Connect with us

LINKEDIN

Ocelot: Scaling observational causal inference at LinkedIn

Published

on

ocelot:-scaling-observational-causal-inference-at-linkedin

Co-authors: Kenneth Tay and Xiaofeng Wang

At Linkedin, we constantly evaluate the value our products and services deliver, so that we can provide the best possible experiences for our members and customers. This includes understanding how product changes impact key metrics related to those experiences. However, simply looking at connections between product changes and key metrics can be misleading. As we know, correlation does not always imply causation. When making decisions about the path forward for a product or feature, we need to know the causal impact of that change on our key metrics.

The ideal way to establish causality is through A/B testing, where we randomly split a test population into two or more groups and provide them with different variants of the product (which we call “treatments”). Due to the randomized assignment, the groups are essentially the same, except for the treatment they received, and so any difference in metrics between the groups can be attributed solely to the treatment. Our T-REX experimentation platform allows us to do A/B testing at scale, adding 2,000 new experiments on a weekly basis, serving a user population of more than 850 million members.

However, there are many situations where A/B testing is either infeasible or too costly. For these situations, we turn to the field of observational causal inference to estimate the impact of product changes. We have previously published some case studies to illustrate the importance of observation causal inference in an article – The Importance of Being Causal

In this blog post, we share more details on how LinkedIn performs observational causal inference at scale using our Ocelot platform. We will also cover the other important measures we put in place to ensure a high standard is met for our causal inference studies and ultimately the changes in product that improve the member and customer experiences.

Advertisement
free widgets for website

What is observational causal inference?

As previously mentioned, sometimes A/B testing is not possible or too expensive, but we still might want to understand the causal effect of a change. A few examples include: 

  • Estimating the impact of brand marketing campaigns. Most of these campaigns (e.g., TV, billboard, radio) cannot be randomized at the user level within the target region.

  • Estimating the impact of bugs or downtime from different sources. Quantifying the impact of bugs from different sources enables us to prioritize infrastructure resources. However, we would not want to run an experiment where we artificially randomize bad experiences between users.

  • Estimating the effect of exogenous shocks to the economy. We are interested in understanding how shocks to the economy (e.g. government policy changes, economic downturn) affect the labor marketplace. We cannot randomize who is impacted by the change and who is not.

In such cases, we would utilize observational causal inference, which is a collection of methods to estimate treatment effects when the treatment is observed rather than randomly assigned. In observational causal inference, we know the treatment status of each user, but the treatment assignment is not random so the raw metric difference between those who were treated and those who were not cannot be causally attributed to the treatment. In particular, the groups might be systematically different from each other (even after taking treatment status out of the picture), and so metric differences could be due to these underlying differences instead. This phenomenon is known as confounding (see Figure 1 for an example).

Advertisement
free widgets for website

Figure 1. Example of confounding. The tables show sample data for the control and treatment groups respectively. It looks like the treatment results in larger outcome values (mean of the “Sessions” column), but this difference could be attributed to the fact that more highly active users took the treatment and these users tend to have larger outcome values, not because of the treatment itself.

Techniques in observational causal inference allow us to estimate the effect of a treatment correctly by adjusting for confounding. It is worth noting that a central difficulty of observational causal inference is that some confounding variables (“confounders”) are observed while others are not. Different observational causal methods have different assumptions and different ways to treat unobserved confounders.

Ocelot: LinkedIn’s platform for observational causal inference

Although observational causal inference is a well-studied research area, not all data scientists are fluent with the full suite of techniques. To make observational causal inference more accessible, easier to use, and faster to execute at LinkedIn, we built an internal web application to enable users to run complex causal studies with no coding effort. It aims to deliver estimates of causal relationships from observational data, along with robustness checks, to end users.

Figure 2 shows the high-level design of the Ocelot platform. 

Advertisement
free widgets for website

Figure 2. Ocelot High-Level Design

There are two major components in the platform. The first is the Ocelot web app (Ocelot UI + Ocelot web services). This is a web application used to run causal studies, present results and diagnostic information, organize study iterations, and share knowledge across the company. Here are some key features of the web app:

  • Provides a guided form to lead users through the causal study setup, including what output metrics will be measured, what the control/treatment labels are, what confounders should be included, and most importantly the time periods over which each variable (e.g. output metrics, control/treatment labels, and confounders) should be computed.

  • UI layer validation to avoid misconfiguration of the causal study. For example, metric dates for A/A robustness checks should be prior to the control/treatment label date. (We will explain what these checks are later in the post.)

  • Present a detailed report with the key results and robustness check (e.g. A/A tests, rerandomization test, coverage checks, etc. Please check the later section Ensuring robustness of study results to learn more.) status highlighted.

  • Feature peer-reviewed, high-quality causal studies with large business impact on the platform, so others can use them as templates to jumpstart their own causal studies. 

The second component is the Ocelot pipelines, which are fully integrated data pipelines consisting of Java jobs, Spark jobs, and R jobs running on Azkaban (a LinkedIn open-source workflow manager), which both prepare modeling data according to the user configuration and executes causal modeling code. We chose to bundle the functionality of data preparation with the causal modeling for the following reasons. 

First, correctly setting the variable dates is critical for the correctness of the causal inference conclusions. For example, Figure 3 shows the date requirement for the fixed effect model (FEM). For the outcome metric to be the result of the treatment, it has to be measured after the treatment has been administered. Meanwhile, the covariates need to be measured before the treatment time period. For a typical FEM with four time periods, there will be 24 dates involved (for each of the three time periods, we need to set the start and end dates). It can be an easy mistake to have an overlap between dates across different time periods but by providing this functionality in the Ocelot platform with UI validation, both users and reviewers do not have to worry about the correctness of the data preparation.

Advertisement
free widgets for website

Figure 3. An example of a Fixed Effect Modeling Date Setup

Second, LinkedIn has 875+ million members and keeps growing. Joining large-scale member data with many confounding variables requires skillful data engineering practice. We fine tuned Spark jobs to reduce the data preparation time and failure rate. Ocelot also is integrated with our internal feature store, Feathr, where users only need to select covariates by names. Ocelot handles the join logic to ensure modeling data correctness. For example, if users wish to control for session count in the previous week, they can simply pick “macrosessions_sum_7d”  as a covariate in the causal study configuration. Our pipeline will map this covariate name to the corresponding data sources and aggregate the seven-day values according to the different date configurations for both causal modeling phase and robustness check phase. To further boost the productivity, we work with the domain experts to pre-define a standard covariate set, which currently includes more than 200 commonly used covariates. 

Lastly, we can enforce the best practice of always running robustness checks along with causal modeling without requiring users to prepare the data multiple times. 

Figure 4 shows the five methods offered on our Ocelot platform. They are Coarsened Exact Matching (CEM) and the Doubly Robust (DR) estimator (also known as the augmented inverse propensity weighted estimator) for cross-sectional data, Instrumental variables (IV) estimation when an instrument is available, Fixed effects models (FEM) for panel data, and Bayesian Structured Time Series (BSTS) for time series data.

Advertisement
free widgets for website

A typical Ocelot user journey is illustrated by the following screenshots. First, the user logs in to the Ocelot platform and picks a causal method (Figure 4).

Figure 4. Ocelot Landing Page

Then, the user looks at the featured analyses and/or past analyses to learn how to set up parameters for their causal study (Figure 5). Users are also free to create an analysis from scratch.

Advertisement
free widgets for website

Figure 5. Methodology Landing Page & Past Analysis History

Third, the user fills in a guided form to set up the causal study, and executes the analysis with a click of the button (Figure 6).

Advertisement
free widgets for website

Figure 6. Create & Execute a New Analysis Page

Next, the user reviews the results (Figure 7).

Advertisement
free widgets for website

Figure 7.  Results Page

Lastly, different from simple deep dive analysis, the user usually iterates on the causal study by including different covariates or changing study population. All the past execution history is captured to avoid p-hacking (Figure 8).

Advertisement
free widgets for website

Figure 8. Iterate on Causal Design

Then, if the results are ready, the user can submit the analysis for our committee’s review (Figure 9).

Figure 9. Request Committee Review

As shown in the previous screenshots, our Ocelot platform provides the following convenient features to increase users’ productivity:

Advertisement
free widgets for website
  1. Capture causal study description, goals, and tags. These are searchable so that new users can easily learn how to run observational causal studies.

  2. Clone function to jump start a new analysis or a new iteration by copying the past causal analysis configuration.

  3. Data visualization to help users configure causal study dates correctly and highlight key results.

Since the launch of the Ocelot platform in 2019, we have been successful at democratizing and expediting observational causal inference within LinkedIn’s data science community. For example, prior to Ocelot, it usually took a few data scientists (one experienced observational causal inference expert and one domain expert) up to six weeks to design the causal study, build the data pipeline to create the dataset, write ad-hoc causal modeling scripts, and validate and analyze the results. Due to the resource-intensive nature, there were only 10-20 observational causal studies produced up to that point. With the Ocelot platform, domain expert data scientists can run causal studies on their own, and it usually takes them just a couple of hours to learn the tool and execute a simple causal study. Considering all the iterations and reviews, we can complete a thorough causal study with less than one week’s effort. Since the launch of Ocelot, we have more than 50 casual studies every year, with many of them providing deep insights into the LinkedIn ecosystem and influence on the product strategy. (A few of those studies have been discussed in the article “The Importance of Being Causal” in the Harvard Data Science Review.)

Ensuring robustness of study results

Because the estimates from observational causal studies are used to guide product decision-making, they must be reliable. Hence, the design of our studies, the methods we use, and the way we interpret the results must meet a high bar of rigor. We do this in two main ways at LinkedIn: 1.) a central review committee, and 2.) implementing automated robustness checks on Ocelot.

We have set up a central review committee that vets the design of observational causal studies and guides proper interpretation of study results. Study design and results are presented and discussed weekly, and treatment effect estimates can only be interpreted as causal if the committee deems the study to be rigorous enough.

The committee consists of members from the horizontal Data Science Applied Research team and data scientists from each product. Having both types of members is key to the committee’s success as they bring complementary strengths to the table. While all members of the committee have a strong understanding of observational causal inference, members from the horizontal team have deep technical expertise on the methods in experimentation and causal inference, with the ability to develop new methods when needed. Product data scientists have the domain knowledge to ensure the study’s design and interpretation of results make business sense.

The central committee also plays a vital role in raising the level of knowledge of observational causal inference across LinkedIn. Members from the horizontal team distill and share the latest advances in observational causal inference and updates to the Ocelot platform, through documents, presentations to vertical teams, as well as through the product data scientists in the central committee. Product data scientists also act as the “champion” for observational causal inference in their line of business, seeking out opportunities where observational causal inference could be helpful, and giving advice to team members who are running observational causal studies.

Advertisement
free widgets for website

In addition to manual review, methods on the Ocelot platform have automated robustness checks which, if passed, increase confidence in the treatment effect estimates. If the robustness checks fail, that means the methodology cannot be used to estimate the causal effect and the user is not allowed to claim that the estimates are causal. For most of the methods on Ocelot we have some version of the A/A test. The A/A test is easiest to explain in the A/B testing setting: in an A/A test, we randomly split the test population into two groups but give the groups the same treatment. Since the treatments are the same for both groups, we do not expect any metric differences. Statistically significant metric differences suggest that something is wrong with the study design, and the results cannot be trusted. For example, A/A test failure suggests that treatment assignment is actually not random, and there may exist confounders that contribute to differences in the outcome, and should not be mistaken for the treatment effect. For observational causal inference, we find settings where the treatment effect should be zero after adjusting for confounding. The A/A test fails if the treatment effect estimate is significantly different from zero, even after adjusting for confounding.

While robustness checks help to increase trust in study results, we note that observational causal methods often require assumptions that are impossible to verify (e.g., no unobserved confounding for the doubly robust method, the exclusion restriction for instrumental variables). Review committee members, in conjunction with study owners, use domain knowledge to assess the reasonableness of these assumptions in the study’s context, and to point out these assumptions whenever the study results are used. This also underscores the importance of sound study design.

Conclusion

Observational causal inference is an important complement to A/B testing, enabling us to measure the effect of product changes when we are not able to randomize the treatment among users. Our Ocelot platform enables us to do this at scale in a robust manner. We are continually thinking about which methods to add to the platform and how to ensure that observational causal inference is done rigorously: if you have any thoughts on the topic, we would love to hear them!

Acknowledgements

We would like to thank our colleagues Xiaonan(Kate) Ding, David Tag, Donghoon (Don) Jung, Rina Friedberg, Min Liu, Albert Chen, Vivek Agrawal, and Simon Yu for building the Ocelot platform; YinYin Yu, Weitao Duan, Dan Antzelevitch, Parvez Ahammad, Zheng Li, Sofus MacskassySouvik Ghosh, and Ya Xu for their continued support and leadership to advance the observational causal studies platform. We would also like to thank many internal users who provide valuable feedback to improve the platform, especially Rose Tan, Ming Wu and Joyce Chen. Finally, we are grateful to the LinkedIn editorial team for their comments and suggestions on the earlier versions of the blog. 

Advertisement
free widgets for website

Topics

Advertisement
free widgets for website
See also  (Re)building Threat Detection and Incident Response at LinkedIn
Continue Reading
Advertisement free widgets for website
Click to comment

Leave a Reply

Your email address will not be published.

LINKEDIN

Career stories: Influencing engineering growth at LinkedIn

Published

on

By

career-stories:-influencing-engineering-growth-at-linkedin

Since learning frontend and backend skills, Rishika’s passion for engineering has expanded beyond her team at LinkedIn to grow into her own digital community. As she develops as an engineer, giving back has become the most rewarding part of her role.

From intern to engineer—life at LinkedIn

My career with LinkedIn began with a college internship, where I got to dive into all things engineering. Even as a summer intern, I absorbed so much about frontend and backend engineering during my time here. When I considered joining LinkedIn full-time after graduation, I thought back to the work culture and how my manager treated me during my internship. Although I had a virtual experience during COVID-19, the LinkedIn team ensured I was involved in team meetings and discussions. That mentorship opportunity ultimately led me to accept an offer from LinkedIn over other offers. 

Before joining LinkedIn full-time, I worked with Adobe as a Product Intern for six months, where my projects revolved around the core libraries in the C++ language. When I started my role here, I had to shift to using a different tech stack: Java for the backend and JavaScript framework for the frontend. This was a new challenge for me, but the learning curve was beneficial since I got hands-on exposure to pick up new things by myself. Also, I have had the chance to work with some of the finest engineers; learning from the people around me has been such a fulfilling experience. I would like to thank Sandeep and Yash for their constant support throughout my journey and for mentoring me since the very beginning of my journey with LinkedIn.

See also  LinkedIn Brings Its 'Open for Business' Feature to India (and the Rest of the World)

Currently, I’m working with the Trust team on building moderation tools for all our LinkedIn content while guaranteeing that we remove spam on our platform, which can negatively affect the LinkedIn member experience. Depending on the project, I work on both the backend and the frontend, since my team handles the full-stack development. At LinkedIn, I have had the opportunity to work on a diverse set of projects and handle them from end to end. 

Advertisement
free widgets for website

Mentoring the next generation of engineering graduates

I didn’t have a mentor during college, so I’m so passionate about helping college juniors find their way in engineering. When I first started out, I came from a biology background, so I was not aware of programming languages and how to translate them into building a technical resume. I wish there would have been someone to help me out with debugging and finding solutions, so it’s important to me to give back in that way. 

I’m quite active in university communities, participating in student-led tech events like hackathons to help them get into tech and secure their first job in the industry. I also love virtual events like X (formally Twitter) and LinkedIn Live events. Additionally, I’m part of LinkedIn’s CoachIn Program, where we help with resume building and offer scholarships for women in tech.

Advertisement
free widgets for website

Influencing online and off at LinkedIn

I love creating engineering content on LinkedIn, X, and other social media platforms, where people often contact me about opportunities at LinkedIn Engineering. It brings me so much satisfaction to tell others about our amazing company culture and connect with future grads. 

See also  Taking Charge of Tables: Introducing OpenHouse for Big Data Management

When I embarked on my role during COVID-19, building an online presence helped me stay connected with what’s happening in the tech world. I began posting on X first, and once that community grew, I launched my YouTube channel to share beginner-level content on data structures and algorithms. My managers and peers at LinkedIn were so supportive, so I broadened my content to cover aspects like soft skills, student hackathons, resume building, and more. While this is in addition to my regular engineering duties, I truly enjoy sharing my insights with my audience of 60,000+ followers. And the enthusiasm from my team inspires me to keep going! I’m excited to see what the future holds for me at LinkedIn as an engineer and a resource for my community on the LinkedIn platform.

Advertisement
free widgets for website

About Rishika

Rishika holds a Bachelor of Technology from Indira Gandhi Delhi Technical University for Women. Before joining LinkedIn, she interned at Google as part of the SPS program and as a Product Intern at Adobe. She currently works as a software engineer on LinkedIn’s Trust Team. Outside of work, Rishika loves to travel all over India and create digital art. 

Editor’s note: Considering an engineering/tech career at LinkedIn? In this Career Stories series, you’ll hear first-hand from our engineers and technologists about real life at LinkedIn — including our meaningful work, collaborative culture, and transformational growth. For more on tech careers at LinkedIn, visit: lnkd.in/EngCareers.

Advertisement
free widgets for website
    Continue Reading

    LINKEDIN

    Career Stories: Learning and growing through mentorship and community

    Published

    on

    By

    career-stories:-learning-and-growing-through-mentorship-and-community

    Lekshmy has always been interested in a role in a company that would allow her to use her people skills and engineering background to help others. Working as a software engineer at various companies led her to hear about the company culture at LinkedIn. After some focused networking, Lekshmy landed her position at LinkedIn and has been continuing to excel ever since.

    How did I get my job at LinkedIn? Through LinkedIn. 

    Before my current role, I had heard great things about the company and its culture. After hearing about InDays (Investment Days) and how LinkedIn supports its employees, I knew I wanted to work there. 

    While at the College of Engineering, Trivandrum (CET), I knew I wanted to pursue a career in software engineering. Engineering is something that I’m good at and absolutely love, and my passion for the field has only grown since joining LinkedIn. When I graduated from CET, I began working at Groupon as a software developer, starting on databases, REST APIs, application deployment, and data structures. From that role, I was able to advance into the position of software developer engineer 2, which enabled me to dive into other software languages, as well as the development of internal systems. That’s where I first began mentoring teammates and realized I loved teaching and helping others. It was around this time that I heard of LinkedIn through the grapevine. 

    Advertisement
    free widgets for website

    Joining the LinkedIn community

    Everything I heard about LinkedIn made me very interested in career opportunities there, but I didn’t have connections yet. I did some research and reached out to a talent acquisition manager on LinkedIn and created a connection which started a path to my first role at the company. 

    See also  Our Approach to Research and A/B Testing

    When I joined LinkedIn, I started on the LinkedIn Talent Solutions (LTS) team. It was a phenomenal way to start because not only did I enjoy the work, but the experience served as a proper introduction to the culture at LinkedIn. I started during the pandemic, which meant remote working, and eventually, as the world situation improved, we went hybrid. This is a great system for me; I have a wonderful blend of being in the office and working remotely. When I’m in the office, I like to catch up with my team by talking about movies or playing games, going beyond work topics, and getting to know each other. With LinkedIn’s culture, you really feel that sense of belonging and recognize that this is an environment where you can build lasting connections. 

    Advertisement
    free widgets for website

    LinkedIn: a people-first company 

    If you haven’t been able to tell already, even though I mostly work with software, I truly am a people person. I just love being part of a community. At the height of the pandemic, I’ll admit I struggled with a bit of imposter syndrome and anxiety. But I wasn’t sure how to ask for help. I talked with my mentor at LinkedIn, and they recommended I use the Employee Assistance Program (EAP) that LinkedIn provides. 

    I was nervous about taking advantage of the program, but I am so happy that I did. The EAP helped me immensely when everything felt uncertain, and I truly felt that the company was on my side, giving me the space and resources to help relieve my stress. Now, when a colleague struggles with something similar, I recommend they consider the EAP, knowing firsthand how effective it is.

    Advertisement
    free widgets for website
    See also  Challenges and practical lessons from building a deep-learning-based ads CTR prediction model

    Building a path for others’ growth

    With my mentor, I was also able to learn about and become a part of our Women in Technology (WIT)  WIT Invest Program. WIT Invest is a program that provides opportunities like networking, mentorship check-ins, and executive coaching sessions. WIT Invest helped me adopt a daily growth mindset and find my own path as a mentor for college students. When mentoring, I aim to build trust and be open, allowing an authentic connection to form. The students I work with come to me for all kinds of guidance; it’s just one way I give back to the next generation and the wider LinkedIn community. Providing the kind of support my mentor gave me early on was a full-circle moment for me. 

    Working at LinkedIn is everything I thought it would be and more. I honestly wake up excited to work every day. In my three years here, I have learned so much, met new people, and engaged with new ideas, all of which have advanced my career and helped me support the professional development of my peers. I am so happy I took a leap of faith and messaged that talent acquisition manager on LinkedIn. To anyone thinking about applying to LinkedIn, go for it. Apply, send a message, and network—you never know what one connection can bring! 

    Advertisement
    free widgets for website

    About Lekshmy

    Based in Bengaluru, Karnataka, India, Lekshmy is a Senior Software Engineer on LinkedIn’s Hiring Platform Engineering team, focused on the Internal Mobility Project. Before joining LinkedIn, Lekshmy held various software engineering positions at Groupon and SDE 3. Lekshmy holds a degree in Computer Science from the College of Engineering, Trivandrum, and is a trained classical dancer. Outside of work, Lekshmy enjoys painting, gardening, and trying new hobbies that pique her interest. 

    See also  Measuring marketing incremental impacts beyond last click attribution

    Editor’s note: Considering an engineering/tech career at LinkedIn? In this Career Stories series, you’ll hear first-hand from our engineers and technologists about real life at LinkedIn — including our meaningful work, collaborative culture, and transformational growth. For more on tech careers at LinkedIn, visit: lnkd.in/EngCareers.

    Advertisement
    free widgets for website

    Topics

    Continue Reading

    LINKEDIN

    Solving Espresso’s scalability and performance challenges to support our member base

    Published

    on

    By

    solving-espresso’s-scalability-and-performance-challenges-to-support-our-member-base

    Espresso is the database that we designed to power our member profiles, feed, recommendations, and hundreds of other Linkedin applications that handle large amounts of data and need both high performance and reliability. As Espresso continued to expand in support of our 950M+ member base, the number of network connections that it needed began to drive scalability and resiliency challenges. To address these challenges, we migrated to HTTP/2. With the initial Netty based implementation, we observed a 45% degradation in throughput which we needed to analyze then correct.

    In this post, we will explain how we solved these challenges and improved system performance. We will also delve into the various optimization efforts we employed on Espresso’s online operation section, implementing one approach that resulted in a 75% performance boost.

    Espresso Architecture

    Advertisement
    free widgets for website
    • Graphic of Espresso System Overview

    Figure 1.  Espresso System Overview

    Figure 1 is a high-level overview of the Espresso ecosystem, which includes the online operation section of Espresso (the main focus of this blog post). This section comprises two major components – the router and the storage node. The router is responsible for directing the request to the relevant storage node and the storage layer’s primary responsibility is to get data from the MySQL database and present the response in the desired format to the member. Espresso utilizes the open-source framework Netty for the transport layer, which has been heavily customized for Espresso’s needs. 

    Need for new transport layer architecture

    In the communication between the router and storage layer, our earlier approach involved utilizing HTTP/1.1, a protocol extensively employed for interactions between web servers and clients. However, HTTP/1.1 operates on a connection-per-request basis. In the context of large clusters, this approach led to millions of concurrent connections between the router and the storage nodes. This resulted in constraints on scalability, resiliency, and numerous performance-related hurdles.

    See also  Operating system upgrades at LinkedIn’s scale

    Scalability: Scalability is a crucial aspect of any database system, and Espresso is no exception. In our recent cluster expansion, adding an additional 100 router nodes caused the memory usage to spike by around 2.5GB. The additional memory can be attributed to the new TCP network connections within the storage nodes. Consequently, we experienced a 15% latency increase due to an increase in garbage collection. The number of connections to storage nodes posed a significant challenge to scaling up the cluster, and we needed to address this to ensure seamless scalability.

    Resiliency: In the event of network flaps and switch upgrades, the process of re-establishing thousands of connections from the router often breaches the connection limit on the storage node. This, in turn, causes errors and the router to fail to communicate with the storage nodes. 

    Performance: When using the HTTP/1.1 architecture, routers maintain a limited pool of connections to each storage node within the cluster. In some larger clusters, the wait time to acquire a connection can be as high as 15ms at the 95th percentile due to the limited pool. This delay can significantly affect the system’s response time.

    Advertisement
    free widgets for website

    We determined that all of the above limitations could be resolved by transitioning to HTTP/2, as it supports connection multiplexing and requires a significantly lower number of connections between the router and the storage node.

    We explored various technologies for HTTP/2 implementation but due to the strong support from the open-source community and our familiarity with the framework, we went with Netty. When using Netty out of the box, the HTTP/2 implementation throughput was 45% less than the original (HTTP/1.1) implementation.  Because the out of the box performance was very poor, we had to implement different optimizations to enhance performance.

    See also  Costwiz: Saving cost for LinkedIn enterprise on Azure

    The experiment was run on a production-like test cluster and the traffic is a combination of access patterns, which include read and write traffic. The results are as follows:

    Advertisement
    free widgets for website
    Protocol QPS Single Read Latency (P99) Multi-Read Latency (P99)
    HTTP/1.1 9K 7ms 25ms

    HTTP/2 5K (-45%) 11ms (+57%) 42ms (+68%)

    On the routing layer, after further analysis using flame graphs, major differences between the two protocols are shown in the following table.

    CPU overhead HTTP/1.1 HTTP/2
    Acquiring a connection and processing the request 20% 32% (+60%)
    Encode/Decode HTTP request 18% 32% (+77%)

    Improvements to Request/Response Handling

    Reusing the Stream Channel Pipeline

    One of the core concepts of Netty is its ChannelPipeline. As seen in Figure 1, when the data is received from the socket, it is passed through the pipeline which processes the data. Channel Pipeline contains a list of Handlers, each working on a specific task.

    Advertisement
    free widgets for website
    • Diagram of Netty Pipeline

    Figure 2. Netty Pipeline

    In the original HTTP/1.1 Netty pipeline, a set of 15-20 handlers was established when a connection was made, and this pipeline was reused for all subsequent requests served on the same connection. 

    However, in HTTP/2 Netty’s default implementation, a fresh pipeline is generated for each new stream or request. For instance, a multi-get request to a router with over 100 keys can often result in approximately 30 to 35 requests being sent to the storage node. Consequently, the router must initiate new pipelines for all 35 storage node requests. The process of creating and dismantling pipelines for each request involving a considerable number of handlers turned out to be notably resource-intensive in terms of memory utilization and garbage collection.

    Advertisement
    free widgets for website

    To address this concern, a forked version of Netty’s Http2MultiplexHandler has been developed to maintain a queue of local stream channels. As illustrated in Figure 2, on receiving a new request, the multiplex handler no longer generates a new pipeline. Instead, it retrieves a local channel from the queue and employs it to process the request. Subsequent to request completion, the channel is returned to the queue for future use. Through the reuse of existing channels, the creation and destruction of pipelines are minimized, leading to a reduction in memory strain and garbage collection.

    • Sequence diagram of stream channel reuse
    See also  Migration madness: How to navigate the chaos of large cross-team initiatives towards a common goal

    Figure 3. Sequence diagram of stream channel reuse

    Addressing uneven work distribution among Netty I/O threads 

    When a new connection is created, Netty assigns this connection to one of the 64 I/O threads. In Espresso, the number of I/O threads is equal to twice the number of cores present. The I/O thread associated with the connection is responsible for I/O and handling the request/response on the connection. Netty’s default implementation employs a rudimentary method for selecting an appropriate I/O thread out of the 64 available for a new channel. Our observation revealed that this approach leads to a significantly uneven distribution of workload among the I/O threads. 

    Advertisement
    free widgets for website

    In a standard deployment, we observed that 20% of I/O threads were managing 50% of all the total connections/requests. To address this issue, we introduced a BalancedEventLoopGroup. This entity is designed to evenly distribute connections across all available worker threads. During channel registration, the BalancedEventLoopGroup iterates through the worker threads to ensure a more equitable allocation of workload

    After this change, during registering of a channel, an event loop with the number of connections below the average is selected.

    Advertisement
    free widgets for website
    private EventLoop selectLoop() {  int average = averageChannelsPerEventLoop();  EventLoop loop = next();  if (_eventLoopCount > 1 && isUnbalanced(loop, average)) {    ArrayList list = new ArrayList<>(_eventLoopCount);    _eventLoopGroup.forEach(eventExecutor -> list.add((EventLoop) eventExecutor));    Collections.shuffle(list, ThreadLocalRandom.current());    Iterator it = list.iterator();    do {      loop = it.next();    } while (it.hasNext() && isUnbalanced(loop, average));  }  return loop; } 

    Reducing context switches when acquiring a connection 

    In the HTTP/2 implementation, each router maintains 10 connections to every storage node. These connections serve as communication pathways for the router I/O threads interfacing with the storage node. Previously, we utilized Netty’s FixedChannelPool implementation to oversee connection pools, handling tasks like acquiring, releasing, and establishing new connections. 

    However, the underlying queue within Netty’s implementation is not inherently thread-safe. To obtain a connection from the pool, the requesting worker thread must engage the I/O worker overseeing the pool. This process led to two context switches.  To resolve this, we developed a derivative of the Netty pool implementation that employs a high-performance, thread-safe queue. Now, the task is executed by the requesting thread instead of a distinct I/O thread, effectively eliminating the need for context switches.

    Improvements to SSL Performance

    The following section describes various optimizations to improve the SSL performance.

    Offloading DNS lookup and handshake to separate thread pool

    During an SSL handshake, the DNS lookup procedure for resolving a hostname to an IP address functions as a blocking operation. Consequently, the I/O thread responsible for executing the handshake might be held up for the entirety of the DNS lookup process. This delay can result in request timeouts and other issues, especially when managing a substantial influx of incoming connections concurrently.  

    To tackle this concern, we developed an SSL initializer that conducts the DNS lookup on a different thread prior to initiating the handshake. This method involves passing the InetAddress, that contains both the IP address and hostname, to the SSL handshake procedure, effectively circumventing the need for a DNS lookup during the handshake.

    Advertisement
    free widgets for website

    Enabling Native SSL encryption/decryption

    Java’s default built-in SSL implementation carries a significant performance overhead. Netty offers a JNI-based SSL engine that demonstrates exceptional efficiency in both CPU and memory utilization. Upon enabling OpenSSL within the storage layer, we observed a notable 10% reduction in latency. (The router layer already utilizes OpenSSL.)  

    To employ Netty Native SSL, one must include the pertinent Netty Native dependencies, as it interfaces with OpenSSL through the JNI (Java Native Interface). For more detailed information, please refer to https://netty.io/wiki/forked-tomcat-native.html.

    Improvements to Encode/Decode performance

    This section focuses on the performance improvements we made when converting bytes to Http objects and vice versa. Approximately 20% of our CPU cycles are spent on encode/decode bytes. Unlike a typical service, Espresso has very rich headers. Our HTTP/2 implementation involves wrapping the existing HTTP/1.1 pipeline with HTTP/2 functionality. While the HTTP/2 layer handles network communication, the core business logic resides within the HTTP/1.1 layer. Due to this, each incoming request required the conversion of HTTP/2 requests to HTTP/1.1 and vice versa, which resulted in high CPU usage, memory consumption, and garbage creation.

    To improve performance, we have implemented a custom codec designed for efficient handling of HTTP headers. We introduced a new type of request class named Http1Request. This class effectively encapsulates an HTTP/2 request as an HTTP/1.1 by utilizing wrapped Http2 headers. The primary objective behind this approach is to avoid the expensive task of converting HTTP/1.1 headers to HTTP/2 and vice versa.

    For example:

    Advertisement
    free widgets for website
    public class Http1Headers extends HttpHeaders {   private final Http2Headers _headers;    ….  } 

    And Operations such as get, set, and contains operate on the Http2Headers:

    Advertisement
    free widgets for website
    @Override public String get(String name) {  return str(_headers.get(AsciiString.cached(name).toLowerCase()); } 

    To make this possible, we developed a new codec that is essentially a clone of Netty’s Http2StreamFrameToHttpObjectCodec. This codec is designed to translate HTTP/2 StreamFrames to HTTP/1.1 requests/responses with minimal overhead. By using this new codec, we were able to significantly improve the performance of encode/decode operations and reduce the amount of garbage generated during the conversions.

    Disabling HPACK Header Compression

    HTTP/2 introduced a new header compression algorithm known as HPACK. It works by maintaining an index list or dictionaries on both the client and server. Instead of transmitting the complete string value, HPACK sends the associated index (integer) when transmitting a header. HPACK encompasses two key components: 

    1. Static Table – A dictionary comprising  61 commonly used headers.

    2. Dynamic Table – This table retains the user-generated header information.

    The Hpack header compression is tailored to scenarios where header contents remain relatively constant. But Espresso has very rich headers with stateful information such as timestamps, SCN, and so on. Unfortunately, HPACK didn’t align well with Espresso’s requirements.

    Upon examining flame graphs, we observed a substantial stack dedicated to encoding/decoding dynamic tables. Consequently, we opted to disable dynamic header compression, leading to an approximate 3% enhancement in performance.

    Advertisement
    free widgets for website

    In Netty, this can be disabled using the following:

    Http2FrameCodecBuilder.forClient()    .initialSettings(Http2Settings.defaultSettings().headerTableSize(0));

    Results

    Latency Improvements

    Advertisement
    free widgets for website
    P99.9 Latency HTTP/1.1 HTTP/2
    Single Key Get 20ms 7ms (-66%)
    Multi Key Get 80ms 20ms (-75%)

    We observed a 75% reduction in 99th and 99.9th percentile multi-read and read latencies, decreasing from 80ms to 20ms.

    • Image of Latency reduction after HTTP/2

    Figure 4. Latency reduction after HTTP/2

    We observed similar latency reductions across the 90th percentile and higher.  

    Advertisement
    free widgets for website

    Reduction in TCP connections

      HTTP/1.1 HTTP/2
    No of TCP Connections 32 million 3.9 million (-88%)

    We observed an 88% reduction in the number of connections required between routers and storage nodes in some of our largest clusters.

    Advertisement
    free widgets for website
    • Image of the Total number of connections after HTTP/2

    Figure 5. Total number of connections after HTTP/2

    Reduction in Garbage Collection time

    We observed a 75% reduction in garbage collection times for both young and old gen.

    Advertisement
    free widgets for website
    GC HTTP/1.1 HTTP/2
    Young Gen 2000 ms 500ms (+75%)
    Old Gen 80 ms 15 ms (+81%)
    • Image that shows the reduction in time for GC after HTTP/2

    Figure 6. Reduction in time for GC after HTTP/2

    Waiting time to acquire a Storage Node connection

    HTTP/2 eliminates the need to wait for a storage node connection by enabling multiplexing on a single TCP connection, which is a significant factor in reducing latency compared to HTTP/1.1.

      HTTP/1.1 HTTP/2
    Wait time in router to get a storage node connection 11ms 0.02ms (+99%)
    • Image of the reduction is wait time to get a connection after HTTP/2

    Figure 7. Reduction is wait time to get a connection after HTTP/2

    Conclusion

    Espresso has a large server fleet and is mission-critical to a number of LinkedIn applications. With HTTP/2 migration, we successfully solved Espresso’s scalability problems due to the huge number of TCP connections required between the router and the storage nodes. The new architecture also reduced the latencies by 75% and made Espresso more resilient. 

    Advertisement
    free widgets for website

    Acknowledgments

    I would like to thank my colleagues Antony Curtis, Yaoming Zhan, BinBing Hou, Wenqing Ding, Andy Mao, and Rahul Mehrotra who worked on this project. The project demanded a great deal of time and effort due to the complexity involved in optimizing the performance. I would like to thank Kamlakar Singh and Yun Sun for reviewing the blog and providing valuable feedback. 

    We would also like to thank our management Madhur Badal, Alok Dhariwal and Gayatri Penumetsa for their support and resources, which played a crucial role in the success of this project. Their encouragement and guidance helped the team overcome challenges and deliver the project on time.

    Advertisement
    free widgets for website

    Topics

    Continue Reading

    Trending