Connect with us

LINKEDIN

Real-time analytics on network flow data with Apache Pinot

Published

on

real-time-analytics-on-network-flow-data-with-apache-pinot

The LinkedIn infrastructure has thousands of services serving millions of queries per second. At this scale, having tools that provide observability into the LinkedIn infrastructure is imperative to ensure that issues in our infrastructure are quickly detected, diagnosed, and remediated. This level of visibility helps prevent the occurrence of outages so we can deliver the best experience for our members. To provide observability, there are various data points that need to be collected, such as metrics, events, logs, and flows. Once collected, the data points can then be processed and made available, in real-time, for engineers to use for alerting, troubleshooting, capacity planning, and other operations.

At LinkedIn, we developed InFlow to provide observability into network flows. A network flow describes the movement of a packet through a network and is the metadata of a packet sampled at a network device that describes the packet in terms of the 5-tuple: source IP, source port, destination IP, destination port, and protocol. It may also contain source and destination autonomous system numbers (ASNs), the IP address of the network device that has captured this flow, input and output interface indices of the network device where the traffic was sampled, and the number of bytes transferred.

Network devices can be configured to export this information to an external collector using various protocols. InFlow understands the industry standard sFlow and IPFIX protocols for collecting flows.

How LinkedIn leverages flow data

InFlow provides a rich set of time-series network data having over 50 dimensions such as source and destination sites, security zones, ASNs, IP address type, and protocol. With this data, various types of analytical queries can be run to get meaningful insights about network health and characteristics.

Advertisement
free widgets for website
  • InFlow UI Top Services

Figure 1.  A screenshot from InFlow UI’s Top Services tab which shows the 5 services consuming the most network bandwidth and the variation of this traffic over the last 2 hours

Most commonly, InFlow is used for operational troubleshooting to get complete visibility into the traffic. For example, if there is an outage due to a network link capacity exhaustion, InFlow can be used to find out the top talkers for that link based on hosts/services that are consuming the most bandwidth (Figure 1) and based on the nature of the service, further steps can be taken to remediate the issue.

Flow data also provides source and destination ASN information, which can be used for optimizing cost, based on bandwidth consumption of different kinds of peering with external networks. It can also be used for analyzing data based on several dimensions for network operations. For example, finding the distribution of traffic by IPv4 or IPv6 flows or the distribution of traffic based on Type of Service (ToS) bits.

InFlow architecture overview

Advertisement
free widgets for website
  • InFlow architecture

Figure 2. InFlow architecture

Figure 2 shows the overall InFlow architecture. The platform is divided into 3 main components: flow collector, flow enricher, and InFlow API with Pinot as a storage system. Each component has been modeled as an independent microservice to provide the following benefits:

  1. It enforces the single responsibility principle and prevents the system from becoming a monolith.
  2. Each of the components have different requirements in terms of scaling. Separate microservices ensure that each can be scaled independently.
  3. This architecture creates loosely coupled pluggable services which can be reused for other scenarios.
See also  Migration madness: How to navigate the chaos of large cross-team initiatives towards a common goal

Flow collection

InFlow receives 50k flows per second from over 100 different network devices on the LinkedIn backbone and edge devices. InFlow supports sFlow and IPFIX as protocols for collecting flows from network devices. This is based on the device’s vendor support for the protocols and minimal impact of flow export on the device’s performance. The InFlow collector receives and parses these incoming flows, aggregates the data into unique flows for a minute, and pushes them to a Kafka topic for raw flows.

Flow enrichment

The data processing pipeline for InFlow leverages Apache Kafka and Apache Samza for stream processing of incoming flow events. Our streaming pipeline processes 50k messages per second, enriching the data with 40 additional fields (like service, source and destination sites, security zones, ASNs, and IP address type), which are fetched from various internal services at LinkedIn. For example, our data center infrastructure management system, InOps, provides information on the site, security zone, security domain of the source, and destination IPs for a flow. The incoming raw flow messages are consumed by a stream processing job on Samza and after adding the additional enriched fields, the result is pushed to an enriched Kafka topic.

Data storage

InFlow requires storage of tens of TBs of data with a retention of 30 days. To support its real-time troubleshooting use case, the data must be queryable in real-time with sub-second latency so that engineers can query the data without any hassles during outages. For the storage layer, InFlow leverages Apache Pinot.

Advertisement
free widgets for website

InFlow UI

  • A screenshot from InFlow UI’s Explore tab

Figure 3.  A screenshot from InFlow UI’s Explore tab which provides a self-service interface for users to visualize flow data by grouping and filtering on different dimensions

The InFlow UI is a dashboard with some of the commonly used visualizations on flow data pre-populated that provides a rich interface where the data can be filtered or grouped by any of the 40 different dimension fields. The UI also has an Explore section, which allows for creation of ad-hoc queries. The UI is based on top of InFlow API, which is a middleware responsible for translating user input into Pinot queries and issuing them to the Pinot cluster.

Pinot as a storage layer

In the first version of InFlow, data was ingested from the enriched Kafka topic to HDFS. We leveraged Trino for facilitating user queries on the data present in HDFS. However, the ETL and aggregation pipeline added a 15-20 minute delay to the pipeline, reducing the freshness of the data. Additionally, query latencies to HDFS using Presto were in the order of 15-30 seconds. This latency and delay was acceptable for doing historical data analytics, however, for real-time troubleshooting, the data needs to be available in real-time with a maximum delay of 1 minute.

Advertisement
free widgets for website

Based on the query latency and data freshness requirements, we explored several storage solutions available at LinkedIn (like EspressoKusto, and Pinot) and decided on onboarding our data to Apache Pinot. When looking for solutions, we needed a reliable system providing real-time ingestion and sub-second query latencies. Pinot’s support for Lambda and Lamda-less architecturereal-time ingestion, and low latency at high throughput could help us achieve optimal results. Additionally, the Pinot team at LinkedIn is experimenting with supporting a new use case called Real-time Operational Metrics Analysis (ROMA), which enables engineers to slice and dice metrics along different combinations of dimensions to help monitor infrastructure near real-time, analyze the last few weeks/months/years of data to discover trends and patterns to forecast and plan capacity, and helps to find the root cause of outages quickly and reduce the time to recovery. These objectives aligned well with our problem statement of processing large numbers of metrics in real-time.

See also  Challenges and practical lessons from building a deep-learning-based ads CTR prediction model

The Pinot ingestion pipeline consumes directly from the enriched Kafka topic and creates the segments on the Pinot servers, which improves the freshness of the data in the system to less than a minute. User requests from InFlow UI are converted to Pinot SQL queries and sent to the Pinot broker for processing. Since Pinot servers keep data and indices in cache-friendly data structures, the query latencies are a huge improvement from the previous version where data was queried from disk (HDFS).

Several optimizations were done to reach this query latency and ingestion parameters. Because the data volume for the input Kafka topic is high, several considerations were made to decide the optimal number of partitions in the topic to allow for parallel consumption into segments in Pinot after several experiments with the ingestion parameters. Most of our queries involved a regexp_like condition on the devicehostname column, which is the name of the network device that has exported the flow. This is used to narrow down on a specific plane of the network. regexp_like is inefficient as it cannot leverage any index so to resolve this, we set up an ingestion transformation using Pinot. These are various transformation functions that can be applied to your data before it is ingested into Pinot. The transformation created a derived column flowType, which classifies a flow based on the name of the network device that has exported this flow into a specific plane of the network. For example, if the exporting device is at the edge of our network, then this flow can be classified as an Internet-facing flow. The flowType column is now an indexed column used for equality comparisons instead of regexp_like and this helped improve query latency by 50%.

See also  18 LinkedIn Stats from 2019 to Guide Your Social Media Strategy in 2020 [Infographic]

Queries from InFlow always request for data from a specific range in time. To improve query performance, timestamp based pruning was enabled on Pinot. This improved query latencies since only relevant segments are filtered in for processing based on the filter conditions on the timestamp column in queries. Based on the Pinot team’s input, indexes on the different dimension columns were set up to aid query performance.

Conclusion

Advertisement
free widgets for website
  • Latency metric for InFlow API query

Figure 4.  Latency metric for InFlow API query for top flows in the last 12 hours before and after onboarding to Pinot

Following the successful onboarding of flow data to a real-time table on Pinot, freshness of data improved from 15 mins to 1 minute and query latencies were reduced by as much as 95%. For some of the more expensive queries, which took as much as 6 minutes using Presto queries, the query latency reduced to 4 seconds using Pinot.This has been helpful in making it easier for the network engineers at LinkedIn to easily get the data they need for troubleshooting or running real-time analytics on network flow data.

What’s next

The current network flow data only provides us with sampled flows from the LinkedIn backbone and edge network. Skyfall is an eBPF-based agent, developed at LinkedIn, that collects flow data and network metrics from the host’s kernel with minimal overhead. The agent captures all flows for the host without sampling and will be deployed across all servers in the LinkedIn fleet. This would provide us with a 100% coverage of flows across our data centers and enable us to support more use cases on flow data that require unsampled information such as security audit and validation based use cases. Because the agent collects more data and from more devices, the scale of data collected by Skyfall is expected to be 100 times that of InFlow. We are looking forward to leveraging the InFlow architecture to support this scale and provide real-time analytics on top of the rich set of metrics exported by the Skyfall agent. Another upcoming feature that we are excited about is leveraging InFlow data for anomaly detection and more traffic analytics.

Acknowledgements

Onboarding our data to Pinot was a collaborative effort and we would like to express our gratitude to Subbu SubramaniamSajjad MoradiFlorence Zhang, and the Pinot team at LinkedIn for their patience and efforts in understanding our requirements and working on the optimizations required for getting us to the optimal performance.

Advertisement
free widgets for website

Thanks to Prashanth Kumar for the continuous dialogue in helping us understand the network engineering perspective on flow data. Thanks to Varoun P and Vishwa Mohan for their leadership and continued support.

Advertisement
free widgets for website

Topics

Continue Reading
Advertisement free widgets for website
Click to comment

Leave a Reply

Your email address will not be published.

LINKEDIN

Feathr joins LF AI & Data Foundation

Published

on

By

feathr-joins-lf-ai-&-data-foundation

In April 2022, Feathr was released under the Apache 2.0 license and we announced, in close conjunction with our Microsoft Azure partners, native integration and support for Feathr on Azure. Since being open sourced, Feathr has achieved substantial popularity among the machine learning operations (MLOps) community. It has been adopted by companies of various sizes across multiple industries and the community continues to grow rapidly. Most excitingly, more and more open-source enthusiasts are contributing code to Feathr.

It’s clear that many others experience the same pain points that Feathr aims to address. That’s why we are excited to share it with a broader audience and for Feathr to be adopted by a broader open-source community with help from LF AI & Data.

Donating to the LF AI & Data will help ensure that Feathr continues to grow and evolve across various dimensions, including visibility, user base, and contributor base. Also, the Feathr development team will have more opportunities to collaborate with other member companies and projects, such as achieving richer online store support via integration with Milvus and JanusGraph, and adopting open data lineage standard from OpenLineage.  As a result, we hope Feathr helps AI engineers build and scale feature pipelines and feature applications in ways that push MLOps tech stacks and the industry forward for years to come. 

The Feathr feature store provides an abstraction layer between raw data and ML models. This abstraction layer standardizes and simplifies feature definition, transformation, serving, storage, and access from within ML workflows or applications. Feathr empowers AI engineers to focus on feature engineering while it takes care of data serialization format, connecting to various databases, performance optimization, and credential management. More specifically, Feathr helps:

  • Define features once and use them in different scenarios, like model training and model serving
  • Create training dataset with point-in-time correct semantics
  • Connect to various offline data sources (data lakes, and data warehouses), and then transform source data into features
  • Deliver feature data from offline system to online store for faster online serving
  • Discover features or share features among colleagues or teams with ease
See also  INWED2022: How Skills Can Help Shape Your Future in Engineering

To learn more please visit Feathr’s GitHub page here and our April 2022 blog, Open sourcing Feathr – LinkedIn’s feature store for productive machine learning.

Acceptance into the LF AI & Data indicates an important recognition from the Linux Foundation. We believe a large, diverse, healthy, and self-sustained Feathr open-source community is important. We’re excited for the new chapter of Feathr and to welcome more people into the Feathr community.

Advertisement
free widgets for website
Continue Reading

LINKEDIN

Career stories: Rejoining LinkedIn to scale our media infrastructure

Published

on

By

career-stories:-rejoining-linkedin-to-scale-our-media-infrastructure

Originally from Argentina, systems & infrastructure engineering leader Federico was a founding member of the Media Infrastructure team in 2015. Now based in Bellevue, Wash., Federico shares how his supportive mentor, LinkedIn’s “sweet spot” scale, and the distinctive engineering challenges here ultimately brought him back to LinkedIn in 2019.

  • picture of Federico and partner

My love for engineering started in my home country of Argentina. After working as an engineer in a corporate setting for a few years, I decided to start my own company focused on custom software development. I loved the interesting problems I could solve every day for my clients, but I was searching for greater economic opportunities in the U.S., where most of my clients were based. After working as a contractor for YouTube, I found my passion for media and engineering of video systems.

Joining and rejoining LinkedIn

When LinkedIn reached out to me with an opportunity to build their video platform in 2015, I jumped at the chance. It was thrilling to join LinkedIn at a time when we launching in-feed video. What originally started as a team of two grew to nine people, and that’s when LinkedIn began training me to step into my first management role for the Media Infrastructure team.

After growing in my management position for a few years, I left LinkedIn for an opportunity working on larger scale systems. But I quickly became burned out and missed my original role as an individual contributor at LinkedIn. My previous manager at LinkedIn was so supportive. I was offered a role as a technical architect (i.e., Senior Staff) for media infrastructure, which allowed me to return to LinkedIn with new technical knowledge, and the same passion for my work.

Advertisement
free widgets for website
  • group of friends
See also  LinkedIn’s journey to Java 11

Making the move to a new LinkedIn home base

Once our team had grown to almost 40 people, we reached the point at which it made sense to look for additional engineering talent outside the San Francisco Bay and New York City areas. It is challenging to find engineers in the media domain since very few companies are doing what LinkedIn does at scale. That’s when we started considering the next office location as an opportunity to bring in more talent.

Ultimately, we decided on Bellevue, Washington. After eight years in the Bay Area, I was ready for a move, and Bellevue was the right fit for my wife and me for many reasons. For example, many of the media companies we partnered with had a strong engineering presence in Seattle. Our driving motivation was to spearhead the company culture and to build an identity for a new LinkedIn office. The Bellevue office just turned one year old and we have been able to build a thriving engineering community here that’s growing quickly.

Advertisement
free widgets for website
  • Federico and partner at stadium

Taking ownership and giving back

In my current role as a Principal Staff Software Engineer, I love that I can mix the technical side of engineering with driving the strategic and product roadmap for my organization.

As an infrastructure engineer, there’s a sweet spot here between the scale of your work and the size of your engineering team at LinkedIn. We have relatively small teams tackling very large problems in complex technical domains. This creates great opportunities for individual ownership over a significant engineering problem on a large scale. We have space to get involved and truly make a difference instead of simply being a cog in a wheel.

Advertisement
free widgets for website
  • Federico working as DJ
See also  Applying multitask learning to AI models at LinkedIn

Throughout my time in Silicon Valley, so many mentors were instrumental in shaping my career. As I’ve grown, I’ve tried to prioritize paying it forward by mentoring my team and other engineers at LinkedIn. Relationships matter, especially at LinkedIn. Building your network is a really core value here, because we thrive on connections.

More About Federico

Based in Bellevue, Washington, Federico is a Principal Staff Systems & Infrastructure Engineer on LinkedIn Media Infrastructure team. Prior to his time at LinkedIn, Federico’s engineering career led him from launching his own software development company, ESTUDIO42, to software engineering roles at YouTube and Instagram. Federico holds a degree in Computer Engineering from the Universidad Nacional de Tucuman in Argentina. Outside of work, Federico enjoys traveling with his wife, cooking, visiting shuttle expeditions, and mixing music. 

Advertisement
free widgets for website

Topics

Advertisement
free widgets for website
Continue Reading

LINKEDIN

Operating system upgrades at LinkedIn’s scale

Published

on

By

operating-system-upgrades-at-linkedin’s-scale

Introduction

Completing recurring operating system (OS) upgrades on time and without impacting users can be challenging. For LinkedIn, completing these upgrades at a massive scale has its own complexities as we’re often facing multiple upgrades. To secure our platform and protect our members’ data, we needed a fast and reliable OS upgrade framework with little to no human intervention.

In this blog, we’ll introduce a newly developed system, Operating System Upgrade Automation (OSUA), which allows LinkedIn to scale OS upgrades. OSUA has been used for more than 200,000 upgrades on servers that host LinkedIn’s applications.

Key features

After learning the lessons from the past upgrades, here are four remarkable features that OSUA provides.

Zero impact

One of our key values at LinkedIn is putting our customers and members first. In engineering, this means site-up (linkedin.com can be accessed and served anytime from anywhere securely) is always our first priority. OSUA is designed with mechanisms to ensure that no user-facing impact is risked during OS upgrades on servers. Therefore, zero-impact always comes first before any other features in design decisions.

Advertisement
free widgets for website

High throughput

LinkedIn has a growing footprint in its on-prem serving facilities that consists of hundreds of thousands of physical servers. To perform a timely upgrade, OSUA has a high upgrade throughput by leveraging parallelization and some applications’ ephemerality that does not sacrifice site-up or cause any performance regression in the middle of upgrades. During the most recent fleet level upgrades, OSUA was capable of upgrading more than 10x more hosts per day than the old mechanisms before. Currently, we are working towards the goal of upgrading 2x of what OSUA can do now.

Support heterogeneous environments

The LinkedIn serving environment consists of various natures that include, but are not limited to, stateless applications, stateful systems (explained in detail later in the post), infrastructure services, etc. They are hosted in multiple locations, managed by a variety of schedulers, ranging from Rain and Kubernetes to Yarn, and most are deployed in a multi-tenant fashion. OSUA currently supports approximately 94% of the LinkedIn footprint that is made up by these systems and its coverage continues to increase.

Automation, autonomous, reduce toil

Advertisement
free widgets for website

In the past years, LinkedIn has undergone a few company-wide server OS upgrades for purposes like tech refreshes or improving our platform’s security. Our previous processes for OS upgrades were highly human resource intensive, which added a significant amount of toils to the teams involved. To overcome this, OSUA is designed to be a hand-free, self-serve service where users only need to click a button (or submit a CLI command). Any failures caused by upgrades will be reported back quickly to the corresponding teams. To customize the upgrade processes for different teams, OSUA also allows users to set up and manage their own policy of upgrades.

Technical approach

At a high level, a server (as an example) needs to go through the following three steps sequentially for an OS upgrade:

Advertisement
free widgets for website
  • General steps of hosts

Figure 1: General steps of hosts undergoing an OS upgrade

  1. Drain: Gracefully stopping or evacuating the applications that are running on the host serving traffic, sometimes with extra steps of initiating data balancing for stateful systems.
  2. Upgrade: A server can go through a full reimaging to upgrade images that have the main partition cleaned but the data partition retained or through a yum update way of reimaging that has both main and data partitions retained. At the end of an upgrade, all servers need to have machine health checks, depending on their hardware and system specs, performed and passed before serving any payloads after upgrade.
  3. Recover: Applications will be redeployed onto the server if they were allocated to it and not moved elsewhere as part of the drain step, possibly with data rebalancing and handling for stateful systems. For servers that had ephemeral applications evacuated, they become ready for any resource allocation of new workloads.
See also  Overcoming challenges with Linux cgroups memory accounting

These three steps seem simple to be done quickly but are much more complex at scale. To upgrade the entire LinkedIn fleet and provide the features listed above, OSUA is built at the orchestration layer to manage and coordinate the upgrades with the following highlights.

Unified workflow

To support heterogeneous environments while maintaining common upgrade processes and experiences, after numerous internal researches, request gathering, and case studies, we’ve developed a single workflow that could be a one-fit-all solution with portability and flexibility for various applications and resource schedulers’ characteristics. Having one workflow, like OSUA, also helps to reduce onboarding and education efforts.

When a host/server/vm is the working unit in the upgrade process, the workflow can be shown as follows:

Advertisement
free widgets for website
  • Unified workflow of steps

Figure 2: Unified workflow of steps of hosts undergoing an OS upgrade with optional pre-/post-step

Here are some design decisions worthwhile to highlight:

  • Customized handling for drain and recover phases provide applications with the ability to handle necessary tasks before and after upgrades in a customized way. These abilities are essential to preparing stateful systems for an upgrade and recover back to the pre-upgrade ready-to-serve condition.
  • Drain and recover phases are abstract. As they are tasks formatted as jobs encoded in rest.li schema for multiple consumers (resource scheduler in this case) to work on, any consumer can be plugged in and execute the tasks of its kind in a different way according to their own needs.

Impact analysis and batching

At LinkedIn, all OS upgrades are performed while live traffic is served. Therefore, during the drain phase, OSUA can only take down a computed subset of hosts that are submitted into its pipeline as a way of making use of capacity redundancy/reserve to ensure that linkedin.com has the needed capacity served.

OSUA leverages an internal standardized impact approval system (Blessin) that allows application teams to specify acceptable impact as a percentage of total number of instances / capacity as an absolute number, or consults customized built API, provided by individual service controllers (often cluster management services), to obtain information if a group of instances can be taken down and when.

While processing each host, OSUA figures out all of the application instances on the host and validates based on the configured rules in Blessin to determine if the application instances can be taken down. If all of the application instances on a host can be taken down, the host is picked for OS upgrade. The following figure illustrates a simplified example of determining if a host can be taken down for upgrades or if extra coordination, such as waiting, is needed.

Advertisement
free widgets for website
  • Graphic of impact analysis process

Figure 3: Example of impact analysis process

Some stateful teams have a condition where a group of hosts within a fault zone (a logical group where all-or-none hosts in it can perform maintenance all at once) should be upgraded together so the overall cluster’s rebalances can be kept minimal. In such a scenario, OSUA tries to drain, upgrade, and recover such hosts as a single batch if all of the hosts in the batches are approved.

To maximize throughput, the impact analysis and batching mechanism is streamlined and conducted by intervals with parallelization to timely refresh data (such as capacity and upgrade status) and continue to pick hosts for upgrades.

Advertisement
free widgets for website

Cross-system operation coordination

OSUA won’t be the only system doing maintenance on the site. There will be constant deployments taking place on application instances initiated by other maintenance activities such as data defragmentation, repartition on stateful systems, network switch upgrades, etc. Such maintenance activities, along with routine code release actions, have to be coordinated well so that, on a host, only one activity can take place at a time. Otherwise, OSUA could pick a host to drain and at the same time a routine code release could take place, which affects the cluster health of application instances too and consequently the ultimate impact in total would exceed the allowance.

Advertisement
free widgets for website
  • Workflow of OSUA acquiring a lock from Insync
See also  18 LinkedIn Stats from 2019 to Guide Your Social Media Strategy in 2020 [Infographic]

Figure 4: Workflow of OSUA acquiring a lock from Insync while a system tries but fails to get the lock

To avoid this race condition, our SRE teams are working on a centralized locking system (Insync) where application instances and hosts can be locked for certain maintenance or release activities to ensure only one activity can take place at a time using a first in, first out (FIFO) method. A host that is locked successfully is considered down for maintenance in effective availability calculation during impact analysis. OSUA picks a host for maintenance only if the effective availability of each of the application instances is within the threshold configured by the owners of the application, and if the host is not locked already for any other maintenance activity.  

Customized execution handling for stateful systems and more

While keeping the OS upgrade workflow unified that most applications can leverage, there are a number of systems that need customized handling in the format of pluggable add-on steps to the workflow because of their systematic complexities. One of the examples is the stateful system.

A stateful system is one where the operation of the system depends on a critical internal “state.” This state could be data or metadata that acts as the memory and history of the system at that point in time. The LinkedIn technical ecosystem comprises many stateful applications, especially on the data tier. These systems often have custom workflows that need to be executed before taking a node out of rotation (a.k.a. pre-steps) or bringing them back into the cluster (a.k.a. post-steps). These workflows vary quite a bit across the fleet and pose a bigger challenge for an automated OS upgrade setup.

In the past, engineers would need to run a number of administrative tasks manually or use scripts on to-be-upgraded hosts to ensure all necessary pre-steps are completed. Additionally, the problem is often compounded by the need to migrate data out of the to-be-upgraded host and rebalance the data across the rest of the cluster so that a minimum safe number of copies is maintained within the cluster. OSUA has to solve these diverse sets of problems while ensuring that no human toil is involved during the upgrade process.

Advertisement
free widgets for website

To address the diverse demands from these systems, OSUA aligns towards a solution that is uniform in approach and still allows flexibility to these stateful systems to automate for their unique requirements for upgrades. As a result, OSUA leverages an in-house platform, STORU. STORU was initially developed with the idea of automating large scale operations for switch upgrades, but the system was extensible and supports customized automation before/after operations.

For pre-steps and post-steps, OSUA leverages a feature of STORU, custom hooks, which enables application owners to build custom application logic that would be executed before and after the OS upgrade process.

Advertisement
free widgets for website
  • Graphic of custom hook execution of pre-step

Figure 5: An example of custom hook execution of pre-step

In this section, we will focus on custom hooks and explore some of its salient features.

  • Pre- and post-steps: As discussed earlier, a pre-step of the custom hook allows custom code execution to get hosts ready. This is usually required to safely take hosts out of rotation with optional customized extra steps. A post-step is a mirror image of the pre-step that is executed after the OS upgrade is complete to revert the outcomes of the pre-step.
  • Custom hook execution order: OSUA allows custom hooks to be executed in various stages, which are defined relative to the application deployment step during the upgrade process. Both pre- and post-steps can be executed before, after, or both before and after application (un)deployment. This provides flexibility for stateful applications to configure how custom code execution can be invoked.
  • Custom parameters: OSUA also allows application teams to define and pass additional parameters to custom hooks when submitting host(s) for upgrade. This helps custom code handle specific nuanced cases that might apply to a subset of hosts in the fleet when they are submitted for upgrade.
See also  LinkedIn’s journey to Java 11

Auto-remediation

At scale, there will always be a certain percentage of failure that can occur during any steps involved in the OS upgrade process ranging from unsuccess of application uninstallation and deployment to server breakdown. OSUA is equipped with mechanisms to detect, analyze, triage, and remediate failures automatically, which greatly reduces human toil and facilitates company-wide hardware repair and refresh.

Self-contained

OSUA by nature is an infrastructure service. To be self-contained and avoid circular dependencies, which can result in cascading outages, we build OSUA on top of a limited number of internal control plane services and don’t depend on large scale data plane systems if we are able to find alternative solutions. For example, for event messaging needs, instead of using LinkedIn’s ready-for-use Kafka clusters, we implemented a lightweight restful based pub-sub mechanism within OSUA. This is to avoid the circular dependency such as Kafka uses Kafka (as a OSUA dependency) to upgrade host OS of Kafka that can lead to cascading failure when an upgrade is unsuccessful.

A recent LinkedIn OS upgrade

Since introducing OSUA, it has successfully performed more than 200k upgrades at LinkedIn, with more than seven million system packages updated and 18 million vulnerabilities addressed on these servers, with no external impact to LinkedIn customers and members from outages rooted from systematic processes. Further, the engineering effort from engineering teams to spend on OS upgrades has been reduced by 90% from previous upgrades that had an even smaller scale. The daily peak upgrade velocity is a 10x improvement from previous upgrades.

Now, many LinkedIn engineering teams come to this single platform to delegate OS upgrade operations worry-free.

Next steps

OSUA has shown success recently on LinkedIn’s on-prem infrastructure upgrade. However, increasing upgrade velocity with lower failure rate and less human intervention will be our continuous focus.

Advertisement
free widgets for website

Acknowledgements

OSUA could not have been accomplished without the help of many engineers, managers, and TPMs across many teams. The engineers who have made contributions to OSUA are: Anil Alluri, Aman Sharma, Anant Bir Singh, Barak Zhou, Clint Joseph, Hari PrabhakaranJose Thomas Kiriyanthan, Junyuan Zeng, Keith Ward, Nikhita KatariaParvathy GeethaRonak NathaniRitu Panjwani, Subhas Sinha, Sagar Ippalpalli, Tim McNally, Vijay Bais, Ying He, Yash Shah, John Sushant Sundharam, and Deepshika. Special thanks to our TPMs Sean Patrick and Soumya Nair who have been steering this project from Day 1. Also, we’d like to thank the engineering leadership, Ashi Sareen, Mir Islam, Samir Tata, Sankar Hariharan, and Senthilkumar Eswaran, who have been providing continuous support to building OSUA. Additionally, we would like to thank Adam DebusJustin Anderson, and Samir Jafferalifor their reviews and valuable feedback.

Advertisement
free widgets for website

Topics

Continue Reading

Trending