Open Sourcing AvroTensorDataset: A Performant TensorFlow Dataset For Processing Avro Data

Co-authors: Jonathan Hung, Pei-Lun Liao, Lijuan Zhang, Abin Shahab, Keqiu Hu
TensorFlow is one of the most popular frameworks we use to train machine learning (ML) models at LinkedIn. It allows us to develop various ML models across our platform that power relevance and matching in the news feed, advertisements, recruiting solutions, and more. To ensure the best member experience, we want our models to be accurate and up-to-date, which requires training the models as fast as possible. However, we found that many of our workloads were bottlenecked by reading multiple terabytes of input data.
To remove this bottleneck, we built AvroTensorDataset, a TensorFlow dataset for reading, parsing, and processing Avro data. AvroTensorDataset speeds up data preprocessing by multiple orders of magnitude, enabling us to keep site content as fresh as possible for our members. Today, we’re excited to open source this tool so that other Avro and Tensorflow users can use this dataset in their machine learning pipelines to get a large performance boost to their training workloads.
In this blog post, we will discuss the AvroTensorDataset API, techniques we used to improve data processing speeds by up to 162x over existing solutions (thereby decreasing overall training time by up to 66%), and performance results from benchmarks and production.
In general, a machine learning training pipeline requires the following steps:
-
Input data pre-processing
-
Ingesting input data from disk to memory
-
Training machine learning model
-
Model validation and post-processing
Today at LinkedIn, Avro is the primary supported storage format for machine learning training data (LinkedIn uses Apache Hadoop for much of our data processing, and Avro is a widely used serialization format in Hadoop). Users provide a schema describing their data format, and Avro provides multi-language support for reading and writing Avro data from/to disk.
Avro schemas support a wide variety of formats: primitive types (int, long, float, boolean, etc.), and complex types (record, enum, array, map, union, fixed). Avro serializes or deserializes data based on data types provided in the schema. For example, ints and longs use variable-length zig-zag encoding, and arrays are encoded via a count (the number of elements in the array), concatenated with the encoded array elements, then zero-terminated.
An Avro file is formatted with the following bytes:
Figure 1: Avro file and data block byte layout
The Avro file consists of four “magic” bytes, file metadata (including a schema, which all objects in this file must conform to), a 16-byte file-specific sync marker, and a sequence of data blocks separated by the file’s sync marker.
Each data block contains the number of objects in that block, the size in bytes of the objects in that block, and a sequence of serialized objects.
Existing AvroRecordDataset
TensorFlow I/O contains an existing AvroRecordDataset which reads and parses Avro files into Tensors. The AvroRecordDataset itself is a tf.Dataset implementation whose associated AvroRecordDataset operation reads bytes from Avro files into memory.
AvroRecordDataset supports prefetching, parsing, shuffling, and batching via an auxiliary function make_avro_record_dataset which:
-
Creates an AvroRecordDataset dataset
-
Shuffles via the underlying tf.data.Dataset ShuffleDataset operation
-
Batches via the underlying tf.data.Dataset BatchDataset operation
-
Parses via applying the ParseAvro operation via tf.data.Dataset map operation
-
Prefetches via the underlying tf.data.Dataset PrefetchDataset operation
The ParseAvro operation can parse Avro data with arbitrary schemas (primitive types, and/or nested complex types such as maps, unions, arrays, etc.). It defers parsing to Avro’s GenericReader; this implementation recursively decodes the incoming bytes based on the potentially arbitrarily nested schemas (e.g. an array within a map, within a union, etc…). For complex types like arrays, it dynamically resizes the in-memory data structure which stores the parsed elements as it sequentially parses additional elements.
AvroTensorDataset API
Python API
The AvroTensorDataset supports the same features as AvroRecordDataset. Here is an example on how to instantiate AvroTensorDataset:
from tensorflow_io.core.python.experimental.atds.dataset import ATDSDataset from tensorflow_io.core.python.experimental.atds.features import DenseFeature, SparseFeature, VarlenFeature dataset = ATDSDataset( filenames=["part-00000.avro", "part-00001.avro"], batch_size=1024, features={ "dense_feature": DenseFeature(shape=[128], dtype=tf.float32), "sparse_feature": SparseFeature(shape=[50001], dtype=tf.float32), # -1 means unknown dimension. "varlen_feature": VarlenFeature(shape=[-1, -1], dtype=tf.int64) } )
The constructor supports the following arguments:
Argument | type | comment |
filenames | tf.string or tf.data.Dataset | A tf.string tensor containing one or more filenames. |
batch_size | tf.int64 | A tf.int64 scalar representing the number of records to read and parse per iteration. |
features | Dict[str, Union[ DenseFeature, SparseFeature, VarlenFeature]] | A feature configuration dict with feature name as key and feature spec as value. We support DenseFeature, SparseFeature, and VarlenFeature specs. All of them are named tuples with shape and dtype information. |
drop_remainder | tf.bool | (Optional.) A tf.bool scalar tf.Tensor, representing whether the last batch should be dropped in the case it has fewer than batch_size elements. The default behavior is not to drop the smaller batch. |
reader_buffer_size | tf.int64 | (Optional) A tf.int64 scalar representing the number of bytes used in the file content buffering. Default is 128 * 1024 (128KB). |
shuffle_buffer_size | tf.int64 | (Optional) A tf.int64 scalar representing the number of records to shuffle together before batching. Default is zero. Zero shuffle buffer size means shuffle is disabled. |
num_parallel_calls | tf.int64 | (Optional) A tf.int64 scalar representing the maximum thread number used in the dataset. If greater than one, records in files are processed in parallel. The number will be truncated when it is greater than the maximum available parallelism number on the host. If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available CPU and workload. Default is 1. |
At a minimum, the constructor requires the list of files to read, the batch size (to support batching), and dict containing feature specs. Prefetch is enabled by default and its behavior can be tuned via reader_buffer_size. Parsing happens automatically within the ATDSDataset operation. Shuffling is supported via configuring shuffle_buffer_size.
Supported Avro Schemas
Although Avro supports many complex types (unions, maps, etc.), AvroTensorDataset only supports records of primitives and nested arrays. These supported types cover most TensorFlow use cases, and we get a big performance boost by only supporting a subset of complex types (more on that later).
AvroTensorDataset supports dense features, sparse features, and variable-length features. It also supports certain TensorFlow primitives that are supported by Avro. They are represented in Avro via the following:
Primitive Types
All Avro primitive types are supported, and map to the following TensorFlow dtypes:
Avro data type | int | long | float | double | boolean | string | bytes |
tf.dtype | tf.int32 | tf.int64 | tf.float32 | tf.float64 | tf.bool | tf.string | tf.string |
Dense Features
Dense features are represented as nested arrays in Avro. For example, a doubly nested array represents a dense feature with rank 2. Some examples of Avro schemas representing dense features:
"fields": [ { "name" : "scalar_double_feature", "type" : "double" }, { "name" : "1d_double_feature", "type" : { "type": "array", "items" : "double" } }, { "name" : "2d_float_feature", "type" : { "type": "array", "items" : { "type": "array", "items": "float" } } } ]
Dense features are parsed into dense tensors. For the above, the features argument to ATDSDataset might be:
{ "scalar_double_feature": DenseFeature(shape=[], dtype=tf.float64), "1d_double_feature": DenseFeature(shape=[128], dtype=tf.float64), "2d_float_feature": DenseFeature(shape=[16, 100], dtype=tf.float32), }
Sparse Features
Sparse features are represented as a flat list of arrays in Avro. For a sparse feature with rank N, the Avro schema contains N+1 arrays: arrays named “indices0”, “indices1”, …, “indices(N-1)” and an array named “values”. All N+1 arrays should have the same length. For example, this is the schema for a sparse feature with dtype float and rank 2:
"fields": [ { "name" : "2d_float_sparse_feature", "type" : { "type" : "record", "name" : "2d_float_sparse_feature", "fields" : [ { "name": "indices0", "type": { "type": "array", "items": "long" } }, { "name": "indices1", "type": { "type": "array", "items": "long" } }, { "name": "values", "type": { "type": "array", "items": "float" } } ] } } ]
Sparse features are parsed into sparse tensors. For the above, the features argument to ATDSDataset might be:
{ "2d_float_sparse_feature": SparseFeature(shape=[16, 10], dtype=tf.float32), }
The i-th indices array represents the indices for rank i, i.e., the Avro representation for a sparse tensor is in coordinate format. For example, the sparse tensor: tf.sparse.SparseTensor(indices=[[0,1], [2,4], [6,5]], values=[1.0, 2.0, 3.0], dense_shape=[8, 10]) would be represented in Avro via the following:
{ "indices0" : [0, 2, 6], "indices1" : [1, 4, 5], "values" : [1.0, 2.0, 3.0] }
VarLen Features
VarLen features are similar to dense features in that they are also represented as nested arrays in Avro, but they can have dimensions of unknown length (indicated by -1). Some examples of Avro schemas representing variable-length features:
"fields": [ { "name" : "1d_bool_varlen_feature", "type" : { "type": "array", "items" : "boolean" } }, { "name" : "2d_long_varlen_feature", "type" : { "type": "array", "items" : { "type": "array", "items": "long" } } } ]
Dimensions with length -1 can be variable length, hence variable-length features are parsed into sparse tensors. For the above, the features argument to ATDSDataset might be:
{ "1d_bool_varlen_feature": VarlenFeature(shape=[-1], dtype=tf.bool), "2d_long_varlen_feature": VarlenFeature(shape=[2, -1], dtype=tf.int64), }
Here, 2d_long_varlen_feature has variable length in the last dimension; for example, an object with values [[1, 2, 3], [4, 5]] would be parsed as tf.sparse.SparseTensor(indices=[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1]], values=[1, 2, 3, 4, 5], dense_shape=[2, 3]).
Performance Optimizations
AvroTensorDataset implements a few features to optimize performance.
Operation Fusion
AvroTensorDataset fuses several TensorFlow Dataset operations: Read, Prefetch, Parse, Shuffle, and Batch, into a single ATDSDataset op.
-
The read step reads raw Avro bytes from a local or remote filesystem into a memory buffer.
-
The prefetch step provides a readahead capability in a separate producer thread. The rest of the steps act as consumers of prefetched bytes.
-
The parse step converts the in-memory bytes to TensorFlow Tensors. The bytes are decoded based on the provided features metadata (i.e. a column whose metadata is a DenseFeature with shape [10, 20] and dtype tf.float32 will be parsed to a 2-D tensor with dtype tf.float32).
-
The shuffle step shuffles the objects read into memory by the prefetch step. The prefetch step will read batch_size + shuffle_buffer_size objects into memory, and the shuffle step randomly chooses batch_size objects to parse and return.
-
The batch step groups/merges a group of features into single tensors, reducing the memory footprint and optimizing training performance.
Implementing these steps as separate operations introduce overhead which impacts data ingestion performance. While they can be individually multithreaded via parallel loops over the entire data ingestion pipeline, fusing them into a single operation allows for better multithreading, pipelining, and tuning.
Schema Performance
As mentioned earlier, AvroTensorDataset only supports reading Avro primitives and array types. With these types, we can already support dense, sparse, and ragged tensors, which cover the majority of use cases.
Previously, when supporting other complex types such as unions and maps, Avro schemas could get arbitrarily complicated (e.g. a record containing a map, whose values are arrays containing unions of int and long). Furthermore, since an Avro block stores objects in sequence, they are decoded in sequence, and each object must be deserialized according to the (arbitrarily complicated) schema. For a schema with lots of nested unions/maps/arrays/etc., this recursive type checking introduces a lot of overhead. We fix this by only supporting arrays and records as complex types.
Decoding arrays also introduces overhead. An array is serialized with the following bytes:
Figure 2: Avro array byte layout
It contains a sequence of blocks, where each block contains a count and a sequence of serialized array objects. Therefore, the blocks must be decoded in sequence, and we don’t know the length of the array until all of the array’s blocks are decoded. This requires us to continuously resize the in-memory data structure storing the decoded array as more blocks are decoded. We fix this by passing the array shapes to the ATDSDataset constructor, so we can pre-allocate the in-memory array without having to resize it.
Shuffle Algorithm
Another challenge with Avro is that Avro blocks do not track the offsets of each Avro object in the block. It makes it impossible to jump to a random offset and decode an object. In other words, we can only read Avro blocks sequentially. This limitation adds complexity to shuffle. If we want to shuffle Avro records within an Avro block, we have to read all records sequentially and shuffle the intermediate results. It will introduce extra copy which hurts performance.
In the ATDSDataset, we propose a shuffle algorithm that samples the number of records to read from each Avro block and merges the read objects from multiple blocks as the batched output. In this way, we can still read Avro objects sequentially without extra copy. The Avro blocks will be kept in memory until they are fully read. For example, assume three Avro blocks are loaded into memory and each block stores ten Avro objects. ATDSDataset can read one object from block 1, two objects from block 2 and 1 object from block 3 to create output tensors with batch size 4. The number of objects to read is randomly sampled. Although the algorithm does not support perfect shuffling, we do not see model performance degradation in our production models.
Thread Parallelism
The ATDSDataset constructor takes a num_parallel_calls argument which determines how many threads to use for parsing. ATDSDataset determines which blocks the next returned batch of objects belongs to (either the earliest-read blocks if shuffle is not enabled, or the blocks containing the randomly selected batch_size objects if shuffle is enabled). These blocks are split across the configured number of threads and parsed in parallel.
Block Distribution
The logic for distributing blocks across threads can also impact performance. Ideally, threads complete parsing at the same time; otherwise, multi-threading doesn’t achieve maximum speedup. To achieve this, we apply a cost-based model to estimate the time it takes to process a block, then distribute blocks by balancing cost. The cost of a block is impacted by whether it is compressed or not, and how many remaining undecoded objects it contains.
Here is an example with 8 in-memory blocks, and 4 threads. Blocks could be distributed via the following:
Figure 3: Eight blocks distributed across four threads
Note that since the blocks given to threads 0 and 1 are uncompressed, these threads are given more blocks to decode compared to threads 2 and 3, and threads 0 and 1 are given (roughly) equal numbers of objects to decode.
Thread Count Auto-Tuning
Although increasing thread count can help performance, eventually it will reach a point of diminishing returns; increasing thread count too much could actually hurt performance as well, due to thread latency overhead. Furthermore, it would be wasteful to spawn six threads if there are only five blocks in memory.
num_parallel_calls supports the tf.data.AUTOTUNE parameter which will let ATDSDataset determine the appropriate number of threads when processing each batch. To do this, it chooses the thread count which will minimize estimated cost, where the estimated cost for a thread count is:
estimated_cost = (Σ block_cost) / thread_count + thread_latency_overhead
We compute the total cost of decompressing and decoding the current batch, distribute this cost among all threads, and add the thread latency overhead for this thread count. Note that increasing thread count will reduce the average cost, but increase the thread latency overhead.
Thread Parallelism Benchmarks
In our experiments, we found that increasing thread parallelism can help speed up throughput by distributing the parsing workload across threads. It is especially helpful for workloads with a large number of blocks to process on each iteration (e.g. workloads with a large batch size).
We ran benchmarks to measure I/O throughput on various thread counts (with deflate codec). The benchmark contains various dense and sparse features with different shapes and dtypes.
Figure 4: Throughput scaling for multi-threaded AvroTensorDataset
Generally, increasing threads can increase throughput, with better scaling as the batch size increases (since there is more workload to distribute among more threads). Furthermore, thread autotuning can achieve close to optimal performance.
Performance Results
AvroTensorDataset has been in production at LinkedIn for over a year as the default Avro reader for machine learning training, and has removed I/O as a training bottleneck. It improves on existing Avro data ingestion solutions by multiple orders of magnitude.
We ran a benchmark on an internal production schema on various batch sizes to compare I/O performance of AvroRecordDataset and ATDSDataset. The schema contained:
-
6 scalar tensors (dense tensors with rank 0)
-
8 dense tensors with rank 1
-
5 sparse tensors with rank 1
This was the average time spent in I/O per step:
Batch size 64 | Batch size 256 | Batch 1024 | |
AvroRecordDataset | 40 ms/ step | 160 ms /step | 650 ms / step |
ATDSDataset | 1.2 ms / step | 1.3 ms / step | 4 ms / step |
Improvement | 33x | 123x | 162x |
Figure 5: AvroRecordDataset vs. AvroTensorDataset latency
Furthermore, we saw 35%-66% in total training time (not just I/O time) for production flows.
Conclusion
ATDSDataset is LinkedIn’s solution to efficiently read Avro data into TensorFlow. Through multiple performance enhancements, we were able to speed up I/O throughput by orders of magnitude over existing Avro reader solutions. Our team at LinkedIn worked closely with the TensorFlow I/O community to open-source this feature, and we hope that by open-sourcing it, the TensorFlow community can also benefit from these performance enhancements. For more details, please check out the ATDSDataset code on GitHub here.
Acknowledgments
Thanks to an amazing team of engineers in the Deep Learning Infrastructure team Pei-Lun Liao, Jonathan Hung, Abin Shahab, Arup De, Lijuan Zhang, and Cheng Ren for working on this project, and special thanks to Pei-Lun Liao for starting and providing technical guidance throughout this project. Thanks to the management team for supporting this project: Keqiu Hu, Joshua Hartman, Animesh Singh, Tanton Gibbs, and Kapil Surlaker. Many thanks to the support from the TensorFlow open-source community for reviewing the PR: Vignesh Kothapalli. Last but not least, many thanks to the reviewers of this blog post: Ben Levine, Animesh Singh, Qingquan Song, Keqiu Hu, and the LinkedIn Editorial team: Katherine Vaiente, and Greg Earl for your reviews and suggestions.
Topics
Career stories: Influencing engineering growth at LinkedIn

Since learning frontend and backend skills, Rishika’s passion for engineering has expanded beyond her team at LinkedIn to grow into her own digital community. As she develops as an engineer, giving back has become the most rewarding part of her role.
From intern to engineer—life at LinkedIn
My career with LinkedIn began with a college internship, where I got to dive into all things engineering. Even as a summer intern, I absorbed so much about frontend and backend engineering during my time here. When I considered joining LinkedIn full-time after graduation, I thought back to the work culture and how my manager treated me during my internship. Although I had a virtual experience during COVID-19, the LinkedIn team ensured I was involved in team meetings and discussions. That mentorship opportunity ultimately led me to accept an offer from LinkedIn over other offers.
Before joining LinkedIn full-time, I worked with Adobe as a Product Intern for six months, where my projects revolved around the core libraries in the C++ language. When I started my role here, I had to shift to using a different tech stack: Java for the backend and JavaScript framework for the frontend. This was a new challenge for me, but the learning curve was beneficial since I got hands-on exposure to pick up new things by myself. Also, I have had the chance to work with some of the finest engineers; learning from the people around me has been such a fulfilling experience. I would like to thank Sandeep and Yash for their constant support throughout my journey and for mentoring me since the very beginning of my journey with LinkedIn.
Currently, I’m working with the Trust team on building moderation tools for all our LinkedIn content while guaranteeing that we remove spam on our platform, which can negatively affect the LinkedIn member experience. Depending on the project, I work on both the backend and the frontend, since my team handles the full-stack development. At LinkedIn, I have had the opportunity to work on a diverse set of projects and handle them from end to end.
Mentoring the next generation of engineering graduates
I didn’t have a mentor during college, so I’m so passionate about helping college juniors find their way in engineering. When I first started out, I came from a biology background, so I was not aware of programming languages and how to translate them into building a technical resume. I wish there would have been someone to help me out with debugging and finding solutions, so it’s important to me to give back in that way.
I’m quite active in university communities, participating in student-led tech events like hackathons to help them get into tech and secure their first job in the industry. I also love virtual events like X (formally Twitter) and LinkedIn Live events. Additionally, I’m part of LinkedIn’s CoachIn Program, where we help with resume building and offer scholarships for women in tech.
Influencing online and off at LinkedIn
I love creating engineering content on LinkedIn, X, and other social media platforms, where people often contact me about opportunities at LinkedIn Engineering. It brings me so much satisfaction to tell others about our amazing company culture and connect with future grads.
When I embarked on my role during COVID-19, building an online presence helped me stay connected with what’s happening in the tech world. I began posting on X first, and once that community grew, I launched my YouTube channel to share beginner-level content on data structures and algorithms. My managers and peers at LinkedIn were so supportive, so I broadened my content to cover aspects like soft skills, student hackathons, resume building, and more. While this is in addition to my regular engineering duties, I truly enjoy sharing my insights with my audience of 60,000+ followers. And the enthusiasm from my team inspires me to keep going! I’m excited to see what the future holds for me at LinkedIn as an engineer and a resource for my community on the LinkedIn platform.
About Rishika
Rishika holds a Bachelor of Technology from Indira Gandhi Delhi Technical University for Women. Before joining LinkedIn, she interned at Google as part of the SPS program and as a Product Intern at Adobe. She currently works as a software engineer on LinkedIn’s Trust Team. Outside of work, Rishika loves to travel all over India and create digital art.
Editor’s note: Considering an engineering/tech career at LinkedIn? In this Career Stories series, you’ll hear first-hand from our engineers and technologists about real life at LinkedIn — including our meaningful work, collaborative culture, and transformational growth. For more on tech careers at LinkedIn, visit: lnkd.in/EngCareers.
Career Stories: Learning and growing through mentorship and community

Lekshmy has always been interested in a role in a company that would allow her to use her people skills and engineering background to help others. Working as a software engineer at various companies led her to hear about the company culture at LinkedIn. After some focused networking, Lekshmy landed her position at LinkedIn and has been continuing to excel ever since.
How did I get my job at LinkedIn? Through LinkedIn.
Before my current role, I had heard great things about the company and its culture. After hearing about InDays (Investment Days) and how LinkedIn supports its employees, I knew I wanted to work there.
While at the College of Engineering, Trivandrum (CET), I knew I wanted to pursue a career in software engineering. Engineering is something that I’m good at and absolutely love, and my passion for the field has only grown since joining LinkedIn. When I graduated from CET, I began working at Groupon as a software developer, starting on databases, REST APIs, application deployment, and data structures. From that role, I was able to advance into the position of software developer engineer 2, which enabled me to dive into other software languages, as well as the development of internal systems. That’s where I first began mentoring teammates and realized I loved teaching and helping others. It was around this time that I heard of LinkedIn through the grapevine.
Joining the LinkedIn community
Everything I heard about LinkedIn made me very interested in career opportunities there, but I didn’t have connections yet. I did some research and reached out to a talent acquisition manager on LinkedIn and created a connection which started a path to my first role at the company.
When I joined LinkedIn, I started on the LinkedIn Talent Solutions (LTS) team. It was a phenomenal way to start because not only did I enjoy the work, but the experience served as a proper introduction to the culture at LinkedIn. I started during the pandemic, which meant remote working, and eventually, as the world situation improved, we went hybrid. This is a great system for me; I have a wonderful blend of being in the office and working remotely. When I’m in the office, I like to catch up with my team by talking about movies or playing games, going beyond work topics, and getting to know each other. With LinkedIn’s culture, you really feel that sense of belonging and recognize that this is an environment where you can build lasting connections.
LinkedIn: a people-first company
If you haven’t been able to tell already, even though I mostly work with software, I truly am a people person. I just love being part of a community. At the height of the pandemic, I’ll admit I struggled with a bit of imposter syndrome and anxiety. But I wasn’t sure how to ask for help. I talked with my mentor at LinkedIn, and they recommended I use the Employee Assistance Program (EAP) that LinkedIn provides.
I was nervous about taking advantage of the program, but I am so happy that I did. The EAP helped me immensely when everything felt uncertain, and I truly felt that the company was on my side, giving me the space and resources to help relieve my stress. Now, when a colleague struggles with something similar, I recommend they consider the EAP, knowing firsthand how effective it is.
Building a path for others’ growth
With my mentor, I was also able to learn about and become a part of our Women in Technology (WIT) WIT Invest Program. WIT Invest is a program that provides opportunities like networking, mentorship check-ins, and executive coaching sessions. WIT Invest helped me adopt a daily growth mindset and find my own path as a mentor for college students. When mentoring, I aim to build trust and be open, allowing an authentic connection to form. The students I work with come to me for all kinds of guidance; it’s just one way I give back to the next generation and the wider LinkedIn community. Providing the kind of support my mentor gave me early on was a full-circle moment for me.
Working at LinkedIn is everything I thought it would be and more. I honestly wake up excited to work every day. In my three years here, I have learned so much, met new people, and engaged with new ideas, all of which have advanced my career and helped me support the professional development of my peers. I am so happy I took a leap of faith and messaged that talent acquisition manager on LinkedIn. To anyone thinking about applying to LinkedIn, go for it. Apply, send a message, and network—you never know what one connection can bring!
About Lekshmy
Based in Bengaluru, Karnataka, India, Lekshmy is a Senior Software Engineer on LinkedIn’s Hiring Platform Engineering team, focused on the Internal Mobility Project. Before joining LinkedIn, Lekshmy held various software engineering positions at Groupon and SDE 3. Lekshmy holds a degree in Computer Science from the College of Engineering, Trivandrum, and is a trained classical dancer. Outside of work, Lekshmy enjoys painting, gardening, and trying new hobbies that pique her interest.
Editor’s note: Considering an engineering/tech career at LinkedIn? In this Career Stories series, you’ll hear first-hand from our engineers and technologists about real life at LinkedIn — including our meaningful work, collaborative culture, and transformational growth. For more on tech careers at LinkedIn, visit: lnkd.in/EngCareers.
Topics
Solving Espresso’s scalability and performance challenges to support our member base

Espresso is the database that we designed to power our member profiles, feed, recommendations, and hundreds of other Linkedin applications that handle large amounts of data and need both high performance and reliability. As Espresso continued to expand in support of our 950M+ member base, the number of network connections that it needed began to drive scalability and resiliency challenges. To address these challenges, we migrated to HTTP/2. With the initial Netty based implementation, we observed a 45% degradation in throughput which we needed to analyze then correct.
In this post, we will explain how we solved these challenges and improved system performance. We will also delve into the various optimization efforts we employed on Espresso’s online operation section, implementing one approach that resulted in a 75% performance boost.
Espresso Architecture
Figure 1. Espresso System Overview
Figure 1 is a high-level overview of the Espresso ecosystem, which includes the online operation section of Espresso (the main focus of this blog post). This section comprises two major components – the router and the storage node. The router is responsible for directing the request to the relevant storage node and the storage layer’s primary responsibility is to get data from the MySQL database and present the response in the desired format to the member. Espresso utilizes the open-source framework Netty for the transport layer, which has been heavily customized for Espresso’s needs.
Need for new transport layer architecture
In the communication between the router and storage layer, our earlier approach involved utilizing HTTP/1.1, a protocol extensively employed for interactions between web servers and clients. However, HTTP/1.1 operates on a connection-per-request basis. In the context of large clusters, this approach led to millions of concurrent connections between the router and the storage nodes. This resulted in constraints on scalability, resiliency, and numerous performance-related hurdles.
Scalability: Scalability is a crucial aspect of any database system, and Espresso is no exception. In our recent cluster expansion, adding an additional 100 router nodes caused the memory usage to spike by around 2.5GB. The additional memory can be attributed to the new TCP network connections within the storage nodes. Consequently, we experienced a 15% latency increase due to an increase in garbage collection. The number of connections to storage nodes posed a significant challenge to scaling up the cluster, and we needed to address this to ensure seamless scalability.
Resiliency: In the event of network flaps and switch upgrades, the process of re-establishing thousands of connections from the router often breaches the connection limit on the storage node. This, in turn, causes errors and the router to fail to communicate with the storage nodes.
Performance: When using the HTTP/1.1 architecture, routers maintain a limited pool of connections to each storage node within the cluster. In some larger clusters, the wait time to acquire a connection can be as high as 15ms at the 95th percentile due to the limited pool. This delay can significantly affect the system’s response time.
We determined that all of the above limitations could be resolved by transitioning to HTTP/2, as it supports connection multiplexing and requires a significantly lower number of connections between the router and the storage node.
We explored various technologies for HTTP/2 implementation but due to the strong support from the open-source community and our familiarity with the framework, we went with Netty. When using Netty out of the box, the HTTP/2 implementation throughput was 45% less than the original (HTTP/1.1) implementation. Because the out of the box performance was very poor, we had to implement different optimizations to enhance performance.
The experiment was run on a production-like test cluster and the traffic is a combination of access patterns, which include read and write traffic. The results are as follows:
Protocol | QPS | Single Read Latency (P99) | Multi-Read Latency (P99) |
HTTP/1.1 | 9K | 7ms | 25ms |
HTTP/2 | 5K (-45%) | 11ms (+57%) | 42ms (+68%) |
On the routing layer, after further analysis using flame graphs, major differences between the two protocols are shown in the following table.
CPU overhead | HTTP/1.1 | HTTP/2 |
Acquiring a connection and processing the request | 20% | 32% (+60%) |
Encode/Decode HTTP request | 18% | 32% (+77%) |
Improvements to Request/Response Handling
Reusing the Stream Channel Pipeline
One of the core concepts of Netty is its ChannelPipeline. As seen in Figure 1, when the data is received from the socket, it is passed through the pipeline which processes the data. Channel Pipeline contains a list of Handlers, each working on a specific task.
Figure 2. Netty Pipeline
In the original HTTP/1.1 Netty pipeline, a set of 15-20 handlers was established when a connection was made, and this pipeline was reused for all subsequent requests served on the same connection.
However, in HTTP/2 Netty’s default implementation, a fresh pipeline is generated for each new stream or request. For instance, a multi-get request to a router with over 100 keys can often result in approximately 30 to 35 requests being sent to the storage node. Consequently, the router must initiate new pipelines for all 35 storage node requests. The process of creating and dismantling pipelines for each request involving a considerable number of handlers turned out to be notably resource-intensive in terms of memory utilization and garbage collection.
To address this concern, a forked version of Netty’s Http2MultiplexHandler has been developed to maintain a queue of local stream channels. As illustrated in Figure 2, on receiving a new request, the multiplex handler no longer generates a new pipeline. Instead, it retrieves a local channel from the queue and employs it to process the request. Subsequent to request completion, the channel is returned to the queue for future use. Through the reuse of existing channels, the creation and destruction of pipelines are minimized, leading to a reduction in memory strain and garbage collection.
Figure 3. Sequence diagram of stream channel reuse
Addressing uneven work distribution among Netty I/O threads
When a new connection is created, Netty assigns this connection to one of the 64 I/O threads. In Espresso, the number of I/O threads is equal to twice the number of cores present. The I/O thread associated with the connection is responsible for I/O and handling the request/response on the connection. Netty’s default implementation employs a rudimentary method for selecting an appropriate I/O thread out of the 64 available for a new channel. Our observation revealed that this approach leads to a significantly uneven distribution of workload among the I/O threads.
In a standard deployment, we observed that 20% of I/O threads were managing 50% of all the total connections/requests. To address this issue, we introduced a BalancedEventLoopGroup. This entity is designed to evenly distribute connections across all available worker threads. During channel registration, the BalancedEventLoopGroup iterates through the worker threads to ensure a more equitable allocation of workload
After this change, during registering of a channel, an event loop with the number of connections below the average is selected.
private EventLoop selectLoop() { int average = averageChannelsPerEventLoop(); EventLoop loop = next(); if (_eventLoopCount > 1 && isUnbalanced(loop, average)) { ArrayList list = new ArrayList<>(_eventLoopCount); _eventLoopGroup.forEach(eventExecutor -> list.add((EventLoop) eventExecutor)); Collections.shuffle(list, ThreadLocalRandom.current()); Iterator it = list.iterator(); do { loop = it.next(); } while (it.hasNext() && isUnbalanced(loop, average)); } return loop; }
Reducing context switches when acquiring a connection
In the HTTP/2 implementation, each router maintains 10 connections to every storage node. These connections serve as communication pathways for the router I/O threads interfacing with the storage node. Previously, we utilized Netty’s FixedChannelPool implementation to oversee connection pools, handling tasks like acquiring, releasing, and establishing new connections.
However, the underlying queue within Netty’s implementation is not inherently thread-safe. To obtain a connection from the pool, the requesting worker thread must engage the I/O worker overseeing the pool. This process led to two context switches. To resolve this, we developed a derivative of the Netty pool implementation that employs a high-performance, thread-safe queue. Now, the task is executed by the requesting thread instead of a distinct I/O thread, effectively eliminating the need for context switches.
Improvements to SSL Performance
The following section describes various optimizations to improve the SSL performance.
Offloading DNS lookup and handshake to separate thread pool
During an SSL handshake, the DNS lookup procedure for resolving a hostname to an IP address functions as a blocking operation. Consequently, the I/O thread responsible for executing the handshake might be held up for the entirety of the DNS lookup process. This delay can result in request timeouts and other issues, especially when managing a substantial influx of incoming connections concurrently.
To tackle this concern, we developed an SSL initializer that conducts the DNS lookup on a different thread prior to initiating the handshake. This method involves passing the InetAddress, that contains both the IP address and hostname, to the SSL handshake procedure, effectively circumventing the need for a DNS lookup during the handshake.
Enabling Native SSL encryption/decryption
Java’s default built-in SSL implementation carries a significant performance overhead. Netty offers a JNI-based SSL engine that demonstrates exceptional efficiency in both CPU and memory utilization. Upon enabling OpenSSL within the storage layer, we observed a notable 10% reduction in latency. (The router layer already utilizes OpenSSL.)
To employ Netty Native SSL, one must include the pertinent Netty Native dependencies, as it interfaces with OpenSSL through the JNI (Java Native Interface). For more detailed information, please refer to https://netty.io/wiki/forked-tomcat-native.html.
Improvements to Encode/Decode performance
This section focuses on the performance improvements we made when converting bytes to Http objects and vice versa. Approximately 20% of our CPU cycles are spent on encode/decode bytes. Unlike a typical service, Espresso has very rich headers. Our HTTP/2 implementation involves wrapping the existing HTTP/1.1 pipeline with HTTP/2 functionality. While the HTTP/2 layer handles network communication, the core business logic resides within the HTTP/1.1 layer. Due to this, each incoming request required the conversion of HTTP/2 requests to HTTP/1.1 and vice versa, which resulted in high CPU usage, memory consumption, and garbage creation.
To improve performance, we have implemented a custom codec designed for efficient handling of HTTP headers. We introduced a new type of request class named Http1Request. This class effectively encapsulates an HTTP/2 request as an HTTP/1.1 by utilizing wrapped Http2 headers. The primary objective behind this approach is to avoid the expensive task of converting HTTP/1.1 headers to HTTP/2 and vice versa.
For example:
public class Http1Headers extends HttpHeaders { private final Http2Headers _headers; …. }
And Operations such as get, set, and contains operate on the Http2Headers:
@Override public String get(String name) { return str(_headers.get(AsciiString.cached(name).toLowerCase()); }
To make this possible, we developed a new codec that is essentially a clone of Netty’s Http2StreamFrameToHttpObjectCodec. This codec is designed to translate HTTP/2 StreamFrames to HTTP/1.1 requests/responses with minimal overhead. By using this new codec, we were able to significantly improve the performance of encode/decode operations and reduce the amount of garbage generated during the conversions.
Disabling HPACK Header Compression
HTTP/2 introduced a new header compression algorithm known as HPACK. It works by maintaining an index list or dictionaries on both the client and server. Instead of transmitting the complete string value, HPACK sends the associated index (integer) when transmitting a header. HPACK encompasses two key components:
-
Static Table – A dictionary comprising 61 commonly used headers.
-
Dynamic Table – This table retains the user-generated header information.
The Hpack header compression is tailored to scenarios where header contents remain relatively constant. But Espresso has very rich headers with stateful information such as timestamps, SCN, and so on. Unfortunately, HPACK didn’t align well with Espresso’s requirements.
Upon examining flame graphs, we observed a substantial stack dedicated to encoding/decoding dynamic tables. Consequently, we opted to disable dynamic header compression, leading to an approximate 3% enhancement in performance.
In Netty, this can be disabled using the following:
Http2FrameCodecBuilder.forClient() .initialSettings(Http2Settings.defaultSettings().headerTableSize(0));
Results
Latency Improvements
P99.9 Latency | HTTP/1.1 | HTTP/2 |
Single Key Get | 20ms | 7ms (-66%) |
Multi Key Get | 80ms | 20ms (-75%) |
We observed a 75% reduction in 99th and 99.9th percentile multi-read and read latencies, decreasing from 80ms to 20ms.
Figure 4. Latency reduction after HTTP/2
We observed similar latency reductions across the 90th percentile and higher.
Reduction in TCP connections
HTTP/1.1 | HTTP/2 | |
No of TCP Connections | 32 million | 3.9 million (-88%) |
We observed an 88% reduction in the number of connections required between routers and storage nodes in some of our largest clusters.
Figure 5. Total number of connections after HTTP/2
Reduction in Garbage Collection time
We observed a 75% reduction in garbage collection times for both young and old gen.
GC | HTTP/1.1 | HTTP/2 |
Young Gen | 2000 ms | 500ms (+75%) |
Old Gen | 80 ms | 15 ms (+81%) |
Figure 6. Reduction in time for GC after HTTP/2
Waiting time to acquire a Storage Node connection
HTTP/2 eliminates the need to wait for a storage node connection by enabling multiplexing on a single TCP connection, which is a significant factor in reducing latency compared to HTTP/1.1.
HTTP/1.1 | HTTP/2 | |
Wait time in router to get a storage node connection | 11ms | 0.02ms (+99%) |
Figure 7. Reduction is wait time to get a connection after HTTP/2
Conclusion
Espresso has a large server fleet and is mission-critical to a number of LinkedIn applications. With HTTP/2 migration, we successfully solved Espresso’s scalability problems due to the huge number of TCP connections required between the router and the storage nodes. The new architecture also reduced the latencies by 75% and made Espresso more resilient.
Acknowledgments
I would like to thank my colleagues Antony Curtis, Yaoming Zhan, BinBing Hou, Wenqing Ding, Andy Mao, and Rahul Mehrotra who worked on this project. The project demanded a great deal of time and effort due to the complexity involved in optimizing the performance. I would like to thank Kamlakar Singh and Yun Sun for reviewing the blog and providing valuable feedback.
We would also like to thank our management Madhur Badal, Alok Dhariwal and Gayatri Penumetsa for their support and resources, which played a crucial role in the success of this project. Their encouragement and guidance helped the team overcome challenges and deliver the project on time.
Topics
-
FACEBOOK2 weeks ago
Introducing Facebook Graph API v18.0 and Marketing API v18.0
-
Uncategorized2 weeks ago
3 Ways To Find Your Instagram Reels History
-
Uncategorized2 weeks ago
Community Manager: Job Description & Key Responsibilities
-
LINKEDIN2 weeks ago
Career Stories: Learning and growing through mentorship and community
-
Uncategorized2 weeks ago
Facebook Page Template Guide: Choose the Best One
-
OTHER6 days ago
WhatsApp iPad Support Spotted in Testing on Latest iOS Beta, Improved Group Calls Interface on Android
-
Uncategorized1 week ago
The Complete Guide to Social Media Video Specs in 2023
-
LINKEDIN5 days ago
Career stories: Influencing engineering growth at LinkedIn