Connect with us

LINKEDIN

Overcoming challenges with Linux cgroups memory accounting

Published

on

overcoming-challenges-with-linux-cgroups-memory-accounting

Introduction

LinkedIn’s de facto search solution, Galene, is a Search-as-a-Service infrastructure that powers a multitude of search products at LinkedIn, from member-facing searches (such as searching for jobs or other members) to internal index searches. Galene’s responsiveness and reliability are paramount as it caters to many critical features.

This post discusses debugging an issue where the hosts ran out of memory and became inaccessible, even though the applications are limited by cgroups. We’ll cover memory accounting in cgroups and how it is not always straightforward when there are multiple variables at play. We will also discuss a case where cgroups, in certain cases may not account for the memory according to our expectations, which can be disastrous for co-hosted applications or the host itself.

This issue arose from one of the services in the search stack, the searcher-app, which is responsible for querying search indexes. The indexes are stored as flat files in a binary format specific to Galene and loaded into the searcher-app’s memory using mmap() calls. The application also uses the mlockall() call to keep the file in memory and disable paging, as fpaging can cause extremely high tail latencies. When mlockall() is not used, the Linux kernel can swap out pages that are part of the index and not frequently accessed. A query requiring one of those sections will require disk access, which will increase the latency. Searcher applications, like a number of other apps, are hosted on containers and use memory and CPU cgroups to limit resources used by an application or system process on the host. 

Issue 1: Low memory leads to excessive page swapping and high latency

We received an alert notification that one of our search clusters was having issues and noticed that many of our searcher-apps were down. When we tried to restart the apps, we saw that the physical host itself was not responding and needed a power cycle via the console to get any response. A few observations to note from the debugging are that before going into the “unresponsive” state, the system had a memory crunch, and once it had entered into the “unresponsive” state, no logs of any kind were generated on the host.

Advertisement
free widgets for website
  • graph-of-host-disk-read-time-graph

Fig 1: Host disk read time graph (y-axis in milliseconds)

  • graph-of-host-available-memory-graph

Fig 2: Host available memory graph

We noticed that the host was running low on memory and that there was also an increase in disk read times. This observation, along with an increase in page faults, led us to realize that the pages were being swapped too often because the host was low on memory, which led to high disk writes and slowed down read times. The search application was a major contributor to the lack of memory on the host. So, we optimized the searcher-app’s memory utilization and reduced the cgroup memory limit for the app, which in turn reserved more memory for system processes and resolved the issue.

Advertisement
free widgets for website

Issue 2: An unknown cause for reserving large amounts of memory, leading to unresponsive hosts

In six months, we had the same problem on another cluster and during our debugging this time around, we uncovered something specific: the application tried to reserve a huge chunk of memory right before the system hung and pushed the host into an unreachable state. This led us to suspect Linux’s cgroup memory enforcement as the culprit. We wrote a small C program to try and reproduce the issue by running this reproducer inside of a cgroup under a few different memory overallocation patterns, but in all cases, the Linux OOMkiller was correctly invoked and killed off the application process. We could not simulate the host-hang situation so we had to look back at our OS metrics.

See also  Career stories: Four engineering careers. One LinkedIn.

Debugging

Once we established that the issue was a memory crunch, we began investigating the memory usage pattern on the host. Interestingly, we found that the application cgroup showed much less memory usage than expected.

Advertisement
free widgets for website
  • graph-of-application-cgroup-total-memory-usage-graph

Application cgroup total memory usage graph

The above graph shows memory usage of about 51GB before the node went unreachable. The red circle that marks the point it went unreachable is the point we will use for all of our further graphs. The ideal way to calculate the entire memory usage for the cgroup is Resident Set Size (RSS) Anonymous + page-cache + swap used by the cgroup. Because we use mlockall()we don’t use swap, so we don’t need to worry about that here. RSS is how much memory a process currently has in main memory (RAM). The cgroup stat file used for the following cgroup graphs only shows the anonymous part of RSS—the total RSS of a process is the sum of RSS Anonymous, RSS File, and Shared RSS. RSS File (which contains the mmapped files) will be accounted for in page cache and Shared RSS size is too low to be of any significance in the calculations.

  • application-cgroup-RSS-usage-graph

Application cgroup RSS usage graph

Advertisement
free widgets for website
  • application-cgroup-page-cache-usage-graph

Application cgroup page cache usage graph

From the previous graphs, if we add up the memory usage (19 and 31GB), it says that we use 50GB. That’s in line with the “Application cgroup total memory usage graph” shown at the beginning of this section.

Advertisement
free widgets for website
  • searcher-application-base-index-size-graph

Searcher application base index size graph

  • searcher-application-middle-index-size-graph

Searcher application middle index size graph

From these two graphs, we can see that the base index size is 32GB and the middle index size is 12GB, which brings us to a total size of 44GB—the size of flat index files mmapped into memory. When we add the RSS value of 19GB, we get a total usage of 63GB.

So, the application is using 63GB of memory, based on the above calculation from the actual file size of the indexes and the RSS, which were verified by looking at the process on the host. This means that our cgroup is not reporting the correct amount of memory used for cache: we need 44GB of cache, but cgroup only shows 31GB.

The current hierarchy of our cgroups is

Advertisement
free widgets for website
  • Root cgroup

    • Application parent cgroup

      • Application 1 cgroup

      • Application 2 cgroup

Now, let’s compare the application cgroup page cache usage with the parent cgroup metrics. We wanted to compare the different cgroups to identify at which level the memory was not being reported as we expected.

  • parent-cgroup-page-cache-usage-graph

Parent cgroup page cache usage graph

Advertisement
free widgets for website
  • application-cgroup-page-cache-usage-graph

Application cgroup page cache usage graph

The dip in cache usage by the application cgroup is due to a restart. After the restart, we see that the application cgroup is reporting the wrong numbers for the cache. We expect around 44GB of cache, but the application cgroup only shows around 10GB just after restart, while the parent cgroup still reports the right amount of cache usage.

OOMkiller will not kick in, even when the application is using more memory than allocated, because the application cgroup is not reporting the correct memory usage. This can cause the search application to hog memory on the box and other services to become starved for memory, which leads to swapping, and eventually the system becomes unreachable.

Understanding page cache accounting in cgroups

Let us first understand how memory is being accounted for in cgroups

  • RSS: This one is simple. Just add up the RSS of all the processes under that cgroup.

  • Cache: Shared pages are accounted for on a first touch basis. This means that any page created by a process inside a cgroup is accounted for by that cgroup. If the page already existed in memory, then the accounting gets complicated. In this case, the page will eventually get accounted to the cgroup after it keeps accessing that page aggressively. 

In our stack, restarts or redeploys follow these steps:

Advertisement
free widgets for website
  1. Stop application

  2. Delete application cgroup

  3. Create application cgroup

  4. Start application

In our case, we deploy new indexes and then the application’s cgroup reports the correct memory usage. Once the index grows and reaches the application cgroup memory limit, the OOMkiller is invoked and the application is killed. From there, our automation kicks in and starts the application. This leads to the existing application cgroup being deleted and a new one being created. But this time, the application cgroup memory is wrong. This is because the pages for the index are already in memory, but the new application cgroup is not accounting for this. As a result, the index keeps growing and the host faces a memory crunch, which leads to thrashing (Figures 1, 2). The OOMkiller is not invoked by the application cgroup because it reports less memory than is actually being used. Our application uses mlockall() so memory cannot be swapped; this leads to other critical system applications being swapped instead, and causes the host to go into an “unresponsive” state.

Validating the findings

We did a small experiment to validate our findings. We picked one host showing lower application cgroup memory usage and stopped the application and destroyed the cgroup, then got the machine to drop all its page cache. After that, we created a new cgroup and started the application inside it.

Advertisement
free widgets for website
  • application-cgroup-page-cache-usage-graph

Application cgroup page cache usage graph

The application cgroup showed the right amount of memory after the above steps. This verified that the issue was caused by a new application cgroup not charging pages to itself, even if the application inside it is the only one using those pages.

Solution

First, we wanted to set up proper monitoring to catch the growth of indexes to avoid running out of memory. We used metrics emitted by the application to monitor the index size and tracked the RSS memory used by the cgroup to set up an alert that would let us know when a certified threshold had been exceeded. This gave us enough time to mitigate the issue before we ran out of memory, but there were some cases where a sudden increase in memory could happen, so we needed a failsafe to ensure that the host doesn’t go into an unresponsive state.

The total memory used shown in the parent cgroup is still correct, as previously discussed. When the old cgroup is destroyed, the parent still retains the total memory usage numbers, which include the page cache. To ensure that the OOMkiller is invoked when the parent is breaching its limits, we are planning to put a memory limit on the parent cgroup. Doing so can cause a noisy neighbor situation, where a different co-hosted application is killed rather than the one abusing the memory, but considering that the host will go unreachable and both applications will suffer if the memory situation becomes too overloaded, this is the best current solution to the issue.

While we did considere a few other solutions (listed below), we determined that they didn’t fit our needs.

  • Adding a cache flush each time a cgroup is created: this would unnecessarily affect other applications running on the host because of disk I/O using up CPU cycles.

  • Leverage /tmpfs to host indexes: this would require changes on the application side and a different configuration for searcher hosts. 

  • Create a parent cgroup with limits per application cgroup: this would require extensive changes from the current provisioning and deployment tooling.

After evaluating all these approaches, we decided to go with setting a cgroup limit on the parent cgroup.

Advertisement
free widgets for website

Conclusion

Debugging an issue is always filled with surprises and learnings. From this issue, we realized that memory accounting in cgroups can be complicated when page cache is involved. Using mlockall() can lead to critical services being swapped out when the application starts hogging memory. But most importantly, this process was a good reminder of the importance of challenging the assumptions we make during debugging—for instance, if we had questioned cgroup’s memory reporting during the initial issue, we would have had one less issue in production. After adding monitoring to detect the issue, we figured out that there were other clusters affected by this and we could fix it before it caused any production impact.

Acknowledgments

I would like to thank Kalyan Somasundaram and Mike Svoboda for helping me during the triaging. Also, a big thanks again to Kalyan for reviewing this blog post. Finally, I would like to acknowledge the constant encouragement and support from my manager, Venu Ryali.

Advertisement
free widgets for website

Topics

    Continue Reading
    Advertisement free widgets for website
    Click to comment

    Leave a Reply

    Your email address will not be published.

    LINKEDIN

    Building LinkedIn’s Skills Graph to Power a Skills-First World

    Published

    on

    By

    building-linkedin’s-skills-graph-to-power-a-skills-first-world

    Co-authors: Sofus Macskássy, Yi Pan, Ji Yan, Yanen Li, Di Zhou, Shiyong Lin

    As industries rapidly evolve, so do the skills necessary for success. Skill sets for jobs globally have changed by 25% since 2015 and this number is expected to double by 2027. Yet, we’ve long relied on insufficient and unequal signals when evaluating talent and predicting success – who you know, where you went to school, or who your last employer was. If we look at the labor market instead through the lens of skills – the skills you have and the skills a role or industry demands – we can create a transparent and fair job matching process that drives better outcomes for employers and employees. 

    This new reality requires a common understanding of skills, backed by better data. For nearly a decade, our Economic Graph has helped leaders benchmark and compare labor markets and economies across the world. A critical element of this analysis is the insight provided by LinkedIn’s Skills Graph, which creates a common language around skills to help us all better understand the skills that power the global workforce. The Skills Graph does this by dynamically mapping the relationships between 39K skills, 875M people, 59M companies, and other organizations globally. 

    It also drives relevance and matching across LinkedIn – helping learners find content more relevant to their career path; helping job seekers find jobs that are a good fit; and helping recruiters find the highest quality candidates. For example, these relationships between skills means we can detect that “cost management” in a job seekers’ profile is relevant to a job posting that lists “project budgeting” as a required skill.

    Building the LinkedIn Skills Graph

    At the heart of our Skills Graph lies our skills taxonomy. The taxonomy is a curated list of unique skills and their intertwined relationships, each with detailed information about those skills. It’s built on a deep understanding of how skills power professional journeys, including what skills are required in a job, what skills a member has, and how members move from one position to the next. 

    Today, our taxonomy consists of over 39,000 skills spanning 26 languages, over 374,000 aliases (different ways to refer to the same skill – e.g., “data analysis” and “data analytics”), and more than 200,000 links between skills. Even more important than the volume of data, the key to unlocking the power of skills lies in the structure and relationships between the skills. To create a stronger network of connected skills in our taxonomy, we utilize a framework we call, “Structured Skills.” This framework increases our understanding of every skill in our database by mapping the relationships it has to other skills around it, and creates richer, more accurate skill-driven experiences for our members and customers. For example,

    Advertisement
    free widgets for website
    • If a member knows about Artificial Neural Networks, the member knows something about Deep Learning, which means the member knows something about Machine Learning.

    • If a job requires Supply Chain Engineering, having a skill in Supply Chain Management or Industry Engineering is definitely also relevant.

    Creating meaningful and accurate relationships between skill sets is critical to getting the most out of our Structured Skills. To do this, our machine learning and artificial intelligence combs through massive amounts of data and suggests new skills and relations between them. As our Skills Graph continues to grow and learn with AI, we are committed to maintaining the high quality of the data and connections found in our taxonomy. We do this with the help of trained taxonomists on our team, who manually review our skills data and ensure that we can verify its integrity and relevancy.

    Structured skills consists of meaningful relationships between skills that empower deep reasoning to match members to relevant content such as jobs, learning material, and feed posts

    But, building the taxonomy and Structured Skills is meaningless without connecting these to the jobs and members on our platform. Together, the Structured Skills and mapping to our members and jobs make up our Skills Graph and both are needed to unlock the full potential of a skill-based job market.

    Advertisement
    free widgets for website

    Structured skills enrich the set of skills for both members and jobs to ensure we can find all the relevant jobs for a member. We show the skill overlap so that members can see which of their skills are a match and also potential skill gaps that they might want to address for their own career growth

    Leveraging Machine Learning to map skills to members and jobs

    Although millions of LinkedIn members have added skills to their profile, many have not added their most relevant skills to their skills sections or kept their skills section up to date. Instead, they list relevant skills in their summary sections, within the job experience descriptions in their profiles or on the resumes they submit. On the other hand, many jobs on LinkedIn don’t comprehensively describe what skills are needed. Many listings also come through an online job posting that a recruiter has submitted but are ingested from our customers’ websites. In these scenarios where skills are not explicitly provided, it’s critical to pull skills data from the job descriptions, summaries, and more, to create a tool that drives reliable insights.

    As you can imagine, this process requires processing a lot of text. So, we have built machine learning models that leverage natural language understanding, deep learning, and information extraction technologies. To help train these models, our human labelers use AI to connect text found across jobs, profiles, and learning courses, to specific skills in our taxonomy. Our system then learns to recognize different ways to refer to the same type of skill. Combined with natural language processing, we extract skills from many different types of text – with a high degree of confidence – to make sure we have high coverage and high precision when we map skills to our members and job posts.

    Advertisement
    free widgets for website

    We also leverage various clustering and machine learning algorithms to identify the core skills relating to a given job or function. We do this by applying these tools to all member histories and all job descriptions on our platform, which identify the skills that are likely associated with a job post or member job experience. These techniques, together with Structured Skills, create a holistic picture of skills a member has and skills needed to do a job. 

    When hirers create a job post on the LinkedIn platform, we use machine learning and Structured skills to suggest explicit skills that we can tag the post with to increase discoverability

    These models are designed to continuously improve and learn over time based on engagement from members on the LinkedIn platform, job seekers, hirers, and learners. For example, when a hirer posts a new job on our platform and the hirer types in the job description, our machine learning model automatically suggests the skills that are associated with that job posting. The hirer can refine the selection of skills that best represent the qualification of this job by removing and adding these suggested skills manually.

    Advertisement
    free widgets for website

    Looking forward

    Beyond streamlining the hiring process, understanding members’ skills allows us to surface more relevant posts in their feed, suggest people they should connect with, and companies to follow. It also helps sales and marketing professionals on Linkedin be more effective by using skills for ads targeting and provides insights to our sales and marketing customers by sharing details on the skill sets of those who engage with their content. As our Skills Graph continues to evolve in parallel with the global workforce, it will only become smarter and deliver better outcomes for hirers, learners, job seekers, customers, and members. 

    Realizing a more equitable and efficient future of work will rely on building a deeper understanding of peoples’ abilities and potential. To keep up, some companies are already utilizing skills to identify qualified candidates – more than 40% of hirers on LinkedIn explicitly use skills data to fill their roles. 

    As our CEO Ryan Roslansky stated at LinkedIn’s Talent Connect event this year, “We can build a world where everyone has access to opportunity not because of where they were born, who they know, or where they went to school, but because of their actual skills and ability.” Our Skills Graph will continue to be a critical part of how we help make a skills-based labor market a reality. We’re excited to share updates as our work continues on this journey.

    Advertisement
    free widgets for website

    Topics

    Advertisement
    free widgets for website
    See also  Operating system upgrades at LinkedIn’s scale
    Continue Reading

    LINKEDIN

    TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters

    Published

    on

    By

    topicgc:-how-linkedin-cleans-up-unused-metadata-for-its-kafka-clusters

    Introduction

    Apache Kafka is an open-sourced event streaming platform where users can create Kafka topics as data transmission units, and then publish or subscribe to the topic with producers and consumers. While most of the Kafka topics are actively used, some  are not needed anymore because business needs changed or the topics themselves are ephemeral. Kafka itself doesn’t have a mechanism to automatically detect unused topics and delete them. It is usually not a big concern, since a Kafka cluster can hold a considerable amount of topics, hundreds to thousands. However, if the topic number keeps growing, it will eventually hit some bottleneck and have disruptive effects on the entire Kafka cluster. The TopicGC service was born to solve this exact problem. It was proven to reduce Kafka pressure by deleting ~20% of topics, and improved Kafka’s produce and consume performance by at least 30%.

    Motivation

    As the first step, we need to understand how unused topics can cause pressure on Kafka. Like many other storage systems, all Kafka topics have a retention period, meaning that for any unused topics, the data will be purged after a period of time and the topic will become empty. A common question here is, “How could empty topics affect Kafka?” 

    Metadata pressure

    For topic management purposes, Kafka stores the metadata of topics in multiple places, including Apache ZooKeeper and a metadata cache on every single broker. Topic metadata contains information of partition and replica assignments. 

    Let’s do some simple calculation here:  topic A can have 25 partitions, with a replication factor of three, meaning each partition has three replicas. Even if topic A is not used anymore, Kafka still needs to store the location info of all 75 replicas somewhere.

    Advertisement
    free widgets for website

    The effect of metadata pressure may not be that obvious for a single topic, but it can make a big difference if there are a lot of topics. The metadata can consume memory from Kafka brokers and ZooKeeper nodes, and can add payload to metadata requests. 

    Fetch requests

    In Kafka, the follower replicas periodically send fetch requests to the leader replicas to keep sync with the leader. Even for empty topics and partitions, the followers still try to sync with the leaders. Because Kafka does not know whether a topic is permanently unused, it always forces the followers to fetch from the leaders. These redundant fetch requests will further lead to more fetch threads being created, which can cause extra network, CPU, and memory utilization, and can dominate the request queues, causing other requests to be delayed or even dropped.

    See also  (Re)building Threat Detection and Incident Response at LinkedIn

    Controller initialization

    Kafka controller is a broker that coordinates and manages other brokers in a Kafka cluster. Many Kafka requests have to be handled by the controller, thus the controller availability is crucial to Kafka. 

    Advertisement
    free widgets for website

    On controller failover, a new controller has to be elected and take over the role of managing the cluster. The new controller will take some time to load the metadata of the entire cluster from ZooKeeper before it can act as the controller, which is called the controller initialization time. As mentioned earlier in this post, unused topics can generate extra metadata that makes the controller initialization slower, and threaten the Kafka availability. Issues can arise when the ZooKeeper response is larger than 1MB. For one of our largest clusters, the ZooKeeper response has already reached 0.75MB, and we anticipate within two to three years it will hit a bottleneck.

    Service design

    While designing TopicGC, we kept in mind a number of requirements. Functionality, we determined that the system must set criteria to determine whether a topic should be deleted, constantly run the garbage collector (GC) process to remove the unused topics, and notify the user before topic deletion.

    Additionally, we identified non-functional requirements for the system. The requirements include ensuring no data loss during topic deletion, removal of all dependencies from unused topics before deletion, and the ability to recover the topic states from service failures.

    To satisfy those requirements, we designed TopicGC based on a state machine model, which we will discuss in more detail in the following sections.

    Topic state machine

    Advertisement
    free widgets for website

    To achieve all of the functional requirements, TopicGC internally runs a state machine. Each topic instance is associated with a state and there are several background jobs that periodically run and transit the topic states if needed. Table 1 describes all possible states in TopicGC.

    Table 1: Topic states and descriptions

    Advertisement
    free widgets for website

    TopicGC workflow

    With the help of internal states, TopicGC follows a certain workflow to delete unused topics.

    • Graphic of Topic GC state machine

    Figure 1: TopicGC state machine

    Detect topic usage

    Advertisement
    free widgets for website

    TopicGC has a background job to find unused topics. Internally, we use the following criteria to determine whether a topic is unused:

    • The topic is empty
    • There is no BytesIn/BytesOut
    • There is no READ/WRITE access event in the past 60 days
    • The topic is not newly created in the past 60 days 

    The TopicGC service fetches the above information from ZooKeeper and a variety of internal data sources, such as our metrics reporting system.

    See also  LinkedIn’s GraphQL journey for integrations and partnerships: How we accelerated development by 90%

    Send email notification

    If a topic is in the UNUSED state, TopicGC will trigger the email sending service to find the LDAP user info of the topic owner and send email notifications. This is important because we don’t know whether the topic is temporarily idle or permanently unused. In the former case, once the topic owner receives the email, they can take actions to prevent the topic from being deleted.

    Block write access

    This is the most important step in the TopicGC workflow. Think of a case: if a user produces some data right at the last second before topic deletion, the data will be lost with the topic deletion. Thus, avoiding data loss is a crucial challenge for TopicGC. To ensure the TopicGC service doesn’t delete the topics that have last minute write, we introduced a block-write-access step before the topic deletion. After the write access is blocked on the topic, there is no chance that TopicGC can cause data loss.

    Advertisement
    free widgets for website

    Notice that Kafka doesn’t have a mechanism to “seal” a topic. Here we leverage LinkedIn’s internal way to block topic access. In LinkedIn, we have some access to services to allow us to control the access for all data resources, including Kafka topics. To seal a topic, TopicGC sends a request to the access service to block any read and write access to the topic.

    Disable mirroring

    The data of a topic can be mirrored to other clusters via Brooklin. Brooklin is open-sourced by LinkedIn, as a framework to stream data between various heterogeneous sources and destination systems with high reliability and throughput at scale. Before deleting the topic, we need to disable Brooklin mirroring of the topic. Brooklin can be regarded as a wildcard consumer for all Kafka topics. If the topic is deleted without informing Brooklin, Brooklin will throw exceptions about consuming from non-existent topics. For the same reason, before topic deletion, if there are any other services that consume from all topics, TopicGC should tell those services to stop consuming from the garbage topics before topic deletion.

    Delete topics

    Once all preparations are done, the TopicGC service will trigger the topic deletion by calling the Kafka admin client. The topic deletion process can be customized and in our case, we delete topics in batches. Because topic deletion can introduce extra load to Kafka clusters, we set an upper limit of the concurrent topic deletion number to three.

    Advertisement
    free widgets for website

    Last minute usage check

    See also  Operating system upgrades at LinkedIn’s scale

    Before any of the actual changes made to the topic (including blocking write access, disabling mirroring, and topic deletion), we run a last minute usage check for the topic. This is to add an extra secure layer to prevent data loss. If TopicGC detects usage during the whole deletion process, it will mark the topic as INCOMPLETE state, and start recovering the topic back to USED state.

    Impact of TopicGC

    We launched TopicGC in one of our largest data pipelines, and were able to reduce the topic count by nearly 20%. In the graph, each color represents a distinct Kafka cluster in the pipeline.

    Advertisement
    free widgets for website

    Figure 2: Total topic count during TopicGC

    Improvement on CPU usage

    The topic deletion helps to reduce the total fetch requests in the Kafka clusters and as a result, the CPU usage drops significantly after the unused topics are deleted. The total Kafka CPU usage had about a 30% reduction.

    Advertisement
    free widgets for website

    Figure 3: CPU usage improvement by TopicGC

    Improvement On Client Request Performance

    Due to the CPU usage reduction, Kafka brokers are able to handle the requests more efficiently. As a result, Kafka’s request handling performance improved, and request latencies dropped by up to 40%. Figure 4 shows the decrease in latency for Metadata Request.

    Advertisement
    free widgets for website
    • Image of Kafka request performance improvement by Topic GC

    Figure 4: Kafka request performance improvement by TopicGC

    Conclusion

    After we launched TopicGC to delete unused topics for Kafka, it has deleted nearly 20% of topics, and significantly reduced the metadata pressure of our Kafka clusters. From our metrics, the client request performance is improved around 40% and CPU usage is reduced by up to 30%. 

    Future plans

    As TopicGC has shown its ability to clean up Kafka clusters and improve Kafka performance, we have decided to launch the service to all of our internal Kafka clusters. We are hoping to see that TopicGC can help LinkedIn have a more effective resource usage on Kafka.

    Acknowledgements

    Many thanks to Joseph Lin and Lincong Li for coming up with the idea of TopicGC and implementing the original design. We are also grateful for our managers Rohit Rakshe and Adem Efe Gencer, who provided significant support for this project. Last but not least, we want to shout out to the Kafka SRE team and Brooklin SRE team to act as helpful partners. With their help, we smoothly launched TopicGC and were able to see these exciting results. 

    Advertisement
    free widgets for website
    Advertisement
    free widgets for website

    Topics

    Continue Reading

    LINKEDIN

    Render Models at LinkedIn

    Published

    on

    By

    render-models-at-linkedin

    Co-Authors: Mahesh VishwanathEric BabyakSonali BhadraUmair Saeed

    Introduction

    We use render models for passing data to our client applications to describe the content (text, images, buttons etc.) and the layout to display on the screen. This means most of such logic is moved out of the clients and centralized on the server. This enables us to deliver new features faster to our members and customers while keeping the experience consistent and being responsive to change.

    Overview

    Traditionally, many of our API models tend to be centered around the raw data that’s needed for clients to render a view, which we refer to as data modeling. With this approach, clients own the business logic that transforms the data into a view model to display. Often this business logic layer can grow quite complex over time as more features and use cases need to be supported.

    This is where render models come into the picture. A render model is an API modeling strategy where the server returns data that describes the view that will be rendered. Other commonly used terms that describe the same technique are Server Driven User Interface (SDUI), or View Models. With render models, the client business logic tends to be much thinner, because the logic that transforms raw data into view models now resides in the API layer. For any given render model, the client should have a single, shared function that is responsible for generating the UI representation of the render model.

    Advertisement
    free widgets for website
    • A diagram of an architectural comparison between data modeling and render modeling

    Architectural comparison between data modeling and render modeling

    Example

    To highlight the core differences in modeling strategy between a render model and data model, let’s walk through a quick example of how we can model the same UI with these two strategies. In the following UI, we want to show a list of entities that contain some companies, groups, and profiles.

    Advertisement
    free widgets for website
    • A diagram of an example UI of an ‘interests’ card to display to members

    An example UI of an ‘interests’ card to display to members

    Following the data model approach, we would look at the list as a mix of different entity types (members, companies, groups, etc.) and design a model so that each entity type would contain the necessary information for clients to be able to transform the data into the view shown in the design.

    Advertisement
    free widgets for website
    record FollowableEntity {   /**    * Each model in the union below contains data that is related    * to the entity it represents.    */   entity: union[     Profile,     Company,     Group   ] }   record Profile {   // Details for a Profile.   … }   record Company {   // Details for a Company.   … }   record Group {   // Details for a Group.   … } 

    When applying a render model approach, rather than worry about the different entity types we want to support for this feature, we look at the different UI elements that are needed in the designs.

    • A diagram of an ‘interests’ card categorized by UI elements

    An ‘interests’ card categorized by UI elements

    In this case, we have one image, one title text, and two other smaller subtexts. A render model represents these fields directly.

    Advertisement
    free widgets for website
    record FollowableEntity {   /**    * An image to represent the logo for each element    * e.g. the Microsoft logo.    */   image: Image     /**    * Text to represent the main bold text    * e.g. ‘Microsoft’    */   titleText: Text     /**    * Text to represent the small sub text that displays a statistic    * about the entity this element represents.    * e.g. ‘10,975,744 followers’    */   statisticText: Text     /**    * Optional text to provide more information about the entity.    * Empty in the first element case, ‘CEO of Microsoft’ in the 2nd one.    */   caption: optional Text } 

    With the above modeling, the client layer remains very thin as it simply displays each image/text returned from the API. The clients are unaware of which underlying entity each element represents, as the server is responsible for transforming the data into displayable content.

    API design with render models

    API modeling with render models can live on a spectrum between the two extremes of frontend modeling strategies, such as pure data models and pure view models. With pure data models, different types of content use different models, even if they look the same on UI. Clients know exactly what entity they are displaying and most of the business logic is on clients, so complex product UX can be implemented as needed. Pure view models are heavily-templated and clients have no context on what they are actually displaying with almost all business logic on the API. In practice, we have moved away from using pure view models due to difficulties in supporting complex functionality, such as client animations and client-side consistency support, due to the lack of context on the clients’ end.

    See also  Measuring marketing incremental impacts beyond last click attribution

    Typically, when we use render models, our models have both view model and data model aspects. We prefer to use view modeling most of the time to abstract away most of the view logic on the API and to keep the view layer on the client as thin as possible. We can mix in data models as needed, to support the cases where we need specific context about the data being displayed.

    Advertisement
    free widgets for website
    • A diagram of a spectrum of modeling strategies between pure view models and pure data models

    A spectrum of modeling strategies between pure view models and pure data models

    To see this concretely, let’s continue our previous example of a FollowableEntity. The member can tap on an entity to begin following the profile, company, or group. As a slightly contrived example, imagine that we perform different client side actions based on the type of the entity. In such a scenario, the clients need to know the type of the entity and at first brush it might appear that the render models approach isn’t feasible. However, we can combine theseapproaches to get the best of both worlds. We can continue to use a render model to display all the client data but embed the data model inside the render model to provide context for making the follow request.

    Advertisement
    free widgets for website
      record FollowableEntity {   /**    * An image to represent the logo for each element    * e.g. the Microsoft logo.    */   image: Image     /**    * Text to represent the main bold text    * e.g. ‘Microsoft’    */   titleText: Text     /**    * Text to represent the small sub text that displays a statistic    * about the entity this element represents.    * e.g. ‘10,975,744 followers’    */   statisticText: Text     /**    * Optional text to provide more information about the entity.    * Empty in the first element case, ‘CEO of Microsoft’ in the 2nd one.    */   caption: optional Text     /**    * An embedded data model that provides context for interacting    * with this entity.    */   entity: union[     Profile,     Company,     Group   ] } 

    Client theming, layout, and accessibility

    Clients have the most context about how information will be displayed to users. Understanding the dynamics of client-side control over the UX is an important consideration when we build render models. This is particularly important because clients can alter display settings like theme, layout, screen size, and dynamic font size without requesting new render models from the server.

    Properties like colors, local image references, borders, or corner radius are sent using semantic tokens (e.g., color-action instead of blue) from our render models. Our clients maintain a mapping from these semantic tokens to concrete values based on the design language for the specific feature on a given platform (e.g. iOS, Android, etc.). Referencing theme properties with semantic tokens enables our client applications to maintain dynamic control over the theme.

    For the layout, our render models are not intended to dictate the exact layout of the UI because they are not aware of the total available screen space. Instead, the models describe the order, context, and priorities for views, allowing client utilities to ultimately determine how the components should be placed based on available space (screen size and orientation). One way we accomplish this is by referring to the sizes of views by terms like “small” or “large” and allowing clients to apply what that sizing means based on the context and screen size.

    Advertisement
    free widgets for website

    It is critical that we maintain the same level of accessibility when our UIs are driven by render models. To do so, we provide accessibility text where necessary in our models, map our render models to components that have accessibility concerns baked in (minimum tap targets), and use semantics instead of specific values when describing sizes, layouts, etc.

    See also  LinkedIn’s GraphQL journey for integrations and partnerships: How we accelerated development by 90%

    Write use cases

    One of the most challenging aspects of render models is dealing with write use cases, like filling forms and taking actions on the app (such as following a company, connecting with a person, sending a message, etc.). These use cases need specific data to be written to backends and cannot be modeled in a completely generic way, making it hard to use render models.

    Actions are modeled by sending the current state of the action and its other possible states from the server to the clients. This tells the clients exactly what to display. In addition, it allows them to maintain any custom logic to implement a complex UI or perform state-changing follow-up actions.

    To support forms, we created a standardized library to read and write forms, with full client infrastructure support out of the box. Similar to how traditional read-based render models attempt to leverage generic fields and models to represent different forms of data, our standardized forms library leverages form components as its backbone to generically represent data in a form by the type of UI element it represents (such as a ‘single line component’ or a ‘toggle component’).

    Advertisement
    free widgets for website

    Render models in practice

    As we have mentioned above, the consistency of your UI is an important factor when leveraging render models. LinkedIn is built on a semantics-based design system that includes foundations like color and text, as well as shared components such as buttons and labels. Similarly, we have created layers of common UX render models in our API that include foundational and component models, which are built on top of those foundations.

    Our foundational models include rich representations of text and images and are backed by client infrastructure that renders these models consistently across LinkedIn. Representing rich text through a common model and render utilities enables us to provide a consistent member experience and maintain our accessibility standards (for instance, we can restrict the usage of underlining in text that is not a link). Our image model and processing ensures that we use the correct placeholders and failure images based on what the actual image being fetched presents (e.g., a member profile). These capabilities of the foundational models are available without any client consumer knowledge of what the actual text or image represents and this information is all encapsulated by the server-driven model and shared client render utilities.

    The foundational models can be used on their own or through component models that are built on top of the foundations. They foster re-use and improve our development velocity by providing a common model and shared infrastructure that resolves the component. One example is our common insight model, which combines an image with some insightful text.

    Advertisement
    free widgets for website
    • A commonly used ‘insight’ model used throughout the site

    A commonly used ‘insight’ model used throughout the site

    Over the years, many teams at LinkedIn have taken on large initiatives to re-architect their pages based on render model concepts built on top of these foundational models. No two use cases are exactly alike, but a few of the major use cases include:

    • The profile page, which is built using a set of render model-based components stitched together to compose the page. For more details on this architecture, see this blog post published earlier this year.

    • The search results page, built using multiple card render model templates to display different types of search results in a consistent manner. See this blog post for more details.

    • The main feed, built centered around the consistent rendering of one update with optional components to allow for variability based on different content types.

    • Image of a feed component designed around a several components

    A feed component designed around a several components

    • The notifications tab, which helped standardize 50+ notification types into one simple render model template.
    • Image of a notifications card designed using a standardized UI template

    A notifications card designed using a standardized UI template

    All of these use cases have seen some of the key benefits highlighted in this post: simpler client-side logic, a consistent design feel, faster iteration, and development and experimentation velocity for new features and bugs.

    Render model tradeoffs

    Render models come with their pros and cons, so it is important to properly understand your product use case and vision before implementing them.

    Advertisement
    free widgets for website

    Benefits

    With render models, teams are able to create leverage and control when a consistent visual experience, within a defined design boundary, is required across diverse use cases. This is enabled by centralizing logic on the server rather than duplicating logic across clients. It fosters generalized and simpler client-side implementation, with clients requiring less logic to render the user interface since most business logic lives on the server.

    Render models also decrease repeated design decisions and client-side work to onboard use cases when the use case fits an existing visual experience. It fosters generalized API schemas, thereby encouraging reuse across different features if the UI is similar to an existing feature.

    With more logic pushed to the API and a thin client-side layer, it enables faster experimentation and iteration as changes can be made by only modifying the server code without needing client-side changes on all platforms (iOS, Android, and Web). This is especially advantageous with mobile clients that might have older, but still supported versions in the wild for long periods of time.

    Similarly, as most of the business logic is on the server, it is likely that any bugs will be on the server instead of clients. Render models enable faster turnaround time to get these issues fixed and into production, as server-side fixes apply to all clients without needing to wait for a new mobile app release and for users to upgrade.

    Advertisement
    free widgets for website

    Disadvantages

    As mentioned previously, render models rely on consistent UIs. However, if the same data backs multiple, visually-distinct UIs, it reduces the reusability of your API because the render model needs more complexity to be able to handle the various types of UIs. If the UI does need to change outside the framework, the client-code and server code needs to be updated, sometimes in invasive ways. By comparison, UI-only changes typically do not require changes to data models. For some of these reasons, upfront costs to implement and design render models are often higher due to the need to define the platform and its boundaries, especially on the client.

    Render models are un-opinionated about writes and occasionally require write-only models or additional work to write data. This is contrasted with data models where the same data models can be used in a CRUD format.

    Client-side tracking with render models has to be conceived at the design phase, where tracking with data models is more composable from the client. It can be difficult to support use case-specific custom tracking in a generic render model.

    Finally, there are some cases where client business logic is unavoidable such as in cases with complex interactions between various user interface elements. These could be animations or client-data interactions. In such scenarios, render models are likely not the best approach as, without the specific context, it becomes difficult to have any client-side business logic.

    Advertisement
    free widgets for website

    When to use render models?

    Render models are most beneficial when building a platform that requires onboarding many use cases that have a similar UI layout. This is particularly useful when you have multiple types of backend data entities that will all render similarly on clients. Product and design teams must have stable, consistent requirements and they, along with engineering, need to have a common understanding of what kinds of flexibility they will need to support and how to do so.

    Additionally, if there are complex product requirements that need involved client-side logic, this may be a good opportunity to push some of the logic to the API. For example, it is often easier to send a computed text from the API directly rather than sending multiple fields that the client then needs to handle in order to construct the text. Being able to consolidate/centralize logic on the server, and thus simplifying clients, makes their behavior more consistent and bug-free.

    On the flip side, if there is a lack of stability or consistency in products and designs, any large product or design changes are more difficult to implement with render models due to needing schema changes.

    Render models are effective when defining generic templates that clients can render. If the product experience does not need to display different variants of data with the same UI, it would be nearly impossible to define such a generic template, and would often be simpler to use models that are more use case-specific rather than over-generalizing the model designs.

    Acknowledgments

    Render models have been adapted through many projects and our best practices have evolved over several years. Many have contributed to the design and implementation behind this modeling approach and we want to give a special shoutout to Nathan HibnerZach MooreLogan Carmody, and Gabriel Csapo for being key drivers in formulating these guidelines and principles formally for the larger LinkedIn community.

    Advertisement
    free widgets for website
    Advertisement
    free widgets for website

    Topics

    Continue Reading

    Trending