LinkedIn’s mission is to connect the world’s professionals to make them more productive and successful. One way we advance this mission is by partnering with other organizations to deliver world class integrations. We are developing a platform-as-a-service (PaaS) that provides exploratory access, insights, and conflation with LinkedIn’s Economic Graph to enable product integrations with strategic partners and customers. Our goal is to build a best-in-class API platform that is easy to use, efficient, and easy for our customers to operate. However, one challenge we found was that crafting and externalizing APIs to fit a specific use case is time consuming. In this blog post, we will walk through our adoption of GraphQL and how it helped speed our API development in support of our customers.
GraphQL is a query language for data APIs. It is a server-side runtime specification for executing queries using a type system. GraphQL is not tied to any specific database or storage engine. Instead, it is backed by server-side code and data. In the following sections, we will discuss specifics of how GraphQL enables us to significantly accelerate our API development. The key strength of GraphQL is its flexibility in querying, which removes the ties to rigidly defined APIs.
Problem statement: Where we were
Every customer looking to integrate with LinkedIn has a unique set of product concerns, mostly related to data and retrieval patterns. Before the GraphQL model, this resulted in us manifesting a new REST API for each use case. Creating and operating all externalized APIs led to a high cost barrier that only highly-scaled or high-ROI use cases could overcome.
Prior to GraphQL, we employed two major strategies in our APIs to lower our cost of fulfilling different use cases’ requirements:
- Define different APIs only when necessary: For use cases with different input scenarios (e.g., search by email, search by schoolId) that lacked reusability, we defined new APIs.
- Support projections: For use cases that required different fields to be projected in the response, even if they had the same input, we added projection supports.
These strategies helped. However, the whole process required a long development cycle, especially for Option #1. When we needed to define a new API or a variant of an existing API, we would have to handle: REST API definition, request’s input processing, and response collecting. After accounting for rigorous technical and business reviews and testing, it would take up to two quarters of development to fulfill a new product’s feature or onboard a new use case.
Leveraging REST projections made our APIs more reusable but came at a cost. It significantly increased the complexity of monitoring and operations. Splitting the monitoring metrics with different projection sets for the same API is challenging in itself; it also increases the challenge of evolving API products and increases the time to debug in the event of an issue.
What made the long onboarding time problematic was demand—a fortunate problem to have. Our team had customers lining up and were unable to keep up with demand, let alone make progress towards clearing out the backlog. We needed to scale without simply throwing more engineers at the problem.
Solution: Where we are
At a high-level, GraphQL helps us complete the story. Users will be able to explore the service capabilities via type system anda new use case will be created via defining a new GraphQL query. As the GraphQL service will automatically work for the new query against its type system, no additional engineering effort will be needed. In this section, we focus on how we used GraphQL to address our onboarding capacity problems in the GraphQL service part. We also briefly discuss the gaps we see in the current offering of GraphQL and how we extend it for our use-cases.
When we designed our solution, we used the following guiding principles:
- Follow industry-supported best GraphQL practices
- Keep things simple and generic, with no additional manual processes for query execution
- Alleviate toil with API definition and with response wiring, increasing onboarding velocity and capacity
- Include full feature coverage for existing use cases, providing a migration path
One of the most common use cases within LinkedIn is search. Some examples include finding a person by their attributes, jobs that match certain criteria, and many more. Within LinkedIn, we chose to have our GraphQL API follow the OpenCRUD specification. This specification handles SQL-like queries very well. For search queries, however, it falls short due to the differences in query definition and the response shape.
- Search semantics are not accurately expressed in type systems (e.g., full-text, typeahead);
- Search results include search-specific metadata, like matched fields, but relevance features are not well captured in a type system;
- Search operators, like Lucene’s “should” and “must,” as well as facets, are not well supported in OpenCRUD.
The gaps in the type system for search use cases also make it difficult to execute the search GraphQL queries generically.
Now, we will talk about the solution in different aspects: the type system and query engine.
The type system is the heart of GraphQL. It captures the GraphQL service’s capabilities, like supported retrieval patterns, input conditions, nodes, and edges between nodes. Currently, none of the GraphQL type systems that support search in industries meet all required capabilities like rich filtering or search query specification. We decided to expand the OpenCRUD type system to add search-related type “Search Entity” to include the rich search response, like facets or matched terms information. As is the case at LinkedIn, we wanted to have different search verticals to provide searchability on different datasets/entities. Meanwhile, with the Search Entity type, we define a rich input structure, to include the search semantics, relevance models, and search-specific operators. An excerpt from our Economic Graph type system is included to demonstrate the modeling for search entities. The following example is the “Search Entity” for the MemberSearch type.
There are other companies that have created different strategies to define search capabilities in their GraphQL type system. After performing a comparison, we believe our GraphQL type system provides a better user experience, as it follows the GraphQL syntax, provides richer query capabilities, and accepts faceted query features.
|Query Format||Graph QL-oriented||Free-formed||Free-formed||Graph QL-oriented||Graph QL-oriented|
|Query Syntax||Graph QL||GitHub Syntax||Customized||Graph QL||Graph QL|
Here is the high level architecture showing how the expanded type system is executed. The search types will be treated as a different execution node, which contains the relation between the other node we plan to fetch data from.
We are leveraging the graphql-java’s engine to handle the execution concerns in the GraphQL story. With our enhancements and architecture, the engine provides the generic request orchestrator and response stitching functionalities. With that, a new query against the same type system that the engine is serving will automatically work without extra configuration needed. For our purpose of handling search with GraphQL, we will need to expand the engine to allow the downstream call to search via a search data fetcher.
Having built well-structured schemas (like MemberSearch) for search input and search response types in the type system, we could easily define the generic function for the data fetcher to handle delegation to downstream search verticals. Keywords (e.g., matchTerm, Facet, or relevanceModel) in the type system determine which search vertical a request needs to be routed to and query rewriting. After we filled in the gaps for Search Entity execution, we were able to stitch together the GraphQL engine and our type system to provide the dynamic search capability we needed.
Along with feature completion, we’re continuing to enhance the foundation support to further improve the whole development process with type system generation tools for search. Because a search type system requires hints to uncover the details about all the searchable fields, facetable fields, and the searched entity, among others, we defined a search index spec: a config file that provides hints for generating the relevance search type system. An automated search type system generation tool will parse the search index spec and generate the type system. We included an example of a search index spec, which we used to further reduce the time to initiate a new GraphQL service with a new type system from three weeks to a couple of days.
GraphQL will always return 200, even with partial errors. With that, it is hard to debug GraphQL services. We have adopted and extended our internal query monitoring tool to further analyze the GraphQL error response, so that we can differentiate whether it’s a server-side error, and are using that for service monitoring. Better monitoring and debuggability will help reduce the future maintenance effort.
Improvements on development velocity
With the benefits of GraphQL and our enhanced foundation support, we reduced the total development effort from 2-3 quarters to 2 days for the use case within the type system and eliminated the manual development for new data APIs and their request/response handling. The whole process of onboarding a new use case no longer relies on any engineer’s manual effort. Instead, users can explore the capabilities available from our system, register GraphQL queries for their own use cases, and, when they get approval for the query, they can use it directly.
In the past few quarters, we’ve onboarded two new use cases into the new system. For both use cases, we reduced the overall onboarding time to 2 days.
Future work: Where we will be
We have been operationalizing the searchable Economic Graph by onboarding new use cases. What’s next on our roadmap is to apply the same concept to our PaaS to enable search over the conflated Economic Graph. We are also planning to implement schema delegation in our ecosystem to boost the efficiency of query resolutions. All of these enhancements are stepping stones towards our goal of building a best-in-class API platform that is easy to use, efficient, and easy to operate. Stay tuned for future updates.
The success of this project would not be possible without support from many people from many teams at LinkedIn. We cannot possibly thank everyone but would like to make a special callout to: Alex Song, Jingwen Mo, Areg Nersisyan, and Lin Shen. In addition, we would like to thank Juan Grande and Diego Buthay for helping us with their expertise in the search domain. We also would like to thank Gopal Holla, Prakhar Sharma, Min Chen , Vickey Yeh, and Arun Ponniah Sethuramalingam, who helped consult on how the type system should be expanded and how we could leverage the existing work provided by our infra teams. Last but not least, we would like to thank Ankit Gupta and Aarthi Jayaram for sponsoring this project.
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Apache Kafka is an open-sourced event streaming platform where users can create Kafka topics as data transmission units, and then publish or subscribe to the topic with producers and consumers. While most of the Kafka topics are actively used, some are not needed anymore because business needs changed or the topics themselves are ephemeral. Kafka itself doesn’t have a mechanism to automatically detect unused topics and delete them. It is usually not a big concern, since a Kafka cluster can hold a considerable amount of topics, hundreds to thousands. However, if the topic number keeps growing, it will eventually hit some bottleneck and have disruptive effects on the entire Kafka cluster. The TopicGC service was born to solve this exact problem. It was proven to reduce Kafka pressure by deleting ~20% of topics, and improved Kafka’s produce and consume performance by at least 30%.
As the first step, we need to understand how unused topics can cause pressure on Kafka. Like many other storage systems, all Kafka topics have a retention period, meaning that for any unused topics, the data will be purged after a period of time and the topic will become empty. A common question here is, “How could empty topics affect Kafka?”
For topic management purposes, Kafka stores the metadata of topics in multiple places, including Apache ZooKeeper and a metadata cache on every single broker. Topic metadata contains information of partition and replica assignments.
Let’s do some simple calculation here: topic A can have 25 partitions, with a replication factor of three, meaning each partition has three replicas. Even if topic A is not used anymore, Kafka still needs to store the location info of all 75 replicas somewhere.
The effect of metadata pressure may not be that obvious for a single topic, but it can make a big difference if there are a lot of topics. The metadata can consume memory from Kafka brokers and ZooKeeper nodes, and can add payload to metadata requests.
In Kafka, the follower replicas periodically send fetch requests to the leader replicas to keep sync with the leader. Even for empty topics and partitions, the followers still try to sync with the leaders. Because Kafka does not know whether a topic is permanently unused, it always forces the followers to fetch from the leaders. These redundant fetch requests will further lead to more fetch threads being created, which can cause extra network, CPU, and memory utilization, and can dominate the request queues, causing other requests to be delayed or even dropped.
Kafka controller is a broker that coordinates and manages other brokers in a Kafka cluster. Many Kafka requests have to be handled by the controller, thus the controller availability is crucial to Kafka.
On controller failover, a new controller has to be elected and take over the role of managing the cluster. The new controller will take some time to load the metadata of the entire cluster from ZooKeeper before it can act as the controller, which is called the controller initialization time. As mentioned earlier in this post, unused topics can generate extra metadata that makes the controller initialization slower, and threaten the Kafka availability. Issues can arise when the ZooKeeper response is larger than 1MB. For one of our largest clusters, the ZooKeeper response has already reached 0.75MB, and we anticipate within two to three years it will hit a bottleneck.
While designing TopicGC, we kept in mind a number of requirements. Functionality, we determined that the system must set criteria to determine whether a topic should be deleted, constantly run the garbage collector (GC) process to remove the unused topics, and notify the user before topic deletion.
Additionally, we identified non-functional requirements for the system. The requirements include ensuring no data loss during topic deletion, removal of all dependencies from unused topics before deletion, and the ability to recover the topic states from service failures.
To satisfy those requirements, we designed TopicGC based on a state machine model, which we will discuss in more detail in the following sections.
Topic state machine
To achieve all of the functional requirements, TopicGC internally runs a state machine. Each topic instance is associated with a state and there are several background jobs that periodically run and transit the topic states if needed. Table 1 describes all possible states in TopicGC.
Table 1: Topic states and descriptions
With the help of internal states, TopicGC follows a certain workflow to delete unused topics.
Figure 1: TopicGC state machine
Detect topic usage
TopicGC has a background job to find unused topics. Internally, we use the following criteria to determine whether a topic is unused:
- The topic is empty
- There is no BytesIn/BytesOut
- There is no READ/WRITE access event in the past 60 days
- The topic is not newly created in the past 60 days
The TopicGC service fetches the above information from ZooKeeper and a variety of internal data sources, such as our metrics reporting system.
Send email notification
If a topic is in the UNUSED state, TopicGC will trigger the email sending service to find the LDAP user info of the topic owner and send email notifications. This is important because we don’t know whether the topic is temporarily idle or permanently unused. In the former case, once the topic owner receives the email, they can take actions to prevent the topic from being deleted.
Block write access
This is the most important step in the TopicGC workflow. Think of a case: if a user produces some data right at the last second before topic deletion, the data will be lost with the topic deletion. Thus, avoiding data loss is a crucial challenge for TopicGC. To ensure the TopicGC service doesn’t delete the topics that have last minute write, we introduced a block-write-access step before the topic deletion. After the write access is blocked on the topic, there is no chance that TopicGC can cause data loss.
Notice that Kafka doesn’t have a mechanism to “seal” a topic. Here we leverage LinkedIn’s internal way to block topic access. In LinkedIn, we have some access to services to allow us to control the access for all data resources, including Kafka topics. To seal a topic, TopicGC sends a request to the access service to block any read and write access to the topic.
The data of a topic can be mirrored to other clusters via Brooklin. Brooklin is open-sourced by LinkedIn, as a framework to stream data between various heterogeneous sources and destination systems with high reliability and throughput at scale. Before deleting the topic, we need to disable Brooklin mirroring of the topic. Brooklin can be regarded as a wildcard consumer for all Kafka topics. If the topic is deleted without informing Brooklin, Brooklin will throw exceptions about consuming from non-existent topics. For the same reason, before topic deletion, if there are any other services that consume from all topics, TopicGC should tell those services to stop consuming from the garbage topics before topic deletion.
Once all preparations are done, the TopicGC service will trigger the topic deletion by calling the Kafka admin client. The topic deletion process can be customized and in our case, we delete topics in batches. Because topic deletion can introduce extra load to Kafka clusters, we set an upper limit of the concurrent topic deletion number to three.
Last minute usage check
Before any of the actual changes made to the topic (including blocking write access, disabling mirroring, and topic deletion), we run a last minute usage check for the topic. This is to add an extra secure layer to prevent data loss. If TopicGC detects usage during the whole deletion process, it will mark the topic as INCOMPLETE state, and start recovering the topic back to USED state.
Impact of TopicGC
We launched TopicGC in one of our largest data pipelines, and were able to reduce the topic count by nearly 20%. In the graph, each color represents a distinct Kafka cluster in the pipeline.
Figure 2: Total topic count during TopicGC
Improvement on CPU usage
The topic deletion helps to reduce the total fetch requests in the Kafka clusters and as a result, the CPU usage drops significantly after the unused topics are deleted. The total Kafka CPU usage had about a 30% reduction.
Figure 3: CPU usage improvement by TopicGC
Improvement On Client Request Performance
Due to the CPU usage reduction, Kafka brokers are able to handle the requests more efficiently. As a result, Kafka’s request handling performance improved, and request latencies dropped by up to 40%. Figure 4 shows the decrease in latency for Metadata Request.
Figure 4: Kafka request performance improvement by TopicGC
After we launched TopicGC to delete unused topics for Kafka, it has deleted nearly 20% of topics, and significantly reduced the metadata pressure of our Kafka clusters. From our metrics, the client request performance is improved around 40% and CPU usage is reduced by up to 30%.
As TopicGC has shown its ability to clean up Kafka clusters and improve Kafka performance, we have decided to launch the service to all of our internal Kafka clusters. We are hoping to see that TopicGC can help LinkedIn have a more effective resource usage on Kafka.
Many thanks to Joseph Lin and Lincong Li for coming up with the idea of TopicGC and implementing the original design. We are also grateful for our managers Rohit Rakshe and Adem Efe Gencer, who provided significant support for this project. Last but not least, we want to shout out to the Kafka SRE team and Brooklin SRE team to act as helpful partners. With their help, we smoothly launched TopicGC and were able to see these exciting results.
Render Models at LinkedIn
We use render models for passing data to our client applications to describe the content (text, images, buttons etc.) and the layout to display on the screen. This means most of such logic is moved out of the clients and centralized on the server. This enables us to deliver new features faster to our members and customers while keeping the experience consistent and being responsive to change.
Traditionally, many of our API models tend to be centered around the raw data that’s needed for clients to render a view, which we refer to as data modeling. With this approach, clients own the business logic that transforms the data into a view model to display. Often this business logic layer can grow quite complex over time as more features and use cases need to be supported.
This is where render models come into the picture. A render model is an API modeling strategy where the server returns data that describes the view that will be rendered. Other commonly used terms that describe the same technique are Server Driven User Interface (SDUI), or View Models. With render models, the client business logic tends to be much thinner, because the logic that transforms raw data into view models now resides in the API layer. For any given render model, the client should have a single, shared function that is responsible for generating the UI representation of the render model.
Architectural comparison between data modeling and render modeling
To highlight the core differences in modeling strategy between a render model and data model, let’s walk through a quick example of how we can model the same UI with these two strategies. In the following UI, we want to show a list of entities that contain some companies, groups, and profiles.
An example UI of an ‘interests’ card to display to members
Following the data model approach, we would look at the list as a mix of different entity types (members, companies, groups, etc.) and design a model so that each entity type would contain the necessary information for clients to be able to transform the data into the view shown in the design.
When applying a render model approach, rather than worry about the different entity types we want to support for this feature, we look at the different UI elements that are needed in the designs.
An ‘interests’ card categorized by UI elements
In this case, we have one image, one title text, and two other smaller subtexts. A render model represents these fields directly.
With the above modeling, the client layer remains very thin as it simply displays each image/text returned from the API. The clients are unaware of which underlying entity each element represents, as the server is responsible for transforming the data into displayable content.
API design with render models
API modeling with render models can live on a spectrum between the two extremes of frontend modeling strategies, such as pure data models and pure view models. With pure data models, different types of content use different models, even if they look the same on UI. Clients know exactly what entity they are displaying and most of the business logic is on clients, so complex product UX can be implemented as needed. Pure view models are heavily-templated and clients have no context on what they are actually displaying with almost all business logic on the API. In practice, we have moved away from using pure view models due to difficulties in supporting complex functionality, such as client animations and client-side consistency support, due to the lack of context on the clients’ end.
Typically, when we use render models, our models have both view model and data model aspects. We prefer to use view modeling most of the time to abstract away most of the view logic on the API and to keep the view layer on the client as thin as possible. We can mix in data models as needed, to support the cases where we need specific context about the data being displayed.
A spectrum of modeling strategies between pure view models and pure data models
To see this concretely, let’s continue our previous example of a FollowableEntity. The member can tap on an entity to begin following the profile, company, or group. As a slightly contrived example, imagine that we perform different client side actions based on the type of the entity. In such a scenario, the clients need to know the type of the entity and at first brush it might appear that the render models approach isn’t feasible. However, we can combine theseapproaches to get the best of both worlds. We can continue to use a render model to display all the client data but embed the data model inside the render model to provide context for making the follow request.
Client theming, layout, and accessibility
Clients have the most context about how information will be displayed to users. Understanding the dynamics of client-side control over the UX is an important consideration when we build render models. This is particularly important because clients can alter display settings like theme, layout, screen size, and dynamic font size without requesting new render models from the server.
Properties like colors, local image references, borders, or corner radius are sent using semantic tokens (e.g., color-action instead of blue) from our render models. Our clients maintain a mapping from these semantic tokens to concrete values based on the design language for the specific feature on a given platform (e.g. iOS, Android, etc.). Referencing theme properties with semantic tokens enables our client applications to maintain dynamic control over the theme.
For the layout, our render models are not intended to dictate the exact layout of the UI because they are not aware of the total available screen space. Instead, the models describe the order, context, and priorities for views, allowing client utilities to ultimately determine how the components should be placed based on available space (screen size and orientation). One way we accomplish this is by referring to the sizes of views by terms like “small” or “large” and allowing clients to apply what that sizing means based on the context and screen size.
It is critical that we maintain the same level of accessibility when our UIs are driven by render models. To do so, we provide accessibility text where necessary in our models, map our render models to components that have accessibility concerns baked in (minimum tap targets), and use semantics instead of specific values when describing sizes, layouts, etc.
Write use cases
One of the most challenging aspects of render models is dealing with write use cases, like filling forms and taking actions on the app (such as following a company, connecting with a person, sending a message, etc.). These use cases need specific data to be written to backends and cannot be modeled in a completely generic way, making it hard to use render models.
Actions are modeled by sending the current state of the action and its other possible states from the server to the clients. This tells the clients exactly what to display. In addition, it allows them to maintain any custom logic to implement a complex UI or perform state-changing follow-up actions.
To support forms, we created a standardized library to read and write forms, with full client infrastructure support out of the box. Similar to how traditional read-based render models attempt to leverage generic fields and models to represent different forms of data, our standardized forms library leverages form components as its backbone to generically represent data in a form by the type of UI element it represents (such as a ‘single line component’ or a ‘toggle component’).
Render models in practice
As we have mentioned above, the consistency of your UI is an important factor when leveraging render models. LinkedIn is built on a semantics-based design system that includes foundations like color and text, as well as shared components such as buttons and labels. Similarly, we have created layers of common UX render models in our API that include foundational and component models, which are built on top of those foundations.
Our foundational models include rich representations of text and images and are backed by client infrastructure that renders these models consistently across LinkedIn. Representing rich text through a common model and render utilities enables us to provide a consistent member experience and maintain our accessibility standards (for instance, we can restrict the usage of underlining in text that is not a link). Our image model and processing ensures that we use the correct placeholders and failure images based on what the actual image being fetched presents (e.g., a member profile). These capabilities of the foundational models are available without any client consumer knowledge of what the actual text or image represents and this information is all encapsulated by the server-driven model and shared client render utilities.
The foundational models can be used on their own or through component models that are built on top of the foundations. They foster re-use and improve our development velocity by providing a common model and shared infrastructure that resolves the component. One example is our common insight model, which combines an image with some insightful text.
A commonly used ‘insight’ model used throughout the site
Over the years, many teams at LinkedIn have taken on large initiatives to re-architect their pages based on render model concepts built on top of these foundational models. No two use cases are exactly alike, but a few of the major use cases include:
The profile page, which is built using a set of render model-based components stitched together to compose the page. For more details on this architecture, see this blog post published earlier this year.
The search results page, built using multiple card render model templates to display different types of search results in a consistent manner. See this blog post for more details.
The main feed, built centered around the consistent rendering of one update with optional components to allow for variability based on different content types.
A feed component designed around a several components
- The notifications tab, which helped standardize 50+ notification types into one simple render model template.
A notifications card designed using a standardized UI template
All of these use cases have seen some of the key benefits highlighted in this post: simpler client-side logic, a consistent design feel, faster iteration, and development and experimentation velocity for new features and bugs.
Render model tradeoffs
Render models come with their pros and cons, so it is important to properly understand your product use case and vision before implementing them.
With render models, teams are able to create leverage and control when a consistent visual experience, within a defined design boundary, is required across diverse use cases. This is enabled by centralizing logic on the server rather than duplicating logic across clients. It fosters generalized and simpler client-side implementation, with clients requiring less logic to render the user interface since most business logic lives on the server.
Render models also decrease repeated design decisions and client-side work to onboard use cases when the use case fits an existing visual experience. It fosters generalized API schemas, thereby encouraging reuse across different features if the UI is similar to an existing feature.
With more logic pushed to the API and a thin client-side layer, it enables faster experimentation and iteration as changes can be made by only modifying the server code without needing client-side changes on all platforms (iOS, Android, and Web). This is especially advantageous with mobile clients that might have older, but still supported versions in the wild for long periods of time.
Similarly, as most of the business logic is on the server, it is likely that any bugs will be on the server instead of clients. Render models enable faster turnaround time to get these issues fixed and into production, as server-side fixes apply to all clients without needing to wait for a new mobile app release and for users to upgrade.
As mentioned previously, render models rely on consistent UIs. However, if the same data backs multiple, visually-distinct UIs, it reduces the reusability of your API because the render model needs more complexity to be able to handle the various types of UIs. If the UI does need to change outside the framework, the client-code and server code needs to be updated, sometimes in invasive ways. By comparison, UI-only changes typically do not require changes to data models. For some of these reasons, upfront costs to implement and design render models are often higher due to the need to define the platform and its boundaries, especially on the client.
Render models are un-opinionated about writes and occasionally require write-only models or additional work to write data. This is contrasted with data models where the same data models can be used in a CRUD format.
Client-side tracking with render models has to be conceived at the design phase, where tracking with data models is more composable from the client. It can be difficult to support use case-specific custom tracking in a generic render model.
Finally, there are some cases where client business logic is unavoidable such as in cases with complex interactions between various user interface elements. These could be animations or client-data interactions. In such scenarios, render models are likely not the best approach as, without the specific context, it becomes difficult to have any client-side business logic.
When to use render models?
Render models are most beneficial when building a platform that requires onboarding many use cases that have a similar UI layout. This is particularly useful when you have multiple types of backend data entities that will all render similarly on clients. Product and design teams must have stable, consistent requirements and they, along with engineering, need to have a common understanding of what kinds of flexibility they will need to support and how to do so.
Additionally, if there are complex product requirements that need involved client-side logic, this may be a good opportunity to push some of the logic to the API. For example, it is often easier to send a computed text from the API directly rather than sending multiple fields that the client then needs to handle in order to construct the text. Being able to consolidate/centralize logic on the server, and thus simplifying clients, makes their behavior more consistent and bug-free.
On the flip side, if there is a lack of stability or consistency in products and designs, any large product or design changes are more difficult to implement with render models due to needing schema changes.
Render models are effective when defining generic templates that clients can render. If the product experience does not need to display different variants of data with the same UI, it would be nearly impossible to define such a generic template, and would often be simpler to use models that are more use case-specific rather than over-generalizing the model designs.
Render models have been adapted through many projects and our best practices have evolved over several years. Many have contributed to the design and implementation behind this modeling approach and we want to give a special shoutout to Nathan Hibner, Zach Moore, Logan Carmody, and Gabriel Csapo for being key drivers in formulating these guidelines and principles formally for the larger LinkedIn community.
(Re)building Threat Detection and Incident Response at LinkedIn
LinkedIn connects and empowers more than 875 million members and over the past few years, has undergone tremendous growth. As an integral part of the Information Security organization at LinkedIn, the Threat Detection and Incident Response team (aka SEEK) defends LinkedIn against computer security threats. As we continue to experience a rapid growth trajectory, the SEEK team decided to reimagine its capabilities and the scale of its monitoring and response solutions. What SEEK set out to do was akin to shooting for the moon, so we named the program “Moonbase.” Moonbase set the stage for significantly more mature capabilities that ultimately led to some impressive results. With Moonbase, we were able to reduce incident investigation times by 50%, increase threat detection coverage expansion by 900%, and reduce our time to detect and contain security incidents from weeks or days to hours.
In this blog, we will discuss how LinkedIn rebuilt its security operations platform and teams, scaled to protect nearly 20,000 employees and more than 875 million members, and our approach and strategy to achieve this objective. In subsequent posts, we will do a deep dive into how we built and scaled threat detection, improved asset visibility, and enhanced our ability to respond to incidents within minutes, with many lessons learned along the way.
Software-defined Security Operations Center (SOC)
While these data points demonstrated some of the return on the investment, they didn’t show how much scaling to Moonbase improved the quality of life for our security analysts and engineers. With Moonbase, we were able to eliminate the need to manually search through CSV files on a remote file server, or the need to download logs and other datasets directly for processing and searching. The move to a software-defined and cloud-centric security operations center accelerated the team’s ability to analyze more data in more detail, while reducing the toil of manual acquisition, transformation, and exploration of data.
Having scaled other initiatives in the past, we knew before we started the program rebuild we would need strong guiding principles to keep us on track, within scope, and set reasonable expectations.
Preserving Human Capital
As we thought about pursuing a Software-defined SOC framework for our threat detection and incident response program, preserving our human capital was one of the main driving factors. This principle continues to drive the team, leading to outcomes such as reduced toil through automation, maximized true positives, tight collaboration between our detection engineers and incident responders, and semi-automated triage and investigation activities.
It’s commonly said that security is everyone’s responsibility, and that includes threat detection and incident response. In our experience, centralizing all responsibilities of threat detection and incident response within a single team restricts progress. Democratization can be a force multiplier and a catalyst for rapid progress if approached pragmatically. For example, we developed a user attestation platform where a user is informed about suspicious activity, provided context around why we think it is suspicious, and asked a question of whether they recognize the suspicious activity. Depending on the user’s response and circumstantial factors, a workflow is triggered that could lead to an incident being opened for investigation. This helped reduce toil and the time to contain threats by offering an immediate response to an unusual activity. Democratization has been applied to several other use cases with varying degrees of success from gaining visibility to gathering threat detection ideas.
Building for the future while addressing the present
The rebuilding of the threat detection and incident response program was done while still running the existing operations. With this approach, we were able to carve out space among the team to work on more strategic initiatives.
Security, scalability, and reliability of infrastructure
As the team increased its visibility tenfold, the demand for data normalization and searching continued to grow. Our platform, tooling, data, and processes need reliability and scalability in the most critical times, like during an incident or an active attack. The team ensured a focus on resiliency in the face of setbacks such as software bugs or system failures. To get ahead of potential problems, we committed to planning for failure, early warning, and recovery states to ensure our critical detection and response systems were available when most needed. Security of the LinkedIn platform is at the forefront of everything the team does and is etched into our thought process as we build these systems.
As we started thinking about the broad problem space we needed to address, a few fundamental questions came up.
The first was, “What visibility would the threat detection and incident response team need to be effective?” This fundamental question helped us shape our thinking and requirements about how to rebuild the function. There are several ways to approach building out an operational security response and detection engineering team. Whether that’s an incident response firm on retainer, an in-house security operations center, a managed service provider, or a combination of both, there are plenty of approaches that work for many organizations. At LinkedIn, we wanted to work with what we already had, which was a great engineering team and culture, the ability to take some intelligent risks, and the support of our peers to help build and maintain the pipeline and platform we needed to be effective.
The next question that we asked was, “How do we provide coverage for the threats affecting our infrastructure and users?” There are many areas that require attention when building out a Software-defined SOC and it can be difficult to know where to prioritize your efforts. The long-term goal was inspired by the original intent of the SOCless concept, which suggests that mostly all incident handling would be automated with a few human checkpoints to ensure quality and accuracy, paired with in-depth investigations as necessary. Given our team’s skills and development-focused culture, the benefit of reducing the need for human interaction in security monitoring operations was an attractive idea. With this, we needed to build and maintain development, logging, and alerting pipelines, decide what data sources to consume, and decide what detections to prioritize.
“What are the most important assets we need to protect?” was the third question we asked. Due to the scope of Moonbase, we had to decide how to deliver the biggest impact on security before we had designed the new processes or completed the deployment. This meant we focused on the most important assets first, commonly known as “Crown Jewels.” Starting with systems we knew and understood, we could more easily test our detection pipeline. Many of the early onboarded data sources and subsequent detections were from the Crown Jewels. Ultimately, this was a quick start for us, but did not yet offer the comprehensive visibility and detection capabilities we needed.
This leads us to the last question we asked, “How can we improve the lives of our incident responders with a small team?” Early detections in the new system, while eventually iterated or tuned, led to classic analyst fatigue. To ease the burden and improve the lives of the responders, we built a simple tuning request functionality, which enables quick feedback and the potential to pause on a lower-quality detection. This principle has enabled us to maintain a reasonable expectation of results from analysts while reducing the potential for additional fatigue, alert overload, and job dissatisfaction. Additionally, we have focused on decentralizing the early phases of the security monitoring process, which has led to significantly less toil and investigation required from analysts. When a class of alerts can be sent with additional context to potential victims or sources of the alert, with the proper failsafes in place the analyst can focus on responding to the true threats. Another example is directly notifying a system owner or responsible team whenever there are unusual or high-risk activities such as a new administrator, new two-factor accounts, credential changes, etc. If the activity is normal, expected, or a scheduled administration activity that can be confirmed by the system owners, again with the proper failsafes, there is no need to directly notify the incident response team. Tickets generated from security alerts are still created and logged for analysis and quality assurance, but reviewing a select number of tickets periodically for accuracy is preferred to manually communicating through cases that likely have no impact or threat.
Phase 1: Foundation and visibility
The first phase of rebuilding our security platform was our end-to-end infrastructure and process development. We already had a small team and some data sources, including some security-specific telemetry (endpoint and network detection technologies), so we focused on finding the right platform first and then on building strong processes and infrastructure around it. Given the rapidly growing business, a lean team, and other constraints, a few base requirements were established for the foundational platform(s).
Low operational overhead
This enables the team to focus on the analysis tasks meant for humans and lets automation do the work of data gathering and transformation. Constantly working through the critical, yet laborious extract, transform, and load processes while trying to hunt for and respond to threats is a quick recipe for burnout.
Built-in scalability and reliability
Reliability is a naturally critical component in any security system, however, with a small team, it must be low maintenance and easy to scale. Additionally, time invested should be primarily focused on building on top of the platform, rather than keeping the platform operational. This is one of the strongest arguments for a cloud solution, as it allows us to work with our internal IT partners and cloud service providers. Through this, we can ensure a reliable backbone for our pipelines and alerting, along with other programs that coordinate these processes to gather data and context.
Rapid time to value
Building a functioning operational team takes time and practice, so when choosing the technology toolset, we need to see results quickly to focus on process improvement and developing other important capabilities.
To minimize context switching and inefficiencies, we want the data, context, and all entity information to be available in our alerting and monitoring system. This naturally leads toward a log management system and security information and event management (SIEM) overlays.
By the end of this phase, we wanted to be able to onboard a handful of data sources to a collaborative platform where we could run searches, write detections, and automatically respond to incidents. This was the beginning of our methodical approach to deploying detections through a traditional CI/CD pipeline.
Capturing data at scale, and in a heterogeneous environment, is a huge challenge on its own. Not all sources can use the same pipelines, and developing and maintaining hundreds of data pipelines is too much administrative overhead for a small team. We started our search for a solution internally and decided on a hybrid solution. LinkedIn already has a robust data streaming and processing infrastructure in the form of Kafka and Samza, which ships trillions of messages (17 Trillion messages per day!) and processes hundreds of petabytes of data. Most of the LinkedIn production and supporting infrastructure is already capable of transporting data through these pipelines, which made them an attractive early target. However, LinkedIn has organically grown and between acquisitions, software-as-a-service applications, different cloud platforms, and other factors, there needed to be other supported and reliable modes of transport. After analysis, the team developed four strategic modes of data collection including the ultimate fallback of REST APIs provided by the SIEM.
A simplified diagram of data collection pipelines
Most of our infrastructure is already capable of writing to Kafka. With reliability and scalability in mind, Kafka was a perfect choice for a primary data collection medium.
Infrastructure, like firewalls, is not capable of writing to Kafka inherently. We operate a cluster of Syslog collectors for anything that supports Syslog export, but not Kafka messages.
Serverless data pipelines
These are employed mostly for collecting logs from SaaS, PaaS, and other cloud platforms.
The data collector REST API is the collection mechanism natively supported by the SIEM for accepting logs and storing them against known schemas. This is currently the most commonly used transport mechanism, scaling to hundreds of terabytes.
Security infrastructure as code
The team has deep experience with security tooling and platform over the years. As we rebuilt our foundational infrastructure, we knew we needed to apply a more comprehensive engineering first approach. One aspect of this engineering first approach to defense infrastructure was treating everything as code to maintain consistency, reliability, and quality. This led to the development of the Moonbase continuous integration and continuous deployment (CI/CD) pipeline. Through the CI/CD pipeline the team manages all detections, data source definitions and parsers, automated playbooks and serverless functions, and Jupyter notebooks used for investigations, etc.
Having every engineer work on the development and improvement of the detection playbooks, as well as having the applied rigor that comes from the typical CI/CD review and testing stages, gives the team a strong, templateable, and heavily quality-assured playbook for detecting and responding to threats. Simple mistakes or detections that could lead to unnecessary resource usage are easily prevented through the peer review process. Our tailored comprehensive CI validations for each resource type help us to programmatically detect any issues in these artifacts within the PR validation process and improve the deployment success rate significantly. Change tracking is also much easier as pull request IDs can be added to investigation tickets for reference or used in other parts of the tuning process.
The Moonbase CI/CD pipeline is serverless, built on top of Azure DevOps. Azure Repos is a source code management solution similar to Github that we use for all our code with deployment done through Azure Pipelines. Azure Pipelines is a robust CI/CD platform that supports multi-stage deployments, integration with Azure CLI tasks, integration with Azure Repos, PR-triggered CI builds, etc. This also helps us deploy the same resource to multiple SIEM instances within different scopes only by updating deployment configuration settings. We leverage both to build, validate, and deploy detections and all other deployables, following a trunk-based development model. Artifacts like queries are enriched in the pipeline before deployment. These enrichments help not only with detections but also help track threat detection coverage, metrics for incidents, etc.
While there are many features of a CI/CD pipeline like constant validation, decentralized review, mandatory documentation, and metrics, the templating aspect is one of the stronger points of this detection engineering approach. The pipeline allows any analyst on the team to quickly deploy a new detection or update an existing one. In this screenshot from VSCode you can see an example detection template looking for unexpected activity from inactive service principals (SPN).
The pane on the right shows the actual detection within its template. The entire detection isn’t pictured here to obscure some internal-only information, but this snippet shows what a detection engineer has configured for this specific detection in the orange-colored text. Other items, like how far back to search and the severity of the alert, can be specified in the template.
Additionally, we explicitly configured the data sources needed to execute the detection within the template.
To assist in understanding our coverage of threats, we map each detection to its MITRE ATT&CK ID and Sub ID. Not only does this help us track our coverage, but it also enables us to write additional queries or detections that collate any technique or class of attacks into a single report.
Finally, the query itself is listed (in this case we’re using KQL to search the data).
Our internally-developed CLI generates boilerplate templates for new artifacts and helps engineers maintain quality and their deployment velocity and improves productivity by helping engineers validate their changes locally.
It is important to create space for innovation. Earlier, the team often found themselves deep in operational toil, going from one issue to another. A very deliberate effort to pause operational work, which yielded a low return on investment, really helped the team make space for innovation. Staying true to the guiding principles of automating what makes sense, the team uses automation heavily to remove as much of the mundane and repeatable work as possible. The following are some automation tools and platforms that the team currently uses. These platforms and the automation have enabled the team to unlock efficiency and quality across the functions.
Automated playbooks and workflow automation
The team leverages a no-code workflow automation platform that comes tightly integrated with the cloud SIEM. It is a highly scalable integration platform that allows building solutions quickly due to the many built-in connectors across the product ecosystem like Jira, ServiceNow, and several custom tools that the team depends on. Some use cases include alert and incident enrichment, which makes all context required to triage an alert available to the engineers, and running automated post-alert playbooks for prioritization and running automated containment and remediation jobs, and other business workflows. Another use case is automated quality control and post-incident processing, which allows us to learn lessons from previous incidents.
Several complex automation jobs are written as serverless functions. These are easy to deploy, scalable, and have a very low operational overhead. These functions are used for ingesting data from on-prem and online sources, along with more complex automation jobs like the containment of compromised identities.
Resilience and reliability are broad topics, and thus not covered in this blog, however, data and platform reliability are absolutely critical. A change in underlying data can have major cascading effects on detection and response. Data availability during incidents is key, too. Outside of the core monitoring of the platforms, the team relied on three different avenues for signals of degraded performance:
Looking at things like message drop rate, latency, and volume allows the team to quickly identify any issues before they impact detections and/or response activities.
Sending messages to ensure the pipeline is functional and that messages get from source to destination within expected timeframes.
Ideal behavior indicators (operational alerts)
Data Sources are dynamic in nature. When onboarding a datasource, a detection is developed to monitor the datasource health. For example, sending an alert when the number of unique source hosts decreases more than the threshold percentage.
These are only some of the health checks and alerts we developed to ensure that our systems were logging and reporting properly. We try to graph availability as well as detection efficacy data to assist us in constantly re-evaluating our detection posture and accuracy.
What did we get for all this?
Threat detection and incident response is not a purely operational problem, using an engineering lens paid off well for our team. If this is not something you’re doing, we hope this post drives you toward that realization. Leaving the processes to organic growth is unsustainable and violates our guiding principles designed to prevent burnout for the team and ensure a quality response. In the end, we achieved significant success and improvements. We were able to expand our data collection infrastructure by 10x going from gigabytes to petabytes, our average time to triage incidents went from hours to minutes, we were able to maintain 99.99% uptime for the platform and connected tooling, and correlation was now possible, significantly reducing alert fatigue and improving precision. Additionally, automated response playbooks allowed us to reduce toil for response and automatically handle simple tasks like enrichment, case creation and updates, or additional evidence gathering. We were also able to quickly integrate threat intelligence into our enrichment pipeline, providing a much richer context for investigations.
More work to do
What we’ve covered in this blog represents the work of a handful of engineers, product and project managers, analysts, and developers over a relatively short period of time. We learned many valuable lessons over time and have since developed new ways of thinking about detection, automated response, and scaling. Whatever platform a team decides on using for security monitoring, it cannot be overstated how important a solid design and architecture will be to its eventual success.
In a future post, we’ll cover more details on how incident response ties in with our platforms and pipelines and a re-architected approach to detection engineering, including the lessons we learned.
This work would not have been possible without the talented team of people supporting our success.
Core contributors: Vishal Mujumdar, Amir Jalali, Alex Harding, Tom Leahy, Tanvi Kolte, Arthur Kao, Swathi Chandrasekar, Erik Miyake, Prateek Jain, Lalith Machiraju, Gaurav Gupta, Brian Pierini, Sergei Rousakov, Jeff Bollinger and several other partners within the organization.
TikTok SEO in 5 Steps: How To Make Sure Your Videos Show Up in Search
What to know about Presto SQL query engine and PrestoCon
Now people can share directly to Instagram Reels from some of their favorite apps
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Elon Musk Hints at Plans to Increase Character Limit for Tweets in Response to Twitter User
WhatsApp ‘Message Yourself’ Feature Rolling Out on Android and iOS: Report
Sending Messages via WhatsApp in Your Node.js Application
(Re)building Threat Detection and Incident Response at LinkedIn
Access Levels for Gaming Apps
How LinkedIn Ditched the “One Size Fits All” Hiring Approach for InfoSec and Won
Introducing the Product Tagging API for Reels to the Instagram Platform
Introducing the ‘Instagram Explore home’ Ads Placements via the Instagram Marketing API
FACEBOOK2 weeks ago
Dynolog: Open source system observability
FACEBOOK1 week ago
How to Send Interactive Messages with the WhatsApp Business Platform
FACEBOOK1 week ago
Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding
FACEBOOK1 week ago
Building Intuitive Interactions in VR: Interaction SDK, First Hand Showcase and Other Resources
OTHER6 days ago
Elon Musk Urged by US Senator to Better Protect US Users’ Data After Whistleblower Testimony
OTHER5 days ago
Twitter, Other Social Media Apps Fail to Remove Hate Speech, Says EU Review
OTHER2 weeks ago
Elon Musk Expecting Short Time at Twitter, to Find New Leader to Run the Company
OTHER4 days ago
What Are WhatsApp Polls and How Do You Use Them? All You Need to Know