Currently, LinkedIn infrastructure is composed of hundreds of thousands of hosts across multiple data centers. Observability into our infrastructure makes it possible for us to focus on the health and performance of our critical services to provide the best experience to our members. With LinkedIn’s large infrastructure growth over the past few years, observability has become more critical to pinpoint the potential root causes for any infrastructure failure or anomaly. There are a few elegant in-house monitoring systems at LinkedIn that provide network switch level metrics, logs, and even flow-level visibility by sampling packets going through our network. However, all of these rely on sampling or some kind of periodic polling of data, which for any meaningful sampling rate generates a very large volume of data to be processed and analyzed.
This led us to start looking into better approaches to extract this observability information. We realized that the most reliable source for this would be to get this information from the servers themselves through a lightweight, widely deployed agent. This is where we started leveraging eBPF for the agent, by tapping into system events and reading system metrics directly to extract the desired information. With eBPF, instead of tracing each packet, we are tracing syscalls at a layer closer to the application layer and summarizing the data in-kernel to have minimal tracing overhead. The eBPF agent has been named Skyfall and will be referred to as such throughout this post.
In this post, we will talk about the architecture of the Skyfall agent that is deployed across the fleet of hosts in our DCs, the data pipeline involved with it, and where this data can be leveraged. We will also talk about the challenges involved with deploying an always-running agent at such a wide scale and what we have learned on our journey so far.
What is eBPF?
Classic BPF, which stands for Berkeley packet filter, originated from packet filtering in the early 1990s and was used in packet filtering tools, most popularly in tcpdump. eBPF was built on the same principles but has features beyond just packet filtering.
eBPF is an in-kernel execution engine with its own instruction set and further infrastructure around it, like maps and in-built helper functions, as well as a pseudo file system for pinning maps. It was developed to allow user-supplied programs to be attached to any code-path in the kernel in a secure and lightweight manner. For example, if we want to trace all force-kill events, we can simply add a hook into the kill syscall via eBPF to capture such events.
Currently in Linux, the most common probe types available are:
- Kprobes: A debugging mechanism for the Linux kernel that enables dynamic breaking into any kernel routine and collects debugging and performance information non-disruptively.
- Tracepoints: Tracepoints are a marker within the kernel source that, when enabled, can be used to hook into a running kernel at the point where the marker is located. They are more stable than Kprobes.
- XDP: This hook is at the earliest point possible in the networking driver and triggers a run of the eBPF program on packet reception before it goes through any kernel networking stack.
- TC classifier: Similar to XDP, this hook is attached at the network interface, but will run after the networking stack has done initial processing of the packet.
eBPF programs are mostly comprised of two programs, the kernel space program, which is the eBPF program running on the kernel and being run on events, and the user space program, which is responsible for loading the eBPF program into the kernel and setting up the associated maps.
How are we using eBPF?
The Skyfall agent runs on almost all servers within our datacenter fleet. Using eBPF, we are able to correlate kernel events with network flow data in real-time. In a traditional network monitoring setup, for instance, we would need to monitor all the network interfaces on a host. However, with eBPF we can simply look at the kernel state to get the necessary statistics directly, since the kernel already knows about the network traffic on the host. Looking at kernel state using eBPF helps in identifying which services, processes, and containers are participating in communication sessions objectively, with very low CPU overhead. This mapping of system events to network traffic provides our engineers with the multi-dimensional context necessary to reduce the entropy of monitored data.
Skyfall program details
The Skyfall agent hooks into following protocol-specific (TCP, UDP) lifecycle syscalls via kprobes and kretprobes to collect the desired data:
- tcp_set_state: For tracing tcp state changes.
- tcp_v4_connect, tcp_v6_connect: For all tcp connection attempts.
- inet_csk_accept: For tcp accept events.
- ip4_datagram_connect, ip6_datagram_connect: For UDP connect events.
The following TCP metrics are collected along with the traffic byte count:
- Smoothed RTT (Round trip time): The predicted RTT value obtained by applying a smoothing factor to it, which is also used to adjust the RTO (Retransmission timeout) value. We are collecting this metric to measure the contribution of the network to overall performance.
- RTT variance: An indication of path jitter. TCP uses this value, combined with SRTT, to compute the RTO. We are collecting this metric to detect transient network issues.
- Packetloss and Retransmits: These metrics are being collected to monitor network performance.
- Sending congestion window size: Congestion window controls the number of packets a TCP flow may have in the network at any time.
We are able to extract the above TCP metrics via the following lines of eBPF code:
Here we are casting the Linux sock struct to a tcp_sock struct to allow for accessing the tcp specific fields and assigning them to corresponding fields in our event struct. Note that we are using the bpf helper bpf_probe_read, which allows us to safely read given size bytes from kernel space to read the tcp fields. There are a few other bpf_helper functions that can be used in eBPF programs to interact with the system, or with the context in which they work as mentioned here.
Skyfall architecture overview
The collected data is ingested into a highly scalable and efficient data collection pipeline, provided through our in-house inFlow collectors. InFlow is our network flow collection, aggregation, and visualization platform. The Skyfall agent encodes collected data into sflow datagrams using our custom XDR schema and sends this data to the InFlow collectors.
The collectors then internally aggregate some of the data to remove any redundancies and send it to our highly scalable in-house Kafka cluster. From here the data follows two main paths. First, it gets consumed by a Samza job, which is our stream processing system, that transforms the data and extracts service-to-service dependencies to determine the upstream and downstream dependencies for any given service, along with traffic attributes like Tx/Rx bytes, packet retransmission count and RTT. Then, it also gets ETL-ed into our HDFS datastore, on which we run various analytics jobs.
In the current deployment state, we are handling around 12M/s events being produced by Skyfall agents across the fleet, to the InFlow collectors. Post-aggregation, this drops to 1.4M/s messages being produced to Kafka. Currently, the aggregation logic keys each flow or event by the following 4-tuples of protocol, source IP, destination IP, service port. Here, we are identifying either the source port or the destination port to be the service port, and the other one will be identified as an ephemeral port. The logic of identifying which port acts as a service port is handled at the Skyfall agent itself, since we already maintain a list of listening ports on the host. On aggregation, we take the ephemeral port associated with the flow and append it to the existing ephemeral ports list for the aggregated event. This approach helped reduce the Kafka messages count by almost 70% compared to the simpler 5-tuple (protocol, source IP, source port, destination IP, destination port) based aggregation.
The following graph indicates the huge drop in the rate of Kafka messages produced after introducing the flow aggregation.
On the x-axis we have timestamps. On the y-axis we have the number of messages produced to Kafka per second.
Challenges and learnings
Given the massive scale at which LinkedIn operates, deploying an always-running agent across hundreds of thousands of servers comes with a number of unprecedented challenges. In our journey with Skyfall, we have faced a few challenges.
Operating systems with different kernel versions
Torun the Skyfall agent across different kernel versions, we had to deal with various Linux kernel struct rearrangements and modifications across those versions. Unfortunately, since most of our machines don’t have BTF (BPF type format) support yet, we could not leverage the BPF CO-RE (Compile once run everywhere) approach. Fortunately, we only had to handle these modifications and struct rearrangements across a handful of different kernel versions. Because eBPF is usually compiled from C, using clang, and linked into an ELF (Executable and Linkable Format) binary file, we were able to compile the kernel-space program against a small number of different kernel headers to generate kernel specific ELF binary files. The user-space program only had to load the appropriate ELF binary during the agent’s startup, according to the kernel version of the host.
Handling performance overhead
Our initial implementation traced protocol-specific sendmsg and recvmsg functions, so the CPU overhead was proportional to the rate at which these functions were being called. On most servers with normal application workloads we saw very low CPU usage, under 3% of a single CPU core. On hosts running very high throughput applications and database services, we started seeing high CPU utilization with over 70k function calls being traced per second. The CPU usage on some of these hosts went above 50% of 1 CPU core.
To solve this problem, we switched to tracing lifecycle functions like protocol specific accept, connect, and close/disconnect functions, instead of the sendmsg and recvmsg functions, because the rate of these lifecycle functions is much smaller. This cut down the CPU overhead by almost 90%. On most of the hosts that were previously reporting higher CPU usage, it now runs well under 5% of a single CPU core. Along with this, we have also lowered the nice and ionice values for the agent process to lower its scheduling priority, to reduce contention with application workloads, both in terms of CPU processor time and disk I/O load.
Where can this data be used?
The collected network flows along with the TCP/IP statistics such as retransmits, RTTs, and Tx/Rx bytes will be ingested by our analytics jobs to maximize coverage and get more granular insights into our traffic metrics, to aid in the following use cases:
- Pinpointing network bottlenecks with granular app/service level information. This helps in determining whether the reported issue is caused by the network or by other elements of the stack.
- Analyzing top traffic-intensive services and traffic patterns for capacity planning. During major switch link failures this helps in avoiding disruption of critical services.
- Constructing a service-service dependency graph to identify highly interconnected services, services interacting across security zone boundaries, identifying unmanaged applications, and more.
We are further looking to leverage eBPF’s capabilities for tracing security events to capture suspicious and malicious activities on hosts, like privilege escalations, suspicious binary, and module loads. We will also be tracing via uprobes, which allow us to intercept a userspace program dynamically to trace shell command lines.
eBPF is a powerful tool for the observability space and helps us with collecting valuable data points with minimal overhead. It has been both a challenging and exciting journey working with eBPF at LinkedIn. With our initial efforts and learnings along the way, we have been able to establish a solid foundation for leveraging eBPF for a variety of use cases and are looking forward to building interesting tools on top of the rich dataset it provides.
Skyfall is the result of continued efforts from a lot of people within the Infrastructure Development team, both past and present. I would like to thank Varoun P, Haribabu Viswanathan, Ananya Shandilya, Manish Arora, and Sunil Thunuguntla.
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Apache Kafka is an open-sourced event streaming platform where users can create Kafka topics as data transmission units, and then publish or subscribe to the topic with producers and consumers. While most of the Kafka topics are actively used, some are not needed anymore because business needs changed or the topics themselves are ephemeral. Kafka itself doesn’t have a mechanism to automatically detect unused topics and delete them. It is usually not a big concern, since a Kafka cluster can hold a considerable amount of topics, hundreds to thousands. However, if the topic number keeps growing, it will eventually hit some bottleneck and have disruptive effects on the entire Kafka cluster. The TopicGC service was born to solve this exact problem. It was proven to reduce Kafka pressure by deleting ~20% of topics, and improved Kafka’s produce and consume performance by at least 30%.
As the first step, we need to understand how unused topics can cause pressure on Kafka. Like many other storage systems, all Kafka topics have a retention period, meaning that for any unused topics, the data will be purged after a period of time and the topic will become empty. A common question here is, “How could empty topics affect Kafka?”
For topic management purposes, Kafka stores the metadata of topics in multiple places, including Apache ZooKeeper and a metadata cache on every single broker. Topic metadata contains information of partition and replica assignments.
Let’s do some simple calculation here: topic A can have 25 partitions, with a replication factor of three, meaning each partition has three replicas. Even if topic A is not used anymore, Kafka still needs to store the location info of all 75 replicas somewhere.
The effect of metadata pressure may not be that obvious for a single topic, but it can make a big difference if there are a lot of topics. The metadata can consume memory from Kafka brokers and ZooKeeper nodes, and can add payload to metadata requests.
In Kafka, the follower replicas periodically send fetch requests to the leader replicas to keep sync with the leader. Even for empty topics and partitions, the followers still try to sync with the leaders. Because Kafka does not know whether a topic is permanently unused, it always forces the followers to fetch from the leaders. These redundant fetch requests will further lead to more fetch threads being created, which can cause extra network, CPU, and memory utilization, and can dominate the request queues, causing other requests to be delayed or even dropped.
Kafka controller is a broker that coordinates and manages other brokers in a Kafka cluster. Many Kafka requests have to be handled by the controller, thus the controller availability is crucial to Kafka.
On controller failover, a new controller has to be elected and take over the role of managing the cluster. The new controller will take some time to load the metadata of the entire cluster from ZooKeeper before it can act as the controller, which is called the controller initialization time. As mentioned earlier in this post, unused topics can generate extra metadata that makes the controller initialization slower, and threaten the Kafka availability. Issues can arise when the ZooKeeper response is larger than 1MB. For one of our largest clusters, the ZooKeeper response has already reached 0.75MB, and we anticipate within two to three years it will hit a bottleneck.
While designing TopicGC, we kept in mind a number of requirements. Functionality, we determined that the system must set criteria to determine whether a topic should be deleted, constantly run the garbage collector (GC) process to remove the unused topics, and notify the user before topic deletion.
Additionally, we identified non-functional requirements for the system. The requirements include ensuring no data loss during topic deletion, removal of all dependencies from unused topics before deletion, and the ability to recover the topic states from service failures.
To satisfy those requirements, we designed TopicGC based on a state machine model, which we will discuss in more detail in the following sections.
Topic state machine
To achieve all of the functional requirements, TopicGC internally runs a state machine. Each topic instance is associated with a state and there are several background jobs that periodically run and transit the topic states if needed. Table 1 describes all possible states in TopicGC.
Table 1: Topic states and descriptions
With the help of internal states, TopicGC follows a certain workflow to delete unused topics.
Figure 1: TopicGC state machine
Detect topic usage
TopicGC has a background job to find unused topics. Internally, we use the following criteria to determine whether a topic is unused:
- The topic is empty
- There is no BytesIn/BytesOut
- There is no READ/WRITE access event in the past 60 days
- The topic is not newly created in the past 60 days
The TopicGC service fetches the above information from ZooKeeper and a variety of internal data sources, such as our metrics reporting system.
Send email notification
If a topic is in the UNUSED state, TopicGC will trigger the email sending service to find the LDAP user info of the topic owner and send email notifications. This is important because we don’t know whether the topic is temporarily idle or permanently unused. In the former case, once the topic owner receives the email, they can take actions to prevent the topic from being deleted.
Block write access
This is the most important step in the TopicGC workflow. Think of a case: if a user produces some data right at the last second before topic deletion, the data will be lost with the topic deletion. Thus, avoiding data loss is a crucial challenge for TopicGC. To ensure the TopicGC service doesn’t delete the topics that have last minute write, we introduced a block-write-access step before the topic deletion. After the write access is blocked on the topic, there is no chance that TopicGC can cause data loss.
Notice that Kafka doesn’t have a mechanism to “seal” a topic. Here we leverage LinkedIn’s internal way to block topic access. In LinkedIn, we have some access to services to allow us to control the access for all data resources, including Kafka topics. To seal a topic, TopicGC sends a request to the access service to block any read and write access to the topic.
The data of a topic can be mirrored to other clusters via Brooklin. Brooklin is open-sourced by LinkedIn, as a framework to stream data between various heterogeneous sources and destination systems with high reliability and throughput at scale. Before deleting the topic, we need to disable Brooklin mirroring of the topic. Brooklin can be regarded as a wildcard consumer for all Kafka topics. If the topic is deleted without informing Brooklin, Brooklin will throw exceptions about consuming from non-existent topics. For the same reason, before topic deletion, if there are any other services that consume from all topics, TopicGC should tell those services to stop consuming from the garbage topics before topic deletion.
Once all preparations are done, the TopicGC service will trigger the topic deletion by calling the Kafka admin client. The topic deletion process can be customized and in our case, we delete topics in batches. Because topic deletion can introduce extra load to Kafka clusters, we set an upper limit of the concurrent topic deletion number to three.
Last minute usage check
Before any of the actual changes made to the topic (including blocking write access, disabling mirroring, and topic deletion), we run a last minute usage check for the topic. This is to add an extra secure layer to prevent data loss. If TopicGC detects usage during the whole deletion process, it will mark the topic as INCOMPLETE state, and start recovering the topic back to USED state.
Impact of TopicGC
We launched TopicGC in one of our largest data pipelines, and were able to reduce the topic count by nearly 20%. In the graph, each color represents a distinct Kafka cluster in the pipeline.
Figure 2: Total topic count during TopicGC
Improvement on CPU usage
The topic deletion helps to reduce the total fetch requests in the Kafka clusters and as a result, the CPU usage drops significantly after the unused topics are deleted. The total Kafka CPU usage had about a 30% reduction.
Figure 3: CPU usage improvement by TopicGC
Improvement On Client Request Performance
Due to the CPU usage reduction, Kafka brokers are able to handle the requests more efficiently. As a result, Kafka’s request handling performance improved, and request latencies dropped by up to 40%. Figure 4 shows the decrease in latency for Metadata Request.
Figure 4: Kafka request performance improvement by TopicGC
After we launched TopicGC to delete unused topics for Kafka, it has deleted nearly 20% of topics, and significantly reduced the metadata pressure of our Kafka clusters. From our metrics, the client request performance is improved around 40% and CPU usage is reduced by up to 30%.
As TopicGC has shown its ability to clean up Kafka clusters and improve Kafka performance, we have decided to launch the service to all of our internal Kafka clusters. We are hoping to see that TopicGC can help LinkedIn have a more effective resource usage on Kafka.
Many thanks to Joseph Lin and Lincong Li for coming up with the idea of TopicGC and implementing the original design. We are also grateful for our managers Rohit Rakshe and Adem Efe Gencer, who provided significant support for this project. Last but not least, we want to shout out to the Kafka SRE team and Brooklin SRE team to act as helpful partners. With their help, we smoothly launched TopicGC and were able to see these exciting results.
Render Models at LinkedIn
We use render models for passing data to our client applications to describe the content (text, images, buttons etc.) and the layout to display on the screen. This means most of such logic is moved out of the clients and centralized on the server. This enables us to deliver new features faster to our members and customers while keeping the experience consistent and being responsive to change.
Traditionally, many of our API models tend to be centered around the raw data that’s needed for clients to render a view, which we refer to as data modeling. With this approach, clients own the business logic that transforms the data into a view model to display. Often this business logic layer can grow quite complex over time as more features and use cases need to be supported.
This is where render models come into the picture. A render model is an API modeling strategy where the server returns data that describes the view that will be rendered. Other commonly used terms that describe the same technique are Server Driven User Interface (SDUI), or View Models. With render models, the client business logic tends to be much thinner, because the logic that transforms raw data into view models now resides in the API layer. For any given render model, the client should have a single, shared function that is responsible for generating the UI representation of the render model.
Architectural comparison between data modeling and render modeling
To highlight the core differences in modeling strategy between a render model and data model, let’s walk through a quick example of how we can model the same UI with these two strategies. In the following UI, we want to show a list of entities that contain some companies, groups, and profiles.
An example UI of an ‘interests’ card to display to members
Following the data model approach, we would look at the list as a mix of different entity types (members, companies, groups, etc.) and design a model so that each entity type would contain the necessary information for clients to be able to transform the data into the view shown in the design.
When applying a render model approach, rather than worry about the different entity types we want to support for this feature, we look at the different UI elements that are needed in the designs.
An ‘interests’ card categorized by UI elements
In this case, we have one image, one title text, and two other smaller subtexts. A render model represents these fields directly.
With the above modeling, the client layer remains very thin as it simply displays each image/text returned from the API. The clients are unaware of which underlying entity each element represents, as the server is responsible for transforming the data into displayable content.
API design with render models
API modeling with render models can live on a spectrum between the two extremes of frontend modeling strategies, such as pure data models and pure view models. With pure data models, different types of content use different models, even if they look the same on UI. Clients know exactly what entity they are displaying and most of the business logic is on clients, so complex product UX can be implemented as needed. Pure view models are heavily-templated and clients have no context on what they are actually displaying with almost all business logic on the API. In practice, we have moved away from using pure view models due to difficulties in supporting complex functionality, such as client animations and client-side consistency support, due to the lack of context on the clients’ end.
Typically, when we use render models, our models have both view model and data model aspects. We prefer to use view modeling most of the time to abstract away most of the view logic on the API and to keep the view layer on the client as thin as possible. We can mix in data models as needed, to support the cases where we need specific context about the data being displayed.
A spectrum of modeling strategies between pure view models and pure data models
To see this concretely, let’s continue our previous example of a FollowableEntity. The member can tap on an entity to begin following the profile, company, or group. As a slightly contrived example, imagine that we perform different client side actions based on the type of the entity. In such a scenario, the clients need to know the type of the entity and at first brush it might appear that the render models approach isn’t feasible. However, we can combine theseapproaches to get the best of both worlds. We can continue to use a render model to display all the client data but embed the data model inside the render model to provide context for making the follow request.
Client theming, layout, and accessibility
Clients have the most context about how information will be displayed to users. Understanding the dynamics of client-side control over the UX is an important consideration when we build render models. This is particularly important because clients can alter display settings like theme, layout, screen size, and dynamic font size without requesting new render models from the server.
Properties like colors, local image references, borders, or corner radius are sent using semantic tokens (e.g., color-action instead of blue) from our render models. Our clients maintain a mapping from these semantic tokens to concrete values based on the design language for the specific feature on a given platform (e.g. iOS, Android, etc.). Referencing theme properties with semantic tokens enables our client applications to maintain dynamic control over the theme.
For the layout, our render models are not intended to dictate the exact layout of the UI because they are not aware of the total available screen space. Instead, the models describe the order, context, and priorities for views, allowing client utilities to ultimately determine how the components should be placed based on available space (screen size and orientation). One way we accomplish this is by referring to the sizes of views by terms like “small” or “large” and allowing clients to apply what that sizing means based on the context and screen size.
It is critical that we maintain the same level of accessibility when our UIs are driven by render models. To do so, we provide accessibility text where necessary in our models, map our render models to components that have accessibility concerns baked in (minimum tap targets), and use semantics instead of specific values when describing sizes, layouts, etc.
Write use cases
One of the most challenging aspects of render models is dealing with write use cases, like filling forms and taking actions on the app (such as following a company, connecting with a person, sending a message, etc.). These use cases need specific data to be written to backends and cannot be modeled in a completely generic way, making it hard to use render models.
Actions are modeled by sending the current state of the action and its other possible states from the server to the clients. This tells the clients exactly what to display. In addition, it allows them to maintain any custom logic to implement a complex UI or perform state-changing follow-up actions.
To support forms, we created a standardized library to read and write forms, with full client infrastructure support out of the box. Similar to how traditional read-based render models attempt to leverage generic fields and models to represent different forms of data, our standardized forms library leverages form components as its backbone to generically represent data in a form by the type of UI element it represents (such as a ‘single line component’ or a ‘toggle component’).
Render models in practice
As we have mentioned above, the consistency of your UI is an important factor when leveraging render models. LinkedIn is built on a semantics-based design system that includes foundations like color and text, as well as shared components such as buttons and labels. Similarly, we have created layers of common UX render models in our API that include foundational and component models, which are built on top of those foundations.
Our foundational models include rich representations of text and images and are backed by client infrastructure that renders these models consistently across LinkedIn. Representing rich text through a common model and render utilities enables us to provide a consistent member experience and maintain our accessibility standards (for instance, we can restrict the usage of underlining in text that is not a link). Our image model and processing ensures that we use the correct placeholders and failure images based on what the actual image being fetched presents (e.g., a member profile). These capabilities of the foundational models are available without any client consumer knowledge of what the actual text or image represents and this information is all encapsulated by the server-driven model and shared client render utilities.
The foundational models can be used on their own or through component models that are built on top of the foundations. They foster re-use and improve our development velocity by providing a common model and shared infrastructure that resolves the component. One example is our common insight model, which combines an image with some insightful text.
A commonly used ‘insight’ model used throughout the site
Over the years, many teams at LinkedIn have taken on large initiatives to re-architect their pages based on render model concepts built on top of these foundational models. No two use cases are exactly alike, but a few of the major use cases include:
The profile page, which is built using a set of render model-based components stitched together to compose the page. For more details on this architecture, see this blog post published earlier this year.
The search results page, built using multiple card render model templates to display different types of search results in a consistent manner. See this blog post for more details.
The main feed, built centered around the consistent rendering of one update with optional components to allow for variability based on different content types.
A feed component designed around a several components
- The notifications tab, which helped standardize 50+ notification types into one simple render model template.
A notifications card designed using a standardized UI template
All of these use cases have seen some of the key benefits highlighted in this post: simpler client-side logic, a consistent design feel, faster iteration, and development and experimentation velocity for new features and bugs.
Render model tradeoffs
Render models come with their pros and cons, so it is important to properly understand your product use case and vision before implementing them.
With render models, teams are able to create leverage and control when a consistent visual experience, within a defined design boundary, is required across diverse use cases. This is enabled by centralizing logic on the server rather than duplicating logic across clients. It fosters generalized and simpler client-side implementation, with clients requiring less logic to render the user interface since most business logic lives on the server.
Render models also decrease repeated design decisions and client-side work to onboard use cases when the use case fits an existing visual experience. It fosters generalized API schemas, thereby encouraging reuse across different features if the UI is similar to an existing feature.
With more logic pushed to the API and a thin client-side layer, it enables faster experimentation and iteration as changes can be made by only modifying the server code without needing client-side changes on all platforms (iOS, Android, and Web). This is especially advantageous with mobile clients that might have older, but still supported versions in the wild for long periods of time.
Similarly, as most of the business logic is on the server, it is likely that any bugs will be on the server instead of clients. Render models enable faster turnaround time to get these issues fixed and into production, as server-side fixes apply to all clients without needing to wait for a new mobile app release and for users to upgrade.
As mentioned previously, render models rely on consistent UIs. However, if the same data backs multiple, visually-distinct UIs, it reduces the reusability of your API because the render model needs more complexity to be able to handle the various types of UIs. If the UI does need to change outside the framework, the client-code and server code needs to be updated, sometimes in invasive ways. By comparison, UI-only changes typically do not require changes to data models. For some of these reasons, upfront costs to implement and design render models are often higher due to the need to define the platform and its boundaries, especially on the client.
Render models are un-opinionated about writes and occasionally require write-only models or additional work to write data. This is contrasted with data models where the same data models can be used in a CRUD format.
Client-side tracking with render models has to be conceived at the design phase, where tracking with data models is more composable from the client. It can be difficult to support use case-specific custom tracking in a generic render model.
Finally, there are some cases where client business logic is unavoidable such as in cases with complex interactions between various user interface elements. These could be animations or client-data interactions. In such scenarios, render models are likely not the best approach as, without the specific context, it becomes difficult to have any client-side business logic.
When to use render models?
Render models are most beneficial when building a platform that requires onboarding many use cases that have a similar UI layout. This is particularly useful when you have multiple types of backend data entities that will all render similarly on clients. Product and design teams must have stable, consistent requirements and they, along with engineering, need to have a common understanding of what kinds of flexibility they will need to support and how to do so.
Additionally, if there are complex product requirements that need involved client-side logic, this may be a good opportunity to push some of the logic to the API. For example, it is often easier to send a computed text from the API directly rather than sending multiple fields that the client then needs to handle in order to construct the text. Being able to consolidate/centralize logic on the server, and thus simplifying clients, makes their behavior more consistent and bug-free.
On the flip side, if there is a lack of stability or consistency in products and designs, any large product or design changes are more difficult to implement with render models due to needing schema changes.
Render models are effective when defining generic templates that clients can render. If the product experience does not need to display different variants of data with the same UI, it would be nearly impossible to define such a generic template, and would often be simpler to use models that are more use case-specific rather than over-generalizing the model designs.
Render models have been adapted through many projects and our best practices have evolved over several years. Many have contributed to the design and implementation behind this modeling approach and we want to give a special shoutout to Nathan Hibner, Zach Moore, Logan Carmody, and Gabriel Csapo for being key drivers in formulating these guidelines and principles formally for the larger LinkedIn community.
(Re)building Threat Detection and Incident Response at LinkedIn
LinkedIn connects and empowers more than 875 million members and over the past few years, has undergone tremendous growth. As an integral part of the Information Security organization at LinkedIn, the Threat Detection and Incident Response team (aka SEEK) defends LinkedIn against computer security threats. As we continue to experience a rapid growth trajectory, the SEEK team decided to reimagine its capabilities and the scale of its monitoring and response solutions. What SEEK set out to do was akin to shooting for the moon, so we named the program “Moonbase.” Moonbase set the stage for significantly more mature capabilities that ultimately led to some impressive results. With Moonbase, we were able to reduce incident investigation times by 50%, increase threat detection coverage expansion by 900%, and reduce our time to detect and contain security incidents from weeks or days to hours.
In this blog, we will discuss how LinkedIn rebuilt its security operations platform and teams, scaled to protect nearly 20,000 employees and more than 875 million members, and our approach and strategy to achieve this objective. In subsequent posts, we will do a deep dive into how we built and scaled threat detection, improved asset visibility, and enhanced our ability to respond to incidents within minutes, with many lessons learned along the way.
Software-defined Security Operations Center (SOC)
While these data points demonstrated some of the return on the investment, they didn’t show how much scaling to Moonbase improved the quality of life for our security analysts and engineers. With Moonbase, we were able to eliminate the need to manually search through CSV files on a remote file server, or the need to download logs and other datasets directly for processing and searching. The move to a software-defined and cloud-centric security operations center accelerated the team’s ability to analyze more data in more detail, while reducing the toil of manual acquisition, transformation, and exploration of data.
Having scaled other initiatives in the past, we knew before we started the program rebuild we would need strong guiding principles to keep us on track, within scope, and set reasonable expectations.
Preserving Human Capital
As we thought about pursuing a Software-defined SOC framework for our threat detection and incident response program, preserving our human capital was one of the main driving factors. This principle continues to drive the team, leading to outcomes such as reduced toil through automation, maximized true positives, tight collaboration between our detection engineers and incident responders, and semi-automated triage and investigation activities.
It’s commonly said that security is everyone’s responsibility, and that includes threat detection and incident response. In our experience, centralizing all responsibilities of threat detection and incident response within a single team restricts progress. Democratization can be a force multiplier and a catalyst for rapid progress if approached pragmatically. For example, we developed a user attestation platform where a user is informed about suspicious activity, provided context around why we think it is suspicious, and asked a question of whether they recognize the suspicious activity. Depending on the user’s response and circumstantial factors, a workflow is triggered that could lead to an incident being opened for investigation. This helped reduce toil and the time to contain threats by offering an immediate response to an unusual activity. Democratization has been applied to several other use cases with varying degrees of success from gaining visibility to gathering threat detection ideas.
Building for the future while addressing the present
The rebuilding of the threat detection and incident response program was done while still running the existing operations. With this approach, we were able to carve out space among the team to work on more strategic initiatives.
Security, scalability, and reliability of infrastructure
As the team increased its visibility tenfold, the demand for data normalization and searching continued to grow. Our platform, tooling, data, and processes need reliability and scalability in the most critical times, like during an incident or an active attack. The team ensured a focus on resiliency in the face of setbacks such as software bugs or system failures. To get ahead of potential problems, we committed to planning for failure, early warning, and recovery states to ensure our critical detection and response systems were available when most needed. Security of the LinkedIn platform is at the forefront of everything the team does and is etched into our thought process as we build these systems.
As we started thinking about the broad problem space we needed to address, a few fundamental questions came up.
The first was, “What visibility would the threat detection and incident response team need to be effective?” This fundamental question helped us shape our thinking and requirements about how to rebuild the function. There are several ways to approach building out an operational security response and detection engineering team. Whether that’s an incident response firm on retainer, an in-house security operations center, a managed service provider, or a combination of both, there are plenty of approaches that work for many organizations. At LinkedIn, we wanted to work with what we already had, which was a great engineering team and culture, the ability to take some intelligent risks, and the support of our peers to help build and maintain the pipeline and platform we needed to be effective.
The next question that we asked was, “How do we provide coverage for the threats affecting our infrastructure and users?” There are many areas that require attention when building out a Software-defined SOC and it can be difficult to know where to prioritize your efforts. The long-term goal was inspired by the original intent of the SOCless concept, which suggests that mostly all incident handling would be automated with a few human checkpoints to ensure quality and accuracy, paired with in-depth investigations as necessary. Given our team’s skills and development-focused culture, the benefit of reducing the need for human interaction in security monitoring operations was an attractive idea. With this, we needed to build and maintain development, logging, and alerting pipelines, decide what data sources to consume, and decide what detections to prioritize.
“What are the most important assets we need to protect?” was the third question we asked. Due to the scope of Moonbase, we had to decide how to deliver the biggest impact on security before we had designed the new processes or completed the deployment. This meant we focused on the most important assets first, commonly known as “Crown Jewels.” Starting with systems we knew and understood, we could more easily test our detection pipeline. Many of the early onboarded data sources and subsequent detections were from the Crown Jewels. Ultimately, this was a quick start for us, but did not yet offer the comprehensive visibility and detection capabilities we needed.
This leads us to the last question we asked, “How can we improve the lives of our incident responders with a small team?” Early detections in the new system, while eventually iterated or tuned, led to classic analyst fatigue. To ease the burden and improve the lives of the responders, we built a simple tuning request functionality, which enables quick feedback and the potential to pause on a lower-quality detection. This principle has enabled us to maintain a reasonable expectation of results from analysts while reducing the potential for additional fatigue, alert overload, and job dissatisfaction. Additionally, we have focused on decentralizing the early phases of the security monitoring process, which has led to significantly less toil and investigation required from analysts. When a class of alerts can be sent with additional context to potential victims or sources of the alert, with the proper failsafes in place the analyst can focus on responding to the true threats. Another example is directly notifying a system owner or responsible team whenever there are unusual or high-risk activities such as a new administrator, new two-factor accounts, credential changes, etc. If the activity is normal, expected, or a scheduled administration activity that can be confirmed by the system owners, again with the proper failsafes, there is no need to directly notify the incident response team. Tickets generated from security alerts are still created and logged for analysis and quality assurance, but reviewing a select number of tickets periodically for accuracy is preferred to manually communicating through cases that likely have no impact or threat.
Phase 1: Foundation and visibility
The first phase of rebuilding our security platform was our end-to-end infrastructure and process development. We already had a small team and some data sources, including some security-specific telemetry (endpoint and network detection technologies), so we focused on finding the right platform first and then on building strong processes and infrastructure around it. Given the rapidly growing business, a lean team, and other constraints, a few base requirements were established for the foundational platform(s).
Low operational overhead
This enables the team to focus on the analysis tasks meant for humans and lets automation do the work of data gathering and transformation. Constantly working through the critical, yet laborious extract, transform, and load processes while trying to hunt for and respond to threats is a quick recipe for burnout.
Built-in scalability and reliability
Reliability is a naturally critical component in any security system, however, with a small team, it must be low maintenance and easy to scale. Additionally, time invested should be primarily focused on building on top of the platform, rather than keeping the platform operational. This is one of the strongest arguments for a cloud solution, as it allows us to work with our internal IT partners and cloud service providers. Through this, we can ensure a reliable backbone for our pipelines and alerting, along with other programs that coordinate these processes to gather data and context.
Rapid time to value
Building a functioning operational team takes time and practice, so when choosing the technology toolset, we need to see results quickly to focus on process improvement and developing other important capabilities.
To minimize context switching and inefficiencies, we want the data, context, and all entity information to be available in our alerting and monitoring system. This naturally leads toward a log management system and security information and event management (SIEM) overlays.
By the end of this phase, we wanted to be able to onboard a handful of data sources to a collaborative platform where we could run searches, write detections, and automatically respond to incidents. This was the beginning of our methodical approach to deploying detections through a traditional CI/CD pipeline.
Capturing data at scale, and in a heterogeneous environment, is a huge challenge on its own. Not all sources can use the same pipelines, and developing and maintaining hundreds of data pipelines is too much administrative overhead for a small team. We started our search for a solution internally and decided on a hybrid solution. LinkedIn already has a robust data streaming and processing infrastructure in the form of Kafka and Samza, which ships trillions of messages (17 Trillion messages per day!) and processes hundreds of petabytes of data. Most of the LinkedIn production and supporting infrastructure is already capable of transporting data through these pipelines, which made them an attractive early target. However, LinkedIn has organically grown and between acquisitions, software-as-a-service applications, different cloud platforms, and other factors, there needed to be other supported and reliable modes of transport. After analysis, the team developed four strategic modes of data collection including the ultimate fallback of REST APIs provided by the SIEM.
A simplified diagram of data collection pipelines
Most of our infrastructure is already capable of writing to Kafka. With reliability and scalability in mind, Kafka was a perfect choice for a primary data collection medium.
Infrastructure, like firewalls, is not capable of writing to Kafka inherently. We operate a cluster of Syslog collectors for anything that supports Syslog export, but not Kafka messages.
Serverless data pipelines
These are employed mostly for collecting logs from SaaS, PaaS, and other cloud platforms.
The data collector REST API is the collection mechanism natively supported by the SIEM for accepting logs and storing them against known schemas. This is currently the most commonly used transport mechanism, scaling to hundreds of terabytes.
Security infrastructure as code
The team has deep experience with security tooling and platform over the years. As we rebuilt our foundational infrastructure, we knew we needed to apply a more comprehensive engineering first approach. One aspect of this engineering first approach to defense infrastructure was treating everything as code to maintain consistency, reliability, and quality. This led to the development of the Moonbase continuous integration and continuous deployment (CI/CD) pipeline. Through the CI/CD pipeline the team manages all detections, data source definitions and parsers, automated playbooks and serverless functions, and Jupyter notebooks used for investigations, etc.
Having every engineer work on the development and improvement of the detection playbooks, as well as having the applied rigor that comes from the typical CI/CD review and testing stages, gives the team a strong, templateable, and heavily quality-assured playbook for detecting and responding to threats. Simple mistakes or detections that could lead to unnecessary resource usage are easily prevented through the peer review process. Our tailored comprehensive CI validations for each resource type help us to programmatically detect any issues in these artifacts within the PR validation process and improve the deployment success rate significantly. Change tracking is also much easier as pull request IDs can be added to investigation tickets for reference or used in other parts of the tuning process.
The Moonbase CI/CD pipeline is serverless, built on top of Azure DevOps. Azure Repos is a source code management solution similar to Github that we use for all our code with deployment done through Azure Pipelines. Azure Pipelines is a robust CI/CD platform that supports multi-stage deployments, integration with Azure CLI tasks, integration with Azure Repos, PR-triggered CI builds, etc. This also helps us deploy the same resource to multiple SIEM instances within different scopes only by updating deployment configuration settings. We leverage both to build, validate, and deploy detections and all other deployables, following a trunk-based development model. Artifacts like queries are enriched in the pipeline before deployment. These enrichments help not only with detections but also help track threat detection coverage, metrics for incidents, etc.
While there are many features of a CI/CD pipeline like constant validation, decentralized review, mandatory documentation, and metrics, the templating aspect is one of the stronger points of this detection engineering approach. The pipeline allows any analyst on the team to quickly deploy a new detection or update an existing one. In this screenshot from VSCode you can see an example detection template looking for unexpected activity from inactive service principals (SPN).
The pane on the right shows the actual detection within its template. The entire detection isn’t pictured here to obscure some internal-only information, but this snippet shows what a detection engineer has configured for this specific detection in the orange-colored text. Other items, like how far back to search and the severity of the alert, can be specified in the template.
Additionally, we explicitly configured the data sources needed to execute the detection within the template.
To assist in understanding our coverage of threats, we map each detection to its MITRE ATT&CK ID and Sub ID. Not only does this help us track our coverage, but it also enables us to write additional queries or detections that collate any technique or class of attacks into a single report.
Finally, the query itself is listed (in this case we’re using KQL to search the data).
Our internally-developed CLI generates boilerplate templates for new artifacts and helps engineers maintain quality and their deployment velocity and improves productivity by helping engineers validate their changes locally.
It is important to create space for innovation. Earlier, the team often found themselves deep in operational toil, going from one issue to another. A very deliberate effort to pause operational work, which yielded a low return on investment, really helped the team make space for innovation. Staying true to the guiding principles of automating what makes sense, the team uses automation heavily to remove as much of the mundane and repeatable work as possible. The following are some automation tools and platforms that the team currently uses. These platforms and the automation have enabled the team to unlock efficiency and quality across the functions.
Automated playbooks and workflow automation
The team leverages a no-code workflow automation platform that comes tightly integrated with the cloud SIEM. It is a highly scalable integration platform that allows building solutions quickly due to the many built-in connectors across the product ecosystem like Jira, ServiceNow, and several custom tools that the team depends on. Some use cases include alert and incident enrichment, which makes all context required to triage an alert available to the engineers, and running automated post-alert playbooks for prioritization and running automated containment and remediation jobs, and other business workflows. Another use case is automated quality control and post-incident processing, which allows us to learn lessons from previous incidents.
Several complex automation jobs are written as serverless functions. These are easy to deploy, scalable, and have a very low operational overhead. These functions are used for ingesting data from on-prem and online sources, along with more complex automation jobs like the containment of compromised identities.
Resilience and reliability are broad topics, and thus not covered in this blog, however, data and platform reliability are absolutely critical. A change in underlying data can have major cascading effects on detection and response. Data availability during incidents is key, too. Outside of the core monitoring of the platforms, the team relied on three different avenues for signals of degraded performance:
Looking at things like message drop rate, latency, and volume allows the team to quickly identify any issues before they impact detections and/or response activities.
Sending messages to ensure the pipeline is functional and that messages get from source to destination within expected timeframes.
Ideal behavior indicators (operational alerts)
Data Sources are dynamic in nature. When onboarding a datasource, a detection is developed to monitor the datasource health. For example, sending an alert when the number of unique source hosts decreases more than the threshold percentage.
These are only some of the health checks and alerts we developed to ensure that our systems were logging and reporting properly. We try to graph availability as well as detection efficacy data to assist us in constantly re-evaluating our detection posture and accuracy.
What did we get for all this?
Threat detection and incident response is not a purely operational problem, using an engineering lens paid off well for our team. If this is not something you’re doing, we hope this post drives you toward that realization. Leaving the processes to organic growth is unsustainable and violates our guiding principles designed to prevent burnout for the team and ensure a quality response. In the end, we achieved significant success and improvements. We were able to expand our data collection infrastructure by 10x going from gigabytes to petabytes, our average time to triage incidents went from hours to minutes, we were able to maintain 99.99% uptime for the platform and connected tooling, and correlation was now possible, significantly reducing alert fatigue and improving precision. Additionally, automated response playbooks allowed us to reduce toil for response and automatically handle simple tasks like enrichment, case creation and updates, or additional evidence gathering. We were also able to quickly integrate threat intelligence into our enrichment pipeline, providing a much richer context for investigations.
More work to do
What we’ve covered in this blog represents the work of a handful of engineers, product and project managers, analysts, and developers over a relatively short period of time. We learned many valuable lessons over time and have since developed new ways of thinking about detection, automated response, and scaling. Whatever platform a team decides on using for security monitoring, it cannot be overstated how important a solid design and architecture will be to its eventual success.
In a future post, we’ll cover more details on how incident response ties in with our platforms and pipelines and a re-architected approach to detection engineering, including the lessons we learned.
This work would not have been possible without the talented team of people supporting our success.
Core contributors: Vishal Mujumdar, Amir Jalali, Alex Harding, Tom Leahy, Tanvi Kolte, Arthur Kao, Swathi Chandrasekar, Erik Miyake, Prateek Jain, Lalith Machiraju, Gaurav Gupta, Brian Pierini, Sergei Rousakov, Jeff Bollinger and several other partners within the organization.
TikTok SEO in 5 Steps: How To Make Sure Your Videos Show Up in Search
What to know about Presto SQL query engine and PrestoCon
Now people can share directly to Instagram Reels from some of their favorite apps
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Elon Musk Hints at Plans to Increase Character Limit for Tweets in Response to Twitter User
WhatsApp ‘Message Yourself’ Feature Rolling Out on Android and iOS: Report
Sending Messages via WhatsApp in Your Node.js Application
(Re)building Threat Detection and Incident Response at LinkedIn
Access Levels for Gaming Apps
How LinkedIn Ditched the “One Size Fits All” Hiring Approach for InfoSec and Won
Introducing the Product Tagging API for Reels to the Instagram Platform
Introducing the ‘Instagram Explore home’ Ads Placements via the Instagram Marketing API
FACEBOOK2 weeks ago
Dynolog: Open source system observability
FACEBOOK1 week ago
How to Send Interactive Messages with the WhatsApp Business Platform
FACEBOOK1 week ago
Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding
FACEBOOK1 week ago
Building Intuitive Interactions in VR: Interaction SDK, First Hand Showcase and Other Resources
OTHER6 days ago
Elon Musk Urged by US Senator to Better Protect US Users’ Data After Whistleblower Testimony
OTHER5 days ago
Twitter, Other Social Media Apps Fail to Remove Hate Speech, Says EU Review
OTHER4 days ago
What Are WhatsApp Polls and How Do You Use Them? All You Need to Know
OTHER2 weeks ago
Elon Musk Begins Twitter Poll on Donald Trump’s Account Reinstating