We are proud to announce the open sourcing of Venice, LinkedIn’s derived data platform that powers more than 1800 of our datasets and is leveraged by over 300 distinct applications. Venice is a high-throughput, low-latency, highly-available, horizontally-scalable, eventually-consistent storage system with first-class support for ingesting the output of batch and stream processing jobs.
Venice entered production at the end of 2016 and has been scaling continuously ever since, gradually replacing many other systems, including Voldemort and several proprietary ones. It is used by the majority of our AI use cases, including People You May Know, Jobs You May Be Interested In, and many more.
The very first commit in the Venice codebase happened a little more than eight years ago, but development picked up in earnest with a fully staffed team in the beginning of 2016, and continues to this day. In total, 42 people have contributed more than 3700 commits to Venice, which currently equates to 67 years of development work.
Throughout all this, the project traversed many phases, increasing in both scope and scale. The team learned many lessons along the way, which led to a plethora of minor fixes and optimizations, along with several rounds of major architectural enhancements. The last such round brought active-active replication to Venice and served as a catalyst for clearing many layers of tech debt. Against this backdrop, we think Venice is in the best position it’s ever been in to propel the industry forward, beyond the boundaries of LinkedIn.
This blog post provides an overview of Venice and some of its use cases, divided across three sections:
- Writing to Venice
- Reading from Venice
- Operating Venice at scale
Writing data seems like a natural place to start, since anyone trying out a new database needs to write data first, before reading it.
How we write into Venice also happens to be one of the significant architectural characteristics that makes it different from most traditional databases. The main thing to keep in mind with Venice is that, since it is a derived data store that is fed from offline and nearline sources, all writes are asynchronous. In other words, there is no support for performing strongly consistent online write requests, as one would get in MySQL or Apache HBase. Venice is not the only system designed this way—another example of a derived data store with the same constraints is Apache Pinot. This architecture enables Venice to achieve very high write throughputs.
In this section, we will go over the various mechanisms provided by Venice to swap an entire dataset, as well as to insert and update rows within a dataset. Along the way, we will also mention some use cases that leverage Venice, and reference previously published information about them.
The most popular way to write data into Venice is using the Push Job, which takes data from an Apache Hadoop grid and writes it into Venice. Currently, LinkedIn has more than 1400 push jobs running daily that are completely automated.
By default, the Venice Push Job will swap the data, meaning that the new dataset version gets loaded in the background without affecting online reads, and when a given data center is done loading, then read requests within that data center get swapped over to the new version of the dataset. The previous dataset version is kept as a backup, and we can quickly roll back to it if necessary. Older backups get purged automatically according to configurable policies. Since this job performs a full swap, it doesn’t make sense to run multiple Full Pushes against the same dataset at the same time, and Venice guards against this, treating it as a user error. Full Pushes are ideal in cases where all or most of the data needs to change, whereas other write methods described in the next sections are better suited if changing only small portions of the data.
By 2018, the roughly 500 Voldemort Read-Only use cases running in production had been fully migrated to Venice, leveraging its Full Push capability as a drop-in replacement.
A large portion of Full Push use cases are AI workflows that retrain daily or even several times a day. A prominent example of this is People You May Know. More generally, LinkedIn’s AI infrastructure has adopted a virtual feature store architecture where the underlying storage is pluggable, and Venice is the most popular choice of online storage to use with our feature store, Feathr, which was recently open sourced. Tight integration between Venice and Feathr is widely used within LinkedIn, and this integration will be available in a future release of Feathr.
Another way to leverage the Venice Push Job is in incremental mode, in which case the new data gets added to the existing dataset, rather than swapping it wholesale. Data from incremental pushes get written both to the current version and backup version (if any) of the dataset. Contrary to the default (Full Push) mode, Venice supports having multiple Incremental Push jobs run in parallel, which is useful in cases where many different offline sources need to be aggregated into a single dataset.
Besides writing from Hadoop, Venice also supports writing from a stream processor. At LinkedIn, we use Apache Samza for stream processing, and that is what Venice is currently integrated with. We are looking forward to hearing from the community about which other stream processors they would like to integrate with Venice. Because Venice has a sufficiently flexible architecture, integrating additional stream processors is possible.
Similar to incremental pushes, data written from stream processors gets applied both to the current and backup versions of a dataset, and multiple stream processing jobs can all write concurrently into the same dataset.
Since 2017, LinkedIn’s A/B testing experimentation platform has leveraged Venice with both full pushes and streaming writes for its member attribute stores. We will come back to this use case in the next section.
There is an alternative way to leverage a stream processor that is similar to a Full Push. If the stream processor is leveraging an input source such as Brooklin that supports bootstrapping the whole upstream dataset, then the owner of the stream processing job can choose to reprocess a snapshot of its whole input. The stream processor’s output is written to Venice as a new dataset version, loaded in the background, and, finally, the read traffic gets swapped over to the new version. This is conceptually similar to the Kappa Architecture.
We’ve described before on the blog our data standardization work, which helps us build taxonomies to power our Economic Graph. These standardization use cases started off using a different database, but since then have fully migrated their more than 100 datasets to Venice. Having built-in support for reprocessing allows standardization engineers to run a reprocessing job in hours instead of days, and to do more of them at the same time, both of which greatly increases their productivity.
For both incremental pushes and streaming writes, it is not only possible to write entire rows, but also to update specific columns of the affected rows. This is especially useful in cases where multiple sources need to be merged together into the same dataset, with each source contributing different columns.
Another supported flavor of partial updates is collection merging, which enables the user to declaratively add or remove certain elements from a set or a map.
The goal of partial updates is for the user to mutate some of the data within a row without needing to know the rest of that row. This is important because all writes to Venice are asynchronous, and it is therefore not possible to perform “read-modify-write” workloads. Compared to systems which provide direct support for read-modify-write workloads (e.g., via optimistic locking), Venice’s approach of declaratively pushing down the update into the servers is more efficient and better at dealing with large numbers of concurrent writers. It is also the only practical way to make hybrid workloads work (see below).
Hybrid write workloads
Venice supports hybrid write workloads that include both swapping the entire dataset (via Full Push jobs or reprocessing jobs) and nearline writes (via incremental pushes or streaming). All of these data sources get merged together seamlessly by Venice. The way this works is that after having loaded a new dataset version in the background, but before swapping the reads over to it, there is a replay phase where recent nearline writes get written on top of the new dataset version. When the replay is caught up, then reads get swapped over. How far back the replay should start from is configurable.
This previous blog post on Venice Hybrid goes into more detail. Of note, the bookkeeping implementation details described in that post have since been redesigned.
As of today, including both incremental pushes and streaming writes, there are more than 200 hybrid datasets in production.
Putting it all together
In summary, we can say that Venice supports two data sources: Hadoop and Samza (or eventually any stream processor, if the open source community wants it), which are the columns in the following table. In addition, Venice supports three writing patterns, which are the rows of the following table. This 2×3 matrix results in 6 possible combinations, all of which are supported.
|Full dataset swap||Full Push job||Reprocessing job|
|Insertion of some rows into an existing dataset||Incremental push job||Streaming writes|
|Updates to some columns of some rows||Incremental push job
doing partial updates
doing partial updates
In addition to supporting these individual workloads, Venice also supports hybrid workloads, where any of the “full dataset swap” methods can be merged with any of the insertion and update methods, thanks to the built-in replay mechanism.
In terms of scale, if we add up all of the write methods together, Venice at LinkedIn continuously ingests an average of 14 GB per second, or 39M rows per second. But the average doesn’t tell the full story, because Full Pushes are by nature spiky. At peak throughput, Venice ingests 50 GB per second, or 113M rows per second. These figures add up to 1.2 PB daily and a bit more than three trillion rows daily.
On a per-server basis, we throttle the asynchronous ingestion of data by throughput of both bytes and rows. Currently, our servers with the most aggressive tunings are throttling at 200 MB per second, and 400K rows per second, all the while continuing to serve low latency online read requests without hiccups. Ingestion efficiency and performance is an area of the system that we continuously invest in, and we are committed to keep pushing the envelope on these limits.
Now that we know how to populate Venice from various sources, let’s explore how data can be read from it. We will go over the three main read APIs: Single Gets, Batch Gets, and Read Compute. We will then look at a special client mode called Da Vinci.
The simplest way to interact with Venice is as if it were a key-value store: for a given key, retrieve the associated value.
LinkedIn’s A/B testing platform, mentioned earlier, is one of the heavy users of this functionality. Including all use cases, Venice serves 6.25M Single Get queries per second at peak.
It is also possible to retrieve all values associated with a set of keys. This is a popular usage pattern across search and recommendation use cases, where we need to retrieve features associated with a large number of entities. We then score each entity by feeding its features into some ML model, and finally return the top K most relevant entities.
A prominent example of this is the LinkedIn Feed, which makes heavy use of batch gets. Including all Batch Get use cases, Venice serves 45M key lookups per second at peak.
Another way to get data out of Venice is via its declarative read computation DSL, which supports operations such as projecting a subset of columns, as well as executing vector math functions (e.g., dot product, cosine similarity, etc.) on the embeddings (i.e., arrays of floats) stored in the datasets.
Since 2019, these functions have been used by People You May Know (PYMK) to achieve horizontally scalable online deep learning. While it is common to see large scale deep learning get executed offline (e.g., via TonY), online deep learning tends to be constrained to a smaller scale (i.e., single instance). It is thus fairly novel, in our view, to achieve horizontal scalability in the context of online inference. The data stored in the PYMK dataset includes many large embeddings, and retrieving all of this data into the application is prohibitively expensive, given its latency budget. To solve this performance bottleneck, PYMK declares in its query what operations to perform on the data, and Venice performs it on PYMK’s behalf, returning only the very small scalar results. A previous talk given at QCon AI provides more details on this use case.
At peak, read compute use cases perform 46M key lookups per second, and every single one of these lookups also performs an average of 10 distinct arithmetic operations on various embeddings stored inside the looked up value.
So far, we have been exploring use cases that perform various kinds of remote queries against the Venice backend. These are the original and oldest use cases supported by Venice, and collectively we refer to them as Classical Venice. In 2020, we built an alternative client library called Da Vinci, which supports all of the same APIs we already looked at, but provides a very different cost-performance tradeoff, by serving all calls from local state rather than remote queries.
Da Vinci is a portion of the Venice server code, packaged as a client library. Besides the APIs of the Classical Venice client, it also offers APIs for subscribing to all or some partitions of the dataset. Subscribed partitions are fully loaded in the host application’s local state, and also eagerly updated by listening to Venice’s internal changelog of updates. In this way, the state inside Da Vinci mirrors that of the Venice server, and the two are eventually consistent with one other. All of the write methods described earlier are fully compatible with both clients.
The same dataset can be used in both Classical and Da Vinci fashion at the same time. The application can choose which client to use, and switch between them easily, if cost-performance requirements change over time.
Da Vinci is what we call an “eager cache,” in opposition to the more traditional read-through cache, which needs to be configured with a TTL. The difference between these kinds of caches is summarized in the following table:
As we can see, the main advantage of a read-through cache is that the size of the local state is configurable, but the drawback is needing to deal with the awkward tradeoff of tuning TTL, which pits freshness against hit rate. An eager cache provides the opposite tradeoff, not giving the ability to configure the size of the local state (besides the ability to control which partitions to subscribe to), but providing 100% hit rate and optimal freshness.
The easiest way to leverage Da Vinci is to eagerly load all partitions of the dataset of interest. This works well if the total dataset size is small—let’s say a few GB. We have some online applications in the ads space that do this.
Alternatively, Da Vinci also supports custom partitioning. This enables an application to control which partition gets ingested in each instance, and thus reduces the size of the local state in each instance. This is a flexible mechanism, which is usable in a variety of ways, but also more advanced. It is appropriate for use within distributed systems that already have a built-in notion of partitioning. In this way, Da Vinci becomes a building block for other data systems.
Galene, our proprietary search stack, uses Da Vinci to load AI feature data inside its searchers, where it performs ranking on many documents at high throughput. It leverages Da Vinci with its all-in-RAM configuration*, since the state is not that large (e.g., 10 GB per partition) and the latency requirement is very tight. These use cases usually have few partitions (e.g., 16) but can have a very high replication factor (e.g., 60), adding up to a thousand instances all subscribing to the same Venice dataset.
Samza stream processing jobs also leverage Da Vinci to load AI feature data. In this case, the state can be very large, in the tens of TB in total, so many more partitions are used (e.g., one to two thousand) to spread it into smaller chunks. For stream processing, only one replica is needed, or two if a warm standby is desired, so it is the inverse of the Galene use cases. These use cases leverage the SSD-friendly configuration* to be more cost-effective.
* N.B.: Both the all-in-RAM and SSD-friendly configurations are described in more detail in the storage section.
Venice was designed from the ground up for massive scale and operability, and a large portion of development efforts is continuously invested in these important areas. As such, the system has built-in support for, among other things:
- Operator-driven configuration
- Elastic & linear scalability
In the subsequent sections, we’ll discuss each of these in turn.
It is a requirement to have our data systems deployed in all data centers. Not all data systems have the same level of integration between data centers, however. In the case of Venice, the system is globally integrated in such a way that keeping data in-sync across regions is easy, while also leveraging the isolation of each data center to better manage risks.
The Push Job writes to all data centers, and we have Hadoop grids in more than one region, all of which can serve as the source of a push. Nearline writes are also replicated to all regions in an active-active manner, with deterministic conflict resolution in case of concurrent writes.
These are important characteristics for our users, who want data to keep getting refreshed even if the data sources are having availability issues in some of the regions.
Administrative operations, such as dataset creation or schema evolution, are performed asynchronously by each region, so that if a region has availability issues, it can catch up to what it missed when it becomes healthy again. This may seem like a trivial detail, but at LinkedIn we have datasets getting created and dropped every day, from users leveraging our self-service console and also from various systems doing these operations dynamically. As such, it is infeasible for Venice operators to supervise these procedures.
Operators also have the ability to repair the data in an unhealthy region by copying it from a healthy region. In rare cases of correlated hardware failures, this last resort mechanism can save the day.
Each region contains many Venice clusters, some of which are dedicated to very sensitive use cases, while others are heavily multi-tenant. Currently, there are 17 production Venice clusters per region, spanning a total of 4000 physical hosts.
Dataset to cluster assignment is fully operator-driven and invisible to users. Users writing to and reading from Venice need not know what cluster their dataset is hosted on. Moreover, operators can migrate datasets between clusters, also invisibly for users. Migration is used to optimize resource utilization, or to leverage special tunings that some clusters have.
Each Venice cluster may contain many datasets; our most densely populated cluster currently holds more than 500 datasets. In such an environment, it is critical to have mechanisms for preventing noisy neighbors from affecting each other. This is implemented in the form of quotas, of which there are three kinds, all configurable on a per-dataset basis.
There is a storage quota, which is the total amount of persistent storage a dataset is allowed to take. For Full Push jobs, this is verified prior to the push taking place, so that an oversized push doesn’t even begin to enter the system. For other write methods, there is an ongoing verification that can cause ingestion to stall if a dataset is going over its limit, in which case dataset owners are alerted and can request more quota.
There are two kinds of read quotas: one for the maximum number of keys queryable in a Batch Get query, and the other for the maximum rate of keys looked up per second. In other words, N Single Get queries cost the same amount of quota as one Batch Get query of N keys.
One of the core principles of LinkedIn’s data infrastructure teams is Invisible Infra, which means that the users should be empowered to achieve their business goals without needing to know how the infrastructure helps them fulfill those goals. An example of this is given in the previous multi-cluster section, in the form operator-driven cluster migration. There are many more such examples.
Venice supports end-to-end compression, which helps minimize storage and transmission costs. Various compression schemes are supported, and operators can choose to enable them on behalf of users. This configuration change takes effect on the next Full Push, which can be triggered by the user’s own schedule, or triggered by the operator. The latest compression scheme we integrated is Zstandard, which we use to build a shared dictionary as part of the Push Job. This dictionary is stored just once per dataset version, thus greatly reducing total storage space in cases where some bits of data are repeated across records. In the most extreme case, we’ve seen a dataset deflate by 40x!
Venice also supports repartitioning datasets, which is another type of configuration change taking effect on the next Full Push. Partitions are the unit of scale in the system, and as datasets grow, it is sometimes useful to be able to change this parameter. For a given dataset size, having a greater number of smaller partitions enables faster push and rebalancing (more on this later).
Venice has a pluggable storage engine architecture. Previously, Venice leveraged BDB-JE as its storage engine, but since then, we have been able to fully swap this for RocksDB, without needing any user intervention. Currently, we use two different RocksDB formats: block-based (optimized for larger-than-RAM datasets, backed by SSD) and plain table (optimized for all-in-RAM datasets). As usual, changing between these formats is operator-driven. A previous blog post chronicles some of our recent RocksDB-related challenges and how we solved them.
We are looking forward to seeing what the open source community will do with this abstraction, and whether new kinds of use cases may be achievable within the scope of Venice by plugging in new storage engines.
Venice leverages Apache Helix for partition-replica placements in each cluster, which enables the system to react readily to hardware failures. If servers crash, Helix automatically rebalances the partitions they hosted onto healthy servers. In addition, we leverage Helix’s support for rack-awareness, which ensures that at most one replica of a given partition is in the same rack. This is useful so that network switch maintenance or failures don’t cause us to lose many replicas of a partition at once. Rack-awareness is also useful in that it allows us to simultaneously bounce all Venice servers in a rack at once, thus greatly reducing the time required to perform a rolling restart of a full cluster. In these scenarios, Helix is smart enough to avoid triggering a surge of self-healing rebalance operations. A previous post provides more details on Venice’s usage of Helix, although the state model has been redesigned since the post was published.
Elastic & linear scalability
Again thanks to Helix, it is easy to inject extra hardware into a cluster and to rebalance the data into it. The exact same code path that is leveraged more than 1400 times a day to rewrite entire datasets is also leveraged during rebalance operations, thus making it very well-tested and robust. This important design choice also means that whenever we further optimize the write path (which we are more or less continuously doing), it improves both push and rebalance performance, thus giving more elasticity to the system.
Beyond this, another important aspect of scalability is to ensure that the capacity to serve read traffic increases linearly, in proportion to the added hardware. For the more intense use cases described in this post—Feed and PYMK—the traffic is high enough that we benefit from over-replicating these clusters beyond the number of replicas needed for reliability purposes. Furthermore, we group replicas into various sets, so that a given Batch Get or Read Compute query can be guaranteed to hit all partitions within a bounded number of servers. This is important, because otherwise the query fanout size could increase to arbitrarily high amounts and introduce tail latency issues, which would mean the system’s capacity scales sublinearly as the hardware footprint increases. This and many other optimizations are described in more detail in a previous blog post dedicated to supporting large fanout use cases.
It would be difficult to exhaustively cover all aspects of a system as large as Venice in a single blog post. Hopefully, this overview of functionalities and use cases gives a glimpse of the range of possibilities it enables. Below is a quick recap to put things in a broader context.
What is Venice for?
We have found Venice to be a useful component in almost every user-facing AI use case at LinkedIn, from the most basic to cutting edge ones like PYMK. The user-facing aspect is significant, since AI can also be leveraged in offline-only ways, in which case there may be no need for a scalable storage system optimized for serving ML features for online inference workloads.
Besides AI, there are other use cases where we have found it useful to have a derived data store with first-class support for ingesting lots of data in various ways. One example is our use of Venice as the stateful backbone of our A/B testing platform. There are other examples as well, such as caching a massaged view of source-of-truth storage systems, when cost and performance matter, but strong consistency does not.
At the end of the day, Venice is a specialized solution, and there is no intent for it to be anyone’s one and only data system, nor even their first one. Most likely, all organizations need to have some source of truth database, like a RDBMS, a key-value store, or a document store. It is also likely that any organization would already have some form of batch or stream processing infrastructure (or both) before they feel the need to store the output of these processing jobs in a derived data store. Once these other components are in place, Venice may be a worthwhile addition to complement an organization’s data ecosystem.
The road ahead
In this post, we have focused on the larger scale and most mature capabilities of Venice, but we also have a packed roadmap of exciting projects that we will share more details about soon.
We hope this upcoming chapter of the project’s life in the open source community will bring value to our peers and propel the project to the next level.
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Apache Kafka is an open-sourced event streaming platform where users can create Kafka topics as data transmission units, and then publish or subscribe to the topic with producers and consumers. While most of the Kafka topics are actively used, some are not needed anymore because business needs changed or the topics themselves are ephemeral. Kafka itself doesn’t have a mechanism to automatically detect unused topics and delete them. It is usually not a big concern, since a Kafka cluster can hold a considerable amount of topics, hundreds to thousands. However, if the topic number keeps growing, it will eventually hit some bottleneck and have disruptive effects on the entire Kafka cluster. The TopicGC service was born to solve this exact problem. It was proven to reduce Kafka pressure by deleting ~20% of topics, and improved Kafka’s produce and consume performance by at least 30%.
As the first step, we need to understand how unused topics can cause pressure on Kafka. Like many other storage systems, all Kafka topics have a retention period, meaning that for any unused topics, the data will be purged after a period of time and the topic will become empty. A common question here is, “How could empty topics affect Kafka?”
For topic management purposes, Kafka stores the metadata of topics in multiple places, including Apache ZooKeeper and a metadata cache on every single broker. Topic metadata contains information of partition and replica assignments.
Let’s do some simple calculation here: topic A can have 25 partitions, with a replication factor of three, meaning each partition has three replicas. Even if topic A is not used anymore, Kafka still needs to store the location info of all 75 replicas somewhere.
The effect of metadata pressure may not be that obvious for a single topic, but it can make a big difference if there are a lot of topics. The metadata can consume memory from Kafka brokers and ZooKeeper nodes, and can add payload to metadata requests.
In Kafka, the follower replicas periodically send fetch requests to the leader replicas to keep sync with the leader. Even for empty topics and partitions, the followers still try to sync with the leaders. Because Kafka does not know whether a topic is permanently unused, it always forces the followers to fetch from the leaders. These redundant fetch requests will further lead to more fetch threads being created, which can cause extra network, CPU, and memory utilization, and can dominate the request queues, causing other requests to be delayed or even dropped.
Kafka controller is a broker that coordinates and manages other brokers in a Kafka cluster. Many Kafka requests have to be handled by the controller, thus the controller availability is crucial to Kafka.
On controller failover, a new controller has to be elected and take over the role of managing the cluster. The new controller will take some time to load the metadata of the entire cluster from ZooKeeper before it can act as the controller, which is called the controller initialization time. As mentioned earlier in this post, unused topics can generate extra metadata that makes the controller initialization slower, and threaten the Kafka availability. Issues can arise when the ZooKeeper response is larger than 1MB. For one of our largest clusters, the ZooKeeper response has already reached 0.75MB, and we anticipate within two to three years it will hit a bottleneck.
While designing TopicGC, we kept in mind a number of requirements. Functionality, we determined that the system must set criteria to determine whether a topic should be deleted, constantly run the garbage collector (GC) process to remove the unused topics, and notify the user before topic deletion.
Additionally, we identified non-functional requirements for the system. The requirements include ensuring no data loss during topic deletion, removal of all dependencies from unused topics before deletion, and the ability to recover the topic states from service failures.
To satisfy those requirements, we designed TopicGC based on a state machine model, which we will discuss in more detail in the following sections.
Topic state machine
To achieve all of the functional requirements, TopicGC internally runs a state machine. Each topic instance is associated with a state and there are several background jobs that periodically run and transit the topic states if needed. Table 1 describes all possible states in TopicGC.
Table 1: Topic states and descriptions
With the help of internal states, TopicGC follows a certain workflow to delete unused topics.
Figure 1: TopicGC state machine
Detect topic usage
TopicGC has a background job to find unused topics. Internally, we use the following criteria to determine whether a topic is unused:
- The topic is empty
- There is no BytesIn/BytesOut
- There is no READ/WRITE access event in the past 60 days
- The topic is not newly created in the past 60 days
The TopicGC service fetches the above information from ZooKeeper and a variety of internal data sources, such as our metrics reporting system.
Send email notification
If a topic is in the UNUSED state, TopicGC will trigger the email sending service to find the LDAP user info of the topic owner and send email notifications. This is important because we don’t know whether the topic is temporarily idle or permanently unused. In the former case, once the topic owner receives the email, they can take actions to prevent the topic from being deleted.
Block write access
This is the most important step in the TopicGC workflow. Think of a case: if a user produces some data right at the last second before topic deletion, the data will be lost with the topic deletion. Thus, avoiding data loss is a crucial challenge for TopicGC. To ensure the TopicGC service doesn’t delete the topics that have last minute write, we introduced a block-write-access step before the topic deletion. After the write access is blocked on the topic, there is no chance that TopicGC can cause data loss.
Notice that Kafka doesn’t have a mechanism to “seal” a topic. Here we leverage LinkedIn’s internal way to block topic access. In LinkedIn, we have some access to services to allow us to control the access for all data resources, including Kafka topics. To seal a topic, TopicGC sends a request to the access service to block any read and write access to the topic.
The data of a topic can be mirrored to other clusters via Brooklin. Brooklin is open-sourced by LinkedIn, as a framework to stream data between various heterogeneous sources and destination systems with high reliability and throughput at scale. Before deleting the topic, we need to disable Brooklin mirroring of the topic. Brooklin can be regarded as a wildcard consumer for all Kafka topics. If the topic is deleted without informing Brooklin, Brooklin will throw exceptions about consuming from non-existent topics. For the same reason, before topic deletion, if there are any other services that consume from all topics, TopicGC should tell those services to stop consuming from the garbage topics before topic deletion.
Once all preparations are done, the TopicGC service will trigger the topic deletion by calling the Kafka admin client. The topic deletion process can be customized and in our case, we delete topics in batches. Because topic deletion can introduce extra load to Kafka clusters, we set an upper limit of the concurrent topic deletion number to three.
Last minute usage check
Before any of the actual changes made to the topic (including blocking write access, disabling mirroring, and topic deletion), we run a last minute usage check for the topic. This is to add an extra secure layer to prevent data loss. If TopicGC detects usage during the whole deletion process, it will mark the topic as INCOMPLETE state, and start recovering the topic back to USED state.
Impact of TopicGC
We launched TopicGC in one of our largest data pipelines, and were able to reduce the topic count by nearly 20%. In the graph, each color represents a distinct Kafka cluster in the pipeline.
Figure 2: Total topic count during TopicGC
Improvement on CPU usage
The topic deletion helps to reduce the total fetch requests in the Kafka clusters and as a result, the CPU usage drops significantly after the unused topics are deleted. The total Kafka CPU usage had about a 30% reduction.
Figure 3: CPU usage improvement by TopicGC
Improvement On Client Request Performance
Due to the CPU usage reduction, Kafka brokers are able to handle the requests more efficiently. As a result, Kafka’s request handling performance improved, and request latencies dropped by up to 40%. Figure 4 shows the decrease in latency for Metadata Request.
Figure 4: Kafka request performance improvement by TopicGC
After we launched TopicGC to delete unused topics for Kafka, it has deleted nearly 20% of topics, and significantly reduced the metadata pressure of our Kafka clusters. From our metrics, the client request performance is improved around 40% and CPU usage is reduced by up to 30%.
As TopicGC has shown its ability to clean up Kafka clusters and improve Kafka performance, we have decided to launch the service to all of our internal Kafka clusters. We are hoping to see that TopicGC can help LinkedIn have a more effective resource usage on Kafka.
Many thanks to Joseph Lin and Lincong Li for coming up with the idea of TopicGC and implementing the original design. We are also grateful for our managers Rohit Rakshe and Adem Efe Gencer, who provided significant support for this project. Last but not least, we want to shout out to the Kafka SRE team and Brooklin SRE team to act as helpful partners. With their help, we smoothly launched TopicGC and were able to see these exciting results.
Render Models at LinkedIn
We use render models for passing data to our client applications to describe the content (text, images, buttons etc.) and the layout to display on the screen. This means most of such logic is moved out of the clients and centralized on the server. This enables us to deliver new features faster to our members and customers while keeping the experience consistent and being responsive to change.
Traditionally, many of our API models tend to be centered around the raw data that’s needed for clients to render a view, which we refer to as data modeling. With this approach, clients own the business logic that transforms the data into a view model to display. Often this business logic layer can grow quite complex over time as more features and use cases need to be supported.
This is where render models come into the picture. A render model is an API modeling strategy where the server returns data that describes the view that will be rendered. Other commonly used terms that describe the same technique are Server Driven User Interface (SDUI), or View Models. With render models, the client business logic tends to be much thinner, because the logic that transforms raw data into view models now resides in the API layer. For any given render model, the client should have a single, shared function that is responsible for generating the UI representation of the render model.
Architectural comparison between data modeling and render modeling
To highlight the core differences in modeling strategy between a render model and data model, let’s walk through a quick example of how we can model the same UI with these two strategies. In the following UI, we want to show a list of entities that contain some companies, groups, and profiles.
An example UI of an ‘interests’ card to display to members
Following the data model approach, we would look at the list as a mix of different entity types (members, companies, groups, etc.) and design a model so that each entity type would contain the necessary information for clients to be able to transform the data into the view shown in the design.
When applying a render model approach, rather than worry about the different entity types we want to support for this feature, we look at the different UI elements that are needed in the designs.
An ‘interests’ card categorized by UI elements
In this case, we have one image, one title text, and two other smaller subtexts. A render model represents these fields directly.
With the above modeling, the client layer remains very thin as it simply displays each image/text returned from the API. The clients are unaware of which underlying entity each element represents, as the server is responsible for transforming the data into displayable content.
API design with render models
API modeling with render models can live on a spectrum between the two extremes of frontend modeling strategies, such as pure data models and pure view models. With pure data models, different types of content use different models, even if they look the same on UI. Clients know exactly what entity they are displaying and most of the business logic is on clients, so complex product UX can be implemented as needed. Pure view models are heavily-templated and clients have no context on what they are actually displaying with almost all business logic on the API. In practice, we have moved away from using pure view models due to difficulties in supporting complex functionality, such as client animations and client-side consistency support, due to the lack of context on the clients’ end.
Typically, when we use render models, our models have both view model and data model aspects. We prefer to use view modeling most of the time to abstract away most of the view logic on the API and to keep the view layer on the client as thin as possible. We can mix in data models as needed, to support the cases where we need specific context about the data being displayed.
A spectrum of modeling strategies between pure view models and pure data models
To see this concretely, let’s continue our previous example of a FollowableEntity. The member can tap on an entity to begin following the profile, company, or group. As a slightly contrived example, imagine that we perform different client side actions based on the type of the entity. In such a scenario, the clients need to know the type of the entity and at first brush it might appear that the render models approach isn’t feasible. However, we can combine theseapproaches to get the best of both worlds. We can continue to use a render model to display all the client data but embed the data model inside the render model to provide context for making the follow request.
Client theming, layout, and accessibility
Clients have the most context about how information will be displayed to users. Understanding the dynamics of client-side control over the UX is an important consideration when we build render models. This is particularly important because clients can alter display settings like theme, layout, screen size, and dynamic font size without requesting new render models from the server.
Properties like colors, local image references, borders, or corner radius are sent using semantic tokens (e.g., color-action instead of blue) from our render models. Our clients maintain a mapping from these semantic tokens to concrete values based on the design language for the specific feature on a given platform (e.g. iOS, Android, etc.). Referencing theme properties with semantic tokens enables our client applications to maintain dynamic control over the theme.
For the layout, our render models are not intended to dictate the exact layout of the UI because they are not aware of the total available screen space. Instead, the models describe the order, context, and priorities for views, allowing client utilities to ultimately determine how the components should be placed based on available space (screen size and orientation). One way we accomplish this is by referring to the sizes of views by terms like “small” or “large” and allowing clients to apply what that sizing means based on the context and screen size.
It is critical that we maintain the same level of accessibility when our UIs are driven by render models. To do so, we provide accessibility text where necessary in our models, map our render models to components that have accessibility concerns baked in (minimum tap targets), and use semantics instead of specific values when describing sizes, layouts, etc.
Write use cases
One of the most challenging aspects of render models is dealing with write use cases, like filling forms and taking actions on the app (such as following a company, connecting with a person, sending a message, etc.). These use cases need specific data to be written to backends and cannot be modeled in a completely generic way, making it hard to use render models.
Actions are modeled by sending the current state of the action and its other possible states from the server to the clients. This tells the clients exactly what to display. In addition, it allows them to maintain any custom logic to implement a complex UI or perform state-changing follow-up actions.
To support forms, we created a standardized library to read and write forms, with full client infrastructure support out of the box. Similar to how traditional read-based render models attempt to leverage generic fields and models to represent different forms of data, our standardized forms library leverages form components as its backbone to generically represent data in a form by the type of UI element it represents (such as a ‘single line component’ or a ‘toggle component’).
Render models in practice
As we have mentioned above, the consistency of your UI is an important factor when leveraging render models. LinkedIn is built on a semantics-based design system that includes foundations like color and text, as well as shared components such as buttons and labels. Similarly, we have created layers of common UX render models in our API that include foundational and component models, which are built on top of those foundations.
Our foundational models include rich representations of text and images and are backed by client infrastructure that renders these models consistently across LinkedIn. Representing rich text through a common model and render utilities enables us to provide a consistent member experience and maintain our accessibility standards (for instance, we can restrict the usage of underlining in text that is not a link). Our image model and processing ensures that we use the correct placeholders and failure images based on what the actual image being fetched presents (e.g., a member profile). These capabilities of the foundational models are available without any client consumer knowledge of what the actual text or image represents and this information is all encapsulated by the server-driven model and shared client render utilities.
The foundational models can be used on their own or through component models that are built on top of the foundations. They foster re-use and improve our development velocity by providing a common model and shared infrastructure that resolves the component. One example is our common insight model, which combines an image with some insightful text.
A commonly used ‘insight’ model used throughout the site
Over the years, many teams at LinkedIn have taken on large initiatives to re-architect their pages based on render model concepts built on top of these foundational models. No two use cases are exactly alike, but a few of the major use cases include:
The profile page, which is built using a set of render model-based components stitched together to compose the page. For more details on this architecture, see this blog post published earlier this year.
The search results page, built using multiple card render model templates to display different types of search results in a consistent manner. See this blog post for more details.
The main feed, built centered around the consistent rendering of one update with optional components to allow for variability based on different content types.
A feed component designed around a several components
- The notifications tab, which helped standardize 50+ notification types into one simple render model template.
A notifications card designed using a standardized UI template
All of these use cases have seen some of the key benefits highlighted in this post: simpler client-side logic, a consistent design feel, faster iteration, and development and experimentation velocity for new features and bugs.
Render model tradeoffs
Render models come with their pros and cons, so it is important to properly understand your product use case and vision before implementing them.
With render models, teams are able to create leverage and control when a consistent visual experience, within a defined design boundary, is required across diverse use cases. This is enabled by centralizing logic on the server rather than duplicating logic across clients. It fosters generalized and simpler client-side implementation, with clients requiring less logic to render the user interface since most business logic lives on the server.
Render models also decrease repeated design decisions and client-side work to onboard use cases when the use case fits an existing visual experience. It fosters generalized API schemas, thereby encouraging reuse across different features if the UI is similar to an existing feature.
With more logic pushed to the API and a thin client-side layer, it enables faster experimentation and iteration as changes can be made by only modifying the server code without needing client-side changes on all platforms (iOS, Android, and Web). This is especially advantageous with mobile clients that might have older, but still supported versions in the wild for long periods of time.
Similarly, as most of the business logic is on the server, it is likely that any bugs will be on the server instead of clients. Render models enable faster turnaround time to get these issues fixed and into production, as server-side fixes apply to all clients without needing to wait for a new mobile app release and for users to upgrade.
As mentioned previously, render models rely on consistent UIs. However, if the same data backs multiple, visually-distinct UIs, it reduces the reusability of your API because the render model needs more complexity to be able to handle the various types of UIs. If the UI does need to change outside the framework, the client-code and server code needs to be updated, sometimes in invasive ways. By comparison, UI-only changes typically do not require changes to data models. For some of these reasons, upfront costs to implement and design render models are often higher due to the need to define the platform and its boundaries, especially on the client.
Render models are un-opinionated about writes and occasionally require write-only models or additional work to write data. This is contrasted with data models where the same data models can be used in a CRUD format.
Client-side tracking with render models has to be conceived at the design phase, where tracking with data models is more composable from the client. It can be difficult to support use case-specific custom tracking in a generic render model.
Finally, there are some cases where client business logic is unavoidable such as in cases with complex interactions between various user interface elements. These could be animations or client-data interactions. In such scenarios, render models are likely not the best approach as, without the specific context, it becomes difficult to have any client-side business logic.
When to use render models?
Render models are most beneficial when building a platform that requires onboarding many use cases that have a similar UI layout. This is particularly useful when you have multiple types of backend data entities that will all render similarly on clients. Product and design teams must have stable, consistent requirements and they, along with engineering, need to have a common understanding of what kinds of flexibility they will need to support and how to do so.
Additionally, if there are complex product requirements that need involved client-side logic, this may be a good opportunity to push some of the logic to the API. For example, it is often easier to send a computed text from the API directly rather than sending multiple fields that the client then needs to handle in order to construct the text. Being able to consolidate/centralize logic on the server, and thus simplifying clients, makes their behavior more consistent and bug-free.
On the flip side, if there is a lack of stability or consistency in products and designs, any large product or design changes are more difficult to implement with render models due to needing schema changes.
Render models are effective when defining generic templates that clients can render. If the product experience does not need to display different variants of data with the same UI, it would be nearly impossible to define such a generic template, and would often be simpler to use models that are more use case-specific rather than over-generalizing the model designs.
Render models have been adapted through many projects and our best practices have evolved over several years. Many have contributed to the design and implementation behind this modeling approach and we want to give a special shoutout to Nathan Hibner, Zach Moore, Logan Carmody, and Gabriel Csapo for being key drivers in formulating these guidelines and principles formally for the larger LinkedIn community.
(Re)building Threat Detection and Incident Response at LinkedIn
LinkedIn connects and empowers more than 875 million members and over the past few years, has undergone tremendous growth. As an integral part of the Information Security organization at LinkedIn, the Threat Detection and Incident Response team (aka SEEK) defends LinkedIn against computer security threats. As we continue to experience a rapid growth trajectory, the SEEK team decided to reimagine its capabilities and the scale of its monitoring and response solutions. What SEEK set out to do was akin to shooting for the moon, so we named the program “Moonbase.” Moonbase set the stage for significantly more mature capabilities that ultimately led to some impressive results. With Moonbase, we were able to reduce incident investigation times by 50%, increase threat detection coverage expansion by 900%, and reduce our time to detect and contain security incidents from weeks or days to hours.
In this blog, we will discuss how LinkedIn rebuilt its security operations platform and teams, scaled to protect nearly 20,000 employees and more than 875 million members, and our approach and strategy to achieve this objective. In subsequent posts, we will do a deep dive into how we built and scaled threat detection, improved asset visibility, and enhanced our ability to respond to incidents within minutes, with many lessons learned along the way.
Software-defined Security Operations Center (SOC)
While these data points demonstrated some of the return on the investment, they didn’t show how much scaling to Moonbase improved the quality of life for our security analysts and engineers. With Moonbase, we were able to eliminate the need to manually search through CSV files on a remote file server, or the need to download logs and other datasets directly for processing and searching. The move to a software-defined and cloud-centric security operations center accelerated the team’s ability to analyze more data in more detail, while reducing the toil of manual acquisition, transformation, and exploration of data.
Having scaled other initiatives in the past, we knew before we started the program rebuild we would need strong guiding principles to keep us on track, within scope, and set reasonable expectations.
Preserving Human Capital
As we thought about pursuing a Software-defined SOC framework for our threat detection and incident response program, preserving our human capital was one of the main driving factors. This principle continues to drive the team, leading to outcomes such as reduced toil through automation, maximized true positives, tight collaboration between our detection engineers and incident responders, and semi-automated triage and investigation activities.
It’s commonly said that security is everyone’s responsibility, and that includes threat detection and incident response. In our experience, centralizing all responsibilities of threat detection and incident response within a single team restricts progress. Democratization can be a force multiplier and a catalyst for rapid progress if approached pragmatically. For example, we developed a user attestation platform where a user is informed about suspicious activity, provided context around why we think it is suspicious, and asked a question of whether they recognize the suspicious activity. Depending on the user’s response and circumstantial factors, a workflow is triggered that could lead to an incident being opened for investigation. This helped reduce toil and the time to contain threats by offering an immediate response to an unusual activity. Democratization has been applied to several other use cases with varying degrees of success from gaining visibility to gathering threat detection ideas.
Building for the future while addressing the present
The rebuilding of the threat detection and incident response program was done while still running the existing operations. With this approach, we were able to carve out space among the team to work on more strategic initiatives.
Security, scalability, and reliability of infrastructure
As the team increased its visibility tenfold, the demand for data normalization and searching continued to grow. Our platform, tooling, data, and processes need reliability and scalability in the most critical times, like during an incident or an active attack. The team ensured a focus on resiliency in the face of setbacks such as software bugs or system failures. To get ahead of potential problems, we committed to planning for failure, early warning, and recovery states to ensure our critical detection and response systems were available when most needed. Security of the LinkedIn platform is at the forefront of everything the team does and is etched into our thought process as we build these systems.
As we started thinking about the broad problem space we needed to address, a few fundamental questions came up.
The first was, “What visibility would the threat detection and incident response team need to be effective?” This fundamental question helped us shape our thinking and requirements about how to rebuild the function. There are several ways to approach building out an operational security response and detection engineering team. Whether that’s an incident response firm on retainer, an in-house security operations center, a managed service provider, or a combination of both, there are plenty of approaches that work for many organizations. At LinkedIn, we wanted to work with what we already had, which was a great engineering team and culture, the ability to take some intelligent risks, and the support of our peers to help build and maintain the pipeline and platform we needed to be effective.
The next question that we asked was, “How do we provide coverage for the threats affecting our infrastructure and users?” There are many areas that require attention when building out a Software-defined SOC and it can be difficult to know where to prioritize your efforts. The long-term goal was inspired by the original intent of the SOCless concept, which suggests that mostly all incident handling would be automated with a few human checkpoints to ensure quality and accuracy, paired with in-depth investigations as necessary. Given our team’s skills and development-focused culture, the benefit of reducing the need for human interaction in security monitoring operations was an attractive idea. With this, we needed to build and maintain development, logging, and alerting pipelines, decide what data sources to consume, and decide what detections to prioritize.
“What are the most important assets we need to protect?” was the third question we asked. Due to the scope of Moonbase, we had to decide how to deliver the biggest impact on security before we had designed the new processes or completed the deployment. This meant we focused on the most important assets first, commonly known as “Crown Jewels.” Starting with systems we knew and understood, we could more easily test our detection pipeline. Many of the early onboarded data sources and subsequent detections were from the Crown Jewels. Ultimately, this was a quick start for us, but did not yet offer the comprehensive visibility and detection capabilities we needed.
This leads us to the last question we asked, “How can we improve the lives of our incident responders with a small team?” Early detections in the new system, while eventually iterated or tuned, led to classic analyst fatigue. To ease the burden and improve the lives of the responders, we built a simple tuning request functionality, which enables quick feedback and the potential to pause on a lower-quality detection. This principle has enabled us to maintain a reasonable expectation of results from analysts while reducing the potential for additional fatigue, alert overload, and job dissatisfaction. Additionally, we have focused on decentralizing the early phases of the security monitoring process, which has led to significantly less toil and investigation required from analysts. When a class of alerts can be sent with additional context to potential victims or sources of the alert, with the proper failsafes in place the analyst can focus on responding to the true threats. Another example is directly notifying a system owner or responsible team whenever there are unusual or high-risk activities such as a new administrator, new two-factor accounts, credential changes, etc. If the activity is normal, expected, or a scheduled administration activity that can be confirmed by the system owners, again with the proper failsafes, there is no need to directly notify the incident response team. Tickets generated from security alerts are still created and logged for analysis and quality assurance, but reviewing a select number of tickets periodically for accuracy is preferred to manually communicating through cases that likely have no impact or threat.
Phase 1: Foundation and visibility
The first phase of rebuilding our security platform was our end-to-end infrastructure and process development. We already had a small team and some data sources, including some security-specific telemetry (endpoint and network detection technologies), so we focused on finding the right platform first and then on building strong processes and infrastructure around it. Given the rapidly growing business, a lean team, and other constraints, a few base requirements were established for the foundational platform(s).
Low operational overhead
This enables the team to focus on the analysis tasks meant for humans and lets automation do the work of data gathering and transformation. Constantly working through the critical, yet laborious extract, transform, and load processes while trying to hunt for and respond to threats is a quick recipe for burnout.
Built-in scalability and reliability
Reliability is a naturally critical component in any security system, however, with a small team, it must be low maintenance and easy to scale. Additionally, time invested should be primarily focused on building on top of the platform, rather than keeping the platform operational. This is one of the strongest arguments for a cloud solution, as it allows us to work with our internal IT partners and cloud service providers. Through this, we can ensure a reliable backbone for our pipelines and alerting, along with other programs that coordinate these processes to gather data and context.
Rapid time to value
Building a functioning operational team takes time and practice, so when choosing the technology toolset, we need to see results quickly to focus on process improvement and developing other important capabilities.
To minimize context switching and inefficiencies, we want the data, context, and all entity information to be available in our alerting and monitoring system. This naturally leads toward a log management system and security information and event management (SIEM) overlays.
By the end of this phase, we wanted to be able to onboard a handful of data sources to a collaborative platform where we could run searches, write detections, and automatically respond to incidents. This was the beginning of our methodical approach to deploying detections through a traditional CI/CD pipeline.
Capturing data at scale, and in a heterogeneous environment, is a huge challenge on its own. Not all sources can use the same pipelines, and developing and maintaining hundreds of data pipelines is too much administrative overhead for a small team. We started our search for a solution internally and decided on a hybrid solution. LinkedIn already has a robust data streaming and processing infrastructure in the form of Kafka and Samza, which ships trillions of messages (17 Trillion messages per day!) and processes hundreds of petabytes of data. Most of the LinkedIn production and supporting infrastructure is already capable of transporting data through these pipelines, which made them an attractive early target. However, LinkedIn has organically grown and between acquisitions, software-as-a-service applications, different cloud platforms, and other factors, there needed to be other supported and reliable modes of transport. After analysis, the team developed four strategic modes of data collection including the ultimate fallback of REST APIs provided by the SIEM.
A simplified diagram of data collection pipelines
Most of our infrastructure is already capable of writing to Kafka. With reliability and scalability in mind, Kafka was a perfect choice for a primary data collection medium.
Infrastructure, like firewalls, is not capable of writing to Kafka inherently. We operate a cluster of Syslog collectors for anything that supports Syslog export, but not Kafka messages.
Serverless data pipelines
These are employed mostly for collecting logs from SaaS, PaaS, and other cloud platforms.
The data collector REST API is the collection mechanism natively supported by the SIEM for accepting logs and storing them against known schemas. This is currently the most commonly used transport mechanism, scaling to hundreds of terabytes.
Security infrastructure as code
The team has deep experience with security tooling and platform over the years. As we rebuilt our foundational infrastructure, we knew we needed to apply a more comprehensive engineering first approach. One aspect of this engineering first approach to defense infrastructure was treating everything as code to maintain consistency, reliability, and quality. This led to the development of the Moonbase continuous integration and continuous deployment (CI/CD) pipeline. Through the CI/CD pipeline the team manages all detections, data source definitions and parsers, automated playbooks and serverless functions, and Jupyter notebooks used for investigations, etc.
Having every engineer work on the development and improvement of the detection playbooks, as well as having the applied rigor that comes from the typical CI/CD review and testing stages, gives the team a strong, templateable, and heavily quality-assured playbook for detecting and responding to threats. Simple mistakes or detections that could lead to unnecessary resource usage are easily prevented through the peer review process. Our tailored comprehensive CI validations for each resource type help us to programmatically detect any issues in these artifacts within the PR validation process and improve the deployment success rate significantly. Change tracking is also much easier as pull request IDs can be added to investigation tickets for reference or used in other parts of the tuning process.
The Moonbase CI/CD pipeline is serverless, built on top of Azure DevOps. Azure Repos is a source code management solution similar to Github that we use for all our code with deployment done through Azure Pipelines. Azure Pipelines is a robust CI/CD platform that supports multi-stage deployments, integration with Azure CLI tasks, integration with Azure Repos, PR-triggered CI builds, etc. This also helps us deploy the same resource to multiple SIEM instances within different scopes only by updating deployment configuration settings. We leverage both to build, validate, and deploy detections and all other deployables, following a trunk-based development model. Artifacts like queries are enriched in the pipeline before deployment. These enrichments help not only with detections but also help track threat detection coverage, metrics for incidents, etc.
While there are many features of a CI/CD pipeline like constant validation, decentralized review, mandatory documentation, and metrics, the templating aspect is one of the stronger points of this detection engineering approach. The pipeline allows any analyst on the team to quickly deploy a new detection or update an existing one. In this screenshot from VSCode you can see an example detection template looking for unexpected activity from inactive service principals (SPN).
The pane on the right shows the actual detection within its template. The entire detection isn’t pictured here to obscure some internal-only information, but this snippet shows what a detection engineer has configured for this specific detection in the orange-colored text. Other items, like how far back to search and the severity of the alert, can be specified in the template.
Additionally, we explicitly configured the data sources needed to execute the detection within the template.
To assist in understanding our coverage of threats, we map each detection to its MITRE ATT&CK ID and Sub ID. Not only does this help us track our coverage, but it also enables us to write additional queries or detections that collate any technique or class of attacks into a single report.
Finally, the query itself is listed (in this case we’re using KQL to search the data).
Our internally-developed CLI generates boilerplate templates for new artifacts and helps engineers maintain quality and their deployment velocity and improves productivity by helping engineers validate their changes locally.
It is important to create space for innovation. Earlier, the team often found themselves deep in operational toil, going from one issue to another. A very deliberate effort to pause operational work, which yielded a low return on investment, really helped the team make space for innovation. Staying true to the guiding principles of automating what makes sense, the team uses automation heavily to remove as much of the mundane and repeatable work as possible. The following are some automation tools and platforms that the team currently uses. These platforms and the automation have enabled the team to unlock efficiency and quality across the functions.
Automated playbooks and workflow automation
The team leverages a no-code workflow automation platform that comes tightly integrated with the cloud SIEM. It is a highly scalable integration platform that allows building solutions quickly due to the many built-in connectors across the product ecosystem like Jira, ServiceNow, and several custom tools that the team depends on. Some use cases include alert and incident enrichment, which makes all context required to triage an alert available to the engineers, and running automated post-alert playbooks for prioritization and running automated containment and remediation jobs, and other business workflows. Another use case is automated quality control and post-incident processing, which allows us to learn lessons from previous incidents.
Several complex automation jobs are written as serverless functions. These are easy to deploy, scalable, and have a very low operational overhead. These functions are used for ingesting data from on-prem and online sources, along with more complex automation jobs like the containment of compromised identities.
Resilience and reliability are broad topics, and thus not covered in this blog, however, data and platform reliability are absolutely critical. A change in underlying data can have major cascading effects on detection and response. Data availability during incidents is key, too. Outside of the core monitoring of the platforms, the team relied on three different avenues for signals of degraded performance:
Looking at things like message drop rate, latency, and volume allows the team to quickly identify any issues before they impact detections and/or response activities.
Sending messages to ensure the pipeline is functional and that messages get from source to destination within expected timeframes.
Ideal behavior indicators (operational alerts)
Data Sources are dynamic in nature. When onboarding a datasource, a detection is developed to monitor the datasource health. For example, sending an alert when the number of unique source hosts decreases more than the threshold percentage.
These are only some of the health checks and alerts we developed to ensure that our systems were logging and reporting properly. We try to graph availability as well as detection efficacy data to assist us in constantly re-evaluating our detection posture and accuracy.
What did we get for all this?
Threat detection and incident response is not a purely operational problem, using an engineering lens paid off well for our team. If this is not something you’re doing, we hope this post drives you toward that realization. Leaving the processes to organic growth is unsustainable and violates our guiding principles designed to prevent burnout for the team and ensure a quality response. In the end, we achieved significant success and improvements. We were able to expand our data collection infrastructure by 10x going from gigabytes to petabytes, our average time to triage incidents went from hours to minutes, we were able to maintain 99.99% uptime for the platform and connected tooling, and correlation was now possible, significantly reducing alert fatigue and improving precision. Additionally, automated response playbooks allowed us to reduce toil for response and automatically handle simple tasks like enrichment, case creation and updates, or additional evidence gathering. We were also able to quickly integrate threat intelligence into our enrichment pipeline, providing a much richer context for investigations.
More work to do
What we’ve covered in this blog represents the work of a handful of engineers, product and project managers, analysts, and developers over a relatively short period of time. We learned many valuable lessons over time and have since developed new ways of thinking about detection, automated response, and scaling. Whatever platform a team decides on using for security monitoring, it cannot be overstated how important a solid design and architecture will be to its eventual success.
In a future post, we’ll cover more details on how incident response ties in with our platforms and pipelines and a re-architected approach to detection engineering, including the lessons we learned.
This work would not have been possible without the talented team of people supporting our success.
Core contributors: Vishal Mujumdar, Amir Jalali, Alex Harding, Tom Leahy, Tanvi Kolte, Arthur Kao, Swathi Chandrasekar, Erik Miyake, Prateek Jain, Lalith Machiraju, Gaurav Gupta, Brian Pierini, Sergei Rousakov, Jeff Bollinger and several other partners within the organization.
Twitter Stops Enforcing COVID-19 Misinformation Policy, Experts Express Concerns Over False Claims
TikTok SEO in 5 Steps: How To Make Sure Your Videos Show Up in Search
What to know about Presto SQL query engine and PrestoCon
Now people can share directly to Instagram Reels from some of their favorite apps
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Elon Musk Hints at Plans to Increase Character Limit for Tweets in Response to Twitter User
Sending Messages via WhatsApp in Your Node.js Application
(Re)building Threat Detection and Incident Response at LinkedIn
Access Levels for Gaming Apps
How LinkedIn Ditched the “One Size Fits All” Hiring Approach for InfoSec and Won
Introducing the Product Tagging API for Reels to the Instagram Platform
Introducing the ‘Instagram Explore home’ Ads Placements via the Instagram Marketing API
FACEBOOK2 weeks ago
Dynolog: Open source system observability
FACEBOOK1 week ago
How to Send Interactive Messages with the WhatsApp Business Platform
FACEBOOK1 week ago
Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding
OTHER6 days ago
Elon Musk Urged by US Senator to Better Protect US Users’ Data After Whistleblower Testimony
FACEBOOK1 week ago
Building Intuitive Interactions in VR: Interaction SDK, First Hand Showcase and Other Resources
OTHER6 days ago
Twitter, Other Social Media Apps Fail to Remove Hate Speech, Says EU Review
OTHER2 weeks ago
Twitter Sued by Disabled Employee Over Elon Musk’s Ban on Remote Work
OTHER2 weeks ago
Elon Musk Says Twitter Managers Can Approve Remote Work for Staff at Their Own Risk