Figure 1: Evolving the current data ecosystem to leveraging Super Tables
The Super Tables initiative aims to mitigate these challenges as illustrated in Figure 1. Our goal is to build only a small number of enterprise-grade Super Tables that will be leveraged by many downstream users. Each domain may have only very few Super Tables that are highly leveraged by users. By consolidating hundreds of SOTs into these few Super Tables, a large number of SOTs will eventually be deprecated and retired.
- Having one or two Super Tables for a business entity or event (i.e., in a domain) makes it easier to explore and consume data for data analytics. It minimizes data duplication and reduces time to find the right data.
Strengthens Reliability & Usability
- Super Tables are materialized with precomputed joins of related business entities or events to consolidate data that is commonly needed for downstream analytics. This obviates the need to perform such joins in many downstream use cases leading to simpler and more performant downstream flows and better resource utilization.
- Super Tables provide and publish SLA commitments on availability and supportability with proactive data quality checks (structural, semantics and variance) and a programmatic way of querying the current and historical data quality check results.
Improves Change Management
- Super Tables have well-defined governance policies for change management and communication and committed change velocity (e.g., monthly deployment of changes). Upstream data producers and downstream consumers (cross teams/domains) are both involved in the governance of changes to the STs.
Super Tables design principles
In this section, we highlight several important design principles for Super Tables. We want to emphasize that most of these principles are applicable to any dataset, not just Super Tables.
Building a data product requires a good understanding of the domain and the available data sources. It is important to establish the source of truth and document such information. Choosing the right sources will promote data consistency. For example, if a source is continuously refreshed via streaming, the metric calculation may lead to inconsistent results at a different time. Consumers may not realize the situation and can be surprised by this outcome. Once sources are identified, we need to look at the upstream availability and business requirements so that the ST’s SLA can be established. One may argue that we should consolidate as many datasets into a single ST as possible. However, adding a data source to the ST will increase the resource needed to materialize the ST, and potentially jeopardize its SLA commitment. A good understanding of how the extra data source will be leveraged downstream (e.g., the extra data source is needed to compute a critical metric) is warranted.
Field naming conventions and field groupings are established so that users can easily understand the meaning of the fields and frozen fields (immutable values) are identified. For example, a job poster may switch companies but the hiring company for the job posting shouldn’t change – it is important to include the hiring company information (immutable) instead of the job poster’s company. Future flow executions should not change the field values. By default, any schema changes (both compatible and incompatible) in data sources would not affect the ST. For example, if a new column is added to a source, the column will not appear in the ST by default. Similarly if an existing source column is deleted, its value is nullified and the ST team will be notified. Schema changes in a ST are decoupled from its sources so that any change will not accidently break the ST flow. Planned changes are documented and communicated to consumers through a distribution list.
The retention policy of the ST is well established and published so that downstream consumers are fully aware of the policy and any future changes. Likewise, the retention policies of its data sources are tracked and monitored.
Establishing Upstream SLA Commitment
To meet its own SLA commitment, the SLA commitment of all data sources must be established and agreed on. Any changes will be notified for proper actions.
Data quality checks on all data sources as well as the ST itself must be put in place. For example, the ST’s primary key column should not be NULL or have duplicates. Data profiles on fields are performed to identify any potential outliers. If a data source has seriously bad data quality, the ST flow may be terminated until the quality issue is resolved.
It is very important to have both dataset and column-level documentation available to users to determine if the ST can be leveraged. Very often users want to understand the sources and how the ST and its fields are derived.
High Availability / Disaster Recovery
ST aims to reach 99+% availability. For a daily ST flow, it translates to approximately one SLA miss per quarter. To improve availability, STs can be materialized in multiple clusters. With an active-active configuration, the ST flows will be executed independently and consistency is guaranteed on two clusters (JOBS flows run on 2 production clusters). In case of SLA miss and disaster recovery, data can be copied from the mirror dataset across clusters.
ST flows must be monitored closely for any deviation from the norm.
- Flow performance: the flow’s runtime trending may lead the prevention of SLA misses
- Data quality: the sources must adhere to established quality standards and data quality metrics are shown on dashboards
- Cross cluster: the datasets in multiple clusters must be compared for deviation detection
A governance body (comprising teams from upstream and downstream) is established to ensure that the ST design principles are followed, the dataflow operation meets SLA and data quality is preserved, and changes are communicated to consumers. For example, if a downstream user wants to include a new field from a source, the user will have to submit the request to the governance body for evaluation and recommendation. In contrast to the past, a field very large in size would have been automatically included and blown up the final dataset size as well as jeopardized the delivery SLA. The governance body will have to consider all the tradeoffs in the final recommendation. A minimum of monthly release cadence is established to accept change requests so that the ST has the agility to serve the business needs.
A brief introduction to two Super Tables
The first Super Table is JOBS. Before the JOBS ST was built, there were a dozen tables and views that were built over the years; each joined with the job posting dataset (more than one) with different dimensional tables for various analytics, such as the job poster information, and standardized/normalized job attributes (location, titles, industries, and skills etc.). These tables/views may be created using different formats (such as Apache Avro or Apache ORC) or on different clusters. They may have different arrival frequencies and times (SLAs). Adding more complexity to the situation is that there are many different kinds of jobs, some of which may even be deleted or archived, and various job listing and posting types. Choosing the right dataset for a particular use case requires a complicated task of understanding and analyzing various data sources, joining the right dimensional datasets, and in some cases repeating the join redundantly. As such, the learning curve is steep.
The JOBS ST combines data elements from 57+ critical data sources from the multiple LinkedIn teams across different organizations. Totaling 158 columns, this JOBS ST precomputes and combines the most needed information for job-related analyses and insights. It was our intent that JOBS ST can be leveraged extensively for the most critical use cases. The JOBS ST has a daily SLA in the morning and the daily flow is run on two different production clusters to provide high availability. Data quality of all data sources and JOBS itself are enforced and monitored continuously.
Leveraging the JOBS ST is easier and more efficient than its original source datasets. In fact, in many downstream flows, the flow logic is simplified by just scanning the JOBS ST or joining it with another dimensional table, making the logic more efficient. With the availability of JOBS, the existing dozen job-related tables/views will be deprecated and migrated to JOBS.
The second Super Table is Ad Events. Before the Ad Events ST was built, there were seven different advertisement (ads) related tables including ad impressions, clicks, and video views. They have many fields in common such as campaign and advertiser information. Downstream frequently needs to join multiple ads tables, the campaign dimension, and the advertiser dimension tables to get insights on the ads revenue, performance, etc. The duplicate fields and frequent downstream joins added some unnecessary storage and computation.
Upon analyzing the commonly joined tables and downstream metrics, the Ad Events ST is created with 150+ columns to provide the precomputed information ready for Ads insights analysis and reporting.
From designing and implementing the Super Tables, we have learned many valuable lessons. First, we were able to learn that understanding the use cases and usage patterns is crucial in determining the main benefits of using a Super Table. Before building a new Super Table, it’s important to look around and see what similar tables are already in place. Sometimes it’s better to strengthen an existing table than to build one from scratch. When weighing these two options, factors to consider include quality, coverage, support, and usage of the existing tables.
Next, we learned that while building Super Tables, it’s important to identify those semantic logic and their owners to ensure that it is correct and understand how it will evolve over time. Data transformation includes structural logic (e.g., table joins, deduplication, and data type conversion) and semantic logic (e.g., a method of predicting the likelihood of a future purchase based on browsing history). Semantic logic is usually owned by a specific team with deep domain knowledge. Without proper communication and collaboration, semantic logic would be poorly maintained and outdated. A better solution is to separate semantic logics into a different layer (such as Dali Views built on top of ST) and be managed by the team with the domain knowledge.
With Super Table SLAs, the stricter the SLA, the less tolerant you can be of issues. This means implementing mechanisms like defensive data loading, safe data transformation, ironclad alerting and escalation, and runtime data quality checks. For softer SLAs, you can tolerate a failure, triage and resolve the issue. With strict SLAs, sometimes you cannot tolerate a single failure.
Another learning was that, in an age with an emphasis on privacy, where data inaccuracy can have disastrous cascading effects, and lowering costs is at the forefront, reducing redundant datasets is the closest thing to a panacea there is. Often, it is better to align a Super Table to fulfill a distinct use case and ensure all requirements are met, instead of tolerating duplication for the sake of speed. Likewise, consolidating multiple similar datasets (which target different use cases) into a single Super Table is extremely critical.
Lastly, after the release of ST, one of the critical tasks is to migrate users of the legacy SOTs to the new ST. The sooner the migration is complete, the more resources can be freed up for other tasks. We have learned that it is imperative to provide awareness and support to the downstream users who need to perform the migrations. To that end, we have created a migration guide that outlines all the impacted SOTs. It includes detailed column-to-column mappings and suggested validations to perform to ensure data correctness and quality. The release of JOBS ST has significantly improved the life of both the owners of jobs data sources and the downstream consumers. Before ST, the knowledge of jobs business logic was scattered across several data sources owned by different teams and none of them had the full picture, making it extremely difficult for downstream consumers to figure out what is the right data source to use per their use cases. Usually time-consuming communications were required among different teams and it could easily lead to a misuse of data but since the release of ST, we formed a governance body involving various stakeholders. The governance body manages the evolution and maintenance of the ST. For instance, the ST consumers requested a new field. The governance body discussed the best design and implementation approach, which involved gathering raw data from the sources and implementing the transformation logic in the ST. The monthly release cadence allows development agility and JOBS data are integrated with easier access and much less confusion.
Conclusions and future work
Democratization of data authoring brings agility to analytics throughout the LinkedIn community. However, it introduces lots of challenges such as discoverability, redundancy, and inconsistency. We have launched two Super Tables at LinkedIn that address these challenges, and are in the process of identifying and building more Super Tables. Both STs have simplified data discovery by providing the “go-to” tables. They have also simplified downstream logic and hence saved computation resources. The created value is also amplified due to the high-leverage nature of these tables.
The following table summarizes the benefits of building and leveraging Super Tables.
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Apache Kafka is an open-sourced event streaming platform where users can create Kafka topics as data transmission units, and then publish or subscribe to the topic with producers and consumers. While most of the Kafka topics are actively used, some are not needed anymore because business needs changed or the topics themselves are ephemeral. Kafka itself doesn’t have a mechanism to automatically detect unused topics and delete them. It is usually not a big concern, since a Kafka cluster can hold a considerable amount of topics, hundreds to thousands. However, if the topic number keeps growing, it will eventually hit some bottleneck and have disruptive effects on the entire Kafka cluster. The TopicGC service was born to solve this exact problem. It was proven to reduce Kafka pressure by deleting ~20% of topics, and improved Kafka’s produce and consume performance by at least 30%.
As the first step, we need to understand how unused topics can cause pressure on Kafka. Like many other storage systems, all Kafka topics have a retention period, meaning that for any unused topics, the data will be purged after a period of time and the topic will become empty. A common question here is, “How could empty topics affect Kafka?”
For topic management purposes, Kafka stores the metadata of topics in multiple places, including Apache ZooKeeper and a metadata cache on every single broker. Topic metadata contains information of partition and replica assignments.
Let’s do some simple calculation here: topic A can have 25 partitions, with a replication factor of three, meaning each partition has three replicas. Even if topic A is not used anymore, Kafka still needs to store the location info of all 75 replicas somewhere.
The effect of metadata pressure may not be that obvious for a single topic, but it can make a big difference if there are a lot of topics. The metadata can consume memory from Kafka brokers and ZooKeeper nodes, and can add payload to metadata requests.
In Kafka, the follower replicas periodically send fetch requests to the leader replicas to keep sync with the leader. Even for empty topics and partitions, the followers still try to sync with the leaders. Because Kafka does not know whether a topic is permanently unused, it always forces the followers to fetch from the leaders. These redundant fetch requests will further lead to more fetch threads being created, which can cause extra network, CPU, and memory utilization, and can dominate the request queues, causing other requests to be delayed or even dropped.
Kafka controller is a broker that coordinates and manages other brokers in a Kafka cluster. Many Kafka requests have to be handled by the controller, thus the controller availability is crucial to Kafka.
On controller failover, a new controller has to be elected and take over the role of managing the cluster. The new controller will take some time to load the metadata of the entire cluster from ZooKeeper before it can act as the controller, which is called the controller initialization time. As mentioned earlier in this post, unused topics can generate extra metadata that makes the controller initialization slower, and threaten the Kafka availability. Issues can arise when the ZooKeeper response is larger than 1MB. For one of our largest clusters, the ZooKeeper response has already reached 0.75MB, and we anticipate within two to three years it will hit a bottleneck.
While designing TopicGC, we kept in mind a number of requirements. Functionality, we determined that the system must set criteria to determine whether a topic should be deleted, constantly run the garbage collector (GC) process to remove the unused topics, and notify the user before topic deletion.
Additionally, we identified non-functional requirements for the system. The requirements include ensuring no data loss during topic deletion, removal of all dependencies from unused topics before deletion, and the ability to recover the topic states from service failures.
To satisfy those requirements, we designed TopicGC based on a state machine model, which we will discuss in more detail in the following sections.
Topic state machine
To achieve all of the functional requirements, TopicGC internally runs a state machine. Each topic instance is associated with a state and there are several background jobs that periodically run and transit the topic states if needed. Table 1 describes all possible states in TopicGC.
Table 1: Topic states and descriptions
With the help of internal states, TopicGC follows a certain workflow to delete unused topics.
Figure 1: TopicGC state machine
Detect topic usage
TopicGC has a background job to find unused topics. Internally, we use the following criteria to determine whether a topic is unused:
- The topic is empty
- There is no BytesIn/BytesOut
- There is no READ/WRITE access event in the past 60 days
- The topic is not newly created in the past 60 days
The TopicGC service fetches the above information from ZooKeeper and a variety of internal data sources, such as our metrics reporting system.
Send email notification
If a topic is in the UNUSED state, TopicGC will trigger the email sending service to find the LDAP user info of the topic owner and send email notifications. This is important because we don’t know whether the topic is temporarily idle or permanently unused. In the former case, once the topic owner receives the email, they can take actions to prevent the topic from being deleted.
Block write access
This is the most important step in the TopicGC workflow. Think of a case: if a user produces some data right at the last second before topic deletion, the data will be lost with the topic deletion. Thus, avoiding data loss is a crucial challenge for TopicGC. To ensure the TopicGC service doesn’t delete the topics that have last minute write, we introduced a block-write-access step before the topic deletion. After the write access is blocked on the topic, there is no chance that TopicGC can cause data loss.
Notice that Kafka doesn’t have a mechanism to “seal” a topic. Here we leverage LinkedIn’s internal way to block topic access. In LinkedIn, we have some access to services to allow us to control the access for all data resources, including Kafka topics. To seal a topic, TopicGC sends a request to the access service to block any read and write access to the topic.
The data of a topic can be mirrored to other clusters via Brooklin. Brooklin is open-sourced by LinkedIn, as a framework to stream data between various heterogeneous sources and destination systems with high reliability and throughput at scale. Before deleting the topic, we need to disable Brooklin mirroring of the topic. Brooklin can be regarded as a wildcard consumer for all Kafka topics. If the topic is deleted without informing Brooklin, Brooklin will throw exceptions about consuming from non-existent topics. For the same reason, before topic deletion, if there are any other services that consume from all topics, TopicGC should tell those services to stop consuming from the garbage topics before topic deletion.
Once all preparations are done, the TopicGC service will trigger the topic deletion by calling the Kafka admin client. The topic deletion process can be customized and in our case, we delete topics in batches. Because topic deletion can introduce extra load to Kafka clusters, we set an upper limit of the concurrent topic deletion number to three.
Last minute usage check
Before any of the actual changes made to the topic (including blocking write access, disabling mirroring, and topic deletion), we run a last minute usage check for the topic. This is to add an extra secure layer to prevent data loss. If TopicGC detects usage during the whole deletion process, it will mark the topic as INCOMPLETE state, and start recovering the topic back to USED state.
Impact of TopicGC
We launched TopicGC in one of our largest data pipelines, and were able to reduce the topic count by nearly 20%. In the graph, each color represents a distinct Kafka cluster in the pipeline.
Figure 2: Total topic count during TopicGC
Improvement on CPU usage
The topic deletion helps to reduce the total fetch requests in the Kafka clusters and as a result, the CPU usage drops significantly after the unused topics are deleted. The total Kafka CPU usage had about a 30% reduction.
Figure 3: CPU usage improvement by TopicGC
Improvement On Client Request Performance
Due to the CPU usage reduction, Kafka brokers are able to handle the requests more efficiently. As a result, Kafka’s request handling performance improved, and request latencies dropped by up to 40%. Figure 4 shows the decrease in latency for Metadata Request.
Figure 4: Kafka request performance improvement by TopicGC
After we launched TopicGC to delete unused topics for Kafka, it has deleted nearly 20% of topics, and significantly reduced the metadata pressure of our Kafka clusters. From our metrics, the client request performance is improved around 40% and CPU usage is reduced by up to 30%.
As TopicGC has shown its ability to clean up Kafka clusters and improve Kafka performance, we have decided to launch the service to all of our internal Kafka clusters. We are hoping to see that TopicGC can help LinkedIn have a more effective resource usage on Kafka.
Many thanks to Joseph Lin and Lincong Li for coming up with the idea of TopicGC and implementing the original design. We are also grateful for our managers Rohit Rakshe and Adem Efe Gencer, who provided significant support for this project. Last but not least, we want to shout out to the Kafka SRE team and Brooklin SRE team to act as helpful partners. With their help, we smoothly launched TopicGC and were able to see these exciting results.
Render Models at LinkedIn
We use render models for passing data to our client applications to describe the content (text, images, buttons etc.) and the layout to display on the screen. This means most of such logic is moved out of the clients and centralized on the server. This enables us to deliver new features faster to our members and customers while keeping the experience consistent and being responsive to change.
Traditionally, many of our API models tend to be centered around the raw data that’s needed for clients to render a view, which we refer to as data modeling. With this approach, clients own the business logic that transforms the data into a view model to display. Often this business logic layer can grow quite complex over time as more features and use cases need to be supported.
This is where render models come into the picture. A render model is an API modeling strategy where the server returns data that describes the view that will be rendered. Other commonly used terms that describe the same technique are Server Driven User Interface (SDUI), or View Models. With render models, the client business logic tends to be much thinner, because the logic that transforms raw data into view models now resides in the API layer. For any given render model, the client should have a single, shared function that is responsible for generating the UI representation of the render model.
Architectural comparison between data modeling and render modeling
To highlight the core differences in modeling strategy between a render model and data model, let’s walk through a quick example of how we can model the same UI with these two strategies. In the following UI, we want to show a list of entities that contain some companies, groups, and profiles.
An example UI of an ‘interests’ card to display to members
Following the data model approach, we would look at the list as a mix of different entity types (members, companies, groups, etc.) and design a model so that each entity type would contain the necessary information for clients to be able to transform the data into the view shown in the design.
When applying a render model approach, rather than worry about the different entity types we want to support for this feature, we look at the different UI elements that are needed in the designs.
An ‘interests’ card categorized by UI elements
In this case, we have one image, one title text, and two other smaller subtexts. A render model represents these fields directly.
With the above modeling, the client layer remains very thin as it simply displays each image/text returned from the API. The clients are unaware of which underlying entity each element represents, as the server is responsible for transforming the data into displayable content.
API design with render models
API modeling with render models can live on a spectrum between the two extremes of frontend modeling strategies, such as pure data models and pure view models. With pure data models, different types of content use different models, even if they look the same on UI. Clients know exactly what entity they are displaying and most of the business logic is on clients, so complex product UX can be implemented as needed. Pure view models are heavily-templated and clients have no context on what they are actually displaying with almost all business logic on the API. In practice, we have moved away from using pure view models due to difficulties in supporting complex functionality, such as client animations and client-side consistency support, due to the lack of context on the clients’ end.
Typically, when we use render models, our models have both view model and data model aspects. We prefer to use view modeling most of the time to abstract away most of the view logic on the API and to keep the view layer on the client as thin as possible. We can mix in data models as needed, to support the cases where we need specific context about the data being displayed.
A spectrum of modeling strategies between pure view models and pure data models
To see this concretely, let’s continue our previous example of a FollowableEntity. The member can tap on an entity to begin following the profile, company, or group. As a slightly contrived example, imagine that we perform different client side actions based on the type of the entity. In such a scenario, the clients need to know the type of the entity and at first brush it might appear that the render models approach isn’t feasible. However, we can combine theseapproaches to get the best of both worlds. We can continue to use a render model to display all the client data but embed the data model inside the render model to provide context for making the follow request.
Client theming, layout, and accessibility
Clients have the most context about how information will be displayed to users. Understanding the dynamics of client-side control over the UX is an important consideration when we build render models. This is particularly important because clients can alter display settings like theme, layout, screen size, and dynamic font size without requesting new render models from the server.
Properties like colors, local image references, borders, or corner radius are sent using semantic tokens (e.g., color-action instead of blue) from our render models. Our clients maintain a mapping from these semantic tokens to concrete values based on the design language for the specific feature on a given platform (e.g. iOS, Android, etc.). Referencing theme properties with semantic tokens enables our client applications to maintain dynamic control over the theme.
For the layout, our render models are not intended to dictate the exact layout of the UI because they are not aware of the total available screen space. Instead, the models describe the order, context, and priorities for views, allowing client utilities to ultimately determine how the components should be placed based on available space (screen size and orientation). One way we accomplish this is by referring to the sizes of views by terms like “small” or “large” and allowing clients to apply what that sizing means based on the context and screen size.
It is critical that we maintain the same level of accessibility when our UIs are driven by render models. To do so, we provide accessibility text where necessary in our models, map our render models to components that have accessibility concerns baked in (minimum tap targets), and use semantics instead of specific values when describing sizes, layouts, etc.
Write use cases
One of the most challenging aspects of render models is dealing with write use cases, like filling forms and taking actions on the app (such as following a company, connecting with a person, sending a message, etc.). These use cases need specific data to be written to backends and cannot be modeled in a completely generic way, making it hard to use render models.
Actions are modeled by sending the current state of the action and its other possible states from the server to the clients. This tells the clients exactly what to display. In addition, it allows them to maintain any custom logic to implement a complex UI or perform state-changing follow-up actions.
To support forms, we created a standardized library to read and write forms, with full client infrastructure support out of the box. Similar to how traditional read-based render models attempt to leverage generic fields and models to represent different forms of data, our standardized forms library leverages form components as its backbone to generically represent data in a form by the type of UI element it represents (such as a ‘single line component’ or a ‘toggle component’).
Render models in practice
As we have mentioned above, the consistency of your UI is an important factor when leveraging render models. LinkedIn is built on a semantics-based design system that includes foundations like color and text, as well as shared components such as buttons and labels. Similarly, we have created layers of common UX render models in our API that include foundational and component models, which are built on top of those foundations.
Our foundational models include rich representations of text and images and are backed by client infrastructure that renders these models consistently across LinkedIn. Representing rich text through a common model and render utilities enables us to provide a consistent member experience and maintain our accessibility standards (for instance, we can restrict the usage of underlining in text that is not a link). Our image model and processing ensures that we use the correct placeholders and failure images based on what the actual image being fetched presents (e.g., a member profile). These capabilities of the foundational models are available without any client consumer knowledge of what the actual text or image represents and this information is all encapsulated by the server-driven model and shared client render utilities.
The foundational models can be used on their own or through component models that are built on top of the foundations. They foster re-use and improve our development velocity by providing a common model and shared infrastructure that resolves the component. One example is our common insight model, which combines an image with some insightful text.
A commonly used ‘insight’ model used throughout the site
Over the years, many teams at LinkedIn have taken on large initiatives to re-architect their pages based on render model concepts built on top of these foundational models. No two use cases are exactly alike, but a few of the major use cases include:
The profile page, which is built using a set of render model-based components stitched together to compose the page. For more details on this architecture, see this blog post published earlier this year.
The search results page, built using multiple card render model templates to display different types of search results in a consistent manner. See this blog post for more details.
The main feed, built centered around the consistent rendering of one update with optional components to allow for variability based on different content types.
A feed component designed around a several components
- The notifications tab, which helped standardize 50+ notification types into one simple render model template.
A notifications card designed using a standardized UI template
All of these use cases have seen some of the key benefits highlighted in this post: simpler client-side logic, a consistent design feel, faster iteration, and development and experimentation velocity for new features and bugs.
Render model tradeoffs
Render models come with their pros and cons, so it is important to properly understand your product use case and vision before implementing them.
With render models, teams are able to create leverage and control when a consistent visual experience, within a defined design boundary, is required across diverse use cases. This is enabled by centralizing logic on the server rather than duplicating logic across clients. It fosters generalized and simpler client-side implementation, with clients requiring less logic to render the user interface since most business logic lives on the server.
Render models also decrease repeated design decisions and client-side work to onboard use cases when the use case fits an existing visual experience. It fosters generalized API schemas, thereby encouraging reuse across different features if the UI is similar to an existing feature.
With more logic pushed to the API and a thin client-side layer, it enables faster experimentation and iteration as changes can be made by only modifying the server code without needing client-side changes on all platforms (iOS, Android, and Web). This is especially advantageous with mobile clients that might have older, but still supported versions in the wild for long periods of time.
Similarly, as most of the business logic is on the server, it is likely that any bugs will be on the server instead of clients. Render models enable faster turnaround time to get these issues fixed and into production, as server-side fixes apply to all clients without needing to wait for a new mobile app release and for users to upgrade.
As mentioned previously, render models rely on consistent UIs. However, if the same data backs multiple, visually-distinct UIs, it reduces the reusability of your API because the render model needs more complexity to be able to handle the various types of UIs. If the UI does need to change outside the framework, the client-code and server code needs to be updated, sometimes in invasive ways. By comparison, UI-only changes typically do not require changes to data models. For some of these reasons, upfront costs to implement and design render models are often higher due to the need to define the platform and its boundaries, especially on the client.
Render models are un-opinionated about writes and occasionally require write-only models or additional work to write data. This is contrasted with data models where the same data models can be used in a CRUD format.
Client-side tracking with render models has to be conceived at the design phase, where tracking with data models is more composable from the client. It can be difficult to support use case-specific custom tracking in a generic render model.
Finally, there are some cases where client business logic is unavoidable such as in cases with complex interactions between various user interface elements. These could be animations or client-data interactions. In such scenarios, render models are likely not the best approach as, without the specific context, it becomes difficult to have any client-side business logic.
When to use render models?
Render models are most beneficial when building a platform that requires onboarding many use cases that have a similar UI layout. This is particularly useful when you have multiple types of backend data entities that will all render similarly on clients. Product and design teams must have stable, consistent requirements and they, along with engineering, need to have a common understanding of what kinds of flexibility they will need to support and how to do so.
Additionally, if there are complex product requirements that need involved client-side logic, this may be a good opportunity to push some of the logic to the API. For example, it is often easier to send a computed text from the API directly rather than sending multiple fields that the client then needs to handle in order to construct the text. Being able to consolidate/centralize logic on the server, and thus simplifying clients, makes their behavior more consistent and bug-free.
On the flip side, if there is a lack of stability or consistency in products and designs, any large product or design changes are more difficult to implement with render models due to needing schema changes.
Render models are effective when defining generic templates that clients can render. If the product experience does not need to display different variants of data with the same UI, it would be nearly impossible to define such a generic template, and would often be simpler to use models that are more use case-specific rather than over-generalizing the model designs.
Render models have been adapted through many projects and our best practices have evolved over several years. Many have contributed to the design and implementation behind this modeling approach and we want to give a special shoutout to Nathan Hibner, Zach Moore, Logan Carmody, and Gabriel Csapo for being key drivers in formulating these guidelines and principles formally for the larger LinkedIn community.
(Re)building Threat Detection and Incident Response at LinkedIn
LinkedIn connects and empowers more than 875 million members and over the past few years, has undergone tremendous growth. As an integral part of the Information Security organization at LinkedIn, the Threat Detection and Incident Response team (aka SEEK) defends LinkedIn against computer security threats. As we continue to experience a rapid growth trajectory, the SEEK team decided to reimagine its capabilities and the scale of its monitoring and response solutions. What SEEK set out to do was akin to shooting for the moon, so we named the program “Moonbase.” Moonbase set the stage for significantly more mature capabilities that ultimately led to some impressive results. With Moonbase, we were able to reduce incident investigation times by 50%, increase threat detection coverage expansion by 900%, and reduce our time to detect and contain security incidents from weeks or days to hours.
In this blog, we will discuss how LinkedIn rebuilt its security operations platform and teams, scaled to protect nearly 20,000 employees and more than 875 million members, and our approach and strategy to achieve this objective. In subsequent posts, we will do a deep dive into how we built and scaled threat detection, improved asset visibility, and enhanced our ability to respond to incidents within minutes, with many lessons learned along the way.
Software-defined Security Operations Center (SOC)
While these data points demonstrated some of the return on the investment, they didn’t show how much scaling to Moonbase improved the quality of life for our security analysts and engineers. With Moonbase, we were able to eliminate the need to manually search through CSV files on a remote file server, or the need to download logs and other datasets directly for processing and searching. The move to a software-defined and cloud-centric security operations center accelerated the team’s ability to analyze more data in more detail, while reducing the toil of manual acquisition, transformation, and exploration of data.
Having scaled other initiatives in the past, we knew before we started the program rebuild we would need strong guiding principles to keep us on track, within scope, and set reasonable expectations.
Preserving Human Capital
As we thought about pursuing a Software-defined SOC framework for our threat detection and incident response program, preserving our human capital was one of the main driving factors. This principle continues to drive the team, leading to outcomes such as reduced toil through automation, maximized true positives, tight collaboration between our detection engineers and incident responders, and semi-automated triage and investigation activities.
It’s commonly said that security is everyone’s responsibility, and that includes threat detection and incident response. In our experience, centralizing all responsibilities of threat detection and incident response within a single team restricts progress. Democratization can be a force multiplier and a catalyst for rapid progress if approached pragmatically. For example, we developed a user attestation platform where a user is informed about suspicious activity, provided context around why we think it is suspicious, and asked a question of whether they recognize the suspicious activity. Depending on the user’s response and circumstantial factors, a workflow is triggered that could lead to an incident being opened for investigation. This helped reduce toil and the time to contain threats by offering an immediate response to an unusual activity. Democratization has been applied to several other use cases with varying degrees of success from gaining visibility to gathering threat detection ideas.
Building for the future while addressing the present
The rebuilding of the threat detection and incident response program was done while still running the existing operations. With this approach, we were able to carve out space among the team to work on more strategic initiatives.
Security, scalability, and reliability of infrastructure
As the team increased its visibility tenfold, the demand for data normalization and searching continued to grow. Our platform, tooling, data, and processes need reliability and scalability in the most critical times, like during an incident or an active attack. The team ensured a focus on resiliency in the face of setbacks such as software bugs or system failures. To get ahead of potential problems, we committed to planning for failure, early warning, and recovery states to ensure our critical detection and response systems were available when most needed. Security of the LinkedIn platform is at the forefront of everything the team does and is etched into our thought process as we build these systems.
As we started thinking about the broad problem space we needed to address, a few fundamental questions came up.
The first was, “What visibility would the threat detection and incident response team need to be effective?” This fundamental question helped us shape our thinking and requirements about how to rebuild the function. There are several ways to approach building out an operational security response and detection engineering team. Whether that’s an incident response firm on retainer, an in-house security operations center, a managed service provider, or a combination of both, there are plenty of approaches that work for many organizations. At LinkedIn, we wanted to work with what we already had, which was a great engineering team and culture, the ability to take some intelligent risks, and the support of our peers to help build and maintain the pipeline and platform we needed to be effective.
The next question that we asked was, “How do we provide coverage for the threats affecting our infrastructure and users?” There are many areas that require attention when building out a Software-defined SOC and it can be difficult to know where to prioritize your efforts. The long-term goal was inspired by the original intent of the SOCless concept, which suggests that mostly all incident handling would be automated with a few human checkpoints to ensure quality and accuracy, paired with in-depth investigations as necessary. Given our team’s skills and development-focused culture, the benefit of reducing the need for human interaction in security monitoring operations was an attractive idea. With this, we needed to build and maintain development, logging, and alerting pipelines, decide what data sources to consume, and decide what detections to prioritize.
“What are the most important assets we need to protect?” was the third question we asked. Due to the scope of Moonbase, we had to decide how to deliver the biggest impact on security before we had designed the new processes or completed the deployment. This meant we focused on the most important assets first, commonly known as “Crown Jewels.” Starting with systems we knew and understood, we could more easily test our detection pipeline. Many of the early onboarded data sources and subsequent detections were from the Crown Jewels. Ultimately, this was a quick start for us, but did not yet offer the comprehensive visibility and detection capabilities we needed.
This leads us to the last question we asked, “How can we improve the lives of our incident responders with a small team?” Early detections in the new system, while eventually iterated or tuned, led to classic analyst fatigue. To ease the burden and improve the lives of the responders, we built a simple tuning request functionality, which enables quick feedback and the potential to pause on a lower-quality detection. This principle has enabled us to maintain a reasonable expectation of results from analysts while reducing the potential for additional fatigue, alert overload, and job dissatisfaction. Additionally, we have focused on decentralizing the early phases of the security monitoring process, which has led to significantly less toil and investigation required from analysts. When a class of alerts can be sent with additional context to potential victims or sources of the alert, with the proper failsafes in place the analyst can focus on responding to the true threats. Another example is directly notifying a system owner or responsible team whenever there are unusual or high-risk activities such as a new administrator, new two-factor accounts, credential changes, etc. If the activity is normal, expected, or a scheduled administration activity that can be confirmed by the system owners, again with the proper failsafes, there is no need to directly notify the incident response team. Tickets generated from security alerts are still created and logged for analysis and quality assurance, but reviewing a select number of tickets periodically for accuracy is preferred to manually communicating through cases that likely have no impact or threat.
Phase 1: Foundation and visibility
The first phase of rebuilding our security platform was our end-to-end infrastructure and process development. We already had a small team and some data sources, including some security-specific telemetry (endpoint and network detection technologies), so we focused on finding the right platform first and then on building strong processes and infrastructure around it. Given the rapidly growing business, a lean team, and other constraints, a few base requirements were established for the foundational platform(s).
Low operational overhead
This enables the team to focus on the analysis tasks meant for humans and lets automation do the work of data gathering and transformation. Constantly working through the critical, yet laborious extract, transform, and load processes while trying to hunt for and respond to threats is a quick recipe for burnout.
Built-in scalability and reliability
Reliability is a naturally critical component in any security system, however, with a small team, it must be low maintenance and easy to scale. Additionally, time invested should be primarily focused on building on top of the platform, rather than keeping the platform operational. This is one of the strongest arguments for a cloud solution, as it allows us to work with our internal IT partners and cloud service providers. Through this, we can ensure a reliable backbone for our pipelines and alerting, along with other programs that coordinate these processes to gather data and context.
Rapid time to value
Building a functioning operational team takes time and practice, so when choosing the technology toolset, we need to see results quickly to focus on process improvement and developing other important capabilities.
To minimize context switching and inefficiencies, we want the data, context, and all entity information to be available in our alerting and monitoring system. This naturally leads toward a log management system and security information and event management (SIEM) overlays.
By the end of this phase, we wanted to be able to onboard a handful of data sources to a collaborative platform where we could run searches, write detections, and automatically respond to incidents. This was the beginning of our methodical approach to deploying detections through a traditional CI/CD pipeline.
Capturing data at scale, and in a heterogeneous environment, is a huge challenge on its own. Not all sources can use the same pipelines, and developing and maintaining hundreds of data pipelines is too much administrative overhead for a small team. We started our search for a solution internally and decided on a hybrid solution. LinkedIn already has a robust data streaming and processing infrastructure in the form of Kafka and Samza, which ships trillions of messages (17 Trillion messages per day!) and processes hundreds of petabytes of data. Most of the LinkedIn production and supporting infrastructure is already capable of transporting data through these pipelines, which made them an attractive early target. However, LinkedIn has organically grown and between acquisitions, software-as-a-service applications, different cloud platforms, and other factors, there needed to be other supported and reliable modes of transport. After analysis, the team developed four strategic modes of data collection including the ultimate fallback of REST APIs provided by the SIEM.
A simplified diagram of data collection pipelines
Most of our infrastructure is already capable of writing to Kafka. With reliability and scalability in mind, Kafka was a perfect choice for a primary data collection medium.
Infrastructure, like firewalls, is not capable of writing to Kafka inherently. We operate a cluster of Syslog collectors for anything that supports Syslog export, but not Kafka messages.
Serverless data pipelines
These are employed mostly for collecting logs from SaaS, PaaS, and other cloud platforms.
The data collector REST API is the collection mechanism natively supported by the SIEM for accepting logs and storing them against known schemas. This is currently the most commonly used transport mechanism, scaling to hundreds of terabytes.
Security infrastructure as code
The team has deep experience with security tooling and platform over the years. As we rebuilt our foundational infrastructure, we knew we needed to apply a more comprehensive engineering first approach. One aspect of this engineering first approach to defense infrastructure was treating everything as code to maintain consistency, reliability, and quality. This led to the development of the Moonbase continuous integration and continuous deployment (CI/CD) pipeline. Through the CI/CD pipeline the team manages all detections, data source definitions and parsers, automated playbooks and serverless functions, and Jupyter notebooks used for investigations, etc.
Having every engineer work on the development and improvement of the detection playbooks, as well as having the applied rigor that comes from the typical CI/CD review and testing stages, gives the team a strong, templateable, and heavily quality-assured playbook for detecting and responding to threats. Simple mistakes or detections that could lead to unnecessary resource usage are easily prevented through the peer review process. Our tailored comprehensive CI validations for each resource type help us to programmatically detect any issues in these artifacts within the PR validation process and improve the deployment success rate significantly. Change tracking is also much easier as pull request IDs can be added to investigation tickets for reference or used in other parts of the tuning process.
The Moonbase CI/CD pipeline is serverless, built on top of Azure DevOps. Azure Repos is a source code management solution similar to Github that we use for all our code with deployment done through Azure Pipelines. Azure Pipelines is a robust CI/CD platform that supports multi-stage deployments, integration with Azure CLI tasks, integration with Azure Repos, PR-triggered CI builds, etc. This also helps us deploy the same resource to multiple SIEM instances within different scopes only by updating deployment configuration settings. We leverage both to build, validate, and deploy detections and all other deployables, following a trunk-based development model. Artifacts like queries are enriched in the pipeline before deployment. These enrichments help not only with detections but also help track threat detection coverage, metrics for incidents, etc.
While there are many features of a CI/CD pipeline like constant validation, decentralized review, mandatory documentation, and metrics, the templating aspect is one of the stronger points of this detection engineering approach. The pipeline allows any analyst on the team to quickly deploy a new detection or update an existing one. In this screenshot from VSCode you can see an example detection template looking for unexpected activity from inactive service principals (SPN).
The pane on the right shows the actual detection within its template. The entire detection isn’t pictured here to obscure some internal-only information, but this snippet shows what a detection engineer has configured for this specific detection in the orange-colored text. Other items, like how far back to search and the severity of the alert, can be specified in the template.
Additionally, we explicitly configured the data sources needed to execute the detection within the template.
To assist in understanding our coverage of threats, we map each detection to its MITRE ATT&CK ID and Sub ID. Not only does this help us track our coverage, but it also enables us to write additional queries or detections that collate any technique or class of attacks into a single report.
Finally, the query itself is listed (in this case we’re using KQL to search the data).
Our internally-developed CLI generates boilerplate templates for new artifacts and helps engineers maintain quality and their deployment velocity and improves productivity by helping engineers validate their changes locally.
It is important to create space for innovation. Earlier, the team often found themselves deep in operational toil, going from one issue to another. A very deliberate effort to pause operational work, which yielded a low return on investment, really helped the team make space for innovation. Staying true to the guiding principles of automating what makes sense, the team uses automation heavily to remove as much of the mundane and repeatable work as possible. The following are some automation tools and platforms that the team currently uses. These platforms and the automation have enabled the team to unlock efficiency and quality across the functions.
Automated playbooks and workflow automation
The team leverages a no-code workflow automation platform that comes tightly integrated with the cloud SIEM. It is a highly scalable integration platform that allows building solutions quickly due to the many built-in connectors across the product ecosystem like Jira, ServiceNow, and several custom tools that the team depends on. Some use cases include alert and incident enrichment, which makes all context required to triage an alert available to the engineers, and running automated post-alert playbooks for prioritization and running automated containment and remediation jobs, and other business workflows. Another use case is automated quality control and post-incident processing, which allows us to learn lessons from previous incidents.
Several complex automation jobs are written as serverless functions. These are easy to deploy, scalable, and have a very low operational overhead. These functions are used for ingesting data from on-prem and online sources, along with more complex automation jobs like the containment of compromised identities.
Resilience and reliability are broad topics, and thus not covered in this blog, however, data and platform reliability are absolutely critical. A change in underlying data can have major cascading effects on detection and response. Data availability during incidents is key, too. Outside of the core monitoring of the platforms, the team relied on three different avenues for signals of degraded performance:
Looking at things like message drop rate, latency, and volume allows the team to quickly identify any issues before they impact detections and/or response activities.
Sending messages to ensure the pipeline is functional and that messages get from source to destination within expected timeframes.
Ideal behavior indicators (operational alerts)
Data Sources are dynamic in nature. When onboarding a datasource, a detection is developed to monitor the datasource health. For example, sending an alert when the number of unique source hosts decreases more than the threshold percentage.
These are only some of the health checks and alerts we developed to ensure that our systems were logging and reporting properly. We try to graph availability as well as detection efficacy data to assist us in constantly re-evaluating our detection posture and accuracy.
What did we get for all this?
Threat detection and incident response is not a purely operational problem, using an engineering lens paid off well for our team. If this is not something you’re doing, we hope this post drives you toward that realization. Leaving the processes to organic growth is unsustainable and violates our guiding principles designed to prevent burnout for the team and ensure a quality response. In the end, we achieved significant success and improvements. We were able to expand our data collection infrastructure by 10x going from gigabytes to petabytes, our average time to triage incidents went from hours to minutes, we were able to maintain 99.99% uptime for the platform and connected tooling, and correlation was now possible, significantly reducing alert fatigue and improving precision. Additionally, automated response playbooks allowed us to reduce toil for response and automatically handle simple tasks like enrichment, case creation and updates, or additional evidence gathering. We were also able to quickly integrate threat intelligence into our enrichment pipeline, providing a much richer context for investigations.
More work to do
What we’ve covered in this blog represents the work of a handful of engineers, product and project managers, analysts, and developers over a relatively short period of time. We learned many valuable lessons over time and have since developed new ways of thinking about detection, automated response, and scaling. Whatever platform a team decides on using for security monitoring, it cannot be overstated how important a solid design and architecture will be to its eventual success.
In a future post, we’ll cover more details on how incident response ties in with our platforms and pipelines and a re-architected approach to detection engineering, including the lessons we learned.
This work would not have been possible without the talented team of people supporting our success.
Core contributors: Vishal Mujumdar, Amir Jalali, Alex Harding, Tom Leahy, Tanvi Kolte, Arthur Kao, Swathi Chandrasekar, Erik Miyake, Prateek Jain, Lalith Machiraju, Gaurav Gupta, Brian Pierini, Sergei Rousakov, Jeff Bollinger and several other partners within the organization.
TikTok SEO in 5 Steps: How To Make Sure Your Videos Show Up in Search
What to know about Presto SQL query engine and PrestoCon
Now people can share directly to Instagram Reels from some of their favorite apps
TopicGC: How LinkedIn cleans up unused metadata for its Kafka clusters
Elon Musk Hints at Plans to Increase Character Limit for Tweets in Response to Twitter User
WhatsApp ‘Message Yourself’ Feature Rolling Out on Android and iOS: Report
Sending Messages via WhatsApp in Your Node.js Application
(Re)building Threat Detection and Incident Response at LinkedIn
Access Levels for Gaming Apps
How LinkedIn Ditched the “One Size Fits All” Hiring Approach for InfoSec and Won
Introducing the Product Tagging API for Reels to the Instagram Platform
Introducing the ‘Instagram Explore home’ Ads Placements via the Instagram Marketing API
FACEBOOK2 weeks ago
Dynolog: Open source system observability
FACEBOOK1 week ago
How to Send Interactive Messages with the WhatsApp Business Platform
FACEBOOK1 week ago
Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding
FACEBOOK1 week ago
Building Intuitive Interactions in VR: Interaction SDK, First Hand Showcase and Other Resources
OTHER6 days ago
Elon Musk Urged by US Senator to Better Protect US Users’ Data After Whistleblower Testimony
OTHER5 days ago
Twitter, Other Social Media Apps Fail to Remove Hate Speech, Says EU Review
OTHER2 weeks ago
Twitter Sued by Disabled Employee Over Elon Musk’s Ban on Remote Work
OTHER2 weeks ago
Elon Musk Expecting Short Time at Twitter, to Find New Leader to Run the Company