Connect with us

LINKEDIN

Towards data quality management at LinkedIn

Published

on

towards-data-quality-management-at-linkedin

Co-authors: Liangzhao Zeng, Ting Yu (Cliff) Leung, Jimmy Hong, and Kevin Lau

Introduction

Data is at the heart of all our products and decisions at LinkedIn and the quality of our data is vital to our success. While not an uncommon problem, our scale, hundreds of thousands of pipelines and streams as well as over an exabyte of data in our data lake alone, presents unique challenges. Most pipelines have complex logic involving a multitude of upstream datasets and common data quality issues in these datasets can be broadly classified into two categories: metadata and data semantic.

The metadata category of data quality issues concerns data availability, freshness, schema changes, and data completeness (volume deviation), which can be retrieved without checking the content of the dataset. This kind of data health information can be collected from the metadata repository (e.g., Hive metastore and related event) or storage services (e.g., HDFS), i.e., the purely metadata of the datasets. The semantic category of data quality issues concerns the content of the datasets, such as column value nullability, duplication, distribution, exceptional values. etc. This kind of data health information can be collected from the data profile (if it exists) or by scanning the dataset itself, i.e., the data content. To lay a solid foundation for tackling data quality, we must attack the problem at the root.  In this blog post, we focus on the metadata category of data quality issues, and demonstrate how we can monitor the data quality of datasets at scale: data availability, data freshness, data completeness, and schema changes.

Advertisement
free widgets for website

A typical AI use case

We are using a typical AI use case to illustrate the challenges when addressing data quality issues. The following figures show the actual arrival time and data volume of a dataset that is being consumed by the AI use case. The dataset is expected to arrive on a daily basis before 7 a.m. and for the data volume, there is a weekly pattern (i.e., seasonality) when the volume during the weekend is lower.

  • graphs-showing-dataset-arrival-and-volume

It is quite common that a complex pipeline consumes tens of (if not more than a hundred) upstream datasets like the one we illustrated. These upstream datasets may be ingested into the storage or created by different pipelines at different times with different arrival frequencies.  For example, some datasets are produced on a monthly, weekly, daily, or hourly basis while some datasets are produced several times a day so that the latest version can be used. Some are static in nature (updated very occasionally) and others are continuously ingested into the system. 

All of these complexities make it difficult to reason about the upstream dataset quality and in many cases, awareness in following aspects can greatly help the operation of these complex pipelines:

Advertisement
free widgets for website
  • Data freshness/staleness: An older version of the dataset may be used unintentionally in the pipeline logic (due to late arrival of the latest version), or the pipeline is stopped due to a late arrival. In both cases, it can downgrade the overall data quality. In many cases, the pipeline may accept reading slightly stale source data but the users must be aware that the quality of the output may degrade over time.

  • Data volume deviation: A sudden drop in data volume could mean only partial data being generated. If the situation went undetected, the output dataset would be inaccurate. If there was a steep increase in data volume, the pipeline logic might fail due to insufficiently allocated system resources (e.g., memory) causing a delay in delivery of the output dataset.

  • Schema change: A schema change (e.g., adding a field to a structure) may break some downstream pipeline logic unintentionally, leading to pipeline failure.

Challenges

The data health can be greatly improved if the following challenges can be tackled systematically while addressing these issues. 

In LinkedIn, there are hundreds of thousands of datasets that reside on multiple Hadoop clusters. Some datasets are replicated across clusters, either by copying or being created in parallel for redundancy. On the other hand, other datasets are served for different purposes (e.g., all sensitive information is obfuscated). A monitoring solution must address the scalability issue and offer low overhead, short latency, and monitoring-by-default (i.e., no manual onboarding process) capability. 

LinkedIn has a huge number of data pipelines and each consumes a large number of datasets and has different expectations on dataset availability. Let’s consider a pipeline reading a dataset which has a SLA of daily arrival at 8 a.m. If the pipeline won’t start until 2 p.m., does missing this SLA (with its actual arrival time at 10 a.m.) have any business significance to the pipeline? If the dataset is backfilled by noon, this particular SLA miss is irrelevant to the data consuming pipeline but receiving an alert of the SLA miss at 10 a.m. can be annoying, leading to email alert fatigue (to be discussed further in this blog). There are multiple consumers and each has its own expectation on the input dataset’s availability. In other words, even if producers miss the SLA commitment, the misses may not impact all of the consumers and the real impact depends on the consumption usage.

There are many different types of datasets at LinkedIn with offline datasets classified as either snapshot or partitioned. However, datasets can have different arrival frequencies: monthly, weekly, daily, multiple-times per day, or hourly. When a pipeline consumes any of these datasets, defining the freshness staleness of the dataset can be a challenge. For example, if a pipeline reads a dataset that is refreshed several times a day and there is a delay for the latest arrival (its producer missed the SLA), does this indicate that the pipeline is reading stale data even though the pipeline can read an earlier version that arrived on the same day?  Reading an earlier version may have little business impact on the output but it is important that the consumers be aware of the situation.

Engineers receive various email alerts throughout the day, some of which might be false or duplicate alarms. Due to this, engineers are faced with email alert fatigue and sometimes don’t pay close attention to email alerts that indicate issues or failures. Additionally, different expectations on dataset availability also magnifies the fatigue issue when the consumers receive too many alerts that are technically valid but not relevant. This leads to the mishandling or overlooking of any real data quality issues. Consequently, reducing email alert fatigue is a paramount goal, and at the same time, including possible remedial action or information in the email alert is critical to the success of the monitoring solution.

Advertisement
free widgets for website

Existing solutions often require the dataset owners to grant explicit read permission to the monitoring service. This extra step is a manual step and increases the friction in adopting the solution at scale.

Data Health Monitor (DHM) architecture

  • data-health-monitor-system-architecture

Figure 1. Data Health Monitor system architecture

The high level system architecture of DHM is shown in Figure 1 and is divided into three phases:

Advertisement
free widgets for website
  • Observation: DHM leverages low-latency Hive metadata and HDFS audit logs as the sources of truth.  Data health vital signs are collected automatically at scale, such as dataset or partition arrival time, freshness timestamps, and schema. DHM continuously gathers new audit logs and derives new vital signs, which will be persisted on HDFS. With this approach, all Hive/HDFS datasets will be monitored by default without additional onboarding steps.

  • Understanding: After collecting data health vital signs in the repository for several days automatically, DHM can safely infer several key dataset properties with high accuracy.  They include the arrival frequency and average arrival time. Most of the datasets are ingested or created on HDFS on a regular basis such as daily, several times a day, or hourly. There are other kinds of ingestions as well: monthly, weekly, static, or unknown.  For example, DHM can infer that a new dataset partition has been created daily at 8 a.m. (on average). The inferred arrival time and frequency will then be used as a default assertion on the expected dataset’s arrival. Dataset consumers can easily revise the expected arrival time according to the consumer’s need via a self-service UI. Using the previous example, the consumer would like to be alerted if the partition did not arrive before 11 a.m. and users can even change the Arrival Frequency. For example, users may want to receive alerts once a day for hourly datasets. Based on our experience, some data pipelines may be less sensitive to the freshness of the datasets. For example, a dataset may be arriving once a day, but there aren’t many changes that would actually affect the output dataset’s quality. Instead of potentially being spammed on email alerts on a daily basis, users may revise the assertion to receive an alert at a different time, e.g., an alert if there wasn’t a new arrival in the past 10 days. This capability requires assertion evaluation just before the alert and cannot be scheduled based on the dataset arrival pattern. The architecture offers a lot of flexibility to reduce email alert fatigue.

  • Reasoning: DHM monitors all datasets for several key data health events (like arrival and schema changes) and reasons about the health status based on those inferred key properties such as arrival frequency and time. The reasoning results form the basis for email alerting. However, before sending out email alerts, DHM will resolve if the identified health issue is a duplicate, and if so, the email alert will not be sent to avoid spamming. If any users want to receive email alerts, they may create a subscription to the dataset via the DHM self-service UI. As stated earlier, users may revise the inferred properties (such as changing the expected arrival time) according to the user’s expectation. The email alert contains relevant information that includes the reasoning basis and related context information for users to take action.

The previous three steps compose the data health monitoring services and automatically discovers all of the datasets in the ecosystem and collects their data quality metrics. The collected data quality metrics in the data warehouse provide a source of truth of data quality information that can be leveraged by users, such as creating a data health dashboard or performing analytics on data health, like trending analysis. 

Status

The Data Health Monitor project has become generally available (GA) since August 2021 for the LinkedIn community and has been widely adopted in many organizations such as the marketing and AI/ML teams. As of now, all HDFS/Hive datasets are being monitored by DHM across several Hadoop clusters. DHM collects 1B data health vital signs for about 150k critical datasets on a daily basis, and this demonstrates DHM’s scalability and performance.

Dataset consumers can create a subscription on any dataset and will receive email alerts according to the subscription settings. There are several important metrics that we use to measure the success of DHM:

  • Subscriptions (with an email alert) reached more than 2,000 in several months after becoming generally available

  • On average, DHM sent out roughly 1,500 alerts weekly

  • Over 98% of the alerts are real-positive and accurate according to the subscriptions, by validating the alerts with realtime dataset status.  

  • The DHM alert SLA (the time between detecting a health issue according to the subscription and sending out an email) has been reduced to roughly 30 minutes after GA optimization

Conclusion and future work

DHM is a scalable data health monitoring service that has been deployed widely across the LinkedIn community and can handle large scale dataset subscriptions with low latency and high precision alerts. The DHM is a self-service solution that requires no manual onboarding steps and offers a simple UI for simple alert configuration. 

Advertisement
free widgets for website

DHM is still considered as a “monitoring-after-the-fact” or diagnosis solution, meaning that a data health issue has already occurred, and possibly some damage might have been done, such as adopting partial or stale data to compute the new dataset, flow failure, etc.  With our ability to shorten the latency in the detection and accurate diagnosis, we plan to incorporate several preventive care capabilities such as flow failure prediction due to data volume changes or severe data skew.ealso intend to provide a complete data quality solution that addresses both metadata and semantic categories.

Acknowledgements

We would like to thank all current and former DHM team members (especially Ting-Kuan Wu, Silvester Yao, and Yen-Ting Liu) for their design/implementation and contributions to the project.  Furthermore, we thank our early partners who provided us with valuable and timely feedback.

Advertisement
free widgets for website

Topics

See also  18 LinkedIn Stats from 2019 to Guide Your Social Media Strategy in 2020 [Infographic]
Continue Reading
Advertisement free widgets for website
Click to comment

Leave a Reply

Your email address will not be published.

LINKEDIN

Career Stories: Breaking barriers with LinkedIn

Published

on

By

career-stories:-breaking-barriers-with-linkedin

After interning with us, Beatrix resonated with the culture and community she found at LinkedIn, and rejoined us post-undergrad. As she continues exploring her passion for frontend (UI) and accessibility engineering, she shares why launching her career with LinkedIn is one of the best decisions she has made.

  • beatrix-in-a-rainbow-light-tunnel

From intern to engineer

In 2019, my career with LinkedIn started with a UI (frontend) engineering internship in the San Francisco office. I had done a bit of user experience (UX) work before, so I was excited that there was a specific frontend role for interns at LinkedIn. I learned a lot about frontend development and immersed myself in LinkedIn’s culture. After my internship, I felt like I had barely scratched the surface of all there was for me to learn at LinkedIn, and I was so happy to get a return offer as a UI engineer on the LinkedIn Marketing Solutions team after I graduated in 2020. 

I had a bit of an untraditional path to engineering — I’ve always been creative and loved taking Latin, but I started taking an interest in computer science during high school, which continued into my undergraduate studies at Vassar. Computer science combined many things I liked about science with my love for humanities and logic; it’s truly a multidisciplinary field. 

I honed these skills during my extracurriculars in college as a teaching assistant with Kode with Klossy, a summer camp that helps teach teen girls how to code, and through attending the Grace Hopper Celebration, a women in tech conference. It was a full-circle moment when I had the privilege to attend the virtual Grace Hopper Conference in 2020 with LinkedIn. 

Advertisement
free widgets for website
  • beatrix-in-front-of-the-painted-ladies-in-san-francisco

A culture of Next Plays

One of the attributes that kept me here is LinkedIn’s culture of transformation. With every team I’ve been on, there’s a lot of celebration of people’s “Next Plays” as we call it here, whether that’s a new job opportunity or promotion at LinkedIn itself or elsewhere. 

While I enjoyed my time working on LinkedIn Marketing Solutions’ Campaign Manager, after 1.5 years on the team, I was eager to dive deeper into a new challenge and more accessibility work, to better support diverse learners and LinkedIn users (or members as we call them) with disabilities. 

See also  18 LinkedIn Stats from 2019 to Guide Your Social Media Strategy in 2020 [Infographic]

Thanks to LinkedIn’s wonderful culture that has an emphasis on collaboration and mentorship, I was able to connect with engineers from across the engineering organization to find a role that combined my interests in accessibility engineering, and development for the main LinkedIn site. With that mentorship, I found a new role earlier this year on our LinkedIn Talent Solutions team, centered on the job search and evaluation engineering work. 

Advertisement
free widgets for website

I’ve always found LinkedIn to be very human in its approach to work, because everything we do stems from our mission to build economic opportunities and connections for people. The job search team is focused on trying to help people get jobs and with our accessibility impact, we make more jobs accessible to every member of the global workforce. This focus was also seen in other roles that I’ve had at LinkedIn. While on the LMS team, I got to work on our reflow efforts, which ensures that the pages in Campaign Manager are usable on many screen sizes and at different zoomed-in levels. 

50M job seekers visit LinkedIn every week — resulting in 95 job applications every second, and six hires every minute. To help job seekers find their dream jobs at that scale is incredibly rewarding, and I feel very fortunate to contribute directly to those outcomes as an engineer.  

Advertisement
free widgets for website
  • beatrixs-work-from-home-setup-with-her-cats

Being there through the tough times

And LinkedIn’s human approach transcends the work itself. I grew up in Los Angeles, and I’m incredibly close to my parents, and two sisters: one who’s in high school, and one in college.

I remember getting pulled into a family emergency where I had to unexpectedly fly back home from San Francisco to help, and told my manager, “I’m not sure when I’ll be back [in San Francisco].” My team and managers were incredibly compassionate, and I was able to spend time with my family and fly back to visit them as needed to help support during this difficult time. 

See also  Under the hood: How we built API versioning for LinkedIn Marketing APIs

I was also able to be in Los Angeles from Thanksgiving to New Years with my family, working remotely from LA. The flexibility and earnest support I’ve received from my team during both the good and the tough times have meant the world.

Advertisement
free widgets for website
  • beatrix-at-a-restaurant

Craftsmanship in engineering

On the technical side of the house, one of the things that impresses me about LinkedIn is engineering’s emphasis on craftsmanship here, especially on our Talent Solutions team. We invest a lot into the foundations of our code base and code quality, ensuring that we are writing code that we can build on in the future. 

While working on new features for the site is always exciting, I am also grateful to have the opportunity to work on the efforts that improve the site behind the scenes, like documentation and other quality-of-life changes. Many teams at LinkedIn are trying to push foundational work initiatives like this forward. Documentation is one of those things that always comes up in developer productivity and happiness here at LinkedIn, so I’m glad to be able to contribute in ways that help make my colleagues’ work lives easier. 

Recently, my team discussed a situation with a contractor who came into the code base and thought our code tests did not make sense. This confusion sparked us to begin renaming the tests, changing the wording, and agreeing on the clearest way of labeling our code tests. I have so appreciated having space for these discussions; although product users will never see this, it is something that makes our code so much more reliable. 

Advertisement
free widgets for website
  • beatrixs-cat-next-on-her-desk

Breaking barriers through Women in Tech

Since I joined LinkedIn full-time in the midst of remote work during the pandemic, I wanted to find ways to connect with other engineers. I’m so thankful that LinkedIn has given me those opportunities; I was one of the founding members of the LinkedIn Marketing Solutions branch of Women in Tech (LMS WiT), and joined our Out@In (i.e., LGBTQIA+) Employee Resource Group (ERG). It is incredible how leadership opportunities aren’t gated to you based on age or company tenure at LinkedIn. I was able to grow and learn so much about what it means to be organized, to be a leader, and what it means to think about how I am in a position to help the WIT community, to facilitate these learnings.

See also  Career stories: Next plays, jungle gyms, and Python

Within LMS WiT, I helped to co-found the Amplify Voices track. Shortly after joining, I raised that we should rename our Male Allies track. I had heard from several nonbinary employees on our LinkedIn Marketing Solutions team that were wondering if there was room for them within WiT. It was powerful to me that my group was receptive to my idea and changed the name to WiT Allies the very next day, so that more LinkedIn employees felt included. If you’re interested in equality, empowerment, and these events that are focused on how to speak up for yourself in a professional setting, it’s essential to have these discussions about inclusiveness. 

Anytime I had a suggestion in ERGs, it was always considered thoughtfully and there was a lot of trust placed in me even as a young professional. In LinkedIn’s ERGs, there’s this openness that breaks down artificial limits and helps us grow as leaders. This spirit of inclusiveness is what makes LinkedIn such a welcoming place. 

Advertisement
free widgets for website

About Beatrix

Beatrix is a frontend (UI) engineer on our LinkedIn Talent Solutions team. Prior to her current role, Beatrix was a UI engineering intern and a UI engineer on our LinkedIn Marketing Solutions team. She graduated from Vassar College with a degree in computer science. In her free time, Beatrix enjoys spending time with her two cats, Mr. Darcy and Georgiana, cross-stitching and crocheting, and gaming.

Editor’s note: Considering an engineering/tech career at LinkedIn? In this Career Stories series, you’ll hear first-hand from our engineers and technologists about real life at LinkedIn — including our meaningful work, collaborative culture, and transformational growth. For more on tech careers at LinkedIn, visit: lnkd.in/EngCareers.

Advertisement
free widgets for website

Topics

Continue Reading

LINKEDIN

Measuring marketing incremental impacts beyond last click attribution

Published

on

By

measuring-marketing-incremental-impacts-beyond-last-click-attribution

Co-authors: Maggie Zhang, Joyce Chen, and Ming Wu

What’s my ROI?

In every company, there’s a fundamental need to understand the impacts of marketing campaigns. You want to be able to measure how many incremental conversions different channels and touchpoints are successfully driving. The best practice of A/B tests at individual level is not applicable in traditional channels such as TV ads, radios, or billboards. Even in digital marketing channels, new regulations and public awareness for data privacy have made A/B testing on third party platforms, which require transferring user level data, harder than ever. As a compromise, companies often rely on the last-click attribution model, which gives 100% credit for a conversion to the last marketing touchpoint/campaign in a user’s journey. This means that not only does it ignore everything (i.e. engagement, other media exposure) that happened before the final touchpoint throughout the user journey, it also tends to over-credit the last touchpoint (usually a paid media exposure) for conversions that would have been achieved organically without the media exposure. 

To accurately quantify the true incremental impact of marketing campaigns, we adopted a powerful approach — a Bayesian Structural Time Series (BSTS) model approach to measure the causal effect of an intervention. 

The basic idea is simple and intuitive: we design an experiment where the experimental units are defined by targetable geographical areas. Planned marketing intervention is applied in the selected areas (the test areas). The remaining areas are used as Control. The BSTS model is created to predict the Test areas’ would-be performance in an alternative scenario with no marketing intervention. The delta between the observed and the predicted performance of the Test areas enables us to measure the true impact of the marketing intervention.

What is BSTS?

BSTS model is a statistical technique, designed to work with time series data and used for time series forecasting and inferring causal impact. You can refer to this paper, and Google’s open source R Causal Impact package for more details.

Advertisement
free widgets for website

Let’s use geo based marketing campaign measurement as an example. At a high level, in order to construct an adequate counterfactual for the test marketers’ performance, three sources of information are needed. The first is the time series performance of the test markets, prior to the marketing campaign. Second is the time series performance of the Control markets that are predictive of the test markets performance before the campaign (there are a lot of considerations that go into picking the most relevant subset to use as contemporaneous controls). The third source of information is the prior knowledge about the model parameters from previous studies as an example.

BSTS causal impact analysis steps

To infer causal impact of a marketing campaign with BSTS model approach, the following steps need to take place. 

Metric selection

A true north metric will be used to select comparable markets. Whether it’s the traffic, job views, or job applications, we have to be very clear about what we want to drive and what we want to measure. 

See also  Measuring marketing incremental impacts beyond last click attribution

Geo-split

One key assumption of a geo test is that control markets’ time series data are predictive of test markets’ time series data. We can form test and control groups by leveraging a sampling/matching algorithm to select comparable groups of markets based on historical time series data. There are two algorithms to form the comparable groups depending on the actual business needs: 

  • MarketMatching is used to find matching markets when marketers already have a list of markets they want to run campaigns with. For example, a billboard campaign is set to launch in New York and the matching algorithm might find that San Francisco and Chicago are good markets to use as control. 

  • Stratified sampling approach pre-divides the list of markets into homogenous groups called strata based on characteristics that they share (e.g., location, revenue share), then it draws randomly from each strata to form the test sample. It can guard against an “unrepresentative” sample (e.g., all-coastal states from a nationwide Google search campaign) by ensuring each subgroup of a given population is adequately represented within the whole population. This allows the marketers to properly infer the performance of a large scale non-local campaign. 

Theoretically, geo-split can be implemented at various levels (nations, state, county). In reality, a good selection of geo-split level should fulfill these requirements: 

Advertisement
free widgets for website
  • Targetable: it is possible to fully control the marketing activities at this level on the desired ad platforms. Geo-targeting capability and restrictions vary across platforms. It is important to understand them before planning your test.

  • Measureable: it is possible to observe the ad spend amount and accurately measure the response metric at this level. 

  • Economical: For example, it is not a good idea to run a job promotion campaign with a state level split. Some people may reside in New Jersey while working in New York City. Instead, the campaign should be run in the entire New York metropolitan area, which covers both key areas in New Jersey and New York and therefore reduce risks of cross-group contamination. 

Modeling

After decisions have been made on the geo group assignments and true north metrics, we can construct two time series (test/control) using historical data aggregated at the assigned geo group level. We recommend finding a period without major regional marketing activities. The period required for training the model depends on the availability of the data and variance of the time series. If the training period is too short, there will not be enough data to learn the relationship between test and control time series, thus high bias. If the training period is too long, the relationship may change over time and won’t apply anymore. In practice, we find one to three months to be a good duration. 

The next step is to build a model that can accurately predict test time series based on the control time series. 

A good time series model needs to be flexible and transparent and should take in account the seasonality, the macroeconomic trend, and the business drivers tobe able to quantify the impact from each. BSTS allows you to explicitly specify the posterior uncertainty of each individual component (regression, seasonality, trend). You also can control the variance of each component and impose prior belief in its Bayesian framework. Mean Absolute Prediction Error (MAPE) () is used to evaluate the goodness of the fit of the model during the training period. A good MAPE score (usually <5%) is a strong signal that the selected control group can be used to accurately predict the counterfactual of the test market.

Validation 

Prior to the campaign launch, we’d like to establish an AA-testing process to validate the model performance and rule out the possibility of pre-existence bias that could potentially undermine the causal inference. During the AA test period, no marketing intervention is applied to either treatment or control. We expect the model to report no statistically significant difference between the predicted time series and the observed time series. Further deep-dives and re-design of the test is required if AA test fails. 

Advertisement
free widgets for website
  • image-of-aa-testing-process

Power analysis and budget scenarios 

Similar to an A/B test, we’d like a power analysis at the design stage for a geo experiment. If those markets are used as control and treatment in the experiment and the true north metric is session, what is the probability of detecting an effect if a session lift of a particular magnitude is truly present? Unlike A/B tests, there is no theoretical approach to conduct a power analysis. The current approach to estimate minimum detectable effect (MDE) and the required test duration is through simulation where a synthetic lift is added to the treatment group to approximate the effect of a marketing campaign. We can then work with marketing partners to create budget scenarios at different MDE levels to ensure incrementality can be detected with a reasonable chance and with a reasonable budget and pacing plan. A budget scenario usually takes account of several factors including media cost, MDE, targeting plan (audience size/launch areas), and campaign duration. 

Measurement 

At the end of the campaign, we’ll apply the previously trained BSTS model and forge a synthetic control based on the control time series data from the post intervention period. Comparison of the synthetic control (the predicted) and the observed time series of the test markets will be performed to measure the true impact of the marketing intervention. Similar to A/B tests, impacts are only considered statistically significant if the p-value of (delta > 0) is below 0.05.

Advertisement
free widgets for website
  • graph-measurement-of-BSTS

Successful use case of BSTS at LinkedIn

At LinkedIn, our Data Science team has successfully applied the BSTS approach to many unique business cases and answered questions that would otherwise have remained myths to our business.

In one of our full funnel brand marketing national campaigns, which lasted for two months with multi-channels deployment, including TV, billboard, audio, digital, and social, we applied our BSTS approach and concluded that the national full funnel campaign drove almost double digit lift in targeted metrics. 

In one of our paid job distribution programs, we successfully designed go-dark city selection using the aforementioned stratified approach. By applying BSTS, we successfully proved that the Return on Advertising Spend (ROAS) of the program has a healthy reading that’s well above 1.0.

In our paid app activation program, we were able to leverage BSTS to infer member’s incremental lifetime value (LTV) by country and by operating system (iOS, Android). The results guided our app activation program’s future investment.

In a recent Google Universal App campaign where we promoted LinkedIn Apps on Android, we again applied BSTS and concluded that in the tested geography, about half of the app installs reported through last click are incremental.

Advertisement
free widgets for website

Conclusion

Understanding marketing campaign ROI is a crucial business challenge. When the golden standard of A/B testing or measurement at the individual unit level are not available, BSTS is a powerful alternative to measure a marketing campaign’s causal impact at a geo aggregated level. The LinkedIn Data Science team, by establishing a BSTS measurement framework and best practices, has successfully applied the approach to deliver insightful measurement results that led to improvements on our marketing channel efficiency and budget allocation.  

We’d like to end this blog post by highlighting that in addition to measuring past marketing campaign performance using BSTS, in our subsequent work, we also feed the BSTS results into a marketing mixed model (MMM) to optimally allocate spend on future investments. Media mix models can provide a high-level cross channel view on how marketing channels are performing. By triangulating modeling results with rigorous BSTS causal experimentation, one can improve the model’s robustness and its capability to recover some of the lost signals.

Acknowledgements

We would like to acknowledge Rahul Todkar and Ya Xu for their leadership in this cross-team work and Minjae Ormes, Ryan McDougall, Tim Clancy, Kim Chitra, Ginger Cherny, for their business partnership and support. We would like to thank all of the collaborators, reviewers and users who assisted with BSTS Geo Causal Inference Studies from the Data Science Applied Research team (Rina Friedberg, Albert Chen, Shan Ba), the Go-to-Market Data team (Fangfang Tan, Jina Lin, Catherine Wang, Kelly Chang, Sylvana Yelda), the Consumer Product Marketing Team (Rajarshi Chatterjee, Shauna-kay Campbell, Emma Yu), the Paid Media team (Nicolette Song, Krizia Manzano, Sandy Shen), the Careers Engineering Team (Wenxuan Gao, Dheemanth Bykere Mallikarjun), and our wonderful partners, the DSPx team (Xiaofeng Wang, Daniel Antzelevitch, Kate Xiaonan Ding) who helped build the automated solution. 

Advertisement
free widgets for website

Topics

Advertisement
free widgets for website
Continue Reading

LINKEDIN

Career stories: Next plays, jungle gyms, and Python

Published

on

By

career-stories:-next-plays,-jungle-gyms,-and-python

Since she was a child, Deepti has been motivated to help people. This drive led her on a career journey with many pivots and moves — akin to navigating a children’s jungle gym — between industries and around the world. Based in Bangalore, this biomedical engineer turned data scientist shares how LinkedIn helped her gain new technical skills, dive into meaningful work, and grow. 

  • picture-of-deepti-walking-in-a-jungle

Growing up in Mumbai, India, I always imagined myself in a career where I could give back. I once dreamed of becoming a neurosurgeon, but early in my career, I took a different path and earned a bachelor’s in electronics engineering. While studying engineering gave me the foundation for my future career, I quickly realized that my job options wouldn’t help me make the difference that I wanted to. So, I decided to complete a master’s program in biomedical engineering at Drexel University in Philadelphia. 

After graduating, I found an opportunity at the Toyota Technical Center in Boston, where I helped build driver safety systems that incorporated human physiological considerations into injury prevention. Toyota is where I first began to reconsider my perspective on what it means to help others, realizing that I could draw on my STEM background to build safer systems that would benefit everyone.

Advertisement
free widgets for website
  • picture-of-deepti-and-her-daughter

Embracing a data-driven career change

Soon, however, home and family called me back to India where at the time, biomedical research in India was not as exciting as the work I was doing in the U.S. While CT scans and MRIs are, of course, critical instruments, I increasingly felt that I wasn’t giving back in the way I’d hoped. After two years, I knew it was time to push myself out of my comfort zone once again, which led me to data science. 

When I first broke into the field, data science was more like informal analytics. Yet I was intrigued by this new discipline, where I could use the skills I gained as an engineer, like problem-solving and logical thinking, while also gaining unique expertise. When I started, my mantra was to keep focused on learning and not worry about my experience (or lack of) when surrounded by data scientists, who were just out of school, with more experience than me.

See also  Career Stories: Breaking barriers with LinkedIn

My instincts served me well, and I quickly grew from an analyst to a senior manager — this pace of career progression is the norm in startups, where fast growth is expected. In a short period, my time became less focused on getting my hands dirty with data, and more centered on managing clients and stakeholders and putting out fires. After seven years, I missed building things and solving problems, which is when the perfect opportunity opened up at LinkedIn. 

Advertisement
free widgets for website
  • picture-of-deepti-sitting-on-couch-with-plants-behind-her

Giving back to the global community at LinkedIn

With a desire to do more, I was recruited at the right time for a data scientist position on LinkedIn’s Economic Graph, our digital representation of the global economy. The Economic Graph research team I was on was a global team with people based in the U.S., Europe, and Singapore. What appealed to me most about the Economic Graph was that we work and collaborate alongside the government and other non-governmental organizations (NGOs) to deliver insights that enable our members to succeed and connect with the right opportunities for them. 

The Economic Graph partners with public sector organizations to provide data insights that improve policy decisions. For example, if a government ministry is considering where to invest in education, they need data on issues like labor market demand and skills gaps. Using our member (i.e., LinkedIn user) data, our team would deliver such power-packed insights using LinkedIn platform data. At LinkedIn, we ensure that member data is used safely, and we’re proud that the trust we’ve built with our members enables us to deliver these insights. 

See also  18 LinkedIn Stats from 2019 to Guide Your Social Media Strategy in 2020 [Infographic]

Python, Scala, and people management

When the Economic Graph team consolidated, I knew it was time for the next stage of my career, or my Next Play as we call it here. My manager pushed me to consider taking on a tech lead role in data science within the Business Operations team at LinkedIn in India. I admit I was reluctant to go back to a position focused on business revenues, as I had grown attached to the research mission in my previous role. Soon, though, I realized that everything we do at LinkedIn helps advance our mission and vision for the community. 

Now, I’m managing a newsletter and leading a team of data scientists solving business-critical problems across the company. It’s precisely the kind of exposure I’m looking for at this point in my career, gaining horizontal expertise by engaging all these different domains. LinkedIn is all about learning. Here, managers encourage people to take charge of their careers, experiment, and move into other roles according to their interests and goals. 

Advertisement
free widgets for website

We don’t shy away from challenges and learning curves. For example, I’ve had to upskill myself in coding. For nearly 14 years, I primarily used the R programming language. Now, we’ve moved on to Python and Scala, building on our everyday work with statistics and math. 

It’s not all about tech, though. We deal with unique questions, so problem-solving skills are critical. It’s also essential to think about the business contexts and ask the right questions. Then, we bring it all together with technology to solve a problem in a structured manner.

Advertisement
free widgets for website
  • picture-of-deepti-and-her-family-on-the-beach

Moving forward on the LinkedIn learning path

When thinking about career trajectories, I always return to this metaphor of a children’s jungle gym. I tell my team that it’s about moving through a matrix rather than climbing a ladder one step at a time. You’re still moving from one point to the next, but the next step isn’t necessarily upward. The reality is that, sometimes, you have to move down a level to reach a specific endpoint. 

See also  INWED2022: How Skills Can Help Shape Your Future in Engineering

I’ve moved from senior management roles into positions with no one reporting to me. At first, I thought, “Am I down-leveling?” Then I would remind myself that my final goal is always to do something meaningful. At that point, taking on a more technical role was a step in that direction. 

Then, with the knowledge and expertise I gained, I could return to leading teams and tackling more considerable challenges. Growth looks different to different people but as long as you have that fire inside you to keep learning and growing, changing domains, jobs, or even countries, it will only help you in your journey. 

Advertisement
free widgets for website

About Deepti 

Based in Bengaluru, India, Deepti is a senior data scientist at LinkedIn. Before LinkedIn, she spent nearly seven years as a senior analyst and senior manager for [24]7.ai working on customer engagement solutions. Born and raised in India, she holds a bachelor’s degree in electronics engineering from the University of Mumbai, and a master’s in biomedical engineering from Drexel University in the U.S. Outside of work, Deepti spends time with her two daughters, and shares her passions for interior design and gender equality issues on social media. 

Editor’s note: Considering an engineering/tech career at LinkedIn? In this Career Stories series, you’ll hear first-hand from our engineers and technologists about real life at LinkedIn — including our meaningful work, collaborative culture, and transformational growth. For more on tech careers at LinkedIn, visit: lnkd.in/EngCareers.

Advertisement
free widgets for website

Topics

Continue Reading

Trending