Silent but deadly: there is nothing more destructive than data corruptions that cannot be caught by the various error capture tools in hardware and even in software, can be hard to spot before they have infected an entire application.
This is especially devastating at Facebook scale but engineering teams at the social giant have discovered strategies to keep a local problem from going global. A single hardware-rooted error can cascade into a massive problem when multiplied at hyperscale and for Facebook, keeping this at bay takes a combination of hardware resiliency, production detection mechanisms, and a broader fault-tolerant software architecture.
Facebook’s infrastructure team started an effort to understand the roots and fixes for silent data corruption in 2018 to understand how fleet-wide fixes might look—and what those might detection strategies could cost in terms of overhead.
Engineers found that many of the cascading errors are the result of CPUs in production but not always due to the “soft errors” of radiation or synthetic fault injection. Rather, they find these can happen randomly on CPUs in repeatable ways. Although ECC is useful, this is focused on problems in SRAM but other elements are susceptible. The Facebook engineering team that reported on these problems finds that CPU silent data corruptions are actually orders of magnitude higher than soft-errors due to a lack of error correction in other blocks.
Increased CPU complexity opens the doors to more errors and when compounded at hyperscale datacenter levels with ever-denser nodes, these at-scale problems will only become more problematic and widespread. At the hardware level, the problems can range from general device errors (placement and routing problems can lead to different arrival times for signals, causing bit-flips, for instance) and more manufacturing-centric problems like etching errors still happen. Further, early life failures of devices and degradation of existing CPUs can also have hard-to-detect impacts.
For example, when you perform 2×3, the CPU may give a result of 5 instead of 6 silently under certain microarchitectural conditions without any indication of the miscomputation in the system event or error logs. As a result, a service utilizing the CPU is potentially unaware of the computational accuracy and keeps consuming the incorrect values in the application.
“Silent data corruptions are real phenomena in datacenter applications running at scale,” members from the Facebook infrastructure team explain. “Understanding these corruptions helps us gain insights into the silicon device characteristics; through intricate instruction flows and their interactions with compilers and software architectures. Multiple strategies of detection and mitigation exist, with each contributing additional cost and complexity into a large-scale datacenter infrastructure.”
Facebook used a few reference application examples to highlight the impact of silent data corruption at scale, including an example with a Spark workflow that runs millions of computations of wordcount computations per day along with FB’s compression application, which similar millions of compression/decompression computations daily. In the compression example, Facebook observed a case where the algorithm returned a “0” size value for a single file (was supposed to be a non-zero number), therefore the file was not written into the decompressed output database. “as a result, the database had missing files. The missing files subsequently propagated to the application. An application keeping a list of key value store mappings for compressed files immediately observes that files that were compressed are no longer recoverable. The chain of dependencies causes the application to fail.” And pretty soon, the querying infrastructure reports back with critical data loss. The problem is clear from this one example, imagine if it was larger than just compression or wordcount—Facebook can.
Data corruptions propagate across the stack and manifest as application level problems. These types of errors can result in data loss and can require months of debug engineering time… With increased silicon density and technology scaling, we believe that academic researchers and industry should invest in methods to counter these issues.
Debugging is arduous but it is still at the heart of how Facebook handles these silent data corruptions, although not until they’re loud enough to be heard. “To debug a silent error, we cannot proceed forward without understanding which machine level instructions are executed. We either need an ahead-of-time compiler for Java and Scala or we need a probe, which upon execution of the JIT code, provides the list of instructions executed.” Their best practices for silent error debugging include are detailed in 5.2.
An overall suite of fault tolerance mechanisms is also key to Facebook’s strategy. These include redundancy at the software level but of course, this comes with costs. “The cost of redundancy has a direct effect on resources; the more redundant the architecture, the larger the duplicate resource pool requirements” even though this is the most certain path to probabilistic fault tolerance. Less overhead-laden ways of dealing with fault tolerance also include relying on fault tolerant libraries (PyTorch is specifically cited) although this is not “free” either, the impact on application performance is palpable.
“This effort would need a close handshake between the hardware silent error research community and the software library community.”
In terms of that handshake, Facebook is openly calling on datacenter device makers to understand that their largest customers are expecting more, especially given the cascading wide-net impacts of hardware-derived errors.
“Silent data corruptions are not limited to rare one in a million occurrences within a large-scale infrastructure. These errors are systemic and are not as well understood as the other failure modes like Machine Check Exceptions.” The infrastructure team adds that there are several studies evaluating the techniques to reduce soft error rate within processors those lessons can be carried into similar, repeatable SDCs which can occur at a higher rate.
A large part of the responsibility should be shared by device makers, Facebook says. These approaches are on the manufacturer’s side and can include beefing up the blocks on a device for better datapath protection using custom ECCs, providing better randomized testing, understanding increased density means higher propagation of errors and most important, understanding “at scale behavior” via “close partnership with customers using devices at scale to understand the impact of silent errors.” This would include occurrence rates, time to failure in production, dependency on frequency, and environmental issues that impact these errors.
“Facebook infrastructure has implemented multiple variants of the above hardware detection and software fault tolerant techniques in the past 18 months. Quantification of benefits and costs for each of the methods described above has helped the infrastructure to be reliable for the Facebook family of apps.” The infrastructure team plans to release a follow-on with more detail about the various trade-offs and costs for their current approaches.
More detail, including Facebook’s best practices for fault tolerance in software and architecting around potential hardware failures can be found here.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Introducing Facebook Graph API v18.0 and Marketing API v18.0
Today, we are releasing Facebook Graph API v18.0 and Marketing API v18.0. As part of this release, we are highlighting changes below that we believe are relevant to parts of our developer community. These changes include announcements, product updates, and notifications on deprecations that we believe are relevant to your application(s)’ integration with our platform.
For a complete list of all changes and their details, please visit our changelog.
Consolidation of Audience Location Status Options for Location Targeting
As previously announced in May 2023, we have consolidated Audience Location Status to our current default option of “People living in or recently in this location” when choosing the type of audience to reach within their Location Targeting selections. This update reflects a consolidation of other previously available options and removal of our “People traveling in this location” option.
We are making this change as part of our ongoing efforts to deliver more value to businesses, simplify our ads system, and streamline our targeting options in order to increase performance efficiency and remove options that have low usage.
This update will apply to new or duplicated campaigns. Existing campaigns created prior to launch will not be entered in this new experience unless they are in draft mode or duplicated.
Add “add_security_recommendation” and “code_expiration_minutes” to WA Message Templates API
Earlier this year, we released WhatsApp’s authentication solution which enabled creating and sending authentication templates with native buttons and preset authentication messages. With the release of Graph API v18, we’re making improvements to the retrieval of authentication templates, making the end-to-end authentication template process easier for BSPs and businesses.
With Graph API v18, BSPs and businesses can have better visibility into preset authentication message template content after creation. Specifically, payloads will return preset content configuration options, in addition to the text used by WhatsApp. This improvement can enable BSPs and businesses to build “edit” UIs for authentication templates that can be constructed on top of the API.
Note that errors may occur when upgrading to Graph API v18 if BSPs or businesses are taking the entire response from the GET request and providing it back to the POST request to update templates. To resolve, the body/header/footer text fields should be dropped before passing back into the API.
Re-launching dev docs and changelogs for creating Call Ads
- Facebook Reels Placement for Call Ads
Meta is releasing the ability to deliver Call Ads through the Facebook Reels platform. Call ads allow users to call businesses in the moment of consideration when they view an ad, and help businesses drive more complex discussions with interested users. This is an opportunity for businesses to advertise with call ads based on peoples’ real-time behavior on Facebook. Under the Ad set Level within Ads Manager, businesses can choose to add “Facebook Reels” Under the Placements section.
- Re-Launching Call Ads via API
On September 12, 2023, we’re providing updated guidance on how to create Call Ads via the API. We are introducing documentation solely for Call Ads, so that 3P developers can more easily create Call Ads’ campaigns and know how to view insights about their ongoing call ad campaigns, including call-related metrics. In the future, we also plan to support Call Add-ons via our API platform. Developers should have access to the general permissions necessary to create general ads in order to create Call Ads via the API platform.
Please refer to developer documentation for additional information.
Deprecations & Breaking Changes
Graph API changes for user granular permission feature
We are updating two graph API endpoints for WhatsAppBusinessAccount. These endpoints are as follows:
- Retrieve message templates associated with WhatsAppBusiness Account
- Retrieve phone numbers associated with WhatsAppBusiness Account
With v18, we are rolling out a new feature “user granular permission”. All existing users who are already added to WhatsAppBusinessAccount will be backfilled and will continue to have access (no impact).
The admin has the flexibility to change these permissions. If the admin changes the permission and removes access to view message templates or phone numbers for one of their users, that specific user will start getting an error message saying you do not have permission to view message templates or phone numbers on all versions v18 and older.
Deprecate legacy metrics naming for IG Media and User Insights
Starting on September 12, Instagram will remove duplicative and legacy, insights metrics from the Instagram Graph API in order to share a single source of metrics to our developers.
This new upgrade reduces any confusion as well as increases the reliability and quality of our reporting.
After 90 days of this launch (i.e. December 11, 2023), we will remove all these duplicative and legacy insights metrics from the Instagram Graph API on all versions in order to be more consistent with the Instagram app.
We appreciate all the feedback that we’ve received from our developer community, and look forward to continuing to work together.
Deprecate all Facebook Wi-Fi v1 and Facebook Wi-Fi v2 endpoints
Facebook Wi-Fi was designed to improve the experience of connecting to Wi-Fi hotspots at businesses. It allowed a merchant’s customers to get free Wi-Fi simply by checking in on Facebook. It also allowed merchants to control who could use their Wi-Fi and for how long, and integrated with ads to enable targeting to customers who had used the merchant’s Wi-Fi. This product was deprecated on June 12, 2023. As the partner notice period has ended, all endpoints used by Facebook Wi-Fi v1 and Facebook Wi-Fi v2 have been deprecated and removed.
API Version Deprecations:
- September 14, 2023: Graph API v11.0 will be deprecated and removed from the platform
- February 8, 2024: Graph API v12.0 will be deprecated and removed from the platform
- May 28, 2024: Graph API v13.0 will be deprecated and removed from the platform
- September 20, 2023: Marketing API v14.0 will be deprecated and removed from the platform
- September 20, 2023: Marketing API v15.0 will be deprecated and removed from the platform
- February 06, 2024: Marketing API v16.0 will be deprecated and removed from the platform
To avoid disruption to your business, we recommend migrating all calls to the latest API version that launched today.
Facebook Platform SDK
As part of our 2-year deprecation schedule for Platform SDKs, please note the upcoming deprecations and sunsets:
- October 2023: Facebook Platform SDK v11.0 or below will be sunset
- February 2024: Facebook Platform SDK v12.0 or below will be sunset
First seen at developers.facebook.com
Allowing Users to Promote Stories as Ads (via Marketing API)
Before today (August 28, 2023), advertisers could not promote images and/or videos used in Instagram Stories as ads via the Instagram Marketing API. This process created unwanted friction for our partners and their customers.
After consistently hearing about this pain point from our developer community, we have removed this unwanted friction for advertisers and now allow users to seamlessly promote their image and/or video media used in Instagram Stories as ads via the Instagram Marketing API as of August 28, 2023.
We appreciate all the feedback received from our developer community, and hope to continue improving your experience.
Please review the developer documentation to learn more.
First seen at developers.facebook.com
Launching second release of Facebook Reels API: An enterprise solution for desktop and web publishers
We’re excited to announce that the second release of FB Reels API is now publicly available for third-party developers. FB Reels API enables users of third-party platforms to share Reels directly to public Facebook Pages and the New Pages Experience.
FB Reels API has grown significantly since the first release in September 2022. The new version of the APIs now support custom thumbnails, automatic music tagging, tagging collaborators, longer format of reels and better error handling.
FB Reels API will also support scheduling and draft capability to allow creators to take advantage of tools provided either by Meta or by our partners. Based on the feedback we received from our partners, we’ll now provide additional audio insights via the Audio Recommendations API and reels performance metrics via the Insights API.
Our goal in the next couple of releases is to continue to make it easier for creators to develop quality content by adding features like early copyright detection and A/B testing. We’re also excited to start working on enhanced creation features like Video clipping- so stay tuned to hear more about those features in the future.
If you are a developer interested in integrating with the Facebook Reels API, please refer to the Developer Documents for more info.
Not sure if this product is for you? Check out our entire suite of sharing offerings.
Tune in to Product @scale event to learn more about FB Video APIs and hear from some of our customers.
First seen at developers.facebook.com
WhatsApp, Instagram and Messenger to Get AI Assistants; Meta Shows Off Image Generation Tool Emu
Meta Smart Glasses in Collaboration With Ray-Ban Launched, Allows Hands-Free Livestreaming
YouTube Shorts Monetization Guide [How Much Can You Make?]
WhatsApp Spotted Working on New Colours, Icons for Chat Interface on Android
5 B2B Social Media Marketing Tactics That Actually Work
Reddit Rolls Out Contributor Program, Offering Real Money for Gold and Karma
Enhancing Security and Developer Productivity: LinkedIn’s Journey with Implementing Content Security Policy
Top Social Media Manager Interview Questions and Answers 
How To Save Time and Avoid Issues With Facebook Auto Reply
What is Lemon8? TikTok’s Sister App Explained
What Is FedRAMP, and Why Is It So Important?
Open Sourcing iris-message-processor
Uncategorized2 weeks ago
Community Manager: Job Description & Key Responsibilities
LINKEDIN1 week ago
Career stories: Influencing engineering growth at LinkedIn
OTHER1 week ago
WhatsApp iPad Support Spotted in Testing on Latest iOS Beta, Improved Group Calls Interface on Android
Uncategorized2 weeks ago
Social Media Intelligence: What It Is & Why You Need It
Uncategorized2 weeks ago
How to Create a Social Media Report [Free Template Included]
Uncategorized2 weeks ago
The Complete Guide to Social Media Video Specs in 2023
OTHER1 week ago
CCI Said to Have Appointed Former WhatsApp Executive, Government Officials as New Members
OTHER7 days ago
YouTube Announces AI-Enabled Editing Products for Video Creators