In 2015, Facebook introduced a feature that enabled users to go live for their followers and friends on the platform. It was initially available to only a few celebrities but was eventually rolled out to all users.
From the point of view of a user, the feature looks very simple. You click on ‘Go Live’ and all your friends can see your video in an instant. Just like a video call with a lot of people. But behind the scenes, the infrastructure that takes care of this is well thought out and is not so intuitive.
Today, we are going to discuss exactly that. This is based on the ideas I collected from a few articles and majorly from this talk by Sachin Kulkarni who was the Video Infra Director at Facebook back then.
Various video streaming protocols
When it comes to video streaming, different companies use different protocols. Zoom uses WebRTC, while Netflix uses something called DASH (Dynamic Streaming over HTTP).
This decision is mostly based on the functionality and the company’s existing infrastructure.
WebRTC
WebRTC is a protocol created by Google and is widely used by various video and audio streaming products.
It is based on UDP and provides a real-time streaming experience for the users. UDP doesn’t wait for acknowledgment of the data packet so it’s faster than TCP.
TCP requires the client to acknowledge that the data packet has been received. Due to the back and forth involved in it, there is slightly more latency while streaming. For video calls, real-time transmission is important and even if some of the packets get lost, it doesn’t matter so much.
That’s one of the reasons Zoom uses WebRTC for transmission!
WebRTC is also a peer-to-peer protocol, which means you don’t need a server to intercept or transmit your data. Modern web browsers support WebRTC APIs hence you can do this inside a web browser as well!
RTMP
RTMP is based on TCP as I mentioned above, hence it’s slower in terms of transmitting data. But it’s widely used in many video applications (like Youtube Live).
RTMP has its secure variants that encrypt the data before transmission.
What does Facebook Live use?
Since WebRTC employs UDP, and UDP is not supported by Facebook’s infrastructure, they needed a solution using TCP. Hence, the Facebook team decided to go with RTMPS (secure variant of RTMP) for the broadcast client and DASH for the streaming.
How is Facebook Live different from other video streaming methods?
Naturally, you might ask why Facebook Live is different from methods like video calls, and other video streaming platforms like Youtube.
There are some differences.
Unlike a video call, it’s not a one-to-one or one to few stream. It’s one to many stream. Anyone who can get access to the live stream can view it.
Because of the real-time nature of the live stream, delays in streaming cannot be afforded. So even if the internet speed of the end-users is bad, the Facebook team needs to figure out a way to stream it real-time for a good user experience.
Also, since it’s real-time, you cannot cache the contents beforehand and populate it in servers you expect to face a heavy load.
Another challenge with live streaming is, you cannot predict the number of viewers beforehand. Any stream can go viral and there might be a lot of users pouring in to watch it real-time.
Overview of Facebook Live’s infrastructure
There are 3 main components in this.
The Broadcast client that generates the feed
A stack of servers known as Point of Presence or PoP
Another, more complex, stack of servers that is called as Data Center.
There are multiple PoPs distributed around the globe and a lesser number of Data Centers. The PoPs contain a bunch of Proxygen hosts and a caching mechanism.
Proxygen is Facebook’s open source HTTP framework that they employ in their servers for load balancing, reverse proxies etc.
Just like the PoPs, the datacenters also contains Proxygen hosts and a caching layer. But it also contains servers to encode the video stream. One of the ways Facebook is able to stream the video without much delay is by using the method of Adaptive Bitrate.
What is Adaptive Bitrate?
Adaptive Bitrate is a widely used method to speed up video streaming. When a video is uploaded it’s broken down into smaller parts and each of them is stored in multiple resolutions.
When a client requests for a video, the resolution which best suits their bandwidth is sent to them.
That’s the reason you see the quality of video shifting in between the videos on Youtube.
Broadcasting a live stream
When a client wants to start a live stream, they send a request to the nearest PoP. The PoP returns 3 parameters associated with that stream.
Stream ID: This helps to identify the stream uniquely. This also helps with consistent hashing. Consistent hashing helps you with choosing a particular server for streaming and consistently chooses the same server every time you want to connect to Facebook. In case the server goes down, it can choose another server easily.
Security tokens: These are used for security purposes such as encrypting the video stream and identifying the users etc.
URI: URI helps us to connect to the right datacenter
Once the client connects to a PoP, the PoP decides which datacenter to connect to. The stream is sent to the datacenter where it stores it with multiple encoded formats. Whenever a streaming client requests for the live stream, appropriate video resolution is sent across.
Adaptive bitrate is also employed on the broadcasting side to enable a better user experience.
Streaming a live stream
A live stream is broken down into 1-second segments and information regarding each of these segments, such as the URI, is stored in a manifest file. This manifest file is stored inside the data center and upon request, is transmitted across other servers.
The main function of the servers is to cache the data so that the load on the servers is distributed properly.
Whenever a streaming client requests for a stream, it connects to a nearby PoP. The PoP checks its cache for the requested stream. If it cannot find it, it connects to the correct datacenter and fetches the stream data. The datacenter itself has a caching layer that caches the data from the encoding servers.
When a client requests a stream, both the caching layers in the datacenter as well as in the PoP is populated first before sending it to the client. This multi-level caching helps to distribute the load evenly.
There’s another great optimization that the Facebook team did.
Let’s say there are multiple users requesting the same live stream at the same time. The requests would hit the PoP, sees that the stream data is not available, hits the datacenter and finds the stream data, and then populates the cache.
But if multiple people request the same stream, they don’t want all of them to hit the data centers at the same time. So they made the subsequent requests wait while the first request is still fetching the data from the datacenter and populating the cache.
Once the cache is populated, the subsequent request is served from the cache.
Wrapping up
Often times what seems simple on the user’s end has highly complex strategies working in the background.
I would highly recommend watching the video to get a better insight into the problem and how they solved it.
📖 Read recommendations for the week
If you like my content and want to support me to keep me going, consider buying me a coffee ☕️ ☕️
Connect with me!
If you need help with your service architecture, you can email me 💌 : dennysam14@gmail.com
Very interesting ya!