Introduction
Handling massive amounts of data in real-time is a monumental task for any tech company. Twitter, with its 400 billion events processed daily, has pioneered a new approach to manage this enormous scale efficiently. This article delves into Twitter's transition to a new real-time data processing architecture, showcasing its innovative use of the Kappa model to enhance performance and accuracy.
The Shift to Kappa Architecture
Twitter's new architecture adopts the Kappa model, simplifying the process by using a single real-time pipeline. This approach replaced the older lambda architecture, which relied on separate batch and real-time pipelines. The new system preprocesses Kafka events on-premise, converts them to Google PubSub format, and uses Google Dataflow for deduplication and aggregation. The final results are stored in BigTable.
This streamlined process eliminates the need for batch processing pipelines, significantly reducing complexity and costs while ensuring robust handling of late events and minimizing data loss.
Key Metrics of the New Architecture
- Latency: The new architecture achieves a stable latency of ~10 seconds, a significant improvement from the previous range of 10 seconds to 10 minutes.
- Throughput: The throughput has increased dramatically to ~1GB/s, compared to the old architecture's maximum of ~100MB/s.
- Processing Accuracy: Nearly exactly once processing is achieved through at-least-once data publishing to Google PubSub, coupled with deduplication by Dataflow.
- Cost Efficiency: The elimination of batch pipelines results in considerable cost savings, simplifying the overall architecture.
- Data Accuracy: Over 95% of the new pipeline results match the old batch pipeline, with enhanced handling of late events.
- Event Loss: The new system ensures no event loss occurs during restarts, maintaining data integrity and reliability.
Conclusion
By adopting the Kappa architecture, Twitter has significantly improved its real-time data processing capabilities. This new system not only enhances performance and accuracy but also simplifies the data pipeline, making it more cost-effective and efficient. Twitter's innovative approach sets a new standard for handling massive-scale data in real time, ensuring robust and reliable processing of 400 billion events daily.
Stay tuned for more insights on cutting-edge data processing technologies in our next blog.
Raman Sapezhka
CEO Plantago/CTO