How to Stream 100B+ Events Daily with Spark Structured Streaming Архитектуры, масштабируемость
Passionate software engineer and technical leader with vast experience in developing high load systems, Big Data infrastructure and building cloud services from garage phases through public launch.
At AppsFlyer we ingest more than 100 billion events daily through our Kafka operation, which is then stored in a data warehouse hosted on Amazon S3. With AppsFlyer's increasing growth & scale, latency of data and resilience of the system began to pose complexities, and we found that we had to start rethinking our current approach and provide a better way to do our raw data processing.
This talk will focus on how we migrated the current raw data ingestion system to a solution based on Spark structured streaming.
I'll discuss what Spark structured streaming even is, some of the motivators for the migration, and why it was the right solution for us. You will get to see some of the challenges we faced during implementation, such as picking the correct data partitioning, how we ensured continued compliance with our GDPR solution, and the tooling we built to support the migration while providing our exactly-once solution.
The solution and the approaches that will be presented in this talk can be applied in your own data pipelines to create more resilient systems and correct data flows.