Your mission, should you decide to accept it, is the following: aggregate temperature values from a CSV file and group them by weather station name. There’s only one caveat: the file has one 1,000,000,000 rows!
This is the task of the “One Billion Row Challenge” which went viral within the Java community earlier this year. Come and join me for this talk where I’ll dive into some of the tricks employed by the fastest solutions for processing the challenge’s 13 GB input file within less than two seconds. Parallelization and efficient memory access, optimized parsing routines using SIMD and SWAR, as well as custom map implementations are just some of the topics which we are going to discuss.
I will also share some of the personal experiences and learnings which I made while running this challenge for and with the community.
Interview:
What key takeaways can attendees expect from your InfoQ Dev Summit session?
- Java is fast, really fast: you can process one billion rows in less than two seconds, and we'll discuss some of the techniques for doing so.
- Getting good performance is about getting the basics right and avoiding easy mistakes.
- Efforts increase exponentially the further you get on the optimization curve; decide carefully how far you should go.
What's the focus of your work these days?
I work on the data platform team at Decodable, a SaaS for real-time ETL and stream processing based on Apache Flink and Debezium. One insight is that once people have seen their first data use case in real-time, they usually want more of the same. A great developer experience is key here, so that people can develop their data pipelines with ease and flexibility, for instance with tools to preview stream processing jobs, reprocess data, manage and evolve data schemas, and more.
What technical aspects of your role are most important?
Given we are a start-up, the engineering team is relatively small and you need to be savvy going from the foundations of Java and how to use it effectively and efficiently, over having a deep understanding of the semantics of stream processing, all the way up to building a distributed platform running on top of Kubernetes.
How does your InfoQ Dev Summit Munich session address current challenges or trends in the industry?
One of my goals for this talk is to debunk the myth that Java is outdated or slow. As the results of the challenge have shown, nothing could be further from the truth. With the time of low interest rates and free money behind us, it's more critical than ever to make the most out of the compute resources available to you. The techniques and tools discussed in the talk can help with that.
How do you see the concepts discussed in your InfoQ Dev Summit Munich session shaping the future of the industry?
I hope this talk can give people some ideas for building more efficient applications, while not forgetting about other aspects such as maintainability. Compute-heavy tasks like 1BRC can also be used as a benchmark for informing decisions on changes to the Java platform, such as the planned removal of the memory access methods in Java's Unsafe class.
Speaker
Gunnar Morling
Senior Staff Software Engineer @Decodableco, Open Source Aficionado, Previously Debezium Project Lead @Red Hat, Creator of The One Billion Row Challenge
Gunnar Morling is a software engineer and open-source enthusiast by heart, currently working at Decodable on stream processing based on Apache Flink. In his prior role as a software engineer at Red Hat, he led the Debezium project, a distributed platform for change data capture. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct. Gunnar is an avid blogger (morling.dev) and has spoken at a wide range of conferences like QCon, Java One, and Devoxx. He lives in Hamburg, Germany.