Name: Apache Beam (incubating): Unified batch and streaming data processing
Start: 2016-05-19T16:00:00-0700
End: 2016-05-19T16:20:00-0700

Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.

Back To Schedule

Apache Beam (incubating): Unified batch and streaming data processing

This talk traces the evolution of ideas in Google's data processing tools over the past 13 years - from classic MapReduce, to strongly consistent stream processing with Millwheel, to the unified batch and streaming programming model of Apache Beam.

Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.

Beam cleanly separates the different aspects of temporal data processing: what computation to apply, where in event time to apply it, when in processing time to produce results, and how to refine the results as late data arrives. By decoupling semantics from the underlying execution environment, Beam provides portability across multiple runners, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).

I will give an overview of the programming model and current status of the project and invite you to participate in its rapidly developing ecosystem.

Speakers

Eugene Kirpichov

Senior Software Engineer, Google

I'm an engineer on the Google Cloud Dataflow team. I'm interested in some programming- and math-related topics, equality-related issues, cognitive psychology, and a bunch of other things.

Thursday May 19, 2016 4:00pm - 4:20pm PDT
Ada

Pipelines

Data By the Bay

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Eugene Kirpichov

Attendees (10)