Loading…
Data By the Bay has ended
Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas  spanned by multiple horizontal data pipelines, platforms, and algorithms.  We are unifying data science and data engineering, showing what really works to run businesses at scale.
Friday, May 20 • 11:10am - 11:30am
A Scalable GA4GH Server Implementation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Genomics and Health related data implies lots of data, usually distributed in remote data centers, with lots of contraints related to privacy and confidentiality. Scalability is required at two levels, first within a single data center, and for this, distributed computing technologies like Apache Spark, scalable machine learning libraries and distributed databases are a match. At the inter-data center level, the scheme to share data and data processing methods must be guided by interoperability standards. The Global Alliance For Genomics and Health (GA4GH) is defining such a standard. We present here an implementation of a GA4GH server, using distributed computing and databases as back-end engine, so providing a scalable reference implementation. We also show how to extend the GA4GH server, with new functionality like requesting some model estimation (Machine Learning) and predictions on these models. We then show with the Spark Notebook as interactive tool how to generae a client for the GA4GH server and how to execute methods on the server.

Speakers
avatar for Andy Petrella

Andy Petrella

Cofounder, Data Fellas
Creator of Spark Notebook


Friday May 20, 2016 11:10am - 11:30am PDT
Ada

Attendees (2)