Closing the Gap Between Compute and Data
Workshop at 2022 UC Santa Cruz Open Source Symposium
Session Chair: Carlos Maltzahn
Wednesday, Sept 28, 11-12:30pm Pacific Time
The ecosystem around Apache Arrow and Substrait provides exciting opportunities for caching (data close to compute) and pushdown (compute close to data). In this session we are looking at emerging open source infrastructures in industry as well as opportunities for university research and contributions.
Agenda
Shouwei and Hope Wang (Alluxio): Tackling I/O Challenges in Modern Data Lakes
Abstract: It has become increasingly popular to build modern open-source data lakes for big data analytics and AI workloads. The architecture of data platforms has been evolving heavily over the past few years with many open source communities participating and collaborating in this movement. Many focus on better and more cloud native approaches to serve metadata for structural data, but challenges remain for retrieving data more efficiently and providing sufficient bandwidth. For example, the scalability and cost-efficiency of cloud-native storage services are driving many organizations to embrace hybrid or multi-cloud architectures. It is important to present data in the lake efficiently to the computation with I/O bandwidth shared fairly across lake users. On the application side, the I/O workload is also quickly evolving in its patterns. For example, recent machine learning jobs tend to retrieve hundreds of millions of relatively small files/objects in training, which increasingly challenge the scalability, cost-efficiency and throughput of metadata serving.
In this talk, Shouwei Chen and Hope Wang will provide their views based on observations of working with many open source users. They will share the analysis on these industry trends, challenges, and success stories working in the open source ecosystem.
Bio: Dr. Shouwei Chen is a core maintainer and product manager of open-source Alluxio. Before joining Alluxio, Shouwei received a Ph.D. degree from Rutgers University. Shouwei’s research focuses on the codesign of the memory-centric computing frameworks with in-memory distributed file systems in large-scale environments.
Bio: Hope Wang is a technical marketer, an evangelist and open-source contributor of Alluxio, and an advocate for women in tech. Prior to joining Alluxio, Hope was a Venture Capitalist focusing on emerging technologies. Previously, she was the Data Architect at China Mobile. Hope earned bachelor's degrees in Computer Science & Economics and a master's degree in Software Engineering from Peking University, as well as an MBA from the University of Southern California - Marshall School of Business.
Weston Pace (Voltron Data): Arrow-native query engines
Abstract: From its origins as a standard format for representing columnar data in memory, the Apache Arrow project and the ecosystem around it have grown to include numerous tools and integrations for working natively with Arrow-formatted data. Recently, several query engines have emerged that are capable of processing analytic queries on Arrow-formatted data. This talk gives an overview of these engines—including Acero, DataFusion, DuckDB, and Velox—and provides additional details about Acero, the streaming query engine built into the C++ library.
Bio: Weston Pace has been doing full stack development in the telecom industry for over 7 years. Pace earned a B.S. in Computer Science and Mathematics and a M.S. in Computer Science in Colorado State University.
Ian Cook (Voltron Data): Arrow and Substrait: The Missing Glue in the Deconstructed Database
Abstract: Arrow defines a standard for representing data. Substrait defines a standard for representing operations on data. Learn how Arrow and Substrait enable interoperability between the decoupled layers of modern data platforms.
Bio: Ian Cook is the Product Management Director at Voltron Data after working on curriculum development first at TIBICO Software and then Cloudera. He is the co-founder and director of the Research Triangle Analysts and the founder and director of the Raleigh-Durham-Chapel Hill R Users Group. Cook earned a B.S. in Applied Mathematics from Stony Brook University and a M.S. in Statistics from Lehigh University.
コメント