Loading...
「ツール」は右上に移動しました。
利用したサーバー: wtserver3
32いいね 1476回再生

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.

In this session, we would like to take a deep dive on we build Native SQL engine for Spark by leveraging Arrow Gandiva and its compute kernels. We will introduce the general design of commonly used operators like aggregation, sorting and joining, and discuss how can we optimize these operators with SIMD based instructions. We will also introduce how to implement WholeStageCodeGen with Native libraries. Finally we will use micro-benchmarks and TPCH workloads to explain how vectorized execution can benefit these workloads.

About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: databricks.com/product/unified-data-analytics-plat…

See all the previous Summit sessions:

Connect with us:
Website: databricks.com/
Facebook: www.facebook.com/databricksinc
Twitter: twitter.com/databricks
LinkedIn: www.linkedin.com/company/databricks/
Instagram: www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. databricks.com/databricks-named-leader-by-gartner

コメント