oss-data-tools-landscape

Home

Query and Processing Tools

Query and processing tools are essential components of any data analytics infrastructure. They enable organizations to extract insights from large volumes of data, perform complex computations, and support decision-making processes. These tools can be broadly categorized into five main areas: query engines, stream processing, batch processing, dataframe processing, and datawarehouse & OLAP.

They can be broadly categorized into five main areas:

Available Tools

Here is a summary table of the main query and processing tools we have identified.

Query Engine

| Tool | Subcategory | Creation Date | Stars | Forks | Contributors | Last Release | Latest Commit | Meets Criteria* | Link | |—|—|—|—|—|—|—|—|—|—| | Apache Calcite | Query Engine | 25/06/2014 | 4938 | 2447 | 328 | N/A | 17/09/2025 | Yes | https://github.com/apache/calcite | | Apache Drill | Query Engine | 05/09/2012 | 1991 | 985 | 161 | 29/06/2025 | 16/09/2025 | Yes | https://github.com/apache/drill | | Datafusion | Query Engine | 17/04/2021 | 7743 | 1639 | 416 | N/A | 17/09/2025 | Yes | https://github.com/apache/arrow-datafusion | | DuckDB | Query Engine | 26/06/2018 | 32860 | 2591 | 339 | 16/09/2025 | 17/09/2025 | Yes | https://github.com/duckdb/duckdb | | Hydra | Query Engine | 22/07/2022 | 2985 | 92 | 16 | 01/04/2024 | 10/02/2025 | No | https://github.com/hydradatabase/hydra | | PostgreSQL | Query Engine | 21/09/2010 | 18547 | 5112 | 42 | N/A | 17/09/2025 | Yes | https://github.com/postgres/postgres | | Presto | Query Engine | 09/08/2012 | 16502 | 5500 | 324 | 27/08/2025 | 17/09/2025 | Yes | https://github.com/prestodb/presto | | Trino | Query Engine | 19/01/2019 | 11881 | 3334 | 333 | N/A | 17/09/2025 | Yes | https://github.com/trinodb/trino |

Stream Processing

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
Apache Flink Stream Processing 07/06/2014 25274 13782 286 N/A 17/09/2025 Yes https://github.com/apache/flink
Apache Kafka Stream Processing 15/08/2011 30921 14636 345 N/A 17/09/2025 Yes https://github.com/apache/kafka
Apache Samza Stream Processing 14/03/2015 832 334 132 N/A 02/05/2025 Yes https://github.com/apache/samza
Apache Storm Stream Processing 05/11/2013 6653 4060 280 03/08/2025 15/09/2025 Yes https://github.com/apache/storm
Materialize Stream Processing 22/02/2019 6111 478 146 14/08/2024 17/09/2025 Yes https://github.com/MaterializeInc/materialize
Redpanda Stream Processing 02/11/2020 11004 675 146 11/09/2025 17/09/2025 Yes https://github.com/redpanda-data/redpanda

Batch Processing

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
AmphiETL Batch Processing 20/03/2024 1098 74 8 N/A 12/09/2025 Yes https://github.com/amphi-ai/amphi-etl
Apache Beam Batch Processing 02/02/2016 8298 4403 308 15/09/2025 17/09/2025 Yes https://github.com/apache/beam
Apache Hop Batch Processing 24/09/2019 1231 402 93 08/08/2025 17/09/2025 Yes https://github.com/apache/hop
Apache Spark Batch Processing 25/02/2014 41906 28826 333 N/A 17/09/2025 Yes https://github.com/apache/spark
dbt core Batch Processing 10/03/2016 11392 1802 306 10/09/2025 17/09/2025 Yes https://github.com/dbt-labs/dbt-core
Talaxie Batch Processing 28/05/2024 4 2 142 N/A 20/10/2024 No https://github.com/Talaxie/tdi-studio-se

Dataframe Processing

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
Dask Dataframe Processing 04/01/2015 13487 1796 416 16/09/2025 16/09/2025 Yes https://github.com/dask/dask
Ibis Project Dataframe Processing 17/04/2015 6102 665 202 28/07/2025 17/09/2025 Yes https://github.com/ibis-project/ibis
Pandas Dataframe Processing 24/08/2010 46597 18963 413 21/08/2025 17/09/2025 Yes https://github.com/pandas-dev/pandas
Polars Dataframe Processing 13/05/2020 35359 2397 443 16/09/2025 17/09/2025 Yes https://github.com/pola-rs/polars

Datawarehouse & OLAP

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
Apache Hive Datawarehouse & OLAP 21/05/2009 5792 4768 257 N/A 16/09/2025 Yes https://github.com/apache/hive
Apache Impala Datawarehouse & OLAP 13/04/2016 1242 537 173 07/03/2025 17/09/2025 Yes https://github.com/apache/impala
Apache Kylin Datawarehouse & OLAP 03/01/2015 3748 1520 60 06/04/2025 17/09/2025 Yes https://github.com/apache/kylin
ClickHouse Datawarehouse & OLAP 02/06/2016 42935 7663 297 16/09/2025 17/09/2025 Yes https://github.com/ClickHouse/ClickHouse
Doris Datawarehouse & OLAP 10/08/2017 14277 3561 336 03/09/2025 17/09/2025 Yes https://github.com/apache/doris
Druid Datawarehouse & OLAP 23/10/2012 13828 3758 355 11/08/2025 17/09/2025 Yes https://github.com/apache/druid
Pinot Datawarehouse & OLAP 19/05/2014 5901 1416 367 15/09/2025 17/09/2025 Yes https://github.com/apache/pinot
StarRocks Datawarehouse & OLAP 04/09/2021 10671 2138 401 09/09/2025 17/09/2025 Yes https://github.com/StarRocks/starrocks

*Criteria: >40 contributors, >500 stars, and recent releases/commit

Tool Details

Query Engine

  1. Apache Calcite: Dynamic data management framework providing query optimization, data federation, and more.
  2. Apache Drill: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
  3. Datafusion: Fast query engine for Apache Arrow, written in Rust.
  4. DuckDB: In-process SQL OLAP database management system, designed to be fast and efficient for analytical queries.
  5. GraphQL: Query language for APIs and a runtime for executing those queries with existing data.
  6. Hydra: Unified interface for constructing and executing complex analytical queries across different query engines and data sources.
  7. PostgreSQL: Powerful, open source object-relational database system with a strong reputation for reliability and data integrity.
  8. Presto: Distributed SQL query engine for big data, designed for fast analytic queries against data of any size.
  9. Trino: Fast distributed SQL query engine for big data analytics, designed to efficiently query vast amounts of data.

Stream Processing

  1. Apache Flink: Stateful computations over data streams, providing precise control of time and state.
  2. Apache Kafka: Distributed event streaming platform capable of handling trillions of events a day.
  3. Apache Samza: Distributed stream processing framework that uses Apache Kafka for messaging, and Hadoop YARN for fault tolerance.
  4. Apache Storm: Distributed real-time computation system for processing fast, large streams of data.
  5. Materialize: Streaming database that makes it easy to build real-time applications on streaming data.
  6. Redpanda: Modern streaming platform compatible with Kafka API, built for mission-critical workloads with high performance.

Batch Processing

  1. AmphiETL: Cloud-native ETL platform built for modern data teams, offering scalable data transformations and integrations.
  2. Apache Beam: Unified programming model for batch and streaming data processing, offering language-specific SDKs.
  3. Apache Hop: Data orchestration and data engineering platform designed for visual development of data pipelines and workflows.
  4. Apache Spark: Fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python and R.
  5. dbt core: Transforms data in warehouses by allowing analysts and engineers to define models using SQL SELECT statements.

Dataframe Processing

  1. Dask: Flexible library for parallel computing in Python, scaling Python and Pandas workflows efficiently.
  2. Ibis Project: Expression compiler for analytics, bridging different query engines with a unified Python API for data analytics.
  3. Pandas: Powerful Python data manipulation and analysis library, offering data structures for efficiently storing large datasets.
  4. Polars: Fast multi-threaded DataFrame library for Rust and Python, designed as a faster alternative to Pandas.

Datawarehouse & OLAP

  1. Apache Hive: Data warehouse software facilitating reading, writing, and managing large datasets in distributed storage using SQL.
  2. Apache Impala: Massively Parallel Processing (MPP) SQL query engine for data stored in Hadoop clusters.
  3. Apache Kylin: Extreme OLAP engine for big data that allows for sub-second queries on datasets with trillions of rows.
  4. ClickHouse: Open-source column-oriented database management system for real-time analytics using SQL.
  5. Doris: High-performance real-time analytical database based on MPP architecture.
  6. Druid: High performance real-time analytics database designed for workflows where fast queries and ingest really matter.
  7. StarRocks: High-performance analytical database that enables real-time, multi-dimensional, and highly concurrent data analysis.

These tools offer a wide range of capabilities for querying and processing data in various scenarios. When choosing a tool, consider factors such as:

Remember that different categories of tools can be combined to create comprehensive data processing pipelines:

The choice of tools can significantly impact the performance and capabilities of your data analytics infrastructure. It’s often beneficial to combine multiple tools to address different aspects of your data processing needs while maintaining a balance between functionality, complexity, and maintainability.

The Challenge of Choice

The open-source community has developed numerous solutions for various aspects of data handling, including: