oss-data-tools-landscape

Query and Processing Tools

Query and processing tools are essential components of any data analytics infrastructure. They enable organizations to extract insights from large volumes of data, perform complex computations, and support decision-making processes. These tools can be broadly categorized into five main areas: query engines, stream processing, batch processing, dataframe processing, and datawarehouse & OLAP.

They can be broadly categorized into five main areas:

Query Engine: Query engines are designed to efficiently retrieve and analyze data from various sources. They allow users to write and execute queries, often using SQL or SQL-like languages, to extract specific information from databases or data lakes.
Stream Processing: Stream processing deals with real-time data analysis. It processes data as it arrives, allowing for immediate insights and actions. This is particularly useful for scenarios requiring real-time decision making or continuous data analysis.
Batch Processing: Batch processing involves processing large volumes of data at scheduled intervals. It’s typically used for handling large datasets where immediate results are not required. Batch processing is efficient for complex analyses that require processing entire datasets.
Dataframe Processing: Dataframe processing tools provide efficient ways to manipulate and analyze structured data in memory. They offer intuitive APIs for data transformation, aggregation, and analysis, typically optimized for performance and ease of use.
Datawarehouse & OLAP: Datawarehouse and OLAP (Online Analytical Processing) tools are specialized systems designed for storing and analyzing large volumes of historical data, enabling complex analytical queries and multidimensional analysis.

Available Tools

Here is a summary table of the main query and processing tools we have identified.

Query Engine

| Tool | Subcategory | Creation Date | Stars | Forks | Contributors | Last Release | Latest Commit | Meets Criteria* | Link | |—|—|—|—|—|—|—|—|—|—| | Apache Calcite | Query Engine | 25/06/2014 | 4938 | 2447 | 328 | N/A | 17/09/2025 | Yes | https://github.com/apache/calcite | | Apache Drill | Query Engine | 05/09/2012 | 1991 | 985 | 161 | 29/06/2025 | 16/09/2025 | Yes | https://github.com/apache/drill | | Datafusion | Query Engine | 17/04/2021 | 7743 | 1639 | 416 | N/A | 17/09/2025 | Yes | https://github.com/apache/arrow-datafusion | | DuckDB | Query Engine | 26/06/2018 | 32860 | 2591 | 339 | 16/09/2025 | 17/09/2025 | Yes | https://github.com/duckdb/duckdb | | Hydra | Query Engine | 22/07/2022 | 2985 | 92 | 16 | 01/04/2024 | 10/02/2025 | No | https://github.com/hydradatabase/hydra | | PostgreSQL | Query Engine | 21/09/2010 | 18547 | 5112 | 42 | N/A | 17/09/2025 | Yes | https://github.com/postgres/postgres | | Presto | Query Engine | 09/08/2012 | 16502 | 5500 | 324 | 27/08/2025 | 17/09/2025 | Yes | https://github.com/prestodb/presto | | Trino | Query Engine | 19/01/2019 | 11881 | 3334 | 333 | N/A | 17/09/2025 | Yes | https://github.com/trinodb/trino |

Stream Processing

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
Apache Flink	Stream Processing	07/06/2014	25274	13782	286	N/A	17/09/2025	Yes	https://github.com/apache/flink
Apache Kafka	Stream Processing	15/08/2011	30921	14636	345	N/A	17/09/2025	Yes	https://github.com/apache/kafka
Apache Samza	Stream Processing	14/03/2015	832	334	132	N/A	02/05/2025	Yes	https://github.com/apache/samza
Apache Storm	Stream Processing	05/11/2013	6653	4060	280	03/08/2025	15/09/2025	Yes	https://github.com/apache/storm
Materialize	Stream Processing	22/02/2019	6111	478	146	14/08/2024	17/09/2025	Yes	https://github.com/MaterializeInc/materialize
Redpanda	Stream Processing	02/11/2020	11004	675	146	11/09/2025	17/09/2025	Yes	https://github.com/redpanda-data/redpanda

Batch Processing

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
AmphiETL	Batch Processing	20/03/2024	1098	74	8	N/A	12/09/2025	Yes	https://github.com/amphi-ai/amphi-etl
Apache Beam	Batch Processing	02/02/2016	8298	4403	308	15/09/2025	17/09/2025	Yes	https://github.com/apache/beam
Apache Hop	Batch Processing	24/09/2019	1231	402	93	08/08/2025	17/09/2025	Yes	https://github.com/apache/hop
Apache Spark	Batch Processing	25/02/2014	41906	28826	333	N/A	17/09/2025	Yes	https://github.com/apache/spark
dbt core	Batch Processing	10/03/2016	11392	1802	306	10/09/2025	17/09/2025	Yes	https://github.com/dbt-labs/dbt-core
Talaxie	Batch Processing	28/05/2024	4	2	142	N/A	20/10/2024	No	https://github.com/Talaxie/tdi-studio-se

Dataframe Processing

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
Dask	Dataframe Processing	04/01/2015	13487	1796	416	16/09/2025	16/09/2025	Yes	https://github.com/dask/dask
Ibis Project	Dataframe Processing	17/04/2015	6102	665	202	28/07/2025	17/09/2025	Yes	https://github.com/ibis-project/ibis
Pandas	Dataframe Processing	24/08/2010	46597	18963	413	21/08/2025	17/09/2025	Yes	https://github.com/pandas-dev/pandas
Polars	Dataframe Processing	13/05/2020	35359	2397	443	16/09/2025	17/09/2025	Yes	https://github.com/pola-rs/polars

Datawarehouse & OLAP

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
Apache Hive	Datawarehouse & OLAP	21/05/2009	5792	4768	257	N/A	16/09/2025	Yes	https://github.com/apache/hive
Apache Impala	Datawarehouse & OLAP	13/04/2016	1242	537	173	07/03/2025	17/09/2025	Yes	https://github.com/apache/impala
Apache Kylin	Datawarehouse & OLAP	03/01/2015	3748	1520	60	06/04/2025	17/09/2025	Yes	https://github.com/apache/kylin
ClickHouse	Datawarehouse & OLAP	02/06/2016	42935	7663	297	16/09/2025	17/09/2025	Yes	https://github.com/ClickHouse/ClickHouse
Doris	Datawarehouse & OLAP	10/08/2017	14277	3561	336	03/09/2025	17/09/2025	Yes	https://github.com/apache/doris
Druid	Datawarehouse & OLAP	23/10/2012	13828	3758	355	11/08/2025	17/09/2025	Yes	https://github.com/apache/druid
Pinot	Datawarehouse & OLAP	19/05/2014	5901	1416	367	15/09/2025	17/09/2025	Yes	https://github.com/apache/pinot
StarRocks	Datawarehouse & OLAP	04/09/2021	10671	2138	401	09/09/2025	17/09/2025	Yes	https://github.com/StarRocks/starrocks

*Criteria: >40 contributors, >500 stars, and recent releases/commit

Tool Details

Query Engine

Apache Calcite: Dynamic data management framework providing query optimization, data federation, and more.
Apache Drill: Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
Datafusion: Fast query engine for Apache Arrow, written in Rust.
DuckDB: In-process SQL OLAP database management system, designed to be fast and efficient for analytical queries.
GraphQL: Query language for APIs and a runtime for executing those queries with existing data.
Hydra: Unified interface for constructing and executing complex analytical queries across different query engines and data sources.
PostgreSQL: Powerful, open source object-relational database system with a strong reputation for reliability and data integrity.
Presto: Distributed SQL query engine for big data, designed for fast analytic queries against data of any size.
Trino: Fast distributed SQL query engine for big data analytics, designed to efficiently query vast amounts of data.

Stream Processing

Apache Flink: Stateful computations over data streams, providing precise control of time and state.
Apache Kafka: Distributed event streaming platform capable of handling trillions of events a day.
Apache Samza: Distributed stream processing framework that uses Apache Kafka for messaging, and Hadoop YARN for fault tolerance.
Apache Storm: Distributed real-time computation system for processing fast, large streams of data.
Materialize: Streaming database that makes it easy to build real-time applications on streaming data.
Redpanda: Modern streaming platform compatible with Kafka API, built for mission-critical workloads with high performance.

Batch Processing

AmphiETL: Cloud-native ETL platform built for modern data teams, offering scalable data transformations and integrations.
Apache Beam: Unified programming model for batch and streaming data processing, offering language-specific SDKs.
Apache Hop: Data orchestration and data engineering platform designed for visual development of data pipelines and workflows.
Apache Spark: Fast and general-purpose cluster computing system, providing high-level APIs in Java, Scala, Python and R.
dbt core: Transforms data in warehouses by allowing analysts and engineers to define models using SQL SELECT statements.

Dataframe Processing

Dask: Flexible library for parallel computing in Python, scaling Python and Pandas workflows efficiently.
Ibis Project: Expression compiler for analytics, bridging different query engines with a unified Python API for data analytics.
Pandas: Powerful Python data manipulation and analysis library, offering data structures for efficiently storing large datasets.
Polars: Fast multi-threaded DataFrame library for Rust and Python, designed as a faster alternative to Pandas.

Datawarehouse & OLAP

Apache Hive: Data warehouse software facilitating reading, writing, and managing large datasets in distributed storage using SQL.
Apache Impala: Massively Parallel Processing (MPP) SQL query engine for data stored in Hadoop clusters.
Apache Kylin: Extreme OLAP engine for big data that allows for sub-second queries on datasets with trillions of rows.
ClickHouse: Open-source column-oriented database management system for real-time analytics using SQL.
Doris: High-performance real-time analytical database based on MPP architecture.
Druid: High performance real-time analytics database designed for workflows where fast queries and ingest really matter.
StarRocks: High-performance analytical database that enables real-time, multi-dimensional, and highly concurrent data analysis.

These tools offer a wide range of capabilities for querying and processing data in various scenarios. When choosing a tool, consider factors such as:

Scale of your data
Real-time requirements
Query complexity
Integration needs with existing data stack
Performance requirements
Team expertise and learning curve

Remember that different categories of tools can be combined to create comprehensive data processing pipelines:

Use Dataframe Processing tools for exploratory data analysis and prototyping
Implement Stream Processing for real-time data needs
Deploy Batch Processing for large-scale periodic processing
Leverage Query Engines for ad-hoc analysis
Utilize Datawarehouse & OLAP systems for historical analysis and reporting

The choice of tools can significantly impact the performance and capabilities of your data analytics infrastructure. It’s often beneficial to combine multiple tools to address different aspects of your data processing needs while maintaining a balance between functionality, complexity, and maintainability.

The Challenge of Choice

The open-source community has developed numerous solutions for various aspects of data handling, including: