oss-data-tools-landscape

Data Storage Tools and Formats

Data storage is a fundamental aspect of any data management strategy. It involves organizing and preserving data in various formats and systems to ensure efficient access, retrieval, and analysis. In the context of big data and modern analytics, choosing the right storage solution is crucial for performance, scalability, and data integrity.

They can be broadly categorized into three main areas:

File Layer: The file layer represents the fundamental formats for storing and organizing data. These formats focus on efficient storage, compression, and access patterns for raw data, providing the foundation for higher-level data operations.
Metadata Layer: The metadata layer builds upon basic file formats to provide advanced features like ACID transactions, schema evolution, and versioning. These systems manage the organization and tracking of data changes while ensuring data consistency and reliability.
Data Modeling: Data modeling tools help in creating structured representations of data systems, defining relationships between data elements, and managing data transformations. They are essential for maintaining data quality and enabling effective analysis.

Available Tools and Formats

Here is a summary table of the main data storage tools and formats we have identified.

File Layer

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
Avro	File Layer	21/05/2009	3149	1692	372	05/08/2024	11/09/2025	Yes	https://github.com/apache/avro
ORC	File Layer	06/05/2015	742	500	136	30/07/2025	17/09/2025	Yes	https://github.com/apache/orc
Parquet	File Layer	10/06/2014	2940	1482	235	03/09/2025	16/09/2025	Yes	https://github.com/apache/parquet-mr

Metadata Layer

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
Delta Lake	Metadata Layer	22/04/2019	8275	1915	373	09/06/2025	17/09/2025	Yes	https://github.com/delta-io/delta
Hive Metastore	Metadata Layer	21/05/2009	5792	4768	257	N/A	16/09/2025	Yes	https://github.com/apache/hive
Hudi	Metadata Layer	14/12/2016	5938	2439	375	02/05/2025	17/09/2025	Yes	https://github.com/apache/hudi
Iceberg	Metadata Layer	19/11/2018	7974	2781	401	11/09/2025	17/09/2025	Yes	https://github.com/apache/iceberg
Nessie	Metadata Layer	09/04/2020	1322	160	66	16/09/2025	17/09/2025	Yes	https://github.com/projectnessie/nessie
Paimon	Metadata Layer	12/01/2022	2998	1226	269	N/A	17/09/2025	Yes	https://github.com/apache/paimon
Polaris	Metadata Layer	29/05/2024	1661	305	94	20/08/2025	17/09/2025	Yes	https://github.com/apache/polaris

Data Modeling

Tool	Subcategory	Creation Date	Stars	Forks	Contributors	Last Release	Latest Commit	Meets Criteria*	Link
Big Functions	Data Modeling	24/08/2022	750	70	36	15/05/2025	26/05/2025	No	https://github.com/unytics/bigfunctions
dbt core	Data Modeling	10/03/2016	11392	1802	306	10/09/2025	17/09/2025	Yes	https://github.com/dbt-labs/dbt-core
GraphQL	Data Modeling	01/07/2015	14569	1145	127	04/09/2025	04/09/2025	Yes	https://github.com/graphql/graphql-spec
SQL Mesh	Data Modeling	23/09/2022	2611	260	108	17/09/2025	17/09/2025	Yes	https://github.com/TobikoData/sqlmesh

*Criteria: >40 contributors, >500 stars, and recent releases/commit

Tool and Format Details

File Layer

Avro: A row-based storage format, Avro is a data serialization system that provides rich data structures and a compact, fast, binary data format.
ORC: (Optimized Row Columnar) A highly efficient way to store Hive data. It was designed to overcome limitations of other Hive file formats.
Parquet: A columnar storage file format available to any project in the Hadoop ecosystem. Parquet is built from the ground up with complex nested data structures in mind.

Metadata Layer

Delta Lake: An open-source storage layer bringing ACID transactions to Apache Spark and big data workloads, enabling reliable data lake operations with time travel, schema enforcement, and batch/streaming unification.
Hive Metastore: A centralized metadata repository service that stores semantic and technical metadata about data assets, providing metadata management and schema validation for Hadoop ecosystem components.
Hudi: A data lake storage system enabling atomic publishing, record-level updates/deletes, and incremental data processing. It offers snapshot isolation and provides efficient upsert and delete capabilities.
Iceberg: A table format for massive analytic datasets, offering schema evolution, hidden partitioning, and snapshot isolation. Supports efficient reads and writes with partition pruning and metadata handling.
Nessie: A Git-like version control system for data lakes enabling branch-based development, metadata versioning, and time travel capabilities across multiple table formats and data lake configurations.
Paimon: A streaming data lake platform optimized for high-speed data ingestion and real-time analytics, featuring changelog tracking and efficient incremental computation on dynamic datasets.
Polaris: A unified metadata management system providing data discovery, lineage tracking, and governance capabilities across data lakes and warehouses with integrated security controls.

Data Modeling

dbt core: A command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively.
GraphQL: A query language for APIs and a runtime for executing those queries with your existing data. It provides a complete and understandable description of the data in your API.
SQL Mesh: An open-source tool for building and managing data transformations, with a focus on data modeling and lineage.

When choosing storage and data modeling tools, consider factors such as data volume, query patterns, integration with existing systems, scalability requirements, and the specific needs of your data team. For file formats, think about compression, schema evolution capabilities, and compatibility with your processing engines. For metadata layers, consider transaction support and real-time requirements. For data modeling tools, consider the complexity of your data relationships, the need for version control, and collaboration features.

It’s often beneficial to combine multiple tools. For example, you might use Parquet for base storage, Delta Lake for transaction support, and dbt for transformation and modeling. The key is to create a flexible, scalable data infrastructure that supports your current needs and can evolve with your organization’s data strategy.

Remember, the choice of storage tools can significantly impact query performance, data governance, and the overall efficiency of your data operations. It’s worth investing time in selecting the right combination of tools for your specific use case.

The Challenge of Choice

The open-source community has developed numerous solutions for various aspects of data handling, including: