oss-data-tools-landscape

Data Storage Tools and Formats

Data storage is a fundamental aspect of any data management strategy. It involves organizing and preserving data in various formats and systems to ensure efficient access, retrieval, and analysis. In the context of big data and modern analytics, choosing the right storage solution is crucial for performance, scalability, and data integrity.

They can be broadly categorized into three main areas:

Available Tools and Formats

Here is a summary table of the main data storage tools and formats we have identified.

File Layer

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
Avro File Layer 21/05/2009 3149 1692 372 05/08/2024 11/09/2025 Yes https://github.com/apache/avro
ORC File Layer 06/05/2015 742 500 136 30/07/2025 17/09/2025 Yes https://github.com/apache/orc
Parquet File Layer 10/06/2014 2940 1482 235 03/09/2025 16/09/2025 Yes https://github.com/apache/parquet-mr

Metadata Layer

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
Delta Lake Metadata Layer 22/04/2019 8275 1915 373 09/06/2025 17/09/2025 Yes https://github.com/delta-io/delta
Hive Metastore Metadata Layer 21/05/2009 5792 4768 257 N/A 16/09/2025 Yes https://github.com/apache/hive
Hudi Metadata Layer 14/12/2016 5938 2439 375 02/05/2025 17/09/2025 Yes https://github.com/apache/hudi
Iceberg Metadata Layer 19/11/2018 7974 2781 401 11/09/2025 17/09/2025 Yes https://github.com/apache/iceberg
Nessie Metadata Layer 09/04/2020 1322 160 66 16/09/2025 17/09/2025 Yes https://github.com/projectnessie/nessie
Paimon Metadata Layer 12/01/2022 2998 1226 269 N/A 17/09/2025 Yes https://github.com/apache/paimon
Polaris Metadata Layer 29/05/2024 1661 305 94 20/08/2025 17/09/2025 Yes https://github.com/apache/polaris

Data Modeling

Tool Subcategory Creation Date Stars Forks Contributors Last Release Latest Commit Meets Criteria* Link
Big Functions Data Modeling 24/08/2022 750 70 36 15/05/2025 26/05/2025 No https://github.com/unytics/bigfunctions
dbt core Data Modeling 10/03/2016 11392 1802 306 10/09/2025 17/09/2025 Yes https://github.com/dbt-labs/dbt-core
GraphQL Data Modeling 01/07/2015 14569 1145 127 04/09/2025 04/09/2025 Yes https://github.com/graphql/graphql-spec
SQL Mesh Data Modeling 23/09/2022 2611 260 108 17/09/2025 17/09/2025 Yes https://github.com/TobikoData/sqlmesh

*Criteria: >40 contributors, >500 stars, and recent releases/commit

Tool and Format Details

File Layer

  1. Avro: A row-based storage format, Avro is a data serialization system that provides rich data structures and a compact, fast, binary data format.
  2. ORC: (Optimized Row Columnar) A highly efficient way to store Hive data. It was designed to overcome limitations of other Hive file formats.
  3. Parquet: A columnar storage file format available to any project in the Hadoop ecosystem. Parquet is built from the ground up with complex nested data structures in mind.

Metadata Layer

  1. Delta Lake: An open-source storage layer bringing ACID transactions to Apache Spark and big data workloads, enabling reliable data lake operations with time travel, schema enforcement, and batch/streaming unification.
  2. Hive Metastore: A centralized metadata repository service that stores semantic and technical metadata about data assets, providing metadata management and schema validation for Hadoop ecosystem components.
  3. Hudi: A data lake storage system enabling atomic publishing, record-level updates/deletes, and incremental data processing. It offers snapshot isolation and provides efficient upsert and delete capabilities.
  4. Iceberg: A table format for massive analytic datasets, offering schema evolution, hidden partitioning, and snapshot isolation. Supports efficient reads and writes with partition pruning and metadata handling.
  5. Nessie: A Git-like version control system for data lakes enabling branch-based development, metadata versioning, and time travel capabilities across multiple table formats and data lake configurations.
  6. Paimon: A streaming data lake platform optimized for high-speed data ingestion and real-time analytics, featuring changelog tracking and efficient incremental computation on dynamic datasets.
  7. Polaris: A unified metadata management system providing data discovery, lineage tracking, and governance capabilities across data lakes and warehouses with integrated security controls.

Data Modeling

  1. dbt core: A command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively.
  2. GraphQL: A query language for APIs and a runtime for executing those queries with your existing data. It provides a complete and understandable description of the data in your API.
  3. SQL Mesh: An open-source tool for building and managing data transformations, with a focus on data modeling and lineage.

When choosing storage and data modeling tools, consider factors such as data volume, query patterns, integration with existing systems, scalability requirements, and the specific needs of your data team. For file formats, think about compression, schema evolution capabilities, and compatibility with your processing engines. For metadata layers, consider transaction support and real-time requirements. For data modeling tools, consider the complexity of your data relationships, the need for version control, and collaboration features.

It’s often beneficial to combine multiple tools. For example, you might use Parquet for base storage, Delta Lake for transaction support, and dbt for transformation and modeling. The key is to create a flexible, scalable data infrastructure that supports your current needs and can evolve with your organization’s data strategy.

Remember, the choice of storage tools can significantly impact query performance, data governance, and the overall efficiency of your data operations. It’s worth investing time in selecting the right combination of tools for your specific use case.

The Challenge of Choice

The open-source community has developed numerous solutions for various aspects of data handling, including: