Data storage is a fundamental aspect of any data management strategy. It involves organizing and preserving data in various formats and systems to ensure efficient access, retrieval, and analysis. In the context of big data and modern analytics, choosing the right storage solution is crucial for performance, scalability, and data integrity.
They can be broadly categorized into three main areas:
Here is a summary table of the main data storage tools and formats we have identified.
Tool | Subcategory | Creation Date | Stars | Forks | Contributors | Last Release | Latest Commit | Meets Criteria* | Link |
---|---|---|---|---|---|---|---|---|---|
Avro | File Layer | 21/05/2009 | 3149 | 1692 | 372 | 05/08/2024 | 11/09/2025 | Yes | https://github.com/apache/avro |
ORC | File Layer | 06/05/2015 | 742 | 500 | 136 | 30/07/2025 | 17/09/2025 | Yes | https://github.com/apache/orc |
Parquet | File Layer | 10/06/2014 | 2940 | 1482 | 235 | 03/09/2025 | 16/09/2025 | Yes | https://github.com/apache/parquet-mr |
Tool | Subcategory | Creation Date | Stars | Forks | Contributors | Last Release | Latest Commit | Meets Criteria* | Link |
---|---|---|---|---|---|---|---|---|---|
Delta Lake | Metadata Layer | 22/04/2019 | 8275 | 1915 | 373 | 09/06/2025 | 17/09/2025 | Yes | https://github.com/delta-io/delta |
Hive Metastore | Metadata Layer | 21/05/2009 | 5792 | 4768 | 257 | N/A | 16/09/2025 | Yes | https://github.com/apache/hive |
Hudi | Metadata Layer | 14/12/2016 | 5938 | 2439 | 375 | 02/05/2025 | 17/09/2025 | Yes | https://github.com/apache/hudi |
Iceberg | Metadata Layer | 19/11/2018 | 7974 | 2781 | 401 | 11/09/2025 | 17/09/2025 | Yes | https://github.com/apache/iceberg |
Nessie | Metadata Layer | 09/04/2020 | 1322 | 160 | 66 | 16/09/2025 | 17/09/2025 | Yes | https://github.com/projectnessie/nessie |
Paimon | Metadata Layer | 12/01/2022 | 2998 | 1226 | 269 | N/A | 17/09/2025 | Yes | https://github.com/apache/paimon |
Polaris | Metadata Layer | 29/05/2024 | 1661 | 305 | 94 | 20/08/2025 | 17/09/2025 | Yes | https://github.com/apache/polaris |
Tool | Subcategory | Creation Date | Stars | Forks | Contributors | Last Release | Latest Commit | Meets Criteria* | Link |
---|---|---|---|---|---|---|---|---|---|
Big Functions | Data Modeling | 24/08/2022 | 750 | 70 | 36 | 15/05/2025 | 26/05/2025 | No | https://github.com/unytics/bigfunctions |
dbt core | Data Modeling | 10/03/2016 | 11392 | 1802 | 306 | 10/09/2025 | 17/09/2025 | Yes | https://github.com/dbt-labs/dbt-core |
GraphQL | Data Modeling | 01/07/2015 | 14569 | 1145 | 127 | 04/09/2025 | 04/09/2025 | Yes | https://github.com/graphql/graphql-spec |
SQL Mesh | Data Modeling | 23/09/2022 | 2611 | 260 | 108 | 17/09/2025 | 17/09/2025 | Yes | https://github.com/TobikoData/sqlmesh |
*Criteria: >40 contributors, >500 stars, and recent releases/commit
When choosing storage and data modeling tools, consider factors such as data volume, query patterns, integration with existing systems, scalability requirements, and the specific needs of your data team. For file formats, think about compression, schema evolution capabilities, and compatibility with your processing engines. For metadata layers, consider transaction support and real-time requirements. For data modeling tools, consider the complexity of your data relationships, the need for version control, and collaboration features.
It’s often beneficial to combine multiple tools. For example, you might use Parquet for base storage, Delta Lake for transaction support, and dbt for transformation and modeling. The key is to create a flexible, scalable data infrastructure that supports your current needs and can evolve with your organization’s data strategy.
Remember, the choice of storage tools can significantly impact query performance, data governance, and the overall efficiency of your data operations. It’s worth investing time in selecting the right combination of tools for your specific use case.
The open-source community has developed numerous solutions for various aspects of data handling, including: