[NEW BLOG]: What is Apache Arrow Flight / Flight SQL / ADBC? 🎉
We need to ask, Why don’t ODBC & JDBC fit in today’s analytical world?
These protocols were designed particularly for row-based workloads.
What about columnar “Arrow” based data?
dipankar-tnt.medium.com/what-is-apac...
Posts by Dipankar Mazumdar
Blogged: ACID in Lakehouse - How Apache Hudi, Iceberg & Delta Lake implements it?
www.onehouse.ai/blog/acid-tr...
📅 Save the Date 📅
Community Over Code North America 2025 has been announced!
Where: Minneapolis, MN (USA)
When: September 11-14, 2025
Read more about #CommunityOverCode --> https://buff.ly/4jQx36S
Blogged: Concurrency control methods in a Lakehouse with Apache Hudi, Iceberg & Delta Lake.
In this blog I go into the fundamentals of concurrency control, explore why it is essential for lakehouses with OCC, MVCC & Non blocking control.
hudi.apache.org/blog/2025/01...
Blogged: Clustering Algorithms in Open Lakehouse formats such as Apache Hudi, Apache Iceberg & Delta Lake.
Querying huge volumes of data from storage demands optimized query speed
Your queries are fast today, but they might not be over time!
Read: www.onehouse.ai/blog/what-is...
Some of the highlight items from Hudi 1.0:
- Intro of the LSM trees (log-structured merge-trees)
- Expression, Secondary indexes
- Non-blocking concurrency control
- Partial Merges
Blog: hudi.apache.org/blog/2024/12...
✅ Similar to database access methods, Hudi features a multi-modal index with asynchronous MVCC, enhancing write efficiency & data consistency. It aims to apply various index types for both writes & reads, improving efficiency with new schemes, supported by engines like Presto, Spark, and Trino
✅ Like a lock manager in a database, Hudi uses external lock managers & plans to centralize this via a metaserver. It implements Optimistic Concurrency Control (OCC) for concurrent writers and Multi-Version Concurrency Control (MVCC) to ensure non-blocking interactions.
✅ Like a database's log manager, Hudi organizes logs for recovery, structures data into file groups and slices, tracks changes via timelines, manages rollbacks with marker files, & generates compressible metadata for enhanced data tracking and operations such as CDC.
✅ Hudi has many building blocks (Log manager, Lock manager, Access methods, etc.) that make up a DBMS.
✅ If we compare Hudi's architecture to the seminal "Architecture of a Database System" paper, we can see how Hudi serves as the foundational half of a database optimized for data lakes.
This is the main "design difference" to understand when comparing/evaluating Hudi especially against other lakehouse table formats.
With the new 1.0 release of Apache Hudi, we are now closer to the vision of building the first transactional database for the data lake.
Let’s explore:
Right from its inception back at Uber, Apache Hudi has been approached as a database problem for data lakes rather than just being a standalone metadata format.
Hudi brings a core transactional layer (via its Storage Engine) to cloud data lakes, typically seen in any database management system.
“Bringing the database kernel to data lakes” - this is what Apache Hudi started with before the world heard of something called “Lakehouse”.
Lakehouse means only one thing- data lakes needed the “transactional layer” on top of Parquet for running db-style workloads (both transactional & analytical)
Now seems like a good time to remind people of my starter pack.
go.bsky.app/T1SxhAe
The 200th edition of Data Engineering Weekly is out. Thank you all for your kind support
www.dataengineeringw...
Wrote a lil bit about openness and interoperability here: www.onehouse.ai/blog/open-ta...
The question we should ask- Can I seamlessly switch between specific components or the overall platform—whether it's a vendor-managed or self-managed open source solution—as new requirements emerge?
Apache XTable at Scale in Production!
This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.
Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...
With this feature, users can use OneLake shortcuts to point to an Iceberg table written using Snowflake (or another engine), and it will present that table as a Delta Lake table, which works well within Fabric ecosystem.
This is powered by XTable
xtable.apache.org
✅ Customers using multiple formats adding XTable to their existing data pipelines (say Apache Airflow operator or a lambda fn)
The announcement on Fabric OneLake-Snowflake interoperability is a critical example that solidifies point (1).
And even if they do work with multiple formats, it is practically tough to build optimization capabilities for each of these formats.
So, to summarize, I see XTable having 2 major applications:
✅ On the compute-side with vendors using XTable as the interoperability layer
On the query engine-side (warehouse, lake compute), more & more vendors are now looking at integrating with these open formats.
In reality, it is tough to have robust support for every single format. By robust I mean - full write support, schema evolution, compaction.
Each of these formats shines in specific use cases depending on its unique features!
And so based on your use case & technical fit (in your data architecture), you should be free to use anything without being married to just one.
XTable started with the core idea around “interoperability”.
That you should be able to write data in any format of your choice irrespective of whether it’s Iceberg, Hudi or Delta.
Then you can bring any compute engine of your choice that works well with a particular format & run analytics on top.
Apache XTable at Scale in Production!
This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.
Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...
New blog post on the fun new hardware advancements which databases can leverage for great gains, and why the cloud means it doesn't matter that they exist. 🫠
transactional.blog/b...
This might be the first time an open source app is at the top of the app store. Definitely the first open source social app.
Congrats Alison & good to see you here!
5. Cleaning: As data is continuously written, updated, & deleted, older file versions & metadata will accumulate over time. This can lead to significant storage bloat & long file listing time, which negatively impact query perf.
You need a cleaner service like this: hudi.apache.org/docs/hoodie_...
4. Data Skipping: This is a technique used to enhance query perf by eliminating the need to scan irrelevant files.
This includes Parquet min/max stats & Bloom Filters.
Bloom filter is a probabilistic data structure that allows you to quickly determine whether a value might be present in a dataset
Clustering algorithms include linear sorting & multi dimensional clustering (Z-ordering, Hilbert Curves).
Multi-dimensional clustering reorganizes data across multiple columns simultaneously, optimizing queries that filter on more than one dimension.