Dipankar Mazumdar (@dipankartnt) Bsky

What is Apache Arrow Flight, Flight SQL & ADBC? ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity) have long been the industry standard for connecting databases with…

[NEW BLOG]: What is Apache Arrow Flight / Flight SQL / ADBC? 🎉

We need to ask, Why don’t ODBC & JDBC fit in today’s analytical world?

These protocols were designed particularly for row-based workloads.

What about columnar “Arrow” based data?

dipankar-tnt.medium.com/what-is-apac...

1 year ago 5 0 0 0

ACID Transactions in an Open Data Lakehouse Ensuring Atomicity, Consistency, Isolation, and Durability (ACID) in data systems is crucial for maintaining data integrity, especially in environments with concurrent operations. By the end of this b...

Blogged: ACID in Lakehouse - How Apache Hudi, Iceberg & Delta Lake implements it?

www.onehouse.ai/blog/acid-tr...

1 year ago 4 1 0 0

📅 Save the Date 📅

Community Over Code North America 2025 has been announced!

Where: Minneapolis, MN (USA)
When: September 11-14, 2025

Read more about #CommunityOverCode --> https://buff.ly/4jQx36S

1 year ago 5 8 0 0

Concurrency Control in Open Data Lakehouse | Apache Hudi Introduction

Blogged: Concurrency control methods in a Lakehouse with Apache Hudi, Iceberg & Delta Lake.

In this blog I go into the fundamentals of concurrency control, explore why it is essential for lakehouses with OCC, MVCC & Non blocking control.

hudi.apache.org/blog/2025/01...

1 year ago 1 0 0 0

Blogged: Clustering Algorithms in Open Lakehouse formats such as Apache Hudi, Apache Iceberg & Delta Lake.

Querying huge volumes of data from storage demands optimized query speed

Your queries are fast today, but they might not be over time!

Read: www.onehouse.ai/blog/what-is...

1 year ago 2 1 1 0

Announcing Apache Hudi 1.0 and the Next Generation of Data Lakehouses | Apache Hudi Overview

Some of the highlight items from Hudi 1.0:
- Intro of the LSM trees (log-structured merge-trees)
- Expression, Secondary indexes
- Non-blocking concurrency control
- Partial Merges

Blog: hudi.apache.org/blog/2024/12...

1 year ago 0 0 0 0

✅ Similar to database access methods, Hudi features a multi-modal index with asynchronous MVCC, enhancing write efficiency & data consistency. It aims to apply various index types for both writes & reads, improving efficiency with new schemes, supported by engines like Presto, Spark, and Trino

1 year ago 0 0 1 0

✅ Like a lock manager in a database, Hudi uses external lock managers & plans to centralize this via a metaserver. It implements Optimistic Concurrency Control (OCC) for concurrent writers and Multi-Version Concurrency Control (MVCC) to ensure non-blocking interactions.

1 year ago 0 0 1 0

✅ Like a database's log manager, Hudi organizes logs for recovery, structures data into file groups and slices, tracks changes via timelines, manages rollbacks with marker files, & generates compressible metadata for enhanced data tracking and operations such as CDC.

1 year ago 0 0 1 0

✅ Hudi has many building blocks (Log manager, Lock manager, Access methods, etc.) that make up a DBMS.

✅ If we compare Hudi's architecture to the seminal "Architecture of a Database System" paper, we can see how Hudi serves as the foundational half of a database optimized for data lakes.

1 year ago 0 0 1 0

This is the main "design difference" to understand when comparing/evaluating Hudi especially against other lakehouse table formats.

With the new 1.0 release of Apache Hudi, we are now closer to the vision of building the first transactional database for the data lake.

Let’s explore:

1 year ago 0 0 1 0

Right from its inception back at Uber, Apache Hudi has been approached as a database problem for data lakes rather than just being a standalone metadata format.

Hudi brings a core transactional layer (via its Storage Engine) to cloud data lakes, typically seen in any database management system.

1 year ago 0 0 1 0

“Bringing the database kernel to data lakes” - this is what Apache Hudi started with before the world heard of something called “Lakehouse”.

Lakehouse means only one thing- data lakes needed the “transactional layer” on top of Parquet for running db-style workloads (both transactional & analytical)

1 year ago 0 0 1 0

Now seems like a good time to remind people of my starter pack.

go.bsky.app/T1SxhAe

1 year ago 78 13 10 0

Data Engineering Weekly #200 The Weekly Data Engineering Newsletter

The 200th edition of Data Engineering Weekly is out. Thank you all for your kind support
www.dataengineeringw...

1 year ago 11 1 1 0

Open Table Formats and the Open Data Lakehouse, In Perspective The open data lakehouse is much more than a data table format, such as Apache Hudi, Apache Iceberg, and Delta Lake. It works seamlessly with all three table formats and serves as a transactional datab...

Wrote a lil bit about openness and interoperability here: www.onehouse.ai/blog/open-ta...

The question we should ask- Can I seamlessly switch between specific components or the overall platform—whether it's a vendor-managed or self-managed open source solution—as new requirements emerge?

1 year ago 3 1 0 0

Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...

1 year ago 0 1 1 0

Apache XTable™ (Incubating) Apache XTable™ (Incubating) is a cross-table interop of lakehouse table formats Apache Hudi, Apache Iceberg, and Delta Lake. Apache XTable™ is NOT a new or separate format, Apache XTable™ provides abs...

With this feature, users can use OneLake shortcuts to point to an Iceberg table written using Snowflake (or another engine), and it will present that table as a Delta Lake table, which works well within Fabric ecosystem.

This is powered by XTable

xtable.apache.org

1 year ago 0 0 0 0

✅ Customers using multiple formats adding XTable to their existing data pipelines (say Apache Airflow operator or a lambda fn)

The announcement on Fabric OneLake-Snowflake interoperability is a critical example that solidifies point (1).

1 year ago 0 0 1 0

And even if they do work with multiple formats, it is practically tough to build optimization capabilities for each of these formats.

So, to summarize, I see XTable having 2 major applications:
✅ On the compute-side with vendors using XTable as the interoperability layer

1 year ago 0 0 1 0

On the query engine-side (warehouse, lake compute), more & more vendors are now looking at integrating with these open formats.

In reality, it is tough to have robust support for every single format. By robust I mean - full write support, schema evolution, compaction.

1 year ago 0 0 1 0

Each of these formats shines in specific use cases depending on its unique features!

And so based on your use case & technical fit (in your data architecture), you should be free to use anything without being married to just one.

1 year ago 0 0 1 0

XTable started with the core idea around “interoperability”.

That you should be able to write data in any format of your choice irrespective of whether it’s Iceberg, Hudi or Delta.

Then you can bring any compute engine of your choice that works well with a particular format & run analytics on top.

1 year ago 0 0 1 0

Apache XTable at Scale in Production!

This is a solid example to show the metadata translation capability for open table formats like Iceberg, Hudi & Delta.

Fabric users can work with Iceberg tables written by Snowflake without any rewrites/stuff.
Link: blog.fabric.microsoft.com/en-us/blog/s...

1 year ago 0 1 1 0

New blog post on the fun new hardware advancements which databases can leverage for great gains, and why the cloud means it doesn't matter that they exist. 🫠

transactional.blog/b...

1 year ago 53 18 3 3

This might be the first time an open source app is at the top of the app store. Definitely the first open source social app.

1 year ago 3024 398 63 24

Congrats Alison & good to see you here!

1 year ago 1 0 1 0

Cleaning | Apache Hudi Background

5. Cleaning: As data is continuously written, updated, & deleted, older file versions & metadata will accumulate over time. This can lead to significant storage bloat & long file listing time, which negatively impact query perf.

You need a cleaner service like this: hudi.apache.org/docs/hoodie_...

1 year ago 1 0 0 0

4. Data Skipping: This is a technique used to enhance query perf by eliminating the need to scan irrelevant files.

This includes Parquet min/max stats & Bloom Filters.

Bloom filter is a probabilistic data structure that allows you to quickly determine whether a value might be present in a dataset

1 year ago 1 0 1 0

Clustering algorithms include linear sorting & multi dimensional clustering (Z-ordering, Hilbert Curves).

Multi-dimensional clustering reorganizes data across multiple columns simultaneously, optimizing queries that filter on more than one dimension.

1 year ago 1 0 1 0

Posts by Dipankar Mazumdar