Skip to main content

Excel-Parquet Integration: Mastering Data Analysis with DuckDB

· 5 min read
Simon Müller
Co-Founder & CTO

Summary

  • The article guides on integrating Excel with Parquet files using DuckDB, highlighting the efficiency of DuckDB for large data sets and how it surpasses Excel's normal data handling limits.
  • It includes step-by-step instructions for installing the DuckDB ODBC driver, configuring Excel, setting up the ODBC connection, and suggests using Power Query Desktop for data transformation.

Introduction

DuckDB, an open-source database engine optimized for OLAP, offers a lightweight, low-dependency solution for large-scale data analysis, reminiscent of Microsoft Access. Excelling in handling Parquet files with robust SQL support, DuckDB extends Excel's data processing capabilities, enabling analysis of datasets beyond Excel's usual limits. This blog post demonstrates the straightforward process of connecting Excel with Parquet files using DuckDB, unlocking new possibilities in data handling and analysis.

ODBC Installation

To begin querying Parquet files with DuckDB, you must first install the DuckDB ODBC driver. The specific release you'll need for this setup is DuckDB Release 0.9.2. The installation process is as follows:

  1. Download the Windows ODBC driver bundle from the DuckDB releases page. Look for duckdb_odbc-windows-amd64.zip.
  2. Extract the archive to a folder you control — for example C:\Tools\duckdb_odbc\.
  3. Run odbc_install.exe from inside that folder as Administrator. The installer registers the driver under the name DuckDB Driver in the Windows ODBC Data Source Administrator.

You can verify the driver is registered by opening ODBC Data Sources (64-bit) from the Start menu and looking under the Drivers tab.

Preparing a Parquet File from DuckDB

Before pointing Excel at anything, you need a Parquet file to read. Open the DuckDB CLI and create one from a real-ish dataset:

-- Pull a public CSV directly into DuckDB
CREATE TABLE flights AS
SELECT *
FROM read_csv_auto('https://raw.githubusercontent.com/plotly/datasets/master/2015_flights.csv');

-- Persist it as Parquet
COPY flights TO 'C:/Data/flights.parquet' (FORMAT PARQUET);

That single COPY statement compresses ~580k flight rows down to a few megabytes — Excel would balk at the raw CSV but DuckDB chews through it in seconds.

Configuring the ODBC Data Source

Excel reaches Parquet through DuckDB via a System DSN.

  1. Open ODBC Data Sources (64-bit)System DSN tab → Add….
  2. Pick DuckDB Driver and click Finish.
  3. In the DSN dialog, give the source a name (e.g. duckdb-parquet) and set the Database field to the directory holding your Parquet files — for example C:\Data\. Leave the rest at defaults.
  4. Click OK to save.

DuckDB's ODBC layer will then treat each *.parquet file in that folder as a queryable table.

Connecting from Excel

In Excel 2019+ / Microsoft 365:

  1. DataGet DataFrom Other SourcesFrom ODBC.
  2. Pick the duckdb-parquet DSN you created.
  3. Excel pops the Navigator window with each Parquet file exposed as a table.
  4. Choose Transform Data to drop into Power Query Editor instead of loading the entire file straight into a worksheet — Power Query is where you do the heavy lifting.

Why Power Query Desktop

Loading a 580k-row Parquet file directly into a worksheet defeats the point of using DuckDB. Power Query Editor lets you push filtering, grouping, and joins back into the database via folding:

  • Add a Filter Rows step (e.g. DEPARTURE_DELAY > 30) and Power Query will translate it into an ODBC WHERE clause.
  • Add Group By for an aggregation (e.g. flights per AIRLINE) and again it folds into a SQL GROUP BY against DuckDB.
  • Only the result set lands in the worksheet — usually a few hundred rows that Excel handles instantly.

You can confirm the folding is happening by right-clicking a step and looking at View Native Query. If Power Query shows you the generated SQL, the step is being executed inside DuckDB.

A Realistic Example

Suppose the flights file has 27 columns and you only want average and max departure delay per airline, plus carrier code. Build the query in Power Query Editor:

Source       → ODBC.DataSource("dsn=duckdb-parquet")
flights → Source{[Name="flights"]}[Data]
selected → Table.SelectColumns(flights, {"AIRLINE", "DEPARTURE_DELAY"})
grouped → Table.Group(
selected, {"AIRLINE"},
{{"avg_delay", each List.Average([DEPARTURE_DELAY])},
{"max_delay", each List.Max([DEPARTURE_DELAY])}}
)

Excel only ever pulls the aggregated result — a 14-row table by airline.

Where This Fits in a Bigger Stack

The Parquet-via-ODBC pattern shines for analyst workflows where:

  • Source files are too big for Excel to hold in memory but small enough to ship as Parquet (gigabytes, not terabytes).
  • The team already lives in Excel and Power Query and shouldn't have to learn a BI tool just to slice a few million rows.
  • You want Excel to be the consumer of computed results, not the place the heavy work happens.

For SAP-resident data the same idea applies — extract once with ERPL, persist to Parquet, hand the Parquet to Excel users via DuckDB ODBC. Best of both worlds: SAP fidelity, Excel ergonomics, no warehouse in between.

Wrapping Up

Connecting Excel to Parquet through DuckDB takes about ten minutes once the ODBC driver is in place, and immediately unlocks datasets that previously didn't fit. Pair it with Power Query's query-folding and you keep all the heavy filtering and grouping inside DuckDB, where it belongs.

If you want to take the same pattern further into your SAP landscape, install ERPL and the same Parquet trick works directly against KNA1, VBAK, or any BW data source.