Pandas vs DuckDB: A Practical Comparison for Python Data Workflows
Pandas has been the go-to tool for data manipulation in Python for well over ten years. Whether you have cleaned features for a machine learning model, inspected a CSV file in a notebook, or built a small ETL workflow, there is a good chance your code has centered around the Pandas DataFrame. Pandas has played a major role in the Python data ecosystem for a long time.
However, as datasets have grown and analytical workloads have become more demanding, the in-memory and mostly single-threaded nature of Pandas has increasingly become a limitation. DuckDB addresses many of these challenges as an in-process SQL OLAP, or Online Analytical Processing, engine. Its common description as “SQLite for analytics” reflects its promise of:
- Fast and scalable analytical queries
- Native support for Parquet and Arrow
- Parallel execution
- The ability to process datasets larger than available memory
This article compares both tools from a practical, developer-focused perspective. It looks at how they differ in architecture, performance, scalability, and developer experience. It also explains where DuckDB can meaningfully take over tasks from Pandas and where Pandas remains the better option.
Key Takeaways
- DuckDB should be seen as a complement to Pandas, not as a full replacement. DuckDB is designed for large analytical workloads, while Pandas remains very useful for small to medium-sized datasets, feature engineering, and general Python-native data manipulation.
- Their architectures are fundamentally different. Pandas is a Python library built around an in-memory DataFrame object. DuckDB is a vectorized, SQL-first, column-oriented OLAP engine that supports parallel execution and can query data beyond memory limits.
- Performance and scalability often favor DuckDB, especially when working with large Parquet or Arrow datasets. Its query execution can be much faster than Pandas because of projection pushdown, filter pushdown, multi-threading, and disk-backed execution.
- A major strength of DuckDB is its ability to fit into hybrid workflows. It can query Pandas, Polars, and Arrow DataFrames directly, combine them with large external Parquet, CSV, or SQLite files, and return results in different formats. This makes it possible to use SQL directly inside a notebook workflow.
- The best choice depends on the task. DuckDB is well suited for OLAP-style queries, large joins, and big datasets. Pandas is still excellent for interactive data cleaning, custom Python logic, close integration with machine learning and visualization libraries, and workloads that fit comfortably into RAM.
What Are Pandas and DuckDB?
To compare their performance and workflow capabilities properly, it helps to understand what each tool is designed to be. Pandas is a DataFrame library that is deeply integrated with Python. DuckDB, by contrast, is an embedded SQL analytics engine that runs directly inside your Python process.
Pandas as a Python DataFrame Library
Pandas is widely regarded as the standard library for data wrangling in Python. It introduced the DataFrame concept to Python in 2008 and has since become one of the most important tools in the data science ecosystem. A Pandas DataFrame is a two-dimensional, in-memory table structure with labeled rows and columns. It provides fast indexed access through a Python API. Operations such as slicing, filtering, and transforming data are supported by C and NumPy operations for vectorized execution.
Key Characteristics of Pandas
- In-memory operation: Pandas DataFrames are stored in RAM. This allows fast random access for smaller datasets, but it also means that the complete dataset must fit into memory. When working with several million rows, memory errors and noticeable slowdowns can appear quickly if the system runs out of available memory.
- Rich API and ecosystem: Pandas provides a broad, Python-friendly API for filtering, aggregating, merging, time series handling, and many other data tasks. It is highly flexible and also allows custom Python logic inside DataFrame workflows. Pandas also works very well with other libraries, including visualization tools and scikit-learn, making it easy to pass a DataFrame into charts or machine learning models.
- Ease of use: Many data scientists find Pandas intuitive. Since the code is written in normal Python, it is usually straightforward to debug and iterate on. Basic Pandas workflows also have a relatively gentle learning curve.
DuckDB as an In-Process SQL Engine
DuckDB is often described as the “SQLite for analytics”. Like Pandas, it can query data inside your Python process. Under the hood, however, DuckDB works very differently:
- Columnar, vectorized engine: DuckDB stores data in a column-oriented format and uses vectorized execution, processing chunks of column values, or vectors, at a time.
- SQL-centered interface: DuckDB is mainly used through SQL queries such as SELECT, JOIN, and GROUP BY. It is a complete SQL engine with a rich dialect that is similar to PostgreSQL SQL. Workflows that require many lines of Pandas code can often be written as one SQL query in DuckDB.
- Disk-based and memory-conscious: DuckDB can query disk-based columnar formats such as Parquet, CSV, and JSON directly without first loading everything into memory. In practice, you can point DuckDB to a 10 GB Parquet file and run a limited SELECT query almost immediately with low memory overhead.
- Embedded OLAP database functionality: Because DuckDB is a database engine, it includes features expected from databases, including concurrent query support, transactions, window functions, common table expressions, parallel execution, and other advanced SQL capabilities.
- Lightweight and dependency-free: DuckDB is a small package of around 10 MB with no external server requirement. After installing it with pip, it is ready to use. Everything it needs is statically linked, so no separate database service is required.
Architectural Differences Between Pandas and DuckDB
A comparison between DuckDB and Pandas shows several important architectural distinctions:
| Feature | Pandas DataFrame Library | DuckDB In-Process SQL Engine |
|---|---|---|
| Core abstraction | Python DataFrame and Series objects with an imperative API. | Relational tables queried through declarative SQL and exposed through a Python API. |
| Storage format | In practice, DataFrames are row-oriented from a workflow perspective. Columns are backed by NumPy or Arrow arrays, but many operations move through rows in Python. | Columnar storage with data held in contiguous column blocks, optimized for scans and aggregations. |
| Execution model | Runs through Python loops and vectorized NumPy or Arrow operations. It is single-threaded by default, with multi-threading only in certain cases. | Uses a vectorized query engine that processes batches of values per operator through a C++ engine and can use multiple CPU cores. |
| Memory and scale | The full DataFrame, or grouped data during group-by operations, must fit into RAM. Out-of-core workflows need manual workarounds or other tools. | Built for larger-than-memory analytics, with transparent disk spilling and streaming from columnar files such as Parquet. |
| Concurrency and transactions | No ACID transaction layer. Operations modify in-memory objects inside one Python process. | Supports ACID transactions with MVCC, allowing safe transactional updates and consistent reads, including when using a persistent .duckdb file. |
| Query language and ergonomics | Python DataFrame API with method chains such as .groupby(), .agg(), and .merge(). Very flexible, but complex pipelines can become verbose. | Standard SQL with joins, window functions, common table expressions, and macros. For users comfortable with SQL, complex analytics can often be clearer and shorter. |
| Integration with DataFrames | Works directly with in-memory DataFrames, and much of the ecosystem is built around the Pandas API. | Can query Pandas and Polars DataFrames through replacement scans, using zero-copy where possible, and can return results as Pandas, Polars, Arrow, or NumPy. |
| File and external data access | Often depends on separate Python libraries, such as pyarrow or storage-specific packages, for Parquet, cloud/object storage, and other formats. Input and output logic is usually written explicitly in Python. | Includes connectors and extensions that can query CSV, Parquet, JSON, and many other formats, including object storage and HTTP sources, directly from SQL without additional Python packages. |
| Ecosystem and tooling | Large and mature ecosystem with strong support in visualization and machine learning libraries such as Matplotlib, Seaborn, scikit-learn, and statsmodels. | Growing analytics-focused ecosystem. Integrates through Arrow and connectors and can serve as a fast local engine for BI tools, dbt, or custom applications. |
| Typical sweet spot | Interactive exploration, feature engineering, and small to medium-sized datasets that fit easily into memory. Well suited for fast experimentation in notebooks. | Heavy analytical queries such as joins, aggregations, and window functions on large local datasets. Useful for SQL-first workflows, local OLAP, and moving expensive work away from Pandas. |
DuckDB is built as a columnar database, so an aggregation such as SELECT SUM(value) can read only the columns needed instead of entire rows. With vectorized execution, operations are performed on large blocks of values at once, reducing overhead per operation. Pandas, by default, often processes data row by row or in smaller slices. As a result, DuckDB is frequently faster and more memory-efficient for aggregations, filters, and joins, especially as datasets become larger.
Benchmarks
The following benchmark compares DuckDB, Polars, and Pandas. It was run on a MacBook Pro M2 with 16 GB of RAM. The synthetic benchmark dataset represents 1 million people and 10 million position records. The performance results showed DuckDB as the fastest solution. On average, DuckDB finished in 3.84 seconds. Polars followed closely with an average runtime of 5.77 seconds. Pandas was slower than both DuckDB and Polars, with an average runtime of 19.57 seconds, which is more than five times slower than DuckDB.
This result is useful when evaluating the strengths and limits of Pandas, especially for large datasets or performance-sensitive data processing. It also shows why newer and more specialized data processing tools such as DuckDB and Polars are becoming increasingly important. Depending on the workload, they may offer substantial performance advantages over Pandas.
Integrating DuckDB and Pandas
One of DuckDB’s biggest strengths is that it does not force you to choose between SQL and Pandas. Instead, it fits into existing Python workflows:
- Run SQL queries directly against in-memory Pandas DataFrames.
- Return query results as Pandas, Polars, or Arrow objects.
- Combine external files such as CSV, Parquet, or JSON with DataFrames in one query.
- Use the same integration style with Polars and Apache Arrow.
DuckDB becomes a SQL engine inside your notebook rather than a separate system that has to be managed.
1. Querying Pandas DataFrames with SQL Using Replacement Scans
With DuckDB’s replacement scan mechanism, a Pandas DataFrame can be treated like a SQL table by referencing the variable name in a query. No explicit registration is required.
Example: Filtering and Transforming a Pandas DataFrame with SQL
pip install duckdb
import duckdb
import pandas as pd
# Create a small DataFrame in Pandas
df = pd.DataFrame(
{
"city": ["Paris", "Berlin", "Lagos", "Delhi"],
"temp_c": [21, 19, 30, 35],
}
)
# DuckDB detects 'df' as a table through replacement scans
result = duckdb.sql("""
SELECT
city,
temp_c,
temp_c * 9.0/5.0 + 32 AS temp_f
FROM df
WHERE temp_c >= 25
""").df()
print(result)
Key Points
- df should be a regular Pandas DataFrame.
- duckdb.sql(…) automatically reads df as a SQL table named df.
- No copy operation is required because DuckDB accesses the existing DataFrame in memory.
- The final result is materialized as another Pandas DataFrame.
You can also use an explicit connection and registration when you need more control:
con = duckdb.connect()
con.register("cities", df)
hot_cities = con.sql("""
SELECT city, temp_c
FROM cities
WHERE temp_c >= 25
""").df()
2. Returning Results as Pandas, Polars, or Arrow
After executing a SQL query, you usually continue working in the Python data tool that best fits your workflow. DuckDB’s Python API provides convenient conversion methods for this:
- .df() returns a Pandas DataFrame
- .pl() returns a Polars DataFrame
- .arrow() returns an Apache Arrow Table
- .fetchnumpy() returns NumPy arrays
Example: One SQL Query with Multiple Result Formats
import duckdb
import pandas as pd
import polars as pl
import pyarrow as pa
# Simple example dataset
df = pd.DataFrame(
{
"product": ["A", "A", "B", "B"],
"amount": [100, 150, 80, 120],
}
)
rel = duckdb.sql("""
SELECT
product,
SUM(amount) AS total_amount
FROM df
GROUP BY product
ORDER BY total_amount DESC
""")
# 1) Return results as Pandas
pandas_result = rel.df()
# 2) Return results as Polars
polars_result = rel.pl()
# 3) Return results as an Arrow table
arrow_result = rel.arrow()
print("Pandas:\n", pandas_result)
print("Polars:\n", polars_result)
print("Arrow schema:\n", arrow_result.schema)
In a real project, you might:
- Use .df() when passing the result to scikit-learn or Python plotting libraries.
- Use .pl() when the next processing steps are written in Polars.
- Use .arrow() when data needs to move into systems that support Arrow or when zero-copy interchange is useful.
3. Hybrid Workflows: Joining Files and DataFrames in One Query
DuckDB is especially strong in hybrid workflows that combine:
- Large datasets stored in files, such as Parquet on disk or in object storage.
- Smaller reference tables held in memory through Pandas.
These can be joined in one SQL statement without writing custom ETL logic.
Example: Joining a Large Parquet Fact Table with a Small Pandas Dimension Table
For example, you might have:
- data/orders.parquet, containing tens of millions of orders and too large to handle comfortably in Pandas.
- A small Pandas DataFrame with customer segments calculated from a CRM export.
import os
import duckdb
import pandas as pd
from pathlib import Path
# 1. Make sure the 'data' directory exists
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
# 2. Build a small demo orders dataset in Pandas
orders = pd.DataFrame(
{
"order_id": [101, 102, 103, 104, 105],
"customer_id": [1, 2, 1, 3, 2],
"order_date": pd.to_datetime(
["2025-01-02", "2025-01-03", "2025-01-05", "2025-02-01", "2025-02-10"]
),
"order_amount": [200.0, 150.0, 300.0, 80.0, 120.0],
}
)
parquet_path = data_dir / "orders.parquet"
# 3. Store the data as Parquet, using pyarrow or fastparquet
orders.to_parquet(parquet_path, index=False)
print(f"Saved demo file at: {parquet_path.resolve()}")
# 4. Create a small customer segment DataFrame
customer_segments = pd.DataFrame(
{
"customer_id": [1, 2, 3],
"segment": ["Enterprise", "SMB", "Consumer"],
}
)
# 5. Connect DuckDB and register the dimension table
con = duckdb.connect()
con.register("customer_segments", customer_segments)
# 6. Execute a hybrid query that joins Parquet data with Pandas data
query = f"""
SELECT
o.customer_id,
s.segment,
SUM(o.order_amount) AS total_spend
FROM '{parquet_path.as_posix()}' AS o
JOIN customer_segments AS s
ON o.customer_id = s.customer_id
WHERE o.order_date >= DATE '2025-01-01'
GROUP BY o.customer_id, s.segment
ORDER BY total_spend DESC
"""
top_customers = con.sql(query).df()
print(top_customers)
This pattern demonstrates a best-of-both-worlds approach:
- Pandas can be used for ad-hoc data preparation, feature engineering, or creating lookup tables.
- DuckDB can efficiently scan and aggregate large columnar files.
- Everything happens inside one Python process or notebook, without needing Spark, Dask, or a separate data warehouse.
4. Replacement Scans with Polars and Arrow
DuckDB works well with Pandas, but its integration does not stop there. It also works smoothly with Polars and Apache Arrow. This is important when using a mixed data stack.
There are two common scenarios:
- Use a Polars DataFrame or Arrow table as input for DuckDB queries.
- Return DuckDB query results as Polars or Arrow objects.
Example A: Querying a Polars DataFrame with SQL
import duckdb
import polars as pl
# Create a Polars DataFrame
pl_df = pl.DataFrame(
{
"id": [1, 2, 3, 4],
"value": [10, 20, 30, 40],
}
)
# DuckDB can detect 'pl_df' similarly to Pandas through replacement scans
rel = duckdb.sql("""
SELECT
id,
value,
value * 2 AS value_x2
FROM pl_df
WHERE value >= 20
""")
# Return the result as Polars again
result_pl = rel.pl()
print(result_pl)
Example B: Working with Arrow Tables
import duckdb
import pyarrow as pa
# Create a small Arrow table
arrow_table = pa.table(
{
"event_type": ["click", "view", "click", "purchase"],
"user_id": [10, 10, 11, 10],
"amount": [None, None, None, 99.0],
}
)
# Register the Arrow table explicitly for clarity
con = duckdb.connect()
con.register("events", arrow_table)
# Run SQL on the Arrow table
agg = con.sql("""
SELECT
user_id,
COUNT(*) AS num_events,
SUM(CASE WHEN event_type = 'purchase' THEN amount ELSE 0 END) AS revenue
FROM events
GROUP BY user_id
""").arrow()
print(agg)
In both examples:
- Unnecessary copies are avoided by staying close to Arrow or Polars memory layouts.
- DuckDB acts as a fast SQL layer over existing DataFrame or Arrow data.
- You can convert between Pandas, Polars, and Arrow at the boundaries of your pipeline, depending on what each step requires.
When DuckDB Can Replace Pandas and When It Should Not
Should Pandas be replaced completely by DuckDB? In some situations, yes, but not in every case. The following overview shows which types of tasks DuckDB can handle well and which are usually better suited to Pandas:
| Tool | Use Case | When to Prefer It / Description |
|---|---|---|
| DuckDB | Very large datasets or big data on a laptop | Useful for analyzing datasets that are too large for Pandas, such as tens of millions of rows or multi-gigabyte files. It is suitable for log files, telemetry, and other large tables where Pandas would require sampling, downcasting, or might fail. DuckDB can often work with more than 100 million rows on a typical laptop with 16 GB RAM. |
| DuckDB | Analytical queries and OLAP-style reporting | Best for workflows that resemble SQL, including grouping, joining, filtering, aggregations, and window functions. DuckDB can replace chains of df.groupby().agg() and merge operations with compact SQL. It is often used for reports, exploratory analysis, and notebook-based transformations. |
| DuckDB | Columnar file formats such as Parquet and Arrow | Useful when data is mainly stored as Parquet or Arrow. DuckDB can query these files directly without loading full DataFrames into memory and can also write Parquet data. It works as a local query engine over data lake-style files and is often faster and more memory-efficient. |
| DuckDB | Interactive exploration with SQL | Helpful for users who prefer exploring data with SQL statements such as SELECT, WHERE, and LIMIT. DuckDB supports iterative querying in notebooks and scripts, similar to a local SQL IDE that can run offline. It can replace many Pandas-based exploration steps for SQL-oriented users. |
| DuckDB | Memory-bound computations | Well suited to operations that are very memory-heavy in Pandas, including large group-bys with many categories, multiple big joins, or wide tables. DuckDB uses streaming aggregation and disk-spilling join strategies, making workloads possible that might exceed Pandas memory limits. |
| DuckDB | Production and data engineering pipelines | Useful for recurring analytics in scripts or applications that process large data. It can be more stable and predictable at scale than Pandas, which may fail as data volumes grow. DuckDB fits nightly aggregation jobs, batch ETL, and embedded analytics, and it can be used from Python, R, Java, and other environments. |
| Pandas | Small datasets and quick scripts | Best for tiny or small datasets, such as a CSV with 5,000 rows, and quick one-off analyses. Loading and manipulating data in Pandas is immediate, and writing a short Python snippet is often faster than writing SQL, especially for users fluent in Python. |
| Pandas | Complex data manipulation and feature engineering | Best for tasks that need arbitrary Python logic, external libraries, or iterative algorithms. It is easy to create new columns using loops, vectorized operations, or apply. Pandas is useful for custom text processing, advanced ranking logic, specialized pivots, and detailed feature engineering. |
| Pandas | Ecosystem and library compatibility | Many tools expect Pandas DataFrames or NumPy arrays, including scikit-learn, Matplotlib, and Seaborn. Even if DuckDB prepares the data, the workflow often ends in Pandas for modeling or visualization. Staying in Pandas can be simpler when every step depends on it. |
| Pandas | Real-time interactive notebook edits | Strong for interactive data cleaning and inspection with commands such as df.head(), boolean filters, and manual edits like df.iloc[5, 3] = None. Because the data is directly editable in memory, ad-hoc adjustments can be more convenient than re-running SQL queries. |
| Pandas | Team habits and existing codebases | Useful when there is already a large Pandas codebase or when a team is comfortable with Pandas but not SQL. Rewriting everything in DuckDB has a cost, so gradual adoption is usually better. Pandas remains valuable because of its maturity and familiarity. |
Pandas remains strong in terms of flexibility and for the broad range of small and medium-sized workloads where its overhead is not noticeable and its native Python integration is helpful. DuckDB is not meant to replace that completely. Instead, it is designed for situations where Pandas struggles, such as large datasets, demanding analytical queries, and SQL-centered workflows.
Limitations and Considerations
Although DuckDB has clear advantages, it is not a universal solution. Important limitations include:
- Concurrency: DuckDB uses a single-writer model. It supports multiple concurrent readers, but only one write transaction at a time. For multi-writer workloads or heavy OLTP use cases, a server-based database such as PostgreSQL is a better fit.
- Ecosystem maturity: Pandas has existed since 2008 and has a mature ecosystem of visualization and machine learning libraries around it. DuckDB’s ecosystem is growing quickly, but some higher-level convenience features available in Pandas are still missing.
- Learning curve: Pandas syntax may feel more natural to analysts who do not know SQL. On the other hand, SQL offers long-term stability, portability, and broad familiarity across many systems.
- No built-in plotting: DuckDB is a query engine. For visualizations, results still need to be used with Pandas, Matplotlib, or another plotting library.
- Experimental functionality: DuckDB is under active development, and some features may still be experimental, including certain extensions or support for newer file formats. These features can change between versions.
Conclusion
DuckDB is not intended to eliminate Pandas. Instead, it extends what Pandas can do. As data workloads grow beyond a single in-memory DataFrame, DuckDB brings a modern, high-performance analytical engine directly into the Python workflow. It can read Parquet files, push filters down to storage, use all available CPU cores, and work with datasets much larger than RAM. This is possible without deploying a server, changing the working environment, or dealing with the overhead of distributed databases.
Pandas continues to be a strong choice for daily exploration, prototyping, and fast feature engineering that benefits from Python’s flexibility. But when a workload involves scanning millions of rows, joining large Parquet files, or working with Arrow-backed datasets, DuckDB provides speed and scalability that Pandas alone does not offer.
DuckDB gives data scientists the performance of a local analytical database inside a notebook while extending workflows built around Pandas, Polars, Arrow, and dbt.
In practice, the future is not “DuckDB versus Pandas.” It is DuckDB and Pandas together. Each tool has a clear role in the modern analytics stack. Combined, they create a workflow that is fast, flexible, and suitable for the scale of today’s data.


