[MS] Introducing Apache Arrow Support in mssql-python

c1014e61 a66d 4807 ab58 655671044f49 image

Reviewed by Sumit Sarabhai Fetching a million rows from SQL Server into a Polars DataFrame used to mean a million Python objects, a million GC allocations, and then throwing it all away to build a DataFrame. Not anymore. mssql-python now supports fetching SQL Server data directly as Apache Arrow structures - a faster and more memory-efficient path for anyone working with SQL Server data in Polars, Pandas, DuckDB, or any other Arrow-native library. This feature was contributed by community developer Felix Graßl (@ffelixg), and we are thrilled to ship it. [alert type="note" heading="Key Terms"]API (Application Programming Interface): a source-code contract that defines how to call a function or library. ABI (Application Binary Interface): a binary-level contract that specifies how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly - no serialization is needed. Arrow C Data Interface: Apache Arrow's ABI specification - the standard that makes zero-copy data exchange between languages possible.[/alert]

What Is Apache Arrow?

The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout - the Arrow C Data Interface, a cross-language ABI - that any language can produce or consume by exchanging a pointer, with no serialization, no copies, and no re-parsing. A C++ database driver and a Python DataFrame library can work on the exact same memory without either one knowing about the other. Built on top of that, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows, each row a collection of Python objects, Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects. For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers - no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations - filters, joins, aggregations - also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines. For users of mssql-python, this translates into four concrete benefits:

Speed: The columnar fetch path avoids Python object creation per row, which should make fetching noticeably faster for many SQL Server types - especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely.
Lower memory usage: A column of one million integers is a single contiguous C array, not a million individual Python objects.
Seamless interoperability: Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and many other libraries all speak Arrow natively. Zero-copy hand-off between mssql-python and those tools.
Purely additive: Your existing fetchone, fetchmany, and fetchall code is completely unaffected. You opt in only where you need it.

[alert type="warning" heading=""]Try it here: pip install mssql-python Calling all Python + SQL developers! We invite the community to try out mssql-python and help us shape the future of high-performance SQL Server connectivity in Python.![/alert]

The Arrow Fetch APIs

Three APIs have been added to the Cursor object.

`1. cursor.arrow_batch(batch_size=8192)` → `pyarrow.RecordBatch`

Fetches the next batch of up to batch_size rows as an Arrow RecordBatch and advances the cursor. RecordBatches are the building block for more high-level Arrow data types like tables and the batch reader interface.

import mssql_python

conn   = mssql_python.connect(conn_str)
cursor = conn.cursor()
cursor.execute("SELECT * FROM SalesData")

partial_data = cursor.arrow_batch(batch_size=50000)
process(partial_data)   # pyarrow.RecordBatch

`2. cursor.arrow(batch_size=8192)` → `pyarrow.Table`

Eagerly fetches the entire result set into a single Arrow Table. This is the simplest path and works well for analytics queries where the result fits comfortably in memory. However, because it materialises the full result set at once, it can cause high peak RAM usage or out-of-memory errors on very large or unbounded queries. For large exports or ETL workloads, prefer cursor.arrow_reader() (streaming, fetches lazily) or cursor.arrow_batch() (fetch one batch at a time). In both cases, batch_size is a tuning knob: larger batches improve throughput but increase peak memory; smaller batches reduce memory at the cost of slightly more per-batch overhead.

cursor.execute("SELECT customer_id, order_date, amount FROM Orders")
table = cursor.arrow()

# Zero-copy conversion to Polars
import polars as pl
df = pl.DataFrame(table)

# Or to Pandas with Arrow-backed dtypes
import pandas as pd
df = table.to_pandas(types_mapper=pd.ArrowDtype)

`3. cursor.arrow_reader(batch_size=8192)` → `pyarrow.RecordBatchReader`

Returns a lazy RecordBatchReader. Batches are fetched only when the reader is iterated, enabling streaming over very large result sets. RecordBatchReader is also accepted directly by DuckDB, Lance, and other Arrow-native libraries.

cursor.execute("SELECT * FROM LargeEventLog")
reader = cursor.arrow_reader(batch_size=100000)

for batch in reader:
    sink.write(batch)

Testing

We validated the Arrow fetch path against the standard Python row fetch path across a range of SQL Server types — numeric, temporal, string, and UUID - for both single-column and wide (20-column) tables. The full test script and results are available in the Resources section; we encourage you to run them on your own hardware to see the difference for your workload. In our testing, the Arrow path was consistently faster for most SQL Server types. Temporal types showed the largest gains: types like DATETIME and DATETIMEOFFSET benefit significantly because the Arrow path handles timezone normalization and value encoding entirely in C++, eliminating per-value Python-side conversions. DATETIMEOFFSET in particular showed some of the most pronounced speedups we observed.

JSON Serialization Bonus

The Arrow path can also benefit API workloads that serialize results to JSON. Instead of fetchall() + json.dumps(), fetch via cursor.arrow(), wrap in a Polars DataFrame, and call df.write_json() - the entire pipeline bypasses Python objects and can be noticeably faster, especially for types like DATETIMEOFFSET. [alert type="warning" heading="NVARCHAR on Linux"]Our Linux tests show longer fetch times for NVARCHAR due to the current UTF-16 → UTF-8 conversion path. On Windows, NVARCHAR fetches consistently faster with Arrow. A fix is targeted for a follow-up release.[/alert]

Getting Started

Install or upgrade mssql-python, then add pyarrow:

pip install mssql-python pyarrow

For IDE type hints and static type checking:

pip install pyarrow-stubs

Then swap in cursor.arrow() wherever you would have called fetchall() and converted to a DataFrame. Your existing code is completely unaffected — Arrow support is purely additive.

import mssql_python
import polars as pl

conn   = mssql_python.connect(conn_str)
cursor = conn.cursor()

cursor.execute("SELECT * FROM dbo.LargeSalesTable")
df = pl.DataFrame(cursor.arrow())

print(df.describe())

What's Next

One known area we are actively working on to improve is NVARCHAR performance on Linux. SQL Server returns Unicode string data in UTF-16 encoding, which the driver must convert to UTF-8 before handing it to Arrow. On Windows this conversion uses a native system API that is very fast, but the current Linux code path goes through a slower chain of intermediate steps. As a result, NVARCHAR columns on Linux show longer fetch times compared to the Python fetch path — the opposite of every other type. A fix using a more efficient codec is in progress for a follow-up release. On Windows, our tests show NVARCHAR fetching noticeably faster with Arrow, and Linux will follow.

A Note of Thanks

This feature was contributed by Felix Graßl (@ffelixg), the author of zodbc, his own Zig-based ODBC driver. His deep familiarity with ODBC and Arrow made this a thorough, well-tested contribution covering both Linux and Windows, and all three fetch patterns. We are very grateful for his work and the care he brought to this feature.

Resources

Try It and Share Your Feedback!

We invite you to:

Check-out the mssql-python driver and integrate it into your projects.
Share your thoughts: Open issues, suggest features, and contribute to the project.
Join the conversation: GitHub Discussions | SQL Server Tech Community.

[alert type="tip" heading="Use Python Driver with Free Azure SQL Database"]You can use the Python Driver with the free version of Azure SQL Database! ✅ Deploy Azure SQL Database for free ✅ Deploy Azure SQL Managed Instance for free Perfect for testing, development, or learning scenarios without incurring costs.[/alert]

[alert type="note" heading=""]Have questions or feedback? Open an issue or discussion on GitHub, or reach out to the team at mssql-python@microsoft.com[/alert]

Post Updated on May 4, 2026 at 05:33AM
Thanks for reading
from devamazonaws.blogspot.com

Search This Blog

News For Dev-ops