FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 by subrata-ms · Pull Request #478 · microsoft/mssql-python

subrata-ms · 2026-03-19T05:46:01Z

Work Item / Issue Reference

AB#43177

GitHub Issue: #468

Summary

This pull request significantly improves cross-platform handling of character encoding for VARCHAR and related types in the mssql_python ODBC bindings, ensuring consistent Unicode support and simplifying buffer allocation logic. The changes also update test cases to reflect expanded Unicode and encoding coverage.

Platform-specific encoding and buffer allocation:

Windows vs. Linux/macOS VARCHAR handling: On Windows, VARCHAR columns are now fetched as SQL_C_WCHAR (UTF-16), matching NVARCHAR handling and avoiding code page guessing; on Linux/macOS, VARCHAR columns are fetched as SQL_C_CHAR (UTF-8), with buffer sizes adjusted for potential multi-byte expansion. [1] [2] [3] [4] [5] [6] [7] [8]
Processor selection: On Windows, VARCHAR columns now use the NVARCHAR processor (ProcessWChar), ensuring correct decoding and consistency.

LOB and buffer management improvements:

LOB streaming triggers: Logic for identifying LOB columns for streaming is clarified and made consistent across FetchMany_wrap and FetchAll_wrap. [1] [2]
Buffer allocation comments and logic: Comments and allocation formulas are updated for clarity and correctness, especially regarding potential over-allocation and streaming fallback for large columns. [1] [2]

Unicode and encoding test coverage:

Expanded test suite: The encoding/decoding test suite is updated to include more comprehensive Unicode and encoding scenarios, increasing the total number of tests from 131 to 154.
Improved test table creation: Test cases now consistently use multi-line string formatting for table creation, supporting Unicode and edge-case data types. [1] [2] [3] [4] [5] [6] [7]

Minor code cleanup:

Formatting and error handling: Minor formatting changes and improved error handling for decoding failures and buffer truncation, enhancing code readability and maintainability. [1] [2]

These updates ensure robust, consistent Unicode support across platforms and provide a solid foundation for further improvements to encoding/decoding and LOB streaming in the driver.

github-actions · 2026-03-19T05:51:11Z

📊 Code Coverage Report

🔥 Diff Coverage 100%	🎯 Overall Coverage 77%	📈 Total Lines Covered: `5665` out of `7336` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/pybind/ddbc_bindings.cpp (100%)

Summary

Total: 7 lines
Missing: 0 lines
Coverage: 100%

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

Copilot

Pull request overview

This PR addresses cross-platform inconsistencies when fetching CP1252-collated VARCHAR data by changing the ODBC C-type used for VARCHAR retrieval on Windows, aiming to ensure consistent Unicode (str) results across Windows and Linux/macOS.

Changes:

On Windows, fetch/bind VARCHAR (and related SQL_CHAR types) as SQL_C_WCHAR and process via the wide-char path to avoid code page guessing and bytes fallback.
On Linux/macOS, clarify/standardize SQL_C_CHAR buffer sizing assumptions for UTF-8 expansion.
Add a new CP1252 regression test section covering problematic bytes and multiple fetch paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`mssql_python/pybind/ddbc_bindings.cpp`	Updates `SQLGetData_wrap` and batch fetch binding/processing to use `SQL_C_WCHAR` for `VARCHAR` on Windows; adjusts buffer sizing logic and LOB detection formatting.
`tests/test_013_encoding_decoding.py`	Adds a new CP1252 `VARCHAR` consistency test suite intended to ensure `VARCHAR` returns `str` across platforms and fetch methods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

mssql_python/pybind/ddbc_bindings.cpp

+                                "returning NULL",
+                                i);
+                            row.append(py::none());


tests/test_013_encoding_decoding.py

+        assert isinstance(value, str), (
+            f"Expected str for CHAR({byte_val}) ({description}) but got "
+            f"{type(value).__name__}: {value!r} (platform={sys.platform})."
+        )


tests/test_013_encoding_decoding.py

+        assert isinstance(value, str), (
+            f"{fetch_method}: expected str but got {type(value).__name__}: "
+            f"{value!r} (platform={sys.platform})."
+        )


tests/test_013_encoding_decoding.py

+            f"Expected str but got {type(value).__name__}: {value!r}. "
+            f"Embedded byte 0xAD was not decoded (platform={sys.platform})."
+        )
+        assert "hello" in value and "world" in value


mssql_python/pybind/ddbc_bindings.cpp

    // Note: wcharEncoding parameter is reserved for future use
    // Currently WCHAR data always uses UTF-16LE for Windows compatibility
    (void)wcharEncoding;  // Suppress unused parameter warning
+#if !defined(__APPLE__) && !defined(__linux__)


mssql_python/pybind/ddbc_bindings.cpp

                break;
            case SQL_SS_UDT:
                rowSize += (static_cast<SQLLEN>(columnSize) == SQL_NO_TOTAL || columnSize == 0)
-                               ? SQL_MAX_LOB_SIZE : columnSize;
+                               ? SQL_MAX_LOB_SIZE
+                               : columnSize;


ffelixg

Hey, love to see this!

The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:

def test_locale_varchar_decode_iso885915():
    assert locale.getlocale() == ('en_US', 'UTF-8')

    # important: change locale BEFORE connecting
    locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
    connection = connect()
    cursor = connection.cursor()

    (val,) = cursor.execute("SELECT '€'").fetchone()
    # without changing locale, we would get val == '€'
    assert val == b'\xa4', val

    cursor.close()
    connection.close()

If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.

On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.

base logic change

12ad019

github-actions bot added the pr-size: medium Moderate update size label Mar 19, 2026

linting fix

d7e894b

subrata-ms marked this pull request as ready for review March 19, 2026 09:07

subrata-ms requested review from bewithgaurav, Copilot, gargsaumya, jahnvi480 and sumitmsft March 19, 2026 09:07

Copilot AI reviewed Mar 19, 2026

View reviewed changes

ffelixg reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478
subrata-ms wants to merge 2 commits intomainfrom
subrata-ms/cp1252_encoding

subrata-ms commented Mar 19, 2026 •

edited by azure-boards bot

Loading

Uh oh!

github-actions bot commented Mar 19, 2026 •

edited

Loading

🔥 Diff Coverage

100%

🎯 Overall Coverage

77%

Uh oh!

Copilot AI left a comment

Uh oh!

ffelixg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

subrata-ms commented Mar 19, 2026 • edited by azure-boards bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Code Coverage Report

🔥 Diff Coverage

100%

🎯 Overall Coverage

77%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

📋 Files Needing Attention

🔗 Quick Links

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ffelixg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

subrata-ms commented Mar 19, 2026 •

edited by azure-boards bot

Loading

github-actions bot commented Mar 19, 2026 •

edited

Loading