Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478

Open
subrata-ms wants to merge 2 commits intomainfrom
subrata-ms/cp1252_encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#478
subrata-ms wants to merge 2 commits intomainfrom
subrata-ms/cp1252_encoding

Conversation

@subrata-ms
Copy link
Contributor

@subrata-ms subrata-ms commented Mar 19, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request significantly improves cross-platform handling of character encoding for VARCHAR and related types in the mssql_python ODBC bindings, ensuring consistent Unicode support and simplifying buffer allocation logic. The changes also update test cases to reflect expanded Unicode and encoding coverage.

Platform-specific encoding and buffer allocation:

  • Windows vs. Linux/macOS VARCHAR handling: On Windows, VARCHAR columns are now fetched as SQL_C_WCHAR (UTF-16), matching NVARCHAR handling and avoiding code page guessing; on Linux/macOS, VARCHAR columns are fetched as SQL_C_CHAR (UTF-8), with buffer sizes adjusted for potential multi-byte expansion. [1] [2] [3] [4] [5] [6] [7] [8]
  • Processor selection: On Windows, VARCHAR columns now use the NVARCHAR processor (ProcessWChar), ensuring correct decoding and consistency.

LOB and buffer management improvements:

  • LOB streaming triggers: Logic for identifying LOB columns for streaming is clarified and made consistent across FetchMany_wrap and FetchAll_wrap. [1] [2]
  • Buffer allocation comments and logic: Comments and allocation formulas are updated for clarity and correctness, especially regarding potential over-allocation and streaming fallback for large columns. [1] [2]

Unicode and encoding test coverage:

  • Expanded test suite: The encoding/decoding test suite is updated to include more comprehensive Unicode and encoding scenarios, increasing the total number of tests from 131 to 154.
  • Improved test table creation: Test cases now consistently use multi-line string formatting for table creation, supporting Unicode and edge-case data types. [1] [2] [3] [4] [5] [6] [7]

Minor code cleanup:

  • Formatting and error handling: Minor formatting changes and improved error handling for decoding failures and buffer truncation, enhancing code readability and maintainability. [1] [2]

These updates ensure robust, consistent Unicode support across platforms and provide a solid foundation for further improvements to encoding/decoding and LOB streaming in the driver.

@github-actions github-actions bot added the pr-size: medium Moderate update size label Mar 19, 2026
@github-actions
Copy link

github-actions bot commented Mar 19, 2026

📊 Code Coverage Report

🔥 Diff Coverage

100%


🎯 Overall Coverage

77%


📈 Total Lines Covered: 5665 out of 7336
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (100%)

Summary

  • Total: 7 lines
  • Missing: 0 lines
  • Coverage: 100%

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review March 19, 2026 09:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses cross-platform inconsistencies when fetching CP1252-collated VARCHAR data by changing the ODBC C-type used for VARCHAR retrieval on Windows, aiming to ensure consistent Unicode (str) results across Windows and Linux/macOS.

Changes:

  • On Windows, fetch/bind VARCHAR (and related SQL_CHAR types) as SQL_C_WCHAR and process via the wide-char path to avoid code page guessing and bytes fallback.
  • On Linux/macOS, clarify/standardize SQL_C_CHAR buffer sizing assumptions for UTF-8 expansion.
  • Add a new CP1252 regression test section covering problematic bytes and multiple fetch paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
mssql_python/pybind/ddbc_bindings.cpp Updates SQLGetData_wrap and batch fetch binding/processing to use SQL_C_WCHAR for VARCHAR on Windows; adjusts buffer sizing logic and LOB detection formatting.
tests/test_013_encoding_decoding.py Adds a new CP1252 VARCHAR consistency test suite intended to ensure VARCHAR returns str across platforms and fetch methods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +3068 to +3070
"returning NULL",
i);
row.append(py::none());
assert isinstance(value, str), (
f"Expected str for CHAR({byte_val}) ({description}) but got "
f"{type(value).__name__}: {value!r} (platform={sys.platform})."
)
assert isinstance(value, str), (
f"{fetch_method}: expected str but got {type(value).__name__}: "
f"{value!r} (platform={sys.platform})."
)
f"Expected str but got {type(value).__name__}: {value!r}. "
f"Embedded byte 0xAD was not decoded (platform={sys.platform})."
)
assert "hello" in value and "world" in value
// Note: wcharEncoding parameter is reserved for future use
// Currently WCHAR data always uses UTF-16LE for Windows compatibility
(void)wcharEncoding; // Suppress unused parameter warning
#if !defined(__APPLE__) && !defined(__linux__)
Comment on lines 4102 to +4106
break;
case SQL_SS_UDT:
rowSize += (static_cast<SQLLEN>(columnSize) == SQL_NO_TOTAL || columnSize == 0)
? SQL_MAX_LOB_SIZE : columnSize;
? SQL_MAX_LOB_SIZE
: columnSize;
Copy link
Contributor

@ffelixg ffelixg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, love to see this!

The encoding used by the odbc driver is actually not directly dependent on the operating system, instead it checks which locale is being used and encodes according to that. The difference between the operating systems is that windows uses cp1252 by default and linux uses utf-8. However, you can change the locale and make the driver return data in some other encoding. For example the current test passes on Linux with mssql_python 1.4.0, which shows that you can provoke the same issue on Linux:

def test_locale_varchar_decode_iso885915():
    assert locale.getlocale() == ('en_US', 'UTF-8')

    # important: change locale BEFORE connecting
    locale.setlocale(locale.LC_ALL, 'en_US.iso885915')
    connection = connect()
    cursor = connection.cursor()

    (val,) = cursor.execute("SELECT '€'").fetchone()
    # without changing locale, we would get val == '€'
    assert val == b'\xa4', val

    cursor.close()
    connection.close()

If we really want to continue to use SQL_C_CHAR for fetching varchar data, even if just on Linux, we would need to note down the current locale when connecting and only fetch SQL_C_CHAR if it is UTF-8 and SQL_C_WCHAR otherwise. This is made more complicated by connection pooling though. Apparently some pooling is already active by default, as followup connections still use the same locale, even if I change it beforehand.

On the flip side, if you do this, you can get rid of the platform specific code, because windows also allows changing the locale to UTF-8, which would enable SQL_C_CHAR there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants