Skip to content

Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search#2707

Open
leekeiabstraction wants to merge 2 commits intoapache:mainfrom
leekeiabstraction:lance-writer-emit-arrow-fixedSizeList
Open

Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search#2707
leekeiabstraction wants to merge 2 commits intoapache:mainfrom
leekeiabstraction:lance-writer-emit-arrow-fixedSizeList

Conversation

@leekeiabstraction
Copy link
Contributor

Purpose

Linked issue: close #2706

Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search.

Querying with pylance native vector search fails because pylance expects embedding column to be of fixedSizeList type instead of variable size list. This PR fixes the conversion from Fluss' arrow data to Lance's arrow data by using FixedSizeList if <column>.arrow.fixed-size-list.size is defined, similar to how Spark SQL does it ref: https://lance.org/integrations/spark/operations/ddl/create-table/#creating-large-string-columns

Tests

Added unit test

  • Tested manually that pylance successfully performed a vector search on the tiered data

@leekeiabstraction
Copy link
Contributor Author

@XuQianJin-Stars Appreciate if you can review this PR 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the Lance writer to emit Arrow FixedSizeList types for array columns, enabling native vector search in PyLance. Previously, array columns were always written as variable-length Arrow Lists, preventing Lance's vector search functionality from working. The fix introduces a configuration property pattern <column>.arrow.fixed-size-list.size that allows users to specify when array columns should be written as FixedSizeList.

Changes:

  • Added support for configuring FixedSizeList via table properties in LanceArrowUtils
  • Implemented conversion logic from ListVector to FixedSizeListVector in ArrowDataConverter
  • Updated table creation in LanceLakeCatalog and tests to pass custom properties through the schema conversion pipeline

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
LanceArrowUtils.java Adds configuration property support and logic to create FixedSizeList ArrowType when the property is set
ArrowDataConverter.java Implements new method to copy data from shaded ListVector to non-shaded FixedSizeListVector
LanceLakeCatalog.java Passes custom properties to toArrowSchema to enable FixedSizeList configuration
LanceTieringTest.java Adds comprehensive test coverage with embedding column using FixedSizeList type

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (logicalType instanceof ArrayType && tableProperties != null) {
String sizeStr = tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX);
if (sizeStr != null) {
arrowType = new ArrowType.FixedSizeList(Integer.parseInt(sizeStr));
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Integer.parseInt call can throw NumberFormatException if the property value is not a valid integer. This could happen if a user provides an invalid configuration. Consider adding proper error handling with a descriptive error message that includes the invalid value and expected format. For example, wrap this in a try-catch block and throw an IllegalArgumentException with a message like "Invalid fixed-size list size for column 'columnName': 'sizeStr'. Expected a positive integer."

Copilot uses AI. Check for mistakes.
if (logicalType instanceof ArrayType && tableProperties != null) {
String sizeStr = tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX);
if (sizeStr != null) {
arrowType = new ArrowType.FixedSizeList(Integer.parseInt(sizeStr));
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no validation to ensure that the parsed size is positive. Arrow's FixedSizeList requires a positive size value. If a user provides zero or a negative value, it could lead to runtime errors when creating Arrow vectors. Add validation after parsing to ensure the size is positive, throwing an IllegalArgumentException if it's not.

Suggested change
arrowType = new ArrowType.FixedSizeList(Integer.parseInt(sizeStr));
int size = Integer.parseInt(sizeStr);
if (size <= 0) {
throw new IllegalArgumentException(
"FixedSizeList size for field '" + fieldName + "' must be positive, but was: " + size);
}
arrowType = new ArrowType.FixedSizeList(size);

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +192
private static void copyListToFixedSizeListVectorData(
org.apache.fluss.shaded.arrow.org.apache.arrow.vector.complex.ListVector
shadedListVector,
FixedSizeListVector nonShadedFixedSizeListVector) {

// Copy the child data vector recursively (e.g., the float values)
org.apache.fluss.shaded.arrow.org.apache.arrow.vector.FieldVector shadedDataVector =
shadedListVector.getDataVector();
FieldVector nonShadedDataVector = nonShadedFixedSizeListVector.getDataVector();

if (shadedDataVector != null && nonShadedDataVector != null) {
copyVectorData(shadedDataVector, nonShadedDataVector);
}

// FixedSizeListVector only has a validity buffer (no offset buffer).
// Copy the validity buffer from the shaded ListVector.
List<org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf> shadedBuffers =
shadedListVector.getFieldBuffers();
List<ArrowBuf> nonShadedBuffers = nonShadedFixedSizeListVector.getFieldBuffers();

// Both ListVector and FixedSizeListVector have validity as their first buffer
if (!shadedBuffers.isEmpty() && !nonShadedBuffers.isEmpty()) {
org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf shadedValidityBuf =
shadedBuffers.get(0);
ArrowBuf nonShadedValidityBuf = nonShadedBuffers.get(0);

long size = Math.min(shadedValidityBuf.capacity(), nonShadedValidityBuf.capacity());
if (size > 0) {
ByteBuffer srcBuffer = shadedValidityBuf.nioBuffer(0, (int) size);
srcBuffer.position(0);
srcBuffer.limit((int) Math.min(size, Integer.MAX_VALUE));
nonShadedValidityBuf.setBytes(0, srcBuffer);
}
}

int valueCount = shadedListVector.getValueCount();
nonShadedFixedSizeListVector.setValueCount(valueCount);
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyListToFixedSizeListVectorData method does not validate that the source ListVector data actually contains arrays of the expected fixed size. If the actual data has arrays of varying lengths or a different size than what the FixedSizeListVector schema expects, this could lead to data corruption or runtime errors. Consider adding validation to check that each list entry in shadedListVector has the correct size matching the FixedSizeList's configured size before copying.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[lake/lance] Lance writer should emit Arrow FixedSizeList for array columns to enable native vector search

2 participants