Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search#2707
Conversation
…ble native vector search
|
@XuQianJin-Stars Appreciate if you can review this PR 🙏 |
There was a problem hiding this comment.
Pull request overview
This PR fixes the Lance writer to emit Arrow FixedSizeList types for array columns, enabling native vector search in PyLance. Previously, array columns were always written as variable-length Arrow Lists, preventing Lance's vector search functionality from working. The fix introduces a configuration property pattern <column>.arrow.fixed-size-list.size that allows users to specify when array columns should be written as FixedSizeList.
Changes:
- Added support for configuring FixedSizeList via table properties in
LanceArrowUtils - Implemented conversion logic from ListVector to FixedSizeListVector in
ArrowDataConverter - Updated table creation in
LanceLakeCatalogand tests to pass custom properties through the schema conversion pipeline
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| LanceArrowUtils.java | Adds configuration property support and logic to create FixedSizeList ArrowType when the property is set |
| ArrowDataConverter.java | Implements new method to copy data from shaded ListVector to non-shaded FixedSizeListVector |
| LanceLakeCatalog.java | Passes custom properties to toArrowSchema to enable FixedSizeList configuration |
| LanceTieringTest.java | Adds comprehensive test coverage with embedding column using FixedSizeList type |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (logicalType instanceof ArrayType && tableProperties != null) { | ||
| String sizeStr = tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX); | ||
| if (sizeStr != null) { | ||
| arrowType = new ArrowType.FixedSizeList(Integer.parseInt(sizeStr)); |
There was a problem hiding this comment.
The Integer.parseInt call can throw NumberFormatException if the property value is not a valid integer. This could happen if a user provides an invalid configuration. Consider adding proper error handling with a descriptive error message that includes the invalid value and expected format. For example, wrap this in a try-catch block and throw an IllegalArgumentException with a message like "Invalid fixed-size list size for column 'columnName': 'sizeStr'. Expected a positive integer."
| if (logicalType instanceof ArrayType && tableProperties != null) { | ||
| String sizeStr = tableProperties.get(fieldName + FIXED_SIZE_LIST_SIZE_SUFFIX); | ||
| if (sizeStr != null) { | ||
| arrowType = new ArrowType.FixedSizeList(Integer.parseInt(sizeStr)); |
There was a problem hiding this comment.
There is no validation to ensure that the parsed size is positive. Arrow's FixedSizeList requires a positive size value. If a user provides zero or a negative value, it could lead to runtime errors when creating Arrow vectors. Add validation after parsing to ensure the size is positive, throwing an IllegalArgumentException if it's not.
| arrowType = new ArrowType.FixedSizeList(Integer.parseInt(sizeStr)); | |
| int size = Integer.parseInt(sizeStr); | |
| if (size <= 0) { | |
| throw new IllegalArgumentException( | |
| "FixedSizeList size for field '" + fieldName + "' must be positive, but was: " + size); | |
| } | |
| arrowType = new ArrowType.FixedSizeList(size); |
| private static void copyListToFixedSizeListVectorData( | ||
| org.apache.fluss.shaded.arrow.org.apache.arrow.vector.complex.ListVector | ||
| shadedListVector, | ||
| FixedSizeListVector nonShadedFixedSizeListVector) { | ||
|
|
||
| // Copy the child data vector recursively (e.g., the float values) | ||
| org.apache.fluss.shaded.arrow.org.apache.arrow.vector.FieldVector shadedDataVector = | ||
| shadedListVector.getDataVector(); | ||
| FieldVector nonShadedDataVector = nonShadedFixedSizeListVector.getDataVector(); | ||
|
|
||
| if (shadedDataVector != null && nonShadedDataVector != null) { | ||
| copyVectorData(shadedDataVector, nonShadedDataVector); | ||
| } | ||
|
|
||
| // FixedSizeListVector only has a validity buffer (no offset buffer). | ||
| // Copy the validity buffer from the shaded ListVector. | ||
| List<org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf> shadedBuffers = | ||
| shadedListVector.getFieldBuffers(); | ||
| List<ArrowBuf> nonShadedBuffers = nonShadedFixedSizeListVector.getFieldBuffers(); | ||
|
|
||
| // Both ListVector and FixedSizeListVector have validity as their first buffer | ||
| if (!shadedBuffers.isEmpty() && !nonShadedBuffers.isEmpty()) { | ||
| org.apache.fluss.shaded.arrow.org.apache.arrow.memory.ArrowBuf shadedValidityBuf = | ||
| shadedBuffers.get(0); | ||
| ArrowBuf nonShadedValidityBuf = nonShadedBuffers.get(0); | ||
|
|
||
| long size = Math.min(shadedValidityBuf.capacity(), nonShadedValidityBuf.capacity()); | ||
| if (size > 0) { | ||
| ByteBuffer srcBuffer = shadedValidityBuf.nioBuffer(0, (int) size); | ||
| srcBuffer.position(0); | ||
| srcBuffer.limit((int) Math.min(size, Integer.MAX_VALUE)); | ||
| nonShadedValidityBuf.setBytes(0, srcBuffer); | ||
| } | ||
| } | ||
|
|
||
| int valueCount = shadedListVector.getValueCount(); | ||
| nonShadedFixedSizeListVector.setValueCount(valueCount); | ||
| } |
There was a problem hiding this comment.
The copyListToFixedSizeListVectorData method does not validate that the source ListVector data actually contains arrays of the expected fixed size. If the actual data has arrays of varying lengths or a different size than what the FixedSizeListVector schema expects, this could lead to data corruption or runtime errors. Consider adding validation to check that each list entry in shadedListVector has the correct size matching the FixedSizeList's configured size before copying.
Purpose
Linked issue: close #2706
Fix Lance writer to emit Arrow FixedSizeList for array columns to enable native vector search.
Querying with pylance native vector search fails because pylance expects embedding column to be of fixedSizeList type instead of variable size list. This PR fixes the conversion from Fluss' arrow data to Lance's arrow data by using FixedSizeList if
<column>.arrow.fixed-size-list.sizeis defined, similar to how Spark SQL does it ref: https://lance.org/integrations/spark/operations/ddl/create-table/#creating-large-string-columnsTests
Added unit test