Skip to content

GH-531: Add parquet flatbuf schema#544

Open
alkis wants to merge 4 commits intoapache:masterfrom
alkis:patch-2
Open

GH-531: Add parquet flatbuf schema#544
alkis wants to merge 4 commits intoapache:masterfrom
alkis:patch-2

Conversation

@alkis
Copy link
Contributor

@alkis alkis commented Dec 12, 2025

Rationale for this change

Improve wide table support.

What changes are included in this PR?

Add parquet flatbuf schema.

Do these changes have PoC implementations?

apache/arrow#48431

codec: CompressionCodec;
num_values: long = null; // only present if not equal to rg.num_rows
total_uncompressed_size: long;
total_compressed_size: long;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?

dictionary_page_offset: long = null;
statistics: Statistics;
is_fully_dict_encoded: bool;
bloom_filter_offset: long = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we this be made a struct/value to make the bloom filter info more self contained?

row_groups: [RowGroup];
kv: [KV];
created_by: string;
// column_orders: [ColumnOrder]; // moved to SchemaElement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this row for now?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to add an apache header here, and CI to make sure this compiles?

@Jiayi-Wang-db
Copy link

Hi @rok and @emkornfield , could you help to have another look of this pr?

min_lo4: uint;
min_lo8: ulong;
min_hi8: ulong;
min_len: byte = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
min_len: byte = null;
min_len: int = null;

Original suffix lenght could exceed int8 range of byte type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rok , the previous comment is outdated. max_len and min_len store the truncated suffix length, which means the maximum value is 16. We use negative numbers to represent inexact values. I have updated the comment and the example, please take a look.

max_lo4: uint;
max_lo8: ulong;
max_hi8: ulong;
max_len: byte = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above:

Suggested change
max_len: byte = null;
max_len: int = null;


/** repetition of the field. The root of the schema does not have a repetition_type.
* All other nodes must have one */
repetition_type: FieldRepetitionType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To allow for root to not have repetition type. In thrift we have optional:

3: optional FieldRepetitionType repetition_type;

Suggested change
repetition_type: FieldRepetitionType;
repetition_type: FieldRepetitionType = null;

@rok
Copy link
Member

rok commented Mar 11, 2026

@Jiayi-Wang-db Feel free to resolve comments you feel were addressed to make this more readable?

@rok
Copy link
Member

rok commented Mar 11, 2026

Is parquet3.bfbs required?

@Jiayi-Wang-db
Copy link

Is parquet3.bfbs required?

Not at all. I pushed it accidentally. Removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants