Fix incorrect handling of missing values in ToParentBlockJoinSortField (#15548)#15727
Fix incorrect handling of missing values in ToParentBlockJoinSortField (#15548)#15727gmarouli wants to merge 12 commits intoapache:mainfrom
Conversation
romseygeek
left a comment
There was a problem hiding this comment.
This looks great @gmarouli, thank you! I left some comments, but this all heading in the right direction.
lucene/join/src/java/org/apache/lucene/search/join/ToParentDocValues.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| private static void validate(Type type, Object childMissingValue) { | ||
| switch (type) { |
There was a problem hiding this comment.
Nit: we can use switch expressions here (ie, case STRING -> validateMissingValue())
There was a problem hiding this comment.
Done, I was a bit reluctant to add the comment // $CASES-OMITTED$ because I thought it might be recommended, but I did find other similar places in the code that it was used so it might be ok, right?
lucene/join/src/java/org/apache/lucene/search/join/ToParentDocValues.java
Outdated
Show resolved
Hide resolved
| int prevParentDocID = parents.prevSetBit(docID - 1); | ||
| hasChildWithMissingValue = childrenWithValuesCount < getTotalChildrenCount(prevParentDocID); | ||
| if (hasChildWithMissingValue && missingWithChildWithoutValue) { | ||
| docID = nextDoc(); |
There was a problem hiding this comment.
Can we avoid recursion here?
There was a problem hiding this comment.
Yes, I used a loop with the seen flag in the condition. I changed the code, so seen is set to true only when we are certain that the parent doc id will be returned.
It used to be set as long as we found a child doc, so I added a dedicated local flag for that and that allowed us to use the seen flag in the new loop.
lucene/join/src/java/org/apache/lucene/search/join/ToParentDocValues.java
Outdated
Show resolved
Hide resolved
c3f4c8a to
b03918f
Compare
Thank you for all the guidance @romseygeek . I followed up to the comments and I am open to further changes if you see fit. |
romseygeek
left a comment
There was a problem hiding this comment.
Thanks @gmarouli! I left another comment.
| int nextParentDocID = parents.nextSetBit(childWithValues.docID()); | ||
| collector.reset(); | ||
| seen = true; | ||
| int nextParentDocID = parents.nextSetBit(childWithValues.docID()); |
There was a problem hiding this comment.
I think we need to check for NO_MORE_DOCS here now, given that it's possible that we iterate through parent docs until there are none left?
There was a problem hiding this comment.
Worth adding a test that exercises this scenario.
There was a problem hiding this comment.
Thank you for spotting it! It should have been in the loop. Fixed and test added, I added a new one dedicated to nextDoc because of its complexity, but if you think it would be better to add it to the existing one I can do it too.
| * targetParentDocID}. | ||
| * | ||
| * <p>A parent document is considered to have a value when at least one of its child documents has | ||
| * a value. When {@code skipParentsWithMissingChildValues} is {@code true}, the parent is |
There was a problem hiding this comment.
Reading this again, I'm not sure it's quite what we want? We need to return a top-level missing value when a parent has no child documents with sort values, rather than when it has some missing ones.
There was a problem hiding this comment.
We need to return a top-level missing value when a parent has no child documents with sort values, rather than when it has some missing ones.
I agree and this is what we do for the numeric doc values. But with binary doc values, we do not have a default value, we only have the flag, if missing values go first or last, so the missing value ends up missing in the parent too.
This why I approached it like this, if a parent has children with missing values and the children with a missing value is meant to "win" (this is why we say check if the sorting is reverse or not and if it's sortingMissingLast) then the parent is treated like it did not having children.
Am I missing some nuance here? (I wouldn't be surprised).
There was a problem hiding this comment.
Oh, right I see what you mean. I think it still needs to return true from advanceExact() though? And then return an appropriate value from ord().
There was a problem hiding this comment.
Ah, I see. What would be an appropriate value for ord(), my initial thought is -1 since any positive number could be confused with an actual ordinal, right?
There was a problem hiding this comment.
TermOrdValComparator has this snippet:
if (sortMissingLast) {
missingOrd = Integer.MAX_VALUE;
} else {
missingOrd = -1;
}
There was a problem hiding this comment.
I see, so we could use the same approach based on the child sortMissingLast. If parent and child have different reverse or sortMissingLast will be able to differentiate between a parent without children and a parent with a child with a missing value.
Thank you! This was very helpful.
romseygeek
left a comment
There was a problem hiding this comment.
Nearly there, I think! Sorry this is taking so long...
| @@ -88,6 +97,9 @@ public boolean advanceExact(int targetParentDocID) throws IOException { | |||
|
|
|||
| @Override | |||
| public int ordValue() { | |||
| if (iter.hasChildWithMissingValue() && (reverse == sortMissingLast)) { | |||
There was a problem hiding this comment.
I'm not sure this is correct, I think we need to check against the ord if there's a missing value? Like you do for NumDV
| ords[5] = 3; | ||
| ords[8] = 10; | ||
|
|
||
| final SortedDocValues naturalOrder = |
There was a problem hiding this comment.
Let's also add a test case for wrap(.... , true, true)
Description
Problem
ToParentBlockJoinSortFielddid not handle missing values for child documents. When a parent had a mix of children with a value and children without, the children without values were silently ignored from the min/max selection. This led to incorrect sort ordering -- the missing children should participate in the comparison using a configurable missing value, consistent with how the missing value inSortFieldworks.Additionally, the parent-level
missingValuewhich was not configurable after #15483, is reinstated and configurable via the constructor.Plan
missingValueconfiguration for the parent-level inToParentBlockJoinSortFieldmaking it again configurable.reverseandorderflags in the constructor ofToParentBlockJoinSortField.missingValueconfiguration for the child-level inToParentBlockJoinSortFieldand extendToParentDocValuesandBlockJoinSelectorto detect children with missing values and incorporate the configurable missing value into the min/max aggregation, with proper handling for all supported types.Fixes #15548