[DE-6999] Enable image deduplication within nucleus sdk#452
Open
[DE-6999] Enable image deduplication within nucleus sdk#452
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See title.
Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861
Added some unit and integration tests. These tests creates an image fixture dataset and a video fixture dataset. Both are made up of TEST_IMAGE_URLS so...
To get the integrations tests to completely pass, I had to run some backfills. Specifically I had to backfill each
TEST_IMAGE_URLfromTEST_IMAGE_URLSBackfill all occurrences of all
(TEST_IMAGE_URL, 60ad648c85db770026e9bf77)innucleus.processing_uploadtable(original_url, user_id)- its a composite index. Theuser_idfor pytests, as defined inhelpers.tsisNUCLEUS_PYTEST_USER_ID = "60ad648c85db770026e9bf77"Backfill all occurrences of
TEST_IMAGE_URLinnucleus.processed_uploadtableoriginal_url- this is the indexSee this comment for more info
pip install -e .to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.Example test script of valid usage:
Output:
Examples of invalid usage:
Greptile Summary
This PR adds image deduplication support to the Nucleus Python SDK by introducing two new methods on the
Datasetclass:deduplicate()(by reference IDs or entire dataset) anddeduplicate_by_ids()(by internal dataset item IDs). Both use perceptual hashing (pHash) with a configurable Hamming distance threshold (0-64).DeduplicationResultandDeduplicationStatsdataclasses in a newnucleus/deduplication.pymodule for structured resultsDataset.deduplicate()with optionalreference_idsfiltering andDataset.deduplicate_by_ids()for direct item ID-based deduplicationConfidence Score: 4/5
nucleus/dataset.pyhas minor style considerations but no functional issues.Important Files Changed
DeduplicationResultandDeduplicationStatsdataclasses. Simple, well-structured, no issues.deduplicate()anddeduplicate_by_ids()methods with client-side validation and API calls. Well-documented with proper error handling. Uses localthresholdfor stats construction instead of server response value.DeduplicationResultandDeduplicationStatsin__all__and imports fromdeduplicationmodule. Correctly placed alphabetically.Sequence Diagram
sequenceDiagram participant User participant Dataset participant NucleusClient participant API as Nucleus API User->>Dataset: deduplicate(threshold, reference_ids?) Dataset->>Dataset: Validate reference_ids not empty list Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate") NucleusClient->>API: POST /dataset/{id}/deduplicate API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats} NucleusClient-->>Dataset: response dict Dataset-->>User: DeduplicationResult User->>Dataset: deduplicate_by_ids(threshold, dataset_item_ids) Dataset->>Dataset: Validate dataset_item_ids not empty Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate") NucleusClient->>API: POST /dataset/{id}/deduplicate API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats} NucleusClient-->>Dataset: response dict Dataset-->>User: DeduplicationResultLast reviewed commit: 9ec043a