Skip to content

[DE-6999] Enable image deduplication within nucleus sdk#452

Open
edwinpav wants to merge 10 commits intomasterfrom
edwinpav/dedup
Open

[DE-6999] Enable image deduplication within nucleus sdk#452
edwinpav wants to merge 10 commits intomasterfrom
edwinpav/dedup

Conversation

@edwinpav
Copy link
Contributor

@edwinpav edwinpav commented Feb 23, 2026

See title.

Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861

Added some unit and integration tests. These tests creates an image fixture dataset and a video fixture dataset. Both are made up of TEST_IMAGE_URLS so...

To get the integrations tests to completely pass, I had to run some backfills. Specifically I had to backfill each TEST_IMAGE_URL from TEST_IMAGE_URLS

  1. Backfill all occurrences of all (TEST_IMAGE_URL, 60ad648c85db770026e9bf77) in nucleus.processing_upload table

    • Why? this is the table used for caching async uploads to a dataset and it caches based on (original_url, user_id) - its a composite index. The user_id for pytests, as defined in helpers.ts is NUCLEUS_PYTEST_USER_ID = "60ad648c85db770026e9bf77"
  2. Backfill all occurrences of TEST_IMAGE_URL in nucleus.processed_upload table

    • Why? this is the table used for caching sync uploads to a dataset and it caches based on just original_url - this is the index

See this comment for more info

  1. From root of this repo, create a venv and run pip install -e . to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.
  2. Run test scripts within this venv (can run from any repo).

Example test script of valid usage:

import nucleus

# define variables
corp_api_key="<SCALE_API_KEY>"
customer_id="68921622befbf26f9e535024"
SCALE_API_KEY=f"{corp_api_key}|{customer_id}"
endpoint="http://localhost:3000/v1/nucleus"
dataset_id = "ds_d6ccka5zks5g0bheab8g"

# initialize client
client = nucleus.NucleusClient(SCALE_API_KEY, endpoint=endpoint)
print(client)
dataset = client.get_dataset(dataset_id)
print(dataset)

entire_dataset_dedup = dataset.deduplicate(threshold=30)
print(entire_dataset_dedup)
print()

ref_ids_dedup = dataset.deduplicate(threshold=10, reference_ids=["video1/0", "video1/1", "video1/2", "video1/3", "video1/4", "video1/5"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)
print()

dataset_item_ids_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=["di_d6ccmm2mc93g23g1maag", "di_d6ccmm2mc93g23g1mab0", "di_d6ccmm2mc93g23g1mabg", "di_d6ccmm2mc93g23g1mac0", "di_d6ccmm2mc93g23g1macg", "di_d6ccmm2mc93g23g1mad0"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)

Output:

NucleusClient(api_key='scaleint_c5477527b28e4911b887ac2ede355eac|68921622befbf26f9e535024', use_notebook=False, endpoint='http://localhost:3000/v1/nucleus')
Dataset(name='test-phash-scene-1, dataset_id='ds_d6ccka5zks5g0bheab8g', is_scene='True')
DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mb1g', 'di_d6ccmm2mc93g23g1mbyg', 'di_d6ccmm2mc93g23g1mchg', 'di_d6ccmm2mc93g23g1mfk0', 'di_d6ccmm2mc93g23g1mgeg', 'di_d6ccmm2mc93g23g1mh10', 'di_d6ccp5rmc93g200pr6n0', 'di_d6ccp70mc93g1yqr4pw0'], unique_reference_ids=['video1/0', 'video1/46', 'video1/104', 'video1/142', 'video1/337', 'video1/392', 'video1/429', 'video3/108', 'video2/397'], stats=DeduplicationStats(threshold=30, original_count=1802, deduplicated_count=9))

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

Examples of invalid usage:

# not passing threshold
entire_dataset_dedup = dataset.deduplicate()

# output
    entire_dataset_dedup = dataset.deduplicate()
                           ^^^^^^^^^^^^^^^^^^^^^
TypeError: Dataset.deduplicate() missing 1 required positional argument: 'threshold'
# invalid threshold
entire_dataset_dedup = dataset.deduplicate(threshold=70) 
# or
entire_dataset_dedup = dataset.deduplicate(threshold=-5)

# output (for both)
Tried to post http://localhost:3000/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate, but received 400: Bad Request.
The detailed error is:
{"error":"An unexpected internal error occured: threshold must be an integer between 0 and 64","route":"/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate","request_id":"51bb8a48-0a63-44f8-8ef5-5954493b0edb","status_code":400}
# empty list for ref_ids instead of just not passing in that param
entire_dataset_dedup = dataset.deduplicate(threshold=10, reference_ids=[])

# output
ValueError: reference_ids cannot be empty. Omit reference_ids parameter to deduplicate entire dataset.
# empty list for dataset_item_ids
entire_dataset_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=[])

# output
ValueError: dataset_item_ids must be non-empty. Use deduplicate() for entire dataset.

Greptile Summary

This PR adds image deduplication support to the Nucleus Python SDK by introducing two new methods on the Dataset class: deduplicate() (by reference IDs or entire dataset) and deduplicate_by_ids() (by internal dataset item IDs). Both use perceptual hashing (pHash) with a configurable Hamming distance threshold (0-64).

  • Adds DeduplicationResult and DeduplicationStats dataclasses in a new nucleus/deduplication.py module for structured results
  • Adds Dataset.deduplicate() with optional reference_ids filtering and Dataset.deduplicate_by_ids() for direct item ID-based deduplication
  • Includes client-side validation (empty list guards) with server-side threshold validation
  • Comprehensive test suite covering image datasets, video scene datasets, video URL datasets, edge cases (threshold boundaries, empty datasets, single items, duplicates), and idempotency
  • Version bump to 0.17.12 with changelog entry

Confidence Score: 4/5

  • This PR is safe to merge — it adds new methods with no changes to existing behavior.
  • The PR is additive-only: new dataclasses, new methods, and new tests. No existing code paths are modified. Client-side validation is correct and API integration follows established patterns. Comprehensive integration test coverage with edge cases. Minor note: stats threshold is sourced from the local parameter rather than the server response, which is fine for current behavior.
  • No files require special attention. nucleus/dataset.py has minor style considerations but no functional issues.

Important Files Changed

Filename Overview
nucleus/deduplication.py New file with clean DeduplicationResult and DeduplicationStats dataclasses. Simple, well-structured, no issues.
nucleus/dataset.py Adds deduplicate() and deduplicate_by_ids() methods with client-side validation and API calls. Well-documented with proper error handling. Uses local threshold for stats construction instead of server response value.
nucleus/init.py Exports DeduplicationResult and DeduplicationStats in __all__ and imports from deduplication module. Correctly placed alphabetically.
tests/test_deduplication.py Comprehensive test coverage with unit tests for validation and integration tests for various dataset types (image, video scene, video URL). Covers edge cases, idempotency, and response invariants.
CHANGELOG.md Adds v0.17.12 changelog entry documenting new deduplication methods with example usage.
pyproject.toml Version bump from 0.17.11 to 0.17.12.
nucleus/constants.py Adds THRESHOLD_KEY constant in proper alphabetical order.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dataset
    participant NucleusClient
    participant API as Nucleus API

    User->>Dataset: deduplicate(threshold, reference_ids?)
    Dataset->>Dataset: Validate reference_ids not empty list
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult

    User->>Dataset: deduplicate_by_ids(threshold, dataset_item_ids)
    Dataset->>Dataset: Validate dataset_item_ids not empty
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult
Loading

Last reviewed commit: 9ec043a

@edwinpav edwinpav self-assigned this Feb 23, 2026
@edwinpav edwinpav requested a review from vinay553 February 28, 2026 06:50
@edwinpav edwinpav marked this pull request as ready for review February 28, 2026 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant