[DE-6999] Enable image deduplication within nucleus sdk by edwinpav · Pull Request #452 · scaleapi/nucleus-python-client

edwinpav · 2026-02-23T20:52:05Z

See title.

Merge in after sister pr is deployed: https://github.com/scaleapi/scaleapi/pull/134861

Added some unit and integration tests. These tests creates an image fixture dataset and a video fixture dataset. Both are made up of TEST_IMAGE_URLS so...

To get the integrations tests to completely pass, I had to run some backfills. Specifically I had to backfill each TEST_IMAGE_URL from TEST_IMAGE_URLS

Backfill all occurrences of all (TEST_IMAGE_URL, 60ad648c85db770026e9bf77) in nucleus.processing_upload table
- Why? this is the table used for caching async uploads to a dataset and it caches based on (original_url, user_id) - its a composite index. The user_id for pytests, as defined in helpers.ts is NUCLEUS_PYTEST_USER_ID = "60ad648c85db770026e9bf77"
Backfill all occurrences of TEST_IMAGE_URL in nucleus.processed_upload table
- Why? this is the table used for caching sync uploads to a dataset and it caches based on just original_url - this is the index

See this comment for more info

From root of this repo, create a venv and run pip install -e . to have this venv connect to the local version of the sdk (this repo). Make sure the venv is created with python11. Currently, some of the sdk code doesn't support newer versions of python (I can look upgrading it). Based on the client installation tests, it only supports python3.7-3.11.
Run test scripts within this venv (can run from any repo).

Example test script of valid usage:

import nucleus

# define variables
corp_api_key="<SCALE_API_KEY>"
customer_id="68921622befbf26f9e535024"
SCALE_API_KEY=f"{corp_api_key}|{customer_id}"
endpoint="http://localhost:3000/v1/nucleus"
dataset_id = "ds_d6ccka5zks5g0bheab8g"

# initialize client
client = nucleus.NucleusClient(SCALE_API_KEY, endpoint=endpoint)
print(client)
dataset = client.get_dataset(dataset_id)
print(dataset)

entire_dataset_dedup = dataset.deduplicate(threshold=30)
print(entire_dataset_dedup)
print()

ref_ids_dedup = dataset.deduplicate(threshold=10, reference_ids=["video1/0", "video1/1", "video1/2", "video1/3", "video1/4", "video1/5"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)
print()

dataset_item_ids_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=["di_d6ccmm2mc93g23g1maag", "di_d6ccmm2mc93g23g1mab0", "di_d6ccmm2mc93g23g1mabg", "di_d6ccmm2mc93g23g1mac0", "di_d6ccmm2mc93g23g1macg", "di_d6ccmm2mc93g23g1mad0"])
print(ref_ids_dedup)
print(ref_ids_dedup.stats)

Output:

NucleusClient(api_key='scaleint_c5477527b28e4911b887ac2ede355eac|68921622befbf26f9e535024', use_notebook=False, endpoint='http://localhost:3000/v1/nucleus')
Dataset(name='test-phash-scene-1, dataset_id='ds_d6ccka5zks5g0bheab8g', is_scene='True')
DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mb1g', 'di_d6ccmm2mc93g23g1mbyg', 'di_d6ccmm2mc93g23g1mchg', 'di_d6ccmm2mc93g23g1mfk0', 'di_d6ccmm2mc93g23g1mgeg', 'di_d6ccmm2mc93g23g1mh10', 'di_d6ccp5rmc93g200pr6n0', 'di_d6ccp70mc93g1yqr4pw0'], unique_reference_ids=['video1/0', 'video1/46', 'video1/104', 'video1/142', 'video1/337', 'video1/392', 'video1/429', 'video3/108', 'video2/397'], stats=DeduplicationStats(threshold=30, original_count=1802, deduplicated_count=9))

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

DeduplicationResult(unique_item_ids=['di_d6ccmm2mc93g23g1maag', 'di_d6ccmm2mc93g23g1mac0'], unique_reference_ids=['video1/0', 'video1/3'], stats=DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2))
DeduplicationStats(threshold=10, original_count=6, deduplicated_count=2)

Examples of invalid usage:

# not passing threshold
entire_dataset_dedup = dataset.deduplicate()

# output
    entire_dataset_dedup = dataset.deduplicate()
                           ^^^^^^^^^^^^^^^^^^^^^
TypeError: Dataset.deduplicate() missing 1 required positional argument: 'threshold'

# invalid threshold
entire_dataset_dedup = dataset.deduplicate(threshold=70) 
# or
entire_dataset_dedup = dataset.deduplicate(threshold=-5)

# output (for both)
Tried to post http://localhost:3000/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate, but received 400: Bad Request.
The detailed error is:
{"error":"An unexpected internal error occured: threshold must be an integer between 0 and 64","route":"/v1/nucleus/dataset/ds_d6ccka5zks5g0bheab8g/deduplicate","request_id":"51bb8a48-0a63-44f8-8ef5-5954493b0edb","status_code":400}

# empty list for ref_ids instead of just not passing in that param
entire_dataset_dedup = dataset.deduplicate(threshold=10, reference_ids=[])

# output
ValueError: reference_ids cannot be empty. Omit reference_ids parameter to deduplicate entire dataset.

# empty list for dataset_item_ids
entire_dataset_dedup = dataset.deduplicate_by_ids(threshold=10, dataset_item_ids=[])

# output
ValueError: dataset_item_ids must be non-empty. Use deduplicate() for entire dataset.

Greptile Summary

This PR adds image deduplication support to the Nucleus Python SDK by introducing two new methods on the Dataset class: deduplicate() (by reference IDs or entire dataset) and deduplicate_by_ids() (by internal dataset item IDs). Both use perceptual hashing (pHash) with a configurable Hamming distance threshold (0-64).

Adds DeduplicationResult and DeduplicationStats dataclasses in a new nucleus/deduplication.py module for structured results
Adds Dataset.deduplicate() with optional reference_ids filtering and Dataset.deduplicate_by_ids() for direct item ID-based deduplication
Includes client-side validation (empty list guards) with server-side threshold validation
Comprehensive test suite covering image datasets, video scene datasets, video URL datasets, edge cases (threshold boundaries, empty datasets, single items, duplicates), and idempotency
Version bump to 0.17.12 with changelog entry

Confidence Score: 4/5

This PR is safe to merge — it adds new methods with no changes to existing behavior.
The PR is additive-only: new dataclasses, new methods, and new tests. No existing code paths are modified. Client-side validation is correct and API integration follows established patterns. Comprehensive integration test coverage with edge cases. Minor note: stats threshold is sourced from the local parameter rather than the server response, which is fine for current behavior.
No files require special attention. nucleus/dataset.py has minor style considerations but no functional issues.

Important Files Changed

Filename	Overview
nucleus/deduplication.py	New file with clean `DeduplicationResult` and `DeduplicationStats` dataclasses. Simple, well-structured, no issues.
nucleus/dataset.py	Adds `deduplicate()` and `deduplicate_by_ids()` methods with client-side validation and API calls. Well-documented with proper error handling. Uses local `threshold` for stats construction instead of server response value.
nucleus/init.py	Exports `DeduplicationResult` and `DeduplicationStats` in `__all__` and imports from `deduplication` module. Correctly placed alphabetically.
tests/test_deduplication.py	Comprehensive test coverage with unit tests for validation and integration tests for various dataset types (image, video scene, video URL). Covers edge cases, idempotency, and response invariants.
CHANGELOG.md	Adds v0.17.12 changelog entry documenting new deduplication methods with example usage.
pyproject.toml	Version bump from 0.17.11 to 0.17.12.
nucleus/constants.py	Adds THRESHOLD_KEY constant in proper alphabetical order.

Sequence Diagram

sequenceDiagram
    participant User
    participant Dataset
    participant NucleusClient
    participant API as Nucleus API

    User->>Dataset: deduplicate(threshold, reference_ids?)
    Dataset->>Dataset: Validate reference_ids not empty list
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult

    User->>Dataset: deduplicate_by_ids(threshold, dataset_item_ids)
    Dataset->>Dataset: Validate dataset_item_ids not empty
    Dataset->>NucleusClient: make_request(payload, "dataset/{id}/deduplicate")
    NucleusClient->>API: POST /dataset/{id}/deduplicate
    API-->>NucleusClient: {unique_item_ids, unique_reference_ids, stats}
    NucleusClient-->>Dataset: response dict
    Dataset-->>User: DeduplicationResult

_{Last reviewed commit: 9ec043a}

Enable deduplication in nucleus sdk

b8ff516

edwinpav self-assigned this Feb 23, 2026

edwinpav added 8 commits February 23, 2026 15:54

Lint fixes

436fbaf

Fix import order

0a1c8d2

Add tests for deduplication sdk

4b2a1d4

Fix isort import formatting errors

6545d02

Add fixture for image dataset specifically for dedup

019a31a

Fix image dataset creation syntax

ed67d5b

Create image dataset syncrhonously

6330be2

Make dataset_with_duplicates fixture sync

6d6a0ce

edwinpav requested a review from vinay553 February 28, 2026 06:50

edwinpav marked this pull request as ready for review February 28, 2026 06:50

Add dedup test for scene made with video url

9ec043a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DE-6999] Enable image deduplication within nucleus sdk#452

[DE-6999] Enable image deduplication within nucleus sdk#452
edwinpav wants to merge 10 commits intomasterfrom
edwinpav/dedup

edwinpav commented Feb 23, 2026 •

edited by greptile-apps bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edwinpav commented Feb 23, 2026 • edited by greptile-apps bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edwinpav commented Feb 23, 2026 •

edited by greptile-apps bot

Loading