Skip to content

[SYNPY-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed]#1339

Draft
danlu1 wants to merge 1 commit intodevelopfrom
synpy-1439-manifest-oop
Draft

[SYNPY-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed]#1339
danlu1 wants to merge 1 commit intodevelopfrom
synpy-1439-manifest-oop

Conversation

@danlu1
Copy link
Copy Markdown
Contributor

@danlu1 danlu1 commented Mar 18, 2026

Problem:

We would like to have sync_from_synapse and sync_to_synapse support the writing of data to or from a manifest TSV file.

Solution:

  1. This pull request introduces comprehensive manifest file capabilities to the Synapse Python client, enabling users to generate, validate, and upload manifest TSV files for bulk file operations.
  2. It also adds support for generating manifests from the Synapse Download List, exposes relevant constants, and enhances documentation with detailed guides and references.

AI generated summary:
Manifest functionality and documentation:

  • Added the ManifestGeneratable mixin to both Project and Folder classes, providing methods for manifest generation, validation, and upload. This enables users to easily work with manifest files for bulk operations. [1] [2] [3]
  • Introduced new documentation: manifest_generatable.md (API reference) and manifest_operations.md (tutorial), covering manifest generation, upload, validation, provenance handling, annotation management, and async usage. [1] [2]

Download List manifest support:

  • Added the DownloadListManifestRequest class for generating manifest files from a user's download list using the Synapse Asynchronous Job service, with support for custom CSV formatting and async APIs.
  • Updated imports and __all__ in synapseclient.models and synapseclient.models.mixins to expose DownloadListManifestRequest, ManifestGeneratable, MANIFEST_FILENAME, and DEFAULT_GENERATED_MANIFEST_KEYS for public use. [1] [2] [3] [4]

Constants and API exposure:

  • Exposed MANIFEST_FILENAME and DEFAULT_GENERATED_MANIFEST_KEYS as public constants for consistent manifest file naming and column management across the package. [1] [2] [3] [4]

These changes significantly enhance the usability of manifest-based workflows in Synapse, streamline bulk data operations, and provide clear documentation and APIs for both synchronous and asynchronous use.

  • Describe the specific technical solution that this PR provides.
  • Include details such as:
    • How you debugged the problem
    • Why you chose this solution
    • Does the solution make any impact beyond the immediate problem
    • Any necessary technical debt incurred
    • Any important design decisions/links to design documentation (if applicable)

Testing:

Unit test is added.

@danlu1 danlu1 requested a review from a team as a code owner March 18, 2026 17:43
@danlu1 danlu1 marked this pull request as draft March 18, 2026 17:43
@danlu1 danlu1 changed the title [SYIN-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [SYIN-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed] Mar 18, 2026
Copy link
Copy Markdown
Member

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great addition — having manifest support directly on the OOP models is a nice step forward. I reviewed this specifically against the server-side manifest implementation in Synapse-Repository-Services to identify alignment opportunities. Most of my comments are about interoperability between client-generated and server-generated manifests. Nothing blocking since this is still in draft.

Note: This comment was drafted with AI assistance and reviewed by me for accuracy.

Comment on lines +29 to +40
DEFAULT_GENERATED_MANIFEST_KEYS = [
"path",
"parent",
"name",
"id",
"synapseStore",
"contentType",
"used",
"executed",
"activityName",
"activityDescription",
]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to be aware of: the server-side manifest (generated by DownloadListManifestRequest via ManifestKeys enum in the Java backend) uses different column names than what we define here. Here's the mapping:

Concept Python Client Server (ManifestKeys)
Entity ID id ID
Parent parent parentId
Version (missing) versionNumber
File size (missing) dataFileSizeBytes
MD5 (missing) dataFileMD5Hex
Created by (missing) createdBy
Created on (missing) createdOn
Modified by (missing) modifiedBy
Modified on (missing) modifiedOn
Synapse URL (missing) synapseURL

Should we consider adding versionNumber and dataFileMD5Hex to the default keys? Those are already available on the File model and are commonly useful for users doing bulk operations. Not necessarily in this PR, but worth thinking about for interoperability.

Note: This comment was drafted with AI assistance and reviewed by me for accuracy.

timeout: int = 120,
*,
synapse_client: Optional[Synapse] = None,
) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to call out: the manifest generated by this method comes from the server, which uses column names like ID, parentId, versionNumber, dataFileSizeBytes, etc. But from_manifest() expects the Python client column names (id, parent, path). So a user can't take the output of generate_download_list_manifest() and feed it directly into from_manifest() without manual column renaming.

Should we document this explicitly, or would it make sense to add a column mapping/normalization option so the two are interoperable?

Note: This comment was drafted with AI assistance and reviewed by me for accuracy.

Parse a manifest cell value into an appropriate Python type.

Handles:
- List syntax: [a,b,c] -> ['a', 'b', 'c']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: _parse_manifest_value doesn't handle datetime/ISO 8601 strings — it will return them as plain strings. The existing _convert_cell_in_manifest_to_python_types in synapseutils/sync.py does handle datetime parsing (lines ~240-260). Since the server stores annotations with a TIMESTAMP_MS type, a round-trip through generate → parse could lose the datetime typing.

Is the plan to add datetime parsing here later, or is there a reason to keep this simpler than the sync.py version?

Note: This comment was drafted with AI assistance and reviewed by me for accuracy.

Comment on lines +664 to +677
file = File(
path=path,
parent_id=file_parent,
name=row.get("name", "").strip() or None,
id=row.get("id", "").strip() or None,
synapse_store=(
row.get("synapseStore", "").strip().lower() != "false"
if row.get("synapseStore", "").strip()
else True
),
content_type=row.get("contentType", "").strip() or None,
merge_existing_annotations=merge_existing_annotations,
associate_activity_to_new_version=associate_activity_to_new_version,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forceVersion is listed in STORE_FUNCTION_FIELDS (line 26) and excluded from annotations via skip_keys, but it's never actually read from the manifest row and applied to the File object. The File model does accept force_version — should this be wired up?

Note: This comment was drafted with AI assistance and reviewed by me for accuracy.

Comment on lines +216 to +222
downloaded_path = await download_by_file_handle(
file_handle_id=self.result_file_handle_id,
synapse_id=self.result_file_handle_id,
entity_type="FileEntity",
destination=download_path,
synapse_client=synapse_client,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is synapse_id=self.result_file_handle_id correct here? The manifest isn't associated with a FileEntity — it's a standalone file handle from the async job. I'm not sure if download_by_file_handle handles this case correctly when it tries to resolve the association. Worth double-checking this path in an integration test.

Note: This comment was drafted with AI assistance and reviewed by me for accuracy.

@BryanFauble BryanFauble changed the title [SYIN-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed] [SYN-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed] Mar 23, 2026
@BryanFauble BryanFauble requested a review from a team March 23, 2026 16:23
@BryanFauble BryanFauble changed the title [SYN-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed] [SYNPY-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed] Mar 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants