[SYNPY-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed]#1339
[SYNPY-1439]Support manifest.tsv in sync_from_synapse and sync_to_synapse [test needed]#1339
Conversation
BryanFauble
left a comment
There was a problem hiding this comment.
This is a great addition — having manifest support directly on the OOP models is a nice step forward. I reviewed this specifically against the server-side manifest implementation in Synapse-Repository-Services to identify alignment opportunities. Most of my comments are about interoperability between client-generated and server-generated manifests. Nothing blocking since this is still in draft.
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
| DEFAULT_GENERATED_MANIFEST_KEYS = [ | ||
| "path", | ||
| "parent", | ||
| "name", | ||
| "id", | ||
| "synapseStore", | ||
| "contentType", | ||
| "used", | ||
| "executed", | ||
| "activityName", | ||
| "activityDescription", | ||
| ] |
There was a problem hiding this comment.
Something to be aware of: the server-side manifest (generated by DownloadListManifestRequest via ManifestKeys enum in the Java backend) uses different column names than what we define here. Here's the mapping:
| Concept | Python Client | Server (ManifestKeys) |
|---|---|---|
| Entity ID | id |
ID |
| Parent | parent |
parentId |
| Version | (missing) | versionNumber |
| File size | (missing) | dataFileSizeBytes |
| MD5 | (missing) | dataFileMD5Hex |
| Created by | (missing) | createdBy |
| Created on | (missing) | createdOn |
| Modified by | (missing) | modifiedBy |
| Modified on | (missing) | modifiedOn |
| Synapse URL | (missing) | synapseURL |
Should we consider adding versionNumber and dataFileMD5Hex to the default keys? Those are already available on the File model and are commonly useful for users doing bulk operations. Not necessarily in this PR, but worth thinking about for interoperability.
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
| timeout: int = 120, | ||
| *, | ||
| synapse_client: Optional[Synapse] = None, | ||
| ) -> str: |
There was a problem hiding this comment.
One thing to call out: the manifest generated by this method comes from the server, which uses column names like ID, parentId, versionNumber, dataFileSizeBytes, etc. But from_manifest() expects the Python client column names (id, parent, path). So a user can't take the output of generate_download_list_manifest() and feed it directly into from_manifest() without manual column renaming.
Should we document this explicitly, or would it make sense to add a column mapping/normalization option so the two are interoperable?
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
| Parse a manifest cell value into an appropriate Python type. | ||
|
|
||
| Handles: | ||
| - List syntax: [a,b,c] -> ['a', 'b', 'c'] |
There was a problem hiding this comment.
Nit: _parse_manifest_value doesn't handle datetime/ISO 8601 strings — it will return them as plain strings. The existing _convert_cell_in_manifest_to_python_types in synapseutils/sync.py does handle datetime parsing (lines ~240-260). Since the server stores annotations with a TIMESTAMP_MS type, a round-trip through generate → parse could lose the datetime typing.
Is the plan to add datetime parsing here later, or is there a reason to keep this simpler than the sync.py version?
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
| file = File( | ||
| path=path, | ||
| parent_id=file_parent, | ||
| name=row.get("name", "").strip() or None, | ||
| id=row.get("id", "").strip() or None, | ||
| synapse_store=( | ||
| row.get("synapseStore", "").strip().lower() != "false" | ||
| if row.get("synapseStore", "").strip() | ||
| else True | ||
| ), | ||
| content_type=row.get("contentType", "").strip() or None, | ||
| merge_existing_annotations=merge_existing_annotations, | ||
| associate_activity_to_new_version=associate_activity_to_new_version, | ||
| ) |
There was a problem hiding this comment.
forceVersion is listed in STORE_FUNCTION_FIELDS (line 26) and excluded from annotations via skip_keys, but it's never actually read from the manifest row and applied to the File object. The File model does accept force_version — should this be wired up?
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
| downloaded_path = await download_by_file_handle( | ||
| file_handle_id=self.result_file_handle_id, | ||
| synapse_id=self.result_file_handle_id, | ||
| entity_type="FileEntity", | ||
| destination=download_path, | ||
| synapse_client=synapse_client, | ||
| ) |
There was a problem hiding this comment.
Is synapse_id=self.result_file_handle_id correct here? The manifest isn't associated with a FileEntity — it's a standalone file handle from the async job. I'm not sure if download_by_file_handle handles this case correctly when it tries to resolve the association. Worth double-checking this path in an integration test.
Note: This comment was drafted with AI assistance and reviewed by me for accuracy.
Problem:
We would like to have sync_from_synapse and sync_to_synapse support the writing of data to or from a manifest TSV file.
Solution:
AI generated summary:
Manifest functionality and documentation:
ManifestGeneratablemixin to bothProjectandFolderclasses, providing methods for manifest generation, validation, and upload. This enables users to easily work with manifest files for bulk operations. [1] [2] [3]manifest_generatable.md(API reference) andmanifest_operations.md(tutorial), covering manifest generation, upload, validation, provenance handling, annotation management, and async usage. [1] [2]Download List manifest support:
DownloadListManifestRequestclass for generating manifest files from a user's download list using the Synapse Asynchronous Job service, with support for custom CSV formatting and async APIs.__all__insynapseclient.modelsandsynapseclient.models.mixinsto exposeDownloadListManifestRequest,ManifestGeneratable,MANIFEST_FILENAME, andDEFAULT_GENERATED_MANIFEST_KEYSfor public use. [1] [2] [3] [4]Constants and API exposure:
MANIFEST_FILENAMEandDEFAULT_GENERATED_MANIFEST_KEYSas public constants for consistent manifest file naming and column management across the package. [1] [2] [3] [4]These changes significantly enhance the usability of manifest-based workflows in Synapse, streamline bulk data operations, and provide clear documentation and APIs for both synchronous and asynchronous use.
Testing:
Unit test is added.