diff --git a/docs/reference/experimental/mixins/manifest_generatable.md b/docs/reference/experimental/mixins/manifest_generatable.md new file mode 100644 index 000000000..47aac2a4c --- /dev/null +++ b/docs/reference/experimental/mixins/manifest_generatable.md @@ -0,0 +1,69 @@ +# ManifestGeneratable Mixin + +The `ManifestGeneratable` mixin provides manifest TSV file generation and reading capabilities for container entities (Projects and Folders). + +## Overview + +This mixin enables: + +- Generating manifest TSV files after syncing from Synapse +- Uploading files from manifest TSV files +- Validating manifest files before upload + +## Usage + +The mixin is automatically available on `Project` and `Folder` classes: + +```python +from synapseclient.models import Project, Folder + +# Project and Folder both have manifest capabilities +project = Project(id="syn123") +folder = Folder(id="syn456") +``` + +## API Reference + +::: synapseclient.models.mixins.manifest.ManifestGeneratable + options: + show_root_heading: true + show_source: false + members: + - generate_manifest + - generate_manifest_async + - from_manifest + - from_manifest_async + - validate_manifest + - validate_manifest_async + - get_manifest_data + - get_manifest_data_async + +## Constants + +### MANIFEST_FILENAME + +The default filename for generated manifests: `SYNAPSE_METADATA_MANIFEST.tsv` + +```python +from synapseclient.models import MANIFEST_FILENAME + +print(MANIFEST_FILENAME) # "SYNAPSE_METADATA_MANIFEST.tsv" +``` + +### DEFAULT_GENERATED_MANIFEST_KEYS + +The default columns included in generated manifest files: + +```python +from synapseclient.models import DEFAULT_GENERATED_MANIFEST_KEYS + +print(DEFAULT_GENERATED_MANIFEST_KEYS) +# ['path', 'parent', 'name', 'id', 'synapseStore', 'contentType', +# 'used', 'executed', 'activityName', 'activityDescription'] +``` + +## See Also + +- [Manifest Operations Tutorial](../../../tutorials/python/manifest_operations.md) +- [StorableContainer Mixin](storable_container.md) +- [Manifest TSV Format](../../../explanations/manifest_tsv.md) diff --git a/docs/tutorials/python/manifest_operations.md b/docs/tutorials/python/manifest_operations.md new file mode 100644 index 000000000..25362a347 --- /dev/null +++ b/docs/tutorials/python/manifest_operations.md @@ -0,0 +1,328 @@ +# Manifest Operations + +This tutorial covers how to work with manifest TSV files for bulk file operations in Synapse. Manifest files provide a way to track file metadata, download files with their annotations, and upload files with provenance information. + +## Overview + +A manifest file is a tab-separated values (TSV) file that contains metadata about files in Synapse. The manifest includes: + +- File paths and Synapse IDs +- Parent container IDs +- Annotations +- Provenance information (used/executed references) + +## Generating Manifests During Download + +When syncing files from Synapse, you can automatically generate a manifest file that captures all file metadata. + +### Using sync_from_synapse with Manifest Generation + +```python +from synapseclient.models import Project +import synapseclient + +synapseclient.login() + +# Download a project with manifest generation at each directory level +project = Project(id="syn123456").sync_from_synapse( + path="/path/to/download", + generate_manifest="all" +) + +# Or generate a single manifest at the root level only +project = Project(id="syn123456").sync_from_synapse( + path="/path/to/download", + generate_manifest="root" +) +``` + +### Manifest Generation Options + +The `generate_manifest` parameter accepts three values: + +| Value | Description | +|-------|-------------| +| `"suppress"` | (Default) Do not create any manifest files | +| `"root"` | Create a single manifest at the root download path | +| `"all"` | Create a manifest in each directory level | + +### Generating Manifest Separately + +You can also generate a manifest after syncing: + +```python +from synapseclient.models import Project +import synapseclient + +synapseclient.login() + +# First sync without manifest +project = Project(id="syn123456").sync_from_synapse( + path="/path/to/download" +) + +# Then generate manifest separately +manifest_path = project.generate_manifest( + path="/path/to/download", + manifest_scope="root" +) +print(f"Manifest created at: {manifest_path}") +``` + +## Manifest File Format + +The generated manifest file (`SYNAPSE_METADATA_MANIFEST.tsv`) contains the following columns: + +| Column | Description | +|--------|-------------| +| `path` | Local file path | +| `parent` | Synapse ID of the parent container | +| `name` | File name in Synapse | +| `id` | Synapse file ID | +| `synapseStore` | Whether the file is stored in Synapse | +| `contentType` | MIME type of the file | +| `used` | Provenance - entities used to create this file | +| `executed` | Provenance - code/scripts executed | +| `activityName` | Name of the provenance activity | +| `activityDescription` | Description of the provenance activity | +| *custom columns* | Any annotations on the files | + +### Example Manifest + +```tsv +path parent name id synapseStore contentType used executed activityName activityDescription study dataType +/data/file1.csv syn123 file1.csv syn456 True text/csv Data Processing Study1 RNA-seq +/data/file2.csv syn123 file2.csv syn789 True text/csv syn456 Analysis Processed from file1 Study1 RNA-seq +``` + +## Uploading Files from a Manifest + +You can upload files to Synapse using a manifest file: + +```python +from synapseclient.models import Project +import synapseclient + +synapseclient.login() + +# Upload files from a manifest +files = Project.from_manifest( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123456" +) + +for file in files: + print(f"Uploaded: {file.name} ({file.id})") +``` + +### Dry Run Validation + +Before uploading, you can validate the manifest: + +```python +from synapseclient.models import Project + +# Validate without uploading +is_valid, errors = Project.validate_manifest( + manifest_path="/path/to/manifest.tsv" +) + +if is_valid: + print("Manifest is valid, ready for upload") +else: + for error in errors: + print(f"Error: {error}") +``` + +Or use the `dry_run` option to validate the manifest and see what would be uploaded without making changes: + +```python +# Dry run - validates and returns what would be uploaded, but doesn't upload +files = Project.from_manifest( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123456", + dry_run=True # Validate only, no actual upload +) +print(f"Would upload {len(files)} files") +``` + +The `dry_run` parameter is useful for: + +- Validating manifest format before committing to an upload +- Testing your manifest configuration +- Previewing which files will be affected + +## Working with Annotations + +Annotations in the manifest are automatically handled: + +### On Download + +When generating a manifest, all file annotations are included as additional columns: + +```python +project = Project(id="syn123456").sync_from_synapse( + path="/path/to/download", + generate_manifest="root" +) +# Annotations appear as columns in the manifest +``` + +### On Upload + +Any columns in the manifest that aren't standard fields become annotations: + +```tsv +path parent study dataType specimenType +/data/file1.csv syn123 Study1 RNA-seq tissue +``` + +```python +files = Project.from_manifest( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123456", + merge_existing_annotations=True # Merge with existing annotations +) +``` + +## Working with Provenance + +### On Download + +Provenance information is captured in the `used`, `executed`, `activityName`, and `activityDescription` columns: + +```python +project = Project(id="syn123456").sync_from_synapse( + path="/path/to/download", + include_activity=True, # Include provenance + generate_manifest="root" +) +``` + +### On Upload + +You can specify provenance in the manifest: + +```tsv +path parent used executed activityName activityDescription +/data/output.csv syn123 syn456;syn789 https://github.com/repo/script.py Analysis Generated from input files +``` + +- Multiple references are separated by semicolons (`;`) +- References can be Synapse IDs, URLs, or local file paths + +## Synapse Download List Integration + +The manifest functionality integrates with Synapse's Download List feature. You can generate a manifest directly from your Synapse download list, which is useful for exporting metadata about files you've queued for download in the Synapse web interface. + +### Generating Manifest from Download List + +```python +from synapseclient.models import Project +import synapseclient + +synapseclient.login() + +# Generate a manifest from your Synapse download list +manifest_path = Project.generate_download_list_manifest( + download_path="/path/to/save/manifest" +) +print(f"Manifest downloaded to: {manifest_path}") +``` + +### Custom CSV Formatting + +You can customize the manifest format: + +```python +from synapseclient.models import Project +import synapseclient + +synapseclient.login() + +# Generate a tab-separated manifest +manifest_path = Project.generate_download_list_manifest( + download_path="/path/to/save/manifest", + csv_separator="\t", # Tab-separated + include_header=True +) +``` + +### Using DownloadListManifestRequest Directly + +For more control over the manifest generation process, use the `DownloadListManifestRequest` class directly: + +```python +from synapseclient.models import DownloadListManifestRequest, CsvTableDescriptor +import synapseclient + +synapseclient.login() + +# Create a request with custom CSV formatting +request = DownloadListManifestRequest( + csv_table_descriptor=CsvTableDescriptor( + separator="\t", + quote_character='"', + is_first_line_header=True + ) +) + +# Send the job and wait for completion +request.send_job_and_wait() + +# Download the generated manifest +manifest_path = request.download_manifest(download_path="/path/to/download") +print(f"Manifest file handle: {request.result_file_handle_id}") +``` + +## Best Practices + +1. **Use `generate_manifest="root"` for simple cases** - Creates a single manifest at the root level, easier to manage. + +2. **Use `generate_manifest="all"` for complex hierarchies** - Creates manifests at each directory level, useful for large projects with many subdirectories. + +3. **Validate manifests before upload** - Use `validate_manifest()` or `dry_run=True` to catch errors early. + +4. **Include provenance information** - Set `include_activity=True` when syncing to capture provenance in the manifest. + +5. **Backup your manifest** - The manifest is a valuable record of your data and its metadata. + +## Async API + +All manifest operations are available as async methods: + +```python +import asyncio +from synapseclient.models import Project +import synapseclient + +async def main(): + synapseclient.login() + + # Async sync with manifest + project = Project(id="syn123456") + await project.sync_from_synapse_async( + path="/path/to/download", + generate_manifest="root" + ) + + # Async manifest generation + manifest_path = await project.generate_manifest_async( + path="/path/to/download", + manifest_scope="root" + ) + + # Async upload from manifest + files = await Project.from_manifest_async( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123456" + ) + +asyncio.run(main()) +``` + +## See Also + +- [Download Data in Bulk](download_data_in_bulk.md) +- [Upload Data in Bulk](upload_data_in_bulk.md) +- [Manifest TSV Format](../../explanations/manifest_tsv.md) diff --git a/synapseclient/models/__init__.py b/synapseclient/models/__init__.py index 554de0bc2..23cd1a96a 100644 --- a/synapseclient/models/__init__.py +++ b/synapseclient/models/__init__.py @@ -68,7 +68,11 @@ WikiOrderHint, WikiPage, ) - +from synapseclient.models.download_list import DownloadListManifestRequest +from synapseclient.models.mixins.manifest import ( + DEFAULT_GENERATED_MANIFEST_KEYS, + MANIFEST_FILENAME, +) __all__ = [ "Activity", "UsedURL", @@ -153,6 +157,11 @@ # Form models "FormGroup", "FormData", + # Manifest constants + "MANIFEST_FILENAME", + "DEFAULT_GENERATED_MANIFEST_KEYS", + # Download List models + "DownloadListManifestRequest", ] # Static methods to expose as functions diff --git a/synapseclient/models/download_list.py b/synapseclient/models/download_list.py new file mode 100644 index 000000000..e1c0eb866 --- /dev/null +++ b/synapseclient/models/download_list.py @@ -0,0 +1,224 @@ +"""Models for interacting with Synapse's Download List functionality. + +This module provides classes for generating manifest files from a user's download list +using the Synapse Asynchronous Job service. + +See: https://rest-docs.synapse.org/rest/POST/download/list/manifest/async/start.html +""" + +from dataclasses import dataclass, field +from typing import Any, Dict, Optional + +from typing_extensions import Self + +from synapseclient import Synapse +from synapseclient.core.async_utils import async_to_sync, otel_trace_method +from synapseclient.core.constants.concrete_types import DOWNLOAD_LIST_MANIFEST_REQUEST +from synapseclient.core.download import download_by_file_handle +from synapseclient.core.utils import delete_none_keys +from synapseclient.models.mixins.asynchronous_job import AsynchronousCommunicator +from synapseclient.models.protocols.download_list_protocol import ( + DownloadListManifestRequestSynchronousProtocol, +) +from synapseclient.models.table_components import CsvTableDescriptor + + +@dataclass +@async_to_sync +class DownloadListManifestRequest( + DownloadListManifestRequestSynchronousProtocol, AsynchronousCommunicator +): + """ + A request to generate a manifest file (CSV) of the current user's download list. + + This class uses the Synapse Asynchronous Job service to generate a manifest file + containing metadata about files in the user's download list. The manifest can be + used to download files or for record-keeping purposes. + + See: https://rest-docs.synapse.org/rest/org/sagebionetworks/repo/model/download/DownloadListManifestRequest.html + + Attributes: + csv_table_descriptor: Optional CSV formatting options for the manifest. + result_file_handle_id: The file handle ID of the generated manifest (populated after completion). + + Example: Generate a manifest from download list + Generate a CSV manifest from your download list: + + from synapseclient.models import DownloadListManifestRequest + import synapseclient + + synapseclient.login() + + # Create and send the request + request = DownloadListManifestRequest() + request.send_job_and_wait() + + print(f"Manifest file handle: {request.result_file_handle_id}") + + Example: Generate manifest with custom CSV formatting + Use custom separator and quote characters: + + from synapseclient.models import DownloadListManifestRequest, CsvTableDescriptor + import synapseclient + + synapseclient.login() + + request = DownloadListManifestRequest( + csv_table_descriptor=CsvTableDescriptor( + separator="\t", # Tab-separated + is_first_line_header=True + ) + ) + request.send_job_and_wait() + """ + + concrete_type: str = field( + default=DOWNLOAD_LIST_MANIFEST_REQUEST, repr=False, compare=False + ) + """The concrete type of this request.""" + + csv_table_descriptor: Optional[CsvTableDescriptor] = None + """Optional CSV formatting options for the manifest file.""" + + result_file_handle_id: Optional[str] = None + """The file handle ID of the generated manifest file. Populated after the job completes.""" + + def to_synapse_request(self) -> Dict[str, Any]: + """ + Convert this request to the format expected by the Synapse REST API. + + Returns: + A dictionary containing the request body for the Synapse API. + """ + request = { + "concreteType": self.concrete_type, + } + if self.csv_table_descriptor: + request[ + "csvTableDescriptor" + ] = self.csv_table_descriptor.to_synapse_request() + delete_none_keys(request) + return request + + def fill_from_dict(self, synapse_response: Dict[str, Any]) -> Self: + """ + Populate this object from a Synapse REST API response. + + Arguments: + synapse_response: The response from the REST API. + + Returns: + This object with fields populated from the response. + """ + self.result_file_handle_id = synapse_response.get("resultFileHandleId", None) + return self + + @otel_trace_method( + method_to_trace_name=lambda self, **kwargs: "DownloadListManifestRequest_send_job_and_wait" + ) + async def send_job_and_wait_async( + self, + post_exchange_args: Optional[Dict[str, Any]] = None, + timeout: int = 120, + *, + synapse_client: Optional[Synapse] = None, + ) -> Self: + """Send the job to the Asynchronous Job service and wait for it to complete. + + This method sends the manifest generation request to Synapse and waits + for the job to complete. After completion, the `result_file_handle_id` + attribute will be populated. + + Arguments: + post_exchange_args: Additional arguments to pass to the request. + timeout: The number of seconds to wait for the job to complete or progress + before raising a SynapseTimeoutError. Defaults to 120. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + This instance with `result_file_handle_id` populated. + + Raises: + SynapseTimeoutError: If the job does not complete within the timeout. + SynapseError: If the job fails. + + Example: Generate a manifest + Generate a manifest from the download list: + + from synapseclient.models import DownloadListManifestRequest + import synapseclient + + synapseclient.login() + + request = DownloadListManifestRequest() + request.send_job_and_wait() + print(f"Manifest file handle: {request.result_file_handle_id}") + """ + return await super().send_job_and_wait_async( + post_exchange_args=post_exchange_args, + timeout=timeout, + synapse_client=synapse_client, + ) + + @otel_trace_method( + method_to_trace_name=lambda self, **kwargs: "DownloadListManifestRequest_download_manifest" + ) + async def download_manifest_async( + self, + download_path: str, + *, + synapse_client: Optional[Synapse] = None, + ) -> str: + """ + Download the generated manifest file to a local path. + + This method should be called after `send_job_and_wait()` has completed + successfully and `result_file_handle_id` is populated. + + Arguments: + download_path: The local directory path where the manifest will be saved. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + The full path to the downloaded manifest file. + + Raises: + ValueError: If the manifest has not been generated yet (no result_file_handle_id). + + Example: Download the manifest after generation + Generate and download a manifest: + + from synapseclient.models import DownloadListManifestRequest + import synapseclient + + synapseclient.login() + + request = DownloadListManifestRequest() + request.send_job_and_wait() + + manifest_path = request.download_manifest(download_path="/path/to/download") + print(f"Manifest downloaded to: {manifest_path}") + """ + if not self.result_file_handle_id: + raise ValueError( + "Manifest has not been generated yet. " + "Call send_job_and_wait() before downloading." + ) + + # Download the file handle using the download module + # For download list manifests, the synapse_id parameter is set to the file handle ID + # because these manifests are not associated with a specific entity. The download + # service handles this case by using the file handle directly. + downloaded_path = await download_by_file_handle( + file_handle_id=self.result_file_handle_id, + synapse_id=self.result_file_handle_id, + entity_type="FileEntity", + destination=download_path, + synapse_client=synapse_client, + ) + + return downloaded_path diff --git a/synapseclient/models/folder.py b/synapseclient/models/folder.py index a0658f521..45ba51f6d 100644 --- a/synapseclient/models/folder.py +++ b/synapseclient/models/folder.py @@ -39,7 +39,7 @@ VirtualTable, ) - +from synapseclient.models.mixins.manifest import ManifestGeneratable @dataclass() @async_to_sync class Folder( @@ -47,6 +47,7 @@ class Folder( AccessControllable, StorableContainer, ContainerEntityJSONSchema, + ManifestGeneratable, ): """Folder is a hierarchical container for organizing data in Synapse. diff --git a/synapseclient/models/mixins/__init__.py b/synapseclient/models/mixins/__init__.py index 62ddcf017..1467a9ed3 100644 --- a/synapseclient/models/mixins/__init__.py +++ b/synapseclient/models/mixins/__init__.py @@ -21,7 +21,11 @@ ValidationException, ) from synapseclient.models.mixins.storable_container import StorableContainer - +from synapseclient.models.mixins.manifest import ( + DEFAULT_GENERATED_MANIFEST_KEYS, + MANIFEST_FILENAME, + ManifestGeneratable, +) __all__ = [ "AccessControllable", "StorableContainer", @@ -40,4 +44,7 @@ "FormChangeRequest", "FormSubmissionStatus", "StateEnum", + "ManifestGeneratable", + "MANIFEST_FILENAME", + "DEFAULT_GENERATED_MANIFEST_KEYS", ] diff --git a/synapseclient/models/mixins/manifest.py b/synapseclient/models/mixins/manifest.py new file mode 100644 index 000000000..785a9c7b9 --- /dev/null +++ b/synapseclient/models/mixins/manifest.py @@ -0,0 +1,950 @@ +"""Mixin for objects that can generate and read manifest TSV files.""" + +import csv +import datetime +import io +import os +import re +import sys +from typing import TYPE_CHECKING, Any, Dict, List, Optional, Tuple, Union + +from synapseclient import Synapse +from synapseclient.core import utils +from synapseclient.core.async_utils import async_to_sync, otel_trace_method +from synapseclient.core.utils import is_synapse_id_str, is_url, topolgical_sort +from synapseclient.models.protocols.manifest_protocol import ( + ManifestGeneratableSynchronousProtocol, +) + +if TYPE_CHECKING: + from synapseclient.models import File + +# When new fields are added to the manifest they will also need to be added to +# file.py#_determine_fields_to_ignore_in_merge +REQUIRED_FIELDS = ["path", "parent"] +FILE_CONSTRUCTOR_FIELDS = ["name", "id", "synapseStore", "contentType"] +STORE_FUNCTION_FIELDS = ["activityName", "activityDescription", "forceVersion"] +PROVENANCE_FIELDS = ["used", "executed"] +MANIFEST_FILENAME = "SYNAPSE_METADATA_MANIFEST.tsv" +DEFAULT_GENERATED_MANIFEST_KEYS = [ + "path", + "parent", + "name", + "id", + "synapseStore", + "contentType", + "used", + "executed", + "activityName", + "activityDescription", +] +ARRAY_BRACKET_PATTERN = re.compile(r"^\[.*\]$") +SINGLE_OPEN_BRACKET_PATTERN = re.compile(r"^\[") +SINGLE_CLOSING_BRACKET_PATTERN = re.compile(r"\]$") +# https://stackoverflow.com/questions/18893390/splitting-on-comma-outside-quotes +COMMAS_OUTSIDE_DOUBLE_QUOTES_PATTERN = re.compile(r",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)") + + +def _manifest_filename(path: str) -> str: + """Get the full path to the manifest file. + + Arguments: + path: The directory where the manifest file will be created. + + Returns: + The full path to the manifest file. + """ + return os.path.join(path, MANIFEST_FILENAME) + + +def _convert_manifest_data_items_to_string_list( + items: List[Union[str, datetime.datetime, bool, int, float]], +) -> str: + """ + Handle converting an individual key that contains a possible list of data into a + list of strings or objects that can be written to the manifest file. + + This has specific logic around how to handle datetime fields. + + When working with datetime fields we are printing the ISO 8601 UTC representation of + the datetime. + + When working with non strings we are printing the non-quoted version of the object. + + Example: Examples + Several examples of how this function works. + + >>> _convert_manifest_data_items_to_string_list(["a", "b", "c"]) + '[a,b,c]' + >>> _convert_manifest_data_items_to_string_list(["string,with,commas", "string without commas"]) + '["string,with,commas",string without commas]' + >>> _convert_manifest_data_items_to_string_list(["string,with,commas"]) + 'string,with,commas' + >>> _convert_manifest_data_items_to_string_list( + [datetime.datetime(2020, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc)]) + '2020-01-01T00:00:00Z' + >>> _convert_manifest_data_items_to_string_list([True]) + 'True' + >>> _convert_manifest_data_items_to_string_list([1]) + '1' + >>> _convert_manifest_data_items_to_string_list([1.0]) + '1.0' + >>> _convert_manifest_data_items_to_string_list( + [datetime.datetime(2020, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc), + datetime.datetime(2021, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc)]) + '[2020-01-01T00:00:00Z,2021-01-01T00:00:00Z]' + + + Args: + items: The list of items to convert. + + Returns: + The list of items converted to strings. + """ + items_to_write = [] + for item in items: + if isinstance(item, datetime.datetime): + items_to_write.append( + utils.datetime_to_iso(dt=item, include_milliseconds_if_zero=False) + ) + else: + # If a string based annotation has a comma in it + # this will wrap the string in quotes so it won't be parsed + # as multiple values. For example this is an annotation with 2 values: + # [my first annotation, "my, second, annotation"] + # This is an annotation with 4 value: + # [my first annotation, my, second, annotation] + if isinstance(item, str): + if len(items) > 1 and "," in item: + items_to_write.append(f'"{item}"') + else: + items_to_write.append(item) + else: + items_to_write.append(repr(item)) + + if len(items_to_write) > 1: + return f'[{",".join(items_to_write)}]' + elif len(items_to_write) == 1: + return items_to_write[0] + else: + return "" + + +def _convert_manifest_data_row_to_dict(row: dict, keys: List[str]) -> dict: + """ + Convert a row of data to a dict that can be written to a manifest file. + + Args: + row: The row of data to convert. + keys: The keys of the manifest. Used to select the rows of data. + + Returns: + The dict representation of the row. + """ + data_to_write = {} + for key in keys: + data_for_key = row.get(key, "") + if isinstance(data_for_key, list): + items_to_write = _convert_manifest_data_items_to_string_list(data_for_key) + data_to_write[key] = items_to_write + else: + data_to_write[key] = data_for_key + return data_to_write + + +def _write_manifest_data(filename: str, keys: List[str], data: List[dict]) -> None: + """ + Write a number of keys and a list of data to a manifest file. This will write + the data out as a tab separated file. + + For the data we are writing to the TSV file we are not quoting the content with any + characters. This is because the syncToSynapse function does not require strings to + be quoted. When quote characters were included extra double quotes were being added + to the strings when they were written to the manifest file. This was not causing + errors, however, it was changing the content of the manifest file when changes + were not required. + + Args: + filename: The name of the file to write to. + keys: The keys of the manifest. + data: The data to write to the manifest. This should be a list of dicts where + each dict represents a row of data. + """ + with io.open(filename, "w", encoding="utf8") if filename else sys.stdout as fp: + csv_writer = csv.DictWriter( + fp, + keys, + restval="", + extrasaction="ignore", + delimiter="\t", + quotechar=None, + quoting=csv.QUOTE_NONE, + ) + csv_writer.writeheader() + for row in data: + csv_writer.writerow(rowdict=_convert_manifest_data_row_to_dict(row, keys)) + + +def _extract_entity_metadata_for_file( + all_files: List["File"], +) -> Tuple[List[str], List[Dict[str, str]]]: + """ + Extracts metadata from the list of File Entities and returns them in a form + usable by csv.DictWriter + + Arguments: + all_files: an iterable that provides File entities + + Returns: + keys: a list column headers + data: a list of dicts containing data from each row + """ + keys = list(DEFAULT_GENERATED_MANIFEST_KEYS) + annotation_keys = set() + data = [] + for entity in all_files: + row = { + "parent": entity.parent_id, + "path": entity.path, + "name": entity.name, + "id": entity.id, + "synapseStore": entity.synapse_store, + "contentType": entity.content_type, + } + + if entity.annotations: + annotation_keys.update(set(entity.annotations.keys())) + row.update( + { + key: (val if len(val) > 0 else "") + for key, val in entity.annotations.items() + } + ) + + row_provenance = _get_entity_provenance_dict_for_file(entity=entity) + row.update(row_provenance) + + data.append(row) + keys.extend(annotation_keys) + return keys, data + + +def _get_entity_provenance_dict_for_file(entity: "File") -> Dict[str, str]: + """ + Arguments: + entity: File entity object + + Returns: + dict: a dict with a subset of the provenance metadata for the entity. + An empty dict is returned if the metadata does not have a provenance record. + """ + if not entity.activity: + return {} + + used_activities = [] + for used_activity in entity.activity.used: + used_activities.append(used_activity.format_for_manifest()) + + executed_activities = [] + for executed_activity in entity.activity.executed: + executed_activities.append(executed_activity.format_for_manifest()) + + return { + "used": ";".join(used_activities), + "executed": ";".join(executed_activities), + "activityName": entity.activity.name or "", + "activityDescription": entity.activity.description or "", + } + + +def _validate_manifest_required_fields( + manifest_path: str, +) -> Tuple[bool, List[str]]: + """ + Validate that a manifest file exists and has the required fields. + + Args: + manifest_path: Path to the manifest file. + + Returns: + Tuple of (is_valid, list_of_error_messages). + """ + errors = [] + + if not os.path.isfile(manifest_path): + errors.append(f"Manifest file not found: {manifest_path}") + return (False, errors) + + try: + with io.open(manifest_path, "r", encoding="utf8") as fp: + reader = csv.DictReader(fp, delimiter="\t") + headers = reader.fieldnames or [] + + # Check for required fields + for field in REQUIRED_FIELDS: + if field not in headers: + errors.append(f"Missing required field: {field}") + + # Validate each row + row_num = 1 + for row in reader: + row_num += 1 + path = row.get("path", "") + parent = row.get("parent", "") + + if not path: + errors.append(f"Row {row_num}: 'path' is empty") + + if not parent: + errors.append(f"Row {row_num}: 'parent' is empty") + elif not is_synapse_id_str(parent) and not is_url(parent): + errors.append( + f"Row {row_num}: 'parent' is not a valid Synapse ID: {parent}" + ) + + # Check if path exists (skip URLs) + if path and not is_url(path): + expanded_path = os.path.abspath( + os.path.expandvars(os.path.expanduser(path)) + ) + if not os.path.isfile(expanded_path): + errors.append(f"Row {row_num}: File not found: {path}") + + except Exception as e: + errors.append(f"Error reading manifest file: {str(e)}") + + return (len(errors) == 0, errors) + + +@async_to_sync +class ManifestGeneratable(ManifestGeneratableSynchronousProtocol): + """ + Mixin for objects that can generate and read manifest TSV files. + + In order to use this mixin, the class must have the following attributes: + + - `id` + - `name` + - `_synced_from_synapse` + + The class must also inherit from `StorableContainer` mixin which provides: + + - `flatten_file_list()` + - `map_directory_to_all_contained_files()` + """ + + id: Optional[str] = None + name: Optional[str] = None + _synced_from_synapse: bool = False + + @otel_trace_method( + method_to_trace_name=lambda self, **kwargs: f"{self.__class__.__name__}_generate_manifest: {self.id}" + ) + async def generate_manifest_async( + self, + path: str, + manifest_scope: str = "all", + *, + synapse_client: Optional[Synapse] = None, + ) -> Optional[str]: + """ + Generate a manifest TSV file for all files in this container. + + This method should be called after `sync_from_synapse()` to generate + a manifest of all downloaded files with their metadata. + + Arguments: + path: The directory where the manifest file(s) will be written. + manifest_scope: Controls manifest file generation: + + - "all": Create a manifest in each directory level + - "root": Create a single manifest at the root path only + - "suppress": Do not create any manifest files + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + The path to the root manifest file if created, or None if suppressed. + + Raises: + ValueError: If the container has not been synced from Synapse. + ValueError: If manifest_scope is not one of 'all', 'root', 'suppress'. + + Example: Generate manifest after sync + Generate a manifest file after syncing from Synapse: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + project = Project(id="syn123").sync_from_synapse( + path="/path/to/download" + ) + manifest_path = project.generate_manifest( + path="/path/to/download", + manifest_scope="root" + ) + print(f"Manifest created at: {manifest_path}") + """ + if manifest_scope not in ("all", "root", "suppress"): + raise ValueError( + 'Value of manifest_scope should be one of ("all", "root", "suppress")' + ) + + if manifest_scope == "suppress": + return None + + if not self._synced_from_synapse: + raise ValueError( + "Container has not been synced from Synapse. " + "Call sync_from_synapse() before generating a manifest." + ) + + syn = Synapse.get_client(synapse_client=synapse_client) + + # Expand the path + path = os.path.expanduser(path) if path else None + if not path: + raise ValueError("A path must be provided to generate a manifest.") + + # Get all files from this container + all_files = self.flatten_file_list() + + if not all_files: + syn.logger.info( + f"[{self.id}:{self.name}]: No files found in container, " + "skipping manifest generation." + ) + return None + + root_manifest_path = None + + if manifest_scope == "root": + # Generate a single manifest at the root + keys, data = _extract_entity_metadata_for_file(all_files=all_files) + manifest_path = _manifest_filename(path) + _write_manifest_data(manifest_path, keys, data) + root_manifest_path = manifest_path + syn.logger.info( + f"[{self.id}:{self.name}]: Created manifest at {manifest_path}" + ) + elif manifest_scope == "all": + # Generate a manifest at each directory level + directory_map = self.map_directory_to_all_contained_files(root_path=path) + + for directory_path, files_in_directory in directory_map.items(): + if files_in_directory: + keys, data = _extract_entity_metadata_for_file( + all_files=files_in_directory + ) + manifest_path = _manifest_filename(directory_path) + _write_manifest_data(manifest_path, keys, data) + + # Track the root manifest path + if directory_path == path: + root_manifest_path = manifest_path + + syn.logger.info( + f"[{self.id}:{self.name}]: Created manifest at {manifest_path}" + ) + + return root_manifest_path + + @otel_trace_method( + method_to_trace_name=lambda self, **kwargs: f"{self.__class__.__name__}_get_manifest_data: {self.id}" + ) + async def get_manifest_data_async( + self, + *, + synapse_client: Optional[Synapse] = None, + ) -> Tuple[List[str], List[Dict[str, str]]]: + """ + Get manifest data for all files in this container. + + This method extracts metadata from all files that have been synced + to this container. The data can be used to generate a manifest file + or for other purposes. + + Arguments: + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + Tuple of (keys, data) where keys is a list of column headers + and data is a list of dictionaries, one per file, containing + the file metadata. + + Raises: + ValueError: If the container has not been synced from Synapse. + + Example: Get manifest data + Get manifest data for all files in a project: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + project = Project(id="syn123").sync_from_synapse( + path="/path/to/download" + ) + keys, data = project.get_manifest_data() + for row in data: + print(f"File: {row['name']} at {row['path']}") + """ + if not self._synced_from_synapse: + raise ValueError( + "Container has not been synced from Synapse. " + "Call sync_from_synapse() before getting manifest data." + ) + + all_files = self.flatten_file_list() + return _extract_entity_metadata_for_file(all_files=all_files) + + @classmethod + @otel_trace_method( + method_to_trace_name=lambda cls, **kwargs: f"{cls.__name__}_from_manifest" + ) + async def from_manifest_async( + cls, + manifest_path: str, + parent_id: str, + dry_run: bool = False, + merge_existing_annotations: bool = True, + associate_activity_to_new_version: bool = False, + *, + synapse_client: Optional[Synapse] = None, + ) -> List["File"]: + """ + Upload files to Synapse from a manifest TSV file. + + This method reads a manifest TSV file and uploads all files defined in it + to Synapse. The manifest file must contain at minimum the 'path' and 'parent' + columns. + + Arguments: + manifest_path: Path to the manifest TSV file. + parent_id: The Synapse ID of the parent container (Project or Folder) + where files will be uploaded if not specified in the manifest. + dry_run: If True, validate the manifest but do not upload. + merge_existing_annotations: If True, merge annotations with existing + annotations on the file. If False, replace existing annotations. + associate_activity_to_new_version: If True, copy the activity + (provenance) from the previous version to the new version. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + List of File objects that were uploaded. + + Raises: + ValueError: If the manifest file does not exist. + ValueError: If the manifest file is missing required fields. + IOError: If a file path in the manifest does not exist. + + Example: Upload files from a manifest + Upload files from a manifest TSV file: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + files = Project.from_manifest( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123" + ) + for file in files: + print(f"Uploaded: {file.name} ({file.id})") + + Example: Dry run validation + Validate a manifest without uploading: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + files = Project.from_manifest( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123", + dry_run=True + ) + print("Manifest is valid, ready for upload") + """ + from synapseclient.models import Activity, File + + syn = Synapse.get_client(synapse_client=synapse_client) + + # Validate the manifest + is_valid, errors = _validate_manifest_required_fields(manifest_path) + if not is_valid: + raise ValueError( + "Invalid manifest file:\n" + "\n".join(f" - {e}" for e in errors) + ) + + # Read the manifest + rows = [] + with io.open(manifest_path, "r", encoding="utf8") as fp: + reader = csv.DictReader(fp, delimiter="\t") + for row in reader: + rows.append(row) + + if dry_run: + syn.logger.info( + f"Dry run: {len(rows)} files would be uploaded from manifest" + ) + return [] + + # Build dependency graph for provenance ordering + path_to_row = {} + upload_order = {} + + for row in rows: + path = row.get("path", "") + if path and not is_url(path): + path = os.path.abspath(os.path.expandvars(os.path.expanduser(path))) + path_to_row[path] = row + + # Collect provenance references + all_refs = [] + used = row.get("used", "") + if used and used.strip(): + for item in used.split(";"): + item = item.strip() + if item: + if os.path.isfile( + os.path.abspath( + os.path.expandvars(os.path.expanduser(item)) + ) + ): + all_refs.append( + os.path.abspath( + os.path.expandvars(os.path.expanduser(item)) + ) + ) + + executed = row.get("executed", "") + if executed and executed.strip(): + for item in executed.split(";"): + item = item.strip() + if item: + if os.path.isfile( + os.path.abspath( + os.path.expandvars(os.path.expanduser(item)) + ) + ): + all_refs.append( + os.path.abspath( + os.path.expandvars(os.path.expanduser(item)) + ) + ) + + upload_order[path] = all_refs + + # Topologically sort based on provenance dependencies + sorted_paths = topolgical_sort(upload_order) + sorted_paths = [p[0] for p in sorted_paths] + + # Track uploaded files for provenance resolution + path_to_synapse_id: Dict[str, str] = {} + uploaded_files: List["File"] = [] + + for path in sorted_paths: + row = path_to_row[path] + + # Get parent - use manifest value or fall back to provided parent_id + file_parent = row.get("parent", "").strip() or parent_id + + # Build the File object + file = File( + path=path, + parent_id=file_parent, + name=row.get("name", "").strip() or None, + id=row.get("id", "").strip() or None, + synapse_store=( + row.get("synapseStore", "").strip().lower() != "false" + if row.get("synapseStore", "").strip() + else True + ), + content_type=row.get("contentType", "").strip() or None, + merge_existing_annotations=merge_existing_annotations, + associate_activity_to_new_version=associate_activity_to_new_version, + ) + + # Build annotations from extra columns + annotations = {} + skip_keys = set( + REQUIRED_FIELDS + + FILE_CONSTRUCTOR_FIELDS + + STORE_FUNCTION_FIELDS + + PROVENANCE_FIELDS + ) + for key, value in row.items(): + if key not in skip_keys and value and value.strip(): + annotations[key] = _parse_manifest_value(value.strip()) + if annotations: + file.annotations = annotations + + # Build provenance/activity + used_items = [] + executed_items = [] + + used_str = row.get("used", "") + if used_str and used_str.strip(): + for item in used_str.split(";"): + item = item.strip() + if item: + used_items.append( + _resolve_provenance_item(item, path_to_synapse_id) + ) + + executed_str = row.get("executed", "") + if executed_str and executed_str.strip(): + for item in executed_str.split(";"): + item = item.strip() + if item: + executed_items.append( + _resolve_provenance_item(item, path_to_synapse_id) + ) + + if used_items or executed_items: + activity = Activity( + name=row.get("activityName", "").strip() or None, + description=row.get("activityDescription", "").strip() or None, + used=used_items, + executed=executed_items, + ) + file.activity = activity + + # Upload the file + file = await file.store_async(synapse_client=syn) + + # Track for provenance resolution + path_to_synapse_id[path] = file.id + uploaded_files.append(file) + + syn.logger.info(f"Uploaded: {file.name} ({file.id})") + + return uploaded_files + + @staticmethod + @otel_trace_method(method_to_trace_name=lambda **kwargs: "validate_manifest") + async def validate_manifest_async( + manifest_path: str, + *, + synapse_client: Optional[Synapse] = None, + ) -> Tuple[bool, List[str]]: + """ + Validate a manifest TSV file without uploading. + + This method validates a manifest file to ensure it is properly formatted + and all paths exist. + + Arguments: + manifest_path: Path to the manifest TSV file. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + Tuple of (is_valid, list_of_error_messages). If the manifest is valid, + is_valid will be True and the list will be empty. + + Example: Validate a manifest file + Validate a manifest file before uploading: + + from synapseclient.models import Project + + is_valid, errors = Project.validate_manifest( + manifest_path="/path/to/manifest.tsv" + ) + if is_valid: + print("Manifest is valid") + else: + for error in errors: + print(f"Error: {error}") + """ + return _validate_manifest_required_fields(manifest_path) + + @staticmethod + async def generate_download_list_manifest_async( + download_path: str, + csv_separator: str = ",", + include_header: bool = True, + timeout: int = 120, + *, + synapse_client: Optional[Synapse] = None, + ) -> str: + """ + Generate a manifest file from the current user's download list using the + Synapse REST API. + + This method creates a CSV manifest containing metadata about all files in + the user's download list. The manifest is generated server-side by Synapse + and then downloaded to the specified path. + + This is interoperable with the Synapse download list feature and provides + a way to export the download list as a manifest file that can be used for + bulk operations. + + Arguments: + download_path: The local directory path where the manifest will be saved. + csv_separator: The delimiter character for the CSV file. + Defaults to "," for comma-separated values. Use "\t" for tab-separated. + include_header: Whether to include column headers in the first row. + Defaults to True. + timeout: The number of seconds to wait for the job to complete. + Defaults to 120 seconds. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + The full path to the downloaded manifest file. + + Example: Generate manifest from download list + Generate a manifest from your Synapse download list: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + # Generate manifest from download list + manifest_path = Project.generate_download_list_manifest( + download_path="/path/to/download" + ) + print(f"Manifest downloaded to: {manifest_path}") + + Example: Generate tab-separated manifest + Generate a TSV manifest from your download list: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + manifest_path = Project.generate_download_list_manifest( + download_path="/path/to/download", + csv_separator="\t" + ) + + See Also: + - `DownloadListManifestRequest`: The underlying request class for more + fine-grained control over the manifest generation process. + """ + from synapseclient.models.download_list import DownloadListManifestRequest + from synapseclient.models.table_components import CsvTableDescriptor + + # Create the request with CSV formatting options + request = DownloadListManifestRequest( + csv_table_descriptor=CsvTableDescriptor( + separator=csv_separator, + is_first_line_header=include_header, + ) + ) + + # Send the job and wait for completion + await request.send_job_and_wait_async( + timeout=timeout, + synapse_client=synapse_client, + ) + + # Download the manifest + manifest_file_path = await request.download_manifest_async( + download_path=download_path, + synapse_client=synapse_client, + ) + + return manifest_file_path + + +def _resolve_provenance_item( + item: str, + path_to_synapse_id: Dict[str, str], +) -> Any: + """ + Resolve a provenance item to a UsedEntity or UsedURL. + + Args: + item: The provenance item string (could be a path, Synapse ID, or URL). + path_to_synapse_id: Mapping of local file paths to their Synapse IDs. + + Returns: + UsedEntity or UsedURL object. + """ + from synapseclient.models import UsedEntity, UsedURL + + # Check if it's a local file path that was uploaded + expanded_path = os.path.abspath(os.path.expandvars(os.path.expanduser(item))) + if expanded_path in path_to_synapse_id: + return UsedEntity(target_id=path_to_synapse_id[expanded_path]) + + # Check if it's a URL + if is_url(item): + return UsedURL(url=item) + + # Check if it's a Synapse ID + if is_synapse_id_str(item): + return UsedEntity(target_id=item) + + # Assume it's a Synapse ID + return UsedEntity(target_id=item) + + +def _parse_manifest_value(value: str) -> Any: + """ + Parse a manifest cell value into an appropriate Python type. + + Handles: + - List syntax: [a,b,c] -> ['a', 'b', 'c'] + - Boolean strings: 'true', 'false' -> True, False + - Numeric strings: '123' -> 123, '1.5' -> 1.5 + - Everything else: returned as string + + Args: + value: The string value from the manifest. + + Returns: + The parsed value. + """ + # Check for list syntax + if ARRAY_BRACKET_PATTERN.match(value): + # Remove brackets + inner = value[1:-1] + # Split on commas outside quotes + items = COMMAS_OUTSIDE_DOUBLE_QUOTES_PATTERN.split(inner) + result = [] + for item in items: + item = item.strip() + # Remove surrounding quotes if present + if item.startswith('"') and item.endswith('"'): + item = item[1:-1] + result.append(item) + return result + + # Check for boolean + if value.lower() == "true": + return True + if value.lower() == "false": + return False + + # Check for integer + try: + return int(value) + except ValueError: + pass + + # Check for float + try: + return float(value) + except ValueError: + pass + + # Return as string + return value diff --git a/synapseclient/models/project.py b/synapseclient/models/project.py index a1a6a1c21..50f8d0d7f 100644 --- a/synapseclient/models/project.py +++ b/synapseclient/models/project.py @@ -37,7 +37,7 @@ Table, VirtualTable, ) - +from synapseclient.models.mixins.manifest import ManifestGeneratable @dataclass() @async_to_sync @@ -46,6 +46,7 @@ class Project( AccessControllable, StorableContainer, ContainerEntityJSONSchema, + ManifestGeneratable, ): """A Project is a top-level container for organizing data in Synapse. diff --git a/synapseclient/models/protocols/manifest_protocol.py b/synapseclient/models/protocols/manifest_protocol.py new file mode 100644 index 000000000..1da447da0 --- /dev/null +++ b/synapseclient/models/protocols/manifest_protocol.py @@ -0,0 +1,240 @@ +"""Protocol for the specific methods of ManifestGeneratable mixin that have +synchronous counterparts generated at runtime.""" + +from typing import Dict, List, Optional, Protocol, Tuple + +from synapseclient import Synapse + + +class ManifestGeneratableSynchronousProtocol(Protocol): + """ + The protocol for methods that are asynchronous but also + have a synchronous counterpart that may also be called. + """ + + def generate_manifest( + self, + path: str, + manifest_scope: str = "all", + *, + synapse_client: Optional[Synapse] = None, + ) -> Optional[str]: + """Generate a manifest TSV file for all files in this container. + + This method should be called after `sync_from_synapse()` to generate + a manifest of all downloaded files with their metadata. + + Arguments: + path: The directory where the manifest file(s) will be written. + manifest_scope: Controls manifest file generation: + + - "all": Create a manifest in each directory level + - "root": Create a single manifest at the root path only + - "suppress": Do not create any manifest files + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + The path to the root manifest file if created, or None if suppressed. + + Raises: + ValueError: If the container has not been synced from Synapse. + ValueError: If manifest_scope is not one of 'all', 'root', 'suppress'. + + Example: Generate manifest after sync + Generate a manifest file after syncing from Synapse: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + project = Project(id="syn123").sync_from_synapse( + path="/path/to/download" + ) + manifest_path = project.generate_manifest( + path="/path/to/download", + manifest_scope="root" + ) + print(f"Manifest created at: {manifest_path}") + """ + return None + + @classmethod + def from_manifest( + cls, + manifest_path: str, + parent_id: str, + dry_run: bool = False, + merge_existing_annotations: bool = True, + associate_activity_to_new_version: bool = False, + *, + synapse_client: Optional[Synapse] = None, + ) -> List: + """Upload files to Synapse from a manifest TSV file. + + This method reads a manifest TSV file and uploads all files defined in it + to Synapse. The manifest file must contain at minimum the 'path' and 'parent' + columns. + + Arguments: + manifest_path: Path to the manifest TSV file. + parent_id: The Synapse ID of the parent container (Project or Folder) + where files will be uploaded if not specified in the manifest. + dry_run: If True, validate the manifest but do not upload. + merge_existing_annotations: If True, merge annotations with existing + annotations on the file. If False, replace existing annotations. + associate_activity_to_new_version: If True, copy the activity + (provenance) from the previous version to the new version. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + List of File objects that were uploaded. + + Example: Upload files from a manifest + Upload files from a manifest TSV file: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + files = Project.from_manifest( + manifest_path="/path/to/manifest.tsv", + parent_id="syn123" + ) + for file in files: + print(f"Uploaded: {file.name} ({file.id})") + """ + return [] + + @staticmethod + def validate_manifest( + manifest_path: str, + *, + synapse_client: Optional[Synapse] = None, + ) -> Tuple[bool, List[str]]: + """Validate a manifest TSV file without uploading. + + This method validates a manifest file to ensure it is properly formatted + and all paths exist. + + Arguments: + manifest_path: Path to the manifest TSV file. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + Tuple of (is_valid, list_of_error_messages). If the manifest is valid, + is_valid will be True and the list will be empty. + + Example: Validate a manifest file + Validate a manifest file before uploading: + + from synapseclient.models import Project + + is_valid, errors = Project.validate_manifest( + manifest_path="/path/to/manifest.tsv" + ) + if is_valid: + print("Manifest is valid") + else: + for error in errors: + print(f"Error: {error}") + """ + return (True, []) + + def get_manifest_data( + self, + *, + synapse_client: Optional[Synapse] = None, + ) -> Tuple[List[str], List[Dict[str, str]]]: + """Get manifest data for all files in this container. + + This method extracts metadata from all files that have been synced + to this container. The data can be used to generate a manifest file + or for other purposes. + + Arguments: + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + Tuple of (keys, data) where keys is a list of column headers + and data is a list of dictionaries, one per file, containing + the file metadata. + + Raises: + ValueError: If the container has not been synced from Synapse. + + Example: Get manifest data + Get manifest data for all files in a project: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + project = Project(id="syn123").sync_from_synapse( + path="/path/to/download" + ) + keys, data = project.get_manifest_data() + for row in data: + print(f"File: {row['name']} at {row['path']}") + """ + return ([], []) + + @staticmethod + def generate_download_list_manifest( + download_path: str, + csv_separator: str = ",", + include_header: bool = True, + timeout: int = 120, + *, + synapse_client: Optional[Synapse] = None, + ) -> str: + """Generate a manifest file from the current user's download list. + + This method creates a CSV manifest containing metadata about all files in + the user's download list. The manifest is generated server-side by Synapse + and then downloaded to the specified path. + + This is interoperable with the Synapse download list feature and provides + a way to export the download list as a manifest file that can be used for + bulk operations. + + Arguments: + download_path: The local directory path where the manifest will be saved. + csv_separator: The delimiter character for the CSV file. + Defaults to "," for comma-separated values. Use "\t" for tab-separated. + include_header: Whether to include column headers in the first row. + Defaults to True. + timeout: The number of seconds to wait for the job to complete. + Defaults to 120 seconds. + synapse_client: If not passed in and caching was not disabled by + `Synapse.allow_client_caching(False)` this will use the last created + instance from the Synapse class constructor. + + Returns: + The full path to the downloaded manifest file. + + Example: Generate manifest from download list + Generate a manifest from your Synapse download list: + + from synapseclient.models import Project + + import synapseclient + synapseclient.login() + + # Generate manifest from download list + manifest_path = Project.generate_download_list_manifest( + download_path="/path/to/download" + ) + print(f"Manifest downloaded to: {manifest_path}") + """ + return "" diff --git a/tests/unit/synapseclient/models/unit_test_manifest.py b/tests/unit/synapseclient/models/unit_test_manifest.py new file mode 100644 index 000000000..4c65ac7c3 --- /dev/null +++ b/tests/unit/synapseclient/models/unit_test_manifest.py @@ -0,0 +1,499 @@ +"""Unit tests for the synapseclient.models.mixins.manifest module.""" + +import datetime +import os +import tempfile + +import pytest + +from synapseclient.models.mixins.manifest import ( + DEFAULT_GENERATED_MANIFEST_KEYS, + MANIFEST_FILENAME, + _convert_manifest_data_items_to_string_list, + _convert_manifest_data_row_to_dict, + _extract_entity_metadata_for_file, + _get_entity_provenance_dict_for_file, + _manifest_filename, + _parse_manifest_value, + _validate_manifest_required_fields, + _write_manifest_data, +) + + +class TestManifestConstants: + """Tests for manifest constants.""" + + def test_manifest_filename_constant(self): + """Test the MANIFEST_FILENAME constant.""" + assert MANIFEST_FILENAME == "SYNAPSE_METADATA_MANIFEST.tsv" + + def test_default_manifest_keys(self): + """Test the DEFAULT_GENERATED_MANIFEST_KEYS constant.""" + expected_keys = [ + "path", + "parent", + "name", + "id", + "synapseStore", + "contentType", + "used", + "executed", + "activityName", + "activityDescription", + ] + assert DEFAULT_GENERATED_MANIFEST_KEYS == expected_keys + + +class TestManifestFilename: + """Tests for _manifest_filename function.""" + + def test_manifest_filename(self): + """Test generating manifest filename.""" + # GIVEN a path + path = "/path/to/directory" + + # WHEN we generate the manifest filename + result = _manifest_filename(path) + + # THEN it should be the path joined with MANIFEST_FILENAME + assert result == os.path.join(path, MANIFEST_FILENAME) + + +class TestConvertManifestDataItemsToStringList: + """Tests for _convert_manifest_data_items_to_string_list function.""" + + def test_single_string(self): + """Test converting a single string.""" + # GIVEN a list with a single string + items = ["hello"] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return the string directly + assert result == "hello" + + def test_multiple_strings(self): + """Test converting multiple strings.""" + # GIVEN a list with multiple strings + items = ["a", "b", "c"] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return a bracketed list + assert result == "[a,b,c]" + + def test_string_with_comma(self): + """Test converting a string with comma.""" + # GIVEN a single item with comma (no quotes needed for single item) + items = ["hello,world"] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return the string directly + assert result == "hello,world" + + def test_multiple_strings_with_comma(self): + """Test converting multiple strings where one has a comma.""" + # GIVEN multiple strings where one contains commas + items = ["string,with,commas", "string without commas"] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN the comma-containing string should be quoted + assert result == '["string,with,commas",string without commas]' + + def test_datetime(self): + """Test converting a datetime.""" + # GIVEN a datetime value + dt = datetime.datetime(2020, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc) + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list([dt]) + + # THEN it should return ISO format + assert result == "2020-01-01T00:00:00Z" + + def test_multiple_datetimes(self): + """Test converting multiple datetimes.""" + # GIVEN multiple datetime values + dt1 = datetime.datetime(2020, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc) + dt2 = datetime.datetime(2021, 1, 1, 0, 0, 0, 0, tzinfo=datetime.timezone.utc) + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list([dt1, dt2]) + + # THEN it should return a bracketed list of ISO dates + assert result == "[2020-01-01T00:00:00Z,2021-01-01T00:00:00Z]" + + def test_boolean_true(self): + """Test converting True.""" + # GIVEN a True value + items = [True] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return "True" + assert result == "True" + + def test_boolean_false(self): + """Test converting False.""" + # GIVEN a False value + items = [False] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return "False" + assert result == "False" + + def test_integer(self): + """Test converting an integer.""" + # GIVEN an integer value + items = [1] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return the string representation + assert result == "1" + + def test_float(self): + """Test converting a float.""" + # GIVEN a float value + items = [1.5] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return the string representation + assert result == "1.5" + + def test_empty_list(self): + """Test converting an empty list.""" + # GIVEN an empty list + items = [] + + # WHEN we convert to string + result = _convert_manifest_data_items_to_string_list(items) + + # THEN it should return an empty string + assert result == "" + + +class TestConvertManifestDataRowToDict: + """Tests for _convert_manifest_data_row_to_dict function.""" + + def test_simple_row(self): + """Test converting a simple row.""" + # GIVEN a row with simple values + row = {"path": "/path/to/file", "name": "file.txt"} + keys = ["path", "name"] + + # WHEN we convert it + result = _convert_manifest_data_row_to_dict(row, keys) + + # THEN it should return the same values + assert result == {"path": "/path/to/file", "name": "file.txt"} + + def test_row_with_list(self): + """Test converting a row with a list value.""" + # GIVEN a row with a list value + row = {"annotations": ["a", "b", "c"]} + keys = ["annotations"] + + # WHEN we convert it + result = _convert_manifest_data_row_to_dict(row, keys) + + # THEN the list should be converted to a string + assert result == {"annotations": "[a,b,c]"} + + def test_missing_key(self): + """Test converting a row with a missing key.""" + # GIVEN a row missing a key + row = {"path": "/path/to/file"} + keys = ["path", "name"] + + # WHEN we convert it + result = _convert_manifest_data_row_to_dict(row, keys) + + # THEN the missing key should be empty string + assert result == {"path": "/path/to/file", "name": ""} + + +class TestParseManifestValue: + """Tests for _parse_manifest_value function.""" + + def test_simple_string(self): + """Test parsing a simple string.""" + assert _parse_manifest_value("hello") == "hello" + + def test_list_syntax(self): + """Test parsing list syntax.""" + assert _parse_manifest_value("[a,b,c]") == ["a", "b", "c"] + + def test_list_with_quoted_string(self): + """Test parsing list with quoted string containing comma.""" + result = _parse_manifest_value('["hello,world",other]') + assert result == ["hello,world", "other"] + + def test_boolean_true(self): + """Test parsing 'true' string.""" + assert _parse_manifest_value("true") is True + assert _parse_manifest_value("True") is True + assert _parse_manifest_value("TRUE") is True + + def test_boolean_false(self): + """Test parsing 'false' string.""" + assert _parse_manifest_value("false") is False + assert _parse_manifest_value("False") is False + assert _parse_manifest_value("FALSE") is False + + def test_integer(self): + """Test parsing an integer string.""" + assert _parse_manifest_value("123") == 123 + + def test_float(self): + """Test parsing a float string.""" + assert _parse_manifest_value("1.5") == 1.5 + + def test_non_numeric_string(self): + """Test that non-numeric strings stay as strings.""" + assert _parse_manifest_value("hello123") == "hello123" + + +class TestWriteManifestData: + """Tests for _write_manifest_data function.""" + + def test_write_simple_manifest(self): + """Test writing a simple manifest file.""" + # GIVEN simple data + keys = ["path", "name", "id"] + data = [ + {"path": "/path/to/file1.txt", "name": "file1.txt", "id": "syn123"}, + {"path": "/path/to/file2.txt", "name": "file2.txt", "id": "syn456"}, + ] + + # WHEN we write it to a temp file + with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".tsv") as f: + filename = f.name + + try: + _write_manifest_data(filename, keys, data) + + # THEN the file should contain the expected content + with open(filename, "r") as f: + content = f.read() + + lines = content.strip().split("\n") + assert len(lines) == 3 # header + 2 data rows + assert lines[0] == "path\tname\tid" + assert lines[1] == "/path/to/file1.txt\tfile1.txt\tsyn123" + assert lines[2] == "/path/to/file2.txt\tfile2.txt\tsyn456" + finally: + os.unlink(filename) + + +class TestValidateManifestRequiredFields: + """Tests for _validate_manifest_required_fields function.""" + + def test_valid_manifest(self): + """Test validating a valid manifest file.""" + # GIVEN a valid manifest file + with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".tsv") as f: + f.write("path\tparent\n") + f.write(f"{f.name}\tsyn123\n") + filename = f.name + + try: + # Create the file referenced in path column + with open(filename, "a") as f: + pass # File already exists + + # WHEN we validate it + is_valid, errors = _validate_manifest_required_fields(filename) + + # THEN it should be valid + assert is_valid is True + assert errors == [] + finally: + os.unlink(filename) + + def test_missing_file(self): + """Test validating a non-existent manifest file.""" + # WHEN we validate a non-existent file + is_valid, errors = _validate_manifest_required_fields("/nonexistent/file.tsv") + + # THEN it should be invalid + assert is_valid is False + assert len(errors) == 1 + assert "not found" in errors[0] + + def test_missing_required_field(self): + """Test validating a manifest missing a required field.""" + # GIVEN a manifest missing the 'parent' field + with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".tsv") as f: + f.write("path\tname\n") + f.write("/path/to/file.txt\tfile.txt\n") + filename = f.name + + try: + # WHEN we validate it + is_valid, errors = _validate_manifest_required_fields(filename) + + # THEN it should be invalid + assert is_valid is False + assert any("parent" in e for e in errors) + finally: + os.unlink(filename) + + def test_empty_path(self): + """Test validating a manifest with empty path.""" + # GIVEN a manifest with empty path + with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".tsv") as f: + f.write("path\tparent\n") + f.write("\tsyn123\n") + filename = f.name + + try: + # WHEN we validate it + is_valid, errors = _validate_manifest_required_fields(filename) + + # THEN it should be invalid + assert is_valid is False + assert any("'path' is empty" in e for e in errors) + finally: + os.unlink(filename) + + def test_invalid_parent_id(self): + """Test validating a manifest with invalid parent ID.""" + # GIVEN a manifest with invalid parent ID + with tempfile.NamedTemporaryFile(mode="w", delete=False, suffix=".tsv") as f: + f.write("path\tparent\n") + f.write(f"{f.name}\tinvalid_parent\n") + filename = f.name + + try: + # WHEN we validate it + is_valid, errors = _validate_manifest_required_fields(filename) + + # THEN it should be invalid + assert is_valid is False + assert any("not a valid Synapse ID" in e for e in errors) + finally: + os.unlink(filename) + + +class TestExtractEntityMetadataForFile: + """Tests for _extract_entity_metadata_for_file function.""" + + def test_extract_basic_metadata(self): + """Test extracting basic file metadata.""" + + # GIVEN a mock File object + class MockFile: + def __init__(self): + self.parent_id = "syn123" + self.path = "/path/to/file.txt" + self.name = "file.txt" + self.id = "syn456" + self.synapse_store = True + self.content_type = "text/plain" + self.annotations = None + self.activity = None + + file = MockFile() + + # WHEN we extract metadata + keys, data = _extract_entity_metadata_for_file([file]) + + # THEN we should get the expected data + assert "path" in keys + assert "parent" in keys + assert "name" in keys + assert "id" in keys + assert len(data) == 1 + assert data[0]["path"] == "/path/to/file.txt" + assert data[0]["parent"] == "syn123" + assert data[0]["name"] == "file.txt" + assert data[0]["id"] == "syn456" + + def test_extract_with_annotations(self): + """Test extracting metadata with annotations.""" + + # GIVEN a mock File object with annotations + class MockFile: + def __init__(self): + self.parent_id = "syn123" + self.path = "/path/to/file.txt" + self.name = "file.txt" + self.id = "syn456" + self.synapse_store = True + self.content_type = "text/plain" + self.annotations = {"study": ["Study1"], "dataType": ["RNA-seq"]} + self.activity = None + + file = MockFile() + + # WHEN we extract metadata + keys, data = _extract_entity_metadata_for_file([file]) + + # THEN annotation keys should be included + assert "study" in keys + assert "dataType" in keys + assert data[0]["study"] == ["Study1"] + assert data[0]["dataType"] == ["RNA-seq"] + + +class TestGetEntityProvenanceDictForFile: + """Tests for _get_entity_provenance_dict_for_file function.""" + + def test_no_activity(self): + """Test extracting provenance when there is no activity.""" + + # GIVEN a mock File object with no activity + class MockFile: + def __init__(self): + self.activity = None + + file = MockFile() + + # WHEN we extract provenance + result = _get_entity_provenance_dict_for_file(file) + + # THEN we should get an empty dict + assert result == {} + + def test_with_activity(self): + """Test extracting provenance when there is an activity.""" + + # GIVEN mock objects + class MockUsedEntity: + def format_for_manifest(self): + return "syn789" + + class MockActivity: + def __init__(self): + self.name = "Analysis" + self.description = "Processing data" + self.used = [MockUsedEntity()] + self.executed = [] + + class MockFile: + def __init__(self): + self.activity = MockActivity() + + file = MockFile() + + # WHEN we extract provenance + result = _get_entity_provenance_dict_for_file(file) + + # THEN we should get the expected dict + assert result["activityName"] == "Analysis" + assert result["activityDescription"] == "Processing data" + assert result["used"] == "syn789" + assert result["executed"] == ""