diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..de9cc1a37 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,113 @@ + + +## Project + +Synapse Python Client — official Python SDK and CLI for Synapse (synapse.org), a collaborative science platform by Sage Bionetworks. Provides programmatic access to entities (projects, files, folders, tables, views), metadata, permissions, evaluations, and data curation workflows. Published to PyPI as `synapseclient`. + +## Stack + +- Python 3.10–3.14 (`setup.cfg`: `python_requires = >=3.10, <3.15`) +- HTTP: httpx (async), requests (sync/legacy) +- Models: stdlib dataclasses (NOT Pydantic) +- Tests: pytest 8.2, pytest-asyncio, pytest-socket, pytest-xdist +- Docs: MkDocs with Material theme, mkdocstrings +- Linting: ruff, black (line-length 88), isort (profile=black), bandit +- CI: GitHub Actions → SonarCloud, PyPI deploy on release +- Docker: `Dockerfile` at repo root, published to `ghcr.io/sage-bionetworks/synapsepythonclient` + +## Commands + +```bash +# Install for development +pip install -e ".[boto3,pandas,pysftp,tests,curator,dev]" + +# Unit tests +pytest -sv tests/unit + +# Integration tests (requires Synapse credentials, runs in parallel) +pytest -sv --reruns 3 tests/integration -n 8 --dist loadscope + +# Pre-commit checks (ruff, black, isort, bandit) +pre-commit run --all-files + +# Build docs locally +pip install -e ".[docs]" && mkdocs serve +``` + +## Conventions + +### Async-first with generated sync wrappers +All new methods must be async with `_async` suffix. The `@async_to_sync` class decorator (`core/async_utils.py`) auto-generates sync counterparts at class definition time. Never write sync methods manually on model classes — the decorator handles it. + +### `wrap_async_to_sync()` for standalone functions +Use `wrap_async_to_sync()` (not `@async_to_sync`) for free-standing async functions outside of classes — see `operations/` layer for the pattern. The class decorator only works on classes. + +### Protocol classes for sync type hints +Each model in `models/` has a corresponding protocol in `models/protocols/` defining the sync method signatures. When adding a new async method to a model, add its sync signature to the protocol class so IDE type hints work. + +### Dataclass models with `fill_from_dict()` +Models are `@dataclass` classes, NOT Pydantic. REST responses are deserialized via `fill_from_dict()` methods on each model. New models must follow this pattern. + +### Concrete types are Java class names +`core/constants/concrete_types.py` maps Java class names (e.g., `org.sagebionetworks.repo.model.FileEntity`) for polymorphic entity deserialization. When adding new entity types, register the concrete type string here AND in `api/entity_factory.py` AND in `models/mixins/asynchronous_job.py` if it's an async job type. + +### Options dataclass pattern +The `operations/` layer uses dataclass option objects (`StoreFileOptions`, `FileOptions`, `TableOptions`, etc.) to bundle type-specific configuration for CRUD operations. Follow this pattern for new entity-type-specific options. + +### Mixin composition for shared behavior +Shared functionality lives in `models/mixins/` (AccessControllable, StorableContainer, AsynchronousJob, etc.). Prefer adding to existing mixins over duplicating logic across models. + +### `synapse_client` parameter pattern +Most functions accept an optional `synapse_client` parameter. If omitted, `Synapse.get_client()` returns the cached singleton. Never pass `None` explicitly — omit the argument instead. + +### Branch naming +Use `SYNPY-{issue_number}` or `synpy-{issue_number}` prefix for feature branches. PR titles follow `[SYNPY-XXXX] Description` format. + +## Architecture + +``` +synapseclient/ +├── client.py # Synapse class — public entry point, REST methods, auth (9600+ lines) +├── api/ # REST API layer — one file per resource type (21 files) +│ └── entity_factory.py # Polymorphic entity deserialization via concrete type dispatch +├── models/ # Dataclass entities (Project, File, Table, etc.) (28 files) +│ ├── protocols/ # Sync method type signatures for IDE hints (18 files) +│ ├── mixins/ # Shared behavior (ACL, containers, async jobs, tables) (7 files) +│ └── services/ # Model-level business logic (storable_entity, search) +├── operations/ # High-level CRUD: get(), store(), delete() — factory dispatch +├── core/ # Infrastructure: upload/download, retry, cache, creds, OTel +│ ├── upload/ # Multipart upload (sync + async) +│ ├── download/ # File download (sync + async) +│ ├── credentials/ # Auth chain (PAT, env var, config file, AWS SSM) +│ ├── constants/ # Concrete types, config keys, limits, method flags +│ ├── models/ # ACL, Permission, DictObject, custom JSON serialization +│ └── multithread_download/ # Threaded download manager +├── extensions/ +│ └── curator/ # Schema curation (pandas, networkx, rdflib) — optional +├── services/ # JSON schema validation services +└── entity.py, table.py, ... # Legacy classes (pre-OOP rewrite, read-only) + +synapseutils/ # Legacy bulk utilities (copy, sync, migrate, walk) — sync-only +``` + +Data flow: User → `operations/` factory → model async methods → `api/` service functions → `client.py` REST calls → Synapse API. Responses deserialized via `fill_from_dict()` on model instances. + +## Constraints + +- Do not use Pydantic for models — the codebase uses stdlib dataclasses with custom serialization. Mixing would break the `@async_to_sync` decorator and `fill_from_dict()` pattern. +- Do not write synchronous test files — write async tests only. The `@async_to_sync` decorator is validated by a dedicated smoke test. Duplicate sync tests were removed to cut CI cost. +- Unit tests must not make network calls — `pytest-socket` blocks all sockets. Use `pytest-mock` for HTTP mocking. +- `develop` is the default/main branch, not `main` or `master`. PRs target `develop`. +- Legacy classes in root `synapseclient/` (entity.py, table.py, etc.) are kept for backwards compatibility. New features go in `models/` using the dataclass pattern. +- Avoid adding new methods to `client.py` (9600+ lines) — prefer the `api/` + `models/` layered pattern. +- `synapseutils/` is legacy sync-only (uses `requests`, NOT `httpx`). Do not add async methods there — new async equivalents go in `models/` or `operations/`. + +## Testing + +- `asyncio_mode = auto` in pytest.ini — no need for `@pytest.mark.asyncio` +- `asyncio_default_fixture_loop_scope = session` — all async tests share one event loop +- Unit test client fixture: session-scoped, `skip_checks=True`, `cache_client=False` +- Integration tests use `--reruns 3` for flaky retries and `-n 8 --dist loadscope` for parallelism +- Integration fixtures create per-worker Synapse projects; use `schedule_for_cleanup()` for teardown +- Auth env vars: `SYNAPSE_AUTH_TOKEN` (bearer token), `SYNAPSE_PROFILE` (config file profile, default: `"default"`), `SYNAPSE_TOKEN_AWS_SSM_PARAMETER_NAME` (AWS SSM path) +- CI runs integration tests only on Python 3.10 and 3.14 (oldest + newest) to limit Synapse server load diff --git a/docs/CLAUDE.md b/docs/CLAUDE.md new file mode 100644 index 000000000..5ad47d242 --- /dev/null +++ b/docs/CLAUDE.md @@ -0,0 +1,55 @@ + + +## Project + +User-facing documentation for the Synapse Python Client. Built with MkDocs + Material theme, deployed via GitHub Pages. Follows the Diataxis documentation framework with four content types: tutorials, guides, reference, and explanations. + +## Stack + +MkDocs with Material theme, mkdocstrings (Google-style docstrings), termynal (CLI animations), markdown-include (file embedding). + +## Conventions + +### Content types (Diataxis framework) +- **tutorials/** — Step-by-step learning (competence-building). Themed around a biomedical researcher working with Alzheimer's Disease data. Progressive build-up: Project → Folder → File → Annotations → etc. +- **guides/** — How-to guides for specific use cases (problem-solution oriented). Includes extension-specific guides (curator). +- **reference/** — API reference auto-generated from docstrings via mkdocstrings. Split into `experimental/sync/` and `experimental/async/` for new OOP API. +- **explanations/** — Deep conceptual content ("why" not just "how"). Design decisions, internal machinery. + +### File inclusion pattern (markdown-include) +Tutorial code lives in `tutorials/python/tutorial_scripts/*.py` and is embedded in markdown via line-range includes: +```markdown +{!docs/tutorials/python/tutorial_scripts/annotation.py!lines=5-23} +``` +Single source of truth — edit the `.py` file, not the markdown. Changing line numbers in scripts requires updating the line ranges in the corresponding `.md` files. + +### mkdocstrings reference generation +Reference markdown files use `::: synapseclient.ClassName` syntax to trigger auto-generation from docstrings. Key configuration: +- `docstring_style: google` — parse Google-style docstrings +- `members_order: source` — preserve source code order +- `filters: ["!^_", "!to_synapse_request", "!fill_from_dict"]` — private members, `to_synapse_request()`, and `fill_from_dict()` are excluded from docs +- `inherited_members: true` — shows mixin methods on inheriting classes +- Member lists are explicit — each reference page specifies which methods to document + +### Anchor links for cross-referencing +Pattern: `[](){ #reference-anchor }` in reference pages. Tutorials link to reference via `[API Reference][project-reference-sync]`. Explicit type hints use: `[syn.login][synapseclient.Synapse.login]`. + +### termynal CLI animations +Terminal animation blocks marked with `` HTML comment. Prompts configured as `$` or `>`. Used in authentication.md and installation docs. + +### Custom CSS (`css/custom.css`) +- API reference indentation: `doc-contents` has 25px left padding with border +- Smaller table font (0.7rem) for API docs +- Wide layout: `max-width: 1700px` for complex content + +### Navigation structure +Defined in `mkdocs.yml` nav section. 5 main sections: Home, Tutorials, How-To Guides, API Reference, Further Reading, News. API Reference has ~85 markdown files (~40 legacy, ~45 experimental). + +## Constraints + +- Do not edit tutorial code inline in markdown — edit the `.py` script file in `tutorial_scripts/` and update line ranges if needed. +- Reference docs auto-generate from source docstrings — to change method documentation, edit the docstring in the Python source, not the markdown. +- `mkdocs.yml` is at the repo root, not in `docs/` — it configures the entire doc build. +- Docs deploy to Read the Docs (configured via `.readthedocs.yaml` at repo root). +- Local build output goes to `docs_site/` (via `site_dir` in `mkdocs.yml`) — gitignored. +- Cross-referencing uses the `autorefs` plugin: `[display text][synapseclient.ClassName.method]` auto-resolves to mkdocstrings anchors. diff --git a/synapseclient/api/CLAUDE.md b/synapseclient/api/CLAUDE.md new file mode 100644 index 000000000..954cfc915 --- /dev/null +++ b/synapseclient/api/CLAUDE.md @@ -0,0 +1,48 @@ + + +## Project + +REST API service layer — thin async functions that map to Synapse REST endpoints. One file per resource type. Called by model layer, never by end users directly. + +## Conventions + +### Function signature pattern +```python +async def verb_resource( + required_param: str, + optional_param: str = None, + *, + synapse_client: Optional["Synapse"] = None, +) -> Dict[str, Any]: +``` +- All functions are `async def` +- `synapse_client` is always the last parameter, keyword-only (after `*`) +- Use `Synapse.get_client(synapse_client=synapse_client)` to get the client instance +- Use `TYPE_CHECKING` guard for `Synapse` import — avoids circular dependencies between `api/` and `client.py` + +### REST call pattern +```python +client = Synapse.get_client(synapse_client=synapse_client) +return await client.rest_post_async(uri="/endpoint", body=json.dumps(request)) +``` +Available methods: `rest_get_async`, `rest_post_async`, `rest_put_async`, `rest_delete_async`. Pass `endpoint=client.fileHandleEndpoint` for file handle operations; omit for the default repository endpoint. Use `json.dumps()` for request bodies — not raw dicts. + +### Return values +- Most functions return raw `Dict[str, Any]` — transformation happens in the model layer via `fill_from_dict()` +- Some return typed dataclass instances (e.g., `EntityHeader` from `entity_services.py`) when the data is only used internally +- Delete operations return `None` + +### Pagination +Use helpers from `api_client.py`: +- `rest_get_paginated_async()` — for GET endpoints with limit/offset. Expects `results` or `children` key in response. +- `rest_post_paginated_async()` — for POST endpoints with `nextPageToken`. Expects `page` array in response. +Both are async generators yielding individual items. + +### Entity factory (`entity_factory.py`) +Polymorphic entity deserialization via concrete type dispatch. Maps Java class names from `core/constants/concrete_types.py` to model classes. When adding a new entity type, register the type mapping here. + +### Adding a new service file +1. Create `synapseclient/api/new_service.py` +2. Add all public functions to `api/__init__.py` imports and `__all__` — every public function must be re-exported +3. Use `json.dumps()` for request bodies (not dict) +4. Reference `entity_services.py` for CRUD pattern, `table_services.py` for pagination pattern diff --git a/synapseclient/core/CLAUDE.md b/synapseclient/core/CLAUDE.md new file mode 100644 index 000000000..642abc9a6 --- /dev/null +++ b/synapseclient/core/CLAUDE.md @@ -0,0 +1,62 @@ + + +## Project + +Infrastructure layer — authentication, file transfer, retry logic, caching, OpenTelemetry tracing, and the `async_to_sync` decorator that powers the dual sync/async API. + +## Conventions + +### async_to_sync decorator (`async_utils.py`) +- Scans class for `*_async` methods and creates sync wrappers stripping the suffix +- Uses `ClassOrInstance` descriptor — methods work on both class and instance +- Detects running event loop: uses `nest_asyncio.apply()` for nested loops (Python <3.14), raises `RuntimeError` on Python 3.14+ instructing users to call async directly +- `wrap_async_to_sync()` for standalone functions (not class methods) — used in `operations/` layer +- `wrap_async_generator_to_sync_generator()` for async generators — must call `aclose()` in finally block +- `@skip_async_to_sync` decorator excludes specific methods from sync wrapper generation (sets `_skip_conversion = True`) +- `@otel_trace_method()` wraps async methods with OpenTelemetry spans. Format: `f"{ClassName}_{Operation}: ID: {self.id}, Name: {self.name}"` + +### Retry patterns (`retry.py`) +- `with_retry()` — count-based exponential backoff (default 3 retries), jitter 0.5-1.5x multiplier +- `with_retry_time_based_async()` — time-bounded (default 20 min), exponential backoff with 0.01-0.1 random jitter +- Default retryable status codes: `[429, 500, 502, 503, 504]` +- `NON_RETRYABLE_ERRORS` list overrides status code retry (currently: `["is not a table or view"]`) +- 429 throttling: wait bumps to 16 seconds minimum +- Sets OTel span attribute `synapse.retries` on retry + +### Credentials chain (`credentials/`) +Provider chain tries in order: login args → config file → env var (`SYNAPSE_AUTH_TOKEN`) → AWS SSM. Credentials implement `requests.auth.AuthBase`, adding `Authorization: Bearer` header. Profile selection via `SYNAPSE_PROFILE` env var or `--profile` arg. + +### Upload/download +- Both use 60-retry params spanning ~30 minutes for resilience +- Upload determines storage location from project settings, supports S3/SFTP/GCP +- Download validates MD5 post-transfer, raises `SynapseMd5MismatchError` on mismatch +- Progress via `tqdm`; multi-threaded uploads suppress per-file messages via `cumulative_transfer_progress` + +### concrete_types.py +Maps Java class names from Synapse REST API for polymorphic deserialization. When adding a new entity type, add its concrete type string here AND in `api/entity_factory.py` type map AND in `models/mixins/asynchronous_job.py` ASYNC_JOB_URIS if it's an async job type. + +### Key reusable utilities (`utils.py`) +- `delete_none_keys(d)` — removes None-valued keys from dict. MUST call before all API requests — Synapse rejects null values. +- `id_of(obj)` — extracts Synapse ID from entity, dict, or string +- `concrete_type_of(entity)` — gets the concrete type string from an entity +- `get_synid_and_version(id_str)` — parses "synXXX.N" strings into (id, version) tuples +- `merge_dataclass_entities(source, dest, ...)` — merges fields from one dataclass into another +- `log_dataclass_diff(obj1, obj2)` — logs field-by-field differences between two dataclass instances +- `snake_case(name)` — converts camelCase to snake_case +- `normalize_whitespace(s)` — collapses whitespace +- `MB`, `KB`, `GB` — byte size constants +- `make_bogus_data_file()`, `make_bogus_binary_file(n)`, `make_bogus_uuid_file()` — test file generators (in production code, used by tests) + +### Exception hierarchy (`exceptions.py`) +`SynapseError` base with 14+ subclasses: `SynapseHTTPError`, `SynapseMd5MismatchError`, `SynapseFileNotFoundError`, `SynapseNotFoundError`, `SynapseAuthenticationError`, etc. `_raise_for_status()` and `_raise_for_status_httpx()` handle HTTP error responses with Bearer token redaction via `BEARER_TOKEN_PATTERN` regex. + +### Rolled-up subdirectories + +**`core/models/`** — Internal dataclasses for ACL, Permission, DictObject (dict-like base class), and custom JSON serialization utilities. `DictObject` (`dict_object.py`) provides dot-notation access to dict entries. + +**`core/multithread_download/`** — Threaded download manager with `shared_executor()` context manager for external thread pool configuration. Uses `DownloadRequest` dataclass. Default part size: `SYNAPSE_DEFAULT_DOWNLOAD_PART_SIZE`. + +## Constraints + +- Bearer tokens must never appear in logs — use `BEARER_TOKEN_PATTERN` regex for redaction. +- `delete_none_keys()` must be called on all dicts before sending to the API — Synapse rejects null values. diff --git a/synapseclient/core/constants/CLAUDE.md b/synapseclient/core/constants/CLAUDE.md new file mode 100644 index 000000000..d5d42ff72 --- /dev/null +++ b/synapseclient/core/constants/CLAUDE.md @@ -0,0 +1,22 @@ + + +## Project + +Centralized constants used across the codebase — concrete type mappings, API limits, collision modes, and config file keys. + +## Conventions + +### concrete_types.py — 3-way registration required +Maps Java class name strings (e.g., `org.sagebionetworks.repo.model.FileEntity`) for polymorphic entity deserialization. When adding a new entity or job type, register in THREE places: +1. `concrete_types.py` — add the constant string +2. `api/entity_factory.py` — add to the type dispatch map +3. `models/mixins/asynchronous_job.py` `ASYNC_JOB_URIS` — add if it's an async job type + +### limits.py +`MAX_FILE_HANDLE_PER_COPY_REQUEST = 100` and other API batch size limits. + +### method_flags.py +Collision handling modes for file downloads: `COLLISION_OVERWRITE_LOCAL`, `COLLISION_KEEP_LOCAL`, `COLLISION_KEEP_BOTH`. + +### config_file_constants.py +Section and key names for the `~/.synapseConfig` file. `AUTHENTICATION_SECTION_NAME` identifies the auth section. diff --git a/synapseclient/core/credentials/CLAUDE.md b/synapseclient/core/credentials/CLAUDE.md new file mode 100644 index 000000000..85b758f46 --- /dev/null +++ b/synapseclient/core/credentials/CLAUDE.md @@ -0,0 +1,23 @@ + + +## Project + +Authentication credential providers implementing a chain-of-responsibility pattern for token resolution. + +## Conventions + +### Provider chain order (priority) +1. **UserArgsCredentialsProvider** — explicit login args passed to `syn.login()` +2. **ConfigFileCredentialsProvider** — `~/.synapseConfig` file (profile-aware via sections) +3. **EnvironmentVariableCredentialsProvider** — `SYNAPSE_AUTH_TOKEN` env var +4. **AWSParameterStoreCredentialsProvider** — AWS SSM Parameter Store (via `SYNAPSE_TOKEN_AWS_SSM_PARAMETER_NAME` env var) + +### Profile selection +Select profile via `SYNAPSE_PROFILE` env var or `--profile` CLI arg. If username provided in login args differs from config file username, config credentials are rejected — prevents ambiguity. + +### Token handling +`SynapseAuthTokenCredentials` implements `requests.auth.AuthBase`, adding `Authorization: Bearer` header. JWT validation failure is silent (logs warning, does not raise) — allows tokens with unrecognized formats to attempt API calls. + +## Constraints + +- Bearer tokens must never appear in logs — redact with `BEARER_TOKEN_PATTERN` regex before logging. diff --git a/synapseclient/core/download/CLAUDE.md b/synapseclient/core/download/CLAUDE.md new file mode 100644 index 000000000..c6c359104 --- /dev/null +++ b/synapseclient/core/download/CLAUDE.md @@ -0,0 +1,26 @@ + + +## Project + +File download from Synapse storage with MD5 validation, collision handling, and progress tracking. + +## Conventions + +### Primary download path +`download_async.py` is the primary async download implementation. `download_functions.py` contains shared helpers and the sync download wrapper. + +### MD5 validation +Post-transfer MD5 validation is mandatory. Raises `SynapseMd5MismatchError` on mismatch — the download is retried automatically (60 retries spanning ~30 minutes). + +### Collision handling +Controlled by `if_collision` parameter, using constants from `core/constants/method_flags.py`: +- `overwrite.local` — replace existing local file +- `keep.local` — skip download if local file exists +- `keep.both` — rename downloaded file to avoid collision + +### Progress tracking +Uses `shared_download_progress_bar` from `core/transfer_bar.py` for tqdm-based progress. Multi-file downloads track cumulative progress via `cumulative_transfer_progress`. + +### Key helpers +- `ensure_download_location_is_directory()` — validates/creates download directory +- `download_by_file_handle()` — downloads a file given its handle metadata diff --git a/synapseclient/core/upload/CLAUDE.md b/synapseclient/core/upload/CLAUDE.md new file mode 100644 index 000000000..b28de12f7 --- /dev/null +++ b/synapseclient/core/upload/CLAUDE.md @@ -0,0 +1,38 @@ + + +## Project + +Multipart file upload to Synapse storage (S3, GCP, SFTP). Dual implementation: sync (requests) and async (httpx). + +## Conventions + +### Constants +- `MAX_NUMBER_OF_PARTS = 10000` +- `MIN_PART_SIZE = 5 MB` +- `DEFAULT_PART_SIZE = 8 MB` +- `MAX_RETRIES = 7` +- Upload retry: 60 retries spanning ~30 minutes for resilience + +### Sync vs async duality +`multipart_upload.py` (sync/requests) and `multipart_upload_async.py` (async/httpx) must be kept in feature parity. Both implement `UploadAttempt` / `UploadAttemptAsync` classes orchestrating multi-part uploads with presigned URL batching. + +### Async-specific patterns +- `HandlePartResult` dataclass tracks individual part uploads +- `shared_progress_bar()` context manager for tqdm integration across concurrent tasks +- Explicit `gc.collect()` calls and psutil memory monitoring during large uploads — prevents memory pressure +- Uses `asyncio.Lock` for thread-safe state management + +### Sync-specific patterns +- Thread-local `requests.Session` storage for persistent HTTP connections per thread +- `shared_executor()` context manager allows callers to provide their own thread pool + +### Upload flow +1. Pre-upload: MD5 calculation, MIME type detection, storage location determination from project settings +2. Presigned URL batch fetching with expiry detection and refresh +3. Multi-part upload with retry per part +4. Post-upload: complete upload API call, retrieve file handle + +### upload_utils.py +- `get_partial_file_chunk()` — binary file chunk reader with offset tracking +- `get_partial_dataframe_chunk()` — DataFrame chunk reader (iterates in 100-row increments) +- MD5 calculation, MIME type guessing, part size computation diff --git a/synapseclient/extensions/curator/CLAUDE.md b/synapseclient/extensions/curator/CLAUDE.md new file mode 100644 index 000000000..6cd1effa8 --- /dev/null +++ b/synapseclient/extensions/curator/CLAUDE.md @@ -0,0 +1,40 @@ + + +## Project + +Schema curation tools for data modeling — JSON Schema generation from CSV/JSONLD data models, schema registration/binding to Synapse entities, and metadata task creation for file-based and record-based curation workflows. + +## Stack + +Optional dependencies (gated by `[curator]` extras): pandas, pandarallel, networkx, rdflib, inflection, dataclasses-json. + +## Conventions + +### schema_generation.py (5984 lines) +Largest file in the codebase. Contains `DataModelParser`, `DataModelComponent`, `DataModelRelationships` classes. Uses networkx (DiGraph, MultiDiGraph) for node/edge relationships and cycle detection (via multiprocessing). Many deprecated validation rule enums marked for removal (SYNPY-1724, SYNPY-1692). Active development area — multiple recent PRs modifying conditionals, display names, and grouping. + +### schema_registry.py +Query engine for the schema registry table. Default table ID: `syn69735275` (configurable via parameter). Builds SQL WHERE clauses from filter kwargs — supports exact match and LIKE pattern match. `return_latest_only=True` returns newest version URI only. + +### schema_management.py +Thin wrappers around `JSONSchema` OOP model: +- `register_jsonschema()` / `register_jsonschema_async()` — loads schema from file, calls `.store_async()` +- `bind_jsonschema()` / `bind_jsonschema_async()` — binds schema to entity +- `fix_schema_name()` — replaces dashes/underscores with periods for Synapse compliance + +Uses `wrap_async_to_sync()` for sync versions (not class decorator). + +### file_based_metadata_task.py +Creates EntityView from JSON Schema bound to folder/project. `create_json_schema_entity_view()` auto-reorders columns (createdBy→name→id to front). `create_or_update_wiki_with_entity_view()` embeds EntityView query in Wiki page. + +### record_based_metadata_task.py +Extracts schema properties → DataFrame → RecordSet → CurationTask + Grid. Supports URI-based schemas via `JSONSchema.from_uri()`. + +### utils.py +`project_id_from_entity_id()` — traverses folder hierarchy up to project (max 1000 iterations). Uses legacy sync `get()` API in a loop — known tech debt. + +## Constraints + +- This area is under active development with frequent PRs. Be cautious about large refactors — coordinate with the curator team. +- `schema_generation.py` contains deprecated patterns (SYNPY-1724) that are still in use — do not remove without verifying the deprecation timeline. +- Uses `urllib.request` in one place instead of httpx (has TODO to replace) — do not propagate this pattern elsewhere. diff --git a/synapseclient/models/CLAUDE.md b/synapseclient/models/CLAUDE.md new file mode 100644 index 000000000..43f00a2dd --- /dev/null +++ b/synapseclient/models/CLAUDE.md @@ -0,0 +1,67 @@ + + +## Project + +Dataclass-based entity models for the Synapse REST API. Each model represents a Synapse resource (Project, File, Folder, Table, etc.) with async-first methods and auto-generated sync wrappers. + +## Conventions + +### New model checklist +1. Decorate with `@dataclass()` then `@async_to_sync` (order matters — `@async_to_sync` must be outer) +2. Inherit from: `SynchronousProtocol`, then mixins (`AccessControllable`, `StorableContainer`, etc.) +3. Create a matching protocol file in `protocols/` with sync method signatures +4. Register concrete type in `core/constants/concrete_types.py` +5. Add to `models/__init__.py` exports and `__all__` +6. Add to entity factory type map in `api/entity_factory.py` if it's an entity type +7. Add to `ASYNC_JOB_URIS` in `models/mixins/asynchronous_job.py` if it uses async jobs + +### Standard fields every entity model must have +```python +id: Optional[str] = None +name: Optional[str] = None +etag: Optional[str] = None +created_on: Optional[str] = field(default=None, compare=False) +modified_on: Optional[str] = field(default=None, compare=False) +created_by: Optional[str] = field(default=None, compare=False) +modified_by: Optional[str] = field(default=None, compare=False) +create_or_update: bool = field(default=True, repr=False) +_last_persistent_instance: Optional["Self"] = field(default=None, repr=False, compare=False) +``` + +Use `compare=False` for read-only timestamps, child collections, annotations, and internal state — this makes `has_changed` compare only user-modifiable fields. + +### fill_from_dict() pattern +Maps camelCase REST keys to snake_case fields via `.get("camelCaseKey", None)`. Must return `self`. Handle annotations separately with `set_annotations` parameter. Reference: `folder.py`, `file.py`. + +### Annotations handling +Annotations are deserialized separately from `fill_from_dict()` — they use a `set_annotations` flag parameter. The `Annotations` model wraps key-value metadata. When storing, annotations are sent via a separate API call in `models/services/storable_entity_components.py`. + +### Activity/provenance pattern +`Activity` model tracks provenance (what data/code produced an entity). Contains `used` and `executed` lists of `UsedEntity`/`UsedURL` references. Activity is stored as a separate component — the `associate_activity_to_new_version` flag on File controls whether activity transfers to new versions. + +### _last_persistent_instance lifecycle +- Set via `_set_last_persistent_instance()` after every successful `store_async()` and `get_async()` +- Uses `dataclasses.replace(self)` with `deepcopy` for annotations +- Enables `has_changed` property — skips redundant API calls when nothing changed +- Drives `create_or_update` logic: if no `_last_persistent_instance`, attempts merge with existing Synapse entity + +### @otel_trace_method on every async method +Apply to all async methods that call Synapse. Format: `f"{ClassName}_{Operation}: ID: {self.id}, Name: {self.name}"`. + +### delete_none_keys() before API calls +Always call `delete_none_keys()` on request dicts before passing to `store_entity()` — the Synapse API rejects `None` values. + +### EnumCoercionMixin for enum fields +If a model has enum-typed fields, inherit from `EnumCoercionMixin` and declare `_ENUM_FIELDS: ClassVar[Dict[str, type]]` mapping field names to enum classes. Auto-coerces strings to enums on assignment via `__setattr__`. + +### OOP table.py vs legacy synapseclient/table.py +`models/table.py` is the modern OOP dataclass Table. `synapseclient/table.py` in the package root is the legacy Table class (MutableMapping-based). New table features go in `models/table.py` and `models/table_components.py`. + +### Business logic in services/ +Complex orchestration logic lives in `models/services/` (storable_entity, storable_entity_components, search) — not directly on model classes. This keeps models thin. + +## Constraints + +- Never manually write sync methods on models — `@async_to_sync` generates them. Use `@skip_async_to_sync` to exclude specific methods. +- Protocol files must exactly match the async method signatures (minus `_async` suffix) — they exist for IDE type hints, not runtime dispatch. +- Child collections (files, folders, tables) must use `compare=False` to avoid breaking `has_changed`. diff --git a/synapseclient/models/mixins/CLAUDE.md b/synapseclient/models/mixins/CLAUDE.md new file mode 100644 index 000000000..ae3a2da35 --- /dev/null +++ b/synapseclient/models/mixins/CLAUDE.md @@ -0,0 +1,30 @@ + + +## Project + +Composable behavior mixins for model classes — ACL management, container operations, async job orchestration, table CRUD, form submissions, and JSON schema validation. + +## Conventions + +### access_control.py (2527 lines) +`AccessControllable` mixin provides `get_permissions()`, `set_permissions()`, `delete_acl()`. Uses `BenefactorTracker` dataclass to track ACL cascade when inheritance changes — maps entity→benefactor and benefactor→children relationships. Batch ACL operations use `asyncio.as_completed()` for concurrency with tqdm progress bars. + +### storable_container.py (1530 lines) +`StorableContainer` mixin for entities containing files/folders/tables. Queue-based concurrent download/upload via `_worker()` coroutine processing `asyncio.Queue`. `sync_from_synapse_async()` recursively downloads child entities. `FailureStrategy` enum (LOG_EXCEPTION vs RAISE_EXCEPTION) controls child entity error handling. Uses `wrap_async_generator_to_sync_generator()` for `get_children`. Child entity type dispatch via concrete type → model class mapping. + +### asynchronous_job.py (516 lines) +`AsynchronousCommunicator` abstract mixin for long-running Synapse jobs. `ASYNC_JOB_URIS` dict maps concrete types to REST endpoints — when adding a new async job type, register here AND in `core/constants/concrete_types.py`. `send_job_and_wait_async()` polls job status with tqdm progress. Subclasses must implement `to_synapse_request()` and `fill_from_dict()`. + +### table_components.py (4634 lines) +Massive table CRUD mixin: schema management, query execution, row operations, CSV upload/download, snapshot creation. Uses `multipart_upload_dataframe_async` for CSV data. Pandas integration for to_csv/read_csv. Column type mapping between Python types and Synapse column types. Multiple TODOs for incomplete features (SYNPY-1651). + +### form.py (178 lines) +`FormData` submission mixin with `FormGroup`, `FormChangeRequest`, `FormSubmissionStatus` dataclasses. `StateEnum`: WAITING_FOR_SUBMISSION, SUBMITTED_WAITING_FOR_REVIEW, ACCEPTED, REJECTED. + +### json_schema.py (1267 lines) +Schema validation and creation via async jobs. `JSONSchemaVersionInfo`, `JSONSchemaBinding`, `JSONSchemaValidation` dataclasses. Used by entities that support schema binding (Folder, Project). + +## Constraints + +- When adding a new async job type, register in BOTH `ASYNC_JOB_URIS` (here) and `concrete_types.py` — missing either causes runtime errors. +- Child collections on `StorableContainer` models must use `compare=False` in field definition to avoid breaking `has_changed` comparison. diff --git a/synapseclient/models/protocols/CLAUDE.md b/synapseclient/models/protocols/CLAUDE.md new file mode 100644 index 000000000..03ceff63f --- /dev/null +++ b/synapseclient/models/protocols/CLAUDE.md @@ -0,0 +1,27 @@ + + +## Project + +Protocol classes providing sync method type hints for IDE autocompletion. One protocol file per model class (18 files). + +## Conventions + +### Naming convention +- File: `{entity}_protocol.py` (e.g., `file_protocol.py`, `project_protocol.py`) +- Class: `{Entity}SynchronousProtocol` (e.g., `FileSynchronousProtocol`) + +### Signature matching +Every async method on a model must have a corresponding sync signature here — method name without `_async` suffix, same parameters (including `synapse_client: Optional["Synapse"] = None`). Body is always `...` (no implementation). + +### Purpose +The `@async_to_sync` decorator generates the actual sync implementation at class definition time. These protocol files exist solely so IDEs can provide type hints, autocomplete, and documentation for the generated sync methods. + +### Adding a new method +1. Add async method to model class (e.g., `store_async()`) +2. Add sync signature to the corresponding protocol (e.g., `store()` with `...` body) +3. The decorator auto-generates the working sync implementation + +## Constraints + +- Protocol signatures must exactly match async signatures minus the `_async` suffix — mismatches cause IDE type hint errors. +- Do not add implementation logic to protocols — they are type stubs only. diff --git a/synapseclient/models/services/CLAUDE.md b/synapseclient/models/services/CLAUDE.md new file mode 100644 index 000000000..cd2a479c9 --- /dev/null +++ b/synapseclient/models/services/CLAUDE.md @@ -0,0 +1,20 @@ + + +## Project + +Business logic extracted from model classes to keep models thin. Internal-only — not part of the public API. + +## Conventions + +### storable_entity.py +`store_entity()` async function orchestrates entity POST/PUT to Synapse. Handles version numbering: if `version_label` changed or `force_version=True`, increments version. Note: this function has an explicit TODO marking it as incomplete/WIP. + +### storable_entity_components.py +`store_entity_components()` orchestrates storing annotations, activity, and ACL as separate API calls after the entity itself is stored. `FailureStrategy` enum (LOG_EXCEPTION, RAISE_EXCEPTION) controls error handling. `wrap_coroutine()` helper wraps individual component store operations. + +### search.py +`get_id()` utility resolves an entity by name+parent or by Synapse ID. Has a TODO for deprecated code replacement (SYNPY-1623) — uses `asyncio.get_event_loop().run_in_executor()` as a legacy pattern for blocking operations. + +## Constraints + +- These are internal service functions — do not expose in `models/__init__.py` or import from user-facing code. diff --git a/synapseclient/operations/CLAUDE.md b/synapseclient/operations/CLAUDE.md new file mode 100644 index 000000000..a09c3a436 --- /dev/null +++ b/synapseclient/operations/CLAUDE.md @@ -0,0 +1,39 @@ + + +## Project + +High-level CRUD factory methods (`get`, `store`, `delete`) that dispatch to the correct entity-type-specific handler. Entry point for users who want a simpler interface than calling model methods directly. + +## Conventions + +### Sync wrapper pattern +Uses `wrap_async_to_sync()` on standalone async functions — NOT the `@async_to_sync` class decorator (which only works on classes). Every public async function has a sync counterpart generated this way. + +### Factory dispatch via isinstance() +`store_async()` routes to entity-specific handlers via `isinstance()` checks: +- File/RecordSet → `_handle_store_file_entity()` +- Project/Folder → `_handle_store_container_entity()` +- Table-like entities → `_handle_store_table_entity()` +- Link → `_handle_store_link_entity()` +- Team → if has `id`: `.store_async()`, else: `.create_async()` +- AgentSession → `.update_async()` (not `.store_async()`) + +### Options dataclasses +Type-specific configuration bundled in dataclass objects: +- **Store**: `StoreFileOptions`, `StoreContainerOptions`, `StoreTableOptions`, `StoreGridOptions`, `StoreJSONSchemaOptions` +- **Get**: `FileOptions`, `ActivityOptions`, `TableOptions`, `LinkOptions` + +`LinkOptions.follow_link=True` returns the target entity, not the Link itself. + +### Delete version precedence +Version resolution order: explicit `version` parameter > entity's `version_number` attribute > version parsed from ID string (e.g., "syn123.4"). Only warns on conflict if both explicit param and attribute are set and differ. + +### FailureStrategy +`FailureStrategy` enum controls child entity error handling in container store operations: +- `LOG_EXCEPTION` — log error, continue with remaining children +- `RAISE_EXCEPTION` — raise immediately on first child failure + +### Adding new operations +1. Add async function in the appropriate file +2. Create sync wrapper with `wrap_async_to_sync()` +3. Export both in `operations/__init__.py` and `__all__` diff --git a/synapseutils/CLAUDE.md b/synapseutils/CLAUDE.md new file mode 100644 index 000000000..999f4543b --- /dev/null +++ b/synapseutils/CLAUDE.md @@ -0,0 +1,34 @@ + + +## Project + +Legacy bulk utility functions for copy, sync, migrate, walk, describe, and monitor operations. Pre-OOP code using legacy `requests` HTTP and old-style Entity classes (not modern dataclass models). + +## Conventions + +### Naming convention +Functions use camelCase (legacy convention) — e.g., `syncFromSynapse()`, `copyFileHandles()`, `notifyMe()`. Do NOT convert to snake_case — this is the public API. + +### migrate_functions.py (1429 lines) +Uses SQLite database for migration state persistence. `MigrationResult` proxy object iterates results without loading all into memory — avoids memory issues for repos with millions of files. Two-phase pattern: `index_files_for_migration()` then `migrate_indexed_files()`. Uses concurrent.futures thread pool with configurable part size (default 100 MB). + +### sync.py (1528 lines) +`syncFromSynapse()` / `syncToSynapse()` for bulk folder transfer. Generates manifest files for tracking. Known issue: TODO at line 967 notes "absence of a raise here appears to be a bug and yet tests fail if this is raised" — `SynapseFileNotFoundError` handling may be incorrect. + +### copy_functions.py (965 lines) +`copyFileHandles()` batches by `MAX_FILE_HANDLE_PER_COPY_REQUEST`. Returns list with potential `failureCodes` (UNAUTHORIZED, NOT_FOUND). `copyWiki()` and `changeFileMetaData()` for metadata operations. + +### monitor.py (192 lines) +`notifyMe()` — decorator for sync functions that sends email notification on completion/failure. `notify_me_async()` — async variant. Both retry on failure with configurable retry count. Uses `syn.sendMessage()` with user's owner ID. + +### walk_functions.py +`walk()` — recursive entity tree traversal similar to `os.walk()`. Returns generator of (dirpath, dirnames, filenames) tuples. + +### describe_functions.py +Opens CSV/TSV entities as pandas DataFrames. Calculates per-column stats: mode, min/max (numeric), mean, dtype. + +## Constraints + +- ALL functions use legacy sync `requests` library, NOT httpx. Do NOT add async methods here — new async equivalents go in `synapseclient/models/` or `synapseclient/operations/`. +- Uses legacy Entity classes (`from synapseclient import Entity, File, Folder`) — NOT modern dataclass models. +- Do not refactor to modern patterns without a migration plan — these are public APIs with external consumers. diff --git a/tests/CLAUDE.md b/tests/CLAUDE.md new file mode 100644 index 000000000..0953eb97f --- /dev/null +++ b/tests/CLAUDE.md @@ -0,0 +1,45 @@ + + +## Project + +Test suite for the Synapse Python Client. Unit tests run without network access; integration tests hit the live Synapse API. + +## Conventions + +### Write async tests only +Do not create synchronous test files. The `@async_to_sync` decorator is validated by a dedicated smoke test (`tests/integration/synapseclient/models/synchronous/test_sync_wrapper_smoke.py`). Duplicate sync tests were removed to cut CI cost and maintenance burden. + +### Unit tests (`tests/unit/`) +- `pytest-socket` blocks all network calls (unix sockets allowed on non-Windows for async event loop). On Windows, socket disabling is skipped entirely — tests still run but are not network-isolated. +- Session-scoped `syn` fixture: `Synapse(skip_checks=True, cache_client=False)` with silent logger +- Autouse `set_timezone` fixture forces `TZ=UTC` for deterministic timestamps +- Client caching disabled via `Synapse.allow_client_caching(False)` +- Use `AsyncMock` for async method mocking, `create_autospec` for type-safe mocks +- Class-based test organization with `@pytest.fixture(scope="function", autouse=True)` for setup +- Test file naming: `unit_test_*.py` (legacy) or `test_*.py` (newer) — both patterns are discovered by pytest + +### Integration tests (`tests/integration/`) +- All async tests share one event loop: `asyncio_default_fixture_loop_scope = session` +- `schedule_for_cleanup(item)` — defer entity/file cleanup to session teardown. Always use this instead of inline deletion. Cleanup list is reversed before execution for dependency ordering (children deleted before parents). +- Per-worker project fixtures (`project_model`, `project`) created during session setup +- `--reruns 3` for flaky retry, `-n 8 --dist loadscope` for parallelism +- OpenTelemetry tracing opt-in via `SYNAPSE_INTEGRATION_TEST_OTEL_ENABLED` env var +- Two client fixtures: `syn` (silent logger) and `syn_with_logger` (verbose) +- conftest.py locations: `tests/unit/conftest.py` (session client, socket blocking, UTC timezone), `tests/integration/conftest.py` (logged-in client, per-worker projects, cleanup fixture) + +### Test utilities +- `tests/test_utils.py`: `spy_for_async_function(original_func)` — wraps async function for pytest-mock spying while preserving async behavior. `spy_for_function(original_func)` — sync variant. +- `tests/integration/helpers.py`: `wait_for_condition(condition_fn, timeout_seconds=60)` — async polling helper with exponential backoff. Accepts sync or async condition functions. +- `tests/integration/__init__.py`: `QUERY_TIMEOUT_SEC = 600`, `ASYNC_JOB_TIMEOUT_SEC = 600` +- Test data generators in production code: `core/utils.py` has `make_bogus_data_file()`, `make_bogus_binary_file(n)`, `make_bogus_uuid_file()` + +### No `@pytest.mark.asyncio` needed +`asyncio_mode = auto` in pytest.ini — all async test functions are auto-detected. + +### Python 3.14+ limitation +Sync wrapper smoke tests are skipped on Python 3.14+ — `@async_to_sync` raises `RuntimeError` when an event loop is already active (pytest-asyncio runs one). Users on 3.14+ must call async methods directly. + +## Constraints + +- Unit tests must never make network calls — `pytest-socket` will fail them. Mock all HTTP interactions. +- Integration test cleanup is mandatory — use `schedule_for_cleanup()` for every created resource to avoid orphaned Synapse entities.