Skip to content

Add storage finder facets and update CI workflow permissions#281

Open
s-sajid-ali wants to merge 43 commits intomainfrom
ci/storage-finder-cron-fix
Open

Add storage finder facets and update CI workflow permissions#281
s-sajid-ali wants to merge 43 commits intomainfrom
ci/storage-finder-cron-fix

Conversation

@s-sajid-ali
Copy link
Member

@s-sajid-ali s-sajid-ali commented Feb 11, 2026

This PR updates the implementation to populate the datafinder data with to account for all facets, updates the CI workflow permissions and on a related note the source Google Sheet URL has also been updated.

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NYU-RTS.github.io/rts-docs/pr-preview/pr-281/

Built to branch gh-pages at 2026-03-09 16:45 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@s-sajid-ali
Copy link
Member Author

While the initial implementation of the storage finder was done with data from the CSV sheet, the generated JSON file was edited (for instance in commit d23fe45). Per claude, here's the summary of changes (that need to be made in the CSV file): https://gist.github.com/s-sajid-ali/33ce8a6488db28582d7ccba462e46bff

@s-sajid-ali s-sajid-ali force-pushed the ci/storage-finder-cron-fix branch from 3af04ac to 7d91a7b Compare February 18, 2026 20:46
s-sajid-ali and others added 6 commits March 4, 2026 14:55
Add synchronous access and alumni access facets to the storage finder configuration. Update workflow to include explicit permissions for improved security. Regenerate data files and update dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…et tree

Improved clarity of access permission descriptions in config and reorganized the facet tree with numeric IDs, added contextual descriptions for risk classification and affiliation questions, and reordered questions for better user flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Restore descriptions from main branch for "What is the risk classification of your data?" and "What is your University affiliation?" facets to improve user guidance in the storage finder UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@s-sajid-ali s-sajid-ali force-pushed the ci/storage-finder-cron-fix branch from 7d91a7b to 14978d6 Compare March 4, 2026 19:55
@s-sajid-ali
Copy link
Member Author

Deleted the Spatial Data Repository option as it is being sunset with ultraviolet taking over existing content per Deb Verhoff in the source sheet.

@s-sajid-ali
Copy link
Member Author

Per analysis by @Amanda-dong: 7fdcefd removed Research Archival Storage (which is a duplicate of HPC Archive. The new sheet has been updated to account for this.

s-sajid-ali and others added 8 commits March 4, 2026 15:11
Adds a new "From where will the data be accessed?" question with four
choices (VPN, Public Cloud, Off Campus, Browser GUI), driven by the new
"Access locations" CSV column.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The access-location facet was missing a corresponding field definition,
so access location data was not included in service records' field_data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace loose word-boundary patterns with patterns that require each
keyword to appear as a standalone comma-separated item, preventing
"VPN" from matching within embedded text and ensuring any combination
of access locations is handled correctly without hardcoding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Google Sheet export includes a line break inside the column header
"Access locations (VPN, Public Cloud, \nOff Campus, Browser GUI)", causing
row lookups to return undefined for every service and triggering the
fallback: "all" for all access-location facets regardless of actual data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@s-sajid-ali
Copy link
Member Author

Removed the option From workstations or laptops from the facet access-location corresponding to From where will the data be accessed? since it did not meaningfully differentiate among different options.

s-sajid-ali and others added 2 commits March 6, 2026 12:04
Replace outdated Drupal-based documentation with accurate description
of the Google Sheets CSV generator workflow, automated GitHub Actions
weekly sync, current questions, service fields, and matching logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@s-sajid-ali
Copy link
Member Author

s-sajid-ali commented Mar 6, 2026

Pinging @remram44 @VickyRampin for reviewing the data services your team offer. In particular the Ceph service has no information about access policies. This is the source sheet. Please let me know if you do not have access to edit it and I'll grant it.

@remram44
Copy link
Member

remram44 commented Mar 6, 2026

The storage finder can't be found from the search bar, only the bottom of the front page: https://services.rt.nyu.edu/storage-finder/

What does "synchronous access" mean? I see it's "yes" for HPC scratch and qualtrics but "no" for research workspace, what does it mean? I suggest removing that line entirely if we don't know what it is, as users won't know either.

"Alumni access" is also yes for scratch so maybe it should be yes for Ceph? Obviously you need to have a sponsor either way, alumni don't automatically access to HPC scratch.

At this point Ceph is accessible in 3 different ways:

  • through data transfer nodes via SSH/sshfs/rsync
  • through Nextcloud in the browser or via a WebDAV client
  • through S3

Maybe those should be separate rows, depending on what "synchronous access" means and the amount of detail we want in the "permission settings" column?

@s-sajid-ali
Copy link
Member Author

s-sajid-ali commented Mar 6, 2026

The storage finder can't be found from the search bar, only the bottom of the front page: https://services.rt.nyu.edu/storage-finder/

Correct, it is not indexed by the client side search engine we currently use and I don't think there's a way to add non-local URLs for that search index to crawl. We could do that if we switch to Algolia (in #298) by adding the source URL for the Google Sheet. Or we convert that sheet to markdown and add a new page at https://services.rt.nyu.edu/storage-finder-data/ and let the client side search indexer crawl that markdown page.

What does "synchronous access" mean? I see it's "yes" for HPC scratch and qualtrics but "no" for research workspace, what does it mean? I suggest removing that line entirely if we don't know what it is, as users won't know either.

I'm okay with that. @genericdata : What did we mean for this column to indicate originally?

"Alumni access" is also yes for scratch so maybe it should be yes for Ceph? Obviously you need to have a sponsor either way, alumni don't automatically access to HPC scratch.

Allowed that facet for Ceph for consistency and yes, we'll have to point to the access policy somewhere.

s-sajid-ali and others added 9 commits March 6, 2026 17:25
…alues

Update anchor-based regex matchers to handle leading/trailing whitespace
and newlines in spreadsheet cell values. Adds multiline flag (m) so ^/$
match line boundaries, and adds \s* around anchors to absorb surrounding
whitespace.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e suffixes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…capacity matchers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@remram44
Copy link
Member

remram44 commented Mar 6, 2026

I mean that you can't find this page with the search, not the Google sheet. "data finder" or "storage" should probably point to the storage finder.

What does "alumni access" mean? That it is possible for an alumni to have access if they get a researcher sponsor? In this case the answer is "yes" for a lot more options (S3, HPC RPS, Data Lake, probably Google Shared Drive and Research Workspace too)

@s-sajid-ali
Copy link
Member Author

I mean that you can't find this page with the search, not the Google sheet. "data finder" or "storage" should probably point to the storage finder.

I'll take a look at indexing that page.

What does "alumni access" mean? That it is possible for an alumni to have access if they get a researcher sponsor? In this case the answer is "yes" for a lot more options (S3, HPC RPS, Data Lake, probably Google Shared Drive and Research Workspace too)

I agree and have removed it now. I was mainly focused on moving to ingesting the data from the Google sheet that I didn't really think about which data made sense to move.

@remram44
Copy link
Member

remram44 commented Mar 9, 2026

I think something happened with the risk rating, the table now only shows "Storable Files: High" which is not as clear as the previous "Storable Files: High, Moderate, & Low Risk". The word "risk" should be present.

@remram44
Copy link
Member

remram44 commented Mar 9, 2026

A lot of other facets lost details, such as "backup", which changed for Box from "Retains up to 100 previous versions of a single file" to "yes" (lost details) and for S3 from "available for additional cost" to "yes" (incorrect) for example.

s-sajid-ali and others added 2 commits March 9, 2026 11:45
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants