Skip to content

Add scripts for scraping more models#2

Open
yuqiannemo wants to merge 1 commit intoXtra-Computing:mainfrom
yuqiannemo:dna_dev/more_models
Open

Add scripts for scraping more models#2
yuqiannemo wants to merge 1 commit intoXtra-Computing:mainfrom
yuqiannemo:dna_dev/more_models

Conversation

@yuqiannemo
Copy link

No description provided.

Copy link
Collaborator

@JerryLife JerryLife left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good — a few items to address before merging:

Copy link
Collaborator

@JerryLife JerryLife left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Thanks for adding the model scraping scripts! A few items to address before merging.


1. Output path — write to configs/

The pipeline stores model lists in configs/ following a {source}_llm_list naming convention:

Existing file Purpose
configs/llm_list.txt Generic model list (HuggingFace)
configs/OpenRouter_llm_list.txt OpenRouter model list
configs/llm_metadata.json Model metadata

The new scripts should follow the same convention. Suggested output paths:

  • closed_source_models.pyconfigs/openrouter_llm_list.jsonl
  • open_source_models.pyconfigs/huggingface_llm_list.jsonl

Use a path relative to the script's location so it works regardless of the working directory:

from pathlib import Path

ROOT = Path(__file__).resolve().parent.parent
output_path = ROOT / "configs" / "openrouter_llm_list.jsonl"

2. Add if __name__ == "__main__" guards

Both scripts execute API calls and I/O at module level. This means they cannot be imported without side effects (e.g., for testing or reuse). Please wrap the execution logic:

def main():
    # ... existing logic ...

if __name__ == "__main__":
    main()

3. Add error handling for HTTP requests

closed_source_models.py lines 5–6 — the requests.get() call has no timeout, no status code check, and no exception handling. Suggested fix:

response = requests.get(url, timeout=30)
response.raise_for_status()

4. Declare dependencies

requests and huggingface_hub are not currently declared in the project's dependencies. Either:

  • Add them as optional dependencies in pyproject.toml (e.g., under [project.optional-dependencies]), or
  • Add a comment at the top of each script noting the required packages.

5. modality field is always "unknown"

closed_source_models.py line 43 — architecture.get("modality") does not exist in the OpenRouter API schema. The script already reads input_modalities and output_modalities; consider constructing a modality string from those, e.g.:

input_mods = architecture.get("input_modalities", [])
output_mods = architecture.get("output_modalities", [])
"modality": f"{'|'.join(input_mods)} -> {'|'.join(output_mods)}"

Summary of requested changes

# Item Priority
1 Write output to configs/ with consistent naming (openrouter_llm_list.jsonl, huggingface_llm_list.jsonl) Required
2 Add if __name__ == "__main__" guards Required
3 Add timeout and raise_for_status() to HTTP request Required
4 Declare requests / huggingface_hub dependencies Required
5 Fix dead modality field Minor

@JerryLife
Copy link
Collaborator

Noted that both scripts should be merged into dev branch. Not the main branch.

@JerryLife JerryLife self-assigned this Mar 19, 2026
@JerryLife JerryLife added the enhancement New feature or request label Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants