Contributing to the GenBio SDK¶

The GenBio SDK is an internal library for all components and workflows that are part of the virtual cell system.

Workflows are virtual cell capabilities. They are functions that apply one or many components to answer the canonical virtual cell queries.

Components are the building blocks of workflows. They include model inference functions (e.g., protein_embedding) and data retrieval functions (e.g., gene_reference_genome). These operations rely on backend services for heavy lifting (Inference Engine, Molecular Design Platform, Data Lake).

The GenBio SDK is the contract between the backend services (components), the canonical workflows, and the frontend system. It serves as a focal point for internal developers to document, verify, and use these components and workflows, ensuring they are used correctly and understood by all.

This document describes how the SDK is structured and how to completely integrate your components, services, and workflows.

Integration Prerequisites
Integration Checklist
Project Structure
Step-by-Step Integration Walkthrough
Function Conventions
Testing

Integration Prerequisites¶

Before SDK integration, your model and/or reference data must be hosted the Inference Engine, Molecular Design Platform, and/or Data Lake.

For a model or data function to be ready for integration, it must have a standalone codebase with 1. Instructions to download any model weights and example data 2. Instructions to install the codebase and dependencies in a clean environment 3. An example inference script that takes an example input and produces an example output 4. Ample description of the model's purpose and use-cases as well as references to the model's training data, testing data, and performance metrics (if applicable)

A good example codebase is gsRNA-activity-prediction

Once this is complete, ping the #vc-system-integration channel and the relevant integration contacts:

Service	Primary Contact	Email
Inference Engine	Deepak Kumar	deepak.kumar@genbio.ai
Molecular Design Platform	Aleks Kovaltsuk	aleksandr.kovaltsuk@genbio.ai
Data Lake	Sammy Agrawal	samarth.agrawal@genbio.ai

Integration Checklist¶

Virtual cell integration is defined by top-level workflows that stitch together multiple model and data components.

Once the model and reference data are hosted, integration can start. Integration is complete when the assigned question is integrated as a workflow in the SDK, with all components calling the correct endpoints and passing the correct parameters. You are responsible for the integration of your assigned question once your components are hosted.

Every new component or workflow integration requires all of the following:

#	What	Where
1	Wrapper function	`src/genbio/toolkit/aido_models_apis.py` or `src/genbio/data/aido_datalake_apis.py` or `src/genbio/pipelines/<pipeline_name>.py`
2	Export	`src/genbio/toolkit/__init__.py` or `src/genbio/data/__init__.py` (import block + `__all__`)
3	Pytest	`tests/test_model_apis.py` or `tests/test_datalake.py` or `tests/test_pipelines.py` with appropriate markers and fixtures
4	Documentation	A complete docstring in the wrapper function, and addition to capabilities, components, and api_reference in `docs/`
5	curl example	`scripts/api-examples/<endpoint>.sh` or `scripts/data-examples/<fn_name>.sh`

The wrapper function is the SDK's contract with internal users and the frontend system — it must have a complete docstring (see Function Conventions). The pytest must call the live endpoint and use save_api_response / save_api_latency fixtures.

Project Structure¶

genbio-sdk/
├── src/genbio/
│   ├── utils.py                    ← Shared HTTP utilities
│   ├── toolkit/
│   │   ├── __init__.py             ← Public exports (add new functions here)
│   │   ├── aido_models_apis.py     ← All AIDO Inference Engine + MDP wrappers
│   │   └── references/             ← Static data files (gene vocabs, track DBs)
│   └── data/
│       ├── __init__.py             ← Public exports for datalake APIs
│       └── aido_datalake_apis.py   ← AIDO Datalake wrappers
├── tests/
│   ├── conftest.py                 ← Fixtures (save_api_response, save_api_latency)
│   ├── test_model_apis.py          ← Tests for all AIDO model functions
│   ├── test_datalake.py            ← Tests for AIDO Datalake functions
│   └── test_assets/                ← h5ad files and other test inputs
├── scripts/
│   ├── api-examples/               ← curl examples for inference/MDP endpoints
│   ├── data-examples/              ← curl examples for datalake endpoints
│   └── stress-tests/               ← Inference engine load tests
├── pyproject.toml
├── .env.example
└── uv.lock

Backend services¶

The SDK wraps four backend services:

Service	Env var	Namespace	Description	Port-forward
Inference Engine	`AIDO_INFERENCE_ENGINE_URL`	`infra`	AIDO foundation models (protein, cell, DNA, RNA, tissue, structure prediction)	`kubectl port-forward svc/inference-engine-staging-head-ilb -n infra 8000:8000`
Structure Utils	`STRUCTURE_UTILS_URL`	`mdp-dev`	PDB sequence search for structure generation	`kubectl port-forward svc/structure-utils-api-serve-svc -n mdp-dev 8001:8000`
Structure Generation	`STRUCTURE_GENERATION_URL`	`mdp-dev`	Antibody structure diffusion model	`kubectl port-forward svc/structure-generation-api-serve-svc -n mdp-dev 8002:8000`
Datalake	`AIDO_DATALAKE_URL`	`datalake-dev`	Biological reference data	`kubectl port-forward svc/aido-datalake-service -n datalake-dev 8003:80`

All four run on the virtual-lab-test GKE cluster. See the README quickstart for port-forward instructions. Most of the heavy lifting should be done by these services; new functionality is often dependent on upstream endpoints the SDK can call.

If you need to integrate a new dataset that shouldn’t be downloaded at runtime, contact Sammy for integration into the data service.
If you need to integrate a new model which must be hosted, contact Deepak for integration into the inference engine.

Afterward, you should have the resources to add any virtual cell workflows using these components.

Frontend services¶

The GenBio SDK is used by the frontend system to route user queries to the correct workflows, providing the canonical virtual cell capabilities.

The router exposes this API schema, which routes each query to a SDK workflow. The contract between the workflows and the frontend enables us to continue developing the frontend and backend independently.

Step-by-Step Integration Walkthrough¶

Step 0 — Developer setup¶

git clone git@gitlab.genbio.ai:virtual-cell-system/genbio-sdk.git
cd genbio-sdk
uv venv
source .venv/bin/activate
uv sync --extra dev

Or with pip:

pip install -e ".[dev]"

Set up backend access by following the README quickstart (port-forwards + env vars).

Step 1 — Add the function to `aido_models_apis.py`¶

Each function wraps exactly one model endpoint. Follow this pattern:

def new_model_function(
    query: str | list[str],
    param: str = "default",
) -> dict[str, Any]:
    """
    One-line description of what the model does.

    Notes:
        Background on the model — architecture, training data, capabilities.

    Args:
        query: Description of the input.
        param: Description of the parameter.

    Returns:
        A dictionary with the following fields:
        - "model_name": The model identifier.
        - "return_code": Integer status code (0 = success).
        - "output": A dictionary containing model outputs.
        - "parameters": Echo of the input parameters.
    """
    base_url = os.environ["AIDO_INFERENCE_ENGINE_URL"].rstrip("/")
    response = requests.post(
        f"{base_url}/category/models/Model.Name",
        json={"query": query, "param": param},
        headers={
            "X-API-Key": os.environ["AIDO_INFERENCE_ENGINE_API_KEY"],
            "Content-Type": "application/json",
        },
    )
    raise_for_status_with_context(response)
    return response.json()

Key rules: - Read URLs and keys from os.environ (not function parameters) - Use _raise_for_status_with_context() for error handling - Return the raw JSON dict — don't post-process unless the function's purpose requires it - For file uploads (h5ad), use requests.post(..., files={"file": open(path, "rb")}, data={...})

Step 2 — Export from `init.py`¶

Add the function to both the import block and the __all__ list in src/genbio/toolkit/__init__.py, under the appropriate category comment.

Step 3 — Add a test¶

Add a test to tests/test_model_apis.py:

@pytest.mark.inference_engine
def test_new_model_function(save_api_response, save_api_latency):
    start = time.time()
    resp = new_model_function("input_data")
    save_api_latency(time.time() - start)
    assert isinstance(resp, dict)
    save_api_response(resp)

Use @pytest.mark.inference_engine for inference engine endpoints or @pytest.mark.mdp for MDP endpoints.

Step 4 — Add a shell script example¶

Add a curl example in scripts/api-examples/:

#!/bin/bash
curl -X POST "${AIDO_INFERENCE_ENGINE_URL}/category/models/Model.Name" \
  -H "X-API-Key: ${AIDO_INFERENCE_ENGINE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"query": "INPUT", "param": "default"}' \
  | jq

Step 5 — Update README¶

Add the function to the Available Functions table in README.md under the appropriate category.

Step 6 — Update the docs site¶

The documentation site is built with mkdocs and lives in docs/. API reference pages use mkdocstrings directives (::: module.path.function_name) to auto-generate docs from your docstrings — no manual copy-paste needed.

Add a ::: directive for your new function in the correct file(s) based on which backend it targets:

Backend	API Reference file	Also appears in
Inference Engine (`AIDO_INFERENCE_ENGINE_URL`)	`docs/docs/api_reference/inference_engine.md`	`docs/docs/components/inference_engine.md`
MDP (`STRUCTURE_UTILS_URL` / `STRUCTURE_GENERATION_URL`)	`docs/docs/api_reference/mdp.md`	`docs/docs/components/mdp.md`
Data Lake (`AIDO_DATALAKE_URL`)	`docs/docs/api_reference/data_lake.md`	`docs/docs/components/data_lake.md`
Workflows	`docs/docs/api_reference/workflows.md`	`docs/docs/capabilities/workflows.md`
3rd Party (BioMart, Semantic Scholar, etc.)	`docs/docs/api_reference/third_party.md`	—

Place the directive under the appropriate section heading. For example, to add a new protein function to the Inference Engine page:

## Protein

::: genbio.toolkit.aido_models_apis.protein_embedding

::: genbio.toolkit.aido_models_apis.your_new_function

The Components pages (docs/docs/components/) mirror the API Reference pages — update both so the function appears in all views. To preview locally:

uv sync --extra docs
mkdocs serve --config-file docs/mkdocs.yml

Function Conventions¶

Naming¶

Embedding functions: {modality}_embedding_{size} (e.g., cell_embedding_small)
Query/prediction functions: {modality}_{task} (e.g., protein_stability, rna_secondary_structure)
Vocab helpers: {function}_gene_vocab (e.g., embedding_gene_vocab)

Docstrings¶

Every function has a Google-style docstring with: - One-line summary - Notes: section with model background (architecture, training data, performance) - Args: with type and description for each parameter - Returns: with field-by-field description of the response dict

Error handling¶

Use raise_for_status_with_context() from genbio.utils — it enhances HTTP errors with the response body
Don't catch exceptions inside wrapper functions; let them propagate

Testing¶

Running tests¶

# All inference engine tests
pytest tests/test_model_apis.py -m inference_engine -v

# All MDP tests
pytest tests/test_model_apis.py -m mdp -v

# Single test
pytest tests/test_model_apis.py -k test_protein_embedding -v

# Full suite
pytest tests/ -v

Test fixtures¶

save_api_response(resp) — saves the response JSON to tests/responses/{test_name}.json
save_api_latency(duration) — records latency to tests/responses/api_latency.csv

Both fixtures are optional but recommended for all API tests.

Pytest markers¶

Marker	Description
`inference_engine`	Tests that call the AIDO Inference Engine
`mdp`	Tests that call MDP (structure generation) services
`datalake`	Tests that call the AIDO Datalake