Skip to content

Contributing to the GenBio SDK

The GenBio SDK is an internal library for all components and workflows that are part of the virtual cell system.

Workflows are virtual cell capabilities. They are functions that apply one or many components to answer the canonical virtual cell queries.

Components are the building blocks of workflows. They include model inference functions (e.g., protein_embedding) and data retrieval functions (e.g., gene_reference_genome). These operations rely on backend services for heavy lifting (Inference Engine, Molecular Design Platform, Data Lake).

The GenBio SDK is the contract between the backend services (components), the canonical workflows, and the frontend system. It serves as a focal point for internal developers to document, verify, and use these components and workflows, ensuring they are used correctly and understood by all.

This document describes how the SDK is structured and how to completely integrate your components, services, and workflows.

  1. Integration Prerequisites
  2. Integration Checklist
  3. Project Structure
  4. Step-by-Step Integration Walkthrough
  5. Function Conventions
  6. Testing

Integration Prerequisites

Before SDK integration, your model and/or reference data must be hosted the Inference Engine, Molecular Design Platform, and/or Data Lake.

For a model or data function to be ready for integration, it must have a standalone codebase with 1. Instructions to download any model weights and example data 2. Instructions to install the codebase and dependencies in a clean environment 3. An example inference script that takes an example input and produces an example output 4. Ample description of the model's purpose and use-cases as well as references to the model's training data, testing data, and performance metrics (if applicable)

A good example codebase is gsRNA-activity-prediction

Once this is complete, ping the #vc-system-integration channel and the relevant integration contacts:

Service Primary Contact Email
Inference Engine Deepak Kumar deepak.kumar@genbio.ai
Molecular Design Platform Aleks Kovaltsuk aleksandr.kovaltsuk@genbio.ai
Data Lake Sammy Agrawal samarth.agrawal@genbio.ai

Integration Checklist

Virtual cell integration is defined by top-level workflows that stitch together multiple model and data components.

Once the model and reference data are hosted, integration can start. Integration is complete when the assigned question is integrated as a workflow in the SDK, with all components calling the correct endpoints and passing the correct parameters. You are responsible for the integration of your assigned question once your components are hosted.

Every new component or workflow integration requires all of the following:

# What Where
1 Wrapper function src/genbio/toolkit/aido_models_apis.py or src/genbio/data/aido_datalake_apis.py or src/genbio/pipelines/<pipeline_name>.py
2 Export src/genbio/toolkit/__init__.py or src/genbio/data/__init__.py (import block + __all__)
3 Pytest tests/test_model_apis.py or tests/test_datalake.py or tests/test_pipelines.py with appropriate markers and fixtures
4 Documentation A complete docstring in the wrapper function, and addition to capabilities, components, and api_reference in docs/
5 curl example scripts/api-examples/<endpoint>.sh or scripts/data-examples/<fn_name>.sh

The wrapper function is the SDK's contract with internal users and the frontend system — it must have a complete docstring (see Function Conventions). The pytest must call the live endpoint and use save_api_response / save_api_latency fixtures.

Project Structure

genbio-sdk/
├── src/genbio/
│   ├── utils.py                    ← Shared HTTP utilities
│   ├── toolkit/
│   │   ├── __init__.py             ← Public exports (add new functions here)
│   │   ├── aido_models_apis.py     ← All AIDO Inference Engine + MDP wrappers
│   │   └── references/             ← Static data files (gene vocabs, track DBs)
│   └── data/
│       ├── __init__.py             ← Public exports for datalake APIs
│       └── aido_datalake_apis.py   ← AIDO Datalake wrappers
├── tests/
│   ├── conftest.py                 ← Fixtures (save_api_response, save_api_latency)
│   ├── test_model_apis.py          ← Tests for all AIDO model functions
│   ├── test_datalake.py            ← Tests for AIDO Datalake functions
│   └── test_assets/                ← h5ad files and other test inputs
├── scripts/
│   ├── api-examples/               ← curl examples for inference/MDP endpoints
│   ├── data-examples/              ← curl examples for datalake endpoints
│   └── stress-tests/               ← Inference engine load tests
├── pyproject.toml
├── .env.example
└── uv.lock

Backend services

The SDK wraps four backend services:

Service Env var Namespace Description Port-forward
Inference Engine AIDO_INFERENCE_ENGINE_URL infra AIDO foundation models (protein, cell, DNA, RNA, tissue, structure prediction) kubectl port-forward svc/inference-engine-staging-head-ilb -n infra 8000:8000
Structure Utils STRUCTURE_UTILS_URL mdp-dev PDB sequence search for structure generation kubectl port-forward svc/structure-utils-api-serve-svc -n mdp-dev 8001:8000
Structure Generation STRUCTURE_GENERATION_URL mdp-dev Antibody structure diffusion model kubectl port-forward svc/structure-generation-api-serve-svc -n mdp-dev 8002:8000
Datalake AIDO_DATALAKE_URL datalake-dev Biological reference data kubectl port-forward svc/aido-datalake-service -n datalake-dev 8003:80

All four run on the virtual-lab-test GKE cluster. See the README quickstart for port-forward instructions. Most of the heavy lifting should be done by these services; new functionality is often dependent on upstream endpoints the SDK can call.

  • If you need to integrate a new dataset that shouldn’t be downloaded at runtime, contact Sammy for integration into the data service.

  • If you need to integrate a new model which must be hosted, contact Deepak for integration into the inference engine.

Afterward, you should have the resources to add any virtual cell workflows using these components.

Frontend services

The GenBio SDK is used by the frontend system to route user queries to the correct workflows, providing the canonical virtual cell capabilities.

The router exposes this API schema, which routes each query to a SDK workflow. The contract between the workflows and the frontend enables us to continue developing the frontend and backend independently.


Step-by-Step Integration Walkthrough

Step 0 — Developer setup

git clone git@gitlab.genbio.ai:virtual-cell-system/genbio-sdk.git
cd genbio-sdk
uv venv
source .venv/bin/activate
uv sync --extra dev

Or with pip:

pip install -e ".[dev]"

Set up backend access by following the README quickstart (port-forwards + env vars).

Step 1 — Add the function to aido_models_apis.py

Each function wraps exactly one model endpoint. Follow this pattern:

def new_model_function(
    query: str | list[str],
    param: str = "default",
) -> dict[str, Any]:
    """
    One-line description of what the model does.

    Notes:
        Background on the model — architecture, training data, capabilities.

    Args:
        query: Description of the input.
        param: Description of the parameter.

    Returns:
        A dictionary with the following fields:
        - "model_name": The model identifier.
        - "return_code": Integer status code (0 = success).
        - "output": A dictionary containing model outputs.
        - "parameters": Echo of the input parameters.
    """
    base_url = os.environ["AIDO_INFERENCE_ENGINE_URL"].rstrip("/")
    response = requests.post(
        f"{base_url}/category/models/Model.Name",
        json={"query": query, "param": param},
        headers={
            "X-API-Key": os.environ["AIDO_INFERENCE_ENGINE_API_KEY"],
            "Content-Type": "application/json",
        },
    )
    raise_for_status_with_context(response)
    return response.json()

Key rules: - Read URLs and keys from os.environ (not function parameters) - Use _raise_for_status_with_context() for error handling - Return the raw JSON dict — don't post-process unless the function's purpose requires it - For file uploads (h5ad), use requests.post(..., files={"file": open(path, "rb")}, data={...})

Step 2 — Export from __init__.py

Add the function to both the import block and the __all__ list in src/genbio/toolkit/__init__.py, under the appropriate category comment.

Step 3 — Add a test

Add a test to tests/test_model_apis.py:

@pytest.mark.inference_engine
def test_new_model_function(save_api_response, save_api_latency):
    start = time.time()
    resp = new_model_function("input_data")
    save_api_latency(time.time() - start)
    assert isinstance(resp, dict)
    save_api_response(resp)

Use @pytest.mark.inference_engine for inference engine endpoints or @pytest.mark.mdp for MDP endpoints.

Step 4 — Add a shell script example

Add a curl example in scripts/api-examples/:

#!/bin/bash
curl -X POST "${AIDO_INFERENCE_ENGINE_URL}/category/models/Model.Name" \
  -H "X-API-Key: ${AIDO_INFERENCE_ENGINE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"query": "INPUT", "param": "default"}' \
  | jq

Step 5 — Update README

Add the function to the Available Functions table in README.md under the appropriate category.

Step 6 — Update the docs site

The documentation site is built with mkdocs and lives in docs/. API reference pages use mkdocstrings directives (::: module.path.function_name) to auto-generate docs from your docstrings — no manual copy-paste needed.

Add a ::: directive for your new function in the correct file(s) based on which backend it targets:

Backend API Reference file Also appears in
Inference Engine (AIDO_INFERENCE_ENGINE_URL) docs/docs/api_reference/inference_engine.md docs/docs/components/inference_engine.md
MDP (STRUCTURE_UTILS_URL / STRUCTURE_GENERATION_URL) docs/docs/api_reference/mdp.md docs/docs/components/mdp.md
Data Lake (AIDO_DATALAKE_URL) docs/docs/api_reference/data_lake.md docs/docs/components/data_lake.md
Workflows docs/docs/api_reference/workflows.md docs/docs/capabilities/workflows.md
3rd Party (BioMart, Semantic Scholar, etc.) docs/docs/api_reference/third_party.md

Place the directive under the appropriate section heading. For example, to add a new protein function to the Inference Engine page:

## Protein

::: genbio.toolkit.aido_models_apis.protein_embedding

::: genbio.toolkit.aido_models_apis.your_new_function

The Components pages (docs/docs/components/) mirror the API Reference pages — update both so the function appears in all views. To preview locally:

uv sync --extra docs
mkdocs serve --config-file docs/mkdocs.yml

Function Conventions

Naming

  • Embedding functions: {modality}_embedding_{size} (e.g., cell_embedding_small)
  • Query/prediction functions: {modality}_{task} (e.g., protein_stability, rna_secondary_structure)
  • Vocab helpers: {function}_gene_vocab (e.g., embedding_gene_vocab)

Docstrings

Every function has a Google-style docstring with: - One-line summary - Notes: section with model background (architecture, training data, performance) - Args: with type and description for each parameter - Returns: with field-by-field description of the response dict

Error handling

  • Use raise_for_status_with_context() from genbio.utils — it enhances HTTP errors with the response body
  • Don't catch exceptions inside wrapper functions; let them propagate

Testing

Running tests

# All inference engine tests
pytest tests/test_model_apis.py -m inference_engine -v

# All MDP tests
pytest tests/test_model_apis.py -m mdp -v

# Single test
pytest tests/test_model_apis.py -k test_protein_embedding -v

# Full suite
pytest tests/ -v

Test fixtures

  • save_api_response(resp) — saves the response JSON to tests/responses/{test_name}.json
  • save_api_latency(duration) — records latency to tests/responses/api_latency.csv

Both fixtures are optional but recommended for all API tests.

Pytest markers

Marker Description
inference_engine Tests that call the AIDO Inference Engine
mdp Tests that call MDP (structure generation) services
datalake Tests that call the AIDO Datalake