Skip to content

Inference Engine

AIDO foundation model endpoints served by the Inference Engine (AIDO_INFERENCE_ENGINE_URL).

Protein

genbio.toolkit.aido_models_apis.protein_embedding

protein_embedding(query: str | list[str], pooling: Literal['mean', 'max', 'min', 'none'] = 'mean') -> dict[str, Any]

Compute protein sequence embeddings from an amino acid sequence.

Notes

This function accesses the SOTA AIDO.Protein-16B bidirectional transformer encoder trained via masked language modeling on >1.2 trillion amino acids from UniRef90 and ColabFoldDB. The model operates on single amino acid sequences, and produces rich contextual representations that are SOTA for various downstream tasks such as embedding-based similarity search, clustering, and training downstream models. No task head is applied; this endpoint exposes backbone embedding inference only. Can process up to 1023 amino acids per sequence, and returns either a pooled protein-level embedding when pooling is "mean", "max", or "min", or residue-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
query str | list[str]

A single protein sequence (string of amino acid tokens) or a list of sequences. Sequences are tokenized at single–amino-acid resolution using the model’s fixed vocabulary of canonical amino acids.

required
pooling Literal['mean', 'max', 'min', 'none']

Strategy to aggregate token-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "min": min pooling, - "none": return token-level embeddings without pooling.

'mean'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "embedding": A nested list of floats containing the computed embeddings. For pooled outputs, this has shape [N, D], where N is the number of input sequences and D is the model hidden size 2304. For unpooled output ("none"), embeddings are returned at token resolution with shape [N, L+1, D], where L is the sequence length (padded to the maximum in the batch), and the +1 accounts for a prepended CLS token.
dict[str, Any]
  • "shape": A list specifying the tensor shape of the embedding output, typically [N, D] for pooled embeddings, or [N, max(L)+1, D] for unpooled.

genbio.toolkit.aido_models_apis.protein_stability

protein_stability(query: str | list[str]) -> dict[str, Any]

Predict protein stability from amino acid sequences.

Notes

This function accesses the SOTA AIDO.Protein-16B-stability-prediction model, which is fine-tuned from AIDO.Protein-16B on a dataset of 55k small protein fragments (41-50aa) with experimental measurements for proteolytic degradation resistance (stability). The predicted stability float is in arbitrary units where higher values indicate greater resistance to degradataion.

Parameters:

Name Type Description Default
query str | list[str]

A single protein sequence (string of amino acid tokens) or a list of sequences. Sequences are tokenized at single-amino-acid resolution using the model's fixed vocabulary of canonical amino acids.

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Protein-16B-stability-prediction").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": A list [N, 1] where N is the number of input sequences.
  • "values": A nested list of floats [[score_1], [score_2], ..., [score_N]] containing the predicted stability score for each sequence.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "sequence_count": Number of sequences processed.
  • "sequence_lengths": List of lengths for each input sequence.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.protein_protein_interaction

protein_protein_interaction(seq_a: str, seq_b: str, crop_mode: Literal['head', 'tail', 'center'] = 'head') -> dict[str, Any]

Predict protein-protein interaction, binding sites, and cross-attention mapping.

Notes

This function accesses the AIDO.ProteinProteinInteraction model for predicting protein-protein interactions between two amino acid sequences. The model provides: - Interaction probability (0-1) and binary label - Per-residue binding site probabilities for both chains - Cross-attention matrix between the two sequences

Sequences exceeding the internal crop size (1000 residues) will be cropped according to the specified crop_mode.

Parameters:

Name Type Description Default
seq_a str

Primary amino acid sequence of protein A. Must contain only valid amino acid characters: ACDEFGHIKLMNPQRSTVWY (and X for unknown). Maximum length 2048.

required
seq_b str

Primary amino acid sequence of protein B. Must contain only valid amino acid characters: ACDEFGHIKLMNPQRSTVWY (and X for unknown). Maximum length 2048.

required
crop_mode Literal['head', 'tail', 'center']

Crop mode for sequences exceeding the internal crop size (1000 residues). Options: "head", "tail", "center". Default "head".

'head'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.ProteinProteinInteraction").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success, 1 for error).
dict[str, Any]
  • "output": A dictionary containing:
  • "ppi": Protein-protein interaction prediction
    • "prob": Interaction probability (0-1), computed via sigmoid
    • "label": Binary interaction label (1 if prob >= 0.5, 0 otherwise)
  • "binding_sites": Per-residue binding site predictions
    • "chain1_prob": Array of per-residue binding probabilities for protein A
    • "chain2_prob": Array of per-residue binding probabilities for protein B
  • "attention": Cross-attention map between sequences
    • "shape": Shape of the attention matrix [len_a, len_b]
    • "matrix": Cross-attention scores between residues
  • "meta": Metadata about processed sequences
    • "len_a": Actual length of protein A after cleaning and cropping
    • "len_b": Actual length of protein B after cleaning and cropping
    • "crop_size": Max residue length per chain (1000)
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier
  • "seq_a_length": Length of the input seq_a
  • "seq_b_length": Length of the input seq_b
  • "crop_mode": The crop mode applied
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.reactome_pathway_query

reactome_pathway_query(sequence: str | list[str]) -> dict[str, Any]

Predict Reactome pathway memberships for protein amino acid sequences.

Notes

This function accesses the AIDO.ReactomePathway-Query model, which predicts which of 1,766 curated biological pathways from the Reactome database a protein is involved in based on its primary sequence. The model first obtains protein embeddings via the AIDO.Protein foundation model, then uses a KNeighborsClassifier trained on 11,660 sequences from the Reactome Physical Entity mapping (UniProt-to-pathway). Only pathways with at least 10 member sequences were included in training. The model was optimized for Macro F1 to handle the significant sparsity of the pathway membership matrix.

Parameters:

Name Type Description Default
sequence str | list[str]

A single protein amino acid sequence (string) or a list of sequences (e.g., "MRLPAQ..." or ["SEQ1...", "SEQ2..."]). Each sequence must be <= 2048 characters long. Sequences are tokenized at single-amino-acid resolution using the AIDO.Protein model vocabulary.

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.ReactomePathway-Query").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "results": Array of results, one per input sequence. Each result contains:
    • "sequence": The original input amino acid sequence string.
    • "pathways": A list of predicted Reactome pathway names (strings) that the protein is predicted to be involved in. May be empty if no pathways meet the classification threshold.
    • "pathway_count": Integer count of predicted pathways for this sequence.
  • "count": Total number of sequences in the results.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "sequence_count": Number of sequences processed.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

Cell

genbio.toolkit.aido_models_apis.cell_embedding_small

cell_embedding_small(h5ad_path: str, pooling: Literal['mean', 'none'] = 'mean', do_cell_average: bool = False, pooling_dim: int | None = None) -> dict[str, Any]

Compute cell embeddings from single-cell RNA-seq data.

Notes

This function accesses the SOTA AIDO.Cell-3M model, a scRNA-seq count bidirectional transformer encoder (BERT) model trained on 50 million cells from over 100 tissue types (963 billion gene tokens). The model uses an auto-discretization strategy for encoding continuous gene expression values. The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a representation of cell and gene states from the transcriptional context. The rich contextual representations are SOTA for various downstream tasks such as embedding-based similarity search, clustering, and training downstream models. No task head is applied; this endpoint exposes backbone embedding inference only. Returns either pooled cell-level embeddings when pooling is "mean", or gene-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format, representing the transcriptome of one or more cells.

required
pooling Literal['mean', 'none']

Strategy to aggregate gene-level representations into a cell-level embedding. Options include: - "mean": mean pooling over gene tokens (default), - "none": return gene-level embeddings without pooling.

'mean'
do_cell_average bool

If True and multiple cells are present, average all cells before embedding, producing a single embedding vector. If False (default), returns one embedding per cell.

False
pooling_dim int | None

Dimension along which to pool when pooling="mean". Default is None (server defaults to 1, pooling over the sequence/gene dimension). Allowed range: -2 to 2. Only applicable when pooling is "mean"; ignored otherwise.

None

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Cell-3M").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": If pooling is "mean", a list [N, D] where N is the number of cells (or 1 if do_cell_average=True) and D is the embedding dimension (128 for AIDO.Cell-3M). If pooling is "none", a list [N, G, D] where G is the number of genes (19,264) in the order defined by the aido_gene_list tool with missing genes imputed and extra genes ignored.
  • "values": A nested list of floats containing the computed embeddings. For pooled outputs, shape is [N, 128]. For unpooled output ("none"), embeddings are returned at gene-level resolution with shape [N, 19264, 128].
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "pooling_dim": The pooling dimension used (if provided).
  • "do_cell_average": Whether cell averaging was applied.
  • "is_aligned": Boolean indicating whether the data was considered aligned before inference. If False, alignment was performed.
  • "filename": The uploaded file name.
  • "cell_count": Number of cells in the input data.
  • "gene_count": Number of genes in the input data.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.cell_embedding_medium

cell_embedding_medium(h5ad_path: str, pooling: Literal['mean', 'none'] = 'mean', do_cell_average: bool = False, pooling_dim: int | None = None) -> dict[str, Any]

Compute cell embeddings from single-cell RNA-seq data.

Notes

This function accesses the SOTA AIDO.Cell-10M model, a scRNA-seq count bidirectional transformer encoder (BERT) model trained on 50 million cells from over 100 tissue types (963 billion gene tokens). The model uses an auto-discretization strategy for encoding continuous gene expression values. The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a representation of cell and gene states from the transcriptional context. The rich contextual representations are SOTA for various downstream tasks such as embedding-based similarity search, clustering, and training downstream models. No task head is applied; this endpoint exposes backbone embedding inference only. Returns either pooled cell-level embeddings when pooling is "mean", or gene-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format, representing the transcriptome of one or more cells.

required
pooling Literal['mean', 'none']

Strategy to aggregate gene-level representations into a cell-level embedding. Options include: - "mean": mean pooling over gene tokens (default), - "none": return gene-level embeddings without pooling.

'mean'
do_cell_average bool

If True and multiple cells are present, average all cells before embedding, producing a single embedding vector. If False (default), returns one embedding per cell.

False
pooling_dim int | None

Dimension along which to pool when pooling="mean". Default is None (server defaults to 1, pooling over the sequence/gene dimension). Allowed range: -2 to 2. Only applicable when pooling is "mean"; ignored otherwise.

None

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Cell-10M").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": If pooling is "mean", a list [N, D] where N is the number of cells (or 1 if do_cell_average=True) and D is the embedding dimension (256 for AIDO.Cell-10M). If pooling is "none", a list [N, G, D] where G is the number of genes (19,264) in the order defined by the aido_gene_list tool with missing genes imputed and extra genes ignored.
  • "values": A nested list of floats containing the computed embeddings. For pooled outputs, shape is [N, 256]. For unpooled output ("none"), embeddings are returned at gene-level resolution with shape [N, 19264, 256].
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "pooling_dim": The pooling dimension used (if provided).
  • "do_cell_average": Whether cell averaging was applied.
  • "is_aligned": Boolean indicating whether the data was considered aligned before inference. If False, alignment was performed.
  • "filename": The uploaded file name.
  • "cell_count": Number of cells in the input data.
  • "gene_count": Number of genes in the input data.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.cell_embedding_large

cell_embedding_large(h5ad_path: str, pooling: Literal['mean', 'none'] = 'mean', do_cell_average: bool = False, pooling_dim: int | None = None) -> dict[str, Any]

Compute cell embeddings from single-cell RNA-seq data.

Notes

This function accesses the SOTA AIDO.Cell-100M model, a scRNA-seq count bidirectional transformer encoder (BERT) model trained on 50 million cells from over 100 tissue types (963 billion gene tokens). The model uses an auto-discretization strategy for encoding continuous gene expression values. The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a representation of cell and gene states from the transcriptional context. The rich contextual representations are SOTA for various downstream tasks such as embedding-based similarity search, clustering, and training downstream models. No task head is applied; this endpoint exposes backbone embedding inference only. Returns either pooled cell-level embeddings when pooling is "mean", or gene-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format, representing the transcriptome of one or more cells.

required
pooling Literal['mean', 'none']

Strategy to aggregate gene-level representations into a cell-level embedding. Options include: - "mean": mean pooling over gene tokens (default), - "none": return gene-level embeddings without pooling.

'mean'
do_cell_average bool

If True and multiple cells are present, average all cells before embedding, producing a single embedding vector. If False (default), returns one embedding per cell.

False
pooling_dim int | None

Dimension along which to pool when pooling="mean". Default is None (server defaults to 1, pooling over the sequence/gene dimension). Allowed range: -2 to 2. Only applicable when pooling is "mean"; ignored otherwise.

None

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Cell-100M").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": If pooling is "mean", a list [N, D] where N is the number of cells (or 1 if do_cell_average=True) and D is the embedding dimension (512 for AIDO.Cell-100M). If pooling is "none", a list [N, G, D] where G is the number of genes (19,264) in the order defined by the aido_gene_list tool with missing genes imputed and extra genes ignored.
  • "values": A nested list of floats containing the computed embeddings. For pooled outputs, shape is [N, 640]. For unpooled output ("none"), embeddings are returned at gene-level resolution with shape [N, 19264, 640].
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "pooling_dim": The pooling dimension used (if provided).
  • "do_cell_average": Whether cell averaging was applied.
  • "is_aligned": Boolean indicating whether the data was considered aligned before inference. If False, alignment was performed.
  • "filename": The uploaded file name.
  • "cell_count": Number of cells in the input data.
  • "gene_count": Number of genes in the input data.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.cell_type_annotation

cell_type_annotation(h5ad_path: str, tissue: str, return_probs: bool = False) -> dict[str, Any]

Predict cell types from single-cell RNA-seq data using tissue-specific models.

Notes

This function accesses the AIDO.CellType-Query model, which uses tissue-specific pretrained classification models to predict cell type labels for each cell in the input dataset based on single-cell RNA-seq expression patterns. The model checkpoint is selected based on the provided tissue name (e.g., "Kidney"). The function automatically realigns input genes to the AIDO gene index (19,264 genes) if needed. Available tissues include: Bladder, Blood, Bone_Marrow, Ear, Eye, Fat, Heart, Kidney, Large_Intestine, Liver, Lung, Lymph_Node, Mammary, Muscle, Ovary, Pancreas, Prostate, Salivary_Gland, Skin, Small_Intestine, Spleen, Stomach, Testis, Thymus, Tongue, Trachea, Uterus, and Vasculature (call cell_type_annotation_supported_tissues() for more details). Model performance may vary by tissue type.

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format.

required
tissue str

Tissue name used to select the pretrained model checkpoint (e.g., "Kidney"). Must match one of the available tissue-specific models. See cell_type_annotation_supported_tissues() for the full list of supported tissues.

required
return_probs bool

If True, returns the probabilities for each cell type per cell. If False (default), returns only the predicted cell type label (the cell type with maximum probability).

False

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.CellType-Query").
dict[str, Any]
  • "return_code": Integer status code (200 indicates success, 400/500 for errors).
dict[str, Any]
  • "output": A dictionary containing:
  • "results": Array of objects, one per cell. When return_probs=False, each contains:
    • "cell_id": String identifier for the cell (from adata.obs_names).
    • "predicted_label": String predicted cell type label (max probability). When return_probs=True, each contains:
    • "cell_id": String identifier for the cell.
    • "": Float probability for first cell type.
    • "": Float probability for second cell type.
    • ... (one key-value pair per possible cell type)
  • "count": Integer number of cells in the results.
  • "realigned_to_gene_index": Boolean indicating whether gene realignment was applied to match the AIDO gene index.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "query_file": The uploaded file name.
  • "tissue": The tissue name used.
  • "return_probs": Whether probabilities were returned.
  • "cell_count": Number of cells in the input data.
  • "gene_count": Number of genes in the input data.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.cell_type_annotation_supported_tissues

cell_type_annotation_supported_tissues() -> list[str]

Retrieve the list of supported tissue types for cell type annotation.

Notes

These tissue names are used to select the appropriate pretrained model checkpoint for cell type classification. Model availability and performance may vary across different tissue types. All tissues should have model support, but some may be more comprehensive than others.

Returns:

Type Description
list[str]

A list of tissue names (strings) that are supported by the

list[str]

AIDO.CellType-Query model.

genbio.toolkit.aido_models_apis.cell_age_predictor

cell_age_predictor(h5ad_path: str) -> dict[str, Any]

Predict biological age from single-cell RNA-seq data.

Notes

This function accesses the AIDO.AgePredictor model, a transcriptomic clock based on the Cell Perceiver architecture fine-tuned for age regression. The model was derived from a pretrained Cell Perceiver and fine-tuned on CellXGene data with experimentally measured donor ages. The model operates on raw (un-normalized) scRNA-seq counts from the human transcriptome (20,062 genes) and predicts the biological age of the sampled tissue or donor. The model returns both normalized predictions (z-scores relative to training set distribution) and denormalized age predictions in years. This model is suitable for estimating biological age, and a reasonable proxy for overall cellular stress and disease.

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix with raw (un-normalized) scRNA-seq counts. The model expects 20,062 genes; missing genes will be imputed with a mask value, and extra genes will be ignored. Most genes overlap with the HGNC gene set (see tool aido_gene_list).

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.AgePredictor").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "age_predictions": A list of predicted ages in years, one per cell.
  • "age_predictions_normalized": A list of z-score normalized age predictions (relative to the training set mean and standard deviation).
  • "normalization_mean": The mean age from the training set (approximately 53 years).
  • "normalization_std": The standard deviation of ages in the training set (approximately 22 years).
  • "age_range": A list [min_age, max_age] indicating the range of predicted ages across all cells in the input.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "is_aligned": Boolean indicating whether gene alignment was performed during preprocessing. If False, the input data was aligned to the model's gene set.
  • "filename": The uploaded file name.
  • "cell_count": Number of cells in the input data.
  • "gene_count": Number of genes in the input data.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.embedding_gene_vocab

embedding_gene_vocab() -> list[str]

Retrieve the ordered list of genes supported by cell_embedding_ and tissue_embedding_ models.

Returns:

Type Description
list[str]

A list of HGNC gene symbols (strings) that are recognized by the

list[str]

cell and tissue embedding models, in the order they are

list[str]

returned in model outputs.

genbio.toolkit.aido_models_apis.age_predictor_gene_vocab

age_predictor_gene_vocab() -> list[str]

Retrieve the gene vocabulary for the age predictor model.

Notes

This function returns the list of genes (approximately 20,062 genes) that the AIDO.AgePredictor model expects in the input. The model automatically imputes missing genes and ignores extra genes during preprocessing.

Returns:

Type Description
list[str]

A list of gene names (strings, mix of HGNC symbols and Ensembl IDs) that

list[str]

are used by the age predictor model, in the order expected by the model.

DNA

genbio.toolkit.aido_models_apis.dna_embedding_small

dna_embedding_small(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]

Compute DNA sequence embeddings from nucleotide sequences.

Notes

This function accesses the AIDO.DNA-300M model, a DNA foundation model based on the bidirectional transformer encoder (BERT) architecture trained via masked language modeling on 10.6 billion nucleotides from 796 species. The model operates on DNA sequences with single-nucleotide tokenization (A, T, C, G, N), producing rich contextual representations for embedding-based similarity search, clustering, and training downstream models. No task head is applied; this endpoint exposes backbone embedding inference only. Can process up to 4000 nucleotides per sequence, and returns either a pooled sequence-level embedding when pooling is "mean" or "max", or nucleotide-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
sequences list[str]

A list of DNA sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, T, C, G, N, where N denotes uncertain elements.

required
pooling Literal['mean', 'max', 'none']

Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling.

'mean'

Returns: A dictionary with the following fields: - "model_name": The identifier of the model used ("AIDO.DNA-300M"). - "return_code": Integer status code (0 indicates success). - "output": A dictionary containing: - "shape": A list specifying the tensor shape. For pooled outputs ("mean" or "max"), this is [N, 1024] where N is the number of input sequences and 1024 is the model hidden size. For unpooled output ("none"), shape is [N, L+2, 1024] where L is the sequence length (padded to the maximum in the batch), and the +2 accounts for prepended CLS and appended EOS tokens. - "values": A nested list of floats containing the computed embeddings. - "parameters": A dictionary with metadata including: - "model_id": The model identifier. - "pooling": The pooling strategy used. - "sequence_count": Number of sequences processed. - "sequence_lengths": List of lengths for each input sequence. - "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.dna_embedding_large

dna_embedding_large(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]

Compute DNA sequence embeddings from nucleotide sequences.

Notes

This function accesses the AIDO.DNA-7B model, a DNA foundation model based on the bidirectional transformer encoder (BERT) architecture trained via masked language modeling on 10.6 billion nucleotides from 796 species. The model operates on DNA sequences with single-nucleotide tokenization (A, T, C, G, N), producing rich contextual representations for embedding-based similarity search, clustering, and training downstream models. No task head is applied; this endpoint exposes backbone embedding inference only. Can process up to 4000 nucleotides per sequence, and returns either a pooled sequence-level embedding when pooling is "mean" or "max", or nucleotide-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
sequences list[str]

A list of DNA sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, T, C, G, N, where N denotes uncertain elements.

required
pooling Literal['mean', 'max', 'none']

Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling.

'mean'

Returns: A dictionary with the following fields: - "model_name": The identifier of the model used ("AIDO.DNA-7B"). - "return_code": Integer status code (0 indicates success). - "output": A dictionary containing: - "shape": A list specifying the tensor shape. For pooled outputs ("mean" or "max"), this is [N, 4352] where N is the number of input sequences and 4352 is the model hidden size. For unpooled output ("none"), shape is [N, L+2, 4352] where L is the sequence length (padded to the maximum in the batch), and the +2 accounts for prepended CLS and appended EOS tokens. - "values": A nested list of floats containing the computed embeddings. - "parameters": A dictionary with metadata including: - "model_id": The model identifier. - "pooling": The pooling strategy used. - "sequence_count": Number of sequences processed. - "sequence_lengths": List of lengths for each input sequence. - "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.dna2_flashzoi_rep1

dna2_flashzoi_rep1(sequence: str, output_type: Literal['tracks', 'embeddings'] = 'tracks', is_human: bool = True, bins_to_return: int = 6144) -> dict[str, Any]

Predict genomic tracks (RNAseq) or generate embeddings from long DNA sequences.

Notes

This function accesses the AIDO.DNA2-470M-Flashzoi-rep1 model, a 2-part model which contains 1. a long bidirectional transformer backbone trained via masked language modeling on 8.8 trillion nucleotides from 113,379 prokaryotic genomes and 15,032 eukaryotic genomes. 2. a genomic predictor head based on Flashzoi, fine-tuned to predict 7,611 genomic assay tracks from DNA sequence on the ENCODE dataset. The model operates on DNA sequences with single-nucleotide tokenization (A, T, C, G, N). In "embeddings" mode, the model produces rich contextual representations for embedding-based similarity search, clustering, and training downstream models. In "tracks" mode, the model predicts 7,611 genomic assay tracks including RNA-seq, CAGE, DNase, ATAC-seq, ChIP-seq (transcription factors), and ChIP-seq (histone modifications) in 32bp bins from 196,608bp input sequences for human or mouse genomes.

Parameters:

Name Type Description Default
sequence str

A single DNA sequence (string of nucleotide tokens). Sequences must be exactly 196,608 base pairs, and are tokenized at single-nucleotide resolution using the vocabulary: A, T, C, G, N (where N denotes uncertain elements).

required
output_type Literal['tracks', 'embeddings']

Type of output to generate. Options include: - "tracks": return predicted genomic assay tracks (default), - "embeddings": return intermediate embeddings from the model backbone.

'tracks'
is_human bool

Whether the input sequence is from a human genome (True, default) or mouse genome (False). This determines which species-specific output head is used for prediction.

True
bins_to_return int

Number of output bins to return from the center of the prediction. Default is 6144 bins, which corresponds to 196,608 bp at 32 bp resolution. Set to -1 to return all bins.

6144

Returns:

Type Description
dict[str, Any]

Dictionary containing:

dict[str, Any]
  • model_name: Model identifier string
dict[str, Any]
  • return_code: 0 for success, 1 for error
dict[str, Any]
  • output: Dictionary with:
  • sha256: File checksum
  • shape: Array dimensions [1, bins, features]
  • dtype: Data type (e.g., "float32")
  • values: numpy.ndarray with prediction data
  • output_type: "tracks" or "embeddings"
dict[str, Any]
  • parameters: Request metadata dict
dict[str, Any]
  • error: Error message or None
dna2_track_search(query: str, k: int = 10, track_types: list[str] | None = None) -> pd.DataFrame

Search for genomic tracks by text description to identify relevant assays for DNA sequence analysis.

Notes

This function provides text-based search over 7,611 genomic assay tracks from ENCODE, FANTOM, and GTEx that are predicted by the dna2_flashzoi_rep1 model. The search use semantic search to find tracks matching a text query. This tool is designed to work in conjunction with dna2_flashzoi_rep1: first use this search to identify relevant track indices, then use those indices to filter the 7,611 predictions from dna2_flashzoi_rep1.

The search covers diverse assay types including: - CAGE: Cap Analysis of Gene Expression (transcription start sites) - RNA: RNA-seq (gene expression) - DNASE: DNase-seq (open chromatin) - ATAC: ATAC-seq (chromatin accessibility) - CHIP_H: ChIP-seq for histone modifications (e.g., H3K4me3, H3K27ac) - CHIP_TF: ChIP-seq for transcription factors (e.g., CTCF, TP53)

Tracks span various cell types, tissues, and conditions from the ENCODE and FANTOM projects. This is a LOCAL search tool using pre-computed embeddings - it does NOT call the AIDO Inference Engine API.

Parameters:

Name Type Description Default
query str

Text description to search for (e.g., "liver tissue", "MCF-7 breast cancer cells", "histone H3K27 acetylation"). The search matches against track descriptions including assay type, cell type, tissue type, and experimental conditions.

required
k int

Number of top results to return, ranked by similarity score. Default is 10.

10
track_types list[str] | None

Optional filter to restrict search to specific assay types. Must be a subset of: ["ATAC", "CAGE", "CHIP_H", "CHIP_TF", "DNASE", "RNA"]. If None (default), searches across all track types. Use CHIP_H for histone modifications and CHIP_TF for transcription factors.

None

Returns:

Type Description
DataFrame

A pandas DataFrame with columns:

DataFrame
  • "track_idx": Integer index (0-7610) of this track in the model output.
DataFrame
  • "identifier": String identifier for the track (e.g., "CNhs10624+").
DataFrame
  • "track_type": Assay type (ATAC, CAGE, CHIP_H, CHIP_TF, DNASE, or RNA).
DataFrame
  • "description": Full human-readable description of the track.
DataFrame
  • "distance": Similarity score (lower values indicate better matches).

genbio.toolkit.aido_models_apis.predict_tracks_v3

predict_tracks_v3(sequence: str, track_idxs: list[int] | None = None, track_type: TrackType | None = None, is_human: bool = True) -> dict[str, Any]

Predict genomic tracks or generate embeddings from a DNA sequence using AIDO.DNA3-AG-524K.

In "tracks" mode, predicts genomic assay tracks at two resolutions from 196,608 bp input sequences for human or mouse genomes. In "embeddings" mode, returns backbone representations suitable for similarity search, clustering, or training downstream models.

Parameters:

Name Type Description Default
sequence str

DNA sequence to predict tracks for. Tokenized at single-nucleotide resolution using A, T, C, G, N.

required
track_idxs list[int] | None

Integer indices selecting specific tracks from the output. If None, all tracks for the requested type are returned.

None
track_type TrackType | None

Assay type to predict. Passed to the API for server-side filtering. 1 bp resolution: TrackType.CAGE, TrackType.RNA_SEQ, TrackType.ATAC, TrackType.DNASE, TrackType.PROCAP, TrackType.SPLICE_SITES, TrackType.SPLICE_SITE_USAGE 128 bp resolution: TrackType.CHIP_HISTONE, TrackType.CHIP_TF

None
is_human bool

True (default) for human, False for mouse. Only applies when output_type is "tracks".

True

Returns:

Type Description
dict[str, Any]

Dictionary with keys:

dict[str, Any]
  • model_name: str
dict[str, Any]
  • return_code: 0 for success, 1 for error
dict[str, Any]
  • output: dict with:
  • output_type: "tracks" or "embeddings"
  • sha256: checksum of the result file
  • tracks_1bp_shape / tracks_128bp_shape: shapes of the respective arrays (present when output_type is "tracks"; shape (0,) when filtered out)
  • values: NpzFile with keys:
    • tracks_1bp: float32 ndarray [1, seq_len, n_1bp_tracks], or shape (0,) if filtered
    • tracks_128bp: float32 ndarray [1, seq_len//128, n_128bp_tracks], or shape (0,) if filtered
    • metadata_1bp / metadata_128bp: object arrays with track metadata For embeddings: values is a single float32 ndarray of shape [1, seq_len, 512]
dict[str, Any]
  • parameters: request metadata dict
dict[str, Any]
  • error: str or None

genbio.toolkit.aido_models_apis.dna_v3_embeddings

dna_v3_embeddings(sequence: str)

Generate embeddings from a DNA sequence using AIDO.DNA3-AG-524K.

Returns backbone representations suitable for similarity search, clustering, or training downstream models.

Parameters:

Name Type Description Default
sequence str

A single DNA sequence. Tokenized at single-nucleotide resolution using A, T, C, G, N.

required

Returns:

Type Description

Dictionary with keys:

  • model_name: str
  • return_code: 0 for success, 1 for error
  • output: dict with:
  • output_type: "tracks" or "embeddings"
  • sha256: checksum of the result file
  • tracks_1bp_shape / tracks_128bp_shape: shapes of the respective arrays (present when output_type is "tracks"; shape (0,) when filtered out)
  • values: NpzFile with keys:
    • tracks_1bp: float32 ndarray [1, seq_len, n_1bp_tracks], or shape (0,) if filtered
    • tracks_128bp: float32 ndarray [1, seq_len//128, n_128bp_tracks], or shape (0,) if filtered
    • metadata_1bp / metadata_128bp: object arrays with track metadata For embeddings: values is a single float32 ndarray of shape [1, seq_len, 512]
  • parameters: request metadata dict
  • error: str or None

Tissue

genbio.toolkit.aido_models_apis.tissue_embedding_small

tissue_embedding_small(h5ad_path: str, pooling: Literal['mean', 'max', 'first_token', 'all', 'none'] = 'mean', neighbor_num: int = 8) -> dict[str, Any]

Compute spatially-aware tissue embeddings from spatially resolved single-cell RNA-seq data.

Notes

This function accesses the SOTA AIDO.Tissue-3M model spatial endpoint, a bidirectional transformer encoder trained on spatially resolved single-cell RNA-seq data (76 slides with 22M cells from Vizgen, Nanostring, and 10xGenomics). The model incorporates spatial cell information by retrieving K nearest neighbor cells for each center cell, concatenating the center cell and neighbor cell expression vectors as input with 2D rotary positional embeddings where the first dimension represents gene index and the second represents cell index. The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a spatially-aware representation of the center cell. The rich contextual representations are SOTA for downstream tasks such as embedding-based similarity search, clustering, and training downstream models for niche and density prediction. No task head is applied; this endpoint exposes backbone embedding inference only. CRITICAL: Input h5ad files MUST contain spatial coordinates in adata.obs with columns "x" and "y".

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing spatially resolved single-cell gene expression data. The file MUST contain a cell-by-gene expression matrix in adata.X with spatial coordinate information in adata.obs.x and adata.obs.y columns (required for spatial context).

required
pooling Literal['mean', 'max', 'first_token', 'all', 'none']

Strategy to aggregate hidden-state representations. Options include: - "mean": mean pooling across sequence tokens (default) → [n_cells, hidden_dim], - "max": max pooling across sequence tokens → [n_cells, hidden_dim], - "first_token": use first token only → [n_cells, hidden_dim], - "all": return all sequence tokens → [n_cells, seq_len, hidden_dim], - "none": return gene-level embeddings without pooling.

'mean'
neighbor_num int

Number of spatial neighbors to include for each center cell. Must be non-negative. Default is 8, which retrieves 8 nearest neighbor cells based on spatial coordinates for spatial context modeling.

8

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Tissue-3M").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": Shape depends on pooling mode:
    • "mean"/"max"/"first_token": [n_cells, 128] where 128 is the embedding dimension.
    • "all": [n_cells, seq_len, 128] where seq_len is the sequence length.
    • "none": [n_cells, 19264, 128] for gene-level embeddings.
  • "values": A nested list of floats containing the computed embeddings.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "neighbor_num": Number of spatial neighbors used.
  • "pooling": The pooling strategy used.
  • "is_aligned": Boolean indicating whether the data was considered aligned before inference.
  • "query": The uploaded file name.
dict[str, Any]
  • "error": None if successful, otherwise contains error information (400/500 status codes).

genbio.toolkit.aido_models_apis.tissue_embedding_large

tissue_embedding_large(h5ad_path: str, pooling: Literal['mean', 'max', 'first_token', 'all', 'none'] = 'mean', neighbor_num: int = 8) -> dict[str, Any]

Compute spatially-aware tissue embeddings from spatially resolved single-cell RNA-seq data.

Notes

This function accesses the SOTA AIDO.Tissue-60M model spatial endpoint, a bidirectional transformer encoder trained on spatially resolved single-cell RNA-seq data (76 slides with 22M cells from Vizgen, Nanostring, and 10xGenomics). The model incorporates spatial cell information by retrieving K nearest neighbor cells for each center cell, concatenating the center cell and neighbor cell expression vectors as input with 2D rotary positional embeddings where the first dimension represents gene index and the second represents cell index. The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a spatially-aware representation of the center cell. The rich contextual representations are SOTA for downstream tasks such as embedding-based similarity search, clustering, and training downstream models for niche and density prediction. No task head is applied; this endpoint exposes backbone embedding inference only. CRITICAL: Input h5ad files MUST contain spatial coordinates in adata.obs with columns "x" and "y".

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing spatially resolved single-cell gene expression data. The file MUST contain a cell-by-gene expression matrix in adata.X with spatial coordinate information in adata.obs.x and adata.obs.y columns (required for spatial context).

required
pooling Literal['mean', 'max', 'first_token', 'all', 'none']

Strategy to aggregate hidden-state representations. Options include: - "mean": mean pooling across sequence tokens (default) → [n_cells, hidden_dim], - "max": max pooling across sequence tokens → [n_cells, hidden_dim], - "first_token": use first token only → [n_cells, hidden_dim], - "all": return all sequence tokens → [n_cells, seq_len, hidden_dim], - "none": return gene-level embeddings without pooling.

'mean'
neighbor_num int

Number of spatial neighbors to include for each center cell. Must be non-negative. Default is 8, which retrieves 8 nearest neighbor cells based on spatial coordinates for spatial context modeling.

8

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Tissue-60M").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": Shape depends on pooling mode:
    • "mean"/"max"/"first_token": [n_cells, 512] where 512 is the embedding dimension.
    • "all": [n_cells, seq_len, 512] where seq_len is the sequence length.
    • "none": [n_cells, 19264, 512] for gene-level embeddings.
  • "values": A nested list of floats containing the computed embeddings.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "neighbor_num": Number of spatial neighbors used.
  • "pooling": The pooling strategy used.
  • "is_aligned": Boolean indicating whether the data was considered aligned before inference.
  • "query": The uploaded file name.
dict[str, Any]
  • "error": None if successful, otherwise contains error information (400/500 status codes).

RNA

genbio.toolkit.aido_models_apis.ncrna_embedding

ncrna_embedding(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]

Compute non-coding and regulatory RNA sequence embeddings from nucleotide sequences.

Notes

This function accesses the SOTA AIDO.RNA-1.6B model, a bidirectional encoder-only transformer with 1.6 billion parameters trained via masked language modeling on 42 million non-coding RNA sequences from RNAcentral. The model operates on RNA sequences with single-nucleotide tokenization (A, U, C, G), producing rich contextual representations that achieve state-of-the-art performance on The representations are suitable for embedding-based similarity search, clustering, and training downstream models such as secondary structure prediction, inverse folding, and function classification. No task head is applied; this endpoint exposes backbone embedding inference only. Returns either pooled sequence-level embeddings when pooling is "mean" or "max", or nucleotide-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
sequences list[str]

A list of RNA sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G.

required
pooling Literal['mean', 'max', 'none']

Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling.

'mean'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.RNA-1.6B").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": A list specifying the tensor shape. For pooled outputs ("mean" or "max"), this is [N, 2048] where N is the number of input sequences and 2048 is the model hidden size. For unpooled output ("none"), shape is [N, L+2, 2048] where L is the sequence length (padded to the maximum in the batch), and the +2 accounts for prepended CLS and appended EOS tokens.
  • "values": A nested list of floats containing the computed embeddings.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "sequence_count": Number of sequences processed.
  • "sequence_lengths": List of lengths for each input sequence.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.mrna_embedding

mrna_embedding(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]

Compute coding sequence (mRNA/CDS) embeddings from RNA nucleotide sequences.

Notes

This function accesses the AIDO.RNA-1.6B-CDS model, a domain-adapted version of the SOTA AIDO.RNA-1.6B bidirectional encoder-only transformer trained on 9 million coding sequences. The model continues pre-training from AIDO.RNA-1.6B on coding sequence data, specializing it for mRNA and coding DNA sequence tasks. The model operates on RNA sequences with single-nucleotide tokenization (A, U, C, G), producing rich contextual representations optimized for coding sequences. The representations are suitable for embedding-based similarity search, clustering, and training downstream models for tasks such as translation efficiency prediction, protein abundance prediction, and codon optimization. No task head is applied; this endpoint exposes backbone embedding inference only. Returns either pooled sequence-level embeddings when pooling is "mean" or "max", or nucleotide-level embeddings when pooling is "none".

Parameters:

Name Type Description Default
sequences list[str]

A list of RNA coding sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G.

required
pooling Literal['mean', 'max', 'none']

Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling.

'mean'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.RNA-1.6B-CDS").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": A list specifying the tensor shape. For pooled outputs ("mean" or "max"), this is [N, 2048] where N is the number of input sequences and 2048 is the model hidden size. For unpooled output ("none"), shape is [N, L+2, 2048] where L is the sequence length (padded to the maximum in the batch), and the +2 accounts for prepended CLS and appended EOS tokens.
  • "values": A nested list of floats containing the computed embeddings.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "sequence_count": Number of sequences processed.
  • "sequence_lengths": List of lengths for each input sequence.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.rna_translation_efficiency_muscle

rna_translation_efficiency_muscle(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]

Predict translation efficiency in muscle tissue from mRNA coding sequences.

Notes

This function accesses the AIDO.RNA-1.6B-translation-efficiency-muscle model, which is fine-tuned from the SOTA AIDO.RNA-1.6B non-coding RNA sequence model on an endogenous human 5' UTR dataset measuring the ratio of Ribo-seq to RNA-seq RPKM values (translation efficiency) with 1,260 100bp 5' UTR sequences. Predictions are normalized to arbitrary units where higher values indicate more efficient translation. This model is specialized with observational data from human muscle tissue.

Parameters:

Name Type Description Default
sequences list[str]

A list of 5' UTR sequences up to 100bp (strings of RNA nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G.

required
pooling Literal['mean', 'max', 'none']

Strategy to aggregate nucleotide-level representations. This parameter is passed to the model but translation efficiency prediction always returns a single scalar value per sequence. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": no pooling (not recommended for this task).

'mean'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.RNA-1.6B-translation-efficiency-muscle").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": A list [N, 1] where N is the number of input sequences.
  • "values": A nested list of floats [[score_1], [score_2], ..., [score_N]] containing the predicted translation efficiency score for each sequence. Scores are in arbitrary units where higher values indicate greater translation efficiency in muscle tissue.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "sequence_count": Number of sequences processed.
  • "sequence_lengths": List of lengths for each input sequence.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.rna_secondary_structure

rna_secondary_structure(sequences: list[str]) -> dict[str, Any]

Predict RNA secondary structure from nucleotide sequences in dot-bracket notation.

Notes

This function accesses the SOTA AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction model, which is fine-tuned from AIDO.RNA-1.6B on the bpRNA and Archive-II datasets for RNA secondary structure prediction. The model predicts base pairing patterns that form RNA secondary structure, achieving SOTA performance with an F1 score of 0.783 on the bpRNA-TS0 test set, and demonstrates strong inter-family generalization across nine RNA families including 5S rRNA, tRNA, tmRNA, RNase P RNA, and others. Predictions are returned in dot-bracket notation where paired bases are indicated by matching parentheses and unpaired bases by dots. All sequences are processed in chunks of up to 1000 nucleotides.

Parameters:

Name Type Description Default
sequences list[str]

A list of RNA sequences to be processed in chunks of up to 1000 nucleotides each (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G.

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "structures": A list of dictionaries, one per input sequence, each containing:
    • "dot_bracket": A string in dot-bracket notation representing the predicted secondary structure, where '.' indicates unpaired bases and matching '(' and ')' indicate base pairs.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used (depreciated).
  • "sequence_count": Number of sequences processed.
  • "sequence_lengths": List of lengths for each input sequence.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.rna_protein_abundance_hsapiens

rna_protein_abundance_hsapiens(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]

Predict protein abundance from mRNA coding sequences in human cells.

Notes

This function accesses the AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens model, which is fine-tuned from the SOTA AIDO.RNA-1.6B-CDS coding sequence model on a dataset of 11.8k CDS with lengths between 156 and 2048bp and measured protein abundance from PAXdb. human mRNA sequences with experimentally measured protein abundance from PAXdb, mainly consisting of mass spectroscopy-based quantifications. The model predicts the steady-state protein abundance that would result from a given coding sequence, capturing the effects of mRNA stability, ribosome throughput, and other factors that influence protein expression in human cells. Predictions are in arbitrary units where higher values indicate greater protein abundance. This model is specialized for human (Homo sapiens) cells.

Parameters:

Name Type Description Default
sequences list[str]

A list of mRNA coding sequences up to 2048 nucleotides in length (strings of RNA nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G.

required
pooling Literal['mean', 'max', 'none']

Strategy to aggregate nucleotide-level representations. This parameter is passed to the model but protein abundance prediction always returns a single scalar value per sequence. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": no pooling (not recommended for this task).

'mean'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "shape": A list [N, 1] where N is the number of input sequences.
  • "values": A nested list of floats [[score_1], [score_2], ..., [score_N]] containing the predicted protein abundance score for each sequence. Scores are in arbitrary units where higher values indicate greater protein abundance in human cells.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "pooling": The pooling strategy used.
  • "sequence_count": Number of sequences processed.
  • "sequence_lengths": List of lengths for each input sequence.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.rna_splice_site_query

rna_splice_site_query(sequences: list[str]) -> dict[str, Any]

Predict splice site probabilities from RNA sequences.

Notes

This function accesses the AIDO.RNA-SpliceSite-Query model, a fine-tuned AIDO RNA model that predicts acceptor and donor splice site probabilities from RNA sequences. The model is specialized for identifying potential splice sites in pre-mRNA sequences and returns probability scores for both acceptor (3' splice site) and donor (5' splice site) positions. CRITICAL REQUIREMENT: Each input sequence must be exactly 600 nucleotides long and should be centered around the potential splice site of interest. Sequences of other lengths will be rejected by the API.

Parameters:

Name Type Description Default
sequences list[str]

List of RNA sequences to score. Each sequence must be exactly 600 nucleotides long, centered around a potential splice site. Sequences should contain only standard RNA nucleotides (A, U, C, G). The 600bp requirement ensures proper context for splice site prediction.

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.RNA-SpliceSite-Query").
dict[str, Any]
  • "return_code": Integer status code (200 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "predictions": Array of predictions, one per input sequence. Each prediction contains:
    • "acceptor_proba": Float probability (0-1) that the sequence contains an acceptor splice site (3' splice site, AG dinucleotide).
    • "donor_proba": Float probability (0-1) that the sequence contains a donor splice site (5' splice site, GU dinucleotide).
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "num_sequences": Number of sequences processed.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.gsrna_activity_query

gsrna_activity_query(sequences: list[str]) -> dict[str, Any]

Predict guide RNA (gsRNA) activity scores from RNA sequences.

Notes

This function accesses the AIDO.gsRNA-Activity-Query model, which predicts activity scores for guide RNA sequences. The model is trained to predict the effectiveness of guide RNAs for gene editing applications. Activity scores are averaged across 5-fold ensemble predictions for improved reliability. CRITICAL REQUIREMENT: Each input sequence must be exactly 21 nucleotides long and contain only A, C, G, T characters (case-insensitive).

Parameters:

Name Type Description Default
sequences list[str]

List of RNA sequences to score. Each sequence must be exactly 21 nucleotides long and contain only standard RNA/DNA nucleotides (A, C, G, T - case-insensitive, automatically converted to uppercase). Empty sequences are not allowed.

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.gsRNA-Activity-Query").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "activity_scores": Array of predicted activity scores (floats), one per input sequence. Scores are in the same order as input sequences and represent the predicted effectiveness of each guide RNA. Values are averaged across 5-fold ensemble predictions.
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "sequence_count": Number of sequences processed.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

Structure Prediction

genbio.toolkit.aido_models_apis.structure_prediction

structure_prediction(query_1: str, query_1_type: Literal['proteinChain', 'rnaSequence', 'ligand'], query_2: str | None = None, query_2_type: Literal['proteinChain', 'rnaSequence', 'ligand'] | None = None, query_3: str | None = None, query_3_type: Literal['proteinChain', 'rnaSequence', 'ligand'] | None = None) -> dict[str, Any]

Predict full-atom 3D structure and interactions of biomolecules including proteins, DNA, RNA, and small molecule ligands.

Notes

This function accesses the SOTA AIDO.StructurePrediction model, an AlphaFold3-like full-atom structure prediction model designed to predict the structure and interactions of biological molecules including proteins, DNA, RNA, ligands, and antibodies. The model achieves state-of-the-art performance on all structure predictiction tasks, with especially strong performance for immunology-related structure prediction tasks, including antibody, nanobody, antibody-antigen, and nanobody-antigen complexes. Predictions are returned in CIF (Crystallographic Information File) format, always with chain keys A0, B0, and C0, suitable for structural analysis and visualization with py3Dmol, PyMOL, ChimeraX, and other structural biology tools.

Parameters:

Name Type Description Default
query_1 str

Sequence of the first molecule. Amino acids if proteinChain, AUGC nucleotides if rnaSequence, ATGC nucleotides if dnaSequence, or SMILES string if ligand. Ligand also tolerates CCD code if known.

required
query_1_type str

Type of the first sequence; one of ["proteinChain", "rnaSequence", "dnaSequence", "ligand"].

required
query_2 str

Sequence of the second molecule, same specification as query_1 (default: "").

None
query_2_type str

Type of the second sequence, same specification as query_1_type (default: "").

None
query_3 str

Sequence of the third molecule, same specification as query_1 (default: "").

None
query_3_type str

Type of the third sequence, same specification as query_1_type (default: "").

None

Returns: A dictionary with the following fields: - "model_name": The identifier of the model used ("AIDO.StructurePrediction"). - "return_code": Integer status code (0 indicates success). - "output": A dictionary containing: - "cif_data": A string containing the full structure prediction in CIF (Crystallographic Information File) format, including atomic coordinates, connectivity information, entity definitions, and metadata. The CIF format can be parsed by standard structural biology tools for visualization and analysis. Always returns chains "A0", "B0", and "C0" for the input molecules. - "parameters": A dictionary containing the input parameters provided to the model, including all query sequences and their types. - "error": None if successful, otherwise contains error information.

Perturbation & Interactome

genbio.toolkit.aido_models_apis.perturbation_effect_query

perturbation_effect_query(h5ad_path: str, cell_line: Literal['H1', 'Hep-G2', 'Jurkat', 'K-562', 'RPE1'], query_obs_col: str = 'gene', query_control: str = 'ctrl', query_condition: str = 'cond_A', ref_input_type: Literal['raw', 'delta'] = 'delta', metric: Literal['cosine', 'euclidean', 'correlation', 'spearman'] = 'correlation', target_sum: float = 10000.0, top_k: int = 10) -> dict[str, Any]

Query perturbation effect database for similar genetic perturbations.

Notes

This function accesses the AIDO.Perturbation-Query model, which searches a reference database of genetic perturbation effects to find perturbations with similar transcriptomic signatures. The model compares the user-provided query data (control vs condition) against a pre-computed reference database and returns the most similar perturbations ranked by distance score. Currently, H1 cell line is fully supported with comprehensive reference data. Other cell lines (Hep-G2, Jurkat, K-562, RPE1) may have limited reference data availability and could result in errors if the reference data doesn't contain them. The query data should contain both control and perturbed cells with labels in the obs dataframe.

Parameters:

Name Type Description Default
h5ad_path str

Path to an h5ad file containing single-cell RNA-seq expression data with control and condition/perturbed cells. The file should contain a cell-by-gene expression matrix with labels in the obs dataframe indicating which cells are control vs condition.

required
cell_line Literal['H1', 'Hep-G2', 'Jurkat', 'K-562', 'RPE1']

Cell line to query against. Must be one of: "H1", "Hep-G2", "Jurkat", "K-562", "RPE1". Currently only "H1" is fully supported with complete reference data.

required
query_obs_col str

Column name in query.obs that contains control vs condition labels. Default is "gene".

'gene'
query_control str

Label value in query_obs_col that identifies control cells. Default is "ctrl".

'ctrl'
query_condition str

Label value in query_obs_col that identifies perturbed/condition cells. Default is "cond_A".

'cond_A'
ref_input_type Literal['raw', 'delta']

Reference data type: "raw" for raw counts, "delta" for pre-computed deltas. Default is "delta". Note that reference files use target_gene column (not gene) for perturbation names.

'delta'
metric Literal['cosine', 'euclidean', 'correlation', 'spearman']

Distance metric for similarity calculation. Must be one of: "cosine", "euclidean", "correlation", "spearman". Default is "correlation".

'correlation'
target_sum float

Target sum for normalization (used when ref_input_type="raw"). Default is 10000.0.

10000.0
top_k int

Number of top-ranked perturbation matches to return. Default is 10.

10

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Perturbation-Query").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "results": Array of ranked perturbation matches, each with:
    • "gene": Gene/perturbation name from the reference database.
    • "distance_score": Similarity distance score (lower = more similar/closer match). For correlation metric, this is 1 - correlation.
    • "rank": Rank of this perturbation (1 = most similar).
  • "count": Number of results returned (≤ top_k).
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "query_file": The uploaded file name.
  • "cell_line": Cell line used.
  • "query_obs_col", "query_control", "query_condition": Query labels.
  • "ref_input_type": Reference data type.
  • "metric": Distance metric used.
  • "target_sum": Normalization target.
  • "top_k": Number of top results requested.
  • "cell_count": Number of cells in query data.
  • "gene_count": Number of genes in query data.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.perturbation_effect_query_cell_lines

perturbation_effect_query_cell_lines() -> list[str]

Retrieve the list of supported cell lines for perturbation effect queries.

Notes

These cell line names are used to select the reference database for perturbation effect similarity searches. Currently, only "H1" is fully supported with comprehensive reference data. Other cell lines may have limited reference data availability.

Returns:

Type Description
list[str]

A list of cell line names (strings) that are supported by the

list[str]

AIDO.Perturbation-Query model.

genbio.toolkit.aido_models_apis.interactome_query

interactome_query(gene_symbol: str, n_neighbors: int = 10, metric: Literal['pearson', 'spearman', 'euclidean'] = 'euclidean') -> dict[str, Any]

Query the interactome embeddings for nearest neighbor genes.

Notes

This function accesses the AIDO.Interactome-Query model, which searches pre-computed gene interaction embeddings to find the nearest neighbor genes for a given query gene. The model returns genes with similar interaction patterns based on the specified distance/ similarity metric. The query gene itself will appear in the results with rank 1 and a score of 0.0 (for euclidean metric) or 1.0 (for correlation metrics). This tool is useful for identifying genes with similar biological functions, interaction partners, or pathway memberships.

Parameters:

Name Type Description Default
gene_symbol str

Gene symbol to query (e.g., 'CXCL8', 'CDKN1A'). Must exist in the interactome reference database vocabulary (~18,000 genes).

required
n_neighbors int

Number of nearest neighbors to return (including the query gene itself). Default is 10.

10
metric Literal['pearson', 'spearman', 'euclidean']

Distance/similarity metric for nearest neighbor search. Must be one of: "pearson", "spearman", or "euclidean". Default is "euclidean". Score interpretation varies by metric: - euclidean: lower = closer (query gene has 0.0) - pearson/spearman: higher = more similar (query gene has 1.0)

'euclidean'

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.Interactome-Query").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "results": Array of ranked neighbor genes, each with:
    • "gene": Gene symbol from the interactome database.
    • "distance_score": Similarity/distance score. Interpretation depends on metric: For euclidean: lower = closer (query gene has 0.0). For pearson/spearman: higher = more similar (query gene has 1.0).
    • "rank": Rank of this neighbor (1 = query gene itself, 2 = nearest neighbor, etc.).
  • "count": Number of results returned (≤ n_neighbors).
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "gene_symbol": The query gene.
  • "n_neighbors": Number of neighbors requested.
  • "metric": Distance metric used.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.interactome_query_gene_vocab

interactome_query_gene_vocab() -> list[str]

Retrieve the gene vocabulary for interactome queries.

Notes

This function returns the list of gene symbols that are recognized by the AIDO.Interactome-Query model. Query genes must exist in this vocabulary (approximately 18,000 genes).

Returns:

Type Description
list[str]

A list of HGNC gene symbols (strings) that are supported by the

list[str]

interactome query model, in alphabetical order.

genbio.toolkit.aido_models_apis.knockout_effect_query

knockout_effect_query(gene_symbol: str | list[str]) -> dict[str, Any]

Predict transcriptomic effects of gene knockout.

Notes

This function accesses the AIDO.KnockoutEffect-Query model, which returns predicted expression changes for approximately 6,000 readout genes resulting from knocking out one or more query genes. The model predicts the difference in expression between knockout and baseline conditions (Expression_KO - Expression_Baseline). Positive values indicate upregulation after knockout, while negative values indicate downregulation. This tool is useful for understanding the downstream transcriptional effects of perturbing specific genes, identifying potential compensatory mechanisms, and predicting phenotypic outcomes.

Parameters:

Name Type Description Default
gene_symbol str | list[str]

Gene symbol(s) to query. Can be either a single gene as a string (e.g., "CXCL8") or a list of genes (e.g., ["CXCL8", "PLAG1", "TP53"]). Each gene must exist in the knockout effect reference database vocabulary (~18,000 genes).

required

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model used ("AIDO.KnockoutEffect-Query").
dict[str, Any]
  • "return_code": Integer status code (0 indicates success).
dict[str, Any]
  • "output": A dictionary containing:
  • "results": Array of results, one per queried gene. Each result contains:
    • "gene": Gene symbol that was knocked out.
    • "readout_genes": Dictionary mapping readout gene symbols to delta values (predicted expression changes). Keys are gene symbols (~6,000 readout genes), values are floating-point numbers representing the change in expression. Positive values indicate upregulation, negative values indicate downregulation.
  • "count": Number of results returned (equals number of queried genes).
dict[str, Any]
  • "parameters": A dictionary with metadata including:
  • "model_id": The model identifier.
  • "gene_symbol": The query gene(s).
  • "gene_count": Number of genes queried.
dict[str, Any]
  • "error": None if successful, otherwise contains error information.

genbio.toolkit.aido_models_apis.knockout_effect_query_gene_vocab

knockout_effect_query_gene_vocab() -> list[str]

Retrieve the input gene vocabulary for knockout effect queries.

Notes

This function returns the list of gene symbols that can be queried in the AIDO.KnockoutEffect-Query model. Query genes must exist in this vocabulary (approximately 18,000 genes).

Returns:

Type Description
list[str]

A list of HGNC gene symbols (strings) that can be knocked out in the

list[str]

knockout effect prediction model, in alphabetical order.

genbio.toolkit.aido_models_apis.knockout_effect_query_readout_genes

knockout_effect_query_readout_genes() -> list[str]

Retrieve the readout gene vocabulary for knockout effect query results.

Notes

This function returns the list of readout gene symbols that are included in knockout effect predictions. The model predicts expression changes for approximately 6,000 readout genes in response to knocking out the query gene(s).

Returns:

Type Description
list[str]

A list of HGNC gene symbols (strings) for which expression changes are

list[str]

predicted in knockout effect results, in alphabetical order.

genbio.toolkit.aido_models_apis.gene_knockdown_query

gene_knockdown_query(genes: list[str], context_name: str = 'K562', return_symbols: bool = True, *, download: bool = True) -> dict[str, Any]

Simulate gene knockdowns and predict genome-wide expression changes.

Notes

This function performs In-Silico Perturbation via the AIDO.GeneKnockdown model. It predicts how the transcriptome of a cell (the "context") would change if one or more genes were knocked down. Unlike a simple lookup, this uses a trained Multi-Layer Perceptron (MLPConcat) to generalise effects based on gene embeddings and input context (gene expression) vectors.

Model & Data Context

  • Model (MLPConcat): A neural network that takes two inputs — a cell-state vector C and a perturbation vector P — and predicts delta expression (knockdown vs. baseline).
  • Embeddings: Uses NBFNet-predicted embeddings that incorporate biological priors from gene ontologies and protein–protein interaction networks.
  • Contexts: 1,048 cell types / tissues are supported (e.g. "K562", "hepatoblastoma cell HepG2", "HEK293T", "endoderm cells"). Use the associated context vocabulary function for the full list.
  • Targets: 18,425 gene symbols can be knocked down.
  • Readout: Each prediction covers 15,867 readout genes (columns of the output matrix — gene symbols or Ensembl IDs). Note: the readout gene set differs from the target gene set.

Result delivery

Because the result matrix can be very large (up to ~1 GB for 18,000 genes), the API returns a lightweight JSON receipt with download URIs instead of streaming the data inline. The .h5ad file is stored in the inference results object store (GCS).

When download=True this function automatically fetches the .h5ad file and parses it into an anndata.AnnData object, available as output["adata"]. It first tries a direct download via the google-cloud-storage SDK (fastest for in-cluster / same-project callers, zero egress cost), and falls back to the HTTPS signed URL for external callers. The parsed AnnData object is available directly::

resp = gene_knockdown_query(["MYC"], download=True)
adata = resp["output"]["adata"]

Parameters:

Name Type Description Default
genes list[str]

List of gene symbols to knock down (e.g. ["MYC", "TP53"]). Case-insensitive. Genes not found in the model's embedding vocabulary (~18,425 genes) are silently skipped. At least one valid gene is required.

required
context_name str

Cell type / tissue context for the prediction. Supports 1,048 contexts. Defaults to "K562".

'K562'
return_symbols bool

If True (default), readout gene columns in the .h5ad are HGNC gene symbols. If False, they are Ensembl IDs.

True
download bool

If True, automatically download the .h5ad result and parse it into an anndata.AnnData object at output["adata"]. The download strategy prefers the GCS SDK (in-cluster) and falls back to the signed URL (external). Defaults to True.

True

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "model_name": The identifier of the model ("AIDO.GeneKnockdown").
dict[str, Any]
  • "return_code": Integer status code (0 = success).
dict[str, Any]
  • "output": On success, the GCS result receipt:

  • "uri": Provider-native URI, e.g. gs://inference-engine-results/gene_knockdown//knockdown_result.h5ad.

  • "signed_url": Time-limited HTTPS download URL (may be None if signing is unavailable).
  • "key": Object key within the results bucket.
  • "size_bytes": Size of the .h5ad file in bytes.
  • "content_type": Always "application/x-hdf5".
  • "upload_duration_ms": Server-side upload time (ms).
  • "adata" (only when download=True): The result loaded as an anndata.AnnData object (shape [n_genes, 15867]).
dict[str, Any]
  • "parameters": Echo of request metadata (model_id, genes, context_name, gene_count, return_symbols).
dict[str, Any]
  • "error": None on success, otherwise a human-readable error message.

Raises:

Type Description
HTTPError

If the API returns an error status.