Inference Engine — Other¶
Additional AIDO foundation model endpoints served by the Inference Engine (AIDO_INFERENCE_ENGINE_URL).
Protein¶
genbio.toolkit.aido_models_apis.protein_embedding ¶
protein_embedding(query: str | list[str], pooling: Literal['mean', 'max', 'min', 'none'] = 'mean') -> dict[str, Any]
Compute protein sequence embeddings from an amino acid sequence.
Notes
This function accesses the SOTA AIDO.Protein-16B bidirectional transformer encoder trained via
masked language modeling on >1.2 trillion amino acids from UniRef90 and
ColabFoldDB. The model operates on single amino acid sequences, and
produces rich contextual representations that are SOTA for various downstream
tasks such as embedding-based similarity search, clustering, and training downstream models.
No task head is applied; this endpoint exposes backbone
embedding inference only. Can process up to 1023 amino acids per sequence, and
returns either a pooled protein-level embedding when pooling is "mean", "max", or "min",
or residue-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str | list[str]
|
A single protein sequence (string of amino acid tokens) or a list of sequences. Sequences are tokenized at single–amino-acid resolution using the model’s fixed vocabulary of canonical amino acids. |
required |
pooling
|
Literal['mean', 'max', 'min', 'none']
|
Strategy to aggregate token-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "min": min pooling, - "none": return token-level embeddings without pooling. |
'mean'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.protein_stability ¶
protein_stability(query: str | list[str]) -> dict[str, Any]
Predict protein stability from amino acid sequences.
Notes
This function accesses the SOTA AIDO.Protein-16B-stability-prediction model, which is fine-tuned from AIDO.Protein-16B on a dataset of 55k small protein fragments (41-50aa) with experimental measurements for proteolytic degradation resistance (stability). The predicted stability float is in arbitrary units where higher values indicate greater resistance to degradataion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str | list[str]
|
A single protein sequence (string of amino acid tokens) or a list of sequences. Sequences are tokenized at single-amino-acid resolution using the model's fixed vocabulary of canonical amino acids. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
Cell¶
genbio.toolkit.aido_models_apis.cell_embedding_small ¶
cell_embedding_small(h5ad_path: str, pooling: Literal['mean', 'none'] = 'mean', do_cell_average: bool = False, pooling_dim: int | None = None) -> dict[str, Any]
Compute cell embeddings from single-cell RNA-seq data.
Notes
This function accesses the SOTA AIDO.Cell-3M model, a scRNA-seq count bidirectional transformer encoder (BERT) model
trained on 50 million cells from over 100 tissue types (963 billion gene tokens).
The model uses an auto-discretization strategy for encoding continuous gene expression values.
The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning
a representation of cell and gene states from the transcriptional context.
The rich contextual representations are SOTA for various downstream
tasks such as embedding-based similarity search, clustering, and training downstream models.
No task head is applied; this endpoint exposes backbone embedding inference only.
Returns either pooled cell-level embeddings when pooling is "mean",
or gene-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format, representing the transcriptome of one or more cells. |
required |
pooling
|
Literal['mean', 'none']
|
Strategy to aggregate gene-level representations into a cell-level embedding. Options include: - "mean": mean pooling over gene tokens (default), - "none": return gene-level embeddings without pooling. |
'mean'
|
do_cell_average
|
bool
|
If True and multiple cells are present, average all cells before embedding, producing a single embedding vector. If False (default), returns one embedding per cell. |
False
|
pooling_dim
|
int | None
|
Dimension along which to pool when pooling="mean". Default is None (server defaults to 1, pooling over the sequence/gene dimension). Allowed range: -2 to 2. Only applicable when pooling is "mean"; ignored otherwise. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.cell_embedding_medium ¶
cell_embedding_medium(h5ad_path: str, pooling: Literal['mean', 'none'] = 'mean', do_cell_average: bool = False, pooling_dim: int | None = None) -> dict[str, Any]
Compute cell embeddings from single-cell RNA-seq data.
Notes
This function accesses the SOTA AIDO.Cell-10M model, a scRNA-seq count bidirectional transformer encoder (BERT) model
trained on 50 million cells from over 100 tissue types (963 billion gene tokens).
The model uses an auto-discretization strategy for encoding continuous gene expression values.
The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning
a representation of cell and gene states from the transcriptional context.
The rich contextual representations are SOTA for various downstream
tasks such as embedding-based similarity search, clustering, and training downstream models.
No task head is applied; this endpoint exposes backbone embedding inference only.
Returns either pooled cell-level embeddings when pooling is "mean",
or gene-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format, representing the transcriptome of one or more cells. |
required |
pooling
|
Literal['mean', 'none']
|
Strategy to aggregate gene-level representations into a cell-level embedding. Options include: - "mean": mean pooling over gene tokens (default), - "none": return gene-level embeddings without pooling. |
'mean'
|
do_cell_average
|
bool
|
If True and multiple cells are present, average all cells before embedding, producing a single embedding vector. If False (default), returns one embedding per cell. |
False
|
pooling_dim
|
int | None
|
Dimension along which to pool when pooling="mean". Default is None (server defaults to 1, pooling over the sequence/gene dimension). Allowed range: -2 to 2. Only applicable when pooling is "mean"; ignored otherwise. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.cell_embedding_large ¶
cell_embedding_large(h5ad_path: str, pooling: Literal['mean', 'none'] = 'mean', do_cell_average: bool = False, pooling_dim: int | None = None) -> dict[str, Any]
Compute cell embeddings from single-cell RNA-seq data.
Notes
This function accesses the SOTA AIDO.Cell-100M model, a scRNA-seq count bidirectional transformer encoder (BERT) model
trained on 50 million cells from over 100 tissue types (963 billion gene tokens).
The model uses an auto-discretization strategy for encoding continuous gene expression values.
The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning
a representation of cell and gene states from the transcriptional context.
The rich contextual representations are SOTA for various downstream
tasks such as embedding-based similarity search, clustering, and training downstream models.
No task head is applied; this endpoint exposes backbone embedding inference only.
Returns either pooled cell-level embeddings when pooling is "mean",
or gene-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format, representing the transcriptome of one or more cells. |
required |
pooling
|
Literal['mean', 'none']
|
Strategy to aggregate gene-level representations into a cell-level embedding. Options include: - "mean": mean pooling over gene tokens (default), - "none": return gene-level embeddings without pooling. |
'mean'
|
do_cell_average
|
bool
|
If True and multiple cells are present, average all cells before embedding, producing a single embedding vector. If False (default), returns one embedding per cell. |
False
|
pooling_dim
|
int | None
|
Dimension along which to pool when pooling="mean". Default is None (server defaults to 1, pooling over the sequence/gene dimension). Allowed range: -2 to 2. Only applicable when pooling is "mean"; ignored otherwise. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.cell_type_annotation ¶
cell_type_annotation(h5ad_path: str, tissue: str, return_probs: bool = False) -> dict[str, Any]
Predict cell types from single-cell RNA-seq data using tissue-specific models.
Notes
This function accesses the AIDO.CellType-Query model, which uses tissue-specific pretrained classification models to predict cell type labels for each cell in the input dataset based on single-cell RNA-seq expression patterns. The model checkpoint is selected based on the provided tissue name (e.g., "Kidney"). The function automatically realigns input genes to the AIDO gene index (19,264 genes) if needed. Available tissues include: Bladder, Blood, Bone_Marrow, Ear, Eye, Fat, Heart, Kidney, Large_Intestine, Liver, Lung, Lymph_Node, Mammary, Muscle, Ovary, Pancreas, Prostate, Salivary_Gland, Skin, Small_Intestine, Spleen, Stomach, Testis, Thymus, Tongue, Trachea, Uterus, and Vasculature (call cell_type_annotation_supported_tissues() for more details). Model performance may vary by tissue type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing single-cell gene expression data. The file should contain a cell-by-gene expression matrix compatible with AnnData format. |
required |
tissue
|
str
|
Tissue name used to select the pretrained model checkpoint (e.g., "Kidney"). Must match one of the available tissue-specific models. See cell_type_annotation_supported_tissues() for the full list of supported tissues. |
required |
return_probs
|
bool
|
If True, returns the probabilities for each cell type per cell. If False (default), returns only the predicted cell type label (the cell type with maximum probability). |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.cell_type_annotation_supported_tissues ¶
cell_type_annotation_supported_tissues() -> list[str]
Retrieve the list of supported tissue types for cell type annotation.
Notes
These tissue names are used to select the appropriate pretrained model checkpoint for cell type classification. Model availability and performance may vary across different tissue types. All tissues should have model support, but some may be more comprehensive than others.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of tissue names (strings) that are supported by the |
list[str]
|
AIDO.CellType-Query model. |
genbio.toolkit.aido_models_apis.cell_age_predictor ¶
cell_age_predictor(h5ad_path: str) -> dict[str, Any]
Predict biological age from single-cell RNA-seq data.
Notes
This function accesses the AIDO.AgePredictor model, a transcriptomic clock based on the Cell Perceiver architecture fine-tuned for age regression. The model was derived from a pretrained Cell Perceiver and fine-tuned on CellXGene data with experimentally measured donor ages. The model operates on raw (un-normalized) scRNA-seq counts from the human transcriptome (20,062 genes) and predicts the biological age of the sampled tissue or donor. The model returns both normalized predictions (z-scores relative to training set distribution) and denormalized age predictions in years. This model is suitable for estimating biological age, and a reasonable proxy for overall cellular stress and disease.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing single-cell gene expression data.
The file should contain a cell-by-gene expression matrix with raw (un-normalized)
scRNA-seq counts. The model expects 20,062 genes; missing genes will be imputed
with a mask value, and extra genes will be ignored. Most genes overlap with the
HGNC gene set (see tool |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.embedding_gene_vocab ¶
embedding_gene_vocab() -> list[str]
Retrieve the ordered list of genes supported by cell_embedding_ and tissue_embedding_ models.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of HGNC gene symbols (strings) that are recognized by the |
list[str]
|
cell and tissue embedding models, in the order they are |
list[str]
|
returned in model outputs. |
genbio.toolkit.aido_models_apis.age_predictor_gene_vocab ¶
age_predictor_gene_vocab() -> list[str]
Retrieve the gene vocabulary for the age predictor model.
Notes
This function returns the list of genes (approximately 20,062 genes) that the AIDO.AgePredictor model expects in the input. The model automatically imputes missing genes and ignores extra genes during preprocessing.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of gene names (strings, mix of HGNC symbols and Ensembl IDs) that |
list[str]
|
are used by the age predictor model, in the order expected by the model. |
DNA¶
genbio.toolkit.aido_models_apis.dna_embedding_small ¶
dna_embedding_small(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]
Compute DNA sequence embeddings from nucleotide sequences.
Notes
This function accesses the AIDO.DNA-300M model, a DNA foundation model based on
the bidirectional transformer encoder (BERT) architecture trained via masked language
modeling on 10.6 billion nucleotides from 796 species. The model operates on DNA
sequences with single-nucleotide tokenization (A, T, C, G, N), producing rich contextual
representations for embedding-based similarity search, clustering, and training downstream models.
No task head is applied; this endpoint exposes backbone
embedding inference only. Can process up to 4000 nucleotides per sequence, and returns
either a pooled sequence-level embedding when pooling is "mean" or "max",
or nucleotide-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of DNA sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, T, C, G, N, where N denotes uncertain elements. |
required |
pooling
|
Literal['mean', 'max', 'none']
|
Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling. |
'mean'
|
Returns: A dictionary with the following fields: - "model_name": The identifier of the model used ("AIDO.DNA-300M"). - "return_code": Integer status code (0 indicates success). - "output": A dictionary containing: - "shape": A list specifying the tensor shape. For pooled outputs ("mean" or "max"), this is [N, 1024] where N is the number of input sequences and 1024 is the model hidden size. For unpooled output ("none"), shape is [N, L+2, 1024] where L is the sequence length (padded to the maximum in the batch), and the +2 accounts for prepended CLS and appended EOS tokens. - "values": A nested list of floats containing the computed embeddings. - "parameters": A dictionary with metadata including: - "model_id": The model identifier. - "pooling": The pooling strategy used. - "sequence_count": Number of sequences processed. - "sequence_lengths": List of lengths for each input sequence. - "error": None if successful, otherwise contains error information.
genbio.toolkit.aido_models_apis.dna_embedding_large ¶
dna_embedding_large(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]
Compute DNA sequence embeddings from nucleotide sequences.
Notes
This function accesses the AIDO.DNA-7B model, a DNA foundation model based on
the bidirectional transformer encoder (BERT) architecture trained via masked language
modeling on 10.6 billion nucleotides from 796 species. The model operates on DNA
sequences with single-nucleotide tokenization (A, T, C, G, N), producing rich contextual
representations for embedding-based similarity search, clustering, and training downstream models.
No task head is applied; this endpoint exposes backbone
embedding inference only. Can process up to 4000 nucleotides per sequence, and returns
either a pooled sequence-level embedding when pooling is "mean" or "max",
or nucleotide-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of DNA sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, T, C, G, N, where N denotes uncertain elements. |
required |
pooling
|
Literal['mean', 'max', 'none']
|
Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling. |
'mean'
|
Returns: A dictionary with the following fields: - "model_name": The identifier of the model used ("AIDO.DNA-7B"). - "return_code": Integer status code (0 indicates success). - "output": A dictionary containing: - "shape": A list specifying the tensor shape. For pooled outputs ("mean" or "max"), this is [N, 4352] where N is the number of input sequences and 4352 is the model hidden size. For unpooled output ("none"), shape is [N, L+2, 4352] where L is the sequence length (padded to the maximum in the batch), and the +2 accounts for prepended CLS and appended EOS tokens. - "values": A nested list of floats containing the computed embeddings. - "parameters": A dictionary with metadata including: - "model_id": The model identifier. - "pooling": The pooling strategy used. - "sequence_count": Number of sequences processed. - "sequence_lengths": List of lengths for each input sequence. - "error": None if successful, otherwise contains error information.
genbio.toolkit.aido_models_apis.dna_v3_embeddings ¶
dna_v3_embeddings(sequence: str)
Generate embeddings from a DNA sequence using AIDO.DNA3-AG-524K.
Returns backbone representations suitable for similarity search, clustering, or training downstream models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
A single DNA sequence. Tokenized at single-nucleotide resolution using A, T, C, G, N. |
required |
Returns:
| Type | Description |
|---|---|
|
Dictionary with keys: |
|
|
|
|
|
|
|
|
|
|
Tissue¶
genbio.toolkit.aido_models_apis.tissue_embedding_small ¶
tissue_embedding_small(h5ad_path: str, pooling: Literal['mean', 'max', 'first_token', 'all', 'none'] = 'mean', neighbor_num: int = 8) -> dict[str, Any]
Compute spatially-aware tissue embeddings from spatially resolved single-cell RNA-seq data.
Notes
This function accesses the SOTA AIDO.Tissue-3M model spatial endpoint, a bidirectional transformer encoder
trained on spatially resolved single-cell RNA-seq data
(76 slides with 22M cells from Vizgen, Nanostring, and 10xGenomics).
The model incorporates spatial cell information by retrieving K nearest neighbor cells
for each center cell, concatenating the center cell and neighbor cell expression
vectors as input with 2D rotary positional embeddings where
the first dimension represents gene index and the second represents cell index.
The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a
spatially-aware representation of the center cell. The rich contextual representations are
SOTA for downstream tasks such as embedding-based similarity
search, clustering, and training downstream models for niche and density prediction.
No task head is applied; this endpoint exposes backbone
embedding inference only.
CRITICAL: Input h5ad files MUST contain spatial coordinates in adata.obs with columns "x" and "y".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing spatially resolved single-cell gene expression data. The file MUST contain a cell-by-gene expression matrix in adata.X with spatial coordinate information in adata.obs.x and adata.obs.y columns (required for spatial context). |
required |
pooling
|
Literal['mean', 'max', 'first_token', 'all', 'none']
|
Strategy to aggregate hidden-state representations. Options include: - "mean": mean pooling across sequence tokens (default) → [n_cells, hidden_dim], - "max": max pooling across sequence tokens → [n_cells, hidden_dim], - "first_token": use first token only → [n_cells, hidden_dim], - "all": return all sequence tokens → [n_cells, seq_len, hidden_dim], - "none": return gene-level embeddings without pooling. |
'mean'
|
neighbor_num
|
int
|
Number of spatial neighbors to include for each center cell. Must be non-negative. Default is 8, which retrieves 8 nearest neighbor cells based on spatial coordinates for spatial context modeling. |
8
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.tissue_embedding_large ¶
tissue_embedding_large(h5ad_path: str, pooling: Literal['mean', 'max', 'first_token', 'all', 'none'] = 'mean', neighbor_num: int = 8) -> dict[str, Any]
Compute spatially-aware tissue embeddings from spatially resolved single-cell RNA-seq data.
Notes
This function accesses the SOTA AIDO.Tissue-60M model spatial endpoint, a bidirectional transformer encoder
trained on spatially resolved single-cell RNA-seq data
(76 slides with 22M cells from Vizgen, Nanostring, and 10xGenomics).
The model incorporates spatial cell information by retrieving K nearest neighbor cells
for each center cell, concatenating the center cell and neighbor cell expression
vectors as input with 2D rotary positional embeddings where
the first dimension represents gene index and the second represents cell index.
The model operates on the human transcriptome as input (up to 19,264 HGNC symbols, see tool aido_gene_list), learning a
spatially-aware representation of the center cell. The rich contextual representations are
SOTA for downstream tasks such as embedding-based similarity
search, clustering, and training downstream models for niche and density prediction.
No task head is applied; this endpoint exposes backbone
embedding inference only.
CRITICAL: Input h5ad files MUST contain spatial coordinates in adata.obs with columns "x" and "y".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing spatially resolved single-cell gene expression data. The file MUST contain a cell-by-gene expression matrix in adata.X with spatial coordinate information in adata.obs.x and adata.obs.y columns (required for spatial context). |
required |
pooling
|
Literal['mean', 'max', 'first_token', 'all', 'none']
|
Strategy to aggregate hidden-state representations. Options include: - "mean": mean pooling across sequence tokens (default) → [n_cells, hidden_dim], - "max": max pooling across sequence tokens → [n_cells, hidden_dim], - "first_token": use first token only → [n_cells, hidden_dim], - "all": return all sequence tokens → [n_cells, seq_len, hidden_dim], - "none": return gene-level embeddings without pooling. |
'mean'
|
neighbor_num
|
int
|
Number of spatial neighbors to include for each center cell. Must be non-negative. Default is 8, which retrieves 8 nearest neighbor cells based on spatial coordinates for spatial context modeling. |
8
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
RNA¶
genbio.toolkit.aido_models_apis.ncrna_embedding ¶
ncrna_embedding(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]
Compute non-coding and regulatory RNA sequence embeddings from nucleotide sequences.
Notes
This function accesses the SOTA AIDO.RNA-1.6B model, a bidirectional encoder-only transformer with 1.6 billion parameters
trained via masked language modeling on 42 million non-coding RNA sequences from RNAcentral. The model operates
on RNA sequences with single-nucleotide tokenization (A, U, C, G), producing rich contextual
representations that achieve state-of-the-art performance on
The representations are suitable for
embedding-based similarity search, clustering, and training downstream models such as
secondary structure prediction, inverse folding, and function classification.
No task head is applied; this endpoint exposes backbone
embedding inference only. Returns either pooled sequence-level embeddings when pooling is "mean" or "max",
or nucleotide-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of RNA sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G. |
required |
pooling
|
Literal['mean', 'max', 'none']
|
Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling. |
'mean'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.mrna_embedding ¶
mrna_embedding(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]
Compute coding sequence (mRNA/CDS) embeddings from RNA nucleotide sequences.
Notes
This function accesses the AIDO.RNA-1.6B-CDS model, a domain-adapted version of the SOTA
AIDO.RNA-1.6B bidirectional encoder-only transformer trained on 9 million coding sequences.
The model continues pre-training from AIDO.RNA-1.6B on coding sequence data,
specializing it for mRNA and coding DNA sequence tasks.
The model operates on RNA sequences with single-nucleotide tokenization (A, U, C, G), producing
rich contextual representations optimized for coding sequences. The representations are suitable for
embedding-based similarity search, clustering, and training downstream models for tasks such as
translation efficiency prediction, protein abundance prediction, and codon optimization.
No task head is applied; this endpoint exposes backbone
embedding inference only. Returns either pooled sequence-level embeddings when pooling is "mean" or "max",
or nucleotide-level embeddings when pooling is "none".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of RNA coding sequences (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G. |
required |
pooling
|
Literal['mean', 'max', 'none']
|
Strategy to aggregate nucleotide-level representations into a sequence-level embedding. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": return nucleotide-level embeddings without pooling. |
'mean'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.rna_translation_efficiency_muscle ¶
rna_translation_efficiency_muscle(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]
Predict translation efficiency in muscle tissue from mRNA coding sequences.
Notes
This function accesses the AIDO.RNA-1.6B-translation-efficiency-muscle model, which is fine-tuned from the SOTA AIDO.RNA-1.6B non-coding RNA sequence model on an endogenous human 5' UTR dataset measuring the ratio of Ribo-seq to RNA-seq RPKM values (translation efficiency) with 1,260 100bp 5' UTR sequences. Predictions are normalized to arbitrary units where higher values indicate more efficient translation. This model is specialized with observational data from human muscle tissue.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of 5' UTR sequences up to 100bp (strings of RNA nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G. |
required |
pooling
|
Literal['mean', 'max', 'none']
|
Strategy to aggregate nucleotide-level representations. This parameter is passed to the model but translation efficiency prediction always returns a single scalar value per sequence. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": no pooling (not recommended for this task). |
'mean'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.rna_protein_abundance_hsapiens ¶
rna_protein_abundance_hsapiens(sequences: list[str], pooling: Literal['mean', 'max', 'none'] = 'mean') -> dict[str, Any]
Predict protein abundance from mRNA coding sequences in human cells.
Notes
This function accesses the AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens model, which is fine-tuned from the SOTA AIDO.RNA-1.6B-CDS coding sequence model on a dataset of 11.8k CDS with lengths between 156 and 2048bp and measured protein abundance from PAXdb. human mRNA sequences with experimentally measured protein abundance from PAXdb, mainly consisting of mass spectroscopy-based quantifications. The model predicts the steady-state protein abundance that would result from a given coding sequence, capturing the effects of mRNA stability, ribosome throughput, and other factors that influence protein expression in human cells. Predictions are in arbitrary units where higher values indicate greater protein abundance. This model is specialized for human (Homo sapiens) cells.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of mRNA coding sequences up to 2048 nucleotides in length (strings of RNA nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G. |
required |
pooling
|
Literal['mean', 'max', 'none']
|
Strategy to aggregate nucleotide-level representations. This parameter is passed to the model but protein abundance prediction always returns a single scalar value per sequence. Options include: - "mean": mean pooling over sequence tokens (default), - "max": max pooling, - "none": no pooling (not recommended for this task). |
'mean'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.gsrna_activity_query ¶
gsrna_activity_query(sequences: list[str]) -> dict[str, Any]
Predict guide RNA (gsRNA) activity scores from RNA sequences.
Notes
This function accesses the AIDO.gsRNA-Activity-Query model, which predicts activity scores for guide RNA sequences. The model is trained to predict the effectiveness of guide RNAs for gene editing applications. Activity scores are averaged across 5-fold ensemble predictions for improved reliability. CRITICAL REQUIREMENT: Each input sequence must be exactly 21 nucleotides long and contain only A, C, G, T characters (case-insensitive).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
List of RNA sequences to score. Each sequence must be exactly 21 nucleotides long and contain only standard RNA/DNA nucleotides (A, C, G, T - case-insensitive, automatically converted to uppercase). Empty sequences are not allowed. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
Perturbation & Interactome¶
genbio.toolkit.aido_models_apis.perturbation_effect_query ¶
perturbation_effect_query(h5ad_path: str, cell_line: Literal['H1', 'Hep-G2', 'Jurkat', 'K-562', 'RPE1'], query_obs_col: str = 'gene', query_control: str = 'ctrl', query_condition: str = 'cond_A', ref_input_type: Literal['raw', 'delta'] = 'delta', metric: Literal['cosine', 'euclidean', 'correlation', 'spearman'] = 'correlation', target_sum: float = 10000.0, top_k: int = 10) -> dict[str, Any]
Query perturbation effect database for similar genetic perturbations.
Notes
This function accesses the AIDO.Perturbation-Query model, which searches a reference database of genetic perturbation effects to find perturbations with similar transcriptomic signatures. The model compares the user-provided query data (control vs condition) against a pre-computed reference database and returns the most similar perturbations ranked by distance score. Currently, H1 cell line is fully supported with comprehensive reference data. Other cell lines (Hep-G2, Jurkat, K-562, RPE1) may have limited reference data availability and could result in errors if the reference data doesn't contain them. The query data should contain both control and perturbed cells with labels in the obs dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
h5ad_path
|
str
|
Path to an h5ad file containing single-cell RNA-seq expression data with control and condition/perturbed cells. The file should contain a cell-by-gene expression matrix with labels in the obs dataframe indicating which cells are control vs condition. |
required |
cell_line
|
Literal['H1', 'Hep-G2', 'Jurkat', 'K-562', 'RPE1']
|
Cell line to query against. Must be one of: "H1", "Hep-G2", "Jurkat", "K-562", "RPE1". Currently only "H1" is fully supported with complete reference data. |
required |
query_obs_col
|
str
|
Column name in query.obs that contains control vs condition labels. Default is "gene". |
'gene'
|
query_control
|
str
|
Label value in query_obs_col that identifies control cells. Default is "ctrl". |
'ctrl'
|
query_condition
|
str
|
Label value in query_obs_col that identifies perturbed/condition cells. Default is "cond_A". |
'cond_A'
|
ref_input_type
|
Literal['raw', 'delta']
|
Reference data type: "raw" for raw counts, "delta" for pre-computed deltas. Default is "delta". Note that reference files use target_gene column (not gene) for perturbation names. |
'delta'
|
metric
|
Literal['cosine', 'euclidean', 'correlation', 'spearman']
|
Distance metric for similarity calculation. Must be one of: "cosine", "euclidean", "correlation", "spearman". Default is "correlation". |
'correlation'
|
target_sum
|
float
|
Target sum for normalization (used when ref_input_type="raw"). Default is 10000.0. |
10000.0
|
top_k
|
int
|
Number of top-ranked perturbation matches to return. Default is 10. |
10
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.perturbation_effect_query_cell_lines ¶
perturbation_effect_query_cell_lines() -> list[str]
Retrieve the list of supported cell lines for perturbation effect queries.
Notes
These cell line names are used to select the reference database for perturbation effect similarity searches. Currently, only "H1" is fully supported with comprehensive reference data. Other cell lines may have limited reference data availability.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of cell line names (strings) that are supported by the |
list[str]
|
AIDO.Perturbation-Query model. |
genbio.toolkit.aido_models_apis.interactome_query ¶
interactome_query(gene_symbol: str, n_neighbors: int = 10, metric: Literal['pearson', 'spearman', 'euclidean'] = 'euclidean') -> dict[str, Any]
Query the interactome embeddings for nearest neighbor genes.
Notes
This function accesses the AIDO.Interactome-Query model, which searches pre-computed gene interaction embeddings to find the nearest neighbor genes for a given query gene. The model returns genes with similar interaction patterns based on the specified distance/ similarity metric. The query gene itself will appear in the results with rank 1 and a score of 0.0 (for euclidean metric) or 1.0 (for correlation metrics). This tool is useful for identifying genes with similar biological functions, interaction partners, or pathway memberships.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gene_symbol
|
str
|
Gene symbol to query (e.g., 'CXCL8', 'CDKN1A'). Must exist in the interactome reference database vocabulary (~18,000 genes). |
required |
n_neighbors
|
int
|
Number of nearest neighbors to return (including the query gene itself). Default is 10. |
10
|
metric
|
Literal['pearson', 'spearman', 'euclidean']
|
Distance/similarity metric for nearest neighbor search. Must be one of: "pearson", "spearman", or "euclidean". Default is "euclidean". Score interpretation varies by metric: - euclidean: lower = closer (query gene has 0.0) - pearson/spearman: higher = more similar (query gene has 1.0) |
'euclidean'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.interactome_query_gene_vocab ¶
interactome_query_gene_vocab() -> list[str]
Retrieve the gene vocabulary for interactome queries.
Notes
This function returns the list of gene symbols that are recognized by the AIDO.Interactome-Query model. Query genes must exist in this vocabulary (approximately 18,000 genes).
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of HGNC gene symbols (strings) that are supported by the |
list[str]
|
interactome query model, in alphabetical order. |