Inference Engine¶
AIDO foundation model endpoints served by the Inference Engine (AIDO_INFERENCE_ENGINE_URL).
Protein¶
genbio.toolkit.aido_models_apis.protein_protein_interaction ¶
protein_protein_interaction(seq_a: str, seq_b: str, crop_mode: Literal['head', 'tail', 'center'] = 'head') -> dict[str, Any]
Predict protein-protein interaction, binding sites, and cross-attention mapping.
Notes
This function accesses the AIDO.ProteinProteinInteraction model for predicting protein-protein interactions between two amino acid sequences. The model provides: - Interaction probability (0-1) and binary label - Per-residue binding site probabilities for both chains - Cross-attention matrix between the two sequences
Sequences exceeding the internal crop size (1000 residues) will be cropped according to the specified crop_mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq_a
|
str
|
Primary amino acid sequence of protein A. Must contain only valid amino acid characters: ACDEFGHIKLMNPQRSTVWY (and X for unknown). Maximum length 2048. |
required |
seq_b
|
str
|
Primary amino acid sequence of protein B. Must contain only valid amino acid characters: ACDEFGHIKLMNPQRSTVWY (and X for unknown). Maximum length 2048. |
required |
crop_mode
|
Literal['head', 'tail', 'center']
|
Crop mode for sequences exceeding the internal crop size (1000 residues). Options: "head", "tail", "center". Default "head". |
'head'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.reactome_pathway_query ¶
reactome_pathway_query(sequence: str | list[str]) -> dict[str, Any]
Predict Reactome pathway memberships for protein amino acid sequences.
Notes
This function accesses the AIDO.ReactomePathway-Query model, which predicts which of 1,766 curated biological pathways from the Reactome database a protein is involved in based on its primary sequence. The model first obtains protein embeddings via the AIDO.Protein foundation model, then uses a KNeighborsClassifier trained on 11,660 sequences from the Reactome Physical Entity mapping (UniProt-to-pathway). Only pathways with at least 10 member sequences were included in training. The model was optimized for Macro F1 to handle the significant sparsity of the pathway membership matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str | list[str]
|
A single protein amino acid sequence (string) or a list of sequences (e.g., "MRLPAQ..." or ["SEQ1...", "SEQ2..."]). Each sequence must be <= 2048 characters long. Sequences are tokenized at single-amino-acid resolution using the AIDO.Protein model vocabulary. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
DNA¶
genbio.toolkit.aido_models_apis.dna2_flashzoi_rep1 ¶
dna2_flashzoi_rep1(sequence: str, output_type: Literal['tracks', 'embeddings'] = 'tracks', is_human: bool = True, bins_to_return: int = 6144) -> dict[str, Any]
Predict genomic tracks (RNAseq) or generate embeddings from long DNA sequences.
Notes
This function accesses the AIDO.DNA2-470M-Flashzoi-rep1 model, a 2-part model which contains 1. a long bidirectional transformer backbone trained via masked language modeling on 8.8 trillion nucleotides from 113,379 prokaryotic genomes and 15,032 eukaryotic genomes. 2. a genomic predictor head based on Flashzoi, fine-tuned to predict 7,611 genomic assay tracks from DNA sequence on the ENCODE dataset. The model operates on DNA sequences with single-nucleotide tokenization (A, T, C, G, N). In "embeddings" mode, the model produces rich contextual representations for embedding-based similarity search, clustering, and training downstream models. In "tracks" mode, the model predicts 7,611 genomic assay tracks including RNA-seq, CAGE, DNase, ATAC-seq, ChIP-seq (transcription factors), and ChIP-seq (histone modifications) in 32bp bins from 196,608bp input sequences for human or mouse genomes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
A single DNA sequence (string of nucleotide tokens). Sequences must be exactly 196,608 base pairs, and are tokenized at single-nucleotide resolution using the vocabulary: A, T, C, G, N (where N denotes uncertain elements). |
required |
output_type
|
Literal['tracks', 'embeddings']
|
Type of output to generate. Options include: - "tracks": return predicted genomic assay tracks (default), - "embeddings": return intermediate embeddings from the model backbone. |
'tracks'
|
is_human
|
bool
|
Whether the input sequence is from a human genome (True, default) or mouse genome (False). This determines which species-specific output head is used for prediction. |
True
|
bins_to_return
|
int
|
Number of output bins to return from the center of the prediction. Default is 6144 bins, which corresponds to 196,608 bp at 32 bp resolution. Set to -1 to return all bins. |
6144
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary containing: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.dna2_track_search ¶
dna2_track_search(query: str, k: int = 10, track_types: list[str] | None = None) -> pd.DataFrame
Search for genomic tracks by text description to identify relevant assays for DNA sequence analysis.
Notes
This function provides text-based search over 7,611 genomic assay tracks from ENCODE, FANTOM, and GTEx that are predicted by the dna2_flashzoi_rep1 model. The search use semantic search to find tracks matching a text query. This tool is designed to work in conjunction with dna2_flashzoi_rep1: first use this search to identify relevant track indices, then use those indices to filter the 7,611 predictions from dna2_flashzoi_rep1.
The search covers diverse assay types including: - CAGE: Cap Analysis of Gene Expression (transcription start sites) - RNA: RNA-seq (gene expression) - DNASE: DNase-seq (open chromatin) - ATAC: ATAC-seq (chromatin accessibility) - CHIP_H: ChIP-seq for histone modifications (e.g., H3K4me3, H3K27ac) - CHIP_TF: ChIP-seq for transcription factors (e.g., CTCF, TP53)
Tracks span various cell types, tissues, and conditions from the ENCODE and FANTOM projects. This is a LOCAL search tool using pre-computed embeddings - it does NOT call the AIDO Inference Engine API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Text description to search for (e.g., "liver tissue", "MCF-7 breast cancer cells", "histone H3K27 acetylation"). The search matches against track descriptions including assay type, cell type, tissue type, and experimental conditions. |
required |
k
|
int
|
Number of top results to return, ranked by similarity score. Default is 10. |
10
|
track_types
|
list[str] | None
|
Optional filter to restrict search to specific assay types. Must be a subset of: ["ATAC", "CAGE", "CHIP_H", "CHIP_TF", "DNASE", "RNA"]. If None (default), searches across all track types. Use CHIP_H for histone modifications and CHIP_TF for transcription factors. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
A pandas DataFrame with columns: |
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
DataFrame
|
|
genbio.toolkit.aido_models_apis.predict_tracks_v3 ¶
predict_tracks_v3(sequence: str, track_idxs: list[int] | None = None, track_type: TrackType | None = None, is_human: bool = True) -> dict[str, Any]
Predict genomic tracks or generate embeddings from a DNA sequence using AIDO.DNA3-AG-524K.
In "tracks" mode, predicts genomic assay tracks at two resolutions from 196,608 bp input sequences for human or mouse genomes. In "embeddings" mode, returns backbone representations suitable for similarity search, clustering, or training downstream models.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
str
|
DNA sequence to predict tracks for. Tokenized at single-nucleotide resolution using A, T, C, G, N. |
required |
track_idxs
|
list[int] | None
|
Integer indices selecting specific tracks from the output. If None, all tracks for the requested type are returned. |
None
|
track_type
|
TrackType | None
|
Assay type to predict. Passed to the API for server-side filtering. 1 bp resolution: TrackType.CAGE, TrackType.RNA_SEQ, TrackType.ATAC, TrackType.DNASE, TrackType.PROCAP, TrackType.SPLICE_SITES, TrackType.SPLICE_SITE_USAGE 128 bp resolution: TrackType.CHIP_HISTONE, TrackType.CHIP_TF |
None
|
is_human
|
bool
|
True (default) for human, False for mouse. Only applies when output_type is "tracks". |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary with keys: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
RNA¶
genbio.toolkit.aido_models_apis.rna_secondary_structure ¶
rna_secondary_structure(sequences: list[str]) -> dict[str, Any]
Predict RNA secondary structure from nucleotide sequences in dot-bracket notation.
Notes
This function accesses the SOTA AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction model, which is fine-tuned from AIDO.RNA-1.6B on the bpRNA and Archive-II datasets for RNA secondary structure prediction. The model predicts base pairing patterns that form RNA secondary structure, achieving SOTA performance with an F1 score of 0.783 on the bpRNA-TS0 test set, and demonstrates strong inter-family generalization across nine RNA families including 5S rRNA, tRNA, tmRNA, RNase P RNA, and others. Predictions are returned in dot-bracket notation where paired bases are indicated by matching parentheses and unpaired bases by dots. All sequences are processed in chunks of up to 1000 nucleotides.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
A list of RNA sequences to be processed in chunks of up to 1000 nucleotides each (strings of nucleotide tokens). Sequences are tokenized at single-nucleotide resolution using the vocabulary: A, U, C, G. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.rna_splice_site_query ¶
rna_splice_site_query(sequences: list[str]) -> dict[str, Any]
Predict splice site probabilities from RNA sequences.
Notes
This function accesses the AIDO.RNA-SpliceSite-Query model, a fine-tuned AIDO RNA model that predicts acceptor and donor splice site probabilities from RNA sequences. The model is specialized for identifying potential splice sites in pre-mRNA sequences and returns probability scores for both acceptor (3' splice site) and donor (5' splice site) positions. CRITICAL REQUIREMENT: Each input sequence must be exactly 600 nucleotides long and should be centered around the potential splice site of interest. Sequences of other lengths will be rejected by the API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
list[str]
|
List of RNA sequences to score. Each sequence must be exactly 600 nucleotides long, centered around a potential splice site. Sequences should contain only standard RNA nucleotides (A, U, C, G). The 600bp requirement ensures proper context for splice site prediction. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
Structure Prediction¶
genbio.toolkit.aido_models_apis.structure_prediction ¶
structure_prediction(query_1: str, query_1_type: Literal['proteinChain', 'rnaSequence', 'ligand'], query_2: str | None = None, query_2_type: Literal['proteinChain', 'rnaSequence', 'ligand'] | None = None, query_3: str | None = None, query_3_type: Literal['proteinChain', 'rnaSequence', 'ligand'] | None = None) -> dict[str, Any]
Predict full-atom 3D structure and interactions of biomolecules including proteins, DNA, RNA, and small molecule ligands.
Notes
This function accesses the SOTA AIDO.StructurePrediction model, an AlphaFold3-like full-atom structure prediction model designed to predict the structure and interactions of biological molecules including proteins, DNA, RNA, ligands, and antibodies. The model achieves state-of-the-art performance on all structure predictiction tasks, with especially strong performance for immunology-related structure prediction tasks, including antibody, nanobody, antibody-antigen, and nanobody-antigen complexes. Predictions are returned in CIF (Crystallographic Information File) format, always with chain keys A0, B0, and C0, suitable for structural analysis and visualization with py3Dmol, PyMOL, ChimeraX, and other structural biology tools.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query_1
|
str
|
Sequence of the first molecule. Amino acids if proteinChain, AUGC nucleotides if rnaSequence, ATGC nucleotides if dnaSequence, or SMILES string if ligand. Ligand also tolerates CCD code if known. |
required |
query_1_type
|
str
|
Type of the first sequence; one of ["proteinChain", "rnaSequence", "dnaSequence", "ligand"]. |
required |
query_2
|
str
|
Sequence of the second molecule, same specification as query_1 (default: ""). |
None
|
query_2_type
|
str
|
Type of the second sequence, same specification as query_1_type (default: ""). |
None
|
query_3
|
str
|
Sequence of the third molecule, same specification as query_1 (default: ""). |
None
|
query_3_type
|
str
|
Type of the third sequence, same specification as query_1_type (default: ""). |
None
|
Returns: A dictionary with the following fields: - "model_name": The identifier of the model used ("AIDO.StructurePrediction"). - "return_code": Integer status code (0 indicates success). - "output": A dictionary containing: - "cif_data": A string containing the full structure prediction in CIF (Crystallographic Information File) format, including atomic coordinates, connectivity information, entity definitions, and metadata. The CIF format can be parsed by standard structural biology tools for visualization and analysis. Always returns chains "A0", "B0", and "C0" for the input molecules. - "parameters": A dictionary containing the input parameters provided to the model, including all query sequences and their types. - "error": None if successful, otherwise contains error information.
Perturbation¶
genbio.toolkit.aido_models_apis.knockout_effect_query ¶
knockout_effect_query(gene_symbol: str | list[str]) -> dict[str, Any]
Predict transcriptomic effects of gene knockout.
Notes
This function accesses the AIDO.KnockoutEffect-Query model, which returns predicted expression changes for approximately 6,000 readout genes resulting from knocking out one or more query genes. The model predicts the difference in expression between knockout and baseline conditions (Expression_KO - Expression_Baseline). Positive values indicate upregulation after knockout, while negative values indicate downregulation. This tool is useful for understanding the downstream transcriptional effects of perturbing specific genes, identifying potential compensatory mechanisms, and predicting phenotypic outcomes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gene_symbol
|
str | list[str]
|
Gene symbol(s) to query. Can be either a single gene as a string (e.g., "CXCL8") or a list of genes (e.g., ["CXCL8", "PLAG1", "TP53"]). Each gene must exist in the knockout effect reference database vocabulary (~18,000 genes). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
genbio.toolkit.aido_models_apis.knockout_effect_query_gene_vocab ¶
knockout_effect_query_gene_vocab() -> list[str]
Retrieve the input gene vocabulary for knockout effect queries.
Notes
This function returns the list of gene symbols that can be queried in the AIDO.KnockoutEffect-Query model. Query genes must exist in this vocabulary (approximately 18,000 genes).
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of HGNC gene symbols (strings) that can be knocked out in the |
list[str]
|
knockout effect prediction model, in alphabetical order. |
genbio.toolkit.aido_models_apis.knockout_effect_query_readout_genes ¶
knockout_effect_query_readout_genes() -> list[str]
Retrieve the readout gene vocabulary for knockout effect query results.
Notes
This function returns the list of readout gene symbols that are included in knockout effect predictions. The model predicts expression changes for approximately 6,000 readout genes in response to knocking out the query gene(s).
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of HGNC gene symbols (strings) for which expression changes are |
list[str]
|
predicted in knockout effect results, in alphabetical order. |
genbio.toolkit.aido_models_apis.gene_knockdown_query ¶
gene_knockdown_query(genes: list[str], context_name: str = 'K562', return_symbols: bool = True, *, download: bool = True) -> dict[str, Any]
Simulate gene knockdowns and predict genome-wide expression changes.
Notes
This function performs In-Silico Perturbation via the AIDO.GeneKnockdown model. It predicts how the transcriptome of a cell (the "context") would change if one or more genes were knocked down. Unlike a simple lookup, this uses a trained Multi-Layer Perceptron (MLPConcat) to generalise effects based on gene embeddings and input context (gene expression) vectors.
Model & Data Context
- Model (MLPConcat): A neural network that takes two inputs — a cell-state vector C and a perturbation vector P — and predicts delta expression (knockdown vs. baseline).
- Embeddings: Uses NBFNet-predicted embeddings that incorporate biological priors from gene ontologies and protein–protein interaction networks.
- Contexts: 1,048 cell types / tissues are supported (e.g. "K562", "hepatoblastoma cell HepG2", "HEK293T", "endoderm cells"). Use the associated context vocabulary function for the full list.
- Targets: 18,425 gene symbols can be knocked down.
- Readout: Each prediction covers 15,867 readout genes (columns of the output matrix — gene symbols or Ensembl IDs). Note: the readout gene set differs from the target gene set.
Result delivery
Because the result matrix can be very large (up to ~1 GB for 18,000 genes), the API returns a lightweight JSON receipt with download URIs instead of streaming the data inline. The .h5ad file is stored in the inference results object store (GCS).
When download=True this function automatically fetches the .h5ad file and parses it into an anndata.AnnData object, available as output["adata"]. It first tries a direct download via the google-cloud-storage SDK (fastest for in-cluster / same-project callers, zero egress cost), and falls back to the HTTPS signed URL for external callers. The parsed AnnData object is available directly::
resp = gene_knockdown_query(["MYC"], download=True)
adata = resp["output"]["adata"]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes
|
list[str]
|
List of gene symbols to knock down (e.g. ["MYC", "TP53"]). Case-insensitive. Genes not found in the model's embedding vocabulary (~18,425 genes) are silently skipped. At least one valid gene is required. |
required |
context_name
|
str
|
Cell type / tissue context for the prediction. Supports 1,048 contexts. Defaults to "K562". |
'K562'
|
return_symbols
|
bool
|
If True (default), readout gene columns in the .h5ad are HGNC gene symbols. If False, they are Ensembl IDs. |
True
|
download
|
bool
|
If True, automatically download the .h5ad result and parse it into an anndata.AnnData object at output["adata"]. The download strategy prefers the GCS SDK (in-cluster) and falls back to the signed URL (external). Defaults to True. |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary with the following fields: |
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
dict[str, Any]
|
|
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the API returns an error status. |