Skip to content

3rd Party Tools

Gene Mapping

genbio.toolkit.gene_mapping_api.biomart_gene_mapping

biomart_gene_mapping(gene_ids: list[str], dataset: str = 'hsapiens_gene_ensembl', filter_type: str = 'external_gene_name', host: str = 'www.ensembl.org', include_go: bool = False) -> pd.DataFrame

Map and convert gene identifiers using Ensembl BioMart.

Notes

This function uses the Ensembl BioMart web service to convert between different gene identifier types and retrieve gene annotations. BioMart provides access to comprehensive gene information including genomic coordinates, gene descriptions, and Gene Ontology (GO) annotations.

Common use cases: - Convert between identifier types (e.g., gene symbols to Ensembl IDs) - Retrieve gene annotations and descriptions - Get genomic coordinates (chromosome, start/end positions) - Map genes to Gene Ontology terms - Cross-reference between databases (Ensembl, Entrez, HGNC)

The function queries Ensembl's public BioMart service and returns extended gene information. When include_go=True, GO annotations are included, which will create multiple rows per gene (one for each GO term).

For more information about BioMart and available datasets, visit: https://www.ensembl.org/biomart

Parameters:

Name Type Description Default
gene_ids list[str]

List of gene identifiers to map or convert (e.g., ['TP53', 'BRCA1', 'EGFR']). The identifier type should match the filter_type parameter.

required
dataset str

Ensembl BioMart dataset name (default: 'hsapiens_gene_ensembl'). Common datasets: - 'hsapiens_gene_ensembl' - Human genes - 'mmusculus_gene_ensembl' - Mouse genes - 'drerio_gene_ensembl' - Zebrafish genes - 'rnorvegicus_gene_ensembl' - Rat genes - 'dmelanogaster_gene_ensembl' - Fruit fly genes - 'celegans_gene_ensembl' - C. elegans genes

'hsapiens_gene_ensembl'
filter_type str

Type of input gene identifiers (default: 'external_gene_name'). Common filter types: - 'external_gene_name' - Gene symbols (e.g., 'TP53', 'BRCA1') - 'ensembl_gene_id' - Ensembl gene IDs (e.g., 'ENSG00000141510') - 'entrezgene_id' - NCBI Entrez gene IDs (numeric) - 'hgnc_id' - HGNC IDs (for human genes) - 'uniprot_gn_id' - UniProt gene names

'external_gene_name'
host str

Ensembl BioMart host server (default: 'www.ensembl.org'). Alternative mirror servers: - 'useast.ensembl.org' - US East mirror - 'asia.ensembl.org' - Asia mirror The function will automatically fall back to mirrors if primary host fails.

'www.ensembl.org'
include_go bool

Include Gene Ontology term IDs in results (default: False). When True, adds 'go_id' column but creates multiple rows per gene (one row for each GO term associated with the gene). When False, returns one row per gene without GO annotations.

False

Returns:

Type Description
DataFrame

pandas DataFrame with gene mapping and annotation results, containing columns:

DataFrame
  • 'ensembl_gene_id': Ensembl gene identifier
DataFrame
  • 'external_gene_name': Gene symbol
DataFrame
  • 'entrezgene_id': NCBI Entrez gene ID (may be NA for some genes)
DataFrame
  • 'description': Human-readable gene description
DataFrame
  • 'chromosome_name': Chromosome location (e.g., '1', 'X', 'MT')
DataFrame
  • 'start_position': Gene start position (base pairs)
DataFrame
  • 'end_position': Gene end position (base pairs)
DataFrame
  • 'gene_biotype': Gene type (e.g., 'protein_coding', 'lncRNA', 'pseudogene')
DataFrame
  • 'go_id': Gene Ontology term ID (only when include_go=True)
DataFrame

When include_go=False (default): Returns one row per gene.

DataFrame

When include_go=True: Genes appear in multiple rows if associated with

DataFrame

multiple GO terms.

DataFrame

Returns empty DataFrame if no matching genes are found in BioMart.

genbio.toolkit.gene_mapping_api.get_orthology_table

get_orthology_table() -> pd.DataFrame

Load orthology mapping table for human, mouse, marmoset, and rhesus macaque genes.

Notes

This function loads a pre-compiled orthology table containing gene mappings between four species: human, mouse, common marmoset, and rhesus macaque. The table includes Ensembl gene IDs, NCBI gene IDs, and gene symbols for each species where orthologous genes have been identified.

Pulls directly from https://raw.githubusercontent.com/AllenInstitute/GeneOrthology/refs/heads/main/csv/mouse_human_marmoset_macaque_orthologs_20231113.csv

Common use cases: - Convert gene identifiers between model organisms - Find mouse orthologs for human disease genes - Identify conserved genes across primate species - Cross-reference experimental results between species - Filter for genes with established orthologs in specific organisms

Returns:

Type Description
DataFrame

pandas DataFrame with orthology mappings containing 14 columns:

DataFrame
  • 'human_geneid': NCBI Entrez gene ID (integer)
DataFrame
  • 'human_EnsemblID': Ensembl gene identifier (e.g., 'ENSG00000121410')
DataFrame
  • 'human_Symbol': Official gene symbol (e.g., 'A1BG')
DataFrame
  • 'human_type_of_gene': Gene type (e.g., 'protein-coding', 'ncRNA')
DataFrame
  • 'human_description': Gene description/name
DataFrame
  • 'marmoset_EnsemblID': Ensembl gene identifier (e.g., 'ENSCJAG00000000314')
DataFrame
  • 'marmoset_geneid': NCBI Entrez gene ID (integer)
DataFrame
  • 'marmoset_Symbol': Gene symbol
DataFrame
  • 'mouse_EnsemblID': Ensembl gene identifier (e.g., 'ENSMUSG00000022347')
DataFrame
  • 'mouse_geneid': NCBI Entrez gene ID (integer)
DataFrame
  • 'mouse_Symbol': Gene symbol (e.g., 'A1bg')
DataFrame
  • 'rhesus_macaque_EnsemblID': Ensembl gene identifier (e.g., 'ENSMMUG00000012459')
DataFrame
  • 'rhesus_macaque_geneid': NCBI Entrez gene ID (integer)
DataFrame
  • 'rhesus_macaque_Symbol': Gene symbol
DataFrame

The table contains approximately 18,000 human genes with their orthologs.

DataFrame

Missing ortholog information is represented as NaN in the DataFrame.

Semantic Scholar

genbio.toolkit.semantic_scholar_apis.search_papers

search_papers(query: str, fields: str | None = None, limit: int = 10) -> dict[str, Any]

Search for papers by keyword query.

Notes

This function accesses the Semantic Scholar Academic Graph API paper search endpoint. It returns papers matching the query string, ranked by relevance. The search supports natural language queries and returns paginated results.

Parameters:

Name Type Description Default
query str

Search query string (e.g., "COVID-19 vaccines", "neural networks"). Supports natural language queries.

required
fields str | None

Comma-separated list of fields to return (e.g., "paperId,title,authors"). If None, defaults to "paperId,title,authors,year,abstract,url". Available fields include: paperId, externalIds, url, title, abstract, venue, year, referenceCount, citationCount, influentialCitationCount, isOpenAccess, fieldsOfStudy, s2FieldsOfStudy, publicationTypes, publicationDate, journal, authors, citations, references, embedding.

None
limit int

Maximum number of results to return (default: 10, max: 100).

10

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "total": Total number of results matching the query.
dict[str, Any]
  • "offset": Current offset in the result set.
dict[str, Any]
  • "next": Offset for the next page of results (if available).
dict[str, Any]
  • "data": List of paper objects, each containing the requested fields.

genbio.toolkit.semantic_scholar_apis.get_paper_details

get_paper_details(paper_id: str, fields: str | None = None) -> dict[str, Any]

Get detailed information about a specific paper.

Notes

This function accesses the Semantic Scholar Academic Graph API paper details endpoint. It returns comprehensive information about a single paper identified by its ID. Supports multiple ID formats including S2 paper ID, DOI, ArXiv ID, MAG ID, ACL ID, PubMed ID, and Corpus ID.

Parameters:

Name Type Description Default
paper_id str

Paper identifier in one of the following formats: - S2 paper ID: e.g., "649def34f8be52c8b66281af98ae884c09aef38b" - DOI: e.g., "DOI:10.1038/s41586-020-2012-7" - ArXiv ID: e.g., "ARXIV:2106.15928" - PubMed ID: e.g., "PMID:33268865" - Corpus ID: e.g., "CorpusID:3658586" - MAG ID: e.g., "MAG:112218234" - ACL ID: e.g., "ACL:W12-3903"

required
fields str | None

Comma-separated list of fields to return. If None, defaults to "paperId,title,authors,year,abstract,url, citationCount,referenceCount,publicationDate". See search_papers() for available fields.

None

Returns:

Type Description
dict[str, Any]

A dictionary containing the requested fields for the paper.

dict[str, Any]

Returns None if the paper is not found.

genbio.toolkit.semantic_scholar_apis.get_paper_citations

get_paper_citations(paper_id: str, fields: str | None = None, limit: int = 100) -> dict[str, Any]

Get citations for a specific paper.

Notes

This function accesses the Semantic Scholar Academic Graph API citations endpoint. It returns papers that cite the specified paper, along with citation contexts (the text snippets where the citation appears).

Parameters:

Name Type Description Default
paper_id str

Paper identifier (see get_paper_details for supported formats).

required
fields str | None

Comma-separated list of fields to return for each citing paper. If None, defaults to "paperId,title,authors,year". See search_papers() for available fields.

None
limit int

Maximum number of citations to return (default: 100, max: 1000).

100

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "offset": Current offset in the result set.
dict[str, Any]
  • "next": Offset for the next page of results (if available).
dict[str, Any]
  • "data": List of citation objects, each containing:
  • "citingPaper": Paper object with requested fields
  • "contexts": List of citation context strings
  • "intents": List of citation intent categories (Background, Methodology, ResultComparison)
  • "isInfluential": Boolean indicating if this is an influential citation

genbio.toolkit.semantic_scholar_apis.get_paper_references

get_paper_references(paper_id: str, fields: str | None = None, limit: int = 100) -> dict[str, Any]

Get references cited by a specific paper.

Notes

This function accesses the Semantic Scholar Academic Graph API references endpoint. It returns papers that are cited by the specified paper, along with citation contexts (the text snippets where the reference appears).

Parameters:

Name Type Description Default
paper_id str

Paper identifier (see get_paper_details for supported formats).

required
fields str | None

Comma-separated list of fields to return for each cited paper. If None, defaults to "paperId,title,authors,year". See search_papers() for available fields.

None
limit int

Maximum number of references to return (default: 100, max: 1000).

100

Returns:

Type Description
dict[str, Any]

A dictionary with the following fields:

dict[str, Any]
  • "offset": Current offset in the result set.
dict[str, Any]
  • "next": Offset for the next page of results (if available).
dict[str, Any]
  • "data": List of reference objects, each containing:
  • "citedPaper": Paper object with requested fields
  • "contexts": List of citation context strings
  • "intents": List of citation intent categories
  • "isInfluential": Boolean indicating if this is an influential citation