3rd Party Tools¶

Gene Mapping¶

genbio.toolkit.gene_mapping_api.biomart_gene_mapping ¶

biomart_gene_mapping(gene_ids: list[str], dataset: str = 'hsapiens_gene_ensembl', filter_type: str = 'external_gene_name', host: str = 'www.ensembl.org', include_go: bool = False) -> pd.DataFrame

Map and convert gene identifiers using Ensembl BioMart.

Notes

This function uses the Ensembl BioMart web service to convert between different gene identifier types and retrieve gene annotations. BioMart provides access to comprehensive gene information including genomic coordinates, gene descriptions, and Gene Ontology (GO) annotations.

Common use cases: - Convert between identifier types (e.g., gene symbols to Ensembl IDs) - Retrieve gene annotations and descriptions - Get genomic coordinates (chromosome, start/end positions) - Map genes to Gene Ontology terms - Cross-reference between databases (Ensembl, Entrez, HGNC)

The function queries Ensembl's public BioMart service and returns extended gene information. When include_go=True, GO annotations are included, which will create multiple rows per gene (one for each GO term).

For more information about BioMart and available datasets, visit: https://www.ensembl.org/biomart

Parameters:

Name	Type	Description	Default
`gene_ids`	`list[str]`	List of gene identifiers to map or convert (e.g., ['TP53', 'BRCA1', 'EGFR']). The identifier type should match the filter_type parameter.	required
`dataset`	`str`	Ensembl BioMart dataset name (default: 'hsapiens_gene_ensembl'). Common datasets: - 'hsapiens_gene_ensembl' - Human genes - 'mmusculus_gene_ensembl' - Mouse genes - 'drerio_gene_ensembl' - Zebrafish genes - 'rnorvegicus_gene_ensembl' - Rat genes - 'dmelanogaster_gene_ensembl' - Fruit fly genes - 'celegans_gene_ensembl' - C. elegans genes	`'hsapiens_gene_ensembl'`
`filter_type`	`str`	Type of input gene identifiers (default: 'external_gene_name'). Common filter types: - 'external_gene_name' - Gene symbols (e.g., 'TP53', 'BRCA1') - 'ensembl_gene_id' - Ensembl gene IDs (e.g., 'ENSG00000141510') - 'entrezgene_id' - NCBI Entrez gene IDs (numeric) - 'hgnc_id' - HGNC IDs (for human genes) - 'uniprot_gn_id' - UniProt gene names	`'external_gene_name'`
`host`	`str`	Ensembl BioMart host server (default: 'www.ensembl.org'). Alternative mirror servers: - 'useast.ensembl.org' - US East mirror - 'asia.ensembl.org' - Asia mirror The function will automatically fall back to mirrors if primary host fails.	`'www.ensembl.org'`
`include_go`	`bool`	Include Gene Ontology term IDs in results (default: False). When True, adds 'go_id' column but creates multiple rows per gene (one row for each GO term associated with the gene). When False, returns one row per gene without GO annotations.	`False`

Returns:

Type	Description
`DataFrame`	pandas DataFrame with gene mapping and annotation results, containing columns:
`DataFrame`	'ensembl_gene_id': Ensembl gene identifier
`DataFrame`	'external_gene_name': Gene symbol
`DataFrame`	'entrezgene_id': NCBI Entrez gene ID (may be NA for some genes)
`DataFrame`	'description': Human-readable gene description
`DataFrame`	'chromosome_name': Chromosome location (e.g., '1', 'X', 'MT')
`DataFrame`	'start_position': Gene start position (base pairs)
`DataFrame`	'end_position': Gene end position (base pairs)
`DataFrame`	'gene_biotype': Gene type (e.g., 'protein_coding', 'lncRNA', 'pseudogene')
`DataFrame`	'go_id': Gene Ontology term ID (only when include_go=True)
`DataFrame`	When include_go=False (default): Returns one row per gene.
`DataFrame`	When include_go=True: Genes appear in multiple rows if associated with
`DataFrame`	multiple GO terms.
`DataFrame`	Returns empty DataFrame if no matching genes are found in BioMart.

genbio.toolkit.gene_mapping_api.get_orthology_table ¶

get_orthology_table() -> pd.DataFrame

Load orthology mapping table for human, mouse, marmoset, and rhesus macaque genes.

Notes

This function loads a pre-compiled orthology table containing gene mappings between four species: human, mouse, common marmoset, and rhesus macaque. The table includes Ensembl gene IDs, NCBI gene IDs, and gene symbols for each species where orthologous genes have been identified.

Pulls directly from https://raw.githubusercontent.com/AllenInstitute/GeneOrthology/refs/heads/main/csv/mouse_human_marmoset_macaque_orthologs_20231113.csv

Common use cases: - Convert gene identifiers between model organisms - Find mouse orthologs for human disease genes - Identify conserved genes across primate species - Cross-reference experimental results between species - Filter for genes with established orthologs in specific organisms

Returns:

Type	Description
`DataFrame`	pandas DataFrame with orthology mappings containing 14 columns:
`DataFrame`	'human_geneid': NCBI Entrez gene ID (integer)
`DataFrame`	'human_EnsemblID': Ensembl gene identifier (e.g., 'ENSG00000121410')
`DataFrame`	'human_Symbol': Official gene symbol (e.g., 'A1BG')
`DataFrame`	'human_type_of_gene': Gene type (e.g., 'protein-coding', 'ncRNA')
`DataFrame`	'human_description': Gene description/name
`DataFrame`	'marmoset_EnsemblID': Ensembl gene identifier (e.g., 'ENSCJAG00000000314')
`DataFrame`	'marmoset_geneid': NCBI Entrez gene ID (integer)
`DataFrame`	'marmoset_Symbol': Gene symbol
`DataFrame`	'mouse_EnsemblID': Ensembl gene identifier (e.g., 'ENSMUSG00000022347')
`DataFrame`	'mouse_geneid': NCBI Entrez gene ID (integer)
`DataFrame`	'mouse_Symbol': Gene symbol (e.g., 'A1bg')
`DataFrame`	'rhesus_macaque_EnsemblID': Ensembl gene identifier (e.g., 'ENSMMUG00000012459')
`DataFrame`	'rhesus_macaque_geneid': NCBI Entrez gene ID (integer)
`DataFrame`	'rhesus_macaque_Symbol': Gene symbol
`DataFrame`	The table contains approximately 18,000 human genes with their orthologs.
`DataFrame`	Missing ortholog information is represented as NaN in the DataFrame.

Semantic Scholar¶

genbio.toolkit.semantic_scholar_apis.search_papers ¶

search_papers(query: str, fields: str | None = None, limit: int = 10) -> dict[str, Any]

Search for papers by keyword query.

Notes

This function accesses the Semantic Scholar Academic Graph API paper search endpoint. It returns papers matching the query string, ranked by relevance. The search supports natural language queries and returns paginated results.

Parameters:

Name	Type	Description	Default
`query`	`str`	Search query string (e.g., "COVID-19 vaccines", "neural networks"). Supports natural language queries.	required
`fields`	`str \| None`	Comma-separated list of fields to return (e.g., "paperId,title,authors"). If None, defaults to "paperId,title,authors,year,abstract,url". Available fields include: paperId, externalIds, url, title, abstract, venue, year, referenceCount, citationCount, influentialCitationCount, isOpenAccess, fieldsOfStudy, s2FieldsOfStudy, publicationTypes, publicationDate, journal, authors, citations, references, embedding.	`None`
`limit`	`int`	Maximum number of results to return (default: 10, max: 100).	`10`

Returns:

Type	Description
`dict[str, Any]`	A dictionary with the following fields:
`dict[str, Any]`	"total": Total number of results matching the query.
`dict[str, Any]`	"offset": Current offset in the result set.
`dict[str, Any]`	"next": Offset for the next page of results (if available).
`dict[str, Any]`	"data": List of paper objects, each containing the requested fields.

genbio.toolkit.semantic_scholar_apis.get_paper_details ¶

get_paper_details(paper_id: str, fields: str | None = None) -> dict[str, Any]

Get detailed information about a specific paper.

Notes

This function accesses the Semantic Scholar Academic Graph API paper details endpoint. It returns comprehensive information about a single paper identified by its ID. Supports multiple ID formats including S2 paper ID, DOI, ArXiv ID, MAG ID, ACL ID, PubMed ID, and Corpus ID.

Parameters:

Name	Type	Description	Default
`paper_id`	`str`	Paper identifier in one of the following formats: - S2 paper ID: e.g., "649def34f8be52c8b66281af98ae884c09aef38b" - DOI: e.g., "DOI:10.1038/s41586-020-2012-7" - ArXiv ID: e.g., "ARXIV:2106.15928" - PubMed ID: e.g., "PMID:33268865" - Corpus ID: e.g., "CorpusID:3658586" - MAG ID: e.g., "MAG:112218234" - ACL ID: e.g., "ACL:W12-3903"	required
`fields`	`str \| None`	Comma-separated list of fields to return. If None, defaults to "paperId,title,authors,year,abstract,url, citationCount,referenceCount,publicationDate". See search_papers() for available fields.	`None`

Returns:

Type	Description
`dict[str, Any]`	A dictionary containing the requested fields for the paper.
`dict[str, Any]`	Returns None if the paper is not found.

genbio.toolkit.semantic_scholar_apis.get_paper_citations ¶

get_paper_citations(paper_id: str, fields: str | None = None, limit: int = 100) -> dict[str, Any]

Get citations for a specific paper.

Notes

This function accesses the Semantic Scholar Academic Graph API citations endpoint. It returns papers that cite the specified paper, along with citation contexts (the text snippets where the citation appears).

Parameters:

Name	Type	Description	Default
`paper_id`	`str`	Paper identifier (see get_paper_details for supported formats).	required
`fields`	`str \| None`	Comma-separated list of fields to return for each citing paper. If None, defaults to "paperId,title,authors,year". See search_papers() for available fields.	`None`
`limit`	`int`	Maximum number of citations to return (default: 100, max: 1000).	`100`

Returns:

Type	Description
`dict[str, Any]`	A dictionary with the following fields:
`dict[str, Any]`	"offset": Current offset in the result set.
`dict[str, Any]`	"next": Offset for the next page of results (if available).
`dict[str, Any]`	"data": List of citation objects, each containing: "citingPaper": Paper object with requested fields "contexts": List of citation context strings "intents": List of citation intent categories (Background, Methodology, ResultComparison) "isInfluential": Boolean indicating if this is an influential citation

genbio.toolkit.semantic_scholar_apis.get_paper_references ¶

get_paper_references(paper_id: str, fields: str | None = None, limit: int = 100) -> dict[str, Any]

Get references cited by a specific paper.

Notes

This function accesses the Semantic Scholar Academic Graph API references endpoint. It returns papers that are cited by the specified paper, along with citation contexts (the text snippets where the reference appears).

Parameters:

Name	Type	Description	Default
`paper_id`	`str`	Paper identifier (see get_paper_details for supported formats).	required
`fields`	`str \| None`	Comma-separated list of fields to return for each cited paper. If None, defaults to "paperId,title,authors,year". See search_papers() for available fields.	`None`
`limit`	`int`	Maximum number of references to return (default: 100, max: 1000).	`100`

Returns:

Type	Description
`dict[str, Any]`	A dictionary with the following fields:
`dict[str, Any]`	"offset": Current offset in the result set.
`dict[str, Any]`	"next": Offset for the next page of results (if available).
`dict[str, Any]`	"data": List of reference objects, each containing: "citedPaper": Paper object with requested fields "contexts": List of citation context strings "intents": List of citation intent categories "isInfluential": Boolean indicating if this is an influential citation