Exploring Ensembl

Work

Useful

Omics

A side quest in understanding how to work with genetic data.

Author

Bailey Andrew

Published

January 10, 2023

So, I’ve been working on blog posts that do a walkthrough on working with genetic data. I keep running into issues and having to go on sidequests (😅) but I think this one deserves its own blog post. Shoutout to Morgan for the help.

A major resource in bioinformatics is Ensembl. In this blog post, we’ll spend some time exploring it, capping off with using it to accomplish the useful task of how to grab all known mitochondrial genes in Zebrafish (Danio rerio).

Suppose in our dataset we had a gene called ENSDARG00000000001. What can we actually say about it? Well, first of all, this is an Ensembl ID, so it follows the pattern ENS[species prefix][feature type prefix][a unique eleven digit number]. In this case, DAR is the species (Danio rerio) and G indicates that it is referring to a gene.

We can look this gene up on Ensembl to learn more about it:

This gene is also known as slc35a5, but these names can be harder to work with as it is subject to change if our knowledge about its role in biological processes changes; the Ensembl ID is more permanent. On the other hand, this name is arguably more informative, as it is an abbreviation for its role as “solute carrier family 35 member A5”.

If we want to investigate this gene more, we can click on the ZFIN link provided in the summary section.

Here we can see yet another gene ID (ZDB-GENE-030616-55), as well as a link to the naming history of this gene which may be fun to explore. There are some other goodies on this page, but we’ll return to Enbembl as that is the topic of this post.

Our goal is to get a list of all mitochondrial genes in Danio rerio. ENSDARG00000000001 is not a mitochondrial gene, because it is located on chromosome 9. One way to find a list of mitochondrial genes is to search for genes with names beginning with mt-, because that is how mitochondrial genes are named.

However, this isn’t convenient for manual use!

Failed attempt at using scanpy

One way to try to get this data is to use scanpy:

# I had to also install pybiomart, which was only on pip
from scanpy import queries

queries.mitochondrial_genes(
    "drerio",
    attrname = "ensembl_gene_id"
)

HTTPError: 500 Server Error: Internal Server Error for url: http://www.ensembl.org:80/biomart/martservice?type=registry

Unfortunately, Biomart was down at the time I tried to do this! (Biomart seems to be the api for this type of stuff).

Use this data-mining tool to export custom datasets from Ensembl.

– Ensemble docs about what Biomart is

I found that the “asia” mirror of Biomart gives a different error:

mitgenes = queries.mitochondrial_genes(
    "drerio",
    attrname = "ensembl_gene_id",
    host = "asia.ensembl.org"
)
mitgenes

HTTPError: 504 Server Error: Gateway Time-out for url: http://asia.ensembl.org:80/biomart/martservice?type=registry

Another way to get the data is to search by location. We can play around with their region-searcher to see that the mitochondrial dna of a zebrafish is just over 16 kilobases long (16,596 bases to be exact).

If we do that, we can find this cute overview of the mitochondrial dna:

Not relevant to us, though - we want to click on the “export” button, which will give us this popup:

You can then download all the mitochondrial data! It’s a rather small file that looks like this:

seqname,source,feature,start,end,score,strand,frame,hid,hstart,hend,genscan,gene_id,transcript_id,exon_id,gene_type,variation_name,probe_name
MT,EVA,variation,113,113,.,+,.,,,,,,,,,rs508804888,
[...]
MT,Ensembl,gene,951,1019,.,+,0,,,,,ENSDARG00000083480.3,ENSDART00000116162.3,ENSDARE00000880048,Mt_tRNA,,
[...]

import pandas as pd

mito_data = pd.read_csv("./localdata/mt-danio-rerio.txt", sep=",")
mito_data

	seqname	source	feature	start	end	score	strand	frame	hid	hstart	hend	genscan	gene_id	transcript_id	exon_id	gene_type	variation_name	probe_name
0	MT	EVA	variation	113	113	.	+	.	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	rs508804888	NaN
1	MT	EVA	variation	239	239	.	+	.	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	rs513784503	NaN
2	MT	EVA	variation	314	314	.	+	.	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	rs504526537	NaN
3	MT	EVA	variation	339	339	.	+	.	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	rs511578098	NaN
4	MT	EVA	variation	438	438	.	+	.	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	rs514602887	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
174	MT	Ensembl	gene	14714	15232	.	-	0	NaN	NaN	NaN	NaN	ENSDARG00000063922.3	ENSDART00000093623.3	ENSDARE00000685766	protein_coding	NaN	NaN
175	MT	Ensembl	gene	15233	15301	.	-	0	NaN	NaN	NaN	NaN	ENSDARG00000083312.3	ENSDART00000116823.3	ENSDARE00000882905	Mt_tRNA	NaN	NaN
176	MT	Ensembl	gene	15308	16448	.	+	0	NaN	NaN	NaN	NaN	ENSDARG00000063924.3	ENSDART00000093625.3	ENSDARE00000685768	protein_coding	NaN	NaN
177	MT	Ensembl	gene	16449	16520	.	+	0	NaN	NaN	NaN	NaN	ENSDARG00000083462.3	ENSDART00000116552.3	ENSDARE00000881627	Mt_tRNA	NaN	NaN
178	MT	Ensembl	gene	16527	16596	.	-	1	NaN	NaN	NaN	NaN	ENSDARG00000081475.3	ENSDART00000115546.3	ENSDARE00000882281	Mt_tRNA	NaN	NaN

179 rows × 18 columns

# Grab all the genes, get their ids, and chop off the '.3' 
# ending which indicates the version
mito_genes = mito_data[mito_data["feature"] == "gene"]["gene_id"].apply(
    lambda x: x[:-2]
)
mito_genes

142    ENSDARG00000083480
143    ENSDARG00000082753
144    ENSDARG00000081443
145    ENSDARG00000080337
146    ENSDARG00000083046
147    ENSDARG00000063895
148    ENSDARG00000083118
149    ENSDARG00000080630
150    ENSDARG00000082084
151    ENSDARG00000063899
152    ENSDARG00000080718
153    ENSDARG00000080401
154    ENSDARG00000081938
155    ENSDARG00000082789
156    ENSDARG00000080128
157    ENSDARG00000063905
158    ENSDARG00000081369
159    ENSDARG00000083519
160    ENSDARG00000063908
161    ENSDARG00000080151
162    ENSDARG00000063910
163    ENSDARG00000063911
164    ENSDARG00000063912
165    ENSDARG00000081758
166    ENSDARG00000063914
167    ENSDARG00000080329
168    ENSDARG00000063916
169    ENSDARG00000063917
170    ENSDARG00000082716
171    ENSDARG00000082123
172    ENSDARG00000081280
173    ENSDARG00000063921
174    ENSDARG00000063922
175    ENSDARG00000083312
176    ENSDARG00000063924
177    ENSDARG00000083462
178    ENSDARG00000081475
Name: gene_id, dtype: object

mito_genes.to_csv("./localdata/mito-genes.csv", index=False)

So, uh, yeah - success!