search
技术服务021-34781616

欧易生物

热门搜索关键词:转录组基因组甲基化酵母文库蛋白芯片

021-34781616

当前位置凯发k8凯发k8网址官网下载首页 » 新闻资讯 » 技术&解读&应用 » 如何提取gff文件中的基因注释信息

如何提取gff文件中的基因注释信息

来源: 查看手机网址
扫一扫!扫一扫!
浏览:- 发布日期:2017-01-20 14:05:04【 】

gff3格式注释文件是最常见的基因注释https://archive.broadinstitute.org/annotation/argo/help/gff3.html

简单来说,gff3是以tab分隔的文本文件,共有9列,对应信息如下:

1、seqname 

the name of the sequence. typically a chromosome or a  contig. argo does not care what you put here. it will superimpose gff  features on any sequence you like.

2、source 

the program that generated this feature. argo displays the value of this field in the inspector but does not do anything special with it.

3、feature 

the name of this type of feature. the official gff3  spec states that this should be a term from the sofa ontology, but argo does not do anything with this value except display it.

4、start 

the starting position of the feature in the sequence. the first base is numbered 1.

5、end 

the ending position of the feature (inclusive).

6、score 

a score between 0 and 1000. if there is no score value, enter ".".

7、strand 

valid entries include '+', '-', or '.' (for don't  know/don't care).

8、frame 

if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. if the feature is not a coding exon, the value should be '.'. argo does not do anything with this field except display its value.

9、gff3: grouping  attributes 

attribute keys  and values are separated by '=' signs. values must be uri encoded.quoted. attribute pairs are separated by semicolons. certain, special attributes are used for grouping and identification (see below). this field is the one important difference between gff flavor

https://archive.broadinstitute.org/annotation/argo/help/gff.html.



      在进行生物信息分析的时候,常需要把gene的注释信息(第9列)提取出来附加到差异基因或目的基因的表格结果中,但第9列的注释信息通常较多,且不同基因含部分注释信息不全部一致,一般我们只需要部分重要的a信息,dbxrefgene_biotypedescription

本文以ncbi上发布的人类grch38.p7版本注释文件为示例,使用awk命令进行该操作。

https://www.gnu.org/software/gawk/manual/gawk.html


1、下载目的物种注释文件:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/gcf_000001405.33_grch38.p7/gcf_000001405.33_grch38.p7_genomic.gff.gz


然后对 gcf_000001405.33_grch38.p7_genomic.gff.gz 进行解压操作,

得到解压文gcf_000001405.33_grch38.p7_genomic.gff

 


2、查看第9列有哪些注释信息:

$ awk 'begin{fs=ofs="\t"} $3=="gene"{split($9, a, ";"); for(i in a){split(a[i], b, "="); if(++c[b[1]]==1) print b[1]}}' gcf_000001405.33_grch38.p7_genomic.gff


运行显示结果有:

id、dbxref、name、description

gbkey、gene、gene_biotype、pseudo、gene_synonym、partial、start_range、end_range

exception、note


然后使用以下命令查看gff3文件中的结果:

$ awk -f "\t" '$3=="gene"{print $9}' gcf_000001405.33_grch38.p7_genomic.gff | cat -n | less


可以看到




3、下面使用awk进行基因注释信息提取(以提取dbxref、gene_biotype、description信息为例):

$ awk 'begin{fs=ofs="\t"} $3=="gene"{print $0}' gcf_000001405.33_grch38.p7_genomic.gff |

sed 's/;/\t/g' |

awk 'begin{fs=ofs="\t"} {for(i=1; i<=nf; i++){split($i, a, "=");

b[a[1]]=a[2]}} {print b["name"],b["dbxref"],b["gene_biotype"],b["description"]}

{split("", b, ":")}'


终端显示的提取信息(tab分隔,依次为name、dbxref、gene_biotype、description):

说明:部分基因不包含某些注释信息,如loc105379212基因没有description信息,则在对应列为空字符。

 


4、对应终端打印的提取信息,可以添加表头和生成文件,同时对应部分出现在多个染色体的基因在第1列会重复,请对3中的结果进行以下操作即可:

$ sed ‘1i name\tdbxref\tgene_biotype\tdescription’ | awk -f “\t” ‘++a[$1]==1’




推荐阅读

【本文标签】:
【责任编辑】:欧易生物凯发k8官网下载的版权所有:http://www.oebiotech.com转载请注明出处

欧易生物

技术热线:021-34781616 咨询热线:4006-4008-26

上海市闵行区新骏环路138号5幢3层
service@oebiotech.com
欧易生物
欧易生物微信公众号
 网站地图  凯发k8官网下载 copyright © 2016 凯发k8官网下载 保留所有权利