【Bioinfo Blog 005】【Python Code 001】——FASTA文件处理(未完)

格式说明

FASTA格式是一种基于文本用于表示核苷酸序列(或氨基酸序列)的格式。碱基对(或氨基酸)用单个字母来编码,且允许在序列前添加序列名及注释。

>gi|46575915|ref|NM_008261.2| Mus musculus hepatic nuclear factor 4, alpha (Hnf4a), mRNA
GGGACCTGGGAGGAGGCAGGAGGAGGGCGGGGACGGGGGGGGCTGGGGCTCAGCCCAGGGGCTTGGGTGG

FASTA格式以“>”开头,紧接着序列的标识符
换行后是序列信息,代表某一条链从5’到3’的序列,一般不超过80个字符

FASTQ转FASTA shell脚本:

awk '{if(NR%4 == 1){print ">" substr($0, 2)}}{if(NR%4 == 2){print}}' fastq > fasta

FASTA文件处理

文件读取

无需对序列进行处理时

## param file: FASTA格式的文件
## return: None

def fa_cat(file):
    for line in open(file):
        print(line.strip())

fa_cat("test1.fa")

需处理序列并输出成FASTA时

## Read the file
fa_in = open("test1.fa", "r")
## Define a list to contain seq info
seqInfo = []
## Define a list to contain seq
mrnaSeq = []
fa_Num = -1

## Read one line each time
for line in fa_in.readlines():
	## Remove \n at the end of each line
	line = line.rstrip().upper().replace("T","U")
	## If lines start with > ,save in seqInfo,else in mrnaSeq
	if line[0] == ">":
		seqInfo.append(line)
		fa_Num = fa_Num + 1
		mrnaSeq.append("")
	else:
		mrnaSeq[fa_Num] = mrnaSeq[fa_Num] + line

去掉信息行并按行读入列表

## Read the file
fa_in = open('test1.fa')
## 读取到列表中,列表中的每个元素为每一行基因序列构成的字符串
ls=[]
for line in fa_in:
        if not line.startswith('>'):    
                ls.append(line.replace('\n',''))

读入字典保留对应信息

fa_in = open('test1.fa')
seq={}
for line in fa_in:
        if line.startswith('>'):
                name=line.replace('>','').split()[0]
                seq[name]=''
        else:
                seq[name]+=line.replace('\n','').strip()
fa_in.close()

文件写入

	## Output the file
	## Define a file to write in
	res = open("test2.fa", "w")
	for i in range(fa_Num + 1):
		res.write(seqInfo[i] + "\n")
		while len(proSeq[i]) > 70:
			## Each lines contain 70bp
			res.write(proSeq[i][:70] + "\n")
			proSeq[i] = proSeq[i][70:]
	else:
		res.write(proSeq[i] + "\n")
上一篇:Mothur2_减少测序和PCR错误


下一篇:Mothur2进阶_Mothur扩增子基因序列_数据预处理