格式说明
FASTA格式是一种基于文本用于表示核苷酸序列(或氨基酸序列)的格式。碱基对(或氨基酸)用单个字母来编码,且允许在序列前添加序列名及注释。
>gi|46575915|ref|NM_008261.2| Mus musculus hepatic nuclear factor 4, alpha (Hnf4a), mRNA
GGGACCTGGGAGGAGGCAGGAGGAGGGCGGGGACGGGGGGGGCTGGGGCTCAGCCCAGGGGCTTGGGTGG
FASTA格式以“>”开头,紧接着序列的标识符
换行后是序列信息,代表某一条链从5’到3’的序列,一般不超过80个字符
FASTQ转FASTA shell脚本:
awk '{if(NR%4 == 1){print ">" substr($0, 2)}}{if(NR%4 == 2){print}}' fastq > fasta
FASTA文件处理
文件读取
无需对序列进行处理时
## param file: FASTA格式的文件
## return: None
def fa_cat(file):
for line in open(file):
print(line.strip())
fa_cat("test1.fa")
需处理序列并输出成FASTA时
## Read the file
fa_in = open("test1.fa", "r")
## Define a list to contain seq info
seqInfo = []
## Define a list to contain seq
mrnaSeq = []
fa_Num = -1
## Read one line each time
for line in fa_in.readlines():
## Remove \n at the end of each line
line = line.rstrip().upper().replace("T","U")
## If lines start with > ,save in seqInfo,else in mrnaSeq
if line[0] == ">":
seqInfo.append(line)
fa_Num = fa_Num + 1
mrnaSeq.append("")
else:
mrnaSeq[fa_Num] = mrnaSeq[fa_Num] + line
去掉信息行并按行读入列表
## Read the file
fa_in = open('test1.fa')
## 读取到列表中,列表中的每个元素为每一行基因序列构成的字符串
ls=[]
for line in fa_in:
if not line.startswith('>'):
ls.append(line.replace('\n',''))
读入字典保留对应信息
fa_in = open('test1.fa')
seq={}
for line in fa_in:
if line.startswith('>'):
name=line.replace('>','').split()[0]
seq[name]=''
else:
seq[name]+=line.replace('\n','').strip()
fa_in.close()
文件写入
## Output the file
## Define a file to write in
res = open("test2.fa", "w")
for i in range(fa_Num + 1):
res.write(seqInfo[i] + "\n")
while len(proSeq[i]) > 70:
## Each lines contain 70bp
res.write(proSeq[i][:70] + "\n")
proSeq[i] = proSeq[i][70:]
else:
res.write(proSeq[i] + "\n")