最开始,师妹来问我的,如何处理BWA的报错。看报错信息提示是因为输入的Read1和Read2的Fastq文件中有不成对的Read。原因在于这两个fq文件来自之前的分析过程中的bam文件抽取得来。
最开始,师妹来问我的,如何处理BWA的报错。看报错信息提示是因为输入的Read1和Read2的Fastq文件中有不成对的Read。原因在于这两个fq文件来自之前的分析过程中的bam文件抽取得来。
我还是第一次知道BWA这些输入reads一定要成对,否则就报错
因此要对成对的reads进行提取,才能进行下一步。最开始我是不原因自己写个脚本干这个事情,我当时下午还要开会,正在努力的赶结果。所以在网上找了别人写好的脚本,发给师妹后,就继续干活了。
谁知道,立马就报错了,好吧。刚开始我还很纳闷为啥报错。我觉得这么简单的脚本,环境依赖很简单的说,怎么会报错呢。
当时我对这个脚本还报有学习的心态(因为咱的python也是一般的说),随后去测试了一下。就发现是作者的脚本在读取fq文件的时候,字符没有decode的问题。
不过这也可能是python的不同版本的问题
嗯,总之感觉这个脚本写的很一般~,算了,自己还是用pysam重新写个吧。脚本放在Github上.可以直接copy下来或者下载下来。
使用
使用起来比较简单,只有三个参数:
Extract the paired read from the PE platfrom reads
optional arguments:
-h, --help show this help message and exit
--read1 READ1, -1 READ1
the read1 file
--read2 READ2, -2 READ2
the read2 file
--pattern PATTERN, -p PATTERN
the python re module regrex expression which distinguish two read file like @read29299[.1] and @read29299[.2], then the paremeter should be
'\.[1-2]$',the default value is noting to replace
可以直接输入进行运行:
python paired_Read.py -1 文件1 -2 文件2
而pattern
参数则不是必须的,但是非常重要,例如read1 file 和read2 file 的reads名称是一模一样的,那直接不用设置这个参数,但是如果两个reads文件的reads name 是有后缀的不同,那就需要用正则匹配删掉不同的后缀。
例如read1文件的名字都是read29299xxx.1, 而read2文件的名字都是read29299xxx.2,那这个时候就需要pattern
参数,去掉后面的.1和.2
,正则匹配中的\.[1-2]$
就可以很好的去掉后缀,从而判断成对reads。
例子
先创建两个FastQ文件,test.R1.fq
内容如下
@SRR5720828.1.sra.unpaired
CCATTCCTAAACTTACTATCAACAATAAATAAATATTATCCCTACGCCAATCAAACATACCCCTTAAACTAAAATCTGATCTCCAAAAAAATATGAAAGATTTTGAATGTTAAGTATATTGGATTTATGTGAATAAGTGGTGTTTGAAGT
+
AAFFFFJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJAJJJJJJJFJJJF<AJJJJJAJJJJJJFJJJJJJAJJFAFF7FAJJJFAJJJJF<FJFJAFJFFAFJJ<JFJJ<-<-7AFJJFJJJJJ-7A<
@SRR5720828.1.sra.63
CTATCCGAATTCCCACAAAAACAACCAATTTCCATCAACAAACACTCGCACTACACAAAACACTAAACTAAACTCAATAAAAAAAAAAACCCAAAAAAAAAACACTTACCCCTCAAAATCATAATATCACAACCTAAAATTCCGTCTTCC
+
AAFFFFJFFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFFJJJJJJJJJJFAAAJJJAJJJAFJJFJF-AJFJJJJFJJJFJJJJAJA<FFJJAFFJJ<FJ-FA<FJJFJ-<FF<A<FFAA<F
test.R2.fq
内容如下
@SRR5720828.1.sra.713
ATTCCTCCCATCCACTCATTACTTTTCCTAACTTCAAACACCACTTATTCACATAAATCCAATATACTTAACATTCAAAATCTTTCATATTTTTTTGGAGATCAGATTTTAGTTTAAGGGGTATGTTTGATTGGCGTAGGGATAATATTT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJ<F7FJJJ<FFJJJJFFJ<<AFJJJJJJJJJJAFFJJJFFFJJJ-<7<FFJJJ
@SRR5720828.1.sra.63
TGTGTTTATTGGGAAGGAGGATGGGAAGACGGAATTTTAGGTTGTGATATGATGATTTTGAGGGGTAAGTGTTTTTTTTTTGGGTTTTTTTTTTTATTGAGTTTAGTTTAGTGTTTTGTGTAGTGCGAGTGTTTGTTGATGGAAATTGGT
+
AAFFFJJJFJJJJJJ<AJJJJJJJJJJJJJJJAJJJFJJJJF<FJJF-FJ-FJJFA--AFFJFJJ<JJJFJAF<JFJFFJJJJA<FAAA77AAFJJJAA7A-AFJJJJAAFFJJJJJ<-A-7A--<F<A)<7<<<-AJ<7AA<F-AF7AF
运行
python paired_Read.py -1 test.R1.fq -2 test.R2.fq
结果展示,生产四个文件,分别是read1.paired.fq,read2.paired.fq,read1.unpaired.fq,read2.unpaired.fq,顾名思义就知道是什么意思了。结果展示如下:
(qiime2) ➜ pydata head test.R1.unpaired.fq
@SRR5720828.1.sra.unpaired
CCATTCCTAAACTTACTATCAACAATAAATAAATATTATCCCTACGCCAATCAAACATACCCCTTAAACTAAAATCTGATCTCCAAAAAAATATGAAAGATTTTGAATGTTAAGTATATTGGATTTATGTGAATAAGTGGTGTTTGAAGT
+
AAFFFFJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJAJJJJJJJFJJJF<AJJJJJAJJJJJJFJJJJJJAJJFAFF7FAJJJFAJJJJF<FJFJAFJFFAFJJ<JFJJ<-<-7AFJJFJJJJJ-7A<
(qiime2) ➜ pydata head test.R2.unpaired.fq
@SRR5720828.1.sra.713
ATTCCTCCCATCCACTCATTACTTTTCCTAACTTCAAACACCACTTATTCACATAAATCCAATATACTTAACATTCAAAATCTTTCATATTTTTTTGGAGATCAGATTTTAGTTTAAGGGGTATGTTTGATTGGCGTAGGGATAATATTT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJ<F7FJJJ<FFJJJJFFJ<<AFJJJJJJJJJJAFFJJJFFFJJJ-<7<FFJJJ
(qiime2) ➜ pydata head test.R2.paired.fq
@SRR5720828.1.sra.63
TGTGTTTATTGGGAAGGAGGATGGGAAGACGGAATTTTAGGTTGTGATATGATGATTTTGAGGGGTAAGTGTTTTTTTTTTGGGTTTTTTTTTTTATTGAGTTTAGTTTAGTGTTTTGTGTAGTGCGAGTGTTTGTTGATGGAAATTGGT
+
AAFFFJJJFJJJJJJ<AJJJJJJJJJJJJJJJAJJJFJJJJF<FJJF-FJ-FJJFA--AFFJFJJ<JJJFJAF<JFJFFJJJJA<FAAA77AAFJJJAA7A-AFJJJJAAFFJJJJJ<-A-7A--<F<A)<7<<<-AJ<7AA<F-AF7AF
(qiime2) ➜ pydata head test.R1.paired.fq
@SRR5720828.1.sra.63
CTATCCGAATTCCCACAAAAACAACCAATTTCCATCAACAAACACTCGCACTACACAAAACACTAAACTAAACTCAATAAAAAAAAAAACCCAAAAAAAAAACACTTACCCCTCAAAATCATAATATCACAACCTAAAATTCCGTCTTCC
+
AAFFFFJFFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFFJJJJJJJJJJFAAAJJJAJJJAFJJFJF-AJFJJJJFJJJFJJJJAJA<FFJJAFFJJ<FJ-FA<FJJFJ-<FF<A<FFAA<F
后记
因为我使用了python的集合函数,速度应该更快一些,同时考虑reads name的通配问题,以及保留了对fasta文件的支持。后续可以根据需要修改。
总体上感觉写个脚本半个小时,锻炼一下python水平,还挺好的。