最开始,师妹来问我的,如何处理BWA的报错。看报错信息提示是因为输入的Read1和Read2的Fastq文件中有不成对的Read。原因在于这两个fq文件来自之前的分析过程中的bam文件抽取得来。

最开始,师妹来问我的,如何处理BWA的报错。看报错信息提示是因为输入的Read1和Read2的Fastq文件中有不成对的Read。原因在于这两个fq文件来自之前的分析过程中的bam文件抽取得来。

我还是第一次知道BWA这些输入reads一定要成对,否则就报错

因此要对成对的reads进行提取,才能进行下一步。最开始我是不原因自己写个脚本干这个事情,我当时下午还要开会,正在努力的赶结果。所以在网上找了别人写好的脚本,发给师妹后,就继续干活了。

image-20220419140004712

谁知道,立马就报错了,好吧。刚开始我还很纳闷为啥报错。我觉得这么简单的脚本,环境依赖很简单的说,怎么会报错呢。

当时我对这个脚本还报有学习的心态(因为咱的python也是一般的说),随后去测试了一下。就发现是作者的脚本在读取fq文件的时候,字符没有decode的问题。

不过这也可能是python的不同版本的问题

嗯,总之感觉这个脚本写的很一般~,算了,自己还是用pysam重新写个吧。脚本放在Github上.可以直接copy下来或者下载下来。

使用

使用起来比较简单,只有三个参数:

Extract the paired read from the PE platfrom reads

optional arguments:
  -h, --help            show this help message and exit
  --read1 READ1, -1 READ1
                        the read1 file
  --read2 READ2, -2 READ2
                        the read2 file
  --pattern PATTERN, -p PATTERN
                        the python re module regrex expression which distinguish two read file like @read29299[.1] and @read29299[.2], then the paremeter should be
                        '\.[1-2]$',the default value is noting to replace

可以直接输入进行运行:

python paired_Read.py -1 文件1 -2 文件2

pattern参数则不是必须的,但是非常重要,例如read1 file 和read2 file 的reads名称是一模一样的,那直接不用设置这个参数,但是如果两个reads文件的reads name 是有后缀的不同,那就需要用正则匹配删掉不同的后缀。

例如read1文件的名字都是read29299xxx.1, 而read2文件的名字都是read29299xxx.2,那这个时候就需要pattern参数,去掉后面的.1和.2,正则匹配中的\.[1-2]$就可以很好的去掉后缀,从而判断成对reads。

例子

先创建两个FastQ文件,test.R1.fq内容如下

@SRR5720828.1.sra.unpaired
CCATTCCTAAACTTACTATCAACAATAAATAAATATTATCCCTACGCCAATCAAACATACCCCTTAAACTAAAATCTGATCTCCAAAAAAATATGAAAGATTTTGAATGTTAAGTATATTGGATTTATGTGAATAAGTGGTGTTTGAAGT
+
AAFFFFJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJAJJJJJJJFJJJF<AJJJJJAJJJJJJFJJJJJJAJJFAFF7FAJJJFAJJJJF<FJFJAFJFFAFJJ<JFJJ<-<-7AFJJFJJJJJ-7A<
@SRR5720828.1.sra.63
CTATCCGAATTCCCACAAAAACAACCAATTTCCATCAACAAACACTCGCACTACACAAAACACTAAACTAAACTCAATAAAAAAAAAAACCCAAAAAAAAAACACTTACCCCTCAAAATCATAATATCACAACCTAAAATTCCGTCTTCC
+
AAFFFFJFFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFFJJJJJJJJJJFAAAJJJAJJJAFJJFJF-AJFJJJJFJJJFJJJJAJA<FFJJAFFJJ<FJ-FA<FJJFJ-<FF<A<FFAA<F

test.R2.fq内容如下

@SRR5720828.1.sra.713
ATTCCTCCCATCCACTCATTACTTTTCCTAACTTCAAACACCACTTATTCACATAAATCCAATATACTTAACATTCAAAATCTTTCATATTTTTTTGGAGATCAGATTTTAGTTTAAGGGGTATGTTTGATTGGCGTAGGGATAATATTT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJ<F7FJJJ<FFJJJJFFJ<<AFJJJJJJJJJJAFFJJJFFFJJJ-<7<FFJJJ
@SRR5720828.1.sra.63
TGTGTTTATTGGGAAGGAGGATGGGAAGACGGAATTTTAGGTTGTGATATGATGATTTTGAGGGGTAAGTGTTTTTTTTTTGGGTTTTTTTTTTTATTGAGTTTAGTTTAGTGTTTTGTGTAGTGCGAGTGTTTGTTGATGGAAATTGGT
+
AAFFFJJJFJJJJJJ<AJJJJJJJJJJJJJJJAJJJFJJJJF<FJJF-FJ-FJJFA--AFFJFJJ<JJJFJAF<JFJFFJJJJA<FAAA77AAFJJJAA7A-AFJJJJAAFFJJJJJ<-A-7A--<F<A)<7<<<-AJ<7AA<F-AF7AF

运行

python paired_Read.py -1 test.R1.fq -2 test.R2.fq 

结果展示,生产四个文件,分别是read1.paired.fq,read2.paired.fq,read1.unpaired.fq,read2.unpaired.fq,顾名思义就知道是什么意思了。结果展示如下:

(qiime2) ➜ pydata head test.R1.unpaired.fq                                               
@SRR5720828.1.sra.unpaired
CCATTCCTAAACTTACTATCAACAATAAATAAATATTATCCCTACGCCAATCAAACATACCCCTTAAACTAAAATCTGATCTCCAAAAAAATATGAAAGATTTTGAATGTTAAGTATATTGGATTTATGTGAATAAGTGGTGTTTGAAGT
+
AAFFFFJJFJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJFJAJJJJJJJFJJJF<AJJJJJAJJJJJJFJJJJJJAJJFAFF7FAJJJFAJJJJF<FJFJAFJFFAFJJ<JFJJ<-<-7AFJJFJJJJJ-7A<

(qiime2) ➜ pydata head test.R2.unpaired.fq
@SRR5720828.1.sra.713
ATTCCTCCCATCCACTCATTACTTTTCCTAACTTCAAACACCACTTATTCACATAAATCCAATATACTTAACATTCAAAATCTTTCATATTTTTTTGGAGATCAGATTTTAGTTTAAGGGGTATGTTTGATTGGCGTAGGGATAATATTT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJ<F7FJJJ<FFJJJJFFJ<<AFJJJJJJJJJJAFFJJJFFFJJJ-<7<FFJJJ

(qiime2) ➜ pydata head test.R2.paired.fq  
@SRR5720828.1.sra.63
TGTGTTTATTGGGAAGGAGGATGGGAAGACGGAATTTTAGGTTGTGATATGATGATTTTGAGGGGTAAGTGTTTTTTTTTTGGGTTTTTTTTTTTATTGAGTTTAGTTTAGTGTTTTGTGTAGTGCGAGTGTTTGTTGATGGAAATTGGT
+
AAFFFJJJFJJJJJJ<AJJJJJJJJJJJJJJJAJJJFJJJJF<FJJF-FJ-FJJFA--AFFJFJJ<JJJFJAF<JFJFFJJJJA<FAAA77AAFJJJAA7A-AFJJJJAAFFJJJJJ<-A-7A--<F<A)<7<<<-AJ<7AA<F-AF7AF

(qiime2) ➜ pydata head test.R1.paired.fq
@SRR5720828.1.sra.63
CTATCCGAATTCCCACAAAAACAACCAATTTCCATCAACAAACACTCGCACTACACAAAACACTAAACTAAACTCAATAAAAAAAAAAACCCAAAAAAAAAACACTTACCCCTCAAAATCATAATATCACAACCTAAAATTCCGTCTTCC
+
AAFFFFJFFJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFFJJJJJJJJJJFAAAJJJAJJJAFJJFJF-AJFJJJJFJJJFJJJJAJA<FFJJAFFJJ<FJ-FA<FJJFJ-<FF<A<FFAA<F

后记

因为我使用了python的集合函数,速度应该更快一些,同时考虑reads name的通配问题,以及保留了对fasta文件的支持。后续可以根据需要修改。

总体上感觉写个脚本半个小时,锻炼一下python水平,还挺好的。