The goal of this project is to assess the quality of the A. inornata genome that was produce by Dovetail Genomics
I am using Trinity to adapter trim, normalize and assemble a transcriptome for A inornata. Five different tissue samples will be used (blood, brain, heart, germinal bed, and testes).
I concatinated all of the reads from each tissue together by R1 and R2 (PE read designation). Leaving a single R1 and R2 for heart, blood, brain, testes, and germinal bed (blood_R1.fastq.gz, blood_R2.fastq.gz, brain_R1.fastq.gz, brain_R2.fastq.gz, germinalbed_R1.fastq.gz, germinalbed_R2.fastq.gz, heart_R1.fastq.gz, heart_R2.fastq.gz, testes_R1.fastq.gz, testes_R2.fastq.gz)
#author: tfording
#FILES: Blood
#Trinity --seqType fq --SS_lib_type RF --left /n/projects/tfording/a_inornata/blood_R1.fastq.gz --right /n/projects/tfording/a_inornata/RNAseq_data/blood_R2.fastq.gz --CPU 24 --max_memory 600G --normalize_reads --verbose --min_kmer_cov 2 --min_contig_length 200 --output /n/projects/tfording/a_inornata/RNAseq_data/Trinity_blood --trimmomatic --quality_trimming_params "ILLUMINACLIP:/n/apps/CentOS7/bin/trinityrnaseq-2.1.1/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
#FILES: Brain
#Trinity --seqType fq --SS_lib_type RF --left /n/projects/tfording/a_inornata/RNAseq_data/brain_R1.fastq.gz --right /n/projects/tfording/a_inornata/RNAseq_data/brain_R2.fastq.gz --CPU 24 --max_memory 600G --normalize_reads --verbose --min_kmer_cov 2 --min_contig_length 200 --output /n/projects/tfording/a_inornata/RNAseq_data/Trinity_brain --quality_trimming_params "ILLUMINACLIP:/n/apps/CentOS7/bin/trinityrnaseq-2.1.1/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
#FILES: GerminalBed
#Trinity --seqType fq --SS_lib_type RF --left /n/projects/tfording/a_inornata/RNAseq_data/germinalbed_R1.fastq.gz --right /n/projects/tfording/a_inornata/RNAseq_data/germinalbed_R2.fastq.gz --CPU 24 --max_memory 600G --normalize_reads --verbose --min_kmer_cov 2 --min_contig_length 200 --output /n/projects/tfording/a_inornata/RNAseq_data/Trinity_germinalbed --quality_trimming_params "ILLUMINACLIP:/n/apps/CentOS7/bin/trinityrnaseq-2.1.1/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
#FILES: Heart
#Trinity --seqType fq --SS_lib_type RF --left /n/projects/tfording/a_inornata/RNAseq_data/heart_R1.fastq.gz --right /n/projects/tfording/a_inornata/RNAseq_data/heart_R2.fastq.gz --CPU 24 --max_memory 600G --normalize_reads --verbose --min_kmer_cov 2 --min_contig_length 200 --output /n/projects/tfording/a_inornata/RNAseq_data/Trinity_heart --quality_trimming_params "ILLUMINACLIP:/n/apps/CentOS7/bin/trinityrnaseq-2.1.1/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
#FILES: Testes
#Trinity --seqType fq --SS_lib_type RF --left /n/projects/tfording/a_inornata/RNAseq_data/testes_R1.fastq.gz --right /n/projects/tfording/a_inornata/RNAseq_data/testes_R2.fastq.gz --CPU 24 --max_memory 600G --normalize_reads --verbose --min_kmer_cov 2 --min_contig_length 200 --output /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes --quality_trimming_params "ILLUMINACLIP:/n/apps/CentOS7/bin/trinityrnaseq-2.1.1/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
I normalized by tissue type to a min kmer count of 2, max coverage of 50 (see http://ivory.idyll.org/blog/trinity-in-silico-normalize.html for how trinity's in silico normalization works) and minimum contig length of 200.
Script:
# author: tfording
import numpy
import argparse
import os
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
def parse_trinity_fasta(input_file_path):
'''
This function will iterate through a trinity fasta and pull lengths from header lines to pass to various functions
:param input_file_path: Directory to file
:return: Above parameters in a table format
'''
length_array = []
fh = open(input_file_path, "r")
for line in fh:
if str(line[0]) == '>':
line = line.split()
length = line[1]
length = length[4:]
length_array.append(length)
length_array_int = []
for item in length_array:
item = int(item)
length_array_int.append(item)
min = min_contig(length_array_int)
max = max_contig(length_array_int)
total_size = total_size_in_bp(length_array_int)
file_name = parse_file_name(input_file_path)
num_contigs = len(length_array_int)
median = median_contig(length_array_int)
average = average_contig(length_array_int)
n_50 = N50_func(length_array_int, total_size)
print 'File Name:', file_name
print 'Number of Contigs:', num_contigs
print 'Total Size:', total_size
print 'Min Contig Size:', min
print 'Max Contig Size:', max
print 'Average Contig Size:', average
print 'Median Contig Size:', median
print 'N50:', n_50
contig_length_dist(length_array_int, file_name)
def contig_length_dist(length_array, file_name):
'''
This function plots a length distribution
:param length_array:
:return:
'''
#plt.show(sns.distplot(length_array, bins=40, kde=False, rug=True, axlabel='Contig Length', label=file_name+' contig dist'))
def N50_func(length_array, total_len):
'''
Calculates N50 of given array
:param length_array: array of integers
:return: N50 of array
'''
new_list = sorted(length_array)
len_count = 0
pos_count = 0
for i in new_list:
len_count += i
if len_count >= (total_len/2.0):
return i
def average_contig(length_array):
'''
:param length_array: Array of integers
:return: mean of array
'''
return numpy.mean(length_array)
def median_contig(length_array):
'''
:param length_array: array of integers
:return: median of array
'''
return numpy.median(length_array)
def parse_file_name(path_to_file):
'''
:param path_to_file: directory (str)
:return: Just the file name
'''
path_to_file= path_to_file.split('/')
return path_to_file[-1]
def min_contig(length_array):
'''
:param length_array: array of integers
:return: min of array
'''
min_contig = min(length_array)
return min_contig
def max_contig(length_array):
'''
:param length_array: array of integers
:return: max of array
'''
max_contig = max(length_array)
return max_contig
def total_size_in_bp(length_array):
'''
:param length_array: array of integers
:return: sum of integers in array
'''
total_size = 0
for item in length_array:
total_size += item
return total_size
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_blood/blood_Trinity.fasta')
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_brain/brain_Trinity.fasta')
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_germinalbed/germinalbed_Trinity.fasta')
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_heart/heart_Trinity.fasta')
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes/testes_Trinity.fasta')
Duncan and my data sets are drastically different in number of reads retained after normalization and trimmomatic. I am going to investigate which step within trinity cause the massive loss in reads in my data set.
#java -jar /n/apps/CentOS7/bin/trinityrnaseq-2.1.1/trinity-plugins/Trimmomatic/trimmomatic-0.32.jar PE -threads 4 \
#/n/projects/tfording/a_inornata/RNAseq_data/testes_R1.fastq.gz \
#/n/projects/tfording/a_inornata/RNAseq_data/testes_R2.fastq.gz \
#-baseout testesdiagnosis \
After running trimmomatic on the testes data set, 115,305,156 reads were retained from the original 115,660,629 reads. This is a net loss of 355,473 reads. Trimmomatic is not responsible for the massive loss in data.
Installed trinity 2.3.2 via anaconda.
#/n/projects/tfording/anaconda2/bin/Trinity --seqType fq --SS_lib_type RF --left /n/projects/tfording/a_inornata/RNAseq_data/testes_R1.fastq.gz --right /n/projects/tfording/a_inornata/RNAseq_data/testes_R2.fastq.gz --CPU 16 --max_memory 400G --normalize_reads --verbose --min_kmer_cov 2 --min_contig_length 200 --output /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes_2.3.2 --quality_trimming_params "ILLUMINACLIP:/n/projects/tfording/anaconda2/pkgs/trinity-2.3.2-0/opt/trinity-2.3.2/trinity-plugins/Trimmomatic-0.32/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes_2.3.2/testes_Trinity_2.3.2.fasta')
After running Trinity 2.3.2 I got a very similar set of data, so I ran BUSCO (at Duncan's request) on both of my generated testes data sets, as well as on his. I ran BUSCO on both CEGs and the Tetrapod references. I installed BUSCO and its supporting programs via anaconda.
## EUKARYOTA
# Trinity 2.1.1
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes/testes_Trinity.fasta -o EUK_trinity_2.1.1 -l /n/projects/tfording/BUSCO_lineages/eukaryota_odb9 -m tran -c 4
# Trinity 2.3.2
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes_2.3.2/testes_Trinity_2.3.2.fasta -o EUK_trinity_2.3.2 -l /n/projects/tfording/BUSCO_lineages/eukaryota_odb9 -m tran -c 4
# Duncan's Testes (Trinity 2.3.2)
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/dut/a_marmorata/teste_transcriptome_sex_determ/data/trinity_out/Trinity.fasta -o EUK_duncans_testes_trinity_2.3.2 -l /n/projects/tfording/BUSCO_lineages/eukaryota_odb9 -m tran -c 4
## TETRAPODA
# Trinity 2.1.1
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes/testes_Trinity.fasta -o TET_trinity_2.1.1 -l /n/projects/tfording/BUSCO_lineages/tetrapoda_odb9 -m tran -c 4
# Trinity 2.3.2
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes_2.3.2/testes_Trinity_2.3.2.fasta -o TET_trinity_2.3.2 -l /n/projects/tfording/BUSCO_lineages/tetrapoda_odb9 -m tran -c 4
# Duncan's Testes (Trinity 2.3.2)
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/dut/a_marmorata/teste_transcriptome_sex_determ/data/trinity_out/Trinity.fasta -o TET_duncans_testes_trinity_2.3.2 -l /n/projects/tfording/BUSCO_lineages/tetrapoda_odb9 -m tran -c 4
Results from BUSCO: (TESTES)
My data set continues to appear to be drastically less informative. I am going to run Trinity (2.3.2) on the testes data set, without doing a normalization in an attempt to get more contigs and contigs of higher quality.
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import os
import math
def histogram(file_path):
'''
This function takes a jellyfish.kmers.fa.histo file from Trinity and plots a histogram.
:param file_path: Path to jellyfish.kmers.fa.histo
:return: Nothing is returned, a histogram is plotted
'''
fh = open(file_path, "r")
kmer_freq = []
num_kmers = []
for line in fh:
line = line.split()
kmer_freq.append(int(line[0]))
num_kmers.append(math.log10(int(line[1])))
#print kmer_freq, num_kmers
# the histogram of the data
fig = plt.figure()
ax = fig.add_subplot(111)
fig_size = plt.rcParams["figure.figsize"]
fig.set_size_inches(10,12)
ax.bar(kmer_freq, num_kmers, width=100, color='b', edgecolor='b')
ax.set_xlabel('Freq of Kmer')
ax.set_ylabel('Num of Kmer (Log10)')
ax.set_title('Kmer Dist')
ax.grid(True)
plt.show()
I decided to do kmer distributions (at the reccomendation of both Duncan and Aaron) to see how my seemingly problematic datasets compared to Duncan's data set which performed very well in BUSCO
A. inornata Trinity 2.1.1 (Testes)
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes/jellyfish.kmers.fa.histo')
A. inornata Trinity 2.3.2 (Testes)
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes_2.3.2/jellyfish.kmers.fa.histo')
A. marmorata Trinity 2.3.2 (Testes) DUNCAN'S DATA SET
histogram('/n/projects/dut/a_marmorata/teste_transcriptome_sex_determ/data/trinity_out/jellyfish.kmers.fa.histo')
Trinity 2.1.1 Brain
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_brain/jellyfish.kmers.fa.histo')
Trinity 2.1.1 Germinal Bed
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_germinalbed/jellyfish.kmers.fa.histo')
Trinity 2.1.1 Heart
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_heart/jellyfish.kmers.fa.histo')
My distributions seem to drop off rather quickly (with the exception of the blood library which is below), while Duncan's spread much more nicely. This implies a loss of complexity in the data set. I am going to talking to Rachel and see if she did anything differently with these data sets, and run the entire set of data together through Trinity 2.3.2 and see how it looks.
After talking to Rachel, there wasn't anything notable done differently with the data set. I am hoping that a non-normalized run will restore the kmer complexity.
Blood Library
Looking at the blood library, it seems to have done better through the Trinity normalization step.
I also ran it through BUSCO and got the following numbers back:.
Trinity 2.1.1 Blood
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_blood/jellyfish.kmers.fa.histo')
Testes Trinity (2.3.2) Without Normalization
histogram('/n/projects/tfording/a_inornata/RNAseq_data/not_normalized_Trinity_testes_2.3.2/jellyfish.kmers.fa.histo')
BUSCO Results for Non-normalized testes:
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/not_normalized_Trinity_testes_2.3.2/Trinity.fasta')
histogram('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_alltissues_2.3.2/jellyfish.kmers.fa.histo')
This distribution seems normal in complexity.
parse_trinity_fasta('/n/projects/tfording/a_inornata/RNAseq_data/Trinity_alltissues_2.3.2/Trinity.fasta')
DUNCAN'S A.marmorata
histogram('/n/projects/dut/a_marmorata/marmorata_comprehensive_transcriptome/data/trinity_out/jellyfish.kmers.fa.histo')
parse_trinity_fasta('/n/projects/dut/a_marmorata/marmorata_comprehensive_transcriptome/data/trinity_out/Trinity.fasta')
The A. inornata transcriptome is comparable to the A. marmorata transcriptome in terms of BUSCOs. So I will continue forward with the genome annotation.
#BUSCO SCRIPT
python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/transcriptome/inornataTranscriptome.fasta -o ino_trans_VERT_BUSCO.out -l /n/projects/tfording/BUSCO_lineages/vertebrata_odb9 -m tran -c 4
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/RNAseq_data/Trinity_testes_2.3.2/testes_Trinity_2.3.2.fasta -o EUK_trinity_2.3.2 -l /n/projects/tfording/BUSCO_lineages/eukaryota_odb9 -m tran -c 4
Goal of this portion is to mask out repeats in the genome using repeat modeler and repeat masker.
#BuildDatabase -name AspInornata -engine ncbi AspIno_2.0.fasta
#RepeatModeler -engine ncbi -pa 16 -database AspInornata
Note to self: Repeat Modeler takes a LLLLOOOONNNNNGGGG time. 10+ days. After repeatmodeler sat at the same step for over 3 days, I decided to kill the script and start it with more threads. The script would not restart so I had to start it over.
RepeatModeler did not output the file I needed. Instead of spending more time on this I'm going to move forward using the masked inornata genome Rutendo created. File loc: '/n/projects/tfording/a_inornata/repeatmodeler/RM_42558.SunJun260011112016/consensi.fa.classified'
#RepeatMasker -xsmall -gff -s -pa 20 -lib /n/projects/tfording/a_inornata/repeatmodeler/RM_42558.SunJun260011112016/consensi.fa.classified /n/projects/tfording/genomes/AspIno_2.0.fasta -dir /n/projects/tfording/a_inornata > repeatmasker.out
File Name: AspIno_2.0.fasta
Number of Elements | Length Occupied | Percentage of Sequence |
Following the pipeline that was used for the marmorata genome, I am mapping the reads back using STAR.
#STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /n/projects/tfording/a_inornata/genome --genomeFastaFiles /n/projects/tfording/a_inornata/AspIno_2.0.fasta.masked --limitGenomeGenerateRAM 200000000000
Index Out:
#STAR --readFilesIn /n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CAGATC_2_R1_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CAGATC_2_R2_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CAGATC_R1_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CAGATC_R2_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CGATGT_2_R1_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CGATGT_2_R2_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CGATGT_R1_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_CGATGT_R2_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_GTGAAA_2_R1_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_GTGAAA_2_R2_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_GTGAAA_R1_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1655_GTGAAA_R2_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ACAGTG_2_R1_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ACAGTG_2_R2_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ACAGTG_R1_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ACAGTG_R2_brain.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ATGTCA_2_R1_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ATGTCA_2_R2_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ATGTCA_R1_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_ATGTCA_R2_testes.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_GCCAAT_2_R1_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_GCCAAT_2_R2_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_GCCAAT_R1_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1656_GCCAAT_R2_heart.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1684_CCGTCC_2_R1_germinalbed.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1684_CCGTCC_2_R2_germinalbed.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1684_CCGTCC_R1_germinalbed.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG1684_CCGTCC_R2_germinalbed.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG897_CGTACG_2_R1_blood.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG897_CGTACG_2_R2_blood.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG897_CGTACG_R1_blood.fastq.gz,/n/projects/tfording/a_inornata/data/raw_data/RNA_seq/MOLNG897_CGTACG_R2_blood.fastq.gz --readFilesCommand gunzip -c --genomeDir /n/projects/tfording/a_inornata/genome/STAR_genome --outFileNamePrefix /n/projects/tfording/a_inornata/data/STAR_alignments_run1_april17 --runThreadN 8 --twopassMode Basic --alignSJoverhangMin 8 --alignSJDBoverhangMin 5 --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 50000000000 --outFilterType Normal
STAR --readFilesIn --readFilesCommand “cat” --genomeDir ../data/marm_ref --outFileNamePrefix ../data/100_million_alignments_2ndpass/merged_samples_ --runThreadN 8 --alignSJoverhangMin 8 --alignSJDBoverhangMin 5 --sjdbFileChrStartEnd ../data/100_million_alignments/merged_samples_SJ.out.tab --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 50000000000 --outFilterType Normal
python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/genome/inornata_09Jan2016_DgKHC_uppercase.fasta. masked -o ino_masked_vert_BUSCO.out -l /n/projects/tfording/BUSCO_lineages/vertebrata_odb9 -m genome -c 4
NOTE TO SELF: BUSCO requires a path to Augustus to be set. This is done by typing the following into the command line environment:
export AUGUSTUS_CONFIG_PATH=/n/projects/tfording/anaconda2/pkgs/augustus-3.2.3-boost1.60_0/config
BUSCO OUTPUT:
Installed using Anaconda command: conda install -c bioconda braker=1.9
Lots of perl dependencies were missing. After trying to install the needed dependencies, Anaconda informed me of conflicts between packages of different versions. It appears as if I need to use an older version of perl to allow for the Logger::Simple module to be installed from CPAN. To do this I created a separate Conda Environment named Braker1Perl. This is where I am installing the needed packages for Braker.
5May2017: Installing in a new environment didn't work. Duncan said that he used the site wide version of perl. This resolved the Logger::Simple issue, but BRAKER still failed after trying to run it.
Command to run BRAKER1:
/n/projects/tfording/anaconda2/bin/perl /n/projects/tfording/anaconda2/pkgs/braker-1.9-1/bin/braker.pl --genome=/n/projects/tfording/a_inornata/star/AspIno_2.0.fasta.masked --species=AspIno --bam=/n/projects/tfording/a_inornata/data/STAR_alignments_run1_april17Aligned.sortedByCoord.out.bam --cores 10 --filterOutShort on --gff3 --overwrite &>AspIno_braker1_Check3_stdout.txt &l
I have installed all the perl dependencies that maker is currently throwing errors for. Perl however isn't utilizing them currently. I am going to try and switch back to my conda installation of perl. If it doesn't work I'll need to look into how to point my CPAN installations to perl.
After seemingly endless dependency errors I finally got BRAKER to work. I ended up putting the paths to Augustus, Bamtools and GeneMark in my .bash_profile. This fixed BRAKER's inability to find these programs despite the paths being passed to it in the command originally.
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/bin/braker/AspIno/augustus.aa -o EUK_maker_1_BUSCO.out -l /n/projects/tfording/BUSCO_lineages/vertebrata_odb9 -m proteins -c 4
EUKARYOTIC BUSCOs C:77.8%[S:75.2%,D:2.6%],F:15.5%,M:6.7%,n:303
236 Complete BUSCOs (C)
228 Complete and single-copy BUSCOs (S)
8 Complete and duplicated BUSCOs (D)
47 Fragmented BUSCOs (F)
20 Missing BUSCOs (M)
303 Total BUSCO groups searched
Vertabrate BUSCOs C:51.8%[S:47.6%,D:4.2%],F:27.7%,M:20.5%,n:2586
1339 Complete BUSCOs (C)
1231 Complete and single-copy BUSCOs (S)
108 Complete and duplicated BUSCOs (D)
717 Fragmented BUSCOs (F)
530 Missing BUSCOs (M)
2586 Total BUSCO groups searched
#python /n/projects/tfording/anaconda2/bin/BUSCO.py -i /n/projects/tfording/a_inornata/bin/braker/AIno/augustus.aa -o EUK_maker_1_BUSCO.out -l /n/projects/tfording/BUSCO_lineages/vertebrata_odb9 -m proteins -c 4
Vert BUSCOs BRAKER V2
/n/projects/tfording/a_inornata/bin/maker/bin/maker
#author=Tyler Fording
from __future__ import division
import argparse
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
def plot_repeats(file_path1):
fh_in = open(file_path1, 'r')
line_list = []
############################################ Read in file
for line in fh_in:
line = line.strip('\n')
line = line.split()
for i in line:
i = i.strip
if float(line[3]) > 0.0010:
line_list.append(line)
############################################ Populate data structures for plotting
x_tick_labels = [i[0].strip() for i in line_list]
x_list = [i+1 for i in range(len(line_list))]
y_list = [float(i[3].strip()) for i in line_list]
data = []
new_line_list = []
for item in line_list:
data.append({'label':item[0], 'color':'maroon', 'height':float(item[3])})
############################################ Plot data
figure, ax = plt.subplots()
i = 1
for bar in data:
ax.bar(i, bar['height'], align='center', color=bar['color'])
i += 1
plt.xticks(x_list, x_tick_labels, rotation=45, ha='right')
plt.ylabel('% Abundence')
plt.xlabel('Repeat Class')
plt.title('Inornatus Repeat Abundence by Class')
for i in x_list:
ax.text(i, y_list[i-1]+2, str(y_list[i-1]), fontweight='bold', ha='center', rotation=90)
plt.show()
#plt.savefig('ino_repeats.pdf')
plot_repeats('/n/projects/tfording/a_inornata/repeatmasker/repeat_masker_rutendo/inornata/inornata.table.txt')
I started this process by running Trinnotate on the Inornata transcriptome. Trinnotate is apart of an established pipeline that can be found (https://trinotate.github.io/). The commands are as follows:
#blastx -query Trinity.fasta -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 > blastx.outfmt6
This command uses blastx to search the Trinity transcriptome assembly for transcripts.
#blastp -query transdecoder.pep -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 > blastp.outfmt6
This command uses blastp to seach Swissprot database for Transdecoder-predicted proteins. It creates a database of proteins to be used later in the pipeline
#hmmscan --cpu 8 --domtblout TrinotatePFAM.out Pfam-A.hmm transdecoder.pep > pfam.log
HMMER identifies protein domains and outputs a log of all the protein domains found.
#./signalp -f short -n signalp_short.out /n/projects/tfording/a_inornata/transcriptome/Transcriptome_annotation/ino_trans_longestorfs.pep
SignalP predicts signal peptides in the transcriptome
#tmhmm --short < transdecoder.pep > tmhmm.out
tmHMM predicts transmembrane regions.
#/n/projects/tfording/a_inornata/transcriptome/bin/Trinnotate/util/rnammer_support/RnammerTranscriptome.pl --transcriptome inornataTranscriptome.fasta --path_to_rnammer /n/projects/tfording/a_inornata/transcriptome/bin/rnammer-1.2.src/rnammer
RNAMMER identifies rRNA transcripts it outputs a file called inornataTranscriptome.fasta.rnammer.gff
"RNAMMER was originally developed to identify rRNA genes in genomic sequences. To have it identify rRNA sequences among our large sets of transcriptome sequences, we first concatenate all the transcripts together into a single super-scaffold, run RNAMMER to identify rRNA homologies, and then transform the rRNA feature coordinates in the super-scaffold back to the transcriptome reference coordinates. The following script will perform all of these operations:" -(https://trinotate.github.io/)
Note to self: The installation of RNAmmer requires some hacking. I attempted to do as trinotate's website instructed, but the code seems to have been changed since they wrote their hack instructions. They note that this is not a required step to finish annotation, but I would like to produce as complete an annotation as is possible, so I am going to spend some more time trying to a solution.
Number of 100% BlastX hits: 13103
#awk 'BEGIN {count=0 ;} $3 == "100.000" {count ++ ;} END {print count ;}' inornata_blastx.outfmt6
Number of 95% BlastX hits: 21676
Number of 90% BlastX hits: 55868
Number of 85% BlastX hits: 89550
Number of 80% BlastX hits: 122962
Number of 75% BlastX hits: 154266
Number of 70% BlastX hits: 184451