Last updated: 2018-08-28
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20180801)
The command set.seed(20180801)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: f774786
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .RData
Ignored: .Rhistory
Ignored: .Rproj.user/
Untracked files:
Untracked: com_threeprime.Rproj
Untracked: data/PeakPerExon/
Untracked: data/PeakPerGene/
Untracked: data/comp.pheno.data.csv
Untracked: data/dist_TES/
Untracked: data/dist_upexon/
Untracked: data/liftover/
Untracked: data/map.stats.csv
Untracked: data/map.stats.xlsx
Untracked: data/orthoPeak_quant/
Unstaged changes:
Deleted: comparitive_threeprime.Rproj
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | f774786 | brimittleman | 2018-08-28 | initialze APA pheno analysis and map peaks to genes |
In this analysis I want to assign all of the orthologus peaks to the human and chimp mRNAs. Once I do this I will be able to create a PAS usage phenotype. This is x/y. X is the read count in a peak and y is the number of reads in any peak for the same gene. This analysis will follow a similar pipeline to https://brimittleman.github.io/threeprimeseq/pheno.leaf.comb.html. I will run all of this seperatly for the human and chimp peaks. Unfortunately I cannot forse strandedness here like I did in the human LCL. I lost the strand information in the liftover.
#!/bin/bash
#SBATCH --job-name=intGenes_orthopeaks
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=intGenes_orthopeaks.out
#SBATCH --error=intGenes_orthopeaks.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate comp_threeprime_env
#human
bedtools intersect -wa -wb -sorted -a /project2/gilad/briana/comparitive_threeprime/data/ortho_peaks/humanOrthoPeaks.sort.bed -b /project2/gilad/briana/genome_anotation_data/comp_genomes/gene_annos/humanGene_ncbiRefSeq_mRNA_sort.bed > /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.bed
#chimp
bedtools intersect -wa -wb -sorted -a /project2/gilad/briana/comparitive_threeprime/data/ortho_peaks/chimpOrthoPeaks.sort.bed -b /project2/gilad/briana/genome_anotation_data/comp_genomes/gene_annos/chimpGene_refGene_mRNA_sort.bed > /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.bed
The resulting file has info from everything. I need to use awk to keep just the colums I want. chr, start, end, peak name, “.”, gene strand, gene name. The first . is for a score and the second . is strand.
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t " "." "\t" $10 "\t" $8}' /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.bed > /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.sm.bed
awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t " "." "\t" $10 "\t" $8}' /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.bed > /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.sm.bed
Before I run feature count I want to create an ID for each peak and transform the data into a SAF file. The ID will be peak#:ch#:start:end:strand:gene
#peakwGene2Saf_C.py
from misc_helper import *
fout = file("/project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.sm.SAF",'w')
fout.write("GeneID\tChr\tStart\tEnd\tStrand\n")
for ln in open("/project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.sm.bed"):
chrom, start, end, name, score, strand, gene = ln.split()
start_i=int(start)
end_i=int(end)
ID = "%s:%s:%d:%d:%s:%s"%(name, chrom, start_i, end_i, strand, gene)
fout.write("%s\t%s\t%d\t%d\t%s\n"%(ID, chrom, start_i, end_i, strand))
fout.close()
#peakwGene2Saf_H.py
from misc_helper import *
fout = file("/project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.sm.SAF",'w')
fout.write("GeneID\tChr\tStart\tEnd\tStrand\n")
for ln in open("/project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.sm.bed"):
chrom, start, end, name, score, strand, gene = ln.split()
start_i=int(start)
end_i=int(end)
ID = "%s:%s:%d:%d:%s:%s"%(name, chrom, start_i, end_i, strand, gene)
fout.write("%s\t%s\t%d\t%d\t%s\n"%(ID, chrom, start_i, end_i, strand))
fout.close()
Feature counts on these peaks.
fc_orthopeaksWgene.sh
#!/bin/bash
#SBATCH --job-name=fc_orthopeaksWgene
#SBATCH --account=pi-yangili1
#SBATCH --time=24:00:00
#SBATCH --output=fc_orthopeaksWgene.out
#SBATCH --error=fc_orthopeaksWgene.err
#SBATCH --partition=broadwl
#SBATCH --mem=12G
#SBATCH --mail-type=END
module load Anaconda3
source activate comp_threeprime_env
# outdir: /project2/gilad/briana/comparitive_threeprime/data/PeakwGene_quant
featureCounts -a /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.sm.SAF -F SAF -o /project2/gilad/briana/comparitive_threeprime/data/PeakwGene_quant/HumanTotal_Orthopeak.quant /project2/gilad/briana/comparitive_threeprime/human/data/sort/*T-sort.bam -s 1
featureCounts -a /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/humanOrthoPeaks.sort.refseqgenes.sm.SAF -F SAF -o /project2/gilad/briana/comparitive_threeprime/data/PeakwGene_quant/HumanNuclear_Orthopeak.quant /project2/gilad/briana/comparitive_threeprime/human/data/sort/*N-sort.bam -s 1
featureCounts -a /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.sm.SAF -F SAF -o /project2/gilad/briana/comparitive_threeprime/data/PeakwGene_quant/ChimpTotal_Orthopeak.quant /project2/gilad/briana/comparitive_threeprime/chimp/data/sort/*T-sort.bam -s 1
featureCounts -a /project2/gilad/briana/comparitive_threeprime/data/PeaksGeneAnno/chimpOrthoPeaks.sort.refgenes.sm.SAF -F SAF -o /project2/gilad/briana/comparitive_threeprime/data/PeakwGene_quant/ChimpNuclear_Orthopeak.quant /project2/gilad/briana/comparitive_threeprime/chimp/data/sort/*N-sort.bam -s 1
This leads to loss a large number of reads most likely because they fall outside of the orthologous peaks or in peaks not assigned to genes. This is a huge problem in the chimps because there are only 2000 annotated genes.
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.1.1 Rcpp_0.12.18 digest_0.6.16
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.23.0 magrittr_1.5 evaluate_0.11
[10] stringi_1.2.4 whisker_0.3-2 R.oo_1.22.0
[13] R.utils_2.7.0 rmarkdown_1.10 tools_3.5.1
[16] stringr_1.3.1 yaml_2.2.0 compiler_3.5.1
[19] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.1.1