Strand Specificity of Net seq data

Last updated: 2018-03-26

Code version: 8e2043e

The purpose of this analysis is to explore the strand specificity of my net seq data. Within genes we expect strand specific expression and at TSS we would expect some convergent transcription. I will filter the overlaping genes as done in the Meyer paper.

Mayer filtering: Pol-II-transcribed genes that do not overlap other genes within 2.5 kb of the TSS and polyadenylation site and are longer than 2 kb

Filter the genes

bedtools closest -N

Get the protein coding genes:

grep 'protein_coding' gencode.v19.annotation.bed  | awk '$8 == "gene"' |cut -f1-6  |sed 's/^chr//' >  gencode.v19.annotation.proteincodinggene.bed

Only keep genes that are 2kb. I want genes where end (3) - start (2) is greater than 2000

awk '{if( ($3 - $2) > 2000)  print($1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6)}' gencode.v19.annotation.proteincodinggene.bed > gencode.v19.annotation.proteincodinggene.2kb.bed

Next step is to look for overlab distance with bedtools closest -N.

gene_dist.sh


#!/bin/bash

#SBATCH --job-name=gene_dist
#SBATCH --time=8:00:00
#SBATCH --output=gene_dist.out
#SBATCH --error=gene_dist.err
#SBATCH --partition=gilad
#SBATCH --mem=10G
#SBATCH --mail-type=END

module load Anaconda3  

source activate net-seq 

bedtools closest -N -d -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.proteincodinggene.2kb.bed -b /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.proteincodinggene.2kb.bed > /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.proteincodinggene.2kb.dist.bed

Column 13 of this file must be > 2500

awk '{if( $13 > 2500)  print($1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6)}' gencode.v19.annotation.proteincodinggene.2kb.dist.bed > gencode.v19.annotation.distfilteredgenes.bed

This leaves 9367 genes.

Get strand specific gene counts for these genes

Do this for 18486 first.

Sort both files with sort -k1,1 -k2,2n


#!/bin/bash

#SBATCH --job-name=strand
#SBATCH --time=8:00:00
#SBATCH --output=strand.out
#SBATCH --error=strnad.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END



bedtools coverage -counts -s -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18486_combined_Netpilot-bedsort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/same_strand_genecounts_18486.txt 


bedtools coverage -counts -S -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18486_combined_Netpilot-bedsort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/opp_strand_genecounts_18486.txt

Process data

Library input:

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(workflowr)

Loading required package: rmarkdown

This is workflowr version 0.7.0
Run ?workflowr for help getting started

library(ggplot2)

Input data:

opp_strand=read.table("../data/same_strand_genecounts_18486.txt", header=FALSE, stringsAsFactors = FALSE)
colnames(opp_strand)= c("chr", "start", "end", "gene","score", "strand", "opp_count")
same_strand=read.table("../data/opp_strand_genecounts_18486.txt", header=FALSE, stringsAsFactors = FALSE)
colnames(same_strand)= c("chr", "start", "end", "gene","score", "strand", "same_count")

Merge on the gene names:

full_strand=inner_join(same_strand,opp_strand, by=c(c("chr", "start", "end", "gene","score", "strand")))
full_strand= full_strand %>% mutate(perc_same=(same_count /(same_count + opp_count))) %>% filter(perc_same != "NaN")

Explore data:

summary(full_strand$perc_same)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.5000  0.8333  0.6927  0.9611  1.0000

plot(sort(full_strand$perc_same, decreasing=TRUE), col=ifelse((sort(full_strand$perc_same, decreasing=TRUE)) >.5, "red", "black"), ylab="same strand / gene count", xlab="Genes ordered by strand spec percent", main="Gene counts strand specificity 18486")
legend("topright", legend = c("top 75%"), col=c("red"), pch=16, cex=1)

boxplot(log10(full_strand$same_count), log10(full_strand$opp_count), names=c("Same strand", "Opposite Strand"), ylab="log10 Gene counts", main="Gene counts by strand")

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 1 is not drawn

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 2 is not drawn

I am going to run the gene counts analysis not strand specific to make sure I am getting a sum of the these counts.


#!/bin/bash

#SBATCH --job-name=all_counts
#SBATCH --time=8:00:00
#SBATCH --output=all_strand.out
#SBATCH --error=all_strnad.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END



bedtools coverage -counts  -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18486_combined_Netpilot-bedsort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/all_strand_genecounts_18486.txt

This file confirmed that the counts without specificity is the sum of the strand information.

Try this analysis in another line

strand_spec_18505.sh

#!/bin/bash

#SBATCH --job-name=strand_18505
#SBATCH --time=8:00:00
#SBATCH --output=strand.out
#SBATCH --error=strnad.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END


bedtools coverage -counts -s -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18505_combined_Netpilot-bedsort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/same_strand_genecounts_18505.txt 


bedtools coverage -counts -S -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18505_combined_Netpilot-bedsort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/opp_strand_genecounts_18505.txt

opp_strand_18505=read.table("../data/same_strand_genecounts_18505.txt", header=FALSE, stringsAsFactors = FALSE)
colnames(opp_strand_18505)= c("chr", "start", "end", "gene","score", "strand", "opp_count")
same_strand_18505=read.table("../data/opp_strand_genecounts_18505.txt", header=FALSE, stringsAsFactors = FALSE)
colnames(same_strand_18505)= c("chr", "start", "end", "gene","score", "strand", "same_count")


full_strand_18505=inner_join(same_strand_18505,opp_strand_18505, by=c(c("chr", "start", "end", "gene","score", "strand")))
full_strand_18505= full_strand_18505 %>% mutate(perc_same=(same_count /(same_count + opp_count))) %>% filter(perc_same != "NaN")

summary(full_strand_18505$perc_same)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.4545  0.8333  0.6850  0.9722  1.0000

plot(sort(full_strand_18505$perc_same, decreasing = TRUE), col=ifelse((sort(full_strand_18505$perc_same, decreasing = TRUE))>.45, "red", "black"), ylab="same strand / gene count", xlab="Genes ordered by strand spec percent", main="Gene counts strand specificity 18505")
legend("topright", legend = c("top 75%"), col=c("red"), pch=16, cex=1)

boxplot(log10(full_strand_18505$same_count), log10(full_strand_18505$opp_count), names=c("Same strand", "Opposite Strand"), ylab="log10 Gene counts", main="Gene counts by strand 18505")

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 1 is not drawn

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 2 is not drawn

Distribution for genes >1 18486

gene_counts_18486=read.table("../data/all_strand_genecounts_18486.txt", stringsAsFactors = FALSE, header=FALSE)
colnames(gene_counts_18486)= c("chr", "start", "end", "gene","score", "strand", "count")
mapped_18486_mil= 65189389 / 10^6
gene_counts_18486=gene_counts_18486 %>% mutate(K_length= (end - start)/100) %>% mutate(rpkm= + count/(K_length * mapped_18486_mil)) %>% filter(rpkm>=1)

Join this with the strand specific information.

top_genes_strand= inner_join(full_strand, gene_counts_18486, by=c("chr", "start", "end", "gene","score", "strand"))

Plot this distribution:

plot(sort(top_genes_strand$perc_same, decreasing = TRUE), ylab="same strand / gene count", xlab="Genes ordered by strand spec percent", main="Gene counts strand specificity 18486 for genes >1 RPKM")

Box plots:

boxplot(log10(top_genes_strand$same_count), log10(top_genes_strand$opp_count), names=c("Same strand", "Opposite Strand"), ylab="log10 Gene counts", main="Gene counts by strand, genes >1 RPKM")

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 1 is not drawn

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 2 is not drawn

Filter out the small RNAs that fall in these genes

Filter the bams as I did in the create blacklist. filt_smRNA_bam.sh

#!/bin/bash

#SBATCH --job-name=intersect_bam
#SBATCH --time=8:00:00
#SBATCH --output=intbam_sbatch.out
#SBATCH --error=intbam_sbatch.err
#SBATCH --partition=broadwl
#SBATCH --mem=20G
#SBATCH --mail-type=END


module load Anaconda3  

source activate net-seq 


sample=$1

describer=$(echo ${sample} | sed -e 's/.*\YG-SP-//' | sed -e "s/_combined_Netpilot-sort.bam$//")


bedtools intersect -v -wa -abam $1 -b /project2/gilad/briana/genome_anotation_data/snRNA.gencode.v19.nochr.bed /project2/gilad/briana/genome_anotation_data/snoRNA.gencode.v19.nochr.bed  > /project2/gilad/briana/Net-seq-pilot/data/filter_bam/${describer}.rnafilter.bam

run on: /project2/gilad/briana/Net-seq-pilot/data/sort/YG-SP-NET3-18486_combined_Netpilot-sort.bam

Now when I look at the bams in IGV I have removed the snoRNA in TPT1.

Rerun analysis with Filtered bam

This time I will switch the -s and -S so we get the correct same/opp strand information in the txt files.

Convert the filtered bam file to a bed file and sort it:

bamToBed -i NET3-18486.rnafilter.bam > ../bed/YG-SP-NET3-18486_combined_Netpilot-rnafilt.bed


sort -k1,1 -k2,2n YG-SP-NET3-18486_combined_Netpilot-rnafilt.bed > YG-SP-NET3-18486_combined_Netpilot-rnafilt-sort.bed

strand_filt.sh

#!/bin/bash

#SBATCH --job-name=_filtstrand
#SBATCH --time=8:00:00
#SBATCH --output=filtstrand.out
#SBATCH --error=filtstrnad.err
#SBATCH --partition=broadwl
#SBATCH --mem=40G
#SBATCH --mail-type=END



bedtools coverage -counts -S -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18486_combined_Netpilot-rnafilt-sort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/same_strand_genecounts_filt_18486.txt 


bedtools coverage -counts -s -sorted -a /project2/gilad/briana/genome_anotation_data/gencode.v19.annotation.distfilteredgenes.sort.bed -b /project2/gilad/briana/Net-seq-pilot/data/bed/YG-SP-NET3-18486_combined_Netpilot-rnafilt-sort.bed  > /project2/gilad/briana/Net-seq-pilot/output/strand_spec/opp_strand_genecounts_filt_18486.txt

opp_strand_18486_filt=read.table("../data/opp_strand_genecounts_filt_18486.txt", header=FALSE, stringsAsFactors = FALSE)
colnames(opp_strand_18486_filt)= c("chr", "start", "end", "gene","score", "strand", "opp_count")
same_strand_18486_filt=read.table("../data/same_strand_genecounts_filt_18486.txt", header=FALSE, stringsAsFactors = FALSE)
colnames(same_strand_18486_filt)= c("chr", "start", "end", "gene","score", "strand", "same_count")


full_strand_18486_filt=inner_join(same_strand_18486_filt, opp_strand_18486_filt, by=c(c("chr", "start", "end", "gene","score", "strand")))
full_strand_18486_filt= full_strand_18486_filt %>% mutate(perc_same=(same_count /(same_count + opp_count))) %>% filter(perc_same != "NaN")

summary(full_strand_18486_filt$perc_same)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.5000  0.8319  0.6927  0.9594  1.0000

plot(sort(full_strand_18486_filt$perc_same, decreasing = TRUE), col=ifelse((sort(full_strand_18486_filt$perc_same, decreasing = TRUE))>.5, "red", "black"), ylab="same strand / gene count", xlab="Genes ordered by strand spec percent", main="Gene counts strand specificity filtered 18486")
legend("topright", legend = c("top 75%"), col=c("red"), pch=16, cex=1)

boxplot(log10(full_strand_18486_filt$same_count), log10(full_strand_18486_filt$opp_count), names=c("Same strand", "Opposite Strand"), ylab="log10 Gene counts", main="Gene counts by strand 18486 filtered")

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 1 is not drawn

Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
$group == : Outlier (-Inf) in boxplot 2 is not drawn

 hist(full_strand_18486_filt$perc_same, xlab="Same strand/full count for gene", main="18486 strand specificity on filtered genes and filtered bams")

Session information

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2    ggplot2_2.2.1   workflowr_0.7.0 rmarkdown_1.8.5
[5] dplyr_0.7.4    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     knitr_1.18       bindr_0.1        magrittr_1.5    
 [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.1.6     
 [9] plyr_1.8.4       stringr_1.2.0    tools_3.4.2      grid_3.4.2      
[13] gtable_0.2.0     git2r_0.21.0     htmltools_0.3.6  lazyeval_0.2.1  
[17] yaml_2.1.16      rprojroot_1.3-2  digest_0.6.14    assertthat_0.2.0
[21] tibble_1.4.2     glue_1.2.0       evaluate_0.10.1  stringi_1.1.6   
[25] compiler_3.4.2   pillar_1.1.0     scales_0.5.0     backports_1.1.2 
[29] pkgconfig_2.0.1

This R Markdown site was created with workflowr