Location of Signal Site

Last updated: 2019-03-26

Checks: 6 0

Knit directory: threeprimeseq/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(12345)

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: e7a4994

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/perm_QTL_trans_noMP_5percov/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  KalistoAbundance18486.txt
    Untracked:  analysis/4suDataIGV.Rmd
    Untracked:  analysis/AdaptTonyExampleplots.Rmd
    Untracked:  analysis/DirectionapaQTL.Rmd
    Untracked:  analysis/EvaleQTLs.Rmd
    Untracked:  analysis/YL_QTL_test.Rmd
    Untracked:  analysis/groSeqAnalysis.Rmd
    Untracked:  analysis/ncbiRefSeq_sm.sort.mRNA.bed
    Untracked:  analysis/peaksWithMisprimming.Rmd
    Untracked:  analysis/snake.config.notes.Rmd
    Untracked:  analysis/verifyBAM.Rmd
    Untracked:  analysis/verifybam_dubs.Rmd
    Untracked:  code/PeaksToCoverPerReads.py
    Untracked:  code/strober_pc_pve_heatmap_func.R
    Untracked:  data/18486.genecov.txt
    Untracked:  data/APApeaksYL.total.inbrain.bed
    Untracked:  data/AllPeak_counts/
    Untracked:  data/ApaQTLs/
    Untracked:  data/ApaQTLs_otherPhen/
    Untracked:  data/CTCF/
    Untracked:  data/ChromHmmOverlap/
    Untracked:  data/DistTXN2Peak_genelocAnno/
    Untracked:  data/EmpiricalDists/
    Untracked:  data/FeatureoverlapPeaks/
    Untracked:  data/GM12878.chromHMM.bed
    Untracked:  data/GM12878.chromHMM.txt
    Untracked:  data/GWAS_overlap/
    Untracked:  data/LianoglouLCL/
    Untracked:  data/LocusZoom/
    Untracked:  data/LocusZoom_Unexp/
    Untracked:  data/LocusZoom_proc/
    Untracked:  data/MatchedSnps/
    Untracked:  data/NucSpecQTL/
    Untracked:  data/NuclearApaQTLs.txt
    Untracked:  data/PeakCounts/
    Untracked:  data/PeakCounts_noMP_5perc/
    Untracked:  data/PeakCounts_noMP_genelocanno/
    Untracked:  data/PeakUsage/
    Untracked:  data/PeakUsage_noMP/
    Untracked:  data/PeakUsage_noMP_GeneLocAnno/
    Untracked:  data/PeaksUsed/
    Untracked:  data/PeaksUsed_noMP_5percCov/
    Untracked:  data/PolyA_DB/
    Untracked:  data/QTL_overlap/
    Untracked:  data/RNAdecay/
    Untracked:  data/RNAkalisto/
    Untracked:  data/RefSeq_annotations/
    Untracked:  data/Replicates_usage/
    Untracked:  data/Signal_Loc/
    Untracked:  data/TotalApaQTLs.txt
    Untracked:  data/Totalpeaks_filtered_clean.bed
    Untracked:  data/UnderstandPeaksQC/
    Untracked:  data/WASP_STAT/
    Untracked:  data/YL-SP-18486-T-combined-genecov.txt
    Untracked:  data/YL-SP-18486-T_S9_R1_001-genecov.txt
    Untracked:  data/YL_QTL_test/
    Untracked:  data/apaExamp/
    Untracked:  data/apaExamp_proc/
    Untracked:  data/apaQTL_examp_noMP/
    Untracked:  data/bedgraph_peaks/
    Untracked:  data/bin200.5.T.nuccov.bed
    Untracked:  data/bin200.Anuccov.bed
    Untracked:  data/bin200.nuccov.bed
    Untracked:  data/clean_peaks/
    Untracked:  data/comb_map_stats.csv
    Untracked:  data/comb_map_stats.xlsx
    Untracked:  data/comb_map_stats_39ind.csv
    Untracked:  data/combined_reads_mapped_three_prime_seq.csv
    Untracked:  data/diff_iso_GeneLocAnno/
    Untracked:  data/diff_iso_proc/
    Untracked:  data/diff_iso_trans/
    Untracked:  data/eQTL_inAPA/
    Untracked:  data/eQTLs_Lietal/
    Untracked:  data/ensemble_to_genename.txt
    Untracked:  data/example_gene_peakQuant/
    Untracked:  data/explainProtVar/
    Untracked:  data/filtPeakOppstrand_cov_noMP_GeneLocAnno_5perc/
    Untracked:  data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.bed
    Untracked:  data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.noties.bed
    Untracked:  data/first50lines_closest.txt
    Untracked:  data/gencov.test.csv
    Untracked:  data/gencov.test.txt
    Untracked:  data/gencov_zero.test.csv
    Untracked:  data/gencov_zero.test.txt
    Untracked:  data/gene_cov/
    Untracked:  data/joined
    Untracked:  data/leafcutter/
    Untracked:  data/merged_combined_YL-SP-threeprimeseq.bg
    Untracked:  data/molPheno_noMP/
    Untracked:  data/mol_overlap/
    Untracked:  data/mol_pheno/
    Untracked:  data/nom_QTL/
    Untracked:  data/nom_QTL_opp/
    Untracked:  data/nom_QTL_trans/
    Untracked:  data/nuc6up/
    Untracked:  data/nuc_10up/
    Untracked:  data/other_qtls/
    Untracked:  data/pQTL_inAPA/
    Untracked:  data/pQTL_otherphen/
    Untracked:  data/pacbio_cov/
    Untracked:  data/peakPerRefSeqGene/
    Untracked:  data/peaks4DT/
    Untracked:  data/perm_QTL/
    Untracked:  data/perm_QTL_GeneLocAnno_noMP_5percov/
    Untracked:  data/perm_QTL_GeneLocAnno_noMP_5percov_3UTR/
    Untracked:  data/perm_QTL_diffWindow/
    Untracked:  data/perm_QTL_opp/
    Untracked:  data/perm_QTL_trans/
    Untracked:  data/perm_QTL_trans_filt/
    Untracked:  data/protAndAPAAndExplmRes.Rda
    Untracked:  data/protAndAPAlmRes.Rda
    Untracked:  data/protAndExpressionlmRes.Rda
    Untracked:  data/reads_mapped_three_prime_seq.csv
    Untracked:  data/smash.cov.results.bed
    Untracked:  data/smash.cov.results.csv
    Untracked:  data/smash.cov.results.txt
    Untracked:  data/smash_testregion/
    Untracked:  data/ssFC200.cov.bed
    Untracked:  data/temp.file1
    Untracked:  data/temp.file2
    Untracked:  data/temp.gencov.test.txt
    Untracked:  data/temp.gencov_zero.test.txt
    Untracked:  data/threePrimeSeqMetaData.csv
    Untracked:  data/threePrimeSeqMetaData55Ind.txt
    Untracked:  data/threePrimeSeqMetaData55Ind.xlsx
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup.txt
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup.xlsx
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup_WASPMAP.txt
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup_WASPMAP.xlsx
    Untracked:  data/threePrimeSeqMetaData55Ind_redobatch4.xlsx
    Untracked:  data/~$threePrimeSeqMetaData55Ind_redobatch4.xlsx
    Untracked:  manuscript/
    Untracked:  output/APAqtlExamp/
    Untracked:  output/LZ/
    Untracked:  output/deeptools_plots/
    Untracked:  output/picard/
    Untracked:  output/plots/
    Untracked:  output/qual.fig2.pdf

Unstaged changes:
    Modified:   analysis/28ind.peak.explore.Rmd
    Modified:   analysis/CompareLianoglouData.Rmd
    Modified:   analysis/HistoneModandPAS.Rmd
    Modified:   analysis/NewPeakPostMP.Rmd
    Modified:   analysis/NuclearSpecQTL.Rmd
    Modified:   analysis/PeakToXper.Rmd
    Modified:   analysis/RNAdecayAndAPA.Rmd
    Modified:   analysis/apaQTLoverlapGWAS.Rmd
    Modified:   analysis/characterize_apaQTLs.Rmd
    Modified:   analysis/cleanupdtseq.internalpriming.Rmd
    Modified:   analysis/coloc_apaQTLs_protQTLs.Rmd
    Modified:   analysis/dif.iso.usage.leafcutter.Rmd
    Modified:   analysis/diffIsoAnalysisNewMapping.Rmd
    Modified:   analysis/diff_iso_pipeline.Rmd
    Modified:   analysis/explainpQTLs.Rmd
    Modified:   analysis/explore.filters.Rmd
    Modified:   analysis/fixBWChromNames.Rmd
    Modified:   analysis/flash2mash.Rmd
    Modified:   analysis/initialPacBioQuant.Rmd
    Modified:   analysis/mispriming_approach.Rmd
    Modified:   analysis/overlapMolQTL.Rmd
    Modified:   analysis/overlapMolQTL.opposite.Rmd
    Modified:   analysis/overlap_qtls.Rmd
    Modified:   analysis/peakOverlap_oppstrand.Rmd
    Modified:   analysis/peakQCPPlots.Rmd
    Modified:   analysis/pheno.leaf.comb.Rmd
    Modified:   analysis/pipeline_55Ind.Rmd
    Modified:   analysis/swarmPlots_QTLs.Rmd
    Modified:   analysis/test.max2.Rmd
    Modified:   analysis/test.smash.Rmd
    Modified:   analysis/understandPeaks.Rmd
    Modified:   analysis/unexplainedeQTL_analysis.Rmd
    Modified:   code/Snakefile

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	e7a4994	Briana Mittleman	2019-03-26	add AAAAAAA
html	27cd8f1	Briana Mittleman	2019-03-12	Build site.
Rmd	e2c9e1a	Briana Mittleman	2019-03-12	add prop of sites with signal
html	e60daf6	Briana Mittleman	2019-03-10	Build site.
Rmd	2ae4d1e	Briana Mittleman	2019-03-10	add new prop plot
html	04987b5	Briana Mittleman	2019-03-09	Build site.
Rmd	b78e201	Briana Mittleman	2019-03-09	add nuclear intronic
html	101e468	Briana Mittleman	2019-03-09	Build site.
Rmd	c9aeecf	Briana Mittleman	2019-03-09	add cononical vs not plots
html	0917c2a	Briana Mittleman	2019-03-09	Build site.
Rmd	2d022d6	Briana Mittleman	2019-03-09	add location proportion plots
html	ca1a9f4	Briana Mittleman	2019-03-08	Build site.
Rmd	64fe413	Briana Mittleman	2019-03-08	add signal site loc hist
html	ffb0e84	Briana Mittleman	2019-03-07	Build site.
Rmd	52e7514	Briana Mittleman	2019-03-07	start new analysis - 3’ side of peak
html	638d12e	Briana Mittleman	2019-03-07	Build site.
Rmd	72c34ce	Briana Mittleman	2019-03-07	signal site loc based on front side
html	4bf5d09	Briana Mittleman	2019-03-06	Build site.
Rmd	8717550	Briana Mittleman	2019-03-06	res for AATAAA
html	4023fe0	Briana Mittleman	2019-03-06	Build site.
Rmd	d561190	Briana Mittleman	2019-03-06	analysis up to getting seqs
html	ba63ea2	Briana Mittleman	2019-03-06	Build site.
Rmd	c200503	Briana Mittleman	2019-03-06	add signal site loc analysis

In the Signal Site enrichment analysis I looked at the peaks to see if signal sites are enriched upstream of my peaks. I found this is true but now I want to see where the signal sites are in comparison to my peaks. I am going to use the biostrings package tool matchPWM for this analysis.

library(workflowr)

This is workflowr version 1.2.0
Run ?workflowr for help getting started

library(tidyverse)

── Attaching packages ──────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 3.1.0       ✔ purrr   0.3.1  
✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.3       ✔ stringr 1.4.0  
✔ readr   1.3.1       ✔ forcats 0.4.0

Warning: package 'tibble' was built under R version 3.5.2

Warning: package 'tidyr' was built under R version 3.5.2

Warning: package 'purrr' was built under R version 3.5.2

Warning: package 'dplyr' was built under R version 3.5.2

Warning: package 'stringr' was built under R version 3.5.2

Warning: package 'forcats' was built under R version 3.5.2

── Conflicts ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(Biostrings)

Warning: package 'Biostrings' was built under R version 3.5.2

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:dplyr':

    combine, intersect, setdiff, union

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    anyDuplicated, append, as.data.frame, basename, cbind,
    colMeans, colnames, colSums, dirname, do.call, duplicated,
    eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
    paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
    Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max,
    which.min

Loading required package: S4Vectors

Loading required package: stats4


Attaching package: 'S4Vectors'

The following objects are masked from 'package:dplyr':

    first, rename

The following object is masked from 'package:tidyr':

    expand

The following object is masked from 'package:base':

    expand.grid

Loading required package: IRanges


Attaching package: 'IRanges'

The following objects are masked from 'package:dplyr':

    collapse, desc, slice

The following object is masked from 'package:purrr':

    reduce

Loading required package: XVector


Attaching package: 'XVector'

The following object is masked from 'package:purrr':

    compact


Attaching package: 'Biostrings'

The following object is masked from 'package:base':

    strsplit

library(BSgenome)

Loading required package: GenomeInfoDb

Warning: package 'GenomeInfoDb' was built under R version 3.5.2

Loading required package: GenomicRanges

Loading required package: rtracklayer

Warning: package 'rtracklayer' was built under R version 3.5.2

library(cowplot)

Warning: package 'cowplot' was built under R version 3.5.2


Attaching package: 'cowplot'

The following object is masked from 'package:ggplot2':

    ggsave

library(genomation)

Loading required package: grid

Make location file

I need to get the coordinates for the regions I care about. I want to look at the peak and 150bp upstream. This is probably larger than I will need to look at but it will be good to have an inclusive look first.

I want to use the peak file and make a file that is the peak and upstream 150:

Upstream150Bases.py

#python  
def main(Fin, Fout):
  outBed=open(Fout, "w")
  chrom_lengths=open("/project2/gilad/briana/genome_anotation_data/chrom_lengths2.sort.bed","r")
  #make a dictionary with chrom lengths
  length_dic={}
  for i in chrom_lengths:
    chrom, start, end = i.split()
    length_dic[str(chrom)]=int(end)  

#write file 
  for ln in open(Fin):
    chrom, start, end, name, score, strand = ln.split()
    chrom=str(chrom)
    if strand=="+":
      start_new=int(start)-150
      if start_new <= 1:
        start_new = 0 
      end_new= int(end)
      if end_new == 0:
        end_new=1
      outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
    if strand == "-":
      start_new=int(start)
      end_new=int(end) + 150
      outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
  outBed.close()  

if __name__ == "__main__":
    import sys
    inFile = sys.argv[1]
    outFile=sys.argv[2] 
    main(inFile, outFile)

run_get150up.sh

#!/bin/bash

#SBATCH --job-name=run_get150up
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=run_get150upt.out
#SBATCH --error=run_get150up.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env

python Upstream150Bases.py /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed  /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_peakand150up.bed

Get subject (reads)

Input the regions:

Fix chromosomes:

PeakRegions=read.table("../data/Signal_Loc/APAPeaks_5percCov_fixedStrand_peakand150up.bed", header=F,col.names = c("chr","start", "end", "peak", "score", "strand")) %>% mutate(Chrom=paste("chr", chr, sep="")) %>% select(Chrom, start,end,peak,score,strand)
write.table(PeakRegions, file="../data/Signal_Loc/APAPeaks_5percCov_fixedStrand_peakand150up_fixedChr.bed", quote=F, col.names = F, row.names = F, sep="\t")

#convert to reads 

reads.GR= readGeneric(file="../data/Signal_Loc/APAPeaks_5percCov_fixedStrand_peakand150up_fixedChr.bed",chr =1, start = 2, end =3,  meta.cols =4, header=F, zero.based=TRUE,strand=6)

I need to overlap these positions with the genome

Make motifs

AATAAA= PWM("AATAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
ATTAAA= PWM("ATTAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
AGTAAA= PWM("AGTAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
TATAAA= PWM("TATAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
CATAAA= PWM("CATAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
GATAAA= PWM("GATAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
AATATA= PWM("AATATA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
AATACA= PWM("AATACA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
AATAGA= PWM("AATAGA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
AAAAAG= PWM("AAAAAG", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
ACTAAA= PWM("ACTAAA", type = c("log2probratio", "prob"), prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))

find the mathes

genome.hg19 <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19")

#matches <- matchPWM(pwm=AATAAA, subject = genome.hg19) %>% keepStandardChromosomes(., species= "Homo sapiens")


DNAstringSetPeaks=data.frame(seq=getSeq(genome.hg19, reads.GR))

x=DNAString(DNAstringSetPeaks[1,1])

hits <- matchPWM(AATAAA,x,with.score=T) 


start(hits)

[1] 215 219 255

Look over and make hits file for all

list_AATAAA_res=c()
for (i in 1:nrow(DNAstringSetPeaks)){
  x=DNAString(DNAstringSetPeaks[i,1])
  list_AATAAA_res=c(list_AATAAA_res,matchPWM(AATAAA,x,with.score=T))
}

Get out the start positions:

starts_AATAAA=c()
nsig=c()
last_oc_AATAAA=c()
for (i in list_AATAAA_res){
  nsig=c(nsig, length(start(i)))
  starts_AATAAA=c(starts_AATAAA, start(i))
  #print(length(start(i)))
  if (length(start(i)) != 0 ){
    last_oc_AATAAA=c(last_oc_AATAAA, max(start(i),na.rm =T))
  }
}

Histogram of results:

summary(starts_AATAAA)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0    79.0   175.0   156.7   216.0  1612.0

hist(starts_AATAAA,breaks=10000)

Version	Author	Date
638d12e	Briana Mittleman	2019-03-07
4bf5d09	Briana Mittleman	2019-03-06

summary(nsig)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   3.000   3.857   5.000  74.000

sum(nsig==0)

[1] 3584

sum(nsig==1)

[1] 8358

hist(nsig,breaks=100)

Version	Author	Date
638d12e	Briana Mittleman	2019-03-07
4bf5d09	Briana Mittleman	2019-03-06

Look at the first occurence:

summary(last_oc_AATAAA)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   209.0   215.0   213.1   231.0  1612.0

hist(last_oc_AATAAA,breaks=1000)

Version	Author	Date
638d12e	Briana Mittleman	2019-03-07
4bf5d09	Briana Mittleman	2019-03-06

I want to get the closest occurance of the

Change analysis:

I want to start at the end of the peak (most downstream) and look for the signal sites. For confidence a peak is one PAS, I will look only at peaks less than 100bp long. I will extend the peak upstream 100 basepairs. I will look from the downstream end for the sites.

Subset peaks less than 200bp

filterPeaks100length.py

peaks=open("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed", "r")

outPeaks=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length.bed", "w")

nNotOk=0
for ln in peaks:
    start= int(ln.split()[1])
    end=int(ln.split()[2])
    length=end - start 
    if length <= 100:
        outPeaks.write(ln)
    else:
        nNotOk +=1
        
print(nNotOk)
outPeaks.close()

This filters 12105 peaks.

Get sequence for peaks and 100 upstream

Upstream100Bases_filteredpeaks.py

#python  
def main(Fin, Fout):
  outBed=open(Fout, "w")
  chrom_lengths=open("/project2/gilad/briana/genome_anotation_data/chrom_lengths2.sort.bed","r")
  #make a dictionary with chrom lengths
  length_dic={}
  for i in chrom_lengths:
    chrom, start, end = i.split()
    length_dic[str(chrom)]=int(end)  

#write file 
  for ln in open(Fin):
    chrom, start, end, name, score, strand = ln.split()
    chrom=str(chrom)
    if strand=="+":
      start_new=int(start)-100
      if start_new <= 1:
        start_new = 0 
      end_new= int(end)
      if end_new == 0:
        end_new=1
      outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
    if strand == "-":
      start_new=int(start)
      end_new=int(end) + 100
      outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
  outBed.close()  

if __name__ == "__main__":
    import sys
    inFile = sys.argv[1]
    outFile=sys.argv[2] 
    main(inFile, outFile)

Run this:

python Upstream100Bases_filteredpeaks.py /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length.bed  /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length_upstream100.bed

Run bedtools nuc for this to get the sequences:

nucpeaksand100up.sh

#!/bin/bash

#SBATCH --job-name=nucpeaksand100up
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=nucpeaksand100up.out
#SBATCH --error=nucpeaksand100up.err
#SBATCH --partition=broadwl
#SBATCH --mem=36G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env

bedtools nuc -seq -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length_upstream100.bed  > /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length_upstream100_Seq.bed

This is looking at the positive strand left to right always. I need to go from the right and look at the reverse signal sites. I can look into ways to flip a string in python

‘a string’[::-1]

find distance to peaks

change region I am looking at before I do this - move to new analysis and try a new method for this analysis

DistPAS2Sig.py

def main(Insite, out):
  sigsite=[Insite]
  
  inBed=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length_upstream100_Seq.bed", "r")
  outRes=open(out, "w")
  
  #function for reverse compliments  
  
  def ReverseComplement1(seq):
      seq_dict = {'A':'T','T':'A','G':'C','C':'G', 'a':'t', 't':'a', 'g':'c', 'c':'g'}
      bases=[seq_dict[base] for base in seq]
      bases=reversed(bases)
      return("".join(bases))
  
  
  #reverse comp each signal site 
  sigsite_revComp=[]
  for i in sigsite:
      sigsite_revComp.append(ReverseComplement1(i))
      
  #want a dictionary for each of the sites and its reverse compliment:
  sigsites_dic={}
  for i in range(len(sigsite)):
      sigsites_dic[sigsite[i]]=sigsite_revComp[i]    
      
  
  #function to get occurance: takes in sig site and sequence (give it the correct stranded stuff)
  
  #make 2 of these, this is for the pos strand
  def getOccurance(sigsite, seq):
      if sigsite in seq:
          length=len(seq)
          pos= seq.rfind(sigsite)
          posF=length-pos
          return(posF)
      else:
          return(-9)
          
  #negative strand occurance function:
  
  def getOccurance_neg(sigsite, seq):
      sigsite=sigsites_dic[sigsite]
      if sigsite in seq:
          pos= seq.find(sigsite)
          return(pos + 6)
      else:
          return(-9)
  
  #i can only addpend the value if the function does not return -9
      
  
  #function i can run on each signal site  
  
  
  #loop through peaks and check for every site, first ask stand and do the rev  
  def loop41site(site):
      resList=[]
      for ln in inBed:
          strand=ln.split()[5]
          seq= ln.split()[15]
          if strand == "+":
              loc= getOccurance(site, seq)
              if loc !=-9:
                  resList.append(loc)
          else: 
              loc=getOccurance_neg(site,seq)
              if loc !=-9:
                  resList.append(loc)
      return(resList)
          
  
  #run this for each sig site
  res_dic={}
  for i in sigsite:
      res_dic[i]=[]
  for i in sigsite:
      reslist=loop41site(i)
      res_dic[i]=reslist
  
  
  outRes.write("%s\n"%(sigsite[0]))
  for i in reslist:
      outRes.write("%d\n"%(i))
      
  
  
  outRes.close()
      
    
if __name__ == "__main__":
    import sys
    Site_in = sys.argv[1]
    outFile= "/project2/gilad/briana/threeprimeseq/data/Signal_Loc/Loc_%s_Distance2end.txt"%(Site_in)
    main(Site_in, outFile)

make a test with just 1 site:

test_DistPAS2Sig.py


sigsite=['ATTAAA'] 

inBed=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_filter4length_upstream100_Seq.bed", "r")
outRes=open('/project2/gilad/briana/threeprimeseq/data/Signal_Loc/test.text', "w")

#function for reverse compliments  

def ReverseComplement1(seq):
    seq_dict = {'A':'T','T':'A','G':'C','C':'G', 'a':'t', 't':'a', 'g':'c', 'c':'g'}
    bases=[seq_dict[base] for base in seq]
    bases=reversed(bases)
    return("".join(bases))


#reverse comp each signal site 
sigsite_revComp=[]
for i in sigsite:
    sigsite_revComp.append(ReverseComplement1(i))
    
#want a dictionary for each of the sites and its reverse compliment:
sigsites_dic={}
for i in range(len(sigsite)):
    sigsites_dic[sigsite[i]]=sigsite_revComp[i]    
    

#function to get occurance: takes in sig site and sequence (give it the correct stranded stuff)

#make 2 of these, this is for the pos strand
def getOccurance(sigsite, seq):
    if sigsite in seq:
        print(sigsite)
        print(seq)
        pos= seq.rfind(sigsite)
        return(pos)
    else:
        return(-9)
        
#negative strand occurance function:

def getOccurance_neg(sigsite, seq):
    sigsite=sigsites_dic[sigsite]
    if sigsite in seq:
        pos= seq.find(sigsite)
        return(pos + 6)
    else:
        return(-9)

#i can only addpend the value if the function does not return -9
    

#function i can run on each signal site  


#loop through peaks and check for every site, first ask stand and do the rev  
def loop41site(site):
    resList=[]
    for ln in inBed:
        strand=ln.split()[5]
        seq= ln.split()[15]
        if strand == "+":
            loc= getOccurance(site, seq)
            print(loc)
            if loc !=-9:
                resList.append(str(loc))
        else: 
            loc=getOccurance_neg(site,seq)
            if loc !=-9:
                resList.append(str(loc))
    return(resList)
        

#run this for each sig site
res_dic={}
for i in sigsite:
    res_dic[i]=[]
for i in sigsite:
    reslist=loop41site(i)
    res_dic[i]=reslist
    
for key, value in res_dic.items():
    valString=":".join(value)
    outRes.write("%s\t%s\n"%(key, valString))


outRes.close()

run_DistPAS2Sig.sh

#!/bin/bash

#SBATCH --job-name=run_DistPAS2Sig
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=run_DistPAS2Sig.out
#SBATCH --error=run_DistPAS2Sig.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env


python DistPAS2Sig.py AATAAA
python DistPAS2Sig.py ATTAAA
python DistPAS2Sig.py AGTAAA
python DistPAS2Sig.py TATAAA
python DistPAS2Sig.py CATAAA
python DistPAS2Sig.py GATAAA
python DistPAS2Sig.py AATATA
python DistPAS2Sig.py AATACA
python DistPAS2Sig.py AATAGA
python DistPAS2Sig.py AAAAAG
python DistPAS2Sig.py ACTAAA
python DistPAS2Sig.py AAAAAA

HEREEE

Make histograms:

Loc_AATAAA= read.table("../data/Signal_Loc/Loc_AATAAA_Distance2end.txt", header=T) %>% mutate(Site="AATAAA")
nrow(Loc_AATAAA)

[1] 11809

Loc_AAAAAG= read.table("../data/Signal_Loc/Loc_AAAAAG_Distance2end.txt", header=T) %>% mutate(Site="AAAAAG")


Loc_AATACA= read.table("../data/Signal_Loc/Loc_AATACA_Distance2end.txt", header=T) %>% mutate(Site="AATACA")


Loc_AATAGA= read.table("../data/Signal_Loc/Loc_AATAGA_Distance2end.txt", header=T) %>% mutate(Site="AATAGA")


Loc_AATATA= read.table("../data/Signal_Loc/Loc_AATATA_Distance2end.txt", header=T) %>% mutate(Site="AATATA")


Loc_ACTAAA= read.table("../data/Signal_Loc/Loc_ACTAAA_Distance2end.txt", header=T) %>% mutate(Site="ACTAAA")


Loc_AGTAAA= read.table("../data/Signal_Loc/Loc_AGTAAA_Distance2end.txt", header=T) %>% mutate(Site="AGTAAA")



Loc_ATTAAA= read.table("../data/Signal_Loc/Loc_ATTAAA_Distance2end.txt", header=T) %>% mutate(Site="ATTAAA")

Loc_CATAAA= read.table("../data/Signal_Loc/Loc_CATAAA_Distance2end.txt", header=T) %>% mutate(Site="CATAAA")

Loc_GATAAA= read.table("../data/Signal_Loc/Loc_GATAAA_Distance2end.txt", header=T) %>% mutate(Site="GATAAA")

Loc_TATAAA= read.table("../data/Signal_Loc/Loc_TATAAA_Distance2end.txt", header=T) %>% mutate(Site="TATAAA")

Loc_AAAAAA= read.table("../data/Signal_Loc/Loc_AAAAAA_Distance2end.txt", header=T) %>% mutate(Site="AAAAAA")

dist_Loc= ggplot(Loc_AATAAA, aes(x=AATAAA)) + 
  geom_histogram(bins=100, fill="red", alpha=.3) + 
  labs(title="Distance from PAS", x="Number of PAS", y="N basepair from PAS") +
  geom_histogram(bins=100,data=Loc_AAAAAG, aes(x=AAAAAG), fill="orange", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATACA, aes(x=AATACA), fill="yellow", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATAGA, aes(x=AATAGA), fill="green", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATATA, aes(x=AATATA), fill="blue", alpha=.3) +
  geom_histogram(bins=100,data=Loc_ACTAAA, aes(x=ACTAAA), fill="purple", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA, aes(x=AGTAAA), fill="firebrick3", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA, aes(x=AGTAAA), fill="darksalmon", alpha=.3) +
  geom_histogram(bins=100,data=Loc_CATAAA, aes(x=CATAAA), fill="darkslategray", alpha=.3) +
  geom_histogram(bins=100,data=Loc_GATAAA, aes(x=GATAAA), fill="deeppink1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_TATAAA, aes(x=TATAAA), fill="lightcyan1", alpha=.3) 
  geom_histogram(bins=100,data=Loc_AAAAAA, aes(x=AAAAAA), fill="black", alpha=.3)

mapping: x = ~AAAAAA 
geom_bar: na.rm = FALSE
stat_bin: binwidth = NULL, bins = 100, na.rm = FALSE, pad = FALSE
position_stack

dist_Loc

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
ca1a9f4	Briana Mittleman	2019-03-08

Make a long dataframe for all of this to make it easier to manipulate.

colnames(Loc_AATAAA)=c("Count", "Site")
colnames(Loc_AAAAAG)=c("Count", "Site")
colnames(Loc_AATACA)=c("Count", "Site")
colnames(Loc_AATAGA)=c("Count", "Site")
colnames(Loc_AATATA)=c("Count", "Site")
colnames(Loc_ACTAAA)=c("Count", "Site")
colnames(Loc_AGTAAA)=c("Count", "Site")
colnames(Loc_CATAAA)=c("Count", "Site")
colnames(Loc_ATTAAA)=c("Count", "Site")
colnames(Loc_GATAAA)=c("Count", "Site")
colnames(Loc_TATAAA)=c("Count", "Site")
colnames(Loc_AAAAAA)=c("Count", "Site")


AllsiteDF=as.data.frame(rbind(Loc_AATAAA,Loc_AAAAAG,Loc_AATACA,Loc_AATAGA,Loc_AATATA,Loc_ACTAAA,Loc_AGTAAA,Loc_ATTAAA, Loc_GATAAA,Loc_TATAAA,Loc_CATAAA, Loc_AAAAAA)) %>% mutate(NegCount=-1*as.integer(as.character(Count)), Cononical=ifelse(Site=="AATAAA", "Yes","No"))


 AllsiteDF_to100= AllsiteDF %>% filter(Count < 100)

plot:

ggplot(AllsiteDF_to100, aes(group=Site, x=NegCount, fill=Site)) + geom_histogram(position="stack",bins=50 ) + labs(x="Distance from PAS", y="N annotated Sites", title="Location of annotated signal sites")

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

Do this as proportion:

allpeak=33432
AllsiteDF_to100_prop=AllsiteDF_to100 %>% group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=sum(CountperPos),prop=CountperPos/allpeak)

#%>% ungroup() %>% group_by(Site) %>% mutate(nType=sum(Count), prop=CountperPos/nType)

annoationPAS_allpeak=ggplot(AllsiteDF_to100_prop, aes(fill=Site, y=prop, x=NegCount)) + geom_density(stat="identity") + facet_wrap(~Site) + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites")

annoationPAS_allpeak

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

ggsave(annoationPAS_allpeak, file="../output/plots/annoationPAS_allpeakPropLocFacet.png")

Saving 7 x 5 in image

annoationPAS_allpeakProphist=ggplot(AllsiteDF_to100_prop, aes(fill=Site, by=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity", position="stack") + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_allpeakProphist

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

ggsave(annoationPAS_allpeakProphist, file="../output/plots/annoationPAS_allpeakPropLocStackHist.png")

Saving 7 x 5 in image

Run for nuclear specific PAS:

Used more in NUclear * /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed

Used more in nucelar, in intron * /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed

Filter out ones that are too long:

filterPeaks100length_nuc.py

peaks=open(" /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed", "r")

outPeaks=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_filter4length.bed", "w")

nNotOk=0
for ln in peaks:
    start= int(ln.split()[1])
    end=int(ln.split()[2])
    length=end - start 
    if length <= 100:
        outPeaks.write(ln)
    else:
        nNotOk +=1
        
print(nNotOk)
outPeaks.close()

filterPeaks100length_nucintron.py

peaks=open("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed", "r")

outPeaks=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron_filter4length.bed", "w")

nNotOk=0
for ln in peaks:
    start= int(ln.split()[1])
    end=int(ln.split()[2])
    length=end - start 
    if length <= 100:
        outPeaks.write(ln)
    else:
        nNotOk +=1
        
print(nNotOk)
outPeaks.close()

python Upstream100Bases_filteredpeaks.py /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron_filter4length.bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron_filter4length_upstream100.bed

python Upstream100Bases_filteredpeaks.py /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_filter4length.bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_filter4length_upstream100.bed

nucpeaksand100up_nuc.sh

#!/bin/bash

#SBATCH --job-name=nucpeaksand100up
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=nucpeaksand100up.out
#SBATCH --error=nucpeaksand100up.err
#SBATCH --partition=broadwl
#SBATCH --mem=36G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env

bedtools nuc -seq -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron_filter4length_upstream100.bed  > /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron_filter4length_upstream100_seq.bed

bedtools nuc -seq -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_filter4length_upstream100.bed  > /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_filter4length_upstream100_seq.bed

DistPAS2Sig_nuclear.py

def main(Insite, out):
  sigsite=[Insite]
  
  inBed=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_filter4length_upstream100_seq.bed", "r")
  outRes=open(out, "w")
  
  #function for reverse compliments  
  
  def ReverseComplement1(seq):
      seq_dict = {'A':'T','T':'A','G':'C','C':'G', 'a':'t', 't':'a', 'g':'c', 'c':'g'}
      bases=[seq_dict[base] for base in seq]
      bases=reversed(bases)
      return("".join(bases))
  
  
  #reverse comp each signal site 
  sigsite_revComp=[]
  for i in sigsite:
      sigsite_revComp.append(ReverseComplement1(i))
      
  #want a dictionary for each of the sites and its reverse compliment:
  sigsites_dic={}
  for i in range(len(sigsite)):
      sigsites_dic[sigsite[i]]=sigsite_revComp[i]    
      
  
  #function to get occurance: takes in sig site and sequence (give it the correct stranded stuff)
  
  #make 2 of these, this is for the pos strand
  def getOccurance(sigsite, seq):
      if sigsite in seq:
          length=len(seq)
          pos= seq.rfind(sigsite)
          posF=length-pos
          return(posF)
      else:
          return(-9)
          
  #negative strand occurance function:
  
  def getOccurance_neg(sigsite, seq):
      sigsite=sigsites_dic[sigsite]
      if sigsite in seq:
          pos= seq.find(sigsite)
          return(pos + 6)
      else:
          return(-9)
  
  #i can only addpend the value if the function does not return -9
      
  
  #function i can run on each signal site  
  
  
  #loop through peaks and check for every site, first ask stand and do the rev  
  def loop41site(site):
      resList=[]
      for ln in inBed:
          strand=ln.split()[5]
          seq= ln.split()[15]
          if strand == "+":
              loc= getOccurance(site, seq)
              if loc !=-9:
                  resList.append(loc)
          else: 
              loc=getOccurance_neg(site,seq)
              if loc !=-9:
                  resList.append(loc)
      return(resList)
          
  
  #run this for each sig site
  res_dic={}
  for i in sigsite:
      res_dic[i]=[]
  for i in sigsite:
      reslist=loop41site(i)
      res_dic[i]=reslist
  
  
  outRes.write("%s\n"%(sigsite[0]))
  for i in reslist:
      outRes.write("%d\n"%(i))
      
  
  
  outRes.close()
      
    
if __name__ == "__main__":
    import sys
    Site_in = sys.argv[1]
    outFile= "/project2/gilad/briana/threeprimeseq/data/Signal_Loc/Loc_%s_Distance2end_nuclear.txt"%(Site_in)
    main(Site_in, outFile)

DistPAS2Sig_nucIntron.py

def main(Insite, out):
  sigsite=[Insite]
  
  inBed=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron_filter4length_upstream100_seq.bed", "r")
  outRes=open(out, "w")
  
  #function for reverse compliments  
  
  def ReverseComplement1(seq):
      seq_dict = {'A':'T','T':'A','G':'C','C':'G', 'a':'t', 't':'a', 'g':'c', 'c':'g'}
      bases=[seq_dict[base] for base in seq]
      bases=reversed(bases)
      return("".join(bases))
  
  
  #reverse comp each signal site 
  sigsite_revComp=[]
  for i in sigsite:
      sigsite_revComp.append(ReverseComplement1(i))
      
  #want a dictionary for each of the sites and its reverse compliment:
  sigsites_dic={}
  for i in range(len(sigsite)):
      sigsites_dic[sigsite[i]]=sigsite_revComp[i]    
      
  
  #function to get occurance: takes in sig site and sequence (give it the correct stranded stuff)
  
  #make 2 of these, this is for the pos strand
  def getOccurance(sigsite, seq):
      if sigsite in seq:
          length=len(seq)
          pos= seq.rfind(sigsite)
          posF=length-pos
          return(posF)
      else:
          return(-9)
          
  #negative strand occurance function:
  
  def getOccurance_neg(sigsite, seq):
      sigsite=sigsites_dic[sigsite]
      if sigsite in seq:
          pos= seq.find(sigsite)
          return(pos + 6)
      else:
          return(-9)
  
  #i can only addpend the value if the function does not return -9
      
  
  #function i can run on each signal site  
  
  
  #loop through peaks and check for every site, first ask stand and do the rev  
  def loop41site(site):
      resList=[]
      for ln in inBed:
          strand=ln.split()[5]
          seq= ln.split()[15]
          if strand == "+":
              loc= getOccurance(site, seq)
              if loc !=-9:
                  resList.append(loc)
          else: 
              loc=getOccurance_neg(site,seq)
              if loc !=-9:
                  resList.append(loc)
      return(resList)
          
  
  #run this for each sig site
  res_dic={}
  for i in sigsite:
      res_dic[i]=[]
  for i in sigsite:
      reslist=loop41site(i)
      res_dic[i]=reslist
  
  
  outRes.write("%s\n"%(sigsite[0]))
  for i in reslist:
      outRes.write("%d\n"%(i))
      
  
  
  outRes.close()
      
    
if __name__ == "__main__":
    import sys
    Site_in = sys.argv[1]
    outFile= "/project2/gilad/briana/threeprimeseq/data/Signal_Loc/Loc_%s_Distance2end_nuclearIntron.txt"%(Site_in)
    main(Site_in, outFile)

Run both of these:
run_DistPAS2Sig_nuc.sh

#!/bin/bash

#SBATCH --job-name=run_DistPAS2Sig_nuc
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=un_DistPAS2Sig_nuc.out
#SBATCH --error=un_DistPAS2Sig_nuc.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env


python DistPAS2Sig_nucIntron.py AATAAA
python DistPAS2Sig_nucIntron.py ATTAAA
python DistPAS2Sig_nucIntron.py AGTAAA
python DistPAS2Sig_nucIntron.py TATAAA
python DistPAS2Sig_nucIntron.py CATAAA
python DistPAS2Sig_nucIntron.py GATAAA
python DistPAS2Sig_nucIntron.py AATATA
python DistPAS2Sig_nucIntron.py AATACA
python DistPAS2Sig_nucIntron.py AATAGA
python DistPAS2Sig_nucIntron.py AAAAAG
python DistPAS2Sig_nucIntron.py ACTAAA
python DistPAS2Sig_nucIntron.py AAAAAA


python DistPAS2Sig_nuclear.py AATAAA
python DistPAS2Sig_nuclear.py ATTAAA
python DistPAS2Sig_nuclear.py AGTAAA
python DistPAS2Sig_nuclear.py TATAAA
python DistPAS2Sig_nuclear.py CATAAA
python DistPAS2Sig_nuclear.py GATAAA
python DistPAS2Sig_nuclear.py AATATA
python DistPAS2Sig_nuclear.py AATACA
python DistPAS2Sig_nuclear.py AATAGA
python DistPAS2Sig_nuclear.py AAAAAG
python DistPAS2Sig_nuclear.py ACTAAA
python DistPAS2Sig_nuclear.py AAAAAA

Loc_AATAAA_Nuc= read.table("../data/Signal_Loc/Loc_AATAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AATAAA")


Loc_AAAAAG_Nuc= read.table("../data/Signal_Loc/Loc_AAAAAG_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AAAAAG")


Loc_AATACA_Nuc= read.table("../data/Signal_Loc/Loc_AATACA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AATACA")


Loc_AATAGA_Nuc= read.table("../data/Signal_Loc/Loc_AATAGA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AATAGA")


Loc_AATATA_Nuc =read.table("../data/Signal_Loc/Loc_AATATA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AATATA")


Loc_ACTAAA_Nuc= read.table("../data/Signal_Loc/Loc_ACTAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="ACTAAA")


Loc_AGTAAA_Nuc= read.table("../data/Signal_Loc/Loc_AGTAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AGTAAA")



Loc_ATTAAA_Nuc=read.table("../data/Signal_Loc/Loc_ATTAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="ATTAAA")

Loc_CATAAA_Nuc= read.table("../data/Signal_Loc/Loc_CATAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="CATAAA")

Loc_GATAAA_Nuc= read.table("../data/Signal_Loc/Loc_GATAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="GATAAA")

Loc_TATAAA_Nuc= read.table("../data/Signal_Loc/Loc_TATAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="TATAAA")
Loc_AAAAAA_Nuc= read.table("../data/Signal_Loc/Loc_AAAAAA_Distance2end_nuclear.txt", header=T) %>% mutate(Site="AAAAAA")


dist_Loc_Nuc= ggplot(Loc_AATAAA_Nuc, aes(x=AATAAA)) + 
  geom_histogram(bins=100, fill="red", alpha=.3) + 
  labs(title="Distance from PAS in Nuclear Used peaks", x="Number of PAS", y="N basepair from PAS") +
  geom_histogram(bins=100,data=Loc_AAAAAG_Nuc, aes(x=AAAAAG), fill="orange", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATACA_Nuc, aes(x=AATACA), fill="yellow", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATAGA_Nuc, aes(x=AATAGA), fill="green", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATATA_Nuc, aes(x=AATATA), fill="blue", alpha=.3) +
  geom_histogram(bins=100,data=Loc_ACTAAA_Nuc, aes(x=ACTAAA), fill="purple", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA_Nuc, aes(x=AGTAAA), fill="firebrick3", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA_Nuc, aes(x=AGTAAA), fill="darksalmon", alpha=.3) +
  geom_histogram(bins=100,data=Loc_CATAAA_Nuc, aes(x=CATAAA), fill="darkslategray", alpha=.3) +
  geom_histogram(bins=100,data=Loc_GATAAA_Nuc, aes(x=GATAAA), fill="deeppink1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_TATAAA_Nuc, aes(x=TATAAA), fill="lightcyan1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AAAAAA_Nuc, aes(x=AAAAAA), fill="black", alpha=.3) 
  
  
dist_Loc_Nuc

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

Make a long dataframe for all of this to make it easier to manipulate.

colnames(Loc_AATAAA_Nuc)=c("Count", "Site")
colnames(Loc_AAAAAG_Nuc)=c("Count", "Site")
colnames(Loc_AATACA_Nuc)=c("Count", "Site")
colnames(Loc_AATAGA_Nuc)=c("Count", "Site")
colnames(Loc_AATATA_Nuc)=c("Count", "Site")
colnames(Loc_ACTAAA_Nuc)=c("Count", "Site")
colnames(Loc_AGTAAA_Nuc)=c("Count", "Site")
colnames(Loc_CATAAA_Nuc)=c("Count", "Site")
colnames(Loc_ATTAAA_Nuc)=c("Count", "Site")
colnames(Loc_GATAAA_Nuc)=c("Count", "Site")
colnames(Loc_TATAAA_Nuc)=c("Count", "Site")
colnames(Loc_AAAAAA_Nuc)=c("Count", "Site")


AllsiteDF_Nuc=as.data.frame(rbind(Loc_AATAAA_Nuc,Loc_AAAAAG_Nuc,Loc_AATACA_Nuc,Loc_AATAGA_Nuc,Loc_AATATA_Nuc,Loc_ACTAAA_Nuc,Loc_AGTAAA_Nuc,Loc_ATTAAA_Nuc, Loc_GATAAA_Nuc,Loc_TATAAA_Nuc,Loc_CATAAA_Nuc,Loc_AAAAAA_Nuc)) %>% mutate(NegCount=-1*as.integer(as.character(Count)), Cononical=ifelse(Site=="AATAAA", "Yes","No"))


 AllsiteDF_to100_Nuc= AllsiteDF_Nuc %>% filter(Count < 100)

plot:

ggplot(AllsiteDF_to100_Nuc, aes(group=Site, x=NegCount, fill=Site)) + geom_histogram(position="stack",bins=50 ) + labs(x="Distance from PAS", y="N annotated Sites", title="Location of annotated signal sites- Used more in Nuclear")

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09
ca1a9f4	Briana Mittleman	2019-03-08

Do this as proportion:

nucmore=657
AllsiteDF_to100_prop_Nuc=AllsiteDF_to100_Nuc %>% group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=sum(CountperPos),prop=CountperPos/nucmore)

#%>% ungroup() %>% group_by(Site) %>% mutate(nType=sum(Count), prop=CountperPos/nType)

annoationPAS_Nucpeak=ggplot(AllsiteDF_to100_prop_Nuc, aes(fill=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity") + facet_wrap(~Site) + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites for PAS used more in Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_Nucpeak

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

ggsave(annoationPAS_Nucpeak, file="../output/plots/annoationPAS_NuclearpeaksPropLocFacet.png")

Saving 7 x 5 in image

annoationPAS_NucpeakProphist=ggplot(AllsiteDF_to100_prop_Nuc, aes(fill=Site, by=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity", position="stack") + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites used more in Nuc")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_NucpeakProphist

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

ggsave(annoationPAS_NucpeakProphist, file="../output/plots/annoationPAS_NucpeakPropLocStackHist.png")

Saving 7 x 5 in image

Loc_AATAAA_Int= read.table("../data/Signal_Loc/Loc_AATAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="AATAAA")


Loc_AAAAAG_Int= read.table("../data/Signal_Loc/Loc_AAAAAG_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="AAAAAG")


Loc_AATACA_Int= read.table("../data/Signal_Loc/Loc_AATACA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="AATACA")


Loc_AATAGA_Int= read.table("../data/Signal_Loc/Loc_AATAGA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="ATAGA")


Loc_AATATA_Int =read.table("../data/Signal_Loc/Loc_AATATA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="AATATA")

Loc_ACTAAA_Int= read.table("../data/Signal_Loc/Loc_ACTAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="ACTAAA")


Loc_AGTAAA_Int= read.table("../data/Signal_Loc/Loc_AGTAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="AGTAAA")



Loc_ATTAAA_Int=read.table("../data/Signal_Loc/Loc_ATTAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="ATTAAA")

Loc_CATAAA_Int= read.table("../data/Signal_Loc/Loc_CATAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="CATAAA")

Loc_GATAAA_Int= read.table("../data/Signal_Loc/Loc_GATAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="GATAAA")

Loc_TATAAA_Int= read.table("../data/Signal_Loc/Loc_TATAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="TATAAA")
Loc_AAAAAA_Int= read.table("../data/Signal_Loc/Loc_AAAAAA_Distance2end_nuclearIntron.txt", header=T)%>% mutate(Site="AAAAAA")


dist_Loc_Int= ggplot(Loc_AATAAA_Int, aes(x=AATAAA)) + 
  geom_histogram(bins=100, fill="red", alpha=.3) + 
  labs(title="Distance from PAS in Intronic Nuclear Used peaks", x="Number of PAS", y="N basepair from PAS") +
  geom_histogram(bins=100,data=Loc_AAAAAG_Int, aes(x=AAAAAG), fill="orange", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATACA_Int, aes(x=AATACA), fill="yellow", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATAGA_Int, aes(x=AATAGA), fill="green", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATATA_Int, aes(x=AATATA), fill="blue", alpha=.3) +
  geom_histogram(bins=100,data=Loc_ACTAAA_Int, aes(x=ACTAAA), fill="purple", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA_Int, aes(x=AGTAAA), fill="firebrick3", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA_Int, aes(x=AGTAAA), fill="darksalmon", alpha=.3) +
  geom_histogram(bins=100,data=Loc_CATAAA_Int, aes(x=CATAAA), fill="darkslategray", alpha=.3) +
  geom_histogram(bins=100,data=Loc_GATAAA_Int, aes(x=GATAAA), fill="deeppink1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_TATAAA_Int, aes(x=TATAAA), fill="lightcyan1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AAAAAA_Int, aes(x=AAAAAA), fill="black", alpha=.3) 
  
  
dist_Loc_Int

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
0917c2a	Briana Mittleman	2019-03-09

Make a long dataframe for all of this to make it easier to manipulate.

colnames(Loc_AATAAA_Int)=c("Count", "Site")
colnames(Loc_AAAAAG_Int)=c("Count", "Site")
colnames(Loc_AATACA_Int)=c("Count", "Site")
colnames(Loc_AATAGA_Int)=c("Count", "Site")
colnames(Loc_AATATA_Int)=c("Count", "Site")
colnames(Loc_ACTAAA_Int)=c("Count", "Site")
colnames(Loc_AGTAAA_Int)=c("Count", "Site")
colnames(Loc_CATAAA_Int)=c("Count", "Site")
colnames(Loc_ATTAAA_Int)=c("Count", "Site")
colnames(Loc_GATAAA_Int)=c("Count", "Site")
colnames(Loc_TATAAA_Int)=c("Count", "Site")
colnames(Loc_AAAAAA_Int)=c("Count", "Site")


AllsiteDF_NucInt=as.data.frame(rbind(Loc_AATAAA_Int,Loc_AAAAAG_Int,Loc_AATACA_Int,Loc_AATAGA_Int,Loc_AATATA_Int,Loc_ACTAAA_Int,Loc_AGTAAA_Int,Loc_ATTAAA_Int, Loc_GATAAA_Int,Loc_TATAAA_Int,Loc_CATAAA_Int,Loc_AAAAAA_Int)) %>% mutate(NegCount=-1*as.integer(as.character(Count)), Cononical=ifelse(Site=="AATAAA", "Yes","No"))


 AllsiteDF_to100_NucInt= AllsiteDF_NucInt %>% filter(Count < 100)

plot:

ggplot(AllsiteDF_to100_NucInt, aes(group=Site, x=NegCount, fill=Site)) + geom_histogram(position="stack",bins=50 ) + labs(x="Distance from PAS", y="N annotated Sites", title="Location of annotated signal sites- Intronic Used more in Nuclear")

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
04987b5	Briana Mittleman	2019-03-09

Do this as proportion:

nucmore_int=449
AllsiteDF_to100_prop_NucInt=AllsiteDF_to100_NucInt %>% group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=sum(CountperPos),prop=CountperPos/nucmore_int)

#%>% ungroup() %>% group_by(Site) %>% mutate(nType=sum(Count), prop=CountperPos/nType)

annoationPAS_NucIntpeak=ggplot(AllsiteDF_to100_prop_NucInt, aes(fill=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity") + facet_wrap(~Site) + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites for Intronic PAS used more in Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_NucIntpeak

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10
04987b5	Briana Mittleman	2019-03-09

ggsave(annoationPAS_NucIntpeak, file="../output/plots/annoationPAS_NuclearIntronicpeaksPropLocFacet.png")

Saving 7 x 5 in image

annoationPAS_NucIntpeakProphist=ggplot(AllsiteDF_to100_prop_NucInt, aes(fill=Site, by=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity", position="stack") + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites Intronic PAS used more in Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_NucIntpeakProphist

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10
04987b5	Briana Mittleman	2019-03-09

ggsave(annoationPAS_NucIntpeakProphist, file="../output/plots/annoationPAS_NucIntronicpeakPropLocStackHist.png")

Saving 7 x 5 in image

All intronic peaks:

*/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_INTRON.bed (17854)

filterPeaks100length_Intron.py

peaks=open("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_INTRON.bed", "r")

outPeaks=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_INTRON_filter4length.bed", "w")

nNotOk=0
for ln in peaks:
    start= int(ln.split()[1])
    end=int(ln.split()[2])
    length=end - start 
    if length <= 100:
        outPeaks.write(ln)
    else:
        nNotOk +=1
        
print(nNotOk)
outPeaks.close()

12250 peaks

python Upstream100Bases_filteredpeaks.py /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_INTRON_filter4length.bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_INTRON_filter4length_upstream100.bed

nucpeaksand100up_intron.sh

#!/bin/bash

#SBATCH --job-name=nucpeaksand100up_intron
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=nucpeaksand100up_intron.out
#SBATCH --error=nucpeaksand100up_intron.err
#SBATCH --partition=broadwl
#SBATCH --mem=36G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env

bedtools nuc -seq -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_INTRON_filter4length_upstream100.bed > /project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_INTRON_filter4length_upstream100_seq.bed

DistPAS2Sig_Intron.py

def main(Insite, out):
  sigsite=[Insite]
  
  inBed=open("/project2/gilad/briana/threeprimeseq/data/Signal_Loc/APAPeaks_5percCov_fixedStrand_INTRON_filter4length_upstream100_seq.bed", "r")
  outRes=open(out, "w")
  
  #function for reverse compliments  
  
  def ReverseComplement1(seq):
      seq_dict = {'A':'T','T':'A','G':'C','C':'G', 'a':'t', 't':'a', 'g':'c', 'c':'g'}
      bases=[seq_dict[base] for base in seq]
      bases=reversed(bases)
      return("".join(bases))
  
  
  #reverse comp each signal site 
  sigsite_revComp=[]
  for i in sigsite:
      sigsite_revComp.append(ReverseComplement1(i))
      
  #want a dictionary for each of the sites and its reverse compliment:
  sigsites_dic={}
  for i in range(len(sigsite)):
      sigsites_dic[sigsite[i]]=sigsite_revComp[i]    
      
  
  #function to get occurance: takes in sig site and sequence (give it the correct stranded stuff)
  
  #make 2 of these, this is for the pos strand
  def getOccurance(sigsite, seq):
      if sigsite in seq:
          length=len(seq)
          pos= seq.rfind(sigsite)
          posF=length-pos
          return(posF)
      else:
          return(-9)
          
  #negative strand occurance function:
  
  def getOccurance_neg(sigsite, seq):
      sigsite=sigsites_dic[sigsite]
      if sigsite in seq:
          pos= seq.find(sigsite)
          return(pos + 6)
      else:
          return(-9)
  
  #i can only addpend the value if the function does not return -9
      
  
  #function i can run on each signal site  
  
  
  #loop through peaks and check for every site, first ask stand and do the rev  
  def loop41site(site):
      resList=[]
      for ln in inBed:
          strand=ln.split()[5]
          seq= ln.split()[15]
          if strand == "+":
              loc= getOccurance(site, seq)
              if loc !=-9:
                  resList.append(loc)
          else: 
              loc=getOccurance_neg(site,seq)
              if loc !=-9:
                  resList.append(loc)
      return(resList)
          
  
  #run this for each sig site
  res_dic={}
  for i in sigsite:
      res_dic[i]=[]
  for i in sigsite:
      reslist=loop41site(i)
      res_dic[i]=reslist
  
  
  outRes.write("%s\n"%(sigsite[0]))
  for i in reslist:
      outRes.write("%d\n"%(i))
      
  
  
  outRes.close()
      
    
if __name__ == "__main__":
    import sys
    Site_in = sys.argv[1]
    outFile= "/project2/gilad/briana/threeprimeseq/data/Signal_Loc/Loc_%s_Distance2end_Intronic.txt"%(Site_in)
    main(Site_in, outFile)

run_DistPAS2Sig_Intron.sh

#!/bin/bash

#SBATCH --job-name=run_DistPAS2Sig_Intronc
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=run_DistPAS2Sig_Intron.out
#SBATCH --error=run_DistPAS2Sig_Intron.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env


python DistPAS2Sig_Intron.py AATAAA
python DistPAS2Sig_Intron.py ATTAAA
python DistPAS2Sig_Intron.py AGTAAA
python DistPAS2Sig_Intron.py TATAAA
python DistPAS2Sig_Intron.py CATAAA
python DistPAS2Sig_Intron.py GATAAA
python DistPAS2Sig_Intron.py AATATA
python DistPAS2Sig_Intron.py AATACA
python DistPAS2Sig_Intron.py AATAGA
python DistPAS2Sig_Intron.py AAAAAG
python DistPAS2Sig_Intron.py ACTAAA
python DistPAS2Sig_Intron.py AAAAAA

Loc_AATAAA_AllInt= read.table("../data/Signal_Loc/Loc_AATAAA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="AATAAA")


Loc_AAAAAG_AllInt= read.table("../data/Signal_Loc/Loc_AAAAAG_Distance2end_Intronic.txt", header=T) %>% mutate(Site="AAAAAG")


Loc_AATACA_AllInt= read.table("../data/Signal_Loc/Loc_AATACA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="AATACA")


Loc_AATAGA_AllInt= read.table("../data/Signal_Loc/Loc_AATAGA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="AATAGA")


Loc_AATATA_AllInt =read.table("../data/Signal_Loc/Loc_AATATA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="AATATA")


Loc_ACTAAA_AllInt= read.table("../data/Signal_Loc/Loc_ACTAAA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="ACTAAA")


Loc_AGTAAA_AllInt= read.table("../data/Signal_Loc/Loc_AGTAAA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="AGTAAA")



Loc_ATTAAA_AllInt=read.table("../data/Signal_Loc/Loc_ATTAAA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="ATTAAA")

Loc_CATAAA_AllInt= read.table("../data/Signal_Loc/Loc_CATAAA_Distance2end_Intronic.txt", header=T) %>% mutate(Site="CATAAA")

Loc_GATAAA_AllInt= read.table("../data/Signal_Loc/Loc_GATAAA_Distance2end_Intronic.txt", header=T)%>% mutate(Site="GATAAA")

Loc_TATAAA_AllInt= read.table("../data/Signal_Loc/Loc_TATAAA_Distance2end_Intronic.txt", header=T)%>% mutate(Site="TATAAA")

Loc_AAAAAA_AllInt= read.table("../data/Signal_Loc/Loc_AAAAAA_Distance2end_Intronic.txt", header=T)%>% mutate(Site="AAAAAA")

dist_Loc_AllInt= ggplot(Loc_AATAAA_AllInt, aes(x=AATAAA)) + 
  geom_histogram(bins=100, fill="red", alpha=.3) + 
  labs(title="Distance from PAS in Intronic peaks", x="Number of PAS", y="N basepair from PAS") +
  geom_histogram(bins=100,data=Loc_AAAAAG_AllInt, aes(x=AAAAAG), fill="orange", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATACA_AllInt, aes(x=AATACA), fill="yellow", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATAGA_AllInt, aes(x=AATAGA), fill="green", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AATATA_AllInt, aes(x=AATATA), fill="blue", alpha=.3) +
  geom_histogram(bins=100,data=Loc_ACTAAA_AllInt, aes(x=ACTAAA), fill="purple", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA_AllInt, aes(x=AGTAAA), fill="firebrick3", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AGTAAA_AllInt, aes(x=AGTAAA), fill="darksalmon", alpha=.3) +
  geom_histogram(bins=100,data=Loc_CATAAA_AllInt, aes(x=CATAAA), fill="darkslategray", alpha=.3) +
  geom_histogram(bins=100,data=Loc_GATAAA_AllInt, aes(x=GATAAA), fill="deeppink1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_TATAAA_AllInt, aes(x=TATAAA), fill="lightcyan1", alpha=.3) +
  geom_histogram(bins=100,data=Loc_AAAAAA_AllInt, aes(x=AAAAAA), fill="black", alpha=.3) 
  
  
dist_Loc_AllInt

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
04987b5	Briana Mittleman	2019-03-09

Make a long dataframe for all of this to make it easier to manipulate.

colnames(Loc_AATAAA_AllInt)=c("Count", "Site")
colnames(Loc_AAAAAG_AllInt)=c("Count", "Site")
colnames(Loc_AATACA_AllInt)=c("Count", "Site")
colnames(Loc_AATAGA_AllInt)=c("Count", "Site")
colnames(Loc_AATATA_AllInt)=c("Count", "Site")
colnames(Loc_ACTAAA_AllInt)=c("Count", "Site")
colnames(Loc_AGTAAA_AllInt)=c("Count", "Site")
colnames(Loc_CATAAA_AllInt)=c("Count", "Site")
colnames(Loc_ATTAAA_AllInt)=c("Count", "Site")
colnames(Loc_GATAAA_AllInt)=c("Count", "Site")
colnames(Loc_TATAAA_AllInt)=c("Count", "Site")
colnames(Loc_AAAAAA_AllInt)=c("Count", "Site")


AllsiteDF_NucAllInt=as.data.frame(rbind(Loc_AATAAA_AllInt,Loc_AAAAAG_AllInt,Loc_AATACA_AllInt,Loc_AATAGA_AllInt,Loc_AATATA_AllInt,Loc_ACTAAA_AllInt,Loc_AGTAAA_AllInt,Loc_ATTAAA_AllInt, Loc_GATAAA_AllInt,Loc_TATAAA_AllInt,Loc_CATAAA_AllInt,Loc_AAAAAA_AllInt)) %>% mutate(NegCount=-1*as.integer(as.character(Count)), Cononical=ifelse(Site=="AATAAA", "Yes","No"))


 AllsiteDF_to100_NucAllInt= AllsiteDF_NucAllInt %>% filter(Count < 100)

plot:

ggplot(AllsiteDF_to100_NucAllInt, aes(group=Site, x=NegCount, fill=Site)) + geom_histogram(position="stack",bins=50 ) + labs(x="Distance from PAS", y="N annotated Sites", title="Location of annotated signal sites- Intronic PAS")

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10
04987b5	Briana Mittleman	2019-03-09

Do this as proportion:

allIntron=12251 
AllsiteDF_to100_prop_NucAllInt=AllsiteDF_to100_NucAllInt %>% group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=sum(CountperPos),prop=CountperPos/allIntron)

#%>% ungroup() %>% group_by(Site) %>% mutate(nType=sum(Count), prop=CountperPos/nType)

annoationPAS_NucAllIntpeak=ggplot(AllsiteDF_to100_prop_NucAllInt, aes(fill=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity") + facet_wrap(~Site) + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites for all Intronic PAS")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_NucAllIntpeak

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10

ggsave(annoationPAS_NucAllIntpeak, file="../output/plots/annoationPAS_AllIntronicpeaksPropLocFacet.png")

Saving 7 x 5 in image

annoationPAS_NucAllIntpeakProphist=ggplot(AllsiteDF_to100_prop_NucAllInt, aes(fill=Site, by=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity", position="stack") + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites for all Intronic PAS")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_NucAllIntpeakProphist

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12
e60daf6	Briana Mittleman	2019-03-10

ggsave(annoationPAS_NucAllIntpeakProphist, file="../output/plots/annoationPAS_AllIntronicpeakPropLocStackHist.png")

Saving 7 x 5 in image

Subsample from AllsiteDF_NucAllInt and look at it in comparison to the nuclear specific peaks:

subsample 220

AllsiteDF_to100_NucAllInt_sampl= AllsiteDF_to100_NucAllInt %>% sample_n(138)


AllsiteDF_to100_prop_NucAllInt_samp=AllsiteDF_to100_NucAllInt_sampl %>% group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=sum(CountperPos),prop=CountperPos/TotCount)


annoationPAS_NucAllIntpeak_samp=ggplot(AllsiteDF_to100_prop_NucAllInt_samp, aes(fill=Site, y=prop, x=NegCount)) + geom_histogram(stat="identity") + facet_wrap(~Site) + labs(x="Distance from PAS", y="Proportion of Signal Site", title="Location of annotated signal sites for sample of Intronic PAS")

Warning: Ignoring unknown parameters: binwidth, bins, pad

annoationPAS_NucAllIntpeak_samp

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10

ggsave(annoationPAS_NucAllIntpeak_samp, file="../output/plots/annoationPAS_SampleofIntronicpeaksPropLocFacet.png")

Saving 7 x 5 in image

Compare cononical vs other:

AllsiteDF_to100_con=AllsiteDF_to100 %>% group_by(NegCount,Cononical) %>% summarise(PerSite=n()) %>% ungroup() %>% group_by(Cononical) %>% mutate(NCon=sum(PerSite), PropPerSite=PerSite/NCon)


ggplot(AllsiteDF_to100_con,aes(x=NegCount, by=Cononical, fill=Cononical, y=PropPerSite)) +geom_histogram(stat="identity", alpha=.5,bins=50)

Warning: Ignoring unknown parameters: binwidth, bins, pad

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10

#+ facet_grid(~Cononical)

Look in nuclear

AllsiteDF_to100_Nuc_con=AllsiteDF_to100_Nuc %>% group_by(NegCount,Cononical) %>% summarise(PerSite=n()) %>% ungroup() %>% group_by(Cononical) %>% mutate(NCon=sum(PerSite), PropPerSite=PerSite/NCon)


ggplot(AllsiteDF_to100_Nuc_con,aes(x=NegCount, by=Cononical, fill=Cononical, y=PropPerSite)) +geom_histogram(stat="identity",alpha=.5,bins=50)

Warning: Ignoring unknown parameters: binwidth, bins, pad

Version	Author	Date
e60daf6	Briana Mittleman	2019-03-10

Make plots with all sites as demoniator

I want to look at percentages but for all sites:

AllsiteDF_to100_fuklprop=AllsiteDF_to100  %>%  group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=nrow(AllsiteDF_to100),prop=CountperPos/33432)

allSite=ggplot(AllsiteDF_to100_fuklprop, aes(group=Site, x=NegCount, y=prop, fill=Site)) + geom_histogram(stat="identity",position="stack" ) + labs(x="Distance from PAS", y="Proportion of all annotated sites", title="Location of annotated signal sites")

Warning: Ignoring unknown parameters: binwidth, bins, pad

AllsiteDF_to100_nuc_fullprop=AllsiteDF_to100_Nuc  %>%  group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=nrow(AllsiteDF_to100_Nuc),prop=CountperPos/657)

nucSite=ggplot(AllsiteDF_to100_nuc_fullprop, aes(group=Site, x=NegCount, y=prop, fill=Site)) + geom_histogram(stat="identity",position="stack" ) + labs(x="Distance from PAS", y="Proportion of all annotated sites", title="Location of annotated signal sites- Nuclear")

Warning: Ignoring unknown parameters: binwidth, bins, pad

AllsiteDF_to100_nucInt_fullprop=AllsiteDF_to100_NucInt  %>%  group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=nrow(AllsiteDF_to100_NucInt),prop=CountperPos/449)

nucIntsite=ggplot(AllsiteDF_to100_nucInt_fullprop, aes(group=Site, x=NegCount, y=prop, fill=Site)) + geom_histogram(stat="identity",position="stack" ) + labs(x="Distance from PAS", y="Proportion of all annotated sites", title="Location of annotated signal sites- Nuclear Intronic")

Warning: Ignoring unknown parameters: binwidth, bins, pad

AllsiteDF_to100_prop_NucAllInt

AllsiteDF_to100_nucAllInt_fullprop=AllsiteDF_to100_NucAllInt  %>%  group_by(Site,NegCount) %>% summarise(CountperPos=n()) %>% mutate(TotCount=nrow(AllsiteDF_to100_NucAllInt),prop=CountperPos/12251)

AllIntsite=ggplot(AllsiteDF_to100_nucAllInt_fullprop, aes(group=Site, x=NegCount, y=prop, fill=Site)) + geom_histogram(stat="identity",position="stack" ) + labs(x="Distance from PAS", y="Proportion of all annotated sites", title="Location of annotated signal sites- All Intronic")

Warning: Ignoring unknown parameters: binwidth, bins, pad

PropSignalSite_ofall=plot_grid(allSite, AllIntsite, nucSite,nucIntsite)

ggsave(PropSignalSite_ofall, file="../output/plots/PropofAllSigSite_bySet.png", height = 10, width=10)

Total PAS

Now many motifs am I finding within -15 and -40.

I can filter the AllsiteDF:

AllsiteDF_15.40= AllsiteDF_to100 %>% filter(NegCount <= -15, NegCount >= -40)
dim(AllsiteDF_15.40)

[1] 19989     4

AllsiteDF_10.50= AllsiteDF %>% filter(NegCount <= -10, NegCount >= -50)
dim(AllsiteDF_10.50)

[1] 22192     4

Proprortion between -15 and -40 for each:

all15.40=AllsiteDF_to100 %>% filter(NegCount <= -15, NegCount >= -50) %>% nrow() /33432
nuc15.40=AllsiteDF_to100_Nuc %>%filter(NegCount <= -15, NegCount >= -50) %>% nrow() /657
nucint15.40=AllsiteDF_to100_NucInt%>%filter(NegCount <= -15, NegCount >= -50) %>% nrow() /449
allnuc15.40 =AllsiteDF_to100_NucAllInt %>% filter(NegCount <= -15, NegCount >= -50) %>% nrow() / 12251
  
DF_1540=as.data.frame(cbind(Set=c("All PAS", "Nuclear Specific", "Nuclear Specific in Introns", "All PAS in Introns"), value=c(all15.40,nuc15.40,nucint15.40,allnuc15.40), location=c("Full Gene", "Full Gene", "Intronic", "Intronic")))


DF_1540$value=as.numeric(as.character(DF_1540$value))

site15.40=ggplot(DF_1540,aes(x=Set, y=value,fill=location)) + geom_bar(stat="identity") + labs(y="Proportion with annotated site",title="Proportion of PAS with annoatated sites \n 15-40 basepairs upstream") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_brewer(palette = "Set1")

site15.40

Version	Author	Date
27cd8f1	Briana Mittleman	2019-03-12

ggsave(site15.40, file="../output/plots/PropofPASwithSignalSite.png")

Saving 7 x 5 in image

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 genomation_1.14.0                
 [3] cowplot_0.9.4                     BSgenome_1.50.0                  
 [5] rtracklayer_1.42.2                GenomicRanges_1.34.0             
 [7] GenomeInfoDb_1.18.2               Biostrings_2.50.2                
 [9] XVector_0.22.0                    IRanges_2.16.0                   
[11] S4Vectors_0.20.1                  BiocGenerics_0.28.0              
[13] forcats_0.4.0                     stringr_1.4.0                    
[15] dplyr_0.8.0.1                     purrr_0.3.1                      
[17] readr_1.3.1                       tidyr_0.8.3                      
[19] tibble_2.0.1                      ggplot2_3.1.0                    
[21] tidyverse_1.2.1                   workflowr_1.2.0                  

loaded via a namespace (and not attached):
 [1] nlme_3.1-137                bitops_1.0-6               
 [3] matrixStats_0.54.0          fs_1.2.6                   
 [5] lubridate_1.7.4             RColorBrewer_1.1-2         
 [7] httr_1.4.0                  rprojroot_1.3-2            
 [9] tools_3.5.1                 backports_1.1.3            
[11] R6_2.4.0                    KernSmooth_2.23-15         
[13] lazyeval_0.2.1              colorspace_1.4-0           
[15] seqPattern_1.14.0           withr_2.1.2                
[17] tidyselect_0.2.5            compiler_3.5.1             
[19] git2r_0.24.0                cli_1.0.1                  
[21] rvest_0.3.2                 Biobase_2.42.0             
[23] xml2_1.2.0                  DelayedArray_0.8.0         
[25] labeling_0.3                scales_1.0.0               
[27] digest_0.6.18               Rsamtools_1.34.1           
[29] rmarkdown_1.11              pkgconfig_2.0.2            
[31] htmltools_0.3.6             plotrix_3.7-4              
[33] rlang_0.3.1                 readxl_1.3.0               
[35] rstudioapi_0.9.0            impute_1.56.0              
[37] generics_0.0.2              jsonlite_1.6               
[39] BiocParallel_1.16.6         RCurl_1.95-4.12            
[41] magrittr_1.5                GenomeInfoDbData_1.2.0     
[43] Matrix_1.2-15               Rcpp_1.0.0                 
[45] munsell_0.5.0               reticulate_1.11.1          
[47] stringi_1.3.1               whisker_0.3-2              
[49] yaml_2.2.0                  SummarizedExperiment_1.12.0
[51] zlibbioc_1.28.0             plyr_1.8.4                 
[53] crayon_1.3.4                lattice_0.20-38            
[55] haven_2.1.0                 hms_0.4.2                  
[57] knitr_1.21                  pillar_1.3.1               
[59] reshape2_1.4.3              XML_3.98-1.19              
[61] glue_1.3.0                  evaluate_0.13              
[63] data.table_1.12.0           modelr_0.1.4               
[65] cellranger_1.1.0            gtable_0.2.0               
[67] assertthat_0.2.0            xfun_0.5                   
[69] gridBase_0.4-7              broom_0.5.1                
[71] GenomicAlignments_1.18.1

Location of Signal Site

Briana Mittleman

3/6/2019

Make location file

Get subject (reads)

Make motifs

find the mathes

Change analysis:

Subset peaks less than 200bp

Get sequence for peaks and 100 upstream

find distance to peaks

Make histograms:

Run for nuclear specific PAS:

All intronic peaks:

Compare cononical vs other:

Make plots with all sites as demoniator

Total PAS