Last updated: 2019-03-06

Checks: 6 0

Knit directory: threeprimeseq/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/.DS_Store
    Ignored:    data/perm_QTL_trans_noMP_5percov/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  KalistoAbundance18486.txt
    Untracked:  analysis/4suDataIGV.Rmd
    Untracked:  analysis/DirectionapaQTL.Rmd
    Untracked:  analysis/EmpDistforOverlaps.Rmd
    Untracked:  analysis/EvaleQTLs.Rmd
    Untracked:  analysis/YL_QTL_test.Rmd
    Untracked:  analysis/groSeqAnalysis.Rmd
    Untracked:  analysis/ncbiRefSeq_sm.sort.mRNA.bed
    Untracked:  analysis/snake.config.notes.Rmd
    Untracked:  analysis/verifyBAM.Rmd
    Untracked:  analysis/verifybam_dubs.Rmd
    Untracked:  code/PeaksToCoverPerReads.py
    Untracked:  code/strober_pc_pve_heatmap_func.R
    Untracked:  data/18486.genecov.txt
    Untracked:  data/APApeaksYL.total.inbrain.bed
    Untracked:  data/AllPeak_counts/
    Untracked:  data/ApaQTLs/
    Untracked:  data/ApaQTLs_otherPhen/
    Untracked:  data/ChromHmmOverlap/
    Untracked:  data/DistTXN2Peak_genelocAnno/
    Untracked:  data/FeatureoverlapPeaks/
    Untracked:  data/GM12878.chromHMM.bed
    Untracked:  data/GM12878.chromHMM.txt
    Untracked:  data/LianoglouLCL/
    Untracked:  data/LocusZoom/
    Untracked:  data/LocusZoom_Unexp/
    Untracked:  data/LocusZoom_proc/
    Untracked:  data/MatchedSnps/
    Untracked:  data/NuclearApaQTLs.txt
    Untracked:  data/PeakCounts/
    Untracked:  data/PeakCounts_noMP_5perc/
    Untracked:  data/PeakCounts_noMP_genelocanno/
    Untracked:  data/PeakUsage/
    Untracked:  data/PeakUsage_noMP/
    Untracked:  data/PeakUsage_noMP_GeneLocAnno/
    Untracked:  data/PeaksUsed/
    Untracked:  data/PeaksUsed_noMP_5percCov/
    Untracked:  data/PolyA_DB/
    Untracked:  data/QTL_overlap/
    Untracked:  data/RNAkalisto/
    Untracked:  data/RefSeq_annotations/
    Untracked:  data/Replicates_usage/
    Untracked:  data/TotalApaQTLs.txt
    Untracked:  data/Totalpeaks_filtered_clean.bed
    Untracked:  data/UnderstandPeaksQC/
    Untracked:  data/WASP_STAT/
    Untracked:  data/YL-SP-18486-T-combined-genecov.txt
    Untracked:  data/YL-SP-18486-T_S9_R1_001-genecov.txt
    Untracked:  data/YL_QTL_test/
    Untracked:  data/apaExamp/
    Untracked:  data/apaExamp_proc/
    Untracked:  data/apaQTL_examp_noMP/
    Untracked:  data/bedgraph_peaks/
    Untracked:  data/bin200.5.T.nuccov.bed
    Untracked:  data/bin200.Anuccov.bed
    Untracked:  data/bin200.nuccov.bed
    Untracked:  data/clean_peaks/
    Untracked:  data/comb_map_stats.csv
    Untracked:  data/comb_map_stats.xlsx
    Untracked:  data/comb_map_stats_39ind.csv
    Untracked:  data/combined_reads_mapped_three_prime_seq.csv
    Untracked:  data/diff_iso_GeneLocAnno/
    Untracked:  data/diff_iso_proc/
    Untracked:  data/diff_iso_trans/
    Untracked:  data/eQTLs_Lietal/
    Untracked:  data/ensemble_to_genename.txt
    Untracked:  data/example_gene_peakQuant/
    Untracked:  data/explainProtVar/
    Untracked:  data/filtPeakOppstrand_cov_noMP_GeneLocAnno_5perc/
    Untracked:  data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.bed
    Untracked:  data/filtered_APApeaks_merged_allchrom_refseqTrans.closest2End.noties.bed
    Untracked:  data/first50lines_closest.txt
    Untracked:  data/gencov.test.csv
    Untracked:  data/gencov.test.txt
    Untracked:  data/gencov_zero.test.csv
    Untracked:  data/gencov_zero.test.txt
    Untracked:  data/gene_cov/
    Untracked:  data/joined
    Untracked:  data/leafcutter/
    Untracked:  data/merged_combined_YL-SP-threeprimeseq.bg
    Untracked:  data/molPheno_noMP/
    Untracked:  data/mol_overlap/
    Untracked:  data/mol_pheno/
    Untracked:  data/nom_QTL/
    Untracked:  data/nom_QTL_opp/
    Untracked:  data/nom_QTL_trans/
    Untracked:  data/nuc6up/
    Untracked:  data/nuc_10up/
    Untracked:  data/other_qtls/
    Untracked:  data/pQTL_otherphen/
    Untracked:  data/pacbio_cov/
    Untracked:  data/peakPerRefSeqGene/
    Untracked:  data/peaks4DT/
    Untracked:  data/perm_QTL/
    Untracked:  data/perm_QTL_GeneLocAnno_noMP_5percov/
    Untracked:  data/perm_QTL_GeneLocAnno_noMP_5percov_3UTR/
    Untracked:  data/perm_QTL_diffWindow/
    Untracked:  data/perm_QTL_opp/
    Untracked:  data/perm_QTL_trans/
    Untracked:  data/perm_QTL_trans_filt/
    Untracked:  data/protAndAPAAndExplmRes.Rda
    Untracked:  data/protAndAPAlmRes.Rda
    Untracked:  data/protAndExpressionlmRes.Rda
    Untracked:  data/reads_mapped_three_prime_seq.csv
    Untracked:  data/smash.cov.results.bed
    Untracked:  data/smash.cov.results.csv
    Untracked:  data/smash.cov.results.txt
    Untracked:  data/smash_testregion/
    Untracked:  data/ssFC200.cov.bed
    Untracked:  data/temp.file1
    Untracked:  data/temp.file2
    Untracked:  data/temp.gencov.test.txt
    Untracked:  data/temp.gencov_zero.test.txt
    Untracked:  data/threePrimeSeqMetaData.csv
    Untracked:  data/threePrimeSeqMetaData55Ind.txt
    Untracked:  data/threePrimeSeqMetaData55Ind.xlsx
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup.txt
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup.xlsx
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup_WASPMAP.txt
    Untracked:  data/threePrimeSeqMetaData55Ind_noDup_WASPMAP.xlsx
    Untracked:  output/LZ/
    Untracked:  output/deeptools_plots/
    Untracked:  output/picard/
    Untracked:  output/plots/
    Untracked:  output/qual.fig2.pdf

Unstaged changes:
    Modified:   analysis/28ind.peak.explore.Rmd
    Modified:   analysis/CompareLianoglouData.Rmd
    Modified:   analysis/NewPeakPostMP.Rmd
    Modified:   analysis/apaQTLoverlapGWAS.Rmd
    Modified:   analysis/cleanupdtseq.internalpriming.Rmd
    Modified:   analysis/coloc_apaQTLs_protQTLs.Rmd
    Modified:   analysis/dif.iso.usage.leafcutter.Rmd
    Modified:   analysis/diff_iso_pipeline.Rmd
    Modified:   analysis/explainpQTLs.Rmd
    Modified:   analysis/explore.filters.Rmd
    Modified:   analysis/fixBWChromNames.Rmd
    Modified:   analysis/flash2mash.Rmd
    Modified:   analysis/mispriming_approach.Rmd
    Modified:   analysis/overlapMolQTL.Rmd
    Modified:   analysis/overlapMolQTL.opposite.Rmd
    Modified:   analysis/overlap_qtls.Rmd
    Modified:   analysis/peakOverlap_oppstrand.Rmd
    Modified:   analysis/peakQCPPlots.Rmd
    Modified:   analysis/pheno.leaf.comb.Rmd
    Modified:   analysis/pipeline_55Ind.Rmd
    Modified:   analysis/swarmPlots_QTLs.Rmd
    Modified:   analysis/test.max2.Rmd
    Modified:   analysis/test.smash.Rmd
    Modified:   analysis/understandPeaks.Rmd
    Modified:   analysis/unexplainedeQTL_analysis.Rmd
    Modified:   code/Snakefile

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd e1c4295 Briana Mittleman 2019-03-06 add messing with meme- start own script
html 8a73f91 Briana Mittleman 2019-03-02 Build site.
Rmd 49f21df Briana Mittleman 2019-03-02 modify analysis for len 6 add qtl match
html 901e191 Briana Mittleman 2019-03-02 Build site.
Rmd a96297c Briana Mittleman 2019-03-02 add signal site enrich analysis

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310884/ (50bp up)

Singal site in peak

I will use homer to look for enriched signals in my signal sites.

I need to get the 50 base pairs upstream of my peaks and the shuffled peaks. This should encompase the signal site. This is similar to the 10 bp upstream script I used to identify evidence of mispriming. I want it to take any peak file and conver the bed to the 50bp uptream.

Upstream50Bases.py

#python  
def main(Fin, Fout):
  outBed=open(Fout, "w")
  chrom_lengths=open("/project2/gilad/briana/genome_anotation_data/chrom_lengths2.sort.bed","r")
  #make a dictionary with chrom lengths
  length_dic={}
  for i in chrom_lengths:
    chrom, start, end = i.split()
    length_dic[str(chrom)]=int(end)  
#write file 
  for ln in open(Fin):
    chrom, start, end, name, score, strand = ln.split()
    chrom=str(chrom)
    if strand=="+":
      start_new=int(start)-50
      if start_new <= 1:
        start_new = 0 
      end_new= int(start)
      if end_new == 0:
        end_new=1
      outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
    if strand == "-":
      start_new=int(end)
      end_new=int(end) + 50
      outBed.write("%s\t%d\t%d\t%s\t%s\t%s\n"%(chrom, start_new, end_new, name, score, strand))
  outBed.close()  
if __name__ == "__main__":
    import sys
    inFile = sys.argv[1]
    outFile=sy.argv[2] 
    main(inFile, outFile)

RUn this for both the peak and shuff file with the fixed peak strands:

  • /project2/gilad/briana/threeprimeseq/data/FeatureoverlapPeaks/shuffled_FilterPeaks.sort.bed

  • /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed

run_get50up.sh

#!/bin/bash

#SBATCH --job-name=run_get50up
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=run_get50upt.out
#SBATCH --error=run_get50up.err
#SBATCH --partition=broadwl
#SBATCH --mem=16G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env

python Upstream50Bases.py /project2/gilad/briana/threeprimeseq/data/FeatureoverlapPeaks/shuffled_FilterPeaks.sort.bed /project2/gilad/briana/threeprimeseq/data/SignalEnrich/fiftyup_shuffled_FilterPeaks.sort.bed 

python Upstream50Bases.py /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed /project2/gilad/briana/threeprimeseq/data/SignalEnrich/fiftyup_APAPeaks_5percCov_fixedStrand.bed

pabp (signal site) is top known

Run homer enrichement

-h hypergeometic enrichment

run it in downloaded version for now- will fix when environment solves and this is in anaconda env.

signalSiteEnrich.sh

#!/bin/bash

#SBATCH --job-name=signalSiteEnrich
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich.out
#SBATCH --error=signalSiteEnrich.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source deactivate three-prime-env
module unload Anaconda3

findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich/ -size -300,100 -h -bg  /project2/gilad/briana/threeprimeseq/data/FeatureoverlapPeaks/shuffled_FilterPeaks.sort.bed -len 6

I see the correct enrichment but I want to understand where the enrichment is compared to the peaks. Iam worried the peaks are really close to eachother and maybe the enrichment was from a signal site of another peak.

Homer gives some results:

Tpos: average position of motif in target sequences (0 = start of sequences)

This result is only there for the new res. THe known res arnt there. I can run homer specfically looking for the motifs in the https://www.ncbi.nlm.nih.gov/pmc/articles/PMC310884/ paper.

I need to decide on a: Log odds detection threshold, used to determine bound vs. unbound sites (mandatory) example: 8.059752

I cau use the seq2profile.pl for this (here I can say allow 0 mismatch )

Motif file:

/project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/

makeMotifFiles.sh


#!/bin/bash

#SBATCH --job-name=makeMotifFiles
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=makeMotifFiles.out
#SBATCH --error=makeMotifFiles.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END

source ~/.bash_profile

 ~/homer/bin/seq2profile.pl AATAAA AATAAA-Sig1 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA-Sig1.motif
 ~/homer/bin/seq2profile.pl ATTAAA ATTAAA-Sig2 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/ATTAAA-Sig2.motif
 ~/homer/bin/seq2profile.pl AGTAAA AGTAAA-Sig3 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AGTAAA-Sig3.motif
 ~/homer/bin/seq2profile.pl TATAAA TATAAA-Sig4 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/TATAAA-Sig4.motif  
 ~/homer/bin/seq2profile.pl CATAAA CATAAA-Sig5 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/CATAAA-Sig5.motif 
 ~/homer/bin/seq2profile.pl GATAAA GATAAA-Sig6 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/GATAAA-Sig6.motif 
 ~/homer/bin/seq2profile.pl AATATA AATATA-Sig7 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATATA-Sig7.motif 
 ~/homer/bin/seq2profile.pl AATACA AATACA-Sig8 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATACA-Sig8.motif
 ~/homer/bin/seq2profile.pl AATAGA AATAGA-Sig9 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAGA-Sig9.motif
 ~/homer/bin/seq2profile.pl AAAAAG AAAAAG-Sig10 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AAAAAG-Sig10.motif
 ~/homer/bin/seq2profile.pl ACTAAA ACTAAA-Sig11 > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/ACTAAA-Sig11.motif

I can then run the motif finding on these

I want to find inidivdual matches for these with http://meme-suite.org/doc/fimo.html

I need to convert these with motif2meme

#R 
source("/project2/gilad/briana/Motif2Meme/motif2meme.R")
motif2meme("/project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA-Sig1.motif")

Read 7 items Error in index.start:index.end : NA/NaN argument

Try to make my own: /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA_meme.motif

MEME version 4

ALPHABET= ACGT


strands: +


Background letter frequencies (from
A 0.303 C 0.183 G 0.209 T 0.306 

MOTIF AATAAA 
letter-probability matrix: alength= 4 w= 6 nsites= 6 E= 0
0.997   0.001   0.001   0.001
0.001   0.997   0.001   0.001
0.001   0.001   0.001   0.997
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001
0.997   0.001   0.001   0.001

I also need to make a fasta formated file for the regions. (start with the 50 upstream)

takes the bedtools nuc res and makes a fasta

interactively

#example
# >peakid
#seq  


inFile=open("/project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.bed", "r")
outFile=open("/project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.fasta", "w")  

for i, ln in enumerate(inFile):
    if i >0:
        id=ln.split()[3]
        seq=ln.split()[15]
        outFile.write(">%s\n%s\n"%(id, seq))
outFile.close()

add this to conda environment “meme” Test this one the one motif:

fimo --o /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/ /project2/gilad/briana/threeprimeseq/data/SignalEnrich_MyMotifs/AATAAA_meme.motif /project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.fasta

If this doesnt work I will write my own python script to look for the motifs in the sequences. I can then look through the sequences and see if any of those are there and if so which. I can look for each in an heiarchical way. I can then find the location of the signal. I can use the .find command for this in python. It will give me the first index if findes the string.

I can make a second dictionary with the resuts. Anytime a signal is found, Ican add the location of the signal to a list in the value for the dictionary.

change region I am looking at before I do this - move to new analysis and try a new method for this analysis


sigsite=['AATAAA', 'ATTAAA' ,'AGTAAA' ,'TATAAA'  ,'CATAAA' ,'GATAAA' ,'AATATA' ,'AATACA' ,'AATAGA' ,'AAAAAG','ACTAAA'] 


#start results dic, key is site, value is empty list  
Res_dic={}
for i in sigsite:
    Res_dic[i]=[]


upstreamSeq=open("/project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.fasta", "r")

for i,ln in enumerate(upstreamSeq):
    if i % 2 ==0: 
        if sigsite[0] in ln:
        if sigsite[1] in ln:
        if sigsite[2] in ln:
        if sigsite[3] in ln:
        if sigsite[4] in ln:
        if sigsite[5] in ln:
        if sigsite[6] in ln:
        if sigsite[7] in ln:
        if sigsite[8] in ln:
        if sigsite[9] in ln:
        if sigsite[10] in ln:
#not finished code

Signal site in qtls

/project2/gilad/briana/threeprimeseq/data/ApaQTLs/Nuclear.apaQTLs.sort.bed /project2/gilad/briana/threeprimeseq/data/ApaQTLs/Total.apaQTLs.sort.bed

cat /project2/gilad/briana/threeprimeseq/data/ApaQTLs/Nuclear.apaQTLs.sort.bed /project2/gilad/briana/threeprimeseq/data/ApaQTLs/Total.apaQTLs.sort.bed | sort -k1,1 -k2,2n > /project2/gilad/briana/threeprimeseq/data/ApaQTLs/All.apaQTLs.sort.bed
cat /project2/gilad/briana/threeprimeseq/data/MatchedSnp/Nuclear_matched_snps_sort.bed /project2/gilad/briana/threeprimeseq/data/MatchedSnp/Total_matched_snps_sort.bed  | sort -k1,1 -k2,2n > /project2/gilad/briana/threeprimeseq/data/MatchedSnp/All_matched_snps_sort.bed

signalSiteEnrich_QTLs.sh

#!/bin/bash

#SBATCH --job-name=signalSiteEnrich_QTL
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich_QTL.out
#SBATCH --error=signalSiteEnrich_QTL.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END

#source deactivate three-prime-env
#module unload Anaconda3

 ~/homer/bin/findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/ApaQTLs/All.apaQTLs.sort.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich_QTL/ -size 100 -h -bg  /project2/gilad/briana/threeprimeseq/data/MatchedSnp/All_matched_snps_sort.bed -len 6

results dont mean much at the moment

Signal enrichment nuc spec

/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed

Make a shuffled file of this for background:

Do this interactively

import pybedtools 
peaks= pybedtools.BedTool("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed")
peaks_shuf=peaks.shuffle(genome='hg19')
peaks_shuf.saveas("/project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.bed")
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.bed | sed 's/^chr//'  > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.sort.bed

signalSiteEnrich_nucSpec.sh

#!/bin/bash

#SBATCH --job-name=signalSiteEnrich_nucSpec
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich_nucSpec.out
#SBATCH --error=signalSiteEnrich_nucSpec.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source deactivate three-prime-env
module unload Anaconda3

findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/ -size -300,100 -h -bg  /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpec/shuffled_MoreSigUsageNuclear.sort.bed -len 6

Subset these by those in introns:

/project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_GeneLoc/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR_geneLocParsed_withAnno.SAF

I want to subset only those in the /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed file that have an intron annoations:

SigUsageNuc_subsetInron.py

anno=open("/project2/gilad/briana/threeprimeseq/data/mergedPeaks_noMP_GeneLoc/Filtered_APApeaks_merged_allchrom_noMP.sort.named.noCHR_geneLocParsed_withAnno.SAF","r")
inbed="/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc.bed"
outBed=open("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed","w")  

#make dic with intron peaks 
intron={}
for i,ln in enumerate(anno):
    if i >0:
        peak=ln.split()[0].split(":")[0]
        loc=ln.split()[0].split(":")[-1]
        if loc == "intron":
            intron[peak]=""
        


for ln in open(inbed, "r"):
    peak=ln.split()[3].split(":")[1]
    if peak in intron.keys():
        outBed.write(ln)
        
outBed.close()
    

Of these 519 are intron. I now need to do the shuff on these:

Do this interactively

import pybedtools 
peaks= pybedtools.BedTool("/project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed")
peaks_shuf=peaks.shuffle(genome='hg19')
peaks_shuf.saveas("/project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.bed")
sort -k1,1 -k2,2n /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.bed | sed 's/^chr//'  > /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.sort.bed

signalSiteEnrich_nucSpec_intron.sh

#!/bin/bash

#SBATCH --job-name=signalSiteEnrich_nucSpec_intron
#SBATCH --account=pi-yangili1
#SBATCH --time=36:00:00
#SBATCH --output=signalSiteEnrich_nucSpec_intron.out
#SBATCH --error=signalSiteEnrich_nucSpec_intron.err
#SBATCH --partition=bigmem2
#SBATCH --mem=100G
#SBATCH --mail-type=END
source ~/.bash_profile

 ~/homer/bin/findMotifsGenome.pl /project2/gilad/briana/threeprimeseq/data/peaks4DT/APAPeaks_5percCov_fixedStrand_SigUsageNuc_Intron.bed /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/ -size -300,100 -h -bg  /project2/gilad/briana/threeprimeseq/data/SignalEnrich_nucSpecIntron/shuffled_MoreSigUsageNuclear_intron.sort.bed -len 6

Upstream sequences

I saw enrichment but now I need to work on location of the enrichment. I have the 50bp upstream of the PAS from earlier in this analysis. I can use bedtools nuc to get the actual sequences for these regions.

nuc50upPeaks.sh

#!/bin/bash

#SBATCH --job-name=nuc50upPeaks
#SBATCH --account=pi-yangili1
#SBATCH --time=8:00:00
#SBATCH --output=nuc50upPeaks.out
#SBATCH --error=nuc50upPeaks.err
#SBATCH --partition=broadwl
#SBATCH --mem=36G
#SBATCH --mail-type=END

module load Anaconda3
source activate three-prime-env

bedtools nuc -s -seq -fi /project2/gilad/briana/genome_anotation_data/genome/Homo_sapiens.GRCh37.75.dna_sm.all.fa -bed /project2/gilad/briana/threeprimeseq/data/SignalEnrich/fiftyup_APAPeaks_5percCov_fixedStrand.bed > /project2/gilad/briana/threeprimeseq/data/SignalEnrich/Seq_fiftyup_APAPeaks_5percCov_fixedStrand.bed   


sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.2.0 Rcpp_0.12.19    lattice_0.20-35 digest_0.6.17  
 [5] rprojroot_1.3-2 grid_3.5.1      jsonlite_1.6    backports_1.1.2
 [9] git2r_0.24.0    magrittr_1.5    evaluate_0.13   stringi_1.2.4  
[13] fs_1.2.6        whisker_0.3-2   Matrix_1.2-14   reticulate_1.10
[17] rmarkdown_1.11  tools_3.5.1     stringr_1.4.0   glue_1.3.0     
[21] yaml_2.2.0      compiler_3.5.1  htmltools_0.3.6 knitr_1.20