Last updated: 2017-11-07
Code version: cdfdd68
The snake and config files are included in the net_seq_pipeline folder. This directory has the human reference genome and the conda environment file.
before each session: module load Anaconda3
source activate net-seq
To test my snake file: snakemake -np
# Snakemake configuration file
# Specify paths to data files
# Paths must end with forward slash
#project directory
dir_proj: /project2/gilad/briana/net_seq_pipeline/
#directory with aditional scripts:
scripts: /project2/gilad/briana/net_seq_pipeline/scripts/
# Make sure to also update path for log files in cluster.json
dir_log: log/
# Specify Ensembl release for genome sequence and annotation
# http://feb2014.archive.ensembl.org/index.html
ensembl_archive: feb2014.archive.ensembl.org
ensembl_rel: 75
ensembl_genome: GRCh37.75
#Specify chromosomes
#
# For quantifying gene expression
chr_genes: ["1", "2", "3", "4", "5", "6", "7", "8", "9",
"10", "11", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22",
"X", "Y", "MT"]
Make this file executable:
chmod 700 config.yaml
#Snakefile
#
#This file will run the net-seq pipeline from fastq files including assembling reference genome
#
#To configure the paths to data files and other settings, edit
#config.yaml
#
#to configure job submission settings for cluster, edit
#cluster.json and submit.snakemake.sh
#to run on RCC midway2 use 'bash submit-snakemake.sh'
import glob
import os
from snakemake.utils import R
#Configuration -------------------------------------
configfile: "config.yaml"
#Specifu Ensembl release for genome sequence and annotation
ensembl_archive = config["ensembl_archive"]
ensembl_rel = config["ensembl_rel"]
ensembl_ftp = "ftp://ftp.ensembl.org/pub/release-" + \
str(ensembl_rel) + \
"/fasta/homo_sapiens/dna/"
ensembl_exons = "exons-ensembl-release-" + str(ensembl_rel) + ".saf"
ensembl_genome = config["ensembl_genome"]
#Paths for data (end with forward slash)
dir_proj=config["dir_proj"]
dir_data= dir_proj + "data/"
dir_genome= dir_proj + "genome/"
output= dir_proj + "output/"
fastq_dir= dir_data + "fastq/"
fastqc_dir = output + "fastqc/"
chr_genes = config["chr_genes"]
assert os.path.exists(dir_proj), "Project directory exists"
#Directory to send log files. Needs to be created manually since it
#is not a file created by a Snakemake rule.
dir_log = config["dir_log"]
if not os.path.isdir(dir_log):
os.mkdir(dir_log)
This is from John’s cardioQTL snake file
rule download_genome:
output: dir_genome + "Homo_sapiens." + ensembl_genome + \
".dna_sm.chromosome.{chr}.fa.gz"
params: chr = "{chr}", build = ensembl_genome
shell: "wget -O {output} {ensembl_ftp}Homo_sapiens.{params.build}.dna_sm.chromosome.{params.chr}.fa.gz"
rule unzip_chromosome_fasta:
input: dir_genome + "Homo_sapiens." + ensembl_genome + \
".dna_sm.chromosome.{chr}.fa.gz"
output: temp(dir_genome + "Homo_sapiens." + ensembl_genome + \
".dna_sm.chromosome.{chr}.fa")
shell: "zcat {input} > {output}"
rule subread_index:
input: expand(dir_genome + "Homo_sapiens." + ensembl_genome + \
".dna_sm.chromosome.{chr}.fa", \
chr = chr_genes)
output: dir_genome + ensembl_genome + ".reads"
params: prefix = dir_genome + ensembl_genome
shell: "subread-buildindex -o {params.prefix} {input}"
First add a path to the fastqc dir.
fastqc_dir = output + "fastqc/
rule fastqc:
input:
fastq + "{samples}.fastq"
output:
fastqc_dir + "{samples}_fastqc.html",
fastqc_dir + "{samples}_fastqc.zip"
params:
outdir = fastqc_dir
shell:
"fastqc -o {params.outdir} {input}"
Maybe I dont need to do this right now: This is code to maybe add later:
rule unzip:
input:
fastqc_dir + "{sample}_fastqc.zip"
output:
fastqc_dir + "{sample}_fastqc.zip/fastqc_data.txt"
params:
outdir = fastqc_dir
shell:
"unzip -d {params.outdir} {input}"
This is the next step to figure out. I need to trim the primer and extract the barcode from the linker.
Barcode DNA linker /5rApp/(N)6CTGTAGGCACCATCAAT/3ddC
Use UMI tools to extract first 6 for each read
rule umi_extract_umi:
input:
fastq_dir + "{samples}.fastq",
output:
fastq_extr + "{samples}_extracted.fastq",
params:
bc = "NNNNNN""
shell:
"umi_tools extract --bc-pattern={params.bc} --stdin {input} --stdout {output}"
Need to tell it which samples to use:
samples = set(glob_wildcards(fastq_dir + "{samples}.fastq").samples)
rule subjunc:
input:
read = fastq_extr + "{samples}_extracted.fastq",
index = dir_genome + ensembl_genome + ".reads"
output:
dir_bam + "{samples}.bam"
params:
prefix = dir_genome + ensembl_genome
threads: 8
shell: "subjunc -i {params.prefix} -r {input.read} -T {threads} > {output}"
rule sort_bam:
input:
dir_bam + "{samples}.bam"
output:
dir_sort + "{samples}-sort.bam"
shell: "samtools sort -o {output} {input}"
rule index_bam:
input:
dir_sort + "{samples}-sort.bam"
output:
dir_sort + "{samples}-sort.bam.bai"
shell: "samtools index {input}"
#add folder
dir_bam= dir_data + "bam/"
dir_sort= dir_data + "sort/"
rule dedup_bam:
input:
bam = dir_sort + "{samples}-sort.bam",
index = dir_sort + "{samples}-sort.bam.bai"
output:
dir_dedup + "{samples}-sort.dedup.bam"
shell: "umi_tools dedup -I {input.bam} -S {output}"
#dir to add
dir_dedup=dir_data + "dedup/"
I will use this bedtools command to visualize the densities of netseq reads.
http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
bedtools genomecov -bg -5 -i fastq -g indexed genome
Flags to use:
-bg : genome wide coverage in a bedgraph format
-5 calculate coverage for the 5` end of the read
rule genome_cov:
input:
bamdedup = dir_dedup + "{samples}-sort.dedup.bam",
genome = dir_genome + ensembl_genome + ".reads"
output:
dir_cov + "{samples}-sort.dedup.cov.bed"
shell: "bedtools genomecov -bg -5 -ibam {input.bamdedup} -g {input.genome}" > {output}
#dir to add
dir_cov= dir_data + "cov/"
Works when i run it for just the file:
bedtools genomecov -bg -5 -ibam data/dedup/SRR1575922_CUT-sort.dedup.bam -g genome/GRCh37.75.reads
Problem:
This snake rule doesnt know where to write the output. I am not sure what file type this should be.
rule all:
input:
expand(fastqc_dir + "{samples}_fastqc.html", samples = samples),
dir_genome + ensembl_genome + ".reads",
expand(fastq_extr + "{samples}_extracted.fastq", samples = samples),
expand(dir_bam + "{samples}.bam", samples = samples),
expand(dir_sort + "{samples}-sort.bam", samples= samples),
expand(dir_sort + "{samples}-sort.bam.bai", samples=samples),
expand(dir_dedup + "{samples}-sort.dedup.bam", samples=samples),
expand(dir_cov + "{samples}-sort.dedup.cov.bed", samples=samples)
Will need to create a cluster.json file where I specify the memory for each of my rules
Create for first 2 rules:
{
"__default__" :
{
"mem" : 2000,
"n" : 1,
"tasks" : 1,
"name" : "{rule}-{wildcards}",
"logfile": "log/snake-{rule}-{wildcards}-%j.out"
},
"subread_index" :
{
"mem" : 8000
},
"fastqc" :
{
"mem" : 8000
}
"subjunc" :
{
"mem": 16000,
"tasks" : 8
},
"sort_bam" :
{
"mem": 8000
}
}
Full files are in my pipeline folder on midway
To check the snakefile:
snakefile -np
To run the snakefile, in the pipleline directory:
scripts/submit-snakemake.sh
to check running jobs:
squeue -user=brimittleman
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 backports_1.1.1 magrittr_1.5 rprojroot_1.2
[5] tools_3.4.2 htmltools_0.3.6 yaml_2.1.14 Rcpp_0.12.13
[9] stringi_1.1.5 rmarkdown_1.6 knitr_1.17 git2r_0.19.0
[13] stringr_1.2.0 digest_0.6.12 evaluate_0.10.1
This R Markdown site was created with workflowr