Last updated: 2017-11-07

Code version: cdfdd68

The snake and config files are included in the net_seq_pipeline folder. This directory has the human reference genome and the conda environment file.

before each session: module load Anaconda3 source activate net-seq

To test my snake file: snakemake -np

Set up config file:

# Snakemake configuration file

# Specify paths to data files

# Paths must end with forward slash

#project directory 
dir_proj: /project2/gilad/briana/net_seq_pipeline/

#directory with aditional scripts:
scripts: /project2/gilad/briana/net_seq_pipeline/scripts/

# Make sure to also update path for log files in cluster.json
dir_log: log/

# Specify Ensembl release for genome sequence and annotation
# http://feb2014.archive.ensembl.org/index.html
ensembl_archive: feb2014.archive.ensembl.org
ensembl_rel: 75
ensembl_genome: GRCh37.75

#Specify chromosomes
#
# For quantifying gene expression
chr_genes: ["1", "2", "3", "4", "5", "6", "7", "8", "9",
            "10", "11", "12", "13", "14", "15", "16",
            "17", "18", "19", "20", "21", "22",
            "X", "Y", "MT"]

Make this file executable:

chmod 700 config.yaml

Set up snakefile:

#Snakefile
#
#This file will run the net-seq pipeline from fastq files including assembling reference genome
#
#To configure the paths to data files and other settings, edit
#config.yaml
#
#to configure job submission settings for cluster, edit
#cluster.json and submit.snakemake.sh

#to run on RCC midway2 use 'bash submit-snakemake.sh'


import glob
import os
from snakemake.utils import R

#Configuration -------------------------------------  


configfile:  "config.yaml"


#Specifu Ensembl release for genome sequence and annotation
ensembl_archive = config["ensembl_archive"]
ensembl_rel = config["ensembl_rel"]
ensembl_ftp = "ftp://ftp.ensembl.org/pub/release-" + \
              str(ensembl_rel) + \
              "/fasta/homo_sapiens/dna/"
ensembl_exons = "exons-ensembl-release-" + str(ensembl_rel) + ".saf"
ensembl_genome = config["ensembl_genome"]

#Paths for data (end with forward slash)
dir_proj=config["dir_proj"]
dir_data= dir_proj + "data/"
dir_genome= dir_proj + "genome/"
output= dir_proj + "output/"
fastq_dir= dir_data + "fastq/"
fastqc_dir = output + "fastqc/"
chr_genes = config["chr_genes"]



assert os.path.exists(dir_proj), "Project directory exists"

#Directory to send log files. Needs to be created manually since it
#is not a file created by a Snakemake rule.
dir_log = config["dir_log"] 
if not os.path.isdir(dir_log):
     os.mkdir(dir_log)

Rules to add to my snake file

Rules to download and index genome

This is from John’s cardioQTL snake file

rule download_genome:
    output: dir_genome + "Homo_sapiens." + ensembl_genome + \
            ".dna_sm.chromosome.{chr}.fa.gz"
    params: chr = "{chr}", build = ensembl_genome
    shell: "wget -O {output} {ensembl_ftp}Homo_sapiens.{params.build}.dna_sm.chromosome.{params.chr}.fa.gz"

rule unzip_chromosome_fasta:
    input: dir_genome + "Homo_sapiens." + ensembl_genome + \
           ".dna_sm.chromosome.{chr}.fa.gz"
    output: temp(dir_genome + "Homo_sapiens." + ensembl_genome + \
                 ".dna_sm.chromosome.{chr}.fa")
    shell: "zcat {input} > {output}"

rule subread_index:
    input: expand(dir_genome + "Homo_sapiens." + ensembl_genome + \
                  ".dna_sm.chromosome.{chr}.fa", \
                  chr = chr_genes)
    output: dir_genome + ensembl_genome + ".reads"
    params: prefix = dir_genome + ensembl_genome
    shell: "subread-buildindex -o {params.prefix} {input}"

Fastqc rule

First add a path to the fastqc dir.

fastqc_dir = output + "fastqc/
rule fastqc:
    input:
        fastq + "{samples}.fastq"
    output:
        fastqc_dir + "{samples}_fastqc.html",
        fastqc_dir + "{samples}_fastqc.zip"
    params:
        outdir = fastqc_dir
    shell:
        "fastqc -o {params.outdir} {input}"

Unzip fastqc

Maybe I dont need to do this right now: This is code to maybe add later:

rule unzip:
    input:
        fastqc_dir + "{sample}_fastqc.zip"
    output:
        fastqc_dir + "{sample}_fastqc.zip/fastqc_data.txt"
    params:
        outdir = fastqc_dir
    shell:
        "unzip -d {params.outdir} {input}"

Extract umi data

This is the next step to figure out. I need to trim the primer and extract the barcode from the linker.

Barcode DNA linker /5rApp/(N)6CTGTAGGCACCATCAAT/3ddC

Use UMI tools to extract first 6 for each read

rule umi_extract_umi:
    input:
        fastq_dir + "{samples}.fastq",
    output:
        fastq_extr + "{samples}_extracted.fastq",
    params:
        bc = "NNNNNN""
    shell:
        "umi_tools extract --bc-pattern={params.bc} --stdin {input} --stdout {output}"

Need to tell it which samples to use:

samples = set(glob_wildcards(fastq_dir + "{samples}.fastq").samples)

Align with Subread-subjunc

rule subjunc:
    input: 
        read = fastq_extr + "{samples}_extracted.fastq",
        index = dir_genome + ensembl_genome + ".reads"
    output: 
        dir_bam + "{samples}.bam"
    params: 
        prefix = dir_genome + ensembl_genome
    threads: 8
    shell: "subjunc -i {params.prefix} -r {input.read} -T {threads} > {output}"

rule sort_bam:
    input: 
        dir_bam + "{samples}.bam"
    output:
        dir_sort + "{samples}-sort.bam"
    shell: "samtools sort -o {output} {input}"   
    
    
rule index_bam:
    input: 
        dir_sort + "{samples}-sort.bam"
    output:
        dir_sort + "{samples}-sort.bam.bai"
    shell: "samtools index {input}"
    
#add folder
dir_bam= dir_data + "bam/"
dir_sort= dir_data + "sort/"

Rule dedup

rule dedup_bam:
    input: 
        bam = dir_sort + "{samples}-sort.bam",
        index = dir_sort + "{samples}-sort.bam.bai"
    output:
        dir_dedup + "{samples}-sort.dedup.bam"
    shell: "umi_tools dedup -I {input.bam} -S {output}"
    
    
#dir to add
dir_dedup=dir_data + "dedup/"

Rule genomecov

I will use this bedtools command to visualize the densities of netseq reads.

http://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

bedtools genomecov -bg -5 -i fastq -g indexed genome

Flags to use:

  • -bg : genome wide coverage in a bedgraph format

  • -5 calculate coverage for the 5` end of the read


rule genome_cov:
    input:
       bamdedup = dir_dedup + "{samples}-sort.dedup.bam",
       genome = dir_genome + ensembl_genome + ".reads"
    output:
       dir_cov + "{samples}-sort.dedup.cov.bed"
    shell: "bedtools genomecov -bg -5 -ibam {input.bamdedup} -g {input.genome}" > {output}



#dir to add
dir_cov= dir_data + "cov/"

Works when i run it for just the file:

bedtools genomecov -bg -5 -ibam data/dedup/SRR1575922_CUT-sort.dedup.bam -g genome/GRCh37.75.reads

Problem:

This snake rule doesnt know where to write the output. I am not sure what file type this should be.

Rule all:

rule all:
    input:
       expand(fastqc_dir + "{samples}_fastqc.html", samples = samples), 
       dir_genome + ensembl_genome + ".reads",
       expand(fastq_extr + "{samples}_extracted.fastq", samples = samples),
       expand(dir_bam + "{samples}.bam", samples = samples),
       expand(dir_sort + "{samples}-sort.bam", samples= samples),
       expand(dir_sort + "{samples}-sort.bam.bai", samples=samples),
       expand(dir_dedup + "{samples}-sort.dedup.bam", samples=samples),
       expand(dir_cov + "{samples}-sort.dedup.cov.bed", samples=samples)
       

Cluster

Will need to create a cluster.json file where I specify the memory for each of my rules

Create for first 2 rules:

{
    "__default__" :
    {
        "mem" : 2000,
        "n" : 1,
    "tasks" : 1,
    "name" : "{rule}-{wildcards}",
    "logfile": "log/snake-{rule}-{wildcards}-%j.out"
    },

"subread_index" :
    {
            "mem" : 8000
    },
"fastqc" :
    {
              "mem" : 8000
    }
"subjunc" :
    {
             "mem": 16000,
             "tasks" : 8
    },
"sort_bam" :
    {
             "mem": 8000
    }

}

Full files are in my pipeline folder on midway

Run

To check the snakefile:
snakefile -np
To run the snakefile, in the pipleline directory:
scripts/submit-snakemake.sh
to check running jobs:
squeue -user=brimittleman

Things to Change

  • add a get exons rule/download exon file

Session information

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.2  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
 [5] tools_3.4.2     htmltools_0.3.6 yaml_2.1.14     Rcpp_0.12.13   
 [9] stringi_1.1.5   rmarkdown_1.6   knitr_1.17      git2r_0.19.0   
[13] stringr_1.2.0   digest_0.6.12   evaluate_0.10.1

This R Markdown site was created with workflowr