Maizego Summer Tutorial






Lecture 3

Make explicit and reusable pipelines

Maizego Summer Tutorial

  • 👨‍🔬😃: Bro~ can you share your GO annotation pipelines? I'd like run them on my data.

  • 🤪: Sure thing, here you are! Just replace the paths with yours then it'll work.

    • NOTE: DO NOT really try to run the code above !!!

  • 👨‍🔬🥶: ...Thank ...You

  • 🤪: I've always been for you !

Maizego Summer Tutorial

  • 👨‍🔬😟: Hi, can you share your GO annotation pipelines? I'd like run them on my data.

  • 👨‍💻: Wait a sec ...

  • 👨‍💻: Ok, I've put them here: ZMP_blast2go, the Readme and test data would lead you to your goal.


  • 👨‍🔬😂: Oh boy! I wish I could marry you ...

  • 👨‍💻😲: All ...right, Bye💦💨

  • 👨‍🔬😂: Wait!! Teach me how to write solid pipelines!

Maizego Summer Tutorial

Cautions when working with linux/shell/bash

1. exit status and set -o pipefail

  • # allergic to seven counter 
    function algc7(){
        # I eat numbers in pipe, I died if I see seven
        while read num;do
            [[ "$num" == *"7"* ]] && exit 1
            echo $num
        done
    }
    seq 10 | wc -l; echo $?         # expect 10 and normal exit, get 10 and exit0
    seq 10 | algc7 | wc -l; echo $? # expect  6 and error exit,  get 6  and exit0, quack!
    set -o pipefail
    seq 10 | algc7 | wc -l; echo $? # expect  6 and error exit,  get 6  and exit1, good doctor!
    
  • Better put it in your bashrc and every shell script

Maizego Summer Tutorial

Cautions when working with linux/shell/bash

2. set -o nounset and set -o errexit: usefull in shell scripts

3. Or check the exit status explicitly

if [ $? ne 0 ];then
    echo "[ERROR]: Some error msg" >&2
    exit 1
fi

4. check if the output is valid:

file exist, var not empty, var number cutoff, string lenght ...

5. cmds you should use with caution

rm -f, mv, tar,

Maizego Summer Tutorial

Why do we need pipelines?

  • maintenance

  • do not re-invent wheels

  • share with colleagues

  • help to sort your thoughts

  • help to build more complex projects

What do we want from pipelines?

  • parameters reset

  • check and chain dependencies

  • step controllable

Maizego Summer Tutorial

Ways to build complex pipelines

  • make: Very hacker-like, not widely used

  • shell script

  • other (usually high-level) programming languages: perl, python, R, julia ...

  • Third party tools: snakemake, nextflow, wdl ...


  • We will only cover bash script based pipelines here

Maizego Summer Tutorial

Pipelines with bash: things you should know ahead

  • check dependencies: which, -x, --version ...
  • logging, stdout & stderr: >&1, >&2
  • parse arguments: getopts, getopt (GNU), DIY: $@ + case
  • command chains: |, &&, ||
  • step check and skip:
    # check result files (when one main result file is needed)
    if [[ ! -s "output.txt" ]];then
        # your CMDs go here and generate result file
        $CMDs > output.txt
        if [ $? -ne 0 ];then rm output.txt; exit 1; fi
    fi
    # check step tag file (when many result files are needed, or you want to control manually)
    if [ ! -s "step1.done" ];then
        # your CMDs go here
        $CMD1 &&\ 
        $CMD2
        if [ $? -ne 0 ];then exit 1;fi
        echo "done" > step1.done
    fi
    

Maizego Summer Tutorial

Learn with real problem:

>>> build an extremely accurate body index predictor 💯

A prototype goes here:

Maizego Summer Tutorial

List the demands

  • A user input hinter

  • A supaaaar cooool ~~ calculating progress indicator

  • Print the result

Maizego Summer Tutorial

Search for solutions

  • input hinter: DIY with echo + read

  • progress indicator: DIY with echo

  • result print: cmatrix





Learn about:

Maizego Summer Tutorial

Build the app: the design

Maizego Summer Tutorial

Build the app: install the dependency

1. install cmatrix the hard way: complie from source code

# step 1: download the source code
git clone https://github.com/abishekvashok/cmatrix.git

# step 2: learn how to compile: from INSTALL, or README

# step 3: build: 
autoreconf --install
./configure # you may get errors if you weren't root
make && make install

# step 3': haha, root-less makes we meet
./configure --prefix=/path/to/your/path
make && make install

# step 4: add the executable to the $PATH
cd /path/to/cmatrix/excutable
echo "export PATH=$PWD:\$PATH" >> $HOME/.bashrc
source $HOME/.bashrc
# test
which cmatrix
cmatrix

Maizego Summer Tutorial

Build the app: Coding

#!/usr/bin/evn bash
# first things first: set options
set -o nounset
set -o pipefail

# define comon log functions: with color
function mylog () {
    local info=$1
    echo -e "\033[36m[$(date +'%y-%m-%d %H:%M')]\033[0m $info" >&2
}
function myrcd () {
    local info=$1
    echo -e "\033[32m>>>------------>\033[0m $info" >&2
}
function mywarn () {
    local info=$1
    echo -e "\033[35m[WARNING]\033[0m --> $info" >&2
}
export -f mylog myrcd mywarn

# step 1: get user's height in cm
# set the main variable to store input
export height=0
echo "Please input your height (in cm):" >&2 # hint to stderr
read height
# get input height util it is valid
while [[ $height -gt 1000 || $height -lt 10 ]]
do
    mywarn "Really? Your height is $height cm? \nI can only predict human.\nPlease input again (height in cm):" >&2
    read height
done

# step 2: fake calculating function
function fake_calc_prg(){
    # we generate random strings for logging
    mylog "Start calculating ..."
    sleep 2
    mylog "Building models ..."
    sleep 2
    cat /dev/urandom | strings -n 10 | head -n 20 |\
    while read str;do
        myrcd "$str$str"
        sleep 0.2
    done
}
export -f mylog myrcd fake_calc_prg

# step 3: main procedure: do fake calc and out
fake_calc_prg
mylog "Generating result reports ..."
sleep 3
export outtext="##= Your height is $height cm =##"
cmatrix -rM "$(echo $outtext)" -u 1

Save the above code locally to mzg_height_predictor.sh, then run bash mzg_height_predictor.sh

Maizego Summer Tutorial

Build the app: Debug and Improvement

  • try run the script with 179.999 as input



❓ Homework:

1. How to fix the bug above?

2. Can we add fake multi-thread feature?

if the user choice like 4 threads, the whole "calculation" would be 3~4 times faster

Maizego Summer Tutorial

Class Dismiss !

Bye~