MaizegoSummer2022






Lecture 04

Genome Annotation: An overview

MaizegoSummer2022

What is annotation



  • ๐Ÿ‘ง: ไฝ ็Œœๆˆ‘ไธŠๅˆๅŽป่ฏๅบ—็š„่ทฏไธŠ, ้‡ๅˆฐ่ฐไบ†?




  • ๐Ÿค•: ่ฏๅบ—!่ฏๅบ—!่ฏๅบ—!, ้‡่ฆ็š„ไบ‹่ฏดไธ‰้, ๅˆซ้—ฎๆˆ‘ๆ€Žไนˆ็Ÿฅ้“็š„

MaizegoSummer2022

What is genome annotation

Professor Aaron Quinlan's course is an excellent start!

We only play the first 20 min here, go to Aaron Quinlan's YouTube Channel to see the full course

MaizegoSummer2022

Quick overview of genome(gene) annotation

1. Structure annotation

  • 1.1 De novo (Ab initio) annotation

    • modeling based prediction
  • 1.2 Homology-based annotation

    • Annoatae based on the sequence similarity of relative species' genes
  • 1.3 EST evidences (RNA-Seq)

    • Annotate based on the species' own data

2. Functional annotation

  • Ontology: GO, KEGG ...
  • Public DB: NR, Uniprot, Interpro, Enzymes ...
  • Orthologs: RBH, Synteny, Phylogenic ...

MaizegoSummer2022

1. De novo (Ab initio) annotation

# Knowledge before
## basic biology concepts
codon, exon, intron, splice, GC content & CpG island, TSS/TES, ORF, 5'/3', strand: +/- ...

## Algorithms
Pattern Match, HMM, Dynamic Programming ...

Common Softwares:

  • HMM based: Fgenesh, Augustus, GeneMark, SNAP, Genscan ...

MaizegoSummer2022

2. Homology-based annotation

  1. Evidences (Proteins) collection and remove redundants

  2. Proteins align to genome

  3. Parse alignment and generate annotate regions

  • or based on WGA with well-annotated genomes (not commonly used)

Common Softwares:

  • Dedup: CD-HIT
  • Mapping: Diamond/Blastp, Exonerate ...
  • WGA: blat, last, minimap2, mummer, anchorwave ...

MaizegoSummer2022

3. EST evidences (RNA-Seq)

  1. Collecting public data (ESTs)
  2. RNA-Sequencing: 2nd-generation, 3rd-generation
  3. De novo assembly, Genome-guide assembly
  4. **Transcript-mapping" against genome
  5. Parse alignment
  6. Refining

Common Softwares:

  • RNA-seq mapping: minimap2, hisat2, star ...
  • Assembly: stringtie, Cufflinks ...
  • Mapping transcripts: GMAP, minimap2 ...
  • Pipelines: PASA

MaizegoSummer2022

4. Combine all evidences

Evidence confidence level

  1. ISO-seq (3rd generation RNA): high-quality, isoforms, UTRs; BUT rare;
  2. 2nd generation RNA-seq: Some UTRs, reliable structures, some isoforms; Common;
  3. Homolog evidences
  4. Predictions: Poor on isoforms

Common Softwares:

  • Combine Evidences: EVM, MAKER ...

MaizegoSummer2022

Before gene annotation:

1. Genome Assembly Estimation

  • Low quality genomes lead to wrong gene annotation

2. TE annotation and repeatmasking

  • Another important genome annotaton area we will learn in the future
  • Mask to reduce calculating time, and noises
  • Do not mask simple and low repeats (some genes may contain these structure)

MaizegoSummer2022

Along-side or After gene annotation

Along-side:

  • tRNA and ncRNA annotation

    • tRNA: tRNA-scan
    • ncRNA: Rfam HMM models

After:

  • Filtering: low-evidence models, TE-overlap models, pre-mature models ...
  • Functional annotation and downstream analyses ...

MaizegoSummer2022

Overview of gene annotation pipelines:

MaizegoSummer2022

Common wheels

  • MAKER: maybe the most used annotation tool, tutorials on NCPGR WIKI

  • Funannotate: newly friendly pipeline wrapper Github

  • ...

MaizegoSummer2022

Detailed tutorials from BGI college: on BiliBili

Thx & Bye ~