Ivan G. Costa Filho igcf@cin.ufpe.br Centro de Informática Universidade Federal de Pernambuco

Preview:

DESCRIPTION

Ivan G. Costa Filho igcf@cin.ufpe.br Centro de Informática Universidade Federal de Pernambuco. Processamento de Cadeias de Caracteres. Tópicos. Cadeias de Caracteres Biológicas Problemas Básicos alinhamento par/múltiplo busca de motifs modelagem de famílias de proteínas Métodos - PowerPoint PPT Presentation

Citation preview

Biologia In Silico - Centro de Informática - UFPE

Ivan G. Costa Filhoigcf@cin.ufpe.br

Centro de InformáticaUniversidade Federal de Pernambuco

Processamento de Cadeias de Caracteres

Biologia In Silico - Centro de Informática - UFPE

Tópicos

• Cadeias de Caracteres Biológicas• Problemas Básicos

– alinhamento par/múltiplo– busca de motifs– modelagem de famílias de proteínas

• Métodos– Algoritmos dinâmicos– cadeias escondidas de Markov– métodos probabilísticos

Biologia In Silico - Centro de Informática - UFPE

Disciplina

• Aulas – Marco/Abril– introdução de conceitos/métodos básicos– Aulas práticas

• Seminários - Abril/Maio– apresentação de tópicos da disciplina

• Individual - pós• duplas – graduação

• Projeto Maio a Junho– analise de dados reais (de artigos

discutidos) em grupo

Biologia In Silico - Centro de Informática - UFPE

Avaliação

• 40% - apresentação dos seminários– avaliação pelos companheiros de

classe e presença • 20% - listas de exercícios• 40% - projeto em grupo

– nota individual - cada grupo é responsável por descrever a participação

Biologia In Silico - Centro de Informática - UFPE

Bibliografia

• R Durbin, Sean R Eddy, A Krogh, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press.

• An Introduction to Bioinformatics Algorithms, Neil Jones e Pavel Pevzner, MIT Press, 2004

• Ver pagina para literatura especifica de cada aula …

– www.cin.ufpe.br/~igcf

Biologia In Silico - Centro de Informática - UFPE

Biologia Molecular

Biologia In Silico - Centro de Informática - UFPE

Entender a vida a nível celular

• Como a informação genética é herdada

• Como a informação genética influencia processos celulares

• Como genes trabalham juntos para realizar uma função celular

Biologia In Silico - Centro de Informática - UFPE

Informação Genética - DNA

• DNA (ácido desoxirribonucleico) – Cadeia de

nucleotídeos – 4 tipos: A;C;G;T– forma fita dupla a

partir da complementaridade.

• A = T e C = G

Biologia In Silico - Centro de Informática - UFPE

Dogma Central - Transcrição

• Transcrição – DNA para RNA

• RNA (acido ribonucléico)– fita simples.– 4 tipos: A;C;G;U– Moléculas instáveis– Transporte de

informação do núcleo ao citoplasma

Biologia In Silico - Centro de Informática - UFPE

Dogma Central - Transcrição

• Transcrição – copia seqüência de bases do DNA para o RNA (com U ao invéss de T).

Biologia In Silico - Centro de Informática - UFPE

Dogma Central - Tradução

• Tradução– RNA -> Proteínas– realizada pelo ribossomo– Código genético

• Proteínas– cadeia de aminoácidos– 20 tipos diferentes– adquire uma estrutura tri-

dimensional– entidades funcionais da

célula

Biologia In Silico - Centro de Informática - UFPE

Tradução - Código Genético

• Combinações de códons (3 bases) codificam um dos 20 aminoácidos.

Biologia In Silico - Centro de Informática - UFPE

Dogma Central

• Dogma: fluxo de informação

DNA mRNA Proteína• Gene: segmento de DNA

codificando uma proteína.• Transcrito: segmento de

RNA transcrito de uma gene.

• Um gene corresponde a uma proteína e uma função celular.

Biologia In Silico - Centro de Informática - UFPE

Controle da Expressão Gênica

• Como se da o controle da expressão gênica?

• Certas proteínas, fatores de transcrição, se ligam ao DNA e são responsáveis por iniciar a transcrição.

Biologia In Silico - Centro de Informática - UFPE

Controle da Regulação Gênica

Biologia In Silico - Centro de Informática - UFPE

• Manage molecular biological data– Store in databases, organise, formalise, describe...

• Compare molecular biological data• Find patterns in molecular biological data

– phylogenies– correlations (sequence / structure / expression / function

/ disease)

Goals:• characterise biological patterns & processes• predict biological properties

– low level data ⇒ high level properties (eg., sequence ⇒ function)

Bioinformatics

Biologia In Silico - Centro de Informática - UFPE

Bioinformatics: neighbour disciplines

• Computational biology– Broader concept: includes computational

ecology, physiology, neurology etc...

• -omics:– Genomics– Transcriptomics– Proteomics

• Systems biology– Putting it all together...– Building models, identify control & regulation

Biologia In Silico - Centro de Informática - UFPE

Molecular biology data...

>alpha-DATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCAAGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAGCACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGACGGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCCGGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA>alpha-AATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCCATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTGTCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTCTGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCAGTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACCATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAGTACCGTTAA

• DNA sequences

Biologia In Silico - Centro de Informática - UFPE

Molecular biology data...

• Amino acid sequences

• Protein structure:– X-ray crystallography

– NMR

Biologia In Silico - Centro de Informática - UFPE

Cell biology & proteomics data...

• Subcellular localization

Biologia In Silico - Centro de Informática - UFPE

• Homology / Alignment• Simple pattern (“word”) recognition • Statistical methods

– Weight matrices: calculate amino acid probabilities– Other examples: Regression, variance analysis,

clustering

• Machine learning– Like statistical methods, but parameters are estimated

by iterative training rather than direct calculation– Examples: Neural Networks (NN), Hidden Markov Models

(HMM), Support Vector Machines (SVM)

• Combinations

Prediction Methods

Biologia In Silico - Centro de Informática - UFPE

Similarity between sequencesIf two sequences look similar, the explanation

may be:• Homology (common descent)• Convergent evolution (common function → common selective pressure)• Chance!

Biologia In Silico - Centro de Informática - UFPE

Sequences are related

• Darwin: all organisms are related through descent with modification• => Sequences are related through descent with modification• => Similar molecules have similar functions in different organisms

Phylogenetic tree based on ribosomal RNA: three domains of life

Biologia In Silico - Centro de Informática - UFPE

Sequences are related II

Phylogenetic tree of globin-type proteins found in humans

Biologia In Silico - Centro de Informática - UFPE

Why compare sequences?

• Determination of evolutionary relationships

• Prediction of protein function and structure (database searches).

Protein 1: binds oxygen

Sequence similarity

Protein 2: binds oxygen ?

Biologia In Silico - Centro de Informática - UFPE

Biological Databases

• Vast biological and sequence data is freely available through online databases

• Use computational algorithms to efficiently store large amounts of biological data

Examples

• NCBI GeneBank http://ncbi.nih.gov Huge collection of databases, the most prominent being the nucleotide sequence database

• Protein Data Bank http://www.pdb.org

Database of protein tertiary structures

• SWISSPROT http://www.expasy.org/sprot/ • Database of annotated protein sequences

• PROSITE http://kr.expasy.org/prositeDatabase of protein active site motifs

Biologia In Silico - Centro de Informática - UFPE

Alinhamento de Sequencias

Biologia In Silico - Centro de Informática - UFPE

BLAST

• A computational tool that allows us to compare query sequences with entries in current biological databases.

• A great tool for predicting functions of a unknown sequence based on alignment similarities to known genes.

Biologia In Silico - Centro de Informática - UFPE

BLAST

Biologia In Silico - Centro de Informática - UFPE

Some Early Roles of Bioinformatics• Sequence comparison• Searches in sequence databases

Biologia In Silico - Centro de Informática - UFPE

Biological Sequence Comparison• Needleman-

Wunsch, 1970– Dynamic

programming algorithm to align sequences

Biologia In Silico - Centro de Informática - UFPE

Busca de Sinais de Localização

Biologia In Silico - Centro de Informática - UFPE

Protein sorting in eukaryotes

• Proteins belong in different organelles of the cell – and some even have their function outside the cell

• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

Biologia In Silico - Centro de Informática - UFPE

Secretory proteins have a signal peptide

Initially, they are transported across the ER membrane

Protein sorting: secretory pathway / ER

Biologia In Silico - Centro de Informática - UFPE

Signal peptides

A signal peptide is an N-terminal part of the amino acid chain, containing a hydrophobic region.

Signal peptides differ between proteins, and can be hard to recognize.

Biologia In Silico - Centro de Informática - UFPE

Simple pattern (“word”) recognition

Example: PROSITE entry PS00014, ER_TARGET:

Endoplasmic reticulum targeting sequence (”KDEL-signal”).

Pattern: [KRHQSA]-[DENQ]-E-L

NB: only yes/no answers!

Biologia In Silico - Centro de Informática - UFPE

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV

• Estimate probabilities for nucleotides / amino acids• Information content in sequences; logos; Position- Weight

Matrices.• Quantitative answers.

Statistical Methods

Biologia In Silico - Centro de Informática - UFPE

Busca de Motifs

Biologia In Silico - Centro de Informática - UFPE

Random Sample

atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca

tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag

gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Biologia In Silico - Centro de Informática - UFPE

Implanting Motif AAAAAAAGGGGGGG

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Biologia In Silico - Centro de Informática - UFPE

Where is the Implanted Motif?

atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga

tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag

gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Biologia In Silico - Centro de Informática - UFPE

Implanting Motif AAAAAAGGGGGGG

with Four MutationsatgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Biologia In Silico - Centro de Informática - UFPE

Where is the Motif???

atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga

tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag

gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

Biologia In Silico - Centro de Informática - UFPE

Why Finding (15,4) Motif is Difficult?

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG

..|..|||.|..|||

Biologia In Silico - Centro de Informática - UFPE

Próxima Aula

• Ler capitulo 1 do Durbin • Introdução a algoritmos

dinâmicos (10/08)

Biologia In Silico - Centro de Informática - UFPE

Agradecimentos

• Alguns slides extraidos de – Biological Sequence Analysis

course, CBS, Universidade Tecnica da Dinamarca

– Neil Jones, University of California at San Diego

Recommended