83
1 UNIVERSIDADE FEDERAL DE PELOTAS Programa de Ps-Graduaªo em Agronomia Tese Caracterizaªo in silico de microssatØlites no genoma do arroz e anÆlise comparativa com outras espØcies vegetais Luciano Carlos da Maia Pelotas, 2009

tese ok - pdfMachine from Broadgun Software, http ...guaiaca.ufpel.edu.br/bitstream/123456789/1180/1/Tese_Luciano... · 3 Banca examinadora: Professor, PhD., Antonio Costa de Oliveira

  • Upload
    vutuong

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

1

UNIVERSIDADE FEDERAL DE PELOTAS Programa de Pós-Graduação em Agronomia

Tese

Caracterização in silico de microssatélites no genoma do arroz e análise comparativa com outras espécies vegetais

Luciano Carlos da Maia

Pelotas, 2009

id5877546 pdfMachine by Broadgun Software - a great PDF writer! - a great PDF creator! - http://www.pdfmachine.com http://www.broadgun.com

2

Luciano Carlos da Maia

Engenheiro Agrônomo

Caracterização in silico de microssatélites no genoma do arroz e análise comparativa com outras espécies vegetais

Orientador: Antônio Costa de Oliveira, PhD. � FAEM/UFPel

Co-orientador: Fernando Irajá Félix de Carvalho, PhD. � FAEM/UFPel

Pelotas, 2009

Tese apresentada ao Programa de Pós-Graduação em Agronomia da Universidade Federal de Pelotas, como requisito parcial à obtenção do título de Doutor em Ciências (área do conhecimento: Fitomelhoramento).

3

Banca examinadora:

Professor, PhD., Antonio Costa de Oliveira (UFPel) � Presidente

Professor, PhD., Fernando Irajá Félix de Carvalho (UFPel)

Professor, PhD., Antonio Vargas de Oliveira Figueira (ESALQ-USP)

Professor, PhD., Odir Antonio Dellagostin (UFPel)

Professor, PhD., Cesar Valmor Rombaldi (UFPel)

4

Aos meus pais, Milton e Ivone.

As minhas irmãs Lú, Binha, Tamara e para o irmão Zico.

A todos que amo.

Dedico

5

Agradecimentos -A Deus pela minha vida e pela clareza de minhas convicções.

-Ao Professor Antônio pela orientação, pelo apoio na pesquisa de

bioinformática e pela amizade dispensada.

-Ao Professor Fernando Carvalho, no seu incansável labor pelo ensino do

melhoramento vegetal e pelos, sempre bem-vindos, puxões de orelha.

-A todo o pessoal com o qual compartilhei estes anos no Centro de Genomica

e Fitomelhoramento: amigos e colegas...

-Ao vô Alcides, vó Maurisa e ao Osmil.

-Aos fiéis amigos (Batista, Walter, Galvão, Zé Siqueira, Denival, Cibalena,

Anderson e Mané) e familiares, lá de Rechan, que mesmo após todos estes anos

longe, fazem parte de minha vida.

Ao Éder Moreira (Jacupiranga), amigo de todas as dificuldades...

-Aos amigos com quem dividi moradia nestes anos: Mano Lima (e Darliane...),

Julio e Diego.

-Ao Dario Palmieri (UNESP), Mauricio Kopp (EMBRAPA), Velci (UFSM) e

Valmor (KSP), pela amizade e pelas constantes discussões profissionais...

-Ao casal Fernando Henning e Lili Mertz, grandes amigos, para horas de pão-

de-queijo, café, BLASTs e muitas discussões...

-Pra Cris Sakashita...

-A todos que contribuíram e me ajudaram...

-A agências CAPES e CNPq, sem as quais esse sonho não poderia ser

realizado.

-A todos um forte abraço!

6

Eis que o semeador saiu a semear. E, quando semeava, uma parte da semente caiu ao pé do

caminho, e vieram as aves, e comeram-na: E outra parte caiu em pedregais, onde não havia terra

bastante, e logo nasceu, porque não tinha terra funda; Mas, vindo o sol, queimou-se e secou-se,

porque não tinha raiz.E outra caiu entre espinhos, e os espinhos cresceram, e sufocaram-na.

E outra caiu em boa terra, e deu fruto: um a cem, outro a sessenta e outro a trinta.

(Matheus 13;3-8)

7

Resumo MAIA, Luciano Carlos da. Caracterização in silico de microssatélites no genoma do arroz e análise comparativa com outras espécies vegetais. 2009. 83f. Tese (Doutorado) - Programa de Pós- Graduação em Agronomia. Universidade Federal de Pelotas, Pelotas. Marcadores moleculares têm sido utilizados com sucesso em mapeamento genético e seleção assistida como uma ferramenta auxiliar para o melhoramento de plantas e transferência de informações entre espécies relacionadas. Neste sentido, o entendimento da ocorrência de microssatélites no genoma nas diferentes espécies melhoradas como trigo, arroz e milho pode ser utilizado no sentido da melhoria do conhecimento básico das espécies gramíneas descritas como �orfãs�. O arroz, após a conclusão do seqüênciamento do seu genoma, tem sido proposto como modelo genético entre as gramíneas. Dentre os diferentes tipos de marcadores moleculares, os microssatélites são indicados como a classe preferida para estes estudos. De maneira geral, as estratégias de transposição de marcadores moleculares entre espécies ainda apresentam algumas dificuldades e questionamentos referentes os padrões mais conservados de microssatélites entre espécies, genêro e famílias vegetais. Este estudo teve como objetivo, o uso de ferramentas de bioinformática para caracterizar microssatélites oriundos do genoma do arroz e outras espécies, possibilitando predizer padrões de microssatélites mais promissores na transferência. Três estudos foram realizados. O primeiro consistiu no desenvolvimento e validação de uma ferramenta para localização de microssatélites, desenho de iniciadores e simulação da PCR. Foi utilizado um banco de dados contendo 28.469 seqüências fl-cDNA de arroz japonica. Do total de 3.907 loci encontrados, foram desenhados 3.329 conjuntos de iniciadores e testados pela simulação da PCR, mostrando que somente 2.397 (72%) iniciadores amplificaram regiões específicas. No segundo estudo foi analisada a ocorrência de microssatélites em regiões expressas de dez espécies de três diferentes famílias de plantas. Os resultados indicaram a freqüência e padrões de microssatélites dentro e entre as diferentes famílias. No terceiro estudo foi feita a caracterização de microssatélites no genoma completo do arroz. Os resultados mostraram um conservado padrão de ocorrência dos diferentes microssatélites nos diferentes cromossomos e quais os arranjos foram os mais abundantes. Inferências sobre quais elementos permitem a melhor cobertura do genoma foram discutidas. Palavras-chaves: Oryza sativa subsp. japonica. Simple Sequence Repeat. Microsatélites. Bioinformática. Genomica. Melhoramento vegetal, gramíneas.

8

Abstract MAIA, Luciano Carlos da. In silico characterization of microsatellites in the rice genome and comparative analysis to other plant species. 2009. 83f. Tese (Doutorado) - Programa de Pós- Graduação em Agronomia. Universidade Federal de Pelotas, Pelotas. Molecular markers have been successfuly applied in genetic mapping and marker assisted selection as an auxiliary tool for plant breeding and transfer of genetic information among related species. In this sense, the understanding of genome elements occurring in important crop species such as wheat, rice and maize can be used towards the improvement of basic knowledge in orphan grass species. Rice, after the completion of its genome sequence, has been proposed as a genetic model in the grasses. Among the different types of molecular markers, microsatellites have been indicated as the preferred class for such studies. In general, the strategy of transposing molecular markers between species still poses some questions/difficulties regarding the most conserved microsatellite patterns among plant species, genera and families. This study had as objective to use bioinformatic tools to characterize microsatellites from rice and other Grass species of economical importance, enabling the prediction of microsatellite patterns that are most promising in transfer strategies. Three studies were performed. The first was concerned about developping and validating a microsatellite searching tool plus primer design and PCR simulation. A database containing 28,469 fl-cDNA sequences originating from japonica rice genome was used. From a total of 3,907 microsatellite loci, 3,329 primer pairs were designed and tested using the simulated PCR feature showing that only 2,397 (72%) of pairs amplified in specific regions. The second study had as objective to describe the occurrence of microsatellites in expressed regions originating from ten species from three different plant families. The results indicated the frequency and patterns of occurrence of microsatellites within and between the different families. The third study had as objective to characterize the complete occurrence of microsatellites in the rice genome. The results showed a different pattern of occurrence of microsatellites for the different chromosomes and which arrangements are most abundant. Inferences on which elements allow better genome coverages are discussed. Keywords: Oryza sativa subsp. japonica. Simple Sequence Repeat. Microsatellites. Bioinformatics. Genomics. Plant Breeding. Grasses.

9

Lista de figuras Pág. 2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation.

Figure 1. Flow-chart showing the functional structure of SSR Locator. (A) Perl script to search SSRs; (B) text file where information from detected SSRs is stored; (C) module for the statistical calculations for SSR motif occurrence; (D) module that formats text files into standard Primer3 input files; (E) running of Primer3; (F) module for running Virtual-PCR (using a second sequence file as a template); (G) module performing global alignment between homologous amplicons; (H) identity and alignment score calculations between homologous amplicons; and (I) file containing SSR, primer, homologous amplicons, identity, and score information���������������.�

31 3. Tandem Repeat distribuition in gene transcripts of three plant families.

Figure 1. Percentage of expressed sequences containing tandem repeat loci��������������������.�

58

4. Distribuition and patterns of microsatellites occurency in whole rice genome.

Figure 1. Percentage occurrence of different microsatellite types (≥ 12 bp) in the chromossomes����..�������..

77

Figure 2. Percentage occurrence of different microsatellite types (≥ 20 bp) in twelve chromossome���.����...���

78

10

Lista de tabela Pág. 2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation

Table 1. Distribution of SSR/minisatellite motifs according to the number of repeats������������������������

32

Table 2. Distribution of SSR/minisatellite repeats in the rice cDNA collection�������������������������.

33

Table 3. Distribution of amplicon alignments for specific and redundant amplicons with varying identity levels�����������.

35

3. Tandem Repeat distribuition in gene transcripts of three plant families

Table 1. Overall distribution (amounts and percentage) of expressed sequences in translated and non-translated regions.�����..

57

Table 2. Overall distribution of tandem repeat occurrences in translated and non-translated transcripts���������������.

59

Table 3. Overall occurrence, in percentage, of microsatellite and minisatellite motifs on different regions of ten plant species��

60

Table 4. Distribution of di-, tri- and tetramer motifs, percentage occurrence per species and average occurrence per family��......................................................................................

61 Table 5. Distribution of penta- to decamers motifs, percentage

occurrence per species and average occurrence per family��������................................................................

62 4. Distribuition and patterns of microsatellites occurency in whole rice genome

Table 1. Total amounts of microsatellite types (≥ 12 bp)* in the twelve chromossomes�������������������..........

73

Table 2. Distributions, percentage and frequency of different microsatellite types within Classes I and II in the twelve chromosomes����.................................................................

74 Table 3. Average locus size (bp) of different microsatellite types within

Classes I and II for the twelve chromosomes�������......

75 Table 4. Average distances (Kb) between different microsatellite loci

within Classe I and Class II chromossomes���������.

76

11

Sumário

Resumo ....................................................................................................................... 7

Abstract ....................................................................................................................... 8

Lista de figuras ............................................................................................................ 9

Lista de tabela ........................................................................................................... 10

Sumário ..................................................................................................................... 11

1. Introdução geral .................................................................................................. 13

2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation ...................................................................... 16

Abstract .................................................................................................................. 16

1. Introduction ........................................................................................................ 16

2. Material and Methods ........................................................................................ 18

3. Results ............................................................................................................... 21

4. Conclusions ....................................................................................................... 25

References ............................................................................................................ 27

3. Tandem Repeat distribuition in gene transcripts of three plant families ...... 36

ABSTRACT ............................................................................................................ 36

INTRODUCTION ................................................................................................... 37

MATERIAL AND METHODS ................................................................................. 38

RESULTS AND DISCUSSION .............................................................................. 39

CONCLUSIONS .................................................................................................... 49

REFERENCES: ..................................................................................................... 52

4. Distribuition and patterns of microsatellites occurency in the whole rice genome .................................................................................................................... 63

ABSTRACT ............................................................................................................ 63

INTRODUCTION ................................................................................................... 63

MATERIAL AND METHODS ................................................................................. 65

RESULTS AND DISCUSSION .............................................................................. 65

12

CONCLUSION ....................................................................................................... 68

REFERENCES: ..................................................................................................... 70

5. Considerações Finais ......................................................................................... 79

6. Referencias bibliográficas do Item 1 ................................................................. 81

VITAE ....................................................................................................................... 83

13

1. Introdução geral

O desenvolvimento de novas variedades que satisfaçam as exigências de

maior potencial genético para produtividade é a principal meta de todo programa de

melhoramento, e, o sucesso de tal programa, depende de um método dinâmico e

eficiente para atender seus objetivos. Portanto, o melhoramento genético de plantas

requer três etapas fundamentais para a obtenção de genótipos superiores: presença

da variabilidade genética, eficiência na seleção dos genótipos mais promissores e

ajuste das melhores constituições genéticas ao ambiente de cultivo (CARVALHO et

al., 2003).

A identificação dessa variabilidade tem sido objeto de muitos estudos, visto

que, ao avaliar a variabilidade de um determinado caráter, muitas vezes a

manifestação deste vem mascarada pelo efeito do ambiente, ou ainda, por

interações alélicas ou gênicas. Estes fatos tornam o trabalho de seleção do

melhorista mais complicado, exigindo em muitos casos, investigações que são

repetidas por vários anos e locais distintos, no intuito de lograr a ação do ambiente

(CARVALHO et al., 2003).

A seleção de um indivíduo que revele um potencial genético de grande

produtividade, passa a ser uma das tarefas mais árduas do melhorista, pois, esta

dificuldade tem como base a necessidade de substituir um grande número de alelos

nos diferentes locos para determinar um progresso expressivo no caráter. Esse fato

reside na dificuldade em acompanhar a segregação de vários alelos e em vários

locos ao mesmo tempo (CARVALHO, 1982).

Modernamente o uso de técnicas da biotecnologia, como os marcadores

moleculares, tem sido descritas como estratégias auxiliares para superar estas

dificuldades, pois, no momento em que são identificados marcadores moleculares

14

associados a genes de interesse, hà a possibilidade de identificação dos genótipos

portadores dos melhores alelos sem a ação do ambiente (MAIA, 2007).

Entre os diferentes marcadores moleculares conhecidos, uma classe bastante

promissora são os microssatélites ou SSRs. Esta classe de marcadores é poderosa

em variadas aplicações na genética e melhoramento de plantas, devido a sua

reprodutibilidade, natureza multi-alélica, característica co-dominante e abundância

em diferentes genomas (TEMNYKH et al., 2001; VARSHNEY et al. 2005).

Descrita inicialmente como Microssatélites por Litt e Luty (1989) e SSR

(Single sequence repeats) por Tautz et al., (1989), segundo Morgante e Olivieri

(1993), estas sequências de DNA são constituídas por 1,2,3,4,5 ou 6 nucleotideos

que repetem em série.

Regiões de DNA repetitivo (microssatélites) estão mais propensas à

ocorrência de laço (loops) ou estruturas conhecidas como grampos (hairpins), pois,

nestes trechos, durante a replicação, a DNA Polimerase sofre um �escorregão�

(slippage) provocando inserção ou deleção de nucleotídeos, promovendo dessa

forma o aumento ou a redução no tamanho da seqüência de repetição (WELL et al.

1998; IYER et al., 2000).

Atualmente com o acúmulo de dados referentes a regiões expressas de

diferentes genomas (ESTs e cDNAs), a caracterização e obtenção de marcadores

microssatélites derivados dessas regiões, descritos como marcadores funcionais,

representam uma promissora estratégia a ser utilizada no melhoramento de plantas,

pois, apresentam vantagens quando comparadas com aquelas classes de

marcadores baseadas no acesso de regiões genomicas anônimas (VARSHNEY ET

al., 2005).

Em vegetais, embora vários estudos tenham descrito os níveis de ocorrência

de microssatélites associados a regiões transcritas (TEMNYKH et al. 2001;

MCCOUCH et al. 2002; MORGANTE et al. 2002; THIEL et al. 2003; NICOT et al.

2004; LAWON e ZHANG, 2006; VARSHNEY et al. 2006; KASHI e KING, 2006;

ZHANG et al. 2006), algumas abordagens comparativas e ou descritivas, podem

ainda, oferecer novas perspectivas sobre as características desses marcadores,

pois, frequentemente distintos grupos de espécies vegetais vem sendo

seqüenciados, possibilitando a re-avaliação dos bancos de dados acrescidos de

15

novas seqüências, representando divergentes grupos evolutivos e ou com diferentes

modelos genéticos.

O objetivo geral deste trabalho foi utilizar a bioinformática para a

caracterização completa da ocorrência de microssatélites no genoma do arroz,

verificar a ocorrência de microssatélites em regiões expressas deste genoma,

identificar a existência de padrões de ocorrência destes marcadores em diferentes

espécies gramíneas e dícotiledôneas e prever quais padrões de ocorrência são os

melhores marcadores para o arroz, para as gramíneas e quais os padrões de

ocorrência possibilitam incremento na taxa de sucesso em estratégias de

transferência desta classe de marcadores entre diferentes espécies.

16

2. SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation

International Journal of Plant Genomics (ISSN 1687-5389)

Abstract

Microsatellites or SSRs (simple sequence repeats) are ubiquitous short

tandem duplications occurring in eukaryotic organisms. These sequences are among

the best marker technologies applied in plant genetics and breeding. The abundant

genomic, BAC, and EST sequences available in databases allow the survey

regarding presence and location of SSR loci. Additional information concerning

primer sequences is also the target of plant geneticists and breeders. In this paper,

we describe a utility that integrates SSR searches, frequency of occurrence of motifs

and arrangements, primer design, and PCR simulation against other databases. This

simulation allows the performance of global alignments and identity and homology

searches between different amplified sequences, that is, amplicons. In order to

validate the tool functions, SSR discovery searches were performed in a database

containing 28 469 nonredundant rice cDNA sequences.

1. Introduction Microsatellites or SSRs (simple sequence repeats) are sequences in which

one or few bases are tandemly repeated for varying numbers of times [1]. Variations

in SSR regions originate mostly from errors during the replication process, frequently

DNA polymerase slippage, generating insertion or deletion of base pairs, resulting,

respectively, in larger or smaller regions [2, 3]. SSR assessments in the human

genome have shown that many diseases are caused by mutation in these sequences

[4].

17

SSRs can be found in different regions of genes, that is, coding sequences,

untranslated sequences (5′-UTR and 3′-UTR), and introns, where the expansions

and/or contractions can lead to gene gain or loss of function [5]. Also, there are

evidences that genomic distribution of SSRs is related to chromatin organization,

recombination, and DNA repair. SSRs are found throughout the genome, in both

protein-coding and noncoding regions. Genome fractions as low as 0.85%

(Arabidopsis thaliana), 0.37% (Zea mays), 0.21% (Caenorhabtis elegans), 0.30%

(Sacharomyces cerevisae) and as high as 3.0% (Homo sapiens) and 3.21% (Fugu

rubripes) have been found. Some bias for defined genomic locations has also been

reported [6, 7]. This class of markers is broadly applied in genetics and plant

breeding, due to its reproducibility, multiallelic, codominant nature, and genomic

abundance. It�s use for integrating genetic maps, physical mapping, and anchoring

gives geneticists and plant breeders a pathway to link genotype and phenotype

variations [8].

The protocols for isolating SSR loci for a new species were always very labor-

intensive. Currently, with the accumulation of biological data originating from whole

genome sequence initiatives, the use of bioinformatics tools helps to maximize the

identification of these sequences and consequently, the efficiency in the number of

generated markers [9]. The first in silico studies of SSRs were developed using

FASTA [10] and BLAST [11] packages. Later, more specific algorithms, such as

SPUTINICK [12], REPEATMASKER [13], TRF-Tandem Repeat Find [14], TROLL

[15], MISA [16] and SSRIT (Simple Sequence Repeat Tool) [17], were obtained [9].

SSR detection is generally followed by the use of another program for primer

design, to be anchored on flanking sequences. Also, in some applications, a third

step using e-PCR [18] is added, with the goal of verifying primer redundancy. The

sequential use of a number of software is often called a pipeline. Building such a

pipeline can be a very difficult task for research groups not familiar with programming

tools. In the present work, a computing tool with an interface for Windowsusers was

developed, called SSR Locator. The application integrates the following functions: (i)

detection and characterization of SSRs and minisatellite motifs between 1 and 10

base pairs; (ii) primer design for each locus found; (iii) simulation of PCR

(polymerase chain reaction), amplifying fragments with different primer pairs from a

given set of fasta files; (iv) global alignment between amplicons generated by the

18

same primer pair; and (v) estimation of global alignment scores and identities

between amplicons, generating information on primer specificity and redundancy.

The described tool is publicly available at the site

http://www.ufpel.edu.br/~lmaia.faem.

2. Material and Methods

2.1. Algorithms

The algorithms used for the searches, alignment, and homology estimates are

described separately.

2.2. SSR Search

The algorithm used for perfect and imperfect micro-/minisatellite searches was

written in Perl and consists of the generation of a matrix that mixes A(adenine),

T(thymine), C(cytosine), and G(guanine) in all possible composite arrangements

between 1 and 10 nucleotides. The script instructions perform readings on fasta files,

searching all possible arrangements in each database sequence.

Several instructions in the algorithm used in SSRLocator resemble those from

MISA [16] and SSRIT [17]. However, additional instructions have been inserted in

SSRLocator's code. Instead of allowing the overlap of a few nucleotides when two

SSRs are adjacent to each other and one of them is shorter than the minimum size

for a given class as found in MISA and SSRIT, a module written in Delphi language

records the data and eliminates such overlaps.

The SSR Locator software contains windows focused on the selection and

configuration of SSR and minisatellite types (mono- to 10-mers) and a minimum

number of repeats for each one of the selected types. The algorithm calls a perfect

repeat when one locus is present with adjacent loci at an up or downstream distance

higher than 100 bp.

The algorithm calls an imperfect repeat when the same motif is present on

both sides of a fragment containing up to 5 base pairs. The algorithm identifies a

composite locus when two or more adjacent loci were found at distances between 6

and 100 bp [16].

19

In this study, only �Class I� (≥20 bp) repeats are shown. These repeats have

been described as the most efficient loci for use as molecular markers [17]. The

software SSRLocator was configured to locate a minimum of 20 bp SSRs:

monomers(x20), 2-mers(x10), 3-mers(x7), 4-mers(x5), 5-mers(x4), 6-mers(x4), and

minissatellites: 7-mers(x3), 8-mers(x3), 9-mers(x3), and 10-mers(x3).

In order to validate the efficiency of SSRLocator in finding SSRs and

minisatellites, the same database was analyzed with MISA and SSRIT, using the

same parameters for minimum number of repeats.

2.3. Primer Design

An algorithm written in Delphi language performs calls to Primer3 [19], which

execute primer designs. These results are fed to a module that performs Virtual-

PCRs and allocates individual identification, forward and reverse primer sequences,

and a sequence fragment corresponding to the region flanked by the primers (original

amplicon) to each SSR locus. A window allows the selection of Primer3 parameters,

such as range of primer and amplicon sizes, as well as optimum primer size, ranges

of melting temperature (TM) (minimum, maximum, and optimum) and GC content

(minimum and optimum). For primer searches, the software automatically looks for

five base pair distances from both SSR (5′ and 3′) flanking sites. In this study, the

following parameters were used: amplicon size between 100 and 280 bp; minimum,

optimum, and maximum annealing temperature (TM) of 45, 50, and 55, respectively,

minimum, optimum, and maximum primer size of 15, 20, and 25 bp, respectively.

2.4. Virtual-PCR

The module used to simulate a PCR reaction was written in Delphi. The

algorithm consists in reading the file generated by the previous module (SSR locus,

forward and reverse primers, and original amplicon), followed by a search of

sequences containing primer annealing sites. When annealing sites are found for the

two primers, the flanked region and the primer sequences are copied to a new

variable called �paralog amplicon.�

20

2.5. Global Alignment

For the global alignment between paralog and original amplicon sequences

and score calculations (match, mismatch, gaps), a routine was written in Delphi

language using the algorithms of Needleman and Wunsch (1970) [20] and Smith and

Waterman (1981) [21]. Also, in the same module, amplicon identities were calculated

according to Waterman (1994) [22] and Vingron and Waterman (1994) [23].

2.6. Implementation

The strategy of creating a two-language hybrid program was established as a

function of: (i) the higher speed achieved by handling large text files with Perl as

compared to Delphi, and (ii) the better fitness of Perl for generating combinatory

strings to be located. The Perl module was transformed into an executable file,

making unnecessary to install Perl libraries during program installing. The graphic

interface built, integrating input and output windows to the Windows operational

system, was obtained using the Suite Turbo Delphi, where a menu system executes

calls for each of the previously described modules.

2.7. Sequences for Analysis

A total of 28 469 rice (Oryza sativa ssp. japonica- cv. Nipponbare)

nonredundant full length nonredundant cDNA sequences, sequenced by The Rice

Full-Length cDNA Consortium, mapped on the databases derived from the

sequencing of japonica (japonica draft genome, BAC/PAC clones-IRGSP) and indica

(indica draft genome) subspecies [24] were used for the analyses. These sequences

are deposited in NCBI as two groups, the first comprising accesses from AK058203

to AK074028, and the second comprising accesses from AK98843 to AK111488. All

these sequences can be also found in KOME (Knowledge-based Oryza Molecular

Biological Encyclopedia).

A flow chart representing the different steps performed by the software is

shown in Figure 1.

21

3. Results

3.1. Program Validation A total of 3907 micro- and minisatellites were detected by SSRLocator in the

28,469 analyzed cDNA sequences. The same database searched with MISA and

SSRIT presented 3,913 and 3,917 loci, respectively. The mono-, 4-mer, 6-mer, 7-

mer, 8-mer, 9-mer, and 10-mer repeats were identical for the three programs. In the

case of 2-mer repeats, 594 elements were detected by SSRLocator and 596

elements were detected by MISA and SSRIT. 3-mer repeats were differently scored

by SSRLocator (1990) and the other two (1994) algorithms. For 5-mer repeats,

SSRLocator and MISA found the same number of repeats (426), while SSRIT (430)

found a different value.

3.2. Overall Distribution of SSR Types The results obtained with SSRLocator indicate that out of 28,469 cDNA

sequences, 3765 (13.22%) presented one or more micro-/minisatellite loci. In other

studies, microsatellites were found in the following proportions in ESTs: 3% in

arabidopsis [25], 4% in rosaceae [26], 8.11% in barley [16], 2.9% in sugarcane [27],

and values ranging between 6�11% [28] and 1.5�4.7% [29] for cereals in general

(maize, barley, rye, sorghum, rice, and wheat).

Considering the 3765 fl-cDNA sequences, in 3632 (92.96%) only a single

micro-/minisatellitelocus was detected. In 125 sequences, two loci were detected, in

seven sequences three lociandonly one sequence had four loci, adding up to 3907

occurrences. Among the types analyzed, SSRs (mono to 6-mer repeats) and

minisatellites (7- to 10-mer repeats) comprised 96.98% and 4.12% of detected loci,

respectively.

The distribution of occurrences detected by SSRLocator was consisted of 138

monomers, 594 2-mers, 1990 3-mers, 251 4-mers, 426 5-mers, 390 6-mers, 82 7-

mers, 6 8-mers, 25 9-mers, and 5 10-mers, corresponding to rates of 3.53%,

15.20%, 50.93%, 6.42%, 10.90%, 9.98%, 2.10%, 0.15%, 0.64%, and 0.13%,

respectively (see Table 1).

For the remaining SSRs, average percentage values have been reported as

between 17 and 40% for 2-mer, 54�78% for 3-mer, 2.6�6.6% for 4-mer, 0.4�1.3% for

22

5-mer, and less than 1% for 6-mer repeats [28] and 26.5% for 2-mer, 65.4% 3-mer,

6.8% 4-mer, 0.77% 5-mer, and 0.45% for 6-mer repeats [30] for barley, maize,

wheat, sorghum, rye, and rice, respectively. In nonredundant transcripts from the

TIGR database, 15.6% 2-mer, 61.6% 3-mer, 8.5% 4-mer, and 14.4% 5-mer repeats

were found in rice [31]. The frequency of micro/minisatellite locus occurrence for

each million nucleotides (loci/Mb) [6] in this study was 2.94, 12.64, 42.34, 5.34, 9.06,

8.30, 1.74, 0.13, 0.53, and 0.11 for mono to 10-mer repeats/Mb, respectively. Overall

occurrences of 83.13 loci/Mb were found (see Table 1). In other studies, different taxa

were described in analyses of EST databases, such as 133 loci/Mb (barley), 161

loci/Mb (wheat, sorghum and rye), and 256 loci/Mb for rice [28]. Also, for

nonredundant ESTs in rice, sorghum, barley, wheat, and Arabidopsis, frequencies of

277, 169, 112, 94 and 133 loci/Mb were found, respectively [30]. Frequencies closer

to those found in this study were described for CDS regions of Rosaceaespecies,

with an average of 40.9-78 loci/Mb for Rose, Almond and Peach, while 39 loci/Mb

were found for Arabidopsis [26].

3.3. Occurrence Patterns for Different SSR and Minisatellite Types and Motifs Monomers, 2-Mers, 3-Mers, and 4-Mers

On Table 2, the contents and percentage values for different micro-

/minisatellite motifs are shown. For monomer, 2-mer and 3-mer repeats, all possible

arrangements are shown, while for 4-mer to 10-mer repeats, only the ten most

frequent motifs are shown.

The A/T monomer repeats were found in 125 loci, with 111 (88.80%) and 14

(11.20%) loci formed by A and T nucleotides, respectively. The C/G motifs were

found in 13 loci, with ten (76.92%) and three (23.08%) loci formed by C and G,

respectively. A/T containing SSRs were predominant and comprised 90.58% of

monomer loci. In the overall distribution, the monomers represent 3.53% of 3907

detected loci. Motifs AG/CT and GA/TC were the most frequent and added up to

8.52% of 2-mer SSRs, and 6.89% and 5.96% of all 3907 detected occurrences. The

motifs CT, GA, and TC were the most abundant adding up to 172, 143, and 90 loci,

respectively. In maize, barley, rice, sorghum, and wheat ESTs, the motif AG was

described as the most frequent [6, 16, 28, 29, 31, 32]. However, in some studies, the

most frequent motif was GA [30, 33]. Repeats composed by guanine and cytosine

were the most abundant among trimers, with occurrences of 18.44%, 17.89%, and

23

10.60%, respectively, for the motifs CCG/CGG, CGC/GCG, and GCC/GGC, adding

up to 23.9% of the overall frequencies of micro-/minisatellites in the analysis.

The motifs CGC, CCG, and CGG were the most frequent comprising 218, 197,

and 170 loci, respectively. Many reports indicate the 3-mer CCG as the most frequent

in maize, barley, wheat, sorghum and rye [6, 16, 28, 32], sugarcane [27] and rice [29,

31]. Among 4-mers, 100 different arrangements were found, where the motifs GATC

(7.17%), ATTA/AAT (6.77%), and ATCG/CGAT (5.98%) were the most frequent.

These motifs add up to 19.92% of 4-mer repeats found and represent 1.28% of the

overall content of micro-/minisatellites.

In barley ESTs, ACGT was reported as the most abundant motif [16, 28]. For

other species, AAAG/CTTT and AAGG/CCTT in Lolium perene [34], AAAG/CTTT

and AAAC/GTTT in Arabidopsis UTRs [6, 35], and AAAT and AAAG in citrus [36, 37]

were described as most abundant.

3.4. Remaining Repeats Among 5-mers, 188 different arrangements were detected and the most

frequent were CTCCT, CTCTC, and CCTCC with 17, 17, and 12 occurrences,

respectively. In the analysis of CDS regions, the ACCCG motif was the most frequent

in Arabidopsis, AAAAG in S. cerevisae, C. elegans, and AAAAC in different primates

[38]. Also, the motifs AAAAT, AAAAC, and AAAAG were described as the most

frequent in eukaryotes [39].

In rice, the motifs AGAGG and AGGGG were the most abundant [31]. Repeats

of type 6-mer were detected in 230 different arrangements, where CGCCTC and

TCGCCG were the most frequent, occurring in 12 and 10 loci, respectively. Other

studies have shown higher frequencies for the motifs AAGATG, AAAAAT in

arabidopsis [35], AAAAAG in citrus [36], AACACG in S. cerevisae, ACCAGG in C.

elegans and CCCCGG in primates [38]. For all remaining repeats (minisatellites), the

occurrences are widely distributed with low-percentage values for each arrangement.

For 7-mer, 8-mer, 9-mer, and 10-mer repeats, the totals of occurrences were 57, 5,

23, and 5, respectively.

3.5. Primer Design and PCR Simulation The design of primers for the 3907 detected micro-/minisatellites resulted in

24

3329 primer pairs, covering 85.20% of loci. The running of �Virtual PCR� generated a

total of 4610 amplicons. A module in SSRLocator checks for primer redundancy. A

total of 2397 primer pairs amplified only the fragment from its original locus (specific

amplicons) and 932 pairs amplified one or more regions besides the original locus.

From these, 692 pairs amplified two fragments, one from the original site and a

second from another region (paralogous). In this case, 692 specific amplicons plus

692 redundant amplicons, were detected. A total of 143, 90, 2, and 5 primer pairs

generated three (two redundancies), four (three redundancies), five (four

redundancies), and six (five redundancies) fragments, respectively. The final product

of 932 primers with more than one anchoring region resulted in 932 specific

amplicons and 1281 redundant amplicons, adding up to 2213 fragments.

To investigate the ability of these primers in amplifying genomic sequences,

an extra experiment was performed against the whole rice genomic sequence

available at NCBI. The different groups of redundant and nonredundant primer sets,

that is, amplifying one, two, three, or more times in the cDNA database, were tested

against the genomic sequence.

From the 2397 nonredundant primers, only 924 amplified a locus in the

genomic sequence. This difference was already expected because of difficulties in

amplifying genomic regions, that is, if some primers anneal to a boundary region

between two exons in the cDNA, the presence of introns would make this annealing

site no more available. It is interesting to note that from the 924 amplicons detected,

914 (99%) did amplify only one locus in the genomic region, agreeing with the cDNA

results.

When the primer sets that amplified two different cDNAs were run against the

genomic sequence, only 294/692 (42.5%) did amplify, having 14.5% been able to

amplify two different loci. Only one primer set did amplify more than two loci. These

results indicate that SSR locator performance was consistent between the two

databases regarding the nonredundant loci, that is, from those loci that were able to

be amplified in both databases, their status of nonredundant was maintained. The

changes observed for the redundant loci can be attributable to many causes,

including redundancy in the cDNA database, but also to biological reasons due to

primer positioning.

25

3.6. Identity between Specific and Redundant Amplicons Results of global alignment between amplicons from original and redundant

sites are shown in Table 3. Among the 1281 redundant amplifications, 787 (61.44%)

resulted in a perfect alignment between both loci (identity equal to 100). For

redundant amplicons with identity levels of 96�99%, and 90�95%, 452 (35.28%) and

8 (0.62%) loci were found, respectively. Alignments with identity levels bellow 90%

were found in only 2.65% of cases. The fact that such a high percentage of

redundant loci show high identity is probably a consequence of the genome fraction

chosen, that is, expressed sequences. This fraction is under tight selection pressure

and should not accumulate variations such as substitutions or indels at a high rate.

As expected, comparisons to whole genome, generated a great deal of

polymorphism, due to the inclusion of intronic regions in the alignments (data not

shown).

4. Conclusions The software SSRLocator was successfully implemented, adding steps for (1)

SSR discovery, (2) primer design, and (3) PCR simulation between the primers

obtained from original sequences and other fasta files. Also, the software produces

reports for frequency of occurrence, nucleotide arrangement, primer lists with all

standard information needed for PCR and global alignments. From the PCR

simulation, it was possible to point out which primer pairs were nonredundant,

suggesting that these primers are more appropriate for mapping purposes. In this

case, however, wet lab experiments should be performed to confirm the advantage of

nonredundant over redundant primers for mapping.

It is possible that the results for micro-/minisatellite frequencies (loci/Mb)

obtained in this study diverge from the results found in the literature. This can be

explained by the different databases used (redundant ESTs, nonredundant ESTs

and/or fl-cDNA), different algorithm configurations and minimum requirements set for

counting motifs. Another explanation for some contrasting results is the fact that only

�Class I� repeats were analyzed in our study.

The results showed that 932 (27.99%) primers presented amplifications in

more than one gene sequence. This could be mostly due to the fact that primer pairs

derived from a specific gene (cDNA) anchored in similar sites in other duplicated

26

genes, since 5,607/28,469 (19.70%) genes were described as paralogs in the

annotation of the database used [24]. Gene duplication along with polyploidy and

transposon amplification are the major driving forces in genome evolution [40].

It is therefore not surprising that so many loci have redundancy. Also, a

second possibility is that some primers were generated from protein domain regions

within the analyzed cDNAs. These domains could be found in protein families with

many genome copies, resulting in the observed redundancies. A validation of the

redundancies of cDNA results was obtained through a virtual-PCR against the whole

rice genome sequence. From the nonredundant primers that generated an amplicon,

ca. 99% were nonredundant.

Finally, this tool can be used successfully for data mining strategies to find

SSR primers in genomic or expressed sequences (ESTs/cDNAs). Also, this software

can be a tool for microsatellite discovery in databanks of related species, anchoring

primers in ortholog or paralog regions contained between databases from two

different species.

27

References 1. M. Morgante, M. Hanafey, and W. Powell, �Microsatellites are preferentially

associated with nonrepetitive DNA in plant genomes,� Nature Genetics, vol. 30, no. 2,

pp. 194�200, 2002.

2. R. R. Iyer, A. Pluciennik, W. A. Rosche, R. R. Sinden, and R. D. Wells, �DNA

polymerase III proofreading mutants enhance the expansion and deletion of triplet

repeat sequences in Escherichia coli,� Journal of Biological Chemistry, vol. 275, no.

3, pp. 2174�2184, 2000.

3. H. Ellegren, �Microsatellites: simple sequences with complex evolution,� Nature

Reviews Genetics, vol. 5, no. 6, pp. 435�445, 2004.

4. S. M. Mirkin, �DNA structures, repeat expansions and human hereditary disorders,�

Current Opinion in Structural Biology, vol. 16, no. 3, pp. 351�358, 2006.

5. B. Li, Q. Xia, C. Lu, Z. Zhou, and Z. Xiang, �Analysis on frequency and density of

microsatellites in coding sequences of several eukaryotic genomes,� Genomics

Proteomics & Bioinformatics, vol. 2, no. 1, pp. 24�31, 2004.

6. M. Morgante, M. Hanafey, and W. Powell, �Microsatellites are preferentially

associated with nonrepetitive DNA in plant genomes,� Nature Genetics, vol. 30, no. 2,

pp. 194�200, 2002.

7. S. Subramanian, R. K. Mishra, and L. Singh, �Genome-wide analysis of

microsatellite repeats in humans: their abundance and density in specific genomic

regions,� Genome Biology, vol. 4, no. 2, p. R13, 2003.

8. R. K. Varshney, A. Graner, and M. E. Sorrells, �Genic microsatellite markers in

plants: features and applications,� Trends in Biotechnology, vol. 23, no. 1, pp. 48�55,

2005.

9. M. Bilgen, M. Karaca, A. N. Onus, and A. G. Ince, �A software program combining

sequence motif searches with keywords for finding repeats containing DNA

sequences,� Bioinformatics, vol. 20, no. 18, pp. 3379�3386, 2004.

10. W. R. Pearson and D. J. Lipman, �Improved tools for biological sequence

28

comparison,� Proceedings of the National Academy of Sciences of the United States

of America, vol. 85, no. 8, pp. 2444�2448, 1988.

11. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, �Basic local

alignment search tool,� Journal of Molecular Biology, vol. 215, no. 3, pp. 403�410,

1990.

12. C. Abajian, SPUTNIK, 1994, http://www.abajian.com/sputnik.

13. A. F. A. Smit, R. Hubley, and P. Green, RepeatMasker Open-3.0, 1996,

http://www.repeatmasker.org.

14. G. Benson, �Tandem repeats finder: a program to analyze DNA sequences,�

Nucleic Acids Research, vol. 27, no. 2, pp. 573�580, 1999.

15. A. T. Castelo, W. Martins, and G. R. Gao, �TROLL�tandem repeat occurence

locator,� Bioinformatics, vol. 18, no. 4, pp. 634�636, 2002.

16. T. Thiel, W. Michalek, R. K. Varshney, and A. Graner, �Exploiting EST databases

for the development and characterization of gene-derived SSR-markers in barley

(Hordeum vulgare L.),� Theoretical and Applied Genetics, vol. 106, no. 3, pp. 411�

422, 2003.

17. S. Temnykh, G. DeClerck, A. Lukashova, L. Lipovich, S. Cartinhour, and S.

McCouch, �Computational and experimental analysis of microsatellites in rice (Oryza

sativa L.): frequency, length variation, transposon associations, and genetic marker

potential,� Genome Research, vol. 11, no. 8, pp. 1441�1452, 2001.

18. G. D. Schuler, �Sequence mapping by electronic PCR,� Genome Research, vol.

7, no. 5, pp. 541�550, 1997.

19. S. Rozen and H. Skaletsky, �Primer3 on the WWW for general users and for

biologist programmers,� Methods in Molecular Biology, vol. 132, part 3, pp. 365�386,

2000.

20. S. B. Needleman and C. D. Wunsch, �A general method applicable to the search

for similarities in the amino acid sequence of two proteins,� Journal of Molecular

Biology, vol. 48, no. 3, pp. 443�453, 1970.

29

21. T. F. Smith and M. S. Waterman, �Identification of common molecular

subsequences,� Journal of Molecular Biology, vol. 147, no. 1, pp. 195�197, 1981.

22. M. Waterman, �Estimating statistical significance of sequence alignments,�

Philosophical transactions of the Royal Society of London. Series B, vol. 344, no.

1310, pp. 383�390, 1994.

23. M. Vingron and M. S. Waterman, �Sequence alignment and penalty choice.

Review of concepts, case studies and implications,� Journal of Molecular Biology, vol.

235, no. 1, pp. 1�12, 1994.

24. S. Kikuchi, K. Satoh, T. Nagata, et al., �Collection, mapping, and annotation of

over 28,000 cDNA clones from japonica rice: the rice full-length cDNA consortium,�

Science, vol. 301, no. 5631, pp. 376�379, 2003.

25. L. Cardle, L. Ramsay, D. Milbourne, M. Macaulay, D. Marshall, and R. Waugh,

�Computational and experimental characterization of physically clustered simple

sequence repeats in plants,� Genetics, vol. 156, no. 2, pp. 847�854, 2000.

26. S. Jung, A. Abbott, C. Jesudurai, J. Tomkins, and D. Main, �Frequency, type,

distribution and annotation of simple sequence repeats in Rosaceae ESTs,�

Functional & Integrative Genomics, vol. 5, no. 3, pp. 136�143, 2005.

27. G. M. Cordeiro, R. Casu, C. L. McIntyre, J. M. Manners, and R. J. Henry,

�Microsatellite markers from sugarcane (Saccharum spp.) ESTs cross transferable to

erianthus and sorghum,� Plant Science, vol. 160, no. 6, pp. 1115�1123, 2001.

28. R. K. Varshney, T. Thiel, N. Stein, P. Langridge, and A. Graner, �In silico analysis

on frequency and distribution of microsatellites in ESTs of some cereal species,�

Cellular & Molecular Biology Letters, vol. 7, no. 2A, pp. 537�546, 2002.

29. R. V. Kantety, M. La Rota, D. E. Matthews, and M. E. Sorrells, �Data mining for

simple sequence repeats in expressed sequence tags from barley, maize, rice,

sorghum and wheat,� Plant Molecular Biology, vol. 48, no. 5-6, pp. 501�510, 2002.

30. S. K. Parida, K. Anand Raj Kumar, V. Dalal, N. K. Singh, and T. Mohapatra,

�Unigene derived microsatellite markers for the cereal genomes,� Theoretical and

Applied Genetics, vol. 112, no. 5, pp. 808�817, 2006.

30

31. M. La Rota, R. V. Kantety, J.-K. Yu, and M. E. Sorrells, �Nonrandom distribution

and frequencies of genomic and EST-derived microsatellite markers in rice, wheat,

and barley,� BMC Genomics, vol. 6, article 23, 2005.

32. J.-K. Yu, T. M. Dake, S. Singh, et al., �Development and mapping of EST-derived

simple sequence repeat markers for hexaploid wheat,� Genome, vol. 47, no. 5, pp.

805�818, 2004.

33. N. Nicot, V. Chiquet, B. Gandon, et al., �Study of simple sequence repeat (SSR)

markers from wheat expressed sequence tags (ESTs),� Theoretical and Applied

Genetics, vol. 109, no. 4, pp. 800�805, 2004.

34. T. Asp, U. K. Frei, T. Didion, K. K. Nielsen, and T. Lübberstedt, �Frequency, type,

and distribution of EST-SSRs from three genotypes of Lolium perenne, and their

conservation across orthologous sequences of Festuca arundinacea, Brachypodium

distachyon, and Oryza sativa,� BMC Plant Biology, vol. 7, article 36, 2007.

35. L. Zhang, D. Yuan, S. Yu, et al., �Preference of simple sequence repeats in

coding and non-coding regions of Arabidopsis thaliana,� Bioinformatics, vol. 20, no.

7, pp. 1081�1086, 2004.

36. D. Jiang, G.-Y. Zhong, and Q.-B. Hong, �Analysis of microsatellites in citrus

unigenes,� Acta Genetica Sinica, vol. 33, no. 4, pp. 345�353, 2006.

37. D. A. Palmieri, V. M. Novelli, M. Bastianel, et al., �Frequency and distribution of

microsatellites from ESTs of citrus,� Genetics and Molecular Biology, vol. 30, no. 3,

supplement, pp. 1009�1018, 2007.

38. G. Tóth, Z. Gáspári, and J. Jurka, �Microsatellites in different eukaryotic

genomes: surveys and analysis,� Genome Research, vol. 10, no. 7, pp. 967�981,

2000.

39. Y.-C. Li, A. B. Korol, T. Fahima, and E. Nevo, �Microsatellites within genes:

structure, function, and evolution,� Molecular Biology and Evolution, vol. 21, no. 6, pp.

991�1007, 2004.

40. E. A. Kellogg and J. L. Bennetzen, �The evolution of nuclear genome structure in

seed plants,� American Journal of Botany, vol. 91, no. 10, pp. 1709�1725, 2004.

31

Figure 1. Flow-chart showing the functional structure of SSR Locator. (A) Perl script to search SSRs; (B) text file where information

from detected SSRs is stored; (C) module for the statistical calculations for SSR motif occurrence; (D) module that formats text files

into standard Primer3 input files; (E) running of Primer3; (F) module for running Virtual-PCR (using a second sequence file as a

template); (G) module performing global alignment between homologous amplicons; (H) identity and alignment score calculations

between homologous amplicons; and (I) file containing SSR, primer, homologous amplicons, identity, and score information.

32

Table 1: Distribution of SSR/minisatellite motifs according to the number of repeats.

Repeats Mono- (%) 2-mer (%) 3-mer (%) 4-mer (%) 5-mer (%) 6-mer (%) 7-mer (%) 8-mer (%) 9-mer (%) 10-mer (%) Total (%)3 0 - 0 - 0 - 0 - 0 - 0 - 78 95.12 6 100 24 96 5 100 113 2.894 0 - 0 - 0 - 0 - 348 81.69 323 82.82 4 4.88 0 0 1 4 0 0 676 17.305 0 - 0 - 0 - 181 72.11 69 16.20 45 11.54 0 0 0 0 0 0 0 0 295 7.556 0 - 0 - 0 - 41 16.33 7 1.64 13 3.33 0 0 0 0 0 0 0 0 61 1.567 0 - 0 - 1220 61.31 9 3.59 0 0 5 1.28 0 0 0 0 0 0 0 0 1234 31.588 0 - 0 - 441 22.16 9 3.59 1 0.23 1 0.26 0 0 0 0 0 0 0 0 452 11.579 0 - 0 - 173 8.69 4 1.59 0 0 1 0.26 0 0 0 0 0 0 0 0 178 4.5610 0 - 125 21.04 68 3.42 1 0.40 0 0 2 0.51 0 0 0 0 0 0 0 0 196 5.0211 0 - 82 13.80 32 1.61 3 1.20 0 0 0 0 0 0 0 0 0 0 0 0 117 2.9912 0 - 76 12.79 18 0.90 1 0.40 0 0 0 0 0 0 0 0 0 0 0 0 95 2.4313 0 - 71 11.95 5 0.25 1 0.40 0 0 0 0 0 0 0 0 0 0 0 0 77 1.9714 0 - 39 6.57 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 1.0515 0 - 44 7.41 5 0.25 0 0 1 0.23 0 0 0 0 0 0 0 0 0 0 50 1.2816 0 - 30 5.05 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 0.8217 0 - 33 5.56 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 34 0.8718 0 - 15 2.53 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0.4619 0 - 17 2.86 1 0.05 1 0 0 0 0 0 0 0 0 0 0 0 0 0 19 0.4920 21 15.22 14 2.36 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 37 0.9521 19 13.77 8 1.35 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 29 0.7422 15 10.87 6 1.01 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 0.6123 8 5.80 7 1.18 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0.4624 3 2.17 5 0.84 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0.2025 9 6.52 5 0.84 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0.3826 5 3.62 4 0.67 2 0.10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0.2827 3 2.17 1 0.17 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0.1328 1 0.72 3 0.51 3 0.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0.1829 4 2.90 0 0 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0.1330 2 1.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0531 9 6.52 2 0.34 1 0.05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0.3132 3 2.17 3 0.51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0.1533 3 2.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0.0834 1 0.72 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0535 6 4.35 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0.1836 1 0.72 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0537 1 0.72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0338 4 2.90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0.1039 0 0 1 0.17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0340 1 0.72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0341 1 0.72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.0342 2 1.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0543 2 1.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.0544 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00≥45 14 10.14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0.36

Total 138 594 1.990 251 426 390 82 6 25 5 3.907 (%) 3.53 15.20 50.93 6.42 10.90 9.98 2.10 0.15 0.64 0.13 100.00

33

Table 2: Distribution of SSR/minisatellite repeats in the rice cDNA collection.

Motif Ocur(1) (%)(1) Ocur(2) (%)(2) Total (%) Group (%) OverallMono- A/T 111 88.80 14 11.20 125 90.58 3.20

C/G 10 76.92 3 23.08 13 9.42 0.332-mer AG/CT 97 36.06 172 63.94 269 45.29 6.89

GA/TC 143 61.37 90 38.63 233 39.23 5.96CA/TG 10 35.71 18 64.29 28 4.71 0.72AT 24 100.00 - - 24 4.04 0.61AC/GT 6 31.58 13 68.42 19 3.20 0.49TA 19 100.00 - - 19 3.20 0.49CG 2 100.00 - - 2 0.34 0.05

3-mer CCG/CGG 197 53.68 170 46.32 367 18.44 9.39CGC/GCG 218 61.24 138 38.76 356 17.89 9.11GCC/GGC 112 53.08 99 46.92 211 10.60 5.40CTC/GAG 73 42.69 98 57.31 171 8.59 4.38AGG/CCT 34 30.91 76 69.09 110 5.53 2.82GGA/TCC 60 62.50 36 37.50 96 4.82 2.46CAG/CTG 58 76.32 18 23.68 76 3.82 1.95AAG/CTT 34 50.75 33 49.25 67 3.37 1.71CGA/TCG 33 54.10 28 45.90 61 3.07 1.56AGC/GCT 36 62.07 22 37.93 58 2.91 1.48GCA/TGC 47 83.93 9 16.07 56 2.81 1.43AGA/TCT 33 62.26 20 37.74 53 2.66 1.36CCA/TGG 39 75.00 13 25.00 52 2.61 1.33ACC/GGT 22 48.89 23 51.11 45 2.26 1.15GAA/TTC 28 63.64 16 36.36 44 2.21 1.13CAC/GTG 28 65.12 15 34.88 43 2.16 1.10GAC/GTC 18 54.55 15 45.45 33 1.66 0.84ACG/CGT 11 42.31 15 57.69 26 1.31 0.67ATC/GAT 5 45.45 6 54.55 11 0.55 0.28TCA/TGA 5 50.00 5 50.00 10 0.50 0.26CAA/TTG 4 50.00 4 50.00 8 0.40 0.20ACT/AGT 3 42.86 4 57.14 7 0.35 0.18TAA/TTA 1 14.29 6 85.71 7 0.35 0.18CTA/TAG 4 66.67 2 33.33 6 0.30 0.15AAT/ATT 1 20.00 4 80.00 5 0.25 0.13CAT/ATG 4 100.00 - - 4 0.20 0.10AAC/GTT 3 75.00 1 25.00 4 0.20 0.10ATA/TAT 1 50.00 1 50.00 2 0.10 0.05GTA/TAC 1 100.00 - - 1 0.05 0.03

34

continued...

4-mer GATC 18 100.00 0 0 18 7.17 0.46ATTA/TAAT 9 52.94 8 47.06 17 6.77 0.44ATCG/CGAT 3 20.00 12 80.00 15 5.98 0.38CATC/GATG 4 40.00 6 60.00 10 3.98 0.26AGAA/TTCT 2 25.00 6 75.00 8 3.19 0.20GCTA/TAGC 6 75.00 2 25.00 8 3.19 0.20GATA/TATC 1 14.29 6 85.71 7 2.79 0.18GCGA/TCGC 3 42.86 4 57.14 7 2.79 0.18GCAC/GTGC 2 33.33 4 66.67 6 2.39 0.15AGGG/CCCT 2 33.33 4 66.67 6 2.39 0.15

5-mer AGGAG/CTCCT 3 15.00 17 85.00 20 4.69 0.51CTCTC/GAGAG 17 89.47 2 10.53 19 4.46 0.49GAGGA/TCCTC 9 56.25 7 43.75 16 3.76 0.41CCTCC/GGAGG 12 80.00 3 20.00 15 3.52 0.38AGAGG/CCTCT 4 26.67 11 73.33 15 3.52 0.38GGAGA/TCTCC 2 18.18 9 81.82 11 2.58 0.28CTCGC/GCGAG 7 77.78 2 22.22 9 2.11 0.23AGCTA/TAGCT 4 44.44 5 55.56 9 2.11 0.23GAAAA/TTTTC 2 25.00 6 75.00 8 1.88 0.20AGGCG/CGCCT 2 25.00 6 75.00 8 1.88 0.20

6-mer CGCCTC/GAGGCG 12 85.71 2 14.29 14 3.59 0.36CGGCGA/TCGCCG 4 28.57 10 71.43 14 3.59 0.36CCTCCG/CGGAGG 9 81.82 2 18.18 11 2.82 0.28AGGCGG/CCGCCT 1 10.00 9 90.00 10 2.56 0.26CCGTCG/CGACGG 4 44.44 5 55.56 9 2.31 0.23CGTCGC/GCGACG 7 77.78 2 22.22 9 2.31 0.23ACCGCC/GGCGGT 1 12.50 7 87.50 8 2.05 0.20CCACCG/CGGTGG 6 85.71 1 14.29 7 1.79 0.18GGCGGA/TCCGCC 5 71.43 2 28.57 7 1.79 0.18CTCCAT/ATGGAG 6 100.00 0 0 6 1.54 0.15

7-mer CCGCCGC/GCGGCGG 4 66.67 2 33.33 6 7.32 0.15CTCTCTC/GAGAGAG 4 80.00 1 20.00 5 6.10 0.13CCTCTCT/AGAGAGG 4 100.00 0 0 4 4.88 0.10CTCTCTT/AAGAGAG 4 100.00 0 0 4 4.88 0.10CCCAAAT/ATTTGGG 3 100.00 0 0 3 3.66 0.08GCCGCCG/CGGCGGC 3 100.00 0 0 3 3.66 0.08GCGGCGC/GCGCCGC 2 100.00 0 0 2 2.44 0.05AATAAAA/TTTTATT 2 100.00 0 0 2 2.44 0.05GTGTGCG/CGCACAC 2 100.00 0 0 2 2.44 0.05CGCCGTC/GACGGCG 2 100.00 0 0 2 2.44 0.05

8-mer TTGGTTTC/GAAACCAA 2 100.00 0 0 2 33.33 0.05TGGGCTTG/CAAGCCCA 1 100.00 0 0 1 16.67 0.03GCTTCTTG/CAAGAAGC 1 100.00 0 0 1 16.67 0.03ACGGGCGA/TCGCCCGT 1 100.00 0 0 1 16.67 0.03ATGATGTA/TACATCAT 1 100.00 0 0 1 16.67 0.03

9-mer TCGGCGGCG/CGCCGCCGA 2 100.00 0 0 2 8.00 0.05AGGTGGTGG/CCACCACCT 2 100.00 0 0 2 8.00 0.05CCGGTGCGA/TCGCACCGG 1 100.00 0 0 1 4.00 0.03ACGAGGAGG/CCTCCTCGT 1 100.00 0 0 1 4.00 0.03TCCCTTTTC/GAAAAGGGA 1 100.00 0 0 1 4.00 0.03CGGCATGAA/TTCATGCCG 1 100.00 0 0 1 4.00 0.03CGGCAGCGA/TCGCTGCCG 1 100.00 0 0 1 4.00 0.03ACCATCCCG/CGGGATGGT 1 100.00 0 0 1 4.00 0.03ATGGGCGGC/GCCGCCCAT 1 100.00 0 0 1 4.00 0.03ATGCAGGGT/ACCCTGCAT 1 100.00 0 0 1 4.00 0.03

10-mer AGCCCCAACG/CGTTGGGGCT 1 50.00 1 50.00 2 40.00 0.05TTTTTTTCTT/AAGAAAAAAA 1 100.00 0 0 1 20.00 0.03CCTGCTTTGC/GCAAAGCAGG 1 100 0 0 1 20 0.03ATCTCCGCCG/CGGCGGAGAT 1 100 0 0 1 20 0.03

35

Table 3: Distribution of amplicon alignments for specific and redundant amplicons

with varying identity levels.

Identity 100 99 98 97 96 95�90 89�80 79�70 69�60 ≤59 TotalAmplicons 787 261 151 29 11 8 8 6 5 15 1281% 61.44 20.37 11.79 2.26 0.86 0.62 0.62 0.47 0.39 1.17 -

36

3. Tandem Repeat distribuition in gene transcripts of three plant families

Genetics and Molecular Biology (ISSN 1415-4757)

ABSTRACT

Tandem Repeats (Microsatellites or SSRs) are molecular markers with great

potential for plant genetic studies. Modern strategies include the transfer of these

markers between widely studied and orphan species. In silico analyses allow to study

the distribution patterns of microsatellites and to predict which motifs would be more

amenable to interspecies transfer. Transcribed sequences (Unigene) from ten

species of three plant families were surveyed for the occurrence of micro and

minisatellites. Transcripts from different species displayed different rates of tandem

repeat occurrences, ranging from 1.47% to 11.28%. Similar as well as different

patterns were found within and between plant families. The results also indicate a

lack of association between genome size and tandem repeat fractions in expressed

regions. The conservation of motifs among species and its implication on the

evolution and genome dynamics are discussed.

37

INTRODUCTION

Microsatellites or SSRs (Simple sequence repeats) are DNA sequences

formed by the tandem arrangement of nucleotides through the combination of one to

six base pairs, widely distributed in prokaryote and eukaryote genomes (Morgante

and Olivieri, 1993; Tóth et al., 2000). Microsatellite regions tend to form loops or

hairpins structures, leading to a slippage of DNA polymerase during replication,

provoking the insertion or deletion of nucleotides (Iyer et al., 2000). Expansions

and/or contractions of microsatellites may lead to a gain or loss of gene function (Li et

al., 2002, 2004a). Initially, it was suggested that the occurrence and distribution of

microsatellites was the result of random processes. However, new evidences indicate

that the genomic distribution of these repeats is originated from non-random

processes (Bell, 1996; Li et al., 2004b). Microsatellites have been reported to

correspond to 0.85% of Arabidopsis (Arabidopsis thaliana), 0.37% of maize (Zea

mays subsp. mays), 3.21% of fugu fish (Fugu rubripes), 0.21% of the nematode

Caenorhabditis elegans and 0.30% of yeast (Saccharomyces cerevisae) genomes

(Morgante et al., 2002). Also, they make up for 3.00 % of the human genome

(Subramanian et al., 2003).

For microsatellites located in genic regions, 5�UTR are the hotspot for the

presence of this type of repeat. It is known that contractions and/or expansions of

repeats found in 5�UTR regions alter the transcription and/or the translation of these

genes (Li et al., 2004b; Zhang et al., 2006a). Mutations in microsatellite loci found in

3�UTR regions are associated with gene silencing, transcript-cytosol exporting and

splicing mechanism changes as well as the expression levels of flanking genes

(Davis et al., 1997; Thornton et al., 1997; Philips et al., 1998; Conne et al., 2000). For

coding sequences (CDS), the impact of mutations has been described as functional

38

changes, loss of function and protein truncation (Li et al., 2004b). In plants, although

many studies have reported microsatellites frequencies in transcribed regions

(Temnykh et al., 2001; McCouch et al., 2002; Morgante et al., 2002; Thiel et al.,

2003, Nicot et al., 2004; Kashi and King, 2006; Lawon and Zhang, 2006; Varshney et

al., 2006; Zhang et al., 2006b), additional comparative or descriptive analysis can

offer novel perspectives on their use as molecular markers. The genomic abundance

of microsatellites and the ability to associate to many phenotypes make this class of

molecular markers a powerful tool to many aplications in plant genetics. The

identification of microsatellite markers derived from EST and/or cDNAs, described as

functional markers, represent an even more useful possibility for these markers when

compared to those markers based on assessing anonymous regions (Varshney et

al., 2005, 2006).

In order to provide information regarding the patterns of microsatellite

occurrence and distribution on transcribed genome regions, non-redundant full-length

cDNAs (fl-cDNAs) and/or ESTs belonging to ten plant species from three different

families (Brassicaceae, Solanaceae and Poaceae) were used.

MATERIAL AND METHODS

Obtaining the expressed sequence

Files containing expressed sequences were obtained for the following

families/species: Brassicaceae (Arabidopsis thaliana and Brassica napus),

Solanaceae (Solanum lycopersicum and Solanum tuberosum) and Poaceae (Oryza

sativa, Sorghum bicolor, Triticum aestivum, Zea mays, Saccharum officinarum and

Hordeum vulgare) deposited in NCBI-Unigene database. The non-redundant yet

representative sequences for all known genes in each species were selected. The

39

sequences used in the present study were downloaded from the Unigene database

on June 2008.

Distribution of sequences in different transcribed regions

Using computer scripts developed in Perl language and based on the existing

annotation for each of the cDNAs and/or ESTs sequences, the sequences were

categorized as CDS, upstream and downstream regions and partitioned into fasta

files and named CDS, 5� UTR and 3� UTR for each species. Since the annotation of

introns was not part of the database, the repeats present in intronic regions were not

considered in this study.

Location of tandem repeats

For the location of micro and minisatellites, the SSRLocator software was used

(Maia et al. 2008). The software options were adjusted to locate monomers, dimers,

trimers, pentamers and hexamers, containing a minimum of 10, 7, 5, 4 and 4 repeats,

respectively. For minisatellites, heptamer, octamer, nonamer and decamers

containing a minimum of 3, 3, 3 and 2 repeats were selected, respectively.

RESULTS AND DISCUSSION

Distribution of sequences in UTRs and CDSs

The sequences separated in coding regions (CDS) and in untranslated

transcribed regions (5´UTR and 3´UTR) distributed in number of sequences, amount

(Mb) and average size (bp) for the ten species are shown in Table 1. On average, all

species have sequence fragments between 560 and 893 bp, excluding the A.

40

thaliana and O. sativa databases, where longer sequences were found, reaching

averages of 1,447 and 1,490 bp, respectively. The Poaceae species Z. mays and O.

sativa had the largest numbers of sequences deposited in Unigene, with 57,447 and

40,259 sequences, respectively, for each species. It must be taken into account that

not all sequences deposited in this database contain 5´UTR and 3´UTR regions, and

in some sequences, both sequence types are found and in others only one (i.e., 5´ or

3�UTR) is found. The overall average sizes were found to be 130 bp for 5�UTR, 873

bp for CDS and 270 bp for 3´UTR regions. The total nucleotides allocated in each

region were on average 0.9% for 5´UTR, 97.5% for CDS and 1.6% for 3´UTR

regions. The only species with contrasting values was Arabidopsis, where 6.8%,

82.6% and 10.7% of total nucleotides were allocated for 5´UTR, CDS and 3´UTR

regions, respectively.

Percentage of expressed sequences with tandem repeats

On average, 3.55% of analyzed sequences contain one or more loci with

tandem repeats. In Figure 1, the percentage of tandem repeat containing sequences

for each species is displayed. The highest amounts were found for rice (11.28%).

The smallest values were found for the Solanaceae species, i.e. 1.47% and 1.76%

for S. lycopersicum and S. tuberosum, respectively. The percentage values found for

Arabidopsis (3.88%) is in agreement with other reports which have reported between

3% and 5% tandem repeat containing sequences (Cardle et al., 2000; Kumpatla and

Mukhopadhyay, 2005). For B. napus, S. lycopersicon and S. tuberosum

2.42%, 1.47% and 1.76% of sequences containing tandem repeats were found,

respectively. Different values (6.9%, 4.7% and 2.65%) have been reported for the

same species, respectively (Kumpatla and Mukhopadhyay, 2005). For Poaceae, the

41

comparison of present results with former reports for H. vulgare (4.25% vs. 8.11%),

Z. mays (2.14% vs. 1.5%), O. sativa (11.28% vs. 4.7%), S. officinarum (2.13% vs.

2.9%) and T. aestivum (2.38% vs. 7.5%) show a different range of values (Cordeiro

et al., 2001; Kantety et al., 2002; Thiel et al., 2003; Nicot et al., 2004; Asp et al.,

2007). However, all differences are within 2-3 fold range.

The variations on the percentage values found in different reports are related

to the strategy used (software, repeat number and type defined for the search) by the

authors. However, an overall agreement is that microsatellite stretches with minimum

sizes of 20 bp are present in approximately 2-5% of cereal EST sequences

(Varshney et al., 2005).

Frequency of tandem repeats in UTR and CDS regions

Results for total occurrences (total loci), percentage per region (loci amounts

per region divided by total number of loci) and frequencies (amount of loci per

megabase) are shown separately for each species and genic region (5´UTR, CDS

and 3´UTR) in Table 2. In 5´UTR and 3�UTR regions, 4.92% (529 loci) and 2.21%

(237 loci) of all repeats were found in all surveyed species (10,731 loci), with an

average frequency of 1.3 and 0.7 loci/Mb, respectively. In coding regions (CDS) a

higher occurrence of micro and minisatellites were detected, reaching 92.86% of total

loci found (9,965 occurrences) with an average frequency of 35.1 loci/Mb. The higher

percentage of repeats occurred in CDS regions, as a consequence of trimers present

in this region. However, for Arabidopsis, large percentages of dimers (17.9%), trimer

(19.3%) and total (44.5%) microsatellites were found in UTR regions, contrasting with

the other species (Table 3). For Rosaceae, between 44.3% and 53.2% of

microsatellites were found in UTR regions (Jung et al., 2005). For Arabidopsis, 81%

42

and 26% of dimers and trimers were found in UTR regions (Yu et al., 2004).

In the present study, a major percentage of microsatellites in 5�UTR was

detected in Arabidopsis, with a frequency of 9.1 loci/Mb. These repeats represented

34% of all 1,162 repeats found in the 29,918 sequences analyzed in this species.

The species O. sativa and H. vulgare had the second and the third higher

frequencies of repeats in these regions, containing on average 1.3 and 1.0 loci/Mb,

respectively (Table 2).

Many studies indicate the UTR as more abundant in microsatellites than CDS

regions (Morgante et al., 2002). In the present work, 92.86% of microsatellite loci in

CDS regions is due to an annotation deficiency, separating translated from non-

translated fractions in the Unigene transcript database.

As observed for 5�UTRs, contrasting values were also found in 3�UTR regions.

Much higher values were found for Arabidopsis (average of 3.6 loci/Mb) when

compared to values below 0.6 loci/Mb found for the remaining species (Table 2).

Considering all 5�UTR, 3�UTR and CDS occurrences for all species, the

average frequency observed is 37 loci/Mb. The values range from 18 loci/Mb in

tomato to 76 loci/Mb in rice. Average frequency values per family are: 29.0 loci/Mb in

Brassicaceae, 19.9 loci/Mb in Solanaceae and 45.4 loci/Mb in Poaceae (Table 2).

Many reports have shown higher values than those found in this study, i.e.,

112-133 loci/Mb in barley, 133 loci/Mb in maize, 94-161 loci/Mb in wheat, 158-169

loci/Mb in sorghum, 161 loci/Mb in rye, 256-277 loci/Mb in rice and 133 loci/Mb in

Arabidopsis (Varshney et al., 2002; Thiel et al., 2003; Parida et al., 2006). In Citrus

species, values as high as 507 loci/Mb have been described in EST sequences

(Palmieri et al., 2007). Values as high as 125 loci/Mb were also found in Brassica

rapa (Hong et al., 2007). Frequency values closer to our study have been reported

43

for CDS regions of Rosa chinensis (Rose), Prunus dulcis (Almond), Prunus persica

(Peach) and Arabidopsis, showing values ranging from 39-78 loci/Mb (Jung et al.,

2005).

Percentage occurrences of different microsatellite types in UTR and CDS regions

In Table 3, the detailed percentage values for each repeat type in the different

sections of a genic region are listed for each species. The average occurrence of

dimer microsatellites in all species was 21.9%, with majority of these loci present in

CDS regions. For each family, the average percentage of dimer occurrence was

31.5% for Brassicaceae, 21.7% for Solanaceae and 18.8% for Poaceae species.

The percentage values for dimer microsatellites in CDS regions ranged from 4.0% for

Arabidopsis to 40.8% in B. napus. An interesting feature that seems to be particular

of the Arabidopsis genome is the high occurrence of dimer microsatellites in 5�

(13.6%) and 3� (4.3%) UTR regions. Within the Poaceae, dimer microsatellites

ranged from 15.4% in barley to 27.3% in wheat (Table 3). Other studies indicate that

generally the highest rates of occurrence of dimers is associated with 5'UTR regions

(Morgante et al., 2002; Lowson and Zhang, 2006; Hong et al., 2007), but one should

keep in mind that this prevalence in CDS regions may be a reflection of a deficient

annotation of the database. Trimer microsatellites were found in 40.2% of sequences,

with a high predominance in CDS regions. The species with higher trimer values

were Arabidopsis, rice and tomato, with 58.0%, 54.7% and 41.4% of occurrences,

respectively. The average percentage of trimers within each family was 47.0% in

Brassicaceae, 37.8% in Solanaceae and 38.7% in Poaceae. Among the Poaceae

species, the highest percentage of occurrence was found in rice (54.7%) and the

lowest percentage of trimer occurrence was for maize (34.6%). In Brassicaceae,

44

trimers were found more frequently in Arabidopsis (58.0%) and less frequent in B.

napus (36.1%) (Table 3).

Tetramers represented, on average, 8.2% of microsatellites, with average

frequencies of 3.4%, 4.4% and 11.0% for Brassicaceae, Solanaceae and Poaceae,

respectively. Among the Brassicaceae, less than one-fold differences in frequencies

were observed for Arabidopsis (2.9%) and B. napus (4.4%). In Poaceae, a 2.7-fold

difference was found between rice (6.1%) and barley (16.5%).

Pentamers represented, on average, 10.36% of microsatellites, with average

frequencies of 4.5%, 6.6% and 13.6% for Brassicaceae, Solanaceae and Poaceae,

respectively (Table 3). Less than one-fold differences were found for Brassicaceae

and Solanaceae species. In Poaeceae, however, a 1.7-fold difference was found

between rice (9.7%) and maize (16.5%).

Hexamers represented, on average, 13.8% of microsatellites, with average

frequencies of 8.1%, 19.1% and 13% for Brassicaceae, Solanaceae and Poaceae. In

Poaceae, a 2.4-fold difference was found between wheat (7.7%) and sorghum

(18.3%), respectively.

Minisatellites frequencies were also assessed in the data (Table 3).

Heptamers represented, on average, 4.5% of total (minisatellite plus microsatellite)

occurrences. These types of repeats were more common in the Solanaceae family

(9.6%). In Brassicaceae and Poaceae, the average frequencies of heptamers were

3.3% and 3.2%, respectively. Octamers were more frequent in the Brassicaceae

(0.8%), when compared to the Solanaceae (0.3%) and Poaceae (0.1%). Nonamers

were also more frequent in Brassicaceae (0.9%), when compared to Solanaceae

(0.6%) and Poaceae (0.5%). Decamers were comparatively less frequent than other

minisatellites, reaching frequencies of 0.2%, 0.1% and zero in Brassicaceae,

45

Poaceae and Solanaceae, respectively (Table 3).

Many studies have reported EST sequences containing microsatellites. For

the Poaceae (rice, maize, sorghum, barley and wheat) frequencies ranging from 16.6

to 40% for dimers, 41 to 78% for trimers, 2.6 to 14% for tetramers, 0.4-18.9% for

pentamers and below 1% for hexamers (Varshney et al., 2002; Thiel et al., 2003; La

Rota et al., 2005; Parida et al., 2006) have been reported. In the case of Arabidopsis,

frequencies of dimers (36.5%), trimers (62.1%), tetramers (1.1%), pentamers (0.15%)

and hexamers (0.13%) have been reported (Parida et al., 2006).

Most frequent motifs

Dimers and trimers

In Tables 4 and 5 the motif frequencies per species and average frequency

per family are listed. For dimers, differences were observed within and between

families. For Brassicaceae, the dimer motifs AG/CT and GA/TC were most the

frequent, reaching 9.69% and 8.89% of observations in the family. A 6.9-fold

difference was found for AG/CT between Arabidopsis (2.46%) and B. napus

(16.93%). Also for the motif GA/TC, a near 10-fold difference was found between

Arabidopsis (1.64%) and B. napus (16.14%). Other reports have shown that the

motifs AG and GA were the most frequent in Arabidopsis (Cardle et al., 2000;

Morgante et al., 2002; Lawson and Zhang, 2006; Parida et al., 2006) and AT/TA in B.

rapa (Hong et al., 2007). Among the Solanaceae, the motifs AT/AT and TA/TA were

the most frequent, with frequencies of 8.29% and 5.69%, respectively. In Solanaceae

ESTs, frequencies between 20-25% and 15-20% were found for the dimers AG and

AT, respectively (Kumptla and Mukhopadhyay, 2005). In Poaceae, the most frequent

motifs were AG/CT and GA/TC, with average percentage values of 6.72% and

46

5.61%, respectively. In other studies, frequencies ranging from 38-50% were found

for the motif AG in maize, barley, rice, sorghum and wheat (Morgante et al., 2002;

Varshney et al., 2002; Kantety et al., 2002; Thiel et al., 2003; Yu et al., 2004; La Rota

et al., 2005) and frequencies of 50% for AC in barley (Varshney et al., 2002).

However, other reports have shown GA as the most abundant motif in grasses

(Temnykh et al., 2001; Kantety et al., 2002; Nicote et al., 2004; Parida et al., 2006).

In all species that were analysed in the present study, the smaller frequencies were

found for those motifs formed by guanine and cytosine (CG/GC) and were even

missing in the Brassicaceae and Solanaceae species.

The data regarding trimer frequencies show, as already observed for dimers,

that motif patterns are different within as well as between families (Table 4). Among

the Brassicaceae, the motifs GAA/TTC and AAG/CTT were the most abundant,

reaching frequencies of 8.36% and 6.73%, respectively. Contrasting values were

verified for GAA/TTC between Arabidopsis (12.13%) and B. napus (4.59%). The

motif AAG/CTT also showed contrasting values for Arabidopsis (9.51%) and B.

napus (3.96%). Other reports have claimed that AAG is the most frequent motif for

Arabidopsis and B. rapa (Morgante et al., 2002; Hong et al., 2007). In the

Solanaceae, the motifs GAA/TCC and AGA/TCT were the most frequent, showing

values of 4.75% and 4.60%, respectively. For both motifs, the frequency values were

higher in S. tuberosum. Similar results were obtained in Arabidopsis, B. napus, B

.rapa, S. Lycopersicum and S. tuberosum (Kumptla and Mukhopadhyay, 2005) and in

Citrus (Jiang et al., 2006) where the motifs AAG/AGA/GAA were the most frequent.

In the Poaceae, the trimers CCG/CGG, CGC/GCG and GCC/GGC were the most

frequent, corresponding to 5.89%, 5.85% and 5.06%, respectively, adding up to

16.80% of all microsatellites found. Within the family, different motifs were the most

47

common, i.e., for O. sativa, S. bicolor and H. vulgare, the motifs CCG/CGG were

predominant. For T. aestivum and S. officinarum it was GCC/GGC and for Z. mays it

was CGC/GCG. Other reports have shown predominance of the motif CCG in grass

species Z. mays, H. vulgare, O. sativa, S. bicolor, T. aestivum, S. cereale and S.

officinarum (Cordeiro et al., 2001; Kantety et al., 2002; Morgante et al., 2002;

Varshney et al., 2002; Thiel et al., 2003; Nicote et al., 2004; Yu et al., 2004; La Rota

et al., 2005; Peng et al., 2005). These motifs (CCG/CGG, CGC/GCG and GCC/GGC)

seem to be less common in other families, where instead of values around 16.8%

(found for grasses), these motifs reached frequency values of 0.56% in Brassicaceae

and 0.36% in the Solanaceae.

Tetramers, pentamers and hexamers

For the loci formed by motifs longer than three nucleotides, only the ten higher

average percentage values for each family are shown (Tables 4 and 5).

In Brassicaceae, tetramer motifs occurring at higher frequencies were

AAGA/TCTT, AAAC/GTTT or GAAA/TTTC adding to 1.04 % of all motifs found. Other

reports indicate that motifs AAAG/AAAT were predominant in Arabidopsis and AAAT

in B. rapa (Cardle et al., 2000; Hong et al., 2007). For 5�UTR/CDS and 3�UTR

Arabidopsis regions, the predominant motifs reported were AAAG/CTTT and

AAAC/GTTT, respectively (Morgante et al., 2002; Zhang et al., 2004). For the

Solanaceae species, 1.96% of all motifs found were either TAAA/TTTA, TTAA/TTAA

or AAGA/TCTT. These results agree with EST data from 20 dicot species (Kumptla

and Mukhoadhlyay, 2005). Among the grasses, 0.85% of all motifs were either

CCTC/GAGG, AGGA/TCCT or CATC/GATG. Differences on the predominant

tetramer rates were found among the species (Table 4). Other reports have shown

48

ACGT as the most abundant for barley (Varshney et al., 2002; Thiel et al., 2003),

AAAG/CTTT and AAGG/CCTT for perennial ryegrass (Asp et al. 2007) and AAAG as

the most frequent motif in rice BACs (McCouch et al., 2002).

Pentamers present at rates of 0.80% (GAAAA/TTTTC, AAAAT/ATTTT and

AAAAC/GTTTT), 1.37% (AAAAT/ATTTT, AAAAG/CTTTT and AGAAG/CTTCT) and

0.83% (CTCTC/GAGAG, GAGGA/TCCTC and CTTCC/GGAAG) were predominant

in Brassicaceae, Solanaceae and Poaceae, respectively. The major difference

among plant families is the predominance of A/T in Brassicaceae and Solanaceae.

Also, reports in CDS regions of Arabidopsis, S. cerevisae and C.elegans, indicated

the predominance of ACCCG and AAAAG (Toth et al. 2000). For eukaryotes in

general, AAAAT, AAAAC and AAAAG have been shown as predominant (Li et al.,

2004a). On the other hand, 5�UTR and 3�UTR regions of Arabidopsis, were shown to

be rich in AAGAG and AAAAC, respectively (Zhang et al., 2004). AAAAT (Hong et

al., 2007) and AAAAT /AAAAG (Jiang et al., 2006) were described as frequently

found in Rosaceae and Citrus, respectively.

In transcripts from TIGR database, the motif AGAGG was predominant in rice,

AGGGG in barley and ACGAT in wheat (La Rota et al., 2005). Very little information

was found describing the preferential occurrences of pentamers in grasses and

information found about eukaryotes (Toth et al., 2000; Li et al., 2004a), Citrus

(Palmieri et al., 20007; Jiang et al., 2006), Arabidopsis (Zhang et al., 2004) and

Rosaceae (Hong et al., 2007) showed variable results.

A pattern of occurrence of hexamers among and within the three analyzed

plant families was found (Table 5). Only one study has reports in agreement with the

present results, regarding the predominance of AAGGAG hexamers found in

Arabidopsis (Toth et al., 2000). Other reports indicate that the major occurrences of

49

hexamers are AAGATG, AAAGAG and AAAAAT in Arabidopsis (Zhang et al., 2004),

AAAAAG in Citrus (Jiang et al., 2006), AACACG in S. cerevisae, ACCAGG in C.

elegans, AAGGC in mammals and CCCCGG in primates (Toth et al., 2000). The ten

major occurrences in heptamers, octamers, nonamers and decamers are presented

on Table 5. Occurrences are widely variable within and among families, making it

difficult to establish a pattern or discussion based on similarities.

Genome dynamics is very complex regarding microsatellite motifs in plants. A

higher conservation of dimer motifs (AG/TC and GA/TC) seems to overcome

evolutionary barriers distances such as those found between monocot and dicot

plants. However, within the dicots, this conservation may not hold. Unexpectedly,

Poaceae and Brassicaceae were closer when these motifs were analyzed. On the

other hand, trimer microsatellites that are known to be predominant in coding regions

follow the expected pattern of conservation, showing similar rates and predominant

motifs (GAA/TTC) between the two dicot families. Trimers present at higher

frequencies in the grasses tend to be formed by GC arrangements, in contrast to

dicot plants where GATC combinations are more frequently found. The higher

frequency of AT- rich repeats is also found in pentamer motifs in the dicot families.

Repeats of higher complexity did not show detectable conserved patterns in this

study.

CONCLUSIONS

The occurrence of micro and minisatellites in rice sequences (11.28%) is

higher than in other species, ranging from 2.5 to 5 times more sequences containing

these repetitive DNA loci. The fact that species having larger genomes (T. aestivum,

H. vulgare and S. officinarum) do not present a corresponding higher frequency of

50

repetitive loci suggests that there is no relationship between genome size and rates

of tandem repeat occurrence in functional regions. However, the lower coverage of

sequences present in databases for these species could also be a reason for the low

rates found in some species. For Arabidopsis and rice, the results obtained are closer

to reality because both are considered model species and have been studied at

deeper coverage.

The distribution of micro- and minisatellites was higher in CDS regions for all

studied species. Also, microsatellites (97%) were more common than minisatellites

(3%). Per family, the predominant dimer motifs were the same for brassicaceae and

poaceae (AG/CT) and different for the solanaceae (AT/AT). Trimers were the

predominant repeats, ranging between 34.3% and 58.0% with different rates

depending on the family or species. For the Solanaceae, the predominant trimer

motifs were not the same for S. lycopersicum (ATA/TAT and AAT/TTA) and S.

tuberosum (GAA/TTC and AGA/TCT), which could be due to selection. Among the

grasses, trimers formed by C/G were the most abundant, however the specific motifs

are variable between species.

Disagreements between earlier reports and the results obtained in the present

work where dimers were also frequent in CDS regions, could be due to the fact that

the Unigene database contains predominantly EST clusters. Therefore, there is a

tendency of under representing the UTR regions in the annotated sequences present

in this database. This is true for all species, except for Arabidopsis. This could be

solved if the genes were manually curated defining the different regions, however, it

would take a community effort to accomplish such task.

The obtained results shed light on the patterns of tandem repeat occurrence

51

within and between different plant families, facilitating the use of plant breeding

strategies based on the transfer of markers from model to orphan species.

ACKNOWLEDGMENTS:

The authors thank CNPq fot fellowships and grants. The Authors also thank

Dr. Dario Abel Palmieri (UNESP/Assis-SP) and Dr. Olivier Panaud (University of

Perpignan) for fruitful discussions.

52

REFERENCES:

Asp T, Frei UK, Didion T, Nielsen KK and Lübberstedt T (2007) Frequency, type, and

distribution of EST-SSRs from three genotypes of Lolium perenne, and their

conservation across orthologous sequences of Festuca arundinacea, Brachypodium

distachyon, and Oryza sativa. BMC Plant Biol, 7:36.

Bell GI (1996) Evolution of simple sequence repeats. Comput Chem, 20:41-48.

Cardle L, Ramsay L, Milbourne D, Macaulay M, Marshall D and Waugh R. (2000)

Computational and experimental characterization of physically clustered simple

sequence repeats in plants. Genetics, 156:847-854.

Conne B, Stutz A and Vassalli JD (2000) The 3' untranslated region of messenger

RNA: A molecular 'hotspot' for pathology? Nat Med, 6:637-641.

Cordeiro GM, Casu R, McIntyre CL, Manners JM and Henry RJ (2001) Microsatellite

markers from sugarcane (Saccharum spp.) ESTs cross transferable to erianthus and

sorghum. Plant Sci, 160:1115-1123.

Davis BM, McCurrach ME, Taneja KL, Singer RH and Housman DE (1997)

Expan.sion of a CUG trinucleotide repeat in the 3�untranslated region of myotonic

dystrophy protein kinase transcripts results in nuclear retention of transcripts. Proc

Natl Acad Sci USA, 94:7388�7393.

Hong CP, Piao ZY, Kang TW, Batley J, Yang TJ, Hur YK, Bhak J, Park BS, Edwards

D and Lim YP (2007) Genomic distribution of simple sequence repeats in Brassica

rapa. Mol Cells, 23:349-356.

Iyer RR, Pluciennik A, Rosche WA, Sinden RR and Wells RD (2000) DNA

polymerase III proofreading mutants enhance the expansion and deletion of triplet

repeat sequences in Escherichia coli. J Biol Chem, 275: 2174-2184.

Jiang D, Zhong GY and Hong QB (2006) Analysis of microsatellites in citrus

53

unigenes. Yi chuan xue bao (Acta genetica Sinica) 33:345-353.

Jung S, Abbott A, Jesudurai C, Tomkins J and Main D (2005) Frequency, type,

distribution and annotation of simple sequence repeats in Rosaceae ESTs. Funct

Integr Genomics, 5:136-143.

Kantety RV, La Rota M, Matthews DE and Sorrells ME (2002) Data mining for simple

sequence repeats in expressed sequence tags from barley, maize, rice, sorghum and

wheat. Plant Mol Biol, 48:501-510.

Kashi Y and King DG (2006) Simple sequence repeats as advantageous mutators in

evolution. Trends Genet, 22:253-259.

Kumpatla SP and Mukhopadhyay S (2005) Mining and survey of simple sequence

repeats in expressed sequence tags of dicotyledonous species. Genome, 48:985-

998.

La Rota M, Kantety RV, Yu JK and Sorrells ME (2005) Nonrandom distribution and

frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and

barley. BMC Genomics, 6:23.

Lawson MJ and Zhang L (2006) Distinct patterns of SSR distribution in the

Arabidopsis thaliana and rice genomes. Genome Biol, 7: R14.

Li YC, Korol AB, Fahima T, Beiles A and Nevo E (2002) Microsatellites: genomic

distribution, putative functions and mutational mechanisms: a review. Mol Ecol,

11:2453-2465.

Li YC, Korol AB, Fahima T and Nevo E (2004a) Microsatellites within genes:

structure, function, and evolution. Mol Biol Evol, 21: 991-1007.

Li B, Xia Q, Lu C, Zhou Z and Xiang Z (2004b) Analysis on frequency and density of

microsatellites in coding sequences of several eukaryotic genomes. Genomics

Proteomics Bioinformatics, 2:24-31.

54

Maia LC da, Palmieri DA, de Souza VQ, Kopp MM, de Carvalho FI, Costa de Oliveira

A. (2008) SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with

Primer Design and PCR Simulation. Int J Plant Genomics. 412696.

McCouch SR, Teytelman L, Xu Y, Lobos KB, Clare K, Walton M, Fu B, Maghirang R,

Li Z, Xing Y, Zhang Q, Kono I, Yano M, Fjellstrom R, DeClerck G, Schneider D,

Cartinhour S, Ware D and Stein L (2002) Development and mapping of 2240 new

SSR markers for rice (Oryza sativa L.) DNA Res, 9:199-207.

Morgante M, Hanafey M and Powell, W (2002) Microsatellites are preferentially

associated with nonrepetitive DNA in plant genomes. Nat Genet, 30: 194-200.

Morgante M and Olivieri AM (1993) PCR-amplified microsatellites as markers in plant

genetics. Plant J, 3: 175-182.

Nicot N, Chiquet V, Gandon B, Amilhat L, Legeai F, Leroy P, Bernard M and Sourdille

P (2004) Study of simple sequence repeat (SSR) markers from wheat expressed

sequence tags (ESTs). Theor Appl Genet, 109: 800-805.

Palmieri DA, Novelli VM, Bastianel M, Cristofani M, Monge GA, Carlos EF, Oliveira

AC and Machado MA (2007) Frequency and distribution of microsatellites from ESTs

of citrus. Genet Mol Biol, 30: 1009-1018.

Parida SK, Anand Raj Kumar K, Dalal V, Singh NK and Mohapatra T (2006) Unigene

derived microsatellite markers for the cereal genomes. Theor Appl Genet, 112:808-

817.

Peng JH and Lapitan NL (2005) Characterization of EST-derived microsatellites in

the wheat genome and development of eSSR markers. Funct Integr Genomics, 5:

80-96.

Philips AV, Timchenko LT and Cooper TA (1998) Disruption of splicing regulated by a

CUG-binding protein in yotonic dystrophy. Science, 280: 737-741.

55

Subramanian S, Mishra RK and Singh L (2003) Genome-wide analysis of

microsatellite repeats in humans: their abundance and density in specific genomic

regions. Genome Biol, 4: R13.

Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S and McCouch S

(2001) Computational and experimental analysis of microsatellites in rice (Oryza

sativa L.): frequency, length variation, transposon associations, and genetic marker

potential. Genome Research. 11:1441-1452.

Thiel T, Michalek W, Varshney W and Graner A (2003) Exploiting EST databases for

the development and characterization of gene-derived SSR-markers in barley

(Hordeum vulgare L.). Theor Appl Genet, 106:411-422.

Thornton CA, Wymer JP, Simmons Z, McClain C and Moxley RT (1997) Expansion

of the myotonic dystrophy CTG repeat reduces expression of the flanking DMAHP

gene. Nat Genet, 16: 407-409.

Tóth G, Gáspári Z and Jurka J (2000) Microsatellites in different eukaryotic genomes:

survey and analysis. Genome Res, 10:967-981.

Varshney RK, Graner A and Sorrells ME (2005) Genic microsatellite markers in

plants: features and applications. Trends Biotechnol, 23:48-55.

Varshney RK, Thiel T, Stein N, Langridge P and Graner A (2002) In silico analysis on

frequency and distribution of microsatellites in ESTs of some cereal species. Cell

Mol Biol Lett, 7:537-546.

Varshney RK, Hoisington DA, Tyagi AK (2006) Advances in cereal genomics and

applications in crop breeding. Trends Biotechnol, 24:490-499.

Yu JK, Dake TM, Singh S, Benscher D, Li W, Gill B and Sorrells ME (2004)

Development and mapping of EST-derived simple sequence repeat markers for

hexaploid wheat. Genome, 47:805-818.

56

Zhang L, Yuan D, Yu S, Li Z, Cao Y, Miao Z, Qian H and Tang K (2004) Preference

of simple sequence repeats in coding and non coding regions of Arabidopsis

thaliana. Bioinformatics. 20:1081-1086.

Zhang L, Zuo K, Zhang F, Cao Y, Wang J, Zhang Y, Sun X and Tang K (2006a)

Conservation of noncoding microsatellites in plants: implication for gene regulation.

BMC Genomics, 7:323.

Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J and Tang K (2006b) Distributional

gradient of amino acid repeats in plant proteins. Genome. 49:900-905.

57

Table 1.Overall distribution (amounts and percentage) of expressed sequences in translated and non-translated regions. Expressed Sequences 5' UTR CDS 3' UTR

Total Total Mean Total Mean Total Mean Total Mean

Seq.1 mb1 pb1 Seq.2 % mb2 pb2 Seq.3 % mb3 pb3 Seq.4 % mb4 pb4

A. thaliana 29,918 43.3 1,447 16,625 6.8 176 29,918 82.6 1,195 17,591 10.7 262

B. napus 26,285 20.3 773 216 0.1 74 26,285 99.7 770 242 0.2 204

S. lycopersicum 16,945 14.0 823 614 0.5 103 16,945 98.3 809 710 1.2 245

S. tuberosum 19,539 15.6 796 554 0.3 93 19,539 98.6 785 635 1.0 252

O. sativa 40,259 60.0 1,490 1,088 0.5 270 40,259 98.7 1,470 1,158 0.8 438

S. bicolor 13,547 9.5 699 68 0.1 115 13,547 99.7 697 82 0.2 244

T. aestivum 34,505 26.2 758 498 0.2 92 34,505 99.2 753 611 0.6 246

Z. mays 57,447 32.2 560 704 0.3 120 57,447 99.1 555 803 0.7 275

S. officinarum 15,586 12.7 815 48 0.1 160 15,586 99.8 813 54 0.1 273

H. vulgare 21,418 19.1 893 359 0.2 102 21,418 99.2 886 458 0.6 259

Average 27,545 25.3 905 2,077 0.9 130 27,545 97.5 873 2,234 1.6 269.8

Expressed Sequences: Total Seq.1 (Total no. of cDNA sequences), Total mb1 (sum of base pairs of fl-cDNA sequences), Average pb1 (average size of sequences � sum of base pairs divided by number of sequences (Total mb1 / Total Seq.1)). 5�UTR: Total Seq.2 (Total sequences containing 5�UTR regions), % mb2 (percentage of Total mb1 contained in 5�UTR regions), Average pb2 (average size of 5�UTR sequences- sum of base pairs divided by the number of sequences (Total pb(% mb2) / Total Seq.2)). CDS: Total Seq.3 (Total sequences containing CDS regions), % mb3 (percentage of Total mb1 contained in CDS regions), Average pb3 [average size of CDS sequences � sum of base pairs divided by number of sequences (Total pb(% mb3) / Total Seq.3)]. 3�UTR: Total Seq.4 (Total of sequences containing 3�UTR regions), % mb4 (percentage of Total mb1 contained in 3�UTR regions), Average pb4 (average size of 3�UTR sequences � sum of base pairs divided by the number of sequences (Total pb(% mb4) / Total Seq.4)).

58

Figure 1. Percentage of expressed sequences containing tandem repeat loci.

59

Table 2. Overall distribution of tandem repeat occurrences in translated and non-translated transcripts.

5' UTR CDS 3' UTR

Total

Occurrence % ssr/mb Occurrence % 2 ssr/mb Occurrence % ssr/mb Occurrence ssr/mb

A. thaliana 395 34.0 9.1 610 52.5 14.1 157 13.5 3.6 1,162 27

B. napus 1 0.2 0.0 632 99.5 31.1 2 0.3 0.1 635 31

S. lycopersicum 6 2.4 0.4 234 94.0 16.8 9 3.6 0.6 249 18

S. tuberosum 4 1.2 0.3 336 97.7 21.6 4 1.2 0.3 344 22

O. sativa 78 1.7 1.3 4,433 97.6 73.9 29 0.6 0.5 4,540 76

S. bicolor 3 0.6 0.3 505 99.4 53.3 0 0.0 0.0 508 54

T. aestivum 11 1.3 0.4 795 97.0 30.4 14 1.7 0.5 820 31

Z. mays 12 1.0 0.4 1,205 98.0 37.4 13 1.1 0.4 1,230 38

S. officinarum 0 0.0 0.0 332 100.0 26.1 0 0.0 0.0 332 26

H. vulgare 19 2.1 1.0 883 96.9 46.2 9 1.0 0.5 911 48

Average 529 4.9 1.3 9,965 92.9 35.1 237 2.2 0.7 10,731 37

60

Table 3. Overall occurrence, in percentage, of microsatellite and minisatellite motifs on different regions of ten plant species.

Dimer Trimer Tetramer Pentamer HexamerMicrossatélites 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR TotalA. thaliana 13.6 4.0 4.3 21.9 14.6 38.6 4.7 58.0 1.0 0.9 1.0 2.9 2.1 0.8 1.9 4.7 0.9 5.8 0.4 7.1 B. napus 0.2 40.8 0.2 41.1 - 35.9 0.2 36.1 - 4.4 - 4.4 - 4.3 - 4.3 - 9.1 - 9.1 S. lycopersicum 0.4 17.7 2.0 20.1 0.4 40.2 0.8 41.4 - 4.4 - 4.4 - 6.0 0.8 6.8 0.8 17.3 - 18.1 S. tuberosum 0.3 22.4 0.6 23.3 0.3 34.0 - 34.3 - 4.4 - 4.4 - 6.1 0.3 6.4 - 20.1 - 20.1 O. sativa 0.5 14.9 0.3 15.7 0.7 53.9 0.1 54.7 0.0 6.0 0.1 6.1 0.3 9.3 0.1 9.7 0.1 10.3 0.0 10.4 S. bicolor 0.2 18.5 - 18.7 0.2 35.2 - 35.4 - 10.2 - 10.2 - 14.6 - 14.6 0.2 18.1 - 18.3 T. aestivum 0.5 26.5 0.4 27.3 0.5 34.0 0.5 35.0 0.2 13.3 0.1 13.7 0.1 11.3 0.6 12.1 - 7.6 0.1 7.7 Z. mays 0.5 16.0 0.5 17.0 0.2 34.5 - 34.6 0.1 10.7 0.4 11.2 0.1 16.2 0.2 16.4 0.1 17.4 - 17.5 S. officinarum - 18.7 - 18.7 - 36.4 - 36.4 - 8.4 - 8.4 - 14.5 - 14.5 - 16.9 - 16.9 H. vulgare 0.5 14.6 0.2 15.4 0.7 35.1 0.3 36.1 0.4 15.7 0.3 16.5 0.2 13.8 0.1 14.2 0.1 12.8 - 13.0 Average 1.7 19.4 0.8 21.9 1.7 37.8 0.7 40.2 0.2 7.8 0.2 8.2 0.3 9.7 0.4 10.4 0.2 13.5 0.1 13.8

Heptamer Octamer Nonamer Decamer GeralMinissatélites 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR Total 5'UTR CDS 3'UTR TotalA. thaliana 1.0 0.9 0.8 2.8 0.6 0.3 0.3 1.2 0.1 1.0 - 1.1 0.1 0.2 0.1 0.3 34.0 52.5 13.5 100.0 B. napus - 3.8 - 3.8 - 0.5 - 0.5 - 0.6 - 0.6 - 0.2 - 0.2 0.2 99.5 0.3 100.0 S. lycopersicum 0.8 8.4 - 9.2 - - - - - - - - - - - - 2.4 94.0 3.6 100.0 S. tuberosum 0.6 9.0 0.3 9.9 - 0.6 - 0.6 - 1.2 - 1.2 - - - - 1.2 97.7 1.2 100.0 O. sativa 0.1 2.0 0.0 2.1 0.0 0.2 - 0.2 - 0.7 - 0.7 - 0.3 - 0.3 1.7 97.6 0.6 100.0 S. bicolor - 2.4 - 2.4 - - - - - 0.4 - 0.4 - - - - 0.6 99.4 - 100.0 T. aestivum - 3.5 - 3.5 - 0.2 - 0.2 - 0.4 - 0.4 - 0.1 - 0.1 1.3 97.0 1.7 100.0 Z. mays 0.1 3.0 - 3.1 - - - - - 0.2 - 0.2 - - - - 1.0 98.0 1.1 100.0 S. officinarum - 4.2 - 4.2 - - - - - 0.6 - 0.6 - 0.3 - 0.3 - 100.0 - 100.0 H. vulgare 0.1 3.6 - 3.7 - 0.2 - 0.2 - 1.0 - 1.0 - - - - 2.1 96.9 1.0 100.0 Average 0.3 4.1 0.1 4.5 0.1 0.2 0.0 0.3 0.0 0.6 - 0.6 0.0 0.1 0.0 0.1 4.4 93.3 2.3 100.0

61

Table 4. Distribution of di-, tri- and tetramer motifs, percentage occurrence per species and average occurrence per family. Brassicaceae Solanaceae Poaceae

Ara Bra Average Lyc Sol Average Ory Sor Tri Zea Sac HorDimersAG/CT 2.46 16.93 9.69 AT/AT 8.55 8.04 8.29 AG/CT 6.38 5.15 9.06 6.56 6.63 6.57GA/TC 1.64 16.14 8.89 TA/TA 5.13 6.25 5.69 GA/TC 5.46 5.35 10.19 5.15 3.92 3.62AT/AT 1.80 4.11 2.96 GA/TC 1.71 4.76 3.24 AT/AT 1.31 1.39 1.01 1.83 2.71 1.25TA/TA 0.98 2.22 1.60 AG/CT 3.42 2.98 3.20 CA/TG 0.56 2.38 2.89 0.75 1.51 1.36GT/AC 0.49 0.79 0.64 GT/AC 0.00 0.60 0.30 GT/AC 0.59 2.38 2.77 0.50 1.20 1.36CA/TG 0.16 0.79 0.48 CA/TG 0.00 0.30 0.15 TA/TA 0.92 1.98 1.26 1.58 2.41 0.57GC/GC 0.00 0.00 0.00 GC/GC 0.00 0.00 0.00 GC/GC 0.00 0.00 0.00 0.00 0.30 0.23CG/CG 0.00 0.00 0.00 CG/CG 0.00 0.00 0.00 CG/CG 0.07 0.00 0.13 0.00 0.00 0.11TrimersGAA/TTC 12.13 4.59 8.36 GAA/TTC 3.85 5.65 4.75 CCG/CGG 11.41 5.15 2.52 5.81 4.22 6.23AAG/CTT 9.51 3.96 6.73 AGA/TCT 3.85 5.36 4.60 CGC/GCG 10.47 4.75 3.02 5.98 6.02 4.87AGA/TCT 8.85 4.59 6.72 ATA/TAT 5.13 3.57 4.35 GCC/GGC 6.11 4.95 3.27 5.81 6.93 3.28ATC/GAT 7.54 2.22 4.88 AAT/ATT 4.27 2.98 3.62 CAG/CTG 1.87 2.77 2.64 2.41 3.31 2.60TCA/TGA 4.59 2.37 3.48 AAG/CTT 3.42 3.57 3.50 GCA/TGC 1.47 2.77 2.01 2.16 1.81 2.83CAA/TTG 4.75 1.90 3.33 TAA/TTA 2.99 1.19 2.09 CTC/GAG 3.77 1.19 1.89 1.49 2.41 2.15ATG/CAT 4.43 1.74 3.08 CAA/TTG 2.14 1.19 1.66 AGC/GCT 1.47 2.18 1.26 1.16 2.41 2.27AAC/GTT 4.10 1.27 2.68 CTC/GAG 2.14 0.60 1.37 AGG/CCT 2.50 1.19 1.89 1.24 0.30 1.59ACA/TGT 3.93 1.11 2.52 CAG/CTG 2.14 0.60 1.37 GGA/TCC 2.57 0.99 1.13 1.74 1.20 0.79GGA/TCC 3.44 0.79 2.12 TCA/TGA 0.85 1.79 1.32 AAG/CTT 1.51 0.59 1.64 0.41 0.30 1.59AGG/CCT 1.31 2.06 1.68 ACA/TGT 1.71 0.89 1.30 CAA/TTG 0.29 0.40 3.02 0.41 1.20 0.68CTC/GAG 1.15 2.22 1.68 CAC/GTG 2.14 0.30 1.22 CCA/TGG 1.38 1.39 0.38 0.75 0.60 1.13ACC/GGT 2.13 0.63 1.38 ATC/GAT 1.71 0.60 1.15 CGA/TCG 1.58 0.99 0.38 1.58 0.30 0.34CCA/TGG 1.48 1.11 1.29 CCA/TGG 0.85 1.19 1.02 CAC/GTG 0.99 0.79 0.75 0.58 0.90 1.13CAC/GTG 1.31 0.32 0.81 CCG/CGG 1.71 0.30 1.00 GAC/GTC 0.99 0.40 0.50 1.00 1.20 0.68GCA/TGC 0.16 0.95 0.56 GGA/TCC 0.85 0.89 0.87 AGA/TCT 1.35 0.20 1.01 0.33 0.60 0.79TAA/TTA 0.00 0.95 0.47 ACC/GGT 0.43 1.19 0.81 GAA/TTC 1.40 0.40 1.64 0.17 0.00 0.68ACT/AGT 0.66 0.16 0.41 GCA/TGC 0.43 0.89 0.66 ACC/GGT 1.29 0.40 0.88 0.33 0.60 0.23AAT/ATT 0.16 0.63 0.40 ATG/CAT 0.85 0.30 0.58 ACG/CGT 0.79 1.39 0.13 0.50 0.60 0.11CAG/CTG 0.33 0.32 0.32 AGC/GCT 0.43 0.30 0.36 ACA/TGT 0.14 0.20 1.89 0.08 0.60 0.45AGC/GCT 0.33 0.32 0.32 GTA/TAC 0.00 0.60 0.30 ATC/GAT 0.32 0.59 0.38 0.25 0.00 0.57GAC/GTC 0.33 0.32 0.32 GAC/GTC 0.43 0.00 0.21 TCA/TGA 0.32 0.20 0.38 0.00 0.30 0.57CCG/CGG 0.16 0.47 0.32 ACT/AGT 0.43 0.00 0.21 AAC/GTT 0.14 0.00 0.88 0.17 0.30 0.00GCC/GGC 0.00 0.47 0.24 CGC/GCG 0.00 0.30 0.15 ATG/CAT 0.25 0.20 0.25 0.08 0.00 0.57ATA/TAT 0.00 0.47 0.24 GCC/GGC 0.00 0.30 0.15 ATA/TAT 0.14 0.40 0.50 0.17 0.00 0.00GTA/TAC 0.33 0.00 0.16 AAC/GTT 0.00 0.30 0.15 AAT/ATT 0.25 0.00 0.13 0.17 0.30 0.11CTA/TAG 0.33 0.00 0.16 AGG/CCT 0.00 0.00 0.00 ACT/AGT 0.11 0.59 0.13 0.00 0.00 0.00CGA/TCG 0.16 0.16 0.16 CGA/TCG 0.00 0.00 0.00 TAA/TTA 0.18 0.00 0.13 0.41 0.00 0.00CGC/GCG 0.00 0.00 0.00 ACG/CGT 0.00 0.00 0.00 GTA/TAC 0.07 0.20 0.38 0.00 0.00 0.00ACG/CGT 0.00 0.00 0.00 CTA/TAG 0.00 0.00 0.00 CTA/TAG 0.09 0.20 0.13 0.00 0.00 0.00TetramersAAGA/TCTT 0.33 0.47 0.40 TAAA/TTTA 0.85 0.89 0.87 CCTC/GAGG 0.09 0.40 0.50 0.17 0.00 0.79AAAC/GTTT 0.33 0.32 0.32 TTAA/TTAA 0.85 0.30 0.58 AGGA/TCCT 0.14 0.00 0.13 0.17 0.60 0.57GAAA/TTTC 0.33 0.32 0.32 AAGA/TCTT 0.43 0.60 0.51 CATC/GATG 0.27 0.00 0.50 0.25 0.00 0.57AGGA/TCCT 0.16 0.16 0.16 AAAG/CTTT 0.00 0.60 0.30 CACG/CGTG 0.09 0.20 0.13 0.08 0.60 0.45CAAA/TTTG 0.16 0.16 0.16 AGAT/ATCT 0.00 0.60 0.30 AAAG/CTTT 0.14 0.20 0.00 0.08 0.90 0.23CATA/TATG 0.16 0.16 0.16 AAAT/ATTT 0.43 0.00 0.21 ATGC/GCAT 0.00 0.00 0.38 0.33 0.00 0.79AAAG/CTTT 0.00 0.32 0.16 AATT/AATT 0.43 0.00 0.21 CATA/TATG 0.14 0.00 0.50 0.41 0.30 0.11AACA/TGTT 0.00 0.32 0.16 ATTA/TAAT 0.43 0.00 0.21 TCCA/TGGA 0.11 0.00 0.50 0.50 0.00 0.34ACAA/TTGT 0.00 0.32 0.16 CCTC/GAGG 0.43 0.00 0.21 CTGC/GCAG 0.02 0.59 0.38 0.33 0.00 0.11

Ara (Arabidopsis thaliana), Bra (Brassica napus), Lyc (Solanum lycopersicum), Sol (Solanum tuberosum), Ory (Oryza sativa), Sor(Sorghum bicolor), Tri (Triticum aestivum), Zea (Zea mays), Sac (Saccharum officinarum) and Hor (Hordeum vulgare)

62

Table 5. Distribution of penta- to decamers motifs, percentage occurrence per species and average occurrence per family.

Brassicaceae Solanaceae PoaceaeAra Bra Average Lyc Sol Average Ory Sor Tri Zea Sac Hor Average

PentamersGAAAA/TTTTC 0.16 0.47 0.32 AAAAT/ATTTT 0.85 0.30 0.58 CTCTC/GAGAG 0.34 0.59 0.00 0.25 0.30 0.68 0.36AAAAT/ATTTT 0.16 0.32 0.24 AAAAG/CTTTT 0.85 0.00 0.43 GAGGA/TCCTC 0.32 0.00 0.38 0.17 0.00 0.57 0.24AAAAC/GTTTT 0.00 0.47 0.24 AGAAG/CTTCT 0.43 0.30 0.36 CTTCC/GGAAG 0.07 0.20 0.25 0.17 0.60 0.11 0.23CAAAA/TTTTG 0.33 0.00 0.16 ATAAA/TTTAT 0.43 0.30 0.36 GGAGA/TCTCC 0.25 0.20 0.13 0.33 0.00 0.34 0.21GAATC/GATTC 0.00 0.32 0.16 GAAAA/TTTTC 0.43 0.30 0.36 AGGAG/CTCCT 0.29 0.20 0.13 0.33 0.00 0.23 0.20AAATA/TATTT 0.16 0.00 0.08 CAAAC/GTTTG 0.00 0.60 0.30 AGAGG/CCTCT 0.32 0.00 0.25 0.17 0.00 0.34 0.18ACAAA/TTTGT 0.16 0.00 0.08 AAATA/TATTT 0.43 0.00 0.21 CTCCC/GGGAG 0.16 0.00 0.13 0.17 0.60 0.00 0.18ACAAC/GTTGT 0.16 0.00 0.08 AAATC/GATTT 0.43 0.00 0.21 CACCA/TGGTG 0.00 0.00 0.38 0.33 0.30 0.00 0.17ACTAG/CTAGT 0.16 0.00 0.08 AACTG/CAGTT 0.43 0.00 0.21 AGAAG/CTTCT 0.09 0.20 0.25 0.00 0.00 0.45 0.17TGTTC/GAACA 0.16 0.00 0.08 AATAA/TTATT 0.43 0.00 0.21 AGGGG/CCCCT 0.18 0.00 0.25 0.08 0.00 0.45 0.16HexamersGATGAA/TTCATC 0.33 0.16 0.24 GGTGGA/TCCACC 0.00 2.38 1.19 CGGCGA/TCGCCG 0.38 0.20 0.13 0.25 0.30 0.11 0.23AAAACA/TGTTTT 0.00 0.47 0.24 GAAGTA/TACTTC 0.85 0.60 0.72 GCACCA/TGGTGC 0.09 0.00 0.25 0.17 0.60 0.00 0.19AAGGAG/CTCCTT 0.33 0.00 0.16 AGCAGG/CCTGCT 0.85 0.30 0.58 AGGCGG/CCGCCT 0.25 0.20 0.13 0.25 0.00 0.23 0.17AGCCTC/GAGGCT 0.33 0.00 0.16 CAGCAA/TTGCTG 0.43 0.60 0.51 CCGACG/CGTCGG 0.09 0.00 0.00 0.17 0.60 0.11 0.16ATCACC/GGTGAT 0.33 0.00 0.16 CCAACA/TGTTGG 0.85 0.00 0.43 CCGTCG/CGACGG 0.18 0.00 0.13 0.17 0.30 0.11 0.15ATGAAG/CTTCAT 0.33 0.00 0.16 CCTATC/GATAGG 0.85 0.00 0.43 GCCTCC/GGAGGC 0.18 0.40 0.13 0.17 0.00 0.00 0.14CATCAC/GTGATG 0.33 0.00 0.16 GGATGA/TCATCC 0.85 0.00 0.43 GCCACC/GGTGGC 0.02 0.40 0.00 0.00 0.30 0.11 0.14CCTCCA/TGGAGG 0.33 0.00 0.16 AGGAAG/CTTCCT 0.43 0.30 0.36 CGGCGC/GCGCCG 0.05 0.59 0.00 0.17 0.00 0.00 0.13CCTGAG/CTCAGG 0.33 0.00 0.16 ATGAAG/CTTCAT 0.43 0.30 0.36 CGACGC/GCGTCG 0.07 0.40 0.00 0.33 0.00 0.00 0.13GAATCC/GGATTC 0.33 0.00 0.16 CAACCT/AGGTTG 0.43 0.30 0.36 GGAGCC/GGCTCC 0.00 0.20 0.13 0.17 0.30 0.00 0.13HeptamersACACAAA/TTTGTGT 0.33 0.00 0.16 CTTCTCT/AGAGAAG 0.85 0.00 0.43 CCGCCGC/GCGGCGG 0.18 0.20 0.00 0.00 0.00 0.11 0.08GAGAGAA/TTCTCTC 0.16 0.16 0.16 GATCTCC/GGAGATC 0.85 0.00 0.43 CGCCGCC/GGCGGCG 0.02 0.20 0.25 0.00 0.00 0.00 0.08AGAGAGA/TCTCTCT 0.00 0.32 0.16 AAAAAAT/ATTTTTT 0.43 0.30 0.36 CCGGCGA/TCGCCGG 0.00 0.40 0.00 0.00 0.00 0.00 0.07AATTACA/TGTAATT 0.16 0.00 0.08 AAATTTA/TAAATTT 0.43 0.30 0.36 CCGCCGA/TCGGCGG 0.00 0.00 0.00 0.08 0.30 0.00 0.06ATGAGTG/CACTCAT 0.16 0.00 0.08 TCAACTA/TAGTTGA 0.00 0.60 0.30 CGGCAGG/CCTGCCG 0.02 0.00 0.00 0.00 0.30 0.00 0.05CAGCGAC/GTCGCTG 0.16 0.00 0.08 TTTTTTG/CAAAAAA 0.00 0.60 0.30 AAAATGA/TCATTTT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CATTCAA/TTGAATG 0.16 0.00 0.08 AATTGAG/CTCAATT 0.43 0.00 0.21 ACGCAAG/CTTGCGT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CCTCTCT/AGAGAGG 0.16 0.00 0.08 AGAAACA/TGTTTCT 0.43 0.00 0.21 AGCAGAG/CTCTGCT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CTCAACT/AGTTGAG 0.16 0.00 0.08 ATCGCCG/CGGCGAT 0.43 0.00 0.21 CACGCCG/CGGCGTG 0.00 0.00 0.00 0.00 0.30 0.00 0.05TCTCAAA/TTTGAGA 0.16 0.00 0.08 ATGATTC/GAATCAT 0.43 0.00 0.21 CACTGCG/CGCAGTG 0.00 0.00 0.00 0.00 0.30 0.00 0.05OctamersATGTATGA/TCATACAT 0.16 0.00 0.08 AAGAAAAA/TTTTTCTT 0.00 0.30 0.15 GAAGTCAA/TTGACTTC 0.00 0.00 0.13 0.00 0.00 0.00 0.02CCCCTTCT/AGAAGGGG 0.16 0.00 0.08 TTTCTCTC/GAGAGAAA 0.00 0.30 0.15 GCGACCGA/TCGGTCGC 0.00 0.00 0.13 0.00 0.00 0.00 0.02CTTGTTCC/GGAACAAG 0.16 0.00 0.08 AAAAAAAC/GTTTTTTT 0.00 0.00 0.00 CCGCACGC/GCGTGCGG 0.00 0.00 0.00 0.00 0.00 0.11 0.02GAAGCAAG/CTTGCTTC 0.16 0.00 0.08 ACGGGCGA/TCGCCCGT 0.00 0.00 0.00 CCTATCTA/TAGATAGG 0.00 0.00 0.00 0.00 0.00 0.11 0.02AAAAAAAC/GTTTTTTT 0.00 0.16 0.08 AGAAAAAA/TTTTTTCT 0.00 0.00 0.00 CAAGAAGC/GCTTCTTG 0.05 0.00 0.00 0.00 0.00 0.00 0.01AGAAAAAA/TTTTTTCT 0.00 0.16 0.08 ATCAGGGA/TCCCTGAT 0.00 0.00 0.00 ACGGGCGA/TCGCCCGT 0.02 0.00 0.00 0.00 0.00 0.00 0.00TCTTTGTG/CACAAAGA 0.00 0.16 0.08 ATGATGTA/TACATCAT 0.00 0.00 0.00 ATCAGGGA/TCCCTGAT 0.02 0.00 0.00 0.00 0.00 0.00 0.00AAGAAAAA/TTTTTCTT 0.00 0.00 0.00 ATGTATGA/TCATACAT 0.00 0.00 0.00 ATGATGTA/TACATCAT 0.02 0.00 0.00 0.00 0.00 0.00 0.00ACGGGCGA/TCGCCCGT 0.00 0.00 0.00 CAAGAAGC/GCTTCTTG 0.00 0.00 0.00 TCAAATTT/AAATTTGA 0.02 0.00 0.00 0.00 0.00 0.00 0.00ATCAGGGA/TCCCTGAT 0.00 0.00 0.00 CCCCTTCT/AGAAGGGG 0.00 0.00 0.00 TGGGCTTG/CAAGCCCA 0.02 0.00 0.00 0.00 0.00 0.00 0.00NonamersAAGATGAAG/CTTCATCTT 0.16 0.00 0.08 ACTCCTTCA/TGAAGGAGT 0.00 0.30 0.15 ACGACTACG/CGTAGTCGT 0.00 0.00 0.00 0.00 0.30 0.00 0.05AATGGGTGG/CCACCCATT 0.16 0.00 0.08 CAAATTACC/GGTAATTTG 0.00 0.30 0.15 AGCGAAGAA/TTCTTCGCT 0.00 0.00 0.00 0.00 0.30 0.00 0.05AGAAGGAAG/CTTCCTTCT 0.16 0.00 0.08 CAGACTATT/AATAGTCTG 0.00 0.30 0.15 AGCACCAGC/GCTGGTGCT 0.00 0.20 0.00 0.00 0.00 0.00 0.03ATGGGTGAC/GTCACCCAT 0.16 0.00 0.08 CTTCTTATC/GATAAGAAG 0.00 0.30 0.15 GGTGGTATG/CATACCACC 0.00 0.20 0.00 0.00 0.00 0.00 0.03GAAGGAGAA/TTCTCCTTC 0.16 0.00 0.08 AAAAAAAAC/GTTTTTTTT 0.00 0.00 0.00 ACCCTCTCC/GGAGAGGGT 0.00 0.00 0.13 0.00 0.00 0.00 0.02GAGAAGAAG/CTTCTTCTC 0.16 0.00 0.08 AACAGGAGA/TCTCCTGTT 0.00 0.00 0.00 CCGCTGGAT/ATCCAGCGG 0.00 0.00 0.13 0.00 0.00 0.00 0.02GAGGAAGAA/TTCTTCCTC 0.16 0.00 0.08 AAGATGAAG/CTTCATCTT 0.00 0.00 0.00 GCTGTGACC/GGTCACAGC 0.00 0.00 0.13 0.00 0.00 0.00 0.02GAGGAAGAG/CTCTTCCTC 0.16 0.00 0.08 AATGGGTGG/CCACCCATT 0.00 0.00 0.00 ACCACCAGC/GCTGGTGGT 0.00 0.00 0.00 0.00 0.00 0.11 0.02TATAATTCG/CGAATTATA 0.16 0.00 0.08 ACAGCAACA/TGTTGCTGT 0.00 0.00 0.00 ACCACGGAC/GTCCGTGGT 0.00 0.00 0.00 0.00 0.00 0.11 0.02TCTTCGTCT/AGACGAAGA 0.16 0.00 0.08 ACCACCAGC/GCTGGTGGT 0.00 0.00 0.00 CCATCCTTA/TAAGGATGG 0.00 0.00 0.00 0.00 0.00 0.11 0.02DecamersACTTTGAGTG/CACTCAAAGT 0.16 0.00 0.08 AAAAAGAAAA/TTTTCTTTTT 0.00 0.00 0.00 AAAAAGAAAA/TTTTCTTTTT 0.00 0.00 0.00 0.00 0.30 0.00 0.05CAAAGTCACT/AGTGACTTTG 0.16 0.00 0.08 ACTTTGAGTG/CACTCAAAGT 0.00 0.00 0.00 CCACGCGTCG/CGACGCGTGG 0.23 0.00 0.00 0.00 0.00 0.00 0.04TTTTTTTTCT/AGAAAAAAAA 0.00 0.16 0.08 AGCCCCAACG/CGTTGGGGCT 0.00 0.00 0.00 TTTTTTTTCT/AGAAAAAAAA 0.00 0.00 0.13 0.00 0.00 0.00 0.02AAAAAGAAAA/TTTTCTTTTT 0.00 0.00 0.00 ATCTCCGCCG/CGGCGGAGAT 0.00 0.00 0.00 AGCCCCAACG/CGTTGGGGCT 0.05 0.00 0.00 0.00 0.00 0.00 0.01 Ara (Arabidopsis thaliana), Bra (Brassica napus), Lyc (Solanum lycopersicum), Sol (Solanum tuberosum), Ory (Oryza sativa), Sor(Sorghum bicolor), Tri (Triticum aestivum), Zea (Zea mays), Sac (Saccharum officinarum) and Hor (Hordeum vulgare)

63

4. Distribuition and patterns of microsatellites occurency in the whole rice

genome

Genetics and Molecular Biology (ISSN 1415-4757)

ABSTRACT

The objective of this work was to describe the abundance of microsatellites in

the complete sequence of the rice genome. Total occurrence and type distribution of

microsatellites per chromosome were evaluated. Our results indicate that the

occurrence of different loci on distinct chromosomes holds an aparent distribution

pattern. The results also indicate that if one selects only two-mers and three-mers, it

is possible to position markers on average at every 24. 9 Kb.

INTRODUCTION

Microsatellites or SSRs (Simple sequence repeat) are DNA sequences formed

by a tandem repetition of nucleotides between one and six base pairs (Morgante e

Olivieri, 1993). Microsatellite regions are formed as a consequence of loops or

hairpins structures formed during replication and that can be increased or decreased

by errors in the functioning of the DNA polymerase (Iyer et al., 2000). Initially, some

authors attributed the origin of microsatellites to random processes. However,

currently many studies describe the genomic distribution of microsatellites as a non-

64

random process (Li et al., 2002). This is based on the evidence of chromatin, gene

activity regulation, recombination and DNA replication effects of these mutations (Li

et al., 2004). During the 80�s, microsatellites were primarily studied in humans

(Weber et al., 1989; Litt e Lut, 1989; Taltz, 1989) and later were identified in other

eukaryotic genomes, including plants (Morgante et al., 1993; Wang et al., 1994;

Taramino e Tingey, 1996; MacCouch et al., 2001). Microsatellites are an important

class of molecular markers that are used to understand space relationships between

chromosome segments and can be useful to evaluate the temporal and evolutive

relationships between species and genera (Kashi et al., 1997).

Among the different types of molecular markers, the use of microsatellites is

very promising, thanks to their multi-allelic nature, reproducibility, co-dominant

inheritance and abundant genomic distribution. These markers are useful to integrate

genetic and physical maps to sequenced genomic regions and provide to breeders

and plant geneticists an efficient tool to integrate phenotypic and genotypic variations

(Varshney et al., 2005). Different studies on microsatellites using information on from

ESTS, cDNAs and BACs from the rice genome were previously published (Temnykh

et al., 2001; MacCouch et al., 1997, 2002; Morgante et al., 2002; Varshney et al.,

2002; La Rota et al., 2005; IRGSP, 2005; Parida et al., 2006). However, in most

studies, the complete sequence was not yet available. With the current availability of

the complete and ordered sequence, it is possible to obtain more precise statistics on

rice microsatellites.

In the present work, the abundance of microsatellites in the rice genome were

analyzed with the goal of clearing the rates, frequencies and distribution of different

microsatellites on different chromosomes as well as to describe which loci can be

better applied as molecular markers.

65

MATERIAL AND METHODS

Fasta files containing the pseudomolecules corresponding to the twelve rice

chromosomes (Oryza sativa spp japonica � cv. NipponBare) (IRGSP, 2005) were

obtained from the NCBI � National Center for Biotechnology Information

(http://www.ncbi.nlm.nih.gov/) database. For microsatellite searches, a software

called SSRLocator, which was developed in our lab was used (Maia et al., 2008)

(www.ufpel.edu.br/~lmaia.faem). The software was configured to locate Class I (≥ 20

bp) and Class II (≥ 12 and < 20 bp) microsatellites (Temnykh et al., 2001). Also, it

was configured to select those repeats with a minimum of 12 bp, i. e., 12x monomers

repeats, 6x two-mers, 4x three-mers, 3x for four-mers and five-mers and 2x for six-

mers repeats. Class I and Class II repeats were later stored on different files and

analyzed separately.

RESULTS AND DISCUSSION

Overall occurrence

On Table 1 the size (Mb) and genome percentage covered by each rice

chromosome, followed by total occurrence of each microsatellite type, class and

number of loci per million pairs (loci/Mb) per chromosome, are shown.

Chromosome one is the largest (43.3 Mb) and represents 11.7% of total

genome size. The smaller genome fractions are found in chromosomes nine and ten,

both measuring 22.7 Mb and representing 6.1% of the total genome. Total

occurrence of microsatellites was 484,613 loci, including both classes (I and II) and

all types of microsatellites. The most and least common types were six-mers

(290,360) and five-mers (15,924), respectively. The highest and lowest percentage

66

values were 59.9% (six-mers) and 3.3% (five-mers and monomers), respectively

(Table 1). The highest average was 780.2 loci/Mb (six-mers) and the lowest was 42.5

loci/Mb (five-mers). The overall average occurrence was 1,301 loci/Mb.

Looking at some rice BACs, Morgante et al. (2002) found a minimum of 118.4

five-mers and a maximum of 321.3 three-mers per Mb, respectively. For gene-rich

maize BACs, the maximum frequency found was 267.5 loci/Mb (tetramers) and the

minimum frequency 108.4 was 48.0 loci/Mb (monomers). Gene-poor BAC results

were 15.4 monomer loci/Mb and 171.4 three-mer loci/Mb for minimum and maximum

frequency, respectively. A survey of rice genomic sequences (474 Mb) revealed a

range of 7.1 monomer loci to 807.8 six-mer loci per Mb. An Arabidopsis survey

indicated that microsatellite loci range from 32.0 + 733.0, for five and six-mers,

respectively. Another survey involving Medicago Truncatula (77.1 Mb) genomic

clones also showed the highest (733.9 loci/Mb) and the lowest (32.0 loci/Mb)

frequency, for six and five-mers, respectively (Mun et al., 2006). The comparison of

results found in the above mentioned reports agree with the results obtained in the

present work. The major differences found between rice, alfalfa and Arabidopsis were

for monomers, with occurrence rates 2.5-4 times less frequent in rice and for three-

mers that were two times more frequent in rice.

The overall results obtained in this study indicate a common pattern of

occurrence except for chromosome 4 and 11, where lower frequencies were found

(Table 2).

Separate Class I and Class II occurrences

On Table 2, distribution and occurrence of Class I and Class II microsatellites

are shown per chromosome, followed by their percentage values and locus

67

frequency/Mb. A total of 22,581 Class I and 462.032 Class II microsatellites were

detected corresponding, respectively, to 5% and 95% of total occurrences.

Class I microsatellites were divided in 826 monomer, 10.542 two-mer, 4.949

three-mer, 2.345 four-mer, 2.654 five-mer and 1.265 six-mer loci. The most frequent

type was the two-mers, corresponding to 46.88% of overall Class I occurrence. The

least frequent type was the six-mers with 5.60% of overall Class I occurrence. The

density of Class I microsatellites ranged from 2.2 loci/Mb (monomers) to 28.4 loci/Mb

(two-mers), with an average of 60.6 loci/Mb. For Class II microsatellites, the density

range was between 41.0 loci/Mb and 776.8 loci/Mb for monomers and six-mers,

respectively, with an average density of 1,240 loci/Mb.

Class I microsatellites showed a frequency range of 2.3 loci/Mb (monomers) to

29.8 loci/Mb (two-mers) for rice and a range of 1.4 loci/Mb (four-mers) to 21.4 (two-

mers) loci/Mb for Arabidopsis (Mun et al., 2006).

In Figure 2 the average percentage of Class I microsatellites is shown. The

least frequent types were monomers (3.53%) and the most frequent were two-mers

(47.00%). When both classes are considered, however, six-mers become the most

frequent repeat type (59. 99%) (Figure 1).

Microsatellite sizes

On Table 3, Class I and Class II microsatellite average sizes are shown

individually for each chromosome. For those belonging to Class I, sizes ranged from

21.4 to 39.5 bp for five-mers and two-mers, respectively. For Class II microsatellites,

a range from 12.5 to 15 bp for four-mers and five-mers, respectively. The overall

average was 27.5 bp and 13.3 bp for Class I and Class II, respectively. Still on Table

3, average microsatellite sizes do not vary much.

68

Considering that Class I microsatellites are reported as the most useful as

molecular markers, two-mer and three-mer loci are the best candidates, since they

presented longer repeat sequences on average.

Distances between microsatellites

On Table 4, average distances between microsatellites (Kb) were presented

per class, type and chromosome location. For both classes, distances between

microsatellites were shorter for three-mers (4.6 Kb) longer between monomers (445.8

Kb).

The overall average distance between microsatellite loci was 11.5 Kb, white

for Class I this average was 189.9 Kb. Regarding per chromosome distribution, the

shortest average distance between microsatellites considering both classes was

found in chromosome 1 (10.5 Kb). The longest distance was 12.6 Kb and was found

in chromosome 4. When only Class I microsatellites are considered, the shortest and

longest distances were 162.8 and 215.9 Kb and were found in chromosomes 1 and

10, respectively.

CONCLUSION

The results showed a general view about abundance and distribution patterns

of microsatellites in the rice genome. Previous reports have been based in genomic

samples or in unordered sequences from pseudomolecules representing rice

chromosomes.

In the initial analysis, where both Class I and II (≥12 bp) were considered, the

overall frequency of each microsatellite type in each chromosome was assessed. In

this analysis, six-mers were the most abundant types (59.9%). This could be due to

69

the fact that any locus with two or more repeats was detected. Still for this analysis,

the comparison with A. thaliana and M. truncatula indicated similar frequencies for

two, four, five and six-mers and contrasting frequencies for mono and three-mers

between the three species.

For the second part of the analysis, where only Class I repeats (≥20 bp) were

considered, the most abundant types were two-mers (28.4 loci/Mb) and three-mers

(13.2 loci/Mb) with average distances of 34.4 Kb between two-mers and 74.8 Kb

between three-mers. Besides being more frequent, these repeats were also the

longest, with average length of 39.5 bp and 28.5 bp for two-mers and three-mers,

respectively. Considering the use of both these types, one can reach an average

coverage of 3 loci every 74.8 Kb or one locus every 24.9 Kb. This coverage

represents an excellent supply of markers to saturate any targeted genomic region

during mapping studies.

Finally, the data regarding the found loci, average distance between loci,

repeat size and ratio of different types on the rice genome suggest a similar

distribution for the 12 chromosomes.

70

REFERENCES:

Iyer RR, Pluciennik A, Rosche WA, Sinden RR, Wells RD (2000) DNA polymerase III

proofreading mutants enhance the expansion and deletion of triplet repeat

sequences in Escherichia coli. Journal of Biological Chemistry, v.275, n.3,

p.2174-2184.

IRGSP. (2005) The map-based sequence of the rice genome. Nature.

11;436(7052):793-800.

Kashi Y, King D, Soller M. (1997) Simple sequence repeats as a source of

quantitative genetic variation. Trends Genet. 13(2):74-8.

La Rota M, Kantety RV, Yu JK, Sorrells ME (2005) Nonrandom distribution and

frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and

barley. BMC Genomics. 18;6(1):23.

Li YC, Korol AB, Fahima T, Beiles A, Nevo E. (2002) Microsatellites: genomic

distribution, putative functions and mutational mechanisms: a review. Mol Ecol.

11(12):2453-65.

Li YC, Korol AB, Fahima T, Nevo E (2004) Microsatellites within genes: structure,

function, and evolution. Mol Biol Evol. 21(6):991-1007.

McCouch SR, Chen X, Panaud O, Temnykh S, Xu Y, Cho YG, Huang N, Ishii T, Blair

M (1997). Microsatellite marker development, mapping and applications in rice

genetics and breeding. Plant Mol Biol. 35(1-2):89-99.

McCouch SR, Chen X, Panaud O, Temnykh S, Xu Y, Cho YG, Huang N, Ishii T, Blair

M. (1997) Microsatellite marker development, mapping and applications in rice

genetics and breeding. Plant Mol Biol. 35(1-2):89-99.

McCouch SR, Teytelman L, Xu Y, Lobos KB, Clare K, Walton M, Fu B, Maghirang R,

Li Z, Xing Y, Zhang Q, Kono I, Yano M, Fjellstrom R, DeClerck G, Schneider D,

71

Cartinhour S, Ware D, Stein L. (2002) Development and mapping of 2240 new SSR

markers for rice (Oryza sativa L.). DNA Res. 9(6):199-207.

Morgante M, Hanafey M, Powell W. (2002) Microsatellites are preferentially

associated with nonrepetitive DNA in plant genomes. Nat Genet. 30(2):194-200.

Morgante M, Olivieri AM (1993) PCR-amplified microsatellites as markers in plant

genetics. Plant J 1: 175�182.

Mun JH, Kim DJ, Choi HK, Gish J, Debellé F, Mudge J, Denny R, Endré G, Saurat O,

Dudez AM, Kiss GB, Roe B, Young ND, Cook DR. (2006) Distribution of

microsatellites in the genome of Medicago truncatula: a resource of genetic markers

that integrate genetic and physical maps. Genetics. 172(4):2541-55.

Parida SK, Anand Raj Kumar K, Dalal V, Singh NK, Mohapatra T. (2006) Unigene

derived microsatellite markers for the cereal genomes. Theor Appl Genet.

112(5):808-17.

Taramino G, Tingey S. (1996) Simple sequence repeats for germplasm analysis and

mapping in maize. Genome. 39(2):277-87.

Tautz D. (1989) Hypervariability of simple sequences as a general source for

polymorphic DNA markers. Nucleic Acids Res. 17(16):6463-71.

Temnykh S, DeClerck G, Lukashova A, Lipovich L, Cartinhour S, McCouch S. (2001)

Computational and experimental analysis of microsatellites in rice (Oryza sativa L.):

frequency, length variation, transposon associations, and genetic marker potential.

Genome Res. 11(8):1441-52.

Varshney RK, Graner A, Sorrells ME. (2005) Genic microsatellite markers in plants:

features and applications. Trends Biotechnol. 23(1):48-55.

Varshney RK, Thiel T, Stein N, Langridge P, Graner A. (2002) In silico analysis on

frequency and distribution of microsatellites in ESTs of some cereal species. Cell

72

Mol Biol Lett. 7(2A):537-46.

Wang Z, Weber JL, Zhong G, Tanksley SD (1994) Survey of plant short tandem DNA

repeats. Theor Appl Genet 88: 1�6.

Weber JL, May PE. (1989) Abundant class of humanDNApolymorphisms which can

be typed using the polymerase chain reaction. Am J Hum Genet 44: 388�396.

73

Table 1. Total amounts of microsatellite types (≥ 12 bp)* in the twelve chromossomes.

Mono- Di- Tri- Tetra- Penta- Hexa- TotalChr. Mb % Amount Frequency Amount Frequency Amount Frequency Amount Frequency Amount Frequency Amount Frequency Amount Frequency

1 43.3 11.7 2,183 50.5 4,326 100.0 9,472 218.9 5,943 137.4 2,118 49.0 35,537 821.4 59,579 1,377.2 2 36.0 9.7 1,774 49.3 3,442 95.7 8,178 227.5 4,966 138.1 1,716 47.7 29,327 815.7 49,403 1,374.0 3 36.2 9.8 1,785 49.3 3,504 96.8 8,337 230.4 5,073 140.2 1,698 46.9 29,83 824.2 50,227 1,387.8 4 35.5 9.6 1,514 42.6 3,137 88.4 7,077 199.4 4,196 118.2 1,336 37.6 27,226 767.0 44,486 1,253.2 5 29.7 8.0 1,331 44.8 2,847 95.7 7,012 235.8 3,842 129.2 1,346 45.3 23,868 802.6 40,246 1,353.4 6 30.7 8.3 1,368 44.5 3,154 102.6 7,085 230.5 4,004 130.3 1,399 45.5 24,8 807.0 41,81 1,360.5 7 29.6 8.0 1,284 43.3 2,681 90.4 6,231 210.2 3,851 129.9 1,372 46.3 23,334 787.1 38,753 1,307.3 8 28.4 7.7 1,251 44.0 2,795 98.3 6,255 220.0 3,743 131.6 1,169 41.1 22,688 797.9 37,901 1,332.9 9 22.7 6.1 958 42.2 2,277 100.3 4,958 218.4 2,967 130.7 888 39.1 17,969 791.7 30,017 1,322.5 10 22.7 6.1 881 38.8 2,268 100.0 4,942 217.8 2,778 122.5 962 42.4 18,074 796.7 29,905 1,318.2 11 28.4 7.7 768 27.1 2,083 73.4 4,041 142.4 2,684 94.6 802 28.3 16,491 580.9 26,869 946.5 12 27.6 7.4 1,133 41.1 2,797 101.5 5,682 206.1 3,471 125.9 1,118 40.6 21,216 769.6 35,417 1,284.8 Total 370.8 - 16,23 - 35,311 - 79,27 - 47,518 - 15,924 - 290,36 - 484,613 -Average 8.3 1,353 43.1 2,943 95.3 6,606 213.1 3,96 127.4 1,327 42.5 24,197 780.2 40,384 1,301.5 % 3.3 7,3 16,4 9,8 3,3 59,9 - - *Microsatellites Class I: upper 20 bp (>= 20) * Microsatellites Class II: between 12 bp and 20 bp (>= 12 bp e <20)

74

Table 2. Distributions, percentage and frequency of different microsatellite types within Classes I and II in the twelve chromosomes.

Chromossome/Type Mono- Di- Tri- Tetra- Penta- Hexa- TotalClasses I II I II I II I II I II I II I II

Occur. 136 2.047 1.312 3.014 630 8.842 284 5.659 337 1.781 155 35.382 2.854 56.725 1 % 0,06 0,94 0,30 0,70 0,07 0,93 0,05 0,95 0,16 0,84 0,00 1,00 0,05 0,95

Freq. 3,1 47,3 30,3 69,7 14,6 204,4 6,6 130,8 7,8 41,2 3,6 817,9 66,0 1.311,2 Occur. 99 1.675 1.066 2.376 567 7.611 268 4.698 287 1.429 125 29.202 2.412 46.991

2 % 0,06 0,94 0,31 0,69 0,07 0,93 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95 Freq. 2,8 46,6 29,6 66,1 15,8 211,7 7,5 130,7 8,0 39,7 3,5 812,2 67,1 1.306,9

Occur. 94 1.691 1.145 2.359 599 7.738 231 4.842 270 1.428 127 29.703 2.466 47.761 3 % 0,05 0,95 0,33 0,67 0,07 0,93 0,05 0,95 0,16 0,84 0,00 1,00 0,05 0,95

Freq. 2,6 46,7 31,6 65,2 16,6 213,8 6,4 133,8 7,5 39,5 3,5 820,7 68,1 1.319,6 Occur. 68 1.446 840 2.297 408 6.669 166 4.030 220 1.116 118 27.108 1.820 42.666

4 % 0,04 0,96 0,27 0,73 0,06 0,94 0,04 0,96 0,16 0,84 0,00 1,00 0,04 0,96 Freq. 1,9 40,7 23,7 64,7 11,5 187,9 4,7 113,5 6,2 31,4 3,3 763,6 51,3 1.201,9

Occur. 67 1.264 846 2.001 415 6.597 190 3.652 249 1.097 103 23.765 1.870 38.376 5 % 0,05 0,95 0,30 0,70 0,06 0,94 0,05 0,95 0,18 0,82 0,00 1,00 0,05 0,95

Freq. 2,3 42,5 28,4 67,3 14,0 221,8 6,4 122,8 8,4 36,9 3,5 799,2 62,9 1.290,5 Occur. 62 1.306 882 2.272 424 6.661 185 3.819 243 1.156 116 24.684 1.912 39.898

6 % 0,05 0,95 0,28 0,72 0,06 0,94 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95 Freq. 2,0 42,5 28,7 73,9 13,8 216,7 6,0 124,3 7,9 37,6 3,8 803,2 62,2 1.298,3

Occur. 65 1.219 801 1.880 380 5.851 178 3.673 233 1.139 103 23.231 1.760 36.993 7 % 0,05 0,95 0,30 0,70 0,06 0,94 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95

Freq. 2,2 41,1 27,0 63,4 12,8 197,4 6,0 123,9 7,9 38,4 3,5 783,7 59,4 1.247,9 Occur. 65 1.186 820 1.975 406 5.849 183 3.560 191 978 116 22.572 1.781 36.120

8 % 0,05 0,95 0,29 0,71 0,06 0,94 0,05 0,95 0,16 0,84 0,01 0,99 0,05 0,95 Freq. 2,3 41,7 28,8 69,5 14,3 205,7 6,4 125,2 6,7 34,4 4,1 793,8 62,6 1.270,3

Occur. 58 900 712 1.565 268 4.690 175 2.792 150 738 78 17.891 1.441 28.576 9 % 0,06 0,94 0,31 0,69 0,05 0,95 0,06 0,94 0,17 0,83 0,00 1,00 0,05 0,95

Freq. 2,6 39,7 31,4 69,0 11,8 206,6 7,7 123,0 6,6 32,5 3,4 788,3 63,5 1.259,0 Occur. 37 844 649 1.619 292 4.650 125 2.653 156 806 85 17.989 1.344 28.561

10 % 0,04 0,96 0,29 0,71 0,06 0,94 0,04 0,96 0,16 0,84 0,00 1,00 0,04 0,96 Freq. 1,6 37,2 28,6 71,4 12,9 205,0 5,5 116,9 6,9 35,5 3,7 793,0 59,2 1.259,0

Occur. 26 742 628 1.455 226 3.815 164 2.520 139 663 57 16.434 1.240 25.629 11 % 0,03 0,97 0,30 0,70 0,06 0,94 0,06 0,94 0,17 0,83 0,00 1,00 0,05 0,95

Freq. 0,9 26,1 22,1 51,3 8,0 134,4 5,8 88,8 4,9 23,4 2,0 578,9 43,7 902,8 Occur. 49 1.084 841 1.956 334 5.348 196 3.275 179 939 82 21.134 1.681 33.736

12 % 0,04 0,96 0,30 0,70 0,06 0,94 0,06 0,94 0,16 0,84 0,00 1,00 0,05 0,95 Freq. 1,8 39,3 30,5 71,0 12,1 194,0 7,1 118,8 6,5 34,1 3,0 766,6 61,0 1.223,8

Total 826 15.404 10.542 24.769 4.949 74.321 2.345 45.173 2.654 13.270 1.265 289.095 22.581 462.032 Occur. 68,8 1.283,7 878,5 2.064,1 412,4 6.193,4 195,4 3.764,4 221,2 1.105,8 105,4 24.091,3 1.881,8 38.502,7

Average % 0,05 0,95 0,30 0,70 0,06 0,94 0,05 0,95 0,17 0,83 0,00 1,00 0,05 0,95 Freq. 2,2 41,0 28,4 66,9 13,2 200,0 6,3 121,0 7,1 35,4 3,4 776,8 60,6 1.240,9

75

Table 3. Average locus size (bp) of different microsatellite types within Classes I and II for the twelve chromosomes.

 Chr. I II I II I II I II I II I II I II 1 22.7 13.6 37.7 13.7 28.6 13.3 25.3 12.5 21.3 15.0 26.9 12.2 27.1 13.4 2 22.8 13.6 39.2 13.8 28.5 13.3 25.6 12.5 21.3 15.0 26.3 12.2 27.3 13.4 3 23.2 13.6 39.7 13.7 28.2 13.3 24.5 12.5 21.5 15.0 25.6 12.2 27.1 13.4 4 24.0 13.5 36.6 13.6 28.0 13.2 26.1 12.5 21.2 15.0 29.6 12.2 27.6 13.3 5 23.5 13.4 38.9 13.7 26.8 13.3 25.7 12.5 21.6 15.0 25.4 12.2 27.0 13.3 6 23.0 13.5 41.6 13.7 28.6 13.2 27.8 12.5 21.3 15.0 25.8 12.2 28.0 13.3 7 23.4 13.5 38.0 13.7 29.2 13.2 29.7 12.5 21.4 15.0 25.0 12.2 27.8 13.3 8 23.0 13.4 40.6 13.7 27.6 13.2 26.3 12.5 21.6 15.0 25.3 12.2 27.4 13.3 9 22.9 13.6 40.7 13.6 28.1 13.2 27.5 12.5 21.2 15.0 25.5 12.2 27.6 13.3 10 24.3 13.5 41.0 13.6 27.8 13.2 23.2 12.4 20.9 15.0 25.0 12.2 27.0 13.3 11 24.7 13.4 40.4 13.8 29.1 13.1 25.6 12.5 21.2 15.0 25.1 12.2 27.7 13.3 12 23.8 13.5 39.6 13.7 30.8 13.2 26.2 12.5 22.2 15.0 25.4 12.2 28.0 13.3

Average 23.4 13.5 39.5 13.7 28.5 13.2 26.1 12.5 21.4 15.0 25.9 12.2 27.5 13.3

AverageMono- Di- Tri- Tetra- Penta- Hexa-

76

Table 4. Average distances (Kb) between different microsatellite loci within Classe I and Class II chromossomes.

Chr. I - II I I - II I I - II I I - II I I - II I I - II I I - II I

1 19,8 316,2 10,0 32,9 4,6 68,6 7,3 152,1 20,4 128,3 1,2 278,9 10,5 162,82 20,2 353,0 10,4 33,6 4,4 63,4 7,2 134,6 20,9 125,4 1,2 285,8 10,7 166,03 20,2 381,1 10,3 31,5 4,3 60,4 7,1 156,5 21,3 133,7 1,2 284,1 10,7 174,54 23,4 521,5 11,3 42,2 5,0 87,0 8,4 213,3 26,6 160,9 1,3 300,8 12,7 220,95 22,1 427,5 10,4 34,9 4,2 71,1 7,7 155,1 22,0 118,5 1,2 286,8 11,3 182,36 22,4 462,6 9,7 34,7 4,3 72,4 7,7 165,9 21,9 125,5 1,2 263,6 11,2 187,57 23,0 450,4 11,0 36,9 4,7 77,9 7,7 166,1 21,6 126,3 1,3 286,1 11,5 190,68 22,6 423,0 10,1 34,5 4,5 69,9 7,6 154,9 24,3 148,4 1,2 244,3 11,7 179,29 23,5 370,2 9,9 31,6 4,6 84,3 7,6 129,2 25,5 150,6 1,3 288,8 12,1 175,810 25,7 591,1 10,0 34,9 4,6 77,6 8,2 181,1 23,6 145,3 1,2 265,5 12,2 215,911 24,3 526,7 9,8 32,7 4,8 82,5 7,9 140,4 24,6 153,7 1,3 335,5 12,1 211,912 24,3 526,7 9,8 32,7 4,8 82,5 7,9 140,4 24,6 153,7 1,3 335,5 12,1 211,9Average 22,6 445,8 10,2 34,4 4,6 74,8 7,7 157,5 23,1 139,2 1,2 288,0 11,6 189,9

AverageMono- Di- Tri- Tetra- Penta- Hexa-

77

-

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Mono-mer 2-mer 3-mer 4-mer 5-mer 6-mer

% O

ccur

renc

e

Microsatellite types

Chr.1 Chr.2 Chr.3 Chr.4 Chr.5 Chr.6 Chr.7 Chr.8 Chr.9 Chr.10 Chr.11 Chr.12

Figure 1. Percentage occurrence of different microsatellite types (≥ 12 bp)* in the chromossomes.

78

-

5

10

15

20

25

30

35

40

45

50

55

60

Mono-mer 2-mer 3-mer 4-mer 5-mer 6-mer

% O

ccur

renc

e

Microsatellite types

Chr.1 Chr.2 Chr.3 Chr.4 Chr.5 Chr.6 Chr.7 Chr.8 Chr.9 Chr.10 Chr.11 Chr.12

Figure 2. Percentage occurrence of different microsatellite types (≥ 20 bp) in twelve chromossome.

* Including as Class I and Class II

79

5. Considerações Finais

A utilização de marcadores moleculares é atualmente uma ferramenta de

grande importância no auxilio do melhoramento vegetal, entretanto, é necessário

investigar estratégias para incrementar as taxas de sucesso destas aplicações em

estudos de mapeamento genético e seleção assistida. Para isso estudos com base

em bioinformática foram realizados para verificar padrões e abundância de

diferentes tipos e arranjos de loci microssatélites na seqüência completa do genoma

do arroz e outras espécies.

A análise da seqüência completa do genoma do arroz mostrou que os loci

microssatélites formados pelos tipos dímeros e trímeros são os mais abundantes e

que a utilização destes dois tipos de arranjos possibilitam o posicionamento médio

de um marcador a cada 24.900 nucleotídeos (24,9 kb), resultando numa excelente

cobertura do genoma.

No estudo onde foi analizada a acorrência de microssatélites em 28.469

seqüências gênicas (fl-cDNA) foi encontrado um total de 3.907 loci mini e

microssatélites em 3.765 seqüências (13,22%), sendo que, foram desenhados 3.329

conjuntos de iniciadores, correspondendo a 85,20% das seqüências. A simulação da

PCR apartir dos 3.329 iniciadores mostrou que 2.397 conjuntos amplificaram apenas

o fragmento original, e que, 932 conjuntos amplificaram regiões redundantes além

do locus original.

As comparações entre espécies de diferentes famílias mostraram que os

dímeros AG/CT foram predominantes na família Brassicaceae e Poaceae e AT/AT

na Solanaceae. Entre os microssatélites trímeros os motivos ATA/TAT/AAT/TTA e

GAA/TTC/AGA/TCT foram predominantes entre brássicas e solanáceas, enquanto

que, nas gramíneas os trímeros mais freqüêntes foram aqueles compostos por C/G.

80

Finalmente, o resultado geral dos três estudos indicou que loci microssatélites

possibilitam uma boa cobertura tanto para regiões gênicas e intergênicas do arroz, e

que, para transposição de marcadores entre regiões gênicas das gramíneas os loci

formados por trímeros C/G são os mais indicados. A transferência entre espécies

dentro das diferentes famílias é apoaida por padrões encontrados dentro de cada

família estudada, entretanto, entre diferentes famílias um pequeno padrão de loci

foram evidentes, indicando baixo potencial de transferência de marcadores entre

espécies mais distantes evolutivamente.

Estudos futuros em laboratório serão necessários para a validação dos

resultados obtidos in silico. Conjuntos de iniciadores obtidos apartir daqueles loci

com padrões mais abundantes entre as gramíneas, deverão ser testados quanto a

sua real capacidade de amplificar regiões de DNA nas diferentes gramíneas

estudadas e confirmar, desta forma, quais os melhores padrões de marcadores para

transferência para gramíneas pouco estudadas. A segunda necessidade de

validação dos resultados obtidos in silico, é para aqueles conjuntos de iniciadores

que aplificam loci específicos e/ou redundantes no arroz. Iniciadores com real

capacidade de amplificar loci específicos poderão ser utilizados em estratégias de

mapeamento e seleção assistida e aqueles com capacidade de amplificar regiões

redundantes (vários loci ao mesmo tempo) poderão ser utilizados em estudos de

variabilidade e diversidade genética, sendo que, ambos os estudos após validação

resultarão em ferramentas auxiliares para programas de melhoramento vegetal.

81

6. Referencias bibliográficas do Item 1 CARVALHO, F.I.F. in: Trigo no Brasil. Fundação Cargill. Campinas-SP, Editora Ilus,

v.1, p.620, 1982.

CARVALHO, F.I.F.; LORENCETTI, C.; MARCHIORO, V.S.; SILVA, S.A. Condução de populações no melhoramento genético de plantas. Editora e Gráfica

Universitária - UFPel, 2003.

IYER, R.R.; PLUCIENNIK, A.; ROSCHE, W.A.; SINDEN, R.R.; WELLS, R.D. DNA

polymerase III proofreading mutants enhance the expansion and deletion of triplet

repeat sequences in Escherichia coli. Journal of Biological Chemistry, v.275, n.3,

p.2174-2184, 2000.

LAWSON, M.J.; ZHANG, L. Distinct patterns of SSR distribution in the Arabidopsis

thaliana and rice genomes. Genome Biology, v.7, n.2, 2006. LITT, M.; LUTY, J. A. A hypervariable microsatellite revealed by in vitro amplification

of a dinucleotide repeat within the cardiac muscle actin gene. American journal of human genetics. v.44, n.3, p.397-401. 1989.

MORGANTE, M.; OLIVIERI, A.M. PCR-amplified microsatellites as markers in plant

genetics. The Plant Journal. v.3, n.1, p.175-182, 1993.

MORGANTE, M.; HANAFEY, M.; POWELL, W. Microsatellites are preferentially

associated with nonrepetitive DNA in plant genomes. Nature Genetics, v.30, n.2,

p.194-200, 2002.

MCCOUCH, S.R; TEYTELMAN, L.; XU, Y.; LOBOS, K.B.; CLARE, K.; WALTON, M.;

FU, B.; MAGHIRANG, R.; LI, Z.; XING, Y.; ZHANG, Q.; KONO, I.; YANO, M.;

FJELLSTROM, R.; DECLERCK, G.; SCHNEIDER, D.; CARTINHOUR, S.; WARE, D.;

STEIN, L., Development and Mapping of 2240 New SSR Markers for Rice (Oryza

sativa L.). DNA Research, v.9, n.6, p.199-207, 2002.

NICOT, N.; CHIQUET, V.; GANDON, B.; AMILHAT, L.; LEGEAI, F.; LEROY, P.;

BERNARD, M.; SOURDILLE, P. Study of simple sequence repeat (SSR) markers

from wheat expressed sequence tags (ESTs). Theoretical and Applied Genetics,

82

v.109, n.4, p.800-805, 2004.

TAUTZ, D. Hypervariability of simple sequences as a general source for polymorphic

DNA markers. Nucleic Acids Research. v.17, n.16, p.6463�6471, 1989.

TEMNYKH S, DECLERCK G, LUKASHOVA A, LIPOVICH L, CARTINHOUR S,

MCCOUCH S. (2001) Computational and experimental analysis of microsatellites in

rice (Oryza sativa L.): frequency, length variation, transposon associations, and

genetic marker potential. Genome Research. 11(8):1441-52.

THIEL, T.; MICHALEK, W.; VARSHNEY, W.; GRANER, A. Exploiting EST databases

for the development and characterization of gene-derived SSR-markers in barley

(Hordeum vulgare L.). Theoretical and Applied Genetics, v.106, n.3, p.411-422,

2003.

VARSHNEY, R.K.; GRANER, A.; SORRELLS, M.E. Genic microsatellite markers in

plants: features and applications. Trends in Biotechnolgy. v.23, n.1, 48-55. 2005b.

VARSHNEY, R.K.; HOISINGTON, D.A.; TYAGI, A.K. Advances in cereal genomics

and applications in crop breeding. Trends in Biotechnology, v.24, n.11, p.490-499,

2006.

Kashi Y, King D, Soller M. (1997) Simple sequence repeats as a source of

quantitative genetic variation. Trends Genet. 13(2):74-8.

WELLS, R.D.; PARNIEWSKI, P.; PLUCIENNIK, A.; BACOLLA, A.; GELLIBOLIAN,

R.; JAWORSKI, A. Small slipped register genetic instabilities in Escherichia coli in

triplet repeat sequences associated with hereditary neurological diseases. J Biol Chem. v.273, n.31, p.19532-19541, 1998.

ZHANG L, ZUO K, ZHANG F, CAO Y, WANG J, ZHANG Y, SUN X, TANG K. (2006)

Conservation of noncoding microsatellites in plants: implication for gene regulation.

BMC Genomics. 25;7:323.

83

VITAE

Luciano Carlos da Maia, nascido em 13/07/1976 em Itapetininga-SP. Formado

em Tecnologia de Processamento de Dados em 1995, pela Associação de Ensino

de Itapetininga. No período de 1992 e 1999 trabalhou no desenvolvimento de

sistemas de informação para gerenciamento agrícola e de custos da divisão de

pecuária e citricultura do Grupo Votorantim (Itapetininga-SP). Ingressou na

Faculdade de Agronomia Eliseu Maciel(FAEM), da Universidade Federal de Pelotas

(UFPel) em março de 2000. Foi estagiário e bolsista de iniciação científica (CNPq)

entre 2000-2002 no Laboratório de Bacteriologia do Departamento de Fitossanidade-

FAEM/UFPel. A partir de 2003 iniciou estágio no Centro de Genômica e

Fitomelhoramento, sendo bolsista da Fundação Delfin Mendes até conclusão do

curso em 2004. Em março de 2005 iniciou mestrado em Agronomia, área de

Fitomelhoramento da FAEM/UFPel, sob orientação dos Professores Antonio Costa

de Oliveira e Fernando Irajá Félix de Carvalho. Em Maio de 2006, por cumprir os

requisitos necessários, progrediu ao nível de doutorado. Ao longo deste período,

vem desenvolvendo trabalhos de bioinformática para auxílio do melhoramento

vegetal de arroz, trigo e aveia alem de demais estudos de genomica e biologia

molecular das destas espécies estudadas no grupo de Fitomelhoramento da

FAEM/UFPel.