当前位置：文档库 › 遗传多样性空间格局分析软件spagedi简介

遗传多样性空间格局分析软件spagedi简介

SPAGeDi1.3

a program for S patial P attern A nalysis of Ge netic Di versity

by Olivier J. H ARDY and Xavier V EKEMANS

with the contribution of Reed A. C ARTWRIGHT

User’s manual

Address for correspondence:

Service Eco-éthologie Evolutive, CP160/12

Université Libre de Bruxelles

50 Av. F. Roosevelt

B-1050 Brussels, Belgium

e-mail: ohardy@ulb.ac.be

Last update: 22 March 2009

Contents

1. Note about SPAGeDi 1.3 and installation

2. What is SPAGeDi ?

2.1. Purpose

2.2. How to use SPAGeDi – short overview

2.3. Data treated by SPAGeDi

2.4. Three ways to specify populations

2.5. Statistics computed

3. Creating a data file

3.1. Structure of the data file

3.2. How to code genotypes?

3.3. Example of data file

3.4. Note about distance intervals

3.5. Note about spatial groups

3.6. Note about microsatellite allele sizes

3.7. Using a matrix to define pairwise spatial distances

3.8. Defining genetic distances between alleles

3.9. Defining reference allele frequencies for relatedness coefficients

3.10. Present data size limitations

4. Running the program

4.1. Launching the program

4.2. Specifying the data / results files

4.3. Selecting the appropriate options

4.4. Information displayed during computations

5. Interpreting the results file

5.1. Basic information

5.2. Allele frequency analysis

5.3. Type of analyses

5.4. Distance intervals

5.5. Computed statistics

5.6. Permutation tests

5.7. Matrices of pairwise coefficients/distances

6. Technical notes

6.1. Statistics for individual level analyses

6.2. Statistics for population level analyses

6.3. Inference of gene dispersal distances

6.4. Estimating the actual variance of pairwise coefficients for marker based heritability and Q ST estimates

6.5. Testing phylogeographic patterns

7. References

8. Bug reports

1.NOTE ABOUT SPAGeDi 1.3 AND INSTALLATION

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . SPAGeDi has been tested on several data sets and results were checked for consistency with alternative softwares whenever possible. It may nevertheless still contain bugs (corrected bugs are listed at the end of this manual). Some of these bugs are probably easy to detect by causing the program to crash or leading to obvious erroneous results for particular data sets and analyses. But others, more critical, may just cause biased results that appear plausible. Hence, it is advised to take much care checking the consistency of the information from the results file. The authors would appreciate being informed of any detected bug. The authors claim no responsibility if or whenever a bug causes a misinterpretation of the results given by SPAGeDi.

What’s new in SPAGeDi ?

Implementations in version 1.3:

1°) SPAGeDi 1.3 can be compiled for different platforms including Windows, Mac OS X, and Linux (thanks to Reed Cartwright!).

2°)The iterative procedure to estimate gene dispersal parameters has been improved.

3°) A new statistic (Nij) characterize similarity between individuals using “ordered alleles”. It is an analogue of kinship coefficient considering the phylogenetic distance between alleles (or haplotypes). Permutation tests permit to assess whether the allele phylogeny contributes to the genetic structure, providing a test of phylogeographic patterns at the individual level.

4°) Spatial coordinates can now be given as latitudes and longitudes in degrees with decimal (using negative numbers for Southern latitudes or Western longitudes). To this end, the number of spatial coordinates (3rd number of the first line) must be set to -2.

Implementations in version 1.2:

1°) SPAGeDi1.2 proposes new statistics(e.g. N ST) to characterize differentiation among populations using “ordered alleles”, i.e. considering the phylogenetic distance between alleles (or haplotypes), as proposed by Pons & Petit (1996). Permutation tests permit to assess whether the allele phylogeny contributes to the differentiation pattern, which can be used to test phylogeographic patterns.

2°) SPAGeDi1.2 proposes an estimator of the mean kinship coefficient between populations (Gij) closely related to the autocorrelation of population allele frequencies (Barbujani 1987).

3°) SPAGeDi 1.2 proposes a new estimator of the relationship coefficient between individuals (Li et al. 1993). 4°) SPAGeDi1.2 can use specific reference allele frequencies(to specify in a file) to compute relatedness coefficients between individuals.

5°) SPAGeDi1.2 includes an iterative procedure to estimate gene dispersal parameters from isolation-by-distance patterns by regressing pairwise kinship coefficients on distance over a restricted distance range (this requires an estimate of the effective population density).

6°) SPAGeDi 1.2 provides better error messages. The most common data file errors are systematically listed in

a file called “error.txt” when launching the program. As far as possible, error messages when problems occur

were improved. These messages are not yet optimal so that suggestions to improve them are welcome.

Empty lines in data files are now allowed. Problems when entering instructions with the keyboard under Windows 2000 and latter versions have been solved.

Implementations in version 1.1:

1°) SPAGeDi 1.1 can treat data from dominant genetic markers such as AFLP or RAPD to compute pairwise relatedness coefficients between individuals. Details about the statistics used can be found in Hardy (2003).

The way to code phenotypes of dominant markers in the data file is explained in § 3.2.2.

2°) SPAGeDi1.1 proposes an allele size permutation test indicating whether microsatellite allele sizes are informative with respect to genetic differentiation. Details about this test and its applications are given in Hardy et al. (2003).

Installation

For Microsoft Windows users, an executable (SPAGeDi.exe) can be downloaded and used directly without installation, as for the previous versions.

Native installers for Windows, Mac OS X, and Linux created by Reed A. Cartwright are also available. These native binaries have not been fully tested and may not work on all versions of their operating systems. If they fail to run properly, you can always compile the source packages. If the Windows version of SPAGeDi fails to work, you might also need to install the Microsoft Visual C++ 2008 SP1 Redistributable Package (https://www.wendangku.net/doc/a413914893.html,/downloads/details.aspx?familyid=A5C84275-3B97-4AB7-A40D-

3802B2AF5FC2&displaylang=en).

SPAGeDi-1.3a-win32-x86.exe: Windows Installer, 32-bit

SPAGeDi-1.3a-Darwin-universal.dmg: Mac OS X Installer, 32-bit (intel and ppc)

SPAGeDi-1.3a-Linux-i686.tar.gz: Linux Binary Package (2.6 kernel)

SPAGeDi-1.3a-FreeBSD-i386.tar.gz: FreeBSD Binary Package (7.1 kernel)

For all users, a portable source code (Unix/Windows/OS X port) has been created by Reed A. Cartwright

. Compiling SPAGeDi from source requires CMake (https://www.wendangku.net/doc/a413914893.html,/). Windows, Mac OS X and most Unix-like operating systems can install CMake through their package systems, otherwise it can be compiled from its source code. Once CMake is installed, download SPAGeDi’s source code.

SPAGeDi-1.3a.tar.gz: Unix and Mac OS X source (\n linefeeds)

SPAGeDi-1.3a.zip: Windows source (\r\n linefeeds)

Installation from source should be roughly the same on most systems. Detailed instructions for Linux follow. Open a terminal and issue the following commands, once you have downloaded SPAGeDi’s source code to your home directory.

tar xvzf SPAGeDi-1.3a.tar.gz

cd SPAGeDi-1.3a

cmake .

make

make install

A “-g” option can be passed to CMake to designate a different build system. This is most important when compiling SPAGeDi on Windows. See CMake’s documentation for more information.

How to cite SPAGeDi?

Hardy, O. J. & X. Vekemans (2002). SPAGeDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Molecular Ecology Notes 2: 618-620. Acknowledgments

We would like to thank the SPAGeDi user’s who have identified bugs or have given us other feedback on the program, in particular Dave Coltman, Britta Denise Hardesty, Myriam Heuertz, Xavier Turon, Mine Turktas, Peter Wandeler, Reed Cartwright.

2. WHAT IS SPAGeDi ?

2.1.P URPOSE

SPAGeDi is primarily designed to characterise the spatial genetic structure of mapped individuals and/or mapped populations using genotype data (e.g. isozymes, RFLP, microsatellites) of any ploidy level. For polyploids, analyses assume polysomic inheritance as in autopolyploids. Polyploids with disomic inheritance (allopolyploids) can be treated correctly only if alleles from different homeologous genomes can be distinguished so that genotypes are treated as diploid data. SPAGeDi can compute inbreeding coefficients as well as various statistics describing relatedness or differentiation between individuals or populations by pairwise comparisons. To analyse how values of pairwise comparisons are related to geographical distances, SPAGeDi computes 1°) average values for a set of predefined distance intervals, in a way similar to a spatial autocorrelation analysis, 2°)linear regressions of pairwise statistics on geographical distances (or their logarithm). The slopes of these regressions can potentially be used to obtain indirect estimates of gene dispersal distances parameters (e.g. neighbourhood size), and provide a synthetic measure of the strength of spatial structuring. SPAGeDi can also treat data without spatial information, providing global estimates of genetic differentiation and/or matrices of pairwise statistics between individuals or populations.

Different permutation procedures allow to test if there is significant inbreeding, population differentiation, spatial structure, or if microsatellite allele size or the phylogenetic distance between alleles carries relevant information about genetic structure.

Analyses can be carried out on data sets containing individuals with different ploidy levels, but not on data sets mixing loci corresponding to different ploidy levels within individuals (e.g. genotypes based nuclear and cytoplasmic DNA can not be analysed simultaneously, except for an haploid organism). Data from dominant markers (RAPD, AFLP) can be used to carry out analyses at the individual level with diploids (relatedness coefficients between individuals). Presently, there is no statistics adapted to dominant markers for analyses at the population level or for higher ploidy levels. One can always enter such data as haploid data (not mixed with data from codominant markers), but much caution must be taken in the interpretation of the results.

2.2.H OW TO USE SPAG E D I– SHORT OVERVIEW

SPAGeDi has no fancy windowing features. To launch the program just double click on the program icon and type the name of the data file or drag a data file icon into the program window. You can also launch the program with a command line (see §4.1). A single data file must contain all individual characteristics (name, category, spatial coordinate(s), genotypes). Details of the analyses to be carried out (individual versus population level, population definition, statistics, permutation tests, various options) will be specified after the program has been launched. Results of the analyses are written to a single results file. Data and results files are text files with tab delimited pieces of information. Hence they are best opened and edited using a worksheet software such as Excel. Data files can be converted from and into FSTAT and GENEPOP formats. Although error messages are displayed when problems occur, typically because the data file is not properly formatted, they may not be sufficient to find out the errors. Therefore we urge users to read carefully the instructions for preparing data files (next chapter).

2.3.D ATA TREATED BY SPAGeDi

SPAGeDi requires that the following information is provided for each individual: 1°) one to three spatial coordinates on a Cartesian coordinate system or latitude and longitude in degrees (facultative), 2°) value of a categorical variable (facultative) and 3°) its genotype at each locus (missing data allowed). The categorical variable can be used to define populations or to restrict analyses within or among categories. The spatial coordinate(s) permit(s) SPAGeDi to compute pairwise distances between individuals or populations (Euclidian distances). Alternatively, pairwise distances can be defined in a separate matrix. To compute some statistics, a matrix giving phyletic distances between alleles can also be provided.

2.4.T HREE WAYS TO SPECIFY POPULATIONS

Populations can be defined in three different ways: 1°) as categorical groups, where one population includes all individuals sharing the same categorical variable; 2°) as spatial groups, where a spatial group includes all individuals sharing the same spatial coordinates and following each other in the data file; 3°) as spatio-categorical groups, where a spatio-categorical group includes all individuals belonging to both the same spatial group and categorical group. When populations are defined using the categorical variable, the spatial coordinates of a given population are computed by averaging the coordinates of the individuals it contains.

2.5.S TATISTICS COMPUTED

Statistics for pairwise comparisons between populations include:

F ST a measure of population differentiation (intra-class kinship coefficient) (Weir & Cockerham 1984)

G ST equivalent to F ST but estimator with different statistical properties (Pons & Petit 1996)

R ST F ST analogue based on allele size (Slatkin 1995, estimated as Michalakis & Excoffier 1996) N ST F ST analogue accounting for the genetic distances between alleles (Pons & Petit 1996)

Rho intra-class relatedness coefficient permitting among ploidy comparisons (Ronfort et al. 1998) Gij mean kinship coefficient between populations (Barbujani 1987)

Nij Gij analogue accounting for the genetic distances between alleles (OJ Hardy, unpubl)

Ds Nei’s 1978 standard genetic distance

(δμ)2Ds analogue based on allele size (Goldstein and Pollok 1997)

Global F- or R- statistics (inbreeding coefficients) are also provided.

Statistics designed for pairwise comparisons between individuals include

Kinship coefficients, F ij: 3 estimators including one for dominant markers in diploids (Loiselle et al. 1995, Ritland 1996, Hardy 2003)

Relationship coefficients: 6 estimators including one for dominant markers in diploids (Hardy & Vekemans 1999, Lynch & Ritland 1999, Queller & Goodnight 1989, Wang 2002, Li et al. 1993, Hardy 2003) Fraternity coefficients: 2 estimators (Lynch & Ritland 1999, Wang 2002)

Rousset’s distance between individuals, A ij (Rousset 2000)

A kinship analogue based on allele size, R ij (I’ in Streiff et al. 1998)

A kinship analogue based on the genetic distances between alleles, N ij (OJ Hardy, unpubl)

Inbreeding coefficients (computed as kinship coefficients between gene copies within individuals)

All statistics are computed for each locus and a multilocus weighted average. Note that an estimate of the inbreeding coefficient must be entered to compute kinship or relationship coefficients with dominant markers in diploids.

The actual variance of these coefficients (i.e. the remaining variance when sampling variance has been removed) can be estimated following the method of Ritland (2000) for each distance interval defined. The actual variance of kinship (or relatedness) coefficients and of pairwise F ST is necessary for in situ, genetic markers based inference of, respectively, the heritability and Q ST of quantitative traits.

For pairwise coefficients, mean values per distance intervals and regression slopes on spatial distance are given (unless spatial information are lacking).

Jackknifying loci (i.e. deleting information from one locus at a time) provides approximate standard errors for the multilocus estimates.

Permutation tests whereby the statistics are computed again after that locations, individuals or genes are permuted provide ad hoc tests for spatial genetic structure, population differentiation or inbreeding coefficients, respectively. Note that permuting locations is equivalent to carrying out a Mantel test. Permutation of microsatellite allele sizes or of the phylogenetic distances between alleles also permit to test if the mutation rate is sufficient to affect the genetic structure (test of phylogeographic patterns) (Hardy et al. 2003; Pons & Petit 1996).

3. CREATING A DATA FILE

The data file is a text file. It is advised to create the data file using a worksheet program such as Excel and then save it as a “tab delimited text file”. If you do not have this option, try “DOS text” (“Text Unicode” or “ASCI” formats might not work).

3.1S TRUCTURE OF THE DATA FILE

Comments lines: they are not read by the program and can be put anywhere in file. Comment lines must begin by the two characters // . Empty lines are allowed.

The data file must be in the following format, with each piece of information within a line being separated by a tab (i.e. each piece of information put in adjacent columns if using a worksheet program to generate the data file). Hereafter, first, second, third,… line refers to non-comment and non-empty lines.

?first line: 6 format numbers separated by a tab in the following order:

- number of individuals

- number of categories (0 if no category defined)

- number of spatial coordinates in a Cartesian coordinate system (0 to 3) or put -2 for latitude + longitude - number of loci (or the number you wish to use if the data set contains more)

- number of digits used to code one allele (1 to 3); or set a value ≤0 to specify data from dominant markers - ploidy (2 = diploid; for data with several ploidy levels, give the largest)

?second line: definition of distance intervals:

- number of distance intervals (n)

- the n maximal distances corresponding to each interval

Note 1: alternatively you can enter only the desired number of intervals preceded by a negative sign; the program then defines the n maximal distances in such a way that the number of pairwise

comparisons within each distance interval is approximately constant.

Note 2: if you do not wish distance intervals, put 0.

Note 3: if you use latitude + longitude, distance intervals must be given in km .

?third line: the names used as column labels (up to 15 characters long, without space):

- a generic name for individuals (e.g. “Ind”)

- a generic name for categories (e.g. “Cat”), only if categories are defined

- a generic name for each spatial coordinates (e.g. “X”, “Y”)

- the name of each locus (e.g. “Pgm”, “Est”, …)

?fourth line and next ones: individual data (each line = 1 individual):

- name of the individual (up to 15 characters)

- name of the category (up to 15 characters), only if categories are defined

- coordinate along each axis (up to 10 digits), or latitude followed by longitude

- genotype at each locus (also separated by a tab)

Note : if the number of spatial coordinates is set to -2,latitudes and longitudes must be given in degrees with decimal, using negative numbers for Southern latitude or Western longitude.

?last line (after the last individual): the word "END" (in uppercase)

3.2.H OW TO CODE GENOTYPES ?

3.2.1. Codominant data

Single locus genotypes are represented by numbers in either of the following ways:

1°) the allele of each homologous gene is up to n digits long and alleles are separated by any number of non numerical characters other than a tab (n is specified in the first line): e.g.,

12/45 1 1299, 23 6.636--01

are correct genotypes for a diploid with up to 2 digits per allele.

2°) the allele of each homologous gene is exactly n digits long and alleles are not separated by other characters:

e.g.,

124501129923 06063601

are the same genotypes as above.

In both cases, non numerical characters cannot follow the righter most digit.

Notes:

1°) missing genotypes are represented by giving the value 0: e.g.,

0 0 0 000,000,000,000 000

all represent a missing genotype.

2°) incomplete genotypes are represented by giving the value 0 to undetermined alleles on the right: e.g., 05-00 05,0 500 0500

all represent the same incomplete genotype of a diploid (2 digits per allele).

3°) the first 0’s are optional so that 0112 and 0606 could also be written as 112 and 606, respectively

4°) different ploidy levels can co-occur within a data set (not within a single individual), therefore alleles are defined only for the necessary number of genes, or 0 values are attributed to “alleles” on the left: e.g., 123 125 125 12197 123 0 0 97 123

are correct genotypes for a tetrapoid and two diploids, respectively (3 digits per allele).

5°) do not confound incomplete genotypes with genotypes for a ploidy level lower than announced: e.g.,

2 3 4 0 4500 0 2 3 4 45

successively represent 2 tetraploids (with incomplete genotypes), a triploid and a diploid (the two latter with complete genotypes), respectively (1 digit per allele).

3.2.2. Dominant data (set the 5th format number, “number of digits”, ≤ 0 and the 6th format number, ploidy = 2) Single locus genotypes are represented by numbers in either of the following ways:

1°) if the “number of digit” is set to 0, put

0 for a missing data

1 for a recessive genotype

2 for a dominant genotype

2°) if the “number of digit” is set to –X (i.e. a negative number), put

-X for a missing data

0 for a recessive genotype

1 for a dominant genotype

3.3.E XAMPLE OF DATA FILE

// this an example (lines beginning by // are comment lines)

// #ind #cat #coord #loci #dig/loc ploidy

5 0 2 4 2 2

4 10 20.

5 50 100

Ind Lat Long adh got pgm lap

ind1 7.3 21 0101 0303 0 0101

ind2 8.4 52 101 101 102 103

3 5.11 103 0101 0303 0102 0103

4 1.0 13.2 1,1 3 3 1-2 01 03

lastind 1.94 129 0 1701 0118 1799

END

which specifies that 5 diploid individuals, not defined by categories, with location defined by 2 spatial coordinates, are scored at 4 loci where alleles are defined by 2 digits, and that 4 distance intervals will be considered as follow: [0 to 10], ]10 to 20.5], ]20.5 to 50], ]50 to 100]. Note that individuals 3 and 4 share the same genotypes but written in different ways.

3.4.N OTE ABOUT DISTANCE INTERVALS

When specific distance intervals are defined in the data file, the program checks that the maximal distance between two individuals / populations is not greater than the maximal distance of the last distance interval. Otherwise, an additional interval is created. Additional classes are also created for analyses at the individual level: an intra-individual class containing inbreeding coefficients (only for kinship statistics), and an intra-group class if individuals are organised into spatial groups (see §3.5.).

Use a point to indicate decimals (as in American notation), using a coma (as in French notation) would cause distances to be misinterpreted.

3.5.N OTE ABOUT SPATIAL GROUPS

If individuals consist of spatial groups that should be recognized (e.g. sibs from a given family, individuals from a given population), individuals belonging to a same group must follow each other in the data file and they must be given the same spatial coordinates. For analysis carried at the individual level, the program will then add a distance class for the pairwise coefficients between members of the same group (intra-group class). For analyses at the individual level, when each individual receives specific spatial coordinates (no spatial groups, i.e. no two adjacent individuals in the data file share the same location), individuals are considered as independent from one another. This is typically the kind of analysis focusing on one continuously distributed population. If instead individuals are organised in spatial groups, individuals from a same group are treated as dependent. In such case, regression analyses do not take into account pairwise comparisons between individuals from a same group. The procedures for location permutations is also affected, as spatial group locations rather than individual locations are permuted (see §5.6.). When asking for the matrices of pairwise spatial and genetic distances between individuals, the value of the spatial distance between members of the same group is set conventionally to –1.

3.6.N OTE ABOUT MICROSATELLITE ALLELE SIZES

Several statistics are based on microsatellite allele sizes (e.g. R-statistics, Goldstein and Pollok’s (1997) δμ2, Streiff et al. (1998) kinship analogue) using the size specified in the genotypes of the data file. Ideally, this size should be the number of repeats of the microsatellite motif. The computed statistics will still be valid if the size correspond to a constant plus the number of repeats (but the mean allele size information, see § 5.2., will not give the mean number of repeats). Problems may occur if allele sizes are given in terms of number of nucleotides rather than repeats. For the δμ2 statistic, single locus estimates will be multiplied by the square of the motif size

(the same holds for the Variance of allele size information, § 5.2). For R-statistics and Streiff et al. (1998) kinship analogue, single locus estimates will not be affected, but multilocus estimates would be affected if the motif size vary among loci, in which case one should change the data file, dividing allele sizes per locus by the corresponding motif size.

3.7.U SING A MATRIX TO DEFINE ARBITRARY PAIRWISE SPATIAL DISTANCES

Pairwise spatial distances between individuals or populations are normally computed as Euclidian distances using the spatial coordinates or from latitudes and longitudes. However, you can also specify each pairwise distance in an arbitrary way using a matrix. This can be useful in three cases: 1°) If you wish to consider non Euclidian spatial distances, such as distances more closely related to the probability of gene movements between locations. 2°) If you are not interested in spatial distances but in some other kind of pairwise distances (e.g. a morphological distance between individuals or populations) that you wish to correlate with genetic distance. 3°) If you wish to compute average statistics for particular pairwise comparisons between individuals / populations (for this purpose, you can define “distance” intervals and pairwise “distances” using integers).

The matrix of pairwise distances can be put at the end of the data file (just after the word “END”, see section 3.2.), or at the beginning of another text file. The use of such matrix and its location are specified while running the program (see §4.2.5. and 4.2.9.). The matrix can be written in two formats: a matrix format or a column format.

Matrix format:

This is a square matrix. The first line must begin with the letter M, followed by a number representing the matrix size (# of lines and columns). Then, individual or population names corresponding to each column must be written (separated by tab). Each of the next lines begin by the corresponding individual or population name followed by the pairwise distances attributed. The last line must contain the word “END”. Example:

// This is an example of a pairwise distance matrix written in matrix format with 5 rows and columns

M5 pop1 pop2 pop3 pop4 pop5

pop1 0 10.3 12 6 0

pop2 10.3 0 65 18 98

pop3 12 65 0 34 54

pop4 6 18 34 0 15

pop5 0 98 54 15 0

END

Column format:

In column format, each line corresponds to a pairwise comparison. The first line must begin with the letter C, followed by the number of lines (# of pairwise distances defined). Each of the next lines begins by the two individual or population names, separated by a tab, followed by the pairwise distance attributed. The last line must contain the word “END”. Example (the following matrix contains the same information as the one above except that self comparisons are left undefined):

// This is an example of a pairwise distance matrix written in column format with 15 pairwise distances defined C15

pop1 pop2 10.3

pop1 pop3 12

pop1 pop4 6

pop1 pop5 0

pop2 pop3 65

pop2 pop4 18

pop2 pop5 98

pop3 pop4 34

pop3 pop5 54

pop4 pop5 15

END

Notes:

1°) For both matrix and column formats, the order of individuals / populations is unimportant (i.e. does not need to follow that of the data file).

2°) Self-comparisons are not taken into account.

3°) The names must match exactly those of the data file(case also matters!). This is straightforward for analyses at the individual level. However, for analyses at population level, population names vary: A) If one population = one categorical group, its name is that of the category. B) If one population = one spatial group, its name is that of the first individual of the spatial group in the data file. C) If one population = one spatio-categorical group, its name is written by joining the name of the first individual of the spatial group (as found in the data file) with the name of the category, the two being separated by the character ‘-‘.

In order to create a template of the arbitrary matrix with the correct individual / population names, it can be convenient to run the program a first time without defining a pairwise distance matrix but asking to write pairwise distances and statistics in matrix or column formats (see §4.2.5. and 4.2.7.).

4°) Each pairwise comparison does not need to be defined, so that a matrix that does not contain all individuals / populations, or a matrix incompletely filled, are also accepted.

5°) Symmetrical comparisons (e.g. i-j and j-i) can not contain different distances (but one can be undefined).

3.8.D EFINING GENETIC DISTANCES BETWEEN ALLELES

When a statistic based on the genetic distances between alleles is request (e.g. N ST, N ij), the program asks to specify the file containing the distance matrix between alleles. The latter can be put at the end of the data file or in another file, and must be a symmetrical square matrix with the following format:

First line: name of the locus followed by the allele names (numbers)

Next lines: allele name followed by the genetic distance between alleles

Example:

// This is an example of a distance matrix between alleles for a locus called “Hapl”

Hapl 1 2 3 4 15 26 7 18

1 0 6 5

2 4 4

3 3

2 6 0 1 4 2 2

3 3

3 5 1 0 3 1 1 2 2

4 2 4 3 0 2 2 1 1

15 4 2 1 2 0 2 1 3

26 4 2 1 2 2 0 3 1

7 0 2

18 0

END

Notes:

1°) Locus names must match exactly those of the data file (case matters).

2°) The order of alleles must be the same along rows and columns.

3°) Each allele found in the data file must occur in the matrix but the latter can contain additional alleles.

3°) Self-comparisons are not taken into account.

4°) The distance between each allelic pair must be defined but it can be so only one time in the matrix (i.e. a half matrix is also accepted).

3.9.D EFINING REFERENCE ALLELE FREQUENCIES FOR RELATEDNESS COEFFICIENTS

Most statistics available for analyses at the individual level (coefficients of kinship, relationship,…) provide measures of genetic similarity between individuals that are relative to a sample of individuals (usually all individuals in the data set), which defines the “reference allele frequencies”. However, specific reference allele frequencies can be given in a distinct file (see option § 4.3.3. - 6bis) with the following format:

First line: for consecutive loci, name of each locus followed by the total number of alleles

Next lines (one per allele): for consecutive loci, allele name followed by the allele frequency

Example:

// This is an example of a matrix with reference allele frequencies for 3 loci called “Loc1”, “Loc2”, “Loc3”.

Loc1 5 Loc2 8 Loc3 3

1 0.3 120 0.01

2 0.67

2 0.1 122 0.04 4

3 0.32

3 0.05 12

4 0.3

5 3 0.01

4 0.1

5 130 0.13

15 0.4 132 0.10

140 0.05

142 0.07

144 0.25

Notes:

1°) These allele frequencies must be in a distinct file (the default name is “freq.txt”), not in the data file.

2°) Locus names must match exactly those of the data file (case matters).

3°) All loci in the data file must occur but additional loci may also occur (they will not be read).

4°) The order of alleles is unimportant.

5°) All alleles found in the data file must occur and be given a non-null frequency. Other alleles can also be present.

6°) The sum of allele frequencies at each locus must be one (sum between 0.999 and 1.001 accepted).

3.10.P RESENT DATA SIZE LIMITATIONS

max. 100000 individuals

max. 10000 loci

max. 999 alleles per locus (i.e. max 3 digits per allele)

max. 30 characters for the individual, category and locus names

max. 20000 random permutations

max. ploidy = 8 (octoploid) (note that all analyses on polyploids assume polysomic inheritance)

max. 100 distance intervals

max. length of any line in the data file: 100000 characters

Please contact us if these limitations are a problem for you, we may be able to send you a recompiled version with other specifications.

4. RUNNING THE PROGRAM

The program is written in C language. It has no fancy windowing features.

4.1.L AUNCHING THE PROGRAM

Launch the program by double-clicking on its icon and type the name of the data file (which should reside in the same folder as the program file, otherwise you must type its full path) or drag the icon of the data file into the SPAGeDi windows (the data file can then reside anywhere and the result file will be written in the directory of the data file). (This is also the procedure to follow if you need to import a data file.) For Windows version, you can also directly drag the data file on the SPAGeDi program icon or a shortcut to it. The program can also be launched using a command line or via another application, which can be useful to analyse numerous data sets obtained for instance by simulations. A command file that contains the keystrokes used to run an analysis can be associated to automatize the runs (e.g. “spagedi < cmds.txt” where cmds.txt is the file with the keystrokes commands). On Unix-derived systems, the program “tee” can be used to record keystrokes for playback later (e.g. “tee cmds.txt | spagedi” to record and “spagedi < cmds.txt” to repeat).

Error messages are given when files cannot be opened, data files are not well formatted or contain inconsistent information. These messages are not yet optimal and you may have difficulties finding out what is wrong in your data file (suggestions to improve this are welcome). When launching SPAGeDi, an error file“error.txt” is opened (and its previous content erased) and common errors made when preparing data files are listed. Additional information is added in this file whenever a problem occurs. SPAGeDi checks that the number of individuals and the number of categories found are the one specified in the data file, but there is no check for the number of loci (analyses considering only the first loci listed can thus be done by adjusting the number of loci given in line containing the format numbers).

4.2.S PECIFYING THE DATA / RESULTS FILES

Once the program is launched, you are requested to enter the name of the data file (unless you dragged the data file icon on the program) and the name of the results file.

If you just press RETURN to these questions, the default names “in.txt” and/or “out.txt” will be considered as data and results files, respectively (this can be useful if you wish to carry out many different analyses on the same data set without having to enter the file names each time).

You can also import data from a file in FSTAT (Goudet 1995) or GENEPOP (Raymont and Rousset 1995) format. Therefore, press SPACE and then RETURN when asked to enter the data file name, and select the format of the data file (FSTAT or GENEPOP). A new data file in SPAGeDi format will then be created, but it will not contain spatial information, so that you need to add them (as spatial coordinates per individual or as a matrix of pairwise distances), unless you don not need spatial analyses.

If a file with the same name as the results file already exists in the folder, the program will ask if you wish to: erase the existing file first (enter ‘e’), add results to the end of this file (enter ‘a’ or simply press RETURN), or change the name of the ouput file (enter the new name).

Once the data and result files are specified, the program first displays the basic information from the data file on the screen and waits for user to hit the RETURN key. The first set of information displayed is: the number of individuals, the number of categories and their names, the number of spatial coordinates and their names, the number of loci and their names, the number of digits used to specify alleles, the specified ploidy of the data, and the number of individuals of each ploidy. At this stage, if some individuals have missing genotypes at all loci, a warning message is addressed (but the analysis can go on anyway), and if different loci suggest different ploidy levels within some individuals, a warning message is addressed and the data file must be modified (the program stops here). The second set of information displayed is the groups recognised (categorical, spatial and spatio-categorical ones) with the minimal and maximal numbers of individuals per group.

4.3.S ELECTING THE APPROPRIATE OPTIONS

You define the analyses to carry out and the results to write down by selecting options in 4 successive panels: 1°) Level of analyses, 2°) Statistics, 3°) Computational options, 4°) Output options. Some of the options will not be available depending on the structure of the data. You can come back to the beginning at different stages if you made an error of selection.

4.3.1. Level of analyses: individual vs population

Analyses are carried out at the individual level or population level. When both categorical and spatial groups occur, you have also the choice among three different ways to define populations: as categorical, spatial, or spatio-categorical groups. If there are no categorical nor spatial groups in the data set, analyses are restricted to the individual level.

4.3.2. Statistics

You must select the statistics to be computed (you can select several simultaneously). These statistics are computed for each pair of individuals or populations and the average values per distance interval as well as the regression statistics are given in the results file. More details about those statistics are given in § 6.1 and 6.2. For analyses at the individual level with codominant markers, 12 statistics for pairwise comparisons between individuals are available:

1°) A kinship coefficient estimated according to J. Nason (described in Loiselle et al. 1995).

2°) A kinship coefficient estimated according to Ritland (1996).

3°) A relationship coefficient computed as Moran’s I statistic (Hardy and Vekemans 1999).

4°) A relationship coefficient estimated according to Queller and Goodnight (1989).

5°) A relationship coefficient estimated according to Lynch and Ritland (1999) (r coef).

6°) A relationship coefficient estimated according to Wang (2002) (r coef).

7°) A relationship coefficient estimated according to Li et al. (1993).

8°) A fraternity coefficient (4-genes coefficient) estimated according to Lynch and Ritland (1999) (? coef). 9°) A fraternity coefficient (4-genes coefficient) estimated according to Wang (2002) (? coef).

10°) A ij: a distance measure described in Rousset (2000) (the one called a by Rousset).

11°) R ij: a kinship coefficient analogue based on allele sizes (for microsatellites) (I’ in Streiff et al. 1998). 12°) N ij: a kinship coefficient analogue based on distances between alleles (OJ Hardy, unpublished).

Note: statistic 10° can not be computed for haploid data, and statistics 4°, 5°, 6°, 7°, 8° and 9° can presently be computed only for diploid data (5°, 6°, 8° and 9° also assume a population with Hardy-Weinberg genotypic proportions).

For the kinship coefficients, intra-individual values are also computed (as kinship between gene copies within individuals), providing estimates of an inbreeding coefficient.

For analyses at the individual level with dominant markers in diploids (see §3.2.2), 2 statistics are available: 1°) A kinship coefficient (Hardy 2003).

2°) A relationship coefficient (Hardy 2003).

For analyses at the population level with codominant markers, there are 9 choices for global and pairwise statistics between populations:

Statistics based on allele identity / non-identity

1°) Global F-statistics and pairwise F ST (Weir & Cockerham 1984)

2°) Global F-statistics and pairwise Rho (Streiff et al. 1998)

3°) Global G ST and pairwise G ST (Pons & Petit 1996)

4°) Global G ST and pairwise Gij (Barbujani 1987)

5°) Global F-statistics and pairwise Ds (Nei’s standard genetic distance, Nei 1978)

Statistics based on allele size for microsatellites

6°) Global R-statistics and pairwise R ST (

7°) Global R-statistics and pairwise dm2 (Goldstein’s (δμ)2 distance, Goldstein and Pollock 1997)

Statistics based on distances between alleles

8°) Global N ST and pairwise N ST (Pons & Petit 1996)

9°) Global N ST and pairwise N ij (OJ Hardy, unpublished)

When a statistic based on distance between alleles is asked, the program will ask to specify the file containing the matrix of distances between alleles.

4.3.3. Computational options

Once the statistics are chosen, you can select among different options regarding computations (several options can be selected simultaneously):

1°) Use a matrix to define pairwise spatial distances.

This option allows to define pairwise spatial distances between individuals / populations in an arbitrary way (otherwise, Euclidian distances are computed from the spatial coordinates given in the data file). Therefore, you must enter the name of the file containing the matrix (if the matrix follows the genotype information in the data file, just press Return). Details of the format of the matrix are given in § 3.6.

2°) Make partial regression analyses (i.e. over restricted distance range).

This option allows to define a distance range within which the spatial regression is computed, a useful option for gene dispersal parameter estimations (§ 6.3.). If this option is not selected, the regressions are carried out using all pairwise comparisons, except those with a distance of zero for the regressions on ln(distance).

Otherwise, minimal and maximal distances defining the range must be given. Entering no values (i.e. just pressing RETURN) means that the minimal or maximal distance is not bounded.

3°) Make permutation tests.

This option allows to test the significance of different statistics by random permutations of genes, individuals, locations, or allele sizes. More details in § 4.3.5.

4°) Jackknife over loci.

With this option, mean jackknifed estimators and jackknife standard errors (SE) are computed for multilocus average statistics. They can be used to derive approximate confidence intervals using the mean ±2SE.

Jackknifying necessitates at least 2 polymorphic loci, but many polymorphic loci estimates are necessary to obtain reliable SE. Because SE are approximate, they should not be used for formal tests.

5°) Restrict pairwise comparisons within or among (selected) categories.

If the data are organised in categorical groups and analyses are carried out at the level of individuals or populations defined as spatio-categorical groups, you can select the type of pairwise comparisons for which the pairwise statistics are to be computed:

1°) All pairs (i.e. irrespective of categorical groups) = default option

2°) Only pairs within categories

3°) Only pairs among categories

4°) Only pairs within a specified category

5°) Only pairs between two specified categories

When 4° or 5° is selected, the name(s) of the category(ies) is(are) to be given.

When 2° or 4° is selected and analyses are carried out at the individual level, you must select between two reference allele frequencies to compute the statistics (see § 6.1.1. for explanations):

1°) whole sample (i.e. pairwise coefficients are computed relative to the whole sample)

2°) sample within category (i.e. pairwise coefficients are computed relative to the sample to which the pair of individuals belongs)

6°) Pairwise Fst (or Rst, or Rho) provided as Fst/(1-Fst) ratio.

When this option is selected, pairwise differentiation between population will be estimated using F ST/(1-F ST) ratios. This is useful to analyse isolation-by-distance patterns because F ST/(1-F ST) is expected to vary linearly with the distance or its logarithm. See §6.3.

6bis°) Define reference allele frequencies to compute relatedness coefficients.

When this option is selected, pairwise relatedness coefficients will be computed relative to reference allele frequencies given in a separate file (see §3.9. for the format). SPAGeDi will ask the name of this file. This option cannot be applied for the statistics developed for dominant markers in diploids, the relationship coefficient computed as a Moran’s I statistic, and Rousset’s (2000) a coefficient.

4.3.4. Output options

A second set of options concerns the information given in the results file:

1°) Report allele frequencies for each population / category (otherwise only averages reported).

In the results file, global allele frequencies and gene diversities are reported. Activating this option means that this information will also be given for each population (or for each categorical group in the case of analyses at the individual level including categories).

2°) Report all stat of regression analyses (otherwise only slopes reported).

When this option is activated, the following statistics of the regressions of pairwise statistics on spatial distances are provided: slope, intercept, determination coefficient, number of pairs, mean and variance of values of (log) distance and statistics.

3°) Report matrices with pairwise spatial distances and genetic coefficients.

With this option, pairwise spatial distances and pairwise statistics are given at the end of the results file. You must also specify whether the pairwise statistics are to be given for each locus or only the multilocus estimates, and whether pairwise values are to be written only in columnar form or also in matrix form. You can also select Phylip format which gives a square matrix of genetic distances that can be copied directly to a text file for further analyses (there is no tab delimitations). Note that in Phylip format, negative genetic distances are given the value –0.0000 Estimates of the inbreeding coefficient for each individual are given in the columnar format if you asked to compute a kinship coefficient between individuals (the inbreeding coefficients given are computed as kinship coefficient between homologous genes within individual).

4°) Convert data file into GENEPOP or FSTAT format.

This option allows to create a data file that can be used by the software FSTAT (Goudet 1995) or GENEPOP (Raymond and Rousset 1995), and it is available only with diploid data. If analyses were asked at the population level, the GENEPOP or FSTAT file codes data for the same populations as selected. For analyses selected at the individual level, the FSTAT file code data as a single population, whereas the GENEPOP file code data as if each individual constituted a single population (this is the necessary format to use Rousset’s pairwise distance between individuals in GENEPOP).

5°) Estimate gene dispersal sigma.

For analyses at the individual level, this option can be used to estimate the gene dispersal distance parameter sigma from the regression of pairwise kinship coefficients (or Rousset distance) on the logarithm of the distance (Rousset 2000; Hardy et al. 2006). You must assume that genotypes come from a two-dimensional population at drift-dispersal equilibrium so that theoretical expectations of isolation-by-distance models hold (§ 6.3.). You will be asked to enter the effective population density. Be consistent with units: the density must be given in number of individuals per square distance unit where the latter is the same as the one used for spatial coordinates or for the spatial distance matrix. You must also enter X defining the distance range (sigma to X.sigma) over which regression is applied (X should be between 10 and 50, the default value=20). SPAGeDi will then apply an iterative procedure to estimate the sigma from the genetic structure on the restricted distance range (see § 6.3.). The iterative procedure might not converge, indicating that the data are not powerful enough to get reliable estimates (see § 6.3. for additional advices).

6°) Report actual variance of pairwise genetic coefficients (Ritland 2000).

With this option activated, the actual variance (i.e. excluding sampling variance) of pairwise statistics is given for each distance class following the approach described in Ritland (2000), which requires independent loci (at least two). An estimate of the standard error by jackknifing over loci is also given with at least 3 loci.

This variance is useful to compute marker based estimates of the heritability (h2) or population differentiation (Q st) at quantitative traits (Ritland 1996, 2000).

4.3.

5. Permutation tests

If permutation tests are selected, you have two sets of additional options (you can select several at once):

Firstly (only if statistics based on allele size or distance between alleles have been selected), 1°) Test of genetic structuring (permuting genes, individuals and/or locations)

To test individual inbreeding, population differentiation, and/or spatial structure.

2°) Test of mutation effect on genetic structure (permuting alleles)

To test if the allele size (microsatellites) or the phylogenetic distance between alleles is informative with respect to genetic structuring.

3°) Test of mutation effect on genetic differentiation for each pair of populations

To test, for each pair of populations, if the allele size or the phylogenetic distance between alleles is informative with respect to differentiation.

Secondly,

1°) Report only P-values (otherwise details of permutation tests are reported)

If this option is selected, only P-values for 2-sided tests are reported. Otherwise, the following details are given: object permuted, # permutations, # of different values of the statistic after permutation, observed values before permutation, mean values after permutation, standard errors of mean values after permutation, 95% confidence intervals, P-values of 1- and 2-sided tests.

2°) Define # of permutations for each randomised unit (otherwise same #)

Allows defining a high number of permutations for the statistics that most interest you, and no or few permutations for the ones that are not of interest for you or that would take a lot of computation time.

3°) Initialise random number generator (otherwise initialisation on clock)

Define initial seed for random number generator, otherwise the latter is defined according to the computer’s internal clock (this option is useful for debugging).

You must then enter the number(s) of permutations you wish. On large data sets, resampling can be time consuming; hence there is a compromise between computation time and precision of the probability (P-values). It is advisable to enter at least 199 if you are satisfied with a 5% significance level, 999 for a 1% level, 9999 for

a 0.1% level. Enter "0" if you do not need tests.

4.4.I NFORMATION DISPLAYED DURING COMPUTATIONS

Once the program proceeds to the calculations, it displays the computational stage: computation of allele frequencies, of distance intervals, of pairwise statistics, permutation tests. When the computations are finished, a message will appear on the screen and pressing any key will close the window. You can proceed to examination of the results file. If the program crashed, do not forget to open the file "error.txt", because this may give you some information on the origin of the problem.

Details relative to distance intervals are displayed once computed and computations proceed.

Each interval (class) is characterised by

1°) max d its maximal distance (the minimal distance is the maximal distance of the preceding interval) 2°) mean d the average distance between individuals / populations for the pairs belonging to the interval 3°) mean ln(d) idem but using the ln(distance) between individuals / populations

4°) # pairs the number of pairwise comparisons belonging to the interval

5°) % partic the proportion (%) of all individuals / populations represented at least once in the interval 6°) CV partic the coefficient of variation of the number of times each individual / population is represented

Notes:

1°) If analyses are restricted to pairwise comparisons within or among (specified) category(ies), the information per distance intervals considers only pairs satisfying these conditions.

2°) Information on distance intervals can be useful for fine-tuning them. For example, low % partic and/or high CV partic means that the statistics computed for the corresponding interval involve data from only a fraction of the individuals / populations.Hence, as a rule of thumb, we advise that for each distance interval: % partic > 50%, and CV partic <= 1. For analyses at the individual level we also advise that # pairs > 100, given the large standard errors typically observed for pairwise coefficients between individuals (with many loci or highly polymorphic loci this number could be reduced).

5. INTERPRET THE RESULTS FILE

All the results are found in a single results file. The results file can be read as a text file but it is best to open it with a worksheet program (e.g. Excel), tabs being used to delimit columns. The results appear in the following order.

5.1.B ASIC INFORMATION

First, the basic information as it appeared on the screen when running the program is written: names of data and results files, numbers of individuals, categories, spatial coordinates and loci, names of categories, spatial coordinates and loci, ploidy, numbers of individuals for each ploidy level, number of categorical, spatial and spatio-categorical groups (see § 4.2.).

5.2.A LLELE FREQUENCY ANALYSIS

Second, for each locus are written: the number of missing genotypes (# missing genotypes), the number of incomplete genotypes(# incomplete genotypes), the total number of defined genes (# of defined genes), the number of alleles with non zero frequency (# alleles), the gene diversity corrected for sample size (He), the name (or size) of each allele (allele names or allele size) (i.e. the number given in the data file), and the allele frequencies (allele frequencies). When a statistic based on allele size (e.g. R-statistics) has been selected, the mean (Mean allele size) and variance (Variance of allele size) of allele sizes are also given. This information is given for the whole sample and, if asked when selecting the options, for each population (analysis at population level) or each category (analysis at individual level).

If relatedness coefficients were computed using specified reference allele frequencies (individual level analyses), the latter will be written.

5.3.T YPE OF ANALYSES

After the allele frequencies information, it is specified whether the analyses are carried out at the individual or population level, if pairwise comparisons are restricted to pairs within or among category(ies), and, for analyses at the individual level, if statistics are computed on basis of the global (whole sample) or local (within category) allele frequencies (for comparisons within (a) category) or relative to given reference allele frequencies.

5.4.D ISTANCE INTERVALS

Next, for each distance interval corresponding to a column are written:

- Dist classes: the names of the distance classes (1, 2,…)

- Max distance: the maximum distance defining the interval: distance interval c = ] Max dist (c-1), Max dist (c) ] - Number of pairs: the number of pairs of individuals separated by the given distance interval

- % partic: the percentage of individuals participating at least once in a pairwise comparison within the interval - CV partic: the coefficient of variation (i.e. the ratio of the standard deviation over the average) of the number of times each individual participate in pairwise comparisons within the interval

- Mean distance: the average distance separating pairs of individuals within the interval

- Mean ln(distance): the average natural logarithm of the distance separating pairs of individuals within the interval

Note:For analyses at the individual level, an intra individual class is added for comparison of genes within individual (only defined for kinship statistics when ploidy is larger than one), and this class actually corresponds to an inbreeding coefficient. When individuals consist of groups, the distance class “1”

corresponds to intra group comparisons.

5.5.C OMPUTED STATISTICS

For each selected statistic, the following results are given for the multilocus estimate and each locus:

- in columns labelled F IT, F IS, F ST or R IT, R IS, R ST or G ST or N ST (for analyses at population level only): the global statistics. When analyses are restricted to comparisons within a given category or between two given categories, global statistics are computed considering only the populations included in the concerned category(ies).

- in columns corresponding to each distance class: the average value of the pairwise coefficients computed over all pairs of individuals or populations within the distance interval (all pairs of genes within individuals in the case of the “intra individual” class, for analyses at the individual level).

- under the column “average”: the average value of the coefficients computed over all pairs of individuals or populations, whatever the distance (for analyses at individual level, it includes intra group class but not intra individual class).

- under “distance range for regression analyses”: the distance range used to compute regressions of pairwise statistics on spatial distance or ln(distance).

The next columns report the results of the regression analyses, first with the linear distance, then with the ln(distance). If the option “Report details of regression analyses” has not been selected (see §4.2.5), only the slopes (b-lin and b-log) are given; otherwise the following statistics are reported for each regression analysis: - the slope b

- the intercept a

- the coefficient of determination r2 (i.e. squared correlation coefficient)

- the number of pairwise comparisons N (taking account of missing data)

- the mean (Md) and variance (Vd) of pairwise distances or ln(distances)

- the mean (Mv) and variance (Vv) of pairwise statistics

If the option Jackknifing over loci has been selected (see §4.2.5), results of a jackknife procedure deleting each locus at a time are given on the two lines following the information of the last locus: the first line gives the jackknifed estimates, the second one gives their standard errors. Calculations follow Sokal and Rohlf (1995, p.821).

If the option Report actual variance of pairwise genetic coefficients has been selected (see §4.3.4), estimations of the actual variance of pairwise genetic coefficients within each distance interval is provided following the method of Ritland (2000).

If the option Estimate gene dispersal sigma has been selected (see §4.3.4), estimates of the neighbourhood size (Nb) and sigma (the square root of half the mean square parent-offspring distance) are given (see §6.3 for technical notes). If the iterative procedure used to get these estimates did not converge, the message “no convergence” appears. If successive estimates cycled among a set of values, the average and the range of values obtained over a cycle are given. When the option Jackknifing over loci was also selected, the iterative procedure is repeated again after removing one locus at a time, and standard errors on estimates are given.

Notes:

1°) For analyses at the individual level, the intra individual kinship coefficient is an inbreeding coefficient expressing the departure from Hardy-Weinberg genotypic proportions (cf. F IS). When individuals consist of spatial groups corresponding to different populations, this is equivalent to F IT (not F IS). Kinship statistics for the intra group class provides an estimator similar to F ST if groups correspond to different populations.

2°) For analyses at the individual level, the slopes of the regressions do not include the pairs of individuals within spatial groups (intra group class). As slopes do not depend on an arbitrary choice of distance, they offer a convenient measure of the degree of spatial genetic structuring. Moreover, under some conditions, these slopes can be related to population genetic parameters like neighbourhood size (see §6.3.).

5.6.P ERMUTATION TESTS

If permutation tests are selected as option (see §4.2.5), results of these tests are written after the pairwise coefficients. These tests are based on the comparison of the observed values with the corresponding frequency distributions when random permutations of the data are performed. For each locus and the multilocus estimates, tests are given for global statistics (population level analyses), each distance class, and the slopes of the regressions analyses.

The following information is reported (unless the option “Report only P-values” has not been selected - see §4.3.5 – in which case only P-values for the two-sided tests are given):

- the object (genes, individuals or location) permuted (and how): Object permuted - the number of valid permutations (i.e. for which the statistic was computable): N valid permut - the number of different values obtained for the different permutations: N different permut val - the observed value (i.e. before permutation): Obs val - the average value after permutation: Mean permut val - the standard error of the distribution of values after permutation: SD permut val - the lower 95% confidence interval value: 95%CI-inf - the upper 95% confidence interval value: 95%CI-sup - the P-value for the 1-sided test observed value < permuted value: P(1-sided test, H1: obs permuted value: P(1-sided test, H1: obs>exp) - the P-value for the 2-sided test observed value different from permuted value: P(2-sided test, H1: obs!=exp) The following code is used to designate the object permuted and how it is permuted (Objected permuted): GaI permutation of Genes among all Individuals

GaIwC permutation of Genes among Individuals within Category

GaIwP permutation of Genes among Individuals within Population

IaSG permutation of Individuals among Spatial Groups

IaSGwC permutation of Individuals among Spatial Groups within Category

IaP permutation of Individuals among all Populations

IaPwC permutation of Individuals among Populations within Category

ILaI permutation of Individual Locations among all Individuals

ILaIwC permutation of Individual Locations among Individuals within Category

SGLaSG permutation of Spatial Group Locations among all Spatial Groups

SGLaSGwC permutation of Spatial Group Locations among Spatial Groups within Category

PLaP permutation of Population Locations among all Populations

PLaPwC permutation of Population Locations among Population within Category

ASaAwL permutation of Allele Sizes among Alleles within Locus

RCoDMbA permutation of Rows and Columns of Distance Matrices between Alleles

When permutation of an object is done within category, it means that the permuted objects remain in their original categorical group after permutation. This is done when pairwise comparisons are restricted to within category(ies) (see §4.2.3.).

As the preceding code shows, the object permuted varies:

- Genes are permuted among individuals, each locus independently, for tests on F IS, F IT, R IS, R IT and intra individual coefficients. Missing data are not permuted (i.e. permutation concerns only defined genes). For F IS and R IS, genes are permuted only within population.

- Individuals (i.e. whole genotypes) are permuted among populations or spatial groups for tests on global F ST, R ST, Rho, G ST, N ST and intra group coefficients.

- Individual Locations (for analyses at the individual level without spatial groups), Spatial Group Locations (for analyses at the individual level with spatial groups), or Population Locations (for analyses at the population level) are permuted among the available locations for tests on each distance class (except the intra individual and intra group ones), and tests on the regression slopes. This is equivalent to a Mantel test between a matrix of genetic distances and a matrix of geographic distances.

遗传多样性与起源研究

西北农林科技大学 2009级硕博连读研究生学位论文开题报告黄牛、水牛和牦牛Y染色体分子遗传多样性与起源研究Y-chromosome Molecular Genetic Diversity and Origins in Cattle, Buffalo and Yak 学院：动物科技学院学科、专业：动物遗传育种与繁殖研究方向：动物遗传学研究生：XX 指导教师：雷初朝教授

黄牛、水牛和牦牛Y染色体分子遗传多样性与起源研究一、选题的目的与意义黄牛、水牛和牦牛是我国3个重要的牛种，具有对周围环境的高度适应性、耐粗放管理、抗病力强、繁殖力高、肉质好等特点。这些地方牛种本身就是一座天然的基因库，正是进行杂种优势利用和进一步培育高产品种的良好原始材料。在当今世界畜禽品种资源日趋匮乏，品种逐步单一化的情况下，对我国这些牛种遗传资源的保护将对今后的育种工作产生很大的影响，起到难以估量的作用[1]。中国黄牛的起源进化与遗传多样性一直是国内外动物遗传学家感兴趣的课题之一。一般认为，中国黄牛是多元起源的，并主要受普通牛和瘤牛的影响，但究竟起源于哪几个牛种，观点不一[2, 3]。在黄牛遗传多样性方面，自二十世纪八十年代以来，众多研究者分析了中国地方黄牛的核型，发现不同黄牛品种的Y 染色体形态具有明显的多态性，普通牛为中着丝粒或亚中着丝粒，瘤牛为近端着丝粒[4-6]。常振华等发现中国黄牛Y染色体主要属于Y2（普通牛）和Y3（瘤牛）单倍群[7]，但事实上黄牛的每种Y染色体单倍群下都可细分为多种单倍型，而中国黄牛由哪些Y染色体单倍型组成，有无优势单倍型以及单倍型的品种分布有无地理特点，与国外黄牛品种有何不同，这些问题都亟待阐明，以期为黄牛品种资源保护和杂交育种工作提供参考依据。中国也拥有丰富的水牛资源。水牛的驯化时间，地点尚无定论，国内一些学者在形态学和考古学方面进行了一些研究，给中国水牛的驯化历史提供了一些参考[8, 9]，但仅靠形态学和考古学的研究是远远不够的，还需要分子遗传学的更多证据。目前国内外对水牛的起源研究主要是在线粒体DNA的母系起源方面，认为水牛有两个母系起源（A支系和B支系）[10-12]，近年来，也有中国学者对水牛的常染色体微卫星多态性进行了研究，其结果都表明中国水牛的遗传多样度丰富，倾向于支持中国水牛的本土起源假说[13, 14]。对Y染色体遗传多样性的研究，将提供更多的分子遗传学信息，会有助于评估水牛的遗传资源状况，也有助于阐明中国水牛的驯化历史。牦牛主要分布于我国的青藏高原，俗称“万能种”，通常皆为兼用，如乳、肉、毛、皮、役力，是经济价值极高的珍贵畜种[1]。家牦牛是在青藏高原驯化的，藏族自古以来生息于西藏，是驯化牦牛之主，因此牦牛的驯化始终与藏族文化的发展休戚相关，是当地人民不可分离的生产和生活资料[15]。从牦牛生活的特定气候地带的适应性和生态地理、生理特征的表现看，牦牛是地球之巅特有的高寒环境中生存的一个宝贵的特化种，牦牛的驯化与繁衍有着与其他牛种极其不同的种类特点，牦牛对高寒山区的气候和贫瘠的草地所具有的特殊的适应性也是世界

中国主要东方蜜蜂种群的遗传多样性分析

中国主要东方蜜蜂种群的遗传多样性分析任勤1，曹联飞2，赵红霞3,，王瑞生1，程尚1，罗文华1,曹兰1，姬聪慧*1 （1.重庆市畜牧科学院，重庆 402460；2.浙江省农业科学院，浙江杭州 310021；3. 广东省生物资源应用研究所，广东广州 510260）摘要：对中国具代表性的东方蜜蜂遗传资源中7个种群的线粒体DNA tRNA leu～ CO Ⅱ基因进行扩增和测序,并进行遗传多样性比较及亲缘关系分析。结果表明，共发现43个单倍型，其中10个单倍型在GenBank数据库对比确认属于新发现单倍型；7个群体中，阿坝中蜂、滇南中蜂和海南中蜂遗传多样性水平较高，长白山中蜂遗传多样性水平较低，其他群体遗传多样性居中；不同种群间遗传距离变化较大，其中海南中蜂与滇南中蜂、阿坝中蜂间的遗传距离最大，长白山中蜂与云贵中蜂、北方中蜂、华南中蜂间的遗传距离最小；聚类分析显示7个种群可聚为4个类群。关键词：东方蜜蜂；遗传多样性；线粒体DNA 中图分类号：文献标志码：A Analysis of genetic diversity of Apis cerana populations in China REN Qin1, CAO Lianfei2,ZHAO Hongxia3,WANG Ruisheng1,CHENG Shang1,LUO Wenhua1,CAO Lan1, JI Conghui*1 （1.Chong Qing Academy of Animal Science，Chongqing 402460，China；2.Zhejiang Academy of Agricultural Sciences，Zhejiang 310021，China; 3.Guangdong Institute of Applied Biological Resources, Guangdong 510260, China） Abstract:The mitochondrial DNA tRNA leu～CO II genes in 7 populations of Apis cerana Fabricius in China were amplified and sequenced, and their genetic diversity and phylogenetic relationships were analyzed. The results showed that a total of 43 haplotypes were identified, of which 10 haplotypes were identified new haplotypes in the GenBank database, Among 7 populations, Aba bee, Hainan bee and Yunnan bee have higher level of genetic diversity, Changbai Mountain bee has lower level of genetic diversity, other populationswere intermediate; The genetic distances between different populations varied greatly, of which Hainan bee andhave maximum genetic distance with Yunnan bee and Aba bee, The genetic distances between Changbai mountain bee and Yunnan bee, Middle China bee, Northern bee and Southern bee were small.; Cluster analysis showed that the 7 populations could be clustered into 4 taxa. Key words:Apis cerana Fabricius; genetic diversity; mitochondrial DNA 收稿日期：基金项目:国家蜂产业技术体系基金项目（CARS-45SYZ15）;重庆市畜牧科学院基金项目(16421). 作者简介：任勤(1979-), 男, 宁夏固原人，助理研究员, 硕士研究生，主要从事蜜蜂方面的研究。通信作者：姬聪慧(1980-)，女，河南平顶山人，助理研究员，硕士研究生。

苦玄参40个株系表型性状遗传多样性分析_谢阳姣

网络出版时间：2016-06-14 11:09:02 网络出版地址：https://www.wendangku.net/doc/a413914893.html,/kcms/detail/45.1134.Q.20160614.1109.004.html DOI:10.11931/guihaia.gxzw201604001 苦玄参40个株系表型性状遗传多样性分析谢阳姣1，何志鹏1，闫国跃1，李耀燕1*，符标芳1，白燕远1，冯秋瑜1 （1. 广西中医药大学，广西南宁 530001）摘要：植物遗传多样性一直以来是种质资源研究的热点。为揭示苦玄参栽培种质的遗传多样性，该研究从苦玄参常年栽培种中根据形态学性状差异选择40个株系，以产量、药效物质含量及茎、叶、花等形态学性状为指标，进行物种变异和遗传多样性分析，并对其进行聚类分析，获取各株系遗传亲缘关系。结果表明，苦玄参苷I A和I B含量的遗传变异系数较高，分别为24.225%和17.853%；遗传多样性指数高，分别为1.920和2.075。产量遗传变异系数较低，仅为3.637%，但遗传多样性指数较高，达到1.884。其余表型性状指标遗传变异系数均较高，其中花色高达127.794%。数值型性状具有较高的遗传多样性，均大于1，但描述型指标遗传多样性指标较低，均小于1。聚类分析可将供试材料分为4个大的类群，第Ⅰ类群共有7个株系，此类群产量性状及相关指标平均数较高。第Ⅱ类群有8个株系，茎节长度最长，其余指标中等偏下。第Ⅲ类群数量最多，有20个株系，各指标均较低。第Ⅳ类群有 5个株系，苦玄参苷含量和产量均较高，综合指标比较优。相关分析结果表明，苦玄参苷I A 的含量与叶缘性状具有显著相关，产量性状与茎节长度、一级分枝和末级分枝数具有极显著相关，选择优良种质应注重该4个性状的选择。该研究结果说明，目前苦玄参栽培种具有广泛的变异和较高的遗传多样性，可为苦玄参资源的利用提供参考和依据。关键词：苦玄参，表型性状，变异，遗传多样性，聚类分析 Analysis of phenotypic character genetic diversity of 40 lines of Picria felterrae Lour. XIE Yang-Jiao1, HE Zhi-Peng1, YAN Guo-Yue1, LI Yao-Yan1*, FU Biao-Fang1, BAI Yan-Yuan1, FENG Qiu-Yu1 (1. Guangxi University of Chinese Medicine, Nanning 530001, Guangxi, China) Abstract: Plant genetic diversity has been one of the hot spots in the research into germplasm resources. To disclose the genetic diversity of cultivated germplasm of Picria felterrae Lour., we selected 40 lines from perennially cultivated individuals of that species based on morphological differences and used them to analyze species variation and genetic diversity with yield, active ingredient content as well as morphological characters including the stem, leaf and flower as the 收稿日期：2016‐04‐02 接受日期：2015‐06‐13 基金项目：国家自然科学基金（31460074）；广西高校科研项目（YB2014183）[Supported by the National Natural Science Foundation of China（31460074）; Scientific Research Project of Guangxi Colleges and Universities(YB2014183) ]。作者简介：谢阳姣（1975-），女，湖南桂阳人，博士，副研究员，主要从事药用资源与开发研究，（E-mail） xieyangjiao@https://www.wendangku.net/doc/a413914893.html,。 *通讯作者：李耀燕，博士，讲师，主要从事药用资源与开发研究，（E-mail）118033598@https://www.wendangku.net/doc/a413914893.html,。

ntsys-pc遗传多样性分析软件使用说明

NTSYS-PC使用说明 1 数据的录入方法： 1.1 利用Ntedit直接录入数据 0、1二元数据中的数据缺失记为2。其中列标可以写为样品编号，在No.rows 栏中写入0、1数据总数，No.cols 栏中写入样品总数。文件另存为*.nts格式。 1.2 从excel表中直接读入数据 Excel表中输入数据格式如下图。A1必须为1，B1为0、1数据总数，C1为样品总数。打开Ntedit程序，选择从Excel表输入，结果见上图。文件另存为*.Nts格式 1.3 Ntsys-pc可以直接运行*.phy格式的文件（由phylip和phytool产生） 1.4 DNA序列数据Ntsys-PC也可以分析，但好像用的人较少。建议大家使用phylip或者其他的软件。DNA序列数据在Excel 中输入格式如下：

1.5 其他数据的Excel输入如下： 2 聚类分析 Ntsys-pc2.02界面如下：以下以图中数据为例介绍聚类过程： 2.1 首先用similarity程序组中的SimQual计算形似系数矩阵。Coefficient通常选用SM 或DICE，结果输出到另一文件

2.2 以上步的结果作为input file利用Clustering程序组中的SHAN或者Njoin进行计算，聚类分法选用UPGMA，ties选用FIND，Maximum no. tied trees至少大于样品数。 Njoin程序组界面如下，rooting method可以选用Outgroup，但需输入外元。 2.3 将SHAN或NJoin方法得到的tree file文件输入到Graphics程序组中的tree plot程序中计算

遗传多样性空间格局分析软件spagedi简介

SPAGeDi1.3 a program for S patial P attern A nalysis of Ge netic Di versity by Olivier J. H ARDY and Xavier V EKEMANS with the contribution of Reed A. C ARTWRIGHT Copyright (c) 2002-2009 Olivier Hardy and Xavier Vekemans User’s manual Address for correspondence: Service Eco-éthologie Evolutive, CP160/12 Université Libre de Bruxelles 50 Av. F. Roosevelt B-1050 Brussels, Belgium e-mail: ohardy@ulb.ac.be Last update: 22 March 2009

Contents 1. Note about SPAGeDi 1.3 and installation 2. What is SPAGeDi ? 2.1. Purpose 2.2. How to use SPAGeDi – short overview 2.3. Data treated by SPAGeDi 2.4. Three ways to specify populations 2.5. Statistics computed 3. Creating a data file 3.1. Structure of the data file 3.2. How to code genotypes? 3.3. Example of data file 3.4. Note about distance intervals 3.5. Note about spatial groups 3.6. Note about microsatellite allele sizes 3.7. Using a matrix to define pairwise spatial distances 3.8. Defining genetic distances between alleles 3.9. Defining reference allele frequencies for relatedness coefficients 3.10. Present data size limitations 4. Running the program 4.1. Launching the program 4.2. Specifying the data / results files 4.3. Selecting the appropriate options 4.4. Information displayed during computations 5. Interpreting the results file 5.1. Basic information 5.2. Allele frequency analysis 5.3. Type of analyses 5.4. Distance intervals 5.5. Computed statistics 5.6. Permutation tests 5.7. Matrices of pairwise coefficients/distances 6. Technical notes 6.1. Statistics for individual level analyses 6.2. Statistics for population level analyses 6.3. Inference of gene dispersal distances 6.4. Estimating the actual variance of pairwise coefficients for marker based heritability and Q ST estimates 6.5. Testing phylogeographic patterns 7. References 8. Bug reports

我国山茶花种质资源表型遗传多样性研究进展

我国山茶花种质资源表型遗传多样性研究进展作者：魏云华来源：《南方农业·上旬》2020年第10期摘; ;要; ;表型性状是对资源表观形态特征的描述，表型多样性是基因多样性与所处生态环境多样性的综合体现。分析山茶花表型性状分类及遗传多样性的研究意义，概述我国山茶花种质资源表型遗传多样性的研究进展，并展望未来山茶花遗传多样性研究的发展方向。关键词; ;山茶花;种质资源;表型;遗传多样性中图分类号：S685.14; ;文献标志码：C; ; DOI：10.19415/https://www.wendangku.net/doc/a413914893.html,ki.1673-890x.2020.28.019 山茶花（Camellia japonica L.）属于山茶科（Theaceae）山茶属（Camellia L.），是我国传统十大名花之一，包括了原种、变种、栽培品种等种质。我国素有山茶属植物原始物种起源和分布中心之称，山茶种质资源丰富，已知的全世界山茶属原种资源280种中，我国有238种，占比85%。目前在植物学上比较公认的是将我国山茶花分为3大种群：1）云南山茶花（C. reticulata L.）与金花茶（C. chrysanma）品种群;2）主要产于川、桂、闽、赣、浙、皖等地的山茶花（俗称红山茶）（C. japonica L.）品种群;3）茶梅（C. sasanqua Thunb.）品种群。 1 山茶花的表型性状表型性状是植物表观形态特征，是简单直观的外在表现，不同的原种、变种、栽培品种等种质的表型性状有所不同，它也是山茶花品种分类最根本、最重要的依据。山茶花有着悠久的栽培历史，通过长期的自然变异和人工选择，种质交流频繁，产生了丰富的表型性状，从而形成了大量高观赏价值的栽培品种。山茶花表型性状丰富，目前常见的分类方式有两种：1）根据花型、花色、叶片特征及其他性状分为四大类，每一大类又分了很多小类。例如，花型按花瓣数量多少就分为单瓣、不完全重瓣、完全重瓣等，按花瓣排列形状有托桂型、牡丹型、玫瑰型等，还有按花冠直径、花瓣顶端形状、花瓣边缘状态、花瓣厚度、卷曲情况、皱褶情况、花瓣脉纹等分类。2）分为质量性状和数量性状，质量性状如花型、花色、花瓣顶端形状、边缘状态、卷曲情况、皱褶情况等，是单基因控制性状;数量性状如花冠直径、花瓣厚度、雄蕊数量、叶长、叶宽、叶脉数等，是由多基因决定的。近年来，大多数的表型性状研究采用质量性状、数量性状来分类。2011年发布的国家标准《植物新品种特异性、一致性、稳定性测试指南山茶属》（GBT26911—2011）中列出了49个主要表型性状作为山茶新品种特异性、一致性、稳定性的测试标准。 2 表型遗传多样性的研究意义

遗传多样性分析的方法与步骤

遗传多样性分析的方法与步骤摘要：本文对生物的遗传多样性进行阐述，并综述了检测遗传多样性的形态学标记、细胞学标记、生物化学标记和分子标记4种遗传标记的发生与发展过程,并比较了各自的优缺点及其应用。关键词：遗传多样性；形态学标记；细胞学标记；生物化学标记；DNA分子标记Genetic Diversity Analysis Method and Steps Abstract:In this paper, the biological genetic diversity were summarized, and elaborates the detection of genetic diversity morphology mark, cytology mark, biochemical markers and molecular marker and genetic markers of the occurrence and development of the process, and compare their advantages and disadvantages and application. Keywords:genetic diversity; Morphological markers; Cytology mark; Biochemical markers; DNA molecular markers 前言遗传多样性是生态系统多样性和物种多样性的基础,任何物种都有其独特的基因库或遗传组织形式[1]。广义的遗传多样性是指地球上所有生物所携带的遗传信息的总和,但通常所说的遗传多样性是指种内的遗传多样性,即种内不同种群之间或一个种群内不同个体的遗传变异[2]。遗传多样性的表现形式是多层次的,可以从形态特征、细胞学特征、生理特征、基因位点及DNA序列等不同方面来体现,其中DNA多样性是遗传多样性的本质[3]。通常,遗传多样性最直接的表现形式就是遗传变异水平的高低。然而,对任何一个物种来说,个体的生命是短暂的、有限的,而由个体构成的种群或种群系统(宗、亚种、种)在自然界中具有其特定的分布格局,在时间上连续不断,是进化的基本单位。因此,遗传多样性不仅包括变异水平的高低,而且包括变异的分布格局,即种群的遗传结构。种群遗传结构上的差异是遗传多样性的重要体现,一个物种的进化潜力和抵御不良环境的能力既取决于种内遗传变异的大小,也有赖于种群的遗传结构[4]。 1 遗传多样性的意义根据联合国5生物多样性公约，生物多样性是指所有来源的活的生物体中的变异性, 包括陆地、海洋和其它水生生态系统及其所构成的生态综合体[ 1-7]。遗传多样性作为生物多样性的重要组成部分, 是生态系统多样性和物种多样性的基础方面, 任何物种都有其独特的基因库和遗传组织形式, 物种的多样性也就显示了基因的多样性。因此, 广义的遗传多样性是指地球上所有生物所携带的遗