Data Format of QTLNetwork

Back to the QTLNetwork Online Server homepage

Data Format of QTLNetwork

For performing analyses with QTLNetwork 2.0, two source data files are required: a marker linkage map file (for simplification, we call it map file) and a data file. A map file contains information about the order and genetic distances of all observed markers on the chromosomes or linkage groups. A data file contains observations of the markers and the traits under study for all individuals. We provide some sample files for briefly demonstrating the format of source data files for QTLNetwork in the sub-directory (\SampleData) where QTLNetwork (Windows GUI version) has been installed. The map and data files for QTLMapper software can be directly used by QTLNetwork.

1. Format of marker linkage map file

This file contains information about the marker linkage map, such as the number of chromosomes, number and order of markers on each of the chromosomes, flanking marker distances, etc. It consists of general description and map body.

General Description: This part is in the front of map file. A typical general description looks like:

_DistanceUnit cM

_MapFunction K

_Chromosomes 4

_MarkerNumbers 6 4 7 9

There are a total of four possible items for general description. They can be in any order. Each item in general description is a key word followed by certain specification(s). Each key string must be started with an underline “_”, and there should not be any list separator (white space or table) within the key string. The specification(s) must be separated from the key word by at least one list separator, and there must also be at least one list separator between any two neighboring specifications if two or more specifications are included for the item. A key string and its specification(s) must be placed in the same line. Both key strings and specification(s) (if characters) are not case insensitive.

_DistanceUnit specifies the unit of genetic distances used in the map file. The specification string “cM” stands for centi-Morgan and “M” stands for Morgan.

_MapFunction indicates the map function used in creating the marker linkage map for transforming recombination fractions into genetic distances. Specification character “K” is for Kosambi function and “H” for Haldane function.

_Chromosomes is for specifying the total number of chromosomes or linkage groups involved in the map file.

_MarkerNumbers is for specifying the number of markers on each of the chromosomes. The order of the numbers must be consistent with that for genetic-distance columns in the map body.

Map Body: This part starts from key string *MapBegin* and ends at key string *MapEnd*. A typical map body looks like:

*MapBegin*

Marker# Ch1 Ch2 Ch3 Ch4

1 0.00 0.00 0.00 0.00

2 9.84 11.26 7.45 9.85

3 10.22 8.69 9.10 10.93

4 8.25 9.87 10.66 10.70

5 9.79 10.16 10.10

6 7.47 8.34 11.30

7 11.21 9.30

8 7.23

9 11.78

*MapEnd*

The strings (Marker#, Ch1, Ch2, Ch3, Ch4) in the second row show the contents of the columns below them. The Marker# column (first column) is for the order of all markers on each chromosome; the maximum order is equal to the number of markers on the chromosome that has the most markers among all the chromosomes. The Ch1 column (second column) to Ch4 column (last column) each represents a chromosome or linkage group, and contains genetic distances between adjacent markers on the chromosome. Specifically, the genetic distance for the first marker on each chromosome must be set to zero as the start point of the linkage map for the chromosome; the distance for the second marker is between the first and the second markers; the distance for the third marker is between the second and the third markers, and so on. The order of Ch1 column (second column) to Ch4 (last column) must be consistent with that for the numbers following the key string _MarkerNumbers.

2. Format of data file

The data file contains information on population type, number of genotypes sampled from the population, number of observations, observations for both markers and quantitative traits, etc. It is composed of four parts: general description, marker data body, trait data body, and some comment lines.

General description: This part is for specifying the basic features of the data file, and is usually put in the front of the data file. Like in the map file, each item in general description is a key character string followed by certain specification(s). Each key string must be started with an underline “_”, and no white space is allowed within it. There are eight possible items for general description. They can be arranged in any order. A typical description for a data file looks like:

_Population DH

_Genotypes 200

_Observations 400

_Environments yes

_Replications no

_TraitNumber 1

_TotalMarker 64

_MarkerCode P1=1 P2=2 F1=3 F1P1=4 F1P2=5

_Population specifies the population type used. Some commonly used populations are listed as follows:

RI population – derived from a cross between two pure-line parents. The specification word for RI population can be RI or RIL.

BC population – derived from crossing F1 with one of the inbred parents. The specification words for BC1 and BC2 populations are B1 and B2, respectively.

F2 population – derived from selfing or sib-mating F1 that is made by crossing two inbred lines.

Immortalized F2 (IF2) population – derived from randomly mating among individuals from DH or RI population (See Ref: Hua JP, Xing YZ, Xu CG, Sun XL, Yu SB and Zhang QF (2002) Genetic dissection of an elite rice hybrid revealed that heterozygotes are not always advantageous for performance. Genetics 162: 1885–1895). The specification word IF2DH is for IF2 population derived from DH population, and IF2RI for that from RI population.

BnFn Population – derived from F1 backcrossing to one of the inbred parents or selfing for several generations. In each generation, selfing, backcrossing or creating double-haploid is permitted. Let take the following designs for instances:

The specification words for the four designs above are FFF, B1B1B1, B2B1F and B2B1D, respectively.

_Genotypes specifies the total number of genotypes sampled from the mapping population.

_Observations specifies the total number of observations for each trait studied.

_Environments specifies the status of experimental design for environments. If the experiment is conducted in multiple environments, write the specification word yes after the key word _Environments, otherwise write no.

_Replications specifies the status of experimental design for replications or blocks. If the experiment is conducted with replications or blocks, write the specification word yes after the keyword _Replications, otherwise write no.

_TraitNumber specifies the total number of traits included in the data file.

_TotalMarker specifies the total number of the markers included in the data file. This number must be equal to the summation of the numbers for _MarkerNumbers in the map file.

_MarkerCode defines a marker coding scheme. There are five possible strings for the specifications. Each of the strings looks like an equation, but no white space is allowed within the string. On the left side of the equation symbol is the marker phenotype specification:

P1: Marker phenotype being the same as that of P1;

P2: Marker phenotype being the same as that of P2;

F1: Marker phenotype being the same as that of F1;

F1P1: Marker phenotype that is not P2 type (P1 dominant or undistinguishable between P1 type and F1 type);

F1P2: Marker phenotype that is not P1 type (P2 dominant or undistinguishable between P2 type and F1 type).

On the right side of the equation symbol is the code for the marker type. The marker code should always be a single character (a number or a letter). The symbol dot “.” is used to represent missing marker data or trait value. It is not necessary to specify codes for all possible marker types except for F2 population. For example, if your marker data were collected from a DH population, only the specifications for P1 and P2 types are enough.

Marker data body: This part is embraced by two key strings *MarkerBegin* and *MarkerEnd*. The order of the marker data for different marker loci must be consistent with the order of markers on each chromosome determined in the map file. Since electronic table software usually has a limit on the number of columns in spreadsheet, we provide two types of arrangements for marker data.

Type I:

*MarkerBegin*

#Ind Mk1 Mk2 Mk3 Mk4 Mk5 Mk6 Mk7 Mk8 Mk9;

1 1 1 1 2 2 2 2 1 1 ;

2 1 1 . 1 1 2 2 2 2 ;

3 2 . 2 1 1 1 1 2 2 ;

……

89 2 2 2 2 . 1 1 . 1 ;

90 1 1 2 2 2 2 2 1 1 ;

*MarkerEnd*

Type II:

*MarkerBegin*

#Mk 1 2 3 4 5 … 48 49 50 …. 88 89 90 ;

Mk1 1 1 1 2 1 … 2 2 1 … 1 2 1 ;

Mk2 1 1 1 . 2 … 1 2 1 … 1 2 1 ;

Mk3 1 . 1 2 2 … 1 2 2 … 1 2 2 ;

Mk4 2 1 1 1 1 … 1 2 2 … 1 2 2 ;

Mk5 2 1 . 1 1 … 1 1 1 … 2 . 2 ;

Mk6 2 2 2 1 1 … 1 1 1 … 2 1 2 ;

Mk7 2 2 2 1 1 … 2 1 1 … 2 1 1 ;

Mk8 2 2 2 2 2 … 2 1 2 … 1 . 1 ;

Mk9 1 2 2 2 2 … 2 2 2 … 2 1 1 ;

*MarkerEnd*

The two types of marker data arrangement are distinguished by a keyword placed at the beginning of the send row, the keyword #Ind for type I and #Mk for type II. The marker names and marker data must be arranged in the order given in the map file. Any list separator is not allowed within the marker names. Each row must end with a semicolon “;”.

Trait data body: This part is between two key strings *TraitBegin* and *TraitEnd*. The input of trait data is source-based. The Source includes the environment (if available), the replication (if available) and the genotype, from which the observations was obtained for all the traits studied. The following is an example for the trait data body.

*TraitBegin*

Env# Rep# Geno# Trait_1 Trait_2 Trait_3 ;

1 1 1 2.44 7.40 10.04 ;

1 1 2 2.40 4.32 8.55 ;

……

1 1 90 3.54 8.19 10.74 ;

1 2 1 3.17 6.91 11.86 ;

1 2 2 1.90 4.31 11.36 ;

……

1 2 90 3.22 10.54 11.48 ;

2 1 1 5.74 12.78 11.27 ;

2 1 2 7.65 7.02 11.96 ;

……

2 1 90 6.58 13.92 9.94 ;

2 2 1 6.01 10.22 9.95 ;

2 2 2 6.22 11.99 7.81 ;

……

2 2 90 7.98 13.21 12.03 ;

*TraitEnd*

The second row includes the indicator strings and the names of the traits. The number of source strings depends on the experimental design. If both environments and replications are taken, a maximum of three strings must be inputted: the first string for environment (Env#), the second string for replication (Rep#) and the last string for genotype (Geno#). You can use whatever strings to express the sources because they are just used to indicate what the numbers are in the columns below them. If the experiment is conducted without environmental factor or replications, the corresponding column must be removed. And also, a semicolon “;” is required at the end of each observation data row.

Sample Data

Back to the QTLNetwork Online Server homepage