Conventional breeding techniques are struggling to address the challenges presented by climate change and the increasing global demand for agricultural products. Meeting these challenges demands renovation of breeding methods into the next-generation technologies, collectively known as Breeding 4.0. Central to this frontier is Genotype-to-Phenotype (G2P) prediction. For the prediction to be accurate and reliable, developing and optimizing machine learning (ML) algorithms is one of the keys. A major limitation nowadays to the employment of ML in breeding practices is the availability of substantial Genome-to-Phenotype paired data, i.e., lack of well-curated and labelled phenotypic data and their connection with genomic data, such as genetic variations.

BreedingAIDB is the first manually curated G2P paired data database specifically designed to support application of ML in G2P prediction and breeding practices. The Release 1.0 includes the G2P paired data of rice, soybean, and maize. In addition to G2P paired data, BreedingAIDB offers three core functional modules: Feature Extraction, Phenotype Prediction, and ML Project, each designed to enhance the utility of this database.

You can find out more details about the data we have collected in this module.

  • Rice, Oryza sativa, is the main staple food for over half the world's population, particularly in Asia. You can access 143,477 rice G2P paired data across 41 different phenotypes. This comprehensive collection spans 12 different regions and covers the years 2011 to 2018 and extends to 2020.
  • Soybean, Glycine max, known for its importance as a protein and oil crop. 284,395 soybean G2P paired data covering 114 different phenotypes are available to users. This dataset spans growing regions in five different locations and covers the years 2013 to 2019.
  • Maize, Zea mays,is an important source of food and feed. You have access to 12,654 maize G2P paired data across 18 different phenotypes. These profiles cover three geographical areas and the years 2016 and 2017.

BreedingAIDB provides two forms of unstructured data and one form of structured data, which are directly applicable in breeding ML. For structured data, we employed the GSCtool to generate gsctool features, which can be input directly into the G2P model. Unstructured data consists of VCF (Variant Call Format) and GVCF (genomic VCF) data. All three types of crops G2P paired data are accessible through the module.

  • GVCF Files: Encompassing all loci in the genome (or specific intervals), regardless of the presence of variants at those loci.
  • VCF Files: Contain high-quality Single Nucleotide Polymorphism (SNP) information, generated through a rigorous genomic analysis process and common filtering criteria.
  • GSCtool Features: Generated by GSCtool, these features can be input directly into the Genotype-to-Phenotype (G2P) model.

For the phenotype data from various data silos, we manually harmonized their labels and units. Users can select their desired phenotype and obtain phenotype information and the corresponding genomic data.

BreedingAIDB offers three core functional modules: Feature Extraction, Phenotype Prediction, and ML Project, each designed to enhance the utility of this database.

  • Feature Extraction: By uploading the variant detection results (VCF files containing SNP information) generated by using IRGSP 1.0 (rice), B37 (maize), or Williams 82 (soybean) as the reference genome and selecting the applicable reference genome, users can easily complete the feature extraction process using GSCtool, obtaining the essential results to be used in the subsequent analyses and model construction.
  • Pheno predict: The Phenotype Prediction module offers predictions for grain length and width of rice, which can be easily done by uploading the feature files extracted by GSCtool. Take advantage of accurate phenotype predictions based on your genomic data, paving the way for more informed breeding decisions.
  • Model Construction: The module integrating Optuna and lightGBM.Users can upload extracted feature data generated by any feature representation algorithms, just ensuring that the data includes columns with labels, designated by the column name "labels", while all other columns will be treated as feature information. Users can customize the search space of hyperparameters to identify the model configuration that best suits their data. Once the data is uploaded, ML Project will immediately initiate model training and optimization. Following the completion of model training and optimization, ML Project will return a highly optimized lightGBM model with the best hyperparameter configuration for G2P prediction.

If you wish to contribute your data and to work with specific datasets that are not available in the database, BreedingAIDB provides two convenient modules: File Upload and Information Submission.

  • File Upload: You can upload processed files with phenotype (labels).
  • Information Submission: For substantial datasets that are not available in the database, users can provide relevant information through Information Submission. We will process the requested datasets and incorporate them into BreedingAIDB.