It is now known that the genome of all living organisms consists of DNA: the blueprint for their development and function. DNA is a double-helix, composed of nucleotides that are arranged in a defined order along the molecule forming the genes: functional units of DNA that specify the biological heritage of each human, including natural individual variation of about 1.0 percent [1-2].The complete human genome consists of 22 diploid chromosomes (1−22), two sex chromosomes (X and Y) and maternally inherited mitochondrial DNA (mtDNA) [3]. Through converging developments in genetics, recombinant DNA technology and an increase in automatable methods for DNA sequencing, the idea of sequencing the human genome became feasible [4]. The previous path from the first sequencing method - developed independently by Sanger, Maxam and Gilbert, through the polymerase chain reaction (PCR) - to laboratory automation, led to large-scale generation of genomic data sets, for which the “hierarchical shotgun sequencing” approach was the fundamental strategy [5]. But, why sequence the full human genome, or any genome for that matter [6]?
A complete, high-quality human reference genome aims to be a baseline for clinical, comparative, and populational genomic analyses, providing a centralized coordinate system for reporting and comparing results across studies. Sequencing the full human genome also serves the scientific opportunity to reveal diversity across human populations, representing only a small glimpse through the larger window of evolution [7]. Having a reference means to deepen our understanding of mutational processes through the study of inter-species sequence comparisons [4]. A foundational open-access resource for future biomedical research and precision medicine: this is the scientific and long-lasting value of the human reference genome [8].
The efforts to sequence DNA with Human Genome Project (HGP)
Such a great value merits an equally great investment: the Human Genome Project (HGP) was launched in 1990, funded by the US government through the National Institute of Health and it was a truly collaborative international effort, involving twenty centers in six countries, then known as the International Human Genome Sequencing Consortium (IHGSC), with the highest priority goal of obtaining a highly accurate sequence of the vast majority of the euchromatic portion of the human genome [9-10]. Through open collaboration, public data sharing, the study of ethical, legal, and social issues, the HGP had a great scientific and social impact. Because the precise scientific plan and the feasible degree of accuracy and completeness were unclear at the outset, the sequencing of the human genome proceeded through a first draft phase that yielded about 90% of the information, though in imperfect form, and a next finishing phase that yielded about 99% in high-quality form [10].
A great desire for human sequence data led the consortium to focus on an initial goal of producing a “working draft” genome sequence by the end of 2001, using a hierarchical mapping and sequencing strategy, also referred to as “map-based”, “BAC-based” or “clone-by-clone”. This last approach involved the sequencing of overlapping large-insert clones spanning the genome, that were chosen from eight large-insert libraries, prepared from DNA obtained from twenty anonymous volunteer human donors. By aligning and merging the sequenced reads, based on their overlapping nucleotides, contiguous segments (contig) DNA sequences could be linked together to form “sequence-contig scaffolds” based on the paired read information, with the final result of an initial version of the human genome [2-3].
This assembly had definitely important shortcomings: besides omitting about 10% of the euchromatic genome, it was interrupted by 147,821 gaps and the order and orientation of many segments within local regions were unknown. The assembly produced went on to be “finished” in 2003, two years ahead of previous projections: compared to the draft sequence, the near-complete sequence had substantially fewer gaps (341), greater continuity, reflecting an overall improvement of about 475-fold, and unknown local order and orientation were resolved [9]. About the gaps, 33 of them reflected heterochromatic regions, which was not targeted by the HGP, and 308 were euchromatic regions, most of which associated with segmental duplications. This is not a case, because the highly repetitive properties of these sequences made them largely refractory to previous shotgun strategies. The HGP was a multi-year technological achievement, for which an industry has grown to supply the scientific research community with the equipment, supplies and services required [11]. Despite the flaws, the final result reported by the IHGSC served as a firm foundation for biomedical research in the decades ahead [10].
A jump to T2T
The value of having a human genomic reference assembly was clear and the assembly produced by the HGP had been continually updated over the past decade with a special effort of the Genome Research Consortium (GRC) [12-13]. The limitations of automated Sanger sequencing showed a need for new and improved technologies for sequencing large numbers of human genomes, as it was then done by “1000 Genomes Project Consortium”. Newer sequencing technologies arrived in the marketplace referred to as Next Generation Sequencing (NGS). These new technologies produce cheaper short read lengths and higher data volumes [14].
The plummeting costs and massive throughput of second-generation sequencing platforms, coupled with efficient computational analysis tools, have made it possible to analyze sequencing-based experimental data on an unprecedented scale [15]. However, for de novo genome assembly, shorter read lengths means that repetitive DNA regions, such as telomeres, create much greater problems than they did in the era of Sanger sequencing: from a computational perspective, repeats create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. More in detail, repeats can erroneously collapse on top of one another and can cause complex, misassembled rearrangements, resulting typically in a chimera created by joining two chromosomal regions that do not belong near one another [16]. Therefore, it has become critical to develop new hybrid sequencing approaches, such as multiplatform strategies including the third-generation long-read technologies, high-quality finished long-insert clones and new assembly algorithms that can accommodate these heterogeneous datasets [15].
Although some repeats appear to be non-functional, others have played a part in human evolution, at times creating novel functions, but also acting as independent, ‘selfish’ sequence elements [16]. Considering the functional role of repeats, and the extent of genetic diversity within and between human populations highlighted by the 1000 Genomes Project (1KGP) and gnomAD, these gaps in the sequence could obscure variation and its contribution to human phenotypes and disease, invalidating the overall quality and representativeness of the human reference genome [7]. Addressing the remaining 8% of the genome, in 2022 the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, providing the most complete reference assembly for any mammal, positioned to replace GRCh38, the current used reference, as the prevailing reference for human genetics [7, 12].
Leveraging the complementary aspects of PacBio HiFi and Oxford Nanopore Ultra Long-read sequencing to assemble the uniformly homozygous CHM13hTERT cell line (CHM13), the T2T has provided gapless assemblies for all chromosomes except Y, has corrected errors in the prior references, and has introduced nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies [17]. Indeed, it is noted that the T2T-CHM13 reference genome universally improves read mapping and variant identification in a globally diverse cohort [7].
The importance of diversity in assemblies
The under-representation of ethnically diverse populations exacerbates health inequalities, preventing a full understanding of the genetic architecture of human diseases, and compromising clinical practice and public health policy [18]. The T2T has provided a substantial step to assembly models that represent all humans, as the Human Pangenome Reference Consortium is pursuing now [7, 12]. The human genome is by far the largest genome to be sequenced, and its size and complexity has presented many challenges for sequence assembly [4]. Despite the scientific and economic value of the reference, it has many shortcomings, including that it is not actually finished [12]. Through the path from the Human Genome Project to T2T, genomic sequencing has become crucial to the diagnosis and clinical management of patients with constitutional diseases and has quickly become an integral part of the new era of personalized and precision medicine [19]. Currently, NGS can speed up in the early diagnosis of diseases and discover pharmacogenetic markers that help in personalizing therapies [20]. Considering that patterns of genetic variation among populations can affect both disease risk and treatment efficacy, it is necessary to build a more inclusive resource with better representation of human genomic diversity to better serve humanity [8, 18].
Written by Emanuela Passarelli