The Entire Human Genome Finally Sequenced!
The Entire Human Genome Finally Sequenced! Here’s What This Means
A classic Arabic proverb states: مَنْ عَرَفَ نَفْسَهُ فَقَدْ عَرَفَ رَبَّهُ “Whoever knows himself, knows his Lord”, (Ḥilyat al-Awliyā’ 10/208).
Allah said: ثُمَّ سَوَّاهُ وَنَفَخَ فِيهِ مِن رُّوحِهِ ۖ وَجَعَلَ لَكُمُ السَّمْعَ وَالْأَبْصَارَ وَالْأَفْئِدَةَ ۚ قَلِيلًا مَّا تَشْكُرُونَ “Then, He fashioned him and blew into him from His spirit. He made for you hearing, seeing, and hearts, yet little are you grateful” (Surat al-Sajdah 32:8)
Al-Baydawi commented on this verse, writing:
إِضَافَةً إِلَى نَفْسِهِ تَشْرِيفًا لَهُ وَإِشْعَارًا بِأَنَّهُ خَلْقٌ عَجِيبٌ وَأَنَّ لَهُ شَأْنًا لَهُ مُنَاسَبَةٌ مَا إِلَى الْحَضْرَةِ الرُّبُوبِيَّةِ وَلِأَجْلِهِ قِيلَ مَنْ عَرَفَ نَفْسَهُ فَقَدْ عَرَفَ رَبَّهُ
It adds nobility to himself and indicates that humanity is a wondrous creation, that his prestige is appropriate enough to enter the presence of the Lord. For this reason, it is said: Whoever knows himself, knows his Lord. Source: Tafsīr al-Bayḍāwī 32:8
Al-Ghazali writes:
فمن عرف سر الروح فقد عرف نفسه وإذا عرف نفسه فقد عرف ربه وإذا عرف نفسه وربه عرف أنه أمر رباني بطبعه وفطرته وأنه في العالم الجسماني غريب وأن هبوطه إليه لم يكن بمقتضى طبعه في ذاته
Whoever knows the mysteries of the spirit, knows himself. If he knows himself, he knows his Lord. If he knows himself and his Lord, he knows his matter is heavenly in his nature and his instinct, and that he is a stranger in the corporeal world, that his decent into it is not as a result of his nature in itself. Source: Iḥyā’ Ulūm al-Dīn 3/382.
Abu Huraira reported: The Messenger of Allah, peace and blessings be upon him, said:
إِنَّ اللَّهَ قَالَ وَمَا تَقَرَّبَ إِلَيَّ عَبْدِي بِشَيْءٍ أَحَبَّ إِلَيَّ مِمَّا افْتَرَضْتُ عَلَيْهِ وَمَا يَزَالُ عَبْدِي يَتَقَرَّبُ إِلَيَّ بِالنَّوَافِلِ حَتَّى أُحِبَّهُ فَإِذَا أَحْبَبْتُهُ كُنْتُ سَمْعَهُ الَّذِي يَسْمَعُ بِهِ وَبَصَرَهُ الَّذِي يُبْصِرُ بِهِ وَيَدَهُ الَّتِي يَبْطِشُ بِهَا وَرِجْلَهُ الَّتِي يَمْشِي بِهَا
Allah said: My servant does not grow closer to me with anything more beloved to Me than the duties I have imposed upon him. My servant continues to grow closer to Me with extra good works until I love him. When I love him, I am his hearing with which he hears, his seeing with which he sees, his hand with which he strikes, and his foot with which he walks. Source: Ṣaḥīḥ al-Bukhārī 6137.
Sahl ibn Abdullah, may Allah have mercy on him, said:
وَإِذَا عَرَفَ نَفْسَهُ عَرَفَ مَقَامَهُ مِنْ رَبِّهِ وَإِذَا عَرَفَ عَقْلَهُ عَرَفَ حَالَهُ فِيمَا بَيْنَهُ وَبَيْنَ رَبِّهِ
If one knows himself, one knows his status with his Lord. If one knows his mind, one knows his state between him and his Lord. Source: Ḥilyat al-Awliyā’ 10/201.
Sahl was asked about the saying, “Whoever knows himself, knows his Lord.” Sahl said:
مَنْ عَرَّفَ نَفْسَهُ لِرَبِّهِ عَرَّفَ رَبَّهُ لِنَفْسِهِ
Whoever defines himself for the sake of his Lord, his Lord defines him for the sake of himself. Source: Ḥilyat al-Awliyā’ 10/208.
Al-Ghazali writes:
وهو الذي إذا عرفه الإنسان فقد عرف نفسه وإذا عرف نفسه فقد عرف ربه وهو الذي إذا جهله الإنسان فقد جهل نفسه وإذا جهل نفسه فقد جهل ربه … ومن لم يعرف قلبه ليراقبه ويراعيه ويترصد لما يلوح من خزائن الملكوت عليه وفيه فهو ممن قال الله تعالى فيه نسوا الله فأنساهم أنفسهم أولئك هم الفاسقون فمعرفة القلب وحقيقة أوصافه أصل الدين وأساس طريق السالكين
The heart is that by which a human being comes to know himself. If he comes to know himself, he knows his Lord. It is that by which a human being is ignorant of himself. If he is ignorant of himself, he is ignorant of his Lord… Whoever does not know his heart, to be mindful of it, to be watchful over it, and to observe what shines over it and through it of heavenly treasures, he is one of those about whom Allah Almighty said: They forgot Allah, so He made them forget themselves, those are truly wicked, (59:19). Thus, knowledge of the heart, its realities, and its qualities is the foundation of the religion and the basis of spiritual seeking. Source: Iḥyā’ Ulūm al-Dīn 3:2-3
And Al-Ghazali writes:
ولن تصل أيها الطالب إلى القيام بأوامر الله تعالى إلا بمراقبة قلبك وجوارحك في لحظاتك وأنفاسك
O seeker, you will never be able to establish the commands of Allah Almighty unless you are mindful of your heart, your limbs, your every moment, and your every breath. Source: Bidāyat al-Hidāyah 1/28.
Al-Nawwas ibn Sam’an reported: The Messenger of Allah, peace and blessings be upon him, said:
الْبِرُّ حُسْنُ الْخُلُقِ وَالإِثْمُ مَا حَاكَ فِي صَدْرِكَ وَكَرِهْتَ أَنْ يَطَّلِعَ عَلَيْهِ النَّاسُ
Righteousness is good character, and sin is what waivers in your heart and you hate for people to find out about it. Source: Ṣaḥīḥ Muslim 2553.
And Al-Ghazali writes:
أَنْ يَعْرِفَ نَفْسَهُ وَيَعْرِفَ رَبَّهُ تَعَالَى وَيَكْفِيهِ ذَلِكَ فِي إِزَالَةِ الْكِبْرِ فَإِنَّهُ مَهْمَا عَرَفَ نَفْسَهُ حَقَّ الْمَعْرِفَةِ عَلِمَ أنه أذل من كل ذليل وأقل من كل قليل وأنه لا يليق به إلا التواضع والذلة والمهانة وَإِذَا عَرَفَ رَبَّهُ عَلِمَ أَنَّهُ لَا تَلِيقُ الْعَظَمَةُ وَالْكِبْرِيَاءُ إِلَّا بِاللَّهِ
One who knows himself, knows his Lord Almighty, and that is enough to remove arrogance. No matter what he truly knows about himself, he will know that he is abased in every way, he is small in every way. There is nothing appropriate for him but to be humble, meek, and unassuming. If he knows his Lord, he will know that glory and grandeur are not befitting for anyone but Allah. Source: Iḥyā’ Ulūm al-Dīn 3/358.
(Source: https://www.abuaminaelias.com/whoever-knows-himself-knows-his-lord/)
Abstract
In 2001, Celera Genomics and the International Human Genome Sequencing Consortium published their initial drafts of the human genome, which revolutionized the field of genomics. While these drafts and the updates that followed effectively covered the euchromatic fraction of the genome, the heterochromatin and many other complex regions were left unfinished or erroneous. Addressing this remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium has finished the first truly complete 3.055 billion base pair (bp) sequence of a human genome, representing the largest improvement to the human reference genome since its initial release. The new T2T-CHM13 reference includes gapless assemblies for all 22 autosomes plus Chromosome X, corrects numerous errors, and introduces nearly 200 million bp of novel sequence containing 2,226 paralogous gene copies, 115 of which are predicted to be protein coding. The newly completed regions include all centromeric satellite arrays and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies for the first time.
Introduction
The latest major update to the human reference genome was released by the Genome Reference Consortium (GRC) in 2013 and most recently patched in 2019 (GRCh38.p13) (1). This assembly traces its origin to the publicly funded Human Genome Project (2) and has been continually improved over the past two decades. Unlike the competing Celera assembly (3), and most modern genome projects that are also based on shotgun sequence assembly (4), the GRC human reference assembly is primarily based on Sanger sequencing data derived from bacterial artificial chromosome (BAC) clones that were ordered and oriented along the genome via radiation hybrid, genetic linkage, and fingerprint maps (5). This laborious approach resulted in what remains one of the most continuous and accurate reference genomes today. However, reliance on these technologies limited the assembly to only the euchromatic regions of the genome that could be reliably cloned into BACs, mapped, and assembled. Restriction enzyme biases led to the underrepresentation of many long, tandem repeats in the resulting BAC libraries, and the opportunistic assembly of BACs derived from multiple different individuals resulted in a mosaic assembly that does not represent a continuous haplotype. As such, the current GRC assembly contains several unsolvable gaps, where a correct genomic reconstruction is impossible due to incompatible structural polymorphisms associated with segmental duplications on either side of the gap (6). As a result of these shortcomings, many repetitive and polymorphic regions of the genome have been left unfinished or incorrectly assembled for over 20 years.
The current GRCh38.p13 reference genome contains 151 Mbp of unknown sequence distributed throughout the genome, including pericentromeric and subtelomeric regions, recent segmental duplications, ampliconic gene arrays, and ribosomal DNA (rDNA) arrays, all of which are necessary for fundamental cellular processes (Fig. 1A). Some of the largest reference gaps include the entire p-arms (short arms) of all five acrocentric chromosomes (Chr13, Chr14, Chr15, Chr21, and Chr22), and large human satellite arrays (e.g., Chr1, Chr9, and Chr16), which are currently represented in the reference simply as multi-megabase stretches of unknown bases (‘N’s). In addition to these apparent gaps, other regions of the current reference are artificial or are otherwise incorrect. The centromeric alpha satellite arrays, for example, are represented in GRCh38 as computationally generated models of alpha satellite monomers to serve as decoys for resequencing analyses (7). In the case of the acrocentrics, some sequence is included for the p-arm of Chromosome 21 but appears incorrectly localized and poorly assembled, resulting in false gene duplications that complicate downstream analyses (8). When compared to other human genomes, the current reference also shows a genome-wide deletion bias, suggesting the systematic collapse of repeats during its initial cloning and/or assembly (9).
Despite the functional importance of these missing or erroneous regions, the Human Genome Project was officially declared complete in 2003 (10), and there was limited progress towards closing the remaining gaps in the years that followed. This was largely due to limitations of its construction discussed above, but also due to the sequencing technologies of the time, which were dominated by low-cost, high-throughput methods capable of sequencing only a few hundred bases per read. Thus, shotgun-based assembly methods were unable to surpass the quality of the existing reference. However, recent advances in long-read genome sequencing and assembly methods have enabled the complete assembly of individual human chromosomes from telomere to telomere without gaps (11, 12). In addition to using long reads, these T2T projects have targeted the genomes of clonal, complete hydatidiform mole (CHM) cell lines, which are almost completely homozygous and therefore easier to assemble than heterozygous diploid genomes (13). This single-haplotype, de novo strategy overcomes the limitations of the GRC’s mosaic BAC-based legacy, bypasses the challenges of structural polymorphism, and allows the use of modern genome sequencing and assembly methods.
Application of long-read sequencing for the improvement of the human reference genome followed the introduction of PacBio’s single-molecule, polymerase-based technology (14). This was the first commercial sequencing technology capable of producing multi-kilobase sequence reads, which, even with a 15% error rate, proved capable of resolving complex forms of structural variation and gaps in GRCh38 (9, 15). The next major advance in sequencing read lengths came from Oxford Nanopore’s single-molecule, nanopore-based technology, capable of sequencing “ultra-long” reads in excess of 1 Mbp (16), but again with an error rate of 15%. By spanning most genomic repeats, these ultra-long reads enabled highly continuous de novo assembly (17), including the first complete assemblies of a human centromere (ChrY) (18) and a human chromosome (ChrX) (11). However, due to their high error rate, these long-read technologies have posed considerable algorithmic challenges, especially for the reliable assembly of long, highly similar repeat arrays (19). Improved sequencing accuracy simplifies the problem, but past technologies have excelled at either accuracy or length, not both. PacBio’s recent “HiFi” circular consensus sequencing offers a compromise of 20 kbp read lengths and a median accuracy of 99.9% (20, 21), which has resulted in unprecedented assembly accuracy with relatively minor adjustments to standard assembly approaches (22, 23). Whereas ultra-long nanopore sequencing excels at spanning long, identical repeats, HiFi sequencing excels at differentiating subtly diverged repeat copies or haplotypes.
In order to create a complete and gapless human genome assembly, we leveraged the complementary aspects of PacBio HiFi and Oxford Nanopore ultra-long read sequencing, combined with the essentially haploid nature of the CHM13hTERT cell line (hereafter, CHM13) (24). The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, including all centromeric regions and the entire short arms of five human chromosomes. Here we describe the construction, validation, and initial analysis of the first truly complete human reference genome and discuss its potential impact on the field.
Telomere-to-telomere consortium
Introduction
We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.
The known issues identified in the v1.1 assembly are tracked at CHM13 issues. A UCSC browser is also available.
Data reuse and license
All data is released to the public domain (CC0) and we encourage its reuse. While not required, we would appreciate if you would acknowledge the “Telomere-to-Telomere” (T2T) consortium for the creation of this data and encourage you to contact us if you would like to perform analyses on it. More information about our consortium can be found on the T2T homepage.
Citations:
- Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of Human Genetics, 2019.
- Miga KH, Koren S, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature, 2020.
- Nurk S, Walenz BP, et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 2020.
- Logsdon GA, et al. The structure, function, and evolution of a complete human chromosome 8. Nature, 2021.
The complete sequence of a human genome and companion papers
- Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.
- Vollger MR, et al. Segmental duplications and their variation in a complete human genome. bioRxiv, 2021.
- Gershman A, et al. Epigenetic Patterns in a Complete Human Genome. bioRxiv, 2021.
Assembly releases
v1.1
Complete T2T reconstruction of a human genome. Changes from v1.0 include filled rDNA gaps and improved polishing within telomeres. One rare heterozygous variant causing a premature stop codon was changed at chr9:134589924 to the more common allele. Also available at NCBI.
v1.0
Complete T2T reconstruction of a human genome, with the exception of 5 known gaps within the rDNA arrays. Polished assembly based on v0.9. Introduces 4 structural corrections and 993 small variant corrections, including a 4 kb telomere extension on chr18. Polishing was performed using a conservative custom pipeline based on DeepVariant calls and structural corrections were manually curated. Consensus quality exceeds Q60. Prior to a preprint being drafted, a brief summary can be found at this blog post. Also available at NCBI.
v0.9
T2T reconstruction of all 23 chromosomes of CHM13 based on a custom assembly pipeline, briefly featuring:
- Homopolymer-compression and self-correction of Pacbio HiFi reads
- Rescoring of overlaps to account for recurrent Pacbio HiFi errors
- Construction and custom pruning of a string graph built over 100% identical overlaps
- Manual reconstruction on chromosomal paths through the graph, if necessary aided by ultra-long Nanopore reads
- Layout/consensus of original HiFi reads, corresponding to the resulting paths
- Patching of regions absent from HiFi data with v0.7 draft sequences
Consensus quality exceeds Q60. Mitochondrial sequence DNA included. Centers of the 5 rDNA arrays are represented by N-gaps.
v0.7
Assembly draft v0.7 was generated with Canu v1.7.1 including rel1 data up to 2018/11/15 and incorporating the previously released PacBio data. Two gaps on the X plus the centromere were manually resolved. Contigs with low coverage support were split and the assembly was scaffolded with BioNano. The assembly was polished with two rounds of nanopolish and two rounds of arrow. The X polishing was done using unique markers matched between the assembly and the raw read data, the rest of the genome used traditional polishing. Finally, the assembly was polished with 10X Genomics data. We validated the assembly using independent BACs. The overall QV is estimated to be Q37 (Q42 in unique regions) and the assembly resolves over 80% of available CHM13 BACs (280/341). The assembly is 2.94 Gbp in size with 359 scaffolds (448 contigs) and an NG50 of 83 Mbp (70 Mbp). Outside of Chr8 and ChrX, this should be considered a draft and likely has mis-assemblies. Older unpolished assemblies are available for benchmarking purposes, but are of lower quality and should not be used for analyses. Also available at NCBI.
HG002 Chromosome X
Assemly draft v0.7 with the same methods used for CHM13 asm v0.9 with HG002 data HiFi available from the HPRC HG002 data freeze. Due to HiFI coverage gaps which were not patched, the assembly is missing approximately 2 Mbp on the p-arm (including the PAR).
Downloads
- Assembly draft v1.1 (md5: 1cab2b2776005cdf339ec9f283ba2c70)
- Annotation from CAT and Liftoff
- annotation gff3 file (md5: 14865ece7fe6367b8e2b06776a3d522f)
- Telomere identified by the VGP pipeline
- telomere bed file (md5: d6b148d16bf303e25552e381cddff9df)
- Liftover from v1.0
- chain file (md5: 7a6eb727ad489fe040d5ec2c8383c961)
- Alignments (the index bai file is available under the same name as the bam with .bai appended (e.g. chm13.draft_v1.1.hifi_20k.wm_2.0.1.pri.bam has a chm13.draft_v1.1.hifi_20k.wm_2.0.1.pri.bam.bai)
- PacBio HiFi alignments (generated via Winnowmap v2.01 -x map-pb) (md5: ab6b38cb00efa919f6d93bc89787a121)
- Oxford nanopore Guppy alignments (generated via Winnowmap v2.01 -x map-ont) (md5: 5cb543ac85513995893015a3709806f4)
- PCRFree Illumina alignments (generated via bwa mem v0.7.15) (md5: bb41008d0f5de787d26896fb49027420)
- Annotation from CAT and Liftoff
- Assembly draft v1.0 (md5: 6d827b6512562630137008830c46e1ac)
- Annotation from CAT and Liftoff
- annotation gff3 file (md5: a39f18f553d5a426eaef9cfd4f858bf6)
- Telomere identified by the VGP pipeline
- telomere bed file (md5: 5cdca0c8b563b87f7a624d61ae0b5497)
- Alignments (the index bai file is available under the same name as the bam with .bai appended (e.g. chm13.draft_v1.0.wm_2.01.hifi.pri.bam has a chm13.draft_v1.0.wm_2.01.hifi.pri.bam.bai)
- PacBio CLR alignments (generated via Winnowmap v2.01 -x map-pb-clr) (md5: 235e23c72676279714a091fb226f3b1a)
- PacBio HiFi alignments (generated via Winnowmap v2.01 -x map-pb) (md5: 2380bee4c3544d179b51cf22846e33ab)
- Oxford nanopore Guppy alignments (generated via Winnowmap v2.01 -x map-ont) (md5: 5a012ae791f48678b829da6770216f5d)
- Oxford nanopore Bonito alignments (generated via Winnowmap v2.01 -x map-ont) (md5: 84b0b9d5935140ead1d032b0a1610c39)
- PCRFree Illumina alignments (generated via bwa mem v0.7.15) (md5: 9143c6d6dc3e8f537c49f43f9e6cbedd)
- Annotation from CAT and Liftoff
- Assembly draft v0.9 (md5: 05fd40ffc5d68a9b6754773a56381db8)
- Regions patched by non-HiFi data & rDNA loci (md5: a754f98d5e960b3d1e9029cba4414cf2)
- v0.9 assembly graph in GFA format (built over homopolymer-compressed HiFi reads) (md5: df2218db9ebbcd239d07d2544372cfa5)
- Consensus sequences for individual nodes of the v0.9 assembly graph (since the sequence is not homopolymer compressed, the lengths and overlap sizes will not match the GFA!) (md5: 086d3d968b2c8cbc8c4be891e56ad177)
- Genomic paths through the v0.9 graph (part of chr9 was reconstructed by a different assembly method excluded) (md5: 913205d75f5f9c49e5269eb4363fbf16)
- Alignments (the index bai file is available under the same name as the bam with .bai appended (e.g. chm13.draft_v0.9.clr.bam has a chm13.draft_v0.9.clr.bam.bai)
- PacBio CLR alignments (generated via Winnowmap v1.11 -x map-pb-clr) (md5: 7cd9c812e4398db6ed318969fe7080f9)
- PacBio HiFi alignments (generated via Winnowmap v1.11 -x map-pb) (md5: 7527b44aba07d9acbed597fbc445b61a)
- Oxford nanopore alignments (generated via Winnowmap v1.11 -x map-ont) (md5: 4a5bbf70193e65c35a287a70099bb99c)
- PCRFree Illumina alignments (generated via bwa mem v0.7.15) (md5: 7c13fd36ae404eb41697ec5d54ba608f)
- Chromosome X v0.7 (md5: 89b3dd61db66177dd830527b920956fa)
- Chromosome X v0.7 Nanopore rel1 unique k-mer anchored mappings (md5: ada12a00d4781f6b0101a09be19abe93)
- Chromosome X v0.7 PacBio HiFi unique k-mer anchored mappings (md5: bd22daaf6d4a2cd775f109a853a911a9)
- Chromosome X v0.7 PacBio CLR unique k-mer anchored mappings (md5: 69be7bd105ee590bf57853c249e1f8d8)
- Chromosome 8 v9 (md5: cc33037728ab1f743d3e79f85e8c10ac)
- Chromosome 8 v9 Nanopore rel5 unique k-mer anchored mappings (md5: e953525b097c98d8485a3a7b152da897)
- Assembly draft v0.7 (md5: b9777540aaa0251c7dbb4974fb0a69d6)
- Assembly draft v0.6 (md5: c3e3318e82ba5dc64b74f458f4989b85)
- Assembly draft v0.4 (md5: 7e3c2fff9479ba45f7916fa1eee1310b)
- HG002 chrX draft v0.7 (not T2T, missing p-arm PAR region) (md5: 1d79ac022424fc5671135e2ac362d91d)
Sequencing Data
HiFi Data
A total of 100 Gbp of data (32.4x coverage) in HiFi 20 kbp libraries (used for v0.9-v1.1 assemblies) is available from NCBI. An additional 76 Gbp of data (24.4x coverage) is available in HiFi 10 kbp libraries at NCBI.
Oxford Nanopore Data
Nanopore sequencing was performed using Josh Quick’s ultra-long read (UL) protocol and modifications as described in The structure, function, and evolution of a complete human chromosome 8.
We sequenced a total of 390 Gbp of data (126x coverage). The read N50 is 58 kbp and there are 219 Gbp bases in reads >50 kbp (71x). The longest full-length mapping read is 1.3 Mbp. Sequencing data was generated from three lines of CHM13 (NHGRI, UW, UCD), which all originate from the original line established by Urvashi Surti. Only the NHGRI line was karyotyped and confirmed to be stable prior to sequencing. For the NHGRI line, NHGRI (PI: Phillippy) and University of Nottingham (PI: Loose) contributed approximately 140 flowcells of UL data using Quick’s ultra-long protocol; 199 Gbp (64x, 1.4 Gbp/flowcell). The read N50 is 71 kbp and there are 128 Gbp of data in reads >50 kbp (41x). For the UW line, University of Washington (PI: Eichler) contibuted 106 flowcells of UL data using a new UL protocol developed by Glennis Logsdon; 69 Gbp (22x, 0.6 Gbp/flowcell). The read N50 is 133 kbp and there are 57 Gbp of data in reads >50 kbp (18x). For the UCD line, UCDavis (PI: Dennis) contributed two PromethION cells using a ligation prep; 114 Gbp (37x, 57 Gbp/flowcell). The read N50 is 36 kbp and there are 25 Gbp of data in reads >50 kbp (8x).
Read ids broken out by sequencing location are available for NHGRI, U of Nottingham, UW, and UCD.
rel7 (genome DNA)
rel 7 is the full dataset as of 2020/10/01. All data was re-called using Bonito v0.3.1.
Downloads
- Bonito 0.3.1 (md5: d56fb4b4e4a7165c8fa7315795d9d419)
rel6 (genomic DNA)
rel6 is the full dataset as of 2020/10/01, adding UW data from partitions 232-243. All data was re-called using Guppy 3.6.0 with the HAC model.
Downloads
- Guppy flip-flop 3.6.0 (md5: b6f9b702d5dd1a3407b5343fb17391b2)
- Guppy sequencing summary (md5: 9e2bb5a1fa57dfb5a743ee30d64b8613)
rel5 (genomic DNA)
rel5 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.6.0 with the HAC model.
Downloads
- Guppy flip-flop 3.6.0 (md5: fe4941f1f4c5d5b551c50faf368735fb)
rel4 (genomic DNA)
rel4 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.4.5 with the HAC model.
Downloads
- Guppy flip-flop 3.4.5 (md5: dad0b6caa4a2b03f57387c1bd8107b2f)
rel3 (genomic DNA)
rel3 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.1.5 with the HAC model. We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.
Downloads
- Guppy flip-flop 3.1.5 (md5: 92026d97a898c2f5b65074048a1caabf)
- Canu v1.9 rel3 assembly (no curation or polishing, resolves 314 BACs at Q24) (md5: a05a864eb90578f0fe36e0d774395075)
- Flye v2.5 rel3 assembly (no curation or polishing, resolves 253 BACs at Q22) (md5: 80428824ecc3ec41cde9301aa3a986d0)
- Shasta rel3 assembly (no curation or polishing, resolves 176 BACs at Q28) (md5: 4da86a6b4af5fa5c35407d7cf39c1bac)
- Guppy flip-flop mapped to asm v0.7 with minimap2 (md5: 02b8966c447f2cc9dc1ae211930fd4e3)
- Guppy flip-flop mapped to GRCh38 with decoys with minimap2 (md5: a18c3c9e9f3fa638ff348ebba0f883da)
rel2 (genomic DNA)
rel2 is the same data as rel1 but recalled with the latest generation callers (Guppy flip-flop 2.3.1). We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.
Downloads
- Guppy flip-flop 2.3.1 (md5: 7e3f4ded02d500a3db0c76c84cdc42b9)
- Canu v1.8 rel2 assembly (no curation or polishing, resolves 287 BACs at Q20) (md5: 778ec406528e153e9b0cb74b4a4caade)
- Guppy flip-flop mapped to asm v0.6 with minimap2 (md5: 20afc508915207c5082e6f3c427739d2)
- Guppy flip-flop mapped to GRCh38 with decoys with minimap2 (md5: 1a4888cafbc935a21c17f449b4802438)
rel1 (genomic DNA)
The full dataset as of 2019/01/09. These basecalls were generated on-instrument and use older versions of Guppy (depending on when the flowcell ran on the instrument).
Downloads
- Guppy on-instrument (md5: c2cb74601eb657df21b7d25980908288)
fast5 data
The raw fast5 data, without basecalls, is available for completeness. The data is grouped into 243 sets.
- Partitions 1-94 were sequenced at NHGRI
- Partitions 95-98 were sequenced at University of Nottingham
- Partitions 99-144 were sequenced at NHGRI
- Partitions 145-224 were sequenced at University of Washington
- Partitions 225-226 were sequenced at UC Davis
- Partitions 227-231 were sequenced at NHGRI
- Partitions 232-243 were sequenced at University of Washington
- Note that when the tgz were groupped and uploaded, some inadvertently included more than a single partition. These are denoted as partition ranges in the downloads (e.g. 145-149).
Downloads
- Partition 001 (md5: c837460c50a4446fc8320c95dc88f204)
- Partition 002 (md5: 05ceccf4256d248aaec2a4c61e58c26c)
- Partition 003 (md5: 879e3a6391e5da5f943fa46b92decd47)
- Partition 004 (md5: 600bfa46c741eeff0064b1d8040b9349)
- Partition 005 (md5: 1a72beff4b2e4556c5033176ed1cd109)
- Partition 006 (md5: fcd6f8ceeac2034eddaa33cedf6d0010)
- Partition 007 (md5: 0d44cb41a4888b55bce2cba7e70107ba)
- Partition 008 (md5: 52242770505ac9aca1070e0b926c4769)
- Partition 009 (md5: 4e85e63a4ebf8efb2f97fdcee46e5737)
- Partition 010 (md5: e495530dd8a68b7bc9864ab89a4ef52f)
- Partition 011 (md5: 3b57e6256d0162d83a281e74157134e0)
- Partition 012 (md5: 735a0a03c6bec1e0ed417baa0c2d7db2)
- Partition 013 (md5: 90c51a9ab06266b2a980bcc16d3d3960)
- Partition 014 (md5: 645ea0b4edc2bfc71c708a53d5b0d92b)
- Partition 015 (md5: 24f456adb4c1c6579fe34f07c82179e7)
- Partition 016 (md5: 6b72ddda5a7a1c10b50f3026914519ec)
- Partition 017 (md5: 14e7b918b28ecc784b68569454fa27d9)
- Partition 018 (md5: d5f7c9b1d88cf48298f6cbbb2a2a45a9)
- Partition 019 (md5: cefa121a627dfcf9a1dfb117065a7264)
- Partition 020 (md5: ca0729b28cd4cccc81eba670c6e86689)
- Partition 021 (md5: 51a873a2019f2b091ab035cc3f074bb8)
- Partition 022 (md5: e9235f052d651b4ba1fdaaa06ad134d0)
- Partition 023 (md5: 82735ac6bde6dd3064ed23ec89befbe5)
- Partition 024 (md5: e1e05425f9823e50650bd2cf1efa41c6)
- Partition 025 (md5: f8efb23a5e77b12f46bce73b2ddba36a)
- Partition 026 (md5: 829f32786514b092da9e4fb8701da037)
- Partition 027 (md5: 15ebb086d975583386c1d0e49fbca932)
- Partition 028 (md5: 202f0507d424b70b60b785d5131d28a4)
- Partition 029 (md5: 3c5b3522dd741214554f84d8645cdf20)
- Partition 030 (md5: 6e54914ef12c9b7757496b5867028067)
- Partition 031 (md5: e9501d4d0fd38d64c2ad1c81f8d1a0e3)
- Partition 032 (md5: 1f3ff51da0e87c2009bef8256b930f0b)
- Partition 033 (md5: 76a518084b021db82fd5dab7540e88bb)
- Partition 034 (md5: fd9f4dcfaeb89134a4f700a5346c16fa)
- Partition 035 (md5: dbdd53ba61d67a7f61405ae39d2b931b)
- Partition 036 (md5: c243b8f64bde0051fe104e8baaecf09b)
- Partition 037 (md5: aafa1d558881b2b4856fde3af0cbb9b2)
- Partition 038 (md5: d2e39e42eaf6a0a63d0542435590dd88)
- Partition 039 (md5: ef48d5c46f19de02fb6f6646726c95de)
- Partition 040 (md5: 17d7d34b45e14b2a79fc30e5c5084315)
- Partition 041 (md5: eb6a16d0b37d538bdbf90c3bfcc0f098)
- Partition 042 (md5: 7dbf87d75c901463b2e4e4afdc4adb52)
- Partition 043 (md5: 97c071a1d0a170e9f4809f6cdc459a6b)
- Partition 044 (md5: 27dc707435a2c98fc7201ccefec68c9d)
- Partition 045 (md5: 54ce28e1e1b54ab9fd8dd072711acd30)
- Partition 046 (md5: b174c7826fc399312fad331660745e55)
- Partition 047 (md5: 2b6ce400051fce5d2de09fd8fd461fc8)
- Partition 048 (md5: 81415b29f2b6a605473af6d3529758b1)
- Partition 049 (md5: ffc9182d8a9ad9752b6571d3d2f2b69d)
- Partition 050 (md5: 790281fcf0512a798b6f0e75b14620be)
- Partition 051 (md5: 4fc5dc17819a3727e5cedaa89550ef9f)
- Partition 052 (md5: d33a70e926dee0e67cf1a75d50ee1249)
- Partition 053 (md5: 04e641cfc8bbe7233773fc38add3fbd2)
- Partition 054 (md5: 958b62e07349258d93ee3e089c6f91ff)
- Partition 055 (md5: 55f74869ce3303277edf2225b4796fa0)
- Partition 056 (md5: 29b205c649f66e3d44ea9f598b492bc2)
- Partition 057 (md5: 7336b91e333ae912b4cfc6e366570c54)
- Partition 058 (md5: 2d992482005a2523f710487f2c0a0a31)
- Partition 059 (md5: 3b45c205982796a90aa0f40955c4937b)
- Partition 060 (md5: f085ae6a4818c44d03a6f5adfc445699)
- Partition 061 (md5: 1c5a3a0ed8b53a930535b9d34e6a0667)
- Partition 062 (md5: fbfd4ffb7cf8fca4d613d0ec67d3104c)
- Partition 063 (md5: 9ddf7a9fe7e9cf8ceb02b8debed41fcc)
- Partition 064 (md5: ee3ac8080a19d4a6ab3af84074d03d7a)
- Partition 065 (md5: d94a12692d399c44612cab8b2aea8164)
- Partition 066 (md5: a9f3bfa69bbc248b33f99f42827331eb)
- Partition 067 (md5: 6c9d4b38edc6f78521f3cfdd8edc571c)
- Partition 068 (md5: 76a29683bfad7c4a0b8a0bdbbbd6fd49)
- Partition 069 (md5: f924667636c528d56e46aa92db0a182d)
- Partition 070 (md5: f813b0a4b2a4a2353c7deb539f16f286)
- Partition 071 (md5: fa56e2524ea2cc57e79f692466375b83)
- Partition 072 (md5: 23b1df220d55ab9df2735c74849a53c9)
- Partition 073 (md5: 70839cbc61d3d8af7fafcb7ba8f96461)
- Partition 074 (md5: 109b91ceda32ab0f8b9edb24cb35fb23)
- Partition 075 (md5: 53c466af09a3a119df3255189091bcda)
- Partition 076 (md5: 22ad2327db64767e34378508afe60706)
- Partition 077 (md5: 64c7c1702e3476137c54ebc0c07d970e)
- Partition 078 (md5: 6e2048a8a2ceb36bb679455e0af81230)
- Partition 079 (md5: 45717c24fe844f2605be81bd8e15d856)
- Partition 080 (md5: 1ac20637828f0f3115f1c0f289e006aa)
- Partition 081 (md5: e7b5e584de5f2cbda1d53ec2f6e2668e)
- Partition 082 (md5: aad214d168ad3a59488dfac71fcedc22)
- Partition 083 (md5: d557dee3b08c61d540fd6a00689341fa)
- Partition 084 (md5: cc2b4676515b988dd4f64724e49c3304)
- Partition 085 (md5: 34e6154991e5d5c641e22a529c5f06e1)
- Partition 086 (md5: 2f9ff4371f32c3a33ea081ad8825437e)
- Partition 087 (md5: 945504e89ba54cdab032eac63985d216)
- Partition 088 (md5: 46a8ba05cb12b268c7f7ce04575d24da)
- Partition 089 (md5: 5fd0219c9c99aa08ce07bb35e647144c)
- Partition 090 (md5: da0e3f19f81c99a89bcff7e8f74dc6cb)
- Partition 091 (md5: c11b11f3386d47dd33acc3cba7f44fb2)
- Partition 092 (md5: 87dfa60ae9308214b43aa7075ddd9f44)
- Partition 093 (md5: 6eced035881d3e804bea7103d26c042e)
- Partition 094 (md5: 59ebbc64994779244e5f7431c54b819e)
- Partition 095 (md5: 4de3c1f5163357a256847c1082379df3)
- Partition 096 (md5: cf16e88c803b82b052651171490d6d5a)
- Partition 097 (md5: bcf0e6944fb937bdda07a68530e63f01)
- Partition 098 (md5: 22c785691baddf3bddf3b0c77080adf2)
- Partition 099 (md5: 9fa25adca355abe3161060393b40de45)
- Partition 100 (md5: 12c7eabb92e0c9fe7ac4fcfa6f4a2795)
- Partition 101 (md5: 020bae3d98d9c5b2df8faca3f8e46ead)
- Partition 102 (md5: dd4d7a7c6d682271bb9d76cc8cd2f284)
- Partition 103 (md5: 8ea12825fced78d35d6e427c02db33db)
- Partition 104 (md5: 2137c06f010f11aa150a3e431fb502b3)
- Partition 105 (md5: 1c0e69e080eb86fc4f46bf91780f7dbe)
- Partition 106 (md5: c5acaa0cf6786fa2420fe938c564f743)
- Partition 107 (md5: dadc81bebb317516b57329cf8a79dd8d)
- Partition 108 (md5: 020d60709c8892d8abc24f2cb3abadd1)
- Partition 109 (md5: 1b3463481d8203bf617705d1becb86d5)
- Partition 110 (md5: e04fdfdd42d9e6b3f1cc10d54a0ea738)
- Partition 111 (md5: ee6b441916a8170fc3c59958180c9af0)
- Partition 112 (md5: 5eeef61c820be9c7826226d0b5eaadfb)
- Partition 113 (md5: 3b8d107886b0b0f2c7a046e96bcf6693)
- Partition 114 (md5: cbfbe53039a2196d8c043de6af850e2a)
- Partition 115 (md5: 69448ff84cac02071991de26dd60e9e6)
- Partition 116 (md5: 128875eca40e2ac2ef52653724afd579)
- Partition 117 (md5: 8c9b722f6f5cf25b26573a1f1d8807f1)
- Partition 118 (md5: d0ecc4997ed5b2e9db2d7418b55bf017)
- Partition 119 (md5: 899dec0634e97a2a1bdc73ee375b7c84)
- Partition 120 (md5: b17d8467e54d28f3a0748f8e0d86305b)
- Partition 121 (md5: 3d25e5440f17ad324d3b9176c31443dd)
- Partition 122 (md5: 3d89e66d0558babc5b42488e9d7e9b09)
- Partition 123 (md5: 5a48a1424b00933956a21582a66b4ef9)
- Partition 124 (md5: 02ca9ed9b6570a8e5fd8f862adc1ae9d)
- Partition 125 (md5: 7ed0458200f8499ee0529ae691460de0)
- Partition 126 (md5: 9f641a474a8eed64b658e48d0004cde0)
- Partition 127 (md5: a28c15458c2df4455d35c3d1b6f9d0f8)
- Partition 128 (md5: cd1f20b2f3dc7a6d293e2dcc30f3d70d)
- Partition 129 (md5: b269d0ad7ee3879ce92d10aa0b817f6a)
- Partition 130 (md5: 8179021a84f457f545265d80b061640d)
- Partition 131 (md5: 1104dd5a0c900d9862017b1196d9109f)
- Partition 132 (md5: 936403b29d7022f19e988caa5e4885c9)
- Partition 133 (md5: 97b539127ed11106ef75df3759391e92)
- Partition 134 (md5: a7409210b3f3ac08b14b5833b8dc97f4)
- Partition 135 (md5: c266e484108dd9beb2ae23bc6f17cedb)
- Partition 136 (md5: 3047a843da59b3020669116228961b3d)
- Partition 137 (md5: 83a41de61e307a4d184ec94a6d3dc5c7)
- Partition 138 (md5: 8c0ae1c83c4a968218a7193836613b48)
- Partition 139 (md5: 5a69a9b10a0b509ad18a4531901cb128)
- Partition 140 (md5: a6eda1f81e1d528b9bffd07c9e701ff6)
- Partition 141 (md5: 871020ffbc86e0abacc92174d00f97d6)
- Partition 142 (md5: 0fa6572159ec40ebda3400316bceb036)
- Partition 143 (md5: 1275a1587f52129a2eb31b1c4c0ca10c)
- Partition 144 (md5: cd983796c2c94d4dd0b19af8b134ef1f)
- Partition 145-148 (md5: d07b77482f5ae167fa8806f51ac0db3c)
- Partition 149-150 (md5: 0d045ac58956492f98b1937faae88d06)
- Partition 151-152 (md5: 54fdf8a38733d0fe01add1c4695a7d89)
- Partition 153-154 (md5: 4b8b18a9a1f12047e0309365aecc4832)
- Partition 155-157 (md5: dec98ff1d34ab863b5bf2a0356001089)
- Partition 158 (md5: 0d9df61f2eaa0b723fca237034ce74b6)
- Partition 159 (md5: 5acd09c60776e515831d1c2f547da1e1)
- Partition 160 (md5: 46142bd5341d2c5fae08a479dda540a8)
- Partition 161 (md5: 15fe73bc61b67762a6d8d99ac695bae7)
- Partition 162 (md5: b20122665bcf330c9305b000a759c0e3)
- Partition 163 (md5: 98f1933ee44988d0d60edd065aea745f)
- Partition 164-166 (md5: 27330b662904fe3921ec1fbe9a5e0a39)
- Partition 167-168 (md5: 5517cdbaf3d851fa5787711eaca8192b)
- Partition 169-171 (md5: 14524b5fa02d41bd2c983322a4321099)
- Partition 172 (md5: 2a2c642990854cc005ebbde51dbec56a)
- Partition 173 (md5: 32adcc1d1e9d122d2cffab3648cbd1b7)
- Partition 174 (md5: 52905ca54875717a3e3d4cdb5955df46)
- Partition 175 (md5: 11c847f4e695cab036315ae2428cf80b)
- Partition 176 (md5: dcb226eef40a0bef2fd2d5d26f13b88c)
- Partition 177 (md5: 0192f5f9618f119f7cba8f58f2f1fe68)
- Partition 178 (md5: bf9fc0582a6f1f5e4b419f6dd3ecc949)
- Partition 179 (md5: dd6a42f110d1be4041d1fa23403070c2)
- Partition 180 (md5: 817ea0768e2275a6240aefba5e9402b8)
- Partition 181 (md5: fdb66d1d5ec39338673e857c2aa69a87)
- Partition 182 (md5: e5bf39ba1d99e5337b294ae428a2c72d)
- Partition 183 (md5: d8de690d6db614c887c1799c9abfba89)
- Partition 184 (md5: adf710282e35b7cd1f0ea77fcdc32c5a)
- Partition 185 (md5: 4ee185a2889430f63f1960219a68ed78)
- Partition 186 (md5: e51dd1c3db3348ce6fa1d089bfbaaa26)
- Partition 187 (md5: c3b6b5476e3982bcc19b3026aa9786b4)
- Partition 188 (md5: 2eae48d373c6cb85e6b652a7db224f7f)
- Partition 189 (md5: 80719a5ce718ac1a7935ce5ca90e1ac7)
- Partition 190 (md5: f8ae9a7954e6ade74bfbbc10772b5b77)
- Partition 191 (md5: cf16a4d22b1f1e7c86b68bb51789e473)
- Partition 192 (md5: a797cbb62e49135fbc152ec8497d5370)
- Partition 193 (md5: 33e754109a7882fd7c068416859a0695)
- Partition 194 (md5: 7ff95e524daa1d937c3e65213ba901b7)
- Partition 195 (md5: 00c8789114fbaed7fb5f884fdca96346)
- Partition 196 (md5: e307840ffdbddaf7237a65efa8c85188)
- Partition 197 (md5: 9dcee9a1d3d576599d31dda1c8b38ff8)
- Partition 198-199 (md5: b57b2927e330af13858cfb6ff8ed13bd)
- Partition 200-201 (md5: e4aa7c85f2c1513a0669bf85bab92832)
- Partition 202 (md5: c48d830ee1710f23d125301172954e35)
- Partition 203 (md5: d60a220607d465dc3f6cf2779efb1262)
- Partition 204 (md5: 6fc8204a147898ba1d93945a78b33be8)
- Partition 205-206 (md5: 95b0fa324b0341668ba55675bed664d1)
- Partition 207 (md5: 37d370034bcd5503baf1fda12b184def)
- Partition 208 (md5: 1922135013379a366119367e780915ff)
- Partition 209-210 (md5: b60425e6503e5a3618d0019189aa6a17)
- Partition 211-212 (md5: 316efb5752cbedca3593a704f83178ce)
- Partition 213 (md5: 102ce625ade4851c6ef77350c1d66bba)
- Partition 214 (md5: b56af7f1bb450d861d9eecfc681e3f09)
- Partition 215 (md5: 6524dbc2d7f6b8bc65c30ff68d400e00)
- Partition 216 (md5: c24fdefe4aa9a563f06d33367264e57d)
- Partition 217 (md5: 14e79f6d50ce4d2b91364ac176bb9170)
- Partition 218 (md5: 5b71e4c287589290c699d78597eb0fe0)
- Partition 219-220 (md5: 06220779ab619d8f6ec927b5a53f5bce)
- Partition 221 (md5: d07c41c9dbf5f6fe7c0745c4323d4a36)
- Partition 222 (md5: 5fb46b69c2192b8c77a6505ccf0a3499)
- Partition 223 (md5: 036b920192151a51a561792cb3257ecf)
- Partition 224 (md5: a0a16ac031a6bafdba6c299282f5275a)
- Partition 225 (md5: 378aebc21351b13ba643bb83645ae860)
- Partition 226 (md5: d99c0ef473cec223269f1c91a6d99bc7)
- Partition 227 (md5: 0d9a266167d7b1429866dbbff76427fb)
- Partition 228 (md5: 246d9aebae66f06767e0177e1d26a735)
- Partition 229 (md5: 14934a026120af86908254d0c336a144)
- Partition 230 (md5: ef20895ee39928e8c77e57be3f11afe0)
- Partition 231 (md5: ccec880307a9c9999aa7d468df4911c5)
- Partition 232 (md5: 81917bd7ce37628b5b7438dc531147ff)
- Partition 233-234 (md5: 798f496e2c086dccdb7f62db866c1525)
- Partition 235 (md5: 28da4a29c963879ea6af14fd7ea47313)
- Partition 236 (md5: af43d8d1e1420531c33c8f63012a5582)
- Partition 237 (md5: 1cb0cfc3a263f331eb45fb47082d1f15)
- Partition 238 (md5: b7df2be60ead2d63e3358cda9a532b12)
- Partition 239 (md5: d62b1dc41ff27bb6c090467c0b390363)
- Partition 240 (md5: 4a870d135d93ec1e834ceba7c203061e)
- Partition 241 (md5: 1135fbd234a5b635d6155b8f400bc8ab)
- Partition 242 (md5: d4a08ac562906f07f282e078e3db4c5e)
- Partition 243 (md5: 066517834bf644bcb0e6f76e34213dac)
10X Genomics Data
Raw fastq files
Approximately 50x of data was generated on a NovaSeq instrument. Based on the summary output of Supernova, there are 1.2 billion reads with 41x effective coverage. The mean molecule length is 130 kbp and an N50 of 864 reads per barcode.
Downloads
- CHM13_prep5_S13_L002_I1_001 (md5: 84af4586ca9f78060d5802b36cdd9e8a)
- CHM13_prep5_S13_L002_R1_001 (md5: 231633e0cf2fbdeba732dc7ad6233fa0)
- CHM13_prep5_S13_L002_R2_001 (md5: 386febfc3fc760e11e315e69310ed3d8)
- CHM13_prep5_S14_L002_I1_001 (md5: f0b7628e90dfaf2f702ec613c7b61ca7)
- CHM13_prep5_S14_L002_R1_001 (md5: 86afbc7a41ea1c81657bf1ca64d1178c)
- CHM13_prep5_S14_L002_R2_001 (md5: 3dfbe58b5ae715213e20614837dcf3b7)
- CHM13_prep5_S15_L002_I1_001 (md5: ee34f03c765787ea069050d8eaac1de4)
- CHM13_prep5_S15_L002_R1_001 (md5: 73edcb56dd18d7b7b2705b4db7b4efc5)
- CHM13_prep5_S15_L002_R2_001 (md5: a0de8e5bc127203129e4e1437b3e6aaa)
- CHM13_prep5_S16_L002_I1_001 (md5: 42db246f7e5725a7b6ff3f5f5aedfd6e)
- CHM13_prep5_S16_L002_R1_001 (md5: 3d3db7eccaf388fbcd901cbc6ad47630)
- CHM13_prep5_S16_L002_R2_001 (md5: 9dfcc17398a7acd906212a09ab4c8903)
BioNano DLS Data
Approximately 430x of data was generated using the Saphyr instrument and the DLE-1 enzyme. There are 15.2 M molecules with an N50 molecule length of 115.9 kbp and a max of 2.3 Mbp (2 M molecules > 150 kbp, N50 218 kbp). The assembly of the molecules is 2.97 Gbp in size with 255 contigs and an NG50 of 59.6 Mbp.
Downloads
Hi-C Data
A library was generated using an Arima genomics kit and sequenced to approximately 40x on an Illumina HiSeq X.
Downloads
- CHM13.rep1_lane1_R1.fastq.gz (md5: 41d2f26eb1f958723e28e32ca471b680)
- CHM13.rep1_lane1_R2.fastq.gz (md5: 2747aaf1d128182bcaa151098e0abe74)
- CHM13.rep2_lane1_R1.fastq.gz (md5: 26ce58141bb25b4931512ec4cf176f64)
- CHM13.rep2_lane1_R2.fastq.gz (md5: 77b71bd1067c6e4e908a9aaa05f4bd73)
RNA-seq data
Two separate poly-A prep libraries were generated at UC Davis and 2×150 bp RNA-seq reads generated on an Illumina NovaSeq (~25 million PE reads each).
Downloads
- CHM13_1_S182_L002_R1_001.fastq.gz (md5: 4bbbc3bea152273d8d609c54d66c6d82)
- CHM13_1_S182_L002_R2_001.fastq.gz (md5: 3c9445f5370fbf85e5af8d8c44ad3379)
- CHM13_2_S183_L002_R1_001.fastq.gz (md5: 61ef6c5bb88286af497f8dcc8d32a5dc)
- CHM13_2_S183_L002_R2_001.fastq.gz (md5: 5c49547f57f2b5fd795b8c87cdfbdb6f)
Previously generated PacBio data
The PacBio data (both CLR and HiFi) was previously generated and is available from the SRA. The list of cells used for arrow polishing the v0.7 assembly are listed here.
Notes on downloading files.
Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3://
addressing scheme, i.e. replace https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/
with s3://human-pangenomics/T2T
to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz
to the current working directory use the following command.
aws s3 --no-sign-request cp s3://human-pangenomics/T2T/CHM13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .
or to download the full dataset use the following command.
aws s3 --no-sign-request sync s3://human-pangenomics/T2T/CHM13/ .
The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/
or to obtain technology-specific sizes.
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/assemblies
Amending the max_concurrent_requests
etc. settings as per this guide will improve download performance further.
You can also browse all the files available on S3 via web interface.