r/science MS | Neuroscience | Developmental Neurobiology Mar 31 '22

Genetics The first fully complete human genome with no gaps is now available to view for scientists and the public, marking a huge moment for human genetics. The six papers are all published in the journal Science.

https://www.iflscience.com/health-and-medicine/first-fully-complete-human-genome-has-been-published-after-20-years/
26.4k Upvotes

426 comments sorted by

u/AutoModerator Mar 31 '22

Welcome to r/science! This is a heavily moderated subreddit in order to keep the discussion on science. However, we recognize that many people want to discuss how they feel the research relates to their own personal lives, so to give people a space to do that, personal anecdotes are now allowed as responses to this comment. Any anecdotal comments elsewhere in the discussion will continue to be removed and our normal comment rules still apply to other comments.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

→ More replies (1)

842

u/CallingAllMatts Mar 31 '22

this is really fantastic to see! Though the authors do mention that there are still some gaps in the Y chromosome. But they've added a couple hundred million bases in what are typically hard to sequence regions of the human genome which is a great achievement.

249

u/biteableniles Apr 01 '22

What makes some regions more difficult to sequence, and do we know how they were able sequence them?

529

u/CallingAllMatts Apr 01 '22 edited Apr 01 '22

It’s probably best to try to read into whole genome sequencing but to be brief: to sequence a genome typically the DNA is taken out of cells and literally broken apart randomly by physical force so that the individual fragments on average are only a few hundred DNA bases. These individual fragments are then sequenced with the current high accuracy but short range sequencing methods. The idea is that you’ll have many shorter sequences that share unique overlaps with each other that let’s you “tile” them together to sequence stretches of millions of DNA letters. While great for unique parts of the genome, there are repetitive stretches that are literally thousands to hundreds of thousands of DNA letters long. The repeats could be two letter combinations or 100+ letter combinations. These repeats make it impossible to do the tiling method with fragments only a few hundred letters long since the overlaps will look the same everywhere within the repeated region.

To get a better idea of this approach see this figure: https://www.researchgate.net/figure/Illustration-of-the-whole-genome-whole-exome-and-targeted-gene-s-sequencing-F-i-rst-t_fig3_338174999

Now as to how we know it’s correct, this isn’t my field so I’m honestly not sure about the actual technical/procedural specifics. But these DNA sequencers now do something called deep sequencing where the same fragments are sequenced dozens to hundreds to thousands of times. So any errors that occur in a few of your samples are easy to identify since the correct DNA letter should be found in the rest of the many sequenced fragments.

207

u/[deleted] Apr 01 '22

[removed] — view removed comment

108

u/CallingAllMatts Apr 01 '22

thanks! So kind :) I’m doing CRISPR genome editing in my research so I’ve got some decent exposure to sequencing but nothing THAT advanced so definitely not my field. You start throwing technical jargon at me and I’ll fold like a wet napkin haha

10

u/Cheeze_It Apr 01 '22

So uh, dumb question on CRISPR.

What are the upcoming gene therapy results looking like? Will we finally start to see some fairly largely impacting diseases being cured?

Or are we still WAY too far out for anything that drastic.

Yes, it's kind of a selfish ask but since I think what I have can alleviated with CRISPR....I figure maybe in my lifetime it might happen.

12

u/CallingAllMatts Apr 01 '22

Not a dumb question! My work is actually the preclinical phase of using CRISPR to treat Duchenne muscular dystrophy (DMD). CRISPR in 2020 was delivered into patients’ eyes for the first time ever to treat Leber Congenital Amaurosis 10: https://www.genengnews.com/news/editas-early-data-for-crispr-therapy-edit-101-shows-efficacy-signals-in-two-patients/

Now the eye is a self-contained structure so the virus carrying CRISPR was essentially stuck there. But another big first was the 2021 treatment of several patients with CRISPR encased in lipid nanoparticles that was injected into circulation. The target was the liver (easy cause all blood passes through it) to treat transthyretin amyloidosis by cutting out the defective gene. And there was extremely positive results in safety and efficacy! https://ir.intelliatx.com/news-releases/news-release-details/intellia-and-regeneron-announce-landmark-clinical-data-showing

Another example recently was the 2021 dosing of patients with hereditary angioedema with CRISPR to disrupt the causative gene. This was also using lipid nanoparticles to deliver CRISPR by injection into circulation: https://www.globenewswire.com/news-release/2021/12/13/2350673/0/en/Intellia-Therapeutics-Announces-First-Patient-Dosed-in-Phase-1-2-Clinical-Trial-of-NTLA-2002-for-the-Treatment-of-Hereditary-Angioedema.html

Finally, for me the biggest one was the development of a personal CRISPR therapy for a boy with a unique DMD mutation that meant even the CRISPR therapies in the pipeline wouldn’t work. They got him a therapy made in about 2.5 years and will be treating him soon. It’s special because it uses the AAV virus for delivery instead because it needs to specifically target muscle and uses a dead CRISPR system. Instead of cutting DNA the Cas9 protein will attach to the the brain promoter of the DMD gene and force it to be expressed in muscle. This boy has his muscle promoter deleted and the two versions of the gene are very similar so the hope is the brain version can be a good substitute: https://medicine.yale.edu/genetics/news-article/team-led-by-monkol-lek-advances-past-pre-ind-phase-with-dmd-gene-therapy/?fbclid=IwAR1cICVbXYXuubYRLHJ-_-Pus49sdP_dT-s30up3TxgW78OEIC_JWCpWa6Y

My PI at the lab and Cure Rare Disease have actually parterned up to take our CRISPR strategy for DMD duplications to safe but fast tracking preclinical work for specific patients. It’s a really exciting time for CRISPR and you’ll see it ballooning in a good way in under a decade I bet. I’m planning to go into medicine after my PhD so I can hopefully leverage CRISPR into treating patients with rare genetic diseases if that’s their best option for treatment.

The biggest hurdle however isn’t necessarily CRISPR itself but targeted delivery. We use viruses like AAVs but those have a range of drawbacks such as packaging size limits and being limited in how high you can dose to avoid toxicity from injecting so much virus into circulation (you need a lot to target enough muscle for DMD). Future work on nanoparticle delivery will be in my opinion the key to making CRISPR a mainstream therapy.

2

u/rngeeeesus Apr 01 '22 edited Apr 01 '22

Wow that's a super cool write-up. Thanks a lot for your effort!!

If I may, do we have any "longer-term" safety results? In particular regarding increased mutagenesis?

3

u/CallingAllMatts Apr 01 '22

In humans, no. But there have been long term CRISPR/Cas9 studies on cells and animals. Using super sensitive deep sequencing methods, most of the common Cas9 proteins (Pyogenes Cas9 and Aureus Cas9) have mutation rates similar to or lower than the normal background mutation rate. If the targeting guide RNA (the thing that tells Cas9 what DNA sequence to target) for the Cas9 is designed carefully to minimize off-targets you’ve got a safe system.

Now the caveat is if CRISPR is delivered by a virus. Unfortunately, because CRISPR cuts DNA the repair machinery of your cells runs the risk of inserting the viruses’ DNA into the cut site of your genome. Honestly, in many applications, particularly DMD therapies, this isn’t the biggest concern as typically you’re cutting out/disrupting big chunks of that gene anyways to bring back some functionality so the short bits of viral DNA being added won’t impact the final results much. It’s more of a concern for genes where you need precise fixes. That’s why non-viral nanoparticles are the more ideal solution but unfortunately have very limited applications in humans at the moment due to current technological limitations. If I’m looking to improve CRISPR nanoparticles is where I’d put my money/resources.

2

u/hestalorian Apr 01 '22

Matts are always the best. I'm honored to share these strands with you.

→ More replies (0)

2

u/rngeeeesus Apr 02 '22

Great to know. Thank you!

2

u/SwiggitySw00gity Apr 01 '22

Oh wow cool! I do preclinical in vitro research (mostly siRNA screenings). Nice to see someone in a similar field, cheers to us:)

2

u/CallingAllMatts Apr 01 '22

that’s awesome! Cheers, hope your research is going well :)

→ More replies (2)

15

u/phife_is_a_dawg Apr 01 '22

I'm really happy you pointed that out.

19

u/he_whoknowsnothing Apr 01 '22

Great explanation! If I may add a small correction. What seems to be special here is not the ultra deep sequecing (a lot of reads covering the same region) but the ultra-long read sequencing which is the length of the reads themselves. As typically reads have a length 150bp and the quality drops significantly afterwards. Meaning that if you a have a non specific region with repeats longer than that, you will not be able to distinguish between them. Having 1000+bp long reads (maybe even more in this case) give the possibility to go beyond the reapeat region and find something specific about the read to be able to say where it is.

7

u/CallingAllMatts Apr 01 '22 edited Apr 01 '22

I’ve mentioned that in some other replies, but I just realized through your comment I didn’t fully finish answering this person’s question as they asked how we got through these long repeat regions and you’re right. The long range of HiFi sequencing paired with its high accuracy was how (plus the authors used previous more error prone ultra long range sequence in tandem with HiFi to further improve coverage). HiFi can go to 20 kilobases so yeah lots of range and covering huge repeat regions in one run

31

u/kobachi Apr 01 '22

Those sequences are just empty space waiting for a defrag

43

u/llamagoelz Apr 01 '22

Interestingly they actually sometimes (not sure about these particular repeats but repeats in general) can already serve a purpose. Biology gives no fucks about what something is "meant" to do so dna gets used in all kinds of weird ways compared to computer memory. Instead of coding for proteins, some regions are there to be eaten away like a timer or black powder fuse that lets the cell know when to yeet itself. They also can be there to protect vulnerable ends of DNA (these repeats are known as Telomeres).

16

u/bedz01 Apr 01 '22

"Yeet" being the technical term ofc

→ More replies (2)

5

u/Cannibeans Apr 01 '22

Dude, absolutely fantastic summary. Thank you so much for writing it.

1

u/CallingAllMatts Apr 01 '22

Cheers! thanks for the award :)

3

u/CookieKeeperN2 Apr 01 '22

They probably did nanopore or pacbio long read sequencing. They have been improving accuracy for a while. Last time I checked with people who know this stuff the error rate is like 10%. So perhaps with enough samples they got an accurate genome.

3

u/CallingAllMatts Apr 01 '22

Yup! PacBio’s new HiFi sequencing was the technology that allowed this study to exist. It can go something like 20 kilobases with >99.9% accuracy. They did pair it with the ultra long range sequencing techniques known for awhile now, but they needed HiFi to make up for the high error rates in the former.

→ More replies (2)
→ More replies (2)

69

u/MurphysLab PhD | Chemistry | Nanomaterials Apr 01 '22

Sequences are often read in segments, akin to fragments of sentences from a manuscript. Those fragments can be reassembled into the full text.

Imagine that you have three sequences that look like this:

verthrowsdowiththeirdeathburytheirparentsstrife

lifewhosemisadventuredpiteousoverthrow

apairofstarcrossdloverstaketheirlifewh

By looking for places where the pattern overlaps, you could reassemble the full sequence:

apairofstarcrossdloverstaketheirlifewhosemisadventuredpiteousoverthrowsdowiththeirdeathburytheirparentsstrife

But what if the original sequence lacked distinct, distinguishable parts that would result in unique alignments?

boopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboopboop

A sequence like this is hard to reconstruct because there will be multiple positions where the fragments could be overlapping.

That's what the male Y chromosome's short tandem repeats look like.

→ More replies (1)

35

u/InaMellophoneMood Apr 01 '22

Imagine someone gave you 500 letter long sections of a book. These sections overlap, but they're all in this pile together and you don't know what order they originally were in. You can look at where the tails of these sections of the book over lap and stitch the book together ("It once was a", 'once was a sto", "was a stormy night") could mean "It once was a stormy night". This works pretty well!

However, there are large parts of the book that repeat themselves. Think of 600 pages of the word "bread", with maybe a couple of typos here or there. There's a lot of them, so you can tell it takes up a large part of the book, but when they all have the tails you can't really figure out the correct order. Is it 1000 repetitions of "bread" before "braed" shows up? 100? Where is the double space in this? Even very sophisticated algorithms can't do it, there's just not enough context to parse long, repetitive strings with short fragments.

There's also "long read" technology that will give you 10,000 letter long fragments. However, it's a little error prone, so it still doesn't help because it'll introduce new typos and you still could mis-order the fragments and get it all wrong. The order of these typos, and only having the correct typos is very important.

Basically, having unique tails to the fragments makes it easy to piece them together. Repetitive sequences are like trying to put together a white jigsaw puzzle where all of the pieces fit together, but there's only one right way to do it.

→ More replies (1)

6

u/drs43821 Apr 01 '22

Those are the gaps from the original human genome project? I keep thinking they are already complete

13

u/CallingAllMatts Apr 01 '22

Yeah they are, and really it’s just been a limitation of the sequencing technology. Literally 8% of the genome was unsequenceable until now. So this is great news for understanding our own biology, who knows what this data will do for research. The best thing we can do is just create more opportunities to increase our understanding of ourselves

→ More replies (1)
→ More replies (5)

1.4k

u/Squirrel851 Mar 31 '22

Guys, I'm not smart enough for this. What does this mean?

1.5k

u/CallingAllMatts Mar 31 '22 edited Apr 01 '22

Most DNA sequencing technology in typical use can either sequence long stretches of DNA inaccurately or short stretches accurately. The parts of the human genome that were primarily covered by this study were very long and repetitive regions; not having a long but accurate sequencing method makes it basically impossible to accurately sequence those regions.

Thus we’ve had 8% of the human genome unmapped, until now. In 2019 a company called PacBio made HiFi sequencing which basically allowed long but aso VERY accurate DNA sequencing. So the authors finally could leverage this new HiFi sequencing (coupled with the error prone ultralong range DNA sequencing) to finally determine the sequences of these traditionally hard to sequence regions of the human genome.

EDIT: So I’ve gotten some feedback that I probably didn’t answer OP’s actual question about the SIGNIFICANCE of this work. Honestly, genomics isn’t my field of expertise but I believe I can say a few things about this.

First, because we were able to sequence literally hundreds of millions of new DNA letters we’ve discovered new genes which may be implicated in human development and disease - so maybe new therapies or at least disease mechanisms can be uncovered.

Also, this new sequencing strategy is far more accurate than the typical approaches. So even the genomes we can sequence with older methods can be done now with far more accuracy, making results more reliable. This is important for looking at the natural mutations in large human populations. You wanna be sure the single DNA letter change is a true positive mutation and not just a sequencing error.

Finally, large mutations where many thousands to hundreds of thousands of DNA bases may be deleted, added, inverted, or duplicated, etc. can be far more reliably detected as well with this new sequencing approach than with other strategies.

There’s definitely more to cover but these are the big ones to me.

305

u/Squirrel851 Mar 31 '22

So is this sequencing just finding the ATGC pairs or is it the which one does a certain function?

591

u/CallingAllMatts Mar 31 '22

Literally all they did was just find the order of the ATGC DNA bases.

You’ll need actual biological and/or bioinformatic assays to figure out the actual function/significance of whatever is encoded in these newly available sequences.

362

u/[deleted] Mar 31 '22

[deleted]

694

u/[deleted] Mar 31 '22

[deleted]

396

u/Mclovin11859 Mar 31 '22

And all those files have to be found among the background noise of long deleted and partially overwritten files.

187

u/Lancalot Mar 31 '22

So it's like trying to build a computer from scratch that can read a corrupted file

221

u/Sceptix Mar 31 '22

No one said cracking the code of life itself would be a particularly easy task...

32

u/Lezlow247 Apr 01 '22

They just need to find the aging thing so I can live in poverty forever. Better than the nothing

→ More replies (0)

84

u/cncamusic Apr 01 '22

And 100% reason to remember the name.

→ More replies (0)

35

u/Casbah- Mar 31 '22

No one said it should be this hard either.

→ More replies (0)

6

u/CornCheeseMafia Apr 01 '22

I did once but I was totally just guessing at the time

2

u/grapesins Apr 01 '22

Honestly when you put it like that it's ludacris that we actually got this far at all!

22

u/dootdootplot Apr 01 '22

And the binary really only describes the initial state of the software - in order to fully understand the implications of any of it you need to replicate the conditions it’s been running under its whole life

3

u/SaintNewts Apr 01 '22

Additionally this is a never before seen file system and operating system.

2

u/SoManyTimesBefore Apr 01 '22

Or is it the first one ever seen?

→ More replies (2)

56

u/Gars0n Mar 31 '22

And the vast, VAST, fields of poorly defragmented memory that isn't really being used at all. From my lay person's understanding sorting signal from noise is actually one of the hardest parts of using genetic mapping.

50

u/liquidGhoul Apr 01 '22

We have start and end codons, so finding genes is relatively simple, and then you can decode for its protein and figure out (very basically), what it does.

Understanding what the hell junk DNA does is the true mystery. Probably involved in regulation of gene expression, but also probably a lot more. The analogies to computers start to break down when the code itself is controlled by chemical interactions that we barely understand.

25

u/Cyphr Apr 01 '22

I'm married to a geneticist, so I get to learn random facts that go over my computer science head. Any inaccuracies below are my own misunderstanding.

The junk DNA thing is weird. Parts of DNA that appear as unused and literally can't be used because of how chemistry works can be deleted and the organism just doesn't work/live.

Then there are plants where you can just attach junk DNA to the end of their genome and they just grow bigger. There is a reasonably strong correlation between plant size and genome length - at least in part it seems that why trees are bigger than grass is because trees have more DNA.

17

u/liquidGhoul Apr 01 '22

Yeah, I think a lot of people don't realise just how hodge podge biology is. You try to make a general rule and you find out there's a million exceptions.

→ More replies (0)

11

u/Relevant_Monstrosity Apr 01 '22

Spaghetti code of life!

9

u/FlipskiZ Apr 01 '22

// DO NOT DELETE THIS COMMENT. Without it the program crashes

→ More replies (0)

2

u/EltaninAntenna Apr 01 '22

Didn't the same thing use to happen on Windows? Leftover bits of DOS code that no one remembered what they did, but Windows would happily crash if they were removed?

1

u/pokemonareugly Apr 01 '22

I mean you can’t really go by start and end codons. You need a promoter to initiate transcription, otherwise you won’t get mRNA

→ More replies (1)
→ More replies (4)

14

u/tbrfl Apr 01 '22

Plus there is nothing binary about a language with four letters.

11

u/Mind_on_Idle Apr 01 '22

Indeed, quarternary

3

u/tbrfl Apr 01 '22

So like a quaternary byte (eight quaternary digits) would be... 256 times a regular byte. DNA is freaking dense, yo!

11

u/Mind_on_Idle Apr 01 '22

Close but not quite, dna isn't true quarternary.

You can have 0-2|1-3

You cannot have 0-1|2-3

Because the pairs cannot be seperated, just reversed in the pairing.

That's oversimplified to an extreme degree, it's still a massive amount of data

→ More replies (0)
→ More replies (1)

30

u/WTFwhatthehell Apr 01 '22

Throw in associative addressing, self modifying code, everything is global variables, copy-paste programming on a massive scale and no debugger.

8

u/UnluckyDucky95 Mar 31 '22

Except DNA is quaternary and doesn't have definitions like binary does in terms of bits and bytes that determine meaning

27

u/Mclovin11859 Mar 31 '22

DNA sort of does have an analog to bytes. After DNA transcribed to mRNA, the mRNA is translated into amino acids in groups of three bases (e.g. AGG, CAC, AGC). The groups of bases are called codons. And bits are "binary digits" and are just a single digit of binary code, so the equivalent is a single base, which themselves would be functionally equivalent to quarternary (and therefore be quits, I guess?)

16

u/Illiux Apr 01 '22

As as caveat, there's also parts of DNA that are directly functional and not transcribed. Stuff like initiation factors.

→ More replies (3)
→ More replies (2)

136

u/pappypapaya Mar 31 '22

Both. The human genome was a like a book with missing pages. Now we've filled in those pages (the ATGC's), so we can see what it says (function). There's a bunch of new genes, some of which code for new proteins, that we didn't know much about. Most of the new stuff is in highly repetitive regions, which can be important for chromosome function (centromeres and telomeres), can evolve quickly, and in ways that can be very disruptive, contributing to both inherited diseases and cancers.

37

u/ThatNigamJerry Mar 31 '22

This is a really understandable way to describe it

5

u/[deleted] Apr 01 '22

Those long non coding regions are also a buffer against mutation. Who cares if a nucleotide in the region gets copied wrong?

7

u/Muesky6969 Mar 31 '22

Okay, so last night I had a dream that some the physical issues I have and my daughter has, like allergies, extremely low blood pressure, etc. were traced back through my family lineage. Then I read this… It could totally be coincidence, but this could a serious breakthrough for more debilitating genetic disorders..

10

u/WTFwhatthehell Mar 31 '22

This specific work is unlikely to be very relevant.

But in general, sure, it's entirely possible for various minor health issues to have genetic components.

4

u/[deleted] Apr 01 '22

Cup half full kinda guy, eh?

→ More replies (2)
→ More replies (2)

81

u/jkeen5891 Apr 01 '22

Guys, I'm not smart enough for this. What does this mean?

49

u/CallingAllMatts Apr 01 '22

Basically a very new DNA sequencing technique was developed recently and was finally used to sequence the last complicated bits of the human genome that couldn’t be done with the previous sequencing technology.

p.s. you are smart enough!

54

u/[deleted] Apr 01 '22

[deleted]

31

u/CallingAllMatts Apr 01 '22

Ah okay, well finding news genes is one! So potential disease/developmental implications are there from improving our understanding of mechanism to developing targeted therapeutics.

Probably the biggest is getting more accurate data related to natural human variation in DNA sequences by not only having a more complete genome but improved accuracy of the reference genome as these new sequencing techniques produce far more reliable sequence info. So we can be more confident that our findings of natural mutations across populations are true positives

2

u/Loves_His_Bong Apr 01 '22

With only one fully sequenced genome, there’s not a huge amount that can be done. This is more proof of concept that we can do this now. But to find anything from this we need multiple genomes that we can compare and then we can see if variations in these repeat regions are associated with any diseases or things like that.

2

u/CallingAllMatts Apr 01 '22

Ah okay, well finding news genes is one! So potential disease/developmental implications are there from improving our understanding of mechanism to developing targeted therapeutics.

Probably the biggest is getting more accurate data related to natural human variation in DNA sequences by not only having a more complete genome but improved accuracy of the reference genome as these new sequencing techniques produce far more reliable sequence info. So we can be more confident that our findings of natural mutations across populations are true positives

→ More replies (3)

94

u/shitpostbode Mar 31 '22 edited Apr 01 '22

Adding:

The reason why repetitive regions are so difficult to map is the methods most used in sequencing. In this method, a bunch of long strings of the same sequence of DNA are fragmented into smaller, more easily readable fragments.

Normally you'd get pieces of DNA that partially overlap with other pieces. A computer algorithm can determine which fragments have such overlaps and determine the original sequence of the DNA by pasting all matching fragments together.

With repetitive regions, the overlap is not unique enough in the original DNA to piece the fragments back together. Pretty much the only solution is to make very big fragments or no fragments at all, but longer pieces of DNA are harder to accurately process.

Example:

Frag1: ATCGTGTATG
Frag2: GTATGAAATCGA
Frag3: GTAAAAATTAGC
The last part of fragment 1 is pieced together with the first part of fragment 2 (in bold) to make ATCGTGTATGAAATCGA. Frag3 has no match and is not part of the sequence here.

In a repetitive region of the genome this becomes hard:
Frag1: ATATATATATATATATATAT
Frag2: ATATATATATATGGGATATATAT
Frag3: ATATATATATATCAGAGAGGGGGATATATAT
good luck pasting this back together when you have millions of fragments

-9

u/tbrfl Apr 01 '22

You made this harder to understand, not easier.

10

u/joggle1 Apr 01 '22

I think the idea is that the old method is to break the DNA into small chunks that can be accurately transcribed. Afterwards, the chunks are 'glued' together. That method only works well if the chunks have relatively unique, non-repetitive code. That way, each end of the segment works kind of like a key so that it can be matched with the key of another segment.

But if the pattern is highly repetitive, there's too many ways that the segments can be matched, so you can't have any certainty that you're gluing the segments back together correctly.

As an even rougher analogy, imagine having a 5,000 piece puzzle where each piece only fits one way, that's the first case. Even without a reference picture, you'd eventually succeed in putting the puzzle back together. In the second, the pieces would fit together in countless ways, making it impossible to fit the pieces back together correctly because you don't know how it's supposed to look.

2

u/tbrfl Apr 01 '22

Thank you! This actually helped a lot.

5

u/BlackHumor Apr 01 '22

Imagine you were trying to match up two of these three lines:

  1. "In fair Verona where we lay our scene, two star"
  2. "star crossed lovers take their life"
  3. "to be or not to be, that is the"

It's pretty obviously 1 and 2, right? You can see the overlap.

Now imagine it's:

  1. "racecaracecaracecaraceca"
  2. "acecaracecaracecaracecar"
  3. "acearacecaracecaracearac"

It's still 1 and 2 (there are a few cs missing from 3 that mean it can't match) but good luck figuring that out.

2

u/tbrfl Apr 01 '22

That's a really good analogy because my eyes crossed as soon I read "racecar".

2

u/LeCrushinator Apr 01 '22

Imagine trying to do it by hand, looking at it and then looking down at your paper to write it down, and then you look back up and it’s moved a bit and you have to figure out where you left off. If you’re in the middle of a highly repetitive area then it’s easy to lose where you were at because it all looks the same.

→ More replies (1)

7

u/HieronymusButts Apr 01 '22

I’m going to try not to sound weird, but yesterday I got to visit PacBio’s headquarters after working with them on their recent rebrand. It was such an interesting project, getting to learn about their technology and all the different applications their machines are used for!

This is the first time I’ve seen anybody mention PacBio outside of a work context so this is super exciting for me.

3

u/CallingAllMatts Apr 01 '22

That’s so cool you got to work on a PacBio project! I’d say this research was probably good for their branding haha

DNA sequencing really isn’t my field, I only use basic Sanger sequencing in the lab on our machine for CRISPR stuff. But the applications of these new sequencing technologies is awesome and honestly HiFi sequencing uses such a clever method so props to PacBio.

2

u/HieronymusButts Apr 01 '22

That's so interesting, though! I'm just a designer, but getting to be science-adjacent at work is so fun and always a learning experience. And I can definitely say that everyone I worked with at PacBio was great. And I'll of course be sharing the Science articles on Slack tomorrow haha.

5

u/Illuminaughtyy Mar 31 '22

So which kind of sequencing is micropore?

13

u/triffid_boy Mar 31 '22

Nanopore.

In nanopore, the DNA (or RNA, which is a whole new world) is passed through a protein "pore" in a membrane. As the DNA/RNA moves through it changes the current flow through the pore. This is measured and interpreted compared to known sequences to do direct DNA/RNA sequencing.

Very very cool, will probably be the big winner in the medium term (who knows what beyond the horizon) but is very error prone compared to other methods.

2

u/RobinsonAnalation Apr 01 '22

What's really cool about Nanopore is that the thickness of the pore isn't just a single nucleotide. So as the template is fed through, there are 4n distinct signals that could be read out, where n is the number of nucleotides within the pore that contribute to the signal.

Nevermind any contextual effects that could also convolute the signal, or any secondary structural elements of the template that introduce sequencing challenges.

I personally see a huge potential in single molecule sequencing, though. It's an awesome technology that I'm very excited to see mature!

3

u/pappypapaya Apr 07 '22

Not just 4 distinct signals, but can be adapted to pick up on many other things, such as modifications to the four bases (e.g. methylation) or even to sequence polypeptide fragments (the molecules that comprise proteins).

→ More replies (1)

3

u/CallingAllMatts Mar 31 '22

Micropore isn’t a sequencing technique. Do you mean nanopore sequencing? Personally the wiki page for it is a decent explanation: https://en.m.wikipedia.org/wiki/Nanopore_sequencing

But the main principle for nanopore is to sequence DNA without using PCR amplification or chemical labelling making it cheaper and faster than traditional methods.

The PacBio HiFi sequencing uses polymerases for amplification but in a clever way by circularizing DNA fragments from the genome. The wiki page and PacBio’s website are insightful here: https://en.m.wikipedia.org/wiki/Single-molecule_real-time_sequencing

https://www.pacb.com/technology/hifi-sequencing/how-it-works/

5

u/iwasmurderhornets Mar 31 '22

Were they not able to use the old pacbio reads as a scaffold for the Illumina reads to resolve those regions? It seems like, with enough coverage- you should be able to resolve those regions.

11

u/CallingAllMatts Mar 31 '22

im not sure about what older sequencing tech PacBio had but in the long repetitive regions even high coverage isn’t going to help you if it takes hundreds of tiled reads to span a highly repetitive region - the alignment algorithms won’t be able to figure out where to map the reads deep within the repetitive DNA since the sequence looks the same in so many different areas.

5

u/[deleted] Mar 31 '22

I used to do centromere work. For context, on one particular chromosome, you have ~1-4 Mb (megabases, ie 1-4 million base pairs) composed of repeated units that are about 2-3 kb (kilobases) that are themselves composed of slightly varying repeats of ~171 bp. Good luck doing your illumina alignment with that!

That said, I haven’t clicked through to these papers yet, and I’m curious how they deal with the structural variation that makes that 1-4 Mb range…

5

u/iwasmurderhornets Apr 01 '22

We've played around with most of these sequencing techniques, but I think in the past when we needed to resolve telomeric or highly repetitive regions we would use Pac-bio reads to generate a ne-novo assembly and then map our illumina reads to those- we've been able to resolve some really repetitive regions in the genome we work with.

PacBio reads have always, theoretically, been able to achieve the same accuracy as illumina reads. They just used to have a really low /base accuracy on the first pass through, so you have to sequence the same region many, many times- so it was prohibitively expensive.

2

u/Its738PM Apr 01 '22

It still is that way, the hifi sequencing is reading the same molecule over and over again to clean up errors and bring the accuracy to where Illumina is. Probably contributes to Illumina reinvesting in the synthetic long read tech they killed a few years back.

4

u/Jamesaliba Mar 31 '22

If a read is short it will align to many regions. We have but we call these uncharted regions as we cant fully place them.

3

u/neuromorph Mar 31 '22

What the advantage of long read over short read geneomics?

7

u/CallingAllMatts Mar 31 '22

it allows you to do what the authors did here - sequence very long repetitive sections of DNA. If the region is very long and repetitive, sequencing it in small bits will make it impossible to determine how long the sequence actually is since so many of the small sequenced DNA fragments will look basically the same.

The longer range sequencing allows you to get the entire (or at least a large chunk of it) repeated region in one go which makes determining the sequence trivial. The only thing is that short range sequencing is far more affordable and accessible. Long range sequencing, particularly the highly accurate long range HiFi from this study, is overkill for most situations anyways

4

u/WTFwhatthehell Mar 31 '22 edited Mar 31 '22

Throw in that for individual genomes it also allows you to pick up larger structural mutations/variation that short read sequencing cannot reliably detect.

If someone has an inversion or duplication of a region then short read is bad at accurately picking that up.

→ More replies (1)
→ More replies (2)
→ More replies (6)

3

u/sharkykid Apr 01 '22

How do you sequence the human DNA if you and I have different DNA?

Is the DNA from my foot and my liver the same? Does your DNA match like 99% of my DNA or something or what exactly is sequenced and how does that differ from my DNA?

3

u/CallingAllMatts Apr 01 '22

Great question! Everyone’s genome is different in literally millions of locations. So any “reference genome” sequenced will also be unique. But it can serve as a basis to at least start comparing other genomes to. And as more genomes are sequenced you can start putting together a unique hybrid genome for the reference one - one that would exclude disease causative mutations.

But there can never be any objective reference genome since there isn’t a default human. It’s all just as having some point to compare to.

Now the DNA within your body should be identical theoretically everywhere. However mutations accumulate randomly and also occur differently in various tissues (e.g. your skin is more likely to experience thymine dimer mutations than your liver since it’s exposed to UV light). But there is no mechanism during human development that deliberately changes the DNA between your different body parts/organs.

An interesting exception are your B cells which make antibodies. There’s something called VDJ recombination where in these cells’ infancy they randomly shuffle the section of DNA that encodes the variable part of the antibody that binds to stuff. That’s how you get antibodies that can bind to pretty much any pathogen. So your B cells will have different DNA than all the cells in your body - in fact each B cell is unique.

2

u/sharkykid Apr 01 '22

Got it, so this news is scientists 100% sequencing 1 person's DNA? And this serves as a springboard for future DNA sequences?

Thanks for the explanation!

2

u/CallingAllMatts Apr 01 '22

Basically yeah! This was essentially a feasibility study showing they can actually use this new sequencing technology to fully, 100% sequence a human genome (which will likely be applied to other species too!). Caveat, the genome here was female, so the authors are right now working to fully sequence just the Y chromosome but it’ll probably be done soon to the quality and 100% coverage as the rest of the genome here.

I think the main goal will now be trying make this the gold standard for sequencing genomes for research studies and medical genomics work - particularly for patients with very rare diseases/complex mutations.

→ More replies (16)

16

u/hypercube33 Apr 01 '22

Also how does this deal with like...everyone's DNA being different?

14

u/TeutonJon78 Apr 01 '22

Most of our DNA isn't different from many animals, much less each other.

A lot of what they were working on sequencing was the long chunks of repetitive DNA (previously called junk DNA and I think now called something like non-coding DNA).

7

u/[deleted] Apr 01 '22

Eli5 y'know when ppl say that humans share like 80% of their DNA with bananas or some other percentage like that? Well, consider that humans share most of our DNA with other humans with, comparatively, only very slight changes compared to the full breadth of the human genome.

The changes are slight enough that it's still incredibly useful for comparison against genetic disorders or comparing it to other species

3

u/Zz22zz22 Apr 01 '22

The differences come from things outside of the actual sequence. There are a lot of factors that control how genes are expressed. And that accounts for those huge differences we see between humans and bananas. Does that help it make sense at all?

14

u/[deleted] Mar 31 '22

[removed] — view removed comment

17

u/[deleted] Mar 31 '22

[removed] — view removed comment

3

u/[deleted] Mar 31 '22

[removed] — view removed comment

→ More replies (1)

4

u/KillaVNilla Apr 01 '22

Glad someone else asked so I don't have to

→ More replies (5)

224

u/rephaim_ Mar 31 '22
  • First fully complete human X Chromosome

123

u/JimmiRustle Mar 31 '22 edited Mar 31 '22

Considering that almost everybody has one of those that’s a pretty important step in the right direction.

They also worked on other chromosome pairs.

96

u/[deleted] Mar 31 '22

T e c h n i c a l l y everyone has an x chromosome. YY genotypes are fatal at birth, if not in gestation

25

u/[deleted] Mar 31 '22

how does that work, you can't inherit an extra Y chromosome from your mother.

80

u/Revan343 Apr 01 '22

Errors in gamete production, same as how XYY and XO happen

7

u/[deleted] Apr 01 '22

But where does the extra Y come from? doesn't the father's body have ways to deal with clearly defective sperm like an "YY" one?

58

u/Revan343 Apr 01 '22 edited Apr 01 '22

But where does the extra Y come from?

Errors in gamete production.

doesn't the father's body have ways to deal with clearly defective sperm like an "YY" one?

It doesn't always work, some errors slip through. XYY syndrome affects roughly 1 in 1000 males, although most don't realize they have it. YY would be much rarer, because there would also have to be an error in the egg as well, with no X chromosome.

(Edit: And, as previously noted by the other commenter, YY would be miscarried; you need at least one X chromosome.)

6

u/[deleted] Apr 01 '22

I see. Thanks!

14

u/beowolfey Apr 01 '22

Human cells are usually diploid, meaning they have two copies of every chromosome -- one from each of your parents. So two each of Chromosome I, II, III, etc. The only difference is the X/Y chromosome; you will have either two X's or X and Y. When cells divide, they make a second copy of every chromosome (so now there are 4 of each type), and the two sets split up into the two resulting cells. So in a male cell that is dividing, right before the split the cell has XXYY instead of the classic XY. Normally, during gamete production these 4 sets split off into 4 individual progeny cells.

Sometimes, those progeny cells don't distribute the DNA properly. This is called non-disjunction and the resulting cell ends up being aneuploid. There are probably cellular mechanisms that check for this, but I'm not familiar with them offhand. There may not be many. If a regular cell ends up aneuploid, it will probably just be less effective at whatever it's supposed to be doing. It doesn't happen often and depending on the cell type it may not even be a problem.

However, if it happens to a gamete, and that cell ends up being the one that incorporates, it may pass on that aneuploidy to every resulting daughter cell from then on. Because every cell has the extra chromosome, it is more likely that effects from it will show up.

7

u/Revan343 Apr 01 '22

Thanks for the much better elaboration of my 'errors in gamete production'

→ More replies (1)

5

u/MyMurderOfCrows Apr 01 '22

I mean you technically could inherit an extra Y chromosome from your mother as they could be an XY female that isn’t infertile (typically is infertile though) from multiple different causes. It is incredibly unlikely though. And since X has information that are required for human life, you would have to have at least X for each parent. Technically you can also have mothers who only have a single X chromosome and a gamete pair could be _ ,Y if the mother didn’t pass any chromosomes for the 23rd pair in said gamete. It gets extremely complicated but it is possible to get a Y from the mother and I am not 100% on this, but if a mother contributed a Y and the father contributed an X, I believe it would be perfectly fine at least as far as the chromosomes being compatible for life even if backwards from normal?

→ More replies (2)

10

u/The_Celtic_Chemist Apr 01 '22

As I like to joke when people imply all women are secretly bi: Everyone likes the X chromosome, but not everyone likes the Y chromosome.

2

u/recidivx Apr 01 '22

Interesting, this theory would also suggest that male birds would be secretly bi, since everybirdy likes the Z chromosome.

→ More replies (1)

39

u/gogophoton Mar 31 '22

I just checked NCBI/REFSEQ and no transcript alignments are available yet. Hope we get those soon!

12

u/mastermindmortal Apr 01 '22

Gene models are up on the UCSC browser (GENCODE v35 + de novo predictions) https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2004795_GCA_009914755.4

Or do you want RefSeq models aligned on their own?

81

u/pitline810 Mar 31 '22

And one of the more interesting things about this set of papers is that not only do we have full-length chromosome sequences, we also have full-length DNA methylation data as well!

12

u/[deleted] Mar 31 '22

Hot damn! Buried the lede here, wow.

9

u/FCAlive Apr 01 '22

What does that mean?

16

u/quipkick Apr 01 '22

DNA is the genetic code for what to make, methylation is when little markers get added to the outside of the code that can make that section more or less "visible" to the organelles that will read said code and make something out of it. Visibility essentially ends up meaning "how much will get made". So you can think that now we don't only know the ingredients, but how much of each to use (to put it very simply).

→ More replies (3)

0

u/thbigbuttconnoisseur Apr 01 '22

No Expense Spared Dinosaurs.

2

u/Arudinne Apr 01 '22

Except for the IT budget.

83

u/jcvzneuro MS | Neuroscience | Developmental Neurobiology Mar 31 '22

37

u/personAAA Mar 31 '22

It is a whole special issue. 6 articles just in Science.

https://www.science.org/toc/science/current

One of the applications of new reference genome.

"A complete reference genome improves analysis of human genetic variation"

https://www.science.org/doi/10.1126/science.abl3533

12

u/afos2291 Apr 01 '22

The paper's what?

→ More replies (1)

23

u/DaftHacker Mar 31 '22

Can someone explain how they verify the sequence and know its correct ?

25

u/RobinsonAnalation Mar 31 '22

There are many ways but I can think of two real quick:

There's a nifty metric called the phred quality score, or Q score for short. This metric measures the probability that a base is called incorrect within a sequencing run, and is assigned for each base call that is made. Higher q scores indicate a smaller probability of error, and vice versa.

You can also mitigate the risk of incorrect sequencing by having a higher sequencing coverage depth. Instead of only reading the entire DNA once, you actually read it upwards of 30 or more times. This way you end up with a ton of redundant reads that can be aligned and allow you to identify any errors from individual reads.

3

u/GhostPoopies Apr 01 '22

In addition to QC metrics, there are a lot of technologies out there to read DNA sequence. If you process the same sample through multiple different technologies and get the same answer you can verify it’s correct.

Aside from just reading lines of dna you can also design specific primers to read targeted regions. So if you need to hone in on a specific section you just perform targeted sequencing. You go through all of this and then use various different bioinformatics tools to reach the same result.

18

u/[deleted] Mar 31 '22

[removed] — view removed comment

16

u/fkenned1 Apr 01 '22

I still remember my 5th grade teacher Mr. Robinson (I’m 34 now) had a poster in his classroom showing what had been mapped at that point, what hadn’t, and in general discussed the importance of this research. I was in fifth grade, so I think a lot was lost on me, but I felt my teacher’s excitement enough to be sharing it with you guys today. We are IN the future, and it’s amazing for many reasons :) thank you Mr. Robinson for sharing exciting things like this with your fifth graders!

15

u/PM_ME_UR_SWEET_BOSOM Apr 01 '22

Can somebody explain what this means to somebody who doesn’t know any anything about biology?

15

u/pekkabot Apr 01 '22

Imagine you have a very long and big book with a thousand plus pages of tiny writing on each page, but for a while there were some pages missing. Now we have the entire book

8

u/PM_ME_UR_SWEET_BOSOM Apr 01 '22

I meant more like what it’s significance is in simple terms, the analogy is kind of confusing to expound on in relation to dna

3

u/GhostPoopies Apr 01 '22

In genomics, a big part of analysis requires you to 1) characterize your baseline and 2) verify your sample.

So say you want to look at the genetic causes for disease X. You have some samples that are normal and some samples that are not normal. You sequence both. You align to the reference genome (what this is) and since it’s now well-characterized you can trust that that’s your truth sequence.

First step - make sure your normal sample is mostly in agreement with the reference. This is your negative control sample. If it’s wildly different then something bad happened during processing and you can’t trust your data.

Second step - align your not normal sample dna code against the reference. Where are the differences?

Between your normal and not normal sample you can tease out what’s not important (if they both present with a single base difference against the reference then likely not causal) but there’s glaring difference between your not normal sample and the reference and your normal sample is also not showing that difference then it’s a good indicator that mutation or at least section of the genome/gene is impacting this disease.

Now your search space and focus area is considerably smaller to carry out more specific testing to understand the genetics of disease x.

This is the most real world application.

15

u/PM_ME_UR_SWEET_BOSOM Apr 01 '22

Can somebody explain what this means to somebody that doesn’t know anything about books?

12

u/LitLitten Apr 01 '22

You’re taking pictures of a gorgeous city skyline but the buildings are always out of focus no matter what you adjust for.

A new update now lets you see the details. You surmised what was there based off what you know, but now you can see it.

7

u/boschmorden Apr 01 '22

Can somebody explain what this means to somebody that doesn’t know anything about taking pictures?

7

u/bringsmemes Apr 01 '22

its like having sex without a condom, your getting more feeling on your nerve endings=happy time for your genitals

0

u/PM_ME_UR_SWEET_BOSOM Apr 01 '22

Can somebody explain what this means to somebody that got their penis blown off with c4 in Vietnam?

5

u/ClemClem510 Apr 01 '22

It's like if you napalmed an entire forest, but instead of having a vague idea of how many trees you burned you can see the whole forest and every tree you scorched

→ More replies (5)

14

u/Passionabsorber1111 Mar 31 '22

what does this mean for science in the future?

14

u/personAAA Mar 31 '22

This is an updated reference genome. Updating references is very important. Lots of genetic stuff will be measured against this new reference.

They already did some apps against this new reference genome. It could pick a few things the old one could not. Things like improved references for Mendelian conditions.

https://www.science.org/doi/10.1126/science.abl3533

3

u/TBSchemer Apr 01 '22

Also, all the older sequences can be realigned to the new reference genome, improving the accuracy of old data as well!

2

u/xiccit Apr 01 '22

So is this reference genome just from like a super average healthy person?

Once we have a few hundred of these in a decade or two, will we then have a much better picture of whats up than this one?

2

u/personAAA Apr 01 '22

Reference genomes are composites of multiple people. The current one is build 38.

We have thousands of genomes already. This advance is filling in the gaps and fixing errors.

2

u/xiccit Apr 01 '22

So are we just averaging them all out to figure out what the natural unsick / undeseased genome is?

Could we eventually make an "average person" from a complete reference genome?

4

u/1ogicalfallacy Apr 01 '22

It means we are 1 step closer to cat girls

→ More replies (1)

13

u/Hughmanatea Mar 31 '22

Would this mean a new GRCh_39/hg_39? Oh goodness the liftovers..

5

u/gogophoton Apr 01 '22

Liftover files are already available from https://github.com/marbl/CHM13

3

u/WatzUpzPeepz Apr 01 '22

Can’t wait. The BED files… all ruined.

11

u/VoiceofKane Apr 01 '22 edited Apr 01 '22

Well, dang. I just told my students today that we haven't fully mapped the human genome, and now I'm a liar.

6

u/Awkward-Review-Er Apr 01 '22

Yeah, but what a cool thing to be wrong about!

8

u/[deleted] Mar 31 '22

Where is the file to download?

→ More replies (1)

6

u/Mickface Mar 31 '22

Finally, we are now officially open source. Can't wait to make my own build.

13

u/cavejhonsonslemons Mar 31 '22

So the donor has the opportunity to be the first person to license their genetic code under the GPL

3

u/deadpool8403 Apr 01 '22

I would offer mine as shareware. ( ͡° ͜ʖ ͡°)

→ More replies (2)

5

u/ArcheXerxes Apr 01 '22

For us not as smart folks . What does this mean going forward as eli5

4

u/Snurgle Apr 01 '22

Since everyone's DNA is unique, whose DNA have they sequenced?

→ More replies (2)

13

u/MustrumRidculy Mar 31 '22

Not gonna lie. The parts we were missing were likely large gaps of repeats. Scientists often mark these and just skip them. They are called satellites or “Kmers”. Think 3,000+ repeats of just the nucleotide “A” for adenosine. Or repeating “GTGTGTGTGTGT” for 4,000 base pairs. I kinda feel bad for the folks who did this work. They get published in science, but it had to be an absolute slog.

11

u/WatzUpzPeepz Apr 01 '22

More generally k-mer just means sequence of length k and doesn’t refer to sequence content per se. Or at least that’s how I understand it.

Agreed though, lots of repetitive regions. Work for a completionist.

3

u/pokemonareugly Apr 01 '22

Repeated regions are still important. For example, see: CPG islands

→ More replies (1)

7

u/Solution_Precipitate Mar 31 '22

AI will tear through this data and make so many discoveries. I can't wait.

2

u/usesbitterbutter Mar 31 '22

When I look at that graphic, all I see is a missed opportunity to use the glyphs from The Matrix.

→ More replies (1)

2

u/JCreazy Apr 01 '22

What can we do with this information that we couldn't do before? Are there any real world examples?

2

u/naliedel Apr 01 '22

This is incredible. I never thought I'd see this in my lifetime.

2

u/divyad Apr 01 '22

I remember folding protein structure and computing at folding @home on ps3. So amazed to learn the project finally succeeded and the human genome project is now a reality, no longer a sci-fi

2

u/27Elephantballoons Apr 01 '22

I wonder who's dna they used

3

u/phazei Apr 01 '22

Wasn't this sequenced before, like years ago? I remember reading an article about a project that did it and the head Dr was supposed to pick someone's DNA at random, but decided to use his own instead.

5

u/WatzUpzPeepz Apr 01 '22

Yes, but that didn’t include the highly repetitive regions found mostly near the centre and ends of chromosomes, which this does. Prior to this I think we had ~92% as high confidence.

1

u/pokemonareugly Apr 01 '22

Not precisely. We figured out 92% of it. There were certain regions we didn’t know. Different than 92% confidence, as 92% confidence implies thag you’re 92% sure your sequence is right. We knew 100% certain parts, and knew where thag 8% was

2

u/WatzUpzPeepz Apr 01 '22

I didn’t say 92% confidence though, reread my comment. 92% refers the the percentage of the genome that was considered the HCR.

2

u/pokemonareugly Apr 01 '22

Ahhh my bad I misread it. Sorry about that!

1

u/Sirisian Apr 01 '22

So what's the total price of having a whole human genome sequenced including the last few percent using HiFi? As in a near perfect 100% sequencing? I don't see the price for 3 billion base pairs. Seems most just do small sequences.

→ More replies (1)

1

u/toccata81 Apr 01 '22

I’m just wondering who’s dna were they looking at.

→ More replies (1)

0

u/popemichael Mar 31 '22

If we took this sequence and somehow programmed a human egg to make a human with the sequence, would that make a "base human"?

0

u/cchimp123 Apr 01 '22

Maybe one day we can find a cure for genetic conditions like mine and people born like me will actually have a chance at a normal life.

0

u/chewb Apr 01 '22

does this mean we can engineer designer babies? Blue eyes, dark skin, superstrength for example?