To what extent does whole genome sequencing add value above SNP arrays?
Attention conservation notice: I wrote this as the final essay for my course on personal genomics with Michael Linderman at Mt Sinai. The main question of the essay was: what is the point, in 2016, of getting your whole genome sequencing (WGS) data, if you already have your SNP data? Overall, I found analyzing my WGS data an interesting experience, but the vast majority of known genomic info is still at the SNP level, and there are some bugs in contemporary variant callers that make WGS calls more likely to be false-positives, as I experienced first-hand.
This fall, I was lucky enough to be a part of a course at ISMMS where we learned about genomics by analyzing our own whole genome sequencing results, which was graciously paid for by the school [1]. Amazingly, the cost of genome sequencing dropped from $5000 to $1500 (or even $1000) in just this past year, but it’s still a significant investment in our education by ISMMS and I appreciate it. According to the course director, Michael Linderman, there’s only on the order of around 1000 people with access to their own whole genome sequencing results and the ability to interpret them, which puts us in a pretty small, fortunate group. That said, it probably won’t be a small group for long, since just over the past few months, Veritas Genetics announced that it will offer WGS alongside analysis commercially for a new low-price of $999 [2].
I already had access to single-nucleotide polymorphism (SNP) array results from 23&Me, so a very basic question was what kind of data I could get from having my whole genome sequenced that I didn’t already have access to. First, some terminology: the difference between a SNP and a normal genetic variant is that the alternate allele of a SNP must be present in at least 1% of the population. Not surprisingly, most of the papers published about genomics on PubMed study the effect of SNPs, in large part because those are the variants for which there is sufficient power to address biomedical questions robustly. So I already had access to the majority of the well-studied variants through my SNP data. So from one perspective, going from the ~300,000 SNPs that I got from 23&Me to the ~3,000,000,000 base pair calls in the human genome seems like a classic case of the big data trap: collecting more data without any point. And I’ll freely admit that I’ve fallen victim to this tendency at least a few times in my life.
Upon a little bit more literature and soul searching about what I expected to learn, it became apparent that what whole genome sequencing is best at is detecting very private variants – that is, unsurprisingly, things that are present in less than 1% of the population. Any such rare variants that I would found might be present just in my immediate family a countable number of generations back, or they might even be found only in me. But these rare variants can add up to a fairly non-trivial number. As it turns out, the average person has about 100 heterozygous loss of function variants, which includes stop insertions, frameshift mutations, splicing mutations, and large deletions [3]. And since my dad was on the older side when I was born, and older male age is associated with more new genetic variants [4], I knew that I was liable to have an especially large burden of new variants.
On the big day when our sequences had been finally aligned and the variants had been called, the first thing I did was to filter those variants down to the 2000 or so ones that were most likely to be damaging. I scanned down the gene list meticulously, looking for gene names that I recognized. Since I had to memorize a fairly large number of disease-causing genes during my preclinical med school courses, I figured recognizing a gene name would in general be a bad sign. I was relieved and felt lucky to discover no major disease-causing mutations in genes that I knew would cause major disease, such as the cancer-promoting genes BRCA1/2 [5]. Overall this process was not very efficient, but it was pretty fun.
The next time that I sat down to analyze my genetic variants, I decided to filter for variants that were likely to have an effect on the way I think. So I intersected the genes in which I had predicted function-altering variants with another list from a study [6] that measured which genes have the highest RNA expression – a proxy for “are made the most” – in neurons. Here’s a plot of the results:
The green dot represents the gene in which I have a predicted damaging mutation with the strongest expression in neurons, which is the gene SYN2. The protein that this gene codes for is thought to be selectively produced in synapses, where it probably plays a role in synaptic vesicle transport [7]. Synaptic vesicles, in turn, are what neurotransmitters are stored in before are they are released into the synaptic cleft to communicate with the postsynaptic neuron. You might think of them as the “cargo trucks” of the synapse, storing and carrying around the payload of neurotransmitters before they are sent to the next neuron. So naturally, I became curious about what the effect of that variant might be.
First, I took a look at what my actual predicted variant in the SYN2 gene was. Specifically, I was predicted to have a frameshift mutation, due to the deletion of a CGCGA sequence at chromosome 3, position 12,046,269. In general, frameshift mutations are pretty cool. DNA is made into proteins three nucleotides at a time, so mutations in multiples of three only alter a small number of amino acids. But if a frameshift mutation messes up this three nucleotide reading frame, then the whole rest of the protein is totally different. What was predicted to happen in my version of the SYN2 protein is that, 66 nucleotides later after the frameshift, a new stop signal was introduced. So I would have 22 amino acids in my version of SYN2 that are not found in most people, and then the protein was predicted to end. Although it's fun to speculate that maybe those 22 amino acids could turn me into a mutant supergenius if I could just learn how to tap into its mythical synaptic powers, most likely my predicted mutant version of SYN2 would be simply degraded. And since I'm predicted to be heterozygous for the mutation, my non-mutated version of SYN2 could simply pick up the slack. That said, in the absence of compensation, I'd be expected to have ~50% less of this key synaptic protein than the average person.
Naturally, next I did a search for the functional role of a loss of function mutation in SYN2. The first paper I found [7] had the suddenly ominous title: “SYN2 is an autism predisposing gene: loss-of-function mutations alter synaptic vesicle cycling and axon outgrowth.” Specifically, this paper showed that two missense (amino-acid changing) and one frameshift mutation were found in male individuals with autism spectrum disorder, but none were found in male controls with autism spectrum disorder. They also showed that neurons lacking SYN2 have a lower number of synaptic vesicles ready to be released from their synapses, which is consistent with the predicted role of SYN2. I had some qualms about this paper, like the fact that they extrapolated from SYN2 homozygous knock-out mouse studies to humans that were heterozygous for a loss-of-function variant in SYN2, and indeed the mouse study that they built upon did not find a phenotype in SYN2 heterozygous knock-out mice [8]. But overall, this study was a sign that my predicted frameshift mutation might really be playing a significant functional role.
Given that I also had access to SNP data from both of my parents through 23&Me, my next step was to find out which of them I inherited the predicted SYN2 frameshift variant from, so that I could figure out which of my parents I would be able to subsequently blame for all of my problems. But this is where things took another unexpected turn. In order to discover which of my parents was the culprit, I had to analyze the raw reads in the Integrated Genome Viewer (IGV), to find another tagging SNP that I could also see in the data from 23&Me. But when I actually looked at the reads, what I discovered here instead was way more homozygous variation (seen via the single-colored vertical lines) relative to the reference genome than I expected:
This homozygosity of the variants is surprising and makes us suspicious that maybe there’s something going on other than just the mutation – maybe there was a problem in aligning my reads to the reference genome. And indeed, for technical reasons that are beyond the scope of this essay, in class we aligned to the hg19 build of the reference genome, which as it turns out, happens to differ from the hg38 reference genome at this region pretty substantially. And when I aligned one of the individual sequencing reads against the hg38 reference at this location, what I detected was not a deletion, but rather an insertion of 12 base pairs. Since 3 divided by 12 is a whole number, 4, that means that this is an in-frame mutation, which is much less likely to have the serious loss-of-function effect that a frameshift mutation would. And indeed, looking at the DNA sequence that was inserted, it appears that the insertion is probably due to a tandem repeat, with one mismatch:
So, to recapitulate, analyzing the raw reads using the updated reference genome, I found out that likely I do not have a frameshift mutation in SYN2 after all. That said, the potential presence of a tandem repeat expansion within the coding sequence of SYN2 – leading to four extra amino acids in that protein – is itself pretty interesting and could still have some sort of a biological effect. After all, this protein is likely a key component of the cargo truck for my neurotransmitters.
In summary, I think I can say that if you’ve had your SNP data analyzed, that's going make up the lion's share of digestible information. However, there are likely to be some interesting things for you to learn from having your WGS data analyzed as well. First, although I didn’t/haven’t yet found any rare variants in my genome that might significantly increase my risk of disease in a potentially actionable way, I certainly could have. You don’t bring a life jacket on a boat because you think you’re going to fall overboard – you bring it because you might. Second, it was enlightening to learn first-hand about the lack of adequate tools for analyzing genomes, especially at the variant calling and variant analysis steps. We really are in the Wild West era of genomics. This is both exciting and motivating. I now have a better idea of what it is like to have a likely false positive variant call like I had with SYN2.
Finally, getting your genome sequenced isn’t just about your own health – it’s also about your family’s health and the health of society at large. For example, I’m also in the process of donating my whole genome sequencing data to the Personal Genome Project (I've already put up my VCF file). If you have access to SNP data and/or you want to try to have your whole genome sequenced, and you are willing to make the data publically available, then you should consider joining too. I think that by pooling genome and phenotype data in an open way, we’re going to make some discoveries that will improve human health in a big way.
References
[1]: Linderman MD, Bashir A, Diaz GA, et al. Preparing the next generation of genomicists: a laboratory-style course in medical genomics. BMC Med Genomics. 2015;8:47.
[2]: http://www.prnewswire.com/news-releases/veritas-genetics-breaks-1000- whole-genome-barrier-300150585.html
[3]: Macarthur DG, Balasubramanian S, Frankish A, et al. A systematic survey of loss- of-function variants in human protein-coding genes. Science. 2012;335(6070):823- 8.
[4]: Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father's age to disease risk. Nature. 2012;488(7412):471-5.
[5]: You might be wondering, if you’re male, then why are you worried about a BRCA mutation? Well, although BRCA mutations are much more dangerous in women, they can also increase the risk of certain cancer types in men. For example, according to one study with 1000 participants, there is around a 5-fold increased risk for prostate cancer in men with a BRCA2 mutation. See: Kote-jarai Z, Leongamornlert D, Saunders E, et al. BRCA2 is a moderate penetrance gene contributing to young-onset prostate cancer: implications for genetic testing in prostate cancer patients. Br J Cancer. 2011;105(8):1230-4.
[6]: Zhang Y, Chen K, Sloan SA, et al. An RNA-sequencing transcriptome and splicing database of glia, neurons, and vascular cells of the cerebral cortex. J Neurosci. 2014;34(36):11929-47.
[7]: Corradi A, Fadda M, Piton A, et al. SYN2 is an autism predisposing gene: loss-of- function mutations alter synaptic vesicle cycling and axon outgrowth. Hum Mol Genet. 2014;23(1):90-103.
[8]: Greco B, Managò F, Tucci V, Kao HT, Valtorta F, Benfenati F. Autism-related behavioral abnormalities in synapsin knockout mice. Behav Brain Res. 2013;251:65- 74.