Tuesday, June 28, 2011

The Inevitable

I took a break from the lab last week, and headed up with my boyfriend to his brother's wedding at their family farm in Virginia. It was beautiful, just on the edge of Shenandoah National Park, it was a break from the heat and the monotony of work.

Which brings me to the inevitable. Meeting new people, and answering the question, "so what do you do?"
Around scientists this is easy. I may feel slight embarrassment when explaining what I do to molecular biologists, because sometimes there is slight shame in "discovery science" (which there shouldn't be, but that's another topic for another day.) I work in human genetics. It's harder impossible to design an elegant experiment to test a hypothesis when what you're interested is humans.

But what do you say around people with no science background at all? It's especially hard because my boyfriend is one of those do-gooder types. If we're standing next to each other and someone asks what he does, he says, "I'm starting my masters in social work, and I'm specifically interested in resettling African refugees." To which you just watch people swoon. Everyone knows what social work is, and everyone knows it's such a noble profession where you tend to be over worked and underpaid, so this makes A (the boyfriend) just the bee's knees at social events. And then they look over at me, and I'm all, "human genetics, yo" and a few things happen.

For one, the conversation tends to come to a complete stop. It's actually a great talent of mine, to bring happy light-hearted conversations to a screeching halt. I'm kind of awkward. This happens for one of two reasons. One, people just don't understand or relate. Even if people can't relate to social work, they can understand what a social worker does. You help people. That's awesome. What does a scientist do? Crickets. The second thing that happens, which I'm wildly trying to prevent, is people are intimidated. You do science?? You must be smart. At this point I just want to pull out my GPA and be like, "See? See that GPA barely floating above a 3?? And that's really only because I padded it with my creative writing classes!" On the vast spectrum of biology majors, I'm definitely on the lower end.

I tend to lean towards the more awkward side of social situations. I'm kind of quiet around new people, and I get slight social anxiety in big groups. Luckily, because I am (sort of, pretending to be) a scientist, I adapt very quickly to new situations. Here are some tips that I use to deal:

  1. Alcohol helps. Alcohol helps all social situations. After a beer or two, it's way easier to introduce yourself and unstressfully describe what you do. 
  2. Make it understandable. This is easier in the human genetics field. I now open with the disease I work on, even though in reality it's secondary to the genetics and the biology I do. I'll usually tack on "genetics", but sometimes I'll skip straight to saying, "I'm studying autism." If you work on chromosome segregation in yeast (Hi Dad!) (And that's even the dumbed down version), and there's no prayer in getting someone to understand what you do, go straight to the disease. "Understanding the mechanisms that cause cancer" works. Cancer! Down Syndrome! etc. 
  3. Make it relatable. Sometimes people just honestly want to know what you do every day. A lot of people's science experience stops at high school, and some people have vague recollections of using pipets or microscopes. If you take pictures of things on microscopes, say that. I've been differentiating neurons from neuronal stem cells lately, and sometimes I'll say, "I'm growing neurons." That's cool. I tend to not mention stem cells, especially at weddings or if I don't know the people very well, because sometimes that gets you to a political discussion. Unless you're into that, of course, then by all means, go ahead.
  4. Just relax and be yourself. I love what I do. It's why I'm doing it. And if you just forget that you might come off as intimidating or weird or nerdy for a minute, and you just try to convey that you love science, and the creativity, the flexibility, the constant discovery, people will be able to relate to that. Sometimes I just say that my job is awesome because it's so conducive to having a kid. I can make my own hours, and be ready for sick days and teacher work days. My lab isn't too competitive, and everyone has a wide variety of interests. And I love it here. And remembering that you love what you do, well, that's the greatest confidence boost you can get. 
So, young padawan, in conclusion, go boldly in to non-scientific social situations. And when in doubt, remember, you're curing cancer. 

Monday, June 20, 2011

Monday morning

So I get this email from my post doc at 9:30 on Monday morning.

Hey there, I'm working from home today, but I'm going to be talking to [The Big Boss Man aka the PI] so could you please send me some info on what has been accomplished on your end this week. 
...
It's Monday morning! At 9:30! The only thing I have accomplished so far has been turning on my computer, discovering that Chrome was running a little slow, and clearing out my cache.

It's Monday people, let's all act accordingly.

Tuesday, June 14, 2011

A brief overview of sequencing technologies

I started writing a post earlier today about some cool collaborative genetics projects that are going on around the world, and then I realized that these projects all rely on a common knowledge of sequencing. In an effort to not completely sound like a wikipedia article on sequencing, I’m going to just focus on the ones that I use in the lab on a fairly regular basis, and Next-gen sequencing is sort of like, “you seen one you seen ‘em all” type of technology. So I’m going to go through in chronological order, Sanger (or Shotgun sequencing), Next Generation/High Throughput Sequencing (with the emphasis being on Illumina’s method), and then touch briefly on Next-next-gen sequencing (I think they’re calling it third generation now.) Then I’ll take a brief look at how we use sequencing in the lab. Ironically now, we all use High Throughput Sequencing as our first pass through, but everything called in HTS, and I mean everything will get verified through Sanger sequencing, the oldie but goodie.

Shotgun sequencing
Shotgun sequencing, also known as Sanger sequencing because it was developed by some dude named Sanger. Here’s how it works. You amplify the sequence you want, whether by PCR or a vector in vivo, and then your sequencing reaction contains your purified target sequence, primers for that sequence, DNA polymerase, regulary ol’ dNTPs, and then flourescently labeled dNTPs, which also happen to be chain terminating, which means after one is added, no other dNTPs will be added on after that. Your sequence will stop after a flourescently labeled dNTP is added. See where I’m going with this yet? So you put all that stuff in a tube and start a PCR reaction cycling (I know, I know, polymerase chain reaction reaction, but it’s like ATM machine, you just sometimes need to say the machine after it even though it’s included in the acronym), and what happens is, polymerase works its magic, attaching dNTPs and labeled dNTPs on to the exposed single strand of your target DNA sequence. Except for that, every time a labeled dNTP is added on, that DNA polymerase stops putting on more dNTPs, so it’s a set length. So after x amount of cycles, what happens is you’ll have a whole bunch of fragments of your sequence, all with one labeled nucleotide as the last  one on that sequence, like thus:
Your labeled sequences are then sorted by size, electrophoresis, in a capillary tube, and the colors are read off one by one, and the data you get back looks like this:
You get one read per sequence, so in order to detect any heterozygosity, you’d need to run sequencing in both the forward and reverse sequence.
So there you have it. Sanger sequencing. Let’s recap.

Pros: For small sequences? It’s cheap and fast.  I can throw my product in a microcentrifuge tube with some primers, walk it down to our sequencing facility, and for 8 bucks, they’ll give me really high quality sequence of about 800 base pairs or so.

Cons: You need to know your target sequence (or at least, enough to amplify, and sequence) which means you need to already know the genomic location, and you need to design primers flanking.

Where it’s been: People got whole genome data from these little dudes. That’s where the shotgunning comes in. You get massive amounts of these short fragments, and then some crazy algorithm + a super computer + someone smarter than me will sit there and align sequence fragments by where they overlap, and construct a whole genome sequence from that. No one does that.

Where it is now: People use Sanger sequencing for much smaller applications now. We use it to verify PCR fragments, verifying vector sequences. We’ll also use it to verify some of our sequence calls that we get from the high through put methods of DNA sequencing.

Next Generation Sequencing (NGS)
Next Generation Sequencing has a couple of aliases. Some people call it High Throughput Sequencing, others call it, well, I spoke too soon. I think people just call it next gen or high throughput. There are a bunch of platforms for it, which is great. Next gen sequencing is the textbook definition of how capitalism works. Competition to come up with the best product for the lowest price is allowing for the development of some genius technologies at prices that are allowing more and more people to use this an integral part of their research. That being said, I’m really only going to talk about Illumina’s platform, because it’s what my lab uses, and it’s the one I’m most familiar with. The actual technologies are a bit different, but the concept of high throughput/efficiency, is the same for all.
There are two big steps in NGS, I like to think of it as “at the bench” and then “at the really scary expensive sequencing machine”, but I’m not requiring that you use my terminology. At the bench is preparing the genomic library, and at the really scary expensive sequencing machine is cluster generation and the actual sequencing.
At the bench
Preparing genomic library: You start out with your genomic DNA, you shear it into more manageable sizes (~300 bp), and you ligate Illumina’s sequencing adapters on to them. This is your first step, and these sequencing adapters will be necessary for the sequencing steps. You then amplify these sequences. The sequence that you use to amplify are the same as the sequence adapters that you ligated on, which ensures that your fragments have both of their ends ligated.
Illumina Sequencing
The Machine: Your prepared genomic library is placed on a flow cell. A flow cell has eight lanes, and in general, one sample is run per one lane.
Cluster Generation: In this part, your single-stranded fragments randomly attach to the inside the flow cell channels, remember those adapters you ligated on to prep your library? Those primers are on the surface of that flow cell.

Unlabeled nucleotides are then added, as well as an enzyme that initiates solid-phase bridge amplification. This just means those free standing ends bend over to find their other primer, like this:


The enzyme also works to then make all of your little single bridges into double stranded bridges. The double stranded molecules are then denatured, leaving only single stranded templates attached to the flow cell (but because you made them double stranded before denaturing, you have complementary strands.) You repeat this about a million times, so you end up with several million dense clusters of double-stranded DNA. 

You end up choosing only one sequence adapter, so you result in a cluster of only direction.

The Actual Sequencing
Primers are added plus all four labeled dNTPs, which again, are made so that only one base can be added per cycle. When the first primers, dNTPs, and DNA polymerase is added to the flow cell, a laser is used to excite the fluorescence, and an image is captured of the emitted fluorescence, the images look something like this:

Then you literally, rinse, and repeat. Rinse to remove all the leftover dNTPs that didn’t stick on that last time, oh but then you have to remove the terminator property of the dNTPs that are stuck on to allow further extension, add more dNTPs, and take another picture.

See? Same place (those are your clusters) but a new color, and a new picture is taken.
Cycles continue to give you 76 reads. Computers analyze the image data to give you actual sequence. The cool thing is is that they use astronomy imaging techniques to monitor the same place over time. After you analyze all your millions of clusters to get your reads, you have real whole genome sequence!
Again, this is only just sequence. You don’t know where in the genome it aligns to, only that it existed in your original sequence. People use a variety of programs to align sequences to a reference genome.  I’ll go a little into aligning after I briefly touch on a variation on this theme.

Exome sequencing: A lot of times, groups will choose just to sequence the “exome”, that is, all of a person’s exons. The bet is, that this is where the good stuff is going to be, mutations in coding regions are definitely A Very Bad Thing, and also way easier to functionally analyze and decide whether or not it’s a causative bad change. Plus, it’s easy. You add a sequence capture step while you’re preparing your library, so you only capture the exons.

Why exome and not whole genome? The main reason is $$$$$. Exons make up about 50 mb of the genome, as opposed the 3 gb. (mb=megabase, gb=gigabase: that’s 50 million basepairs compared with 3 billion basepairs) It’s about $3k to sequence all the exons of a person with Illumina, about ~$10k to sequence everything. Or something like that. The prices are going down each day though. Plus, the most informative information is coming from your coding regions anyway. A lot of times, when people do whole genome sequencing, they filter out all the noncoding regions anyway. It’s a lot of data, and it’s easier to tackle parts that you know are important. If you can’t find anything then, it’s time to look in noncoding regions. But we know so little about the genome, more information is not always better.
Pros: Next gen sequencing is the crux of "discovery" science, that is, looking for something in the genome that you have no idea where it is, or even if it's there. It generates massive amounts of data which brings me to...
Cons: The shear amount of data that you get from this means that it takes a huge team, huge computer power, to sort through all the data, and there are still large amounts of unaligned sequence that no one can make head nor tail out of. And for all you know, the piece that you're looking for is in that little nugget on your hard drive that you don't know what to do with. And it's expensive.



Caveats
The reference sequence: The “one” reference sequence that people use---in general, as used by the UCSC browser, is this one unknown guy in like, Buffalo, NY. See any issues with that??? There are an insane amount of issues with that. When we use this reference, we are taking a huge huge chance that this is just the normal framework of the human genome. This is really most likely not the case. There are now a ton of databases that are purely dedicated to documented all variation that has been reported in the human genome. dbSNP, 1000 genomes, hapmap, these are huge collaborative projects with the sole purpose to work with and around the fact that people are so different, there’s no way we can have just one reference sequence to align to.
Speaking of aligning, there are two words that people use to describe what they do with their short little sequences from the sequencing machine. They usually say that the next step is aligning or assembling the reference. A lot of the time people use them interchangeably, so much so that it’s pretty much accepted, but it’s not, there’s a subtle difference. Aligning your sequence to a reference implies just that: that based on your sequence, you look at the reference sequence, and see where that piece goes, and you put it there. When you assemble your own sequence, you’re putting the pieces of your sequence together based on the parts of those sequences that overlap each other. That’s how the first “shotgun” sequences were created together.
That concludes this super brief (well, I tried to make it super brief, I’ll give me...a C for effort) background on sequencing. Next, I’ll look at what people are doing with these technologies. Stay tuned for stuff on the 1000 Genomes Project and HapMap.









References: I got most of this from my head, and some powerpoint slides left over from classes I took in college. Unfortunately, my prof cited a post doc, who didn’t cite anything so the google images that I’ve been basing my paint images off of were found here

Thursday, June 9, 2011

Not a blog about autism

I swear, this isn't a blog all about autism, but a recent trio of papers published in Neuron has made the news. (LA Times) They're looking at de novo copy number changes (complete opposite of the last paper I read) and one also did some cool pathway analysis stuff. I might look into them a little more later, but now, off to do some lab work. I purified a bunch of vectors via midi-preps yesterday (~600ug), and I've gotta go send them off for lentiviral packaging.

See ya later!

Wednesday, June 8, 2011

Journal Club: Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations

 Hey y’all! Welcome to my first ever ever post on this new blog. I’m starting out with a journal club feature--a close reading of an interesting or recently published paper in a field I hopefully know a little about.

Today I’ll be reading and writing and working through this paper. PMID: 21572417
(**don’t know what a PMID is? Check out my frequently used acronyms page here)



First off, a little bit of background. Autism aka ASD, or Autism Spectrum Disorder, is an extremely complicated disorder, diverse in phenotype, and the genetic etiology is wildly unknown. There aren’t even really many candidate genes that people can agree on yet. We’ll touch more on that later as we get through this paper. This paper, specifically states that, “ASDs are characterized by pervasive impairment in language, communication and social reciprocity and restricted interests or stereotyped behaviors. There have been some GWAS hits for Autism, and there have been a few rare variants of large effect that have also been described, but for the most part the genetic basis for the vast majority of autism cases remains unknown. And for the stuff that is known, it’s a long way away for being useful information to the general public (read: clinical setting). And long way is an understatement. It’s a really really long way away.

Hypothesis

What O’Roak et al. hypothesizes is this: that sporadic cases of ASD, that is, families with only one child with autism with no family history, that seemingly came out of genetic-nowhere, are more likely to be as a result from a de novo, or new, mutation (not inherited from either parent), as opposed to families with multiple affected individuals, which should more likely result from inherited variants.




Now I’ma let you finish: but hold on a minute. There is a huge, huge assumption that they are making here. Which is that cases that seem sporadic, may not actually be sporadic. Families that have a first child with autism are a lot less likely to have more kids. And if they have two kids who are born with autism there are really really less likely to have more kids after that. (Like those scientific terms I used there?) So discovering cases that are truly familial are really hard. The fact that this hypothesis only focuses on the rare variant model of disease etiology is another shortcoming, but you can’t really fault this paper for that...yet. You can only test on hypothesis at a time, after all. What’s bad is when you limit yourself only to that hypothesis in the results and discussion. But we haven’t gotten there yet. </end Kanye interruption>

Methods

20 Autism trios: this is generally two parents and an affected kid. Their clinical evaluations can be found in the supplementary information. Here’s another source of bias. ASD is such a diverse disorder, thus the inclusion of the word “spectrum” in it’s name. As stringent and as uniform as the diagnostics manual tries to be, there is still a lot of wiggle room that is so dependent on the clinician. I know a clinician that pretty much diagnoses anyone who’s the least bit weird as BAP--or Broad Autism Phenotype. If one clinician where to evaluate the parents as having BAP, and then the kid as ASD, that’s like, two weird parents having a super weird kid, that could severely affect the interpretation of the data. Because then you’re not looking for a rare de novo mutation. Then you’re looking for maybe two subtle effect, perhaps common variants, that together compound to a stronger phenotype in the kid. Just, you know, a pitfall of limiting yourself to the rare variant model. Agh sorry, back to the methods.
20 Autism trios.

They did aCGH on them all, and found no large CNVs, except for in one---a maternally inherited deletion--remember? They used trios, so they have data on the mom dad and the kid, important in discovering origin. This is good because there aren't that many great algorithms (yet) to call copy number variations based on exome sequencing results (read depth, etc.) Some groups are working on it, but I don't think it's quite there yet.

Then they did exome sequencing. That is, sequencing the all the exons. (Because that’s the most likely place for a rare protein-altering variant to be, right?)
Filtering methods: They threw out all variants previously observed in dbSNP, the 1000 Genomes Project, and other exome sequencing data they had. The identified <5 de novo candidates per trio, and validated them using Sanger sequencing.

Results

1) The overall protein-coding de novo rate per trio was higher than expected.

2) Using two independent quantitative measures, the Grantham matrix score, for the nature of the amino-acid replacement, and the Genomic Evolutionary Rate Profilng (GERP) for the degree of nucleotide-level evolutionary conservation, they concluded that the de novo mutations they found where subjected to stronger selection and are likely to have functional impact.

3) 4 out of the 20 trios had “disruptive de novo mutations that are potentially causative, including genes previously associated with autism, intellectual disability and epilepsy.”

4) These genes lead to ASD presentation (they go more in depth in the paper, I’m pulling out basically the last sentence of each paragraph that they do.)
a) “Our data suggest that de novo mutations in GRIN2B may also lead to an ASD presentation.
b) “SCN1A was previously associated with epilepsy and has been suggested as an ASDs candidate gene.”
c) “Additional study is warranted, as laminins have structural similarities to the neurexin and contactin-associated familes of proteins, both of which has been associated with ASDs.”
d) FOXP1  encodes a member of the forkhead-box family of transcription factors and is closely related to FOXP2, a gene implicated in rare monogenic forms of speech and language disorder.

Discussion

There wasn’t overwhelming evidence showing excessive burden of mutations in ASD candidate genes. The people that they found potentially causative de novo mutations were all the most severe cases. That is, most had a pretty severe intellectual disability, and features of epilepsy. And also, more importantly, the genes that they did identify with de novo mutations had also been “disrupted in children with intellectual disability without ASD,” to which they acknowledge, “provides further evidence that these genetic pathways may lead to a spectrum of neurodevelopmental outcomes depending on the genetic and environmental context.”

Way to cover your butts, y’all. It might be autism but it also might not.

Here are my thoughts.

First of all, this paper cannot prove causality. Just because someone has a deleterious allele does not mean that that is the cause of the disease that they have. It could be a silent mutation. They could have one good copy. One good copy might be enough to carry you through with no bad affects. Their conclusions of these genes leading to a presentation of autism is purely based on candidate genes and what’s been seen before. The conclusion that de novo mutations may contribute substantially to the genetic etiology of ASD...doesn’t really hold up here, because, well, they haven’t proved that these are the causal mutations.

We know that exome sequencing works in finding new mutations. It’s brought us a long way. This same group brought us the genetic cause of a new Mendelian disease by only sequencing the exomes of 4 people. Exome sequencing works for Mendelian diseases, plain and simple. Therefore, it would be safe to say that exome sequencing also would work for cases of autism that appear to be mini-Mendelian disorders, that is, an ASD phenotype, but a little more severe, so that it happens to be caused by a protein-altering mutation.

However, exome sequencing, and this study in general (again, in my opinion) does not add any new information to the genetic etiology of autism. Correct me if I’m wrong. Did you see a new pathway illuminated? Did you see a really large sample size and a really small p-value? Yeah, there were some pretty looking genes that coded for like, sodium channels, and things that look like neurons, but there was simply not enough numbers, and not enough molecular follow up to claim what they claimed from the start.

In my opinion, and this is truly my own opinion, and sometimes I have opinions on things I’m not professionally trained in but instead on things I’ve read about on the internet, this paper got published because people are so hungry for any inkling as to what causes autism. And their methods, well, are really inoffensive. Sequencing has worked in the past. And it will work again. It will find variants that are rare, and do have severe phenotypic effects, however, it’s not going to help that much with disorders like autism. Or rather, it will help in the cases of autism that act like Mendelian diseases. But being able to explain a large amount of cases that occur, and a lot of the less severe cases that occur? We’re still waiting for that one.