Generating new protein sequences with a character-level recurrent neural network
This past weekend, using Andrej Karpathy's outrageously simple and helpful github repository , I trained a recurrent neural network on my laptop . If you are reading this post in part because you want to do a similar thing, rest assured that by far the most time-consuming part was installing Torch7. Well, don't rest totally assured, because that process was actually pretty annoying for me . Anyway, I wanted to train a character-level neural network because I was really impressed by Andrej's generated Shakespearian sonnets, as well as Talcos's generated MTG cards. But, I wanted to train the RNN on something more biological. So as input data I used a 76 MB fasta file of the full set of human protein sequences, which is available for download via ftp here. As with any fasta file, this is the format of the input data (I know this is boring, but it will become important later):
>gi|53828740|ref|NP_001005484.1| olfactory receptor 4F5 [Homo sapiens]
>gi|767901760|ref|XP_011542107.1| PREDICTED: uncharacterized protein LOC102725121 isoform X2 [Homo sapiens]
The lines with carets contain metadata about the following protein sequence. The protein sequences themselves are made up of capital letters which refer to one of the twenty main amino acids that are coded for by the human genome. After changing to the directory that I cloned Andrej's github directory into, I trained the neural network using the following call: [code language="bash"] th train.lua -data_dir data/sapiens_fasta/ -gpuid -1 -max_epochs 1 -eval_val_every 40 [/code] One epoch means that the neural network is run once through the input file, which almost certainly not enough times through the data for it to learn truly useful trends, but I was limited on CPU time, and running it through just once took almost 40 hours. This RNN has no structural priors and certainly doesn't know English, so it has to learn everything fresh from the data. After 1/100 through the epoch, the RNN was struggling to learn basic aspects of the file format:
IEDEACNGEHGKRVCLYDGLLLSGENHGEGITRKRLEPLPQPRSPGESTLGVIVMATKQVRLEVH>DRgN|6P61|725902|Ur9r0||PT_f120145573o.r heCn r1aMthrffcenaR o rara-op2oninier1 oisceeo oo s dosrXs26mnccmagosf e]giM:mnpo1lnsrgoase poraom/oifoo rissl,]MHDTTEDVAAEYLDVVCLYMWYCVVSYQVQIDNLCKQCEDAGRKKHLNFALFDKDNSKAKVKHSVEAVGHNVVDASSAPVPYYAGSIDLQPVGVREACEQ
After 1/10 through the epoch, the RNN had the format mostly down, and had even learned that "XP" came before a predicted protein, while "NP" came from known (e.g., cDNA sequence) protein data.
>gi|670774209|ref|XP_011526108.1| PREDICTED: neinator uxtylcerin-1 isoform X1 [Homo sapiens]
>gi|115370400|ref|NP_001185001.1| oimh meeroribh CAE3 pomolphinase CA-ase popotecusution maclor domaiongbnating protein 2Hisoform 309 isoform X2 [Homo sapiens]
As you can see, it had also learned that (almost) all protein sequences start with an "M", since methionine is coded for by the start codon, AUG, which the ribosome recognizes.
Finally, by the end of the epoch, the RNN has the format down and was predicting some protein names that are kind of in the "uncanny valley":
>gi|38956652|ref|NP_001018931.1| hamine/transmerabulyryl-depenter protein 1 isoform 2 [Homo sapiens]
>gi|767959205|ref|XP_011517870.1| PREDICTED: D3 glutamate channel protein isoform X5 [Homo sapiens]
This RNN has a particular penchant for combining parts of names, and some of these actually make sense, like "receptorogen", or "elongatase." I blasted ~10 of the protein sequences trained on the full epoch to see whether they had any evolutionary conservation, but none of them had any conservation above chance, suggesting that the RNN isn't just repeating protein sequences. I also did structure predictions on one of the generated protein sequences, and it is made up of one protein domain with a good template from the Protein Data Bank (PDB).
Here is what the protein is predicted to look like:
The arrows are a common way of referring to alpha helix secondary structure, which the generated protein has a reasonable amount of (31%; the average globular protein contains 30%). It's interesting to think about what applying RNNs on protein sequences or other sorts of biological data might accomplish. For example, you could potentially feed the RNN a list of many anti-microbial proteins as training data, to try to generate new peptides that you could test as novel antibiotics.
: See also this post where Andrej explains it in more detail and uses it to predict Shakespeare sonnets. : A non-CUDA compatible MacBook Pro. : How to do this on a MacBook pro that is non-CUDA compatible: a) install/update homebrew, b) install/update lua, c) follow the command-line instructions at the torch7 website to install it, d) run source ~/.profile to load torch7 at the terminal (I ignored this part of the instructions and this is part of what made it take so long for me), e) get the necessary luarocks, f) fork/clone Andrej's repo, and g) run the nn training and sampling commands with the -gpuid -1 option, since your machine is non-CUDA compatible (I also ignored this part of the instructions, to my vexation).
References Morten Källberg, Haipeng Wang, Sheng Wang, Jian Peng, Zhiyong Wang, Hui Lu, and Jinbo Xu. Template-based protein structure modeling using the RaptorX web server. Nature Protocols 7, 1511-1522, 2012. Kaparthy's github repository: Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch. 2015.