-
You are currently browsing the archives for October, 2011
Twitter
- Nice recommendations.. RT @Jiaaro: A few things to remember while coding in Python http://t.co/BL1hLhw8May 20, 2012 2:29
- if "rm *" deletes everything in a system, one should consider to have a better admin and training http://t.co/vE4bs4JeMay 20, 2012 4:50
- Any one still remember the company VALinux?May 18, 2012 5:01
- RT @ctitusbrown: Pursuant to yesterday's #assembly tweetstorm: Zamin Iqbal sez "Indel calling with mapping is really hard!"May 16, 2012 3:24
- "One thing that evolution shows us is that small, agile species tend to do better than large, (cont) http://t.co/YSbQqqVPMay 16, 2012 5:59
Archives
- March 2012 (1)
- February 2012 (3)
- October 2011 (1)
- June 2011 (2)
- April 2011 (2)
- March 2011 (2)
- May 2010 (1)
- March 2010 (1)
- February 2010 (1)
- July 2009 (1)
- June 2009 (1)
- October 2008 (1)
- August 2008 (1)
- May 2008 (2)
- April 2008 (3)
- September 2007 (1)
- July 2007 (2)
- June 2007 (2)
- April 2007 (1)
- March 2007 (2)
- February 2007 (7)
- January 2007 (6)
- November 2006 (1)
- October 2006 (2)
- September 2006 (2)
- July 2006 (3)
- June 2006 (2)
- May 2006 (1)
- January 2006 (3)
- November 2005 (5)
- October 2005 (4)
- September 2005 (5)
- August 2005 (1)
- June 2005 (3)
- May 2005 (2)
- April 2005 (2)
- February 2005 (7)
- January 2005 (3)
Some thought on the interview puzzles from a dot-com/DNA sequencing data processing company
Oct 22
Posted by Jason Chin in comment, programming, Uncategorized | No Comments
Run into this earlier today. If you are interested in solving computational puzzles, I think they are good ones. It is tempting to write some real code to solve them but it is more interesting and important for me to spend my time analyzing some real data to solve real scientific puzzles this weekend. Anyway, to think and find the clues to the answers is straightforward, especially during and after my morning shower time.
Regardless the verbose description of the questions, the route to solve question is quite straightforward more or less. A real implementation might be slightly more complicated. The following are my tips on “how to solve” these puzzles. (I could be giving wrong answers/tips. If you want a job from DNAnexus, you are on your own.)
(1) Insane in the Membrane
Obviously, the question has nothing really to do with any biological membrane. It is actually more related to “maze solving algorithm” or “finding shortest path” in a graph. One way to solve the problem is to convert the lattice to a graph where each node in the graph represents each empty space (“o”). The edges connect all nodes satisfying the constrain “Each successive position is only 1 nm away from the one before it”. You can start the search with a seed node that only has an edge. Then, using the standard BFS algorithm to find the longest path that connect to the seed node. If there is no node with single edge attached, one can pick any node that is not visited during the graph search as the new seed node. If you find one path that is longer the “Danny Dendrite’s genome”, output the solution. If not, try other seed node until you test all seed nodes. If you can not find any path from any seed node that is longer than the “Danny Dendrite’s genome”, output “impossible”.
(2) Hungry Hungry Coders
Think the “enjoyment values” as a matrix of M by N. What you try to do to find maximum sum when picking one single element per row but non of the element can have a share column. One obvious initial state is to pick the maximum value for each row. If there is no overlapping column, Done. If such assignment is not possible, one need to find potential other assignment which minimize the reduction of the sum. I have not to encounter such problem in my work or research yet, although I can see such algorithm is super-useful on solving practical resource allocation. A quick search shows the “correct” solution is probably the “Hungarian algorithm“. There are several different variants to solve such problem. It would be interested to know which one is mostly efficient especially in the case that there might be a lot of degenerated solutions. Also, it might be interesting to see whether it really works sociologically asking engineers to writing down 10 numbers on 10 menu items for every lunch. It seems quite a way to waste of time.
(3) Genome Search
The major constrain of the problem is “You must not load the entire genome into memory; furthermore, you may read through the genome sequence only once.” Namely, you have to stream the reference genome and doing sequence comparison at the same time. Like all good exact sequencing match problem, using hash values is the way to go. For streaming approach, I think the answer is in some sort of rolling hash. One can calculate the hash values of the “K sequences, each of length M”, into K hash values with one of the rolling hash algorithms. Then, calculate the rolling hash values of the 670G bps sequence with the various lengths of the K sequences while streaming the 670G bps sequence file through the memory. If there is a matched hash value, double check the strings do match and the matched hash value is not due to collision. By the way, good luck on sequencing and doing assembly on the 670 Gbps potentially highly repetitive genome.
Oh, well, while it could be fun to implement these algorithms to see that I indeed get the “correct” answer, I would be more interested to see how they are used in solving real genomics problems. While doing exact match string in a smart way is cool, to build a computing infrastructure to be able to do not-exact matches over and over again is way more useful, e.g., NCBI’s blast server. I assume that is one of what DNAnexus’ directions in the future. I do hope there is indeed a good HPC computation platform to help the scientific communities to solve large data / large analysis problems. In the meantime, I do also believe that the innovation on the DNA detection/sequencing technologies remains one of the most important parts driving biological/medical research to solve important problems. Maybe large data analysis with well-known analytics is only part of the equation. Using data and good analytics to solve basic technology, chemistry, signal processing problems, and to understand the nature of different kind of DNA sequence data is really fun and important.