What Will You Choose?

Here is the “de-nosing”/”consensus” “algorithm” (A not-very-efficient one, but it takes less than 2 minutes to write..)

pixel = {}
for im in rgn_array:
    for x in range(im.size[0]):
        for y in range(im.size[0]):
            pixel.setdefault( (x,y), [])
            pixel[(x,y)].append( im.getpixel( (x,y) ) )
...
for x in range(im.size[0]):
    for y in range(im.size[0]):
        rgb = Counter(pixel[(x,y)]).most_common(1)[0][0]

It is not much different…. (maybe different abstraction about real world)

I was cooking my breakfast and started to think about “why hands-on cutting-edge bioinformatics is harder (to teach for non-geeks)?” It reminds me how I learn to do some bench work. Some basic unix tools, e.g., “awk”, “sort”, “python”, etc., are more like those little gadgets, a vortexer, various kinds of pipets, different kind of tubes, centrifuges, growth medium, etc., on a bench. If we are doing cutting-edge molecule biology, we might not want to put everything together before knowing exactly what to do on a larger scale automatic liquid handler. In the physical world, it is easy for people to have the right perception about the scale of the problem. On a computer, it is hard to people to realize doing fundamental image processing research is very different from knowing how to run “Photo Shop”. Don’t get me wrong, I love Photo Shop, but it is not a tool for doing cutting edge image processing research (I believe Matlab and some python/C code are the standard there now.) And it needs a lot of man-hours / capitals from cutting-edge image processing to Photo Shop.

When people start to develop a new molecule biology protocol, they will survey the exist tools they have to see whether some of the tools can help them to do step A, B or C. This is the same for bioinformatics too. When a molecular biologist does not find a tool that can solve their problem, he/she might start to check out VWR or some other catalogs to see if there is a new gadget (e.g. a different kind of gel box, incubator, purification column, etc.) that can make his/her work more efficient. This is the same in bioinformatics too. Some bioinformatics tools are obvious to use. However, for some other tools, reading manual carefully would be essential to full utilize its power. To me, the good thing about computational work is that, most of relevant information are on-line (in the system or on the Internet), so most of time I can be fully self-sufficient. In the meantime, one can still learn a lot from exchanging relevant information/little tricks with each other if there is some common basic understanding how things work. I believe this is the same for bench work too. For example, if one does not understand how different “columns” for filtration/purification work, one can not efficiently develop a strategy using them. (Yes, I had a lot trouble about those. I guess that I did not do enough homework. Maybe that is why I come back doing computational work mostly.)

Running the de Brujin graph IPython Notebook Example

First, you will need to have a correct python environment to use the notebook. If you don’t have an ipython 0.13 environment installed, you need to setup for whatever computer you are using. I use a Mac or a Linux VM for most of my personal work. (Linux VM/Linux/Windows for my day time job.) Here is my instruction on how I set it up on my own computer: pythonbrew_ipython.rst

Once you get the pythonbrew environment installed. Start with a clean shell. Activate the environment by

$ source "$HOME/.pythonbrew/etc/bashrc"

Now go to a clean working directory. Let’s call it “workdir”.

$ cd workdir

You will need to install “networkx” for the de Brujin group example to work.

$ pip install networkx

Then, you can clone my github repository and start IPython Notebook in a correct directory

$ git clone https://github.com/cschin/ipython_d3_mashup.git
$ cd ipython_d3_mashup/ipython_13_vis_example/
$ ipython notebook --pylab=inline

If you have a Mac OS X.8 and you don’t block any local host port, you should get your Safari browser pop up.
Click the link pointing to the “De_Bruijn_VIS” notebook. It should open the notebook. Shift-Enter to execute each cell.
The notebook does use some IPython extension mechanism to download some extra code from the github python_d3_mashup repository. Sometimes, you might need to run the notebook twice to make sure some of the extension is working.

I could put up a video instruction later if I have time. And if you damage on your computer by following these instructions, you are on your own.

Hacks About d3.js + IPython 0.13 official release

Thanks the IPython team for their excellent work on 0.13 release. The new re-factored javascript for IPython 0.13 notebook makes writing mashup using d3.js + IPython simpler. I put two examples in ipython_13_vis_example/ at github

The examples should work with my own IPython vis_0.13 branch. (More specifically, I test them with this commit.)

In IPython 0.13, I do not need to patch IPython’s official CodeCell and Notebook javascript like in the previous hack. I add two extra files, IPython/frontend/html/notebook/static/vis/vis_extension.js and IPython/frontend/html/notebook/visutils.py, in the code tree to support excuting javascript code from IPython and excute Python code from javascript.

The GDP_CO2_Example.ipynb only uses the vis_extension.py. It shows how to make a movable chart with IPython + d3.js.

The Word_Ladder_network_vis.ipynb is an experiment to show how to build interacitve widget to show it is possible to use python code as callbacks for some html elements.

Currently, I don’t feel happy about the visutils.py code. It is quite ugly. If time permits, I will think a better way to make the mapping between javascript objects and python objects more transparent. It is quite tricky to debug if any simple mistake is in the code.

Once I get some more time over the coming weekend. I can post some screen shots or videos.

Can Not Resist Hacking IPython + d3.js, Another Force Layout Demo for Word Ladder Game

Post a video for visualizing the neighbors of English word (http://en.wikipedia.org/wiki/Word_ladder). It shows what can be done now with minimum change to IPython 0.13-dev source + some simple monkey patches.

I will write down what I think where we can go from here later. The IPython notebook can be download from here. The monkey patches that make this working can be downloaded from my fork of the IPython source code from the GitHub site too.

Yet another ipython + d3.js example: motion chart

I gave a lightening talk in a recent Bay Area Python meet-up. I went over some of my recent hacks on combining ipython notebook and d3.js. What I wanted to show was how to mix python code and javascript code to create a dynamic programming/data analysis notebook. I created yet another example to demonstrate the great potential on combining the powerful tools.

If you are interested, you can try the this ipython notebook. You will need to download the development branch of the ipython v 0.13 to see the notebook. The notebook itself includes some of the explanation on how to run it and how it is done. I did not spend too much polishing the code and the motion chart, but it got the basic ingredients. If you want to peek it, here is a short screen recoding to show it looks like.

Experimenting with ipython notebook bi-directional communication

Thanks for Brian Granger pointing out how to make bi-directional communication from javascript in a ipython-notebook front-end to back-end ipython kernel using the existing websocket/zmq channel architecture in ipython (see the thread ). I have been hacking around to see how to do it. I need to modified a few lines of the ipython-notebook javascript to make it work ( see my github commit ). I wrote some example to show how it works ( the ipython notebook and a screen shot ). Pretty cool that it works. It seems that one can develop a widget library to avoid hand-crafting both the javascript and python code for such communication. All right, one more small step toward to building some cool interactive visualization / analysis tools with ipython.

iPython Notebook / d3.js mashup

While I have been using ipython for a long time, I never really it more than just checking whether some code snippets working as expected. (Well, I tried to play with the parallel computing framework with ipython, but I never put it into production.) Just recently, I start to look into the ipython web-based notebook feature more carefully. It is great and make me think the ipython will make a python programmer or someone uses python for data analysis much more productive. (I used to envy the “RStudio” in the R-lang land, now, we python programmer finally have something more competitive.)

The cool thing using a web page as front-end is there are a lot potential using web interface for some cool visualization. I played with protovis.js a while ago. Recently, I went to a visualization meets-up, d3.js was mentioned a numbers of time. Then the idea comes to my mind “is it possible to combine the best of two world, python and d3.js?” After consulting some more experience users in the ipython-dev mailing list to see what is possible, I decided to spend some of my weekend time to hack it around. In the meantime, I get the chance to play with tornado, zero-mq and websocket, all the fun stuff these days. At the end, I am able to pass some javascript code written within the ipython notebook to get the browser to execute it and show some animation with d3.js. This will enable to create more fancier visualization in an interactive way all in a browser.

My weekend hacking results are hosted at github . I think there is a great potential to make thing like this working better. (For example, can we have a pythonic backend of d3.js? :) ) It definitely worth to mess it around to see more use like this.

Some thought on the interview puzzles from a dot-com/DNA sequencing data processing company

Run into this earlier today. If you are interested in solving computational puzzles, I think they are good ones. It is tempting to write some real code to solve them but it is more interesting and important for me to spend my time analyzing some real data to solve real scientific puzzles this weekend. Anyway, to think and find the clues to the answers is straightforward, especially during and after my morning shower time.

Regardless the verbose description of the questions, the route to solve question is quite straightforward more or less. A real implementation might be slightly more complicated. The following are my tips on “how to solve” these puzzles. (I could be giving wrong answers/tips. If you want a job from DNAnexus, you are on your own.)

(1) Insane in the Membrane

Obviously, the question has nothing really to do with any biological membrane. It is actually more related to “maze solving algorithm” or “finding shortest path” in a graph. One way to solve the problem is to convert the lattice to a graph where each node in the graph represents each empty space (“o”). The edges connect all nodes satisfying the constrain “Each successive position is only 1 nm away from the one before it”. You can start the search with a seed node that only has an edge. Then, using the standard BFS algorithm to find the longest path that connect to the seed node. If there is no node with single edge attached, one can pick any node that is not visited during the graph search as the new seed node. If you find one path that is longer the “Danny Dendrite’s genome”, output the solution. If not, try other seed node until you test all seed nodes. If you can not find any path from any seed node that is longer than the “Danny Dendrite’s genome”, output “impossible”.

(2) Hungry Hungry Coders

Think the “enjoyment values” as a matrix of M by N. What you try to do to find maximum sum when picking one single element per row but non of the element can have a share column. One obvious initial state is to pick the maximum value for each row. If there is no overlapping column, Done. If such assignment is not possible, one need to find potential other assignment which minimize the reduction of the sum. I have not to encounter such problem in my work or research yet, although I can see such algorithm is super-useful on solving practical resource allocation. A quick search shows the “correct” solution is probably the “Hungarian algorithm“. There are several different variants to solve such problem. It would be interested to know which one is mostly efficient especially in the case that there might be a lot of degenerated solutions. Also, it might be interesting to see whether it really works sociologically asking engineers to writing down 10 numbers on 10 menu items for every lunch. It seems quite a way to waste of time. :)

(3) Genome Search

The major constrain of the problem is “You must not load the entire genome into memory; furthermore, you may read through the genome sequence only once.” Namely, you have to stream the reference genome and doing sequence comparison at the same time. Like all good exact sequencing match problem, using hash values is the way to go. For streaming approach, I think the answer is in some sort of rolling hash. One can calculate the hash values of the “K sequences, each of length M”, into K hash values with one of the rolling hash algorithms. Then, calculate the rolling hash values of the 670G bps sequence with the various lengths of the K sequences while streaming the 670G bps sequence file through the memory. If there is a matched hash value, double check the strings do match and the matched hash value is not due to collision. By the way, good luck on sequencing and doing assembly on the 670 Gbps potentially highly repetitive genome.

Oh, well, while it could be fun to implement these algorithms to see that I indeed get the “correct” answer, I would be more interested to see how they are used in solving real genomics problems. While doing exact match string in a smart way is cool, to build a computing infrastructure to be able to do not-exact matches over and over again is way more useful, e.g., NCBI’s blast server. I assume that is one of what DNAnexus’ directions in the future. I do hope there is indeed a good HPC computation platform to help the scientific communities to solve large data / large analysis problems. In the meantime, I do also believe that the innovation on the DNA detection/sequencing technologies remains one of the most important parts driving biological/medical research to solve important problems. Maybe large data analysis with well-known analytics is only part of the equation. Using data and good analytics to solve basic technology, chemistry, signal processing problems, and to understand the nature of different kind of DNA sequence data is really fun and important.

Hacking and Research

I read Richard Stallman’s story about how he was able to fix a printing problem in the MIT AI lab long time ago. The main point is that if you know how to modify a device or an instrument then you have the ability to extend the functionality well and you can make things more useful. In order to do so, you will have to understand the underlying mechanism about how a thing works. In the old days before any computer code is popular, one might learn how a machine works by just looking how it works if one has good amount domain knowledge. However, the universal applications using computer program to control machines makes modification harder. Computer is a wonderful tool to process complicated information and procedures but it also allows concealing about how a device or an instrument works. When there is no source code or some simple API document allowing you to get some understand how the computer controlling a device or an instrument, then some reverse engineering is un-avoidable. This is my story about how I learn to “hack” a machine in order to do some interesting science.

I got a Ph.D in theoretical physics. My thesis was about stochastic fluctuations in non-equilibrium system. My major interests were to use some statistical mechanics framework to explain interesting natural phenomena. What kind of non-equilibrium system can be more interesting than life itself? I started working as a post-doc in UCSF after graduate school. Initially, I mostly focused on some biological network analysis of protein-protein interactions and gene regulations in the beginning. Trained as a theorist with some computational skills, my initial research plan followed my advisor’s paradigm, which was to gather data in public domain and did analysis on the data to publish new analysis results. However, while there are a few successful cases, such approach is in general not favored in the biological research field in my opinion. First of all, unless you know all the details of the initial experiment that generates the data, it is very likely one might not have enough information to interpret the new analysis results or the interpretation might be simply wrong. Secondly, the researchers who generate the data should have done some “first-order” analysis that reveals most significant information with straightforward analysis with the data set. Most low hanging fruits should be presented in the initial study. Secondly analysis on such data set is just harder and may not have the proportional impact as the initial analysis. Moreover, the biological research field generally values original experiments more than fancy theoretical analysis, although there is a new trend after more and more quantitative emphasis on biological study recently. These lines of thought made me to decide I would need to do my own experiments to gather my own data. But, how?

My advisor in UCSF was also a theoretical physicist who became a bioinformatist and built a career in UCSF providing his quantitative skill to the research community in UCSF. However, as far as I could tell, he had very limited experience in experimental sciences. That has been said, the setup in the new UCSF mission campus was indeed great for me who only had mostly theoretical background to learn new tricks in experimental biological sciences. Our lab was surrounded by some of the best molecule biology labs in the country. I started to learn some basic molecule techniques from the post-docs and graduate students in those labs. I still remember some generous graduate students and post-docs spent their time to show me how to do the basic like how to do precision pipetting. My first few attempts to do simple PCR failed miserably because I was not really able to pipet the 1 ul enzyme precisely for the reaction. Sometimes, I felt so embarrassing since I was not even able to calculate the amount regents for PCR and carried the reaction out correctly even though I though I had no problem to attack some difficult and highly mathematical problems in physics if I wanted to. The learning curve was indeed quite steep, however, I had some other advantage: trained as a physicist, I knew some of the fundamental principles on how certain instruments worked. Moreover, it was more or less easy for me to understand how software could be written and how the software controlling the instruments worked. These skills allowed me to do something interesting and went beyond to do some experiments that the vendor could not support.

When I was analyzing some of the time course microarray data, one question we asked was about how to explain certain pattern in the time course. Were there some fundamental and new principles in governing the dynamics of gene expression? Although microarray allowed monitoring a large number of gene expression at the mRNA level at once, the resolution in time and expression level was quite rough. Could I get a better measuring system for understanding gene regulation dynamics with much better time resolution and expression level quantitation? At that time, GFP-fused proteins were used to measure gene expression at the protein level. Two labs in UCSF worked hard to tag several thousands of proteins in S. cerevisiae with GFP and study their static expression under several conditions and the localization of the proteins in a cell. The high throughput measurements of protein expression were done by using a flow cytometer. A flow cytometer uses a flow stream to separate individual cell such that high throughput optical measurement for each individual cells can be done. Although you can do tons of single cell measurement at once, the typically experiment setup is that you still need to collect cells, put them into a tube, put the tube into the flow cytometer and click a number of buttons on the control software to get the measurement done. I did try to take time course this way manually. For a simple low time resolution experiment, like measurement every 20 minute for several hours, I will have run between the incubator of my cell culture and the instrument room of the flow cytometer all the time. I will have to check the cell density and wash the cell and the flow cytometer intake every 20 minutes. Well, to do large-scale experiments with this way was just probably not my style. I needed to find a better way to do it.

The previous study using the same flow cytometer that I was using was indeed done with a robotic setup. The interesting part was that the robotic system was synchronized to the flow cytometer software by simulating UI events to trigger the measurement software to start and stop recording. The flow cytometer vendor did not support such feature yet. I think it was Joe deRisi figured out how to do that and wrote some program that controlled a sample-delivering robot and trigger the measurement software automatically. Instead of the robotic sample delivering system, I was also setting up a different sampling delivering system that used a bioreactor to deliver live culture continuously to the flow cytometer. Even though I could using same UI simulation hack that Joe did, I was also trying to find some other to collect the data more efficiently since pulling the data out of the instrument software was also quite painful to do. It would be nice I could have controlling system and data collection system in one single program that I had more or less full control on the control and data flow. Joe did notice that the vendor’s software running on a PC talked to the embedding system on the instrument through a standard network interface. Basically, the instrument was controlling by an embedding system and data was collected by the same system. The software running on a PC was merely a frontend and a client. If we could figure out the data stream between the PC and the embedding system, we could do much more than the vendor’s software could do. By the way, we did try to talk with the vendor to see whether we could get some documentation about it. Well, as usual, the vendor was quite reluctant to reveal more. (Now, I am on the vendor side. I probably understand their concern better. But, we did not ask too much. What we needed was some simple description about the communication protocol. It had to be already somewhere in their software organization. Otherwise, software engineer could not write the code.)

I did not remember why I knew how to sniff the network communication. I probably learnt that because I installed some on-the-edge linux software that did not work after installation and I needed to do some tcpdump or so to see what was going on to fix the problem as a hobby. So, I installed “ethereal” (it is now called Wireshark) and start to dump the TCP/IP data stream between the vendor’s software on the PC and the embedding system. Of course, I was hopping they simply used some clear text protocol. Unfortunately, they were using some sort of binary protocol. However, when I was staring at those hex numbers of the output from an “ethereal” session. I did notice some regularity. This means, at least, they did not encrypt the data. Great, at least, I could start to guess what those hex numbers represented. Some portion of the data stream had some regularity. The numbers seems repeating similar pattern every 4 bytes. What could they be? Well, there were a lot of things in computer represented with 32 bits. On the other hand, the most common one for such instrument was probably some floating numbers. Yes, I took those 4 byte chucks and figured out the right byte orders, treated them as IEEE float number format, then I wrote some short code to translate them. The number I got totally make sense!! I could read those numbers and understand what they were. Moreover, I could start to tell which parts are control commands, and which parts are data. The rest of that week, I was running between the lab doing some experiments, collected more TCP/IP data streams under different conditions and went to a hospital to stay with my wife who was expecting a new baby in the antepartum department of hospital. I took my laptop with me and worked on deciphering the data and code while my wife was resting.

Eventually, I got to a point that I could decipher and read the data stream to get everything I needed without using the vendor’s controlling software. The next thing is to combine such knowledge to write my own software (GUI application for control and data processing) to bypass the control application from the vendor. I used the same old version Borland C/C++ to build a GUI interface under Windows such that other people can use it too. (If I were the only user, some scripting through a command line interface was probably good enough.) The results were much useful than the original software that the vendor provides. We could combine the data acquisition and an external robot seamlessly to make the measurement more reliable. Actually, there were a few glitches here and there initially since I was almost the sole developer for the project and there were no comprehensive testing and user case studies. Most of them got fixed and could be fixed easily. However, I was always afraid something might be broken after a vendor software/firmware upgrade. It is possible that the vendor can totally do a firmware upgrade that changing the protocol that I have reversed engineered. If so, I will need to restart the whole process.

There were also some side-effect after people can really see the “raw data”. The flowcytometer integrated the electric signals from photo multiplier tubes to get total fluorescence. Anyone who had some experience in such numerical integration method would not be surprised with negative values. My code faithfully reported some of the negative values reported by the firmware of the machine. However, it caused me some trouble. I could not remember how many times I needed to explain the meaning of such negative values to my colleagues who did not know how such machine and algorithm worked.

To be able to control the acquisition for the flow cytometer is an important step stone for what I wanted to do: monitoring GFP expression of a few proteins in real time. I wrote a few simple program with few simple parts (a chemostats as an incubator and a syringe pump to deliver the sample), I was able to see the time course of gene induction by staving the yeast cells for some essential ingredient. I though it was a great system to really measure how gene circuits response to environmental perturbation to get a better understanding and modeling about gene regulations. We wrote a paper using the system to study a particular amino acid synthesis pathway. Unfortunately, due to the hacky nature of the code and lack of general instrumentation mind set in biological research field. We might be the only group that was doing such thing. Although we can share our code, it will be much easier if the vendor opens their protocols such that we can make sure the code will work across different machines. If so, there might be much more other researchers who will like to use such powerful tools to study gene expression in real time.

Since then, I proudly quitted my postdoc job, and got “real job” in few bio-tech companies working on mostly on data analysis, software and algorithm development. I am hoping that the company I am working for now or in the further will embrace openness and I like to see people hacking a machine beyond its original purpose to generate interesting scientific research results.