phdfs.py, a ctypes wrapper of hadoop libhdfs for python

May 10th, 2008

I use python and hadoop distributed file system (HDFS) to process large amount of data at work. Instead of using the regular map-reduce mechanism provided by hadoop, I have my home-made map-reduce python engine written using Pyro. It turns out it is quite efficient and sometimes it is much faster than the corresponding streaming code for some simple map-reduce work. For this kind of work, I access the file in HDFS using “hadoop fs -cat” by the unix pipe (popen) in python. It seems to me it might be useful to be able to bypass the somehow ugly unix pipe and “hadoop fs -cat” combination. There already is a SWIG wrapper of python for hdfs. However, I think it will be nice to have ctypes wrapper such that no extra compiling is necessary for installation. I spend a few nights working on such wrapper and hope it will be useful. The results is a single python module that I call “phdfs“. It provides most of the API in the libhdfs. It will be useful if one want to read, write and manipulate the hadoop filesystem with the flexible and powerful python syntax.

You can download the phdfs.py, and try it out yourself. I have not tested all the methods, so YMMV.

Postdocs, “Not Exactly Students, Not Exactly Employees, What are you?”

May 3rd, 2008

My neighbor shows me this article from East Bay Express. Those stories sound very familiar. My personal feeling is that such academic system should be fixed soon. The academic society should give more recognition to postdocs.

As a postdoc, you don’t get those benefit to students. You are not considered as a formal employee. You don’t get any benefit and you are paid low in the name of science. I still remember that I felt so absurd when I was told I could not pay my monthly parking fee by automatic deduction from my paycheck, because I was a “temporary worker” in the school I had being working for a few years.

Well, I can not say that my career is not benefit from my postdoc research. But, I can not say I totally enjoy being treated by the school as “temporary worker” for an indefinitely amount of time. One should treat the real “working horses” in the academic research industry a little better. Without these working horses, there will be no “super-star” in research communities. Anyway, there is not much point for me to complain anymore. Industrial R&D can be fun too.

奇文共賞

April 18th, 2008

在二十一世紀的今天,台灣的某大報系下的海外版的社論出現下列的句子:

『在「百年老店」裡,58歲的馬英九是春秋鼎盛、如日方中的新星。』

『他領導國民黨仆而復起,號召台灣人民、尤其是青年一代,終結了台獨政權,正是「青年創造時代」的典型。』

『愛因斯坦的「相對論」改造了百年間的科學奧秘,而孔孟之道卻歷經千餘年影響世道人心,連馬克斯信徒也不得不信。』

『鼓勵青年學習馬英九,絕不是搞甚麼「偶像祟拜」,更無意要造一座「新神」,而是就近取譬,用大家都看得見的事實,期勉繼往開來的青年世代,好好鑄造自己、鍛鍊自己,無負「青年創造時代」的期望。』

久居國外,我對馬英九了解不算多,也沒有意見。但看了這文章後,不得不想起那連小學裡作文都要以『解救大陸水深火熱同胞』『以三民主義統一中國』的年代。也許,只是也許,某聖君可以不和獨裁磕頭,完成反共復國的大業。這樣就不用每年去拜拜了。

Disclaimer: 我年幼無知的時候為了考試或是混公假,應該也寫了不少奇聞,不過那可是上世紀的歷史共業呀!

奇文原出處之一

我最先發現奇文的地方

My one day trip to Lugradio, San Francsico, 2008

April 12th, 2008

從 PingYeh 那聽到有 Lugradio 這週末在 San Francisco 舉行, 一時興起,決定和老婆女兒告假一天去看看熱鬧。

雖然我在 1993 還是 1994 安裝過 Linux with kernel version 0.97 後,有幾年是非 Linux 不用的人,參加 Linux / open source 社群的活動倒是第一次。台灣的 open source 活動開始熱絡的時候,我人已不再台灣,而人在米國的時候,因為學業和懶的關係也沒有看看過有沒有甚嘛好玩的活動可以參加。所以對我來說,這一次湊熱鬧的感覺是很新鮮的。

Img 4957

我約十一點多到達會場,當場交了米金大洋十塊錢,註了冊,拿了名牌和有贊助商的小禮品的小袋子就進到會場裡逛逛。會場是在 San Francisco Metreon 戲院的頂樓的 CITY VIEW,從前到 Metron 時從來沒聽過有這麼個地方可以辦活動又還有不錯的 city view 的地方。同一時間內,會場會有三場演講進行,你可以選擇比較有興趣的來聽。不想聽的話,就可以逛逛廠商地展示。我隨意聽了幾個演講:其中有 Second Life 的人來說他們 open source 的策略,有 Bungee Connect 的人示範他們的發展平台,有 VMWare 的人展示 Virtual Machine 的 Streaming,也有 Humanized 的 Aza Raskin 討論使用者介面等等。 大部分都還滿有趣,但並沒有在很多技術上比較有深度的討論,大多的討論都在比較形而上的層次。但這樣也好。而從其他聽眾的提問看來,很多參與的人很重視 open source 的發展。

在其他廠商展示方面,我跑去收集了不少 linux 廠商提供的 live CD。而在眾多的廠商展示裡,對我來說最有趣的卻是兩個硬體的廠商。其一是 TI 可以跑 Linux 的單晶片電腦 beagleboard 。看來等 TI 六月出了這東西,我可能會受不了灑點錢買來玩玩。

Img 4954

另外一個有趣的是我終於看到傳說中的 OLPC,的確是很可愛讓人會不住把玩的東西。可惜這有趣的 laptop 只能看看而已。

Img 4955

今天最後一個演講到五點,本來要 skip 晚上的 party,已打算要回家了。在離開會場前,一個對 amateur biotech 有很大興趣的軟體工程師在得知我在一家 biotech 公司工作後,興高采烈的和我討論有沒有甚麼可以在家裡做的 biotech 的計畫,聊了一個小時後才放我回家。

明天 Lugradio 還有一整天的活動,不過我有其他事要做,不能去了。但今天的一日行倒是收穫不少。意外地得了不少在工作上或是家中得不到得 inspiration 和平常不容易看到的 San Francisco City View!!
Img 4960

ad$ense or ad$pam?

April 2nd, 2008

200804022149

I wish I have a little bit more virtual memory so I can convert virtual money to real one.

Using Safari on iPhone to read CHM file

September 23rd, 2007

iPhone is a fancy toy with a lot of power but Apple deliberately locks a lot of the potential power. One thing I like to do on an iPhone is to be able to read CHM files. As a weekend project, I setup the tool chain for iPhone following the instructions. Then, I grabbed the source code of chmlib. With some minor modification, I was able to compile the chmlib as an iPhone binary library. That was very encouraging.

This provides a convenient way to make iPhone as a CHM reader. In the chmlib source code distribution, there is an example program that runs as a http-server that serves the content of a CHM as standard web page. The “mobileSafari” has no problem to render the results, but the fonts are usually too small to read and the text is typically rendered too wide such that a lot horizontal scrolling becomes annoyingly necessary.

I decided to combine some python code with the chm_http server from the chmlib source code. I modified the source code of chm_http so it can call python code to modify the HTML code in the CHM file, replacing the original CSS with new setting for reading on small screen. Furthermore, I found it was tedious to start the chm_http from a terminal every time when you want to read a different book. I wrote another small python script that can scan a directory and find all CHM files in the directory to output an index html page. At the end, I was able to use the mobileSafari pointing to the index page and select the book I want to read. The “chm_http” server would start automatically to get the book I like to read.

If you are interested in reading CHM on your iPhone. Get this iphoneCHM.tgz (the file would be upload soon). Copy the “chm_http2“, “rewriteHTML.py“, and “CHMServer” to “/usr/local/bin/” in your iPhone. Change the permission of these files such that you can run all of them. Put some chm files in /var/root/Media/CHM_Ebooks/. Open a terminal in the iPhone or ssh into the iPhone to run “CHMServer”. After that, ask the Safari to open this URL http://127.0.0.1:8000. You should see the links to the CHM files. You can now click on any of them and enjoy a nice reading time.

Technorati Tags: ,

Tomorrow

July 22nd, 2007

Tomorrow, a new day with new challenges! What a great feeling for entering the next stage of my career! Although I have to make a tough decision, it is indeed time to move on. I am really feeling the excitement of a new environment and a new career path now.

用 python+OV 寫中文輸入法

July 18th, 2007

前一陣子看到 lukhnos 在寫一個能讓 OV 用 Ruby 來寫 filter 的模組,一時心血來潮想看看如何用 python 來寫 OV 的 filter 的模組。花了點時間研究了一下如何在 C/C++ 中內藏 python。 雖說用 python 來寫程式作研究也有好一陣子,也曾經用過 SWIG 來控制用 C/C++ 寫的物理模擬程式, 在 C/C++ 中呼叫 python 倒是第一次實做。花了點時間寫了個 prototype,在 lukhnos 的協助下,搞定了一個讓 OV 可以用 python 寫 filter 的模組 (在 OV 的 svn repository: Modules/OVOFPythonBased/ 中)。

OV 的 filter 主要是呼叫一個叫 process 的 method。傳到 process 中的只是一個字串,所以實做 OV 的 filter 並不是太難。發展的過程中,大多數的時間花在看 python 的 C API 文件,熟悉如何在 C/C++ 中建立 python 的物件及將參數傳給 python 的 method。

讓 OV 的 filter 機制可以用 modern 的 python 或是 ruby 實做只是第一步。Dynamic language 的方便已經讓發展新的 filter 的工作大大的簡化。所以下一個就是看看能不能讓 OV 用 Python 或是 Ruby 來寫輸入法。在未來實驗類似酷音等比較複雜的自然語言處理的輸入法模組的時候,如果可以用 Python 或是 Ruby 來寫輸入法應該會有很大的助益。

寫 python based OV filter 時, 只需要定義好對應到 process 的 python method/function 就好了,python 的部份是完全的被動,python 的 code 並不需要管 C/C++ 的 class 與 instance,只需要實做一個叫 process 的函數就可以了。 但寫 OV 的輸入法模組的時候,有幾個 OV 的物件必須要傳到 Python 中,而且 Python 也最要能夠 subclass OV 中的 class 來保持 OV API 介面的一致。基本上要做下面幾件事:

(1) 用 SWIG 來把 OV 的 class 轉成 Python 的 class。
(2) 定義對應到 Python class 的 OV C/C++ class。
(3) 在 (2) 中最重要的一件事就是要把將 OV C/C++ 中 instance pointer 轉成 Python 可以認得的物件。

在這三項工作裡,最容易的部份是 (1)。基本上只要把 Framework/Headers/OpenVanilla.h 剪貼到 SWIG 的 interface 檔中就好,唯一要注意的地方是要讓 SWIG 知道要將 C++ 的 class 轉成 python 的 class。這要用到 SWIG 中的 directors 。請見 SWIG 的相關文件Modules/OVIMPython/ 中的 OVIMPython.i

接下來要就是要讓 OV C/C++ 知道 Python 的存在,主要要去 subclass 兩個 OV C/C++ 的 class, OVInputMethodContext 和 OVInputMethod, 讓 OV 的 loader 可以呼叫對應的 python 物件。請見 Modules/OVIMPython/OVIMPythonBased.cpp 中的 OVIMPythonBasedContextOVIMPythonBased class。 這兩個 wrap classes 作的事情很簡單,就是實做 C++ method 來呼叫對應的 Python instance method。但是之前的一個障礙就是 OVInputMethodContextOVInputMethod 中的 method 的參數裡大多是指向 C++ 的 instance 的 pointers。要怎麼把這些對應的 instance 變成 python 物件在傳給 python 倒是一個比較不容易的問題。也牽涉到 SWIG 怎麼把 C++ 物件映射到 python 物件的細節。

也許在 SWIG 有對應的解法,但我並沒有從 SWIG 的文件中看到顯而易見的方法來解決這個問題。後來是在研究 SWIG 產生的 python module 的檔案中找到 hint。對每一個要 wrap 的 C++ class,SWIG 會產生兩個對應的 python class。例如如果在 C++ 中有如下的宣告:

class OVKeyCode : public OVBase  {
public:
    virtual int code()=0;
};

SWIG 會建立下面兩個 python class:

class OVKeyCode(OVBase):
   ...

class OVKeyCodePtr(OVKeyCode):
   ...

其中 OVKeyCodePtr 的 constructer ( __init__() in python ) 可以用 SWIG 中的表示 C/C++ pointer 的 python pointer object 建立對應的 python object。所以接下來要作的就是要把 C/C++ 的 pointer 轉成 python 中 pointer object。而 SWIG 的作法只是把 C/C++ 中的 pointer 的 address 和 type 換成特殊格式的字串,在 SWIG 所產生的 C/C++ 的 wrap 檔中有一個特別的函式 (char *SWIG_PackData(char *c, void *ptr, int sz) ) 就是把 C/C++ 的 pointer 換成字串,所以我們就可以用這個函式將 C/C++ 的 pointer 轉成對應的 python 字串然後透過 SWIG 產生的 aClassPtr 來產生 python 中的 aClass 的 instance,而這個 python instance 的 implementation 就是對應的 C/C++ implementation。

這樣的 mapping 實在有點太複雜而不直覺。還沒有真的詳讀 SWIG 的文件,不知道有沒有比較優雅的方式來作同樣的事。雖說如此,對要用 python 寫輸入法的人可以完全不去理 wrapper 本身及兩個語言的物件對應的複雜性,專注在用 python 來寫輸入法。 在lukhnos 稍早寫的用Python + OpenVanilla寫輸入法中有用 python 的 OV 輸入法的 minimum example。

我想這只是第一步,我自己來試著了解 embedding python 的小小練習。如果有空的話,再看看如何真的用 python 在 OV 裡作些有趣的事。

Circular references in python

June 24th, 2007

The following code create circular references in python:

>>> aRef = []
>>> aRef.append(aRef)
>>> print aRef
[[...]]

This creates a list object referred by a variable named “aRef”. the first element in the list object is a reference to itself. In this case, the “del aRef” dereference aRef to the list object. However, the reference count of the list object does not decrease to zero and the list object is not garbage collected, since the list object still refers to itself. In this case, the garbage collector in Python will periodically check if such circular references exist and the interpreter will collect them. The following is an example to manually collect the space used by circular referenced objects.

>>> import gc
>>> gc.collect()
0
>>> del aRef
>>> gc.collect()
1
>>> gc.collect()
0

Technorati Tags:

Tokyo Tower at Night

April 22nd, 2007

Tokyo Tower At Night

The night view of Tokyo on the fifty-fifth floor in the Mori Tower.
六本木某大樓五十五層的東京夜景。

Viacom vs. google

March 27th, 2007

Well, actually, I just want to test video embedding. This clip is pretty
funny anyway.

Demonstrate “quickhull” implementation in javascript

March 13th, 2007

Quickhull is an algorithm that is similar to the quicksort using a divide and conquer strategy to find the convex hull for scattered points. The green line in the plot is the initial base line and gray lines are the intermediate base line. The final convex hull is shown in red.



source code: qh.js


< 

Technorati Tags: