phdfs.py, a ctypes wrapper of hadoop libhdfs for python
May 10th, 2008I use python and hadoop distributed file system (HDFS) to process large amount of data at work. Instead of using the regular map-reduce mechanism provided by hadoop, I have my home-made map-reduce python engine written using Pyro. It turns out it is quite efficient and sometimes it is much faster than the corresponding streaming code for some simple map-reduce work. For this kind of work, I access the file in HDFS using “hadoop fs -cat” by the unix pipe (popen) in python. It seems to me it might be useful to be able to bypass the somehow ugly unix pipe and “hadoop fs -cat” combination. There already is a SWIG wrapper of python for hdfs. However, I think it will be nice to have ctypes wrapper such that no extra compiling is necessary for installation. I spend a few nights working on such wrapper and hope it will be useful. The results is a single python module that I call “phdfs“. It provides most of the API in the libhdfs. It will be useful if one want to read, write and manipulate the hadoop filesystem with the flexible and powerful python syntax.
You can download the phdfs.py, and try it out yourself. I have not tested all the methods, so YMMV.





