I use python and hadoop distributed file system (HDFS) to process large amount of data at work. Instead of using the regular map-reduce mechanism provided by hadoop, I have my home-made map-reduce python engine written using Pyro. It turns out it is quite efficient and sometimes it is much faster than the corresponding streaming code for some simple map-reduce work. For this kind of work, I access the file in HDFS using “hadoop fs -cat” by the unix pipe (popen) in python. It seems to me it might be useful to be able to bypass the somehow ugly unix pipe and “hadoop fs -cat” combination. There already is a SWIG wrapper of python for hdfs. However, I think it will be nice to have ctypes wrapper such that no extra compiling is necessary for installation. I spend a few nights working on such wrapper and hope it will be useful. The results is a single python module that I call “phdfs“. It provides most of the API in the libhdfs. It will be useful if one want to read, write and manipulate the hadoop filesystem with the flexible and powerful python syntax.
You can download the phdfs.py, and try it out yourself. I have not tested all the methods, so YMMV.
#1 by Erik Forsberg on January 29th, 2010
Quote
This looks very interesting, but unfortunately the link to phdfs.py is unavailable. I would appreciate if that could be fixed!
#2 by Jason Chin on February 10th, 2010
Quote
I have fixed the link. I wish I will have more time to work on this too….