You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Steve Sapovits <ss...@invitemedia.com> on 2008/02/21 22:23:05 UTC

Python access to HDFS

Are there any existing HDFS access packages out there for Python?

I've had some success using SWIG and the C HDFS code, as documented
here:

     http://www.stat.purdue.edu/~sguha/code.html

(halfway down the page) but it's slow adding support for some of the more
complex functions.  If there's anything out there I missed, I'd like to hear
about it.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com



RE: Python access to HDFS

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.
Hi Pete,

If you are referring to the ability to re-open a file and append to it,
then this feature is not in 0.16. Please see:
http://issues.apache.org/jira/browse/HADOOP-1700

Thanks,
dhruba

-----Original Message-----
From: Pete Wyckoff [mailto:pwyckoff@facebook.com] 
Sent: Thursday, February 21, 2008 4:09 PM
To: core-user@hadoop.apache.org
Subject: Re: Python access to HDFS


We're profiling and tuning read performance for fuse dfs and have writes
implemented, but I haven 't been able to test it even as I haven't tried
0.16 yet - It requires the ability to create the file, close it and then
re-open it to start writing - which can't be done till 16.


--pete



On 2/21/08 3:50 PM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Jeff Hammerbacher wrote:
> 
>> maybe the dfs could expose a thrift interface in future releases?
> 
> ThruDB exposes Lucene via Thrift but not the underlying HDFS.   I just
> need HDFS access in Python for now.
> 
>> you could also use the FUSE module to mount the dfs and just write to
it
>> like any other filesystem...
> 
> Good point.  I'll check that avenue.  Would FUSE add much overhead for
> writing lots of data?   I see a Python binding for it.


Re: Python access to HDFS

Posted by Steve Sapovits <ss...@invitemedia.com>.
Roddy Lindsay wrote:

> I do it the old fashioned way:
> 
> (w, r) = os.popen2("%s/bin/hadoop dfs -cat %s" % (hadoop_home.rstrip('/'), filename))

I considered this but ultimately it probably won't scale for our data volume.

I'll probably continiue building on the SWIG base since that's working pretty well
so far ... there's just the SWIG learning curve for complicated interface mappings.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


RE: Python access to HDFS

Posted by Roddy Lindsay <rl...@facebook.com>.
I do it the old fashioned way:

(w, r) = os.popen2("%s/bin/hadoop dfs -cat %s" % (hadoop_home.rstrip('/'), filename))



-----Original Message-----
From: Pete Wyckoff [mailto:pwyckoff@facebook.com]
Sent: Thu 2/21/2008 4:08 PM
To: core-user@hadoop.apache.org
Subject: Re: Python access to HDFS
 

We're profiling and tuning read performance for fuse dfs and have writes
implemented, but I haven 't been able to test it even as I haven't tried
0.16 yet - It requires the ability to create the file, close it and then
re-open it to start writing - which can't be done till 16.


--pete



On 2/21/08 3:50 PM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Jeff Hammerbacher wrote:
> 
>> maybe the dfs could expose a thrift interface in future releases?
> 
> ThruDB exposes Lucene via Thrift but not the underlying HDFS.   I just
> need HDFS access in Python for now.
> 
>> you could also use the FUSE module to mount the dfs and just write to it
>> like any other filesystem...
> 
> Good point.  I'll check that avenue.  Would FUSE add much overhead for
> writing lots of data?   I see a Python binding for it.



Re: Python access to HDFS

Posted by Pete Wyckoff <pw...@facebook.com>.
We're profiling and tuning read performance for fuse dfs and have writes
implemented, but I haven 't been able to test it even as I haven't tried
0.16 yet - It requires the ability to create the file, close it and then
re-open it to start writing - which can't be done till 16.


--pete



On 2/21/08 3:50 PM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Jeff Hammerbacher wrote:
> 
>> maybe the dfs could expose a thrift interface in future releases?
> 
> ThruDB exposes Lucene via Thrift but not the underlying HDFS.   I just
> need HDFS access in Python for now.
> 
>> you could also use the FUSE module to mount the dfs and just write to it
>> like any other filesystem...
> 
> Good point.  I'll check that avenue.  Would FUSE add much overhead for
> writing lots of data?   I see a Python binding for it.


Re: Python access to HDFS

Posted by Steve Sapovits <ss...@invitemedia.com>.
Jeff Hammerbacher wrote:

> maybe the dfs could expose a thrift interface in future releases?

ThruDB exposes Lucene via Thrift but not the underlying HDFS.   I just
need HDFS access in Python for now.

> you could also use the FUSE module to mount the dfs and just write to it
> like any other filesystem...

Good point.  I'll check that avenue.  Would FUSE add much overhead for
writing lots of data?   I see a Python binding for it.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Python access to HDFS

Posted by Jeff Hammerbacher <je...@gmail.com>.
maybe the dfs could expose a thrift interface in future releases?

you could also use the FUSE module to mount the dfs and just write to it
like any other filesystem...

On Thu, Feb 21, 2008 at 1:23 PM, Steve Sapovits <ss...@invitemedia.com>
wrote:

>
> Are there any existing HDFS access packages out there for Python?
>
> I've had some success using SWIG and the C HDFS code, as documented
> here:
>
>     http://www.stat.purdue.edu/~sguha/code.html<http://www.stat.purdue.edu/%7Esguha/code.html>
>
> (halfway down the page) but it's slow adding support for some of the more
> complex functions.  If there's anything out there I missed, I'd like to
> hear
> about it.
>
> --
> Steve Sapovits
> Invite Media  -  http://www.invitemedia.com
> ssapovits@invitemedia.com
>
>
>