You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Bjoern Schiessle <bj...@schiessle.org> on 2010/08/09 18:18:27 UTC

Best way to write files to hdfs (from a Python app)

Hi all,

I develop a web application with Django(Python) which should access an
hbase database and store large files to hdfs.

I wonder what is the best way to write files to hdfs from my Django app?
Basically I thought about two ways but maybe you know a better option:

1. First store the file on the local file system and than move it with
the thrift interface to hdfs. (downside: needs always enough space on the
web application server)

2. Use hdfs-fuse to mount the hdfs file system and write the file directly
to hdfs. (downside: I don't know how well hdfs-fuse is supported and I'm
not sure if it is a good idea to mount the file system and run large
operation on it).

Since I'm new to hdfs and Hadoop in general I'm not sure what's the best
and less error-prone way.

What would be your recommendation?

Thanks a lot! 
Björn


Re: Best way to write files to hdfs (from a Python app) (problem solved)

Posted by Bjoern Schiessle <bj...@schiessle.org>.
Hi all,

I have solved the problem. The problem wasn't Hadoop but my network
setup. I have used my laptop (via wlan) as a client and it seems like the
wlan gateway has a firewall which blocked the connection. After I had
connected my laptop by wire everything works! :-)

bes wishes & thanks a lot for all your help and useful mails!
Björn

-- 
Björn Schießle
Support Free Software, join FSFE's Fellowship (fellowship.fsfe.org)
Buy books and support Free Software (wiki.fsfe.org/SupportPrograms)

Re: Best way to write files to hdfs (from a Python app)

Posted by Bjoern Schiessle <bj...@schiessle.org>.
Hi,

I read various mailing list archives and played a little bit with my
configuration. It seems other had similar problems (remote access to the
namenode) in the past.

I'm now one step further. On both the hadoop server and the
client which doesn't run any hadoop daemon I have replaced
the hostname with the actual IP of the server.
Modified configuration files: core-site.xml, masters, slaves,
mapred-site.xml.

Now I can access the namenode and file system from the client with the
web interface. Also "telnet hadoopserver 9000" works.

But running "bin/hadoop fs -ls /" at the client still gives me:

10/08/12 14:08:11 INFO ipc.Client: Retrying connect to server: /129.69.216.55:9000. Already tried 0 time(s).
10/08/12 14:08:12 INFO ipc.Client: Retrying connect to server: /129.69.216.55:9000. Already tried 1 time(s).
10/08/12 14:08:13 INFO ipc.Client: Retrying connect to server: /129.69.216.55:9000. Already tried 2 time(s). 
...

This error doesn't generate any log messages. Is it possible to get a
more verbose output for debugging?

Any idea what could be wrong?

Thanks a lot!
Björn

Re: Best way to write files to hdfs (from a Python app)

Posted by David Rosenstrauch <da...@darose.net>.
On 08/12/2010 08:01 AM, Bjoern Schiessle wrote:
> Hey Jeff,
>
> On Wed, 11 Aug 2010 10:40:29 -0700 Jeff Hammerbacher wrote:
>> You also mention that your app will be accessing data stored in HBase.
>> There's a Python client for the Avro HBase gateway at
>> http://github.com/hammer/pyhbase. If you try it out, let me know how it
>> goes.
>
> What's the difference between Avro and Thrift? Are there any specific
> reasons to prefer one of the other?
>
> I tried to find some documentation about Avro, but it seems that this is
> a quite new project.
>
> best wishes,
> Björn

This blog post is a good intro:

http://www.searchenginecaffe.com/2009/07/hadoop-data-serialization-battle.html

Avro is going to be supported natively in Hadoop going forward, so if 
you're on the fence, I'd choose Avro.

I've been using Avro for about a month now (just for serialization, not 
RPC) and I've been pretty happy with it.

HTH,

DR

Re: Best way to write files to hdfs (from a Python app)

Posted by Bjoern Schiessle <bj...@schiessle.org>.
Hey Jeff,

On Wed, 11 Aug 2010 10:40:29 -0700 Jeff Hammerbacher wrote:
> You also mention that your app will be accessing data stored in HBase.
> There's a Python client for the Avro HBase gateway at
> http://github.com/hammer/pyhbase. If you try it out, let me know how it
> goes.

What's the difference between Avro and Thrift? Are there any specific
reasons to prefer one of the other?

I tried to find some documentation about Avro, but it seems that this is
a quite new project.

best wishes,
Björn

Re: Best way to write files to hdfs (from a Python app)

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey Björn,

You also mention that your app will be accessing data stored in HBase.
There's a Python client for the Avro HBase gateway at
http://github.com/hammer/pyhbase. If you try it out, let me know how it
goes.

Thanks,
Jeff

On Wed, Aug 11, 2010 at 4:39 AM, Bjoern Schiessle <bj...@schiessle.org>wrote:

> On Tue, 10 Aug 2010 09:39:17 -0700 Philip Zeyliger wrote:
> > On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle
> > <bj...@schiessle.org>wrote:
> > > On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote:
> > > > To give you an example of how this may be done, HUE, under the
> > > > covers, pipes your data to 'bin/hadoop fs
> > > > -Dhadoop.job.ugi=user,group put - path'. (That's from memory, but
> > > > it's approximately right; the full python code is at
> > > >
> > >
> http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692
> > > > )
> > >
> > > Thank you! If I understand it correctly this only works if my python
> > > app runs on the same server as hadoop, right?
> > >
> >
> > It works only if your python app has network connectivity to your
> > namenode. You can access an explicitly specified HDFS by passing
> > -Dfs.default.name=hdfs://<namenode>:<namenode_port>/
> > .  (The default is read from hadoop-site.xml (or perhaps hdfs-site.xml),
> > and, I think, defaults to file:///).
>
> Thank you. This sounds really good! I tried it but i still have a problem.
>
> The namenode is defined at hadoop/conf/core-site.xml. At the namenode it
> looks like:
>
> <property>
>  <name>fs.default.name</name>
>  <value>hdfs://hadoopserver:9000</value>
> </property>
>
> I have now copied the whole hadoop directory to the client where the
> python app runs.
>
> If I run "hadoop fs -ls /"
> I get a message the he can't connect to the server and hadoop tries to
> connect again and again:
>
> 10/08/11 12:06:34 INFO ipc.Client: Retrying connect to server:
> hadoopserver/129.69.216.55:9000. Already tried 0 time(s). 10/08/11
> 12:06:35 INFO ipc.Client: Retrying connect to server: hadoopserver/
> 129.69.216.55:9000. Already tried 1 time(s).
>
> From the client I can access the web interface of the namenode
> (hadoopserver:50070). "Browse the file system" links to
> http://pcmoholynagy:50070/nn_browsedfscontent.jsp but if I click at the
> link I get redirected to
> http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=%2F
> which of course can't be accessed by the client. If I replace "localhost"
> with "hadoopserver" it works.
>
> Maybe the wrong redirection also causes the problem if i call "bin/hadoop
> fs -ls /"?
>
> If have tried to find something by reading the documentation and by
> google but I couldn't find a solution.
>
> Any ideas?
>
> Thanks!
> Björn
>

Re: Best way to write files to hdfs (from a Python app)

Posted by Bjoern Schiessle <bj...@schiessle.org>.
On Tue, 10 Aug 2010 09:39:17 -0700 Philip Zeyliger wrote:
> On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle
> <bj...@schiessle.org>wrote:
> > On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote:
> > > To give you an example of how this may be done, HUE, under the
> > > covers, pipes your data to 'bin/hadoop fs
> > > -Dhadoop.job.ugi=user,group put - path'. (That's from memory, but
> > > it's approximately right; the full python code is at
> > >
> > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692
> > > )
> >
> > Thank you! If I understand it correctly this only works if my python
> > app runs on the same server as hadoop, right?
> >
> 
> It works only if your python app has network connectivity to your
> namenode. You can access an explicitly specified HDFS by passing
> -Dfs.default.name=hdfs://<namenode>:<namenode_port>/
> .  (The default is read from hadoop-site.xml (or perhaps hdfs-site.xml),
> and, I think, defaults to file:///).

Thank you. This sounds really good! I tried it but i still have a problem.

The namenode is defined at hadoop/conf/core-site.xml. At the namenode it
looks like:

<property>
  <name>fs.default.name</name>
  <value>hdfs://hadoopserver:9000</value>
</property>

I have now copied the whole hadoop directory to the client where the
python app runs.

If I run "hadoop fs -ls /"
I get a message the he can't connect to the server and hadoop tries to
connect again and again:

10/08/11 12:06:34 INFO ipc.Client: Retrying connect to server: hadoopserver/129.69.216.55:9000. Already tried 0 time(s). 10/08/11
12:06:35 INFO ipc.Client: Retrying connect to server: hadoopserver/129.69.216.55:9000. Already tried 1 time(s).

From the client I can access the web interface of the namenode
(hadoopserver:50070). "Browse the file system" links to
http://pcmoholynagy:50070/nn_browsedfscontent.jsp but if I click at the
link I get redirected to
http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=%2F
which of course can't be accessed by the client. If I replace "localhost"
with "hadoopserver" it works.

Maybe the wrong redirection also causes the problem if i call "bin/hadoop
fs -ls /"?

If have tried to find something by reading the documentation and by
google but I couldn't find a solution.

Any ideas?

Thanks!
Björn

Re: Best way to write files to hdfs (from a Python app)

Posted by Travis Crawford <tr...@gmail.com>.
Has anyone had tried using swig to wrap libhdfs?

I spent some time today doing this, and it seems like it could be a
great solution, but its also a fair amount of work (especially having
never used swig before). If this seems generally worthwhile I could
finish it up.

Or is the thrift interface the API to use? Is anyone successfully using it?

I'm primarily interested in building some filesystem management +
reporting tools, so being slower than the Java interface is not
problematic. I'd prefer to not to parse the command-line tool output
though :)

--travis



On Tue, Aug 10, 2010 at 9:39 AM, Philip Zeyliger <ph...@cloudera.com> wrote:
>
>
> On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle <bj...@schiessle.org>
> wrote:
>>
>> Hi Philip,
>>
>> On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote:
>> > To give you an example of how this may be done, HUE, under the covers,
>> > pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put -
>> > path'. (That's from memory, but it's approximately right; the full
>> > python code is at
>> >
>> > http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692
>> > )
>>
>> Thank you! If I understand it correctly this only works if my python app
>> runs on the same server as hadoop, right?
>
> It works only if your python app has network connectivity to your namenode.
>  You can access an explicitly specified HDFS by passing
> -Dfs.default.name=hdfs://<namenode>:<namenode_port>/ .  (The default is read
> from hadoop-site.xml (or perhaps hdfs-site.xml), and, I think, defaults to
> file:///).
>

Re: Best way to write files to hdfs (from a Python app)

Posted by Philip Zeyliger <ph...@cloudera.com>.
On Tue, Aug 10, 2010 at 5:06 AM, Bjoern Schiessle <bj...@schiessle.org>wrote:

> Hi Philip,
>
> On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote:
> > To give you an example of how this may be done, HUE, under the covers,
> > pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put -
> > path'. (That's from memory, but it's approximately right; the full
> > python code is at
> >
> http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692
> > )
>
> Thank you! If I understand it correctly this only works if my python app
> runs on the same server as hadoop, right?
>

It works only if your python app has network connectivity to your namenode.
 You can access an explicitly specified HDFS by passing
-Dfs.default.name=hdfs://<namenode>:<namenode_port>/
.  (The default is read from hadoop-site.xml (or perhaps hdfs-site.xml),
and, I think, defaults to file:///).

Re: Best way to write files to hdfs (from a Python app)

Posted by Bjoern Schiessle <bj...@schiessle.org>.
Hi Philip,

On Mon, 9 Aug 2010 16:35:07 -0700 Philip Zeyliger wrote:
> To give you an example of how this may be done, HUE, under the covers,
> pipes your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put -
> path'. (That's from memory, but it's approximately right; the full
> python code is at
> http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692
> )

Thank you! If I understand it correctly this only works if my python app
runs on the same server as hadoop, right?

I would like to run the python app on a different server. Therefore my
two ideas (1) Thrift or (2) hdfs-fuse.

Thrift seems to be able to store string content only to hdfs but no
binary files. At least I couldn't find an interface for a simple put
operation.

So at the moment I'm not sure how to continue.

Any ideas?

Thanks,
Björn

Re: Best way to write files to hdfs (from a Python app)

Posted by Philip Zeyliger <ph...@cloudera.com>.
Hi Bjoern,

To give you an example of how this may be done, HUE, under the covers, pipes
your data to 'bin/hadoop fs -Dhadoop.job.ugi=user,group put - path'.
 (That's from memory, but it's approximately right; the full python code is
at
http://github.com/cloudera/hue/blob/master/desktop/libs/hadoop/src/hadoop/fs/hadoopfs.py#L692
)

Cheers,

-- Philip



On Mon, Aug 9, 2010 at 9:18 AM, Bjoern Schiessle <bj...@schiessle.org>wrote:

> Hi all,
>
> I develop a web application with Django(Python) which should access an
> hbase database and store large files to hdfs.
>
> I wonder what is the best way to write files to hdfs from my Django app?
> Basically I thought about two ways but maybe you know a better option:
>
> 1. First store the file on the local file system and than move it with
> the thrift interface to hdfs. (downside: needs always enough space on the
> web application server)
>
> 2. Use hdfs-fuse to mount the hdfs file system and write the file directly
> to hdfs. (downside: I don't know how well hdfs-fuse is supported and I'm
> not sure if it is a good idea to mount the file system and run large
> operation on it).
>
> Since I'm new to hdfs and Hadoop in general I'm not sure what's the best
> and less error-prone way.
>
> What would be your recommendation?
>
> Thanks a lot!
> Björn
>
>