You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Mark Laffoon <ml...@semanticresearch.com> on 2010/11/02 19:25:47 UTC

web-based file transfer

We want to enable our web-based client (i.e. browser client, java applet,
whatever?) to transfer files into a system backed by hdfs. The obvious
simple solution is to do http file uploads, then copy the file to hdfs. I
was wondering if there is a way to do it with an hdfs-enabled applet where
the server gives the client the necessary hadoop configuration
information, and the client applet pushes the data directly into hdfs. 

 

Has anybody done this or something similar? Can you give me a starting
point (I'm about to go wander through the hadoop CLI code to get ideas).

 

Thanks,

Mark

RE: web-based file transfer

Posted by "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>.

>Even all the Java servlet APIs assume that the content-length header 
>fits into a signed 32 bit integer and gets unhappy once you go over 2GB 
>(something I worry about in
>http://jira.smartfrog.org/jira/browse/SFOS-1476 )

I built my HDFS webDAV implementation to reference the JackRabbit 1.6.4 library - AFAIK it uses a long for the content-length header since release 1.5.5, not a 32bit int:

https://issues.apache.org/jira/browse/JCR-2009

That means any large file limitations are going to be on the client side, especially for 32bit OSs, so yes, might be worth thinking about leveraging HAR archives to keep the file count down if you do choose to go down the same route.

R

On 02/11/10 18:25, Mark Laffoon wrote:
> We want to enable our web-based client (i.e. browser client, java applet,
> whatever?) to transfer files into a system backed by hdfs. The obvious
> simple solution is to do http file uploads, then copy the file to hdfs. I
> was wondering if there is a way to do it with an hdfs-enabled applet where
> the server gives the client the necessary hadoop configuration
> information, and the client applet pushes the data directly into hdfs.

I recall some work done with webdav
   https://issues.apache.org/jira/browse/HDFS-225
-but I don't think it's progressed

I've done things like this in the past with servlets and forms; the 
webapp you deploy has the hadoop configuration (and the network rights 
to talk to HDFS in the datacentre), the clients PUT/POST up content

http://www.slideshare.net/steve_l/long-haul-hadoop

However, you are limited to 2GB worth of upload/download in most web 
clients, some (chrome) go up to 4GB but you are pushing the limit there. 
Even all the Java servlet APIs assume that the content-length header 
fits into a signed 32 bit integer and gets unhappy once you go over 2GB 
(something I worry about in 
http://jira.smartfrog.org/jira/browse/SFOS-1476 )

Because Hadoop really likes large files -tens to hundreds of GB in a big 
cluster- I don't think the current web infrastructure is up to working 
with it.

that said, looking at hudson for the nightly runs of my bulk IO tests , 
jetty will serve up 4GB in 5 minutes (loopback if), and I can POST  or 
PUT up 4GB, but I have to get/set content length headers myself rather 
than rely on the java.net client and servlet implementations to handle it:

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/www/src/org/smartfrog/services/www/bulkio/client/SunJavaBulkIOClient.java?revision=8430&view=markup

If you can control the client, then maybe you would be able to do >4GB 
uploads, but otherwise you are stuck with data <2GB in size, which is, 
-what- 4-8 blocks in a production cluster?

-steve

Re: web-based file transfer

Posted by Steve Loughran <st...@apache.org>.

On 02/11/10 18:25, Mark Laffoon wrote:
> We want to enable our web-based client (i.e. browser client, java applet,
> whatever?) to transfer files into a system backed by hdfs. The obvious
> simple solution is to do http file uploads, then copy the file to hdfs. I
> was wondering if there is a way to do it with an hdfs-enabled applet where
> the server gives the client the necessary hadoop configuration
> information, and the client applet pushes the data directly into hdfs.


I recall some work done with webdav
   https://issues.apache.org/jira/browse/HDFS-225
-but I don't think it's progressed

I've done things like this in the past with servlets and forms; the 
webapp you deploy has the hadoop configuration (and the network rights 
to talk to HDFS in the datacentre), the clients PUT/POST up content

http://www.slideshare.net/steve_l/long-haul-hadoop

However, you are limited to 2GB worth of upload/download in most web 
clients, some (chrome) go up to 4GB but you are pushing the limit there. 
Even all the Java servlet APIs assume that the content-length header 
fits into a signed 32 bit integer and gets unhappy once you go over 2GB 
(something I worry about in 
http://jira.smartfrog.org/jira/browse/SFOS-1476 )

Because Hadoop really likes large files -tens to hundreds of GB in a big 
cluster- I don't think the current web infrastructure is up to working 
with it.


that said, looking at hudson for the nightly runs of my bulk IO tests , 
jetty will serve up 4GB in 5 minutes (loopback if), and I can POST  or 
PUT up 4GB, but I have to get/set content length headers myself rather 
than rely on the java.net client and servlet implementations to handle it:

http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components/www/src/org/smartfrog/services/www/bulkio/client/SunJavaBulkIOClient.java?revision=8430&view=markup

If you can control the client, then maybe you would be able to do >4GB 
uploads, but otherwise you are stuck with data <2GB in size, which is, 
-what- 4-8 blocks in a production cluster?

-steve

RE: web-based file transfer

Posted by "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>.

> What are the performance characteristics like for the webdav solution?

The HDFS over WebDAV setup is horizontally scalable, just keep adding Jettys and put a round-robin VIP on the front. It is stateless so there's no need for sticky session.

It is not especially chatty unless doing complex directory traversals - HTTP/Put & HTTP/Get - much the same as most REST implementations in fact.

For us, it's more than good enough.

-----Original Message-----
From: Mark Laffoon [mailto:mlaffoon@semanticresearch.com]
Sent: Fri 11/5/2010 10:34 PM
To: general@hadoop.apache.org
Subject: RE: web-based file transfer

Robert,

What are the performance characteristics like for the webdav solution? I
ask for two reasons: since it is implemented over tcp it probably isn't
much faster than http fileupload; I've had previous experience with webdav
(on top of object stores) and we found the protocol to be very "chatty".

Since our use-case is fairly simple (just need to transfer lots of files
from lots of clients; navigating the results isn't necessary), will the
webdav solution be too much?

Comments?

Thanks!

-----Original Message-----
From: Gibbon, Robert, VF-Group [mailto:Robert.Gibbon@vodafone.com] 
Sent: Wednesday, November 03, 2010 4:20 PM
To: general@hadoop.apache.org; general@hadoop.apache.org
Subject: RE: web-based file transfer

Check out HDFS over WebDAV

- http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

WebDAV is an HTTP based protocol for accessing remote filesystems.

I'm running an adapted version of this. It runs under Jetty which is
pretty industry standard and is built on Apache JackRabbit which is pretty
production stable too. I lashed together a custom JAAS authentication
module to authenticate it against our user database.

You can mount WebDAV on Linux using FUSE and WDFS, or script sessions with
cadaver on Solaris/Unix without mounting WebDav. It works pretty sweet on
Windows and Apple, too. 

Recent versions of Jetty have built in traffic shaping and QoS features,
although you might get more mileage from HAProxy or a hardware
loadbalancer.

It works pretty sweet as it enforces HDFS permissions (if you have them
enabled). To get Hadoop permission integrity enforced on MapReduce jobs
check out Oozie - it's a job submission proxy which runs under Tomcat
(might work with Jetty too - haven't tried) and can use a custom
ServletFilter for authentication which you can also patch onto your own
user database/directory. 

Then you just need to seal the perimeter of your cluster with Firewall
rules and you're good to go

No more Kerberos!
R

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com]
Sent: Wed 11/3/2010 5:05 PM
To: general@hadoop.apache.org
Subject: Re: web-based file transfer

Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <ha...@holsman.net> wrote:
> Doesn't chukwa do something like this?
>
> ---
> Ian Holsman - 703 879-3128
>
> I saw the angel in the marble and carved until I set him free --
Michelangelo
>
> On 03/11/2010, at 5:44 AM, Eric Sammer <es...@cloudera.com> wrote:
>
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>>
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>>
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>>
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>>
>> Hope that helps.
>>
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <ml...@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java
applet,
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to
hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet
where
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>>
>>>
>>>
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get
ideas).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>
>>
>>
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

RE: web-based file transfer

Posted by Mark Laffoon <ml...@semanticresearch.com>.

Robert,

What are the performance characteristics like for the webdav solution? I
ask for two reasons: since it is implemented over tcp it probably isn't
much faster than http fileupload; I've had previous experience with webdav
(on top of object stores) and we found the protocol to be very "chatty".

Since our use-case is fairly simple (just need to transfer lots of files
from lots of clients; navigating the results isn't necessary), will the
webdav solution be too much?

Comments?

Thanks!

-----Original Message-----
From: Gibbon, Robert, VF-Group [mailto:Robert.Gibbon@vodafone.com] 
Sent: Wednesday, November 03, 2010 4:20 PM
To: general@hadoop.apache.org; general@hadoop.apache.org
Subject: RE: web-based file transfer

Check out HDFS over WebDAV

- http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

WebDAV is an HTTP based protocol for accessing remote filesystems.

I'm running an adapted version of this. It runs under Jetty which is
pretty industry standard and is built on Apache JackRabbit which is pretty
production stable too. I lashed together a custom JAAS authentication
module to authenticate it against our user database.

You can mount WebDAV on Linux using FUSE and WDFS, or script sessions with
cadaver on Solaris/Unix without mounting WebDav. It works pretty sweet on
Windows and Apple, too. 

Recent versions of Jetty have built in traffic shaping and QoS features,
although you might get more mileage from HAProxy or a hardware
loadbalancer.

It works pretty sweet as it enforces HDFS permissions (if you have them
enabled). To get Hadoop permission integrity enforced on MapReduce jobs
check out Oozie - it's a job submission proxy which runs under Tomcat
(might work with Jetty too - haven't tried) and can use a custom
ServletFilter for authentication which you can also patch onto your own
user database/directory. 

Then you just need to seal the perimeter of your cluster with Firewall
rules and you're good to go

No more Kerberos!
R

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com]
Sent: Wed 11/3/2010 5:05 PM
To: general@hadoop.apache.org
Subject: Re: web-based file transfer

Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <ha...@holsman.net> wrote:
> Doesn't chukwa do something like this?
>
> ---
> Ian Holsman - 703 879-3128
>
> I saw the angel in the marble and carved until I set him free --
Michelangelo
>
> On 03/11/2010, at 5:44 AM, Eric Sammer <es...@cloudera.com> wrote:
>
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>>
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>>
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>>
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>>
>> Hope that helps.
>>
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <ml...@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java
applet,
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to
hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet
where
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>>
>>>
>>>
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get
ideas).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>
>>
>>
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

RE: web-based file transfer

Posted by "Gibbon, Robert, VF-Group" <Ro...@vodafone.com>.

Check out HDFS over WebDAV

- http://www.hadoop.iponweb.net/Home/hdfs-over-webdav

WebDAV is an HTTP based protocol for accessing remote filesystems.

I'm running an adapted version of this. It runs under Jetty which is pretty industry standard and is built on Apache JackRabbit which is pretty production stable too. I lashed together a custom JAAS authentication module to authenticate it against our user database.

You can mount WebDAV on Linux using FUSE and WDFS, or script sessions with cadaver on Solaris/Unix without mounting WebDav. It works pretty sweet on Windows and Apple, too. 

Recent versions of Jetty have built in traffic shaping and QoS features, although you might get more mileage from HAProxy or a hardware loadbalancer.

It works pretty sweet as it enforces HDFS permissions (if you have them enabled). To get Hadoop permission integrity enforced on MapReduce jobs check out Oozie - it's a job submission proxy which runs under Tomcat (might work with Jetty too - haven't tried) and can use a custom ServletFilter for authentication which you can also patch onto your own user database/directory. 

Then you just need to seal the perimeter of your cluster with Firewall rules and you're good to go

No more Kerberos!
R

-----Original Message-----
From: Eric Sammer [mailto:esammer@cloudera.com]
Sent: Wed 11/3/2010 5:05 PM
To: general@hadoop.apache.org
Subject: Re: web-based file transfer

Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <ha...@holsman.net> wrote:
> Doesn't chukwa do something like this?
>
> ---
> Ian Holsman - 703 879-3128
>
> I saw the angel in the marble and carved until I set him free -- Michelangelo
>
> On 03/11/2010, at 5:44 AM, Eric Sammer <es...@cloudera.com> wrote:
>
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>>
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>>
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>>
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>>
>> Hope that helps.
>>
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <ml...@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java applet,
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet where
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>>
>>>
>>>
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get ideas).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>
>>
>>
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: web-based file transfer

Posted by Eric Sammer <es...@cloudera.com>.

Something like it, but Chukwa is more similar to Flume. For *files*
one may want something slightly different. For a stream of (data)
events, Chukwa, Flume, or Scribe are appropriate.

On Wed, Nov 3, 2010 at 1:22 AM, Ian Holsman <ha...@holsman.net> wrote:
> Doesn't chukwa do something like this?
>
> ---
> Ian Holsman - 703 879-3128
>
> I saw the angel in the marble and carved until I set him free -- Michelangelo
>
> On 03/11/2010, at 5:44 AM, Eric Sammer <es...@cloudera.com> wrote:
>
>> I would recommend against clients pushing data directly to hdfs like
>> this for a few reasons.
>>
>> 1. The HDFS cluster would need to be directly exposed to a public
>> network; you don't want to do this.
>> 2. You'd be applying (presumably) a high concurrent load to HDFS which
>> isn't its strong point.
>>
>> From an architecture point of view, it's much nicer to have a queuing
>> system between the upload and ingestion into HDFS that you can
>> throttle and control, if necessary. This also allows you to isolate
>> the cluster from the outside world. As to not bottleneck on a single
>> writer, you can have uploaded files land in a queue and have multiple
>> competing consumers popping files (or file names upon which to
>> operate) out of the queue and handling the writing in parallel while
>> being able to control the number of workers. If the initial upload is
>> to a shared device like NFS, you can have writers live on multiple
>> boxes and distribute the work.
>>
>> Another option is to consider Flume, but only if you can deal with the
>> fact that it effectively throws away the notion of files and treats
>> their contents as individual events, etc.
>> http://github.com/cloudera/flume.
>>
>> Hope that helps.
>>
>> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
>> <ml...@semanticresearch.com> wrote:
>>> We want to enable our web-based client (i.e. browser client, java applet,
>>> whatever?) to transfer files into a system backed by hdfs. The obvious
>>> simple solution is to do http file uploads, then copy the file to hdfs. I
>>> was wondering if there is a way to do it with an hdfs-enabled applet where
>>> the server gives the client the necessary hadoop configuration
>>> information, and the client applet pushes the data directly into hdfs.
>>>
>>>
>>>
>>> Has anybody done this or something similar? Can you give me a starting
>>> point (I'm about to go wander through the hadoop CLI code to get ideas).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>>
>>
>>
>>
>> --
>> Eric Sammer
>> twitter: esammer
>> data: www.cloudera.com
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: web-based file transfer

Posted by Ian Holsman <ha...@holsman.net>.

Doesn't chukwa do something like this?

---
Ian Holsman - 703 879-3128

I saw the angel in the marble and carved until I set him free -- Michelangelo

On 03/11/2010, at 5:44 AM, Eric Sammer <es...@cloudera.com> wrote:

> I would recommend against clients pushing data directly to hdfs like
> this for a few reasons.
> 
> 1. The HDFS cluster would need to be directly exposed to a public
> network; you don't want to do this.
> 2. You'd be applying (presumably) a high concurrent load to HDFS which
> isn't its strong point.
> 
> From an architecture point of view, it's much nicer to have a queuing
> system between the upload and ingestion into HDFS that you can
> throttle and control, if necessary. This also allows you to isolate
> the cluster from the outside world. As to not bottleneck on a single
> writer, you can have uploaded files land in a queue and have multiple
> competing consumers popping files (or file names upon which to
> operate) out of the queue and handling the writing in parallel while
> being able to control the number of workers. If the initial upload is
> to a shared device like NFS, you can have writers live on multiple
> boxes and distribute the work.
> 
> Another option is to consider Flume, but only if you can deal with the
> fact that it effectively throws away the notion of files and treats
> their contents as individual events, etc.
> http://github.com/cloudera/flume.
> 
> Hope that helps.
> 
> On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
> <ml...@semanticresearch.com> wrote:
>> We want to enable our web-based client (i.e. browser client, java applet,
>> whatever?) to transfer files into a system backed by hdfs. The obvious
>> simple solution is to do http file uploads, then copy the file to hdfs. I
>> was wondering if there is a way to do it with an hdfs-enabled applet where
>> the server gives the client the necessary hadoop configuration
>> information, and the client applet pushes the data directly into hdfs.
>> 
>> 
>> 
>> Has anybody done this or something similar? Can you give me a starting
>> point (I'm about to go wander through the hadoop CLI code to get ideas).
>> 
>> 
>> 
>> Thanks,
>> 
>> Mark
>> 
>> 
> 
> 
> 
> -- 
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com

Re: web-based file transfer

Posted by Eric Sammer <es...@cloudera.com>.

I would recommend against clients pushing data directly to hdfs like
this for a few reasons.

1. The HDFS cluster would need to be directly exposed to a public
network; you don't want to do this.
2. You'd be applying (presumably) a high concurrent load to HDFS which
isn't its strong point.

>From an architecture point of view, it's much nicer to have a queuing
system between the upload and ingestion into HDFS that you can
throttle and control, if necessary. This also allows you to isolate
the cluster from the outside world. As to not bottleneck on a single
writer, you can have uploaded files land in a queue and have multiple
competing consumers popping files (or file names upon which to
operate) out of the queue and handling the writing in parallel while
being able to control the number of workers. If the initial upload is
to a shared device like NFS, you can have writers live on multiple
boxes and distribute the work.

Another option is to consider Flume, but only if you can deal with the
fact that it effectively throws away the notion of files and treats
their contents as individual events, etc.
http://github.com/cloudera/flume.

Hope that helps.

On Tue, Nov 2, 2010 at 2:25 PM, Mark Laffoon
<ml...@semanticresearch.com> wrote:
> We want to enable our web-based client (i.e. browser client, java applet,
> whatever?) to transfer files into a system backed by hdfs. The obvious
> simple solution is to do http file uploads, then copy the file to hdfs. I
> was wondering if there is a way to do it with an hdfs-enabled applet where
> the server gives the client the necessary hadoop configuration
> information, and the client applet pushes the data directly into hdfs.
>
>
>
> Has anybody done this or something similar? Can you give me a starting
> point (I'm about to go wander through the hadoop CLI code to get ideas).
>
>
>
> Thanks,
>
> Mark
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: web-based file transfer

Posted by Venkatesh S <sv...@yahoo-inc.com>.

There is also HDFS Proxy in the contrib that only does Listing & Stream file now  over HTTP. But we are very close to getting the next version of Proxy with R+W (full FS access).

Venkatesh


On 11/2/10 11:55 PM, "Mark Laffoon" <ml...@semanticresearch.com> wrote:

We want to enable our web-based client (i.e. browser client, java applet,
whatever?) to transfer files into a system backed by hdfs. The obvious
simple solution is to do http file uploads, then copy the file to hdfs. I
was wondering if there is a way to do it with an hdfs-enabled applet where
the server gives the client the necessary hadoop configuration
information, and the client applet pushes the data directly into hdfs.



Has anybody done this or something similar? Can you give me a starting
point (I'm about to go wander through the hadoop CLI code to get ideas).



Thanks,

Mark