You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Steve Sonnenberg <st...@gmail.com> on 2012/08/29 22:58:11 UTC

Importing Data into HDFS

Is there any way to import data into HDFS without copying it in? (kinda of
like by reference)
I'm pretty sure the answer to this is no.

What I'm looking for is something that will take existing NFS data and
access it as an HDFS filesystem.
Use case: I have existing data in a warehouse that I would like to run
MapReduce etc. on without copying it into HDFS.

If the data were in S3, could I run MapReduce on it?

Thanks

-- 
Steve Sonnenberg

Re: Importing Data into HDFS

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 29.08.2012 um 22:58 schrieb Steve Sonnenberg <st...@gmail.com>:

> Is there any way to import data into HDFS without copying it in? (kinda of like by reference)
> I'm pretty sure the answer to this is no.
> 
> What I'm looking for is something that will take existing NFS data and access it as an HDFS filesystem.
> Use case: I have existing data in a warehouse that I would like to run MapReduce etc. on without copying it into HDFS.
> 
> If the data were in S3, could I run MapReduce on it?


Hadoop has a filesystem abstraction layer that supports many physical filesystem implementation. Such as HDFS of course, but also the local filesystem, S3, FTP, and others.

You simply loose data locality if you're running MapReduce on data that is -well- not local to where it's been processed.

With data stored in S3, a common solution is to fire up an EMR (elastic mapreduce) cluster inside Amazon's datacenter to work on your S3 data. It's not real data locality, but at least the processing happens in the same data center as your data. And once you're done processing the data, you can take down the EMR cluster.

Kai

-- 
Kai Voigt
k@123.org

Re: Importing Data into HDFS

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 29.08.2012 um 22:58 schrieb Steve Sonnenberg <st...@gmail.com>:

> Is there any way to import data into HDFS without copying it in? (kinda of like by reference)
> I'm pretty sure the answer to this is no.
> 
> What I'm looking for is something that will take existing NFS data and access it as an HDFS filesystem.
> Use case: I have existing data in a warehouse that I would like to run MapReduce etc. on without copying it into HDFS.
> 
> If the data were in S3, could I run MapReduce on it?


Hadoop has a filesystem abstraction layer that supports many physical filesystem implementation. Such as HDFS of course, but also the local filesystem, S3, FTP, and others.

You simply loose data locality if you're running MapReduce on data that is -well- not local to where it's been processed.

With data stored in S3, a common solution is to fire up an EMR (elastic mapreduce) cluster inside Amazon's datacenter to work on your S3 data. It's not real data locality, but at least the processing happens in the same data center as your data. And once you're done processing the data, you can take down the EMR cluster.

Kai

-- 
Kai Voigt
k@123.org

Re: Importing Data into HDFS

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 29.08.2012 um 22:58 schrieb Steve Sonnenberg <st...@gmail.com>:

> Is there any way to import data into HDFS without copying it in? (kinda of like by reference)
> I'm pretty sure the answer to this is no.
> 
> What I'm looking for is something that will take existing NFS data and access it as an HDFS filesystem.
> Use case: I have existing data in a warehouse that I would like to run MapReduce etc. on without copying it into HDFS.
> 
> If the data were in S3, could I run MapReduce on it?


Hadoop has a filesystem abstraction layer that supports many physical filesystem implementation. Such as HDFS of course, but also the local filesystem, S3, FTP, and others.

You simply loose data locality if you're running MapReduce on data that is -well- not local to where it's been processed.

With data stored in S3, a common solution is to fire up an EMR (elastic mapreduce) cluster inside Amazon's datacenter to work on your S3 data. It's not real data locality, but at least the processing happens in the same data center as your data. And once you're done processing the data, you can take down the EMR cluster.

Kai

-- 
Kai Voigt
k@123.org

Re: Importing Data into HDFS

Posted by Kai Voigt <k...@123.org>.

Hello,

Am 29.08.2012 um 22:58 schrieb Steve Sonnenberg <st...@gmail.com>:

> Is there any way to import data into HDFS without copying it in? (kinda of like by reference)
> I'm pretty sure the answer to this is no.
> 
> What I'm looking for is something that will take existing NFS data and access it as an HDFS filesystem.
> Use case: I have existing data in a warehouse that I would like to run MapReduce etc. on without copying it into HDFS.
> 
> If the data were in S3, could I run MapReduce on it?


Hadoop has a filesystem abstraction layer that supports many physical filesystem implementation. Such as HDFS of course, but also the local filesystem, S3, FTP, and others.

You simply loose data locality if you're running MapReduce on data that is -well- not local to where it's been processed.

With data stored in S3, a common solution is to fire up an EMR (elastic mapreduce) cluster inside Amazon's datacenter to work on your S3 data. It's not real data locality, but at least the processing happens in the same data center as your data. And once you're done processing the data, you can take down the EMR cluster.

Kai

-- 
Kai Voigt
k@123.org