You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by sumit ghosh <su...@yahoo.com> on 2012/10/30 11:07:04 UTC

Loading Data to HDFS

Hi,

I have a data on remote machine accessible over ssh. I have Hadoop CDH4 installed on RHEL. I am planning to load quite a few Petabytes of Data onto HDFS. 
 
Which will be the fastest method to use and are there any projects around Hadoop which can be used as well?

 
I cannot install Hadoop-Client on the remote machine.
 
Have a great Day Ahead!
Sumit.
 
 
---------------
Here I am attaching my previous discussion on CDH-user to avoid duplication. 
---------------
On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com> wrote:
in addition to jarcec's suggestions, you could use httpfs. then you'd only need to poke a single host:port in your firewall as all the traffic goes thru it.
thx
Alejandro

On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com> wrote:
> Hi Sumit,
> there is plenty of ways how to achieve that. Please find my feedback below:
>
>> Does Sqoop support loading flat files to HDFS?
>
> No, sqoop is supporting only data move from external database and warehouse systems. Copying files is not supported at the moment.
>
>> Can use distcp?
>
> No. Distcp can be used only to copy data between HDFS filesystesm.
>
>> How do we use the core-site.xml file on the remote machine to use
>> copyFromLocal?
>
> Yes you can install hadoop binaries on your machine (with no hadoop running services) and use hadoop binary to upload data. Installation procedure is described in CDH4 installation guide [1] (follow "client" installation).
>
> Another way that I can think of is leveraging WebHDFS [2] or maybe hdfs-fuse [3]?
>
> Jarcec
>
> Links:
> 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
> 2: https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
> 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
>
> On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
>>
>>
>> Hi,
>>
>> I have a data on remote machine accessible over ssh. What is the fastest
>> way to load data onto HDFS?
>>
>> Does Sqoop support loading flat files to HDFS?
>> Can use distcp?
>> How do we use the core-site.xml file on the remote machine to use
>> copyFromLocal?
>>
>> Which will be the best to use and are there any other open source projects
>> around Hadoop which can be used as well?
>> Have a great Day Ahead!
>> Sumit

Re: Loading Data to HDFS

Posted by Ranjith <ra...@gmail.com>.
along the lines of the email below, has there any libraries built out to copy files in parallel into the cluster? using some sort of byte offset techniques, etc?

Thanks,
Ranjith

On Oct 30, 2012, at 9:24 AM, "M. C. Srivas" <mc...@gmail.com> wrote:

> Loading a petabyte from a single machine will take you about 4 months,
> assuming you can push 100MB/s (1 GigE) continuously for 24 hrs/day over
> those 4 months. Any interruptions and the 4 months will become 6 months.
> 
> You might want to consider a more parallel solution instead of a single
> gateway machine.
> 
> 
> On Tue, Oct 30, 2012 at 3:07 AM, sumit ghosh <su...@yahoo.com> wrote:
> 
>> Hi,
>> 
>> I have a data on remote machine accessible over ssh. I have Hadoop CDH4
>> installed on RHEL. I am planning to load quite a few Petabytes of Data onto
>> HDFS.
>> 
>> Which will be the fastest method to use and are there any projects around
>> Hadoop which can be used as well?
>> 
>> 
>> I cannot install Hadoop-Client on the remote machine.
>> 
>> Have a great Day Ahead!
>> Sumit.
>> 
>> 
>> ---------------
>> Here I am attaching my previous discussion on CDH-user to avoid
>> duplication.
>> ---------------
>> On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
>> wrote:
>> in addition to jarcec's suggestions, you could use httpfs. then you'd only
>> need to poke a single host:port in your firewall as all the traffic goes
>> thru it.
>> thx
>> Alejandro
>> 
>> On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
>> wrote:
>>> Hi Sumit,
>>> there is plenty of ways how to achieve that. Please find my feedback
>> below:
>>> 
>>>> Does Sqoop support loading flat files to HDFS?
>>> 
>>> No, sqoop is supporting only data move from external database and
>> warehouse systems. Copying files is not supported at the moment.
>>> 
>>>> Can use distcp?
>>> 
>>> No. Distcp can be used only to copy data between HDFS filesystesm.
>>> 
>>>> How do we use the core-site.xml file on the remote machine to use
>>>> copyFromLocal?
>>> 
>>> Yes you can install hadoop binaries on your machine (with no hadoop
>> running services) and use hadoop binary to upload data. Installation
>> procedure is described in CDH4 installation guide [1] (follow "client"
>> installation).
>>> 
>>> Another way that I can think of is leveraging WebHDFS [2] or maybe
>> hdfs-fuse [3]?
>>> 
>>> Jarcec
>>> 
>>> Links:
>>> 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
>>> 2:
>> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
>>> 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
>>> 
>>> On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> I have a data on remote machine accessible over ssh. What is the fastest
>>>> way to load data onto HDFS?
>>>> 
>>>> Does Sqoop support loading flat files to HDFS?
>>>> Can use distcp?
>>>> How do we use the core-site.xml file on the remote machine to use
>>>> copyFromLocal?
>>>> 
>>>> Which will be the best to use and are there any other open source
>> projects
>>>> around Hadoop which can be used as well?
>>>> Have a great Day Ahead!
>>>> Sumit

Re: Loading Data to HDFS

Posted by "M. C. Srivas" <mc...@gmail.com>.
Loading a petabyte from a single machine will take you about 4 months,
assuming you can push 100MB/s (1 GigE) continuously for 24 hrs/day over
those 4 months. Any interruptions and the 4 months will become 6 months.

You might want to consider a more parallel solution instead of a single
gateway machine.


On Tue, Oct 30, 2012 at 3:07 AM, sumit ghosh <su...@yahoo.com> wrote:

> Hi,
>
> I have a data on remote machine accessible over ssh. I have Hadoop CDH4
> installed on RHEL. I am planning to load quite a few Petabytes of Data onto
> HDFS.
>
> Which will be the fastest method to use and are there any projects around
> Hadoop which can be used as well?
>
>
> I cannot install Hadoop-Client on the remote machine.
>
> Have a great Day Ahead!
> Sumit.
>
>
> ---------------
> Here I am attaching my previous discussion on CDH-user to avoid
> duplication.
> ---------------
> On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
> wrote:
> in addition to jarcec's suggestions, you could use httpfs. then you'd only
> need to poke a single host:port in your firewall as all the traffic goes
> thru it.
> thx
> Alejandro
>
> On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
> wrote:
> > Hi Sumit,
> > there is plenty of ways how to achieve that. Please find my feedback
> below:
> >
> >> Does Sqoop support loading flat files to HDFS?
> >
> > No, sqoop is supporting only data move from external database and
> warehouse systems. Copying files is not supported at the moment.
> >
> >> Can use distcp?
> >
> > No. Distcp can be used only to copy data between HDFS filesystesm.
> >
> >> How do we use the core-site.xml file on the remote machine to use
> >> copyFromLocal?
> >
> > Yes you can install hadoop binaries on your machine (with no hadoop
> running services) and use hadoop binary to upload data. Installation
> procedure is described in CDH4 installation guide [1] (follow "client"
> installation).
> >
> > Another way that I can think of is leveraging WebHDFS [2] or maybe
> hdfs-fuse [3]?
> >
> > Jarcec
> >
> > Links:
> > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
> > 2:
> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
> > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
> >
> > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
> >>
> >>
> >> Hi,
> >>
> >> I have a data on remote machine accessible over ssh. What is the fastest
> >> way to load data onto HDFS?
> >>
> >> Does Sqoop support loading flat files to HDFS?
> >> Can use distcp?
> >> How do we use the core-site.xml file on the remote machine to use
> >> copyFromLocal?
> >>
> >> Which will be the best to use and are there any other open source
> projects
> >> around Hadoop which can be used as well?
> >> Have a great Day Ahead!
> >> Sumit

Re: Loading Data to HDFS

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
> I don't know what you mean by gateway but in order to have a rough idea of
> the time needed you need 3 values

I believe Sumit's setup is a cluster within a firewall and hadoop
client machines also within the firewall, the only way to access to
the cluster is to ssh from outside to one of the hadoop client
machines and then submit your jobs. These hadoop client machines are
often referred as gateway machines.


On Tue, Oct 30, 2012 at 4:10 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> I don't know what you mean by gateway but in order to have a rough idea of
> the time needed you need 3 values
> * amount of data you want to put on hadoop
> * hadoop bandwidth with regards to local storage (read/write)
> * bandwidth between where your data are stored and where the hadoop cluster
> is
>
> For the latter, for big volumes, physically moving the volumes is a viable
> solution.
> It will depends on your constraints of course : budget, speed...
>
> Bertrand
>
> On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh <su...@yahoo.com> wrote:
>
>> Hi Bertrand,
>>
>> By Physically movi ng the data do you mean that the data volume is
>> connected to the gateway machine and the data is loaded from the local copy
>> using copyFromLocal?
>>
>> Thanks,
>> Sumit
>>
>>
>> ________________________________
>> From: Bertrand Dechoux <de...@gmail.com>
>> To: common-user@hadoop.apache.org; sumit ghosh <su...@yahoo.com>
>> Sent: Tuesday, 30 October 2012 3:46 PM
>> Subject: Re: Loading Data to HDFS
>>
>> It might sound like a deprecated way but can't you move the data
>> physically?
>> From what I understand, it is one shot and not "streaming" so it could be a
>> good method if you the access of course.
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <su...@yahoo.com> wrote:
>>
>> > Hi,
>> >
>> > I have a data on remote machine accessible over ssh. I have Hadoop CDH4
>> > installed on RHEL. I am planning to load quite a few Petabytes of Data
>> onto
>> > HDFS.
>> >
>> > Which will be the fastest method to use and are there any projects around
>> > Hadoop which can be used as well?
>> >
>> >
>> > I cannot install Hadoop-Client on the remote machine.
>> >
>> > Have a great Day Ahead!
>> > Sumit.
>> >
>> >
>> > ---------------
>> > Here I am attaching my previous discussion on CDH-user to avoid
>> > duplication.
>> > ---------------
>> > On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
>> > wrote:
>> > in addition to jarcec's suggestions, you could use httpfs. then you'd
>> only
>> > need to poke a single host:port in your firewall as all the traffic goes
>> > thru it.
>> > thx
>> > Alejandro
>> >
>> > On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
>> > wrote:
>> > > Hi Sumit,
>> > > there is plenty of ways how to achieve that. Please find my feedback
>> > below:
>> > >
>> > >> Does Sqoop support loading flat files to HDFS?
>> > >
>> > > No, sqoop is supporting only data move from external database and
>> > warehouse systems. Copying files is not supported at the moment.
>> > >
>> > >> Can use distcp?
>> > >
>> > > No. Distcp can be used only to copy data between HDFS filesystesm.
>> > >
>> > >> How do we use the core-site.xml file on the remote machine to use
>> > >> copyFromLocal?
>> > >
>> > > Yes you can install hadoop binaries on your machine (with no hadoop
>> > running services) and use hadoop binary to upload data. Installation
>> > procedure is described in CDH4 installation guide [1] (follow "client"
>> > installation).
>> > >
>> > > Another way that I can think of is leveraging WebHDFS [2] or maybe
>> > hdfs-fuse [3]?
>> > >
>> > > Jarcec
>> > >
>> > > Links:
>> > > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
>> > > 2:
>> >
>> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
>> > > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
>> > >
>> > > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
>> > >>
>> > >>
>> > >> Hi,
>> > >>
>> > >> I have a data on remote machine accessible over ssh. What is the
>> fastest
>> > >> way to load data onto HDFS?
>> > >>
>> > >> Does Sqoop support loading flat files to HDFS?
>> > >> Can use distcp?
>> > >> How do we use the core-site.xml file on the remote machine to use
>> > >> copyFromLocal?
>> > >>
>> > >> Which will be the best to use and are there any other open source
>> > projects
>> > >> around Hadoop which can be used as well?
>> > >> Have a great Day Ahead!
>> > >> Sumit
>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> Bertrand Dechoux



-- 
Alejandro

Re: Loading Data to HDFS

Posted by sumit ghosh <su...@yahoo.com>.
Hi Bertrand,

Gateway machine is one which is usually used to connect to the Hadoop cluster however the machine itself does not contain DataNode/Tasktracker.
 
Warm Regards,
Sumit


________________________________
From: Bertrand Dechoux <de...@gmail.com>
To: common-user@hadoop.apache.org; sumit ghosh <su...@yahoo.com> 
Sent: Tuesday, 30 October 2012 4:40 PM
Subject: Re: Loading Data to HDFS


I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values
* amount of data you want to put on hadoop
* hadoop bandwidth with regards to local storage (read/write)
* bandwidth between where your data are stored and where the hadoop cluster is

For the latter, for big volumes, physically moving the volumes is a viable solution.
It will depends on your constraints of course : budget, speed...

Bertrand


On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh <su...@yahoo.com> wrote:

Hi Bertrand,
> 
>By Physically movi ng the data do you mean that the data volume is connected to the gateway machine and the data is loaded from the local copy using copyFromLocal?
> 
>Thanks,
>Sumit
>
>
>
>________________________________
>From: Bertrand Dechoux <de...@gmail.com>
>To: common-user@hadoop.apache.org; sumit ghosh <su...@yahoo.com>
>Sent: Tuesday, 30 October 2012 3:46 PM
>Subject: Re: Loading Data to HDFS
>
>
>It might sound like a deprecated way but can't you move the data physically?
>From what I understand, it is one shot and not "streaming" so it could be a
>good method if you the access of course.
>
>Regards
>
>Bertrand
>
>On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <su...@yahoo.com> wrote:
>
>> Hi,
>>
>> I have a data on remote machine accessible over ssh. I have Hadoop CDH4
>> installed on RHEL. I am planning to load quite a few Petabytes of Data onto
>> HDFS.
>>
>> Which will be the fastest method to use and are there any projects around
>> Hadoop which can be used as well?
>>
>>
>> I cannot install Hadoop-Client on the remote machine.
>>
>> Have a great Day Ahead!
>> Sumit.
>>
>>
>> ---------------
>> Here I am attaching my previous discussion on CDH-user to avoid
>> duplication.
>> ---------------
>> On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
>> wrote:
>> in addition to jarcec's suggestions, you could use httpfs. then you'd only
>> need to poke a single host:port in your firewall as all the traffic goes
>> thru it.
>> thx
>> Alejandro
>>
>> On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
>> wrote:
>> > Hi Sumit,
>> > there is plenty of ways how to achieve that. Please find my feedback
>> below:
>> >
>> >> Does Sqoop support loading flat files to HDFS?
>> >
>> > No, sqoop is supporting only data move from external database and
>> warehouse systems. Copying files is not supported at the moment.
>> >
>> >> Can use distcp?
>> >
>> > No. Distcp can be used only to copy data between HDFS filesystesm.
>> >
>> >> How do we use the core-site.xml file on the remote machine to use
>> >> copyFromLocal?
>> >
>> > Yes you can install hadoop binaries on your machine (with no hadoop
>> running services) and use hadoop binary to upload data. Installation
>> procedure is described in CDH4 installation guide [1] (follow "client"
>> installation).
>> >
>> > Another way that I can think of is leveraging WebHDFS [2] or maybe
>> hdfs-fuse [3]?
>> >
>> > Jarcec
>> >
>> > Links:
>> > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
>> > 2:
>> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
>> > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
>> >
>> > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I have a data on remote machine accessible over ssh. What is the fastest
>> >> way to load data onto HDFS?
>> >>
>> >> Does Sqoop support loading flat files to HDFS?
>> >> Can use distcp?
>> >> How do we use the core-site.xml file on the remote machine to use
>> >> copyFromLocal?
>> >>
>> >> Which will be the best to use and are there any other open source
>> projects
>> >> around Hadoop which can be used as well?
>> >> Have a great Day Ahead!
>> >> Sumit
>
>
>
>
>--
>Bertrand Dechoux


-- 
Bertrand Dechoux

Re: Loading Data to HDFS

Posted by Bertrand Dechoux <de...@gmail.com>.
I don't know what you mean by gateway but in order to have a rough idea of
the time needed you need 3 values
* amount of data you want to put on hadoop
* hadoop bandwidth with regards to local storage (read/write)
* bandwidth between where your data are stored and where the hadoop cluster
is

For the latter, for big volumes, physically moving the volumes is a viable
solution.
It will depends on your constraints of course : budget, speed...

Bertrand

On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh <su...@yahoo.com> wrote:

> Hi Bertrand,
>
> By Physically movi ng the data do you mean that the data volume is
> connected to the gateway machine and the data is loaded from the local copy
> using copyFromLocal?
>
> Thanks,
> Sumit
>
>
> ________________________________
> From: Bertrand Dechoux <de...@gmail.com>
> To: common-user@hadoop.apache.org; sumit ghosh <su...@yahoo.com>
> Sent: Tuesday, 30 October 2012 3:46 PM
> Subject: Re: Loading Data to HDFS
>
> It might sound like a deprecated way but can't you move the data
> physically?
> From what I understand, it is one shot and not "streaming" so it could be a
> good method if you the access of course.
>
> Regards
>
> Bertrand
>
> On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <su...@yahoo.com> wrote:
>
> > Hi,
> >
> > I have a data on remote machine accessible over ssh. I have Hadoop CDH4
> > installed on RHEL. I am planning to load quite a few Petabytes of Data
> onto
> > HDFS.
> >
> > Which will be the fastest method to use and are there any projects around
> > Hadoop which can be used as well?
> >
> >
> > I cannot install Hadoop-Client on the remote machine.
> >
> > Have a great Day Ahead!
> > Sumit.
> >
> >
> > ---------------
> > Here I am attaching my previous discussion on CDH-user to avoid
> > duplication.
> > ---------------
> > On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
> > wrote:
> > in addition to jarcec's suggestions, you could use httpfs. then you'd
> only
> > need to poke a single host:port in your firewall as all the traffic goes
> > thru it.
> > thx
> > Alejandro
> >
> > On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
> > wrote:
> > > Hi Sumit,
> > > there is plenty of ways how to achieve that. Please find my feedback
> > below:
> > >
> > >> Does Sqoop support loading flat files to HDFS?
> > >
> > > No, sqoop is supporting only data move from external database and
> > warehouse systems. Copying files is not supported at the moment.
> > >
> > >> Can use distcp?
> > >
> > > No. Distcp can be used only to copy data between HDFS filesystesm.
> > >
> > >> How do we use the core-site.xml file on the remote machine to use
> > >> copyFromLocal?
> > >
> > > Yes you can install hadoop binaries on your machine (with no hadoop
> > running services) and use hadoop binary to upload data. Installation
> > procedure is described in CDH4 installation guide [1] (follow "client"
> > installation).
> > >
> > > Another way that I can think of is leveraging WebHDFS [2] or maybe
> > hdfs-fuse [3]?
> > >
> > > Jarcec
> > >
> > > Links:
> > > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
> > > 2:
> >
> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
> > > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
> > >
> > > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
> > >>
> > >>
> > >> Hi,
> > >>
> > >> I have a data on remote machine accessible over ssh. What is the
> fastest
> > >> way to load data onto HDFS?
> > >>
> > >> Does Sqoop support loading flat files to HDFS?
> > >> Can use distcp?
> > >> How do we use the core-site.xml file on the remote machine to use
> > >> copyFromLocal?
> > >>
> > >> Which will be the best to use and are there any other open source
> > projects
> > >> around Hadoop which can be used as well?
> > >> Have a great Day Ahead!
> > >> Sumit
>
>
>
>
> --
> Bertrand Dechoux
>



-- 
Bertrand Dechoux

Re: Loading Data to HDFS

Posted by sumit ghosh <su...@yahoo.com>.
Hi Bertrand,
 
By Physically movi ng the data do you mean that the data volume is connected to the gateway machine and the data is loaded from the local copy using copyFromLocal?
 
Thanks,
Sumit


________________________________
From: Bertrand Dechoux <de...@gmail.com>
To: common-user@hadoop.apache.org; sumit ghosh <su...@yahoo.com> 
Sent: Tuesday, 30 October 2012 3:46 PM
Subject: Re: Loading Data to HDFS

It might sound like a deprecated way but can't you move the data physically?
From what I understand, it is one shot and not "streaming" so it could be a
good method if you the access of course.

Regards

Bertrand

On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <su...@yahoo.com> wrote:

> Hi,
>
> I have a data on remote machine accessible over ssh. I have Hadoop CDH4
> installed on RHEL. I am planning to load quite a few Petabytes of Data onto
> HDFS.
>
> Which will be the fastest method to use and are there any projects around
> Hadoop which can be used as well?
>
>
> I cannot install Hadoop-Client on the remote machine.
>
> Have a great Day Ahead!
> Sumit.
>
>
> ---------------
> Here I am attaching my previous discussion on CDH-user to avoid
> duplication.
> ---------------
> On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
> wrote:
> in addition to jarcec's suggestions, you could use httpfs. then you'd only
> need to poke a single host:port in your firewall as all the traffic goes
> thru it.
> thx
> Alejandro
>
> On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
> wrote:
> > Hi Sumit,
> > there is plenty of ways how to achieve that. Please find my feedback
> below:
> >
> >> Does Sqoop support loading flat files to HDFS?
> >
> > No, sqoop is supporting only data move from external database and
> warehouse systems. Copying files is not supported at the moment.
> >
> >> Can use distcp?
> >
> > No. Distcp can be used only to copy data between HDFS filesystesm.
> >
> >> How do we use the core-site.xml file on the remote machine to use
> >> copyFromLocal?
> >
> > Yes you can install hadoop binaries on your machine (with no hadoop
> running services) and use hadoop binary to upload data. Installation
> procedure is described in CDH4 installation guide [1] (follow "client"
> installation).
> >
> > Another way that I can think of is leveraging WebHDFS [2] or maybe
> hdfs-fuse [3]?
> >
> > Jarcec
> >
> > Links:
> > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
> > 2:
> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
> > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
> >
> > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
> >>
> >>
> >> Hi,
> >>
> >> I have a data on remote machine accessible over ssh. What is the fastest
> >> way to load data onto HDFS?
> >>
> >> Does Sqoop support loading flat files to HDFS?
> >> Can use distcp?
> >> How do we use the core-site.xml file on the remote machine to use
> >> copyFromLocal?
> >>
> >> Which will be the best to use and are there any other open source
> projects
> >> around Hadoop which can be used as well?
> >> Have a great Day Ahead!
> >> Sumit




-- 
Bertrand Dechoux

Re: Loading Data to HDFS

Posted by Bertrand Dechoux <de...@gmail.com>.
It might sound like a deprecated way but can't you move the data physically?
>From what I understand, it is one shot and not "streaming" so it could be a
good method if you the access of course.

Regards

Bertrand

On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <su...@yahoo.com> wrote:

> Hi,
>
> I have a data on remote machine accessible over ssh. I have Hadoop CDH4
> installed on RHEL. I am planning to load quite a few Petabytes of Data onto
> HDFS.
>
> Which will be the fastest method to use and are there any projects around
> Hadoop which can be used as well?
>
>
> I cannot install Hadoop-Client on the remote machine.
>
> Have a great Day Ahead!
> Sumit.
>
>
> ---------------
> Here I am attaching my previous discussion on CDH-user to avoid
> duplication.
> ---------------
> On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <tu...@cloudera.com>
> wrote:
> in addition to jarcec's suggestions, you could use httpfs. then you'd only
> need to poke a single host:port in your firewall as all the traffic goes
> thru it.
> thx
> Alejandro
>
> On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <ja...@cloudera.com>
> wrote:
> > Hi Sumit,
> > there is plenty of ways how to achieve that. Please find my feedback
> below:
> >
> >> Does Sqoop support loading flat files to HDFS?
> >
> > No, sqoop is supporting only data move from external database and
> warehouse systems. Copying files is not supported at the moment.
> >
> >> Can use distcp?
> >
> > No. Distcp can be used only to copy data between HDFS filesystesm.
> >
> >> How do we use the core-site.xml file on the remote machine to use
> >> copyFromLocal?
> >
> > Yes you can install hadoop binaries on your machine (with no hadoop
> running services) and use hadoop binary to upload data. Installation
> procedure is described in CDH4 installation guide [1] (follow "client"
> installation).
> >
> > Another way that I can think of is leveraging WebHDFS [2] or maybe
> hdfs-fuse [3]?
> >
> > Jarcec
> >
> > Links:
> > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
> > 2:
> https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
> > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
> >
> > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
> >>
> >>
> >> Hi,
> >>
> >> I have a data on remote machine accessible over ssh. What is the fastest
> >> way to load data onto HDFS?
> >>
> >> Does Sqoop support loading flat files to HDFS?
> >> Can use distcp?
> >> How do we use the core-site.xml file on the remote machine to use
> >> copyFromLocal?
> >>
> >> Which will be the best to use and are there any other open source
> projects
> >> around Hadoop which can be used as well?
> >> Have a great Day Ahead!
> >> Sumit




-- 
Bertrand Dechoux