You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jean-Pierre <je...@247realmedia.com> on 2008/03/27 20:41:59 UTC

[Map/Reduce][HDFS]

Hello,

I'm working on large amount of logs, and I've noticed that the
distribution of data on the network (./hadoop dfs -put input input)
takes a lot of time.

Let's says that my data is already distributed among the network, is
there anyway to say to hadoop to use the already existing
distribution ?.

Thanks

-- 
Jean-Pierre <je...@247realmedia.com>

RE: [Map/Reduce][HDFS]

Posted by Devaraj Das <dd...@yahoo-inc.com>.

Hi Jean, no that is not directly possible. You have to pass your data
through the DFS client in order for that to be part of the dfs (e.g. hadoop
fs -put .., etc. or programatically). 
(removing core-dev from this thread since this is really a core-user
question)

> -----Original Message-----
> From: Jean-Pierre [mailto:jean-pierre.ocalan@247realmedia.com] 
> Sent: Friday, March 28, 2008 8:58 PM
> To: core-user@hadoop.apache.org; core-dev
> Subject: Re: [Map/Reduce][HDFS]
> 
> Hello
> 
> I'm not sure I've understood...actually I've already set this 
> field in the configuration file. I think this field is just 
> to specify the master for the HDFS. 
> 
> My problem is that I have many machines with, on each one, a 
> bunch of files which represent the distributed data ... and I 
> want to use this distribution of data with hadoop. Maybe 
> there is another configuration file which allow me to say to 
> hadoop how to use my file distribution.
> Is it possible ? Should I look to adapt my distribution of 
> data to the hadoop one ?
> 
> Anyway thanks for your answer Peeyush.
> 
> On Fri, 2008-03-28 at 16:22 +0530, Peeyush Bishnoi wrote:
> > hello ,
> > 
> > Yes you can do this by specify in hadoop-site.xml about the 
> location 
> > of namenode , where your data is already get distributed.
> > 
> > ---------------------------------------------------------------
> > <property>
> >   <name>fs.default.name</name>
> >   <value> <IPAddress:PortNo>  </value> </property>
> > 
> > ---------------------------------------------------------------
> > 
> > Thanks
> > 
> > ---
> > Peeyush
> > 
> > 
> > On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:
> > 
> > > Hello,
> > > 
> > > I'm working on large amount of logs, and I've noticed that the 
> > > distribution of data on the network (./hadoop dfs -put 
> input input) 
> > > takes a lot of time.
> > > 
> > > Let's says that my data is already distributed among the 
> network, is 
> > > there anyway to say to hadoop to use the already existing 
> > > distribution ?.
> > > 
> > > Thanks
> > > 
> 
> 
>

RE: [Map/Reduce][HDFS]

Posted by Devaraj Das <dd...@yahoo-inc.com>.

Hi Jean, no that is not directly possible. You have to pass your data
through the DFS client in order for that to be part of the dfs (e.g. hadoop
fs -put .., etc. or programatically). 
(removing core-dev from this thread since this is really a core-user
question)

> -----Original Message-----
> From: Jean-Pierre [mailto:jean-pierre.ocalan@247realmedia.com] 
> Sent: Friday, March 28, 2008 8:58 PM
> To: core-user@hadoop.apache.org; core-dev
> Subject: Re: [Map/Reduce][HDFS]
> 
> Hello
> 
> I'm not sure I've understood...actually I've already set this 
> field in the configuration file. I think this field is just 
> to specify the master for the HDFS. 
> 
> My problem is that I have many machines with, on each one, a 
> bunch of files which represent the distributed data ... and I 
> want to use this distribution of data with hadoop. Maybe 
> there is another configuration file which allow me to say to 
> hadoop how to use my file distribution.
> Is it possible ? Should I look to adapt my distribution of 
> data to the hadoop one ?
> 
> Anyway thanks for your answer Peeyush.
> 
> On Fri, 2008-03-28 at 16:22 +0530, Peeyush Bishnoi wrote:
> > hello ,
> > 
> > Yes you can do this by specify in hadoop-site.xml about the 
> location 
> > of namenode , where your data is already get distributed.
> > 
> > ---------------------------------------------------------------
> > <property>
> >   <name>fs.default.name</name>
> >   <value> <IPAddress:PortNo>  </value> </property>
> > 
> > ---------------------------------------------------------------
> > 
> > Thanks
> > 
> > ---
> > Peeyush
> > 
> > 
> > On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:
> > 
> > > Hello,
> > > 
> > > I'm working on large amount of logs, and I've noticed that the 
> > > distribution of data on the network (./hadoop dfs -put 
> input input) 
> > > takes a lot of time.
> > > 
> > > Let's says that my data is already distributed among the 
> network, is 
> > > there anyway to say to hadoop to use the already existing 
> > > distribution ?.
> > > 
> > > Thanks
> > > 
> 
> 
>

Re: [Map/Reduce][HDFS]

Posted by Ted Dunning <td...@veoh.com>.


Try running dfs -put on each of the machines that has content.  That will
give you good balance and should let you write at very high speed (depending
on your cluster size).


On 3/28/08 8:27 AM, "Jean-Pierre" <je...@247realmedia.com>
wrote:

> Hello
> 
> I'm not sure I've understood...actually I've already set this field in
> the configuration file. I think this field is just to specify the master
> for the HDFS. 
> 
> My problem is that I have many machines with, on each one, a bunch of
> files which represent the distributed data ... and I want to use this
> distribution of data with hadoop. Maybe there is another configuration
> file which allow me to say to hadoop how to use my file distribution.
> Is it possible ? Should I look to adapt my distribution of data to the
> hadoop one ?
> 
> Anyway thanks for your answer Peeyush.
> 
> On Fri, 2008-03-28 at 16:22 +0530, Peeyush Bishnoi wrote:
>> hello ,
>> 
>> Yes you can do this by specify in hadoop-site.xml about the location of
>> namenode , where your data is already get distributed.
>> 
>> ---------------------------------------------------------------
>> <property>
>>   <name>fs.default.name</name>
>>   <value> <IPAddress:PortNo>  </value>
>> </property>
>> 
>> ---------------------------------------------------------------
>> 
>> Thanks
>> 
>> ---
>> Peeyush
>> 
>> 
>> On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:
>> 
>>> Hello,
>>> 
>>> I'm working on large amount of logs, and I've noticed that the
>>> distribution of data on the network (./hadoop dfs -put input input)
>>> takes a lot of time.
>>> 
>>> Let's says that my data is already distributed among the network, is
>>> there anyway to say to hadoop to use the already existing
>>> distribution ?.
>>> 
>>> Thanks
>>> 
> 
>

Re: [Map/Reduce][HDFS]

Posted by Ted Dunning <td...@veoh.com>.


Try running dfs -put on each of the machines that has content.  That will
give you good balance and should let you write at very high speed (depending
on your cluster size).


On 3/28/08 8:27 AM, "Jean-Pierre" <je...@247realmedia.com>
wrote:

> Hello
> 
> I'm not sure I've understood...actually I've already set this field in
> the configuration file. I think this field is just to specify the master
> for the HDFS. 
> 
> My problem is that I have many machines with, on each one, a bunch of
> files which represent the distributed data ... and I want to use this
> distribution of data with hadoop. Maybe there is another configuration
> file which allow me to say to hadoop how to use my file distribution.
> Is it possible ? Should I look to adapt my distribution of data to the
> hadoop one ?
> 
> Anyway thanks for your answer Peeyush.
> 
> On Fri, 2008-03-28 at 16:22 +0530, Peeyush Bishnoi wrote:
>> hello ,
>> 
>> Yes you can do this by specify in hadoop-site.xml about the location of
>> namenode , where your data is already get distributed.
>> 
>> ---------------------------------------------------------------
>> <property>
>>   <name>fs.default.name</name>
>>   <value> <IPAddress:PortNo>  </value>
>> </property>
>> 
>> ---------------------------------------------------------------
>> 
>> Thanks
>> 
>> ---
>> Peeyush
>> 
>> 
>> On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:
>> 
>>> Hello,
>>> 
>>> I'm working on large amount of logs, and I've noticed that the
>>> distribution of data on the network (./hadoop dfs -put input input)
>>> takes a lot of time.
>>> 
>>> Let's says that my data is already distributed among the network, is
>>> there anyway to say to hadoop to use the already existing
>>> distribution ?.
>>> 
>>> Thanks
>>> 
> 
>

Re: [Map/Reduce][HDFS]

Posted by Jean-Pierre <je...@247realmedia.com>.

Hello

I'm not sure I've understood...actually I've already set this field in
the configuration file. I think this field is just to specify the master
for the HDFS. 

My problem is that I have many machines with, on each one, a bunch of
files which represent the distributed data ... and I want to use this
distribution of data with hadoop. Maybe there is another configuration
file which allow me to say to hadoop how to use my file distribution.
Is it possible ? Should I look to adapt my distribution of data to the
hadoop one ?

Anyway thanks for your answer Peeyush.

On Fri, 2008-03-28 at 16:22 +0530, Peeyush Bishnoi wrote:
> hello ,
> 
> Yes you can do this by specify in hadoop-site.xml about the location of
> namenode , where your data is already get distributed.
> 
> ---------------------------------------------------------------
> <property>
>   <name>fs.default.name</name>
>   <value> <IPAddress:PortNo>  </value>
> </property>
> 
> ---------------------------------------------------------------
> 
> Thanks
> 
> ---
> Peeyush
> 
> 
> On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:
> 
> > Hello,
> > 
> > I'm working on large amount of logs, and I've noticed that the
> > distribution of data on the network (./hadoop dfs -put input input)
> > takes a lot of time.
> > 
> > Let's says that my data is already distributed among the network, is
> > there anyway to say to hadoop to use the already existing
> > distribution ?.
> > 
> > Thanks
> >

Re: [Map/Reduce][HDFS]

Posted by Jean-Pierre <je...@247realmedia.com>.

Hello

I'm not sure I've understood...actually I've already set this field in
the configuration file. I think this field is just to specify the master
for the HDFS. 

My problem is that I have many machines with, on each one, a bunch of
files which represent the distributed data ... and I want to use this
distribution of data with hadoop. Maybe there is another configuration
file which allow me to say to hadoop how to use my file distribution.
Is it possible ? Should I look to adapt my distribution of data to the
hadoop one ?

Anyway thanks for your answer Peeyush.

On Fri, 2008-03-28 at 16:22 +0530, Peeyush Bishnoi wrote:
> hello ,
> 
> Yes you can do this by specify in hadoop-site.xml about the location of
> namenode , where your data is already get distributed.
> 
> ---------------------------------------------------------------
> <property>
>   <name>fs.default.name</name>
>   <value> <IPAddress:PortNo>  </value>
> </property>
> 
> ---------------------------------------------------------------
> 
> Thanks
> 
> ---
> Peeyush
> 
> 
> On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:
> 
> > Hello,
> > 
> > I'm working on large amount of logs, and I've noticed that the
> > distribution of data on the network (./hadoop dfs -put input input)
> > takes a lot of time.
> > 
> > Let's says that my data is already distributed among the network, is
> > there anyway to say to hadoop to use the already existing
> > distribution ?.
> > 
> > Thanks
> >

Re: [Map/Reduce][HDFS]

Posted by Peeyush Bishnoi <pe...@yahoo-inc.com>.

hello ,

Yes you can do this by specify in hadoop-site.xml about the location of
namenode , where your data is already get distributed.

---------------------------------------------------------------
<property>
  <name>fs.default.name</name>
  <value> <IPAddress:PortNo>  </value>
</property>

---------------------------------------------------------------

Thanks

---
Peeyush


On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:

> Hello,
> 
> I'm working on large amount of logs, and I've noticed that the
> distribution of data on the network (./hadoop dfs -put input input)
> takes a lot of time.
> 
> Let's says that my data is already distributed among the network, is
> there anyway to say to hadoop to use the already existing
> distribution ?.
> 
> Thanks
>

Re: [Map/Reduce][HDFS]

Posted by Peeyush Bishnoi <pe...@yahoo-inc.com>.

hello ,

Yes you can do this by specify in hadoop-site.xml about the location of
namenode , where your data is already get distributed.

---------------------------------------------------------------
<property>
  <name>fs.default.name</name>
  <value> <IPAddress:PortNo>  </value>
</property>

---------------------------------------------------------------

Thanks

---
Peeyush


On Thu, 2008-03-27 at 15:41 -0400, Jean-Pierre wrote:

> Hello,
> 
> I'm working on large amount of logs, and I've noticed that the
> distribution of data on the network (./hadoop dfs -put input input)
> takes a lot of time.
> 
> Let's says that my data is already distributed among the network, is
> there anyway to say to hadoop to use the already existing
> distribution ?.
> 
> Thanks
>