You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Usman Waheed <us...@opera.com> on 2009/05/01 13:22:40 UTC

Multiple HDFS clients

Hi,

I just wanted to share a test we conducted in our small cluster of 3  
datanodes and one namenode. Basically we have lots of data to process and  
we run a parsing script outside hadoop that creates the key,value pairs.  
This output which is plain txt files is then imported into hadoop using  
the put/get etc commands.

In order to speed up things we run the parsing jobs on multiple machines  
in parallel which are not part of our cluster (3 datanodes + namenode) but  
they do have the same version of hadoop installed as the cluster which we  
use to perform the puts. This work flow has significantly improved our  
time to import the data into HADOOP after which we run the reduce-only  
step to aggregate.

Currently the way to insert data is through our namenode which all the  
machines outsude the cluster call them hdfs clients connect to and are not  
part of the master/slave setup. I haven't tried but maybe we can perform  
these puts via the datanodes themselves and not just through the namenode?  
Right now the namenode is the single point through which the hdfs client  
machines insert the parsed data.

Secondly i would assume that this is a safe way to import parsed data into  
hadoop before we aggregate and will most likely not cause any data  
corruption in HDFS. Granted anything can happen :).

It would be interesting to import our logs and perform the mapping step  
inside HADOOP versus doing it outside. I wonder if the performance will be  
better, worse or the same. Yes this is dependent on many factors and one  
of them is the amount of datanodes, data to process, hardware etc we have  
but we are limited. We are trying to utilize machines outside the cluster  
which are idle and can process info and then insert the output into HADOOP  
HDFS via puts.

Your thoughts, comments, suggestions are welcome.

Thanks,
Usman




-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Re: Multiple HDFS clients

Posted by Usman Waheed <us...@opera.com>.

Hi Todd,

Thank You for your input. Our data is like any apache log file(s). Basic  
logging info which we are parsing.
Our data is alot which is why we are using HADOOP :).

I will look into running TT's on the hdfs clients just for job processing  
and not to store any data locally. We can then also gage performance by  
running the MAP/REDUCE both.

Thanks,
Usman

> On Fri, May 1, 2009 at 4:22 AM, Usman Waheed <us...@opera.com> wrote:
>
>> Hi,
>>
>> I just wanted to share a test we conducted in our small cluster of 3
>> datanodes and one namenode. Basically we have lots of data to process  
>> and we
>> run a parsing script outside hadoop that creates the key,value pairs.  
>> This
>> output which is plain txt files is then imported into hadoop using the
>> put/get etc commands.
>>
>
> Thanks for sharing, Usman. Some comments below:
>
>
>>
>> In order to speed up things we run the parsing jobs on multiple  
>> machines in
>> parallel which are not part of our cluster (3 datanodes + namenode) but  
>> they
>> do have the same version of hadoop installed as the cluster which we  
>> use to
>> perform the puts. This work flow has significantly improved our time to
>> import the data into HADOOP after which we run the reduce-only step to
>> aggregate.
>>
>> Currently the way to insert data is through our namenode which all the
>> machines outsude the cluster call them hdfs clients connect to and are  
>> not
>> part of the master/slave setup. I haven't tried but maybe we can perform
>> these puts via the datanodes themselves and not just through the  
>> namenode?
>> Right now the namenode is the single point through which the hdfs client
>> machines insert the parsed data.
>>
>
> When you do an hdfs put "to the namenode" the actual data transfer goes  
> to
> the datanodes anyway - the namenode isn't a data bottleneck. It's just  
> used
> to allocate block locations for writers, and then DFSClient connects
> directly to the DNs to transfer.
>
>
>>
>> Secondly i would assume that this is a safe way to import parsed data  
>> into
>> hadoop before we aggregate and will most likely not cause any data
>> corruption in HDFS. Granted anything can happen :).
>>
>
> Yep, this should be as safe as any other method.
>
>
>>
>> It would be interesting to import our logs and perform the mapping step
>> inside HADOOP versus doing it outside. I wonder if the performance will  
>> be
>> better, worse or the same. Yes this is dependent on many factors and  
>> one of
>> them is the amount of datanodes, data to process, hardware etc we have  
>> but
>> we are limited. We are trying to utilize machines outside the cluster  
>> which
>> are idle and can process info and then insert the output into HADOOP  
>> HDFS
>> via puts.
>>
>
> The performance is probably comparable on a small cluster like this,
> depending on what the ratio of input/output data is. The advantage of  
> doing
> the writes from within the cluster is that the namenode will try to  
> allocate
> block locations on the local node, so there is less total transfer into  
> the
> cluster. In a larger cluster, you might end up with a network bottleneck
> going into the cluster, but with three nodes and any reasonable switch  
> you
> shouldn't be running into that.
>
> What does your input data look like? It might make more sense to upload  
> it
> directly to the cluster and then use a MR job to perform the  
> transformation.
> This way you don't have to worry about doing that distribution yourself.  
> If
> you want to make use of those "extra" nodes that aren't part of your
> cluster, you could probably just run TTs on them without running DNs.
>
> -Todd



-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Re: Multiple HDFS clients

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, May 1, 2009 at 4:22 AM, Usman Waheed <us...@opera.com> wrote:

> Hi,
>
> I just wanted to share a test we conducted in our small cluster of 3
> datanodes and one namenode. Basically we have lots of data to process and we
> run a parsing script outside hadoop that creates the key,value pairs. This
> output which is plain txt files is then imported into hadoop using the
> put/get etc commands.
>

Thanks for sharing, Usman. Some comments below:

>
> In order to speed up things we run the parsing jobs on multiple machines in
> parallel which are not part of our cluster (3 datanodes + namenode) but they
> do have the same version of hadoop installed as the cluster which we use to
> perform the puts. This work flow has significantly improved our time to
> import the data into HADOOP after which we run the reduce-only step to
> aggregate.
>
> Currently the way to insert data is through our namenode which all the
> machines outsude the cluster call them hdfs clients connect to and are not
> part of the master/slave setup. I haven't tried but maybe we can perform
> these puts via the datanodes themselves and not just through the namenode?
> Right now the namenode is the single point through which the hdfs client
> machines insert the parsed data.
>

When you do an hdfs put "to the namenode" the actual data transfer goes to
the datanodes anyway - the namenode isn't a data bottleneck. It's just used
to allocate block locations for writers, and then DFSClient connects
directly to the DNs to transfer.

>
> Secondly i would assume that this is a safe way to import parsed data into
> hadoop before we aggregate and will most likely not cause any data
> corruption in HDFS. Granted anything can happen :).
>

Yep, this should be as safe as any other method.

>
> It would be interesting to import our logs and perform the mapping step
> inside HADOOP versus doing it outside. I wonder if the performance will be
> better, worse or the same. Yes this is dependent on many factors and one of
> them is the amount of datanodes, data to process, hardware etc we have but
> we are limited. We are trying to utilize machines outside the cluster which
> are idle and can process info and then insert the output into HADOOP HDFS
> via puts.
>

The performance is probably comparable on a small cluster like this,
depending on what the ratio of input/output data is. The advantage of doing
the writes from within the cluster is that the namenode will try to allocate
block locations on the local node, so there is less total transfer into the
cluster. In a larger cluster, you might end up with a network bottleneck
going into the cluster, but with three nodes and any reasonable switch you
shouldn't be running into that.

What does your input data look like? It might make more sense to upload it
directly to the cluster and then use a MR job to perform the transformation.
This way you don't have to worry about doing that distribution yourself. If
you want to make use of those "extra" nodes that aren't part of your
cluster, you could probably just run TTs on them without running DNs.

-Todd