You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Raghu Angadi <ra...@yahoo-inc.com> on 2009/03/12 22:44:40 UTC

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

TCK wrote:
> How well does the read throughput from HDFS scale with the number of data nodes ?
> For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) 

> be halved if I had the same file on a 20 data node cluster ? 
depends: yes, if whatever was bottleneck with 10 still continues to be 
bottleneck (i.e. you are able to saturate in both cases) and that 
resource is scaled (disk or network)

> Is this not possible because HDFS doesn't support random seeks? 
HDFS does support random seeks for reading... your case should work.

Raghu.


> What about if the file was split up into multiple smaller files before placing in the HDFS ?
> Thanks for your input.
> -TCK
> 
> 
> 
> 
> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:50 PM
> 
> Sounds overly complicated.  Complicated usually leads to mistakes :)
> 
> What about just having a single cluster and only running the tasktrackers on
> the fast CPUs?  No messy cross-cluster transferring.
> 
> Brian
> 
> On Feb 4, 2009, at 12:46 PM, TCK wrote:
> 
>>
>> Thanks, Brian. This sounds encouraging for us.
>>
>> What are the advantages/disadvantages of keeping a persistent storage
> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>> The advantage I can think of is that a permanent storage cluster has
> different requirements from a map-reduce processing cluster -- the permanent
> storage cluster would need faster, bigger hard disks, and would need to grow as
> the total volume of all collected logs grows, whereas the processing cluster
> would need fast CPUs and would only need to grow with the rate of incoming data.
> So it seems to make sense to me to copy a piece of data from the permanent
> storage cluster to the processing cluster only when it needs to be processed. Is
> my line of thinking reasonable? How would this compare to running the map-reduce
> processing on same cluster as the data is stored in? Which approach is used by
> most people?
>> Best Regards,
>> TCK
>>
>>
>>
>> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>> From: Brian Bockelman <bb...@cse.unl.edu>
>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
> reads?
>> To: core-user@hadoop.apache.org
>> Date: Wednesday, February 4, 2009, 1:06 PM
>>
>> Hey TCK,
>>
>> We use HDFS+FUSE solely as a storage solution for a application which
>> doesn't understand MapReduce.  We've scaled this solution to
> around
>> 80Gbps.  For 300 processes reading from the same file, we get about
> 20Gbps.
>> Do consider your data retention policies -- I would say that Hadoop as a
>> storage system is thus far about 99% reliable for storage and is not a
> backup
>> solution.  If you're scared of getting more than 1% of your logs lost,
> have
>> a good backup solution.  I would also add that when you are learning your
>> operational staff's abilities, expect even more data loss.  As you
> gain
>> experience, data loss goes down.
>>
>> I don't believe we've lost a single block in the last month, but
> it
>> took us 2-3 months of 1%-level losses to get here.
>>
>> Brian
>>
>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>
>>> Hey guys,
>>>
>>> We have been using Hadoop to do batch processing of logs. The logs get
>> written and stored on a NAS. Our Hadoop cluster periodically copies a
> batch of
>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>> copies the output back to the NAS. The HDFS is cleaned up at the end of
> each
>> batch (ie, everything in it is deleted).
>>> The problem is that reads off the NAS via NFS don't scale even if
> we
>> try to scale the copying process by adding more threads to read in
> parallel.
>>> If we instead stored the log files on an HDFS cluster (instead of
> NAS), it
>> seems like the reads would scale since the data can be read from multiple
> data
>> nodes at the same time without any contention (except network IO, which
>> shouldn't be a problem).
>>> I would appreciate if anyone could share any similar experience they
> have
>> had with doing parallel reads from a storage HDFS.
>>> Also is it a good idea to have a separate HDFS for storage vs for
> doing
>> the batch processing ?
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>>
>>
>>
>>
> 
> 
> 
> 
>       


Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by Sriram Rao <sr...@gmail.com>.
Hey TCK,

We operate a large cluster in which we run both HDFS/KFS in the same
cluster and on the same nodes.  We run two instances of KFS and one
instance of HDFS in the cluster:
 - Our logs are in KFS and we have KFS setup in WORM mode (a mode in
which deletions/renames on files/dirs are permitted only on files with
.tmp extension).
 - Map/reduce jobs read from WORM and can write to HDFS or the KFS
setup in r/w mode.
- For archival purposes, we back data between the two different DFS
implementations.

The thruput you get depends on your cluster setup: in our case, we
have 4 1-TB disks on each node, that we can push at 100MB/s a piece.
In JBOD mode, in theory, we can get 400MB/s.  With a 1 Gbps NIC, the
theoratical limit is 125MB/s.

Sriram

>>>
>>> Thanks, Brian. This sounds encouraging for us.
>>>
>>> What are the advantages/disadvantages of keeping a persistent storage
>>
>> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>>>
>>> The advantage I can think of is that a permanent storage cluster has
>>
>> different requirements from a map-reduce processing cluster -- the
>> permanent
>> storage cluster would need faster, bigger hard disks, and would need to
>> grow as
>> the total volume of all collected logs grows, whereas the processing
>> cluster
>> would need fast CPUs and would only need to grow with the rate of incoming
>> data.
>> So it seems to make sense to me to copy a piece of data from the permanent
>> storage cluster to the processing cluster only when it needs to be
>> processed. Is
>> my line of thinking reasonable? How would this compare to running the
>> map-reduce
>> processing on same cluster as the data is stored in? Which approach is
>> used by
>> most people?
>>>
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>>> From: Brian Bockelman <bb...@cse.unl.edu>
>>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
>>
>> reads?
>>>
>>> To: core-user@hadoop.apache.org
>>> Date: Wednesday, February 4, 2009, 1:06 PM
>>>
>>> Hey TCK,
>>>
>>> We use HDFS+FUSE solely as a storage solution for a application which
>>> doesn't understand MapReduce.  We've scaled this solution to
>>
>> around
>>>
>>> 80Gbps.  For 300 processes reading from the same file, we get about
>>
>> 20Gbps.
>>>
>>> Do consider your data retention policies -- I would say that Hadoop as a
>>> storage system is thus far about 99% reliable for storage and is not a
>>
>> backup
>>>
>>> solution.  If you're scared of getting more than 1% of your logs lost,
>>
>> have
>>>
>>> a good backup solution.  I would also add that when you are learning your
>>> operational staff's abilities, expect even more data loss.  As you
>>
>> gain
>>>
>>> experience, data loss goes down.
>>>
>>> I don't believe we've lost a single block in the last month, but
>>
>> it
>>>
>>> took us 2-3 months of 1%-level losses to get here.
>>>
>>> Brian
>>>
>>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>>
>>>> Hey guys,
>>>>
>>>> We have been using Hadoop to do batch processing of logs. The logs get
>>>
>>> written and stored on a NAS. Our Hadoop cluster periodically copies a
>>
>> batch of
>>>
>>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>>> copies the output back to the NAS. The HDFS is cleaned up at the end of
>>
>> each
>>>
>>> batch (ie, everything in it is deleted).
>>>>
>>>> The problem is that reads off the NAS via NFS don't scale even if
>>
>> we
>>>
>>> try to scale the copying process by adding more threads to read in
>>
>> parallel.
>>>>
>>>> If we instead stored the log files on an HDFS cluster (instead of
>>
>> NAS), it
>>>
>>> seems like the reads would scale since the data can be read from multiple
>>
>> data
>>>
>>> nodes at the same time without any contention (except network IO, which
>>> shouldn't be a problem).
>>>>
>>>> I would appreciate if anyone could share any similar experience they
>>
>> have
>>>
>>> had with doing parallel reads from a storage HDFS.
>>>>
>>>> Also is it a good idea to have a separate HDFS for storage vs for
>>
>> doing
>>>
>>> the batch processing ?
>>>>
>>>> Best Regards,
>>>> TCK
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>