You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sriram Rao <sr...@gmail.com> on 2009/03/13 07:02:00 UTC
Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Hey TCK,

We operate a large cluster in which we run both HDFS/KFS in the same
cluster and on the same nodes.  We run two instances of KFS and one
instance of HDFS in the cluster:
 - Our logs are in KFS and we have KFS setup in WORM mode (a mode in
which deletions/renames on files/dirs are permitted only on files with
.tmp extension).
 - Map/reduce jobs read from WORM and can write to HDFS or the KFS
setup in r/w mode.
- For archival purposes, we back data between the two different DFS
implementations.

The thruput you get depends on your cluster setup: in our case, we
have 4 1-TB disks on each node, that we can push at 100MB/s a piece.
In JBOD mode, in theory, we can get 400MB/s.  With a 1 Gbps NIC, the
theoratical limit is 125MB/s.

Sriram

>>>
>>> Thanks, Brian. This sounds encouraging for us.
>>>
>>> What are the advantages/disadvantages of keeping a persistent storage
>>
>> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>>>
>>> The advantage I can think of is that a permanent storage cluster has
>>
>> different requirements from a map-reduce processing cluster -- the
>> permanent
>> storage cluster would need faster, bigger hard disks, and would need to
>> grow as
>> the total volume of all collected logs grows, whereas the processing
>> cluster
>> would need fast CPUs and would only need to grow with the rate of incoming
>> data.
>> So it seems to make sense to me to copy a piece of data from the permanent
>> storage cluster to the processing cluster only when it needs to be
>> processed. Is
>> my line of thinking reasonable? How would this compare to running the
>> map-reduce
>> processing on same cluster as the data is stored in? Which approach is
>> used by
>> most people?
>>>
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>>> From: Brian Bockelman <bb...@cse.unl.edu>
>>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
>>
>> reads?
>>>
>>> To: core-user@hadoop.apache.org
>>> Date: Wednesday, February 4, 2009, 1:06 PM
>>>
>>> Hey TCK,
>>>
>>> We use HDFS+FUSE solely as a storage solution for a application which
>>> doesn't understand MapReduce.  We've scaled this solution to
>>
>> around
>>>
>>> 80Gbps.  For 300 processes reading from the same file, we get about
>>
>> 20Gbps.
>>>
>>> Do consider your data retention policies -- I would say that Hadoop as a
>>> storage system is thus far about 99% reliable for storage and is not a
>>
>> backup
>>>
>>> solution.  If you're scared of getting more than 1% of your logs lost,
>>
>> have
>>>
>>> a good backup solution.  I would also add that when you are learning your
>>> operational staff's abilities, expect even more data loss.  As you
>>
>> gain
>>>
>>> experience, data loss goes down.
>>>
>>> I don't believe we've lost a single block in the last month, but
>>
>> it
>>>
>>> took us 2-3 months of 1%-level losses to get here.
>>>
>>> Brian
>>>
>>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>>
>>>> Hey guys,
>>>>
>>>> We have been using Hadoop to do batch processing of logs. The logs get
>>>
>>> written and stored on a NAS. Our Hadoop cluster periodically copies a
>>
>> batch of
>>>
>>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>>> copies the output back to the NAS. The HDFS is cleaned up at the end of
>>
>> each
>>>
>>> batch (ie, everything in it is deleted).
>>>>
>>>> The problem is that reads off the NAS via NFS don't scale even if
>>
>> we
>>>
>>> try to scale the copying process by adding more threads to read in
>>
>> parallel.
>>>>
>>>> If we instead stored the log files on an HDFS cluster (instead of
>>
>> NAS), it
>>>
>>> seems like the reads would scale since the data can be read from multiple
>>
>> data
>>>
>>> nodes at the same time without any contention (except network IO, which
>>> shouldn't be a problem).
>>>>
>>>> I would appreciate if anyone could share any similar experience they
>>
>> have
>>>
>>> had with doing parallel reads from a storage HDFS.
>>>>
>>>> Also is it a good idea to have a separate HDFS for storage vs for
>>
>> doing
>>>
>>> the batch processing ?
>>>>
>>>> Best Regards,
>>>> TCK
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>