You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by TCK <mo...@yahoo.com> on 2009/02/04 18:51:28 UTC

Batch processing with Hadoop -- does HDFS scale for parallel reads?

Hey guys, 

We have been using Hadoop to do batch processing of logs. The logs get written and stored on a NAS. Our Hadoop cluster periodically copies a batch of new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and copies the output back to the NAS. The HDFS is cleaned up at the end of each batch (ie, everything in it is deleted).

The problem is that reads off the NAS via NFS don't scale even if we try to scale the copying process by adding more threads to read in parallel.

If we instead stored the log files on an HDFS cluster (instead of NAS), it seems like the reads would scale since the data can be read from multiple data nodes at the same time without any contention (except network IO, which shouldn't be a problem).

I would appreciate if anyone could share any similar experience they have had with doing parallel reads from a storage HDFS.

Also is it a good idea to have a separate HDFS for storage vs for doing the batch processing ?

Best Regards,
TCK

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by Sriram Rao <sr...@gmail.com>.

Hey TCK,

We operate a large cluster in which we run both HDFS/KFS in the same
cluster and on the same nodes.  We run two instances of KFS and one
instance of HDFS in the cluster:
 - Our logs are in KFS and we have KFS setup in WORM mode (a mode in
which deletions/renames on files/dirs are permitted only on files with
.tmp extension).
 - Map/reduce jobs read from WORM and can write to HDFS or the KFS
setup in r/w mode.
- For archival purposes, we back data between the two different DFS
implementations.

The thruput you get depends on your cluster setup: in our case, we
have 4 1-TB disks on each node, that we can push at 100MB/s a piece.
In JBOD mode, in theory, we can get 400MB/s.  With a 1 Gbps NIC, the
theoratical limit is 125MB/s.

Sriram

>>>
>>> Thanks, Brian. This sounds encouraging for us.
>>>
>>> What are the advantages/disadvantages of keeping a persistent storage
>>
>> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>>>
>>> The advantage I can think of is that a permanent storage cluster has
>>
>> different requirements from a map-reduce processing cluster -- the
>> permanent
>> storage cluster would need faster, bigger hard disks, and would need to
>> grow as
>> the total volume of all collected logs grows, whereas the processing
>> cluster
>> would need fast CPUs and would only need to grow with the rate of incoming
>> data.
>> So it seems to make sense to me to copy a piece of data from the permanent
>> storage cluster to the processing cluster only when it needs to be
>> processed. Is
>> my line of thinking reasonable? How would this compare to running the
>> map-reduce
>> processing on same cluster as the data is stored in? Which approach is
>> used by
>> most people?
>>>
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>>> From: Brian Bockelman <bb...@cse.unl.edu>
>>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
>>
>> reads?
>>>
>>> To: core-user@hadoop.apache.org
>>> Date: Wednesday, February 4, 2009, 1:06 PM
>>>
>>> Hey TCK,
>>>
>>> We use HDFS+FUSE solely as a storage solution for a application which
>>> doesn't understand MapReduce.  We've scaled this solution to
>>
>> around
>>>
>>> 80Gbps.  For 300 processes reading from the same file, we get about
>>
>> 20Gbps.
>>>
>>> Do consider your data retention policies -- I would say that Hadoop as a
>>> storage system is thus far about 99% reliable for storage and is not a
>>
>> backup
>>>
>>> solution.  If you're scared of getting more than 1% of your logs lost,
>>
>> have
>>>
>>> a good backup solution.  I would also add that when you are learning your
>>> operational staff's abilities, expect even more data loss.  As you
>>
>> gain
>>>
>>> experience, data loss goes down.
>>>
>>> I don't believe we've lost a single block in the last month, but
>>
>> it
>>>
>>> took us 2-3 months of 1%-level losses to get here.
>>>
>>> Brian
>>>
>>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>>
>>>> Hey guys,
>>>>
>>>> We have been using Hadoop to do batch processing of logs. The logs get
>>>
>>> written and stored on a NAS. Our Hadoop cluster periodically copies a
>>
>> batch of
>>>
>>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>>> copies the output back to the NAS. The HDFS is cleaned up at the end of
>>
>> each
>>>
>>> batch (ie, everything in it is deleted).
>>>>
>>>> The problem is that reads off the NAS via NFS don't scale even if
>>
>> we
>>>
>>> try to scale the copying process by adding more threads to read in
>>
>> parallel.
>>>>
>>>> If we instead stored the log files on an HDFS cluster (instead of
>>
>> NAS), it
>>>
>>> seems like the reads would scale since the data can be read from multiple
>>
>> data
>>>
>>> nodes at the same time without any contention (except network IO, which
>>> shouldn't be a problem).
>>>>
>>>> I would appreciate if anyone could share any similar experience they
>>
>> have
>>>
>>> had with doing parallel reads from a storage HDFS.
>>>>
>>>> Also is it a good idea to have a separate HDFS for storage vs for
>>
>> doing
>>>
>>> the batch processing ?
>>>>
>>>> Best Regards,
>>>> TCK
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

TCK wrote:
> How well does the read throughput from HDFS scale with the number of data nodes ?
> For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) 

> be halved if I had the same file on a 20 data node cluster ? 
depends: yes, if whatever was bottleneck with 10 still continues to be 
bottleneck (i.e. you are able to saturate in both cases) and that 
resource is scaled (disk or network)

> Is this not possible because HDFS doesn't support random seeks? 
HDFS does support random seeks for reading... your case should work.

Raghu.


> What about if the file was split up into multiple smaller files before placing in the HDFS ?
> Thanks for your input.
> -TCK
> 
> 
> 
> 
> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:50 PM
> 
> Sounds overly complicated.  Complicated usually leads to mistakes :)
> 
> What about just having a single cluster and only running the tasktrackers on
> the fast CPUs?  No messy cross-cluster transferring.
> 
> Brian
> 
> On Feb 4, 2009, at 12:46 PM, TCK wrote:
> 
>>
>> Thanks, Brian. This sounds encouraging for us.
>>
>> What are the advantages/disadvantages of keeping a persistent storage
> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>> The advantage I can think of is that a permanent storage cluster has
> different requirements from a map-reduce processing cluster -- the permanent
> storage cluster would need faster, bigger hard disks, and would need to grow as
> the total volume of all collected logs grows, whereas the processing cluster
> would need fast CPUs and would only need to grow with the rate of incoming data.
> So it seems to make sense to me to copy a piece of data from the permanent
> storage cluster to the processing cluster only when it needs to be processed. Is
> my line of thinking reasonable? How would this compare to running the map-reduce
> processing on same cluster as the data is stored in? Which approach is used by
> most people?
>> Best Regards,
>> TCK
>>
>>
>>
>> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>> From: Brian Bockelman <bb...@cse.unl.edu>
>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
> reads?
>> To: core-user@hadoop.apache.org
>> Date: Wednesday, February 4, 2009, 1:06 PM
>>
>> Hey TCK,
>>
>> We use HDFS+FUSE solely as a storage solution for a application which
>> doesn't understand MapReduce.  We've scaled this solution to
> around
>> 80Gbps.  For 300 processes reading from the same file, we get about
> 20Gbps.
>> Do consider your data retention policies -- I would say that Hadoop as a
>> storage system is thus far about 99% reliable for storage and is not a
> backup
>> solution.  If you're scared of getting more than 1% of your logs lost,
> have
>> a good backup solution.  I would also add that when you are learning your
>> operational staff's abilities, expect even more data loss.  As you
> gain
>> experience, data loss goes down.
>>
>> I don't believe we've lost a single block in the last month, but
> it
>> took us 2-3 months of 1%-level losses to get here.
>>
>> Brian
>>
>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>
>>> Hey guys,
>>>
>>> We have been using Hadoop to do batch processing of logs. The logs get
>> written and stored on a NAS. Our Hadoop cluster periodically copies a
> batch of
>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>> copies the output back to the NAS. The HDFS is cleaned up at the end of
> each
>> batch (ie, everything in it is deleted).
>>> The problem is that reads off the NAS via NFS don't scale even if
> we
>> try to scale the copying process by adding more threads to read in
> parallel.
>>> If we instead stored the log files on an HDFS cluster (instead of
> NAS), it
>> seems like the reads would scale since the data can be read from multiple
> data
>> nodes at the same time without any contention (except network IO, which
>> shouldn't be a problem).
>>> I would appreciate if anyone could share any similar experience they
> have
>> had with doing parallel reads from a storage HDFS.
>>> Also is it a good idea to have a separate HDFS for storage vs for
> doing
>> the batch processing ?
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>>
>>
>>
>>
> 
> 
> 
> 
>

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Feb 6, 2009, at 11:00 AM, TCK wrote:

>
> How well does the read throughput from HDFS scale with the number of  
> data nodes ?
> For example, if I had a large file (say 10GB) on a 10 data node  
> cluster, would the time taken to read this whole file in parallel  
> (ie, with multiple reader client processes requesting different  
> parts of the file in parallel) be halved if I had the same file on a  
> 20 data node cluster ?

Possibly.  (I don't give a firm answer because the answer depends on  
the number of chunks and the number of replicas).

If there are enough replicas and enough separate reading processes  
with enough network bandwidth, then yes, your read bandwidth could  
double.

> Is this not possible because HDFS doesn't support random seeks?

It does for reads.  It does not for writes.

Trust me, our physicists have what can best be described as "the most  
god-awful random read patterns you've seen in your life" and they do  
fine on HDFS.

> What about if the file was split up into multiple smaller files  
> before placing in the HDFS ?

Then things would be less efficient and you'd be less likely to scale.

Brian

>
> Thanks for your input.
> -TCK
>
>
>
>
> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for  
> parallel reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:50 PM
>
> Sounds overly complicated.  Complicated usually leads to mistakes :)
>
> What about just having a single cluster and only running the  
> tasktrackers on
> the fast CPUs?  No messy cross-cluster transferring.
>
> Brian
>
> On Feb 4, 2009, at 12:46 PM, TCK wrote:
>
>>
>>
>> Thanks, Brian. This sounds encouraging for us.
>>
>> What are the advantages/disadvantages of keeping a persistent storage
> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>> The advantage I can think of is that a permanent storage cluster has
> different requirements from a map-reduce processing cluster -- the  
> permanent
> storage cluster would need faster, bigger hard disks, and would need  
> to grow as
> the total volume of all collected logs grows, whereas the processing  
> cluster
> would need fast CPUs and would only need to grow with the rate of  
> incoming data.
> So it seems to make sense to me to copy a piece of data from the  
> permanent
> storage cluster to the processing cluster only when it needs to be  
> processed. Is
> my line of thinking reasonable? How would this compare to running  
> the map-reduce
> processing on same cluster as the data is stored in? Which approach  
> is used by
> most people?
>>
>> Best Regards,
>> TCK
>>
>>
>>
>> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
>> From: Brian Bockelman <bb...@cse.unl.edu>
>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for  
>> parallel
> reads?
>> To: core-user@hadoop.apache.org
>> Date: Wednesday, February 4, 2009, 1:06 PM
>>
>> Hey TCK,
>>
>> We use HDFS+FUSE solely as a storage solution for a application which
>> doesn't understand MapReduce.  We've scaled this solution to
> around
>> 80Gbps.  For 300 processes reading from the same file, we get about
> 20Gbps.
>>
>> Do consider your data retention policies -- I would say that Hadoop  
>> as a
>> storage system is thus far about 99% reliable for storage and is  
>> not a
> backup
>> solution.  If you're scared of getting more than 1% of your logs  
>> lost,
> have
>> a good backup solution.  I would also add that when you are  
>> learning your
>> operational staff's abilities, expect even more data loss.  As you
> gain
>> experience, data loss goes down.
>>
>> I don't believe we've lost a single block in the last month, but
> it
>> took us 2-3 months of 1%-level losses to get here.
>>
>> Brian
>>
>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>
>>>
>>> Hey guys,
>>>
>>> We have been using Hadoop to do batch processing of logs. The logs  
>>> get
>> written and stored on a NAS. Our Hadoop cluster periodically copies a
> batch of
>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them,  
>> and
>> copies the output back to the NAS. The HDFS is cleaned up at the  
>> end of
> each
>> batch (ie, everything in it is deleted).
>>>
>>> The problem is that reads off the NAS via NFS don't scale even if
> we
>> try to scale the copying process by adding more threads to read in
> parallel.
>>>
>>> If we instead stored the log files on an HDFS cluster (instead of
> NAS), it
>> seems like the reads would scale since the data can be read from  
>> multiple
> data
>> nodes at the same time without any contention (except network IO,  
>> which
>> shouldn't be a problem).
>>>
>>> I would appreciate if anyone could share any similar experience they
> have
>> had with doing parallel reads from a storage HDFS.
>>>
>>> Also is it a good idea to have a separate HDFS for storage vs for
> doing
>> the batch processing ?
>>>
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
>
>

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by TCK <mo...@yahoo.com>.

How well does the read throughput from HDFS scale with the number of data nodes ?
For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) be halved if I had the same file on a 20 data node cluster ? Is this not possible because HDFS doesn't support random seeks? What about if the file was split up into multiple smaller files before placing in the HDFS ?
Thanks for your input.
-TCK




--- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
From: Brian Bockelman <bb...@cse.unl.edu>
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:50 PM

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running the tasktrackers on
the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:

> 
> 
> Thanks, Brian. This sounds encouraging for us.
> 
> What are the advantages/disadvantages of keeping a persistent storage
(HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
> The advantage I can think of is that a permanent storage cluster has
different requirements from a map-reduce processing cluster -- the permanent
storage cluster would need faster, bigger hard disks, and would need to grow as
the total volume of all collected logs grows, whereas the processing cluster
would need fast CPUs and would only need to grow with the rate of incoming data.
So it seems to make sense to me to copy a piece of data from the permanent
storage cluster to the processing cluster only when it needs to be processed. Is
my line of thinking reasonable? How would this compare to running the map-reduce
processing on same cluster as the data is stored in? Which approach is used by
most people?
> 
> Best Regards,
> TCK
> 
> 
> 
> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:06 PM
> 
> Hey TCK,
> 
> We use HDFS+FUSE solely as a storage solution for a application which
> doesn't understand MapReduce.  We've scaled this solution to
around
> 80Gbps.  For 300 processes reading from the same file, we get about
20Gbps.
> 
> Do consider your data retention policies -- I would say that Hadoop as a
> storage system is thus far about 99% reliable for storage and is not a
backup
> solution.  If you're scared of getting more than 1% of your logs lost,
have
> a good backup solution.  I would also add that when you are learning your
> operational staff's abilities, expect even more data loss.  As you
gain
> experience, data loss goes down.
> 
> I don't believe we've lost a single block in the last month, but
it
> took us 2-3 months of 1%-level losses to get here.
> 
> Brian
> 
> On Feb 4, 2009, at 11:51 AM, TCK wrote:
> 
>> 
>> Hey guys,
>> 
>> We have been using Hadoop to do batch processing of logs. The logs get
> written and stored on a NAS. Our Hadoop cluster periodically copies a
batch of
> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
> copies the output back to the NAS. The HDFS is cleaned up at the end of
each
> batch (ie, everything in it is deleted).
>> 
>> The problem is that reads off the NAS via NFS don't scale even if
we
> try to scale the copying process by adding more threads to read in
parallel.
>> 
>> If we instead stored the log files on an HDFS cluster (instead of
NAS), it
> seems like the reads would scale since the data can be read from multiple
data
> nodes at the same time without any contention (except network IO, which
> shouldn't be a problem).
>> 
>> I would appreciate if anyone could share any similar experience they
have
> had with doing parallel reads from a storage HDFS.
>> 
>> Also is it a good idea to have a separate HDFS for storage vs for
doing
> the batch processing ?
>> 
>> Best Regards,
>> TCK
>> 
>> 
>> 
>> 
> 
> 
> 
>

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running the  
tasktrackers on the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:

>
>
> Thanks, Brian. This sounds encouraging for us.
>
> What are the advantages/disadvantages of keeping a persistent  
> storage (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS  
> cluster ?
> The advantage I can think of is that a permanent storage cluster has  
> different requirements from a map-reduce processing cluster -- the  
> permanent storage cluster would need faster, bigger hard disks, and  
> would need to grow as the total volume of all collected logs grows,  
> whereas the processing cluster would need fast CPUs and would only  
> need to grow with the rate of incoming data. So it seems to make  
> sense to me to copy a piece of data from the permanent storage  
> cluster to the processing cluster only when it needs to be  
> processed. Is my line of thinking reasonable? How would this compare  
> to running the map-reduce processing on same cluster as the data is  
> stored in? Which approach is used by most people?
>
> Best Regards,
> TCK
>
>
>
> --- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
> From: Brian Bockelman <bb...@cse.unl.edu>
> Subject: Re: Batch processing with Hadoop -- does HDFS scale for  
> parallel reads?
> To: core-user@hadoop.apache.org
> Date: Wednesday, February 4, 2009, 1:06 PM
>
> Hey TCK,
>
> We use HDFS+FUSE solely as a storage solution for a application which
> doesn't understand MapReduce.  We've scaled this solution to around
> 80Gbps.  For 300 processes reading from the same file, we get about  
> 20Gbps.
>
> Do consider your data retention policies -- I would say that Hadoop  
> as a
> storage system is thus far about 99% reliable for storage and is not  
> a backup
> solution.  If you're scared of getting more than 1% of your logs  
> lost, have
> a good backup solution.  I would also add that when you are learning  
> your
> operational staff's abilities, expect even more data loss.  As you  
> gain
> experience, data loss goes down.
>
> I don't believe we've lost a single block in the last month, but it
> took us 2-3 months of 1%-level losses to get here.
>
> Brian
>
> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>
>>
>> Hey guys,
>>
>> We have been using Hadoop to do batch processing of logs. The logs  
>> get
> written and stored on a NAS. Our Hadoop cluster periodically copies  
> a batch of
> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
> copies the output back to the NAS. The HDFS is cleaned up at the end  
> of each
> batch (ie, everything in it is deleted).
>>
>> The problem is that reads off the NAS via NFS don't scale even if we
> try to scale the copying process by adding more threads to read in  
> parallel.
>>
>> If we instead stored the log files on an HDFS cluster (instead of  
>> NAS), it
> seems like the reads would scale since the data can be read from  
> multiple data
> nodes at the same time without any contention (except network IO,  
> which
> shouldn't be a problem).
>>
>> I would appreciate if anyone could share any similar experience  
>> they have
> had with doing parallel reads from a storage HDFS.
>>
>> Also is it a good idea to have a separate HDFS for storage vs for  
>> doing
> the batch processing ?
>>
>> Best Regards,
>> TCK
>>
>>
>>
>>
>
>
>
>

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by TCK <mo...@yahoo.com>.

Thanks, Brian. This sounds encouraging for us.

What are the advantages/disadvantages of keeping a persistent storage (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
The advantage I can think of is that a permanent storage cluster has different requirements from a map-reduce processing cluster -- the permanent storage cluster would need faster, bigger hard disks, and would need to grow as the total volume of all collected logs grows, whereas the processing cluster would need fast CPUs and would only need to grow with the rate of incoming data. So it seems to make sense to me to copy a piece of data from the permanent storage cluster to the processing cluster only when it needs to be processed. Is my line of thinking reasonable? How would this compare to running the map-reduce processing on same cluster as the data is stored in? Which approach is used by most people?

Best Regards,
TCK

--- On Wed, 2/4/09, Brian Bockelman <bb...@cse.unl.edu> wrote:
From: Brian Bockelman <bb...@cse.unl.edu>
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:06 PM

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which
doesn't understand MapReduce.  We've scaled this solution to around
80Gbps.  For 300 processes reading from the same file, we get about 20Gbps.

Do consider your data retention policies -- I would say that Hadoop as a
storage system is thus far about 99% reliable for storage and is not a backup
solution.  If you're scared of getting more than 1% of your logs lost, have
a good backup solution.  I would also add that when you are learning your
operational staff's abilities, expect even more data loss.  As you gain
experience, data loss goes down.

I don't believe we've lost a single block in the last month, but it
took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:

> 
> Hey guys,
> 
> We have been using Hadoop to do batch processing of logs. The logs get
written and stored on a NAS. Our Hadoop cluster periodically copies a batch of
new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
copies the output back to the NAS. The HDFS is cleaned up at the end of each
batch (ie, everything in it is deleted).
> 
> The problem is that reads off the NAS via NFS don't scale even if we
try to scale the copying process by adding more threads to read in parallel.
> 
> If we instead stored the log files on an HDFS cluster (instead of NAS), it
seems like the reads would scale since the data can be read from multiple data
nodes at the same time without any contention (except network IO, which
shouldn't be a problem).
> 
> I would appreciate if anyone could share any similar experience they have
had with doing parallel reads from a storage HDFS.
> 
> Also is it a good idea to have a separate HDFS for storage vs for doing
the batch processing ?
> 
> Best Regards,
> TCK
> 
> 
> 
>

Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which  
doesn't understand MapReduce.  We've scaled this solution to around  
80Gbps.  For 300 processes reading from the same file, we get about  
20Gbps.

Do consider your data retention policies -- I would say that Hadoop as  
a storage system is thus far about 99% reliable for storage and is not  
a backup solution.  If you're scared of getting more than 1% of your  
logs lost, have a good backup solution.  I would also add that when  
you are learning your operational staff's abilities, expect even more  
data loss.  As you gain experience, data loss goes down.

I don't believe we've lost a single block in the last month, but it  
took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:

>
> Hey guys,
>
> We have been using Hadoop to do batch processing of logs. The logs  
> get written and stored on a NAS. Our Hadoop cluster periodically  
> copies a batch of new logs from the NAS, via NFS into Hadoop's HDFS,  
> processes them, and copies the output back to the NAS. The HDFS is  
> cleaned up at the end of each batch (ie, everything in it is deleted).
>
> The problem is that reads off the NAS via NFS don't scale even if we  
> try to scale the copying process by adding more threads to read in  
> parallel.
>
> If we instead stored the log files on an HDFS cluster (instead of  
> NAS), it seems like the reads would scale since the data can be read  
> from multiple data nodes at the same time without any contention  
> (except network IO, which shouldn't be a problem).
>
> I would appreciate if anyone could share any similar experience they  
> have had with doing parallel reads from a storage HDFS.
>
> Also is it a good idea to have a separate HDFS for storage vs for  
> doing the batch processing ?
>
> Best Regards,
> TCK
>
>
>
>