You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Steve Sapovits <ss...@invitemedia.com> on 2008/02/26 02:12:41 UTC

long write operations and data recovery

If I have a write operation that takes a while between opening and closing
the file, what is the effect of a node doing that writing crashing in the middle?
For example, suppose I have large logs that I write to continually, rolling them
every N minutes (say every hour for the sake of discussion).  If I have the file
opened and am 90% done my writes and things crash, what happens to the
data I've "written" ... realizing that at some level, data isn't visible to the rest
of the cluster until the file is closed.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

RE: long write operations and data recovery

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

It would nice if a layer on top of the dfs client can be built to handle
disconnected operation. That layer can cache files on local disk if HDFS
is unavailable. It can then upload those files into HDFS when HDFS
service comes back online. I think such a service will be helpful for
most HDFS installations.

Thanks,
dhruba

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Friday, February 29, 2008 11:33 AM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery

Unless your volume is MUCH higher than ours, I think you can get by with
a
relatively small farm of log consolidators that collect and concatenate
files.

If each log line is 100 bytes after compression (that is huge really)
and
you have 10,000 events per second (also pretty danged high) then you are
only writing 1MB/s.  If you need a day of buffering (=100,000 seconds),
then
you need 100GB of buffer storage.  These are very, very moderate
requirements for your ingestion point.

On 2/29/08 11:18 AM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Ted Dunning wrote:
> 
>> In our case, we looked at the problem and decided that Hadoop wasn't
>> feasible for our real-time needs in any case.  There were several
>> issues,
>> 
>> - first, of all, map-reduce itself didn't seem very plausible for
>> real-time applications.  That left hbase and hdfs as the capabilities
>> offered by hadoop (for real-time stuff)
> 
> We'll be using map-reduce batch mode, so we're okay there.
> 
>> The upshot is that we use hadoop extensively for batch operations
>> where it really shines.  The other nice effect is that we don't have
>> to worry all that much about HA (at least not real-time HA) since we
>> don't do real-time with hadoop.
> 
> What I'm struggling with is the write side of things.  We'll have a
huge
> amount of data to write that's essentially a log format.  It would
seem
> that writing that outside of HDFS then trying to batch import it would
> be a losing battle -- that you would need the distributed nature of
HDFS
> to do very large volume writes directly and wouldn't easily be able to
take
> some other flat storage model and feed it in as a secondary step
without
> having the HDFS side start to lag behind.
> 
> The realization is that Name Node could go down so we'll have to have
a
> backup store that might be used during temporary outages, but that
> most of the writes would be direct HDFS updates.
> 
> The alternative would seem to be to end up with a set of distributed
files
> without some unifying distributed file system (e.g., like lots of
Apache
> web logs on many many individual boxes) and then have to come up with
> some way to funnel those back into HDFS.

Re: long write operations and data recovery

Posted by Andy Li <an...@gmail.com>.

What about a hot standby namenode?

For write-ahead-log to avoid crash and recovery, I think this is fine for
small I/O.
For large volume, the write-ahead-log will actually take up the system IO
resource
pretty much that makes 2 IO per block (log and the actual data).  This will
fall back
how current database design implements recovery and crash.

Another thing I don't see in the picture is how Hadoop manage system file
system
instructions.  Each system has different implementation on their file system
and I believe
that by calling 'write' or 'flush' does not really flush the data to the
disk.  Not sure if this is
inevitable and platform OS dependent, but I cannot find any documents to
describe how
Hadoop handle this.

P.S. I handle HA and fail-over mechanism in my own application, but I think
for a framwork,
it should be transparent (semi-transparent) to the user.

-annndy

On Fri, Feb 29, 2008 at 1:54 PM, Joydeep Sen Sarma <js...@facebook.com>
wrote:

> I would agree with Ted. You should easily be able to get 100MBps write
> throughput on a standard Netapp box (with read bandwidth left over -
> since the peak write throughput rating is more than twice of that). Even
> at an average write throughput rate of 50MBps - the daily data volume
> would be (drumroll ..) 4+TB!
>
> So buffer to a decent box and copy stuff over ..
>
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Friday, February 29, 2008 11:33 AM
> To: core-user@hadoop.apache.org
> Subject: Re: long write operations and data recovery
>
>
> Unless your volume is MUCH higher than ours, I think you can get by with
> a
> relatively small farm of log consolidators that collect and concatenate
> files.
>
> If each log line is 100 bytes after compression (that is huge really)
> and
> you have 10,000 events per second (also pretty danged high) then you are
> only writing 1MB/s.  If you need a day of buffering (=100,000 seconds),
> then
> you need 100GB of buffer storage.  These are very, very moderate
> requirements for your ingestion point.
>
>
> On 2/29/08 11:18 AM, "Steve Sapovits" <ss...@invitemedia.com> wrote:
>
> > Ted Dunning wrote:
> >
> >> In our case, we looked at the problem and decided that Hadoop wasn't
> >> feasible for our real-time needs in any case.  There were several
> >> issues,
> >>
> >> - first, of all, map-reduce itself didn't seem very plausible for
> >> real-time applications.  That left hbase and hdfs as the capabilities
> >> offered by hadoop (for real-time stuff)
> >
> > We'll be using map-reduce batch mode, so we're okay there.
> >
> >> The upshot is that we use hadoop extensively for batch operations
> >> where it really shines.  The other nice effect is that we don't have
> >> to worry all that much about HA (at least not real-time HA) since we
> >> don't do real-time with hadoop.
> >
> > What I'm struggling with is the write side of things.  We'll have a
> huge
> > amount of data to write that's essentially a log format.  It would
> seem
> > that writing that outside of HDFS then trying to batch import it would
> > be a losing battle -- that you would need the distributed nature of
> HDFS
> > to do very large volume writes directly and wouldn't easily be able to
> take
> > some other flat storage model and feed it in as a secondary step
> without
> > having the HDFS side start to lag behind.
> >
> > The realization is that Name Node could go down so we'll have to have
> a
> > backup store that might be used during temporary outages, but that
> > most of the writes would be direct HDFS updates.
> >
> > The alternative would seem to be to end up with a set of distributed
> files
> > without some unifying distributed file system (e.g., like lots of
> Apache
> > web logs on many many individual boxes) and then have to come up with
> > some way to funnel those back into HDFS.
>
>

RE: long write operations and data recovery

Posted by Joydeep Sen Sarma <js...@facebook.com>.

I would agree with Ted. You should easily be able to get 100MBps write
throughput on a standard Netapp box (with read bandwidth left over -
since the peak write throughput rating is more than twice of that). Even
at an average write throughput rate of 50MBps - the daily data volume
would be (drumroll ..) 4+TB! 

So buffer to a decent box and copy stuff over ..

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Friday, February 29, 2008 11:33 AM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery

Unless your volume is MUCH higher than ours, I think you can get by with
a
relatively small farm of log consolidators that collect and concatenate
files.

If each log line is 100 bytes after compression (that is huge really)
and
you have 10,000 events per second (also pretty danged high) then you are
only writing 1MB/s.  If you need a day of buffering (=100,000 seconds),
then
you need 100GB of buffer storage.  These are very, very moderate
requirements for your ingestion point.

On 2/29/08 11:18 AM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Ted Dunning wrote:
> 
>> In our case, we looked at the problem and decided that Hadoop wasn't
>> feasible for our real-time needs in any case.  There were several
>> issues,
>> 
>> - first, of all, map-reduce itself didn't seem very plausible for
>> real-time applications.  That left hbase and hdfs as the capabilities
>> offered by hadoop (for real-time stuff)
> 
> We'll be using map-reduce batch mode, so we're okay there.
> 
>> The upshot is that we use hadoop extensively for batch operations
>> where it really shines.  The other nice effect is that we don't have
>> to worry all that much about HA (at least not real-time HA) since we
>> don't do real-time with hadoop.
> 
> What I'm struggling with is the write side of things.  We'll have a
huge
> amount of data to write that's essentially a log format.  It would
seem
> that writing that outside of HDFS then trying to batch import it would
> be a losing battle -- that you would need the distributed nature of
HDFS
> to do very large volume writes directly and wouldn't easily be able to
take
> some other flat storage model and feed it in as a secondary step
without
> having the HDFS side start to lag behind.
> 
> The realization is that Name Node could go down so we'll have to have
a
> backup store that might be used during temporary outages, but that
> most of the writes would be direct HDFS updates.
> 
> The alternative would seem to be to end up with a set of distributed
files
> without some unifying distributed file system (e.g., like lots of
Apache
> web logs on many many individual boxes) and then have to come up with
> some way to funnel those back into HDFS.

Re: long write operations and data recovery

Posted by Ted Dunning <td...@veoh.com>.

Unless your volume is MUCH higher than ours, I think you can get by with a
relatively small farm of log consolidators that collect and concatenate
files.

If each log line is 100 bytes after compression (that is huge really) and
you have 10,000 events per second (also pretty danged high) then you are
only writing 1MB/s.  If you need a day of buffering (=100,000 seconds), then
you need 100GB of buffer storage.  These are very, very moderate
requirements for your ingestion point.


On 2/29/08 11:18 AM, "Steve Sapovits" <ss...@invitemedia.com> wrote:

> Ted Dunning wrote:
> 
>> In our case, we looked at the problem and decided that Hadoop wasn't
>> feasible for our real-time needs in any case.  There were several
>> issues,
>> 
>> - first, of all, map-reduce itself didn't seem very plausible for
>> real-time applications.  That left hbase and hdfs as the capabilities
>> offered by hadoop (for real-time stuff)
> 
> We'll be using map-reduce batch mode, so we're okay there.
> 
>> The upshot is that we use hadoop extensively for batch operations
>> where it really shines.  The other nice effect is that we don't have
>> to worry all that much about HA (at least not real-time HA) since we
>> don't do real-time with hadoop.
> 
> What I'm struggling with is the write side of things.  We'll have a huge
> amount of data to write that's essentially a log format.  It would seem
> that writing that outside of HDFS then trying to batch import it would
> be a losing battle -- that you would need the distributed nature of HDFS
> to do very large volume writes directly and wouldn't easily be able to take
> some other flat storage model and feed it in as a secondary step without
> having the HDFS side start to lag behind.
> 
> The realization is that Name Node could go down so we'll have to have a
> backup store that might be used during temporary outages, but that
> most of the writes would be direct HDFS updates.
> 
> The alternative would seem to be to end up with a set of distributed files
> without some unifying distributed file system (e.g., like lots of Apache
> web logs on many many individual boxes) and then have to come up with
> some way to funnel those back into HDFS.

Re: long write operations and data recovery

Posted by Steve Sapovits <ss...@invitemedia.com>.

Ted Dunning wrote:

> In our case, we looked at the problem and decided that Hadoop wasn't 
> feasible for our real-time needs in any case.  There were several
> issues,
> 
> - first, of all, map-reduce itself didn't seem very plausible for
> real-time applications.  That left hbase and hdfs as the capabilities
> offered by hadoop (for real-time stuff)

We'll be using map-reduce batch mode, so we're okay there.

> The upshot is that we use hadoop extensively for batch operations
> where it really shines.  The other nice effect is that we don't have
> to worry all that much about HA (at least not real-time HA) since we
> don't do real-time with hadoop.

What I'm struggling with is the write side of things.  We'll have a huge
amount of data to write that's essentially a log format.  It would seem
that writing that outside of HDFS then trying to batch import it would
be a losing battle -- that you would need the distributed nature of HDFS
to do very large volume writes directly and wouldn't easily be able to take
some other flat storage model and feed it in as a secondary step without
having the HDFS side start to lag behind.

The realization is that Name Node could go down so we'll have to have a
backup store that might be used during temporary outages, but that 
most of the writes would be direct HDFS updates.

The alternative would seem to be to end up with a set of distributed files
without some unifying distributed file system (e.g., like lots of Apache 
web logs on many many individual boxes) and then have to come up with
some way to funnel those back into HDFS.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: long write operations and data recovery

Posted by Ted Dunning <td...@veoh.com>.

In our case, we looked at the problem and decided that Hadoop wasn't
feasible for our real-time needs in any case.  There were several issues,

- first, of all, map-reduce itself didn't seem very plausible for real-time
applications.  That left hbase and hdfs as the capabilities offered by
hadoop (for real-time stuff)

- hbase was far to immature to consider using.  Also, the read rate from
hbase is not that impressive compared, say to a bank of a dozen or more
memcaches.

- hdfs won't handle nearly the volume of files that we need to work with.
In our main delivery system (one of many needs), we have nearly a billion
(=10^9) files that we have to be able to export at high data rates.  That
just isn't feasible in hadoop without lots of extra work.

The upshot is that we use hadoop extensively for batch operations where it
really shines.  The other nice effect is that we don't have to worry all
that much about HA (at least not real-time HA) since we don't do real-time
with hadoop.  


On 2/28/08 9:53 PM, "dhruba Borthakur" <dh...@yahoo-inc.com> wrote:

> I agree with Joydeep. For batch processing, it is sufficient to make the
> application not assume that HDFS is always up and active. However, for
> real-time applications that are not batch-centric, it might not be
> sufficient. There are a few things that HDFS could do to better handle
> Namenode outages:
> 
> 1. Make Clients handle transient Namenode downtime. This requires that
> Namenode restarts are fast, clients can handle long Namenode outages,
> etc.etc.
> 2. Design HDFS Namenode to be a set of two, an active one and a passive
> one. The active Namenode could continuously forward transactions to the
> passive one. In case of failure of the active Namenode, the passive
> could take over. This type of High-Availability would probably be very
> necessary for non-batch-type-applications.
> 
> Thanks,
> dhruba
> 
> -----Orivery necessaginal Message-----
> From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
> Sent: Thursday, February 28, 2008 6:06 PM
> To: core-user@hadoop.apache.org
> Subject: RE: long write operations and data recovery
> 
> We have had a lot of peace of mind by building a data pipeline that does
> not assume that hdfs is always up and running. If the application is
> primarily non real-time log processing - I would suggest
> batch/incremental copies of data to hdfs that can catch up automatically
> in case of failures/downtimes.
> 
> we have a rsync like map-reduce job that monitors a log directories and
> keeps pulling new data in (and suspect lot of other users do similar
> stuff as well). Might be a useful notion to generalize and put in
> contrib.
> 
> 
> -----Original Message-----
> From: Steve Sapovits [mailto:ssapovits@invitemedia.com]
> Sent: Thursday, February 28, 2008 4:54 PM
> To: core-user@hadoop.apache.org
> Subject: Re: long write operations and data recovery
> 
> 
>> How does replication affect this?  If there's at least one replicated
>>  client still running, I assume that takes care of it?
> 
> Never mind -- I get this now after reading the docs again.
> 
> My remaining point of failure question concerns name nodes.  The docs
> say manual 
> intervention is still required if a name node goes down.  How is this
> typically managed
> in production environments?   It would seem even a short name node
> outage in a 
> data intestive environment would lead to data loss (no name node to give
> the data
> to).

RE: long write operations and data recovery

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

I agree with Joydeep. For batch processing, it is sufficient to make the
application not assume that HDFS is always up and active. However, for
real-time applications that are not batch-centric, it might not be
sufficient. There are a few things that HDFS could do to better handle
Namenode outages:

1. Make Clients handle transient Namenode downtime. This requires that
Namenode restarts are fast, clients can handle long Namenode outages,
etc.etc.
2. Design HDFS Namenode to be a set of two, an active one and a passive
one. The active Namenode could continuously forward transactions to the
passive one. In case of failure of the active Namenode, the passive
could take over. This type of High-Availability would probably be very
necessary for non-batch-type-applications.

Thanks,
dhruba

-----Orivery necessaginal Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com] 
Sent: Thursday, February 28, 2008 6:06 PM
To: core-user@hadoop.apache.org
Subject: RE: long write operations and data recovery

We have had a lot of peace of mind by building a data pipeline that does
not assume that hdfs is always up and running. If the application is
primarily non real-time log processing - I would suggest
batch/incremental copies of data to hdfs that can catch up automatically
in case of failures/downtimes.

we have a rsync like map-reduce job that monitors a log directories and
keeps pulling new data in (and suspect lot of other users do similar
stuff as well). Might be a useful notion to generalize and put in
contrib.


-----Original Message-----
From: Steve Sapovits [mailto:ssapovits@invitemedia.com] 
Sent: Thursday, February 28, 2008 4:54 PM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery


> How does replication affect this?  If there's at least one replicated
>  client still running, I assume that takes care of it?

Never mind -- I get this now after reading the docs again.

My remaining point of failure question concerns name nodes.  The docs
say manual 
intervention is still required if a name node goes down.  How is this
typically managed
in production environments?   It would seem even a short name node
outage in a 
data intestive environment would lead to data loss (no name node to give
the data
to).

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: long write operations and data recovery

Posted by Jason Venner <ja...@attributor.com>.

us also.
The pulling in of data from external machines then a pipeline of simple 
map/reduces is our standard pattern.

Joydeep Sen Sarma wrote:
> We have had a lot of peace of mind by building a data pipeline that does
> not assume that hdfs is always up and running. If the application is
> primarily non real-time log processing - I would suggest
> batch/incremental copies of data to hdfs that can catch up automatically
> in case of failures/downtimes.
>
> we have a rsync like map-reduce job that monitors a log directories and
> keeps pulling new data in (and suspect lot of other users do similar
> stuff as well). Might be a useful notion to generalize and put in
> contrib.
>
>
> -----Original Message-----
> From: Steve Sapovits [mailto:ssapovits@invitemedia.com] 
> Sent: Thursday, February 28, 2008 4:54 PM
> To: core-user@hadoop.apache.org
> Subject: Re: long write operations and data recovery
>
>
>   
>> How does replication affect this?  If there's at least one replicated
>>  client still running, I assume that takes care of it?
>>     
>
> Never mind -- I get this now after reading the docs again.
>
> My remaining point of failure question concerns name nodes.  The docs
> say manual 
> intervention is still required if a name node goes down.  How is this
> typically managed
> in production environments?   It would seem even a short name node
> outage in a 
> data intestive environment would lead to data loss (no name node to give
> the data
> to).
>
>

Re: long write operations and data recovery

Posted by Ted Dunning <td...@veoh.com>.

This is exactly what we do as well.  We also have auto-detection for
modifications and downstream processing so that back-filling in the presence
error correction is possible (the errors can be old processing code or file
munging). 

On 2/28/08 6:06 PM, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> We have had a lot of peace of mind by building a data pipeline that does
> not assume that hdfs is always up and running. If the application is
> primarily non real-time log processing - I would suggest
> batch/incremental copies of data to hdfs that can catch up automatically
> in case of failures/downtimes.
> 
> we have a rsync like map-reduce job that monitors a log directories and
> keeps pulling new data in (and suspect lot of other users do similar
> stuff as well). Might be a useful notion to generalize and put in
> contrib.

RE: long write operations and data recovery

Posted by Joydeep Sen Sarma <js...@facebook.com>.

We have had a lot of peace of mind by building a data pipeline that does
not assume that hdfs is always up and running. If the application is
primarily non real-time log processing - I would suggest
batch/incremental copies of data to hdfs that can catch up automatically
in case of failures/downtimes.

we have a rsync like map-reduce job that monitors a log directories and
keeps pulling new data in (and suspect lot of other users do similar
stuff as well). Might be a useful notion to generalize and put in
contrib.


-----Original Message-----
From: Steve Sapovits [mailto:ssapovits@invitemedia.com] 
Sent: Thursday, February 28, 2008 4:54 PM
To: core-user@hadoop.apache.org
Subject: Re: long write operations and data recovery


> How does replication affect this?  If there's at least one replicated
>  client still running, I assume that takes care of it?

Never mind -- I get this now after reading the docs again.

My remaining point of failure question concerns name nodes.  The docs
say manual 
intervention is still required if a name node goes down.  How is this
typically managed
in production environments?   It would seem even a short name node
outage in a 
data intestive environment would lead to data loss (no name node to give
the data
to).

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: long write operations and data recovery

Posted by Steve Sapovits <ss...@invitemedia.com>.

> How does replication affect this?  If there's at least one replicated
>  client still running, I assume that takes care of it?

Never mind -- I get this now after reading the docs again.

My remaining point of failure question concerns name nodes.  The docs say manual 
intervention is still required if a name node goes down.  How is this typically managed
in production environments?   It would seem even a short name node outage in a 
data intestive environment would lead to data loss (no name node to give the data
to).

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: long write operations and data recovery

Posted by Steve Sapovits <ss...@invitemedia.com>.

dhruba Borthakur wrote:

> The Namenode maintains a lease for every open file that is being written
> to. If the client that was writing to the file disappears, the Namenode
> will do "lease recovery" after expiry of the lease timeout (1 hour). The
> lease recovery process (in most cases) will remove the last block from
> the file (it was not fully written because the client crashed before it
> could fill up the block) and close the file.

How does replication affect this?  If there's at least one replicated client still
running, I assume that takes care of it?

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

RE: long write operations and data recovery

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

The Namenode maintains a lease for every open file that is being written
to. If the client that was writing to the file disappears, the Namenode
will do "lease recovery" after expiry of the lease timeout (1 hour). The
lease recovery process (in most cases) will remove the last block from
the file (it was not fully written because the client crashed before it
could fill up the block) and close the file.

Thanks,
dhruba

-----Original Message-----
From: Steve Sapovits [mailto:ssapovits@invitemedia.com] 
Sent: Monday, February 25, 2008 5:13 PM
To: core-user@hadoop.apache.org
Subject: long write operations and data recovery


If I have a write operation that takes a while between opening and
closing
the file, what is the effect of a node doing that writing crashing in
the middle?
For example, suppose I have large logs that I write to continually,
rolling them
every N minutes (say every hour for the sake of discussion).  If I have
the file
opened and am 90% done my writes and things crash, what happens to the
data I've "written" ... realizing that at some level, data isn't visible
to the rest
of the cluster until the file is closed.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com