You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Cagdas Gerede <ca...@gmail.com> on 2008/04/17 04:40:20 UTC

Help: When is it safe to discard a block in the application layer

I am working on an application on top of Hadoop Distributed File System.

High level flow goes like this: User data block arrives to the application
server. The application server uses DistributedFileSystem api of Hadoop and
write the data block to the file system. Once the block is replicated three
times, the application server will notify the user so that the user can get
rid of the data since it is now in a persistent fault tolerant storage.

I couldn't figure out the following. Let's say, this is my sample program to
write a block.

byte data[] = new byte[blockSize];
out.write(data, 0, data.length);
...

where out is
out = fs.create(outFile, FsPermission.getDefault(), true, 4096,
(short)replicationCount, blockSize, progress);


My application writes the data to the stream and then it goes to the next
line. At this point, I believe I cannot be sure that the block is replicated
at least, say 3, times. Possibly, under the hood, the DFSClient is still
trying to push this data to others.

Given this, how is my application going to know that the data block is
replicated 3 times and it is safe to discard this data?

There are a couple of things you might think:
1) Set the minimum replica property to 3: Even if you do this, the
application still goes to the next line before the data actually replicated
3 times.
2) Right after you write, you continuously get cache hints from master and
check if master is aware of 3 replicas of this block: My problem with this
approach is that the application will wait for a while for every block it
needs to store since it will take some time for datanodes to report and
master to process the blockreports. What is worse, if some datanode in the
pipeline fails, we have no way of knowing the error.

To sum-up, I am not sure when is the right time to discard a block of data
with the guarantee that it is replicated certain number of times.

Please help,

Thanks,
Cagdas

Re: Help: When is it safe to discard a block in the application layer

Posted by Cagdas Gerede <ca...@gmail.com>.
In DataStreamer class (in DFSClient.java), there is a line in run() method
like this:

* if (progress != null) { progress.progress(); }*

I think the progress is called only if the block is replicated at least
minimum number of times.

I can pass my progress object and wait on it until this method is invoked to
delete my application cache.


Does this seem right? Am I missing something?

Thanks for your answer.

-- 
------------
Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


On Thu, Apr 17, 2008 at 10:21 AM, dhruba Borthakur <dh...@yahoo-inc.com>
wrote:

>  The DFSClient caches small packets (e.g. 64K write buffers) and they are
> lazily flushed to the datanoeds in the pipeline. So, when an application
> completes a out.write() call, it is definitely not guaranteed that data is
> sent to even one datanode.
>
>
>
> One option would be to retrieve cache hints from the Namenode and determine
> if the block has three locations.
>
>
>
> Thanks,
>
> dhruba
>
>
>  ------------------------------
>
> *From:* Cagdas Gerede [mailto:cagdas.gerede@gmail.com]
> *Sent:* Wednesday, April 16, 2008 7:40 PM
> *To:* core-user@hadoop.apache.org
> *Subject:* Help: When is it safe to discard a block in the application
> layer
>
>
>
> I am working on an application on top of Hadoop Distributed File System.
>
> High level flow goes like this: User data block arrives to the application
> server. The application server uses DistributedFileSystem api of Hadoop and
> write the data block to the file system. Once the block is replicated three
> times, the application server will notify the user so that the user can get
> rid of the data since it is now in a persistent fault tolerant storage.
>
> I couldn't figure out the following. Let's say, this is my sample program
> to write a block.
>
> byte data[] = new byte[blockSize];
> out.write(data, 0, data.length);
> ...
>
> where out is
> out = fs.create(outFile, FsPermission.getDefault(), true, 4096,
> (short)replicationCount, blockSize, progress);
>
>
> My application writes the data to the stream and then it goes to the next
> line. At this point, I believe I cannot be sure that the block is replicated
> at least, say 3, times. Possibly, under the hood, the DFSClient is still
> trying to push this data to others.
>
> Given this, how is my application going to know that the data block is
> replicated 3 times and it is safe to discard this data?
>
> There are a couple of things you might think:
> 1) Set the minimum replica property to 3: Even if you do this, the
> application still goes to the next line before the data actually replicated
> 3 times.
> 2) Right after you write, you continuously get cache hints from master and
> check if master is aware of 3 replicas of this block: My problem with this
> approach is that the application will wait for a while for every block it
> needs to store since it will take some time for datanodes to report and
> master to process the blockreports. What is worse, if some datanode in the
> pipeline fails, we have no way of knowing the error.
>
> To sum-up, I am not sure when is the right time to discard a block of data
> with the guarantee that it is replicated certain number of times.
>
> Please help,
>
> Thanks,
> Cagdas
>

RE: Help: When is it safe to discard a block in the application layer

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.
The DFSClient caches small packets (e.g. 64K write buffers) and they are
lazily flushed to the datanoeds in the pipeline. So, when an application
completes a out.write() call, it is definitely not guaranteed that data
is sent to even one datanode. 

 

One option would be to retrieve cache hints from the Namenode and
determine if the block has three locations.

 

Thanks,

dhruba

 

________________________________

From: Cagdas Gerede [mailto:cagdas.gerede@gmail.com] 
Sent: Wednesday, April 16, 2008 7:40 PM
To: core-user@hadoop.apache.org
Subject: Help: When is it safe to discard a block in the application
layer

 

I am working on an application on top of Hadoop Distributed File System.

High level flow goes like this: User data block arrives to the
application server. The application server uses DistributedFileSystem
api of Hadoop and write the data block to the file system. Once the
block is replicated three times, the application server will notify the
user so that the user can get rid of the data since it is now in a
persistent fault tolerant storage.

I couldn't figure out the following. Let's say, this is my sample
program to write a block.

byte data[] = new byte[blockSize];
out.write(data, 0, data.length);                        
...

where out is
out = fs.create(outFile, FsPermission.getDefault(), true, 4096,
(short)replicationCount, blockSize, progress);


My application writes the data to the stream and then it goes to the
next line. At this point, I believe I cannot be sure that the block is
replicated
at least, say 3, times. Possibly, under the hood, the DFSClient is still
trying to push this data to others.

Given this, how is my application going to know that the data block is
replicated 3 times and it is safe to discard this data?

There are a couple of things you might think:
1) Set the minimum replica property to 3: Even if you do this, the
application still goes to the next line before the data actually
replicated 3 times.
2) Right after you write, you continuously get cache hints from master
and check if master is aware of 3 replicas of this block: My problem
with this approach is that the application will wait for a while for
every block it needs to store since it will take some time for datanodes
to report and master to process the blockreports. What is worse, if some
datanode in the pipeline fails, we have no way of knowing the error.

To sum-up, I am not sure when is the right time to discard a block of
data with the guarantee that it is replicated certain number of times.

Please help,

Thanks,
Cagdas