You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Meghana <me...@germinait.com> on 2011/07/28 10:08:17 UTC

Reader/Writer problem in HDFS

Hi,

We have a job where the map tasks are given the path to an output folder.
Each map task writes a single file to that folder. There is no reduce phase.
There is another thread, which constantly looks for new files in the output
folder. If found, it persists the contents to index, and deletes the file.

We use this code in the map task:
try {
    OutputStream oStream = fileSystem.create(path);
    IOUtils.write("xyz", oStream);
} finally {
    IOUtils.closeQuietly(oStream);
}

The problem: Some times the reader thread sees & tries to read a file which
is not yet fully written to HDFS (or the checksum is not written yet, etc),
and throws an error. Is it possible to write an HDFS file in such a way that
it won't be visible until it is fully written?

We use Hadoop 0.20.203.

Thanks,

Meghana

RE: Reader/Writer problem in HDFS

Posted by Laxman <la...@huawei.com>.
No such API as per my knowledge.
copyFromLocal is such API. That may not fit in your scenario I guess.

--Laxman

-----Original Message-----
From: Meghana [mailto:meghana.marathe@germinait.com] 
Sent: Thursday, July 28, 2011 4:32 PM
To: hdfs-user@hadoop.apache.org; lakshman_ch@huawei.com
Cc: common-user@hadoop.apache.org
Subject: Re: Reader/Writer problem in HDFS

Thanks Laxman! That would definitely help things. :)

Is there a better FileSystem/other method call to create a file in one go
(i.e. atomic i guess?), without having to call create() and then write to
the stream?

..meghana


On 28 July 2011 16:12, Laxman <la...@huawei.com> wrote:

> One approach can be use some ".tmp" extension while writing. Once the
write
> is completed rename back to original file name. Also, reducer has to
filter
> out ".tmp" files.
>
> This will ensure reducer will not pickup the partial files.
>
> We do have the similar scenario where the a/m approach resolved the issue.
>
> -----Original Message-----
> From: Meghana [mailto:meghana.marathe@germinait.com]
> Sent: Thursday, July 28, 2011 1:38 PM
> To: common-user; hdfs-user@hadoop.apache.org
> Subject: Reader/Writer problem in HDFS
>
> Hi,
>
> We have a job where the map tasks are given the path to an output folder.
> Each map task writes a single file to that folder. There is no reduce
> phase.
> There is another thread, which constantly looks for new files in the
output
> folder. If found, it persists the contents to index, and deletes the file.
>
> We use this code in the map task:
> try {
>    OutputStream oStream = fileSystem.create(path);
>    IOUtils.write("xyz", oStream);
> } finally {
>    IOUtils.closeQuietly(oStream);
> }
>
> The problem: Some times the reader thread sees & tries to read a file
which
> is not yet fully written to HDFS (or the checksum is not written yet,
etc),
> and throws an error. Is it possible to write an HDFS file in such a way
> that
> it won't be visible until it is fully written?
>
> We use Hadoop 0.20.203.
>
> Thanks,
>
> Meghana
>
>


RE: Reader/Writer problem in HDFS

Posted by Laxman <la...@huawei.com>.
No such API as per my knowledge.
copyFromLocal is such API. That may not fit in your scenario I guess.

--Laxman

-----Original Message-----
From: Meghana [mailto:meghana.marathe@germinait.com] 
Sent: Thursday, July 28, 2011 4:32 PM
To: hdfs-user@hadoop.apache.org; lakshman_ch@huawei.com
Cc: common-user@hadoop.apache.org
Subject: Re: Reader/Writer problem in HDFS

Thanks Laxman! That would definitely help things. :)

Is there a better FileSystem/other method call to create a file in one go
(i.e. atomic i guess?), without having to call create() and then write to
the stream?

..meghana


On 28 July 2011 16:12, Laxman <la...@huawei.com> wrote:

> One approach can be use some ".tmp" extension while writing. Once the
write
> is completed rename back to original file name. Also, reducer has to
filter
> out ".tmp" files.
>
> This will ensure reducer will not pickup the partial files.
>
> We do have the similar scenario where the a/m approach resolved the issue.
>
> -----Original Message-----
> From: Meghana [mailto:meghana.marathe@germinait.com]
> Sent: Thursday, July 28, 2011 1:38 PM
> To: common-user; hdfs-user@hadoop.apache.org
> Subject: Reader/Writer problem in HDFS
>
> Hi,
>
> We have a job where the map tasks are given the path to an output folder.
> Each map task writes a single file to that folder. There is no reduce
> phase.
> There is another thread, which constantly looks for new files in the
output
> folder. If found, it persists the contents to index, and deletes the file.
>
> We use this code in the map task:
> try {
>    OutputStream oStream = fileSystem.create(path);
>    IOUtils.write("xyz", oStream);
> } finally {
>    IOUtils.closeQuietly(oStream);
> }
>
> The problem: Some times the reader thread sees & tries to read a file
which
> is not yet fully written to HDFS (or the checksum is not written yet,
etc),
> and throws an error. Is it possible to write an HDFS file in such a way
> that
> it won't be visible until it is fully written?
>
> We use Hadoop 0.20.203.
>
> Thanks,
>
> Meghana
>
>


Re: Reader/Writer problem in HDFS

Posted by Meghana <me...@germinait.com>.
Thanks Laxman! That would definitely help things. :)

Is there a better FileSystem/other method call to create a file in one go
(i.e. atomic i guess?), without having to call create() and then write to
the stream?

..meghana


On 28 July 2011 16:12, Laxman <la...@huawei.com> wrote:

> One approach can be use some ".tmp" extension while writing. Once the write
> is completed rename back to original file name. Also, reducer has to filter
> out ".tmp" files.
>
> This will ensure reducer will not pickup the partial files.
>
> We do have the similar scenario where the a/m approach resolved the issue.
>
> -----Original Message-----
> From: Meghana [mailto:meghana.marathe@germinait.com]
> Sent: Thursday, July 28, 2011 1:38 PM
> To: common-user; hdfs-user@hadoop.apache.org
> Subject: Reader/Writer problem in HDFS
>
> Hi,
>
> We have a job where the map tasks are given the path to an output folder.
> Each map task writes a single file to that folder. There is no reduce
> phase.
> There is another thread, which constantly looks for new files in the output
> folder. If found, it persists the contents to index, and deletes the file.
>
> We use this code in the map task:
> try {
>    OutputStream oStream = fileSystem.create(path);
>    IOUtils.write("xyz", oStream);
> } finally {
>    IOUtils.closeQuietly(oStream);
> }
>
> The problem: Some times the reader thread sees & tries to read a file which
> is not yet fully written to HDFS (or the checksum is not written yet, etc),
> and throws an error. Is it possible to write an HDFS file in such a way
> that
> it won't be visible until it is fully written?
>
> We use Hadoop 0.20.203.
>
> Thanks,
>
> Meghana
>
>

Re: Reader/Writer problem in HDFS

Posted by Meghana <me...@germinait.com>.
Thanks Laxman! That would definitely help things. :)

Is there a better FileSystem/other method call to create a file in one go
(i.e. atomic i guess?), without having to call create() and then write to
the stream?

..meghana


On 28 July 2011 16:12, Laxman <la...@huawei.com> wrote:

> One approach can be use some ".tmp" extension while writing. Once the write
> is completed rename back to original file name. Also, reducer has to filter
> out ".tmp" files.
>
> This will ensure reducer will not pickup the partial files.
>
> We do have the similar scenario where the a/m approach resolved the issue.
>
> -----Original Message-----
> From: Meghana [mailto:meghana.marathe@germinait.com]
> Sent: Thursday, July 28, 2011 1:38 PM
> To: common-user; hdfs-user@hadoop.apache.org
> Subject: Reader/Writer problem in HDFS
>
> Hi,
>
> We have a job where the map tasks are given the path to an output folder.
> Each map task writes a single file to that folder. There is no reduce
> phase.
> There is another thread, which constantly looks for new files in the output
> folder. If found, it persists the contents to index, and deletes the file.
>
> We use this code in the map task:
> try {
>    OutputStream oStream = fileSystem.create(path);
>    IOUtils.write("xyz", oStream);
> } finally {
>    IOUtils.closeQuietly(oStream);
> }
>
> The problem: Some times the reader thread sees & tries to read a file which
> is not yet fully written to HDFS (or the checksum is not written yet, etc),
> and throws an error. Is it possible to write an HDFS file in such a way
> that
> it won't be visible until it is fully written?
>
> We use Hadoop 0.20.203.
>
> Thanks,
>
> Meghana
>
>

RE: Reader/Writer problem in HDFS

Posted by Laxman <la...@huawei.com>.
One approach can be use some ".tmp" extension while writing. Once the write
is completed rename back to original file name. Also, reducer has to filter
out ".tmp" files.

This will ensure reducer will not pickup the partial files.

We do have the similar scenario where the a/m approach resolved the issue.

-----Original Message-----
From: Meghana [mailto:meghana.marathe@germinait.com] 
Sent: Thursday, July 28, 2011 1:38 PM
To: common-user; hdfs-user@hadoop.apache.org
Subject: Reader/Writer problem in HDFS

Hi,

We have a job where the map tasks are given the path to an output folder.
Each map task writes a single file to that folder. There is no reduce phase.
There is another thread, which constantly looks for new files in the output
folder. If found, it persists the contents to index, and deletes the file.

We use this code in the map task:
try {
    OutputStream oStream = fileSystem.create(path);
    IOUtils.write("xyz", oStream);
} finally {
    IOUtils.closeQuietly(oStream);
}

The problem: Some times the reader thread sees & tries to read a file which
is not yet fully written to HDFS (or the checksum is not written yet, etc),
and throws an error. Is it possible to write an HDFS file in such a way that
it won't be visible until it is fully written?

We use Hadoop 0.20.203.

Thanks,

Meghana


RE: Reader/Writer problem in HDFS

Posted by Laxman <la...@huawei.com>.
One approach can be use some ".tmp" extension while writing. Once the write
is completed rename back to original file name. Also, reducer has to filter
out ".tmp" files.

This will ensure reducer will not pickup the partial files.

We do have the similar scenario where the a/m approach resolved the issue.

-----Original Message-----
From: Meghana [mailto:meghana.marathe@germinait.com] 
Sent: Thursday, July 28, 2011 1:38 PM
To: common-user; hdfs-user@hadoop.apache.org
Subject: Reader/Writer problem in HDFS

Hi,

We have a job where the map tasks are given the path to an output folder.
Each map task writes a single file to that folder. There is no reduce phase.
There is another thread, which constantly looks for new files in the output
folder. If found, it persists the contents to index, and deletes the file.

We use this code in the map task:
try {
    OutputStream oStream = fileSystem.create(path);
    IOUtils.write("xyz", oStream);
} finally {
    IOUtils.closeQuietly(oStream);
}

The problem: Some times the reader thread sees & tries to read a file which
is not yet fully written to HDFS (or the checksum is not written yet, etc),
and throws an error. Is it possible to write an HDFS file in such a way that
it won't be visible until it is fully written?

We use Hadoop 0.20.203.

Thanks,

Meghana