You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Peter Sheridan <ps...@millennialmedia.com> on 2012/09/25 18:28:27 UTC

Detect when file is not being written by another process

Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process.  There doesn't appear to be an API specifically for this.  We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually corrupts the file.  Perhaps this is a known issue?  Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete

Re: Detect when file is not being written by another process

Posted by Andy Isaacson <ad...@cloudera.com>.

On Tue, Sep 25, 2012 at 9:28 AM, Peter Sheridan
<ps...@millennialmedia.com> wrote:
> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files
> when they've finished being written to HDFS by a different process.

The common way to solve this problem is to modify the writing
application to write to a temporary filename and then rename the
temporary to the target filename when the write is complete.

That way, if the file exists without the temporary tag, the reader can
be confident the file is complete.

-andy

Re: Detect when file is not being written by another process

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

AFAIK there is not way to determine i a file has been fully written or not.

Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in
the output directory of a job. This _SUCCESS file is written at job
completion time, thus ensuring all the output of the job is ready.
This means that when Oozie is configured to look for a directory FOO/,
in practice it looks for the existence of FOO/_SUCCESS file.

You can configure Oozie to look for existence of FOO/ but this means
you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and
do a rename to FOO/ once you finished writing the data.

Thx

On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala
<yh...@thoughtworks.com> wrote:
> Agree with Bejoy. The problem you've mentioned sounds like building
> something like a workflow, which is what Oozie is supposed to do.
>
> Thanks
> hemanth
>
>
> On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:
>>
>> Hi Peter
>>
>> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
>> soon as the files are written to a  certain hdfs directory.
>>
>>
>> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan
>> <ps...@millennialmedia.com> wrote:
>>>
>>> These are log files being deposited by other processes, which we may not
>>> have control over.
>>>
>>> We don't want multiple processes to write to the same files — we just
>>> don't want to start our jobs until they have been completely written.
>>>
>>> Sorry for lack of clarity & thanks for the response.
>>>
>>>
>>> --Pete
>>>
>>> From: Bertrand Dechoux <de...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Tuesday, September 25, 2012 12:33 PM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Detect when file is not being written by another process
>>>
>>> Hi,
>>>
>>> Multiple files and aggregation or something like hbase?
>>>
>>> Could you tell use more about your context? What are the volumes? Why do
>>> you want multiple processes to write to the same file?
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan
>>> <ps...@millennialmedia.com> wrote:
>>>>
>>>> Hi all.
>>>>
>>>> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>>> files when they've finished being written to HDFS by a different process.
>>>> There doesn't appear to be an API specifically for this.  We had discovered
>>>> through experimentation that the FileSystem.append() method can be used for
>>>> this purpose — it will fail if another process is writing to the file.
>>>>
>>>> However: when running this on a multi-node cluster, using that API
>>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch
>>>> of similar-sounding things.
>>>>
>>>> What's the right way to solve this problem?  Thanks.
>>>>
>>>>
>>>> --Pete
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>



-- 
Alejandro

Re: Detect when file is not being written by another process

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

AFAIK there is not way to determine i a file has been fully written or not.

Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in
the output directory of a job. This _SUCCESS file is written at job
completion time, thus ensuring all the output of the job is ready.
This means that when Oozie is configured to look for a directory FOO/,
in practice it looks for the existence of FOO/_SUCCESS file.

You can configure Oozie to look for existence of FOO/ but this means
you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and
do a rename to FOO/ once you finished writing the data.

Thx

On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala
<yh...@thoughtworks.com> wrote:
> Agree with Bejoy. The problem you've mentioned sounds like building
> something like a workflow, which is what Oozie is supposed to do.
>
> Thanks
> hemanth
>
>
> On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:
>>
>> Hi Peter
>>
>> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
>> soon as the files are written to a  certain hdfs directory.
>>
>>
>> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan
>> <ps...@millennialmedia.com> wrote:
>>>
>>> These are log files being deposited by other processes, which we may not
>>> have control over.
>>>
>>> We don't want multiple processes to write to the same files — we just
>>> don't want to start our jobs until they have been completely written.
>>>
>>> Sorry for lack of clarity & thanks for the response.
>>>
>>>
>>> --Pete
>>>
>>> From: Bertrand Dechoux <de...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Tuesday, September 25, 2012 12:33 PM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Detect when file is not being written by another process
>>>
>>> Hi,
>>>
>>> Multiple files and aggregation or something like hbase?
>>>
>>> Could you tell use more about your context? What are the volumes? Why do
>>> you want multiple processes to write to the same file?
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan
>>> <ps...@millennialmedia.com> wrote:
>>>>
>>>> Hi all.
>>>>
>>>> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>>> files when they've finished being written to HDFS by a different process.
>>>> There doesn't appear to be an API specifically for this.  We had discovered
>>>> through experimentation that the FileSystem.append() method can be used for
>>>> this purpose — it will fail if another process is writing to the file.
>>>>
>>>> However: when running this on a multi-node cluster, using that API
>>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch
>>>> of similar-sounding things.
>>>>
>>>> What's the right way to solve this problem?  Thanks.
>>>>
>>>>
>>>> --Pete
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>



-- 
Alejandro

Re: Detect when file is not being written by another process

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

AFAIK there is not way to determine i a file has been fully written or not.

Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in
the output directory of a job. This _SUCCESS file is written at job
completion time, thus ensuring all the output of the job is ready.
This means that when Oozie is configured to look for a directory FOO/,
in practice it looks for the existence of FOO/_SUCCESS file.

You can configure Oozie to look for existence of FOO/ but this means
you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and
do a rename to FOO/ once you finished writing the data.

Thx

On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala
<yh...@thoughtworks.com> wrote:
> Agree with Bejoy. The problem you've mentioned sounds like building
> something like a workflow, which is what Oozie is supposed to do.
>
> Thanks
> hemanth
>
>
> On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:
>>
>> Hi Peter
>>
>> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
>> soon as the files are written to a  certain hdfs directory.
>>
>>
>> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan
>> <ps...@millennialmedia.com> wrote:
>>>
>>> These are log files being deposited by other processes, which we may not
>>> have control over.
>>>
>>> We don't want multiple processes to write to the same files — we just
>>> don't want to start our jobs until they have been completely written.
>>>
>>> Sorry for lack of clarity & thanks for the response.
>>>
>>>
>>> --Pete
>>>
>>> From: Bertrand Dechoux <de...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Tuesday, September 25, 2012 12:33 PM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Detect when file is not being written by another process
>>>
>>> Hi,
>>>
>>> Multiple files and aggregation or something like hbase?
>>>
>>> Could you tell use more about your context? What are the volumes? Why do
>>> you want multiple processes to write to the same file?
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan
>>> <ps...@millennialmedia.com> wrote:
>>>>
>>>> Hi all.
>>>>
>>>> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>>> files when they've finished being written to HDFS by a different process.
>>>> There doesn't appear to be an API specifically for this.  We had discovered
>>>> through experimentation that the FileSystem.append() method can be used for
>>>> this purpose — it will fail if another process is writing to the file.
>>>>
>>>> However: when running this on a multi-node cluster, using that API
>>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch
>>>> of similar-sounding things.
>>>>
>>>> What's the right way to solve this problem?  Thanks.
>>>>
>>>>
>>>> --Pete
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>



-- 
Alejandro

Re: Detect when file is not being written by another process

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

AFAIK there is not way to determine i a file has been fully written or not.

Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in
the output directory of a job. This _SUCCESS file is written at job
completion time, thus ensuring all the output of the job is ready.
This means that when Oozie is configured to look for a directory FOO/,
in practice it looks for the existence of FOO/_SUCCESS file.

You can configure Oozie to look for existence of FOO/ but this means
you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and
do a rename to FOO/ once you finished writing the data.

Thx

On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala
<yh...@thoughtworks.com> wrote:
> Agree with Bejoy. The problem you've mentioned sounds like building
> something like a workflow, which is what Oozie is supposed to do.
>
> Thanks
> hemanth
>
>
> On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:
>>
>> Hi Peter
>>
>> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
>> soon as the files are written to a  certain hdfs directory.
>>
>>
>> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan
>> <ps...@millennialmedia.com> wrote:
>>>
>>> These are log files being deposited by other processes, which we may not
>>> have control over.
>>>
>>> We don't want multiple processes to write to the same files — we just
>>> don't want to start our jobs until they have been completely written.
>>>
>>> Sorry for lack of clarity & thanks for the response.
>>>
>>>
>>> --Pete
>>>
>>> From: Bertrand Dechoux <de...@gmail.com>
>>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Date: Tuesday, September 25, 2012 12:33 PM
>>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>>> Subject: Re: Detect when file is not being written by another process
>>>
>>> Hi,
>>>
>>> Multiple files and aggregation or something like hbase?
>>>
>>> Could you tell use more about your context? What are the volumes? Why do
>>> you want multiple processes to write to the same file?
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan
>>> <ps...@millennialmedia.com> wrote:
>>>>
>>>> Hi all.
>>>>
>>>> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>>> files when they've finished being written to HDFS by a different process.
>>>> There doesn't appear to be an API specifically for this.  We had discovered
>>>> through experimentation that the FileSystem.append() method can be used for
>>>> this purpose — it will fail if another process is writing to the file.
>>>>
>>>> However: when running this on a multi-node cluster, using that API
>>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch
>>>> of similar-sounding things.
>>>>
>>>> What's the right way to solve this problem?  Thanks.
>>>>
>>>>
>>>> --Pete
>>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>
>>
>



-- 
Alejandro

Re: Detect when file is not being written by another process

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Agree with Bejoy. The problem you've mentioned sounds like building
something like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Peter
>
> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
> soon as the files are written to a  certain hdfs directory.
>
>
> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  These are log files being deposited by other processes, which we may
>> not have control over.
>>
>>  We don't want multiple processes to write to the same files — we just
>> don't want to start our jobs until they have been completely written.
>>
>>  Sorry for lack of clarity & thanks for the response.
>>
>>
>>  --Pete
>>
>>   From: Bertrand Dechoux <de...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, September 25, 2012 12:33 PM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Detect when file is not being written by another process
>>
>>  Hi,
>>
>> Multiple files and aggregation or something like hbase?
>>
>> Could you tell use more about your context? What are the volumes? Why do
>> you want multiple processes to write to the same file?
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
>> psheridan@millennialmedia.com> wrote:
>>
>>>  Hi all.
>>>
>>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>> files when they've finished being written to HDFS by a different process.
>>>  There doesn't appear to be an API specifically for this.  We had
>>> discovered through experimentation that the FileSystem.append() method can
>>> be used for this purpose — it will fail if another process is writing to
>>> the file.
>>>
>>>  However: when running this on a multi-node cluster, using that API
>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>>> bunch of similar-sounding things.
>>>
>>>  What's the right way to solve this problem?  Thanks.
>>>
>>>
>>>  --Pete
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: Detect when file is not being written by another process

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Agree with Bejoy. The problem you've mentioned sounds like building
something like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Peter
>
> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
> soon as the files are written to a  certain hdfs directory.
>
>
> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  These are log files being deposited by other processes, which we may
>> not have control over.
>>
>>  We don't want multiple processes to write to the same files — we just
>> don't want to start our jobs until they have been completely written.
>>
>>  Sorry for lack of clarity & thanks for the response.
>>
>>
>>  --Pete
>>
>>   From: Bertrand Dechoux <de...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, September 25, 2012 12:33 PM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Detect when file is not being written by another process
>>
>>  Hi,
>>
>> Multiple files and aggregation or something like hbase?
>>
>> Could you tell use more about your context? What are the volumes? Why do
>> you want multiple processes to write to the same file?
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
>> psheridan@millennialmedia.com> wrote:
>>
>>>  Hi all.
>>>
>>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>> files when they've finished being written to HDFS by a different process.
>>>  There doesn't appear to be an API specifically for this.  We had
>>> discovered through experimentation that the FileSystem.append() method can
>>> be used for this purpose — it will fail if another process is writing to
>>> the file.
>>>
>>>  However: when running this on a multi-node cluster, using that API
>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>>> bunch of similar-sounding things.
>>>
>>>  What's the right way to solve this problem?  Thanks.
>>>
>>>
>>>  --Pete
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: Detect when file is not being written by another process

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Agree with Bejoy. The problem you've mentioned sounds like building
something like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Peter
>
> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
> soon as the files are written to a  certain hdfs directory.
>
>
> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  These are log files being deposited by other processes, which we may
>> not have control over.
>>
>>  We don't want multiple processes to write to the same files — we just
>> don't want to start our jobs until they have been completely written.
>>
>>  Sorry for lack of clarity & thanks for the response.
>>
>>
>>  --Pete
>>
>>   From: Bertrand Dechoux <de...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, September 25, 2012 12:33 PM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Detect when file is not being written by another process
>>
>>  Hi,
>>
>> Multiple files and aggregation or something like hbase?
>>
>> Could you tell use more about your context? What are the volumes? Why do
>> you want multiple processes to write to the same file?
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
>> psheridan@millennialmedia.com> wrote:
>>
>>>  Hi all.
>>>
>>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>> files when they've finished being written to HDFS by a different process.
>>>  There doesn't appear to be an API specifically for this.  We had
>>> discovered through experimentation that the FileSystem.append() method can
>>> be used for this purpose — it will fail if another process is writing to
>>> the file.
>>>
>>>  However: when running this on a multi-node cluster, using that API
>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>>> bunch of similar-sounding things.
>>>
>>>  What's the right way to solve this problem?  Thanks.
>>>
>>>
>>>  --Pete
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: Detect when file is not being written by another process

Posted by Hemanth Yamijala <yh...@thoughtworks.com>.

Agree with Bejoy. The problem you've mentioned sounds like building
something like a workflow, which is what Oozie is supposed to do.

Thanks
hemanth

On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Peter
>
> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
> soon as the files are written to a  certain hdfs directory.
>
>
> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  These are log files being deposited by other processes, which we may
>> not have control over.
>>
>>  We don't want multiple processes to write to the same files — we just
>> don't want to start our jobs until they have been completely written.
>>
>>  Sorry for lack of clarity & thanks for the response.
>>
>>
>>  --Pete
>>
>>   From: Bertrand Dechoux <de...@gmail.com>
>> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Date: Tuesday, September 25, 2012 12:33 PM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Re: Detect when file is not being written by another process
>>
>>  Hi,
>>
>> Multiple files and aggregation or something like hbase?
>>
>> Could you tell use more about your context? What are the volumes? Why do
>> you want multiple processes to write to the same file?
>>
>> Regards
>>
>> Bertrand
>>
>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
>> psheridan@millennialmedia.com> wrote:
>>
>>>  Hi all.
>>>
>>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>>> files when they've finished being written to HDFS by a different process.
>>>  There doesn't appear to be an API specifically for this.  We had
>>> discovered through experimentation that the FileSystem.append() method can
>>> be used for this purpose — it will fail if another process is writing to
>>> the file.
>>>
>>>  However: when running this on a multi-node cluster, using that API
>>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>>> bunch of similar-sounding things.
>>>
>>>  What's the right way to solve this problem?  Thanks.
>>>
>>>
>>>  --Pete
>>>
>>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: Detect when file is not being written by another process

Posted by Bejoy Ks <be...@gmail.com>.

Hi Peter

AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
soon as the files are written to a  certain hdfs directory.

On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  These are log files being deposited by other processes, which we may not
> have control over.
>
>  We don't want multiple processes to write to the same files — we just
> don't want to start our jobs until they have been completely written.
>
>  Sorry for lack of clarity & thanks for the response.
>
>
>  --Pete
>
>   From: Bertrand Dechoux <de...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, September 25, 2012 12:33 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Detect when file is not being written by another process
>
>  Hi,
>
> Multiple files and aggregation or something like hbase?
>
> Could you tell use more about your context? What are the volumes? Why do
> you want multiple processes to write to the same file?
>
> Regards
>
> Bertrand
>
> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  Hi all.
>>
>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>> files when they've finished being written to HDFS by a different process.
>>  There doesn't appear to be an API specifically for this.  We had
>> discovered through experimentation that the FileSystem.append() method can
>> be used for this purpose — it will fail if another process is writing to
>> the file.
>>
>>  However: when running this on a multi-node cluster, using that API
>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>> bunch of similar-sounding things.
>>
>>  What's the right way to solve this problem?  Thanks.
>>
>>
>>  --Pete
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Detect when file is not being written by another process

Posted by Bejoy Ks <be...@gmail.com>.

Hi Peter

AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
soon as the files are written to a  certain hdfs directory.

On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  These are log files being deposited by other processes, which we may not
> have control over.
>
>  We don't want multiple processes to write to the same files — we just
> don't want to start our jobs until they have been completely written.
>
>  Sorry for lack of clarity & thanks for the response.
>
>
>  --Pete
>
>   From: Bertrand Dechoux <de...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, September 25, 2012 12:33 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Detect when file is not being written by another process
>
>  Hi,
>
> Multiple files and aggregation or something like hbase?
>
> Could you tell use more about your context? What are the volumes? Why do
> you want multiple processes to write to the same file?
>
> Regards
>
> Bertrand
>
> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  Hi all.
>>
>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>> files when they've finished being written to HDFS by a different process.
>>  There doesn't appear to be an API specifically for this.  We had
>> discovered through experimentation that the FileSystem.append() method can
>> be used for this purpose — it will fail if another process is writing to
>> the file.
>>
>>  However: when running this on a multi-node cluster, using that API
>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>> bunch of similar-sounding things.
>>
>>  What's the right way to solve this problem?  Thanks.
>>
>>
>>  --Pete
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Detect when file is not being written by another process

Posted by Bejoy Ks <be...@gmail.com>.

Hi Peter

AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
soon as the files are written to a  certain hdfs directory.

On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  These are log files being deposited by other processes, which we may not
> have control over.
>
>  We don't want multiple processes to write to the same files — we just
> don't want to start our jobs until they have been completely written.
>
>  Sorry for lack of clarity & thanks for the response.
>
>
>  --Pete
>
>   From: Bertrand Dechoux <de...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, September 25, 2012 12:33 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Detect when file is not being written by another process
>
>  Hi,
>
> Multiple files and aggregation or something like hbase?
>
> Could you tell use more about your context? What are the volumes? Why do
> you want multiple processes to write to the same file?
>
> Regards
>
> Bertrand
>
> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  Hi all.
>>
>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>> files when they've finished being written to HDFS by a different process.
>>  There doesn't appear to be an API specifically for this.  We had
>> discovered through experimentation that the FileSystem.append() method can
>> be used for this purpose — it will fail if another process is writing to
>> the file.
>>
>>  However: when running this on a multi-node cluster, using that API
>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>> bunch of similar-sounding things.
>>
>>  What's the right way to solve this problem?  Thanks.
>>
>>
>>  --Pete
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Detect when file is not being written by another process

Posted by Bejoy Ks <be...@gmail.com>.

Hi Peter

AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
soon as the files are written to a  certain hdfs directory.

On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  These are log files being deposited by other processes, which we may not
> have control over.
>
>  We don't want multiple processes to write to the same files — we just
> don't want to start our jobs until they have been completely written.
>
>  Sorry for lack of clarity & thanks for the response.
>
>
>  --Pete
>
>   From: Bertrand Dechoux <de...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, September 25, 2012 12:33 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Detect when file is not being written by another process
>
>  Hi,
>
> Multiple files and aggregation or something like hbase?
>
> Could you tell use more about your context? What are the volumes? Why do
> you want multiple processes to write to the same file?
>
> Regards
>
> Bertrand
>
> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
> psheridan@millennialmedia.com> wrote:
>
>>  Hi all.
>>
>>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
>> files when they've finished being written to HDFS by a different process.
>>  There doesn't appear to be an API specifically for this.  We had
>> discovered through experimentation that the FileSystem.append() method can
>> be used for this purpose — it will fail if another process is writing to
>> the file.
>>
>>  However: when running this on a multi-node cluster, using that API
>> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
>> bunch of similar-sounding things.
>>
>>  What's the right way to solve this problem?  Thanks.
>>
>>
>>  --Pete
>>
>>
>
>
> --
> Bertrand Dechoux
>

Re: Detect when file is not being written by another process

Posted by Peter Sheridan <ps...@millennialmedia.com>.

These are log files being deposited by other processes, which we may not have control over.

We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <de...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, September 25, 2012 12:33 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Detect when file is not being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <ps...@millennialmedia.com>> wrote:
Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process.  There doesn't appear to be an API specifically for this.  We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually corrupts the file.  Perhaps this is a known issue?  Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete




--
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Peter Sheridan <ps...@millennialmedia.com>.

These are log files being deposited by other processes, which we may not have control over.

We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <de...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, September 25, 2012 12:33 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Detect when file is not being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <ps...@millennialmedia.com>> wrote:
Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process.  There doesn't appear to be an API specifically for this.  We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually corrupts the file.  Perhaps this is a known issue?  Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete




--
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Peter Sheridan <ps...@millennialmedia.com>.

These are log files being deposited by other processes, which we may not have control over.

We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <de...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, September 25, 2012 12:33 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Detect when file is not being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <ps...@millennialmedia.com>> wrote:
Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process.  There doesn't appear to be an API specifically for this.  We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually corrupts the file.  Perhaps this is a known issue?  Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete




--
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Peter Sheridan <ps...@millennialmedia.com>.

These are log files being deposited by other processes, which we may not have control over.

We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <de...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, September 25, 2012 12:33 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Detect when file is not being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <ps...@millennialmedia.com>> wrote:
Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process.  There doesn't appear to be an API specifically for this.  We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually corrupts the file.  Perhaps this is a known issue?  Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete




--
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do
you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  Hi all.
>
>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
> files when they've finished being written to HDFS by a different process.
>  There doesn't appear to be an API specifically for this.  We had
> discovered through experimentation that the FileSystem.append() method can
> be used for this purpose — it will fail if another process is writing to
> the file.
>
>  However: when running this on a multi-node cluster, using that API
> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
> bunch of similar-sounding things.
>
>  What's the right way to solve this problem?  Thanks.
>
>
>  --Pete
>
>


-- 
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do
you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  Hi all.
>
>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
> files when they've finished being written to HDFS by a different process.
>  There doesn't appear to be an API specifically for this.  We had
> discovered through experimentation that the FileSystem.append() method can
> be used for this purpose — it will fail if another process is writing to
> the file.
>
>  However: when running this on a multi-node cluster, using that API
> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
> bunch of similar-sounding things.
>
>  What's the right way to solve this problem?  Thanks.
>
>
>  --Pete
>
>


-- 
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do
you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  Hi all.
>
>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
> files when they've finished being written to HDFS by a different process.
>  There doesn't appear to be an API specifically for this.  We had
> discovered through experimentation that the FileSystem.append() method can
> be used for this purpose — it will fail if another process is writing to
> the file.
>
>  However: when running this on a multi-node cluster, using that API
> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
> bunch of similar-sounding things.
>
>  What's the right way to solve this problem?  Thanks.
>
>
>  --Pete
>
>


-- 
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do
you want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <
psheridan@millennialmedia.com> wrote:

>  Hi all.
>
>  We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
> files when they've finished being written to HDFS by a different process.
>  There doesn't appear to be an API specifically for this.  We had
> discovered through experimentation that the FileSystem.append() method can
> be used for this purpose — it will fail if another process is writing to
> the file.
>
>  However: when running this on a multi-node cluster, using that API
> actually corrupts the file.  Perhaps this is a known issue?  Looking at the
> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a
> bunch of similar-sounding things.
>
>  What's the right way to solve this problem?  Thanks.
>
>
>  --Pete
>
>


-- 
Bertrand Dechoux

Re: Detect when file is not being written by another process

Posted by Andy Isaacson <ad...@cloudera.com>.

On Tue, Sep 25, 2012 at 9:28 AM, Peter Sheridan
<ps...@millennialmedia.com> wrote:
> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files
> when they've finished being written to HDFS by a different process.

The common way to solve this problem is to modify the writing
application to write to a temporary filename and then rename the
temporary to the target filename when the write is complete.

That way, if the file exists without the temporary tag, the reader can
be confident the file is complete.

-andy

Re: Detect when file is not being written by another process

Posted by Andy Isaacson <ad...@cloudera.com>.

On Tue, Sep 25, 2012 at 9:28 AM, Peter Sheridan
<ps...@millennialmedia.com> wrote:
> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files
> when they've finished being written to HDFS by a different process.

The common way to solve this problem is to modify the writing
application to write to a temporary filename and then rename the
temporary to the target filename when the write is complete.

That way, if the file exists without the temporary tag, the reader can
be confident the file is complete.

-andy

Re: Detect when file is not being written by another process

Posted by Andy Isaacson <ad...@cloudera.com>.

On Tue, Sep 25, 2012 at 9:28 AM, Peter Sheridan
<ps...@millennialmedia.com> wrote:
> We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files
> when they've finished being written to HDFS by a different process.

The common way to solve this problem is to modify the writing
application to write to a temporary filename and then rename the
temporary to the target filename when the write is complete.

That way, if the file exists without the temporary tag, the reader can
be confident the file is complete.

-andy