You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by kang hua <ka...@msn.com> on 2011/09/07 13:33:53 UTC

Question about hdfs close * hflush behavior


Hi friends:   I has two question.   first one is:   I use libhdfs's hflush to flush my data to a file, in same process context I can read it. But I find that file unchanged if I check from hadoop shell ―― it's len is zero( check by "hadoop fs -ls xxx" or read it in program); however when I reboot hdfs, I can read that file's content that I flushed again。 why ?    can I hflush data to file without close it,at same time read data flushed by other process ？       second one is:     does once close hdfs file, the last written block is untouched. even open that file with append mode, namenode will alloc a new block to for append data?   I find if I close file and open it with append mode again and again. hdfs report will show "used space much more that the file logic size"
   btw: I use cloudera ch2     Thanks a lottttttttttttttttttttttttttttttttttttttkanghua

Re: Question about hdfs close * hflush behavior

Posted by Kanghua151 <ka...@msn.com>.

I get it 。3x

发自我的 iPhone

在 2011-9-8，23:57，Todd Lipcon <to...@cloudera.com> 写道：

> 2011/9/8 Kanghua151 <ka...@msn.com>:
>> you are so nice，thank you very much：）
>> last question：
>> can i trigger block sync without restart hdfs？
> 
> Close the file or have a machine crash :) But no, not really.
> 
>> 
>> 
>> 发自我的 iPhone
>> 
>> 在 2011-9-8，15:00，Todd Lipcon <to...@cloudera.com> 写道：
>> 
>>> 2011/9/7 kang hua <ka...@msn.com>:
>>>> Thanks my friend!
>>>> please allow me to ask more question about detail thinks!
>>>> 1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, But
>>>> how can I get that file real size in other process if namenode is not change
>>>> ?  I real want is to read the date in tail  of that file.
>>> 
>>> You can open the file and then use an API on the DFSInputStream class
>>> to find the length. I don't recall the name of the API, but if you
>>> look in there, you should see it.
>>> 
>>>> 
>>>> 2 why "when I reboot hdfs, I can see that file's content that I flushed
>>>> again by "hadoop fs -ls xxx" "
>>> 
>>> On restart, the namenode triggers block synchronization, and the
>>> up-to-date length is determined.
>>> 
>>>> 3 In append mode.  close file and open it with append mode again and again .
>>>> real dataspace is normally increase, but nodename  show dfs used space
>>>> increase to fast. it is a bug ?
>>> 
>>> Might be a bug, yes.
>>> 
>>>> 4 which version of hdfs that append is no bug ?
>>> 
>>> 0.21, which is buggy in other aspects. So, no stable released version
>>> has a working append() call.
>>> 
>>> In truth I've never seen a _good_ use case for
>>> append-to-an-existing-file. Usually you can do just as well by keeping
>>> the file open and periodically hflushing, or rolling to a new file
>>> when you want to add more records to an existing dataset.
>>> 
>>> -Todd
>>> 
>>>>> From: todd@cloudera.com
>>>>> Date: Wed, 7 Sep 2011 14:17:10 -0700
>>>>> Subject: Re: Question about hdfs close * hflush behavior
>>>>> To: hdfs-user@hadoop.apache.orgSend
>>>>> 
>>>>> 2011/9/7 kang hua <ka...@msn.com>:
>>>>>> 
>>>>>> Hi friends:
>>>>>> I has two question.
>>>>>> first one is:
>>>>>> I use libhdfs's hflush to flush my data to a file, in same process
>>>>>> context I can read it. But I find that file unchanged if I check from
>>>>>> hadoop
>>>>>> shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
>>>>>> program); however when I reboot hdfs, I can read that file's content
>>>>>> that I
>>>>>> flushed again。 why ?
>>>>> 
>>>>> If we were to update th e file metadata on hflush, it would be very
>>>>> expensive, since the metadata lives in the NameNode.
>>>>> 
>>>>> If you do hadoop fs -cat xxx, you should see the entirety of the flushed
>>>>> data.
>>>>> 
>>>>>> can I hflush data to file without close it,at same time read data
>>>>>> flushed
>>>>>> by other process ？
>>>>> 
>>>>> yes.
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>>> second one is:
>>>>>> does once close hdfs file, the last written block is untouched. even
>>>>>> open
>>>>>> that file with append mode, namenode will alloc a new block to for
>>>>>> append
>>>>>> data?
>>>>> 
>>>>> No, it reopens the last block of the existing file for append.
>>>>> 
>>>>>> I find if I close file and open it with append mode again and again.
>>>>>> hdfs
>>>>>> report will show "used space much more that the file logic size"
>>>>> 
>>>>> Not sure I follow what you mean by this. Can you give more d etail?
>>>>> 
>>>>>> btw: I use cloudera ch2
>>>>> 
>>>>> The actual "append()" function has some bugs in all of the 0.20
>>>>> releases, including Cloudera's. The hflush/sync() API is fine to use,
>>>>> but I would recommend against using append().
>>>>> 
>>>>> -Todd
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about hdfs close * hflush behavior

Posted by Todd Lipcon <to...@cloudera.com>.

2011/9/8 Kanghua151 <ka...@msn.com>:
> you are so nice，thank you very much：）
> last question：
> can i trigger block sync without restart hdfs？

Close the file or have a machine crash :) But no, not really.

>
>
> 发自我的 iPhone
>
> 在 2011-9-8，15:00，Todd Lipcon <to...@cloudera.com> 写道：
>
>> 2011/9/7 kang hua <ka...@msn.com>:
>>> Thanks my friend!
>>> please allow me to ask more question about detail thinks!
>>> 1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, But
>>> how can I get that file real size in other process if namenode is not change
>>> ?  I real want is to read the date in tail  of that file.
>>
>> You can open the file and then use an API on the DFSInputStream class
>> to find the length. I don't recall the name of the API, but if you
>> look in there, you should see it.
>>
>>>
>>> 2 why "when I reboot hdfs, I can see that file's content that I flushed
>>> again by "hadoop fs -ls xxx" "
>>
>> On restart, the namenode triggers block synchronization, and the
>> up-to-date length is determined.
>>
>>> 3 In append mode.  close file and open it with append mode again and again .
>>> real dataspace is normally increase, but nodename  show dfs used space
>>> increase to fast. it is a bug ?
>>
>> Might be a bug, yes.
>>
>>> 4 which version of hdfs that append is no bug ?
>>
>> 0.21, which is buggy in other aspects. So, no stable released version
>> has a working append() call.
>>
>> In truth I've never seen a _good_ use case for
>> append-to-an-existing-file. Usually you can do just as well by keeping
>> the file open and periodically hflushing, or rolling to a new file
>> when you want to add more records to an existing dataset.
>>
>> -Todd
>>
>>>> From: todd@cloudera.com
>>>> Date: Wed, 7 Sep 2011 14:17:10 -0700
>>>> Subject: Re: Question about hdfs close * hflush behavior
>>>> To: hdfs-user@hadoop.apache.orgSend
>>>>
>>>> 2011/9/7 kang hua <ka...@msn.com>:
>>>>>
>>>>> Hi friends:
>>>>> I has two question.
>>>>> first one is:
>>>>> I use libhdfs's hflush to flush my data to a file, in same process
>>>>> context I can read it. But I find that file unchanged if I check from
>>>>> hadoop
>>>>> shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
>>>>> program); however when I reboot hdfs, I can read that file's content
>>>>> that I
>>>>> flushed again。 why ?
>>>>
>>>> If we were to update th e file metadata on hflush, it would be very
>>>> expensive, since the metadata lives in the NameNode.
>>>>
>>>> If you do hadoop fs -cat xxx, you should see the entirety of the flushed
>>>> data.
>>>>
>>>>> can I hflush data to file without close it,at same time read data
>>>>> flushed
>>>>> by other process ？
>>>>
>>>> yes.
>>>>
>>>
>>>
>>>
>>>
>>>
>>>>> second one is:
>>>>> does once close hdfs file, the last written block is untouched. even
>>>>> open
>>>>> that file with append mode, namenode will alloc a new block to for
>>>>> append
>>>>> data?
>>>>
>>>> No, it reopens the last block of the existing file for append.
>>>>
>>>>> I find if I close file and open it with append mode again and again.
>>>>> hdfs
>>>>> report will show "used space much more that the file logic size"
>>>>
>>>> Not sure I follow what you mean by this. Can you give more d etail?
>>>>
>>>>> btw: I use cloudera ch2
>>>>
>>>> The actual "append()" function has some bugs in all of the 0.20
>>>> releases, including Cloudera's. The hflush/sync() API is fine to use,
>>>> but I would recommend against using append().
>>>>
>>>> -Todd
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Question about hdfs close * hflush behavior

Posted by Kanghua151 <ka...@msn.com>.

you are so nice，thank you very much：）
last question：
can i trigger block sync without restart hdfs？


发自我的 iPhone

在 2011-9-8，15:00，Todd Lipcon <to...@cloudera.com> 写道：

> 2011/9/7 kang hua <ka...@msn.com>:
>> Thanks my friend!
>> please allow me to ask more question about detail thinks!
>> 1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, But
>> how can I get that file real size in other process if namenode is not change
>> ?  I real want is to read the date in tail  of that file.
> 
> You can open the file and then use an API on the DFSInputStream class
> to find the length. I don't recall the name of the API, but if you
> look in there, you should see it.
> 
>> 
>> 2 why "when I reboot hdfs, I can see that file's content that I flushed
>> again by "hadoop fs -ls xxx" "
> 
> On restart, the namenode triggers block synchronization, and the
> up-to-date length is determined.
> 
>> 3 In append mode.  close file and open it with append mode again and again .
>> real dataspace is normally increase, but nodename  show dfs used space
>> increase to fast. it is a bug ?
> 
> Might be a bug, yes.
> 
>> 4 which version of hdfs that append is no bug ?
> 
> 0.21, which is buggy in other aspects. So, no stable released version
> has a working append() call.
> 
> In truth I've never seen a _good_ use case for
> append-to-an-existing-file. Usually you can do just as well by keeping
> the file open and periodically hflushing, or rolling to a new file
> when you want to add more records to an existing dataset.
> 
> -Todd
> 
>>> From: todd@cloudera.com
>>> Date: Wed, 7 Sep 2011 14:17:10 -0700
>>> Subject: Re: Question about hdfs close * hflush behavior
>>> To: hdfs-user@hadoop.apache.orgSend
>>> 
>>> 2011/9/7 kang hua <ka...@msn.com>:
>>>> 
>>>> Hi friends:
>>>> I has two question.
>>>> first one is:
>>>> I use libhdfs's hflush to flush my data to a file, in same process
>>>> context I can read it. But I find that file unchanged if I check from
>>>> hadoop
>>>> shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
>>>> program); however when I reboot hdfs, I can read that file's content
>>>> that I
>>>> flushed again。 why ?
>>> 
>>> If we were to update th e file metadata on hflush, it would be very
>>> expensive, since the metadata lives in the NameNode.
>>> 
>>> If you do hadoop fs -cat xxx, you should see the entirety of the flushed
>>> data.
>>> 
>>>> can I hflush data to file without close it,at same time read data
>>>> flushed
>>>> by other process ？
>>> 
>>> yes.
>>> 
>> 
>> 
>> 
>> 
>> 
>>>> second one is:
>>>> does once close hdfs file, the last written block is untouched. even
>>>> open
>>>> that file with append mode, namenode will alloc a new block to for
>>>> append
>>>> data?
>>> 
>>> No, it reopens the last block of the existing file for append.
>>> 
>>>> I find if I close file and open it with append mode again and again.
>>>> hdfs
>>>> report will show "used space much more that the file logic size"
>>> 
>>> Not sure I follow what you mean by this. Can you give more d etail?
>>> 
>>>> btw: I use cloudera ch2
>>> 
>>> The actual "append()" function has some bugs in all of the 0.20
>>> releases, including Cloudera's. The hflush/sync() API is fine to use,
>>> but I would recommend against using append().
>>> 
>>> -Todd
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Question about hdfs close * hflush behavior

Posted by Todd Lipcon <to...@cloudera.com>.

2011/9/7 kang hua <ka...@msn.com>:
> Thanks my friend!
> please allow me to ask more question about detail thinks!
> 1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, But
> how can I get that file real size in other process if namenode is not change
> ?  I real want is to read the date in tail  of that file.

You can open the file and then use an API on the DFSInputStream class
to find the length. I don't recall the name of the API, but if you
look in there, you should see it.

>
> 2 why "when I reboot hdfs, I can see that file's content that I flushed
> again by "hadoop fs -ls xxx" "

On restart, the namenode triggers block synchronization, and the
up-to-date length is determined.

> 3 In append mode.  close file and open it with append mode again and again .
> real dataspace is normally increase, but nodename  show dfs used space
> increase to fast. it is a bug ?

Might be a bug, yes.

> 4 which version of hdfs that append is no bug ?

0.21, which is buggy in other aspects. So, no stable released version
has a working append() call.

In truth I've never seen a _good_ use case for
append-to-an-existing-file. Usually you can do just as well by keeping
the file open and periodically hflushing, or rolling to a new file
when you want to add more records to an existing dataset.

-Todd

>> From: todd@cloudera.com
>> Date: Wed, 7 Sep 2011 14:17:10 -0700
>> Subject: Re: Question about hdfs close * hflush behavior
>> To: hdfs-user@hadoop.apache.orgSend
>>
>> 2011/9/7 kang hua <ka...@msn.com>:
>> >
>> > Hi friends:
>> > I has two question.
>> > first one is:
>> > I use libhdfs's hflush to flush my data to a file, in same process
>> > context I can read it. But I find that file unchanged if I check from
>> > hadoop
>> > shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
>> > program); however when I reboot hdfs, I can read that file's content
>> > that I
>> > flushed again。 why ?
>>
>> If we were to update th e file metadata on hflush, it would be very
>> expensive, since the metadata lives in the NameNode.
>>
>> If you do hadoop fs -cat xxx, you should see the entirety of the flushed
>> data.
>>
>> > can I hflush data to file without close it,at same time read data
>> > flushed
>> > by other process ？
>>
>> yes.
>>
>
>
>
>
>
>> > second one is:
>> > does once close hdfs file, the last written block is untouched. even
>> > open
>> > that file with append mode, namenode will alloc a new block to for
>> > append
>> > data?
>>
>> No, it reopens the last block of the existing file for append.
>>
>> > I find if I close file and open it with append mode again and again.
>> > hdfs
>> > report will show "used space much more that the file logic size"
>>
>> Not sure I follow what you mean by this. Can you give more d etail?
>>
>> > btw: I use cloudera ch2
>>
>> The actual "append()" function has some bugs in all of the 0.20
>> releases, including Cloudera's. The hflush/sync() API is fine to use,
>> but I would recommend against using append().
>>
>> -Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

RE: Question about hdfs close * hflush behavior

Posted by kang hua <ka...@msn.com>.

Thanks my friend!please allow me to ask more question about detail thinks!
1 yes, I can use hadoop fs -tail or -cat xxx to see that file content, But how can I get that file real size in other process if namenode is not change ?  I real want is to read the date in tail  of that file.    2 why "when I reboot hdfs, I can see that file's content that I flushed again by "hadoop fs -ls xxx" "  
3 In append mode.  close file and open it with append mode again and again . real dataspace is normally increase, but nodename  show dfs used space increase to fast. it is a bug ?
4 which version of hdfs that append is no bug ?

thanks again.kanghua

> From: todd@cloudera.com
> Date: Wed, 7 Sep 2011 14:17:10 -0700
> Subject: Re: Question about hdfs close * hflush behavior
> To: hdfs-user@hadoop.apache.orgSend
> 
> 2011/9/7 kang hua <ka...@msn.com>:
> >
> > Hi friends:
> >    I has two question.
> >    first one is:
> >    I use libhdfs's hflush to flush my data to a file, in same process
> > context I can read it. But I find that file unchanged if I check from hadoop
> > shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
> > program); however when I reboot hdfs, I can read that file's content that I
> > flushed again。 why ?
> 
> If we were to update the file metadata on hflush, it would be very
> expensive, since the metadata lives in the NameNode.
> 
> If you do hadoop fs -cat xxx, you should see the entirety of the flushed data.
> 
> >    can I hflush data to file without close it,at same time read data flushed
> > by other process ？
> 
> yes.
> 





> >    second one is:
> >    does once close hdfs file, the last written block is untouched. even open
> > that file with append mode, namenode will alloc a new block to for append
> > data?
> 
> No, it reopens the last block of the existing file for append.
> 
> >    I find if I close file and open it with append mode again and again. hdfs
> > report will show "used space much more that the file logic size"
> 
> Not sure I follow what you mean by this. Can you give more detail?
> 
> >    btw: I use cloudera ch2
> 
> The actual "append()" function has some bugs in all of the 0.20
> releases, including Cloudera's. The hflush/sync() API is fine to use,
> but I would recommend against using append().
> 
> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Question about hdfs close * hflush behavior

Posted by Todd Lipcon <to...@cloudera.com>.

2011/9/7 kang hua <ka...@msn.com>:
>
> Hi friends:
>    I has two question.
>    first one is:
>    I use libhdfs's hflush to flush my data to a file, in same process
> context I can read it. But I find that file unchanged if I check from hadoop
> shell ---- it's len is zero( check by "hadoop fs -ls xxx" or read it in
> program); however when I reboot hdfs, I can read that file's content that I
> flushed again。 why ?

If we were to update the file metadata on hflush, it would be very
expensive, since the metadata lives in the NameNode.

If you do hadoop fs -cat xxx, you should see the entirety of the flushed data.

>    can I hflush data to file without close it,at same time read data flushed
> by other process ？

yes.

>
>    second one is:
>    does once close hdfs file, the last written block is untouched. even open
> that file with append mode, namenode will alloc a new block to for append
> data?

No, it reopens the last block of the existing file for append.

>    I find if I close file and open it with append mode again and again. hdfs
> report will show "used space much more that the file logic size"

Not sure I follow what you mean by this. Can you give more detail?

>    btw: I use cloudera ch2

The actual "append()" function has some bugs in all of the 0.20
releases, including Cloudera's. The hflush/sync() API is fine to use,
but I would recommend against using append().

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera