You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Xiaobin She <xi...@gmail.com> on 2013/12/17 12:35:47 UTC

Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

hi,

I'm using libhdfs to deal with hdfs in an c++ programme.

And I have encountered an problem.

here is the scenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an file
2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
after this call returns, new readers shoule be able to see the data
4. I use an http get request to get the file list on that directionary
through the webhdfs interface,
here  I have to use the webhdfs interface because I need to deal with
symlink file
5. from the json response which is returned by the webhdfs, I found that
the lenght of the file is still 0.

I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
these three together, but still doesn't work.

Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get the
correct file lenght through the webhdfs interface.


Is this right? I mean if you want the other process to see the change  of
data, you need to call hdfsCloseFile?

Or is there somethings I did wrong?

thank you very much for your help.

RE: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by java8964 <ja...@hotmail.com>.

I don't think in HDFS, a file can be written concurrently. Process B won't be able to write the file (But can read) until it is CLOSED by process A.
Yong

Date: Fri, 20 Dec 2013 15:55:00 +0800
Subject: Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?
From: xiaobinshe@gmail.com
To: user@hadoop.apache.org

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data, and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be able to see the new lenght of the file

4. process C open the file in read mode, get the length of the file, and read length bytes of data

So at this point, the last 1MB data that process C was read is written by process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B close the file, but it read the data which is written by process A.

this seems a little wired.

2013/12/20 Peyman Mohajerian <mo...@gmail.com>

Ok i just read the book section on this (Definite Guide to Hadoop), just to be sure, length of a file is stored in Name Node, and its updated only after client calls Name Node after close of the file. At that point if Name Node has received all the ACK from Data Nodes then it will set the length meta-data (e.g. minimum replication is met), so one of the last steps and its for performance reasons, client decides when its done writing.

On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close the file I'm writing to ( the read operation is call on another file handler opened on the same file but still in the same process ), which make me more confuse at that time, because I think since I can read the data from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand it now.

So it seems in order to do what I want ( write some data to the file, and then get the length of the file throuth webhdfs interface), I have to open and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe

2013/12/19 Devin Suiter RDX <ds...@rdx.com>

Hello,
In my experience with Flume, watching the HDFS Sink verbose output, I know that even after a file has flushed, but is still open, it reads as a 0-byte file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as long as it is open, so the length cannot be mathematically defined until a start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I don't think that necessarily means that it tells the HDFS RecordWriter to place the "end of stream/EOF" marker down, since the "file" meta-location in HDFS is a pile of actual files around the cluster on physical disk that HDFS presents to you as one file. The HDFS "file" and the physical file splits on disk are distinct, and I would suspect that your HDFS flush calls are forcing Hadoop to move the physical filesplits from their physical datanode buffers to disk, but is not telling HDFS that you expect no further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which is probably unavailable until the close call, try asking for/viewing the contents of the file.
Your scenario step 3 says "according to the header hdfs.h, after this call returns, new readers should be able to see the data" which isn't the same as "new readers can obtain an updated property value from the file metadata" - one is looking at the data inside the container, and the other is asking the container to describe itself.

I hope that helps with your problem!

Devin SuiterJr. Data Solutions Software Engineer100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212

Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.

2013/12/17 Xiaobin She <xi...@gmail.com>

hi, 

I'm using libhdfs to deal with hdfs in an c++ programme.

And I have encountered an problem.

here is the scenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an file

2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the data,  according to the header hdfs.h, after this call returns, new readers shoule be able to see the data
4. I use an http get request to get the file list on that directionary through the webhdfs interface,

here  I have to use the webhdfs interface because I need to deal with symlink file
5. from the json response which is returned by the webhdfs, I found that the lenght of the file is still 0.

I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call these three together, but still doesn't work.

Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get the correct file lenght through the webhdfs interface.

Is this right? I mean if you want the other process to see the change  of data, you need to call hdfsCloseFile?

Or is there somethings I did wrong?

thank you very much for your help.

RE: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by java8964 <ja...@hotmail.com>.

I don't think in HDFS, a file can be written concurrently. Process B won't be able to write the file (But can read) until it is CLOSED by process A.
Yong

Date: Fri, 20 Dec 2013 15:55:00 +0800
Subject: Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?
From: xiaobinshe@gmail.com
To: user@hadoop.apache.org

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data, and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be able to see the new lenght of the file

4. process C open the file in read mode, get the length of the file, and read length bytes of data

So at this point, the last 1MB data that process C was read is written by process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B close the file, but it read the data which is written by process A.

this seems a little wired.

2013/12/20 Peyman Mohajerian <mo...@gmail.com>

Ok i just read the book section on this (Definite Guide to Hadoop), just to be sure, length of a file is stored in Name Node, and its updated only after client calls Name Node after close of the file. At that point if Name Node has received all the ACK from Data Nodes then it will set the length meta-data (e.g. minimum replication is met), so one of the last steps and its for performance reasons, client decides when its done writing.

On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close the file I'm writing to ( the read operation is call on another file handler opened on the same file but still in the same process ), which make me more confuse at that time, because I think since I can read the data from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand it now.

So it seems in order to do what I want ( write some data to the file, and then get the length of the file throuth webhdfs interface), I have to open and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe

2013/12/19 Devin Suiter RDX <ds...@rdx.com>

Hello,
In my experience with Flume, watching the HDFS Sink verbose output, I know that even after a file has flushed, but is still open, it reads as a 0-byte file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as long as it is open, so the length cannot be mathematically defined until a start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I don't think that necessarily means that it tells the HDFS RecordWriter to place the "end of stream/EOF" marker down, since the "file" meta-location in HDFS is a pile of actual files around the cluster on physical disk that HDFS presents to you as one file. The HDFS "file" and the physical file splits on disk are distinct, and I would suspect that your HDFS flush calls are forcing Hadoop to move the physical filesplits from their physical datanode buffers to disk, but is not telling HDFS that you expect no further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which is probably unavailable until the close call, try asking for/viewing the contents of the file.
Your scenario step 3 says "according to the header hdfs.h, after this call returns, new readers should be able to see the data" which isn't the same as "new readers can obtain an updated property value from the file metadata" - one is looking at the data inside the container, and the other is asking the container to describe itself.

I hope that helps with your problem!

Devin SuiterJr. Data Solutions Software Engineer100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212

Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.

2013/12/17 Xiaobin She <xi...@gmail.com>

hi, 

I'm using libhdfs to deal with hdfs in an c++ programme.

And I have encountered an problem.

here is the scenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an file

2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the data,  according to the header hdfs.h, after this call returns, new readers shoule be able to see the data
4. I use an http get request to get the file list on that directionary through the webhdfs interface,

here  I have to use the webhdfs interface because I need to deal with symlink file
5. from the json response which is returned by the webhdfs, I found that the lenght of the file is still 0.

I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call these three together, but still doesn't work.

Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get the correct file lenght through the webhdfs interface.

Is this right? I mean if you want the other process to see the change  of data, you need to call hdfsCloseFile?

Or is there somethings I did wrong?

thank you very much for your help.

RE: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by java8964 <ja...@hotmail.com>.

I don't think in HDFS, a file can be written concurrently. Process B won't be able to write the file (But can read) until it is CLOSED by process A.
Yong

Date: Fri, 20 Dec 2013 15:55:00 +0800
Subject: Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?
From: xiaobinshe@gmail.com
To: user@hadoop.apache.org

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data, and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be able to see the new lenght of the file

4. process C open the file in read mode, get the length of the file, and read length bytes of data

So at this point, the last 1MB data that process C was read is written by process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B close the file, but it read the data which is written by process A.

this seems a little wired.

2013/12/20 Peyman Mohajerian <mo...@gmail.com>

Ok i just read the book section on this (Definite Guide to Hadoop), just to be sure, length of a file is stored in Name Node, and its updated only after client calls Name Node after close of the file. At that point if Name Node has received all the ACK from Data Nodes then it will set the length meta-data (e.g. minimum replication is met), so one of the last steps and its for performance reasons, client decides when its done writing.

On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close the file I'm writing to ( the read operation is call on another file handler opened on the same file but still in the same process ), which make me more confuse at that time, because I think since I can read the data from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand it now.

So it seems in order to do what I want ( write some data to the file, and then get the length of the file throuth webhdfs interface), I have to open and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe

2013/12/19 Devin Suiter RDX <ds...@rdx.com>

Hello,
In my experience with Flume, watching the HDFS Sink verbose output, I know that even after a file has flushed, but is still open, it reads as a 0-byte file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as long as it is open, so the length cannot be mathematically defined until a start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I don't think that necessarily means that it tells the HDFS RecordWriter to place the "end of stream/EOF" marker down, since the "file" meta-location in HDFS is a pile of actual files around the cluster on physical disk that HDFS presents to you as one file. The HDFS "file" and the physical file splits on disk are distinct, and I would suspect that your HDFS flush calls are forcing Hadoop to move the physical filesplits from their physical datanode buffers to disk, but is not telling HDFS that you expect no further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which is probably unavailable until the close call, try asking for/viewing the contents of the file.
Your scenario step 3 says "according to the header hdfs.h, after this call returns, new readers should be able to see the data" which isn't the same as "new readers can obtain an updated property value from the file metadata" - one is looking at the data inside the container, and the other is asking the container to describe itself.

I hope that helps with your problem!

Devin SuiterJr. Data Solutions Software Engineer100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212

Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.

2013/12/17 Xiaobin She <xi...@gmail.com>

hi, 

I'm using libhdfs to deal with hdfs in an c++ programme.

And I have encountered an problem.

here is the scenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an file

2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the data,  according to the header hdfs.h, after this call returns, new readers shoule be able to see the data
4. I use an http get request to get the file list on that directionary through the webhdfs interface,

here  I have to use the webhdfs interface because I need to deal with symlink file
5. from the json response which is returned by the webhdfs, I found that the lenght of the file is still 0.

I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call these three together, but still doesn't work.

Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get the correct file lenght through the webhdfs interface.

Is this right? I mean if you want the other process to see the change  of data, you need to call hdfsCloseFile?

Or is there somethings I did wrong?

thank you very much for your help.

RE: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by java8964 <ja...@hotmail.com>.

I don't think in HDFS, a file can be written concurrently. Process B won't be able to write the file (But can read) until it is CLOSED by process A.
Yong

Date: Fri, 20 Dec 2013 15:55:00 +0800
Subject: Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?
From: xiaobinshe@gmail.com
To: user@hadoop.apache.org

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data, and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be able to see the new lenght of the file

4. process C open the file in read mode, get the length of the file, and read length bytes of data

So at this point, the last 1MB data that process C was read is written by process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B close the file, but it read the data which is written by process A.

this seems a little wired.

2013/12/20 Peyman Mohajerian <mo...@gmail.com>

Ok i just read the book section on this (Definite Guide to Hadoop), just to be sure, length of a file is stored in Name Node, and its updated only after client calls Name Node after close of the file. At that point if Name Node has received all the ACK from Data Nodes then it will set the length meta-data (e.g. minimum replication is met), so one of the last steps and its for performance reasons, client decides when its done writing.

On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close the file I'm writing to ( the read operation is call on another file handler opened on the same file but still in the same process ), which make me more confuse at that time, because I think since I can read the data from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand it now.

So it seems in order to do what I want ( write some data to the file, and then get the length of the file throuth webhdfs interface), I have to open and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe

2013/12/19 Devin Suiter RDX <ds...@rdx.com>

Hello,
In my experience with Flume, watching the HDFS Sink verbose output, I know that even after a file has flushed, but is still open, it reads as a 0-byte file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as long as it is open, so the length cannot be mathematically defined until a start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I don't think that necessarily means that it tells the HDFS RecordWriter to place the "end of stream/EOF" marker down, since the "file" meta-location in HDFS is a pile of actual files around the cluster on physical disk that HDFS presents to you as one file. The HDFS "file" and the physical file splits on disk are distinct, and I would suspect that your HDFS flush calls are forcing Hadoop to move the physical filesplits from their physical datanode buffers to disk, but is not telling HDFS that you expect no further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which is probably unavailable until the close call, try asking for/viewing the contents of the file.
Your scenario step 3 says "according to the header hdfs.h, after this call returns, new readers should be able to see the data" which isn't the same as "new readers can obtain an updated property value from the file metadata" - one is looking at the data inside the container, and the other is asking the container to describe itself.

I hope that helps with your problem!

Devin SuiterJr. Data Solutions Software Engineer100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212

Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.

2013/12/17 Xiaobin She <xi...@gmail.com>

hi, 

I'm using libhdfs to deal with hdfs in an c++ programme.

And I have encountered an problem.

here is the scenario :
1. first I call hdfsOpenFile with O_WRONLY flag to open an file

2. call hdfsWrite to write some data
3. call hdfsHFlush to flush the data,  according to the header hdfs.h, after this call returns, new readers shoule be able to see the data
4. I use an http get request to get the file list on that directionary through the webhdfs interface,

here  I have to use the webhdfs interface because I need to deal with symlink file
5. from the json response which is returned by the webhdfs, I found that the lenght of the file is still 0.

I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call these three together, but still doesn't work.

Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get the correct file lenght through the webhdfs interface.

Is this right? I mean if you want the other process to see the change  of data, you need to call hdfsCloseFile?

Or is there somethings I did wrong?

thank you very much for your help.

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be
updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data,
and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be
able to see the new lenght of the file
4. process C open the file in read mode, get the length of the file, and
read length bytes of data

So at this point, the last 1MB data that process C was read is written by
process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B
close the file, but it read the data which is written by process A.
this seems a little wired.







2013/12/20 Peyman Mohajerian <mo...@gmail.com>

> Ok i just read the book section on this (Definite Guide to Hadoop), just
> to be sure, length of a file is stored in Name Node, and its updated only
> after client calls Name Node after close of the file. At that point if Name
> Node has received all the ACK from Data Nodes then it will set the length
> meta-data (e.g. minimum replication is met), so one of the last steps and
> its for performance reasons, client decides when its done writing.
>
>
> On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>> To Devin,
>>
>> thank you very much for your explanation.
>>
>> I do found that I can read the data out of the file even if I did not
>> close the file I'm writing to ( the read operation is call on another file
>> handler opened on the same file but still in the same process ), which make
>> me more confuse at that time, because I think since I can read the data
>> from the file , why can't I get the length of the file correctly.
>>
>> But from the explantion that you have described, I think I can understand
>> it now.
>>
>> So it seems in order to do what I want ( write some data to the file, and
>> then get the length of the file throuth webhdfs interface), I have to open
>> and close the file every time I do the write operation.
>>
>> Thank you very much again.
>>
>> xiaobinshe
>>
>>
>>
>>
>>
>> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>>
>>> Hello,
>>>
>>> In my experience with Flume, watching the HDFS Sink verbose output, I
>>> know that even after a file has flushed, but is still open, it reads as a
>>> 0-byte file, even if there is actually data contained in the file.
>>>
>>> A HDFS "file" is a meta-location that can accept streaming input for as
>>> long as it is open, so the length cannot be mathematically defined until a
>>> start and an end are in place.
>>>
>>> The flush operation moves data from a buffer to a storage medium, but I
>>> don't think that necessarily means that it tells the HDFS RecordWriter to
>>> place the "end of stream/EOF" marker down, since the "file" meta-location
>>> in HDFS is a pile of actual files around the cluster on physical disk that
>>> HDFS presents to you as one file. The HDFS "file" and the physical file
>>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>>> are forcing Hadoop to move the physical filesplits from their physical
>>> datanode buffers to disk, but is not telling HDFS that you expect no
>>> further input - that is what the HDFS close will do.
>>>
>>> One thing you could try - instead of asking for the length property,
>>> which is probably unavailable until the close call, try asking for/viewing
>>> the contents of the file.
>>>
>>> Your scenario step 3 says "according to the header hdfs.h, after this
>>> call returns, *new readers should be able to see the data*" which isn't
>>> the same as "new readers can obtain an updated property value from the file
>>> metadata" - one is looking at the data inside the container, and the other
>>> is asking the container to describe itself.
>>>
>>> I hope that helps with your problem!
>>>
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>>
>>>>
>>>> sorry to reply to my own thread.
>>>>
>>>> Does anyone know the answer to this question?
>>>> If so, can you please tell me if my understanding is right or wrong?
>>>>
>>>> thanks.
>>>>
>>>>
>>>>
>>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>>
>>>>> hi,
>>>>>
>>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>>
>>>>> And I have encountered an problem.
>>>>>
>>>>> here is the scenario :
>>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>>> 2. call hdfsWrite to write some data
>>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>>> after this call returns, new readers shoule be able to see the data
>>>>> 4. I use an http get request to get the file list on that directionary
>>>>> through the webhdfs interface,
>>>>> here  I have to use the webhdfs interface because I need to deal with
>>>>> symlink file
>>>>> 5. from the json response which is returned by the webhdfs, I found
>>>>> that the lenght of the file is still 0.
>>>>>
>>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>>> these three together, but still doesn't work.
>>>>>
>>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can
>>>>> get the correct file lenght through the webhdfs interface.
>>>>>
>>>>>
>>>>> Is this right? I mean if you want the other process to see the change
>>>>> of data, you need to call hdfsCloseFile?
>>>>>
>>>>> Or is there somethings I did wrong?
>>>>>
>>>>> thank you very much for your help.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be
updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data,
and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be
able to see the new lenght of the file
4. process C open the file in read mode, get the length of the file, and
read length bytes of data

So at this point, the last 1MB data that process C was read is written by
process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B
close the file, but it read the data which is written by process A.
this seems a little wired.







2013/12/20 Peyman Mohajerian <mo...@gmail.com>

> Ok i just read the book section on this (Definite Guide to Hadoop), just
> to be sure, length of a file is stored in Name Node, and its updated only
> after client calls Name Node after close of the file. At that point if Name
> Node has received all the ACK from Data Nodes then it will set the length
> meta-data (e.g. minimum replication is met), so one of the last steps and
> its for performance reasons, client decides when its done writing.
>
>
> On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>> To Devin,
>>
>> thank you very much for your explanation.
>>
>> I do found that I can read the data out of the file even if I did not
>> close the file I'm writing to ( the read operation is call on another file
>> handler opened on the same file but still in the same process ), which make
>> me more confuse at that time, because I think since I can read the data
>> from the file , why can't I get the length of the file correctly.
>>
>> But from the explantion that you have described, I think I can understand
>> it now.
>>
>> So it seems in order to do what I want ( write some data to the file, and
>> then get the length of the file throuth webhdfs interface), I have to open
>> and close the file every time I do the write operation.
>>
>> Thank you very much again.
>>
>> xiaobinshe
>>
>>
>>
>>
>>
>> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>>
>>> Hello,
>>>
>>> In my experience with Flume, watching the HDFS Sink verbose output, I
>>> know that even after a file has flushed, but is still open, it reads as a
>>> 0-byte file, even if there is actually data contained in the file.
>>>
>>> A HDFS "file" is a meta-location that can accept streaming input for as
>>> long as it is open, so the length cannot be mathematically defined until a
>>> start and an end are in place.
>>>
>>> The flush operation moves data from a buffer to a storage medium, but I
>>> don't think that necessarily means that it tells the HDFS RecordWriter to
>>> place the "end of stream/EOF" marker down, since the "file" meta-location
>>> in HDFS is a pile of actual files around the cluster on physical disk that
>>> HDFS presents to you as one file. The HDFS "file" and the physical file
>>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>>> are forcing Hadoop to move the physical filesplits from their physical
>>> datanode buffers to disk, but is not telling HDFS that you expect no
>>> further input - that is what the HDFS close will do.
>>>
>>> One thing you could try - instead of asking for the length property,
>>> which is probably unavailable until the close call, try asking for/viewing
>>> the contents of the file.
>>>
>>> Your scenario step 3 says "according to the header hdfs.h, after this
>>> call returns, *new readers should be able to see the data*" which isn't
>>> the same as "new readers can obtain an updated property value from the file
>>> metadata" - one is looking at the data inside the container, and the other
>>> is asking the container to describe itself.
>>>
>>> I hope that helps with your problem!
>>>
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>>
>>>>
>>>> sorry to reply to my own thread.
>>>>
>>>> Does anyone know the answer to this question?
>>>> If so, can you please tell me if my understanding is right or wrong?
>>>>
>>>> thanks.
>>>>
>>>>
>>>>
>>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>>
>>>>> hi,
>>>>>
>>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>>
>>>>> And I have encountered an problem.
>>>>>
>>>>> here is the scenario :
>>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>>> 2. call hdfsWrite to write some data
>>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>>> after this call returns, new readers shoule be able to see the data
>>>>> 4. I use an http get request to get the file list on that directionary
>>>>> through the webhdfs interface,
>>>>> here  I have to use the webhdfs interface because I need to deal with
>>>>> symlink file
>>>>> 5. from the json response which is returned by the webhdfs, I found
>>>>> that the lenght of the file is still 0.
>>>>>
>>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>>> these three together, but still doesn't work.
>>>>>
>>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can
>>>>> get the correct file lenght through the webhdfs interface.
>>>>>
>>>>>
>>>>> Is this right? I mean if you want the other process to see the change
>>>>> of data, you need to call hdfsCloseFile?
>>>>>
>>>>> Or is there somethings I did wrong?
>>>>>
>>>>> thank you very much for your help.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be
updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data,
and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be
able to see the new lenght of the file
4. process C open the file in read mode, get the length of the file, and
read length bytes of data

So at this point, the last 1MB data that process C was read is written by
process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B
close the file, but it read the data which is written by process A.
this seems a little wired.







2013/12/20 Peyman Mohajerian <mo...@gmail.com>

> Ok i just read the book section on this (Definite Guide to Hadoop), just
> to be sure, length of a file is stored in Name Node, and its updated only
> after client calls Name Node after close of the file. At that point if Name
> Node has received all the ACK from Data Nodes then it will set the length
> meta-data (e.g. minimum replication is met), so one of the last steps and
> its for performance reasons, client decides when its done writing.
>
>
> On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>> To Devin,
>>
>> thank you very much for your explanation.
>>
>> I do found that I can read the data out of the file even if I did not
>> close the file I'm writing to ( the read operation is call on another file
>> handler opened on the same file but still in the same process ), which make
>> me more confuse at that time, because I think since I can read the data
>> from the file , why can't I get the length of the file correctly.
>>
>> But from the explantion that you have described, I think I can understand
>> it now.
>>
>> So it seems in order to do what I want ( write some data to the file, and
>> then get the length of the file throuth webhdfs interface), I have to open
>> and close the file every time I do the write operation.
>>
>> Thank you very much again.
>>
>> xiaobinshe
>>
>>
>>
>>
>>
>> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>>
>>> Hello,
>>>
>>> In my experience with Flume, watching the HDFS Sink verbose output, I
>>> know that even after a file has flushed, but is still open, it reads as a
>>> 0-byte file, even if there is actually data contained in the file.
>>>
>>> A HDFS "file" is a meta-location that can accept streaming input for as
>>> long as it is open, so the length cannot be mathematically defined until a
>>> start and an end are in place.
>>>
>>> The flush operation moves data from a buffer to a storage medium, but I
>>> don't think that necessarily means that it tells the HDFS RecordWriter to
>>> place the "end of stream/EOF" marker down, since the "file" meta-location
>>> in HDFS is a pile of actual files around the cluster on physical disk that
>>> HDFS presents to you as one file. The HDFS "file" and the physical file
>>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>>> are forcing Hadoop to move the physical filesplits from their physical
>>> datanode buffers to disk, but is not telling HDFS that you expect no
>>> further input - that is what the HDFS close will do.
>>>
>>> One thing you could try - instead of asking for the length property,
>>> which is probably unavailable until the close call, try asking for/viewing
>>> the contents of the file.
>>>
>>> Your scenario step 3 says "according to the header hdfs.h, after this
>>> call returns, *new readers should be able to see the data*" which isn't
>>> the same as "new readers can obtain an updated property value from the file
>>> metadata" - one is looking at the data inside the container, and the other
>>> is asking the container to describe itself.
>>>
>>> I hope that helps with your problem!
>>>
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>>
>>>>
>>>> sorry to reply to my own thread.
>>>>
>>>> Does anyone know the answer to this question?
>>>> If so, can you please tell me if my understanding is right or wrong?
>>>>
>>>> thanks.
>>>>
>>>>
>>>>
>>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>>
>>>>> hi,
>>>>>
>>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>>
>>>>> And I have encountered an problem.
>>>>>
>>>>> here is the scenario :
>>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>>> 2. call hdfsWrite to write some data
>>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>>> after this call returns, new readers shoule be able to see the data
>>>>> 4. I use an http get request to get the file list on that directionary
>>>>> through the webhdfs interface,
>>>>> here  I have to use the webhdfs interface because I need to deal with
>>>>> symlink file
>>>>> 5. from the json response which is returned by the webhdfs, I found
>>>>> that the lenght of the file is still 0.
>>>>>
>>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>>> these three together, but still doesn't work.
>>>>>
>>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can
>>>>> get the correct file lenght through the webhdfs interface.
>>>>>
>>>>>
>>>>> Is this right? I mean if you want the other process to see the change
>>>>> of data, you need to call hdfsCloseFile?
>>>>>
>>>>> Or is there somethings I did wrong?
>>>>>
>>>>> thank you very much for your help.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Peyman,

thank you for your reply.

So the property of the file is stored in namenode, and it will not be
updated until the file is closed.

But isn't this will cause some problem ?

For example,
1. process A open the file in write mode, wirte 1MB data, flush the data,
and hold the file handler opened
2. process B open the file in write mode, wirte 1MB data, flush the data
3. process B close the file, at this point, the other process should be
able to see the new lenght of the file
4. process C open the file in read mode, get the length of the file, and
read length bytes of data

So at this point, the last 1MB data that process C was read is written by
process A ? Am I right ?
If this is right, then process C read one more 1MB data because process B
close the file, but it read the data which is written by process A.
this seems a little wired.







2013/12/20 Peyman Mohajerian <mo...@gmail.com>

> Ok i just read the book section on this (Definite Guide to Hadoop), just
> to be sure, length of a file is stored in Name Node, and its updated only
> after client calls Name Node after close of the file. At that point if Name
> Node has received all the ACK from Data Nodes then it will set the length
> meta-data (e.g. minimum replication is met), so one of the last steps and
> its for performance reasons, client decides when its done writing.
>
>
> On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>> To Devin,
>>
>> thank you very much for your explanation.
>>
>> I do found that I can read the data out of the file even if I did not
>> close the file I'm writing to ( the read operation is call on another file
>> handler opened on the same file but still in the same process ), which make
>> me more confuse at that time, because I think since I can read the data
>> from the file , why can't I get the length of the file correctly.
>>
>> But from the explantion that you have described, I think I can understand
>> it now.
>>
>> So it seems in order to do what I want ( write some data to the file, and
>> then get the length of the file throuth webhdfs interface), I have to open
>> and close the file every time I do the write operation.
>>
>> Thank you very much again.
>>
>> xiaobinshe
>>
>>
>>
>>
>>
>> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>>
>>> Hello,
>>>
>>> In my experience with Flume, watching the HDFS Sink verbose output, I
>>> know that even after a file has flushed, but is still open, it reads as a
>>> 0-byte file, even if there is actually data contained in the file.
>>>
>>> A HDFS "file" is a meta-location that can accept streaming input for as
>>> long as it is open, so the length cannot be mathematically defined until a
>>> start and an end are in place.
>>>
>>> The flush operation moves data from a buffer to a storage medium, but I
>>> don't think that necessarily means that it tells the HDFS RecordWriter to
>>> place the "end of stream/EOF" marker down, since the "file" meta-location
>>> in HDFS is a pile of actual files around the cluster on physical disk that
>>> HDFS presents to you as one file. The HDFS "file" and the physical file
>>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>>> are forcing Hadoop to move the physical filesplits from their physical
>>> datanode buffers to disk, but is not telling HDFS that you expect no
>>> further input - that is what the HDFS close will do.
>>>
>>> One thing you could try - instead of asking for the length property,
>>> which is probably unavailable until the close call, try asking for/viewing
>>> the contents of the file.
>>>
>>> Your scenario step 3 says "according to the header hdfs.h, after this
>>> call returns, *new readers should be able to see the data*" which isn't
>>> the same as "new readers can obtain an updated property value from the file
>>> metadata" - one is looking at the data inside the container, and the other
>>> is asking the container to describe itself.
>>>
>>> I hope that helps with your problem!
>>>
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>>
>>>>
>>>> sorry to reply to my own thread.
>>>>
>>>> Does anyone know the answer to this question?
>>>> If so, can you please tell me if my understanding is right or wrong?
>>>>
>>>> thanks.
>>>>
>>>>
>>>>
>>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>>
>>>>> hi,
>>>>>
>>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>>
>>>>> And I have encountered an problem.
>>>>>
>>>>> here is the scenario :
>>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>>> 2. call hdfsWrite to write some data
>>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>>> after this call returns, new readers shoule be able to see the data
>>>>> 4. I use an http get request to get the file list on that directionary
>>>>> through the webhdfs interface,
>>>>> here  I have to use the webhdfs interface because I need to deal with
>>>>> symlink file
>>>>> 5. from the json response which is returned by the webhdfs, I found
>>>>> that the lenght of the file is still 0.
>>>>>
>>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>>> these three together, but still doesn't work.
>>>>>
>>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can
>>>>> get the correct file lenght through the webhdfs interface.
>>>>>
>>>>>
>>>>> Is this right? I mean if you want the other process to see the change
>>>>> of data, you need to call hdfsCloseFile?
>>>>>
>>>>> Or is there somethings I did wrong?
>>>>>
>>>>> thank you very much for your help.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Peyman Mohajerian <mo...@gmail.com>.

Ok i just read the book section on this (Definite Guide to Hadoop), just to
be sure, length of a file is stored in Name Node, and its updated only
after client calls Name Node after close of the file. At that point if Name
Node has received all the ACK from Data Nodes then it will set the length
meta-data (e.g. minimum replication is met), so one of the last steps and
its for performance reasons, client decides when its done writing.


On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

> To Devin,
>
> thank you very much for your explanation.
>
> I do found that I can read the data out of the file even if I did not
> close the file I'm writing to ( the read operation is call on another file
> handler opened on the same file but still in the same process ), which make
> me more confuse at that time, because I think since I can read the data
> from the file , why can't I get the length of the file correctly.
>
> But from the explantion that you have described, I think I can understand
> it now.
>
> So it seems in order to do what I want ( write some data to the file, and
> then get the length of the file throuth webhdfs interface), I have to open
> and close the file every time I do the write operation.
>
> Thank you very much again.
>
> xiaobinshe
>
>
>
>
>
> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>
>> Hello,
>>
>> In my experience with Flume, watching the HDFS Sink verbose output, I
>> know that even after a file has flushed, but is still open, it reads as a
>> 0-byte file, even if there is actually data contained in the file.
>>
>> A HDFS "file" is a meta-location that can accept streaming input for as
>> long as it is open, so the length cannot be mathematically defined until a
>> start and an end are in place.
>>
>> The flush operation moves data from a buffer to a storage medium, but I
>> don't think that necessarily means that it tells the HDFS RecordWriter to
>> place the "end of stream/EOF" marker down, since the "file" meta-location
>> in HDFS is a pile of actual files around the cluster on physical disk that
>> HDFS presents to you as one file. The HDFS "file" and the physical file
>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>> are forcing Hadoop to move the physical filesplits from their physical
>> datanode buffers to disk, but is not telling HDFS that you expect no
>> further input - that is what the HDFS close will do.
>>
>> One thing you could try - instead of asking for the length property,
>> which is probably unavailable until the close call, try asking for/viewing
>> the contents of the file.
>>
>> Your scenario step 3 says "according to the header hdfs.h, after this
>> call returns, *new readers should be able to see the data*" which isn't
>> the same as "new readers can obtain an updated property value from the file
>> metadata" - one is looking at the data inside the container, and the other
>> is asking the container to describe itself.
>>
>> I hope that helps with your problem!
>>
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>
>>>
>>> sorry to reply to my own thread.
>>>
>>> Does anyone know the answer to this question?
>>> If so, can you please tell me if my understanding is right or wrong?
>>>
>>> thanks.
>>>
>>>
>>>
>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>
>>>> hi,
>>>>
>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>
>>>> And I have encountered an problem.
>>>>
>>>> here is the scenario :
>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>> 2. call hdfsWrite to write some data
>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>> after this call returns, new readers shoule be able to see the data
>>>> 4. I use an http get request to get the file list on that directionary
>>>> through the webhdfs interface,
>>>> here  I have to use the webhdfs interface because I need to deal with
>>>> symlink file
>>>> 5. from the json response which is returned by the webhdfs, I found
>>>> that the lenght of the file is still 0.
>>>>
>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>> these three together, but still doesn't work.
>>>>
>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>>> the correct file lenght through the webhdfs interface.
>>>>
>>>>
>>>> Is this right? I mean if you want the other process to see the change
>>>> of data, you need to call hdfsCloseFile?
>>>>
>>>> Or is there somethings I did wrong?
>>>>
>>>> thank you very much for your help.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Peyman Mohajerian <mo...@gmail.com>.

Ok i just read the book section on this (Definite Guide to Hadoop), just to
be sure, length of a file is stored in Name Node, and its updated only
after client calls Name Node after close of the file. At that point if Name
Node has received all the ACK from Data Nodes then it will set the length
meta-data (e.g. minimum replication is met), so one of the last steps and
its for performance reasons, client decides when its done writing.


On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

> To Devin,
>
> thank you very much for your explanation.
>
> I do found that I can read the data out of the file even if I did not
> close the file I'm writing to ( the read operation is call on another file
> handler opened on the same file but still in the same process ), which make
> me more confuse at that time, because I think since I can read the data
> from the file , why can't I get the length of the file correctly.
>
> But from the explantion that you have described, I think I can understand
> it now.
>
> So it seems in order to do what I want ( write some data to the file, and
> then get the length of the file throuth webhdfs interface), I have to open
> and close the file every time I do the write operation.
>
> Thank you very much again.
>
> xiaobinshe
>
>
>
>
>
> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>
>> Hello,
>>
>> In my experience with Flume, watching the HDFS Sink verbose output, I
>> know that even after a file has flushed, but is still open, it reads as a
>> 0-byte file, even if there is actually data contained in the file.
>>
>> A HDFS "file" is a meta-location that can accept streaming input for as
>> long as it is open, so the length cannot be mathematically defined until a
>> start and an end are in place.
>>
>> The flush operation moves data from a buffer to a storage medium, but I
>> don't think that necessarily means that it tells the HDFS RecordWriter to
>> place the "end of stream/EOF" marker down, since the "file" meta-location
>> in HDFS is a pile of actual files around the cluster on physical disk that
>> HDFS presents to you as one file. The HDFS "file" and the physical file
>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>> are forcing Hadoop to move the physical filesplits from their physical
>> datanode buffers to disk, but is not telling HDFS that you expect no
>> further input - that is what the HDFS close will do.
>>
>> One thing you could try - instead of asking for the length property,
>> which is probably unavailable until the close call, try asking for/viewing
>> the contents of the file.
>>
>> Your scenario step 3 says "according to the header hdfs.h, after this
>> call returns, *new readers should be able to see the data*" which isn't
>> the same as "new readers can obtain an updated property value from the file
>> metadata" - one is looking at the data inside the container, and the other
>> is asking the container to describe itself.
>>
>> I hope that helps with your problem!
>>
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>
>>>
>>> sorry to reply to my own thread.
>>>
>>> Does anyone know the answer to this question?
>>> If so, can you please tell me if my understanding is right or wrong?
>>>
>>> thanks.
>>>
>>>
>>>
>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>
>>>> hi,
>>>>
>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>
>>>> And I have encountered an problem.
>>>>
>>>> here is the scenario :
>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>> 2. call hdfsWrite to write some data
>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>> after this call returns, new readers shoule be able to see the data
>>>> 4. I use an http get request to get the file list on that directionary
>>>> through the webhdfs interface,
>>>> here  I have to use the webhdfs interface because I need to deal with
>>>> symlink file
>>>> 5. from the json response which is returned by the webhdfs, I found
>>>> that the lenght of the file is still 0.
>>>>
>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>> these three together, but still doesn't work.
>>>>
>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>>> the correct file lenght through the webhdfs interface.
>>>>
>>>>
>>>> Is this right? I mean if you want the other process to see the change
>>>> of data, you need to call hdfsCloseFile?
>>>>
>>>> Or is there somethings I did wrong?
>>>>
>>>> thank you very much for your help.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Peyman Mohajerian <mo...@gmail.com>.

Ok i just read the book section on this (Definite Guide to Hadoop), just to
be sure, length of a file is stored in Name Node, and its updated only
after client calls Name Node after close of the file. At that point if Name
Node has received all the ACK from Data Nodes then it will set the length
meta-data (e.g. minimum replication is met), so one of the last steps and
its for performance reasons, client decides when its done writing.


On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

> To Devin,
>
> thank you very much for your explanation.
>
> I do found that I can read the data out of the file even if I did not
> close the file I'm writing to ( the read operation is call on another file
> handler opened on the same file but still in the same process ), which make
> me more confuse at that time, because I think since I can read the data
> from the file , why can't I get the length of the file correctly.
>
> But from the explantion that you have described, I think I can understand
> it now.
>
> So it seems in order to do what I want ( write some data to the file, and
> then get the length of the file throuth webhdfs interface), I have to open
> and close the file every time I do the write operation.
>
> Thank you very much again.
>
> xiaobinshe
>
>
>
>
>
> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>
>> Hello,
>>
>> In my experience with Flume, watching the HDFS Sink verbose output, I
>> know that even after a file has flushed, but is still open, it reads as a
>> 0-byte file, even if there is actually data contained in the file.
>>
>> A HDFS "file" is a meta-location that can accept streaming input for as
>> long as it is open, so the length cannot be mathematically defined until a
>> start and an end are in place.
>>
>> The flush operation moves data from a buffer to a storage medium, but I
>> don't think that necessarily means that it tells the HDFS RecordWriter to
>> place the "end of stream/EOF" marker down, since the "file" meta-location
>> in HDFS is a pile of actual files around the cluster on physical disk that
>> HDFS presents to you as one file. The HDFS "file" and the physical file
>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>> are forcing Hadoop to move the physical filesplits from their physical
>> datanode buffers to disk, but is not telling HDFS that you expect no
>> further input - that is what the HDFS close will do.
>>
>> One thing you could try - instead of asking for the length property,
>> which is probably unavailable until the close call, try asking for/viewing
>> the contents of the file.
>>
>> Your scenario step 3 says "according to the header hdfs.h, after this
>> call returns, *new readers should be able to see the data*" which isn't
>> the same as "new readers can obtain an updated property value from the file
>> metadata" - one is looking at the data inside the container, and the other
>> is asking the container to describe itself.
>>
>> I hope that helps with your problem!
>>
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>
>>>
>>> sorry to reply to my own thread.
>>>
>>> Does anyone know the answer to this question?
>>> If so, can you please tell me if my understanding is right or wrong?
>>>
>>> thanks.
>>>
>>>
>>>
>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>
>>>> hi,
>>>>
>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>
>>>> And I have encountered an problem.
>>>>
>>>> here is the scenario :
>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>> 2. call hdfsWrite to write some data
>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>> after this call returns, new readers shoule be able to see the data
>>>> 4. I use an http get request to get the file list on that directionary
>>>> through the webhdfs interface,
>>>> here  I have to use the webhdfs interface because I need to deal with
>>>> symlink file
>>>> 5. from the json response which is returned by the webhdfs, I found
>>>> that the lenght of the file is still 0.
>>>>
>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>> these three together, but still doesn't work.
>>>>
>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>>> the correct file lenght through the webhdfs interface.
>>>>
>>>>
>>>> Is this right? I mean if you want the other process to see the change
>>>> of data, you need to call hdfsCloseFile?
>>>>
>>>> Or is there somethings I did wrong?
>>>>
>>>> thank you very much for your help.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Peyman Mohajerian <mo...@gmail.com>.

Ok i just read the book section on this (Definite Guide to Hadoop), just to
be sure, length of a file is stored in Name Node, and its updated only
after client calls Name Node after close of the file. At that point if Name
Node has received all the ACK from Data Nodes then it will set the length
meta-data (e.g. minimum replication is met), so one of the last steps and
its for performance reasons, client decides when its done writing.


On Thu, Dec 19, 2013 at 8:36 AM, Xiaobin She <xi...@gmail.com> wrote:

> To Devin,
>
> thank you very much for your explanation.
>
> I do found that I can read the data out of the file even if I did not
> close the file I'm writing to ( the read operation is call on another file
> handler opened on the same file but still in the same process ), which make
> me more confuse at that time, because I think since I can read the data
> from the file , why can't I get the length of the file correctly.
>
> But from the explantion that you have described, I think I can understand
> it now.
>
> So it seems in order to do what I want ( write some data to the file, and
> then get the length of the file throuth webhdfs interface), I have to open
> and close the file every time I do the write operation.
>
> Thank you very much again.
>
> xiaobinshe
>
>
>
>
>
> 2013/12/19 Devin Suiter RDX <ds...@rdx.com>
>
>> Hello,
>>
>> In my experience with Flume, watching the HDFS Sink verbose output, I
>> know that even after a file has flushed, but is still open, it reads as a
>> 0-byte file, even if there is actually data contained in the file.
>>
>> A HDFS "file" is a meta-location that can accept streaming input for as
>> long as it is open, so the length cannot be mathematically defined until a
>> start and an end are in place.
>>
>> The flush operation moves data from a buffer to a storage medium, but I
>> don't think that necessarily means that it tells the HDFS RecordWriter to
>> place the "end of stream/EOF" marker down, since the "file" meta-location
>> in HDFS is a pile of actual files around the cluster on physical disk that
>> HDFS presents to you as one file. The HDFS "file" and the physical file
>> splits on disk are distinct, and I would suspect that your HDFS flush calls
>> are forcing Hadoop to move the physical filesplits from their physical
>> datanode buffers to disk, but is not telling HDFS that you expect no
>> further input - that is what the HDFS close will do.
>>
>> One thing you could try - instead of asking for the length property,
>> which is probably unavailable until the close call, try asking for/viewing
>> the contents of the file.
>>
>> Your scenario step 3 says "according to the header hdfs.h, after this
>> call returns, *new readers should be able to see the data*" which isn't
>> the same as "new readers can obtain an updated property value from the file
>> metadata" - one is looking at the data inside the container, and the other
>> is asking the container to describe itself.
>>
>> I hope that helps with your problem!
>>
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com>wrote:
>>
>>>
>>> sorry to reply to my own thread.
>>>
>>> Does anyone know the answer to this question?
>>> If so, can you please tell me if my understanding is right or wrong?
>>>
>>> thanks.
>>>
>>>
>>>
>>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>>
>>>> hi,
>>>>
>>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>>
>>>> And I have encountered an problem.
>>>>
>>>> here is the scenario :
>>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>>> 2. call hdfsWrite to write some data
>>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>>> after this call returns, new readers shoule be able to see the data
>>>> 4. I use an http get request to get the file list on that directionary
>>>> through the webhdfs interface,
>>>> here  I have to use the webhdfs interface because I need to deal with
>>>> symlink file
>>>> 5. from the json response which is returned by the webhdfs, I found
>>>> that the lenght of the file is still 0.
>>>>
>>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>>> these three together, but still doesn't work.
>>>>
>>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>>> the correct file lenght through the webhdfs interface.
>>>>
>>>>
>>>> Is this right? I mean if you want the other process to see the change
>>>> of data, you need to call hdfsCloseFile?
>>>>
>>>> Or is there somethings I did wrong?
>>>>
>>>> thank you very much for your help.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close
the file I'm writing to ( the read operation is call on another file
handler opened on the same file but still in the same process ), which make
me more confuse at that time, because I think since I can read the data
from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand
it now.

So it seems in order to do what I want ( write some data to the file, and
then get the length of the file throuth webhdfs interface), I have to open
and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe





2013/12/19 Devin Suiter RDX <ds...@rdx.com>

> Hello,
>
> In my experience with Flume, watching the HDFS Sink verbose output, I know
> that even after a file has flushed, but is still open, it reads as a 0-byte
> file, even if there is actually data contained in the file.
>
> A HDFS "file" is a meta-location that can accept streaming input for as
> long as it is open, so the length cannot be mathematically defined until a
> start and an end are in place.
>
> The flush operation moves data from a buffer to a storage medium, but I
> don't think that necessarily means that it tells the HDFS RecordWriter to
> place the "end of stream/EOF" marker down, since the "file" meta-location
> in HDFS is a pile of actual files around the cluster on physical disk that
> HDFS presents to you as one file. The HDFS "file" and the physical file
> splits on disk are distinct, and I would suspect that your HDFS flush calls
> are forcing Hadoop to move the physical filesplits from their physical
> datanode buffers to disk, but is not telling HDFS that you expect no
> further input - that is what the HDFS close will do.
>
> One thing you could try - instead of asking for the length property, which
> is probably unavailable until the close call, try asking for/viewing the
> contents of the file.
>
> Your scenario step 3 says "according to the header hdfs.h, after this
> call returns, *new readers should be able to see the data*" which isn't
> the same as "new readers can obtain an updated property value from the file
> metadata" - one is looking at the data inside the container, and the other
> is asking the container to describe itself.
>
> I hope that helps with your problem!
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>>
>> sorry to reply to my own thread.
>>
>> Does anyone know the answer to this question?
>> If so, can you please tell me if my understanding is right or wrong?
>>
>> thanks.
>>
>>
>>
>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>
>>> hi,
>>>
>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>
>>> And I have encountered an problem.
>>>
>>> here is the scenario :
>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>> 2. call hdfsWrite to write some data
>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>> after this call returns, new readers shoule be able to see the data
>>> 4. I use an http get request to get the file list on that directionary
>>> through the webhdfs interface,
>>> here  I have to use the webhdfs interface because I need to deal with
>>> symlink file
>>> 5. from the json response which is returned by the webhdfs, I found that
>>> the lenght of the file is still 0.
>>>
>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>> these three together, but still doesn't work.
>>>
>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>> the correct file lenght through the webhdfs interface.
>>>
>>>
>>> Is this right? I mean if you want the other process to see the change
>>> of data, you need to call hdfsCloseFile?
>>>
>>> Or is there somethings I did wrong?
>>>
>>> thank you very much for your help.
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close
the file I'm writing to ( the read operation is call on another file
handler opened on the same file but still in the same process ), which make
me more confuse at that time, because I think since I can read the data
from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand
it now.

So it seems in order to do what I want ( write some data to the file, and
then get the length of the file throuth webhdfs interface), I have to open
and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe





2013/12/19 Devin Suiter RDX <ds...@rdx.com>

> Hello,
>
> In my experience with Flume, watching the HDFS Sink verbose output, I know
> that even after a file has flushed, but is still open, it reads as a 0-byte
> file, even if there is actually data contained in the file.
>
> A HDFS "file" is a meta-location that can accept streaming input for as
> long as it is open, so the length cannot be mathematically defined until a
> start and an end are in place.
>
> The flush operation moves data from a buffer to a storage medium, but I
> don't think that necessarily means that it tells the HDFS RecordWriter to
> place the "end of stream/EOF" marker down, since the "file" meta-location
> in HDFS is a pile of actual files around the cluster on physical disk that
> HDFS presents to you as one file. The HDFS "file" and the physical file
> splits on disk are distinct, and I would suspect that your HDFS flush calls
> are forcing Hadoop to move the physical filesplits from their physical
> datanode buffers to disk, but is not telling HDFS that you expect no
> further input - that is what the HDFS close will do.
>
> One thing you could try - instead of asking for the length property, which
> is probably unavailable until the close call, try asking for/viewing the
> contents of the file.
>
> Your scenario step 3 says "according to the header hdfs.h, after this
> call returns, *new readers should be able to see the data*" which isn't
> the same as "new readers can obtain an updated property value from the file
> metadata" - one is looking at the data inside the container, and the other
> is asking the container to describe itself.
>
> I hope that helps with your problem!
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>>
>> sorry to reply to my own thread.
>>
>> Does anyone know the answer to this question?
>> If so, can you please tell me if my understanding is right or wrong?
>>
>> thanks.
>>
>>
>>
>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>
>>> hi,
>>>
>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>
>>> And I have encountered an problem.
>>>
>>> here is the scenario :
>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>> 2. call hdfsWrite to write some data
>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>> after this call returns, new readers shoule be able to see the data
>>> 4. I use an http get request to get the file list on that directionary
>>> through the webhdfs interface,
>>> here  I have to use the webhdfs interface because I need to deal with
>>> symlink file
>>> 5. from the json response which is returned by the webhdfs, I found that
>>> the lenght of the file is still 0.
>>>
>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>> these three together, but still doesn't work.
>>>
>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>> the correct file lenght through the webhdfs interface.
>>>
>>>
>>> Is this right? I mean if you want the other process to see the change
>>> of data, you need to call hdfsCloseFile?
>>>
>>> Or is there somethings I did wrong?
>>>
>>> thank you very much for your help.
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close
the file I'm writing to ( the read operation is call on another file
handler opened on the same file but still in the same process ), which make
me more confuse at that time, because I think since I can read the data
from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand
it now.

So it seems in order to do what I want ( write some data to the file, and
then get the length of the file throuth webhdfs interface), I have to open
and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe





2013/12/19 Devin Suiter RDX <ds...@rdx.com>

> Hello,
>
> In my experience with Flume, watching the HDFS Sink verbose output, I know
> that even after a file has flushed, but is still open, it reads as a 0-byte
> file, even if there is actually data contained in the file.
>
> A HDFS "file" is a meta-location that can accept streaming input for as
> long as it is open, so the length cannot be mathematically defined until a
> start and an end are in place.
>
> The flush operation moves data from a buffer to a storage medium, but I
> don't think that necessarily means that it tells the HDFS RecordWriter to
> place the "end of stream/EOF" marker down, since the "file" meta-location
> in HDFS is a pile of actual files around the cluster on physical disk that
> HDFS presents to you as one file. The HDFS "file" and the physical file
> splits on disk are distinct, and I would suspect that your HDFS flush calls
> are forcing Hadoop to move the physical filesplits from their physical
> datanode buffers to disk, but is not telling HDFS that you expect no
> further input - that is what the HDFS close will do.
>
> One thing you could try - instead of asking for the length property, which
> is probably unavailable until the close call, try asking for/viewing the
> contents of the file.
>
> Your scenario step 3 says "according to the header hdfs.h, after this
> call returns, *new readers should be able to see the data*" which isn't
> the same as "new readers can obtain an updated property value from the file
> metadata" - one is looking at the data inside the container, and the other
> is asking the container to describe itself.
>
> I hope that helps with your problem!
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>>
>> sorry to reply to my own thread.
>>
>> Does anyone know the answer to this question?
>> If so, can you please tell me if my understanding is right or wrong?
>>
>> thanks.
>>
>>
>>
>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>
>>> hi,
>>>
>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>
>>> And I have encountered an problem.
>>>
>>> here is the scenario :
>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>> 2. call hdfsWrite to write some data
>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>> after this call returns, new readers shoule be able to see the data
>>> 4. I use an http get request to get the file list on that directionary
>>> through the webhdfs interface,
>>> here  I have to use the webhdfs interface because I need to deal with
>>> symlink file
>>> 5. from the json response which is returned by the webhdfs, I found that
>>> the lenght of the file is still 0.
>>>
>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>> these three together, but still doesn't work.
>>>
>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>> the correct file lenght through the webhdfs interface.
>>>
>>>
>>> Is this right? I mean if you want the other process to see the change
>>> of data, you need to call hdfsCloseFile?
>>>
>>> Or is there somethings I did wrong?
>>>
>>> thank you very much for your help.
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

To Devin,

thank you very much for your explanation.

I do found that I can read the data out of the file even if I did not close
the file I'm writing to ( the read operation is call on another file
handler opened on the same file but still in the same process ), which make
me more confuse at that time, because I think since I can read the data
from the file , why can't I get the length of the file correctly.

But from the explantion that you have described, I think I can understand
it now.

So it seems in order to do what I want ( write some data to the file, and
then get the length of the file throuth webhdfs interface), I have to open
and close the file every time I do the write operation.

Thank you very much again.

xiaobinshe





2013/12/19 Devin Suiter RDX <ds...@rdx.com>

> Hello,
>
> In my experience with Flume, watching the HDFS Sink verbose output, I know
> that even after a file has flushed, but is still open, it reads as a 0-byte
> file, even if there is actually data contained in the file.
>
> A HDFS "file" is a meta-location that can accept streaming input for as
> long as it is open, so the length cannot be mathematically defined until a
> start and an end are in place.
>
> The flush operation moves data from a buffer to a storage medium, but I
> don't think that necessarily means that it tells the HDFS RecordWriter to
> place the "end of stream/EOF" marker down, since the "file" meta-location
> in HDFS is a pile of actual files around the cluster on physical disk that
> HDFS presents to you as one file. The HDFS "file" and the physical file
> splits on disk are distinct, and I would suspect that your HDFS flush calls
> are forcing Hadoop to move the physical filesplits from their physical
> datanode buffers to disk, but is not telling HDFS that you expect no
> further input - that is what the HDFS close will do.
>
> One thing you could try - instead of asking for the length property, which
> is probably unavailable until the close call, try asking for/viewing the
> contents of the file.
>
> Your scenario step 3 says "according to the header hdfs.h, after this
> call returns, *new readers should be able to see the data*" which isn't
> the same as "new readers can obtain an updated property value from the file
> metadata" - one is looking at the data inside the container, and the other
> is asking the container to describe itself.
>
> I hope that helps with your problem!
>
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:
>
>>
>> sorry to reply to my own thread.
>>
>> Does anyone know the answer to this question?
>> If so, can you please tell me if my understanding is right or wrong?
>>
>> thanks.
>>
>>
>>
>> 2013/12/17 Xiaobin She <xi...@gmail.com>
>>
>>> hi,
>>>
>>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>>
>>> And I have encountered an problem.
>>>
>>> here is the scenario :
>>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>>> 2. call hdfsWrite to write some data
>>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>>> after this call returns, new readers shoule be able to see the data
>>> 4. I use an http get request to get the file list on that directionary
>>> through the webhdfs interface,
>>> here  I have to use the webhdfs interface because I need to deal with
>>> symlink file
>>> 5. from the json response which is returned by the webhdfs, I found that
>>> the lenght of the file is still 0.
>>>
>>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>>> these three together, but still doesn't work.
>>>
>>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>>> the correct file lenght through the webhdfs interface.
>>>
>>>
>>> Is this right? I mean if you want the other process to see the change
>>> of data, you need to call hdfsCloseFile?
>>>
>>> Or is there somethings I did wrong?
>>>
>>> thank you very much for your help.
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Hello,

In my experience with Flume, watching the HDFS Sink verbose output, I know
that even after a file has flushed, but is still open, it reads as a 0-byte
file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as
long as it is open, so the length cannot be mathematically defined until a
start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I
don't think that necessarily means that it tells the HDFS RecordWriter to
place the "end of stream/EOF" marker down, since the "file" meta-location
in HDFS is a pile of actual files around the cluster on physical disk that
HDFS presents to you as one file. The HDFS "file" and the physical file
splits on disk are distinct, and I would suspect that your HDFS flush calls
are forcing Hadoop to move the physical filesplits from their physical
datanode buffers to disk, but is not telling HDFS that you expect no
further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which
is probably unavailable until the close call, try asking for/viewing the
contents of the file.

Your scenario step 3 says "according to the header hdfs.h, after this call
returns, *new readers should be able to see the data*" which isn't the same
as "new readers can obtain an updated property value from the file
metadata" - one is looking at the data inside the container, and the other
is asking the container to describe itself.

I hope that helps with your problem!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

>
> sorry to reply to my own thread.
>
> Does anyone know the answer to this question?
> If so, can you please tell me if my understanding is right or wrong?
>
> thanks.
>
>
>
> 2013/12/17 Xiaobin She <xi...@gmail.com>
>
>> hi,
>>
>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>
>> And I have encountered an problem.
>>
>> here is the scenario :
>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>> 2. call hdfsWrite to write some data
>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>> after this call returns, new readers shoule be able to see the data
>> 4. I use an http get request to get the file list on that directionary
>> through the webhdfs interface,
>> here  I have to use the webhdfs interface because I need to deal with
>> symlink file
>> 5. from the json response which is returned by the webhdfs, I found that
>> the lenght of the file is still 0.
>>
>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>> these three together, but still doesn't work.
>>
>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>> the correct file lenght through the webhdfs interface.
>>
>>
>> Is this right? I mean if you want the other process to see the change  of
>> data, you need to call hdfsCloseFile?
>>
>> Or is there somethings I did wrong?
>>
>> thank you very much for your help.
>>
>>
>>
>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Hello,

In my experience with Flume, watching the HDFS Sink verbose output, I know
that even after a file has flushed, but is still open, it reads as a 0-byte
file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as
long as it is open, so the length cannot be mathematically defined until a
start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I
don't think that necessarily means that it tells the HDFS RecordWriter to
place the "end of stream/EOF" marker down, since the "file" meta-location
in HDFS is a pile of actual files around the cluster on physical disk that
HDFS presents to you as one file. The HDFS "file" and the physical file
splits on disk are distinct, and I would suspect that your HDFS flush calls
are forcing Hadoop to move the physical filesplits from their physical
datanode buffers to disk, but is not telling HDFS that you expect no
further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which
is probably unavailable until the close call, try asking for/viewing the
contents of the file.

Your scenario step 3 says "according to the header hdfs.h, after this call
returns, *new readers should be able to see the data*" which isn't the same
as "new readers can obtain an updated property value from the file
metadata" - one is looking at the data inside the container, and the other
is asking the container to describe itself.

I hope that helps with your problem!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

>
> sorry to reply to my own thread.
>
> Does anyone know the answer to this question?
> If so, can you please tell me if my understanding is right or wrong?
>
> thanks.
>
>
>
> 2013/12/17 Xiaobin She <xi...@gmail.com>
>
>> hi,
>>
>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>
>> And I have encountered an problem.
>>
>> here is the scenario :
>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>> 2. call hdfsWrite to write some data
>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>> after this call returns, new readers shoule be able to see the data
>> 4. I use an http get request to get the file list on that directionary
>> through the webhdfs interface,
>> here  I have to use the webhdfs interface because I need to deal with
>> symlink file
>> 5. from the json response which is returned by the webhdfs, I found that
>> the lenght of the file is still 0.
>>
>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>> these three together, but still doesn't work.
>>
>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>> the correct file lenght through the webhdfs interface.
>>
>>
>> Is this right? I mean if you want the other process to see the change  of
>> data, you need to call hdfsCloseFile?
>>
>> Or is there somethings I did wrong?
>>
>> thank you very much for your help.
>>
>>
>>
>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Hello,

In my experience with Flume, watching the HDFS Sink verbose output, I know
that even after a file has flushed, but is still open, it reads as a 0-byte
file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as
long as it is open, so the length cannot be mathematically defined until a
start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I
don't think that necessarily means that it tells the HDFS RecordWriter to
place the "end of stream/EOF" marker down, since the "file" meta-location
in HDFS is a pile of actual files around the cluster on physical disk that
HDFS presents to you as one file. The HDFS "file" and the physical file
splits on disk are distinct, and I would suspect that your HDFS flush calls
are forcing Hadoop to move the physical filesplits from their physical
datanode buffers to disk, but is not telling HDFS that you expect no
further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which
is probably unavailable until the close call, try asking for/viewing the
contents of the file.

Your scenario step 3 says "according to the header hdfs.h, after this call
returns, *new readers should be able to see the data*" which isn't the same
as "new readers can obtain an updated property value from the file
metadata" - one is looking at the data inside the container, and the other
is asking the container to describe itself.

I hope that helps with your problem!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

>
> sorry to reply to my own thread.
>
> Does anyone know the answer to this question?
> If so, can you please tell me if my understanding is right or wrong?
>
> thanks.
>
>
>
> 2013/12/17 Xiaobin She <xi...@gmail.com>
>
>> hi,
>>
>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>
>> And I have encountered an problem.
>>
>> here is the scenario :
>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>> 2. call hdfsWrite to write some data
>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>> after this call returns, new readers shoule be able to see the data
>> 4. I use an http get request to get the file list on that directionary
>> through the webhdfs interface,
>> here  I have to use the webhdfs interface because I need to deal with
>> symlink file
>> 5. from the json response which is returned by the webhdfs, I found that
>> the lenght of the file is still 0.
>>
>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>> these three together, but still doesn't work.
>>
>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>> the correct file lenght through the webhdfs interface.
>>
>>
>> Is this right? I mean if you want the other process to see the change  of
>> data, you need to call hdfsCloseFile?
>>
>> Or is there somethings I did wrong?
>>
>> thank you very much for your help.
>>
>>
>>
>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Devin Suiter RDX <ds...@rdx.com>.

Hello,

In my experience with Flume, watching the HDFS Sink verbose output, I know
that even after a file has flushed, but is still open, it reads as a 0-byte
file, even if there is actually data contained in the file.

A HDFS "file" is a meta-location that can accept streaming input for as
long as it is open, so the length cannot be mathematically defined until a
start and an end are in place.

The flush operation moves data from a buffer to a storage medium, but I
don't think that necessarily means that it tells the HDFS RecordWriter to
place the "end of stream/EOF" marker down, since the "file" meta-location
in HDFS is a pile of actual files around the cluster on physical disk that
HDFS presents to you as one file. The HDFS "file" and the physical file
splits on disk are distinct, and I would suspect that your HDFS flush calls
are forcing Hadoop to move the physical filesplits from their physical
datanode buffers to disk, but is not telling HDFS that you expect no
further input - that is what the HDFS close will do.

One thing you could try - instead of asking for the length property, which
is probably unavailable until the close call, try asking for/viewing the
contents of the file.

Your scenario step 3 says "according to the header hdfs.h, after this call
returns, *new readers should be able to see the data*" which isn't the same
as "new readers can obtain an updated property value from the file
metadata" - one is looking at the data inside the container, and the other
is asking the container to describe itself.

I hope that helps with your problem!

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Thu, Dec 19, 2013 at 7:50 AM, Xiaobin She <xi...@gmail.com> wrote:

>
> sorry to reply to my own thread.
>
> Does anyone know the answer to this question?
> If so, can you please tell me if my understanding is right or wrong?
>
> thanks.
>
>
>
> 2013/12/17 Xiaobin She <xi...@gmail.com>
>
>> hi,
>>
>> I'm using libhdfs to deal with hdfs in an c++ programme.
>>
>> And I have encountered an problem.
>>
>> here is the scenario :
>> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
>> 2. call hdfsWrite to write some data
>> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
>> after this call returns, new readers shoule be able to see the data
>> 4. I use an http get request to get the file list on that directionary
>> through the webhdfs interface,
>> here  I have to use the webhdfs interface because I need to deal with
>> symlink file
>> 5. from the json response which is returned by the webhdfs, I found that
>> the lenght of the file is still 0.
>>
>> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
>> these three together, but still doesn't work.
>>
>> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
>> the correct file lenght through the webhdfs interface.
>>
>>
>> Is this right? I mean if you want the other process to see the change  of
>> data, you need to call hdfsCloseFile?
>>
>> Or is there somethings I did wrong?
>>
>> thank you very much for your help.
>>
>>
>>
>>
>>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.



2013/12/17 Xiaobin She <xi...@gmail.com>

> hi,
>
> I'm using libhdfs to deal with hdfs in an c++ programme.
>
> And I have encountered an problem.
>
> here is the scenario :
> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
> 2. call hdfsWrite to write some data
> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
> after this call returns, new readers shoule be able to see the data
> 4. I use an http get request to get the file list on that directionary
> through the webhdfs interface,
> here  I have to use the webhdfs interface because I need to deal with
> symlink file
> 5. from the json response which is returned by the webhdfs, I found that
> the lenght of the file is still 0.
>
> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
> these three together, but still doesn't work.
>
> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
> the correct file lenght through the webhdfs interface.
>
>
> Is this right? I mean if you want the other process to see the change  of
> data, you need to call hdfsCloseFile?
>
> Or is there somethings I did wrong?
>
> thank you very much for your help.
>
>
>
>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.



2013/12/17 Xiaobin She <xi...@gmail.com>

> hi,
>
> I'm using libhdfs to deal with hdfs in an c++ programme.
>
> And I have encountered an problem.
>
> here is the scenario :
> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
> 2. call hdfsWrite to write some data
> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
> after this call returns, new readers shoule be able to see the data
> 4. I use an http get request to get the file list on that directionary
> through the webhdfs interface,
> here  I have to use the webhdfs interface because I need to deal with
> symlink file
> 5. from the json response which is returned by the webhdfs, I found that
> the lenght of the file is still 0.
>
> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
> these three together, but still doesn't work.
>
> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
> the correct file lenght through the webhdfs interface.
>
>
> Is this right? I mean if you want the other process to see the change  of
> data, you need to call hdfsCloseFile?
>
> Or is there somethings I did wrong?
>
> thank you very much for your help.
>
>
>
>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.



2013/12/17 Xiaobin She <xi...@gmail.com>

> hi,
>
> I'm using libhdfs to deal with hdfs in an c++ programme.
>
> And I have encountered an problem.
>
> here is the scenario :
> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
> 2. call hdfsWrite to write some data
> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
> after this call returns, new readers shoule be able to see the data
> 4. I use an http get request to get the file list on that directionary
> through the webhdfs interface,
> here  I have to use the webhdfs interface because I need to deal with
> symlink file
> 5. from the json response which is returned by the webhdfs, I found that
> the lenght of the file is still 0.
>
> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
> these three together, but still doesn't work.
>
> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
> the correct file lenght through the webhdfs interface.
>
>
> Is this right? I mean if you want the other process to see the change  of
> data, you need to call hdfsCloseFile?
>
> Or is there somethings I did wrong?
>
> thank you very much for your help.
>
>
>
>
>

Re: Why other process can't see the change after calling hdfsHFlush unless hdfsCloseFile is called?

Posted by Xiaobin She <xi...@gmail.com>.

sorry to reply to my own thread.

Does anyone know the answer to this question?
If so, can you please tell me if my understanding is right or wrong?

thanks.



2013/12/17 Xiaobin She <xi...@gmail.com>

> hi,
>
> I'm using libhdfs to deal with hdfs in an c++ programme.
>
> And I have encountered an problem.
>
> here is the scenario :
> 1. first I call hdfsOpenFile with O_WRONLY flag to open an file
> 2. call hdfsWrite to write some data
> 3. call hdfsHFlush to flush the data,  according to the header hdfs.h,
> after this call returns, new readers shoule be able to see the data
> 4. I use an http get request to get the file list on that directionary
> through the webhdfs interface,
> here  I have to use the webhdfs interface because I need to deal with
> symlink file
> 5. from the json response which is returned by the webhdfs, I found that
> the lenght of the file is still 0.
>
> I have tried to replace hdfsHFlush with hdfsFlush or hdfsSync, or call
> these three together, but still doesn't work.
>
> Buf if I call hdfsCloseFile after I call the hdfsHFlush, then I can get
> the correct file lenght through the webhdfs interface.
>
>
> Is this right? I mean if you want the other process to see the change  of
> data, you need to call hdfsCloseFile?
>
> Or is there somethings I did wrong?
>
> thank you very much for your help.
>
>
>
>
>