You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by "Wu, Jiang2 " <ji...@citi.com> on 2013/08/20 23:36:40 UTC

read a changing hdfs file

Hi,

I did some experiments to read a changing hdfs file. It seems that the reading takes a snapshot at the file opening moment, and will not read any data appended to the file afterwards. It's different from what happens when reading a changing local file. My code is as follows

                        Configuration conf = new Configuration();
                        InputStream in = null;
                        try {
                                FileSystem fs = FileSystem.get(URI.create("hdfs://MyCluster/"),
                                                conf);
                                in = fs.open(new Path("/tmp/test.txt"));
                                Scanner scanner=new Scanner(in);
                                while(scanner.hasNextLine()){
                                        System.out.println("+++++++++++++++++++++++++++++++ read "+scanner.nextLine());
                                }
                                System.out.println("+++++++++++++++++++++++++++++++ reader finished ");
                        } catch (IOException e) {
                                // TODO Auto-generated catch block
                                e.printStackTrace();
                        } finally {
                                IOUtils.closeStream(in);
                        }

I'm wondering if this is the designed hdfs reading behavior, or can be changed by using different API or configuration? What I expect is the same behavior as a local file reading: when a reader reads a file while another writer is writing to the file, the reader will receive all data written by the writer.

Thanks,
Jiang



Re: read a changing hdfs file

Posted by Shahab Yunus <sh...@gmail.com>.
As far as I understand (and experts can correct me), the file being written
will be visible once one HDFS block size worth of data is written. This
applies to subsequent writing as well. Basically a block size worth of data
is the level of coherency, the size/unit of data for which data durability
is guaranteed. You can forcefully call the sync (*hsync/hflush) method to
flush your writes to the file system so they become visible as you write
them but then it has a cost in the form of lesser performance. So basically
it is dependent on your application and requirements i.e. trade-off between
performance and data visibility/durability.

*Read more about the definition, differences and use of the appropriate
method here:
http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

Regards,
Shahab


On Tue, Aug 20, 2013 at 5:36 PM, Wu, Jiang2 <ji...@citi.com> wrote:

>  Hi,
>
> I did some experiments to read a changing hdfs file. It seems that the
> reading takes a snapshot at the file opening moment, and will not read any
> data appended to the file afterwards. It’s different from what happens when
> reading a changing local file. My code is as follows
>
>                         Configuration conf = new Configuration();
>                         InputStream in = null;
>                         try {
>                                 FileSystem fs =
> FileSystem.get(URI.create("hdfs://MyCluster/"),
>                                                 conf);
>                                 in = fs.open(new Path("/tmp/test.txt"));
>                                 Scanner scanner=new Scanner(in);
>                                 while(scanner.hasNextLine()){
>
> System.out.println("+++++++++++++++++++++++++++++++ read
> "+scanner.nextLine());
>                                 }
>
> System.out.println("+++++++++++++++++++++++++++++++ reader finished ");
>                         } catch (IOException e) {
>                                 // TODO Auto-generated catch block
>                                 e.printStackTrace();
>                         } finally {
>                                 IOUtils.closeStream(in);
>                         }
>
> I’m wondering if this is the designed hdfs reading behavior, or can be
> changed by using different API or configuration? What I expect is the same
> behavior as a local file reading: when a reader reads a file while another
> writer is writing to the file, the reader will receive all data written by
> the writer.
>
> Thanks,
> Jiang
>
>
>

Re: read a changing hdfs file

Posted by Shahab Yunus <sh...@gmail.com>.
As far as I understand (and experts can correct me), the file being written
will be visible once one HDFS block size worth of data is written. This
applies to subsequent writing as well. Basically a block size worth of data
is the level of coherency, the size/unit of data for which data durability
is guaranteed. You can forcefully call the sync (*hsync/hflush) method to
flush your writes to the file system so they become visible as you write
them but then it has a cost in the form of lesser performance. So basically
it is dependent on your application and requirements i.e. trade-off between
performance and data visibility/durability.

*Read more about the definition, differences and use of the appropriate
method here:
http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

Regards,
Shahab


On Tue, Aug 20, 2013 at 5:36 PM, Wu, Jiang2 <ji...@citi.com> wrote:

>  Hi,
>
> I did some experiments to read a changing hdfs file. It seems that the
> reading takes a snapshot at the file opening moment, and will not read any
> data appended to the file afterwards. It’s different from what happens when
> reading a changing local file. My code is as follows
>
>                         Configuration conf = new Configuration();
>                         InputStream in = null;
>                         try {
>                                 FileSystem fs =
> FileSystem.get(URI.create("hdfs://MyCluster/"),
>                                                 conf);
>                                 in = fs.open(new Path("/tmp/test.txt"));
>                                 Scanner scanner=new Scanner(in);
>                                 while(scanner.hasNextLine()){
>
> System.out.println("+++++++++++++++++++++++++++++++ read
> "+scanner.nextLine());
>                                 }
>
> System.out.println("+++++++++++++++++++++++++++++++ reader finished ");
>                         } catch (IOException e) {
>                                 // TODO Auto-generated catch block
>                                 e.printStackTrace();
>                         } finally {
>                                 IOUtils.closeStream(in);
>                         }
>
> I’m wondering if this is the designed hdfs reading behavior, or can be
> changed by using different API or configuration? What I expect is the same
> behavior as a local file reading: when a reader reads a file while another
> writer is writing to the file, the reader will receive all data written by
> the writer.
>
> Thanks,
> Jiang
>
>
>

Re: read a changing hdfs file

Posted by Shahab Yunus <sh...@gmail.com>.
As far as I understand (and experts can correct me), the file being written
will be visible once one HDFS block size worth of data is written. This
applies to subsequent writing as well. Basically a block size worth of data
is the level of coherency, the size/unit of data for which data durability
is guaranteed. You can forcefully call the sync (*hsync/hflush) method to
flush your writes to the file system so they become visible as you write
them but then it has a cost in the form of lesser performance. So basically
it is dependent on your application and requirements i.e. trade-off between
performance and data visibility/durability.

*Read more about the definition, differences and use of the appropriate
method here:
http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

Regards,
Shahab


On Tue, Aug 20, 2013 at 5:36 PM, Wu, Jiang2 <ji...@citi.com> wrote:

>  Hi,
>
> I did some experiments to read a changing hdfs file. It seems that the
> reading takes a snapshot at the file opening moment, and will not read any
> data appended to the file afterwards. It’s different from what happens when
> reading a changing local file. My code is as follows
>
>                         Configuration conf = new Configuration();
>                         InputStream in = null;
>                         try {
>                                 FileSystem fs =
> FileSystem.get(URI.create("hdfs://MyCluster/"),
>                                                 conf);
>                                 in = fs.open(new Path("/tmp/test.txt"));
>                                 Scanner scanner=new Scanner(in);
>                                 while(scanner.hasNextLine()){
>
> System.out.println("+++++++++++++++++++++++++++++++ read
> "+scanner.nextLine());
>                                 }
>
> System.out.println("+++++++++++++++++++++++++++++++ reader finished ");
>                         } catch (IOException e) {
>                                 // TODO Auto-generated catch block
>                                 e.printStackTrace();
>                         } finally {
>                                 IOUtils.closeStream(in);
>                         }
>
> I’m wondering if this is the designed hdfs reading behavior, or can be
> changed by using different API or configuration? What I expect is the same
> behavior as a local file reading: when a reader reads a file while another
> writer is writing to the file, the reader will receive all data written by
> the writer.
>
> Thanks,
> Jiang
>
>
>

Re: read a changing hdfs file

Posted by Shahab Yunus <sh...@gmail.com>.
As far as I understand (and experts can correct me), the file being written
will be visible once one HDFS block size worth of data is written. This
applies to subsequent writing as well. Basically a block size worth of data
is the level of coherency, the size/unit of data for which data durability
is guaranteed. You can forcefully call the sync (*hsync/hflush) method to
flush your writes to the file system so they become visible as you write
them but then it has a cost in the form of lesser performance. So basically
it is dependent on your application and requirements i.e. trade-off between
performance and data visibility/durability.

*Read more about the definition, differences and use of the appropriate
method here:
http://hadoop-hbase.blogspot.com/2012/05/hbase-hdfs-and-durable-sync.html

Regards,
Shahab


On Tue, Aug 20, 2013 at 5:36 PM, Wu, Jiang2 <ji...@citi.com> wrote:

>  Hi,
>
> I did some experiments to read a changing hdfs file. It seems that the
> reading takes a snapshot at the file opening moment, and will not read any
> data appended to the file afterwards. It’s different from what happens when
> reading a changing local file. My code is as follows
>
>                         Configuration conf = new Configuration();
>                         InputStream in = null;
>                         try {
>                                 FileSystem fs =
> FileSystem.get(URI.create("hdfs://MyCluster/"),
>                                                 conf);
>                                 in = fs.open(new Path("/tmp/test.txt"));
>                                 Scanner scanner=new Scanner(in);
>                                 while(scanner.hasNextLine()){
>
> System.out.println("+++++++++++++++++++++++++++++++ read
> "+scanner.nextLine());
>                                 }
>
> System.out.println("+++++++++++++++++++++++++++++++ reader finished ");
>                         } catch (IOException e) {
>                                 // TODO Auto-generated catch block
>                                 e.printStackTrace();
>                         } finally {
>                                 IOUtils.closeStream(in);
>                         }
>
> I’m wondering if this is the designed hdfs reading behavior, or can be
> changed by using different API or configuration? What I expect is the same
> behavior as a local file reading: when a reader reads a file while another
> writer is writing to the file, the reader will receive all data written by
> the writer.
>
> Thanks,
> Jiang
>
>
>