You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Young-Geun Park <yo...@gmail.com> on 2012/09/04 10:30:18 UTC

questions about SequenceFile

Hi, All


I run a MR program, WordCount:

InputFile is a sequence file compressed by snappy  block type.

InputFormat is SequenceFileInputFormat.


To check whether SequenceFile.Writer.sync() method  would affect a MR
program,

At one case, writer.sync() method was called. the sync() method did not be
called at another case.


The result was that there no difference about MR running time between two
cases.

The elapsed times of two case was about the same.


Does NOT the sync() method in the SequenceFile.Writer affect  MR
performance?


Another question;

According to sources, a sequence file would be splited at getSplits() in
FileInputFormat,

which is super class of SequenceFileInputFormat.

SplitSize in getSplits() method would be determined to default block size
(dfs.block.size) in case using default configurations.

But I think that a record boundary should be considered in splitting
sequence file.

I cannot understand splitting a sequence file by default block size without
considerations about the record boundary.

Do I miss something?


Regards,

Park

Re: questions about SequenceFile

Posted by Harsh J <ha...@cloudera.com>.

Hi Young,

Note that the SequenceFile.Writer#sync method != HDFS sync(), its just
a method that writes a sync marker (a set of bytes representing an end
points for one or more records, kinda like a newline in text files but
not for every record)

I don't think sync() would affect much. Although, if you want larger
compressed blocks, you should sync fewer times (i.e. more data between
sync marker points).

The SequenceFile Reader takes care of the record boundary checks when
given an offset and an length to read. The reader will auto-adjust the
read until the next sync-point. The logic of record boundary reading
in MR split-read mode is hence similar to the newline file reading
explained under http://wiki.apache.org/hadoop/HadoopMapReduce, except
think of the sync-markers as the newlines here.

On Tue, Sep 4, 2012 at 2:00 PM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
>
> I run a MR program, WordCount:
>
> InputFile is a sequence file compressed by snappy  block type.
>
> InputFormat is SequenceFileInputFormat.
>
>
> To check whether SequenceFile.Writer.sync() method  would affect a MR
> program,
>
> At one case, writer.sync() method was called. the sync() method did not be
> called at another case.
>
>
> The result was that there no difference about MR running time between two
> cases.
>
> The elapsed times of two case was about the same.
>
>
> Does NOT the sync() method in the SequenceFile.Writer affect  MR
> performance?
>
>
> Another question;
>
> According to sources, a sequence file would be splited at getSplits() in
> FileInputFormat,
>
> which is super class of SequenceFileInputFormat.
>
> SplitSize in getSplits() method would be determined to default block size
> (dfs.block.size) in case using default configurations.
>
> But I think that a record boundary should be considered in splitting
> sequence file.
>
> I cannot understand splitting a sequence file by default block size without
> considerations about the record boundary.
>
> Do I miss something?
>
>
> Regards,
>
> Park



-- 
Harsh J

Re: questions about SequenceFile

Posted by Harsh J <ha...@cloudera.com>.

Hi Young,

Note that the SequenceFile.Writer#sync method != HDFS sync(), its just
a method that writes a sync marker (a set of bytes representing an end
points for one or more records, kinda like a newline in text files but
not for every record)

I don't think sync() would affect much. Although, if you want larger
compressed blocks, you should sync fewer times (i.e. more data between
sync marker points).

The SequenceFile Reader takes care of the record boundary checks when
given an offset and an length to read. The reader will auto-adjust the
read until the next sync-point. The logic of record boundary reading
in MR split-read mode is hence similar to the newline file reading
explained under http://wiki.apache.org/hadoop/HadoopMapReduce, except
think of the sync-markers as the newlines here.

On Tue, Sep 4, 2012 at 2:00 PM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
>
> I run a MR program, WordCount:
>
> InputFile is a sequence file compressed by snappy  block type.
>
> InputFormat is SequenceFileInputFormat.
>
>
> To check whether SequenceFile.Writer.sync() method  would affect a MR
> program,
>
> At one case, writer.sync() method was called. the sync() method did not be
> called at another case.
>
>
> The result was that there no difference about MR running time between two
> cases.
>
> The elapsed times of two case was about the same.
>
>
> Does NOT the sync() method in the SequenceFile.Writer affect  MR
> performance?
>
>
> Another question;
>
> According to sources, a sequence file would be splited at getSplits() in
> FileInputFormat,
>
> which is super class of SequenceFileInputFormat.
>
> SplitSize in getSplits() method would be determined to default block size
> (dfs.block.size) in case using default configurations.
>
> But I think that a record boundary should be considered in splitting
> sequence file.
>
> I cannot understand splitting a sequence file by default block size without
> considerations about the record boundary.
>
> Do I miss something?
>
>
> Regards,
>
> Park



-- 
Harsh J

Re: questions about SequenceFile

Posted by Harsh J <ha...@cloudera.com>.

Hi Young,

Note that the SequenceFile.Writer#sync method != HDFS sync(), its just
a method that writes a sync marker (a set of bytes representing an end
points for one or more records, kinda like a newline in text files but
not for every record)

I don't think sync() would affect much. Although, if you want larger
compressed blocks, you should sync fewer times (i.e. more data between
sync marker points).

The SequenceFile Reader takes care of the record boundary checks when
given an offset and an length to read. The reader will auto-adjust the
read until the next sync-point. The logic of record boundary reading
in MR split-read mode is hence similar to the newline file reading
explained under http://wiki.apache.org/hadoop/HadoopMapReduce, except
think of the sync-markers as the newlines here.

On Tue, Sep 4, 2012 at 2:00 PM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
>
> I run a MR program, WordCount:
>
> InputFile is a sequence file compressed by snappy  block type.
>
> InputFormat is SequenceFileInputFormat.
>
>
> To check whether SequenceFile.Writer.sync() method  would affect a MR
> program,
>
> At one case, writer.sync() method was called. the sync() method did not be
> called at another case.
>
>
> The result was that there no difference about MR running time between two
> cases.
>
> The elapsed times of two case was about the same.
>
>
> Does NOT the sync() method in the SequenceFile.Writer affect  MR
> performance?
>
>
> Another question;
>
> According to sources, a sequence file would be splited at getSplits() in
> FileInputFormat,
>
> which is super class of SequenceFileInputFormat.
>
> SplitSize in getSplits() method would be determined to default block size
> (dfs.block.size) in case using default configurations.
>
> But I think that a record boundary should be considered in splitting
> sequence file.
>
> I cannot understand splitting a sequence file by default block size without
> considerations about the record boundary.
>
> Do I miss something?
>
>
> Regards,
>
> Park



-- 
Harsh J

Re: questions about SequenceFile

Posted by Harsh J <ha...@cloudera.com>.

Hi Young,

Note that the SequenceFile.Writer#sync method != HDFS sync(), its just
a method that writes a sync marker (a set of bytes representing an end
points for one or more records, kinda like a newline in text files but
not for every record)

I don't think sync() would affect much. Although, if you want larger
compressed blocks, you should sync fewer times (i.e. more data between
sync marker points).

The SequenceFile Reader takes care of the record boundary checks when
given an offset and an length to read. The reader will auto-adjust the
read until the next sync-point. The logic of record boundary reading
in MR split-read mode is hence similar to the newline file reading
explained under http://wiki.apache.org/hadoop/HadoopMapReduce, except
think of the sync-markers as the newlines here.

On Tue, Sep 4, 2012 at 2:00 PM, Young-Geun Park
<yo...@gmail.com> wrote:
> Hi, All
>
>
> I run a MR program, WordCount:
>
> InputFile is a sequence file compressed by snappy  block type.
>
> InputFormat is SequenceFileInputFormat.
>
>
> To check whether SequenceFile.Writer.sync() method  would affect a MR
> program,
>
> At one case, writer.sync() method was called. the sync() method did not be
> called at another case.
>
>
> The result was that there no difference about MR running time between two
> cases.
>
> The elapsed times of two case was about the same.
>
>
> Does NOT the sync() method in the SequenceFile.Writer affect  MR
> performance?
>
>
> Another question;
>
> According to sources, a sequence file would be splited at getSplits() in
> FileInputFormat,
>
> which is super class of SequenceFileInputFormat.
>
> SplitSize in getSplits() method would be determined to default block size
> (dfs.block.size) in case using default configurations.
>
> But I think that a record boundary should be considered in splitting
> sequence file.
>
> I cannot understand splitting a sequence file by default block size without
> considerations about the record boundary.
>
> Do I miss something?
>
>
> Regards,
>
> Park



-- 
Harsh J