You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Attila Jeges (Code Review)" <ge...@cloudera.org> on 2017/02/22 10:15:46 UTC

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Attila Jeges has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6107

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

The fix corrects the implementation of zero-compressed encoding in
ReadWriteUtil class. It also fixes the calculation of block sizes in
SnappyBlockCompressor::ProcessBlock().

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
6 files changed, 119 insertions(+), 39 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS1, Line 219:  if (val & 0xFF00000000000000llu) return 9;
             :   if (val & 0x00FF000000000000llu) return 8;
             :   if (val & 0x0000FF0000000000llu) return 7;
             :   if (val & 0x000000FF00000000llu) return 6;
             :   if (val & 0x00000000FF000000llu) return 5;
             :   if (val & 0x0000000000FF0000llu) return 4;
             :   if (val & 0x000000000000FF00llu) return 3;
> Why not do a bit-scan for first set bit and divide the result by 8 ?
Typo. I meant bit scan from the MSB. __builtin_clzl()


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#5).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 484 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#9).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala created corrupt uncompressed
   sequence files.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 500 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#4).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 396 insertions(+), 78 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util-test.cc
File be/src/exec/read-write-util-test.cc:

PS1, Line 88:  
> nit: blank space.
Done


http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS1, Line 217: ~val;
> Did you mean to convert this to the absolute value ? If so, you are missing
It should be ~val.


PS1, Line 219: 
             : 
             : inline int64_t ReadWriteUtil::PutVLong(int64_t val, uint8_t* buf) {
             :   int64_t num_bytes = VLongRequiredBytes(val);
             : 
             :   if (num_bytes == 1) {
             :     // store the value itself instead of the
> Typo. I meant bit scan from the MSB. __builtin_clzl()
Done


PS1, Line 240:     buf[i] = (val >> (8 * (num_bytes - i - 1))) & 0xFF;
             :   }
> Is this behavior documented somewhere ? Is it part of the sequence file sta
Yes, it is part of the sequence file standard and it can be read back by Hive. I've added an E2E test to confirm this.

Sequence file documentation: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/SequenceFile.html

The Zero-Compressed Integer format is described here:
https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/WritableUtils.html#writeVLong(java.io.DataOutput, long)

The details are in the actual source code (WriteVLong() method):
https://hadoop.apache.org/docs/r2.7.2/api/src-html/org/apache/hadoop/io/WritableUtils.html


PS1, Line 246: Util::PutVInt(int32
> Big Endianness ?
Correct. Changed the comment.


http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/util/compress.cc
File be/src/util/compress.cc:

PS1, Line 209:  each preceded with an integer for
> , each preceded with an integer for its size.
Done


PS1, Line 210: // For testing purposes we are going to generate two blocks if input_length >= 4K
> That seems a bit inefficient to split into two blocks for input_length == 2
Done,changed the threshold to 4K


PS1, Line 235: ReadWriteUtil::
> The declaration can be moved into the loop body.
Done


http://gerrit.cloudera.org:8080/#/c/6107/1/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 93: SET COMPRESSION_CODEC=NONE;
> what about the other codecs?
This test case runs for 8.5 seconds.

I've added test cases for other codecs with RECORD/BLOCK compression mode.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#2).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 385 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala created corrupt uncompressed
   sequence files.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Reviewed-on: http://gerrit.cloudera.org:8080/6107
Reviewed-by: Michael Ho <kw...@cloudera.com>
Reviewed-by: Attila Jeges <at...@cloudera.com>
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 500 insertions(+), 82 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Michael Ho: Looks good to me, approved
  Attila Jeges: Looks good to me, but someone else must approve
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS1, Line 217: val ^ (static_cast<int64_t>(-1))
Did you mean to convert this to the absolute value ? If so, you are missing a +1. If not, why not just do ~val ?


PS1, Line 219:  if (val & 0xFF00000000000000llu) return 9;
             :   if (val & 0x00FF000000000000llu) return 8;
             :   if (val & 0x0000FF0000000000llu) return 7;
             :   if (val & 0x000000FF00000000llu) return 6;
             :   if (val & 0x00000000FF000000llu) return 5;
             :   if (val & 0x0000000000FF0000llu) return 4;
             :   if (val & 0x000000000000FF00llu) return 3;
Why not do a bit-scan for first set bit and divide the result by 8 ?


PS1, Line 240:     buf[0] = -(num_bytes + 119);
             :     val = val ^ (static_cast<int64_t>(-1));
Is this behavior documented somewhere ? Is it part of the sequence file standard ? In other words, if the field is encoded this way, can it be read back by say Hive ?


PS1, Line 246: reversed endianness
Big Endianness ?


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
> The old sequence file writer had multiple issues:
Thanks for listing them out. Please also put this list in the commit message.


http://gerrit.cloudera.org:8080/#/c/6107/5/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 214: // Returns size of the encoded long value, including the 1 byte for length.
for val < -112 or val > 127.


PS5, Line 228: 
nit: long line


PS5, Line 245: -(num_bytes + 111);
nit: help to comment why it's 119 here (which is different from 120 in the comment above) as we already include the length of the extra 1 byte to differentiate between the case of -112 <= v <= 127 vs other values.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6107

to look at the new patch set (#6).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala could not read back uncompressed
   sequence files created by Impala.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 494 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6107/7/be/src/util/compress.cc
File be/src/util/compress.cc:

PS7, Line 214: We are going to generate two blocks if input_length >= 4K.
> let's put back: ... for testing purposes.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6107

to look at the new patch set (#8).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala created corrupt uncompressed
   sequence files.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 500 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#7).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala created corrupt uncompressed
   sequence files.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 500 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 8: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#6).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala could not read back uncompressed
   sequence files created by Impala.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 494 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#4).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 396 insertions(+), 78 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 7:

> (2 comments)
 > 
 > I think we should support writing sequence file. This change have
 > fixed some bugs to get the sequence file writer to work but it's
 > unclear to me if the existing tests have enough coverage. May help
 > to think through the testing first before moving it out of the
 > flag.

There's no particular reason to write sequence files.

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 9: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 1:

(4 comments)

Some initial comments...

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util-test.cc
File be/src/exec/read-write-util-test.cc:

PS1, Line 88:  
nit: blank space.


http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/util/compress.cc
File be/src/util/compress.cc:

PS1, Line 209: each preceded with an integer size
, each preceded with an integer for its size.


PS1, Line 210: // For testing purposes we are going to generate two blocks if input_length >= 2.
That seems a bit inefficient to split into two blocks for input_length == 2. Should we move this threshold higher and improve the test cases instead ?


PS1, Line 235: uint8_t* sizep;
The declaration can be moved into the loop body.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#2).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 385 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 9: Code-Review+1

> Carry +2 forward.

Core tests succeeded: http://sandbox.jenkins.cloudera.com/job/impala-private-build-and-test/5500/

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6:

(2 comments)

I think we should support writing sequence file. This change have fixed some bugs to get the sequence file writer to work but it's unclear to me if the existing tests have enough coverage. May help to think through the testing first before moving it out of the flag.

http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS6, Line 247: DCHECK(val < -112);
nit: DCHECK_LT(val, -112);


PS6, Line 258: DCHECK(val > 127);
nit: DCHECK_GT(val, 127);


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 9:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/502/

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#5).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 484 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 5:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
> Is this the reason the old sequence file writer created corrupted files ?
The old sequence file writer had multiple issues:

1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong() were broken. As a result, Impala could not read back uncompressed sequence files created by Impala (see be/src/exec/read-write-util.h).

2. KEY_CLASS_NAME was missing from the sequence file header. As a result, Hive could not read back uncompressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc).

3. Keys were missing from record-compressed sequence files. Hive could not read back record-compressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc).

4. In some cases the wrong Record-compression flag was written to the sequence file header. As a result, Hive could not read back record-compressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc).

5.Impala added 'sync_marker' instead of 'neg1_sync_marker' to the beginning of blocks in block-compressed sequence files. Hive could not read these files back.

6. Impala created block-compressed files with:
- empty key-lengths block (L176)
- empty keys block (L177)
- empty value-lengths block (L180)

This resulted in invalid block-compressed sequence files that Hive could not read back (see HdfsSequenceTableWriter::WriteCompressedBlock()).


PS4, Line 139: _cast<const uint8_t*>(KEY_CLASS_NAME));
> nit: indent 4. Same below.
Done


Line 191:   record.WriteBytes(output_length, output);
> It seems a bit unfortunate that we don't free the temp buffer (i.e. output)
Added FreeAll to the end of the 'Flush()' function.


PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key for each row (this is what
             :   // Hive does).
> Does not writing key-lengths block and key block prevent the file from bein
Yes, Hive failed with an exception when I tried that.


http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
> Would be good to add the details of the Sequence file's layout as top level
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 230: // For more information, see the documentation for 'WritableUtils.writeVLong()' method:
> DCHECK_GE(num_bytes, 2);
Done


PS3, Line 233: nt64_t num_b
> May also want to state that the source of this behavior.
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc
File be/src/util/compress.cc:

Line 248:     outp += size;
> DCHECK_LE(outp - out_buffer_,  length);
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 212: stored as SEQUENCEFILE;
> May be helpful to also add a test for writing empty file:
I tried this and it doesn't write an empty file. It doesn't create a file at all. Probably there's no easy way to force the sequence file writer to create an empty-file.


http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

PS3, Line 170:       # Read it back in Impala
             :       output = self.client.execute('select count(*) from %s' % table_name)
             :       assert '16541' == output.get_data()
             :       # Read it back in Hive
             :       output = self.run_stmt_in_hive('select count(*) from %s' % table_name)
             :       assert '16541' == output.split('\n')[1]
             : 
             :   def test_avro_writer(self, vector):
             :     self.run_test_case('QueryTest/avro-wri
> Doesn't this duplicate the second test ? May help to test with empty file a
The 2nd test is for record-compressed sequence files while the 3rd is for block-compressed seq files.
Added tests for files greater than 4K and less than 4K. I could not figure out how to create an empty file.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6107/6//COMMIT_MSG
Commit Message:

PS6, Line 11: Impala
> wouldn't this bug mean that no one could read these files? i.e. they were c
True, rephrased the commit message.


http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS6, Line 223: i
> what is 'i'? is this talking about 'val'?
Yes, I've corrected the comment.


PS6, Line 248: buff
> buf
Done


PS6, Line 259: buff
> buf
Done


PS6, Line 260: buff
> buf
Done


http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/util/compress.cc
File be/src/util/compress.cc:

PS6, Line 209: its
> what is this referring to? size of what?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6:

(2 comments)

Thanks for the review.

http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS6, Line 247: DCHECK(val < -112);
> nit: DCHECK_LT(val, -112);
Done


PS6, Line 258: DCHECK(val > 127);
> nit: DCHECK_GT(val, 127);
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Hello Michael Ho,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6107

to look at the new patch set (#7).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala created corrupt uncompressed
   sequence files.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 500 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6:

Forgive the drive-by comment, but I'm curious about whether we plan to make sequence files a supported format for writing. It seems strange to put all this effort into it and keep it hidden behind the flag with other file writers like Avro that are totally broken.

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS3, Line 215: line int ReadWriteUtil::VLongRequiredBytes(int64_t val) {
> nit: consider putting it at line 213 instead, as a function header.
Done


PS3, Line 218:  9 - __builtin_clzll(val)/8;
> nit: unnecessary parenthesis.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#8).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

This change fixes the following issues in the Sequence File Writer:
1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong()
   were broken. As a result, Impala created corrupt uncompressed
   sequence files.

2. KEY_CLASS_NAME was missing from the sequence file header. As a
   result, Hive could not read back uncompressed sequence files
   created by Impala.

3. Impala created record-compressed sequence files with empty keys
   block. As a result, Hive could not read back record-compressed
   sequence files created by Impala.

4. Impala created block-compressed files with:
   - empty key-lengths block
   - empty keys block
   - empty value-lengths block
   This resulted in invalid block-compressed sequence files that Hive could
   not read back.

5. In some cases the wrong Record-compression flag was written to the
   sequence file header. As a result, Hive could not read back record-
   compressed sequence files created by Impala.

6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the
   beginning of blocks in block-compressed sequence files. Hive could
   not read these files back.

7. The calculation of block sizes in SnappyBlockCompressor class was
   incorrect for odd-length buffers.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 500 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS1, Line 240:     buf[i] = (val >> (8 * (num_bytes - i - 1))) & 0xFF;
             :   }
> Yes, it is part of the sequence file standard and it can be read back by Hi
Thanks for the pointers. Would you mind adding the definitionamd the URLs above as function header:

Serializes a long to a binary stream with zero-compressed encoding. For -112 <= i <= 127, only one byte is used with the actual value. For other values of i, the first byte value indicates whether the long is positive or negative, and the number of bytes that follow. If the first byte value v is between -113 and -120, the following long is positive, with number of bytes that follow are -(v+112). If the first byte value v is between -121 and -128, the following long is negative, with number of bytes that follow are -(v+120). Bytes are stored in the high-non-zero-byte-first order.


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS3, Line 215: // returns size of the encoded long value, including the 1 byte for length
nit: consider putting it at line 213 instead, as a function header.


PS3, Line 218:  (9 - __builtin_clzll(val)/8);
nit: unnecessary parenthesis.


PS3, Line 233:  val = ~val;
May also want to state that the source of this behavior.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 9: Code-Review+2

Carry +2 forward.

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 7: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6107/7/be/src/util/compress.cc
File be/src/util/compress.cc:

PS7, Line 214: We are going to generate two blocks if input_length >= 4K.
let's put back: ... for testing purposes.
assuming that's the reason. otherwise, how would a reader know why the code does that?


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/6107/6//COMMIT_MSG
Commit Message:

PS6, Line 11: Impala
wouldn't this bug mean that no one could read these files? i.e. they were completely invalid.


http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS6, Line 223: i
what is 'i'? is this talking about 'val'?


PS6, Line 248: buff
buf


PS6, Line 259: buff
buf


PS6, Line 260: buff
buf


http://gerrit.cloudera.org:8080/#/c/6107/6/be/src/util/compress.cc
File be/src/util/compress.cc:

PS6, Line 209: its
what is this referring to? size of what?
and i think this comment would be clearer if reworded to explain the block encoding in order (rather than "preceded with").


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6: Code-Review+1

Carry previous +1 forward.

-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
> Thanks for listing them out. Please also put this list in the commit messag
Done


http://gerrit.cloudera.org:8080/#/c/6107/5/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 214: // Returns size of the encoded long value, including the 1 byte for length for val < -112
> for val < -112 or val > 127.
Done


PS5, Line 228: ollow
> nit: long line
Done


PS5, Line 245: um_bytes, 9);
> nit: help to comment why it's 119 here (which is different from 120 in the 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#3).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 385 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
Is this the reason the old sequence file writer created corrupted files ?


PS4, Line 139: reinterpret_cast<const uint8_t*>(KEY_CLASS_NAME));
nit: indent 4. Same below.


Line 191:   record.WriteBytes(output_length, output);
It seems a bit unfortunate that we don't free the temp buffer (i.e. output) after copying the data to record.

Do you know if there may be other allocations from mem_pool_ which need to persist across calls to WriteCompressedBlock() ? Otherwise, we should consider calling mem_pool_->FreeAll() at the caller of this function to avoid accumulating too much memory when writing a lot of data.


PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key for each row (this is what
             :   // Hive does).
Does not writing key-lengths block and key block prevent the file from being read back correctly in Hive ?


http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
Would be good to add the details of the Sequence file's layout as top level comment here. The following link seems to include quite a bit of useful information:

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/SequenceFile.html


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 230: // https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/WritableUtils.html
DCHECK_GE(num_bytes, 2);
DCHECK_LE(num_bytes, 9);


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc
File be/src/util/compress.cc:

Line 248:     outp += size;
DCHECK_LE(outp - out_buffer_,  length);


http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 212: SET COMPRESSION_CODEC=SNAPPY_BLOCKED;
May be helpful to also add a test for writing empty file:

select * from tpcds_parquet.store_sales where ss_item_sk=-1;


http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

PS3, Line 170:     # Write block-compressed sequence file and then read it back in Hive
             :     table_name = '%s.seq_tbl_block' % unique_database
             :     self.client.execute("set COMPRESSION_CODEC=DEFAULT")
             :     self.client.execute("set SEQ_COMPRESSION_MODE=BLOCK")
             :     self.client.execute("create table %s like functional.alltypes" % table_name)
             :     self.client.execute("insert into %s partition(year, month) "
             :         "select * from functional.alltypes" % table_name)
             :     output = self.run_stmt_in_hive("select count(*) from %s" % table_name)
             :     assert "7300" == output.split('\n')[1]
Doesn't this duplicate the second test ? May help to test with empty file and file less than 4K and greater than 4K.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6107/1/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 93: SET COMPRESSION_CODEC=SNAPPY;
what about the other codecs?

how long does it take to run this particular test case?


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#3).

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................

IMPALA-3079: Fix sequence file writer

Before the fix, sequence file writer produced corrupt files in some
cases. Steps to reproduce:
  SET ALLOW_UNSUPPORTED_FORMATS=1;

  create table store_sales_seq_snap like tpcds_parquet.store_sales
  stored as SEQUENCEFILE;

  insert into store_sales_seq_snap partition(ss_sold_date_sk)
  select * from tpcds_parquet.store_sales
  where ss_sold_date_sk between 2450816 and 2451200;

The insert statement produces a corrupt file that cannot be read back.

This change fixes:
- The implementation of zero-compressed encoding in ReadWriteUtil
  class.
- The calculation of block sizes in SnappyBlockCompressor class.
- Creating record/block compressed sequence files in
  HdfsSequenceTableWriter class.

Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
---
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/read-write-util-test.cc
M be/src/exec/read-write-util.h
M be/src/util/compress.cc
M be/src/util/decompress-test.cc
M testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
M tests/query_test/test_compressed_formats.py
8 files changed, 385 insertions(+), 77 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/6107/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-3079: Fix sequence file writer

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6107/1/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

PS1, Line 240:     buf[0] = -(num_bytes + 119);
             :     val = val ^ (static_cast<int64_t>(-1));
> Thanks for the pointers. Would you mind adding the definitionamd the URLs a
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-HasComments: Yes