You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org> on 2017/06/22 22:29:37 UTC

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Dimitris Tsirogiannis has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7268

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: Snappy compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Results:
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/util/compress.h
M be/src/util/decompress.h
A tests/custom_cluster/test_compact_catalog_updates.py
7 files changed, 168 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 330:       if (!status.ok()) {
> take out of if/else since it's common to both cases
Done


Line 367:   uint8_t* compressed_buffer_ptr = &compressed_buffer[0];
> comptessed_buffer.data()
Done


Line 370:   // Base64 encode the compressed catalog object and store it in the topic item.
> Is base64-encoding really necessary? Why don't the raw bytes work?
Hm, not really. It works without it, I only used it because the topic items use strings for keys and values. Removed it.


http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/catalog/catalog-server.h
File be/src/catalog/catalog-server.h:

Line 200:   /// Serializes 'catalog_object' and compresses it using Snappy compression.
> remove second "compression"
Done


http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 1417: Status ImpalaServer::DecompressAndDeserializeCatalogObject(const TTopicItem& item,
> Why not separate decompression from deserialization? Seem like separate con
Done


http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/util/compress.h
File be/src/util/compress.h:

Line 103:   SnappyCompressor(MemPool* mem_pool = NULL, bool reuse_buffer = false);
> Why these changes? Shouldn't these always be created using CreateCompressor
oops forgot about these. I was playing around with directly creating the compressor. reverted.


http://gerrit.cloudera.org:8080/#/c/7268/2/tests/custom_cluster/test_compact_catalog_updates.py
File tests/custom_cluster/test_compact_catalog_updates.py:

Line 45:           "select count(*) from functional.alltypes")
> Check result? Maybe also use another query on a parquet table.
Added the result check. Out of curiosity, what's so special about the parquet table in this case?


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has uploaded a new patch set (#3).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: Snappy compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/util/compress.h
M be/src/util/decompress.h
A tests/custom_cluster/test_compact_catalog_updates.py
7 files changed, 141 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 7: Code-Review+1

I'm not sure how i missed that

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 327:     vector<uint8_t> serialized_object;
Unused variable?


Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
Why not LZ4 to be consistent with what we do for exchanges? I don't have a problem with Snappy if we have some reason to use it but my understanding was that LZ4 was generally superior.


Line 368:   item_value->assign(reinterpret_cast<const char*>(compressed_data.data()),
Why do we need this extra copy here? It seems like we should serialize to a string, then compress directly into the output string. That should be possible if we Serialize() into a temporary string then either compress it or move() it into the output string.


http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS5, Line 1293: reinterpret_cast<const uint8_t*>(
Is this cast needed? vector<uint8_t>.data() should return uint8_t*.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 10: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 330:       if (!status.ok()) {
take out of if/else since it's common to both cases


Line 367:   uint8_t* compressed_buffer_ptr = &compressed_buffer[0];
comptessed_buffer.data()


Line 370:   // Base64 encode the compressed catalog object and store it in the topic item.
Is base64-encoding really necessary? Why don't the raw bytes work?


http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/catalog/catalog-server.h
File be/src/catalog/catalog-server.h:

Line 200:   /// Serializes 'catalog_object' and compresses it using Snappy compression.
remove second "compression"


http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 1417: Status ImpalaServer::DecompressAndDeserializeCatalogObject(const TTopicItem& item,
Why not separate decompression from deserialization? Seem like separate concerns.
Then the code would be something like:

buffer = raw buffer
if (compaction is on) buffer = decompress();
DeserializeThriftMsg()
...


http://gerrit.cloudera.org:8080/#/c/7268/2/be/src/util/compress.h
File be/src/util/compress.h:

Line 103:   SnappyCompressor(MemPool* mem_pool = NULL, bool reuse_buffer = false);
Why these changes? Shouldn't these always be created using CreateCompressor()?


http://gerrit.cloudera.org:8080/#/c/7268/2/tests/custom_cluster/test_compact_catalog_updates.py
File tests/custom_cluster/test_compact_catalog_updates.py:

Line 45:           "select count(*) from functional.alltypes")
Check result? Maybe also use another query on a parquet table.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has uploaded a new patch set (#7).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: LZ4 compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-util.cc
M be/src/catalog/catalog-util.h
M be/src/service/impala-server.cc
A tests/custom_cluster/test_compact_catalog_updates.py
5 files changed, 135 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 7:

> The code changes look good. We don't seem to have any test coverage
 > for this flag though - I grepped for it and don't see it referenced
 > in any tests. Can we add a custom cluster test that starts up
 > Impala with this flag and runs a query?
 > 
 > Although I'm actually wondering if this should just be the
 > default/only mode. It seems like compression and decompression are
 > probably less of a bottleneck than the network.

But I added a custom test for in this patch. Do you mean an additional test?

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has uploaded a new patch set (#6).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: Snappy compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-util.cc
M be/src/catalog/catalog-util.h
M be/src/service/impala-server.cc
A tests/custom_cluster/test_compact_catalog_updates.py
5 files changed, 135 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
> LZ4 decompressor requires a pre-allocated output. The problem in this case 
I see - we're depending on MaxOutputLen() on the decompression path. Can't we include the uncompressed size along with the compressed data then? Seems like the right thing to do and a decent performance win.

Otherwise we should explain why we're using Snappy here because it's subtle.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 8: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/8/fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java
File fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java:

Line 21: import com.google.common.base.Preconditions;
These changes are in GVO right now:
http://gerrit.cloudera.org:8080/7315


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/4/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 328:     Status status = thrift_serializer_.Serialize(&catalog_object, &serialized_object);
> Let's try to avoid the extra copy for the !FLAGS_compact_catalog_topic case
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 327:     vector<uint8_t> serialized_object;
> Unused variable?
Done


Line 334:       // Compress the catalog object
> remove (same as code)
Done


Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
> I see - we're depending on MaxOutputLen() on the decompression path. Can't 
I am hesitant to do that. We need essentially to populate the TopicItem with an additional field that is only catalog specific. Also, we already get a 4x compression of catalog topic updates, so I am wondering if this is really worth it. Not super familiar with LZ4's performance/compression ratio. Is it significantly better?


Line 367:   compressed_data.resize(result_len);
> remove
Done


http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

PS5, Line 1293: reinterpret_cast<const uint8_t*>(
> Is this cast needed? vector<uint8_t>.data() should return uint8_t*.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
> I'm in favor of LZ4.
Done


Line 368:   item_value->assign(reinterpret_cast<const char*>(compressed_data.data()),
> Why do we need this extra copy here? It seems like we should serialize to a
Done


http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/service/impala-server.h
File be/src/service/impala-server.h:

Line 556:   Status DecompressCatalogObject(const TTopicItem& item,
> It's a little weird to have the Compress/Decompress functions in different 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 10:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/801/

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7268/7/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 335:         pending_topic_updates_.pop_back();
I think we could incorrectly end up popping two pending topic updates here. One way to fix it is to check status.ok() in L331


http://gerrit.cloudera.org:8080/#/c/7268/7/be/src/catalog/catalog-util.cc
File be/src/catalog/catalog-util.cc:

Line 202:   ReadWriteUtil::PutInt(output_buffer_ptr, (uint32_t)catalog_object->size());
static_cast instead of C-style cast


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
> I am hesitant to do that. We need essentially to populate the TopicItem wit
I see your point.

https://github.com/lz4/lz4 reports that it's maybe 50% faster. According to https://www.percona.com/blog/2016/04/13/evaluating-database-compression-methods-update/ maybe 30-40%

We could alternatively just write the size at the start or end of the buffer on both ends with ReadWriteUtil::PutVLong() or something like that. That's what snappy is doing internally I think.

This may be overengineering it so I'm fine if you just want to explain the motivation for snappy.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has uploaded a new patch set (#4).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: Snappy compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
A tests/custom_cluster/test_compact_catalog_updates.py
5 files changed, 143 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
> Why not LZ4 to be consistent with what we do for exchanges? I don't have a 
LZ4 decompressor requires a pre-allocated output. The problem in this case is that we don't know the initial (pre-compressed) size of the catalog object.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7268/7/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 335:         pending_topic_updates_.pop_back();
> I think we could incorrectly end up popping two pending topic updates here.
Good catch. Done


http://gerrit.cloudera.org:8080/#/c/7268/7/be/src/catalog/catalog-util.cc
File be/src/catalog/catalog-util.cc:

Line 202:   ReadWriteUtil::PutInt(output_buffer_ptr, (uint32_t)catalog_object->size());
> static_cast instead of C-style cast
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 7:

The code changes look good. We don't seem to have any test coverage for this flag though - I grepped for it and don't see it referenced in any tests. Can we add a custom cluster test that starts up Impala with this flag and runs a query?

Although I'm actually wondering if this should just be the default/only mode. It seems like compression and decompression are probably less of a bottleneck than the network.

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has uploaded a new patch set (#5).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: Snappy compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
A tests/custom_cluster/test_compact_catalog_updates.py
5 files changed, 138 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 331:       status = SerializeAndCompressCatalogObject(catalog_object, item);
> Thinking we should have a similar separation of serialization and compressi
Done


Line 366:   compressed_buffer.resize(result_len);
> Remove since we're copying into item.value anyway
Done


Line 367:   item.value = string(reinterpret_cast<char*>(compressed_buffer_ptr),
> use item.value.assign()
Done


http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/service/impala-server.h
File be/src/service/impala-server.h:

Line 557:       std::vector<uint8_t>& data_buffer) WARN_UNUSED_RESULT;
> we use pointers for out params, i.e. use std::vector<uint8_t>*
Done


http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/util/compress.h
File be/src/util/compress.h:

Line 113
> revert
Done


http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/util/decompress.h
File be/src/util/decompress.h:

Line 86: 
> revert
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has uploaded a new patch set (#2).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: Snappy compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/util/compress.h
M be/src/util/decompress.h
A tests/custom_cluster/test_compact_catalog_updates.py
7 files changed, 168 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 334:       // Compress the catalog object
remove (same as code)


Line 367:   compressed_data.resize(result_len);
remove


http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/service/impala-server.h
File be/src/service/impala-server.h:

Line 556:   Status DecompressCatalogObject(const TTopicItem& item,
It's a little weird to have the Compress/Decompress functions in different classes and specific to catalog objects. How about we 
introduce static functions in SnappyCompressor like this:

static Status Compress(string*);
static Status Decompress(string*);

Both compress/decompress in place using temporaries as necessary.

Before you make the change, let me ask Tim what he thinks.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: LZ4 compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Reviewed-on: http://gerrit.cloudera.org:8080/7268
Reviewed-by: Dimitris Tsirogiannis <dt...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-util.cc
M be/src/catalog/catalog-util.h
M be/src/service/impala-server.cc
A tests/custom_cluster/test_compact_catalog_updates.py
5 files changed, 136 insertions(+), 5 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dimitris Tsirogiannis: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/5/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 358:   RETURN_IF_ERROR(Codec::CreateCompressor(nullptr, false, THdfsCompression::SNAPPY,
> I see your point.
I'm in favor of LZ4.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7268/4/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 328:     Status status = thrift_serializer_.Serialize(&catalog_object, &serialized_object);
Let's try to avoid the extra copy for the !FLAGS_compact_catalog_topic case. I think you could pass in &item.value to Serialize like before, and then change CompressCatalogObject() to take a const uint8_t* and a length as input.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 10: Code-Review+2

Rebase, keep Alex's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7268

to look at the new patch set (#8).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: LZ4 compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-util.cc
M be/src/catalog/catalog-util.h
M be/src/service/impala-server.cc
M fe/src/main/java/org/apache/impala/analysis/TableSampleClause.java
M fe/src/main/java/org/apache/impala/analysis/TimestampLiteral.java
M fe/src/test/java/org/apache/impala/hive/executor/UdfExecutorTest.java
A tests/custom_cluster/test_compact_catalog_updates.py
8 files changed, 139 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/catalog/catalog-server.cc
File be/src/catalog/catalog-server.cc:

Line 331:       status = SerializeAndCompressCatalogObject(catalog_object, item);
Thinking we should have a similar separation of serialization and compression like we do for the decompression.


Line 366:   compressed_buffer.resize(result_len);
Remove since we're copying into item.value anyway


Line 367:   item.value = string(reinterpret_cast<char*>(compressed_buffer_ptr),
use item.value.assign()


http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/service/impala-server.h
File be/src/service/impala-server.h:

Line 557:       std::vector<uint8_t>& data_buffer) WARN_UNUSED_RESULT;
we use pointers for out params, i.e. use std::vector<uint8_t>*


http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/util/compress.h
File be/src/util/compress.h:

Line 113
revert


http://gerrit.cloudera.org:8080/#/c/7268/3/be/src/util/decompress.h
File be/src/util/decompress.h:

Line 86: 
revert


http://gerrit.cloudera.org:8080/#/c/7268/2/tests/custom_cluster/test_compact_catalog_updates.py
File tests/custom_cluster/test_compact_catalog_updates.py:

Line 45:       assert result.data[0] == "7300"
> Added the result check. Out of curiosity, what's so special about the parqu
Nothing special about Parquet, just another table.


-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5500: Reduce catalog update topic size

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7268

to look at the new patch set (#9).

Change subject: IMPALA-5500: Reduce catalog update topic size
......................................................................

IMPALA-5500: Reduce catalog update topic size

Problem: IMPALA-4029 introduced the use of the flatbuffers serialization
libary for storing file and block metadata. That change reduced the
effectiveness of the Thrift compaction protocol (when
--compact_catalog_topic is used), thereby causing a 2X increase in
catalog update topic size when the compact protocol is used.

Fix: LZ4 compress the catalog topic updates before sent to the
statestore when --compact_catalog_topic is set to true.

Results: ~4X reduction in catalog update topic size

Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
---
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-util.cc
M be/src/catalog/catalog-util.h
M be/src/service/impala-server.cc
A tests/custom_cluster/test_compact_catalog_updates.py
5 files changed, 136 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/7268/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7268
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>