You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2016/05/02 21:03:53 UTC

[Impala-CR](cdh5-trunk) IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.

Alex Behm has posted comments on this change.

Change subject: IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
......................................................................


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/2779/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 329:       uint8_t* tuple_mem, int* num_values) {
> describe role of tuple_mem
Done


Line 369:   bool ReadValueBatch(MemPool* pool, int max_values, int tuple_size,
> instead of having this be templated, why not inline the code at the call si
Based on my experience working with this code, the most difficult part is getting all the strange edge cases to work, so duplicating loops/code seems like a bad idea to me.

I did what you suggested, but I think it will become more clear in my next patch (that is already in CR) that this template is useful for avoiding duplication of rather tricky code pieces. We can revisit in the next CR.

I moved the implementations below in this .cc


Line 708:   virtual bool ReadValue(MemPool* pool, Tuple* tuple) {
> do we still need this?
R: Yes. We need it for the row-wise materialization of items in CollectionValues.


Line 723:     return ReadValueBatch<false>(pool, max_values, tuple_size, tuple_mem, num_values);
> same comment about inlining the call of the templated function applies here
I don't think it's a good idea, but that will only become clear in my next patch, so I followed your suggestion, and we can revisit in the next CR.


Line 755:   template<bool IN_COLLECTION>
> since we're now dealing with batches, it would be nice to get rid of the te
IN_COLLECTION is gone for now. Did you have others in mind?


Line 1722:   // This function most not be called when the output batch is already full. As long as
> must
Done


Line 1797:     ++stats->total_possible;
> todo: do these stats updates outside the loop if possible, as in stats->tot
Merged with existing TODO. Done.


Line 1801:     // producing an output batch.
> good idea
Done


Line 1874:         continue_execution = col_reader->ReadNonRepeatedValueBatch(
> is there a more graceful name? ReadScalarValueBatch? TopLevelValueBatch?
Neither "scalar" nor "top-level" are discriminative enough. We can materialize repeated values into the top-level tuple, e.g.,:

select orderkey from customer.orders;

Happy to change the fn name, but I think the current name is technically accurate, and I could not come up with anything better.


http://gerrit.cloudera.org:8080/#/c/2779/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 484:   template <bool IN_COLLECTION>
> the comment says top-level tuples, how can in_collection be true? (or: what
IN_COLLECTION is to be understood in the "is in repeated field" sense and not in the "is materializing a CollectionValue" sense.
Agree it's confusing. Let's agree on a new name before I change it everywhere.
How about "IN_REPEATED_FIELD"?

I also removed IN_COLLECTION from AssembleRows(). Hopefully this makes the meaning a little clearer.

Note that we can materialize values in repeated fields into the top-level tuple, e.g.:

select orderkey from customers.orders;


-- 
To view, visit http://gerrit.cloudera.org:8080/2779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Gerrit-PatchSet: 8
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <sk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes