You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2022/12/16 08:04:51 UTC
[Impala-ASF-CR] WiP: IMPALA-10798 : Prototype for JSON reader

Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17771 )

Change subject: WiP: IMPALA-10798 : Prototype for JSON reader
......................................................................


Patch Set 22:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17771/22/be/src/exec/hdfs-json-scanner.cc
File be/src/exec/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17771/22/be/src/exec/hdfs-json-scanner.cc@192
PS22, Line 192: new ArrowMemPool
Will the arrow lib delete this object at the end? Or is it the caller's duty to delete this object?


http://gerrit.cloudera.org:8080/#/c/17771/22/be/src/exec/hdfs-json-scanner.cc@201
PS22, Line 201:   res2 = reader->Read();
Please add a comment mentioning that "Read the entire JSON file and convert it to a Arrow Table"


http://gerrit.cloudera.org:8080/#/c/17771/22/be/src/exec/hdfs-json-scanner.cc@218
PS22, Line 218:   vector<ColumnDescriptor> cv = tad->col_descs();
              :   vector<std::shared_ptr<arrow::Field>> fields_list = {};
              :   // convert impala tuple descriptor to arrow schema
              :   for (auto cvf : cv) {
              :     std::shared_ptr<arrow::Field> field_a =
              :         arrow::field(cvf.name(), ColumnType2ArrowType(cvf.type()));
              :     fields_list.push_back(field_a);
              :   }
This explicitly sets schema for all columns of the table since 'tad' is the TableDescriptor. We should be able to only set schema of the required columns, i.e. use 'td', the TupleDescriptor directly. This avoids internal arrow parse errors on unneeded columns, e.g. decimal parse errors due to scale/precision overflow.


http://gerrit.cloudera.org:8080/#/c/17771/22/be/src/exec/hdfs-json-scanner.cc@279
PS22, Line 279:       if (i == chunked_boundary_) {
I think this is only updated in the first column and the prerequirement is that chunks in different columns are aligned. Is it alway true?



-- 
To view, visit http://gerrit.cloudera.org:8080/17771
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If79364a421d862d0d837f9be694911e388d4d629
Gerrit-Change-Number: 17771
Gerrit-PatchSet: 22
Gerrit-Owner: Anonymous Coward <sh...@gmail.com>
Gerrit-Reviewer: Aman Sinha <am...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <pr...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <sh...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 16 Dec 2022 08:04:51 +0000
Gerrit-HasComments: Yes