You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2018/02/01 23:50:47 UTC

[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
......................................................................


Patch Set 3:

I'm still trying to grok the patch. I have a couple of higher-level asks:

* In the planner, we assume in a few places that PARQUET is the only columnar file format. E.g. the code below. We should identify the places where "== PARQUET" really means "isColumnar()" and update those accordingly so that ORC is also counted.

    if (table.getMajorityFormat() == HdfsFileFormat.PARQUET) {
      // For the purpose of this estimation, the number of per-host scan ranges for
      // Parquet files are equal to the number of columns read from the file. I.e.
      // excluding partition columns and columns that are populated from file metadata.

* You should add ORC to test_scanners_fuzz.py and run it in a loop for a while. That often flushes out bugs in handling invalid data.

  while impala-py.test tests/query_test/test_scanners_fuzz.py -k parquet; do echo yes ; done


-- 
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 01 Feb 2018 23:50:47 +0000
Gerrit-HasComments: No