You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by jinossy <gi...@git.apache.org> on 2016/02/03 11:15:06 UTC

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

GitHub user jinossy opened a pull request:

    https://github.com/apache/tajo/pull/958

    TAJO-2073: Upgrade parquet-mr to 1.8.1.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinossy/tajo TAJO-2073

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/958.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #958
    
----
commit 093bbd63c0c22bee51cdc1335f27e9766249af6e
Author: Jinho Kim <jh...@apache.org>
Date:   2016-02-03T10:12:37Z

    TAJO-2073: Upgrade parquet-mr to 1.8.1.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/tajo/pull/958


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

Posted by jinossy <gi...@git.apache.org>.
Github user jinossy commented on the pull request:

    https://github.com/apache/tajo/pull/958#issuecomment-184022027
  
    Thanks for your review!
    I'll commit it that reflects your comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/958#discussion_r52702202
  
    --- Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java ---
    @@ -70,37 +81,50 @@
       private long totalCountLoadedSoFar = 0;
     
       private Path file;
    +  private UnmaterializableRecordCounter unmaterializableRecordCounter;
    +
    +  /**
    +   * @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro.
    +   * @param filter for filtering individual records
    +   */
    +  public InternalParquetRecordReader(ReadSupport<T> readSupport, Filter filter) {
    +    this.readSupport = readSupport;
    +    this.filter = checkNotNull(filter, "filter");
    +  }
     
       /**
        * @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro.
        */
       public InternalParquetRecordReader(ReadSupport<T> readSupport) {
    -    this(readSupport, null);
    +    this(readSupport, FilterCompat.NOOP);
       }
     
       /**
        * @param readSupport Object which helps reads files of the given type, e.g. Thrift, Avro.
        * @param filter Optional filter for only returning matching records.
    +   * @deprecated use {@link #InternalParquetRecordReader(ReadSupport, Filter)}
        */
    -  public InternalParquetRecordReader(ReadSupport<T> readSupport, UnboundRecordFilter
    -      filter) {
    -    this.readSupport = readSupport;
    -    this.recordFilter = filter;
    +  @Deprecated
    +  public InternalParquetRecordReader(ReadSupport<T> readSupport, UnboundRecordFilter filter) {
    +    this(readSupport, FilterCompat.get(filter));
       }
     
       private void checkRead() throws IOException {
         if (current == totalCountLoadedSoFar) {
           if (current != 0) {
    -        long timeAssembling = System.currentTimeMillis() - startedAssemblingCurrentBlockAt;
    -        totalTimeSpentProcessingRecords += timeAssembling;
    -        if (DEBUG) LOG.debug("Assembled and processed " + totalCountLoadedSoFar + " records from " + columnCount + " columns in " + totalTimeSpentProcessingRecords + " ms: " + ((float) totalCountLoadedSoFar / totalTimeSpentProcessingRecords) + " rec/ms, " + ((float) totalCountLoadedSoFar * columnCount / totalTimeSpentProcessingRecords) + " cell/ms");
    -        long totalTime = totalTimeSpentProcessingRecords + totalTimeSpentReadingBytes;
    -        long percentReading = 100 * totalTimeSpentReadingBytes / totalTime;
    -        long percentProcessing = 100 * totalTimeSpentProcessingRecords / totalTime;
    -        if (DEBUG) LOG.debug("time spent so far " + percentReading + "% reading ("+totalTimeSpentReadingBytes+" ms) and " + percentProcessing + "% processing ("+totalTimeSpentProcessingRecords+" ms)");
    +        totalTimeSpentProcessingRecords += (System.currentTimeMillis() - startedAssemblingCurrentBlockAt);
    +        if (Log.INFO) {
    --- End diff --
    
    Even though these logs seem to be printed whenever a row group is fully read, I'm concerned with there will be too many logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/958#discussion_r52701715
  
    --- Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/parquet/ParquetScanner.java ---
    @@ -78,6 +96,7 @@ public Tuple next() throws IOException {
        */
       @Override
       public void reset() throws IOException {
    +    throw new TajoRuntimeException(new UnsupportedException());
    --- End diff --
    
    UnimplementedException looks more proper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/958#issuecomment-183168945
  
    +1. I left trivial comments. Please consider them before you commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2073: Upgrade parquet-mr to 1.8.1.

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/958#discussion_r52702301
  
    --- Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/thirdparty/parquet/InternalParquetRecordReader.java ---
    @@ -138,24 +162,29 @@ public float getProgress() {
         return (float) current / total;
       }
     
    -  public void initialize(MessageType requestedSchema, MessageType fileSchema,
    -                         Map<String, String> extraMetadata, Map<String, String> readSupportMetadata,
    +  public void initialize(MessageType fileSchema,
    --- End diff --
    
    FileMetaData also includes the file schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---