You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yash Datta (JIRA)" <ji...@apache.org> on 2014/11/12 15:12:33 UTC

[jira] [Created] (SPARK-4365) Remove unnecessary filter call on records returned from parquet library

Yash Datta created SPARK-4365:
---------------------------------

             Summary: Remove unnecessary filter call on records returned from parquet library
                 Key: SPARK-4365
                 URL: https://issues.apache.org/jira/browse/SPARK-4365
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.1.0
            Reporter: Yash Datta
            Priority: Minor
             Fix For: 1.2.0


Since parquet library has been updated , we no longer need to filter the records returned from parquet library for null records , as now the library skips those :

from parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java


  public boolean nextKeyValue() throws IOException, InterruptedException {
    boolean recordFound = false;

    while (!recordFound) {
      // no more records left
      if (current >= total) { return false; }

      try {
        checkRead();
        currentValue = recordReader.read();
        current ++; 
        if (recordReader.shouldSkipCurrentRecord()) {
          // this record is being filtered via the filter2 package
          if (DEBUG) LOG.debug("skipping record");
          continue;
        }   

        if (currentValue == null) {
          // only happens with FilteredRecordReader at end of block
          current = totalCountLoadedSoFar;
          if (DEBUG) LOG.debug("filtered record reader reached end of block");
          continue;
        }   
          recordFound = true;

        if (DEBUG) LOG.debug("read value: " + currentValue);
      } catch (RuntimeException e) {
        throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e); 
      }   
    }   
    return true;
  }





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org