You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by arina-ielchiieva <gi...@git.apache.org> on 2017/11/09 14:00:49 UTC

[GitHub] drill pull request #1030: DRILL-5941: Skip header / footer improvements for ...

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/1030

    DRILL-5941: Skip header / footer improvements for Hive storage plugin

    Overview:
    1. When table has header / footer process input splits fo the same file in one reader (bug fix for DRILL-5941).
    2. Apply skip header logic during reader initialization only once to avoid checks during reading the data (DRILL-5106).
    3. Apply skip footer logic only when footer is more then 0, otherwise default processing will be done without buffering data in queue (DRIL-5106).
    
    Code changes:
    1. AbstractReadersInitializer was introduced to factor out common logic during readers intialization.
    It will have three implementations:
    a. Default (each input split gets its own reader);
    b. Empty (for empty tables);
    c. InputSplitGroups (applied when table has header / footer and input splits of the same file should be processed together).
    
    2. AbstractRecordsInspector was introduced to improve performance when table has footer is less or equals to 0.
    It will have two implementations:
    a. Default (records will be processed one by one without buffering);
    b. SkipFooter (queue will be used to buffer N records that should be skipped in the end of file processing).
    
    3. Allow HiveAbstractReader to have multiple input splits.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-5941

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1030.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1030
    
----

----


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by priteshm <gi...@git.apache.org>.
Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @ppadma can you review this?


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @arina-ielchiieva That will have performance impact.  Better way would be to keep one reader per split and see if we can figure out a way to tell readers how many rows they should skip (for header or footer).


---

[GitHub] drill pull request #1030: DRILL-5941: Skip header / footer improvements for ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1030


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @ppadma performance impact will be only on those tables that have skip header / footer functionality (one reader per file), for other tables processing will be the same (one reader per input). Currently for tables that have skip header / footer functionality and have multiple input splits, these rows are skipped for each input, thus leaving user with incorrect data set.
    
    Regarding your suggestion, as I have mentioned in my previous comment (please see links to the code there as well) we use hadoop reader for the data and don't have information about number of rows in input split (I wish we did, it would make life much easier). 
    
    I have an idea how to parallelize readers when we have only header (though still all readers will be on the same node) but when we have footer we'll have to read one file per reader.  Also we can consider Drill text reader usage instead of hadoop one (as we do for parquet files).
    
    Anyway, I suggest we commit these changes and create new Jira for future improvement of skip header / footer performance. Will this be acceptable?


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @arina-ielchiieva I am concerned about performance impact by grouping all splits in a single reader (essentially, not parallelizing at all).
    Wondering if it is possible to do this way:
    During planning, in HiveScan,  if it is text file and has header/footer, get the number of rows to skip. Read the header/footer rows and based on that, adjust the first/last split and offset within them. The splits which have only header/footer rows can be removed from inputSplits. In HiveSubScan, change hiveReadEntry to be a list (one entry for each split). Add an entry in hiveReadEntry, numRowsToSkip (or offsetToStart) which can be passed to the recordReaders in getBatch for each subScan. This is fairly complicated and I am sure I might be missing some details :-)


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @arina-ielchiieva I have a question. If the splits for a file are spread across multiple fragments, does this logic work ?



---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by parthchandra <gi...@git.apache.org>.
Github user parthchandra commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    +1


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @ppadma pull request is updated to handle header / footer logic in distributed environment. Comment with PR description in the beginning of the pull request is updated. Please review when possible.


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @arina-ielchiieva Ideally, it would be better to do this properly first time so we don't have to revisit. But, since there is a correctness issue, it is ok to go with the approach you have and address the performance problem later. I will finish review and post my comments soon.


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    +1. LGTM.


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    For FWIW, the native CSV reader does the following:
    
    * To read the header, it seeks to offset 0 in the file, regardless of the block being read, then reads the header, which may be a remote read.
    * The reader then seeks to the start of its block. If this is the first block, it skips the header, else it searches for the start of the next record.
    
    Since Hive has the same challenges, Hive must have solved this, we have only to research that existing solution.
    
    One simple solution is:
    
    * If block number is 0, skip the header.
    * If block number is 1 or larger, look for the next record separator.


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @ppadma really good point! I did not realize that when creating readers input splits can be already distributed among several drillbits. So I guess we should make sure that when we have skip header / footer logic in table input splits of the same file reside in one `HiveSubScan` during planning stage. I'll have to make some changes and let you know once done.


---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @ppadma 
    To create reader for each input split and maintain skip header / footer functionality we need to know how many rows are in input split. Unfortunately, input split does not hold such information, only number of bytes. [1] We can't apply skip header functionality for the first input split and skip footer for the last input either since we don't know how many rows will be skipped, it can be the situation that we need to skip the whole first input split and partially second.
    
    @paul-rogers 
    To read from hive we actually use Hadoop reader [2, 3] so if I am not mistaken unfortunately the described above approach can be applied.
    
    [1] https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/FileSplit.html
    [2] https://github.com/apache/drill/blob/master/contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveAbstractReader.java#L234
    [3] https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/RecordReader.html


---