You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/03/13 00:03:00 UTC

[jira] [Commented] (DRILL-7641) Convert Excel Reader to Use Streaming Reader

    [ https://issues.apache.org/jira/browse/DRILL-7641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058303#comment-17058303 ] 

ASF GitHub Bot commented on DRILL-7641:
---------------------------------------

cgivre commented on pull request #2024: DRILL-7641 Convert Excel Reader to use Streaming Reader
URL: https://github.com/apache/drill/pull/2024
 
 
   # [DRILL-7641](https://issues.apache.org/jira/browse/DRILL-7641): Convert Excel Reader to use Streaming Reader
   
   ## Description
   The current implementation of the Excel reader uses the Apache POI reader, which uses excessive amounts of memory. As a result, attempting to read large Excel files will cause out of memory errors. 
   This PR converts the format plugin to use a streaming reader, based still on the POI library.  The documentation for the streaming reader can be found here. [1]. This library was billed as a drop in replacement for the POI reader, however I had to make some minor changes to the batch reader to get this to work.  Minor code cleanup as well. 
   
   [1]: https://github.com/pjfanning/excel-streaming-reader
   
   ## Documentation
   No user visible changes.
   
   ## Testing
   All unit tests from the original plugin pass.  Additionally, I tested this with large Excel files on my local machine and Drill was able to query them whereas before this PR, Drill would run out of memory.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Convert Excel Reader to Use Streaming Reader
> --------------------------------------------
>
>                 Key: DRILL-7641
>                 URL: https://issues.apache.org/jira/browse/DRILL-7641
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Text &amp; CSV
>    Affects Versions: 1.17.0
>            Reporter: Charles Givre
>            Assignee: Charles Givre
>            Priority: Major
>             Fix For: 1.18.0
>
>
> The current implementation of the Excel reader uses the Apache POI reader, which uses excessive amounts of memory. As a result, attempting to read large Excel files will cause out of memory errors. 
> This PR converts the format plugin to use a streaming reader, based still on the POI library.  The documentation for the streaming reader can be found here. [1]
> All unit tests pass and I tested the plugin with some large Excel files on my computer.
> [1]: [https://github.com/pjfanning/excel-streaming-reader]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)