You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "PJ Fanning (Jira)" <ji...@apache.org> on 2021/12/27 13:36:00 UTC

[jira] [Created] (DRILL-8096) format-excel reader: support different Shared String implementations

PJ Fanning created DRILL-8096:
---------------------------------

             Summary: format-excel reader: support different Shared String implementations
                 Key: DRILL-8096
                 URL: https://issues.apache.org/jira/browse/DRILL-8096
             Project: Apache Drill
          Issue Type: Improvement
          Components: Execution - Data Types
            Reporter: PJ Fanning


One of the biggest users of memory and processing time when reading Excel files is handling the Shared Strings Table.

excel-streaming-reader v3.3.0 supports 3 implementations.

I would suggest that Drill should use the ReadOnlySharedStringTable as the default.

Drill currently uses the full featured Apache POI SharedStringTable by default (which requires more memory and parsing effort).

There is also a TempFileSharedStringTable which uses a temp file to keep the data out of heap memory. This is still pretty fast because it is implemented using a H2 database MVMap.

If supporting allowing users configure which implementation they want sounds useful, I can do a PR.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)