You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Jörn Franke <jo...@gmail.com> on 2018/10/19 19:57:06 UTC

Re: Using StAX parser for .xlsx files

thanks a lot to all comments. I updated the hadoopoffice library to include
also Stax parsing. In a later stage I will also add reading comments.

On Fri, May 18, 2018 at 11:37 PM Jörn Franke <jo...@gmail.com> wrote:

> Hi,
>
> I wonder if someone has already used successfully the StAX parser with
> .xlsx files (ie instead of an event driven push model a pull model, cf.
> https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html).
>
> Reason that I ask is that on Big Data platforms (for which we implemented
> the HadoopOffice library powered by Apache POI, cf.
> https://github.com/ZuInnoTe/hadoopoffice) the event driven model causes a
> lot of memory overhead, because virtually all Big Data platforms implement
> a pull model, which means if I use the push model provided by the event API
> then I need to load the full content in memory to make it available as a
> pull model, since those platforms are not event driven.
>
> I found the StaxHelper class, but I have little idea how it can be used
> within Apache POI:
> https://poi.apache.org/apidocs/org/apache/poi/util/StaxHelper.html
>
>
> The main goal is to have a light weight approach as proposed by the
> current POI event push model for reading .xlsx files (
> https://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api) , but in
> form of a pull model, e.g. as illustrated in the following pseudo code:
>
> XMLInputFactory xmlif;
>
> XSSFReader.SheetIterator iter;
> InputStream currentInputStream;
>
> XMLStreamReader xmlr;
>
>
> /* function called once at the start of processing */
> public void init() {
> r = new XSSFReader( pkg );
> ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
> // inputstream sheet iterator
>  iter = (XSSFReader.SheetIterator)r.getSheetsData();
> // XML factory to create Stax Parser
>
> xmlif = XMLInputFactory.newInstance();
>
> }
>
>
> /** the following method is called by the Big Data platform (Flink, Spark,
> Hadoop, Hive etc.) do get the next row without reading the full file in
> memory as in the DOM or SAX (Push) model
> returns null if no further record exist, otherwise sets the current cell
> content as String
> **/
> public String getNextRow() {
> // check if we have data to read
> if ((xmlr==null) || xmlr.hasNext()==false) {
>   if (!iter.hasNext()) {
>      return null;
> } else {
> // read sheet into stax parser
>
> xmlr=xmlif.createXMLStreamReader(iter.next());
>
> }
> // read the data from stax
>  //
> return xmlr.getText(); // just an example, in fact i need to check for a c
> tag, check its type, if it is string then check the string table, otherwise
> convert the numeric to a data or indeed a number
> }
>
> Any pitfalls with this approach? I assume I need to link the sharedstring
> table somehow to the cell tag (I can derive probably from the source code
> of the event API how to do this).
> Formula evaluation is not of so much importance for this currently (the
> HadoopOffice library offers for formula evaluation to load the full file in
> memory).
> Of course still the full sharedstringtable needs to be loaded in memory,
> but I expect for the data to have a very small sharedstring table and/or
> use of numerics.
>
> Has anyone used the StAX API together with .xlsx files with POI?
>
>
> Thank you.
>
> best regards
>