You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Ole Kozaczenko <sh...@gmail.com> on 2015/09/14 13:29:37 UTC

Apache POI Streaming API doesn't recognize Excel (xlsx) content

I have a class which ingests .xlsx-files. I took it from this example and
modified it for my needs:
https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel/XLSX2CSV.java


Now the application processes some files just fine, others not at all. If I
change one single field or even character in one of the not working files
and save them again, the whole content is processed correctly. Does anyone
have an idea what might be the reason for (imho it lies somewhere within
the original excel files).

To whom it may help, here is my code:


package com.goodgamestudios.icosphere.service.fileReader;

    import com.goodgamestudios.icosphere.datamodel.DataSet;
    import com.goodgamestudios.icosphere.datamodel.Tuple;
    import java.io.File;
    import java.io.IOException;
    import java.io.InputStream;
    import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
    import org.apache.poi.openxml4j.opc.OPCPackage;
    import org.apache.poi.ss.usermodel.BuiltinFormats;
    import org.apache.poi.ss.usermodel.DataFormatter;
    import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
    import org.apache.poi.xssf.eventusermodel.XSSFReader;
    import org.apache.poi.xssf.model.SharedStringsTable;
    import org.apache.poi.xssf.model.StylesTable;
    import org.apache.poi.xssf.usermodel.XSSFCellStyle;
    import org.apache.poi.xssf.usermodel.XSSFRichTextString;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    import org.xml.sax.Attributes;
    import org.xml.sax.InputSource;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
     import org.xml.sax.helpers.DefaultHandler;
    import org.xml.sax.helpers.XMLReaderFactory;


    public class ExcelFileReader implements FileReader {

    static final Logger LOG = LoggerFactory.getLogger(ExcelFileReader.class);
    private SheetHandler handler;

    @Override
    public DataSet getDataFromFile(File file) throws IOException {

        LOG.info("Start ingesting file {}");
        try {
            OPCPackage pkg = OPCPackage.open(file);
            XSSFReader reader = new XSSFReader(pkg);
            StylesTable styles = reader.getStylesTable();
            ReadOnlySharedStringsTable strings = new
ReadOnlySharedStringsTable(pkg);

            SharedStringsTable sst = reader.getSharedStringsTable();
            XMLReader parser =
XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
            handler = new SheetHandler(styles, strings, 24);
            parser.setContentHandler(handler);

            // rId2 found by processing the Workbook
            // Seems to either be rId# or rSheet#
            System.out.println("yooooo 1");
            InputStream sheet2 = reader.getSheet("rId2");
            System.out.println("yooooo 2");
            InputSource sheetSource = new InputSource(sheet2);
            System.out.println("yooooo 3");
            parser.parse(sheetSource);
            LOG.debug("{} rows parsed", handler.getData().getRows().size() + 1);
            sheet2.close();
            return handler.getData();

        } catch (OpenXML4JException | SAXException ex) {
            LOG.warn("Unable to parse file {}", file.getName());
            LOG.warn("Exception: {}: ", ex);
        }

        return null;
    }

    /**
     * See org.xml.sax.helpers.DefaultHandler javadocs
     *
     * Derived from http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api
     * <p/>
     * Also see Standard ECMA-376, 1st edition, part 4, pages 1928ff, at
     * http://www.ecma-international.org/publications/standards/Ecma-376.htm
     * <p/>
     * A web-friendly version is http://openiso.org/Ecma/376/Part4
     */
    private static class SheetHandler extends DefaultHandler {

        boolean isFirstRow = true;
        private int quantityOfColumns;
        private int currentColumnNumber = 1;
        int currentRowNumber = 1;
        private int rowNumberOfLastCell = 1;
        private DataSet data = new DataSet();
        private Tuple tuple;

        /**
         * Table with styles
         */
        private StylesTable stylesTable;

        /**
         * Table with unique strings
         */
        private ReadOnlySharedStringsTable sharedStringsTable;

        /**
         * Number of columns to read starting with leftmost
         */
        private final int minColumnCount;

        // Set when V start element is seen
        private boolean vIsOpen;

        // Set when cell start element is seen;
        // used when cell close element is seen.
        private xssfDataType nextDataType;

        // Used to format numeric cell values.
        private short formatIndex;
        private String formatString;
        private final DataFormatter formatter;

        // The last column printed to the output stream
        private int lastColumnNumber = -1;

        // Gathers characters as they are seen.
        private StringBuffer value;

        static final Logger LOG = LoggerFactory.getLogger(SheetHandler.class);

        private SheetHandler(StylesTable styles,
                ReadOnlySharedStringsTable strings,
                int cols) {
            this.stylesTable = styles;
            this.sharedStringsTable = strings;
            this.minColumnCount = cols;
            this.value = new StringBuffer();
            this.nextDataType = xssfDataType.NUMBER;
            this.formatter = new DataFormatter();
            LOG.debug("Sheethandler created");
        }

        /*
         * (non-Javadoc)
         * @see
org.xml.sax.helpers.DefaultHandler#startElement(java.lang.String,
java.lang.String, java.lang.String, org.xml.sax.Attributes)
         */
        public void startElement(String uri, String localName, String name,
                Attributes attributes) throws SAXException {
            System.out.println("yooooooooooo start:uri:" + uri + "
localname: " + localName + " name: " + name);
            if ("inlineStr".equals(name) || "v".equals(name)) {
                vIsOpen = true;
                // Clear contents cache
                value.setLength(0);
            } // c => cell
            else if ("c".equals(name)) {
                // Get the cell reference
                String r = attributes.getValue("r");
                int firstDigit = -1;
                for (int c = 0; c < r.length(); ++c) {
                    if (Character.isDigit(r.charAt(c))) {
                        firstDigit = c;
                        break;
                    }
                }
                currentColumnNumber = nameToColumn(r.substring(0, firstDigit));
                System.out.println("colu mn " + currentColumnNumber);

                // Set up defaults.
                this.nextDataType = xssfDataType.NUMBER;
                this.formatIndex = -1;
                this.formatString = null;
                String cellType = attributes.getValue("t");
                String cellStyleStr = attributes.getValue("s");
                if ("b".equals(cellType)) {
                    nextDataType = xssfDataType.BOOL;
                } else if ("e".equals(cellType)) {
                    nextDataType = xssfDataType.ERROR;
                } else if ("inlineStr".equals(cellType)) {
                    nextDataType = xssfDataType.INLINESTR;
                } else if ("s".equals(cellType)) {
                    nextDataType = xssfDataType.SSTINDEX;
                } else if ("str".equals(cellType)) {
                    nextDataType = xssfDataType.FORMULA;
                } else if (cellStyleStr != null) {
                    // It's a number, but almost certainly one
                    //  with a special style or format
                    XSSFCellStyle style = null;
                    if (cellStyleStr != null) {
                        int styleIndex = Integer.parseInt(cellStyleStr);
                        style = stylesTable.getStyleAt(styleIndex);
                    } else if (stylesTable.getNumCellStyles() > 0) {
                        style = stylesTable.getStyleAt(0);
                    }
                    if (style != null) {
                        this.formatIndex = style.getDataFormat();
                        this.formatString = style.getDataFormatString();
                        if (this.formatString == null) {
                            this.formatString =
BuiltinFormats.getBuiltinFormat(this.formatIndex);
                        }
                    }
                }
            }

        }

        /*
         * (non-Javadoc)
         * @see
org.xml.sax.helpers.DefaultHandler#endElement(java.lang.String,
java.lang.String, java.lang.String)
         */
        public void endElement(String uri, String localName, String name)
                throws SAXException {

            String thisStr = null;

            // v => contents of a cell
            if ("v".equals(name)) {
                // Process the value contents as required.
                // Do now, as characters() may be called more than once
                switch (nextDataType) {

                    case BOOL:
                        char first = value.charAt(0);
                        thisStr = first == '0' ? "FALSE" : "TRUE";
                        break;

                    case ERROR:
                        thisStr = "\"ERROR:" + value.toString() + '"';
                        break;

                    case FORMULA:
                        // A formula could result in a string value,
                        // so always add double-quote characters.
                        thisStr = '"' + value.toString() + '"';
                        break;

                    case INLINESTR:
                        // TODO: have seen an example of this, so it's untested.
                        XSSFRichTextString rtsi = new
XSSFRichTextString(value.toString());
                        thisStr = '"' + rtsi.toString() + '"';
                        break;

                    case SSTINDEX:
                        String sstIndex = value.toString();
                        try {
                            int idx = Integer.parseInt(sstIndex);
                            XSSFRichTextString rtss = new
XSSFRichTextString(sharedStringsTable.getEntryAt(idx));
                            thisStr = rtss.toString();
                        } catch (NumberFormatException ex) {
                            System.out.println("Failed to parse SST
index '" + sstIndex + "': " + ex.toString());
                        }
                        break;

                    case NUMBER:
                        String n = value.toString();
                        if (this.formatString != null && n.length() > 0) {
                            thisStr =
formatter.formatRawCellContents(Double.parseDouble(n),
this.formatIndex, this.formatString);
                        } else {
                            thisStr = n;
                        }
                        break;

                    default:
                        thisStr = "(TODO: Unexpected type: " +
nextDataType + ")";
                        break;
                }

                // Output after we've seen the string contents
                // Emit commas for any fields that were missing on this row
                if (lastColumnNumber == -1) {
                    lastColumnNumber = 0;
                }
                for (int i = lastColumnNumber; i < currentColumnNumber; ++i) {
                }

                // Might be the empty string.
                System.out.println(thisStr);
                if (isFirstRow) {
                    data.getHeaders().add(thisStr);
                } else {
                    tuple.getRowEntries()[currentColumnNumber] = thisStr;
                }
                // Update column
                if (currentColumnNumber > -1) {
                    lastColumnNumber = currentColumnNumber;
                }

            } else if ("row".equals(name)) {

                // We're onto a new row
                System.out.println("nextrow");
                lastColumnNumber = -1;
                System.out.println("yoooooo tuple:" + tuple);
                if (isFirstRow) {
                    isFirstRow = false;
                    quantityOfColumns = data.getHeaders().size();
                    tuple = new Tuple(quantityOfColumns);

                } else if (!tuple.isEmpty()) {
                    data.addRow(tuple);
                    tuple = new Tuple(quantityOfColumns);
                }
            }

        }

        /**
         * Captures characters only if a suitable element is open. Originally
         * was just "v"; extended for inlineStr also.
         */
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            if (vIsOpen) {
                value.append(ch, start, length);
            }
        }

        /**
         * Converts an Excel column name like "C" to a zero-based index.
         *
         * @param name
         * @return Index corresponding to the specified name
         */
        private int nameToColumn(String name) {
            int column = -1;
            for (int i = 0; i < name.length(); ++i) {
                int c = name.charAt(i);
                column = (column + 1) * 26 + c - 'A';
            }
            return column;
        }

        public DataSet getData() {
            return data;
        }
    }

    /**
     * The type of the data value is indicated by an attribute on the cell. The
     * value is usually in a "v" element within the cell.
     */
    enum xssfDataType {

        BOOL,
        ERROR,
        FORMULA,
        INLINESTR,
        SSTINDEX,
        NUMBER,
    }}


MHO the application is for some reason not recognizing neither the
rows nor their contents. I let it print the parameters of the
startElement-Method, where it should go through rows, columns, cells
etc: file-upload.net/download-10897984/output.txt.html
<http://www.file-upload.net/download-10897984/output.txt.html>


Thx for help!

Re: Apache POI Streaming API doesn't recognize Excel (xlsx) content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
Thanks Jens, you beat me to it.

I took the example published on the POI website - ExampleEventUserModel -
and made a couple of changes to the startElement() and endElement() methods
there. As you have suggested Jens, I modified them to use the localName
parameter in addition to the name parameter like this;

public void startElement(String uri, String localName, String name,
    Attributes attributes) throws SAXException {
   // c => cell
   if(localName.equals("c") || name.equals("c")) {
    // Print the cell reference
    System.out.print(attributes.getValue("r") + " - ");
    // Figure out if the value is an index in the SST
    String cellType = attributes.getValue("t");
    if(cellType != null && cellType.equals("s")) {
     nextIsString = true;
    } else {
     nextIsString = false;
    }
   }
   // Clear contents cache
   lastContents = "";
  }
  
  public void endElement(String uri, String localName, String name)
    throws SAXException {
   // Process the last contents as required.
   // Do now, as characters() may be called more than once
   if(nextIsString) {
    int idx = Integer.parseInt(lastContents);
    lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
    nextIsString = false;
   }

   // v => contents of a cell
   // Output after we've seen the string contents
   if(localName.equals("v") || name.equals("v")) {
    System.out.println(lastContents);
   }
  }

and that seems to do the trick.

One other point worth mentioning here is that not all Excel files use the
shared strings table. Some place strings in line with the sheets' markup and
this can also trip up some content handlers. When I get the time, I will
take a look at this as well, that is once I can find an example Excel
workbook that uses inline strings, they seem to be fairly thin on the ground
so to speak.



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/Apache-POI-Streaming-API-doesn-t-recognize-Excel-xlsx-content-tp5720197p5720232.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI Streaming API doesn't recognize Excel (xlsx) content

Posted by Jens Walter <je...@apimeister.com>.
I got it working in my environment, but I'm not sure what to do with that now.

@Mark since you are investigating too, I can give you an update on what I 
found.

The issue is indeed the DefaultHandler of the parser.
The first thing I changed was the "ReadOnlySharedStringsTable".
In here the startElement and endElement method are using the parameter "name".
After changing that to localName this class produced the necessary StringTable.

The second step was doing the same change on the "MyXSSFSheetHandler".
To get the localName populated, I had to change the XSLX2CSV.processSheet 
Method to work with the saxFactory namespace aware.

 SAXParserFactory saxFactory = SAXParserFactory.newInstance();
 saxFactory.setNamespaceAware(true);

Those changes seem to do the trick.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI Streaming API doesn't recognize Excel (xlsx) content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
It looks as though the 'normal' api works fine with this file and that the
'problem' may well relate to the handler being used to parse the xml markup.
So far, I have only had the opportunity to play with the file/code but
should be able to take a closer look over the weekend. Will post if I do
find anything.



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/Apache-POI-Streaming-API-doesn-t-recognize-Excel-xlsx-content-tp5720197p5720230.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI Streaming API doesn't recognize Excel (xlsx) content

Posted by Jens Walter <je...@apimeister.com>.
Mark Beardsley <markbrdsly <at> tiscali.co.uk> writes:

> 
> Hard to say much at all without looking at one of these files. Could you
> possibly post one of the files that is causing issues for you? With your
> saying that opening and then re-saving cures the problems, I suspect that
> the files may contain something that is not to spec as far as POI is
> concerned. Do you know how these files were created? There are report
> generators that create files Excel will accept but POI will not I believe,
> although I cannot name one for you at the moment. I am certain that others
> know far more about this than do I but we can make a start if you could post
> one of the problematical files.
> 
> --
> View this message in context: http://apache-poi.1045710.n5.nabble.com/Apache-POI-
Streaming-API-doesn-t-recognize-Excel-xlsx-content-tp5720197p5720213.html
> Sent from the POI - User mailing list archive at Nabble.com.
> 


I have the same issue here and can provide a sample xlsx-file.
I tried to parse Excel sheets provided by this site:
https://www.bondora.ee/en/invest/statistics/data_export
link to file:
https://www.bondora.ee/marketing/media/MonthlyOverview.xlsx

The problem I found seems to be related to the namespaces used within the xmls. 
If the workbook.xml only contains a default namespace und uses non-prefixed 
tags, the content is parseable. As soon as the xml does not use that shortcut, 
but uses prefixed elements, the implementation doesn't work no more.

I tried to fix the sample XLS2CSV (made the xml parser namespace-aware) but 
that didn't lead to a working version (although the parsing worked).
The next issue is the parsing of the sharedStrings.xml. This one is also not 
namespace-aware, so I still get empty values for all cells.

To me this seems like a bug in the poi library, since those Excel sheets are 
valid.

I will look further into the issue and report back if there is a chance of 
implementing this without changing too much code.

PS: I quick workaround is to load the excel sheet into LibreOffice and then 
save it again. LibreOffice than generates a xlsx-file with a default namespace.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Apache POI Streaming API doesn't recognize Excel (xlsx) content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
Hard to say much at all without looking at one of these files. Could you
possibly post one of the files that is causing issues for you? With your
saying that opening and then re-saving cures the problems, I suspect that
the files may contain something that is not to spec as far as POI is
concerned. Do you know how these files were created? There are report
generators that create files Excel will accept but POI will not I believe,
although I cannot name one for you at the moment. I am certain that others
know far more about this than do I but we can make a start if you could post
one of the problematical files.



--
View this message in context: http://apache-poi.1045710.n5.nabble.com/Apache-POI-Streaming-API-doesn-t-recognize-Excel-xlsx-content-tp5720197p5720213.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org