You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2011/05/23 18:29:47 UTC

[jira] [Resolved] (TIKA-665) NullPointerException from com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString on some excel files from the CLI

     [ https://issues.apache.org/jira/browse/TIKA-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-665.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
         Assignee: Jukka Zitting

Fixed in revision 1126568.

The getAddress() call on a HyperLinkRecord was returning null for some link within the spreadsheet, so I simply added a check for that. Not sure if this is something that can/should be fixed in POI or if it's OK for the return value to be null.

Note that there seems to be some extra debug output coming to System.out from within POI when I parse this file. It would be nice if that could be avoided.



> NullPointerException from com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString on some excel files from the CLI
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-665
>                 URL: https://issues.apache.org/jira/browse/TIKA-665
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9
>            Reporter: Nick Burch
>            Assignee: Jukka Zitting
>             Fix For: 1.0
>
>         Attachments: hyperlink_excel2001.xls
>
>
> I've discovered that a small number of excel files (and possibly others, though I haven't noticed any) will cause com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString to blow up with a NPE. The text being passed through from the Excel parser looks fine though.
> The full stacktrace when run from the CLI is:
> Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@bf7916
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:126)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:340)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:97)
> Caused by: java.lang.NullPointerException
> 	at com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString(ToStream.java:1966)
> 	at com.sun.org.apache.xml.internal.serializer.ToStream.processAttributes(ToStream.java:1946)
> 	at com.sun.org.apache.xml.internal.serializer.ToStream.closeStartTag(ToStream.java:2429)
> 	at com.sun.org.apache.xml.internal.serializer.ToStream.characters(ToStream.java:1381)
> 	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.characters(TransformerHandlerImpl.java:172)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:167)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
> 	at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:39)
> 	at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:61)
> 	at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:113)
> 	at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:151)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:261)
> 	at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:287)
> 	at org.apache.tika.parser.microsoft.TextCell.render(TextCell.java:35)
> 	at org.apache.tika.parser.microsoft.CellDecorator.render(CellDecorator.java:34)
> 	at org.apache.tika.parser.microsoft.LinkedCell.render(LinkedCell.java:36)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processExtraText(ExcelExtractor.java:423)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processSheet(ExcelExtractor.java:522)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:346)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:297)
> 	at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:82)
> 	at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:112)
> 	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:147)
> 	at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:106)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:276)
> 	at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:136)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:206)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 5 more
> Looking at the excel parser code, it seems that we're not doing anything wrong, so I think the issue is with the SAX stuff used by the CLI

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira