You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kevin Krouse <ke...@labkey.com> on 2011/12/02 20:34:10 UTC
ignore mac hidden binary files?
Hello Tikas,
We are getting XML parse exceptions when Tika tries to index Machidden
metadata files that start with a "._" prefix. I don't knowmuch about
these hidden files, but they are binary files and won't
parse as XML.
Should we be filtering these out before Tika tries to processthem or
is it a bug in the AutoDetectParser?
org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
at org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
at java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
XML parse error at
org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
at org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
allowed in prolog. at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at
org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65) ...
10 more
Kevin
Re: ignore mac hidden binary files?
Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 8 Dec 2011, Kevin Krouse wrote:
> Anyone?
I'd suggest you ignore them for now
The issue is that they have the same name as the real file (plus ._), so
the extension looks to be different to what the file actually is
We should probably add mime magic to detect them, if anyone knows which
bits of the header area stable? Looking at a few files I have to hand,
they all seem to start with
00000000 00 05 16 07 00 02 00 00 4d 61 63 20 4f 53 20 58 |........Mac
OS X|
00000010 20 20 20 20 20 20 20 20 00 02 00 00 00 09 00 00 |
........|
Nick
> On Fri, Dec 2, 2011 at 11:34 AM, Kevin Krouse <ke...@labkey.com> wrote:
>>
>> Hello Tikas,
>> We are getting XML parse exceptions when Tika tries to index Mac hidden
>> metadata files that start with a "._" prefix. I don't know much about
>> these hidden files, but they are binary files and won't
>> parse as XML.
>> Should we be filtering these out before Tika tries to process them or
>> is it a bug in the AutoDetectParser?
>>
>> org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
>> at org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
>> at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
>> at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
>> at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
>> at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
>> at java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
>> XML parse error at
>> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
>> at org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
>> at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
>> ... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
>> allowed in prolog. at
>> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
>> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
>> at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
>> at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
>> at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
>> at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
>> at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
>> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
>> at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
>> at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
>> at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
>> at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
>> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
>> at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
>> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at
>> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65) ...
>> 10 more
>>
>> Kevin
>
Re: ignore mac hidden binary files?
Posted by Kevin Krouse <ke...@labkey.com>.
Anyone?
Kevin
On Fri, Dec 2, 2011 at 11:34 AM, Kevin Krouse <ke...@labkey.com> wrote:
>
> Hello Tikas,
> We are getting XML parse exceptions when Tika tries to index Mac hidden
> metadata files that start with a "._" prefix. I don't know much about
> these hidden files, but they are binary files and won't
> parse as XML.
> Should we be filtering these out before Tika tries to process them or
> is it a bug in the AutoDetectParser?
>
> org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
> at org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
> at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
> at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
> at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
> at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
> at java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
> XML parse error at
> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
> at org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
> at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
> ... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
> allowed in prolog. at
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
> at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
> at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
> at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
> at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
> at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
> at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at
> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65) ...
> 10 more
>
> Kevin