You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Kevin Krouse <ke...@labkey.com> on 2011/12/02 20:34:10 UTC

ignore mac hidden binary files?

Hello Tikas,
We are getting XML parse exceptions when Tika tries to index Machidden
metadata files that start with a "._" prefix.  I don't knowmuch about
these hidden files, but they are binary files and won't
parse as XML.
Should we be filtering these out before Tika tries to processthem or
is it a bug in the AutoDetectParser?

org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
  at org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
  at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
  at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
  at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
  at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
  at java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
XML parse error    at
org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71)    at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
   at org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
   at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
   ... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
allowed in prolog.    at
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
   at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
   at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
   at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
   at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
   at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
   at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
   at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
   at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
   at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
   at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
   at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
   at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)    at
org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65)    ...
10 more
Kevin

Re: ignore mac hidden binary files?

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 8 Dec 2011, Kevin Krouse wrote:
> Anyone?

I'd suggest you ignore them for now

The issue is that they have the same name as the real file (plus ._), so 
the extension looks to be different to what the file actually is

We should probably add mime magic to detect them, if anyone knows which 
bits of the header area stable? Looking at a few files I have to hand, 
they all seem to start with

00000000  00 05 16 07 00 02 00 00  4d 61 63 20 4f 53 20 58  |........Mac 
OS X|
00000010  20 20 20 20 20 20 20 20  00 02 00 00 00 09 00 00  | 
........|


Nick

> On Fri, Dec 2, 2011 at 11:34 AM, Kevin Krouse <ke...@labkey.com> wrote:
>>
>> Hello Tikas,
>> We are getting XML parse exceptions when Tika tries to index Mac hidden
>> metadata files that start with a "._" prefix.  I don't know much about
>> these hidden files, but they are binary files and won't
>> parse as XML.
>> Should we be filtering these out before Tika tries to process them or
>> is it a bug in the AutoDetectParser?
>>
>> org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
>>   at org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
>>   at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
>>   at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
>>   at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
>>   at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
>>   at java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
>> XML parse error    at
>> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71)    at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
>>    at org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
>>    at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
>>    ... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
>> allowed in prolog.    at
>> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
>>    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
>>    at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
>>    at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
>>    at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
>>    at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
>>    at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
>>    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
>>    at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
>>    at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
>>    at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
>> at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
>>    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
>>    at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
>>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)    at
>> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65)    ...
>> 10 more
>>
>> Kevin
>

Re: ignore mac hidden binary files?

Posted by Kevin Krouse <ke...@labkey.com>.
Anyone?

Kevin


On Fri, Dec 2, 2011 at 11:34 AM, Kevin Krouse <ke...@labkey.com> wrote:
>
> Hello Tikas,
> We are getting XML parse exceptions when Tika tries to index Mac hidden
> metadata files that start with a "._" prefix.  I don't know much about
> these hidden files, but they are binary files and won't
> parse as XML.
> Should we be filtering these out before Tika tries to process them or
> is it a bug in the AutoDetectParser?
>
> org.labkey.search.model.LuceneSearchServiceImpl$PreProcessingException:/Users/kevink/data/._somefile.xml
>   at org.labkey.search.model.LuceneSearchServiceImpl.logAsPreProcessingException(LuceneSearchServiceImpl.java:701)
>   at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:499)
>   at org.labkey.search.model.AbstractSearchService.preprocess(AbstractSearchService.java:883)
>   at org.labkey.search.model.AbstractSearchService.getPreprocessedItem(AbstractSearchService.java:967)
>   at org.labkey.search.model.AbstractSearchService$7.run(AbstractSearchService.java:1003)
>   at java.lang.Thread.run(Thread.java:680)org.apache.tika.exception.TikaException:
> XML parse error    at
> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:71)    at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:145)
>    at org.labkey.search.model.LuceneSearchServiceImpl.parse(LuceneSearchServiceImpl.java:575)
>    at org.labkey.search.model.LuceneSearchServiceImpl.preprocess(LuceneSearchServiceImpl.java:339)
>    ... 4 moreCaused by: org.xml.sax.SAXParseException: Content is not
> allowed in prolog.    at
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:196)
>    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:175)
>    at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:394)
>    at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:322)
>    at org.apache.xerces.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:281)
>    at org.apache.xerces.impl.XMLScanner.reportFatalError(XMLScanner.java:1459)
>    at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(XMLDocumentScannerImpl.java:870)
>    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:324)
>    at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:845)
>    at org.apache.xerces.parsers.XML11Configuration.parse(XML11Configuration.java:768)
>    at org.apache.xerces.parsers.XMLParser.parse(XMLParser.java:108)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1196)
>    at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:555)
>    at org.apache.xerces.jaxp.SAXParserImpl.parse(SAXParserImpl.java:289)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)    at
> org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:65)    ...
> 10 more
>
> Kevin