You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Florent André <fl...@4sengines.com> on 2010/01/25 15:50:07 UTC

Remove headers from the parser

Hello, 

I use the AutoDetectParser.parse(java.io.InputStream stream,
org.xml.sax.ContentHandler handler, Metadata metadata). 

I use the parse function many times with the same ContentHandler. 

My problem is : 
- on each parse, tika send to the contentHandler the "xml header
definition" (<?xml version="1.0" encoding="UTF-8"?>)

This is a problem for me, because this sending don't allow me to parse the
contentHandler with a SAX element (cocoon transformer).

For example, after using of tika, my output is : 
<root>
<documentparse id="1" <?xml version="1.0" encoding="UTF-8"?>>
<html>
... content from tika
</html>
<documentparse id="2" <?xml version="1.0" encoding="UTF-8"?>>
<html>
... content from tika
</html>
</documentparse>

There is a way to deactivate the xml header sending ? 

Thanks in advance,
++

Re: Remove headers from the parser

Posted by Florent André <fl...@4sengines.com>.
Thanks, It's work like a charm
HAND

On Mon, 25 Jan 2010 20:40:59 +0100, Jukka Zitting <ju...@gmail.com>
wrote:
> Hi,
> 
> On Mon, Jan 25, 2010 at 3:50 PM, Florent André
> <fl...@4sengines.com> wrote:
>> I use the parse function many times with the same ContentHandler.
>> [...]
>> There is a way to deactivate the xml header sending ?
> 
> Check out the EmbeddedContentHandler [1] wrapper that's designed for
> this purpose.
> 
> [1]
>
http://lucene.apache.org/tika/0.5/api/org/apache/tika/sax/EmbeddedContentHandler.html
> 
> BR,
> 
> Jukka Zitting

Re: Remove headers from the parser

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Jan 25, 2010 at 3:50 PM, Florent André
<fl...@4sengines.com> wrote:
> I use the parse function many times with the same ContentHandler.
> [...]
> There is a way to deactivate the xml header sending ?

Check out the EmbeddedContentHandler [1] wrapper that's designed for
this purpose.

[1] http://lucene.apache.org/tika/0.5/api/org/apache/tika/sax/EmbeddedContentHandler.html

BR,

Jukka Zitting