You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/02/26 06:31:49 UTC

ParsingReader and PackageParser

I have a tar file that I want to index the contents of as separate  
files.   To do this, I hooked up an AutoDetectParser to a  
ParsingReader.  I'm using ParsingReader since total uncompressed  
contents of the tars can be quite large.

If I understand how AutoDetectParser works, it figures out that the  
file is a tar, and thus fires off a TarParser which is a type of  
PackageParser.  The PackageParser reads the tar, and sends SAX events  
to some Tika internal representation of the file.  Specifically, it  
sends magic DIVs delimitating  the contents of each file, which in  
turn are parsed by another AutoDetectParser.  The complete sequence of  
SAX events for the entire tar file from the outermost AutoDetectParser  
to ParsingReader.

Here's where things go off the track.  The output stream of  
ParsingReader is *plain text*, meaning that it is now impossible to  
determine where one file within the tar ends, and where the next file  
begins.  Poking around within ParsingReader shows that the SAX events  
are being passed through a BodyContentHandler, which when constructed  
with the default constructor, only writes out the characters of the  
XML stream.  (i.e. performing an XML to text conversion).

It seems like there either needs to be a way for ParsingReader to  
associate a ContentHandler with its internal BodyContentHandler, or  
the default action for BodyContentHandler should be to send the XML  
directly, and not convert it to plain text.

Oh, and subclassing ParsingReader isn't an option without essentially  
reimplementing it since the problematic BodyContentHandler is  
instantiated within the private ParsingThread class.

Ideas?  Suggestions?

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Re: ParsingReader and PackageParser

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Feb 27, 2009 at 1:06 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> Actually, if ParsingReader had some sort of mode where it spat out the xml
> directly instead of (indirectly) using WriteOutContentHandler to convert
> everything to plain text, then one could whatever xml parser, including an
> xml to text converter, on the read side.  As it is, it seems like
> ParsingReader is being just a little too smart.

The main purpose of ParsingReader is to be a Reader, i.e. to produce a
stream of characters.

If you want to see the SAX events, you can just use the Parser
interface directly:

    ContentHandler handler = new MyCustomContentHandler();
    new AutoDetectParser().parse(..., handler, ...);

BR,

Jukka Zitting

Re: ParsingReader and PackageParser

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.
On Feb 26, 2009, at 5:52 AM, Jukka Zitting wrote:

> Hi,
>
> On Thu, Feb 26, 2009 at 6:31 AM, Jonathan Koren  
> <jo...@soe.ucsc.edu> wrote:
>> Ideas?  Suggestions?
>
> If you need special processing for tar files, then the best
> alternative is probably to use the TarInputStream class directly, and
> use the higher level Tika parsers only for parsing the individual tar
> entries.
>
> If you need such processing to be an integral part of Tika, then you
> can wrap your custom logic into a Parser class and modify your
> configuration to use that parser instead of the default TarParser for
> tar files.

I was originally thinking about some way of having ParsingReader set a  
ContentHandler for its internal BodyContentHandler?  As it's setup  
now, you can't get sax events at all with a ParsingReader.   
Unfortunately, there doesn't really seem to be a clean or general way  
to do that.

Actually, if ParsingReader had some sort of mode where it spat out the  
xml directly instead of (indirectly) using WriteOutContentHandler to  
convert everything to plain text, then one could whatever xml parser,  
including an xml to text converter, on the read side.  As it is, it  
seems like ParsingReader is being just a little too smart.

Comments?

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/



Re: ParsingReader and PackageParser

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Feb 26, 2009 at 6:31 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> Ideas?  Suggestions?

If you need special processing for tar files, then the best
alternative is probably to use the TarInputStream class directly, and
use the higher level Tika parsers only for parsing the individual tar
entries.

If you need such processing to be an integral part of Tika, then you
can wrap your custom logic into a Parser class and modify your
configuration to use that parser instead of the default TarParser for
tar files.

BR,

Jukka Zitting