You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/02/26 06:31:49 UTC
ParsingReader and PackageParser
I have a tar file that I want to index the contents of as separate
files. To do this, I hooked up an AutoDetectParser to a
ParsingReader. I'm using ParsingReader since total uncompressed
contents of the tars can be quite large.
If I understand how AutoDetectParser works, it figures out that the
file is a tar, and thus fires off a TarParser which is a type of
PackageParser. The PackageParser reads the tar, and sends SAX events
to some Tika internal representation of the file. Specifically, it
sends magic DIVs delimitating the contents of each file, which in
turn are parsed by another AutoDetectParser. The complete sequence of
SAX events for the entire tar file from the outermost AutoDetectParser
to ParsingReader.
Here's where things go off the track. The output stream of
ParsingReader is *plain text*, meaning that it is now impossible to
determine where one file within the tar ends, and where the next file
begins. Poking around within ParsingReader shows that the SAX events
are being passed through a BodyContentHandler, which when constructed
with the default constructor, only writes out the characters of the
XML stream. (i.e. performing an XML to text conversion).
It seems like there either needs to be a way for ParsingReader to
associate a ContentHandler with its internal BodyContentHandler, or
the default action for BodyContentHandler should be to send the XML
directly, and not convert it to plain text.
Oh, and subclassing ParsingReader isn't an option without essentially
reimplementing it since the problematic BodyContentHandler is
instantiated within the private ParsingThread class.
Ideas? Suggestions?
--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/
Re: ParsingReader and PackageParser
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Fri, Feb 27, 2009 at 1:06 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> Actually, if ParsingReader had some sort of mode where it spat out the xml
> directly instead of (indirectly) using WriteOutContentHandler to convert
> everything to plain text, then one could whatever xml parser, including an
> xml to text converter, on the read side. As it is, it seems like
> ParsingReader is being just a little too smart.
The main purpose of ParsingReader is to be a Reader, i.e. to produce a
stream of characters.
If you want to see the SAX events, you can just use the Parser
interface directly:
ContentHandler handler = new MyCustomContentHandler();
new AutoDetectParser().parse(..., handler, ...);
BR,
Jukka Zitting
Re: ParsingReader and PackageParser
Posted by Jonathan Koren <jo...@soe.ucsc.edu>.
On Feb 26, 2009, at 5:52 AM, Jukka Zitting wrote:
> Hi,
>
> On Thu, Feb 26, 2009 at 6:31 AM, Jonathan Koren
> <jo...@soe.ucsc.edu> wrote:
>> Ideas? Suggestions?
>
> If you need special processing for tar files, then the best
> alternative is probably to use the TarInputStream class directly, and
> use the higher level Tika parsers only for parsing the individual tar
> entries.
>
> If you need such processing to be an integral part of Tika, then you
> can wrap your custom logic into a Parser class and modify your
> configuration to use that parser instead of the default TarParser for
> tar files.
I was originally thinking about some way of having ParsingReader set a
ContentHandler for its internal BodyContentHandler? As it's setup
now, you can't get sax events at all with a ParsingReader.
Unfortunately, there doesn't really seem to be a clean or general way
to do that.
Actually, if ParsingReader had some sort of mode where it spat out the
xml directly instead of (indirectly) using WriteOutContentHandler to
convert everything to plain text, then one could whatever xml parser,
including an xml to text converter, on the read side. As it is, it
seems like ParsingReader is being just a little too smart.
Comments?
--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/
Re: ParsingReader and PackageParser
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Thu, Feb 26, 2009 at 6:31 AM, Jonathan Koren <jo...@soe.ucsc.edu> wrote:
> Ideas? Suggestions?
If you need special processing for tar files, then the best
alternative is probably to use the TarInputStream class directly, and
use the higher level Tika parsers only for parsing the individual tar
entries.
If you need such processing to be an integral part of Tika, then you
can wrap your custom logic into a Parser class and modify your
configuration to use that parser instead of the default TarParser for
tar files.
BR,
Jukka Zitting