You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> on 2014/11/17 23:25:05 UTC

TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

Hi Guys,

There is a great discussion going on around TIKA-1445 right now
that I wanted to bring to the dev list:

http://issues.apache.org/jira/browse/TIKA-1445

What we are seeing from OCR and GDAL lately is that there may be a
use case to have multiple parsers called for the same MediaType.
In this fashion, each parser contributes *more* metadata and content
handling, rather than simply replacing it, or being the only Parser
selected to contribute to it. Tim brought up the following questions
that I wanted to respond to here on list:

{quote} How will we handle: 1) Two parsers both "set" a value in
the Metadata object? Will the second overwrite the value of the
first?  2) Content: How will we know when a document ends?
AutoDetectParser would wrap the handler in an
EndDocumentShieldingContentHandler and then call endDocument when
done?  3) Will the user be able to parse the output from the handler
to figure out which parser is responsible for which content? Let's
say a user wants to pull the electronic text out of a PDF and render
the page as an image and then run it through OCR, would we have
something like <div parser="o.a.t.p.PDFParser"> or similar?  If we
go this route, we'd want to make sure we don't have literally
duplicate parsers (as we do now).  This sounds more complicated
than having parent parsers know which children they control and how
to control them, but, it might make sense.  Aside from OCR
{quote}

Here are my replies:

#1 We will use a default policy of ³append² which allows the Metadata
object to append values to the same key, rather than replace them.
We could also couple this with X-Parsed-By, which is an ordered
list of what Parser parsed what so that we can reconstruct what
Parser contributed what field. If it¹s multi-valued, we can also
add fields for Offsets, etc.  An alternative here would also be to
prefix metadata keys in this CompositeParser by the X-Parsed-By
parser name, to avoid conflicts. Users would be able to switch the
policy from ³append² to ³overwrite² in which this isn¹t a problem,
and we simply allow the last parser to input into a conflicting key
to be the one that takes precedence. One option with overwrite would
be to allow in this policy for providing a precedence order of
Parsers (e.g., the current service list could be a precedence order).

That said, how sure are we that this is a *real* problem? Some
parsers parse the same MediaType but contribute vastly different
and non overlapping keys to the metadata object?

#2 I like your suggestion - or the alternative as I suggested would
be to reset the stream to the beginning after each parser, or
alternatively keep a clone of the original stream as a copy, and
then clone it for each called Parser attempt?

#3 I like your idea about wrapping content provided by handlers
with the parser attribute. Very neat, let¹s try that!

OK, thanks. I will add this to the JIRA issue too, but I think this
is a good thing to have on the dev@ list.

Cheers, 
Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

Posted by David Meikle <lo...@gmail.com>.

Hi Guys,

> On 18 Nov 2014, at 16:52, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Chris,
>  Thank you for moving this to the dev list.  This would be a fairly large change, and the discussion is valuable.

Given the potential implications of the change, I am wondering if it is worth scheduling a Google Hangout / Conference Call / IRC session to chat through things once we have all had time to flesh out thoughts out?

I am happy to facilitate setting this up and documenting it (meeting notes), so we can include outputs on the list for further discussion and subsequent formal decision making with everyone involved.

Cheers,
Dave

RE: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Chris,
  Thank you for moving this to the dev list.  This would be a fairly large change, and the discussion is valuable.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, November 17, 2014 5:25 PM
To: dev@tika.apache.org
Subject: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

Hi Guys,

There is a great discussion going on around TIKA-1445 right now
that I wanted to bring to the dev list:

http://issues.apache.org/jira/browse/TIKA-1445

What we are seeing from OCR and GDAL lately is that there may be a
use case to have multiple parsers called for the same MediaType.
In this fashion, each parser contributes *more* metadata and content
handling, rather than simply replacing it, or being the only Parser
selected to contribute to it. Tim brought up the following questions
that I wanted to respond to here on list:

{quote} How will we handle: 1) Two parsers both "set" a value in
the Metadata object? Will the second overwrite the value of the
first?  2) Content: How will we know when a document ends?
AutoDetectParser would wrap the handler in an
EndDocumentShieldingContentHandler and then call endDocument when
done?  3) Will the user be able to parse the output from the handler
to figure out which parser is responsible for which content? Let's
say a user wants to pull the electronic text out of a PDF and render
the page as an image and then run it through OCR, would we have
something like <div parser="o.a.t.p.PDFParser"> or similar?  If we
go this route, we'd want to make sure we don't have literally
duplicate parsers (as we do now).  This sounds more complicated
than having parent parsers know which children they control and how
to control them, but, it might make sense.  Aside from OCR
{quote}

Here are my replies:

#1 We will use a default policy of ³append² which allows the Metadata
object to append values to the same key, rather than replace them.
We could also couple this with X-Parsed-By, which is an ordered
list of what Parser parsed what so that we can reconstruct what
Parser contributed what field. If it¹s multi-valued, we can also
add fields for Offsets, etc.  An alternative here would also be to
prefix metadata keys in this CompositeParser by the X-Parsed-By
parser name, to avoid conflicts. Users would be able to switch the
policy from ³append² to ³overwrite² in which this isn¹t a problem,
and we simply allow the last parser to input into a conflicting key
to be the one that takes precedence. One option with overwrite would
be to allow in this policy for providing a precedence order of
Parsers (e.g., the current service list could be a precedence order).

That said, how sure are we that this is a *real* problem? Some
parsers parse the same MediaType but contribute vastly different
and non overlapping keys to the metadata object?

>>I agree that different parsers contribute vastly different metadata keys, and, frankly, in the current use case, the tesseract parser should add nearly zero metadata, so this won't be an issue.  However, if we're going to change the way we've been doing things generally, I wanted us to think of the implications.  The root of my initial concern with this is that the child parsers choose whether or not to add or set.  

>>Oh, but wait, ok, so what we'd actually do is send in a new metadata object for each parser and then at the CompositeParser level, we'd make the decision on whether to append or overwrite the data that we got from each Metadata object.  But wait, aren't there some Properties that only allow one value (e.g. TikaCoreProperties.TITLE)?  Ok, so, when we merge the Metadata objects, we just get String(s) as keys, so we lose the Property restrictions.  Will this wreck XMP or lead to a bad day for people expecting these restrictions?

#2 I like your suggestion - or the alternative as I suggested would
be to reset the stream to the beginning after each parser, or
alternatively keep a clone of the original stream as a copy, and
then clone it for each called Parser attempt?

>>I think we're talking about different things.  Yes, we'll definitely need to reset or spool the stream depending on its length.  My concern was more with the handlers.  If the first parser calls endDocument() and we don't shield that, then if someone uses the BodyContentHandler, then they might not see contents from the second/third parser because the initial parser "ended" the document.  I need to test this concern, but I think that this was the root of TIKA-1124.

#3 I like your idea about wrapping content provided by handlers
with the parser attribute. Very neat, let¹s try that!