You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ap...@gagravarr.org> on 2018/01/02 19:20:20 UTC

Re: Not-yet-broken breaking changes for Tika 2?

Sorry to ignore this for so long...

On Thu, 26 Oct 2017, Chris Mattmann wrote:
> On collision, the precedence order defines what key takes precedence and 
> _overwrites_ the other. Overwrite is but one option (you could save 
> *all* the values it’s a multi-valued key structure so…)

OK, I think that's fine. I've had a go at updating the wiki for the 
metadata case:
https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
And example Tika Config settings for it
https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
If people are happy with how that sounds/looks, I can have a stab at 
implementing it, as I *think* it's quite easy


However... that still leaves the Context (XHTML SAX events) case to solve!

Anyone have any ideas on how we can append to or cancel/reset the Content 
Handler series of SAX events when we move onto a second+ parser for a 
file?

Thanks
Nick

> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>
>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>    > My general approach to conflicting metadata is simply to define
>    > precedence orders.
>    >
>    > For example here is one documented from OODT:
>    >
>    > https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>    >
>    > We can do similar things with Tika, e.g.,
>    >
>    > [CoreMetadata.PROPERTIES]
>    > [ImageParser.METADATA]
>    > [TikaOCR.METADATA]
>
>    What happens if two different parsers both output the same bit of metadata
>    though? eg Tim's example of one giving dc:creator of Tim and the second
>    giving dc:creator of Chris?
>
>
>    Secondly, what about the XHTML sax events stream? I think that's probably
>    the harder case...
>
>    Nick

Re: Not-yet-broken breaking changes for Tika 2?

Posted by Chris Mattmann <ma...@apache.org>.
IMO, if the parser p1 has an exception and then we move to p2 before p1 is done 
creating its SAX we can create a special tag indicating the exception e.g., <span class="tika-exception"
>Message here</span> and have it output that before moving to p2 in the chain...



On 2/7/18, 7:00 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    Do we worry about properly closing tags on an exception?
    
    <body>
    	<div parser="parser1">
    		<p>
    kaboom
    	<div parser="parser2>
    ....
    
    My focus is normally text so broken tags aren't a problem for me...but others?
    
    -----Original Message-----
    From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
    Sent: Monday, February 5, 2018 5:34 PM
    To: dev@tika.apache.org
    Subject: Re: Not-yet-broken breaking changes for Tika 2?
    
    From a forensic use case it is better just saying we are trying another parser and not resetting the content handler, because the first parser can extract relevant content before the exception.
    
    To not spool everything to temp files to re-read the stream, I think we can create an optional setinputstreamfactory() method in TikaInputStream, so the user can implement an InputStreamFactory interface with a getInputStream method, if he does not want to pay a performance hit with temp files for everything.
    
    Luis
    
    Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <ma...@apache.org>
    escreveu:
    
    I think we should just say, OK now we're trying  a different parser....
    
    
    
    On 2/5/18, 9:51 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
    
        To my mind, the real challenge is what to do with content that should be ignored...
    
        If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?
    
        Or do we just say, hey, now we're trying a different parser...
    
    
        -----Original Message-----
        From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov]
        Sent: Monday, February 5, 2018 12:29 PM
        To: dev@tika.apache.org
        Subject: Re: Not-yet-broken breaking changes for Tika 2?
    
        Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
        In short just run through the stream 2x....
    
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    ++++++++++++++
        Chris Mattmann, Ph.D.
        Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
        Office: 180-503E, Mailstop: 180-502
        Email: chris.a.mattmann@nasa.gov
        WWW:  http://sunset.usc.edu/~mattmann/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    ++++++++++++++
        Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
        WWW: http://irds.usc.edu/
        ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    ++++++++++++++
    
    
        On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    
            On Mon, 5 Feb 2018, Chris Mattmann wrote:
            > Let's have a go at implementing it! You know my thoughts (make it like
            > OODT ;) )\
    
            I'm still keen to hear how we can do the text content like OODT!
    
            I have tried to copy the OODT model for the proposed metadata case though
            :)
    
            Nick
    
            > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
            >
            >    Ping - anyone got any thoughts on the proposed metadata parser
    stuff, and
            >    any ideas on the content part?
            >
            >    On Tue, 2 Jan 2018, Nick Burch wrote:
            >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
            >    >> On collision, the precedence order defines what key takes
    precedence and
            >    >> _overwrites_ the other. Overwrite is but one option (you
    could save *all*
            >    >> the values it’s a multi-valued key structure so…)
            >    >
            >    > OK, I think that's fine. I've had a go at updating the wiki
    for the metadata
            >    > case:
            >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
    Supplementary.2FAdditive
            >    > And example Tika Config settings for it
            >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
    line-20
            >    > If people are happy with how that sounds/looks, I can have a
    stab at
            >    > implementing it, as I *think* it's quite easy
            >    >
            >    >
            >    > However... that still leaves the Context (XHTML SAX events)
    case to solve!
            >    >
            >    > Anyone have any ideas on how we can append to or
    cancel/reset the Content
            >    > Handler series of SAX events when we move onto a second+
    parser for a file?
            >    >
            >    > Thanks
            >    > Nick
            >    >
            >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org>
    wrote:
            >    >>
            >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
            >    >>    > My general approach to conflicting metadata is simply
    to define
            >    >>    > precedence orders.
            >    >>    >
            >    >>    > For example here is one documented from OODT:
            >    >>    >
            >    >>    >
            >    >> https://cwiki.apache.org/confluence/display/OODT/
    Understanding+CAS-PGE+Metadata+Precendence
            >    >>    >
            >    >>    > We can do similar things with Tika, e.g.,
            >    >>    >
            >    >>    > [CoreMetadata.PROPERTIES]
            >    >>    > [ImageParser.METADATA]
            >    >>    > [TikaOCR.METADATA]
            >    >>
            >    >>    What happens if two different parsers both output the
    same bit of
            >    >> metadata
            >    >>    though? eg Tim's example of one giving dc:creator of Tim
    and the second
            >    >>    giving dc:creator of Chris?
            >    >>
            >    >>
            >    >>    Secondly, what about the XHTML sax events stream? I
    think that's
            >    >> probably
            >    >>    the harder case...
            >    >>
            >    >>    Nick
            >
            >
            >
    



Re: Not-yet-broken breaking changes for Tika 2?

Posted by Luís Filipe Nassif <lf...@gmail.com>.
Mine too, but I know it is important for many use cases. Maybe adding to
XHtmlContentHandler some tracking of open tags and a new method to close
them?

2018-02-07 12:59 GMT-02:00 Allison, Timothy B. <ta...@mitre.org>:

> Do we worry about properly closing tags on an exception?
>
> <body>
>         <div parser="parser1">
>                 <p>
> kaboom
>         <div parser="parser2>
> ....
>
> My focus is normally text so broken tags aren't a problem for me...but
> others?
>
> -----Original Message-----
> From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com]
> Sent: Monday, February 5, 2018 5:34 PM
> To: dev@tika.apache.org
> Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
> From a forensic use case it is better just saying we are trying another
> parser and not resetting the content handler, because the first parser can
> extract relevant content before the exception.
>
> To not spool everything to temp files to re-read the stream, I think we
> can create an optional setinputstreamfactory() method in TikaInputStream,
> so the user can implement an InputStreamFactory interface with a
> getInputStream method, if he does not want to pay a performance hit with
> temp files for everything.
>
> Luis
>
> Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <ma...@apache.org>
> escreveu:
>
> I think we should just say, OK now we're trying  a different parser....
>
>
>
> On 2/5/18, 9:51 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>
>     To my mind, the real challenge is what to do with content that should
> be ignored...
>
>     If the strategy is back-off-on-exception (try the DOCX parser, but if
> there's an exception, use the Zip parser), what do we do with the sax
> elements that have already been written?  Do we need a new handler type
> that has a reset() method?
>
>     Or do we just say, hey, now we're trying a different parser...
>
>
>     -----Original Message-----
>     From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov]
>     Sent: Monday, February 5, 2018 12:29 PM
>     To: dev@tika.apache.org
>     Subject: Re: Not-yet-broken breaking changes for Tika 2?
>
>     Our solution is just to run the parser 2x....yes I get it will induce
> overhead, but as a start, why not?
>     In short just run through the stream 2x....
>
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Chris Mattmann, Ph.D.
>     Associate Chief Technology and Innovation Officer, OCIO Manager,
> Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
> and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA
>     Office: 180-503E, Mailstop: 180-502
>     Email: chris.a.mattmann@nasa.gov
>     WWW:  http://sunset.usc.edu/~mattmann/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>     Director, Information Retrieval and Data Science Group (IRDS) Adjunct
> Associate Professor, Computer Science Department University of Southern
> California, Los Angeles, CA 90089 USA
>     WWW: http://irds.usc.edu/
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++
>
>
>     On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>
>         On Mon, 5 Feb 2018, Chris Mattmann wrote:
>         > Let's have a go at implementing it! You know my thoughts (make
> it like
>         > OODT ;) )\
>
>         I'm still keen to hear how we can do the text content like OODT!
>
>         I have tried to copy the OODT model for the proposed metadata case
> though
>         :)
>
>         Nick
>
>         > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>         >
>         >    Ping - anyone got any thoughts on the proposed metadata parser
> stuff, and
>         >    any ideas on the content part?
>         >
>         >    On Tue, 2 Jan 2018, Nick Burch wrote:
>         >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
>         >    >> On collision, the precedence order defines what key takes
> precedence and
>         >    >> _overwrites_ the other. Overwrite is but one option (you
> could save *all*
>         >    >> the values it’s a multi-valued key structure so…)
>         >    >
>         >    > OK, I think that's fine. I've had a go at updating the wiki
> for the metadata
>         >    > case:
>         >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
> Supplementary.2FAdditive
>         >    > And example Tika Config settings for it
>         >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
> line-20
>         >    > If people are happy with how that sounds/looks, I can have a
> stab at
>         >    > implementing it, as I *think* it's quite easy
>         >    >
>         >    >
>         >    > However... that still leaves the Context (XHTML SAX events)
> case to solve!
>         >    >
>         >    > Anyone have any ideas on how we can append to or
> cancel/reset the Content
>         >    > Handler series of SAX events when we move onto a second+
> parser for a file?
>         >    >
>         >    > Thanks
>         >    > Nick
>         >    >
>         >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org>
> wrote:
>         >    >>
>         >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>         >    >>    > My general approach to conflicting metadata is simply
> to define
>         >    >>    > precedence orders.
>         >    >>    >
>         >    >>    > For example here is one documented from OODT:
>         >    >>    >
>         >    >>    >
>         >    >> https://cwiki.apache.org/confluence/display/OODT/
> Understanding+CAS-PGE+Metadata+Precendence
>         >    >>    >
>         >    >>    > We can do similar things with Tika, e.g.,
>         >    >>    >
>         >    >>    > [CoreMetadata.PROPERTIES]
>         >    >>    > [ImageParser.METADATA]
>         >    >>    > [TikaOCR.METADATA]
>         >    >>
>         >    >>    What happens if two different parsers both output the
> same bit of
>         >    >> metadata
>         >    >>    though? eg Tim's example of one giving dc:creator of Tim
> and the second
>         >    >>    giving dc:creator of Chris?
>         >    >>
>         >    >>
>         >    >>    Secondly, what about the XHTML sax events stream? I
> think that's
>         >    >> probably
>         >    >>    the harder case...
>         >    >>
>         >    >>    Nick
>         >
>         >
>         >
>

RE: Not-yet-broken breaking changes for Tika 2?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Do we worry about properly closing tags on an exception?

<body>
	<div parser="parser1">
		<p>
kaboom
	<div parser="parser2>
....

My focus is normally text so broken tags aren't a problem for me...but others?

-----Original Message-----
From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
Sent: Monday, February 5, 2018 5:34 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

From a forensic use case it is better just saying we are trying another parser and not resetting the content handler, because the first parser can extract relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can create an optional setinputstreamfactory() method in TikaInputStream, so the user can implement an InputStreamFactory interface with a getInputStream method, if he does not want to pay a performance hit with temp files for everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <ma...@apache.org>
escreveu:

I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    To my mind, the real challenge is what to do with content that should be ignored...

    If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?

    Or do we just say, hey, now we're trying a different parser...


    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov]
    Sent: Monday, February 5, 2018 12:29 PM
    To: dev@tika.apache.org
    Subject: Re: Not-yet-broken breaking changes for Tika 2?

    Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
    In short just run through the stream 2x....

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: chris.a.mattmann@nasa.gov
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++


    On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it like
        > OODT ;) )\

        I'm still keen to hear how we can do the text content like OODT!

        I have tried to copy the OODT model for the proposed metadata case though
        :)

        Nick

        > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes
precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you
could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki
for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
        >    > If people are happy with how that sounds/looks, I can have a
stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events)
case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or
cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+
parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org>
wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply
to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/
Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the
same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim
and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I
think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >

Re: Not-yet-broken breaking changes for Tika 2?

Posted by Luís Filipe Nassif <lf...@gmail.com>.
From a forensic use case it is better just saying we are trying another
parser and not resetting the content handler, because the first parser can
extract relevant content before the exception.

To not spool everything to temp files to re-read the stream, I think we can
create an optional setinputstreamfactory() method in TikaInputStream, so
the user can implement an InputStreamFactory interface with a
getInputStream method, if he does not want to pay a performance hit with
temp files for everything.

Luis

Em 5 de fev de 2018 4:52 PM, "Chris Mattmann" <ma...@apache.org>
escreveu:

I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    To my mind, the real challenge is what to do with content that should
be ignored...

    If the strategy is back-off-on-exception (try the DOCX parser, but if
there's an exception, use the Zip parser), what do we do with the sax
elements that have already been written?  Do we need a new handler type
that has a reset() method?

    Or do we just say, hey, now we're trying a different parser...


    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov]
    Sent: Monday, February 5, 2018 12:29 PM
    To: dev@tika.apache.org
    Subject: Re: Not-yet-broken breaking changes for Tika 2?

    Our solution is just to run the parser 2x....yes I get it will induce
overhead, but as a start, why not?
    In short just run through the stream 2x....

    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager,
Advanced IT Research and Open Source Projects Office (1761) Manager, NSF
and Open Source Programs and Applications Office (8212) NASA Jet Propulsion
Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: chris.a.mattmann@nasa.gov
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct
Associate Professor, Computer Science Department University of Southern
California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++


    On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it
like
        > OODT ;) )\

        I'm still keen to hear how we can do the text content like OODT!

        I have tried to copy the OODT model for the proposed metadata case
though
        :)

        Nick

        > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser
stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes
precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you
could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki
for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#
line-20
        >    > If people are happy with how that sounds/looks, I can have a
stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events)
case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or
cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+
parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org>
wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply
to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/
Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the
same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim
and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I
think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >

Re: Not-yet-broken breaking changes for Tika 2?

Posted by Chris Mattmann <ma...@apache.org>.
I think we should just say, OK now we're trying  a different parser....



On 2/5/18, 9:51 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    To my mind, the real challenge is what to do with content that should be ignored...
    
    If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?
    
    Or do we just say, hey, now we're trying a different parser...
    
    
    -----Original Message-----
    From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov] 
    Sent: Monday, February 5, 2018 12:29 PM
    To: dev@tika.apache.org
    Subject: Re: Not-yet-broken breaking changes for Tika 2?
    
    Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
    In short just run through the stream 2x....
    
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 180-503E, Mailstop: 180-502
    Email: chris.a.mattmann@nasa.gov
    WWW:  http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
    WWW: http://irds.usc.edu/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
     
     
    On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    
        On Mon, 5 Feb 2018, Chris Mattmann wrote:
        > Let's have a go at implementing it! You know my thoughts (make it like 
        > OODT ;) )\
        
        I'm still keen to hear how we can do the text content like OODT!
        
        I have tried to copy the OODT model for the proposed metadata case though 
        :)
        
        Nick
        
        > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
        >
        >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
        >    any ideas on the content part?
        >
        >    On Tue, 2 Jan 2018, Nick Burch wrote:
        >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >> On collision, the precedence order defines what key takes precedence and
        >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
        >    >> the values it’s a multi-valued key structure so…)
        >    >
        >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
        >    > case:
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
        >    > And example Tika Config settings for it
        >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
        >    > If people are happy with how that sounds/looks, I can have a stab at
        >    > implementing it, as I *think* it's quite easy
        >    >
        >    >
        >    > However... that still leaves the Context (XHTML SAX events) case to solve!
        >    >
        >    > Anyone have any ideas on how we can append to or cancel/reset the Content
        >    > Handler series of SAX events when we move onto a second+ parser for a file?
        >    >
        >    > Thanks
        >    > Nick
        >    >
        >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
        >    >>
        >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
        >    >>    > My general approach to conflicting metadata is simply to define
        >    >>    > precedence orders.
        >    >>    >
        >    >>    > For example here is one documented from OODT:
        >    >>    >
        >    >>    >
        >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
        >    >>    >
        >    >>    > We can do similar things with Tika, e.g.,
        >    >>    >
        >    >>    > [CoreMetadata.PROPERTIES]
        >    >>    > [ImageParser.METADATA]
        >    >>    > [TikaOCR.METADATA]
        >    >>
        >    >>    What happens if two different parsers both output the same bit of
        >    >> metadata
        >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
        >    >>    giving dc:creator of Chris?
        >    >>
        >    >>
        >    >>    Secondly, what about the XHTML sax events stream? I think that's
        >    >> probably
        >    >>    the harder case...
        >    >>
        >    >>    Nick
        >
        >
        >
    
    



RE: Not-yet-broken breaking changes for Tika 2?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
To my mind, the real challenge is what to do with content that should be ignored...

If the strategy is back-off-on-exception (try the DOCX parser, but if there's an exception, use the Zip parser), what do we do with the sax elements that have already been written?  Do we need a new handler type that has a reset() method?

Or do we just say, hey, now we're trying a different parser...


-----Original Message-----
From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like 
    > OODT ;) )\
    
    I'm still keen to hear how we can do the text content like OODT!
    
    I have tried to copy the OODT model for the proposed metadata case though 
    :)
    
    Nick
    
    > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    >    > case:
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the Content
    >    > Handler series of SAX events when we move onto a second+ parser for a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >


RE: Not-yet-broken breaking changes for Tika 2?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Spool to temp file?

-----Original Message-----
From: Mattmann, Chris A (1761) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Monday, February 5, 2018 12:29 PM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO Manager, Advanced IT Research and Open Source Projects Office (1761) Manager, NSF and Open Source Programs and Applications Office (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like 
    > OODT ;) )\
    
    I'm still keen to hear how we can do the text content like OODT!
    
    I have tried to copy the OODT model for the proposed metadata case though 
    :)
    
    Nick
    
    > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    >    > case:
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the Content
    >    > Handler series of SAX events when we move onto a second+ parser for a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >


Re: Not-yet-broken breaking changes for Tika 2?

Posted by "Mattmann, Chris A (1761)" <ch...@jpl.nasa.gov>.
Our solution is just to run the parser 2x....yes I get it will induce overhead, but as a start, why not?
In short just run through the stream 2x....

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Associate Chief Technology and Innovation Officer, OCIO
Manager, Advanced IT Research and Open Source Projects Office (1761)
Manager, NSF and Open Source Programs and Applications Office (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-502
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 
 
On 2/5/18, 9:25 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

    On Mon, 5 Feb 2018, Chris Mattmann wrote:
    > Let's have a go at implementing it! You know my thoughts (make it like 
    > OODT ;) )\
    
    I'm still keen to hear how we can do the text content like OODT!
    
    I have tried to copy the OODT model for the proposed metadata case though 
    :)
    
    Nick
    
    > On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >
    >    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
    >    any ideas on the content part?
    >
    >    On Tue, 2 Jan 2018, Nick Burch wrote:
    >    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >> On collision, the precedence order defines what key takes precedence and
    >    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
    >    >> the values it’s a multi-valued key structure so…)
    >    >
    >    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
    >    > case:
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    >    > And example Tika Config settings for it
    >    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    >    > If people are happy with how that sounds/looks, I can have a stab at
    >    > implementing it, as I *think* it's quite easy
    >    >
    >    >
    >    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >    >
    >    > Anyone have any ideas on how we can append to or cancel/reset the Content
    >    > Handler series of SAX events when we move onto a second+ parser for a file?
    >    >
    >    > Thanks
    >    > Nick
    >    >
    >    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >    >>
    >    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >    >>    > My general approach to conflicting metadata is simply to define
    >    >>    > precedence orders.
    >    >>    >
    >    >>    > For example here is one documented from OODT:
    >    >>    >
    >    >>    >
    >    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >    >>    >
    >    >>    > We can do similar things with Tika, e.g.,
    >    >>    >
    >    >>    > [CoreMetadata.PROPERTIES]
    >    >>    > [ImageParser.METADATA]
    >    >>    > [TikaOCR.METADATA]
    >    >>
    >    >>    What happens if two different parsers both output the same bit of
    >    >> metadata
    >    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >    >>    giving dc:creator of Chris?
    >    >>
    >    >>
    >    >>    Secondly, what about the XHTML sax events stream? I think that's
    >    >> probably
    >    >>    the harder case...
    >    >>
    >    >>    Nick
    >
    >
    >


Re: Not-yet-broken breaking changes for Tika 2?

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 5 Feb 2018, Chris Mattmann wrote:
> Let's have a go at implementing it! You know my thoughts (make it like 
> OODT ;) )\

I'm still keen to hear how we can do the text content like OODT!

I have tried to copy the OODT model for the proposed metadata case though 
:)

Nick

> On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>
>    Ping - anyone got any thoughts on the proposed metadata parser stuff, and
>    any ideas on the content part?
>
>    On Tue, 2 Jan 2018, Nick Burch wrote:
>    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
>    >> On collision, the precedence order defines what key takes precedence and
>    >> _overwrites_ the other. Overwrite is but one option (you could save *all*
>    >> the values it’s a multi-valued key structure so…)
>    >
>    > OK, I think that's fine. I've had a go at updating the wiki for the metadata
>    > case:
>    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
>    > And example Tika Config settings for it
>    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
>    > If people are happy with how that sounds/looks, I can have a stab at
>    > implementing it, as I *think* it's quite easy
>    >
>    >
>    > However... that still leaves the Context (XHTML SAX events) case to solve!
>    >
>    > Anyone have any ideas on how we can append to or cancel/reset the Content
>    > Handler series of SAX events when we move onto a second+ parser for a file?
>    >
>    > Thanks
>    > Nick
>    >
>    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>    >>
>    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>    >>    > My general approach to conflicting metadata is simply to define
>    >>    > precedence orders.
>    >>    >
>    >>    > For example here is one documented from OODT:
>    >>    >
>    >>    >
>    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>    >>    >
>    >>    > We can do similar things with Tika, e.g.,
>    >>    >
>    >>    > [CoreMetadata.PROPERTIES]
>    >>    > [ImageParser.METADATA]
>    >>    > [TikaOCR.METADATA]
>    >>
>    >>    What happens if two different parsers both output the same bit of
>    >> metadata
>    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
>    >>    giving dc:creator of Chris?
>    >>
>    >>
>    >>    Secondly, what about the XHTML sax events stream? I think that's
>    >> probably
>    >>    the harder case...
>    >>
>    >>    Nick
>
>
>

Re: Not-yet-broken breaking changes for Tika 2?

Posted by Chris Mattmann <ma...@apache.org>.
Let's have a go at implementing it! You know my thoughts (make it like OODT ;) )\



On 2/5/18, 8:37 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

    Ping - anyone got any thoughts on the proposed metadata parser stuff, and 
    any ideas on the content part?
    
    On Tue, 2 Jan 2018, Nick Burch wrote:
    > On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >> On collision, the precedence order defines what key takes precedence and 
    >> _overwrites_ the other. Overwrite is but one option (you could save *all* 
    >> the values it’s a multi-valued key structure so…)
    >
    > OK, I think that's fine. I've had a go at updating the wiki for the metadata 
    > case:
    > https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
    > And example Tika Config settings for it
    > https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
    > If people are happy with how that sounds/looks, I can have a stab at 
    > implementing it, as I *think* it's quite easy
    >
    >
    > However... that still leaves the Context (XHTML SAX events) case to solve!
    >
    > Anyone have any ideas on how we can append to or cancel/reset the Content 
    > Handler series of SAX events when we move onto a second+ parser for a file?
    >
    > Thanks
    > Nick
    >
    >> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
    >>
    >>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
    >>    > My general approach to conflicting metadata is simply to define
    >>    > precedence orders.
    >>    >
    >>    > For example here is one documented from OODT:
    >>    >
    >>    > 
    >> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
    >>    >
    >>    > We can do similar things with Tika, e.g.,
    >>    >
    >>    > [CoreMetadata.PROPERTIES]
    >>    > [ImageParser.METADATA]
    >>    > [TikaOCR.METADATA]
    >>
    >>    What happens if two different parsers both output the same bit of 
    >> metadata
    >>    though? eg Tim's example of one giving dc:creator of Tim and the second
    >>    giving dc:creator of Chris?
    >> 
    >>
    >>    Secondly, what about the XHTML sax events stream? I think that's 
    >> probably
    >>    the harder case...
    >>
    >>    Nick



RE: Not-yet-broken breaking changes for Tika 2?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
On the metadata stuff, I'm coming around to Ray Gauss's proposal.  I wanted too much back then, and his solution is super elegant, IIRC.

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Monday, February 5, 2018 11:37 AM
To: dev@tika.apache.org
Subject: Re: Not-yet-broken breaking changes for Tika 2?

Ping - anyone got any thoughts on the proposed metadata parser stuff, and any ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:
> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence 
>> and _overwrites_ the other. Overwrite is but one option (you could 
>> save *all* the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the 
> metadata
> case:
> https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2
> FAdditive And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at 
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the 
> Content Handler series of SAX events when we move onto a second+ parser for a file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>>
>>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>    > My general approach to conflicting metadata is simply to define
>>    > precedence orders.
>>    >
>>    > For example here is one documented from OODT:
>>    >
>>    >
>> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>    >
>>    > We can do similar things with Tika, e.g.,
>>    >
>>    > [CoreMetadata.PROPERTIES]
>>    > [ImageParser.METADATA]
>>    > [TikaOCR.METADATA]
>>
>>    What happens if two different parsers both output the same bit of 
>> metadata
>>    though? eg Tim's example of one giving dc:creator of Tim and the second
>>    giving dc:creator of Chris?
>> 
>>
>>    Secondly, what about the XHTML sax events stream? I think that's 
>> probably
>>    the harder case...
>>
>>    Nick

Re: Not-yet-broken breaking changes for Tika 2?

Posted by Nick Burch <ap...@gagravarr.org>.
Ping - anyone got any thoughts on the proposed metadata parser stuff, and 
any ideas on the content part?

On Tue, 2 Jan 2018, Nick Burch wrote:
> On Thu, 26 Oct 2017, Chris Mattmann wrote:
>> On collision, the precedence order defines what key takes precedence and 
>> _overwrites_ the other. Overwrite is but one option (you could save *all* 
>> the values it’s a multi-valued key structure so…)
>
> OK, I think that's fine. I've had a go at updating the wiki for the metadata 
> case:
> https://wiki.apache.org/tika/CompositeParserDiscussion#Supplementary.2FAdditive
> And example Tika Config settings for it
> https://wiki.apache.org/tika/CompositeParserDiscussion#line-20
> If people are happy with how that sounds/looks, I can have a stab at 
> implementing it, as I *think* it's quite easy
>
>
> However... that still leaves the Context (XHTML SAX events) case to solve!
>
> Anyone have any ideas on how we can append to or cancel/reset the Content 
> Handler series of SAX events when we move onto a second+ parser for a file?
>
> Thanks
> Nick
>
>> On 10/26/17, 9:43 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>>
>>    On Thu, 26 Oct 2017, Chris Mattmann wrote:
>>    > My general approach to conflicting metadata is simply to define
>>    > precedence orders.
>>    >
>>    > For example here is one documented from OODT:
>>    >
>>    > 
>> https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>>    >
>>    > We can do similar things with Tika, e.g.,
>>    >
>>    > [CoreMetadata.PROPERTIES]
>>    > [ImageParser.METADATA]
>>    > [TikaOCR.METADATA]
>>
>>    What happens if two different parsers both output the same bit of 
>> metadata
>>    though? eg Tim's example of one giving dc:creator of Tim and the second
>>    giving dc:creator of Chris?
>> 
>>
>>    Secondly, what about the XHTML sax events stream? I think that's 
>> probably
>>    the harder case...
>>
>>    Nick