You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2018/03/01 12:21:51 UTC

Tika 1.18?

All,
There have been some important bug fixes, a few new capabilities, and the upgrading of dependencies because of CVEs.  There are a bunch of mime tickets from Andreas Meier that I’d like to get into 1.18.  Is there anything else that is critical?
Schedule wise, I propose getting changes in by say, next Friday (3/9), regression tests the next week, RC1 the following week[0]?
WDYT?

Cheers,

            Tim

[0] week = “open source week” which can be significantly longer than a calendar week when surprises emerge. 😊

Timothy B. Allison, Ph.D.
Principal Artificial Intelligence Engineer
T835/Human Language Technology
The MITRE Corporation
7515 Colshire Drive, McLean, VA  22102
703-983-2473 (phone); 703-983-1379 (fax)



Re: Tika 1.18?

Posted by Luís Filipe Nassif <lf...@gmail.com>.
I thought about logging any custom-mimetype override applied, so the user
will be warned about that. Maybe additionally creating a specific attribute
in mimetype definition xml to configure it must override the default one
instead of aborting. About multiple conflicting custom mimes from different
(external) projetcs, Tika currently aborts and it is already a problem now.

So I think it needs additional discussion and should not be addressed in
the next release. Will copy/paste this discussion in the jira issue.

But I would like to see fixed the detection of MTS videos, but it conflicts
with another existing mime glob. Any workaround for this specific case? If
yes, I can open a different ticket.



Em 2 de mar de 2018 18:23, "Nick Burch" <ap...@gagravarr.org> escreveu:

On Fri, 2 Mar 2018, Luís Filipe Nassif wrote:

> If I make no progress on TIKA-1466 until 3/9, you can start the release
> process without it. But do you devs agree with the proposed change: allow
> overriding of glob patterns in custom-mimetypes.xml?
>

What happens if you have two different custom files which both claim the
same glob?

We have historically been a bit stricter about built-in types overriding,
in part to avoid people doing silly things by mistake, and in part to push
people a bit more towards contributing fixes/enhancements for built-in
types. I think the latter is less of a thing today, as we've a lot more
covered as standard, so it's just the former we need to worry about.

How do we help people know when they have conflicting overrides (possibly
from different projects), help them sensibly merge or turn off Tika
provided magic+definitions, and to alert them to when their copied +
customised version probably wants updating following a tika upgrade giving
a newer definition? Do a better job of those than we currently do now, then
I'm very happy to +1 it :)

Nick

Re: Tika 1.18?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 2 Mar 2018, Luís Filipe Nassif wrote:
> If I make no progress on TIKA-1466 until 3/9, you can start the release
> process without it. But do you devs agree with the proposed change: allow
> overriding of glob patterns in custom-mimetypes.xml?

What happens if you have two different custom files which both claim the 
same glob?

We have historically been a bit stricter about built-in types overriding, 
in part to avoid people doing silly things by mistake, and in part to push 
people a bit more towards contributing fixes/enhancements for built-in 
types. I think the latter is less of a thing today, as we've a lot more 
covered as standard, so it's just the former we need to worry about.

How do we help people know when they have conflicting overrides (possibly 
from different projects), help them sensibly merge or turn off Tika 
provided magic+definitions, and to alert them to when their copied + 
customised version probably wants updating following a tika upgrade giving 
a newer definition? Do a better job of those than we currently do now, 
then I'm very happy to +1 it :)

Nick

RE: Tika 1.18?

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 12 Mar 2018, Allison, Timothy B. wrote:
> Anyone have anything they'd like to get in before I run the regression 
> tests?  I can certainly put it off a few days.

I've made some progress on the metadata-only fallback/merge multiple 
parser work from https://wiki.apache.org/tika/CompositeParserDiscussion, 
but it's some way off finished yet. I don't think I can cause any 
regressions though! It can also wait for 1.19 if I don't get it stable in 
time to come off a branch.

Nick

RE: Tika 1.18?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
I'm working with PDFBox on regression tests for 2.0.9 now.  I'll probably kick off our own preliminary full corpus regression tests shortly... ~2018-03-12T20:00 UTC 

Anyone have anything they'd like to get in before I run the regression tests?  I can certainly put it off a few days.

Cheers,

             Tim

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: Wednesday, March 7, 2018 4:57 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.18?

Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9


Re: Tika 1.18?

Posted by Chris Mattmann <ma...@apache.org>.
Sounds good to me thanks Tim. Happy to line it up with PDF Box 2.0.9


On 3/7/18, 1:16 PM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    All,
    
      I think I've made the updates that I wanted to make sure got in to 1.18.  It looks like PDFBox is going to start their release cycle shortly.  Should we wait for PDFBox 2.0.9?    
    
      That may add a week or two to our release, although, frankly, it might not.  We can start running the regression tests March 9(ish) and see if anything dire appears...
    
      Cheers,
    
              Tim
    
    



RE: Tika 1.18?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
All,

  I think I've made the updates that I wanted to make sure got in to 1.18.  It looks like PDFBox is going to start their release cycle shortly.  Should we wait for PDFBox 2.0.9?    

  That may add a week or two to our release, although, frankly, it might not.  We can start running the regression tests March 9(ish) and see if anything dire appears...

  Cheers,

          Tim


RE: Tika 1.18?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> But do you devs agree with the proposed change: allow overriding of glob patterns in custom-mimetypes.xml?

+1 from me

From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com]
Sent: Friday, March 2, 2018 8:21 AM
To: Allison, Timothy B. <ta...@mitre.org>
Cc: dev@tika.apache.org
Subject: Re: Tika 1.18?

If I make no progress on TIKA-1466 until 3/9, you can start the release process without it. But do you devs agree with the proposed change: allow overriding of glob patterns in custom-mimetypes.xml?


Re: Tika 1.18?

Posted by Luís Filipe Nassif <lf...@gmail.com>.
If I make no progress on TIKA-1466 until 3/9, you can start the release
process without it. But do you devs agree with the proposed change: allow
overriding of glob patterns in custom-mimetypes.xml?

2018-03-02 10:03 GMT-03:00 Allison, Timothy B. <ta...@mitre.org>:

> TIKA-2591 and TIKA-2568
> +1
>
> TIKA-1466 -- how long will it take, do you think?  This seems potentially
> non-trivial...
>
> -----Original Message-----
> From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com]
> Sent: Thursday, March 1, 2018 5:41 PM
> To: dev@tika.apache.org
> Subject: Re: Tika 1.18?
>
> I think we should workaround TIKA-2591, and I would like to work on
> TIKA-1466 (what do you think?) and fix TIKA-2568.
>
> Cheers,
> Luis
>

RE: Tika 1.18?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
TIKA-2591 and TIKA-2568
+1

TIKA-1466 -- how long will it take, do you think?  This seems potentially non-trivial...

-----Original Message-----
From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
Sent: Thursday, March 1, 2018 5:41 PM
To: dev@tika.apache.org
Subject: Re: Tika 1.18?

I think we should workaround TIKA-2591, and I would like to work on TIKA-1466 (what do you think?) and fix TIKA-2568.

Cheers,
Luis

Re: Tika 1.18?

Posted by Luís Filipe Nassif <lf...@gmail.com>.
I think we should workaround TIKA-2591, and I would like to work
on TIKA-1466 (what do you think?) and fix TIKA-2568.

Cheers,
Luis

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Livre
de vírus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>.
<#m_3134801720618142664_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2018-03-01 13:24 GMT-03:00 Chris Mattmann <ma...@apache.org>:

> Same: makes perfect sense to me and let's do it ( I just updated (finally)
> Tika Python down
> stream to be based on the 1.16 Tika, I guess I should get it based on 1.17
> soon too (
>
> https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#
> L17
>
> Cheers,
> Chris
>
> On 3/1/18, 5:16 AM, "Nick Burch" <ap...@gagravarr.org> wrote:
>
>     On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
>     > There have been some important bug fixes, a few new capabilities, and
>     > the upgrading of dependencies because of CVEs.  There are a bunch of
>     > mime tickets from Andreas Meier that I’d like to get into 1.18.  Is
>     > there anything else that is critical?
>
>     I've had a busy few weeks, so haven't yet had a chance to try out my
>     proposed multi-parser stuff for 2.x. I'll hopefully take a look next
> week,
>     assuming even the fastest review cycle and everyone loving it, I can't
> see
>     us being ready to all sign-off on those "2.x breaking changes" until
>     probably April.
>
>     Given that, doing an interim 1.x release soon makes sense to me!
>
>     Nick
>
>
>

Re: Tika 1.18?

Posted by Chris Mattmann <ma...@apache.org>.
Same: makes perfect sense to me and let's do it ( I just updated (finally) Tika Python down
stream to be based on the 1.16 Tika, I guess I should get it based on 1.17 soon too (

https://github.com/chrismattmann/tika-python/blob/master/tika/__init__.py#L17

Cheers,
Chris

On 3/1/18, 5:16 AM, "Nick Burch" <ap...@gagravarr.org> wrote:

    On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
    > There have been some important bug fixes, a few new capabilities, and 
    > the upgrading of dependencies because of CVEs.  There are a bunch of 
    > mime tickets from Andreas Meier that I’d like to get into 1.18.  Is 
    > there anything else that is critical?
    
    I've had a busy few weeks, so haven't yet had a chance to try out my 
    proposed multi-parser stuff for 2.x. I'll hopefully take a look next week, 
    assuming even the fastest review cycle and everyone loving it, I can't see 
    us being ready to all sign-off on those "2.x breaking changes" until 
    probably April.
    
    Given that, doing an interim 1.x release soon makes sense to me!
    
    Nick



Re: Tika 1.18?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 1 Mar 2018, Allison, Timothy B. wrote:
> There have been some important bug fixes, a few new capabilities, and 
> the upgrading of dependencies because of CVEs.  There are a bunch of 
> mime tickets from Andreas Meier that I’d like to get into 1.18.  Is 
> there anything else that is critical?

I've had a busy few weeks, so haven't yet had a chance to try out my 
proposed multi-parser stuff for 2.x. I'll hopefully take a look next week, 
assuming even the fastest review cycle and everyone loving it, I can't see 
us being ready to all sign-off on those "2.x breaking changes" until 
probably April.

Given that, doing an interim 1.x release soon makes sense to me!

Nick