You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2020/08/14 13:06:16 UTC

Tika 2.0 modularization

All,
  I _think_ I might have some time to start working on integrating Bob's
work on the current main branch.  I'll have to ignore most of the incoming
issues for a bit...unlike the last 4 years...this time I mean it. :)
  Let me know if there are any objections to heading down this path now.

   Cheers,

              Tim

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Tim Allison <ta...@apache.org>.
Thank you!

>Somehow I did not find a couple of parsers, probably it is because of
on-going work ...

Yep.  Exactly.  I didn't want to put in the work in this direction if there
were any showstoppers.

>If we are going to make Tika more modern, maybe gradle can do a trick?
My gradle isn't as strong as maven, but if you or anyone else wants to
translate, I'd be good with that.  Let me do the maven modularization
first?  How much effort would this be?

>Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records
Elasticsearch is already at 11, and the next version of Solr requires 11.
I'm happy keeping Tika at 1.8 or moving to 11.  I think 14 is a bit too
cutting edge for Tika 2.0.0...maybe 3.0.0?

Any thoughts on what we do with Jigsaw?  Should we shoot the moon and move
to 11 and jigsaw, go with multi-version jars or just go with what we have
and make modest changes so that we are hostile to folks using jigsaw?



On Tue, Aug 18, 2020 at 11:38 AM Oleg Tikhonov <ol...@apache.org> wrote:

> Hi Tim,
> looks awesome.
> Somehow I did not find a couple of parsers, probably it is because of
> on-going work ...
> In addition, I was thinking about "getting rid of" maven. If we are going
> to make Tika more modern, maybe gradle can do a trick?
> Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
> API, records ...
>
> WDYT?
> BR,
> Oleg
>
>
>
>
> On Tue, Aug 18, 2020 at 5:41 PM Tim Allison <ta...@apache.org> wrote:
>
>> If anyone has any time, please take a look here:
>> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>>
>> Does this basically look ok?
>>
>> I've put the integration tests in
>>
>> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
>> ... that doesn't build yet.
>>
>> I've flipped Bob's design so that the integration tests pull test files
>> from the individual parser modules via test-jar.
>>
>> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>>
>> > +1 excited about this.
>> >
>> > - Bob
>> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>> >
>> > +1 😀
>> >
>> > Cheers Sergey
>> >
>> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <
>> mattmann@apache.org> wrote:
>> >
>> >
>> > Haha  I’m down and supportive!
>> >
>> >
>> >
>> > Time’s TIME FOR 2.x 😊
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: Tim Allison <ta...@apache.org> <ta...@apache.org>
>> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <
>> dev@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
>> > 174B-Affiliate)" <ti...@jpl.nasa.gov> <
>> timothy.b.allison@jpl.nasa.gov>
>> > Date: Friday, August 14, 2020 at 6:06 AM
>> > To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org>
>> <de...@tika.apache.org>
>> > Subject: [EXTERNAL] Tika 2.0 modularization
>> >
>> >
>> >
>> > All,
>> >
>> >   I _think_ I might have some time to start working on integrating Bob's
>> >
>> > work on the current main branch.  I'll have to ignore most of the
>> incoming
>> >
>> > issues for a bit...unlike the last 4 years...this time I mean it. :)
>> >
>> >   Let me know if there are any objections to heading down this path now.
>> >
>> >
>> >
>> >    Cheers,
>> >
>> >
>> >
>> >               Tim
>> >
>> >
>> >
>> >
>> >
>> >
>>
>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Oleg Tikhonov <ol...@apache.org>.
Hi Tim,
looks awesome.
Somehow I did not find a couple of parsers, probably it is because of
on-going work ...
In addition, I was thinking about "getting rid of" maven. If we are going
to make Tika more modern, maybe gradle can do a trick?
Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records ...

WDYT?
BR,
Oleg




On Tue, Aug 18, 2020 at 5:41 PM Tim Allison <ta...@apache.org> wrote:

> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 😀
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <
> mattmann@apache.org> wrote:
> >
> >
> > Haha  I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 😊
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison <ta...@apache.org> <ta...@apache.org>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <
> dev@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
> > 174B-Affiliate)" <ti...@jpl.nasa.gov> <
> timothy.b.allison@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org>
> <de...@tika.apache.org>
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> >   I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch.  I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> >   Let me know if there are any objections to heading down this path now.
> >
> >
> >
> >    Cheers,
> >
> >
> >
> >               Tim
> >
> >
> >
> >
> >
> >
>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Ken Krugler <kk...@transpac.com>.
Hi Tim,

I looked at the HTML module, and seems logical/straightforward.

Thanks for pushing on this.

— Ken

> On Aug 18, 2020, at 7:40 AM, Tim Allison <ta...@apache.org> wrote:
> 
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
> 
> Does this basically look ok?
> 
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
> 
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
> 
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
> 
>> +1 excited about this.
>> 
>> - Bob
>> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>> 
>> +1 😀
>> 
>> Cheers Sergey
>> 
>> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <ma...@apache.org> wrote:
>> 
>> 
>> Haha  I’m down and supportive!
>> 
>> 
>> 
>> Time’s TIME FOR 2.x 😊
>> 
>> From: Tim Allison <ta...@apache.org> <ta...@apache.org>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <de...@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
>> 174B-Affiliate)" <ti...@jpl.nasa.gov> <ti...@jpl.nasa.gov>
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org> <de...@tika.apache.org>
>> Subject: [EXTERNAL] Tika 2.0 modularization
>> 
>> 
>> 
>> All,
>> 
>>  I _think_ I might have some time to start working on integrating Bob's
>> 
>> work on the current main branch.  I'll have to ignore most of the incoming
>> 
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>> 
>>  Let me know if there are any objections to heading down this path now.
>> 
>> 
>> 
>>   Cheers,
>> 
>> 
>> 
>>              Tim

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr


Re: [EXTERNAL] Tika 2.0 modularization

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim

It looks good. Perfect.
Do you plant to have tika-parsers reuse the new module as its dependencies
?

Cheers, Sergey

On Tue, Aug 18, 2020 at 3:41 PM Tim Allison <ta...@apache.org> wrote:

> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 😀
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <
> mattmann@apache.org> wrote:
> >
> >
> > Haha  I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 😊
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison <ta...@apache.org> <ta...@apache.org>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <
> dev@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
> > 174B-Affiliate)" <ti...@jpl.nasa.gov> <
> timothy.b.allison@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org>
> <de...@tika.apache.org>
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> >   I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch.  I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> >   Let me know if there are any objections to heading down this path now.
> >
> >
> >
> >    Cheers,
> >
> >
> >
> >               Tim
> >
> >
> >
> >
> >
> >
>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Bob Paulin <bo...@bobpaulin.com>.
Hey Tim,

Just started taking a look.  The test-jar approach could work but I
recall I ran into some issues with getting access to some of the test
files inside the test-jars for some of the junits.  For many tests this
was simple but for some I think it would require larger functional
changes to the code that I was not comfortable proposing at the time.

Makes sense to try this path again and see if you can get further than I
did.

- Bob

On 8/18/2020 9:40 AM, Tim Allison wrote:
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
>> +1 excited about this.
>>
>> - Bob
>> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>>
>> +1 😀
>>
>> Cheers Sergey
>>
>> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <ma...@apache.org> wrote:
>>
>>
>> Haha  I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 😊
>>
>>
>>
>>
>>
>>
>>
>> From: Tim Allison <ta...@apache.org> <ta...@apache.org>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <de...@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
>> 174B-Affiliate)" <ti...@jpl.nasa.gov> <ti...@jpl.nasa.gov>
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org> <de...@tika.apache.org>
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>>   I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch.  I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>>   Let me know if there are any objections to heading down this path now.
>>
>>
>>
>>    Cheers,
>>
>>
>>
>>               Tim
>>
>>
>>
>>
>>
>>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Tim Allison <ta...@apache.org>.
If anyone has any time, please take a look here:
https://github.com/apache/tika/tree/branch_2x/tika-parser-modules

Does this basically look ok?

I've put the integration tests in
https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
... that doesn't build yet.

I've flipped Bob's design so that the integration tests pull test files
from the individual parser modules via test-jar.

On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:

> +1 excited about this.
>
> - Bob
> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>
> +1 😀
>
> Cheers Sergey
>
> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <ma...@apache.org> wrote:
>
>
> Haha  I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 😊
>
>
>
>
>
>
>
> From: Tim Allison <ta...@apache.org> <ta...@apache.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <de...@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
> 174B-Affiliate)" <ti...@jpl.nasa.gov> <ti...@jpl.nasa.gov>
> Date: Friday, August 14, 2020 at 6:06 AM
> To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org> <de...@tika.apache.org>
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
>   I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch.  I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
>   Let me know if there are any objections to heading down this path now.
>
>
>
>    Cheers,
>
>
>
>               Tim
>
>
>
>
>
>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Bob Paulin <bo...@bobpaulin.com>.
+1 excited about this.

- Bob

On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> +1 😀
>
> Cheers Sergey
>
> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> wrote:
>
>> Haha  I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 😊
>>
>>
>>
>>
>>
>>
>>
>> From: Tim Allison <ta...@apache.org>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>, "Allison, Tim (US
>> 174B-Affiliate)" <ti...@jpl.nasa.gov>
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>>   I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch.  I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>>   Let me know if there are any objections to heading down this path now.
>>
>>
>>
>>    Cheers,
>>
>>
>>
>>               Tim
>>
>>
>>
>>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Sergey Beryozkin <sb...@gmail.com>.
+1 😀

Cheers Sergey

On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> wrote:

> Haha  I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 😊
>
>
>
>
>
>
>
> From: Tim Allison <ta...@apache.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>, "Allison, Tim (US
> 174B-Affiliate)" <ti...@jpl.nasa.gov>
> Date: Friday, August 14, 2020 at 6:06 AM
> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
>   I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch.  I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
>   Let me know if there are any objections to heading down this path now.
>
>
>
>    Cheers,
>
>
>
>               Tim
>
>
>
>

Re: [EXTERNAL] Tika 2.0 modularization

Posted by Chris Mattmann <ma...@apache.org>.
Haha  I’m down and supportive!

 

Time’s TIME FOR 2.x 😊

 

 

 

From: Tim Allison <ta...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>, "Allison, Tim (US 174B-Affiliate)" <ti...@jpl.nasa.gov>
Date: Friday, August 14, 2020 at 6:06 AM
To: "<de...@tika.apache.org>" <de...@tika.apache.org>
Subject: [EXTERNAL] Tika 2.0 modularization

 

All,

  I _think_ I might have some time to start working on integrating Bob's

work on the current main branch.  I'll have to ignore most of the incoming

issues for a bit...unlike the last 4 years...this time I mean it. :)

  Let me know if there are any objections to heading down this path now.

 

   Cheers,

 

              Tim