You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2020/08/14 13:06:16 UTC
Tika 2.0 modularization
All,
I _think_ I might have some time to start working on integrating Bob's
work on the current main branch. I'll have to ignore most of the incoming
issues for a bit...unlike the last 4 years...this time I mean it. :)
Let me know if there are any objections to heading down this path now.
Cheers,
Tim
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Tim Allison <ta...@apache.org>.
Thank you!
>Somehow I did not find a couple of parsers, probably it is because of
on-going work ...
Yep. Exactly. I didn't want to put in the work in this direction if there
were any showstoppers.
>If we are going to make Tika more modern, maybe gradle can do a trick?
My gradle isn't as strong as maven, but if you or anyone else wants to
translate, I'd be good with that. Let me do the maven modularization
first? How much effort would this be?
>Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records
Elasticsearch is already at 11, and the next version of Solr requires 11.
I'm happy keeping Tika at 1.8 or moving to 11. I think 14 is a bit too
cutting edge for Tika 2.0.0...maybe 3.0.0?
Any thoughts on what we do with Jigsaw? Should we shoot the moon and move
to 11 and jigsaw, go with multi-version jars or just go with what we have
and make modest changes so that we are hostile to folks using jigsaw?
On Tue, Aug 18, 2020 at 11:38 AM Oleg Tikhonov <ol...@apache.org> wrote:
> Hi Tim,
> looks awesome.
> Somehow I did not find a couple of parsers, probably it is because of
> on-going work ...
> In addition, I was thinking about "getting rid of" maven. If we are going
> to make Tika more modern, maybe gradle can do a trick?
> Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
> API, records ...
>
> WDYT?
> BR,
> Oleg
>
>
>
>
> On Tue, Aug 18, 2020 at 5:41 PM Tim Allison <ta...@apache.org> wrote:
>
>> If anyone has any time, please take a look here:
>> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>>
>> Does this basically look ok?
>>
>> I've put the integration tests in
>>
>> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
>> ... that doesn't build yet.
>>
>> I've flipped Bob's design so that the integration tests pull test files
>> from the individual parser modules via test-jar.
>>
>> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>>
>> > +1 excited about this.
>> >
>> > - Bob
>> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>> >
>> > +1 😀
>> >
>> > Cheers Sergey
>> >
>> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <
>> mattmann@apache.org> wrote:
>> >
>> >
>> > Haha I’m down and supportive!
>> >
>> >
>> >
>> > Time’s TIME FOR 2.x 😊
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > From: Tim Allison <ta...@apache.org> <ta...@apache.org>
>> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <
>> dev@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
>> > 174B-Affiliate)" <ti...@jpl.nasa.gov> <
>> timothy.b.allison@jpl.nasa.gov>
>> > Date: Friday, August 14, 2020 at 6:06 AM
>> > To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org>
>> <de...@tika.apache.org>
>> > Subject: [EXTERNAL] Tika 2.0 modularization
>> >
>> >
>> >
>> > All,
>> >
>> > I _think_ I might have some time to start working on integrating Bob's
>> >
>> > work on the current main branch. I'll have to ignore most of the
>> incoming
>> >
>> > issues for a bit...unlike the last 4 years...this time I mean it. :)
>> >
>> > Let me know if there are any objections to heading down this path now.
>> >
>> >
>> >
>> > Cheers,
>> >
>> >
>> >
>> > Tim
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Oleg Tikhonov <ol...@apache.org>.
Hi Tim,
looks awesome.
Somehow I did not find a couple of parsers, probably it is because of
on-going work ...
In addition, I was thinking about "getting rid of" maven. If we are going
to make Tika more modern, maybe gradle can do a trick?
Do we plan to add new Java "gooddies" like lambdas, foreign-memory access
API, records ...
WDYT?
BR,
Oleg
On Tue, Aug 18, 2020 at 5:41 PM Tim Allison <ta...@apache.org> wrote:
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 😀
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <
> mattmann@apache.org> wrote:
> >
> >
> > Haha I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 😊
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison <ta...@apache.org> <ta...@apache.org>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <
> dev@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
> > 174B-Affiliate)" <ti...@jpl.nasa.gov> <
> timothy.b.allison@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org>
> <de...@tika.apache.org>
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> > I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch. I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> > Let me know if there are any objections to heading down this path now.
> >
> >
> >
> > Cheers,
> >
> >
> >
> > Tim
> >
> >
> >
> >
> >
> >
>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Ken Krugler <kk...@transpac.com>.
Hi Tim,
I looked at the HTML module, and seems logical/straightforward.
Thanks for pushing on this.
— Ken
> On Aug 18, 2020, at 7:40 AM, Tim Allison <ta...@apache.org> wrote:
>
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
>> +1 excited about this.
>>
>> - Bob
>> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>>
>> +1 😀
>>
>> Cheers Sergey
>>
>> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <ma...@apache.org> wrote:
>>
>>
>> Haha I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 😊
>>
>> From: Tim Allison <ta...@apache.org> <ta...@apache.org>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <de...@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
>> 174B-Affiliate)" <ti...@jpl.nasa.gov> <ti...@jpl.nasa.gov>
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org> <de...@tika.apache.org>
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>> I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch. I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>> Let me know if there are any objections to heading down this path now.
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Tim
--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Tim
It looks good. Perfect.
Do you plant to have tika-parsers reuse the new module as its dependencies
?
Cheers, Sergey
On Tue, Aug 18, 2020 at 3:41 PM Tim Allison <ta...@apache.org> wrote:
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 😀
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <
> mattmann@apache.org> wrote:
> >
> >
> > Haha I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 😊
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison <ta...@apache.org> <ta...@apache.org>
> > Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <
> dev@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
> > 174B-Affiliate)" <ti...@jpl.nasa.gov> <
> timothy.b.allison@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org>
> <de...@tika.apache.org>
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> > I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch. I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> > Let me know if there are any objections to heading down this path now.
> >
> >
> >
> > Cheers,
> >
> >
> >
> > Tim
> >
> >
> >
> >
> >
> >
>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Bob Paulin <bo...@bobpaulin.com>.
Hey Tim,
Just started taking a look. The test-jar approach could work but I
recall I ran into some issues with getting access to some of the test
files inside the test-jars for some of the junits. For many tests this
was simple but for some I think it would require larger functional
changes to the code that I was not comfortable proposing at the time.
Makes sense to try this path again and see if you can get further than I
did.
- Bob
On 8/18/2020 9:40 AM, Tim Allison wrote:
> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
>
>> +1 excited about this.
>>
>> - Bob
>> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>>
>> +1 😀
>>
>> Cheers Sergey
>>
>> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <ma...@apache.org> wrote:
>>
>>
>> Haha I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 😊
>>
>>
>>
>>
>>
>>
>>
>> From: Tim Allison <ta...@apache.org> <ta...@apache.org>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <de...@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
>> 174B-Affiliate)" <ti...@jpl.nasa.gov> <ti...@jpl.nasa.gov>
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org> <de...@tika.apache.org>
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>> I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch. I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>> Let me know if there are any objections to heading down this path now.
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Tim
>>
>>
>>
>>
>>
>>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Tim Allison <ta...@apache.org>.
If anyone has any time, please take a look here:
https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
Does this basically look ok?
I've put the integration tests in
https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
... that doesn't build yet.
I've flipped Bob's design so that the integration tests pull test files
from the individual parser modules via test-jar.
On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin <bo...@bobpaulin.com> wrote:
> +1 excited about this.
>
> - Bob
> On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
>
> +1 😀
>
> Cheers Sergey
>
> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> <ma...@apache.org> wrote:
>
>
> Haha I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 😊
>
>
>
>
>
>
>
> From: Tim Allison <ta...@apache.org> <ta...@apache.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org> <de...@tika.apache.org> <de...@tika.apache.org>, "Allison, Tim (US
> 174B-Affiliate)" <ti...@jpl.nasa.gov> <ti...@jpl.nasa.gov>
> Date: Friday, August 14, 2020 at 6:06 AM
> To: "<de...@tika.apache.org> <de...@tika.apache.org>" <de...@tika.apache.org> <de...@tika.apache.org>
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
> I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch. I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
> Let me know if there are any objections to heading down this path now.
>
>
>
> Cheers,
>
>
>
> Tim
>
>
>
>
>
>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Bob Paulin <bo...@bobpaulin.com>.
+1 excited about this.
- Bob
On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> +1 😀
>
> Cheers Sergey
>
> On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> wrote:
>
>> Haha I’m down and supportive!
>>
>>
>>
>> Time’s TIME FOR 2.x 😊
>>
>>
>>
>>
>>
>>
>>
>> From: Tim Allison <ta...@apache.org>
>> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>, "Allison, Tim (US
>> 174B-Affiliate)" <ti...@jpl.nasa.gov>
>> Date: Friday, August 14, 2020 at 6:06 AM
>> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
>> Subject: [EXTERNAL] Tika 2.0 modularization
>>
>>
>>
>> All,
>>
>> I _think_ I might have some time to start working on integrating Bob's
>>
>> work on the current main branch. I'll have to ignore most of the incoming
>>
>> issues for a bit...unlike the last 4 years...this time I mean it. :)
>>
>> Let me know if there are any objections to heading down this path now.
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Tim
>>
>>
>>
>>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Sergey Beryozkin <sb...@gmail.com>.
+1 😀
Cheers Sergey
On Fri 14 Aug 2020, 18:26 Chris Mattmann, <ma...@apache.org> wrote:
> Haha I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 😊
>
>
>
>
>
>
>
> From: Tim Allison <ta...@apache.org>
> Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>, "Allison, Tim (US
> 174B-Affiliate)" <ti...@jpl.nasa.gov>
> Date: Friday, August 14, 2020 at 6:06 AM
> To: "<de...@tika.apache.org>" <de...@tika.apache.org>
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
> I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch. I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
> Let me know if there are any objections to heading down this path now.
>
>
>
> Cheers,
>
>
>
> Tim
>
>
>
>
Re: [EXTERNAL] Tika 2.0 modularization
Posted by Chris Mattmann <ma...@apache.org>.
Haha  I’m down and supportive!
Time’s TIME FOR 2.x 😊
From: Tim Allison <ta...@apache.org>
Reply-To: "dev@tika.apache.org" <de...@tika.apache.org>, "Allison, Tim (US 174B-Affiliate)" <ti...@jpl.nasa.gov>
Date: Friday, August 14, 2020 at 6:06 AM
To: "<de...@tika.apache.org>" <de...@tika.apache.org>
Subject: [EXTERNAL] Tika 2.0 modularization
All,
I _think_ I might have some time to start working on integrating Bob's
work on the current main branch. I'll have to ignore most of the incoming
issues for a bit...unlike the last 4 years...this time I mean it. :)
Let me know if there are any objections to heading down this path now.
Cheers,
Tim