You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Nick Burch <ni...@apache.org> on 2016/05/11 21:09:19 UTC

My "What's new with Apache Tika 2.0" talk slides

Hi All

For those who couldn't make it to Vancouver this week, the slides from my 
"What's new with Apache Tika 2.0" talk are now available online:
http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20

The audio was recorded, hopefully that will be available to go with the 
slides in a few days time

Nick

RE: Testing 2.0-SNAPSHOT with Apache CXF Tika demo

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Agreed.  Thank you, Sergey!

>> I think the next steps should be determining if there are any other breaking features we'd like to include in 2.0 and perhaps we can get Tim to run the 2.x branch through his massive regression test :).

I think we're a good ways away from this stage, yet; but, y, I'll be happy to run those when it is time. :)

A few things that we still need to do:
1) remove deprecated metadata keys
2) figure out how to implement the resettable contenthandler in support of combo-parsers (for the back-off-on-exception mode).
3) make sure that mods to trunk actually made it into 2.0.  I think there are quite a few mods that have only been made to trunk.
4) Anything in lang detect module?
5) TIKA-1607 -- new metadata model (?)
...

Re: Testing 2.0-SNAPSHOT with Apache CXF Tika demo

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Bob
On 17/05/16 02:37, Bob Paulin wrote:
> Sergey,
>
> Great to hear the code works well with the new modules!  And I do agree
> that Tika has a number of application specific usecases that can be
> explored.  I think the other goal is making the upgrade paths easier so
> developers don't have to drag "JAR Hell" with them into their projects.
> It was good to see in your commit you got to remove the maven exclusions
> as well.  I think you can also remove the explicit tika-core entry as
> that should be a transitive dependency of any of the modules.

The reason I has to list it explicitly is that a tika-core has a 
provided scope in a CXF module where Tika extensions are shipped, 
indeed, I was surprised yesterday when I saw:

Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.exception.AccessPermissionException

and then it was gone after I added a tika-core dependency, and it took 
me awhile to realize why I had to do it. I guess we might have a 
dedicated CXF Tika module introduced later on with the strong 
dependencies...

> This type
> of working is a huge help in moving towards the 2.0 release.
>
Very minor effort on my part given all the work which has already been 
done in Tika 2.0 :-)

> I think the next steps should be determining if there are any other
> breaking features we'd like to include in 2.0 and perhaps we can get Tim
> to run the 2.x branch through his massive regression test :).
>
Cheers, Sergey
>
> - Bob
>
>
>
> On 5/16/2016 10:21 AM, Sergey Beryozkin wrote:
>> Hi All
>>
>> Hope this message will be more relevant compared to the one I posted
>> after a social event at Apache Con NA 2016 :-). I had a chance to talk
>> to Nick and Bob the next day and we agreed it would be good to have
>> Tika 2.0-SNAPSHOT tested a bit more. Specifically I committed to
>> updating a Tika-based demo we ship in Apache CXF to use 2.0-SNAPSHOT
>> module dependencies - no pressure is expected on CXF master in the
>> short term given that the master release won't happen in the next few
>> months for sure.
>>
>> FYI, in CXF we ship this demo:
>>
>> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search
>>
>>
>> IMHO it is a very cool demo written by my CXF colleague Andriy Redko.
>> This demo was part of his NA 2015 presentation:
>>
>> http://events.linuxfoundation.org/sites/events/files/slides/Apache%20CXF%2C%20Tika%20and%20Lucene.pdf
>>
>>
>> Here is a demo description: a user can upload PDF or ODT files to a
>> JAX-RS service using an HTML form. The uploaded files are submitted to
>> a CXF Tika extensions:
>>
>> https://github.com/apache/cxf/tree/master/rt/rs/extensions/search/src/main/java/org/apache/cxf/jaxrs/ext/search/tika
>>
>>
>> with this code:
>>
>> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L115
>>
>>
>> where the Tika reported content/metadata is saved with Lucene.
>>
>>
>> Next a user enter a search phrase and finds matching documents, with
>> the links to them being reported so that a user can download it.
>>
>> IMHO it is an interesting demo because it shows how Tika can help in
>> some application specific situations...
>>
>> Finally, to the actual experiment I did today. Updating the demo to
>> use individual parser modules was easy:
>>
>> http://git-wip-us.apache.org/repos/asf/cxf/commit/c2ccecb2
>>
>> All works well, better modularization in 2.0 will be welcomed
>>
>> Thanks, Sergey
>>
>>
>>
>>
>>
>>
>


Re: Testing 2.0-SNAPSHOT with Apache CXF Tika demo

Posted by Bob Paulin <bo...@bobpaulin.com>.
Sergey,

Great to hear the code works well with the new modules!  And I do agree 
that Tika has a number of application specific usecases that can be 
explored.  I think the other goal is making the upgrade paths easier so 
developers don't have to drag "JAR Hell" with them into their projects.  
It was good to see in your commit you got to remove the maven exclusions 
as well.  I think you can also remove the explicit tika-core entry as 
that should be a transitive dependency of any of the modules.  This type 
of working is a huge help in moving towards the 2.0 release.

I think the next steps should be determining if there are any other 
breaking features we'd like to include in 2.0 and perhaps we can get Tim 
to run the 2.x branch through his massive regression test :).


- Bob



On 5/16/2016 10:21 AM, Sergey Beryozkin wrote:
> Hi All
>
> Hope this message will be more relevant compared to the one I posted 
> after a social event at Apache Con NA 2016 :-). I had a chance to talk 
> to Nick and Bob the next day and we agreed it would be good to have 
> Tika 2.0-SNAPSHOT tested a bit more. Specifically I committed to 
> updating a Tika-based demo we ship in Apache CXF to use 2.0-SNAPSHOT 
> module dependencies - no pressure is expected on CXF master in the 
> short term given that the master release won't happen in the next few 
> months for sure.
>
> FYI, in CXF we ship this demo:
>
> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search 
>
>
> IMHO it is a very cool demo written by my CXF colleague Andriy Redko. 
> This demo was part of his NA 2015 presentation:
>
> http://events.linuxfoundation.org/sites/events/files/slides/Apache%20CXF%2C%20Tika%20and%20Lucene.pdf 
>
>
> Here is a demo description: a user can upload PDF or ODT files to a 
> JAX-RS service using an HTML form. The uploaded files are submitted to 
> a CXF Tika extensions:
>
> https://github.com/apache/cxf/tree/master/rt/rs/extensions/search/src/main/java/org/apache/cxf/jaxrs/ext/search/tika 
>
>
> with this code:
>
> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L115 
>
>
> where the Tika reported content/metadata is saved with Lucene.
>
>
> Next a user enter a search phrase and finds matching documents, with 
> the links to them being reported so that a user can download it.
>
> IMHO it is an interesting demo because it shows how Tika can help in 
> some application specific situations...
>
> Finally, to the actual experiment I did today. Updating the demo to 
> use individual parser modules was easy:
>
> http://git-wip-us.apache.org/repos/asf/cxf/commit/c2ccecb2
>
> All works well, better modularization in 2.0 will be welcomed
>
> Thanks, Sergey
>
>
>
>
>
>


Testing 2.0-SNAPSHOT with Apache CXF Tika demo

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi All

Hope this message will be more relevant compared to the one I posted 
after a social event at Apache Con NA 2016 :-). I had a chance to talk 
to Nick and Bob the next day and we agreed it would be good to have Tika 
2.0-SNAPSHOT tested a bit more. Specifically I committed to updating a 
Tika-based demo we ship in Apache CXF to use 2.0-SNAPSHOT module 
dependencies - no pressure is expected on CXF master in the short term 
given that the master release won't happen in the next few months for sure.

FYI, in CXF we ship this demo:

https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search

IMHO it is a very cool demo written by my CXF colleague Andriy Redko. 
This demo was part of his NA 2015 presentation:

http://events.linuxfoundation.org/sites/events/files/slides/Apache%20CXF%2C%20Tika%20and%20Lucene.pdf

Here is a demo description: a user can upload PDF or ODT files to a 
JAX-RS service using an HTML form. The uploaded files are submitted to a 
CXF Tika extensions:

https://github.com/apache/cxf/tree/master/rt/rs/extensions/search/src/main/java/org/apache/cxf/jaxrs/ext/search/tika

with this code:

https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L115 


where the Tika reported content/metadata is saved with Lucene.


Next a user enter a search phrase and finds matching documents, with the 
links to them being reported so that a user can download it.

IMHO it is an interesting demo because it shows how Tika can help in 
some application specific situations...

Finally, to the actual experiment I did today. Updating the demo to use 
individual parser modules was easy:

http://git-wip-us.apache.org/repos/asf/cxf/commit/c2ccecb2

All works well, better modularization in 2.0 will be welcomed

Thanks, Sergey






Re: My "What's new with Apache Tika 2.0" talk slides

Posted by Sergey Beryozkin <sb...@gmail.com>.
Saw Nick passing by but by the time I was ready to say hi he was gone, 
tomorrow then :-) And I guess I've seen Ken too, but did not know it was 
him :-). And to make it relevant: Tika rocks of course :-)
On 12/05/16 04:41, Ken Krugler wrote:
> One annoying attendee kept asking about the new language detector support in 2.0 :)
>
> \u2014 Ken
>
>> On May 11, 2016, at 5:04pm, Allison, Timothy B. <ta...@mitre.org> wrote:
>>
>> Great slides.  Thank you, Nick.  Wish I could be there...
>>
>> Any feedback/guidance from the audience?
>>
>> -----Original Message-----
>> From: Nick Burch [mailto:nick@apache.org]
>> Sent: Wednesday, May 11, 2016 5:09 PM
>> To: user@tika.apache.org
>> Cc: dev@tika.apache.org
>> Subject: My "What's new with Apache Tika 2.0" talk slides
>>
>> Hi All
>>
>> For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online:
>> http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20
>>
>> The audio was recorded, hopefully that will be available to go with the slides in a few days time
>>
>> Nick
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: My "What's new with Apache Tika 2.0" talk slides

Posted by Ken Krugler <kk...@transpac.com>.
One annoying attendee kept asking about the new language detector support in 2.0 :)

— Ken

> On May 11, 2016, at 5:04pm, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> Great slides.  Thank you, Nick.  Wish I could be there...
> 
> Any feedback/guidance from the audience?
> 
> -----Original Message-----
> From: Nick Burch [mailto:nick@apache.org] 
> Sent: Wednesday, May 11, 2016 5:09 PM
> To: user@tika.apache.org
> Cc: dev@tika.apache.org
> Subject: My "What's new with Apache Tika 2.0" talk slides
> 
> Hi All
> 
> For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online:
> http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20
> 
> The audio was recorded, hopefully that will be available to go with the slides in a few days time
> 
> Nick

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr




RE: My "What's new with Apache Tika 2.0" talk slides

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great slides.  Thank you, Nick.  Wish I could be there...

Any feedback/guidance from the audience?

-----Original Message-----
From: Nick Burch [mailto:nick@apache.org] 
Sent: Wednesday, May 11, 2016 5:09 PM
To: user@tika.apache.org
Cc: dev@tika.apache.org
Subject: My "What's new with Apache Tika 2.0" talk slides

Hi All

For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online:
http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20

The audio was recorded, hopefully that will be available to go with the slides in a few days time

Nick

RE: My "What's new with Apache Tika 2.0" talk slides

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Great slides.  Thank you, Nick.  Wish I could be there...

Any feedback/guidance from the audience?

-----Original Message-----
From: Nick Burch [mailto:nick@apache.org] 
Sent: Wednesday, May 11, 2016 5:09 PM
To: user@tika.apache.org
Cc: dev@tika.apache.org
Subject: My "What's new with Apache Tika 2.0" talk slides

Hi All

For those who couldn't make it to Vancouver this week, the slides from my "What's new with Apache Tika 2.0" talk are now available online:
http://www.slideshare.net/NickBurch2/apache-tika-whats-new-with-20

The audio was recorded, hopefully that will be available to go with the slides in a few days time

Nick