You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2017/04/05 11:21:35 UTC

Re: Streaming and Tika

Hi All

Would it make sense to consider doing something like this for a single 
format, ex, PDF, or other one which may be the most 'capable' of 
reporting its events in a pull like fashion ?

Tom, others, what do you think ?

Cheers, Sergey
On 10/11/16 12:14, Sergey Beryozkin wrote:
> Hi All
>
> I've been looking at how to integrate Tika in some of the streaming
> pipelines, and I'm finding it difficult to set up with the
> callback-based SAX mechanism.
>
> Does it make sense to consider starting adding StAX-like Parser API ?
>
> So far the only reference to Stax I've seen is
> https://issues.apache.org/jira/browse/TIKA-1321
>
> Cheers, Sergey

Re: Streaming and Tika

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim

Thanks, np at all,

I thought of experimenting with integrating Tika into Apache Beam 
pipelines the other day, where the source of the input data is pulled 
regularly, this is why I thought it would require Tika provide a 
pull-like parser interface for such an integration to succeed.

I agree simply attempting to convert Tika parsers to use Stax or similar 
is not realistic, but perhaps some POC may be around XHTML parser can be 
attempted. That said it probably does not make much sense as it won't 
work for all (or most mainstream) Tika parsers anyway...

Thanks, Sergey

On 05/04/17 14:51, Allison, Timothy B. wrote:
> Sergey,
>
>   Good to hear from you.  I'm sorry for not responding sooner.
>
>   First a note on streaming and Tika.  If I understand correctly, from the very beginning of Tika the goal was for full streaming processing.  Unfortunately, for some file formats, we have to read the entire file before we can parse it, so streaming is somewhat of an illusion.  Also, for some files, metadata can't be extracted until after some of the contents are extracted which means that in some cases you'll get more metadata in the Metadata object than you'll get in our xhtml.
>
>   I've dabbled in StAX, and at one point, I found it easier to work with than SAX so I have some sympathy.
>
>   Given that everything in Tika is SAX based, I worry that the benefit isn't worth the effort of converting parsers to StAX.
>
>   What particulars about our SAX handlers make them not conducive to streaming in your case?  Is there anything we can change with less effort than moving to StAX that would help?
>
>   I'm not against you experimenting with a new PDFParser, but overall, it feels like it would be quite a bit of work.
>
>   If you want to work on new handlers, how about the rewriteable ones we need for Tika 2.0? :)
>
>        Cheers,
>
>                         Tim
>
>
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Wednesday, April 5, 2017 7:22 AM
> To: user@tika.apache.org
> Subject: Re: Streaming and Tika
>
> Hi All
>
> Would it make sense to consider doing something like this for a single format, ex, PDF, or other one which may be the most 'capable' of reporting its events in a pull like fashion ?
>
> Tom, others, what do you think ?
>
> Cheers, Sergey
> On 10/11/16 12:14, Sergey Beryozkin wrote:
>> Hi All
>>
>> I've been looking at how to integrate Tika in some of the streaming
>> pipelines, and I'm finding it difficult to set up with the
>> callback-based SAX mechanism.
>>
>> Does it make sense to consider starting adding StAX-like Parser API ?
>>
>> So far the only reference to Stax I've seen is
>> https://issues.apache.org/jira/browse/TIKA-1321
>>
>> Cheers, Sergey
>

RE: Streaming and Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Sergey, 

  Good to hear from you.  I'm sorry for not responding sooner.  

  First a note on streaming and Tika.  If I understand correctly, from the very beginning of Tika the goal was for full streaming processing.  Unfortunately, for some file formats, we have to read the entire file before we can parse it, so streaming is somewhat of an illusion.  Also, for some files, metadata can't be extracted until after some of the contents are extracted which means that in some cases you'll get more metadata in the Metadata object than you'll get in our xhtml.

  I've dabbled in StAX, and at one point, I found it easier to work with than SAX so I have some sympathy.

  Given that everything in Tika is SAX based, I worry that the benefit isn't worth the effort of converting parsers to StAX.

  What particulars about our SAX handlers make them not conducive to streaming in your case?  Is there anything we can change with less effort than moving to StAX that would help?

  I'm not against you experimenting with a new PDFParser, but overall, it feels like it would be quite a bit of work.

  If you want to work on new handlers, how about the rewriteable ones we need for Tika 2.0? :)

       Cheers,

                        Tim

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Wednesday, April 5, 2017 7:22 AM
To: user@tika.apache.org
Subject: Re: Streaming and Tika

Hi All

Would it make sense to consider doing something like this for a single format, ex, PDF, or other one which may be the most 'capable' of reporting its events in a pull like fashion ?

Tom, others, what do you think ?

Cheers, Sergey
On 10/11/16 12:14, Sergey Beryozkin wrote:
> Hi All
>
> I've been looking at how to integrate Tika in some of the streaming 
> pipelines, and I'm finding it difficult to set up with the 
> callback-based SAX mechanism.
>
> Does it make sense to consider starting adding StAX-like Parser API ?
>
> So far the only reference to Stax I've seen is
> https://issues.apache.org/jira/browse/TIKA-1321
>
> Cheers, Sergey