You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2017/12/28 13:43:35 UTC
Re: Integrating Tika with Apache Beam

Hi All

A short update that my original TikaIO contribution changed a lot after 
the round of reviews, but the good news it stayed as a native Beam IO 
component and will be available from Beam 2.3.0.
TikaIO will now return something called ParseResult which includes the 
complete String and metadata content, but also Throwable if the 
exception occurred in some file.
Tika Streaming is not utilized at the moment - but as soon as the good 
use cases emerge then I'm sure Beam community will be open to enhancing 
TikaIO further...

Cheers. Sergey


On 21/09/17 18:54, Chris Mattmann wrote:
> Thanks Sergey, feel free to CC me directly at mattmann@apache.org on the Beam thread.
> My own 2c is that Tika’s “metadata” extraction can be any order, and with our tika-dl module
> and the new feature extraction from multimedia files using Tensorflow and DL4j these are
> perfect examples where the order/extraction doesn’t matter…
> 
> 
> 
> On 9/21/17, 2:52 AM, "Sergey Beryozkin" <sb...@gmail.com> wrote:
> 
>      Hi Guys
>      
>      TikaIO is getting some serious attention now on the Beam dev, and
>      unfortunately it is not all about it being a great addition to Beam.
>      
>      The team is wondering what one can do with TikaIO vs someone just doing
>      some custom Beam function.
>      
>      TikaIO and as any other Bounded text reader will produce the data in the
>      ordered way, but they can be made totally unordered to the pipeline by
>      the Beam runtime.
>      
>      I gave one example where we used the Tika output to save it all to
>      Lucene (with the file name associated) and then search for the files
>      which contain a certain word.
>      
>      Tim, Chris, others, if you have some interesting examples to share where
>      it did not matter in which order Tika-produced data were made eventually
>      available, then please let me know, or reply directly to a Beam dev
>      thread titled "TikaIO concerns".
>      
>      Note, if Beam devs decide they don't want it then one option can be to
>      create a tika-integrations/beam module and experiment there - I'm not
>      saying it will need to be done but it's something that may be worth
>      considering
>      
>      Sergey
>      On 15/09/17 12:02, Sergey Beryozkin wrote:
>      > Hi Chris
>      >
>      > thanks,
>      >
>      > at the moment TikaIO (originally renamed TikaReader as it can only read
>      > but we renamed it to follow the convention) is a bounded reader, so you
>      > can say ask it to read
>      >
>      > /files/*.pdf
>      >
>      > and it will read all the N files there, and will end the run.
>      >
>      > I'm not sure yet what is the best strategy to making it the unbounded
>      > reader where it can continuously poll or be notified of the new files
>      > becoming available...There are some ideas about scheduling the bounded
>      > Beam pipelines, haven't looked yet...
>      >
>      > In the short term, the simplest solution would be simply to create a new
>      > instance of TikaIO pipeline, and point it to the new temp folder where a
>      > new batch of files has been dropped to.
>      >
>      > Thanks, Sergey
>      > On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>      >> Amazing work, thank you Sergey!!
>      >>
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >> Chris Mattmann, Ph.D.
>      >> Principal Data Scientist, Engineering Administrative Office (3010)
>      >> Manager, NSF & Open Source Projects Formulation and Development
>      >> Offices (8212)
>      >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>      >> Office: 180-503E, Mailstop: 180-503
>      >> Email: chris.a.mattmann@nasa.gov
>      >> WWW:  http://sunset.usc.edu/~mattmann/
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >> Director, Information Retrieval and Data Science Group (IRDS)
>      >> Adjunct Associate Professor, Computer Science Department
>      >> University of Southern California, Los Angeles, CA 90089 USA
>      >> WWW: http://irds.usc.edu/
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >>
>      >> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>      >>
>      >>      What great news!  Thank you, Sergey!!!
>      >>      -----Original Message-----
>      >>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>      Sent: Monday, September 11, 2017 9:18 AM
>      >>      To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
>      >>      Subject: Re: Integrating Tika with Apache Beam
>      >>      Hi Tim, All
>      >>      It took it some time, but finally Beam TikaIO component is in its
>      >> 2.2.0-SNAPSHOT master,
>      >>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
>      >>      I've created a basic project which can help with running it quickly:
>      >>      https://github.com/sberyozkin/beamTikaExample
>      >>      One can just build it and run as suggested in Readme.md, simply
>      >> have some PDF files for example, and point to one or all of them.
>      >>      By default, Beam will output the data to /tmp/tika.
>      >>      main() can be updated with supporting more options, they can be
>      >> collected from the command line either with TikaOptions:
>      >>
>      >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
>      >>
>      >>      (all options but the "--input" are optional)
>      >>      or directly from the code, some variations are shown in the tests:
>      >>
>      >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
>      >>
>      >>      By default TikaReader will use an internal queue to make the SAX
>      >> events available to the Beam pipeline, this is why you can see the
>      >> options like "queuePollTime", etc. If it's known that a given parser
>      >> can really read the whole text in the single op only then the process
>      >> can be optimized with 'parseSynchronously'...
>      >>      One can also try to update main() in the example to do more
>      >> interesting things then just print the data :-).
>      >>      Give it a try please if you get a chance, help make TikeIO the
>      >> major part of Beam :-) with PRs, etc
>      >>      Thanks, Sergey
>      >>      On 25/05/17 17:47, Sergey Beryozkin wrote:
>      >>      > Hi Guys
>      >>      >
>      >>      > The link to the initial code is available in JIRA, at this
>      >> stage the
>      >>      > focus is on preparing a solid initial PR, and then we can all
>      >> improve
>      >>      > Tika related code :-)
>      >>      >
>      >>      > Cheers, Sergey
>      >>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
>      >>      >> Hi Tim, All,
>      >>      >>
>      >>      >> I thought I'd start a dedicated thread.
>      >>      >>
>      >>      >> I added some initial comments to [1], I'm quite close now to
>      >> creating
>      >>      >> the initial PR.
>      >>      >>
>      >>      >> Thanks, Sergey
>      >>      >>
>      >>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>      >>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>      >>      >>> Another idea...if you have any interest, it would be great to
>      >> get
>      >>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use
>      >> it for
>      >>      >>> our regression tests?
>      >>      >>>
>      >>      >>> -----Original Message-----
>      >>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>      >>> Sent: Friday, May 19, 2017 4:21 PM
>      >>      >>> To: user@tika.apache.org
>      >>      >>> Subject: Re: Extracting Text from embedded images in PDF docs
>      >>      >>>
>      >>      >>> Hi Tim
>      >>      >>>
>      >>      >>> Sure, once I get an initial PR ready I'll send an update and
>      >> I'll
>      >>      >>> explain what I did for a start and we will discuss it further
>      >>      >>>
>      >>      >
>      >>      >
>      >>
>      
>      
>      --
>      Sergey Beryozkin
>      
>      Talend Community Coders
>      http://coders.talend.com/
>      
> 
>