You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2017/05/24 10:41:54 UTC

Integrating Tika with Apache Beam

Hi Tim, All,

I thought I'd start a dedicated thread.

I added some initial comments to [1], I'm quite close now to creating 
the initial PR.

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/BEAM-2328
On 23/05/17 17:42, Allison, Timothy B. wrote:
> Another idea...if you have any interest, it would be great to get Apache Beam set up on our Rackspace VM (with Spark?) and use it for our regression tests?
> 
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Friday, May 19, 2017 4:21 PM
> To: user@tika.apache.org
> Subject: Re: Extracting Text from embedded images in PDF docs
> 
> Hi Tim
> 
> Sure, once I get an initial PR ready I'll send an update and I'll explain what I did for a start and we will discuss it further
>

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim

I just used 'mvn install -DskipTests=true' to quickly build it, and did 
'mvn clean install' inside the tika module.

I use Eclipse, Beam docs on how to set up it are good, except that it 
did not quite work for me yet for all of Beam, only managed to import 
the individual Tika module
Cheers, Sergey
On 25/05/17 19:30, Allison, Timothy B. wrote:
> Awesome!
> 
> Any tips on building Beam?  Should it work on (dare I say) Windows?
> 
> Intellij is complaining that it can't find jdk.tools:jdk.tools:1.6 as a dependency under much of the Hadoop modules.
> 
> mvn clean install is failing at Beam::SDKS::Java::Core
> 
> 
> [ERROR]   AvroIOTest.testWriteDisplayData:561
> Expected: display data with item: (with key is "filePrefix" and with type is <STRING> and with value is "/foo")
>       but: found 6 non-matching item(s):
> <[]org.apache.beam.sdk.io.AvroIO$Write:codec=snappy
> []org.apache.beam.sdk.io.AvroIO$Write:schema=org.apache.beam.sdk.io.AvroIOTest$GenericClass
> []org.apache.beam.sdk.io.AvroIO$Write:fileSuffix=bar
> []org.apache.beam.sdk.io.AvroIO$Write:numShards=100
> []org.apache.beam.sdk.io.AvroIO$Write:shardNameTemplate=-SS-of-NN-
> []org.apache.beam.sdk.io.AvroIO$Write:filePrefix=C:\foo>
> [ERROR]   FileBasedSinkTest.testRemoveWithTempFilename:148->testRemoveTemporaryFiles:261 temp file C:\Users\tallison\AppData\Local\Temp\junit5212433513605155196\temp\file0 exists
> Expected: is <false>
>       but: was <true>
> [ERROR]   FileBasedSourceTest.testSplittingFailsOnEmptyFileExpansion
> Expected: (an instance of java.io.FileNotFoundException and exception with message a string containing "No files found for spec: C:\Users\tallison\AppData\Local\Temp\junit1719865221821921346\junit7087025770573441186/missing.txt")
>       but: an instance of java.io.FileNotFoundException <java.lang.IllegalStateException: Unable to find registrar for c> is a java.lang.IllegalStateException
> Stacktrace was: java.lang.IllegalStateException: Unable to find registrar for c
>          at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
>          at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)
> 
> 
> among many other errors...
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Thursday, May 25, 2017 12:47 PM
> To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
> Subject: Re: Integrating Tika with Apache Beam
> 
> Hi Guys
> 
> The link to the initial code is available in JIRA, at this stage the focus is on preparing a solid initial PR, and then we can all improve Tika related code :-)
> 
> Cheers, Sergey
> On 24/05/17 11:41, Sergey Beryozkin wrote:
>> Hi Tim, All,
>>
>> I thought I'd start a dedicated thread.
>>
>> I added some initial comments to [1], I'm quite close now to creating
>> the initial PR.
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>> Another idea...if you have any interest, it would be great to get
>>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for
>>> our regression tests?
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Friday, May 19, 2017 4:21 PM
>>> To: user@tika.apache.org
>>> Subject: Re: Extracting Text from embedded images in PDF docs
>>>
>>> Hi Tim
>>>
>>> Sure, once I get an initial PR ready I'll send an update and I'll
>>> explain what I did for a start and we will discuss it further
>>>
> 
> 
> --
> Sergey Beryozkin
> 
> Talend Community Coders
> http://coders.talend.com/
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

RE: Integrating Tika with Apache Beam

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Ha....Beam doesn't work on Windows currently...
https://issues.apache.org/jira/browse/BEAM-2299


-----Original Message-----
From: Allison, Timothy B. [mailto:tallison@mitre.org] 
Sent: Thursday, May 25, 2017 2:30 PM
To: Sergey Beryozkin <sb...@gmail.com>; dev@tika.apache.org
Subject: RE: Integrating Tika with Apache Beam

Awesome!

Any tips on building Beam?  Should it work on (dare I say) Windows?

Intellij is complaining that it can't find jdk.tools:jdk.tools:1.6 as a dependency under many of the Hadoop modules.

mvn clean install is failing at Beam::SDKS::Java::Core


[ERROR]   AvroIOTest.testWriteDisplayData:561
Expected: display data with item: (with key is "filePrefix" and with type is <STRING> and with value is "/foo")
     but: found 6 non-matching item(s):
<[]org.apache.beam.sdk.io.AvroIO$Write:codec=snappy
[]org.apache.beam.sdk.io.AvroIO$Write:schema=org.apache.beam.sdk.io.AvroIOTest$GenericClass
[]org.apache.beam.sdk.io.AvroIO$Write:fileSuffix=bar
[]org.apache.beam.sdk.io.AvroIO$Write:numShards=100
[]org.apache.beam.sdk.io.AvroIO$Write:shardNameTemplate=-SS-of-NN-
[]org.apache.beam.sdk.io.AvroIO$Write:filePrefix=C:\foo>
[ERROR]   FileBasedSinkTest.testRemoveWithTempFilename:148->testRemoveTemporaryFiles:261 temp file C:\Users\tallison\AppData\Local\Temp\junit5212433513605155196\temp\file0 exists
Expected: is <false>
     but: was <true>
[ERROR]   FileBasedSourceTest.testSplittingFailsOnEmptyFileExpansion
Expected: (an instance of java.io.FileNotFoundException and exception with message a string containing "No files found for spec: C:\Users\tallison\AppData\Local\Temp\junit1719865221821921346\junit7087025770573441186/missing.txt")
     but: an instance of java.io.FileNotFoundException <java.lang.IllegalStateException: Unable to find registrar for c> is a java.lang.IllegalStateException Stacktrace was: java.lang.IllegalStateException: Unable to find registrar for c
        at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)


among many other errors...
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
Sent: Thursday, May 25, 2017 12:47 PM
To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Guys

The link to the initial code is available in JIRA, at this stage the focus is on preparing a solid initial PR, and then we can all improve Tika related code :-)

Cheers, Sergey
On 24/05/17 11:41, Sergey Beryozkin wrote:
> Hi Tim, All,
> 
> I thought I'd start a dedicated thread.
> 
> I added some initial comments to [1], I'm quite close now to creating 
> the initial PR.
> 
> Thanks, Sergey
> 
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> On 23/05/17 17:42, Allison, Timothy B. wrote:
>> Another idea...if you have any interest, it would be great to get 
>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>> our regression tests?
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Friday, May 19, 2017 4:21 PM
>> To: user@tika.apache.org
>> Subject: Re: Extracting Text from embedded images in PDF docs
>>
>> Hi Tim
>>
>> Sure, once I get an initial PR ready I'll send an update and I'll 
>> explain what I did for a start and we will discuss it further
>>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

RE: Integrating Tika with Apache Beam

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Awesome!

Any tips on building Beam?  Should it work on (dare I say) Windows?

Intellij is complaining that it can't find jdk.tools:jdk.tools:1.6 as a dependency under much of the Hadoop modules.

mvn clean install is failing at Beam::SDKS::Java::Core


[ERROR]   AvroIOTest.testWriteDisplayData:561
Expected: display data with item: (with key is "filePrefix" and with type is <STRING> and with value is "/foo")
     but: found 6 non-matching item(s):
<[]org.apache.beam.sdk.io.AvroIO$Write:codec=snappy
[]org.apache.beam.sdk.io.AvroIO$Write:schema=org.apache.beam.sdk.io.AvroIOTest$GenericClass
[]org.apache.beam.sdk.io.AvroIO$Write:fileSuffix=bar
[]org.apache.beam.sdk.io.AvroIO$Write:numShards=100
[]org.apache.beam.sdk.io.AvroIO$Write:shardNameTemplate=-SS-of-NN-
[]org.apache.beam.sdk.io.AvroIO$Write:filePrefix=C:\foo>
[ERROR]   FileBasedSinkTest.testRemoveWithTempFilename:148->testRemoveTemporaryFiles:261 temp file C:\Users\tallison\AppData\Local\Temp\junit5212433513605155196\temp\file0 exists
Expected: is <false>
     but: was <true>
[ERROR]   FileBasedSourceTest.testSplittingFailsOnEmptyFileExpansion
Expected: (an instance of java.io.FileNotFoundException and exception with message a string containing "No files found for spec: C:\Users\tallison\AppData\Local\Temp\junit1719865221821921346\junit7087025770573441186/missing.txt")
     but: an instance of java.io.FileNotFoundException <java.lang.IllegalStateException: Unable to find registrar for c> is a java.lang.IllegalStateException
Stacktrace was: java.lang.IllegalStateException: Unable to find registrar for c
        at org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:447)
        at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:111)


among many other errors...
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Thursday, May 25, 2017 12:47 PM
To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Guys

The link to the initial code is available in JIRA, at this stage the focus is on preparing a solid initial PR, and then we can all improve Tika related code :-)

Cheers, Sergey
On 24/05/17 11:41, Sergey Beryozkin wrote:
> Hi Tim, All,
> 
> I thought I'd start a dedicated thread.
> 
> I added some initial comments to [1], I'm quite close now to creating 
> the initial PR.
> 
> Thanks, Sergey
> 
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> On 23/05/17 17:42, Allison, Timothy B. wrote:
>> Another idea...if you have any interest, it would be great to get 
>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>> our regression tests?
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Friday, May 19, 2017 4:21 PM
>> To: user@tika.apache.org
>> Subject: Re: Extracting Text from embedded images in PDF docs
>>
>> Hi Tim
>>
>> Sure, once I get an initial PR ready I'll send an update and I'll 
>> explain what I did for a start and we will discuss it further
>>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim

Thanks, the code, especially the one dealing with adapting the Tika 
events to the Bean pipeline will most likely need to be improved :-),
I've tried to make sure it all can be configured as much as possible 
(point to the loc of the TikaConfig if needed, etc), but it's only a 
start...
I already see a typo in the TikaOptions doc for the minimum text length, 
time to create a new PR :-)

Cheers, Sergey
On 11/09/17 15:33, Allison, Timothy B. wrote:
> What great news!  Thank you, Sergey!!!
> 
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Monday, September 11, 2017 9:18 AM
> To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
> Subject: Re: Integrating Tika with Apache Beam
> 
> Hi Tim, All
> 
> It took it some time, but finally Beam TikaIO component is in its 2.2.0-SNAPSHOT master,
> 
> https://github.com/apache/beam/tree/master/sdks/java/io/tika
> 
> I've created a basic project which can help with running it quickly:
> 
> https://github.com/sberyozkin/beamTikaExample
> 
> One can just build it and run as suggested in Readme.md, simply have some PDF files for example, and point to one or all of them.
> 
> By default, Beam will output the data to /tmp/tika.
> 
> main() can be updated with supporting more options, they can be collected from the command line either with TikaOptions:
> 
> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
> 
> (all options but the "--input" are optional)
> 
> or directly from the code, some variations are shown in the tests:
> 
> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
> 
> By default TikaReader will use an internal queue to make the SAX events available to the Beam pipeline, this is why you can see the options like "queuePollTime", etc. If it's known that a given parser can really read the whole text in the single op only then the process can be optimized with 'parseSynchronously'...
> 
> One can also try to update main() in the example to do more interesting things then just print the data :-).
> 
> Give it a try please if you get a chance, help make TikeIO the major part of Beam :-) with PRs, etc
> 
> Thanks, Sergey
> 
> 
> 
> 
> 
> On 25/05/17 17:47, Sergey Beryozkin wrote:
>> Hi Guys
>>
>> The link to the initial code is available in JIRA, at this stage the
>> focus is on preparing a solid initial PR, and then we can all improve
>> Tika related code :-)
>>
>> Cheers, Sergey
>> On 24/05/17 11:41, Sergey Beryozkin wrote:
>>> Hi Tim, All,
>>>
>>> I thought I'd start a dedicated thread.
>>>
>>> I added some initial comments to [1], I'm quite close now to creating
>>> the initial PR.
>>>
>>> Thanks, Sergey
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>>> Another idea...if you have any interest, it would be great to get
>>>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for
>>>> our regression tests?
>>>>
>>>> -----Original Message-----
>>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>>> Sent: Friday, May 19, 2017 4:21 PM
>>>> To: user@tika.apache.org
>>>> Subject: Re: Extracting Text from embedded images in PDF docs
>>>>
>>>> Hi Tim
>>>>
>>>> Sure, once I get an initial PR ready I'll send an update and I'll
>>>> explain what I did for a start and we will discuss it further
>>>>
>>
>>

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Chris, thanks,
On 21/09/17 18:54, Chris Mattmann wrote:
> Thanks Sergey, feel free to CC me directly at mattmann@apache.org on the Beam thread.
> My own 2c is that Tika’s “metadata” extraction can be any order, and with our tika-dl module
> and the new feature extraction from multimedia files using Tensorflow and DL4j these are
> perfect examples where the order/extraction doesn’t matter…
> 
Are these data the 'metadata' (author, date) or the text content ? The 
main issue so far is that TikaIO will extract the content in the right 
order but the Bean threads can totally reorder the individual content 
pieces - so the first question is what, from the practical point of 
view, can be done with these unordered data pieces, and then the follow, 
TikaIO+Beam implementation specific issue, is how to ensure the data are 
ordered all the way, till they reach the end of the pipeline

I'll CC you :-)

Thanks. Sergey

> 
> 
> On 9/21/17, 2:52 AM, "Sergey Beryozkin" <sb...@gmail.com> wrote:
> 
>      Hi Guys
>      
>      TikaIO is getting some serious attention now on the Beam dev, and
>      unfortunately it is not all about it being a great addition to Beam.
>      
>      The team is wondering what one can do with TikaIO vs someone just doing
>      some custom Beam function.
>      
>      TikaIO and as any other Bounded text reader will produce the data in the
>      ordered way, but they can be made totally unordered to the pipeline by
>      the Beam runtime.
>      
>      I gave one example where we used the Tika output to save it all to
>      Lucene (with the file name associated) and then search for the files
>      which contain a certain word.
>      
>      Tim, Chris, others, if you have some interesting examples to share where
>      it did not matter in which order Tika-produced data were made eventually
>      available, then please let me know, or reply directly to a Beam dev
>      thread titled "TikaIO concerns".
>      
>      Note, if Beam devs decide they don't want it then one option can be to
>      create a tika-integrations/beam module and experiment there - I'm not
>      saying it will need to be done but it's something that may be worth
>      considering
>      
>      Sergey
>      On 15/09/17 12:02, Sergey Beryozkin wrote:
>      > Hi Chris
>      >
>      > thanks,
>      >
>      > at the moment TikaIO (originally renamed TikaReader as it can only read
>      > but we renamed it to follow the convention) is a bounded reader, so you
>      > can say ask it to read
>      >
>      > /files/*.pdf
>      >
>      > and it will read all the N files there, and will end the run.
>      >
>      > I'm not sure yet what is the best strategy to making it the unbounded
>      > reader where it can continuously poll or be notified of the new files
>      > becoming available...There are some ideas about scheduling the bounded
>      > Beam pipelines, haven't looked yet...
>      >
>      > In the short term, the simplest solution would be simply to create a new
>      > instance of TikaIO pipeline, and point it to the new temp folder where a
>      > new batch of files has been dropped to.
>      >
>      > Thanks, Sergey
>      > On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>      >> Amazing work, thank you Sergey!!
>      >>
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >> Chris Mattmann, Ph.D.
>      >> Principal Data Scientist, Engineering Administrative Office (3010)
>      >> Manager, NSF & Open Source Projects Formulation and Development
>      >> Offices (8212)
>      >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>      >> Office: 180-503E, Mailstop: 180-503
>      >> Email: chris.a.mattmann@nasa.gov
>      >> WWW:  http://sunset.usc.edu/~mattmann/
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >> Director, Information Retrieval and Data Science Group (IRDS)
>      >> Adjunct Associate Professor, Computer Science Department
>      >> University of Southern California, Los Angeles, CA 90089 USA
>      >> WWW: http://irds.usc.edu/
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >>
>      >> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>      >>
>      >>      What great news!  Thank you, Sergey!!!
>      >>      -----Original Message-----
>      >>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>      Sent: Monday, September 11, 2017 9:18 AM
>      >>      To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
>      >>      Subject: Re: Integrating Tika with Apache Beam
>      >>      Hi Tim, All
>      >>      It took it some time, but finally Beam TikaIO component is in its
>      >> 2.2.0-SNAPSHOT master,
>      >>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
>      >>      I've created a basic project which can help with running it quickly:
>      >>      https://github.com/sberyozkin/beamTikaExample
>      >>      One can just build it and run as suggested in Readme.md, simply
>      >> have some PDF files for example, and point to one or all of them.
>      >>      By default, Beam will output the data to /tmp/tika.
>      >>      main() can be updated with supporting more options, they can be
>      >> collected from the command line either with TikaOptions:
>      >>
>      >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
>      >>
>      >>      (all options but the "--input" are optional)
>      >>      or directly from the code, some variations are shown in the tests:
>      >>
>      >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
>      >>
>      >>      By default TikaReader will use an internal queue to make the SAX
>      >> events available to the Beam pipeline, this is why you can see the
>      >> options like "queuePollTime", etc. If it's known that a given parser
>      >> can really read the whole text in the single op only then the process
>      >> can be optimized with 'parseSynchronously'...
>      >>      One can also try to update main() in the example to do more
>      >> interesting things then just print the data :-).
>      >>      Give it a try please if you get a chance, help make TikeIO the
>      >> major part of Beam :-) with PRs, etc
>      >>      Thanks, Sergey
>      >>      On 25/05/17 17:47, Sergey Beryozkin wrote:
>      >>      > Hi Guys
>      >>      >
>      >>      > The link to the initial code is available in JIRA, at this
>      >> stage the
>      >>      > focus is on preparing a solid initial PR, and then we can all
>      >> improve
>      >>      > Tika related code :-)
>      >>      >
>      >>      > Cheers, Sergey
>      >>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
>      >>      >> Hi Tim, All,
>      >>      >>
>      >>      >> I thought I'd start a dedicated thread.
>      >>      >>
>      >>      >> I added some initial comments to [1], I'm quite close now to
>      >> creating
>      >>      >> the initial PR.
>      >>      >>
>      >>      >> Thanks, Sergey
>      >>      >>
>      >>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>      >>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>      >>      >>> Another idea...if you have any interest, it would be great to
>      >> get
>      >>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use
>      >> it for
>      >>      >>> our regression tests?
>      >>      >>>
>      >>      >>> -----Original Message-----
>      >>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>      >>> Sent: Friday, May 19, 2017 4:21 PM
>      >>      >>> To: user@tika.apache.org
>      >>      >>> Subject: Re: Extracting Text from embedded images in PDF docs
>      >>      >>>
>      >>      >>> Hi Tim
>      >>      >>>
>      >>      >>> Sure, once I get an initial PR ready I'll send an update and
>      >> I'll
>      >>      >>> explain what I did for a start and we will discuss it further
>      >>      >>>
>      >>      >
>      >>      >
>      >>
>      
>      
>      --
>      Sergey Beryozkin
>      
>      Talend Community Coders
>      http://coders.talend.com/
>      
> 
>

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi All

A short update that my original TikaIO contribution changed a lot after 
the round of reviews, but the good news it stayed as a native Beam IO 
component and will be available from Beam 2.3.0.
TikaIO will now return something called ParseResult which includes the 
complete String and metadata content, but also Throwable if the 
exception occurred in some file.
Tika Streaming is not utilized at the moment - but as soon as the good 
use cases emerge then I'm sure Beam community will be open to enhancing 
TikaIO further...

Cheers. Sergey


On 21/09/17 18:54, Chris Mattmann wrote:
> Thanks Sergey, feel free to CC me directly at mattmann@apache.org on the Beam thread.
> My own 2c is that Tika’s “metadata” extraction can be any order, and with our tika-dl module
> and the new feature extraction from multimedia files using Tensorflow and DL4j these are
> perfect examples where the order/extraction doesn’t matter…
> 
> 
> 
> On 9/21/17, 2:52 AM, "Sergey Beryozkin" <sb...@gmail.com> wrote:
> 
>      Hi Guys
>      
>      TikaIO is getting some serious attention now on the Beam dev, and
>      unfortunately it is not all about it being a great addition to Beam.
>      
>      The team is wondering what one can do with TikaIO vs someone just doing
>      some custom Beam function.
>      
>      TikaIO and as any other Bounded text reader will produce the data in the
>      ordered way, but they can be made totally unordered to the pipeline by
>      the Beam runtime.
>      
>      I gave one example where we used the Tika output to save it all to
>      Lucene (with the file name associated) and then search for the files
>      which contain a certain word.
>      
>      Tim, Chris, others, if you have some interesting examples to share where
>      it did not matter in which order Tika-produced data were made eventually
>      available, then please let me know, or reply directly to a Beam dev
>      thread titled "TikaIO concerns".
>      
>      Note, if Beam devs decide they don't want it then one option can be to
>      create a tika-integrations/beam module and experiment there - I'm not
>      saying it will need to be done but it's something that may be worth
>      considering
>      
>      Sergey
>      On 15/09/17 12:02, Sergey Beryozkin wrote:
>      > Hi Chris
>      >
>      > thanks,
>      >
>      > at the moment TikaIO (originally renamed TikaReader as it can only read
>      > but we renamed it to follow the convention) is a bounded reader, so you
>      > can say ask it to read
>      >
>      > /files/*.pdf
>      >
>      > and it will read all the N files there, and will end the run.
>      >
>      > I'm not sure yet what is the best strategy to making it the unbounded
>      > reader where it can continuously poll or be notified of the new files
>      > becoming available...There are some ideas about scheduling the bounded
>      > Beam pipelines, haven't looked yet...
>      >
>      > In the short term, the simplest solution would be simply to create a new
>      > instance of TikaIO pipeline, and point it to the new temp folder where a
>      > new batch of files has been dropped to.
>      >
>      > Thanks, Sergey
>      > On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>      >> Amazing work, thank you Sergey!!
>      >>
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >> Chris Mattmann, Ph.D.
>      >> Principal Data Scientist, Engineering Administrative Office (3010)
>      >> Manager, NSF & Open Source Projects Formulation and Development
>      >> Offices (8212)
>      >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>      >> Office: 180-503E, Mailstop: 180-503
>      >> Email: chris.a.mattmann@nasa.gov
>      >> WWW:  http://sunset.usc.edu/~mattmann/
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >> Director, Information Retrieval and Data Science Group (IRDS)
>      >> Adjunct Associate Professor, Computer Science Department
>      >> University of Southern California, Los Angeles, CA 90089 USA
>      >> WWW: http://irds.usc.edu/
>      >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>      >>
>      >>
>      >> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>      >>
>      >>      What great news!  Thank you, Sergey!!!
>      >>      -----Original Message-----
>      >>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>      Sent: Monday, September 11, 2017 9:18 AM
>      >>      To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
>      >>      Subject: Re: Integrating Tika with Apache Beam
>      >>      Hi Tim, All
>      >>      It took it some time, but finally Beam TikaIO component is in its
>      >> 2.2.0-SNAPSHOT master,
>      >>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
>      >>      I've created a basic project which can help with running it quickly:
>      >>      https://github.com/sberyozkin/beamTikaExample
>      >>      One can just build it and run as suggested in Readme.md, simply
>      >> have some PDF files for example, and point to one or all of them.
>      >>      By default, Beam will output the data to /tmp/tika.
>      >>      main() can be updated with supporting more options, they can be
>      >> collected from the command line either with TikaOptions:
>      >>
>      >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
>      >>
>      >>      (all options but the "--input" are optional)
>      >>      or directly from the code, some variations are shown in the tests:
>      >>
>      >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
>      >>
>      >>      By default TikaReader will use an internal queue to make the SAX
>      >> events available to the Beam pipeline, this is why you can see the
>      >> options like "queuePollTime", etc. If it's known that a given parser
>      >> can really read the whole text in the single op only then the process
>      >> can be optimized with 'parseSynchronously'...
>      >>      One can also try to update main() in the example to do more
>      >> interesting things then just print the data :-).
>      >>      Give it a try please if you get a chance, help make TikeIO the
>      >> major part of Beam :-) with PRs, etc
>      >>      Thanks, Sergey
>      >>      On 25/05/17 17:47, Sergey Beryozkin wrote:
>      >>      > Hi Guys
>      >>      >
>      >>      > The link to the initial code is available in JIRA, at this
>      >> stage the
>      >>      > focus is on preparing a solid initial PR, and then we can all
>      >> improve
>      >>      > Tika related code :-)
>      >>      >
>      >>      > Cheers, Sergey
>      >>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
>      >>      >> Hi Tim, All,
>      >>      >>
>      >>      >> I thought I'd start a dedicated thread.
>      >>      >>
>      >>      >> I added some initial comments to [1], I'm quite close now to
>      >> creating
>      >>      >> the initial PR.
>      >>      >>
>      >>      >> Thanks, Sergey
>      >>      >>
>      >>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>      >>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>      >>      >>> Another idea...if you have any interest, it would be great to
>      >> get
>      >>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use
>      >> it for
>      >>      >>> our regression tests?
>      >>      >>>
>      >>      >>> -----Original Message-----
>      >>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>      >>> Sent: Friday, May 19, 2017 4:21 PM
>      >>      >>> To: user@tika.apache.org
>      >>      >>> Subject: Re: Extracting Text from embedded images in PDF docs
>      >>      >>>
>      >>      >>> Hi Tim
>      >>      >>>
>      >>      >>> Sure, once I get an initial PR ready I'll send an update and
>      >> I'll
>      >>      >>> explain what I did for a start and we will discuss it further
>      >>      >>>
>      >>      >
>      >>      >
>      >>
>      
>      
>      --
>      Sergey Beryozkin
>      
>      Talend Community Coders
>      http://coders.talend.com/
>      
> 
>

Re: Integrating Tika with Apache Beam

Posted by Chris Mattmann <ma...@apache.org>.

Thanks Sergey, feel free to CC me directly at mattmann@apache.org on the Beam thread.
My own 2c is that Tika’s “metadata” extraction can be any order, and with our tika-dl module
and the new feature extraction from multimedia files using Tensorflow and DL4j these are 
perfect examples where the order/extraction doesn’t matter…



On 9/21/17, 2:52 AM, "Sergey Beryozkin" <sb...@gmail.com> wrote:

    Hi Guys
    
    TikaIO is getting some serious attention now on the Beam dev, and 
    unfortunately it is not all about it being a great addition to Beam.
    
    The team is wondering what one can do with TikaIO vs someone just doing 
    some custom Beam function.
    
    TikaIO and as any other Bounded text reader will produce the data in the 
    ordered way, but they can be made totally unordered to the pipeline by 
    the Beam runtime.
    
    I gave one example where we used the Tika output to save it all to 
    Lucene (with the file name associated) and then search for the files 
    which contain a certain word.
    
    Tim, Chris, others, if you have some interesting examples to share where 
    it did not matter in which order Tika-produced data were made eventually 
    available, then please let me know, or reply directly to a Beam dev 
    thread titled "TikaIO concerns".
    
    Note, if Beam devs decide they don't want it then one option can be to 
    create a tika-integrations/beam module and experiment there - I'm not 
    saying it will need to be done but it's something that may be worth 
    considering
    
    Sergey
    On 15/09/17 12:02, Sergey Beryozkin wrote:
    > Hi Chris
    > 
    > thanks,
    > 
    > at the moment TikaIO (originally renamed TikaReader as it can only read 
    > but we renamed it to follow the convention) is a bounded reader, so you 
    > can say ask it to read
    > 
    > /files/*.pdf
    > 
    > and it will read all the N files there, and will end the run.
    > 
    > I'm not sure yet what is the best strategy to making it the unbounded 
    > reader where it can continuously poll or be notified of the new files 
    > becoming available...There are some ideas about scheduling the bounded 
    > Beam pipelines, haven't looked yet...
    > 
    > In the short term, the simplest solution would be simply to create a new 
    > instance of TikaIO pipeline, and point it to the new temp folder where a 
    > new batch of files has been dropped to.
    > 
    > Thanks, Sergey
    > On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
    >> Amazing work, thank you Sergey!!
    >>
    >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
    >>
    >> Chris Mattmann, Ph.D.
    >> Principal Data Scientist, Engineering Administrative Office (3010)
    >> Manager, NSF & Open Source Projects Formulation and Development 
    >> Offices (8212)
    >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    >> Office: 180-503E, Mailstop: 180-503
    >> Email: chris.a.mattmann@nasa.gov
    >> WWW:  http://sunset.usc.edu/~mattmann/
    >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
    >>
    >> Director, Information Retrieval and Data Science Group (IRDS)
    >> Adjunct Associate Professor, Computer Science Department
    >> University of Southern California, Los Angeles, CA 90089 USA
    >> WWW: http://irds.usc.edu/
    >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
    >>
    >>
    >> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
    >>
    >>      What great news!  Thank you, Sergey!!!
    >>      -----Original Message-----
    >>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
    >>      Sent: Monday, September 11, 2017 9:18 AM
    >>      To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
    >>      Subject: Re: Integrating Tika with Apache Beam
    >>      Hi Tim, All
    >>      It took it some time, but finally Beam TikaIO component is in its 
    >> 2.2.0-SNAPSHOT master,
    >>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
    >>      I've created a basic project which can help with running it quickly:
    >>      https://github.com/sberyozkin/beamTikaExample
    >>      One can just build it and run as suggested in Readme.md, simply 
    >> have some PDF files for example, and point to one or all of them.
    >>      By default, Beam will output the data to /tmp/tika.
    >>      main() can be updated with supporting more options, they can be 
    >> collected from the command line either with TikaOptions:
    >>      
    >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java 
    >>
    >>      (all options but the "--input" are optional)
    >>      or directly from the code, some variations are shown in the tests:
    >>      
    >> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java 
    >>
    >>      By default TikaReader will use an internal queue to make the SAX 
    >> events available to the Beam pipeline, this is why you can see the 
    >> options like "queuePollTime", etc. If it's known that a given parser 
    >> can really read the whole text in the single op only then the process 
    >> can be optimized with 'parseSynchronously'...
    >>      One can also try to update main() in the example to do more 
    >> interesting things then just print the data :-).
    >>      Give it a try please if you get a chance, help make TikeIO the 
    >> major part of Beam :-) with PRs, etc
    >>      Thanks, Sergey
    >>      On 25/05/17 17:47, Sergey Beryozkin wrote:
    >>      > Hi Guys
    >>      >
    >>      > The link to the initial code is available in JIRA, at this 
    >> stage the
    >>      > focus is on preparing a solid initial PR, and then we can all 
    >> improve
    >>      > Tika related code :-)
    >>      >
    >>      > Cheers, Sergey
    >>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
    >>      >> Hi Tim, All,
    >>      >>
    >>      >> I thought I'd start a dedicated thread.
    >>      >>
    >>      >> I added some initial comments to [1], I'm quite close now to 
    >> creating
    >>      >> the initial PR.
    >>      >>
    >>      >> Thanks, Sergey
    >>      >>
    >>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
    >>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
    >>      >>> Another idea...if you have any interest, it would be great to 
    >> get
    >>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use 
    >> it for
    >>      >>> our regression tests?
    >>      >>>
    >>      >>> -----Original Message-----
    >>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
    >>      >>> Sent: Friday, May 19, 2017 4:21 PM
    >>      >>> To: user@tika.apache.org
    >>      >>> Subject: Re: Extracting Text from embedded images in PDF docs
    >>      >>>
    >>      >>> Hi Tim
    >>      >>>
    >>      >>> Sure, once I get an initial PR ready I'll send an update and 
    >> I'll
    >>      >>> explain what I did for a start and we will discuss it further
    >>      >>>
    >>      >
    >>      >
    >>
    
    
    -- 
    Sergey Beryozkin
    
    Talend Community Coders
    http://coders.talend.com/

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim
Thanks, will link you to the thread shortly

In general, I'd say TikaIO has probably generated more interest then 
some of the other Beam IOs which is a good sign :-)

The questions at the moment:
1) what interesting things can be done with the unordered Tika produced data
2) would it really help if users can write the custom functions 
themselves (I'd say the utility code always helps for some cases)

I also believe it would be possible to somehow make all the Tika 
produced data ordered in the end, but that would be the next phase...

At the moment it's those 2 issues which are the main ones...

Thanks, Sergey

P.S I'd not like this TikaIO idea to cause some 'battles' :-), I think 
it would be cool if Tika were one of the native Beam IOs (it would also 
be big for the tooling side of things), if not then indeed Tika users 
can easily do something themselves on top of Beam
On 21/09/17 13:28, Allison, Timothy B. wrote:
> Hi Sergey,
> 
> I just subscribed to Beam's dev list.  Can you forward me your latest email so that I can respond to the thread?  Or can you ping me via their list?  Thank you!
> 
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Thursday, September 21, 2017 5:53 AM
> To: dev@tika.apache.org
> Subject: Re: Integrating Tika with Apache Beam
> 
> Hi Guys
> 
> TikaIO is getting some serious attention now on the Beam dev, and unfortunately it is not all about it being a great addition to Beam.
> 
> The team is wondering what one can do with TikaIO vs someone just doing some custom Beam function.
> 
> TikaIO and as any other Bounded text reader will produce the data in the ordered way, but they can be made totally unordered to the pipeline by the Beam runtime.
> 
> I gave one example where we used the Tika output to save it all to Lucene (with the file name associated) and then search for the files which contain a certain word.
> 
> Tim, Chris, others, if you have some interesting examples to share where it did not matter in which order Tika-produced data were made eventually available, then please let me know, or reply directly to a Beam dev thread titled "TikaIO concerns".
> 
> Note, if Beam devs decide they don't want it then one option can be to create a tika-integrations/beam module and experiment there - I'm not saying it will need to be done but it's something that may be worth considering
> 
> Sergey
> On 15/09/17 12:02, Sergey Beryozkin wrote:
>> Hi Chris
>>
>> thanks,
>>
>> at the moment TikaIO (originally renamed TikaReader as it can only
>> read but we renamed it to follow the convention) is a bounded reader,
>> so you can say ask it to read
>>
>> /files/*.pdf
>>
>> and it will read all the N files there, and will end the run.
>>
>> I'm not sure yet what is the best strategy to making it the unbounded
>> reader where it can continuously poll or be notified of the new files
>> becoming available...There are some ideas about scheduling the bounded
>> Beam pipelines, haven't looked yet...
>>
>> In the short term, the simplest solution would be simply to create a
>> new instance of TikaIO pipeline, and point it to the new temp folder
>> where a new batch of files has been dropped to.
>>
>> Thanks, Sergey
>> On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>>> Amazing work, thank you Sergey!!
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>> Chris Mattmann, Ph.D.
>>> Principal Data Scientist, Engineering Administrative Office (3010)
>>> Manager, NSF & Open Source Projects Formulation and Development
>>> Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 180-503E, Mailstop: 180-503
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>> Director, Information Retrieval and Data Science Group (IRDS) Adjunct
>>> Associate Professor, Computer Science Department University of
>>> Southern California, Los Angeles, CA 90089 USA
>>> WWW: http://irds.usc.edu/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>>>
>>>       What great news!  Thank you, Sergey!!!
>>>       -----Original Message-----
>>>       From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>>       Sent: Monday, September 11, 2017 9:18 AM
>>>       To: Allison, Timothy B. <ta...@mitre.org>;
>>> dev@tika.apache.org
>>>       Subject: Re: Integrating Tika with Apache Beam
>>>       Hi Tim, All
>>>       It took it some time, but finally Beam TikaIO component is in
>>> its 2.2.0-SNAPSHOT master,
>>>       https://github.com/apache/beam/tree/master/sdks/java/io/tika
>>>       I've created a basic project which can help with running it quickly:
>>>       https://github.com/sberyozkin/beamTikaExample
>>>       One can just build it and run as suggested in Readme.md, simply
>>> have some PDF files for example, and point to one or all of them.
>>>       By default, Beam will output the data to /tmp/tika.
>>>       main() can be updated with supporting more options, they can be
>>> collected from the command line either with TikaOptions:
>>>       
>>> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main
>>> /java/org/apache/beam/sdk/io/tika/TikaOptions.java
>>>
>>>       (all options but the "--input" are optional)
>>>       or directly from the code, some variations are shown in the tests:
>>>       
>>> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test
>>> /java/org/apache/beam/sdk/io/tika/TikaIOTest.java
>>>
>>>       By default TikaReader will use an internal queue to make the SAX
>>> events available to the Beam pipeline, this is why you can see the
>>> options like "queuePollTime", etc. If it's known that a given parser
>>> can really read the whole text in the single op only then the process
>>> can be optimized with 'parseSynchronously'...
>>>       One can also try to update main() in the example to do more
>>> interesting things then just print the data :-).
>>>       Give it a try please if you get a chance, help make TikeIO the
>>> major part of Beam :-) with PRs, etc
>>>       Thanks, Sergey
>>>       On 25/05/17 17:47, Sergey Beryozkin wrote:
>>>       > Hi Guys
>>>       >
>>>       > The link to the initial code is available in JIRA, at this
>>> stage the
>>>       > focus is on preparing a solid initial PR, and then we can all
>>> improve
>>>       > Tika related code :-)
>>>       >
>>>       > Cheers, Sergey
>>>       > On 24/05/17 11:41, Sergey Beryozkin wrote:
>>>       >> Hi Tim, All,
>>>       >>
>>>       >> I thought I'd start a dedicated thread.
>>>       >>
>>>       >> I added some initial comments to [1], I'm quite close now to
>>> creating
>>>       >> the initial PR.
>>>       >>
>>>       >> Thanks, Sergey
>>>       >>
>>>       >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>       >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>>       >>> Another idea...if you have any interest, it would be great
>>> to get
>>>       >>> Apache Beam set up on our Rackspace VM (with Spark?) and use
>>> it for
>>>       >>> our regression tests?
>>>       >>>
>>>       >>> -----Original Message-----
>>>       >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>>       >>> Sent: Friday, May 19, 2017 4:21 PM
>>>       >>> To: user@tika.apache.org
>>>       >>> Subject: Re: Extracting Text from embedded images in PDF
>>> docs
>>>       >>>
>>>       >>> Hi Tim
>>>       >>>
>>>       >>> Sure, once I get an initial PR ready I'll send an update and
>>> I'll
>>>       >>> explain what I did for a start and we will discuss it
>>> further
>>>       >>>
>>>       >
>>>       >
>>>
> 
> 
> --
> Sergey Beryozkin
> 
> Talend Community Coders
> http://coders.talend.com/
>

RE: Integrating Tika with Apache Beam

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hi Sergey,

I just subscribed to Beam's dev list.  Can you forward me your latest email so that I can respond to the thread?  Or can you ping me via their list?  Thank you!

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Thursday, September 21, 2017 5:53 AM
To: dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Guys

TikaIO is getting some serious attention now on the Beam dev, and unfortunately it is not all about it being a great addition to Beam.

The team is wondering what one can do with TikaIO vs someone just doing some custom Beam function.

TikaIO and as any other Bounded text reader will produce the data in the ordered way, but they can be made totally unordered to the pipeline by the Beam runtime.

I gave one example where we used the Tika output to save it all to Lucene (with the file name associated) and then search for the files which contain a certain word.

Tim, Chris, others, if you have some interesting examples to share where it did not matter in which order Tika-produced data were made eventually available, then please let me know, or reply directly to a Beam dev thread titled "TikaIO concerns".

Note, if Beam devs decide they don't want it then one option can be to create a tika-integrations/beam module and experiment there - I'm not saying it will need to be done but it's something that may be worth considering

Sergey
On 15/09/17 12:02, Sergey Beryozkin wrote:
> Hi Chris
> 
> thanks,
> 
> at the moment TikaIO (originally renamed TikaReader as it can only 
> read but we renamed it to follow the convention) is a bounded reader, 
> so you can say ask it to read
> 
> /files/*.pdf
> 
> and it will read all the N files there, and will end the run.
> 
> I'm not sure yet what is the best strategy to making it the unbounded 
> reader where it can continuously poll or be notified of the new files 
> becoming available...There are some ideas about scheduling the bounded 
> Beam pipelines, haven't looked yet...
> 
> In the short term, the simplest solution would be simply to create a 
> new instance of TikaIO pipeline, and point it to the new temp folder 
> where a new batch of files has been dropped to.
> 
> Thanks, Sergey
> On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>> Amazing work, thank you Sergey!!
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>
>> Chris Mattmann, Ph.D.
>> Principal Data Scientist, Engineering Administrative Office (3010) 
>> Manager, NSF & Open Source Projects Formulation and Development 
>> Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 180-503E, Mailstop: 180-503
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>
>> Director, Information Retrieval and Data Science Group (IRDS) Adjunct 
>> Associate Professor, Computer Science Department University of 
>> Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>
>>
>> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>>
>>      What great news!  Thank you, Sergey!!!
>>      -----Original Message-----
>>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>      Sent: Monday, September 11, 2017 9:18 AM
>>      To: Allison, Timothy B. <ta...@mitre.org>; 
>> dev@tika.apache.org
>>      Subject: Re: Integrating Tika with Apache Beam
>>      Hi Tim, All
>>      It took it some time, but finally Beam TikaIO component is in 
>> its 2.2.0-SNAPSHOT master,
>>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
>>      I've created a basic project which can help with running it quickly:
>>      https://github.com/sberyozkin/beamTikaExample
>>      One can just build it and run as suggested in Readme.md, simply 
>> have some PDF files for example, and point to one or all of them.
>>      By default, Beam will output the data to /tmp/tika.
>>      main() can be updated with supporting more options, they can be 
>> collected from the command line either with TikaOptions:
>>      
>> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main
>> /java/org/apache/beam/sdk/io/tika/TikaOptions.java
>>
>>      (all options but the "--input" are optional)
>>      or directly from the code, some variations are shown in the tests:
>>      
>> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test
>> /java/org/apache/beam/sdk/io/tika/TikaIOTest.java
>>
>>      By default TikaReader will use an internal queue to make the SAX 
>> events available to the Beam pipeline, this is why you can see the 
>> options like "queuePollTime", etc. If it's known that a given parser 
>> can really read the whole text in the single op only then the process 
>> can be optimized with 'parseSynchronously'...
>>      One can also try to update main() in the example to do more 
>> interesting things then just print the data :-).
>>      Give it a try please if you get a chance, help make TikeIO the 
>> major part of Beam :-) with PRs, etc
>>      Thanks, Sergey
>>      On 25/05/17 17:47, Sergey Beryozkin wrote:
>>      > Hi Guys
>>      >
>>      > The link to the initial code is available in JIRA, at this 
>> stage the
>>      > focus is on preparing a solid initial PR, and then we can all 
>> improve
>>      > Tika related code :-)
>>      >
>>      > Cheers, Sergey
>>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
>>      >> Hi Tim, All,
>>      >>
>>      >> I thought I'd start a dedicated thread.
>>      >>
>>      >> I added some initial comments to [1], I'm quite close now to 
>> creating
>>      >> the initial PR.
>>      >>
>>      >> Thanks, Sergey
>>      >>
>>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>      >>> Another idea...if you have any interest, it would be great 
>> to get
>>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use 
>> it for
>>      >>> our regression tests?
>>      >>>
>>      >>> -----Original Message-----
>>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>      >>> Sent: Friday, May 19, 2017 4:21 PM
>>      >>> To: user@tika.apache.org
>>      >>> Subject: Re: Extracting Text from embedded images in PDF 
>> docs
>>      >>>
>>      >>> Hi Tim
>>      >>>
>>      >>> Sure, once I get an initial PR ready I'll send an update and 
>> I'll
>>      >>> explain what I did for a start and we will discuss it 
>> further
>>      >>>
>>      >
>>      >
>>


--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Guys

TikaIO is getting some serious attention now on the Beam dev, and 
unfortunately it is not all about it being a great addition to Beam.

The team is wondering what one can do with TikaIO vs someone just doing 
some custom Beam function.

TikaIO and as any other Bounded text reader will produce the data in the 
ordered way, but they can be made totally unordered to the pipeline by 
the Beam runtime.

I gave one example where we used the Tika output to save it all to 
Lucene (with the file name associated) and then search for the files 
which contain a certain word.

Tim, Chris, others, if you have some interesting examples to share where 
it did not matter in which order Tika-produced data were made eventually 
available, then please let me know, or reply directly to a Beam dev 
thread titled "TikaIO concerns".

Note, if Beam devs decide they don't want it then one option can be to 
create a tika-integrations/beam module and experiment there - I'm not 
saying it will need to be done but it's something that may be worth 
considering

Sergey
On 15/09/17 12:02, Sergey Beryozkin wrote:
> Hi Chris
> 
> thanks,
> 
> at the moment TikaIO (originally renamed TikaReader as it can only read 
> but we renamed it to follow the convention) is a bounded reader, so you 
> can say ask it to read
> 
> /files/*.pdf
> 
> and it will read all the N files there, and will end the run.
> 
> I'm not sure yet what is the best strategy to making it the unbounded 
> reader where it can continuously poll or be notified of the new files 
> becoming available...There are some ideas about scheduling the bounded 
> Beam pipelines, haven't looked yet...
> 
> In the short term, the simplest solution would be simply to create a new 
> instance of TikaIO pipeline, and point it to the new temp folder where a 
> new batch of files has been dropped to.
> 
> Thanks, Sergey
> On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
>> Amazing work, thank you Sergey!!
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>
>> Chris Mattmann, Ph.D.
>> Principal Data Scientist, Engineering Administrative Office (3010)
>> Manager, NSF & Open Source Projects Formulation and Development 
>> Offices (8212)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 180-503E, Mailstop: 180-503
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>
>>
>> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>>
>>      What great news!  Thank you, Sergey!!!
>>      -----Original Message-----
>>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>      Sent: Monday, September 11, 2017 9:18 AM
>>      To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
>>      Subject: Re: Integrating Tika with Apache Beam
>>      Hi Tim, All
>>      It took it some time, but finally Beam TikaIO component is in its 
>> 2.2.0-SNAPSHOT master,
>>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
>>      I've created a basic project which can help with running it quickly:
>>      https://github.com/sberyozkin/beamTikaExample
>>      One can just build it and run as suggested in Readme.md, simply 
>> have some PDF files for example, and point to one or all of them.
>>      By default, Beam will output the data to /tmp/tika.
>>      main() can be updated with supporting more options, they can be 
>> collected from the command line either with TikaOptions:
>>      
>> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java 
>>
>>      (all options but the "--input" are optional)
>>      or directly from the code, some variations are shown in the tests:
>>      
>> https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java 
>>
>>      By default TikaReader will use an internal queue to make the SAX 
>> events available to the Beam pipeline, this is why you can see the 
>> options like "queuePollTime", etc. If it's known that a given parser 
>> can really read the whole text in the single op only then the process 
>> can be optimized with 'parseSynchronously'...
>>      One can also try to update main() in the example to do more 
>> interesting things then just print the data :-).
>>      Give it a try please if you get a chance, help make TikeIO the 
>> major part of Beam :-) with PRs, etc
>>      Thanks, Sergey
>>      On 25/05/17 17:47, Sergey Beryozkin wrote:
>>      > Hi Guys
>>      >
>>      > The link to the initial code is available in JIRA, at this 
>> stage the
>>      > focus is on preparing a solid initial PR, and then we can all 
>> improve
>>      > Tika related code :-)
>>      >
>>      > Cheers, Sergey
>>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
>>      >> Hi Tim, All,
>>      >>
>>      >> I thought I'd start a dedicated thread.
>>      >>
>>      >> I added some initial comments to [1], I'm quite close now to 
>> creating
>>      >> the initial PR.
>>      >>
>>      >> Thanks, Sergey
>>      >>
>>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>      >>> Another idea...if you have any interest, it would be great to 
>> get
>>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use 
>> it for
>>      >>> our regression tests?
>>      >>>
>>      >>> -----Original Message-----
>>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>      >>> Sent: Friday, May 19, 2017 4:21 PM
>>      >>> To: user@tika.apache.org
>>      >>> Subject: Re: Extracting Text from embedded images in PDF docs
>>      >>>
>>      >>> Hi Tim
>>      >>>
>>      >>> Sure, once I get an initial PR ready I'll send an update and 
>> I'll
>>      >>> explain what I did for a start and we will discuss it further
>>      >>>
>>      >
>>      >
>>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Chris

thanks,

at the moment TikaIO (originally renamed TikaReader as it can only read 
but we renamed it to follow the convention) is a bounded reader, so you 
can say ask it to read

/files/*.pdf

and it will read all the N files there, and will end the run.

I'm not sure yet what is the best strategy to making it the unbounded 
reader where it can continuously poll or be notified of the new files 
becoming available...There are some ideas about scheduling the bounded 
Beam pipelines, haven't looked yet...

In the short term, the simplest solution would be simply to create a new 
instance of TikaIO pipeline, and point it to the new temp folder where a 
new batch of files has been dropped to.

Thanks, Sergey
On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
> Amazing work, thank you Sergey!!
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   
> 
> On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
> 
>      What great news!  Thank you, Sergey!!!
>      
>      -----Original Message-----
>      From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      Sent: Monday, September 11, 2017 9:18 AM
>      To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
>      Subject: Re: Integrating Tika with Apache Beam
>      
>      Hi Tim, All
>      
>      It took it some time, but finally Beam TikaIO component is in its 2.2.0-SNAPSHOT master,
>      
>      https://github.com/apache/beam/tree/master/sdks/java/io/tika
>      
>      I've created a basic project which can help with running it quickly:
>      
>      https://github.com/sberyozkin/beamTikaExample
>      
>      One can just build it and run as suggested in Readme.md, simply have some PDF files for example, and point to one or all of them.
>      
>      By default, Beam will output the data to /tmp/tika.
>      
>      main() can be updated with supporting more options, they can be collected from the command line either with TikaOptions:
>      
>      https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
>      
>      (all options but the "--input" are optional)
>      
>      or directly from the code, some variations are shown in the tests:
>      
>      https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
>      
>      By default TikaReader will use an internal queue to make the SAX events available to the Beam pipeline, this is why you can see the options like "queuePollTime", etc. If it's known that a given parser can really read the whole text in the single op only then the process can be optimized with 'parseSynchronously'...
>      
>      One can also try to update main() in the example to do more interesting things then just print the data :-).
>      
>      Give it a try please if you get a chance, help make TikeIO the major part of Beam :-) with PRs, etc
>      
>      Thanks, Sergey
>      
>      
>      
>      
>      
>      On 25/05/17 17:47, Sergey Beryozkin wrote:
>      > Hi Guys
>      >
>      > The link to the initial code is available in JIRA, at this stage the
>      > focus is on preparing a solid initial PR, and then we can all improve
>      > Tika related code :-)
>      >
>      > Cheers, Sergey
>      > On 24/05/17 11:41, Sergey Beryozkin wrote:
>      >> Hi Tim, All,
>      >>
>      >> I thought I'd start a dedicated thread.
>      >>
>      >> I added some initial comments to [1], I'm quite close now to creating
>      >> the initial PR.
>      >>
>      >> Thanks, Sergey
>      >>
>      >> [1] https://issues.apache.org/jira/browse/BEAM-2328
>      >> On 23/05/17 17:42, Allison, Timothy B. wrote:
>      >>> Another idea...if you have any interest, it would be great to get
>      >>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for
>      >>> our regression tests?
>      >>>
>      >>> -----Original Message-----
>      >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>      >>> Sent: Friday, May 19, 2017 4:21 PM
>      >>> To: user@tika.apache.org
>      >>> Subject: Re: Extracting Text from embedded images in PDF docs
>      >>>
>      >>> Hi Tim
>      >>>
>      >>> Sure, once I get an initial PR ready I'll send an update and I'll
>      >>> explain what I did for a start and we will discuss it further
>      >>>
>      >
>      >
>      
>

Re: Integrating Tika with Apache Beam

Posted by "Mattmann, Chris A (3010)" <ch...@jpl.nasa.gov>.

Amazing work, thank you Sergey!!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development Offices (8212)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 

On 9/11/17, 7:33 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    What great news!  Thank you, Sergey!!!
    
    -----Original Message-----
    From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
    Sent: Monday, September 11, 2017 9:18 AM
    To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
    Subject: Re: Integrating Tika with Apache Beam
    
    Hi Tim, All
    
    It took it some time, but finally Beam TikaIO component is in its 2.2.0-SNAPSHOT master,
    
    https://github.com/apache/beam/tree/master/sdks/java/io/tika
    
    I've created a basic project which can help with running it quickly:
    
    https://github.com/sberyozkin/beamTikaExample
    
    One can just build it and run as suggested in Readme.md, simply have some PDF files for example, and point to one or all of them.
    
    By default, Beam will output the data to /tmp/tika.
    
    main() can be updated with supporting more options, they can be collected from the command line either with TikaOptions:
    
    https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java
    
    (all options but the "--input" are optional)
    
    or directly from the code, some variations are shown in the tests:
    
    https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java
    
    By default TikaReader will use an internal queue to make the SAX events available to the Beam pipeline, this is why you can see the options like "queuePollTime", etc. If it's known that a given parser can really read the whole text in the single op only then the process can be optimized with 'parseSynchronously'...
    
    One can also try to update main() in the example to do more interesting things then just print the data :-).
    
    Give it a try please if you get a chance, help make TikeIO the major part of Beam :-) with PRs, etc
    
    Thanks, Sergey
    
    
    
    
    
    On 25/05/17 17:47, Sergey Beryozkin wrote:
    > Hi Guys
    > 
    > The link to the initial code is available in JIRA, at this stage the 
    > focus is on preparing a solid initial PR, and then we can all improve 
    > Tika related code :-)
    > 
    > Cheers, Sergey
    > On 24/05/17 11:41, Sergey Beryozkin wrote:
    >> Hi Tim, All,
    >>
    >> I thought I'd start a dedicated thread.
    >>
    >> I added some initial comments to [1], I'm quite close now to creating 
    >> the initial PR.
    >>
    >> Thanks, Sergey
    >>
    >> [1] https://issues.apache.org/jira/browse/BEAM-2328
    >> On 23/05/17 17:42, Allison, Timothy B. wrote:
    >>> Another idea...if you have any interest, it would be great to get 
    >>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
    >>> our regression tests?
    >>>
    >>> -----Original Message-----
    >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
    >>> Sent: Friday, May 19, 2017 4:21 PM
    >>> To: user@tika.apache.org
    >>> Subject: Re: Extracting Text from embedded images in PDF docs
    >>>
    >>> Hi Tim
    >>>
    >>> Sure, once I get an initial PR ready I'll send an update and I'll 
    >>> explain what I did for a start and we will discuss it further
    >>>
    > 
    >

RE: Integrating Tika with Apache Beam

Posted by "Allison, Timothy B." <ta...@mitre.org>.

What great news!  Thank you, Sergey!!!

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Monday, September 11, 2017 9:18 AM
To: Allison, Timothy B. <ta...@mitre.org>; dev@tika.apache.org
Subject: Re: Integrating Tika with Apache Beam

Hi Tim, All

It took it some time, but finally Beam TikaIO component is in its 2.2.0-SNAPSHOT master,

https://github.com/apache/beam/tree/master/sdks/java/io/tika

I've created a basic project which can help with running it quickly:

https://github.com/sberyozkin/beamTikaExample

One can just build it and run as suggested in Readme.md, simply have some PDF files for example, and point to one or all of them.

By default, Beam will output the data to /tmp/tika.

main() can be updated with supporting more options, they can be collected from the command line either with TikaOptions:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java

(all options but the "--input" are optional)

or directly from the code, some variations are shown in the tests:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java

By default TikaReader will use an internal queue to make the SAX events available to the Beam pipeline, this is why you can see the options like "queuePollTime", etc. If it's known that a given parser can really read the whole text in the single op only then the process can be optimized with 'parseSynchronously'...

One can also try to update main() in the example to do more interesting things then just print the data :-).

Give it a try please if you get a chance, help make TikeIO the major part of Beam :-) with PRs, etc

Thanks, Sergey





On 25/05/17 17:47, Sergey Beryozkin wrote:
> Hi Guys
> 
> The link to the initial code is available in JIRA, at this stage the 
> focus is on preparing a solid initial PR, and then we can all improve 
> Tika related code :-)
> 
> Cheers, Sergey
> On 24/05/17 11:41, Sergey Beryozkin wrote:
>> Hi Tim, All,
>>
>> I thought I'd start a dedicated thread.
>>
>> I added some initial comments to [1], I'm quite close now to creating 
>> the initial PR.
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>> Another idea...if you have any interest, it would be great to get 
>>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>>> our regression tests?
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Friday, May 19, 2017 4:21 PM
>>> To: user@tika.apache.org
>>> Subject: Re: Extracting Text from embedded images in PDF docs
>>>
>>> Hi Tim
>>>
>>> Sure, once I get an initial PR ready I'll send an update and I'll 
>>> explain what I did for a start and we will discuss it further
>>>
> 
>

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim, All

It took it some time, but finally Beam TikaIO component is in its 
2.2.0-SNAPSHOT master,

https://github.com/apache/beam/tree/master/sdks/java/io/tika

I've created a basic project which can help with running it quickly:

https://github.com/sberyozkin/beamTikaExample

One can just build it and run as suggested in Readme.md, simply have 
some PDF files for example, and point to one or all of them.

By default, Beam will output the data to /tmp/tika.

main() can be updated with supporting more options, they can be 
collected from the command line either with TikaOptions:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/main/java/org/apache/beam/sdk/io/tika/TikaOptions.java

(all options but the "--input" are optional)

or directly from the code, some variations are shown in the tests:

https://github.com/apache/beam/blob/master/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java

By default TikaReader will use an internal queue to make the SAX events 
available to the Beam pipeline, this is why you can see the options like 
"queuePollTime", etc. If it's known that a given parser can really read 
the whole text in the single op only then the process can be optimized 
with 'parseSynchronously'...

One can also try to update main() in the example to do more interesting 
things then just print the data :-).

Give it a try please if you get a chance, help make TikeIO the major 
part of Beam :-) with PRs, etc

Thanks, Sergey





On 25/05/17 17:47, Sergey Beryozkin wrote:
> Hi Guys
> 
> The link to the initial code is available in JIRA, at this stage the 
> focus is on preparing a solid initial PR, and then we can all improve 
> Tika related code :-)
> 
> Cheers, Sergey
> On 24/05/17 11:41, Sergey Beryozkin wrote:
>> Hi Tim, All,
>>
>> I thought I'd start a dedicated thread.
>>
>> I added some initial comments to [1], I'm quite close now to creating 
>> the initial PR.
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>> On 23/05/17 17:42, Allison, Timothy B. wrote:
>>> Another idea...if you have any interest, it would be great to get 
>>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>>> our regression tests?
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Friday, May 19, 2017 4:21 PM
>>> To: user@tika.apache.org
>>> Subject: Re: Extracting Text from embedded images in PDF docs
>>>
>>> Hi Tim
>>>
>>> Sure, once I get an initial PR ready I'll send an update and I'll 
>>> explain what I did for a start and we will discuss it further
>>>
> 
>

Re: Integrating Tika with Apache Beam

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Guys

The link to the initial code is available in JIRA, at this stage the 
focus is on preparing a solid initial PR, and then we can all improve 
Tika related code :-)

Cheers, Sergey
On 24/05/17 11:41, Sergey Beryozkin wrote:
> Hi Tim, All,
> 
> I thought I'd start a dedicated thread.
> 
> I added some initial comments to [1], I'm quite close now to creating 
> the initial PR.
> 
> Thanks, Sergey
> 
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> On 23/05/17 17:42, Allison, Timothy B. wrote:
>> Another idea...if you have any interest, it would be great to get 
>> Apache Beam set up on our Rackspace VM (with Spark?) and use it for 
>> our regression tests?
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Friday, May 19, 2017 4:21 PM
>> To: user@tika.apache.org
>> Subject: Re: Extracting Text from embedded images in PDF docs
>>
>> Hi Tim
>>
>> Sure, once I get an initial PR ready I'll send an update and I'll 
>> explain what I did for a start and we will discuss it further
>>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/