You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Sergey Beryozkin (JIRA)" <ji...@apache.org> on 2017/05/24 10:37:04 UTC

[jira] [Comment Edited] (BEAM-2328) Introduce Apache Tika Input component

    [ https://issues.apache.org/jira/browse/BEAM-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022676#comment-16022676 ] 

Sergey Beryozkin edited comment on BEAM-2328 at 5/24/17 10:36 AM:
------------------------------------------------------------------

Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and then advance(), it won't have to immediately parse the given file content. A good number of Tika parsers can report the data in chunks thus the proposed TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will need to get the full control of the InputStream. However, should the PR be accepted, then I would definitely see some scope for reusing some of currently private FileBasedSource/Reader helpers such as for example the composite reader which is used when multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  




was (Author: sergey_beryozkin):
Apache Tika Parsers report the content via the SAX events, https://tika.apache.org/1.14/.

I'm implementing a TikaReader such that it adapts the sequence of SAX events to the streaming BounderReader API by using the internal ExecutorService and the ConcurrentLinkedQueue. Thus when the Beam thread comes in and calls start() and then advance(), it won't have to immediately parse the given file content. A good number of Tika parsers can report the data in chunks thus the proposed TikaReader implementation should be quite optimal.

Unfortunately I can not extend FileBasedSource/Reader helpers given that Tika Parsers will need to get the full control of the InputStream. However, should the PR be accepted, then I would definitely see some scope for reusing some of currently private FileBasedSource/Reader helpers such as for example the composite reader which is used when a multiple files are picked up.

Right now I have a reasonably good starting code IMHO with the TikaInputTest testing reading PDF, Zipped PDF, ODT and two ODT files, with the content and optionally the parsed out metadata also being streamed. 

Some of the code I copied from FileBasedSource might be suboptimal when applied to the Tika case. I hope that if PR gets eventually accepted then, with the help of Tika experts, there would be no doubt be more improvements coming in.

Planning to work in creating a branch and PR soon, cheers  



> Introduce Apache Tika Input component
> -------------------------------------
>
>                 Key: BEAM-2328
>                 URL: https://issues.apache.org/jira/browse/BEAM-2328
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-ideas
>            Reporter: Sergey Beryozkin
>            Assignee: Davor Bonaci
>             Fix For: 2.1.0
>
>
> Apache Tika is a popular project that offers an extensive support for parsing the variety of file formats. It is used in many projects including Lucene and Elastic Search. 
> Supporting a Tika Input (Read) at the Beam level would be of major interest to many users.
> PR is to follow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)