You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2017/09/19 10:41:16 UTC

TikaIO concerns

Hi All

This is my first post the the dev list, I work for Talend, I'm a Beam 
novice, Apache Tika fan, and thought it would be really great to try and 
link both projects together, which led me to opening [1] where I typed 
some early thoughts, followed by PR [2].

I noticed yesterday I had the robust :-) (but useful and helpful) newer 
review comments from Eugene pending, so I'd like to summarize a bit why 
I did TikaIO (reader) the way I did, and then decide, based on the 
feedback from the experts, what to do next.

Apache Tika Parsers report the text content in chunks, via SaxParser 
events. It's not possible with Tika to take a file and read it bit by 
bit at the 'initiative' of the Beam Reader, line by line, the only way 
is to handle the SAXParser callbacks which report the data chunks. Some 
parsers may report the complete lines, some individual words, with some 
being able report the data only after the completely parse the document.
All depends on the data format.

At the moment TikaIO's TikaReader does not use the Beam threads to parse 
the files, Beam threads will only collect the data from the internal 
queue where the internal TikaReader's thread will put the data into
(note the data chunks are ordered even though the tests might suggest 
otherwise).

The reason I did it was because I thought

1) it would make the individual data chunks available faster to the 
pipeline - the parser will continue working via the binary/video etc 
file while the data will already start flowing - I agree there should be 
some tests data available confirming it - but I'm positive at the moment 
this approach might yield some performance gains with the large sets. If 
the file is large, if it has the embedded attachments/videos to deal 
with, then it may be more effective not to get the Beam thread deal with 
it...

2) As I commented at the end of [2], having an option to concatenate the 
data chunks first before making them available to the pipeline is 
useful, and I guess doing the same in ParDo would introduce some 
synchronization issues (though not exactly sure yet)

One of valid concerns there is that the reader is polling the internal 
queue so, in theory at least, and perhaps in some rare cases too, we may 
have a case where the max polling time has been reached, the parser is 
still busy, and TikaIO fails to report all the file data. I think that 
it can be solved by either 2a) configuring the max polling time to a 
very large number which will never be reached for a practical case, or 
2b) simply use a blocking queue without the time limits - in the worst 
case, if TikaParser spins and fails to report the end of the document, 
then, Bean can heal itself if the pipeline blocks.
I propose to follow 2b).


Please let me know what you think.
My plan so far is:
1) start addressing most of Eugene's comments which would require some 
minor TikaIO updates
2) work on removing the TikaSource internal code dealing with File 
patterns which I copied from TextIO at the next stage
3) If needed - mark TikaIO Experimental to give Tika and Beam users some 
time to try it with some real complex files and also decide if TikaIO 
can continue implemented as a BoundedSource/Reader or not

Eugene, all, will it work if I start with 1) ?

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://github.com/apache/beam/pull/3378

Re: TikaIO concerns

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Eugene,

fully agree ! My point was more in term of features: I think it's fair to 
postpone some features in an IO in new PRs.
For example, when we created JmsIO, it only supports TextMessage and new message 
types support will be added in new improvement PRs.

It's what I meant by "basically good",  it's more in feature scope.

Regards
JB

On 09/20/2017 01:18 AM, Eugene Kirpichov wrote:
> On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> 
>> Hi Sergey,
>>
>> as discussed together during the review, I fully understand the choices
>> you did.
>>
>> Your plan sounds reasonable. Thanks !
>>
>> Generally speaking, in order to give visibility and encourage
>> contribution, I
>> think it would make sense to accept a PR if it's basically right (even if
>> it's
>> not yet perfect) and doesn't break the build.
>>
> This is a wider discussion than the current thread, but I don't think I
> agree with this approach.
> 
> We have followed a much stricter standard in the past, and thanks to that,
> Beam currently has (in my opinion) an extremely high-quality library of
> IOs, and Beam can take pride in not being one of "those" open-source
> projects that advertise everything but guarantee nothing and are
> frustrating to work with, because everything is slightly broken in some way
> or another.
> 
> I can recall at most 1 or 2 cases where a contributor gave up on a PR due
> to the amount of issues pointed out during review - and in those cases, the
> PR was usually in a state where Beam would not have benefitted from merging
> the issue-ridden code anyway. Basically, a thorough review in all cases
> I've seen so far has been a good idea in retrospect.
> 
> There may be trivial fixups best done by a committer rather than author
> (e.g. javadoc typos), but I think nontrivial, high-level issues should be
> reviewed rigorously.
> 
> People trust Beam (especially Beam IOs) with their data, and at least the
> correctness-critical stuff *must* be done right. Beam also generally
> promises a stable API, so API mistakes are forever, and can not be fixed
> iteratively [this can be addressed by marking in-progress work as
> @Experimental] - so APIs must be done right as well. On the other hand,
> performance, documentation, lack of support for certain features etc. can
> be fixed iteratively - I agree that we shouldn't push too hard on that
> during review.
> 
> There's also the mentorship aspect: I think it is valuable for new Beam
> contributors to get thorough review, especially for their first
> contributions, as a kick-start to learning the best practices - they are
> going to need them repeatedly in their future contributions. Merging code
> without sufficient review gives them the immediate gratification of "having
> contributed", but denies the mentorship. Moreover, most contributions are
> made by a relatively small number of prolific "serial contributors" (you
> being a prime example!) who are responsive to feedback and eager to learn,
> so the immediate gratification I think is not very important.
> 
> I think the best way to handle code reviews for Beam is to give it our best
> as reviewers, especially for first-time contributors; and if it feels like
> the amount of required changes is too large for the contributor to handle,
> then work with them to prioritize the changes, or start small and decompose
> the contribution into more manageable pieces, but each merged piece must be
> high-quality.
> 
> 
>> I would be happy to help on TikaIO as I did during the first review round
>> ;)
>>
>> Regards
>> JB
>>
>> On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:
>>> Hi All
>>>
>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>> novice,
>>> Apache Tika fan, and thought it would be really great to try and link
>> both
>>> projects together, which led me to opening [1] where I typed some early
>>> thoughts, followed by PR [2].
>>>
>>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>> review
>>> comments from Eugene pending, so I'd like to summarize a bit why I did
>> TikaIO
>>> (reader) the way I did, and then decide, based on the feedback from the
>> experts,
>>> what to do next.
>>>
>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>> events.
>>> It's not possible with Tika to take a file and read it bit by bit at the
>>> 'initiative' of the Beam Reader, line by line, the only way is to handle
>> the
>>> SAXParser callbacks which report the data chunks. Some parsers may
>> report the
>>> complete lines, some individual words, with some being able report the
>> data only
>>> after the completely parse the document.
>>> All depends on the data format.
>>>
>>> At the moment TikaIO's TikaReader does not use the Beam threads to parse
>> the
>>> files, Beam threads will only collect the data from the internal queue
>> where the
>>> internal TikaReader's thread will put the data into
>>> (note the data chunks are ordered even though the tests might suggest
>> otherwise).
>>>
>>> The reason I did it was because I thought
>>>
>>> 1) it would make the individual data chunks available faster to the
>> pipeline -
>>> the parser will continue working via the binary/video etc file while the
>> data
>>> will already start flowing - I agree there should be some tests data
>> available
>>> confirming it - but I'm positive at the moment this approach might yield
>> some
>>> performance gains with the large sets. If the file is large, if it has
>> the
>>> embedded attachments/videos to deal with, then it may be more effective
>> not to
>>> get the Beam thread deal with it...
>>>
>>> 2) As I commented at the end of [2], having an option to concatenate the
>> data
>>> chunks first before making them available to the pipeline is useful, and
>> I guess
>>> doing the same in ParDo would introduce some synchronization issues
>> (though not
>>> exactly sure yet)
>>>
>>> One of valid concerns there is that the reader is polling the internal
>> queue so,
>>> in theory at least, and perhaps in some rare cases too, we may have a
>> case where
>>> the max polling time has been reached, the parser is still busy, and
>> TikaIO
>>> fails to report all the file data. I think that it can be solved by
>> either 2a)
>>> configuring the max polling time to a very large number which will never
>> be
>>> reached for a practical case, or 2b) simply use a blocking queue without
>> the
>>> time limits - in the worst case, if TikaParser spins and fails to report
>> the end
>>> of the document, then, Bean can heal itself if the pipeline blocks.
>>> I propose to follow 2b).
>>>
>>>
>>> Please let me know what you think.
>>> My plan so far is:
>>> 1) start addressing most of Eugene's comments which would require some
>> minor
>>> TikaIO updates
>>> 2) work on removing the TikaSource internal code dealing with File
>> patterns
>>> which I copied from TextIO at the next stage
>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
>> time to
>>> try it with some real complex files and also decide if TikaIO can
>> continue
>>> implemented as a BoundedSource/Reader or not
>>>
>>> Eugene, all, will it work if I start with 1) ?
>>>
>>> Thanks, Sergey
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>> [2] https://github.com/apache/beam/pull/3378
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Eugene,

As far as I was concerned, I was quite happy with the initial code I did:

It does work, at least I think it works, I run TikaIO example (not part 
of the distro - can provide a link) and I see a bunch of files created 
(with TextIO.Write linking with TikaIO) and to me it looks 'perfect'.

Given that I thought I would see the PR succeeding. At no point of time 
there was a question of me, being a JB's co-worker, relying on JB's 
friendship just to get it in, even though I was keen to see PR merged 
asap :-) - as long as JB had any questions/review comments for me I'd 
address them.

I've no problems with getting and cleaning up the code with more PRs,
I'd really prefer that before re-writing it completely.

Does the ordering matter ? Perhaps for some cases it does, and for some 
it does not. May be it makes sense to support running TikaIO as both the 
bounded reader/source and ParDo, with getting the common code reused.

At this stage IMHO it might make sense to clean up first the current 
code before making possibly bigger decisions ?

Sergey

On 20/09/17 00:18, Eugene Kirpichov wrote:
> On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
> 
>> Hi Sergey,
>>
>> as discussed together during the review, I fully understand the choices
>> you did.
>>
>> Your plan sounds reasonable. Thanks !
>>
>> Generally speaking, in order to give visibility and encourage
>> contribution, I
>> think it would make sense to accept a PR if it's basically right (even if
>> it's
>> not yet perfect) and doesn't break the build.
>>
> This is a wider discussion than the current thread, but I don't think I
> agree with this approach.
> 
> We have followed a much stricter standard in the past, and thanks to that,
> Beam currently has (in my opinion) an extremely high-quality library of
> IOs, and Beam can take pride in not being one of "those" open-source
> projects that advertise everything but guarantee nothing and are
> frustrating to work with, because everything is slightly broken in some way
> or another.
> 
> I can recall at most 1 or 2 cases where a contributor gave up on a PR due
> to the amount of issues pointed out during review - and in those cases, the
> PR was usually in a state where Beam would not have benefitted from merging
> the issue-ridden code anyway. Basically, a thorough review in all cases
> I've seen so far has been a good idea in retrospect.
> 
> There may be trivial fixups best done by a committer rather than author
> (e.g. javadoc typos), but I think nontrivial, high-level issues should be
> reviewed rigorously.
> 
> People trust Beam (especially Beam IOs) with their data, and at least the
> correctness-critical stuff *must* be done right. Beam also generally
> promises a stable API, so API mistakes are forever, and can not be fixed
> iteratively [this can be addressed by marking in-progress work as
> @Experimental] - so APIs must be done right as well. On the other hand,
> performance, documentation, lack of support for certain features etc. can
> be fixed iteratively - I agree that we shouldn't push too hard on that
> during review.
> 
> There's also the mentorship aspect: I think it is valuable for new Beam
> contributors to get thorough review, especially for their first
> contributions, as a kick-start to learning the best practices - they are
> going to need them repeatedly in their future contributions. Merging code
> without sufficient review gives them the immediate gratification of "having
> contributed", but denies the mentorship. Moreover, most contributions are
> made by a relatively small number of prolific "serial contributors" (you
> being a prime example!) who are responsive to feedback and eager to learn,
> so the immediate gratification I think is not very important.
> 
> I think the best way to handle code reviews for Beam is to give it our best
> as reviewers, especially for first-time contributors; and if it feels like
> the amount of required changes is too large for the contributor to handle,
> then work with them to prioritize the changes, or start small and decompose
> the contribution into more manageable pieces, but each merged piece must be
> high-quality.
> 
> 
>> I would be happy to help on TikaIO as I did during the first review round
>> ;)
>>
>> Regards
>> JB
>>
>> On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:
>>> Hi All
>>>
>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>> novice,
>>> Apache Tika fan, and thought it would be really great to try and link
>> both
>>> projects together, which led me to opening [1] where I typed some early
>>> thoughts, followed by PR [2].
>>>
>>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>> review
>>> comments from Eugene pending, so I'd like to summarize a bit why I did
>> TikaIO
>>> (reader) the way I did, and then decide, based on the feedback from the
>> experts,
>>> what to do next.
>>>
>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>> events.
>>> It's not possible with Tika to take a file and read it bit by bit at the
>>> 'initiative' of the Beam Reader, line by line, the only way is to handle
>> the
>>> SAXParser callbacks which report the data chunks. Some parsers may
>> report the
>>> complete lines, some individual words, with some being able report the
>> data only
>>> after the completely parse the document.
>>> All depends on the data format.
>>>
>>> At the moment TikaIO's TikaReader does not use the Beam threads to parse
>> the
>>> files, Beam threads will only collect the data from the internal queue
>> where the
>>> internal TikaReader's thread will put the data into
>>> (note the data chunks are ordered even though the tests might suggest
>> otherwise).
>>>
>>> The reason I did it was because I thought
>>>
>>> 1) it would make the individual data chunks available faster to the
>> pipeline -
>>> the parser will continue working via the binary/video etc file while the
>> data
>>> will already start flowing - I agree there should be some tests data
>> available
>>> confirming it - but I'm positive at the moment this approach might yield
>> some
>>> performance gains with the large sets. If the file is large, if it has
>> the
>>> embedded attachments/videos to deal with, then it may be more effective
>> not to
>>> get the Beam thread deal with it...
>>>
>>> 2) As I commented at the end of [2], having an option to concatenate the
>> data
>>> chunks first before making them available to the pipeline is useful, and
>> I guess
>>> doing the same in ParDo would introduce some synchronization issues
>> (though not
>>> exactly sure yet)
>>>
>>> One of valid concerns there is that the reader is polling the internal
>> queue so,
>>> in theory at least, and perhaps in some rare cases too, we may have a
>> case where
>>> the max polling time has been reached, the parser is still busy, and
>> TikaIO
>>> fails to report all the file data. I think that it can be solved by
>> either 2a)
>>> configuring the max polling time to a very large number which will never
>> be
>>> reached for a practical case, or 2b) simply use a blocking queue without
>> the
>>> time limits - in the worst case, if TikaParser spins and fails to report
>> the end
>>> of the document, then, Bean can heal itself if the pipeline blocks.
>>> I propose to follow 2b).
>>>
>>>
>>> Please let me know what you think.
>>> My plan so far is:
>>> 1) start addressing most of Eugene's comments which would require some
>> minor
>>> TikaIO updates
>>> 2) work on removing the TikaSource internal code dealing with File
>> patterns
>>> which I copied from TextIO at the next stage
>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
>> time to
>>> try it with some real complex files and also decide if TikaIO can
>> continue
>>> implemented as a BoundedSource/Reader or not
>>>
>>> Eugene, all, will it work if I start with 1) ?
>>>
>>> Thanks, Sergey
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>> [2] https://github.com/apache/beam/pull/3378
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Sergey,
>
> as discussed together during the review, I fully understand the choices
> you did.
>
> Your plan sounds reasonable. Thanks !
>
> Generally speaking, in order to give visibility and encourage
> contribution, I
> think it would make sense to accept a PR if it's basically right (even if
> it's
> not yet perfect) and doesn't break the build.
>
This is a wider discussion than the current thread, but I don't think I
agree with this approach.

We have followed a much stricter standard in the past, and thanks to that,
Beam currently has (in my opinion) an extremely high-quality library of
IOs, and Beam can take pride in not being one of "those" open-source
projects that advertise everything but guarantee nothing and are
frustrating to work with, because everything is slightly broken in some way
or another.

I can recall at most 1 or 2 cases where a contributor gave up on a PR due
to the amount of issues pointed out during review - and in those cases, the
PR was usually in a state where Beam would not have benefitted from merging
the issue-ridden code anyway. Basically, a thorough review in all cases
I've seen so far has been a good idea in retrospect.

There may be trivial fixups best done by a committer rather than author
(e.g. javadoc typos), but I think nontrivial, high-level issues should be
reviewed rigorously.

People trust Beam (especially Beam IOs) with their data, and at least the
correctness-critical stuff *must* be done right. Beam also generally
promises a stable API, so API mistakes are forever, and can not be fixed
iteratively [this can be addressed by marking in-progress work as
@Experimental] - so APIs must be done right as well. On the other hand,
performance, documentation, lack of support for certain features etc. can
be fixed iteratively - I agree that we shouldn't push too hard on that
during review.

There's also the mentorship aspect: I think it is valuable for new Beam
contributors to get thorough review, especially for their first
contributions, as a kick-start to learning the best practices - they are
going to need them repeatedly in their future contributions. Merging code
without sufficient review gives them the immediate gratification of "having
contributed", but denies the mentorship. Moreover, most contributions are
made by a relatively small number of prolific "serial contributors" (you
being a prime example!) who are responsive to feedback and eager to learn,
so the immediate gratification I think is not very important.

I think the best way to handle code reviews for Beam is to give it our best
as reviewers, especially for first-time contributors; and if it feels like
the amount of required changes is too large for the contributor to handle,
then work with them to prioritize the changes, or start small and decompose
the contribution into more manageable pieces, but each merged piece must be
high-quality.

> I would be happy to help on TikaIO as I did during the first review round
> ;)
>
> Regards
> JB
>
> On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:
> > Hi All
> >
> > This is my first post the the dev list, I work for Talend, I'm a Beam
> novice,
> > Apache Tika fan, and thought it would be really great to try and link
> both
> > projects together, which led me to opening [1] where I typed some early
> > thoughts, followed by PR [2].
> >
> > I noticed yesterday I had the robust :-) (but useful and helpful) newer
> review
> > comments from Eugene pending, so I'd like to summarize a bit why I did
> TikaIO
> > (reader) the way I did, and then decide, based on the feedback from the
> experts,
> > what to do next.
> >
> > Apache Tika Parsers report the text content in chunks, via SaxParser
> events.
> > It's not possible with Tika to take a file and read it bit by bit at the
> > 'initiative' of the Beam Reader, line by line, the only way is to handle
> the
> > SAXParser callbacks which report the data chunks. Some parsers may
> report the
> > complete lines, some individual words, with some being able report the
> data only
> > after the completely parse the document.
> > All depends on the data format.
> >
> > At the moment TikaIO's TikaReader does not use the Beam threads to parse
> the
> > files, Beam threads will only collect the data from the internal queue
> where the
> > internal TikaReader's thread will put the data into
> > (note the data chunks are ordered even though the tests might suggest
> otherwise).
> >
> > The reason I did it was because I thought
> >
> > 1) it would make the individual data chunks available faster to the
> pipeline -
> > the parser will continue working via the binary/video etc file while the
> data
> > will already start flowing - I agree there should be some tests data
> available
> > confirming it - but I'm positive at the moment this approach might yield
> some
> > performance gains with the large sets. If the file is large, if it has
> the
> > embedded attachments/videos to deal with, then it may be more effective
> not to
> > get the Beam thread deal with it...
> >
> > 2) As I commented at the end of [2], having an option to concatenate the
> data
> > chunks first before making them available to the pipeline is useful, and
> I guess
> > doing the same in ParDo would introduce some synchronization issues
> (though not
> > exactly sure yet)
> >
> > One of valid concerns there is that the reader is polling the internal
> queue so,
> > in theory at least, and perhaps in some rare cases too, we may have a
> case where
> > the max polling time has been reached, the parser is still busy, and
> TikaIO
> > fails to report all the file data. I think that it can be solved by
> either 2a)
> > configuring the max polling time to a very large number which will never
> be
> > reached for a practical case, or 2b) simply use a blocking queue without
> the
> > time limits - in the worst case, if TikaParser spins and fails to report
> the end
> > of the document, then, Bean can heal itself if the pipeline blocks.
> > I propose to follow 2b).
> >
> >
> > Please let me know what you think.
> > My plan so far is:
> > 1) start addressing most of Eugene's comments which would require some
> minor
> > TikaIO updates
> > 2) work on removing the TikaSource internal code dealing with File
> patterns
> > which I copied from TextIO at the next stage
> > 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
> time to
> > try it with some real complex files and also decide if TikaIO can
> continue
> > implemented as a BoundedSource/Reader or not
> >
> > Eugene, all, will it work if I start with 1) ?
> >
> > Thanks, Sergey
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-2328
> > [2] https://github.com/apache/beam/pull/3378
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: TikaIO concerns

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Sergey,

as discussed together during the review, I fully understand the choices you did.

Your plan sounds reasonable. Thanks !

Generally speaking, in order to give visibility and encourage contribution, I 
think it would make sense to accept a PR if it's basically right (even if it's 
not yet perfect) and doesn't break the build.

I would be happy to help on TikaIO as I did during the first review round ;)

Regards
JB

On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:
> Hi All
> 
> This is my first post the the dev list, I work for Talend, I'm a Beam novice, 
> Apache Tika fan, and thought it would be really great to try and link both 
> projects together, which led me to opening [1] where I typed some early 
> thoughts, followed by PR [2].
> 
> I noticed yesterday I had the robust :-) (but useful and helpful) newer review 
> comments from Eugene pending, so I'd like to summarize a bit why I did TikaIO 
> (reader) the way I did, and then decide, based on the feedback from the experts, 
> what to do next.
> 
> Apache Tika Parsers report the text content in chunks, via SaxParser events. 
> It's not possible with Tika to take a file and read it bit by bit at the 
> 'initiative' of the Beam Reader, line by line, the only way is to handle the 
> SAXParser callbacks which report the data chunks. Some parsers may report the 
> complete lines, some individual words, with some being able report the data only 
> after the completely parse the document.
> All depends on the data format.
> 
> At the moment TikaIO's TikaReader does not use the Beam threads to parse the 
> files, Beam threads will only collect the data from the internal queue where the 
> internal TikaReader's thread will put the data into
> (note the data chunks are ordered even though the tests might suggest otherwise).
> 
> The reason I did it was because I thought
> 
> 1) it would make the individual data chunks available faster to the pipeline - 
> the parser will continue working via the binary/video etc file while the data 
> will already start flowing - I agree there should be some tests data available 
> confirming it - but I'm positive at the moment this approach might yield some 
> performance gains with the large sets. If the file is large, if it has the 
> embedded attachments/videos to deal with, then it may be more effective not to 
> get the Beam thread deal with it...
> 
> 2) As I commented at the end of [2], having an option to concatenate the data 
> chunks first before making them available to the pipeline is useful, and I guess 
> doing the same in ParDo would introduce some synchronization issues (though not 
> exactly sure yet)
> 
> One of valid concerns there is that the reader is polling the internal queue so, 
> in theory at least, and perhaps in some rare cases too, we may have a case where 
> the max polling time has been reached, the parser is still busy, and TikaIO 
> fails to report all the file data. I think that it can be solved by either 2a) 
> configuring the max polling time to a very large number which will never be 
> reached for a practical case, or 2b) simply use a blocking queue without the 
> time limits - in the worst case, if TikaParser spins and fails to report the end 
> of the document, then, Bean can heal itself if the pipeline blocks.
> I propose to follow 2b).
> 
> 
> Please let me know what you think.
> My plan so far is:
> 1) start addressing most of Eugene's comments which would require some minor 
> TikaIO updates
> 2) work on removing the TikaSource internal code dealing with File patterns 
> which I copied from TextIO at the next stage
> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some time to 
> try it with some real complex files and also decide if TikaIO can continue 
> implemented as a BoundedSource/Reader or not
> 
> Eugene, all, will it work if I start with 1) ?
> 
> Thanks, Sergey
> 
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> [2] https://github.com/apache/beam/pull/3378

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Great.  Thank you!

-----Original Message-----
From: Chris Mattmann [mailto:mattmann@apache.org] 
Sent: Friday, September 22, 2017 1:46 PM
To: dev@tika.apache.org
Subject: Re: TikaIO concerns

[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data either from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris




On 9/22/17, 10:43 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    >>1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems.
    
    And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles,  we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests.  
    
    :D
    
    But seriously, thank you for this discussion and collaboration!
    
    Cheers,
    
             Tim

Re: TikaIO concerns

Posted by Chris Mattmann <ma...@apache.org>.

[dropping Beam on this]

Tim, another thing is that you can finally download the TREC-DD Polar data either
from the NSF Arctic Data Center (70GB zip), or from Amazon S3, as described here:

http://github.com/chrismattmann/trec-dd-polar/ 

In case we want to use as part of our regression.

Cheers,
Chris




On 9/22/17, 10:43 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

    >>1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems.
    
    And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles,  we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests.  
    
    :D
    
    But seriously, thank you for this discussion and collaboration!
    
    Cheers,
    
             Tim

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>>1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems.

And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles,  we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

         Tim

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Please see comments below, and I'm positive this thread is nearly over :-)
On 22/09/17 22:49, Eugene Kirpichov wrote:
> On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi,
>> On 22/09/17 22:02, Eugene Kirpichov wrote:
>>> Sure - with hundreds of different file formats and the abundance of
>> weird /
>>> malformed / malicious files in the wild, it's quite expected that
>> sometimes
>>> the library will crash.
>>>
>>> Some kinds of issues are easier to address than others. We can catch
>>> exceptions and return a ParseResult representing a failure to parse this
>>> document. Addressing freezes and native JVM process crashes is much
>> harder
>>> and probably not necessary in the first version.
>>>
>>> Sergey - I think, the moment you introduce ParseResult into the code,
>> other
>>> changes I suggested will follow "by construction":
>>> - There'll be 1 ParseResult per document, containing filename, content
>> and
>>> metadata, since per discussion above it probably doesn't make sense to
>>> deliver these in separate PCollection elements
>>
>> I was still harboring the hope that may be using a container bean like
>> ParseResult (with the other changes you proposed) can somehow let us
>> stream from Tika into the pipeline.
>>
>> If it is 1 ParseResult per document then it means that until Tika has
>> parsed all the document the pipeline will not see it.
>>
> This is correct, and this is the API I'm suggesting to start with, because
> it's simple and sufficiently useful. I suggest to get into this state
> first, and then deal with creating a separate API that allows to not hold
> the entire parse result as a single PCollection element in memory. This
> should work fine for cases when each document's parse result (not the input
> document itself!) is up to a few hundred megabytes in size.
> 
+1. I was contemplating about it yesterday evening and I had to admit I 
had no real clue what I wanted to achieve with the document being 
streamed through the pipeline - partially because my Beam knowledge was 
still pretty limited but also because I had difficulties with coming 
with the concrete use cases.
So yes. lets make the 'mainstream' case working well first.
> 
>>
>> I'm sorry if I may be starting to go in circles. But let me ask this.
>> How can a Beam user write a Beam function which will ensure the Tika
>> content pieces are seen ordered by the pipeline, without TikaIO ?
>>
> To answer this, I'd need you to clarify what you mean by "seen ordered by
> the pipeline" - order is a very vague term when it comes to parallel
> processing. What would you like the pipeline to compute that requires order
> within a document, but does NOT require having the contents of a document
> as a single String?
See above, I don't know :-). The case which I do like, and I'll work on 
a demo at a later stage at a dedicate branch, is what I described 
earlier. I would use sat FileIO to get me a list of 1000s matching PDFs, 
run that though Tika(IO) and I'd have a function which will output the 
list of matching PDFs (or other formats). Ex: if someone needs to find 
all the Word docs in a given online library, which talk about some 
event. I think it won't matter in this case whether the ordering of the 
individual lines matters or not, we have a link to the file name and 
it's enough...

But I'll return to this favourite case of mine later :-)

> Or are you asking simply how can users use Tika for arbitrary use cases
> without TikaIO?

I thought later, I was really interested, was it important for any of 
Beam IO's consumers that the individual data chunks come ordered or not, 
and if it was, how that was achieved...Knowing that would help me/us to 
consider what can possibly be done at a later stage

If you'd like to talk about it later then it is OK...

Thanks for the help
Sergey
> 
> 
>>
>> May be knowing that will help coming up with the idea how to generalize
>> somehow with the help of TikaIO ?
>>
>>> - Since you're returning a single value per document, there's no reason
>> to
>>> use a BoundedReader
>>> - Likewise, there's no reason to use asynchronicity because you're not
>>> delivering the result incrementally
>>>
>>> I'd suggest to start the refactoring by removing the asynchronous
>> codepath,
>>> then converting from BoundedReader to ParDo or MapElements, then
>> converting
>>> from String to ParseResult.
>> This is a good plan, thanks, I guess at least for small documents it
>> should work well (unless I've misunderstood a ParseResult idea)
>>
>> Thanks, Sergey
>>>
>>> On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi Tim, All
>>>> On 22/09/17 18:17, Allison, Timothy B. wrote:
>>>>> Y, I think you have it right.
>>>>>
>>>>>> Tika library has a big problem with crashes and freezes
>>>>>
>>>>> I wouldn't want to overstate it.  Crashes and freezes are exceedingly
>>>> rare, but when you are processing millions/billions of files in the wild
>>>> [1], they will happen.  We fix the problems or try to get our
>> dependencies
>>>> to fix the problems when we can,
>>>>
>>>> I only would like to add to this that IMHO it would be more correct to
>>>> state it's not a Tika library's 'fault' that the crashes might occur.
>>>> Tika does its best to get the latest libraries helping it to parse the
>>>> files, but indeed there will always be some file there that might use
>>>> some incomplete format specific tag etc which may cause the specific
>>>> parser to spin - but Tika will include the updated parser library asap.
>>>>
>>>> And with Beam's help the crashes that can kill the Tika jobs completely
>>>> will probably become a history...
>>>>
>>>> Cheers, Sergey
>>>>> but given our past history, I have no reason to believe that these
>>>> problems won't happen again.
>>>>>
>>>>> Thank you, again!
>>>>>
>>>>> Best,
>>>>>
>>>>>                Tim
>>>>>
>>>>> [1] Stuff on the internet or ... some of our users are forensics
>>>> examiners dealing with broken/corrupted files
>>>>>
>>>>> P.S./FTR  😊
>>>>> 1) We've gathered a TB of data from CommonCrawl and we run regression
>>>> tests against this TB (thank you, Rackspace for hosting our vm!) to try
>> to
>>>> identify these problems.
>>>>> 2) We've started a fuzzing effort to try to identify problems.
>>>>> 3) We added "tika-batch" for robust single box fileshare/fileshare
>>>> processing for our low volume users
>>>>> 4) We're trying to get the message out.  Thank you for working with
>> us!!!
>>>>>
>>>>> -----Original Message-----
>>>>> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
>>>>> Sent: Friday, September 22, 2017 12:48 PM
>>>>> To: dev@beam.apache.org
>>>>> Cc: dev@tika.apache.org
>>>>> Subject: Re: TikaIO concerns
>>>>>
>>>>> Hi Tim,
>>>>>    From what you're saying it sounds like the Tika library has a big
>>>> problem with crashes and freezes, and when applying it at scale (eg. in
>> the
>>>> context of Beam) requires explicitly addressing this problem, eg.
>> accepting
>>>> the fact that in many realistic applications some documents will just
>> need
>>>> to be skipped because they are unprocessable? This would be first
>> example
>>>> of a Beam IO that has this concern, so I'd like to confirm that my
>>>> understanding is correct.
>>>>>
>>>>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <
>> tallison@mitre.org>
>>>>> wrote:
>>>>>
>>>>>> Reuven,
>>>>>>
>>>>>> Thank you!  This suggests to me that it is a good idea to integrate
>>>>>> Tika with Beam so that people don't have to 1) (re)discover the need
>>>>>> to make their wrappers robust and then 2) have to reinvent these
>>>>>> wheels for robustness.
>>>>>>
>>>>>> For kicks, see William Palmer's post on his toe-stubbing efforts with
>>>>>> Hadoop [1].  He and other Tika users independently have wound up
>>>>>> carrying out exactly your recommendation for 1) below.
>>>>>>
>>>>>> We have a MockParser that you can get to simulate regular exceptions,
>>>>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>>>>>>
>>>>>>> However if processing the document causes the process to crash, then
>>>>>>> it
>>>>>> will be retried.
>>>>>> Any ideas on how to get around this?
>>>>>>
>>>>>> Thank you again.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>               Tim
>>>>>>
>>>>>> [1]
>>>>>>
>> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>>>>>> eb-content-nanite/
>>>>>> [2]
>>>>>>
>>>>
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>>>>>>
>>>>
>>>
>>
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Please see comments below, and I'm positive this thread is nearly over :-)
On 22/09/17 22:49, Eugene Kirpichov wrote:
> On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi,
>> On 22/09/17 22:02, Eugene Kirpichov wrote:
>>> Sure - with hundreds of different file formats and the abundance of
>> weird /
>>> malformed / malicious files in the wild, it's quite expected that
>> sometimes
>>> the library will crash.
>>>
>>> Some kinds of issues are easier to address than others. We can catch
>>> exceptions and return a ParseResult representing a failure to parse this
>>> document. Addressing freezes and native JVM process crashes is much
>> harder
>>> and probably not necessary in the first version.
>>>
>>> Sergey - I think, the moment you introduce ParseResult into the code,
>> other
>>> changes I suggested will follow "by construction":
>>> - There'll be 1 ParseResult per document, containing filename, content
>> and
>>> metadata, since per discussion above it probably doesn't make sense to
>>> deliver these in separate PCollection elements
>>
>> I was still harboring the hope that may be using a container bean like
>> ParseResult (with the other changes you proposed) can somehow let us
>> stream from Tika into the pipeline.
>>
>> If it is 1 ParseResult per document then it means that until Tika has
>> parsed all the document the pipeline will not see it.
>>
> This is correct, and this is the API I'm suggesting to start with, because
> it's simple and sufficiently useful. I suggest to get into this state
> first, and then deal with creating a separate API that allows to not hold
> the entire parse result as a single PCollection element in memory. This
> should work fine for cases when each document's parse result (not the input
> document itself!) is up to a few hundred megabytes in size.
> 
+1. I was contemplating about it yesterday evening and I had to admit I 
had no real clue what I wanted to achieve with the document being 
streamed through the pipeline - partially because my Beam knowledge was 
still pretty limited but also because I had difficulties with coming 
with the concrete use cases.
So yes. lets make the 'mainstream' case working well first.
> 
>>
>> I'm sorry if I may be starting to go in circles. But let me ask this.
>> How can a Beam user write a Beam function which will ensure the Tika
>> content pieces are seen ordered by the pipeline, without TikaIO ?
>>
> To answer this, I'd need you to clarify what you mean by "seen ordered by
> the pipeline" - order is a very vague term when it comes to parallel
> processing. What would you like the pipeline to compute that requires order
> within a document, but does NOT require having the contents of a document
> as a single String?
See above, I don't know :-). The case which I do like, and I'll work on 
a demo at a later stage at a dedicate branch, is what I described 
earlier. I would use sat FileIO to get me a list of 1000s matching PDFs, 
run that though Tika(IO) and I'd have a function which will output the 
list of matching PDFs (or other formats). Ex: if someone needs to find 
all the Word docs in a given online library, which talk about some 
event. I think it won't matter in this case whether the ordering of the 
individual lines matters or not, we have a link to the file name and 
it's enough...

But I'll return to this favourite case of mine later :-)

> Or are you asking simply how can users use Tika for arbitrary use cases
> without TikaIO?

I thought later, I was really interested, was it important for any of 
Beam IO's consumers that the individual data chunks come ordered or not, 
and if it was, how that was achieved...Knowing that would help me/us to 
consider what can possibly be done at a later stage

If you'd like to talk about it later then it is OK...

Thanks for the help
Sergey
> 
> 
>>
>> May be knowing that will help coming up with the idea how to generalize
>> somehow with the help of TikaIO ?
>>
>>> - Since you're returning a single value per document, there's no reason
>> to
>>> use a BoundedReader
>>> - Likewise, there's no reason to use asynchronicity because you're not
>>> delivering the result incrementally
>>>
>>> I'd suggest to start the refactoring by removing the asynchronous
>> codepath,
>>> then converting from BoundedReader to ParDo or MapElements, then
>> converting
>>> from String to ParseResult.
>> This is a good plan, thanks, I guess at least for small documents it
>> should work well (unless I've misunderstood a ParseResult idea)
>>
>> Thanks, Sergey
>>>
>>> On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi Tim, All
>>>> On 22/09/17 18:17, Allison, Timothy B. wrote:
>>>>> Y, I think you have it right.
>>>>>
>>>>>> Tika library has a big problem with crashes and freezes
>>>>>
>>>>> I wouldn't want to overstate it.  Crashes and freezes are exceedingly
>>>> rare, but when you are processing millions/billions of files in the wild
>>>> [1], they will happen.  We fix the problems or try to get our
>> dependencies
>>>> to fix the problems when we can,
>>>>
>>>> I only would like to add to this that IMHO it would be more correct to
>>>> state it's not a Tika library's 'fault' that the crashes might occur.
>>>> Tika does its best to get the latest libraries helping it to parse the
>>>> files, but indeed there will always be some file there that might use
>>>> some incomplete format specific tag etc which may cause the specific
>>>> parser to spin - but Tika will include the updated parser library asap.
>>>>
>>>> And with Beam's help the crashes that can kill the Tika jobs completely
>>>> will probably become a history...
>>>>
>>>> Cheers, Sergey
>>>>> but given our past history, I have no reason to believe that these
>>>> problems won't happen again.
>>>>>
>>>>> Thank you, again!
>>>>>
>>>>> Best,
>>>>>
>>>>>                Tim
>>>>>
>>>>> [1] Stuff on the internet or ... some of our users are forensics
>>>> examiners dealing with broken/corrupted files
>>>>>
>>>>> P.S./FTR  😊
>>>>> 1) We've gathered a TB of data from CommonCrawl and we run regression
>>>> tests against this TB (thank you, Rackspace for hosting our vm!) to try
>> to
>>>> identify these problems.
>>>>> 2) We've started a fuzzing effort to try to identify problems.
>>>>> 3) We added "tika-batch" for robust single box fileshare/fileshare
>>>> processing for our low volume users
>>>>> 4) We're trying to get the message out.  Thank you for working with
>> us!!!
>>>>>
>>>>> -----Original Message-----
>>>>> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
>>>>> Sent: Friday, September 22, 2017 12:48 PM
>>>>> To: dev@beam.apache.org
>>>>> Cc: dev@tika.apache.org
>>>>> Subject: Re: TikaIO concerns
>>>>>
>>>>> Hi Tim,
>>>>>    From what you're saying it sounds like the Tika library has a big
>>>> problem with crashes and freezes, and when applying it at scale (eg. in
>> the
>>>> context of Beam) requires explicitly addressing this problem, eg.
>> accepting
>>>> the fact that in many realistic applications some documents will just
>> need
>>>> to be skipped because they are unprocessable? This would be first
>> example
>>>> of a Beam IO that has this concern, so I'd like to confirm that my
>>>> understanding is correct.
>>>>>
>>>>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <
>> tallison@mitre.org>
>>>>> wrote:
>>>>>
>>>>>> Reuven,
>>>>>>
>>>>>> Thank you!  This suggests to me that it is a good idea to integrate
>>>>>> Tika with Beam so that people don't have to 1) (re)discover the need
>>>>>> to make their wrappers robust and then 2) have to reinvent these
>>>>>> wheels for robustness.
>>>>>>
>>>>>> For kicks, see William Palmer's post on his toe-stubbing efforts with
>>>>>> Hadoop [1].  He and other Tika users independently have wound up
>>>>>> carrying out exactly your recommendation for 1) below.
>>>>>>
>>>>>> We have a MockParser that you can get to simulate regular exceptions,
>>>>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>>>>>>
>>>>>>> However if processing the document causes the process to crash, then
>>>>>>> it
>>>>>> will be retried.
>>>>>> Any ideas on how to get around this?
>>>>>>
>>>>>> Thank you again.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>               Tim
>>>>>>
>>>>>> [1]
>>>>>>
>> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>>>>>> eb-content-nanite/
>>>>>> [2]
>>>>>>
>>>>
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>>>>>>
>>>>
>>>
>>
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi,
> On 22/09/17 22:02, Eugene Kirpichov wrote:
> > Sure - with hundreds of different file formats and the abundance of
> weird /
> > malformed / malicious files in the wild, it's quite expected that
> sometimes
> > the library will crash.
> >
> > Some kinds of issues are easier to address than others. We can catch
> > exceptions and return a ParseResult representing a failure to parse this
> > document. Addressing freezes and native JVM process crashes is much
> harder
> > and probably not necessary in the first version.
> >
> > Sergey - I think, the moment you introduce ParseResult into the code,
> other
> > changes I suggested will follow "by construction":
> > - There'll be 1 ParseResult per document, containing filename, content
> and
> > metadata, since per discussion above it probably doesn't make sense to
> > deliver these in separate PCollection elements
>
> I was still harboring the hope that may be using a container bean like
> ParseResult (with the other changes you proposed) can somehow let us
> stream from Tika into the pipeline.
>
> If it is 1 ParseResult per document then it means that until Tika has
> parsed all the document the pipeline will not see it.
>
This is correct, and this is the API I'm suggesting to start with, because
it's simple and sufficiently useful. I suggest to get into this state
first, and then deal with creating a separate API that allows to not hold
the entire parse result as a single PCollection element in memory. This
should work fine for cases when each document's parse result (not the input
document itself!) is up to a few hundred megabytes in size.


>
> I'm sorry if I may be starting to go in circles. But let me ask this.
> How can a Beam user write a Beam function which will ensure the Tika
> content pieces are seen ordered by the pipeline, without TikaIO ?
>
To answer this, I'd need you to clarify what you mean by "seen ordered by
the pipeline" - order is a very vague term when it comes to parallel
processing. What would you like the pipeline to compute that requires order
within a document, but does NOT require having the contents of a document
as a single String?
Or are you asking simply how can users use Tika for arbitrary use cases
without TikaIO?


>
> May be knowing that will help coming up with the idea how to generalize
> somehow with the help of TikaIO ?
>
> > - Since you're returning a single value per document, there's no reason
> to
> > use a BoundedReader
> > - Likewise, there's no reason to use asynchronicity because you're not
> > delivering the result incrementally
> >
> > I'd suggest to start the refactoring by removing the asynchronous
> codepath,
> > then converting from BoundedReader to ParDo or MapElements, then
> converting
> > from String to ParseResult.
> This is a good plan, thanks, I guess at least for small documents it
> should work well (unless I've misunderstood a ParseResult idea)
>
> Thanks, Sergey
> >
> > On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sb...@gmail.com>
> > wrote:
> >
> >> Hi Tim, All
> >> On 22/09/17 18:17, Allison, Timothy B. wrote:
> >>> Y, I think you have it right.
> >>>
> >>>> Tika library has a big problem with crashes and freezes
> >>>
> >>> I wouldn't want to overstate it.  Crashes and freezes are exceedingly
> >> rare, but when you are processing millions/billions of files in the wild
> >> [1], they will happen.  We fix the problems or try to get our
> dependencies
> >> to fix the problems when we can,
> >>
> >> I only would like to add to this that IMHO it would be more correct to
> >> state it's not a Tika library's 'fault' that the crashes might occur.
> >> Tika does its best to get the latest libraries helping it to parse the
> >> files, but indeed there will always be some file there that might use
> >> some incomplete format specific tag etc which may cause the specific
> >> parser to spin - but Tika will include the updated parser library asap.
> >>
> >> And with Beam's help the crashes that can kill the Tika jobs completely
> >> will probably become a history...
> >>
> >> Cheers, Sergey
> >>> but given our past history, I have no reason to believe that these
> >> problems won't happen again.
> >>>
> >>> Thank you, again!
> >>>
> >>> Best,
> >>>
> >>>               Tim
> >>>
> >>> [1] Stuff on the internet or ... some of our users are forensics
> >> examiners dealing with broken/corrupted files
> >>>
> >>> P.S./FTR  😊
> >>> 1) We've gathered a TB of data from CommonCrawl and we run regression
> >> tests against this TB (thank you, Rackspace for hosting our vm!) to try
> to
> >> identify these problems.
> >>> 2) We've started a fuzzing effort to try to identify problems.
> >>> 3) We added "tika-batch" for robust single box fileshare/fileshare
> >> processing for our low volume users
> >>> 4) We're trying to get the message out.  Thank you for working with
> us!!!
> >>>
> >>> -----Original Message-----
> >>> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
> >>> Sent: Friday, September 22, 2017 12:48 PM
> >>> To: dev@beam.apache.org
> >>> Cc: dev@tika.apache.org
> >>> Subject: Re: TikaIO concerns
> >>>
> >>> Hi Tim,
> >>>   From what you're saying it sounds like the Tika library has a big
> >> problem with crashes and freezes, and when applying it at scale (eg. in
> the
> >> context of Beam) requires explicitly addressing this problem, eg.
> accepting
> >> the fact that in many realistic applications some documents will just
> need
> >> to be skipped because they are unprocessable? This would be first
> example
> >> of a Beam IO that has this concern, so I'd like to confirm that my
> >> understanding is correct.
> >>>
> >>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <
> tallison@mitre.org>
> >>> wrote:
> >>>
> >>>> Reuven,
> >>>>
> >>>> Thank you!  This suggests to me that it is a good idea to integrate
> >>>> Tika with Beam so that people don't have to 1) (re)discover the need
> >>>> to make their wrappers robust and then 2) have to reinvent these
> >>>> wheels for robustness.
> >>>>
> >>>> For kicks, see William Palmer's post on his toe-stubbing efforts with
> >>>> Hadoop [1].  He and other Tika users independently have wound up
> >>>> carrying out exactly your recommendation for 1) below.
> >>>>
> >>>> We have a MockParser that you can get to simulate regular exceptions,
> >>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
> >>>>
> >>>>> However if processing the document causes the process to crash, then
> >>>>> it
> >>>> will be retried.
> >>>> Any ideas on how to get around this?
> >>>>
> >>>> Thank you again.
> >>>>
> >>>> Cheers,
> >>>>
> >>>>              Tim
> >>>>
> >>>> [1]
> >>>>
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> >>>> eb-content-nanite/
> >>>> [2]
> >>>>
> >>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >>>>
> >>
> >
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:
> Sure - with hundreds of different file formats and the abundance of weird /
> malformed / malicious files in the wild, it's quite expected that sometimes
> the library will crash.
> 
> Some kinds of issues are easier to address than others. We can catch
> exceptions and return a ParseResult representing a failure to parse this
> document. Addressing freezes and native JVM process crashes is much harder
> and probably not necessary in the first version.
> 
> Sergey - I think, the moment you introduce ParseResult into the code, other
> changes I suggested will follow "by construction":
> - There'll be 1 ParseResult per document, containing filename, content and
> metadata, since per discussion above it probably doesn't make sense to
> deliver these in separate PCollection elements

I was still harboring the hope that may be using a container bean like 
ParseResult (with the other changes you proposed) can somehow let us 
stream from Tika into the pipeline.

If it is 1 ParseResult per document then it means that until Tika has 
parsed all the document the pipeline will not see it.

I'm sorry if I may be starting to go in circles. But let me ask this. 
How can a Beam user write a Beam function which will ensure the Tika 
content pieces are seen ordered by the pipeline, without TikaIO ?

May be knowing that will help coming up with the idea how to generalize 
somehow with the help of TikaIO ?

> - Since you're returning a single value per document, there's no reason to
> use a BoundedReader
> - Likewise, there's no reason to use asynchronicity because you're not
> delivering the result incrementally
> 
> I'd suggest to start the refactoring by removing the asynchronous codepath,
> then converting from BoundedReader to ParDo or MapElements, then converting
> from String to ParseResult.
This is a good plan, thanks, I guess at least for small documents it 
should work well (unless I've misunderstood a ParseResult idea)

Thanks, Sergey
> 
> On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi Tim, All
>> On 22/09/17 18:17, Allison, Timothy B. wrote:
>>> Y, I think you have it right.
>>>
>>>> Tika library has a big problem with crashes and freezes
>>>
>>> I wouldn't want to overstate it.  Crashes and freezes are exceedingly
>> rare, but when you are processing millions/billions of files in the wild
>> [1], they will happen.  We fix the problems or try to get our dependencies
>> to fix the problems when we can,
>>
>> I only would like to add to this that IMHO it would be more correct to
>> state it's not a Tika library's 'fault' that the crashes might occur.
>> Tika does its best to get the latest libraries helping it to parse the
>> files, but indeed there will always be some file there that might use
>> some incomplete format specific tag etc which may cause the specific
>> parser to spin - but Tika will include the updated parser library asap.
>>
>> And with Beam's help the crashes that can kill the Tika jobs completely
>> will probably become a history...
>>
>> Cheers, Sergey
>>> but given our past history, I have no reason to believe that these
>> problems won't happen again.
>>>
>>> Thank you, again!
>>>
>>> Best,
>>>
>>>               Tim
>>>
>>> [1] Stuff on the internet or ... some of our users are forensics
>> examiners dealing with broken/corrupted files
>>>
>>> P.S./FTR  😊
>>> 1) We've gathered a TB of data from CommonCrawl and we run regression
>> tests against this TB (thank you, Rackspace for hosting our vm!) to try to
>> identify these problems.
>>> 2) We've started a fuzzing effort to try to identify problems.
>>> 3) We added "tika-batch" for robust single box fileshare/fileshare
>> processing for our low volume users
>>> 4) We're trying to get the message out.  Thank you for working with us!!!
>>>
>>> -----Original Message-----
>>> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
>>> Sent: Friday, September 22, 2017 12:48 PM
>>> To: dev@beam.apache.org
>>> Cc: dev@tika.apache.org
>>> Subject: Re: TikaIO concerns
>>>
>>> Hi Tim,
>>>   From what you're saying it sounds like the Tika library has a big
>> problem with crashes and freezes, and when applying it at scale (eg. in the
>> context of Beam) requires explicitly addressing this problem, eg. accepting
>> the fact that in many realistic applications some documents will just need
>> to be skipped because they are unprocessable? This would be first example
>> of a Beam IO that has this concern, so I'd like to confirm that my
>> understanding is correct.
>>>
>>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
>>> wrote:
>>>
>>>> Reuven,
>>>>
>>>> Thank you!  This suggests to me that it is a good idea to integrate
>>>> Tika with Beam so that people don't have to 1) (re)discover the need
>>>> to make their wrappers robust and then 2) have to reinvent these
>>>> wheels for robustness.
>>>>
>>>> For kicks, see William Palmer's post on his toe-stubbing efforts with
>>>> Hadoop [1].  He and other Tika users independently have wound up
>>>> carrying out exactly your recommendation for 1) below.
>>>>
>>>> We have a MockParser that you can get to simulate regular exceptions,
>>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>>>>
>>>>> However if processing the document causes the process to crash, then
>>>>> it
>>>> will be retried.
>>>> Any ideas on how to get around this?
>>>>
>>>> Thank you again.
>>>>
>>>> Cheers,
>>>>
>>>>              Tim
>>>>
>>>> [1]
>>>> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>>>> eb-content-nanite/
>>>> [2]
>>>>
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>>>>
>>
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:
> Sure - with hundreds of different file formats and the abundance of weird /
> malformed / malicious files in the wild, it's quite expected that sometimes
> the library will crash.
> 
> Some kinds of issues are easier to address than others. We can catch
> exceptions and return a ParseResult representing a failure to parse this
> document. Addressing freezes and native JVM process crashes is much harder
> and probably not necessary in the first version.
> 
> Sergey - I think, the moment you introduce ParseResult into the code, other
> changes I suggested will follow "by construction":
> - There'll be 1 ParseResult per document, containing filename, content and
> metadata, since per discussion above it probably doesn't make sense to
> deliver these in separate PCollection elements

I was still harboring the hope that may be using a container bean like 
ParseResult (with the other changes you proposed) can somehow let us 
stream from Tika into the pipeline.

If it is 1 ParseResult per document then it means that until Tika has 
parsed all the document the pipeline will not see it.

I'm sorry if I may be starting to go in circles. But let me ask this. 
How can a Beam user write a Beam function which will ensure the Tika 
content pieces are seen ordered by the pipeline, without TikaIO ?

May be knowing that will help coming up with the idea how to generalize 
somehow with the help of TikaIO ?

> - Since you're returning a single value per document, there's no reason to
> use a BoundedReader
> - Likewise, there's no reason to use asynchronicity because you're not
> delivering the result incrementally
> 
> I'd suggest to start the refactoring by removing the asynchronous codepath,
> then converting from BoundedReader to ParDo or MapElements, then converting
> from String to ParseResult.
This is a good plan, thanks, I guess at least for small documents it 
should work well (unless I've misunderstood a ParseResult idea)

Thanks, Sergey
> 
> On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi Tim, All
>> On 22/09/17 18:17, Allison, Timothy B. wrote:
>>> Y, I think you have it right.
>>>
>>>> Tika library has a big problem with crashes and freezes
>>>
>>> I wouldn't want to overstate it.  Crashes and freezes are exceedingly
>> rare, but when you are processing millions/billions of files in the wild
>> [1], they will happen.  We fix the problems or try to get our dependencies
>> to fix the problems when we can,
>>
>> I only would like to add to this that IMHO it would be more correct to
>> state it's not a Tika library's 'fault' that the crashes might occur.
>> Tika does its best to get the latest libraries helping it to parse the
>> files, but indeed there will always be some file there that might use
>> some incomplete format specific tag etc which may cause the specific
>> parser to spin - but Tika will include the updated parser library asap.
>>
>> And with Beam's help the crashes that can kill the Tika jobs completely
>> will probably become a history...
>>
>> Cheers, Sergey
>>> but given our past history, I have no reason to believe that these
>> problems won't happen again.
>>>
>>> Thank you, again!
>>>
>>> Best,
>>>
>>>               Tim
>>>
>>> [1] Stuff on the internet or ... some of our users are forensics
>> examiners dealing with broken/corrupted files
>>>
>>> P.S./FTR  😊
>>> 1) We've gathered a TB of data from CommonCrawl and we run regression
>> tests against this TB (thank you, Rackspace for hosting our vm!) to try to
>> identify these problems.
>>> 2) We've started a fuzzing effort to try to identify problems.
>>> 3) We added "tika-batch" for robust single box fileshare/fileshare
>> processing for our low volume users
>>> 4) We're trying to get the message out.  Thank you for working with us!!!
>>>
>>> -----Original Message-----
>>> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
>>> Sent: Friday, September 22, 2017 12:48 PM
>>> To: dev@beam.apache.org
>>> Cc: dev@tika.apache.org
>>> Subject: Re: TikaIO concerns
>>>
>>> Hi Tim,
>>>   From what you're saying it sounds like the Tika library has a big
>> problem with crashes and freezes, and when applying it at scale (eg. in the
>> context of Beam) requires explicitly addressing this problem, eg. accepting
>> the fact that in many realistic applications some documents will just need
>> to be skipped because they are unprocessable? This would be first example
>> of a Beam IO that has this concern, so I'd like to confirm that my
>> understanding is correct.
>>>
>>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
>>> wrote:
>>>
>>>> Reuven,
>>>>
>>>> Thank you!  This suggests to me that it is a good idea to integrate
>>>> Tika with Beam so that people don't have to 1) (re)discover the need
>>>> to make their wrappers robust and then 2) have to reinvent these
>>>> wheels for robustness.
>>>>
>>>> For kicks, see William Palmer's post on his toe-stubbing efforts with
>>>> Hadoop [1].  He and other Tika users independently have wound up
>>>> carrying out exactly your recommendation for 1) below.
>>>>
>>>> We have a MockParser that you can get to simulate regular exceptions,
>>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>>>>
>>>>> However if processing the document causes the process to crash, then
>>>>> it
>>>> will be retried.
>>>> Any ideas on how to get around this?
>>>>
>>>> Thank you again.
>>>>
>>>> Cheers,
>>>>
>>>>              Tim
>>>>
>>>> [1]
>>>> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>>>> eb-content-nanite/
>>>> [2]
>>>>
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>>>>
>>
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Sure - with hundreds of different file formats and the abundance of weird /
malformed / malicious files in the wild, it's quite expected that sometimes
the library will crash.

Some kinds of issues are easier to address than others. We can catch
exceptions and return a ParseResult representing a failure to parse this
document. Addressing freezes and native JVM process crashes is much harder
and probably not necessary in the first version.

Sergey - I think, the moment you introduce ParseResult into the code, other
changes I suggested will follow "by construction":
- There'll be 1 ParseResult per document, containing filename, content and
metadata, since per discussion above it probably doesn't make sense to
deliver these in separate PCollection elements
- Since you're returning a single value per document, there's no reason to
use a BoundedReader
- Likewise, there's no reason to use asynchronicity because you're not
delivering the result incrementally

I'd suggest to start the refactoring by removing the asynchronous codepath,
then converting from BoundedReader to ParDo or MapElements, then converting
from String to ParseResult.

On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim, All
> On 22/09/17 18:17, Allison, Timothy B. wrote:
> > Y, I think you have it right.
> >
> >> Tika library has a big problem with crashes and freezes
> >
> > I wouldn't want to overstate it.  Crashes and freezes are exceedingly
> rare, but when you are processing millions/billions of files in the wild
> [1], they will happen.  We fix the problems or try to get our dependencies
> to fix the problems when we can,
>
> I only would like to add to this that IMHO it would be more correct to
> state it's not a Tika library's 'fault' that the crashes might occur.
> Tika does its best to get the latest libraries helping it to parse the
> files, but indeed there will always be some file there that might use
> some incomplete format specific tag etc which may cause the specific
> parser to spin - but Tika will include the updated parser library asap.
>
> And with Beam's help the crashes that can kill the Tika jobs completely
> will probably become a history...
>
> Cheers, Sergey
> > but given our past history, I have no reason to believe that these
> problems won't happen again.
> >
> > Thank you, again!
> >
> > Best,
> >
> >              Tim
> >
> > [1] Stuff on the internet or ... some of our users are forensics
> examiners dealing with broken/corrupted files
> >
> > P.S./FTR  😊
> > 1) We've gathered a TB of data from CommonCrawl and we run regression
> tests against this TB (thank you, Rackspace for hosting our vm!) to try to
> identify these problems.
> > 2) We've started a fuzzing effort to try to identify problems.
> > 3) We added "tika-batch" for robust single box fileshare/fileshare
> processing for our low volume users
> > 4) We're trying to get the message out.  Thank you for working with us!!!
> >
> > -----Original Message-----
> > From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
> > Sent: Friday, September 22, 2017 12:48 PM
> > To: dev@beam.apache.org
> > Cc: dev@tika.apache.org
> > Subject: Re: TikaIO concerns
> >
> > Hi Tim,
> >  From what you're saying it sounds like the Tika library has a big
> problem with crashes and freezes, and when applying it at scale (eg. in the
> context of Beam) requires explicitly addressing this problem, eg. accepting
> the fact that in many realistic applications some documents will just need
> to be skipped because they are unprocessable? This would be first example
> of a Beam IO that has this concern, so I'd like to confirm that my
> understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
> > wrote:
> >
> >> Reuven,
> >>
> >> Thank you!  This suggests to me that it is a good idea to integrate
> >> Tika with Beam so that people don't have to 1) (re)discover the need
> >> to make their wrappers robust and then 2) have to reinvent these
> >> wheels for robustness.
> >>
> >> For kicks, see William Palmer's post on his toe-stubbing efforts with
> >> Hadoop [1].  He and other Tika users independently have wound up
> >> carrying out exactly your recommendation for 1) below.
> >>
> >> We have a MockParser that you can get to simulate regular exceptions,
> >> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
> >>
> >>> However if processing the document causes the process to crash, then
> >>> it
> >> will be retried.
> >> Any ideas on how to get around this?
> >>
> >> Thank you again.
> >>
> >> Cheers,
> >>
> >>             Tim
> >>
> >> [1]
> >> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> >> eb-content-nanite/
> >> [2]
> >>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >>
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:
> Y, I think you have it right.
> 
>> Tika library has a big problem with crashes and freezes
> 
> I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen.  We fix the problems or try to get our dependencies to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to 
state it's not a Tika library's 'fault' that the crashes might occur. 
Tika does its best to get the latest libraries helping it to parse the 
files, but indeed there will always be some file there that might use 
some incomplete format specific tag etc which may cause the specific 
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely 
will probably become a history...

Cheers, Sergey
> but given our past history, I have no reason to believe that these problems won't happen again.
> 
> Thank you, again!
> 
> Best,
> 
>              Tim
> 
> [1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files
> 
> P.S./FTR  😊
> 1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems.
> 2) We've started a fuzzing effort to try to identify problems.
> 3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users
> 4) We're trying to get the message out.  Thank you for working with us!!!
> 
> -----Original Message-----
> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:48 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
> 
> Hi Tim,
>  From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct.
> 
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
> wrote:
> 
>> Reuven,
>>
>> Thank you!  This suggests to me that it is a good idea to integrate
>> Tika with Beam so that people don't have to 1) (re)discover the need
>> to make their wrappers robust and then 2) have to reinvent these
>> wheels for robustness.
>>
>> For kicks, see William Palmer's post on his toe-stubbing efforts with
>> Hadoop [1].  He and other Tika users independently have wound up
>> carrying out exactly your recommendation for 1) below.
>>
>> We have a MockParser that you can get to simulate regular exceptions,
>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>>
>>> However if processing the document causes the process to crash, then
>>> it
>> will be retried.
>> Any ideas on how to get around this?
>>
>> Thank you again.
>>
>> Cheers,
>>
>>             Tim
>>
>> [1]
>> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>> eb-content-nanite/
>> [2]
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>>1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems.

And if anyone with connections at a big company doing open source + cloud would be interested in floating us some storage and cycles,  we'd be happy to move off our single vm to increase coverage and improve the speed for our large-scale regression tests.  

:D

But seriously, thank you for this discussion and collaboration!

Cheers,

         Tim

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:
> Y, I think you have it right.
> 
>> Tika library has a big problem with crashes and freezes
> 
> I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen.  We fix the problems or try to get our dependencies to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to 
state it's not a Tika library's 'fault' that the crashes might occur. 
Tika does its best to get the latest libraries helping it to parse the 
files, but indeed there will always be some file there that might use 
some incomplete format specific tag etc which may cause the specific 
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely 
will probably become a history...

Cheers, Sergey
> but given our past history, I have no reason to believe that these problems won't happen again.
> 
> Thank you, again!
> 
> Best,
> 
>              Tim
> 
> [1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files
> 
> P.S./FTR  😊
> 1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems.
> 2) We've started a fuzzing effort to try to identify problems.
> 3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users
> 4) We're trying to get the message out.  Thank you for working with us!!!
> 
> -----Original Message-----
> From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:48 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
> 
> Hi Tim,
>  From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct.
> 
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
> wrote:
> 
>> Reuven,
>>
>> Thank you!  This suggests to me that it is a good idea to integrate
>> Tika with Beam so that people don't have to 1) (re)discover the need
>> to make their wrappers robust and then 2) have to reinvent these
>> wheels for robustness.
>>
>> For kicks, see William Palmer's post on his toe-stubbing efforts with
>> Hadoop [1].  He and other Tika users independently have wound up
>> carrying out exactly your recommendation for 1) below.
>>
>> We have a MockParser that you can get to simulate regular exceptions,
>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>>
>>> However if processing the document causes the process to crash, then
>>> it
>> will be retried.
>> Any ideas on how to get around this?
>>
>> Thank you again.
>>
>> Cheers,
>>
>>             Tim
>>
>> [1]
>> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
>> eb-content-nanite/
>> [2]
>> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Y, I think you have it right.

> Tika library has a big problem with crashes and freezes

I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen.  We fix the problems or try to get our dependencies to fix the problems when we can, but given our past history, I have no reason to believe that these problems won't happen again.

Thank you, again!

Best,

            Tim

[1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files

P.S./FTR  😊
1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. 
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users 
4) We're trying to get the message out.  Thank you for working with us!!!

-----Original Message-----
From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:48 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate 
> Tika with Beam so that people don't have to 1) (re)discover the need 
> to make their wrappers robust and then 2) have to reinvent these 
> wheels for robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with 
> Hadoop [1].  He and other Tika users independently have wound up 
> carrying out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, 
> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>
> > However if processing the document causes the process to crash, then 
> > it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>            Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>

Re: TikaIO concerns

Posted by Reuven Lax <re...@google.com.INVALID>.

This is similar to what I suggested. This will not work well to handle
crashes and freezes however.

On Fri, Sep 22, 2017 at 10:24 AM, Ben Chambers <bc...@apache.org> wrote:

> BigQueryIO allows a side-output for elements that failed to be inserted
> when using the Streaming BigQuery sink:
>
> https://github.com/apache/beam/blob/master/sdks/java/io/
> google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/
> StreamingWriteTables.java#L92
>
> This follows the pattern of a DoFn with multiple outputs, as described here
> https://cloud.google.com/blog/big-data/2016/01/handling-
> invalid-inputs-in-dataflow
>
> So, the DoFn that runs the Tika code could be configured in terms of how
> different failures should be handled, with the option of just outputting
> them to a different PCollection that is then processed in some other way.
>
> On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> > Do tell...
> >
> > Interesting.  Any pointers?
> >
> > -----Original Message-----
> > From: Ben Chambers [mailto:bchambers@google.com.INVALID]
> > Sent: Friday, September 22, 2017 12:50 PM
> > To: dev@beam.apache.org
> > Cc: dev@tika.apache.org
> > Subject: Re: TikaIO concerns
> >
> > Regarding specifically elements that are failing -- I believe some other
> > IO has used the concept of a "Dead Letter" side-output,, where documents
> > that failed to process are side-output so the user can handle them
> > appropriately.
> >
> > On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
> > <ki...@google.com.invalid> wrote:
> >
> > > Hi Tim,
> > > From what you're saying it sounds like the Tika library has a big
> > > problem with crashes and freezes, and when applying it at scale (eg.
> > > in the context of Beam) requires explicitly addressing this problem,
> > > eg. accepting the fact that in many realistic applications some
> > > documents will just need to be skipped because they are unprocessable?
> > > This would be first example of a Beam IO that has this concern, so I'd
> > > like to confirm that my understanding is correct.
> > >
> > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > > <ta...@mitre.org>
> > > wrote:
> > >
> > > > Reuven,
> > > >
> > > > Thank you!  This suggests to me that it is a good idea to integrate
> > > > Tika with Beam so that people don't have to 1) (re)discover the need
> > > > to make their wrappers robust and then 2) have to reinvent these
> > > > wheels for robustness.
> > > >
> > > > For kicks, see William Palmer's post on his toe-stubbing efforts
> > > > with Hadoop [1].  He and other Tika users independently have wound
> > > > up carrying out exactly your recommendation for 1) below.
> > > >
> > > > We have a MockParser that you can get to simulate regular
> > > > exceptions,
> > > OOMs
> > > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > > >
> > > > > However if processing the document causes the process to crash,
> > > > > then it
> > > > will be retried.
> > > > Any ideas on how to get around this?
> > > >
> > > > Thank you again.
> > > >
> > > > Cheers,
> > > >
> > > >            Tim
> > > >
> > > > [1]
> > > >
> > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > > eb-content-nanite/
> > > > [2]
> > > >
> > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> > > rces/test-documents/mock/example.xml
> > > >
> > >
> >
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Nice!  Thank you!

-----Original Message-----
From: Ben Chambers [mailto:bchambers@apache.org] 
Sent: Friday, September 22, 2017 1:24 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how different failures should be handled, with the option of just outputting them to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -----Original Message-----
> From: Ben Chambers [mailto:bchambers@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some 
> other IO has used the concept of a "Dead Letter" side-output,, where 
> documents that failed to process are side-output so the user can 
> handle them appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov 
> <ki...@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big 
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem, 
> > eg. accepting the fact that in many realistic applications some 
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so 
> > I'd like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to 
> > > integrate Tika with Beam so that people don't have to 1) 
> > > (re)discover the need to make their wrappers robust and then 2) 
> > > have to reinvent these wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > > with Hadoop [1].  He and other Tika users independently have wound 
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular 
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > >
> > > > However if processing the document causes the process to crash, 
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >            Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising
> > -w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/res
> > ou rces/test-documents/mock/example.xml
> > >
> >
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Nice!  Thank you!

-----Original Message-----
From: Ben Chambers [mailto:bchambers@apache.org] 
Sent: Friday, September 22, 2017 1:24 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

BigQueryIO allows a side-output for elements that failed to be inserted when using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how different failures should be handled, with the option of just outputting them to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -----Original Message-----
> From: Ben Chambers [mailto:bchambers@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some 
> other IO has used the concept of a "Dead Letter" side-output,, where 
> documents that failed to process are side-output so the user can 
> handle them appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov 
> <ki...@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big 
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem, 
> > eg. accepting the fact that in many realistic applications some 
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so 
> > I'd like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to 
> > > integrate Tika with Beam so that people don't have to 1) 
> > > (re)discover the need to make their wrappers robust and then 2) 
> > > have to reinvent these wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > > with Hadoop [1].  He and other Tika users independently have wound 
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular 
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > >
> > > > However if processing the document causes the process to crash, 
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >            Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising
> > -w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/res
> > ou rces/test-documents/mock/example.xml
> > >
> >
>

Re: TikaIO concerns

Posted by Ben Chambers <bc...@apache.org>.

BigQueryIO allows a side-output for elements that failed to be inserted
when using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here
https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how
different failures should be handled, with the option of just outputting
them to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -----Original Message-----
> From: Ben Chambers [mailto:bchambers@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some other
> IO has used the concept of a "Dead Letter" side-output,, where documents
> that failed to process are side-output so the user can handle them
> appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem,
> > eg. accepting the fact that in many realistic applications some
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so I'd
> > like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to integrate
> > > Tika with Beam so that people don't have to 1) (re)discover the need
> > > to make their wrappers robust and then 2) have to reinvent these
> > > wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts
> > > with Hadoop [1].  He and other Tika users independently have wound
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > >
> > > > However if processing the document causes the process to crash,
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >            Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> > rces/test-documents/mock/example.xml
> > >
> >
>

Re: TikaIO concerns

Posted by Ben Chambers <bc...@apache.org>.

BigQueryIO allows a side-output for elements that failed to be inserted
when using the Streaming BigQuery sink:

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L92

This follows the pattern of a DoFn with multiple outputs, as described here
https://cloud.google.com/blog/big-data/2016/01/handling-invalid-inputs-in-dataflow

So, the DoFn that runs the Tika code could be configured in terms of how
different failures should be handled, with the option of just outputting
them to a different PCollection that is then processed in some other way.

On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Do tell...
>
> Interesting.  Any pointers?
>
> -----Original Message-----
> From: Ben Chambers [mailto:bchambers@google.com.INVALID]
> Sent: Friday, September 22, 2017 12:50 PM
> To: dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
>
> Regarding specifically elements that are failing -- I believe some other
> IO has used the concept of a "Dead Letter" side-output,, where documents
> that failed to process are side-output so the user can handle them
> appropriately.
>
> On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
> <ki...@google.com.invalid> wrote:
>
> > Hi Tim,
> > From what you're saying it sounds like the Tika library has a big
> > problem with crashes and freezes, and when applying it at scale (eg.
> > in the context of Beam) requires explicitly addressing this problem,
> > eg. accepting the fact that in many realistic applications some
> > documents will just need to be skipped because they are unprocessable?
> > This would be first example of a Beam IO that has this concern, so I'd
> > like to confirm that my understanding is correct.
> >
> > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > <ta...@mitre.org>
> > wrote:
> >
> > > Reuven,
> > >
> > > Thank you!  This suggests to me that it is a good idea to integrate
> > > Tika with Beam so that people don't have to 1) (re)discover the need
> > > to make their wrappers robust and then 2) have to reinvent these
> > > wheels for robustness.
> > >
> > > For kicks, see William Palmer's post on his toe-stubbing efforts
> > > with Hadoop [1].  He and other Tika users independently have wound
> > > up carrying out exactly your recommendation for 1) below.
> > >
> > > We have a MockParser that you can get to simulate regular
> > > exceptions,
> > OOMs
> > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > >
> > > > However if processing the document causes the process to crash,
> > > > then it
> > > will be retried.
> > > Any ideas on how to get around this?
> > >
> > > Thank you again.
> > >
> > > Cheers,
> > >
> > >            Tim
> > >
> > > [1]
> > >
> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > eb-content-nanite/
> > > [2]
> > >
> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> > rces/test-documents/mock/example.xml
> > >
> >
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Do tell...

Interesting.  Any pointers?

-----Original Message-----
From: Ben Chambers [mailto:bchambers@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:50 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov <ki...@google.com.invalid> wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big 
> problem with crashes and freezes, and when applying it at scale (eg. 
> in the context of Beam) requires explicitly addressing this problem, 
> eg. accepting the fact that in many realistic applications some 
> documents will just need to be skipped because they are unprocessable? 
> This would be first example of a Beam IO that has this concern, so I'd 
> like to confirm that my understanding is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> <ta...@mitre.org>
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate 
> > Tika with Beam so that people don't have to 1) (re)discover the need 
> > to make their wrappers robust and then 2) have to reinvent these 
> > wheels for robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > with Hadoop [1].  He and other Tika users independently have wound 
> > up carrying out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular 
> > exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a <mock> xml [2].
> >
> > > However if processing the document causes the process to crash, 
> > > then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >            Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> rces/test-documents/mock/example.xml
> >
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Do tell...

Interesting.  Any pointers?

-----Original Message-----
From: Ben Chambers [mailto:bchambers@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:50 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Regarding specifically elements that are failing -- I believe some other IO has used the concept of a "Dead Letter" side-output,, where documents that failed to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov <ki...@google.com.invalid> wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big 
> problem with crashes and freezes, and when applying it at scale (eg. 
> in the context of Beam) requires explicitly addressing this problem, 
> eg. accepting the fact that in many realistic applications some 
> documents will just need to be skipped because they are unprocessable? 
> This would be first example of a Beam IO that has this concern, so I'd 
> like to confirm that my understanding is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
> <ta...@mitre.org>
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate 
> > Tika with Beam so that people don't have to 1) (re)discover the need 
> > to make their wrappers robust and then 2) have to reinvent these 
> > wheels for robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts 
> > with Hadoop [1].  He and other Tika users independently have wound 
> > up carrying out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular 
> > exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a <mock> xml [2].
> >
> > > However if processing the document causes the process to crash, 
> > > then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >            Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> rces/test-documents/mock/example.xml
> >
>

Re: TikaIO concerns

Posted by Ben Chambers <bc...@google.com.INVALID>.

Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big problem
> with crashes and freezes, and when applying it at scale (eg. in the context
> of Beam) requires explicitly addressing this problem, eg. accepting the
> fact that in many realistic applications some documents will just need to
> be skipped because they are unprocessable? This would be first example of a
> Beam IO that has this concern, so I'd like to confirm that my understanding
> is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate Tika
> > with Beam so that people don't have to 1) (re)discover the need to make
> > their wrappers robust and then 2) have to reinvent these wheels for
> > robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts with
> > Hadoop [1].  He and other Tika users independently have wound up carrying
> > out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a <mock> xml [2].
> >
> > > However if processing the document causes the process to crash, then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >            Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Y, I think you have it right.

> Tika library has a big problem with crashes and freezes

I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but when you are processing millions/billions of files in the wild [1], they will happen.  We fix the problems or try to get our dependencies to fix the problems when we can, but given our past history, I have no reason to believe that these problems won't happen again.

Thank you, again!

Best,

            Tim

[1] Stuff on the internet or ... some of our users are forensics examiners dealing with broken/corrupted files

P.S./FTR  😊
1) We've gathered a TB of data from CommonCrawl and we run regression tests against this TB (thank you, Rackspace for hosting our vm!) to try to identify these problems. 
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing for our low volume users 
4) We're trying to get the message out.  Thank you for working with us!!!

-----Original Message-----
From: Eugene Kirpichov [mailto:kirpichov@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:48 PM
To: dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with crashes and freezes, and when applying it at scale (eg. in the context of Beam) requires explicitly addressing this problem, eg. accepting the fact that in many realistic applications some documents will just need to be skipped because they are unprocessable? This would be first example of a Beam IO that has this concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate 
> Tika with Beam so that people don't have to 1) (re)discover the need 
> to make their wrappers robust and then 2) have to reinvent these 
> wheels for robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with 
> Hadoop [1].  He and other Tika users independently have wound up 
> carrying out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, 
> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>
> > However if processing the document causes the process to crash, then 
> > it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>            Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>

Re: TikaIO concerns

Posted by Ben Chambers <bc...@google.com.INVALID>.

Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.

On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
<ki...@google.com.invalid> wrote:

> Hi Tim,
> From what you're saying it sounds like the Tika library has a big problem
> with crashes and freezes, and when applying it at scale (eg. in the context
> of Beam) requires explicitly addressing this problem, eg. accepting the
> fact that in many realistic applications some documents will just need to
> be skipped because they are unprocessable? This would be first example of a
> Beam IO that has this concern, so I'd like to confirm that my understanding
> is correct.
>
> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> > Reuven,
> >
> > Thank you!  This suggests to me that it is a good idea to integrate Tika
> > with Beam so that people don't have to 1) (re)discover the need to make
> > their wrappers robust and then 2) have to reinvent these wheels for
> > robustness.
> >
> > For kicks, see William Palmer's post on his toe-stubbing efforts with
> > Hadoop [1].  He and other Tika users independently have wound up carrying
> > out exactly your recommendation for 1) below.
> >
> > We have a MockParser that you can get to simulate regular exceptions,
> OOMs
> > and permanent hangs by asking Tika to parse a <mock> xml [2].
> >
> > > However if processing the document causes the process to crash, then it
> > will be retried.
> > Any ideas on how to get around this?
> >
> > Thank you again.
> >
> > Cheers,
> >
> >            Tim
> >
> > [1]
> >
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> > [2]
> >
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi Tim,
From what you're saying it sounds like the Tika library has a big problem
with crashes and freezes, and when applying it at scale (eg. in the context
of Beam) requires explicitly addressing this problem, eg. accepting the
fact that in many realistic applications some documents will just need to
be skipped because they are unprocessable? This would be first example of a
Beam IO that has this concern, so I'd like to confirm that my understanding
is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate Tika
> with Beam so that people don't have to 1) (re)discover the need to make
> their wrappers robust and then 2) have to reinvent these wheels for
> robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with
> Hadoop [1].  He and other Tika users independently have wound up carrying
> out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, OOMs
> and permanent hangs by asking Tika to parse a <mock> xml [2].
>
> > However if processing the document causes the process to crash, then it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>            Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Reuven,

Thank you!  This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness.  

For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop [1].  He and other Tika users independently have wound up carrying out exactly your recommendation for 1) below. 

We have a MockParser that you can get to simulate regular exceptions, OOMs and permanent hangs by asking Tika to parse a <mock> xml [2]. 

> However if processing the document causes the process to crash, then it will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

           Tim

[1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Reuven,

Thank you!  This suggests to me that it is a good idea to integrate Tika with Beam so that people don't have to 1) (re)discover the need to make their wrappers robust and then 2) have to reinvent these wheels for robustness.  

For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop [1].  He and other Tika users independently have wound up carrying out exactly your recommendation for 1) below. 

We have a MockParser that you can get to simulate regular exceptions, OOMs and permanent hangs by asking Tika to parse a <mock> xml [2]. 

> However if processing the document causes the process to crash, then it will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

           Tim

[1] http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml

Re: TikaIO concerns

Posted by Reuven Lax <re...@google.com.INVALID>.

The answer will be different for the different Beam runners, and even then
probably different in batch and streaming runners.

On Fri, Sep 22, 2017 at 5:01 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> @Eugene: What's the best way to have Beam help us with these issues, or do
> these come for free with the Beam framework?
>
> 1) a process-level timeout (because you can't actually kill a thread in
> Java)
>

While some runners might do this, many runners process many items in
parallel on different threads. If this is necessary, the user code
processing Tika should do it itself (e..g delegate processing to a new
worker thread and kill the process if the worker thread exceeds some
timeout).

> 2) a process-level restart on OOM
>

I believe all current runners restart processes on any crash.

> 3) avoid trying to reprocess a badly behaving document
>

There's no obvious way to do this. If an exception is thrown while
processing a document, you can catch the exception and skip the document.
However if processing the document causes the process to crash, then it
will be retried.

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Ben - yes, something like that would work for ReadableFile consumers 
to be able to choose.

Cheers, Sergey
On 04/10/17 22:51, Ben Chambers wrote:
> It looks like ReadableFile#open does currently decompress the stream, but
> it seems like we could add a ReadableFile#openRaw(...) or something like
> that which didn't implicitly decompress. Then libraries such as Tika which
> want the *actual* file content could use that method. Would that address
> your concerns?
> 
> https://github.com/apache/beam/blob/393e5631054a81ae1fdcd304f81cc68cf53d3422/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L131
> 
> On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
>> expected to decompress itself, while ReadableFile has the content
>> decompressed.
>>
>> The other point is that Tika reports the names of the zipped files too,
>> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>>
>> Can we assume that if Metadata does not point to the local file then it
>> can be opened as a URL stream ? The same issue affects TikaConfig, so
>> I'd rather have a solution which will work for MatchResult.Metadata and
>> TikaConfig
>>
>> Thanks, Sergey
>> On 04/10/17 22:02, Sergey Beryozkin wrote:
>>> Good point...
>>>
>>> Sergey
>>>
>>> On 04/10/17 18:24, Eugene Kirpichov wrote:
>>>> Can TikaInputStream consume a regular InputStream? If so, you can
>>>> apply it
>>>> to Channels.newInputStream(channel). If not, applying it to the filename
>>>> extracted from Metadata won't work either because it can point to a file
>>>> that's not on the local disk.
>>>>
>>>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm starting moving toward
>>>>>
>>>>> class TikaIO {
>>>>>      public static ParseAllToString parseAllToString() {..}
>>>>>      class ParseAllToString extends
>> PTransform<PCollection<ReadableFile>,
>>>>> PCollection<ParseResult>> {
>>>>>        ...configuration properties...
>>>>>        expand {
>>>>>          return input.apply(ParDo.of(new ParseToStringFn))
>>>>>        }
>>>>>        class ParseToStringFn extends DoFn<...> {...}
>>>>>      }
>>>>> }
>>>>>
>>>>> as suggested by Eugene
>>>>>
>>>>> The initial migration seems to work fine, except that ReadableFile and
>>>>> in particular, ReadableByteChannel can not be consumed by
>>>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>>>> better let Tika to unzip if needed given that a lot of effort went in
>>>>> Tika into detecting zip security issues...
>>>>>
>>>>> So I'm typing it as
>>>>>
>>>>> class ParseAllToString extends
>>>>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>>>> Thanks for the review, please see the last comment:
>>>>>>
>>>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>>>
>>>>>> (sorry for the possible duplication - but I'm not sure that GitHub
>> will
>>>>>> propagate it - as I can not see a comment there that I left on
>>>>>> Saturday).
>>>>>>
>>>>>> Cheers, Sergey
>>>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>>>> Hi
>>>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>>>
>>>>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>>>>> (and practical) cases first :-)
>>>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>>>> reassigned it to you.
>>>>>>> OK, thanks...
>>>>>>>>
>>>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
>>>>>>>> <sberyozkin@gmail.com
>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I started looking at
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>>>
>>>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>>>> ParseResult
>>>>>>>>> and updating the tests but keeping the BounderSource/Reader,
>>>>>>>>> dropping
>>>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>>>
>>>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
>>>>>>>>> looking
>>>>>>>>> into it too or was it auto-assigned ?
>>>>>>>>>
>>>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>>>>> based code ?
>>>>>>>>>
>>>>>>>> Yes :)
>>>>>>>>
>>>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>>>> pending PR (which I forgot about) where I only tweaked a couple of
>> doc
>>>>>>> typos, so I renamed that PR:
>>>>>>>
>>>>>>> https://github.com/apache/beam/pull/3835
>>>>>>>
>>>>>>> (The build failures are apparently due to the build timeouts)
>>>>>>>
>>>>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>>>>> with ParseResult, at the moment a file location as its property. Only
>>>>>>> a file name can easily be saved, I thought it might be important
>> where
>>>>>>> on the network the file is - may be copy it afterwards if needed,
>> etc.
>>>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>>>> trying to make it a bit simpler at the start.
>>>>>>>
>>>>>>> I'll deal with the new configurations after a switch. TikaConfig
>> would
>>>>>>> most likely still need to be supported but I recall you mentioned the
>>>>>>> way it's done now will make it work only with the direct runner. I
>>>>>>> guess I can load it as a URL resource... The other bits like
>> providing
>>>>>>> custom content handlers, parsers, input metadata, may be setting the
>>>>>>> max size of the files, etc, can all be added after a switch.
>>>>>>>
>>>>>>> Note I haven't dealt with a number of your comments to the original
>>>>>>> code which can still be dealt with in the current code - given that
>>>>>>> most of that code will go with the next PR anyway.
>>>>>>>
>>>>>>> Please review or merge if it looks like it is a step in the right
>>>>>>> direction...
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have another question anyway,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> E.g. TikaIO could:
>>>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>>>> ParseResult
>>>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>>>>> specified at pipeline construction time) and a ContentHandler
>> whose
>>>>>>>>>> toString() will go into "content". ContentHandler does not
>>>>>>>>>> implement
>>>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>>>> however,
>>>>>>>>> you
>>>>>>>>>> can let the user specify either its class (if it's a simple
>> handler
>>>>>>>>>> like
>>>>>>>>> a
>>>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
>>>>>>>>>> you can
>>>>>>>>> have
>>>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>>>
>>>>>>>>>> Example usage would look like:
>>>>>>>>>>
>>>>>>>>>>       PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>>>         .apply(FileIO.readMatches())
>>>>>>>>>>         .apply(TikaIO.parseAllAsStrings())
>>>>>>>>>>
>>>>>>>>>> or:
>>>>>>>>>>
>>>>>>>>>>         .apply(TikaIO.parseAll()
>>>>>>>>>>             .withParser(new AutoDetectParser())
>>>>>>>>>>             .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>>>> ToXMLContentHandler())))
>>>>>>>>>>
>>>>>>>>>> You could also have shorthands for letting the user avoid using
>>>>> FileIO
>>>>>>>>>> directly in simple cases, for example:
>>>>>>>>>>         p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>>>
>>>>>>>>>> This would of course be implemented as a ParDo or even
>> MapElements,
>>>>>>>>>> and
>>>>>>>>>> you'll be able to share the code between parseAll and regular
>>>>>>>>>> parse.
>>>>>>>>>>
>>>>>>>>> I'd like to understand how to do
>>>>>>>>>
>>>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>>>
>>>>>>>>> Right now I have TikaIO.Read extending
>>>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>>>
>>>>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>>>>> like
>>>>>>>>>
>>>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>>>
>>>>>>>>> What is the convention for supporting something like
>>>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
>> I
>>>>>>>>> see
>>>>>>>>> some example ?
>>>>>>>>>
>>>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
>>>>>>>> and
>>>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>>>> mimic.
>>>>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>>>>> ParDo
>>>>>>>> since there's 1 result per input file.
>>>>>>>>
>>>>>>>
>>>>>>> OK, I'll have a look
>>>>>>>
>>>>>>> Cheers, Sergey
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Many thanks, Sergey
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>
>

Re: TikaIO Refactoring

Posted by Ben Chambers <bc...@apache.org>.

It looks like ReadableFile#open does currently decompress the stream, but
it seems like we could add a ReadableFile#openRaw(...) or something like
that which didn't implicitly decompress. Then libraries such as Tika which
want the *actual* file content could use that method. Would that address
your concerns?

https://github.com/apache/beam/blob/393e5631054a81ae1fdcd304f81cc68cf53d3422/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L131

On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
> expected to decompress itself, while ReadableFile has the content
> decompressed.
>
> The other point is that Tika reports the names of the zipped files too,
> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>
> Can we assume that if Metadata does not point to the local file then it
> can be opened as a URL stream ? The same issue affects TikaConfig, so
> I'd rather have a solution which will work for MatchResult.Metadata and
> TikaConfig
>
> Thanks, Sergey
> On 04/10/17 22:02, Sergey Beryozkin wrote:
> > Good point...
> >
> > Sergey
> >
> > On 04/10/17 18:24, Eugene Kirpichov wrote:
> >> Can TikaInputStream consume a regular InputStream? If so, you can
> >> apply it
> >> to Channels.newInputStream(channel). If not, applying it to the filename
> >> extracted from Metadata won't work either because it can point to a file
> >> that's not on the local disk.
> >>
> >> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
> >> wrote:
> >>
> >>> I'm starting moving toward
> >>>
> >>> class TikaIO {
> >>>     public static ParseAllToString parseAllToString() {..}
> >>>     class ParseAllToString extends
> PTransform<PCollection<ReadableFile>,
> >>> PCollection<ParseResult>> {
> >>>       ...configuration properties...
> >>>       expand {
> >>>         return input.apply(ParDo.of(new ParseToStringFn))
> >>>       }
> >>>       class ParseToStringFn extends DoFn<...> {...}
> >>>     }
> >>> }
> >>>
> >>> as suggested by Eugene
> >>>
> >>> The initial migration seems to work fine, except that ReadableFile and
> >>> in particular, ReadableByteChannel can not be consumed by
> >>> TikaInputStream yet (I'll open an enhancement request), besides, it's
> >>> better let Tika to unzip if needed given that a lot of effort went in
> >>> Tika into detecting zip security issues...
> >>>
> >>> So I'm typing it as
> >>>
> >>> class ParseAllToString extends
> >>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
> >>>
> >>> Cheers, Sergey
> >>>
> >>> On 02/10/17 12:03, Sergey Beryozkin wrote:
> >>>> Thanks for the review, please see the last comment:
> >>>>
> >>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >>>>
> >>>> (sorry for the possible duplication - but I'm not sure that GitHub
> will
> >>>> propagate it - as I can not see a comment there that I left on
> >>>> Saturday).
> >>>>
> >>>> Cheers, Sergey
> >>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
> >>>>> Hi
> >>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>>>>> Hi! Glad the refactoring is happening, thanks!
> >>>>>
> >>>>> Thanks for getting me focused on having TikaIO supporting the simpler
> >>>>> (and practical) cases first :-)
> >>>>>> It was auto-assigned to Reuven as formal owner of the component. I
> >>>>>> reassigned it to you.
> >>>>> OK, thanks...
> >>>>>>
> >>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
> >>>>>> <sberyozkin@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> I started looking at
> >>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>>>>
> >>>>>>> and pushed some initial code to my tikaio branch introducing
> >>>>>>> ParseResult
> >>>>>>> and updating the tests but keeping the BounderSource/Reader,
> >>>>>>> dropping
> >>>>>>> the asynchronous parsing code, and few other bits.
> >>>>>>>
> >>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
> >>>>>>> looking
> >>>>>>> into it too or was it auto-assigned ?
> >>>>>>>
> >>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>>>>> what've done so far before completely removing BoundedSource/Reader
> >>>>>>> based code ?
> >>>>>>>
> >>>>>> Yes :)
> >>>>>>
> >>>>> I did commit yesterday to my branch, and it made its way to the
> >>>>> pending PR (which I forgot about) where I only tweaked a couple of
> doc
> >>>>> typos, so I renamed that PR:
> >>>>>
> >>>>> https://github.com/apache/beam/pull/3835
> >>>>>
> >>>>> (The build failures are apparently due to the build timeouts)
> >>>>>
> >>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
> >>>>> with ParseResult, at the moment a file location as its property. Only
> >>>>> a file name can easily be saved, I thought it might be important
> where
> >>>>> on the network the file is - may be copy it afterwards if needed,
> etc.
> >>>>> I'd also have no problems with having it typed as a K key, was only
> >>>>> trying to make it a bit simpler at the start.
> >>>>>
> >>>>> I'll deal with the new configurations after a switch. TikaConfig
> would
> >>>>> most likely still need to be supported but I recall you mentioned the
> >>>>> way it's done now will make it work only with the direct runner. I
> >>>>> guess I can load it as a URL resource... The other bits like
> providing
> >>>>> custom content handlers, parsers, input metadata, may be setting the
> >>>>> max size of the files, etc, can all be added after a switch.
> >>>>>
> >>>>> Note I haven't dealt with a number of your comments to the original
> >>>>> code which can still be dealt with in the current code - given that
> >>>>> most of that code will go with the next PR anyway.
> >>>>>
> >>>>> Please review or merge if it looks like it is a step in the right
> >>>>> direction...
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I have another question anyway,
> >>>>>>>
> >>>>>>>
> >>>>>>>> E.g. TikaIO could:
> >>>>>>>> - take as input a PCollection<ReadableFile>
> >>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>>>>> ParseResult
> >>>>>>>> is a class with properties { String content, Metadata metadata }
> >>>>>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>>>>> specified at pipeline construction time) and a ContentHandler
> whose
> >>>>>>>> toString() will go into "content". ContentHandler does not
> >>>>>>>> implement
> >>>>>>>> Serializable, so you can not specify it at construction time -
> >>>>>>>> however,
> >>>>>>> you
> >>>>>>>> can let the user specify either its class (if it's a simple
> handler
> >>>>>>>> like
> >>>>>>> a
> >>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
> >>>>>>>> you can
> >>>>>>> have
> >>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>>>>> TikaIO.parseAllAsStrings().
> >>>>>>>>
> >>>>>>>> Example usage would look like:
> >>>>>>>>
> >>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
> >>>>>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>>>>        .apply(FileIO.readMatches())
> >>>>>>>>        .apply(TikaIO.parseAllAsStrings())
> >>>>>>>>
> >>>>>>>> or:
> >>>>>>>>
> >>>>>>>>        .apply(TikaIO.parseAll()
> >>>>>>>>            .withParser(new AutoDetectParser())
> >>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
> >>>>>>>> ToXMLContentHandler())))
> >>>>>>>>
> >>>>>>>> You could also have shorthands for letting the user avoid using
> >>> FileIO
> >>>>>>>> directly in simple cases, for example:
> >>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>>>>
> >>>>>>>> This would of course be implemented as a ParDo or even
> MapElements,
> >>>>>>>> and
> >>>>>>>> you'll be able to share the code between parseAll and regular
> >>>>>>>> parse.
> >>>>>>>>
> >>>>>>> I'd like to understand how to do
> >>>>>>>
> >>>>>>> TikaIO.parse().from(filepattern)
> >>>>>>>
> >>>>>>> Right now I have TikaIO.Read extending
> >>>>>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>>>>
> >>>>>>> and then the boilerplate code which builds Read when I do something
> >>>>>>> like
> >>>>>>>
> >>>>>>> TikaIO.read().from(filepattern).
> >>>>>>>
> >>>>>>> What is the convention for supporting something like
> >>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
> I
> >>>>>>> see
> >>>>>>> some example ?
> >>>>>>>
> >>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
> >>>>>> and
> >>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
> >>> mimic.
> >>>>>> Note that in TikaIO you probably won't need a fusion break after the
> >>>>>> ParDo
> >>>>>> since there's 1 result per input file.
> >>>>>>
> >>>>>
> >>>>> OK, I'll have a look
> >>>>>
> >>>>> Cheers, Sergey
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Many thanks, Sergey
> >>>>>>>
> >>>>>>
> >>>
> >>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi

Yes, using .withCompression(UNCOMPRESSED) works, but the test code looks 
funny:

p.apply("ParseFiles", FileIO.match().filepattern(resourcePath))
         .apply(FileIO.readMatches().withCompression(
           compressed ? Compression.UNCOMPRESSED : Compression.AUTO))
         .apply(TikaIO.read())

One can obviously always set UNCOMPRESSED for Tika and avoid if/else and 
it will work fine (with some minor confusions expected when someone 
attempts to process the compressed files), but IMHO a more typed 
approach for getting the raw content would be useful to have as well.

Thanks, Sergey
On 05/10/17 12:56, Sergey Beryozkin wrote:
> Hi Eugene
> 
> On 04/10/17 22:52, Eugene Kirpichov wrote:
>> You can avoid automatic decompression by using
>> FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).
> 
> This is nice it was already thought about earlier the auto-decompression 
> would not always be needed, and it would help a readZippedPdfFile test 
> passing :-). but I'd rather have TikaIO users not worry about doing 
> .withCompression(UNCOMPRESSED) for it to work correctly, given the raw 
> content is always needed by Tika - thus having the users to assert it 
> may be a bit suboptimal.
> 
> It is only useful to auto-decompress if it is indeed what a user wants 
> (it's a single file only, no risk of some malicious compression), but 
> otherwise Tika should do it itself.
> 
> So ideally, after the initial refactoring is complete, we'd only have
> 
> p.apply(FileIO.readMatches()).apply(TikaIO.parseAll()) for reading 
> compressed or uncompressed files, where Tika would just do 
> ReadableFile.openRaw as suggested by Ben, or at least to have another 
> shortcut at the FileIO level, something like
> 
> FileIO.readRawMatches()
> 
> which would be equivalent to specifying the UNCOMPRESSED option explicitly.
> 
> 
> Thanks, Sergey
>>
>> On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
>>> expected to decompress itself, while ReadableFile has the content
>>> decompressed.
>>>
>>> The other point is that Tika reports the names of the zipped files too,
>>> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>>>
>>> Can we assume that if Metadata does not point to the local file then it
>>> can be opened as a URL stream ? The same issue affects TikaConfig, so
>>> I'd rather have a solution which will work for MatchResult.Metadata and
>>> TikaConfig
>>>
>>> Thanks, Sergey
>>> On 04/10/17 22:02, Sergey Beryozkin wrote:
>>>> Good point...
>>>>
>>>> Sergey
>>>>
>>>> On 04/10/17 18:24, Eugene Kirpichov wrote:
>>>>> Can TikaInputStream consume a regular InputStream? If so, you can
>>>>> apply it
>>>>> to Channels.newInputStream(channel). If not, applying it to the 
>>>>> filename
>>>>> extracted from Metadata won't work either because it can point to a 
>>>>> file
>>>>> that's not on the local disk.
>>>>>
>>>>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm starting moving toward
>>>>>>
>>>>>> class TikaIO {
>>>>>>      public static ParseAllToString parseAllToString() {..}
>>>>>>      class ParseAllToString extends
>>> PTransform<PCollection<ReadableFile>,
>>>>>> PCollection<ParseResult>> {
>>>>>>        ...configuration properties...
>>>>>>        expand {
>>>>>>          return input.apply(ParDo.of(new ParseToStringFn))
>>>>>>        }
>>>>>>        class ParseToStringFn extends DoFn<...> {...}
>>>>>>      }
>>>>>> }
>>>>>>
>>>>>> as suggested by Eugene
>>>>>>
>>>>>> The initial migration seems to work fine, except that ReadableFile 
>>>>>> and
>>>>>> in particular, ReadableByteChannel can not be consumed by
>>>>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>>>>> better let Tika to unzip if needed given that a lot of effort went in
>>>>>> Tika into detecting zip security issues...
>>>>>>
>>>>>> So I'm typing it as
>>>>>>
>>>>>> class ParseAllToString extends
>>>>>> PTransform<PCollection<MatchResult.Metadata>, 
>>>>>> PCollection<ParseResult>>
>>>>>>
>>>>>> Cheers, Sergey
>>>>>>
>>>>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>>>>> Thanks for the review, please see the last comment:
>>>>>>>
>>>>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>>>>
>>>>>>> (sorry for the possible duplication - but I'm not sure that GitHub
>>> will
>>>>>>> propagate it - as I can not see a comment there that I left on
>>>>>>> Saturday).
>>>>>>>
>>>>>>> Cheers, Sergey
>>>>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>>>>> Hi
>>>>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>>>>
>>>>>>>> Thanks for getting me focused on having TikaIO supporting the 
>>>>>>>> simpler
>>>>>>>> (and practical) cases first :-)
>>>>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>>>>> reassigned it to you.
>>>>>>>> OK, thanks...
>>>>>>>>>
>>>>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
>>>>>>>>> <sberyozkin@gmail.com
>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I started looking at
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>>>>
>>>>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>>>>> ParseResult
>>>>>>>>>> and updating the tests but keeping the BounderSource/Reader,
>>>>>>>>>> dropping
>>>>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>>>>
>>>>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
>>>>>>>>>> looking
>>>>>>>>>> into it too or was it auto-assigned ?
>>>>>>>>>>
>>>>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>>>>> what've done so far before completely removing 
>>>>>>>>>> BoundedSource/Reader
>>>>>>>>>> based code ?
>>>>>>>>>>
>>>>>>>>> Yes :)
>>>>>>>>>
>>>>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>>>>> pending PR (which I forgot about) where I only tweaked a couple of
>>> doc
>>>>>>>> typos, so I renamed that PR:
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/pull/3835
>>>>>>>>
>>>>>>>> (The build failures are apparently due to the build timeouts)
>>>>>>>>
>>>>>>>> As I mentioned, in this PR I updated the existing TikaIO test to 
>>>>>>>> work
>>>>>>>> with ParseResult, at the moment a file location as its property. 
>>>>>>>> Only
>>>>>>>> a file name can easily be saved, I thought it might be important
>>> where
>>>>>>>> on the network the file is - may be copy it afterwards if needed,
>>> etc.
>>>>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>>>>> trying to make it a bit simpler at the start.
>>>>>>>>
>>>>>>>> I'll deal with the new configurations after a switch. TikaConfig
>>> would
>>>>>>>> most likely still need to be supported but I recall you 
>>>>>>>> mentioned the
>>>>>>>> way it's done now will make it work only with the direct runner. I
>>>>>>>> guess I can load it as a URL resource... The other bits like
>>> providing
>>>>>>>> custom content handlers, parsers, input metadata, may be setting 
>>>>>>>> the
>>>>>>>> max size of the files, etc, can all be added after a switch.
>>>>>>>>
>>>>>>>> Note I haven't dealt with a number of your comments to the original
>>>>>>>> code which can still be dealt with in the current code - given that
>>>>>>>> most of that code will go with the next PR anyway.
>>>>>>>>
>>>>>>>> Please review or merge if it looks like it is a step in the right
>>>>>>>> direction...
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have another question anyway,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> E.g. TikaIO could:
>>>>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>>>>> ParseResult
>>>>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>>>>> - be configured by: a Parser (it implements Serializable so 
>>>>>>>>>>> can be
>>>>>>>>>>> specified at pipeline construction time) and a ContentHandler
>>> whose
>>>>>>>>>>> toString() will go into "content". ContentHandler does not
>>>>>>>>>>> implement
>>>>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>>>>> however,
>>>>>>>>>> you
>>>>>>>>>>> can let the user specify either its class (if it's a simple
>>> handler
>>>>>>>>>>> like
>>>>>>>>>> a
>>>>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
>>>>>>>>>>> you can
>>>>>>>>>> have
>>>>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>>>>
>>>>>>>>>>> Example usage would look like:
>>>>>>>>>>>
>>>>>>>>>>>       PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>>>>         .apply(FileIO.readMatches())
>>>>>>>>>>>         .apply(TikaIO.parseAllAsStrings())
>>>>>>>>>>>
>>>>>>>>>>> or:
>>>>>>>>>>>
>>>>>>>>>>>         .apply(TikaIO.parseAll()
>>>>>>>>>>>             .withParser(new AutoDetectParser())
>>>>>>>>>>>             .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>>>>> ToXMLContentHandler())))
>>>>>>>>>>>
>>>>>>>>>>> You could also have shorthands for letting the user avoid using
>>>>>> FileIO
>>>>>>>>>>> directly in simple cases, for example:
>>>>>>>>>>>         p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>>>>
>>>>>>>>>>> This would of course be implemented as a ParDo or even
>>> MapElements,
>>>>>>>>>>> and
>>>>>>>>>>> you'll be able to share the code between parseAll and regular
>>>>>>>>>>> parse.
>>>>>>>>>>>
>>>>>>>>>> I'd like to understand how to do
>>>>>>>>>>
>>>>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>>>>
>>>>>>>>>> Right now I have TikaIO.Read extending
>>>>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>>>>
>>>>>>>>>> and then the boilerplate code which builds Read when I do 
>>>>>>>>>> something
>>>>>>>>>> like
>>>>>>>>>>
>>>>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>>>>
>>>>>>>>>> What is the convention for supporting something like
>>>>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, 
>>>>>>>>>> can
>>> I
>>>>>>>>>> see
>>>>>>>>>> some example ?
>>>>>>>>>>
>>>>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
>>>>>>>>> and
>>>>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>>>>> mimic.
>>>>>>>>> Note that in TikaIO you probably won't need a fusion break 
>>>>>>>>> after the
>>>>>>>>> ParDo
>>>>>>>>> since there's 1 result per input file.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I'll have a look
>>>>>>>>
>>>>>>>> Cheers, Sergey
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Many thanks, Sergey
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>
>>
> 
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Eugene

On 04/10/17 22:52, Eugene Kirpichov wrote:
> You can avoid automatic decompression by using
> FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).

This is nice it was already thought about earlier the auto-decompression 
would not always be needed, and it would help a readZippedPdfFile test 
passing :-). but I'd rather have TikaIO users not worry about doing 
.withCompression(UNCOMPRESSED) for it to work correctly, given the raw 
content is always needed by Tika - thus having the users to assert it 
may be a bit suboptimal.

It is only useful to auto-decompress if it is indeed what a user wants 
(it's a single file only, no risk of some malicious compression), but 
otherwise Tika should do it itself.

So ideally, after the initial refactoring is complete, we'd only have

p.apply(FileIO.readMatches()).apply(TikaIO.parseAll()) for reading 
compressed or uncompressed files, where Tika would just do 
ReadableFile.openRaw as suggested by Ben, or at least to have another 
shortcut at the FileIO level, something like

FileIO.readRawMatches()

which would be equivalent to specifying the UNCOMPRESSED option explicitly.


Thanks, Sergey
> 
> On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
>> expected to decompress itself, while ReadableFile has the content
>> decompressed.
>>
>> The other point is that Tika reports the names of the zipped files too,
>> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>>
>> Can we assume that if Metadata does not point to the local file then it
>> can be opened as a URL stream ? The same issue affects TikaConfig, so
>> I'd rather have a solution which will work for MatchResult.Metadata and
>> TikaConfig
>>
>> Thanks, Sergey
>> On 04/10/17 22:02, Sergey Beryozkin wrote:
>>> Good point...
>>>
>>> Sergey
>>>
>>> On 04/10/17 18:24, Eugene Kirpichov wrote:
>>>> Can TikaInputStream consume a regular InputStream? If so, you can
>>>> apply it
>>>> to Channels.newInputStream(channel). If not, applying it to the filename
>>>> extracted from Metadata won't work either because it can point to a file
>>>> that's not on the local disk.
>>>>
>>>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm starting moving toward
>>>>>
>>>>> class TikaIO {
>>>>>      public static ParseAllToString parseAllToString() {..}
>>>>>      class ParseAllToString extends
>> PTransform<PCollection<ReadableFile>,
>>>>> PCollection<ParseResult>> {
>>>>>        ...configuration properties...
>>>>>        expand {
>>>>>          return input.apply(ParDo.of(new ParseToStringFn))
>>>>>        }
>>>>>        class ParseToStringFn extends DoFn<...> {...}
>>>>>      }
>>>>> }
>>>>>
>>>>> as suggested by Eugene
>>>>>
>>>>> The initial migration seems to work fine, except that ReadableFile and
>>>>> in particular, ReadableByteChannel can not be consumed by
>>>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>>>> better let Tika to unzip if needed given that a lot of effort went in
>>>>> Tika into detecting zip security issues...
>>>>>
>>>>> So I'm typing it as
>>>>>
>>>>> class ParseAllToString extends
>>>>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>>>> Thanks for the review, please see the last comment:
>>>>>>
>>>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>>>
>>>>>> (sorry for the possible duplication - but I'm not sure that GitHub
>> will
>>>>>> propagate it - as I can not see a comment there that I left on
>>>>>> Saturday).
>>>>>>
>>>>>> Cheers, Sergey
>>>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>>>> Hi
>>>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>>>
>>>>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>>>>> (and practical) cases first :-)
>>>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>>>> reassigned it to you.
>>>>>>> OK, thanks...
>>>>>>>>
>>>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
>>>>>>>> <sberyozkin@gmail.com
>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I started looking at
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>>>
>>>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>>>> ParseResult
>>>>>>>>> and updating the tests but keeping the BounderSource/Reader,
>>>>>>>>> dropping
>>>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>>>
>>>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
>>>>>>>>> looking
>>>>>>>>> into it too or was it auto-assigned ?
>>>>>>>>>
>>>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>>>>> based code ?
>>>>>>>>>
>>>>>>>> Yes :)
>>>>>>>>
>>>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>>>> pending PR (which I forgot about) where I only tweaked a couple of
>> doc
>>>>>>> typos, so I renamed that PR:
>>>>>>>
>>>>>>> https://github.com/apache/beam/pull/3835
>>>>>>>
>>>>>>> (The build failures are apparently due to the build timeouts)
>>>>>>>
>>>>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>>>>> with ParseResult, at the moment a file location as its property. Only
>>>>>>> a file name can easily be saved, I thought it might be important
>> where
>>>>>>> on the network the file is - may be copy it afterwards if needed,
>> etc.
>>>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>>>> trying to make it a bit simpler at the start.
>>>>>>>
>>>>>>> I'll deal with the new configurations after a switch. TikaConfig
>> would
>>>>>>> most likely still need to be supported but I recall you mentioned the
>>>>>>> way it's done now will make it work only with the direct runner. I
>>>>>>> guess I can load it as a URL resource... The other bits like
>> providing
>>>>>>> custom content handlers, parsers, input metadata, may be setting the
>>>>>>> max size of the files, etc, can all be added after a switch.
>>>>>>>
>>>>>>> Note I haven't dealt with a number of your comments to the original
>>>>>>> code which can still be dealt with in the current code - given that
>>>>>>> most of that code will go with the next PR anyway.
>>>>>>>
>>>>>>> Please review or merge if it looks like it is a step in the right
>>>>>>> direction...
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have another question anyway,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> E.g. TikaIO could:
>>>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>>>> ParseResult
>>>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>>>>> specified at pipeline construction time) and a ContentHandler
>> whose
>>>>>>>>>> toString() will go into "content". ContentHandler does not
>>>>>>>>>> implement
>>>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>>>> however,
>>>>>>>>> you
>>>>>>>>>> can let the user specify either its class (if it's a simple
>> handler
>>>>>>>>>> like
>>>>>>>>> a
>>>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
>>>>>>>>>> you can
>>>>>>>>> have
>>>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>>>
>>>>>>>>>> Example usage would look like:
>>>>>>>>>>
>>>>>>>>>>       PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>>>         .apply(FileIO.readMatches())
>>>>>>>>>>         .apply(TikaIO.parseAllAsStrings())
>>>>>>>>>>
>>>>>>>>>> or:
>>>>>>>>>>
>>>>>>>>>>         .apply(TikaIO.parseAll()
>>>>>>>>>>             .withParser(new AutoDetectParser())
>>>>>>>>>>             .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>>>> ToXMLContentHandler())))
>>>>>>>>>>
>>>>>>>>>> You could also have shorthands for letting the user avoid using
>>>>> FileIO
>>>>>>>>>> directly in simple cases, for example:
>>>>>>>>>>         p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>>>
>>>>>>>>>> This would of course be implemented as a ParDo or even
>> MapElements,
>>>>>>>>>> and
>>>>>>>>>> you'll be able to share the code between parseAll and regular
>>>>>>>>>> parse.
>>>>>>>>>>
>>>>>>>>> I'd like to understand how to do
>>>>>>>>>
>>>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>>>
>>>>>>>>> Right now I have TikaIO.Read extending
>>>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>>>
>>>>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>>>>> like
>>>>>>>>>
>>>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>>>
>>>>>>>>> What is the convention for supporting something like
>>>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
>> I
>>>>>>>>> see
>>>>>>>>> some example ?
>>>>>>>>>
>>>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
>>>>>>>> and
>>>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>>>> mimic.
>>>>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>>>>> ParDo
>>>>>>>> since there's 1 result per input file.
>>>>>>>>
>>>>>>>
>>>>>>> OK, I'll have a look
>>>>>>>
>>>>>>> Cheers, Sergey
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Many thanks, Sergey
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: TikaIO Refactoring

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

You can avoid automatic decompression by using
FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).

On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
> expected to decompress itself, while ReadableFile has the content
> decompressed.
>
> The other point is that Tika reports the names of the zipped files too,
> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>
> Can we assume that if Metadata does not point to the local file then it
> can be opened as a URL stream ? The same issue affects TikaConfig, so
> I'd rather have a solution which will work for MatchResult.Metadata and
> TikaConfig
>
> Thanks, Sergey
> On 04/10/17 22:02, Sergey Beryozkin wrote:
> > Good point...
> >
> > Sergey
> >
> > On 04/10/17 18:24, Eugene Kirpichov wrote:
> >> Can TikaInputStream consume a regular InputStream? If so, you can
> >> apply it
> >> to Channels.newInputStream(channel). If not, applying it to the filename
> >> extracted from Metadata won't work either because it can point to a file
> >> that's not on the local disk.
> >>
> >> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
> >> wrote:
> >>
> >>> I'm starting moving toward
> >>>
> >>> class TikaIO {
> >>>     public static ParseAllToString parseAllToString() {..}
> >>>     class ParseAllToString extends
> PTransform<PCollection<ReadableFile>,
> >>> PCollection<ParseResult>> {
> >>>       ...configuration properties...
> >>>       expand {
> >>>         return input.apply(ParDo.of(new ParseToStringFn))
> >>>       }
> >>>       class ParseToStringFn extends DoFn<...> {...}
> >>>     }
> >>> }
> >>>
> >>> as suggested by Eugene
> >>>
> >>> The initial migration seems to work fine, except that ReadableFile and
> >>> in particular, ReadableByteChannel can not be consumed by
> >>> TikaInputStream yet (I'll open an enhancement request), besides, it's
> >>> better let Tika to unzip if needed given that a lot of effort went in
> >>> Tika into detecting zip security issues...
> >>>
> >>> So I'm typing it as
> >>>
> >>> class ParseAllToString extends
> >>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
> >>>
> >>> Cheers, Sergey
> >>>
> >>> On 02/10/17 12:03, Sergey Beryozkin wrote:
> >>>> Thanks for the review, please see the last comment:
> >>>>
> >>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >>>>
> >>>> (sorry for the possible duplication - but I'm not sure that GitHub
> will
> >>>> propagate it - as I can not see a comment there that I left on
> >>>> Saturday).
> >>>>
> >>>> Cheers, Sergey
> >>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
> >>>>> Hi
> >>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>>>>> Hi! Glad the refactoring is happening, thanks!
> >>>>>
> >>>>> Thanks for getting me focused on having TikaIO supporting the simpler
> >>>>> (and practical) cases first :-)
> >>>>>> It was auto-assigned to Reuven as formal owner of the component. I
> >>>>>> reassigned it to you.
> >>>>> OK, thanks...
> >>>>>>
> >>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
> >>>>>> <sberyozkin@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> I started looking at
> >>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>>>>
> >>>>>>> and pushed some initial code to my tikaio branch introducing
> >>>>>>> ParseResult
> >>>>>>> and updating the tests but keeping the BounderSource/Reader,
> >>>>>>> dropping
> >>>>>>> the asynchronous parsing code, and few other bits.
> >>>>>>>
> >>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
> >>>>>>> looking
> >>>>>>> into it too or was it auto-assigned ?
> >>>>>>>
> >>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>>>>> what've done so far before completely removing BoundedSource/Reader
> >>>>>>> based code ?
> >>>>>>>
> >>>>>> Yes :)
> >>>>>>
> >>>>> I did commit yesterday to my branch, and it made its way to the
> >>>>> pending PR (which I forgot about) where I only tweaked a couple of
> doc
> >>>>> typos, so I renamed that PR:
> >>>>>
> >>>>> https://github.com/apache/beam/pull/3835
> >>>>>
> >>>>> (The build failures are apparently due to the build timeouts)
> >>>>>
> >>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
> >>>>> with ParseResult, at the moment a file location as its property. Only
> >>>>> a file name can easily be saved, I thought it might be important
> where
> >>>>> on the network the file is - may be copy it afterwards if needed,
> etc.
> >>>>> I'd also have no problems with having it typed as a K key, was only
> >>>>> trying to make it a bit simpler at the start.
> >>>>>
> >>>>> I'll deal with the new configurations after a switch. TikaConfig
> would
> >>>>> most likely still need to be supported but I recall you mentioned the
> >>>>> way it's done now will make it work only with the direct runner. I
> >>>>> guess I can load it as a URL resource... The other bits like
> providing
> >>>>> custom content handlers, parsers, input metadata, may be setting the
> >>>>> max size of the files, etc, can all be added after a switch.
> >>>>>
> >>>>> Note I haven't dealt with a number of your comments to the original
> >>>>> code which can still be dealt with in the current code - given that
> >>>>> most of that code will go with the next PR anyway.
> >>>>>
> >>>>> Please review or merge if it looks like it is a step in the right
> >>>>> direction...
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I have another question anyway,
> >>>>>>>
> >>>>>>>
> >>>>>>>> E.g. TikaIO could:
> >>>>>>>> - take as input a PCollection<ReadableFile>
> >>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>>>>> ParseResult
> >>>>>>>> is a class with properties { String content, Metadata metadata }
> >>>>>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>>>>> specified at pipeline construction time) and a ContentHandler
> whose
> >>>>>>>> toString() will go into "content". ContentHandler does not
> >>>>>>>> implement
> >>>>>>>> Serializable, so you can not specify it at construction time -
> >>>>>>>> however,
> >>>>>>> you
> >>>>>>>> can let the user specify either its class (if it's a simple
> handler
> >>>>>>>> like
> >>>>>>> a
> >>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
> >>>>>>>> you can
> >>>>>>> have
> >>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>>>>> TikaIO.parseAllAsStrings().
> >>>>>>>>
> >>>>>>>> Example usage would look like:
> >>>>>>>>
> >>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
> >>>>>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>>>>        .apply(FileIO.readMatches())
> >>>>>>>>        .apply(TikaIO.parseAllAsStrings())
> >>>>>>>>
> >>>>>>>> or:
> >>>>>>>>
> >>>>>>>>        .apply(TikaIO.parseAll()
> >>>>>>>>            .withParser(new AutoDetectParser())
> >>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
> >>>>>>>> ToXMLContentHandler())))
> >>>>>>>>
> >>>>>>>> You could also have shorthands for letting the user avoid using
> >>> FileIO
> >>>>>>>> directly in simple cases, for example:
> >>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>>>>
> >>>>>>>> This would of course be implemented as a ParDo or even
> MapElements,
> >>>>>>>> and
> >>>>>>>> you'll be able to share the code between parseAll and regular
> >>>>>>>> parse.
> >>>>>>>>
> >>>>>>> I'd like to understand how to do
> >>>>>>>
> >>>>>>> TikaIO.parse().from(filepattern)
> >>>>>>>
> >>>>>>> Right now I have TikaIO.Read extending
> >>>>>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>>>>
> >>>>>>> and then the boilerplate code which builds Read when I do something
> >>>>>>> like
> >>>>>>>
> >>>>>>> TikaIO.read().from(filepattern).
> >>>>>>>
> >>>>>>> What is the convention for supporting something like
> >>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
> I
> >>>>>>> see
> >>>>>>> some example ?
> >>>>>>>
> >>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
> >>>>>> and
> >>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
> >>> mimic.
> >>>>>> Note that in TikaIO you probably won't need a fusion break after the
> >>>>>> ParDo
> >>>>>> since there's 1 result per input file.
> >>>>>>
> >>>>>
> >>>>> OK, I'll have a look
> >>>>>
> >>>>> Cheers, Sergey
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Many thanks, Sergey
> >>>>>>>
> >>>>>>
> >>>
> >>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is 
expected to decompress itself, while ReadableFile has the content 
decompressed.

The other point is that Tika reports the names of the zipped files too, 
in the content, as you can see from TikaIOTest#readZippedPdfFile.

Can we assume that if Metadata does not point to the local file then it 
can be opened as a URL stream ? The same issue affects TikaConfig, so 
I'd rather have a solution which will work for MatchResult.Metadata and 
TikaConfig

Thanks, Sergey
On 04/10/17 22:02, Sergey Beryozkin wrote:
> Good point...
> 
> Sergey
> 
> On 04/10/17 18:24, Eugene Kirpichov wrote:
>> Can TikaInputStream consume a regular InputStream? If so, you can 
>> apply it
>> to Channels.newInputStream(channel). If not, applying it to the filename
>> extracted from Metadata won't work either because it can point to a file
>> that's not on the local disk.
>>
>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com> 
>> wrote:
>>
>>> I'm starting moving toward
>>>
>>> class TikaIO {
>>>     public static ParseAllToString parseAllToString() {..}
>>>     class ParseAllToString extends PTransform<PCollection<ReadableFile>,
>>> PCollection<ParseResult>> {
>>>       ...configuration properties...
>>>       expand {
>>>         return input.apply(ParDo.of(new ParseToStringFn))
>>>       }
>>>       class ParseToStringFn extends DoFn<...> {...}
>>>     }
>>> }
>>>
>>> as suggested by Eugene
>>>
>>> The initial migration seems to work fine, except that ReadableFile and
>>> in particular, ReadableByteChannel can not be consumed by
>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>> better let Tika to unzip if needed given that a lot of effort went in
>>> Tika into detecting zip security issues...
>>>
>>> So I'm typing it as
>>>
>>> class ParseAllToString extends
>>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>>
>>> Cheers, Sergey
>>>
>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>> Thanks for the review, please see the last comment:
>>>>
>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>
>>>> (sorry for the possible duplication - but I'm not sure that GitHub will
>>>> propagate it - as I can not see a comment there that I left on 
>>>> Saturday).
>>>>
>>>> Cheers, Sergey
>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>> Hi
>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>
>>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>>> (and practical) cases first :-)
>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>> reassigned it to you.
>>>>> OK, thanks...
>>>>>>
>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin 
>>>>>> <sberyozkin@gmail.com
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I started looking at
>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>
>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>> ParseResult
>>>>>>> and updating the tests but keeping the BounderSource/Reader, 
>>>>>>> dropping
>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>
>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is 
>>>>>>> looking
>>>>>>> into it too or was it auto-assigned ?
>>>>>>>
>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>>> based code ?
>>>>>>>
>>>>>> Yes :)
>>>>>>
>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>> pending PR (which I forgot about) where I only tweaked a couple of doc
>>>>> typos, so I renamed that PR:
>>>>>
>>>>> https://github.com/apache/beam/pull/3835
>>>>>
>>>>> (The build failures are apparently due to the build timeouts)
>>>>>
>>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>>> with ParseResult, at the moment a file location as its property. Only
>>>>> a file name can easily be saved, I thought it might be important where
>>>>> on the network the file is - may be copy it afterwards if needed, etc.
>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>> trying to make it a bit simpler at the start.
>>>>>
>>>>> I'll deal with the new configurations after a switch. TikaConfig would
>>>>> most likely still need to be supported but I recall you mentioned the
>>>>> way it's done now will make it work only with the direct runner. I
>>>>> guess I can load it as a URL resource... The other bits like providing
>>>>> custom content handlers, parsers, input metadata, may be setting the
>>>>> max size of the files, etc, can all be added after a switch.
>>>>>
>>>>> Note I haven't dealt with a number of your comments to the original
>>>>> code which can still be dealt with in the current code - given that
>>>>> most of that code will go with the next PR anyway.
>>>>>
>>>>> Please review or merge if it looks like it is a step in the right
>>>>> direction...
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I have another question anyway,
>>>>>>>
>>>>>>>
>>>>>>>> E.g. TikaIO could:
>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>> ParseResult
>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>>>>> toString() will go into "content". ContentHandler does not 
>>>>>>>> implement
>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>> however,
>>>>>>> you
>>>>>>>> can let the user specify either its class (if it's a simple handler
>>>>>>>> like
>>>>>>> a
>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially 
>>>>>>>> you can
>>>>>>> have
>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>
>>>>>>>> Example usage would look like:
>>>>>>>>
>>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>        .apply(FileIO.readMatches())
>>>>>>>>        .apply(TikaIO.parseAllAsStrings())
>>>>>>>>
>>>>>>>> or:
>>>>>>>>
>>>>>>>>        .apply(TikaIO.parseAll()
>>>>>>>>            .withParser(new AutoDetectParser())
>>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>> ToXMLContentHandler())))
>>>>>>>>
>>>>>>>> You could also have shorthands for letting the user avoid using
>>> FileIO
>>>>>>>> directly in simple cases, for example:
>>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>
>>>>>>>> This would of course be implemented as a ParDo or even MapElements,
>>>>>>>> and
>>>>>>>> you'll be able to share the code between parseAll and regular 
>>>>>>>> parse.
>>>>>>>>
>>>>>>> I'd like to understand how to do
>>>>>>>
>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>
>>>>>>> Right now I have TikaIO.Read extending
>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>
>>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>>> like
>>>>>>>
>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>
>>>>>>> What is the convention for supporting something like
>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
>>>>>>> see
>>>>>>> some example ?
>>>>>>>
>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO 
>>>>>> and
>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>> mimic.
>>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>>> ParDo
>>>>>> since there's 1 result per input file.
>>>>>>
>>>>>
>>>>> OK, I'll have a look
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Many thanks, Sergey
>>>>>>>
>>>>>>
>>>
>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Good point...

Sergey

On 04/10/17 18:24, Eugene Kirpichov wrote:
> Can TikaInputStream consume a regular InputStream? If so, you can apply it
> to Channels.newInputStream(channel). If not, applying it to the filename
> extracted from Metadata won't work either because it can point to a file
> that's not on the local disk.
> 
> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com> wrote:
> 
>> I'm starting moving toward
>>
>> class TikaIO {
>>     public static ParseAllToString parseAllToString() {..}
>>     class ParseAllToString extends PTransform<PCollection<ReadableFile>,
>> PCollection<ParseResult>> {
>>       ...configuration properties...
>>       expand {
>>         return input.apply(ParDo.of(new ParseToStringFn))
>>       }
>>       class ParseToStringFn extends DoFn<...> {...}
>>     }
>> }
>>
>> as suggested by Eugene
>>
>> The initial migration seems to work fine, except that ReadableFile and
>> in particular, ReadableByteChannel can not be consumed by
>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>> better let Tika to unzip if needed given that a lot of effort went in
>> Tika into detecting zip security issues...
>>
>> So I'm typing it as
>>
>> class ParseAllToString extends
>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>
>> Cheers, Sergey
>>
>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>> Thanks for the review, please see the last comment:
>>>
>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>
>>> (sorry for the possible duplication - but I'm not sure that GitHub will
>>> propagate it - as I can not see a comment there that I left on Saturday).
>>>
>>> Cheers, Sergey
>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>> Hi
>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>
>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>> (and practical) cases first :-)
>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>> reassigned it to you.
>>>> OK, thanks...
>>>>>
>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyozkin@gmail.com
>>>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I started looking at
>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>
>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>> ParseResult
>>>>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>
>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>>>>> into it too or was it auto-assigned ?
>>>>>>
>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>> based code ?
>>>>>>
>>>>> Yes :)
>>>>>
>>>> I did commit yesterday to my branch, and it made its way to the
>>>> pending PR (which I forgot about) where I only tweaked a couple of doc
>>>> typos, so I renamed that PR:
>>>>
>>>> https://github.com/apache/beam/pull/3835
>>>>
>>>> (The build failures are apparently due to the build timeouts)
>>>>
>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>> with ParseResult, at the moment a file location as its property. Only
>>>> a file name can easily be saved, I thought it might be important where
>>>> on the network the file is - may be copy it afterwards if needed, etc.
>>>> I'd also have no problems with having it typed as a K key, was only
>>>> trying to make it a bit simpler at the start.
>>>>
>>>> I'll deal with the new configurations after a switch. TikaConfig would
>>>> most likely still need to be supported but I recall you mentioned the
>>>> way it's done now will make it work only with the direct runner. I
>>>> guess I can load it as a URL resource... The other bits like providing
>>>> custom content handlers, parsers, input metadata, may be setting the
>>>> max size of the files, etc, can all be added after a switch.
>>>>
>>>> Note I haven't dealt with a number of your comments to the original
>>>> code which can still be dealt with in the current code - given that
>>>> most of that code will go with the next PR anyway.
>>>>
>>>> Please review or merge if it looks like it is a step in the right
>>>> direction...
>>>>
>>>>>
>>>>>>
>>>>>> I have another question anyway,
>>>>>>
>>>>>>
>>>>>>> E.g. TikaIO could:
>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>> ParseResult
>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>>>> toString() will go into "content". ContentHandler does not implement
>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>> however,
>>>>>> you
>>>>>>> can let the user specify either its class (if it's a simple handler
>>>>>>> like
>>>>>> a
>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>>>>> have
>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>
>>>>>>> Example usage would look like:
>>>>>>>
>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>        .apply(FileIO.readMatches())
>>>>>>>        .apply(TikaIO.parseAllAsStrings())
>>>>>>>
>>>>>>> or:
>>>>>>>
>>>>>>>        .apply(TikaIO.parseAll()
>>>>>>>            .withParser(new AutoDetectParser())
>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
>>>>>>> ToXMLContentHandler())))
>>>>>>>
>>>>>>> You could also have shorthands for letting the user avoid using
>> FileIO
>>>>>>> directly in simple cases, for example:
>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>
>>>>>>> This would of course be implemented as a ParDo or even MapElements,
>>>>>>> and
>>>>>>> you'll be able to share the code between parseAll and regular parse.
>>>>>>>
>>>>>> I'd like to understand how to do
>>>>>>
>>>>>> TikaIO.parse().from(filepattern)
>>>>>>
>>>>>> Right now I have TikaIO.Read extending
>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>
>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>> like
>>>>>>
>>>>>> TikaIO.read().from(filepattern).
>>>>>>
>>>>>> What is the convention for supporting something like
>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
>>>>>> see
>>>>>> some example ?
>>>>>>
>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>> mimic.
>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>> ParDo
>>>>> since there's 1 result per input file.
>>>>>
>>>>
>>>> OK, I'll have a look
>>>>
>>>> Cheers, Sergey
>>>>
>>>>>
>>>>>>
>>>>>> Many thanks, Sergey
>>>>>>
>>>>>
>>
>

Re: TikaIO Refactoring

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Can TikaInputStream consume a regular InputStream? If so, you can apply it
to Channels.newInputStream(channel). If not, applying it to the filename
extracted from Metadata won't work either because it can point to a file
that's not on the local disk.

On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com> wrote:

> I'm starting moving toward
>
> class TikaIO {
>    public static ParseAllToString parseAllToString() {..}
>    class ParseAllToString extends PTransform<PCollection<ReadableFile>,
> PCollection<ParseResult>> {
>      ...configuration properties...
>      expand {
>        return input.apply(ParDo.of(new ParseToStringFn))
>      }
>      class ParseToStringFn extends DoFn<...> {...}
>    }
> }
>
> as suggested by Eugene
>
> The initial migration seems to work fine, except that ReadableFile and
> in particular, ReadableByteChannel can not be consumed by
> TikaInputStream yet (I'll open an enhancement request), besides, it's
> better let Tika to unzip if needed given that a lot of effort went in
> Tika into detecting zip security issues...
>
> So I'm typing it as
>
> class ParseAllToString extends
> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>
> Cheers, Sergey
>
> On 02/10/17 12:03, Sergey Beryozkin wrote:
> > Thanks for the review, please see the last comment:
> >
> > https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >
> > (sorry for the possible duplication - but I'm not sure that GitHub will
> > propagate it - as I can not see a comment there that I left on Saturday).
> >
> > Cheers, Sergey
> > On 29/09/17 10:21, Sergey Beryozkin wrote:
> >> Hi
> >> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>> Hi! Glad the refactoring is happening, thanks!
> >>
> >> Thanks for getting me focused on having TikaIO supporting the simpler
> >> (and practical) cases first :-)
> >>> It was auto-assigned to Reuven as formal owner of the component. I
> >>> reassigned it to you.
> >> OK, thanks...
> >>>
> >>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyozkin@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> I started looking at
> >>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>
> >>>> and pushed some initial code to my tikaio branch introducing
> >>>> ParseResult
> >>>> and updating the tests but keeping the BounderSource/Reader, dropping
> >>>> the asynchronous parsing code, and few other bits.
> >>>>
> >>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> >>>> into it too or was it auto-assigned ?
> >>>>
> >>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>> what've done so far before completely removing BoundedSource/Reader
> >>>> based code ?
> >>>>
> >>> Yes :)
> >>>
> >> I did commit yesterday to my branch, and it made its way to the
> >> pending PR (which I forgot about) where I only tweaked a couple of doc
> >> typos, so I renamed that PR:
> >>
> >> https://github.com/apache/beam/pull/3835
> >>
> >> (The build failures are apparently due to the build timeouts)
> >>
> >> As I mentioned, in this PR I updated the existing TikaIO test to work
> >> with ParseResult, at the moment a file location as its property. Only
> >> a file name can easily be saved, I thought it might be important where
> >> on the network the file is - may be copy it afterwards if needed, etc.
> >> I'd also have no problems with having it typed as a K key, was only
> >> trying to make it a bit simpler at the start.
> >>
> >> I'll deal with the new configurations after a switch. TikaConfig would
> >> most likely still need to be supported but I recall you mentioned the
> >> way it's done now will make it work only with the direct runner. I
> >> guess I can load it as a URL resource... The other bits like providing
> >> custom content handlers, parsers, input metadata, may be setting the
> >> max size of the files, etc, can all be added after a switch.
> >>
> >> Note I haven't dealt with a number of your comments to the original
> >> code which can still be dealt with in the current code - given that
> >> most of that code will go with the next PR anyway.
> >>
> >> Please review or merge if it looks like it is a step in the right
> >> direction...
> >>
> >>>
> >>>>
> >>>> I have another question anyway,
> >>>>
> >>>>
> >>>>> E.g. TikaIO could:
> >>>>> - take as input a PCollection<ReadableFile>
> >>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>> ParseResult
> >>>>> is a class with properties { String content, Metadata metadata }
> >>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>> specified at pipeline construction time) and a ContentHandler whose
> >>>>> toString() will go into "content". ContentHandler does not implement
> >>>>> Serializable, so you can not specify it at construction time -
> >>>>> however,
> >>>> you
> >>>>> can let the user specify either its class (if it's a simple handler
> >>>>> like
> >>>> a
> >>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
> >>>> have
> >>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>> TikaIO.parseAllAsStrings().
> >>>>>
> >>>>> Example usage would look like:
> >>>>>
> >>>>>     PCollection<KV<String, ParseResult>> parseResults =
> >>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>       .apply(FileIO.readMatches())
> >>>>>       .apply(TikaIO.parseAllAsStrings())
> >>>>>
> >>>>> or:
> >>>>>
> >>>>>       .apply(TikaIO.parseAll()
> >>>>>           .withParser(new AutoDetectParser())
> >>>>>           .withContentHandler(() -> new BodyContentHandler(new
> >>>>> ToXMLContentHandler())))
> >>>>>
> >>>>> You could also have shorthands for letting the user avoid using
> FileIO
> >>>>> directly in simple cases, for example:
> >>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>
> >>>>> This would of course be implemented as a ParDo or even MapElements,
> >>>>> and
> >>>>> you'll be able to share the code between parseAll and regular parse.
> >>>>>
> >>>> I'd like to understand how to do
> >>>>
> >>>> TikaIO.parse().from(filepattern)
> >>>>
> >>>> Right now I have TikaIO.Read extending
> >>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>
> >>>> and then the boilerplate code which builds Read when I do something
> >>>> like
> >>>>
> >>>> TikaIO.read().from(filepattern).
> >>>>
> >>>> What is the convention for supporting something like
> >>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
> >>>> see
> >>>> some example ?
> >>>>
> >>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
> >>> JdbcIO. TextIO.readMatches() might be an even better transform to
> mimic.
> >>> Note that in TikaIO you probably won't need a fusion break after the
> >>> ParDo
> >>> since there's 1 result per input file.
> >>>
> >>
> >> OK, I'll have a look
> >>
> >> Cheers, Sergey
> >>
> >>>
> >>>>
> >>>> Many thanks, Sergey
> >>>>
> >>>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

I'm starting moving toward

class TikaIO {
   public static ParseAllToString parseAllToString() {..}
   class ParseAllToString extends PTransform<PCollection<ReadableFile>, 
PCollection<ParseResult>> {
     ...configuration properties...
     expand {
       return input.apply(ParDo.of(new ParseToStringFn))
     }
     class ParseToStringFn extends DoFn<...> {...}
   }
}

as suggested by Eugene

The initial migration seems to work fine, except that ReadableFile and 
in particular, ReadableByteChannel can not be consumed by 
TikaInputStream yet (I'll open an enhancement request), besides, it's 
better let Tika to unzip if needed given that a lot of effort went in 
Tika into detecting zip security issues...

So I'm typing it as

class ParseAllToString extends 
PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>

Cheers, Sergey

On 02/10/17 12:03, Sergey Beryozkin wrote:
> Thanks for the review, please see the last comment:
> 
> https://github.com/apache/beam/pull/3835#issuecomment-333502388
> 
> (sorry for the possible duplication - but I'm not sure that GitHub will 
> propagate it - as I can not see a comment there that I left on Saturday).
> 
> Cheers, Sergey
> On 29/09/17 10:21, Sergey Beryozkin wrote:
>> Hi
>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>> Hi! Glad the refactoring is happening, thanks!
>>
>> Thanks for getting me focused on having TikaIO supporting the simpler 
>> (and practical) cases first :-)
>>> It was auto-assigned to Reuven as formal owner of the component. I
>>> reassigned it to you.
>> OK, thanks...
>>>
>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I started looking at
>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>
>>>> and pushed some initial code to my tikaio branch introducing 
>>>> ParseResult
>>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>>> the asynchronous parsing code, and few other bits.
>>>>
>>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>>> into it too or was it auto-assigned ?
>>>>
>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>> what've done so far before completely removing BoundedSource/Reader
>>>> based code ?
>>>>
>>> Yes :)
>>>
>> I did commit yesterday to my branch, and it made its way to the 
>> pending PR (which I forgot about) where I only tweaked a couple of doc 
>> typos, so I renamed that PR:
>>
>> https://github.com/apache/beam/pull/3835
>>
>> (The build failures are apparently due to the build timeouts)
>>
>> As I mentioned, in this PR I updated the existing TikaIO test to work 
>> with ParseResult, at the moment a file location as its property. Only 
>> a file name can easily be saved, I thought it might be important where 
>> on the network the file is - may be copy it afterwards if needed, etc. 
>> I'd also have no problems with having it typed as a K key, was only 
>> trying to make it a bit simpler at the start.
>>
>> I'll deal with the new configurations after a switch. TikaConfig would 
>> most likely still need to be supported but I recall you mentioned the 
>> way it's done now will make it work only with the direct runner. I 
>> guess I can load it as a URL resource... The other bits like providing 
>> custom content handlers, parsers, input metadata, may be setting the 
>> max size of the files, etc, can all be added after a switch.
>>
>> Note I haven't dealt with a number of your comments to the original 
>> code which can still be dealt with in the current code - given that 
>> most of that code will go with the next PR anyway.
>>
>> Please review or merge if it looks like it is a step in the right 
>> direction...
>>
>>>
>>>>
>>>> I have another question anyway,
>>>>
>>>>
>>>>> E.g. TikaIO could:
>>>>> - take as input a PCollection<ReadableFile>
>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where 
>>>>> ParseResult
>>>>> is a class with properties { String content, Metadata metadata }
>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>> toString() will go into "content". ContentHandler does not implement
>>>>> Serializable, so you can not specify it at construction time - 
>>>>> however,
>>>> you
>>>>> can let the user specify either its class (if it's a simple handler 
>>>>> like
>>>> a
>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>>> have
>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>> TikaIO.parseAllAsStrings().
>>>>>
>>>>> Example usage would look like:
>>>>>
>>>>>     PCollection<KV<String, ParseResult>> parseResults =
>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>       .apply(FileIO.readMatches())
>>>>>       .apply(TikaIO.parseAllAsStrings())
>>>>>
>>>>> or:
>>>>>
>>>>>       .apply(TikaIO.parseAll()
>>>>>           .withParser(new AutoDetectParser())
>>>>>           .withContentHandler(() -> new BodyContentHandler(new
>>>>> ToXMLContentHandler())))
>>>>>
>>>>> You could also have shorthands for letting the user avoid using FileIO
>>>>> directly in simple cases, for example:
>>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>
>>>>> This would of course be implemented as a ParDo or even MapElements, 
>>>>> and
>>>>> you'll be able to share the code between parseAll and regular parse.
>>>>>
>>>> I'd like to understand how to do
>>>>
>>>> TikaIO.parse().from(filepattern)
>>>>
>>>> Right now I have TikaIO.Read extending
>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>
>>>> and then the boilerplate code which builds Read when I do something 
>>>> like
>>>>
>>>> TikaIO.read().from(filepattern).
>>>>
>>>> What is the convention for supporting something like
>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I 
>>>> see
>>>> some example ?
>>>>
>>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>>> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
>>> Note that in TikaIO you probably won't need a fusion break after the 
>>> ParDo
>>> since there's 1 result per input file.
>>>
>>
>> OK, I'll have a look
>>
>> Cheers, Sergey
>>
>>>
>>>>
>>>> Many thanks, Sergey
>>>>
>>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Thanks for the review, please see the last comment:

https://github.com/apache/beam/pull/3835#issuecomment-333502388

(sorry for the possible duplication - but I'm not sure that GitHub will 
propagate it - as I can not see a comment there that I left on Saturday).

Cheers, Sergey
On 29/09/17 10:21, Sergey Beryozkin wrote:
> Hi
> On 28/09/17 17:09, Eugene Kirpichov wrote:
>> Hi! Glad the refactoring is happening, thanks!
> 
> Thanks for getting me focused on having TikaIO supporting the simpler 
> (and practical) cases first :-)
>> It was auto-assigned to Reuven as formal owner of the component. I
>> reassigned it to you.
> OK, thanks...
>>
>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I started looking at
>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>
>>> and pushed some initial code to my tikaio branch introducing ParseResult
>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>> the asynchronous parsing code, and few other bits.
>>>
>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>> into it too or was it auto-assigned ?
>>>
>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>> what've done so far before completely removing BoundedSource/Reader
>>> based code ?
>>>
>> Yes :)
>>
> I did commit yesterday to my branch, and it made its way to the pending 
> PR (which I forgot about) where I only tweaked a couple of doc typos, so 
> I renamed that PR:
> 
> https://github.com/apache/beam/pull/3835
> 
> (The build failures are apparently due to the build timeouts)
> 
> As I mentioned, in this PR I updated the existing TikaIO test to work 
> with ParseResult, at the moment a file location as its property. Only a 
> file name can easily be saved, I thought it might be important where on 
> the network the file is - may be copy it afterwards if needed, etc. I'd 
> also have no problems with having it typed as a K key, was only trying 
> to make it a bit simpler at the start.
> 
> I'll deal with the new configurations after a switch. TikaConfig would 
> most likely still need to be supported but I recall you mentioned the 
> way it's done now will make it work only with the direct runner. I guess 
> I can load it as a URL resource... The other bits like providing custom 
> content handlers, parsers, input metadata, may be setting the max size 
> of the files, etc, can all be added after a switch.
> 
> Note I haven't dealt with a number of your comments to the original code 
> which can still be dealt with in the current code - given that most of 
> that code will go with the next PR anyway.
> 
> Please review or merge if it looks like it is a step in the right 
> direction...
> 
>>
>>>
>>> I have another question anyway,
>>>
>>>
>>>> E.g. TikaIO could:
>>>> - take as input a PCollection<ReadableFile>
>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where 
>>>> ParseResult
>>>> is a class with properties { String content, Metadata metadata }
>>>> - be configured by: a Parser (it implements Serializable so can be
>>>> specified at pipeline construction time) and a ContentHandler whose
>>>> toString() will go into "content". ContentHandler does not implement
>>>> Serializable, so you can not specify it at construction time - however,
>>> you
>>>> can let the user specify either its class (if it's a simple handler 
>>>> like
>>> a
>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>> have
>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>> TikaIO.parseAllAsStrings().
>>>>
>>>> Example usage would look like:
>>>>
>>>>     PCollection<KV<String, ParseResult>> parseResults =
>>>> p.apply(FileIO.match().filepattern(...))
>>>>       .apply(FileIO.readMatches())
>>>>       .apply(TikaIO.parseAllAsStrings())
>>>>
>>>> or:
>>>>
>>>>       .apply(TikaIO.parseAll()
>>>>           .withParser(new AutoDetectParser())
>>>>           .withContentHandler(() -> new BodyContentHandler(new
>>>> ToXMLContentHandler())))
>>>>
>>>> You could also have shorthands for letting the user avoid using FileIO
>>>> directly in simple cases, for example:
>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>
>>>> This would of course be implemented as a ParDo or even MapElements, and
>>>> you'll be able to share the code between parseAll and regular parse.
>>>>
>>> I'd like to understand how to do
>>>
>>> TikaIO.parse().from(filepattern)
>>>
>>> Right now I have TikaIO.Read extending
>>> PTransform<PBegin, PCollection<ParseResult>
>>>
>>> and then the boilerplate code which builds Read when I do something like
>>>
>>> TikaIO.read().from(filepattern).
>>>
>>> What is the convention for supporting something like
>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
>>> some example ?
>>>
>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
>> Note that in TikaIO you probably won't need a fusion break after the 
>> ParDo
>> since there's 1 result per input file.
>>
> 
> OK, I'll have a look
> 
> Cheers, Sergey
> 
>>
>>>
>>> Many thanks, Sergey
>>>
>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi
On 28/09/17 17:09, Eugene Kirpichov wrote:
> Hi! Glad the refactoring is happening, thanks!

Thanks for getting me focused on having TikaIO supporting the simpler 
(and practical) cases first :-)
> It was auto-assigned to Reuven as formal owner of the component. I
> reassigned it to you.
OK, thanks...
> 
> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi
>>
>> I started looking at
>> https://issues.apache.org/jira/browse/BEAM-2994
>>
>> and pushed some initial code to my tikaio branch introducing ParseResult
>> and updating the tests but keeping the BounderSource/Reader, dropping
>> the asynchronous parsing code, and few other bits.
>>
>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>> into it too or was it auto-assigned ?
>>
>> I don't mind, would it make sense for me to do an 'interim' PR on
>> what've done so far before completely removing BoundedSource/Reader
>> based code ?
>>
> Yes :)
> 
I did commit yesterday to my branch, and it made its way to the pending 
PR (which I forgot about) where I only tweaked a couple of doc typos, so 
I renamed that PR:

https://github.com/apache/beam/pull/3835

(The build failures are apparently due to the build timeouts)

As I mentioned, in this PR I updated the existing TikaIO test to work 
with ParseResult, at the moment a file location as its property. Only a 
file name can easily be saved, I thought it might be important where on 
the network the file is - may be copy it afterwards if needed, etc. I'd 
also have no problems with having it typed as a K key, was only trying 
to make it a bit simpler at the start.

I'll deal with the new configurations after a switch. TikaConfig would 
most likely still need to be supported but I recall you mentioned the 
way it's done now will make it work only with the direct runner. I guess 
I can load it as a URL resource... The other bits like providing custom 
content handlers, parsers, input metadata, may be setting the max size 
of the files, etc, can all be added after a switch.

Note I haven't dealt with a number of your comments to the original code 
which can still be dealt with in the current code - given that most of 
that code will go with the next PR anyway.

Please review or merge if it looks like it is a step in the right 
direction...

> 
>>
>> I have another question anyway,
>>
>>
>>> E.g. TikaIO could:
>>> - take as input a PCollection<ReadableFile>
>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
>>> is a class with properties { String content, Metadata metadata }
>>> - be configured by: a Parser (it implements Serializable so can be
>>> specified at pipeline construction time) and a ContentHandler whose
>>> toString() will go into "content". ContentHandler does not implement
>>> Serializable, so you can not specify it at construction time - however,
>> you
>>> can let the user specify either its class (if it's a simple handler like
>> a
>>> BodyContentHandler) or specify a lambda for creating the handler
>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>> have
>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>> TikaIO.parseAllAsStrings().
>>>
>>> Example usage would look like:
>>>
>>>     PCollection<KV<String, ParseResult>> parseResults =
>>> p.apply(FileIO.match().filepattern(...))
>>>       .apply(FileIO.readMatches())
>>>       .apply(TikaIO.parseAllAsStrings())
>>>
>>> or:
>>>
>>>       .apply(TikaIO.parseAll()
>>>           .withParser(new AutoDetectParser())
>>>           .withContentHandler(() -> new BodyContentHandler(new
>>> ToXMLContentHandler())))
>>>
>>> You could also have shorthands for letting the user avoid using FileIO
>>> directly in simple cases, for example:
>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>
>>> This would of course be implemented as a ParDo or even MapElements, and
>>> you'll be able to share the code between parseAll and regular parse.
>>>
>> I'd like to understand how to do
>>
>> TikaIO.parse().from(filepattern)
>>
>> Right now I have TikaIO.Read extending
>> PTransform<PBegin, PCollection<ParseResult>
>>
>> and then the boilerplate code which builds Read when I do something like
>>
>> TikaIO.read().from(filepattern).
>>
>> What is the convention for supporting something like
>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
>> some example ?
>>
> There are a number of IOs that don't use Source - e.g. DatastoreIO and
> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
> Note that in TikaIO you probably won't need a fusion break after the ParDo
> since there's 1 result per input file.
> 

OK, I'll have a look

Cheers, Sergey

> 
>>
>> Many thanks, Sergey
>>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Forgot to remove trim(), but will do it too

Sergey
On 05/10/17 18:09, Sergey Beryozkin wrote:
> Hi Eugene
> 
> I've done an initial commit to do with removing TikaSource, more work is 
> needed and I see 3 tasks remaining:
> 1) provide a shortcut which can let users avoid using FileIO directly, 
> as you suggested earlier, at the moment I do:
> 
> https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99 
> 
> 
> but would love to be able to type something like this in the simple cases
> 
> PCollection<ParseResult> output =
>          p.apply(TikaIO.parseAll().from(filePattern));
> 
> (note I hope to convince you to keep it as parseAll() as opposed to 
> parseAllToString() :-) but it is a minor and separate issue).
> 
> What I don't understand here is how to do this shortcut without a 
> Pipeline instance, i.e, with explicit FileIO use it looks easy, one 
> creates a pipeline and then one applies to it FileIO and then connects 
> TikaIO via another apply(), but how to implement 
> TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO 
> internally without .apply ?
> 
> 2) Optimize ParseResult coder as you noted in the review
> 
> 3) Finish it all with finalizing the configuration options (and enabling 
> and enhancing display tests)
> 
> Have a look please, I wonder if it makes sense to merge to the master 
> now for me to do a follow up (and hopefully final) PR next
> 
> Cheers, Sergey
> 
> On 28/09/17 17:09, Eugene Kirpichov wrote:
>> Hi! Glad the refactoring is happening, thanks!
>> It was auto-assigned to Reuven as formal owner of the component. I
>> reassigned it to you.
>>
>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I started looking at
>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>
>>> and pushed some initial code to my tikaio branch introducing ParseResult
>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>> the asynchronous parsing code, and few other bits.
>>>
>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>> into it too or was it auto-assigned ?
>>>
>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>> what've done so far before completely removing BoundedSource/Reader
>>> based code ?
>>>
>> Yes :)
>>
>>
>>>
>>> I have another question anyway,
>>>
>>>
>>>> E.g. TikaIO could:
>>>> - take as input a PCollection<ReadableFile>
>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where 
>>>> ParseResult
>>>> is a class with properties { String content, Metadata metadata }
>>>> - be configured by: a Parser (it implements Serializable so can be
>>>> specified at pipeline construction time) and a ContentHandler whose
>>>> toString() will go into "content". ContentHandler does not implement
>>>> Serializable, so you can not specify it at construction time - however,
>>> you
>>>> can let the user specify either its class (if it's a simple handler 
>>>> like
>>> a
>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>> have
>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>> TikaIO.parseAllAsStrings().
>>>>
>>>> Example usage would look like:
>>>>
>>>>     PCollection<KV<String, ParseResult>> parseResults =
>>>> p.apply(FileIO.match().filepattern(...))
>>>>       .apply(FileIO.readMatches())
>>>>       .apply(TikaIO.parseAllAsStrings())
>>>>
>>>> or:
>>>>
>>>>       .apply(TikaIO.parseAll()
>>>>           .withParser(new AutoDetectParser())
>>>>           .withContentHandler(() -> new BodyContentHandler(new
>>>> ToXMLContentHandler())))
>>>>
>>>> You could also have shorthands for letting the user avoid using FileIO
>>>> directly in simple cases, for example:
>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>
>>>> This would of course be implemented as a ParDo or even MapElements, and
>>>> you'll be able to share the code between parseAll and regular parse.
>>>>
>>> I'd like to understand how to do
>>>
>>> TikaIO.parse().from(filepattern)
>>>
>>> Right now I have TikaIO.Read extending
>>> PTransform<PBegin, PCollection<ParseResult>
>>>
>>> and then the boilerplate code which builds Read when I do something like
>>>
>>> TikaIO.read().from(filepattern).
>>>
>>> What is the convention for supporting something like
>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
>>> some example ?
>>>
>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
>> Note that in TikaIO you probably won't need a fusion break after the 
>> ParDo
>> since there's 1 result per input file.
>>
>>
>>>
>>> Many thanks, Sergey
>>>
>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Eugene

Given that I was away last week I did not have much time to work on the 
PR, thanks for the latest review comments, all the rounds have helped 
indeed, I've just pushed some minor updates (apart from making 
TikaConfig code run only once - guess as part of ParseAll),

Given that there've been quite a few comments, I'm just summarizing 
again what I think will still need to be done:
- the shortcut to let users avoid typing FileIO match/readMatches
- ParseResult improvements to do with the success/failure to keep the 
pipeline running even if a corrupt file is found...

Thanks, Sergey
On 06/10/17 16:50, Sergey Beryozkin wrote:
> Hi Eugene
> 
> Thanks, I've addressed some of the latest comments, and tried to justify 
> why some rather not be addressed (simplifying with Tika.parseToString(), 
> removing TikaOptions - may be later).
> 
> I'll focus on adding a shortcut per the below suggestions, then the 
> better coder, then more configuration options, and work with the review 
> comments in between...
> 
> I'm traveling next week so not sure I'll have enough time to concentrate 
> on this PR but will continue afterwards
> 
> Cheers, Sergey
> On 06/10/17 02:30, Eugene Kirpichov wrote:
>> On Thu, Oct 5, 2017 at 10:15 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Hi Eugene
>>>
>>> I've done an initial commit to do with removing TikaSource, more work is
>>> needed and I see 3 tasks remaining:
>>> 1) provide a shortcut which can let users avoid using FileIO directly,
>>> as you suggested earlier, at the moment I do:
>>>
>>>
>>> https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99 
>>>
>>>
>>> but would love to be able to type something like this in the simple 
>>> cases
>>>
>>> PCollection<ParseResult> output =
>>>           p.apply(TikaIO.parseAll().from(filePattern));
>>>
>> Yup, makes sense.
>>
>>
>>>
>>> (note I hope to convince you to keep it as parseAll() as opposed to
>>> parseAllToString() :-) but it is a minor and separate issue).
>>>
>>> What I don't understand here is how to do this shortcut without a
>>> Pipeline instance, i.e, with explicit FileIO use it looks easy, one
>>> creates a pipeline and then one applies to it FileIO and then connects
>>> TikaIO via another apply(), but how to implement
>>> TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO
>>> internally without .apply ?
>>>
>> It is a composite transform - it can construct arbitrary complex stuff
>> inside expand(). TikaIO.parseAll()'s expand() method might look something
>> like:
>>
>> ParseAll.expand(PCollection<String> filepatterns) {
>>    return input.apply(FileIO.matchAll())
>>        .apply(FileIO.readMatches().withCompression(UNCOMPRESSED))
>>        .apply(TikaIO.parseFiles());
>> }
>>
>> ParseFiles.expand(PCollection<ReadableFile> files) {
>>    return files.apply(ParDo.of(... whatever you have there now ...))
>> }
>>
>>
>>>
>>> 2) Optimize ParseResult coder as you noted in the review
>>>
>>> 3) Finish it all with finalizing the configuration options (and enabling
>>> and enhancing display tests)
>>>
>>> Have a look please, I wonder if it makes sense to merge to the master
>>> now for me to do a follow up (and hopefully final) PR next
>>>
>>> Cheers, Sergey
>>>
>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>> Hi! Glad the refactoring is happening, thanks!
>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>> reassigned it to you.
>>>>
>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I started looking at
>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>
>>>>> and pushed some initial code to my tikaio branch introducing 
>>>>> ParseResult
>>>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>>>> the asynchronous parsing code, and few other bits.
>>>>>
>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>>>> into it too or was it auto-assigned ?
>>>>>
>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>> based code ?
>>>>>
>>>> Yes :)
>>>>
>>>>
>>>>>
>>>>> I have another question anyway,
>>>>>
>>>>>
>>>>>> E.g. TikaIO could:
>>>>>> - take as input a PCollection<ReadableFile>
>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>> ParseResult
>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>>> toString() will go into "content". ContentHandler does not implement
>>>>>> Serializable, so you can not specify it at construction time - 
>>>>>> however,
>>>>> you
>>>>>> can let the user specify either its class (if it's a simple handler
>>> like
>>>>> a
>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>>>> have
>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>> TikaIO.parseAllAsStrings().
>>>>>>
>>>>>> Example usage would look like:
>>>>>>
>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>        .apply(FileIO.readMatches())
>>>>>>        .apply(TikaIO.parseAllAsStrings())
>>>>>>
>>>>>> or:
>>>>>>
>>>>>>        .apply(TikaIO.parseAll()
>>>>>>            .withParser(new AutoDetectParser())
>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
>>>>>> ToXMLContentHandler())))
>>>>>>
>>>>>> You could also have shorthands for letting the user avoid using 
>>>>>> FileIO
>>>>>> directly in simple cases, for example:
>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>
>>>>>> This would of course be implemented as a ParDo or even 
>>>>>> MapElements, and
>>>>>> you'll be able to share the code between parseAll and regular parse.
>>>>>>
>>>>> I'd like to understand how to do
>>>>>
>>>>> TikaIO.parse().from(filepattern)
>>>>>
>>>>> Right now I have TikaIO.Read extending
>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>
>>>>> and then the boilerplate code which builds Read when I do something 
>>>>> like
>>>>>
>>>>> TikaIO.read().from(filepattern).
>>>>>
>>>>> What is the convention for supporting something like
>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can 
>>>>> I see
>>>>> some example ?
>>>>>
>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>>>> JdbcIO. TextIO.readMatches() might be an even better transform to 
>>>> mimic.
>>>> Note that in TikaIO you probably won't need a fusion break after the
>>> ParDo
>>>> since there's 1 result per input file.
>>>>
>>>>
>>>>>
>>>>> Many thanks, Sergey
>>>>>
>>>>
>>>
>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Eugene

Thanks, I've addressed some of the latest comments, and tried to justify 
why some rather not be addressed (simplifying with Tika.parseToString(), 
removing TikaOptions - may be later).

I'll focus on adding a shortcut per the below suggestions, then the 
better coder, then more configuration options, and work with the review 
comments in between...

I'm traveling next week so not sure I'll have enough time to concentrate 
on this PR but will continue afterwards

Cheers, Sergey
On 06/10/17 02:30, Eugene Kirpichov wrote:
> On Thu, Oct 5, 2017 at 10:15 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi Eugene
>>
>> I've done an initial commit to do with removing TikaSource, more work is
>> needed and I see 3 tasks remaining:
>> 1) provide a shortcut which can let users avoid using FileIO directly,
>> as you suggested earlier, at the moment I do:
>>
>>
>> https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99
>>
>> but would love to be able to type something like this in the simple cases
>>
>> PCollection<ParseResult> output =
>>           p.apply(TikaIO.parseAll().from(filePattern));
>>
> Yup, makes sense.
> 
> 
>>
>> (note I hope to convince you to keep it as parseAll() as opposed to
>> parseAllToString() :-) but it is a minor and separate issue).
>>
>> What I don't understand here is how to do this shortcut without a
>> Pipeline instance, i.e, with explicit FileIO use it looks easy, one
>> creates a pipeline and then one applies to it FileIO and then connects
>> TikaIO via another apply(), but how to implement
>> TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO
>> internally without .apply ?
>>
> It is a composite transform - it can construct arbitrary complex stuff
> inside expand(). TikaIO.parseAll()'s expand() method might look something
> like:
> 
> ParseAll.expand(PCollection<String> filepatterns) {
>    return input.apply(FileIO.matchAll())
>        .apply(FileIO.readMatches().withCompression(UNCOMPRESSED))
>        .apply(TikaIO.parseFiles());
> }
> 
> ParseFiles.expand(PCollection<ReadableFile> files) {
>    return files.apply(ParDo.of(... whatever you have there now ...))
> }
> 
> 
>>
>> 2) Optimize ParseResult coder as you noted in the review
>>
>> 3) Finish it all with finalizing the configuration options (and enabling
>> and enhancing display tests)
>>
>> Have a look please, I wonder if it makes sense to merge to the master
>> now for me to do a follow up (and hopefully final) PR next
>>
>> Cheers, Sergey
>>
>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>> Hi! Glad the refactoring is happening, thanks!
>>> It was auto-assigned to Reuven as formal owner of the component. I
>>> reassigned it to you.
>>>
>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I started looking at
>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>
>>>> and pushed some initial code to my tikaio branch introducing ParseResult
>>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>>> the asynchronous parsing code, and few other bits.
>>>>
>>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>>> into it too or was it auto-assigned ?
>>>>
>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>> what've done so far before completely removing BoundedSource/Reader
>>>> based code ?
>>>>
>>> Yes :)
>>>
>>>
>>>>
>>>> I have another question anyway,
>>>>
>>>>
>>>>> E.g. TikaIO could:
>>>>> - take as input a PCollection<ReadableFile>
>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>> ParseResult
>>>>> is a class with properties { String content, Metadata metadata }
>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>> toString() will go into "content". ContentHandler does not implement
>>>>> Serializable, so you can not specify it at construction time - however,
>>>> you
>>>>> can let the user specify either its class (if it's a simple handler
>> like
>>>> a
>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>>> have
>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>> TikaIO.parseAllAsStrings().
>>>>>
>>>>> Example usage would look like:
>>>>>
>>>>>      PCollection<KV<String, ParseResult>> parseResults =
>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>        .apply(FileIO.readMatches())
>>>>>        .apply(TikaIO.parseAllAsStrings())
>>>>>
>>>>> or:
>>>>>
>>>>>        .apply(TikaIO.parseAll()
>>>>>            .withParser(new AutoDetectParser())
>>>>>            .withContentHandler(() -> new BodyContentHandler(new
>>>>> ToXMLContentHandler())))
>>>>>
>>>>> You could also have shorthands for letting the user avoid using FileIO
>>>>> directly in simple cases, for example:
>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>
>>>>> This would of course be implemented as a ParDo or even MapElements, and
>>>>> you'll be able to share the code between parseAll and regular parse.
>>>>>
>>>> I'd like to understand how to do
>>>>
>>>> TikaIO.parse().from(filepattern)
>>>>
>>>> Right now I have TikaIO.Read extending
>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>
>>>> and then the boilerplate code which builds Read when I do something like
>>>>
>>>> TikaIO.read().from(filepattern).
>>>>
>>>> What is the convention for supporting something like
>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
>>>> some example ?
>>>>
>>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>>> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
>>> Note that in TikaIO you probably won't need a fusion break after the
>> ParDo
>>> since there's 1 result per input file.
>>>
>>>
>>>>
>>>> Many thanks, Sergey
>>>>
>>>
>>
>

Re: TikaIO Refactoring

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

On Thu, Oct 5, 2017 at 10:15 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Eugene
>
> I've done an initial commit to do with removing TikaSource, more work is
> needed and I see 3 tasks remaining:
> 1) provide a shortcut which can let users avoid using FileIO directly,
> as you suggested earlier, at the moment I do:
>
>
> https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99
>
> but would love to be able to type something like this in the simple cases
>
> PCollection<ParseResult> output =
>          p.apply(TikaIO.parseAll().from(filePattern));
>
Yup, makes sense.


>
> (note I hope to convince you to keep it as parseAll() as opposed to
> parseAllToString() :-) but it is a minor and separate issue).
>
> What I don't understand here is how to do this shortcut without a
> Pipeline instance, i.e, with explicit FileIO use it looks easy, one
> creates a pipeline and then one applies to it FileIO and then connects
> TikaIO via another apply(), but how to implement
> TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO
> internally without .apply ?
>
It is a composite transform - it can construct arbitrary complex stuff
inside expand(). TikaIO.parseAll()'s expand() method might look something
like:

ParseAll.expand(PCollection<String> filepatterns) {
  return input.apply(FileIO.matchAll())
      .apply(FileIO.readMatches().withCompression(UNCOMPRESSED))
      .apply(TikaIO.parseFiles());
}

ParseFiles.expand(PCollection<ReadableFile> files) {
  return files.apply(ParDo.of(... whatever you have there now ...))
}


>
> 2) Optimize ParseResult coder as you noted in the review
>
> 3) Finish it all with finalizing the configuration options (and enabling
> and enhancing display tests)
>
> Have a look please, I wonder if it makes sense to merge to the master
> now for me to do a follow up (and hopefully final) PR next
>
> Cheers, Sergey
>
> On 28/09/17 17:09, Eugene Kirpichov wrote:
> > Hi! Glad the refactoring is happening, thanks!
> > It was auto-assigned to Reuven as formal owner of the component. I
> > reassigned it to you.
> >
> > On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
> > wrote:
> >
> >> Hi
> >>
> >> I started looking at
> >> https://issues.apache.org/jira/browse/BEAM-2994
> >>
> >> and pushed some initial code to my tikaio branch introducing ParseResult
> >> and updating the tests but keeping the BounderSource/Reader, dropping
> >> the asynchronous parsing code, and few other bits.
> >>
> >> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> >> into it too or was it auto-assigned ?
> >>
> >> I don't mind, would it make sense for me to do an 'interim' PR on
> >> what've done so far before completely removing BoundedSource/Reader
> >> based code ?
> >>
> > Yes :)
> >
> >
> >>
> >> I have another question anyway,
> >>
> >>
> >>> E.g. TikaIO could:
> >>> - take as input a PCollection<ReadableFile>
> >>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> ParseResult
> >>> is a class with properties { String content, Metadata metadata }
> >>> - be configured by: a Parser (it implements Serializable so can be
> >>> specified at pipeline construction time) and a ContentHandler whose
> >>> toString() will go into "content". ContentHandler does not implement
> >>> Serializable, so you can not specify it at construction time - however,
> >> you
> >>> can let the user specify either its class (if it's a simple handler
> like
> >> a
> >>> BodyContentHandler) or specify a lambda for creating the handler
> >>> (SerializableFunction<Void, ContentHandler>), and potentially you can
> >> have
> >>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>> TikaIO.parseAllAsStrings().
> >>>
> >>> Example usage would look like:
> >>>
> >>>     PCollection<KV<String, ParseResult>> parseResults =
> >>> p.apply(FileIO.match().filepattern(...))
> >>>       .apply(FileIO.readMatches())
> >>>       .apply(TikaIO.parseAllAsStrings())
> >>>
> >>> or:
> >>>
> >>>       .apply(TikaIO.parseAll()
> >>>           .withParser(new AutoDetectParser())
> >>>           .withContentHandler(() -> new BodyContentHandler(new
> >>> ToXMLContentHandler())))
> >>>
> >>> You could also have shorthands for letting the user avoid using FileIO
> >>> directly in simple cases, for example:
> >>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>
> >>> This would of course be implemented as a ParDo or even MapElements, and
> >>> you'll be able to share the code between parseAll and regular parse.
> >>>
> >> I'd like to understand how to do
> >>
> >> TikaIO.parse().from(filepattern)
> >>
> >> Right now I have TikaIO.Read extending
> >> PTransform<PBegin, PCollection<ParseResult>
> >>
> >> and then the boilerplate code which builds Read when I do something like
> >>
> >> TikaIO.read().from(filepattern).
> >>
> >> What is the convention for supporting something like
> >> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
> >> some example ?
> >>
> > There are a number of IOs that don't use Source - e.g. DatastoreIO and
> > JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
> > Note that in TikaIO you probably won't need a fusion break after the
> ParDo
> > since there's 1 result per input file.
> >
> >
> >>
> >> Many thanks, Sergey
> >>
> >
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Eugene

I've done an initial commit to do with removing TikaSource, more work is 
needed and I see 3 tasks remaining:
1) provide a shortcut which can let users avoid using FileIO directly, 
as you suggested earlier, at the moment I do:

https://github.com/sberyozkin/beam/blob/tikaio/sdks/java/io/tika/src/test/java/org/apache/beam/sdk/io/tika/TikaIOTest.java#L99

but would love to be able to type something like this in the simple cases

PCollection<ParseResult> output =
         p.apply(TikaIO.parseAll().from(filePattern));

(note I hope to convince you to keep it as parseAll() as opposed to 
parseAllToString() :-) but it is a minor and separate issue).

What I don't understand here is how to do this shortcut without a 
Pipeline instance, i.e, with explicit FileIO use it looks easy, one 
creates a pipeline and then one applies to it FileIO and then connects 
TikaIO via another apply(), but how to implement 
TikaIO.parseAll().from(filePattern) such that TikaIO links to FileIO 
internally without .apply ?

2) Optimize ParseResult coder as you noted in the review

3) Finish it all with finalizing the configuration options (and enabling 
and enhancing display tests)

Have a look please, I wonder if it makes sense to merge to the master 
now for me to do a follow up (and hopefully final) PR next

Cheers, Sergey

On 28/09/17 17:09, Eugene Kirpichov wrote:
> Hi! Glad the refactoring is happening, thanks!
> It was auto-assigned to Reuven as formal owner of the component. I
> reassigned it to you.
> 
> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi
>>
>> I started looking at
>> https://issues.apache.org/jira/browse/BEAM-2994
>>
>> and pushed some initial code to my tikaio branch introducing ParseResult
>> and updating the tests but keeping the BounderSource/Reader, dropping
>> the asynchronous parsing code, and few other bits.
>>
>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>> into it too or was it auto-assigned ?
>>
>> I don't mind, would it make sense for me to do an 'interim' PR on
>> what've done so far before completely removing BoundedSource/Reader
>> based code ?
>>
> Yes :)
> 
> 
>>
>> I have another question anyway,
>>
>>
>>> E.g. TikaIO could:
>>> - take as input a PCollection<ReadableFile>
>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
>>> is a class with properties { String content, Metadata metadata }
>>> - be configured by: a Parser (it implements Serializable so can be
>>> specified at pipeline construction time) and a ContentHandler whose
>>> toString() will go into "content". ContentHandler does not implement
>>> Serializable, so you can not specify it at construction time - however,
>> you
>>> can let the user specify either its class (if it's a simple handler like
>> a
>>> BodyContentHandler) or specify a lambda for creating the handler
>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>> have
>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>> TikaIO.parseAllAsStrings().
>>>
>>> Example usage would look like:
>>>
>>>     PCollection<KV<String, ParseResult>> parseResults =
>>> p.apply(FileIO.match().filepattern(...))
>>>       .apply(FileIO.readMatches())
>>>       .apply(TikaIO.parseAllAsStrings())
>>>
>>> or:
>>>
>>>       .apply(TikaIO.parseAll()
>>>           .withParser(new AutoDetectParser())
>>>           .withContentHandler(() -> new BodyContentHandler(new
>>> ToXMLContentHandler())))
>>>
>>> You could also have shorthands for letting the user avoid using FileIO
>>> directly in simple cases, for example:
>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>
>>> This would of course be implemented as a ParDo or even MapElements, and
>>> you'll be able to share the code between parseAll and regular parse.
>>>
>> I'd like to understand how to do
>>
>> TikaIO.parse().from(filepattern)
>>
>> Right now I have TikaIO.Read extending
>> PTransform<PBegin, PCollection<ParseResult>
>>
>> and then the boilerplate code which builds Read when I do something like
>>
>> TikaIO.read().from(filepattern).
>>
>> What is the convention for supporting something like
>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
>> some example ?
>>
> There are a number of IOs that don't use Source - e.g. DatastoreIO and
> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
> Note that in TikaIO you probably won't need a fusion break after the ParDo
> since there's 1 result per input file.
> 
> 
>>
>> Many thanks, Sergey
>>
>

Re: TikaIO Refactoring

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi! Glad the refactoring is happening, thanks!
It was auto-assigned to Reuven as formal owner of the component. I
reassigned it to you.

On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi
>
> I started looking at
> https://issues.apache.org/jira/browse/BEAM-2994
>
> and pushed some initial code to my tikaio branch introducing ParseResult
> and updating the tests but keeping the BounderSource/Reader, dropping
> the asynchronous parsing code, and few other bits.
>
> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> into it too or was it auto-assigned ?
>
> I don't mind, would it make sense for me to do an 'interim' PR on
> what've done so far before completely removing BoundedSource/Reader
> based code ?
>
Yes :)


>
> I have another question anyway,
>
>
> > E.g. TikaIO could:
> > - take as input a PCollection<ReadableFile>
> > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> > is a class with properties { String content, Metadata metadata }
> > - be configured by: a Parser (it implements Serializable so can be
> > specified at pipeline construction time) and a ContentHandler whose
> > toString() will go into "content". ContentHandler does not implement
> > Serializable, so you can not specify it at construction time - however,
> you
> > can let the user specify either its class (if it's a simple handler like
> a
> > BodyContentHandler) or specify a lambda for creating the handler
> > (SerializableFunction<Void, ContentHandler>), and potentially you can
> have
> > a simpler facade for Tika.parseAsString() - e.g. call it
> > TikaIO.parseAllAsStrings().
> >
> > Example usage would look like:
> >
> >    PCollection<KV<String, ParseResult>> parseResults =
> > p.apply(FileIO.match().filepattern(...))
> >      .apply(FileIO.readMatches())
> >      .apply(TikaIO.parseAllAsStrings())
> >
> > or:
> >
> >      .apply(TikaIO.parseAll()
> >          .withParser(new AutoDetectParser())
> >          .withContentHandler(() -> new BodyContentHandler(new
> > ToXMLContentHandler())))
> >
> > You could also have shorthands for letting the user avoid using FileIO
> > directly in simple cases, for example:
> >      p.apply(TikaIO.parseAsStrings().from(filepattern))
> >
> > This would of course be implemented as a ParDo or even MapElements, and
> > you'll be able to share the code between parseAll and regular parse.
> >
> I'd like to understand how to do
>
> TikaIO.parse().from(filepattern)
>
> Right now I have TikaIO.Read extending
> PTransform<PBegin, PCollection<ParseResult>
>
> and then the boilerplate code which builds Read when I do something like
>
> TikaIO.read().from(filepattern).
>
> What is the convention for supporting something like
> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
> some example ?
>
There are a number of IOs that don't use Source - e.g. DatastoreIO and
JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
Note that in TikaIO you probably won't need a fusion break after the ParDo
since there's 1 result per input file.


>
> Many thanks, Sergey
>

TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi

I started looking at
https://issues.apache.org/jira/browse/BEAM-2994

and pushed some initial code to my tikaio branch introducing ParseResult 
and updating the tests but keeping the BounderSource/Reader, dropping 
the asynchronous parsing code, and few other bits.

Just noticed it is assigned to Reuven - does it mean Reuven is looking 
into it too or was it auto-assigned ?

I don't mind, would it make sense for me to do an 'interim' PR on 
what've done so far before completely removing BoundedSource/Reader 
based code ?

I have another question anyway,


> E.g. TikaIO could:
> - take as input a PCollection<ReadableFile>
> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> is a class with properties { String content, Metadata metadata }
> - be configured by: a Parser (it implements Serializable so can be
> specified at pipeline construction time) and a ContentHandler whose
> toString() will go into "content". ContentHandler does not implement
> Serializable, so you can not specify it at construction time - however, you
> can let the user specify either its class (if it's a simple handler like a
> BodyContentHandler) or specify a lambda for creating the handler
> (SerializableFunction<Void, ContentHandler>), and potentially you can have
> a simpler facade for Tika.parseAsString() - e.g. call it
> TikaIO.parseAllAsStrings().
> 
> Example usage would look like:
> 
>    PCollection<KV<String, ParseResult>> parseResults =
> p.apply(FileIO.match().filepattern(...))
>      .apply(FileIO.readMatches())
>      .apply(TikaIO.parseAllAsStrings())
> 
> or:
> 
>      .apply(TikaIO.parseAll()
>          .withParser(new AutoDetectParser())
>          .withContentHandler(() -> new BodyContentHandler(new
> ToXMLContentHandler())))
> 
> You could also have shorthands for letting the user avoid using FileIO
> directly in simple cases, for example:
>      p.apply(TikaIO.parseAsStrings().from(filepattern))
> 
> This would of course be implemented as a ParDo or even MapElements, and
> you'll be able to share the code between parseAll and regular parse.
> 
I'd like to understand how to do

TikaIO.parse().from(filepattern)

Right now I have TikaIO.Read extending
PTransform<PBegin, PCollection<ParseResult>

and then the boilerplate code which builds Read when I do something like

TikaIO.read().from(filepattern).

What is the convention for supporting something like
TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see 
some example ?

Many thanks, Sergey

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>> How will it work now, with new Metadata() passed to the AutoDetect parser, will this Metadata have a Metadata value per every attachment, possibly keyed by a name ?

An example of how to call the RecursiveParserWrapper:

https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java#L138

To serialize the List<Metadata>, use:

https://github.com/apache/tika/blob/master/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java#L47

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>> How will it work now, with new Metadata() passed to the AutoDetect parser, will this Metadata have a Metadata value per every attachment, possibly keyed by a name ?

An example of how to call the RecursiveParserWrapper:

https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java#L138

To serialize the List<Metadata>, use:

https://github.com/apache/tika/blob/master/tika-serialization/src/main/java/org/apache/tika/metadata/serialization/JsonMetadataList.java#L47

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim

Sorry for getting into the RecursiveParserWrapper discussion first, I 
was certain the time zone difference was on my side :-)

How will it work now, with new Metadata() passed to the AutoDetect 
parser, will this Metadata have a Metadata value per every attachment, 
possibly keyed by a name ?

Thanks, Sergey
On 22/09/17 12:58, Allison, Timothy B. wrote:
> @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?
> 
> Not at the moment, we’d have to do some coding on our end or within Beam.  The format is a list of maps/dicts for each file.  Each map contains all of the metadata, with one key reserved for the content.  If a file is a file with no attachments, the list has length 1; otherwise there’s a map for each embedded file.  Unlike our legacy xhtml, this format maintains metadata for attachments.
> 
> The downside to this extract format is that it requires a full parse of the document and all data to be held in-memory before writing it.  On the other hand, while Tika tries to be streaming, and that was one of the critical early design goals, for some file formats, we simply have to parse the whole thing before we can have any output.
> 
> So, y, large files are a problem. :\
> 
> Example with purely made-up keys representing a pdf file containing an RTF attachment
> [
> {
>     Name : “container file”,
>     Author: “Chris Mattmann”,
>     Content: “Four score and seven years ago…”,
>     Content-type: “application/pdf”
>    …
> },
> {
>    Name : “embedded file1”
>    Author: “Nick Burch”,
>    Content: “When in the course of human events…”,
>    Content-type: “application/rtf”
> }
> ]
> 
> From: Eugene Kirpichov [mailto:kirpichov@google.com]
> Sent: Thursday, September 21, 2017 7:42 PM
> To: Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
> 
> Hi,
> @Sergey:
> - I already marked TikaIO @Experimental, so we can make changes.
> - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion.
> 
> @Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead).
> 
> @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?
> 
> On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>> wrote:
> Like Sergey, it’ll take me some time to understand your recommendations.  Thank you!
> 
> On one small point:
>> return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
> 
> For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents.
> 
>> However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea
> Large documents are a problem, no doubt about it…
> 
> From: Eugene Kirpichov [mailto:kirpichov@google.com<ma...@google.com>]
> Sent: Thursday, September 21, 2017 4:41 PM
> To: Allison, Timothy B. <ta...@mitre.org>>; dev@beam.apache.org<ma...@beam.apache.org>
> Cc: dev@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: TikaIO concerns
> 
> Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.
> 
> Association with original file:
> Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E .
> 
> If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model).
> 
> Order within a file:
> The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element.
> 
> Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches().
> 
> However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this.
> 
> E.g. TikaIO could:
> - take as input a PCollection<ReadableFile>
> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
> - be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings().
> 
> Example usage would look like:
> 
>    PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...))
>      .apply(FileIO.readMatches())
>      .apply(TikaIO.parseAllAsStrings())
> 
> or:
> 
>      .apply(TikaIO.parseAll()
>          .withParser(new AutoDetectParser())
>          .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler())))
> 
> You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example:
>      p.apply(TikaIO.parseAsStrings().from(filepattern))
> 
> This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse.
> 
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
>> Thank you, Sergey.
>>
>> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
>>
>>   From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
>>
> Right now Tika-related reader does not associate a given text fragment
> with the file name, so a function looking at some text and trying to
> find where it came from won't be able to do so.
> 
> So I asked how to do it in Beam, how to attach some context to the given
> piece of data. I hope it can be done and if not - then perhaps some
> improvement can be applied.
> 
> Re the unordered text - yes - this is what we currently have with Beam +
> TikaIO :-).
> 
> The use-case I referred to earlier in this thread (upload PDFs - save
> the possibly unordered text to Lucene with the file name 'attached', let
> users search for the files containing some words - phrases, this works
> OK given that I can see PDF parser for ex reporting the lines) can be
> supported OK with the current TikaIO (provided we find a way to 'attach'
> a file name to the flow).
> 
> I see though supporting the total ordering can be a big deal in other
> cases. Eugene, can you please explain how it can be done, is it
> achievable in principle, without the users having to do some custom
> coding ?
> 
>> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
>>
>> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
>>
>> 1) a process-level timeout (because you can't actually kill a thread in Java)
>> 2) a process-level restart on OOM
>> 3) avoid trying to reprocess a badly behaving document
>>
>> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
>>
>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
>>
>> Apologies if I've derailed the conversation or misunderstood this thread.
>>
> Major thanks for your input :-)
> 
> Cheers, Sergey
> 
>> Cheers,
>>
>>                  Tim
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
>> Sent: Thursday, September 21, 2017 9:07 AM
>> To: dev@beam.apache.org<ma...@beam.apache.org>
>> Cc: Allison, Timothy B. <ta...@mitre.org>>
>> Subject: Re: TikaIO concerns
>>
>> Hi All
>>
>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>
>> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
>> consumers) then please do so :-).
>>
>> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
>>
>> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
>>
>> Other comments would be welcome too
>>
>> Thanks, Sergey
>>
>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>> individual words and nearly 100% sure I saw Tika reporting individual
>>> lines when it was parsing the text files. The 'min text length'
>>> feature can help with reporting several lines at a time, etc...
>>>
>>> I'm working with this PDF all the time:
>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>
>>> try it too if you get a chance.
>>>
>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>> be for ex a log-like content too)
>>>
>>> That said, I don't know how a parser for the format N will behave, it
>>> depends on the individual parsers.
>>>
>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>
>>> I'd like to know though how to make a file name available to the
>>> pipeline which is working with the current text fragment ?
>>>
>>> Going to try and do some measurements and compare the sync vs async
>>> parsing modes...
>>>
>>> Asked the Tika team to support with some more examples...
>>>
>>> Cheers, Sergey
>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>> Hi,
>>>>
>>>> thanks for the explanations,
>>>>
>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>> Hi!
>>>>>
>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>> a row of a CSV file - so discarding order is ok.
>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>> it depends...
>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>> natural language processing and the text files contain actual prose,
>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>> - Cases where you need to remember which file each element came
>>>>> from, e.g.
>>>>> if you're creating a search index for the files: TextIO can't do
>>>>> this either.
>>>>>
>>>>> Both of these issues have been raised in the past against TextIO;
>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>> not been a priority.
>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>> gives you access to reading a full file yourself - people who want
>>>>> more flexibility will be able to use standard Java text-parsing
>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>
>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>> use case where the files contain independent data entries, so
>>>>> returning an unordered soup of them, with no association to the
>>>>> original file, is the user's intention. XmlIO will not work for
>>>>> processing more complex XML files that are not simply a sequence of
>>>>> entries with the same tag, and it also does not remember the
>>>>> original filename.
>>>>>
>>>>
>>>> OK...
>>>>
>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>> mainly used for extracting content from complex file formats - for
>>>>> example, extracting text and images from PDF files or Word
>>>>> documents. I believe this is the main difference between it and
>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>
>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>> the Tika API is ContentHandler
>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>> html?is-external=true
>>>>>
>>>>> whose
>>>>> documentation says "The order of events in this interface is very
>>>>> important, and mirrors the order of information in the document itself."
>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>> ContentHandler...
>>>>>
>>>>> Let me give a few examples of what I think is possible with the raw
>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>> with Tika and am judging just based on what I read about it.
>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>> to text files for future natural language processing.
>>>>> - User has 100,000 PDF files with financial statements, each
>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>> of transactions in PDF tables. User wants to extract each
>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>> extract text from them, somehow parse author and affiliation from
>>>>> the text, and compute statistics of topics and terminology usage by
>>>>> author name and affiliation.
>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>> observing a location over time: they want to extract metadata from
>>>>> each image using Tika, analyze the images themselves using some
>>>>> other library, and detect anomalies in the overall appearance of the
>>>>> location over time as seen from multiple cameras.
>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>> the resulting PCollection<String> contains no information about
>>>>> which String comes from which document and about the order in which
>>>>> they appear in the document.
>>>> These are good use cases, thanks... I thought what you were talking
>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>> TextIO and alike :-)).
>>>> Putting the ordered vs unordered question aside for a sec, why
>>>> exactly a Tika Reader can not make the name of the file it's
>>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>>
>>>>>
>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>> of strings.
>>>>> So some examples would be very helpful.
>>>>>
>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>> give one example where it did not matter to us in what order
>>>> Tika-produced data were available to the downstream layer.
>>>>
>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>> Con NAs, and we had a happy audience:
>>>>
>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search
>>>>
>>>>
>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>> into Lucene. We associate a file name with the indexed content and
>>>> then let users find a list of PDF files which contain a given word or
>>>> few words, details are here
>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>> og.java#L131
>>>>
>>>>
>>>> I'd say even more involved search engines would not mind supporting a
>>>> case like that :-)
>>>>
>>>> Now there we process one file at a time, and I understand now that
>>>> with TikaIO and N files it's all over the place really as far as the
>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>> TikaReader must be able to associate the file name with a given piece
>>>> of text it's making available to the pipeline.
>>>>
>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>> If it makes things simpler then it would be good, I've just no idea
>>>> at the moment how to start the pipeline without using a
>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>>
>>>>
>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>> provide a usability improvement over such usage?
>>>>>
>>>>
>>>>
>>>> If you are actually asking, does it really make sense for Beam to
>>>> ship Tika related code, given that users can just do it themselves,
>>>> I'm not sure.
>>>>
>>>> IMHO it always works better if users have to provide just few config
>>>> options to an integral part of the framework and see things happening.
>>>> It will bring more users.
>>>>
>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>> not - I'll let you and the team decide; believe it or not I was
>>>> seriously contemplating at the last moment to make it all part of the
>>>> Tika project itself and have a bit more flexibility over there with
>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>> know - it's no my decision...
>>>>
>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>> Perhaps
>>>>> for some cases it does, and for some it does not. May be it makes
>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>> and ParDo, with getting the common code reused." - because using
>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>> The resulting PCollection will be unordered either way - this needs
>>>>> to be solved separately by providing a different API.
>>>> Right I see now, so ParDo is not about making Tika reported data
>>>> available to the downstream pipeline components ordered, only about
>>>> the simpler implementation.
>>>> Association with the file should be possible I hope, but I understand
>>>> it would be possible to optionally make the data coming out in the
>>>> ordered way as well...
>>>>
>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>> let me double check: should we still give some thought to the
>>>> possible performance benefit of the current approach ? As I said, I
>>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>>
>>>> Cheers, Sergey
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>> <sb...@gmail.com>>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>
>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>
>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>> usecases.
>>>>>>
>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>
>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>> basically from any file format.
>>>>>>
>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>
>>>>>> Sergey
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Replies inline.
>>>>>>>
>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>> <sb...@gmail.com>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>> [2].
>>>>>>>>
>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>
>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>> report the data chunks.
>>>>>>>> Some
>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>> with some being able report the data only after the completely
>>>>>>>> parse the document.
>>>>>>>> All depends on the data format.
>>>>>>>>
>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>> the tests might suggest otherwise).
>>>>>>>>
>>>>>>> I agree that your implementation of reader returns records in
>>>>>>> order
>>>>>>> - but
>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>> transforms to the
>>>>>> PCollection
>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>> your reader returned the records.
>>>>>>>
>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The reason I did it was because I thought
>>>>>>>>
>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>
>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>> potentially
>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>> execute in
>>>>>> the
>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>> correctly,
>>>>>>> you might be assuming that:
>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>> complete
>>>>>>> before processing its outputs with downstream transforms
>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>> *concurrently*
>>>>>>> with downstream processing of its results
>>>>>>> - Passing an element from one thread to another using a
>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>> most
>>>>>>> production runners.
>>>>>>>
>>>>>>> There are other disadvantages to this approach:
>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>> invisible
>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>> next element, but not *why* the next
>>>>>> element
>>>>>>> is taking long to compute.
>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>> autoscaling, binpacking
>>>>>> and
>>>>>>> other resource management magic (how much of this runners actually
>>>>>>> do is
>>>>>> a
>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>> the processing happens in a thread about which the runner is
>>>>>>> unaware.
>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>> in the Tika thread
>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>> easier
>>>>>> to
>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>
>>>>>>>
>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>> yet)
>>>>>>>>
>>>>>>> What are these issues?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>> will never be reached for a practical case, or
>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>> I propose to follow 2b).
>>>>>>>>
>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>> Beam's "no knobs"
>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>> out a
>>>>>> good
>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>> running on
>>>>>> a
>>>>>>> new dataset or updating a version of some of the involved
>>>>>>> dependencies
>>>>>> etc.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know what you think.
>>>>>>>> My plan so far is:
>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>> some minor TikaIO updates
>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>> BoundedSource/Reader or not
>>>>>>>>
>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>
>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>> cases
>>>>>> of
>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>> then see what's the best implementation for that particular API
>>>>>>> and set of anticipated use cases.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Sergey
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim

Sorry for getting into the RecursiveParserWrapper discussion first, I 
was certain the time zone difference was on my side :-)

How will it work now, with new Metadata() passed to the AutoDetect 
parser, will this Metadata have a Metadata value per every attachment, 
possibly keyed by a name ?

Thanks, Sergey
On 22/09/17 12:58, Allison, Timothy B. wrote:
> @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?
> 
> Not at the moment, we’d have to do some coding on our end or within Beam.  The format is a list of maps/dicts for each file.  Each map contains all of the metadata, with one key reserved for the content.  If a file is a file with no attachments, the list has length 1; otherwise there’s a map for each embedded file.  Unlike our legacy xhtml, this format maintains metadata for attachments.
> 
> The downside to this extract format is that it requires a full parse of the document and all data to be held in-memory before writing it.  On the other hand, while Tika tries to be streaming, and that was one of the critical early design goals, for some file formats, we simply have to parse the whole thing before we can have any output.
> 
> So, y, large files are a problem. :\
> 
> Example with purely made-up keys representing a pdf file containing an RTF attachment
> [
> {
>     Name : “container file”,
>     Author: “Chris Mattmann”,
>     Content: “Four score and seven years ago…”,
>     Content-type: “application/pdf”
>    …
> },
> {
>    Name : “embedded file1”
>    Author: “Nick Burch”,
>    Content: “When in the course of human events…”,
>    Content-type: “application/rtf”
> }
> ]
> 
> From: Eugene Kirpichov [mailto:kirpichov@google.com]
> Sent: Thursday, September 21, 2017 7:42 PM
> To: Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
> Cc: dev@tika.apache.org
> Subject: Re: TikaIO concerns
> 
> Hi,
> @Sergey:
> - I already marked TikaIO @Experimental, so we can make changes.
> - Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion.
> 
> @Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead).
> 
> @Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?
> 
> On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>> wrote:
> Like Sergey, it’ll take me some time to understand your recommendations.  Thank you!
> 
> On one small point:
>> return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
> 
> For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents.
> 
>> However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea
> Large documents are a problem, no doubt about it…
> 
> From: Eugene Kirpichov [mailto:kirpichov@google.com<ma...@google.com>]
> Sent: Thursday, September 21, 2017 4:41 PM
> To: Allison, Timothy B. <ta...@mitre.org>>; dev@beam.apache.org<ma...@beam.apache.org>
> Cc: dev@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: TikaIO concerns
> 
> Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.
> 
> Association with original file:
> Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E .
> 
> If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model).
> 
> Order within a file:
> The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element.
> 
> Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches().
> 
> However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this.
> 
> E.g. TikaIO could:
> - take as input a PCollection<ReadableFile>
> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
> - be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings().
> 
> Example usage would look like:
> 
>    PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...))
>      .apply(FileIO.readMatches())
>      .apply(TikaIO.parseAllAsStrings())
> 
> or:
> 
>      .apply(TikaIO.parseAll()
>          .withParser(new AutoDetectParser())
>          .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler())))
> 
> You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example:
>      p.apply(TikaIO.parseAsStrings().from(filepattern))
> 
> This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse.
> 
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
>> Thank you, Sergey.
>>
>> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
>>
>>   From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
>>
> Right now Tika-related reader does not associate a given text fragment
> with the file name, so a function looking at some text and trying to
> find where it came from won't be able to do so.
> 
> So I asked how to do it in Beam, how to attach some context to the given
> piece of data. I hope it can be done and if not - then perhaps some
> improvement can be applied.
> 
> Re the unordered text - yes - this is what we currently have with Beam +
> TikaIO :-).
> 
> The use-case I referred to earlier in this thread (upload PDFs - save
> the possibly unordered text to Lucene with the file name 'attached', let
> users search for the files containing some words - phrases, this works
> OK given that I can see PDF parser for ex reporting the lines) can be
> supported OK with the current TikaIO (provided we find a way to 'attach'
> a file name to the flow).
> 
> I see though supporting the total ordering can be a big deal in other
> cases. Eugene, can you please explain how it can be done, is it
> achievable in principle, without the users having to do some custom
> coding ?
> 
>> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
>>
>> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
>>
>> 1) a process-level timeout (because you can't actually kill a thread in Java)
>> 2) a process-level restart on OOM
>> 3) avoid trying to reprocess a badly behaving document
>>
>> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
>>
>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
>>
>> Apologies if I've derailed the conversation or misunderstood this thread.
>>
> Major thanks for your input :-)
> 
> Cheers, Sergey
> 
>> Cheers,
>>
>>                  Tim
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
>> Sent: Thursday, September 21, 2017 9:07 AM
>> To: dev@beam.apache.org<ma...@beam.apache.org>
>> Cc: Allison, Timothy B. <ta...@mitre.org>>
>> Subject: Re: TikaIO concerns
>>
>> Hi All
>>
>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>
>> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
>> consumers) then please do so :-).
>>
>> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
>>
>> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
>>
>> Other comments would be welcome too
>>
>> Thanks, Sergey
>>
>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>> individual words and nearly 100% sure I saw Tika reporting individual
>>> lines when it was parsing the text files. The 'min text length'
>>> feature can help with reporting several lines at a time, etc...
>>>
>>> I'm working with this PDF all the time:
>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>
>>> try it too if you get a chance.
>>>
>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>> be for ex a log-like content too)
>>>
>>> That said, I don't know how a parser for the format N will behave, it
>>> depends on the individual parsers.
>>>
>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>
>>> I'd like to know though how to make a file name available to the
>>> pipeline which is working with the current text fragment ?
>>>
>>> Going to try and do some measurements and compare the sync vs async
>>> parsing modes...
>>>
>>> Asked the Tika team to support with some more examples...
>>>
>>> Cheers, Sergey
>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>> Hi,
>>>>
>>>> thanks for the explanations,
>>>>
>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>> Hi!
>>>>>
>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>> a row of a CSV file - so discarding order is ok.
>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>> it depends...
>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>> natural language processing and the text files contain actual prose,
>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>> - Cases where you need to remember which file each element came
>>>>> from, e.g.
>>>>> if you're creating a search index for the files: TextIO can't do
>>>>> this either.
>>>>>
>>>>> Both of these issues have been raised in the past against TextIO;
>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>> not been a priority.
>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>> gives you access to reading a full file yourself - people who want
>>>>> more flexibility will be able to use standard Java text-parsing
>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>
>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>> use case where the files contain independent data entries, so
>>>>> returning an unordered soup of them, with no association to the
>>>>> original file, is the user's intention. XmlIO will not work for
>>>>> processing more complex XML files that are not simply a sequence of
>>>>> entries with the same tag, and it also does not remember the
>>>>> original filename.
>>>>>
>>>>
>>>> OK...
>>>>
>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>> mainly used for extracting content from complex file formats - for
>>>>> example, extracting text and images from PDF files or Word
>>>>> documents. I believe this is the main difference between it and
>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>
>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>> the Tika API is ContentHandler
>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>> html?is-external=true
>>>>>
>>>>> whose
>>>>> documentation says "The order of events in this interface is very
>>>>> important, and mirrors the order of information in the document itself."
>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>> ContentHandler...
>>>>>
>>>>> Let me give a few examples of what I think is possible with the raw
>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>> with Tika and am judging just based on what I read about it.
>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>> to text files for future natural language processing.
>>>>> - User has 100,000 PDF files with financial statements, each
>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>> of transactions in PDF tables. User wants to extract each
>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>> extract text from them, somehow parse author and affiliation from
>>>>> the text, and compute statistics of topics and terminology usage by
>>>>> author name and affiliation.
>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>> observing a location over time: they want to extract metadata from
>>>>> each image using Tika, analyze the images themselves using some
>>>>> other library, and detect anomalies in the overall appearance of the
>>>>> location over time as seen from multiple cameras.
>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>> the resulting PCollection<String> contains no information about
>>>>> which String comes from which document and about the order in which
>>>>> they appear in the document.
>>>> These are good use cases, thanks... I thought what you were talking
>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>> TextIO and alike :-)).
>>>> Putting the ordered vs unordered question aside for a sec, why
>>>> exactly a Tika Reader can not make the name of the file it's
>>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>>
>>>>>
>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>> of strings.
>>>>> So some examples would be very helpful.
>>>>>
>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>> give one example where it did not matter to us in what order
>>>> Tika-produced data were available to the downstream layer.
>>>>
>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>> Con NAs, and we had a happy audience:
>>>>
>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search
>>>>
>>>>
>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>> into Lucene. We associate a file name with the indexed content and
>>>> then let users find a list of PDF files which contain a given word or
>>>> few words, details are here
>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>> og.java#L131
>>>>
>>>>
>>>> I'd say even more involved search engines would not mind supporting a
>>>> case like that :-)
>>>>
>>>> Now there we process one file at a time, and I understand now that
>>>> with TikaIO and N files it's all over the place really as far as the
>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>> TikaReader must be able to associate the file name with a given piece
>>>> of text it's making available to the pipeline.
>>>>
>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>> If it makes things simpler then it would be good, I've just no idea
>>>> at the moment how to start the pipeline without using a
>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>>
>>>>
>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>> provide a usability improvement over such usage?
>>>>>
>>>>
>>>>
>>>> If you are actually asking, does it really make sense for Beam to
>>>> ship Tika related code, given that users can just do it themselves,
>>>> I'm not sure.
>>>>
>>>> IMHO it always works better if users have to provide just few config
>>>> options to an integral part of the framework and see things happening.
>>>> It will bring more users.
>>>>
>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>> not - I'll let you and the team decide; believe it or not I was
>>>> seriously contemplating at the last moment to make it all part of the
>>>> Tika project itself and have a bit more flexibility over there with
>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>> know - it's no my decision...
>>>>
>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>> Perhaps
>>>>> for some cases it does, and for some it does not. May be it makes
>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>> and ParDo, with getting the common code reused." - because using
>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>> The resulting PCollection will be unordered either way - this needs
>>>>> to be solved separately by providing a different API.
>>>> Right I see now, so ParDo is not about making Tika reported data
>>>> available to the downstream pipeline components ordered, only about
>>>> the simpler implementation.
>>>> Association with the file should be possible I hope, but I understand
>>>> it would be possible to optionally make the data coming out in the
>>>> ordered way as well...
>>>>
>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>> let me double check: should we still give some thought to the
>>>> possible performance benefit of the current approach ? As I said, I
>>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>>
>>>> Cheers, Sergey
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>> <sb...@gmail.com>>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>
>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>
>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>> usecases.
>>>>>>
>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>
>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>> basically from any file format.
>>>>>>
>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>
>>>>>> Sergey
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Replies inline.
>>>>>>>
>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>> <sb...@gmail.com>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>> [2].
>>>>>>>>
>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>
>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>> report the data chunks.
>>>>>>>> Some
>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>> with some being able report the data only after the completely
>>>>>>>> parse the document.
>>>>>>>> All depends on the data format.
>>>>>>>>
>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>> the tests might suggest otherwise).
>>>>>>>>
>>>>>>> I agree that your implementation of reader returns records in
>>>>>>> order
>>>>>>> - but
>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>> transforms to the
>>>>>> PCollection
>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>> your reader returned the records.
>>>>>>>
>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The reason I did it was because I thought
>>>>>>>>
>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>
>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>> potentially
>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>> execute in
>>>>>> the
>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>> correctly,
>>>>>>> you might be assuming that:
>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>> complete
>>>>>>> before processing its outputs with downstream transforms
>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>> *concurrently*
>>>>>>> with downstream processing of its results
>>>>>>> - Passing an element from one thread to another using a
>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>> most
>>>>>>> production runners.
>>>>>>>
>>>>>>> There are other disadvantages to this approach:
>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>> invisible
>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>> next element, but not *why* the next
>>>>>> element
>>>>>>> is taking long to compute.
>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>> autoscaling, binpacking
>>>>>> and
>>>>>>> other resource management magic (how much of this runners actually
>>>>>>> do is
>>>>>> a
>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>> the processing happens in a thread about which the runner is
>>>>>>> unaware.
>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>> in the Tika thread
>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>> easier
>>>>>> to
>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>
>>>>>>>
>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>> yet)
>>>>>>>>
>>>>>>> What are these issues?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>> will never be reached for a practical case, or
>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>> I propose to follow 2b).
>>>>>>>>
>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>> Beam's "no knobs"
>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>> out a
>>>>>> good
>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>> running on
>>>>>> a
>>>>>>> new dataset or updating a version of some of the involved
>>>>>>> dependencies
>>>>>> etc.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know what you think.
>>>>>>>> My plan so far is:
>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>> some minor TikaIO updates
>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>> BoundedSource/Reader or not
>>>>>>>>
>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>
>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>> cases
>>>>>> of
>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>> then see what's the best implementation for that particular API
>>>>>>> and set of anticipated use cases.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Sergey
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?

Not at the moment, we’d have to do some coding on our end or within Beam.  The format is a list of maps/dicts for each file.  Each map contains all of the metadata, with one key reserved for the content.  If a file is a file with no attachments, the list has length 1; otherwise there’s a map for each embedded file.  Unlike our legacy xhtml, this format maintains metadata for attachments.

The downside to this extract format is that it requires a full parse of the document and all data to be held in-memory before writing it.  On the other hand, while Tika tries to be streaming, and that was one of the critical early design goals, for some file formats, we simply have to parse the whole thing before we can have any output.

So, y, large files are a problem. :\

Example with purely made-up keys representing a pdf file containing an RTF attachment
[
{
   Name : “container file”,
   Author: “Chris Mattmann”,
   Content: “Four score and seven years ago…”,
   Content-type: “application/pdf”
  …
},
{
  Name : “embedded file1”
  Author: “Nick Burch”,
  Content: “When in the course of human events…”,
  Content-type: “application/rtf”
}
]

From: Eugene Kirpichov [mailto:kirpichov@google.com]
Sent: Thursday, September 21, 2017 7:42 PM
To: Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion.

@Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>> wrote:
Like Sergey, it’ll take me some time to understand your recommendations.  Thank you!

On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov [mailto:kirpichov@google.com<ma...@google.com>]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <ta...@mitre.org>>; dev@beam.apache.org<ma...@beam.apache.org>
Cc: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E .

If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
>
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
>
>  From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
>
Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
>
> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
>
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
>
> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
>
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
>
> Apologies if I've derailed the conversation or misunderstood this thread.
>
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
>
>                 Tim
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org<ma...@beam.apache.org>
> Cc: Allison, Timothy B. <ta...@mitre.org>>
> Subject: Re: TikaIO concerns
>
> Hi All
>
> Please welcome Tim, one of Apache Tika leads and practitioners.
>
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
>
> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
>
> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
>
> Other comments would be welcome too
>
> Thanks, Sergey
>
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sb...@gmail.com>>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,
On 22/09/17 00:42, Eugene Kirpichov wrote:
> Hi,
> @Sergey:
> - I already marked TikaIO @Experimental, so we can make changes.
OK, thanks
> - Yes, the String in KV<String, ParseResult> is the filename. I guess we
> could alternatively put it into ParseResult - don't have a strong opinion.
> 
Sure. If you don't mind then the 1st thing I'd like to try hopefully 
early next week is to introduce ParseResult first into the existing code.
I know it won't 'fix' the issues related to the ordering, but starting 
with a complete re-write would be a steep curve for me, so I'd try to 
experiment first with the idea (which I like very much) of wrapping 
several related pieces (content fragment, metadata, and the doc id/file 
name) into ParseResult.

By the way, reporting Tika file (output) metadata with every ParseResult 
instance will work much better, I thought first it won't because Tika 
does not callback when it populates the file metadata; it only does it 
for the actual content, but it will update the Metadata instance passed 
to it while it keeps parsing and finding the new metadata, so the 
metadata pieces will be available to the pipeline as soon as they may 
become available. Though Tika (1.17 ?) may need to ensure its Metadata 
is backed up by the concurrent map for this approach to work, not sure 
yet...


> @Chris: unorderedness of Metadata would have helped if we extracted each
> Metadata item into a separate PCollection element, but that's not what we
> want to do (we want to have an element per document instead).
> 
> @Timothy: can you tell more about this RecursiveParserWrapper? Is this
> something that the user can configure by specifying the Parser on TikaIO if
> they so wish?
> 


As a general note, Metadata passed to the top-level parser acts as a 
file (and embedded attachments) metadata sink but also as a 'helper' to 
the parser, right now TikaIO uses it to pass a media type hint if 
available (to help the auto-detect parser select the correct parser 
faster), and also a parser which will be used to parse the embedded 
attachments (I did it after Tim hinted about it earlier on...).

Not sure if RecusriveParserWrapper can act as a top-level parser or 
needs to be passed as a metadata property to AutoDetectParser, Tim will 
know :-)

Thanks, Sergey

> On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>
> wrote:
> 
>> Like Sergey, it’ll take me some time to understand your recommendations.
>> Thank you!
>>
>>
>>
>> On one small point:
>>
>>> return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
>> is a class with properties { String content, Metadata metadata }
>>
>>
>>
>> For this option, I’d strongly encourage using the Json output from the
>> RecursiveParserWrapper that contains metadata and content, and captures
>> metadata even from embedded documents.
>>
>>
>>
>>> However, since TikaIO can be applied to very large files, this could
>> produce very large elements, which is a bad idea
>>
>> Large documents are a problem, no doubt about it…
>>
>>
>>
>> *From:* Eugene Kirpichov [mailto:kirpichov@google.com]
>> *Sent:* Thursday, September 21, 2017 4:41 PM
>> *To:* Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
>> *Cc:* dev@tika.apache.org
>> *Subject:* Re: TikaIO concerns
>>
>>
>>
>> Thanks all for the discussion. It seems we have consensus that both
>> within-document order and association with the original filename are
>> necessary, but currently absent from TikaIO.
>>
>>
>>
>> *Association with original file:*
>>
>> Sergey - Beam does not *automatically* provide a way to associate an
>> element with the file it originated from: automatically tracking data
>> provenance is a known very hard research problem on which many papers have
>> been written, and obvious solutions are very easy to break. See related
>> discussion at
>> https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
>>   .
>>
>>
>>
>> If you want the elements of your PCollection to contain additional
>> information, you need the elements themselves to contain this information:
>> the elements are self-contained and have no metadata associated with them
>> (beyond the timestamp and windows, universal to the whole Beam model).
>>
>>
>>
>> *Order within a file:*
>>
>> The only way to have any kind of order within a PCollection is to have the
>> elements of the PCollection contain something ordered, e.g. have a
>> PCollection<List<Something>>, where each List is for one file [I'm assuming
>> Tika, at a low level, works on a per-file basis?]. However, since TikaIO
>> can be applied to very large files, this could produce very large elements,
>> which is a bad idea. Because of this, I don't think the result of applying
>> Tika to a single file can be encoded as a PCollection element.
>>
>>
>>
>> Given both of these, I think that it's not possible to create a
>> *general-purpose* TikaIO transform that will be better than manual
>> invocation of Tika as a DoFn on the result of FileIO.readMatches().
>>
>>
>>
>> However, looking at the examples at
>> https://tika.apache.org/1.16/examples.html - almost all of the examples
>> involve extracting a single String from each document. This use case, with
>> the assumption that individual documents are small enough, can certainly be
>> simplified and TikaIO could be a facade for doing just this.
>>
>>
>>
>> E.g. TikaIO could:
>>
>> - take as input a PCollection<ReadableFile>
>>
>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
>> is a class with properties { String content, Metadata metadata }
>>
>> - be configured by: a Parser (it implements Serializable so can be
>> specified at pipeline construction time) and a ContentHandler whose
>> toString() will go into "content". ContentHandler does not implement
>> Serializable, so you can not specify it at construction time - however, you
>> can let the user specify either its class (if it's a simple handler like a
>> BodyContentHandler) or specify a lambda for creating the handler
>> (SerializableFunction<Void, ContentHandler>), and potentially you can have
>> a simpler facade for Tika.parseAsString() - e.g. call it
>> TikaIO.parseAllAsStrings().
>>
>>
>>
>> Example usage would look like:
>>
>>
>>
>>    PCollection<KV<String, ParseResult>> parseResults =
>> p.apply(FileIO.match().filepattern(...))
>>
>>      .apply(FileIO.readMatches())
>>
>>      .apply(TikaIO.parseAllAsStrings())
>>
>>
>>
>> or:
>>
>>
>>
>>      .apply(TikaIO.parseAll()
>>
>>          .withParser(new AutoDetectParser())
>>
>>          .withContentHandler(() -> new BodyContentHandler(new
>> ToXMLContentHandler())))
>>
>>
>>
>> You could also have shorthands for letting the user avoid using FileIO
>> directly in simple cases, for example:
>>
>>      p.apply(TikaIO.parseAsStrings().from(filepattern))
>>
>>
>>
>> This would of course be implemented as a ParDo or even MapElements, and
>> you'll be able to share the code between parseAll and regular parse.
>>
>>
>>
>> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>> Hi Tim
>> On 21/09/17 14:33, Allison, Timothy B. wrote:
>>> Thank you, Sergey.
>>>
>>> My knowledge of Apache Beam is limited -- I saw Davor and
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
>> impressed, but I haven't had a chance to work with it yet.
>>>
>>>   From my perspective, if I understand this thread (and I may not!),
>> getting unordered text from _a given file_ is a non-starter for most
>> applications.  The implementation needs to guarantee order per file, and
>> the user has to be able to link the "extract" back to a unique identifier
>> for the document.  If the current implementation doesn't do those things,
>> we need to change it, IMHO.
>>>
>> Right now Tika-related reader does not associate a given text fragment
>> with the file name, so a function looking at some text and trying to
>> find where it came from won't be able to do so.
>>
>> So I asked how to do it in Beam, how to attach some context to the given
>> piece of data. I hope it can be done and if not - then perhaps some
>> improvement can be applied.
>>
>> Re the unordered text - yes - this is what we currently have with Beam +
>> TikaIO :-).
>>
>> The use-case I referred to earlier in this thread (upload PDFs - save
>> the possibly unordered text to Lucene with the file name 'attached', let
>> users search for the files containing some words - phrases, this works
>> OK given that I can see PDF parser for ex reporting the lines) can be
>> supported OK with the current TikaIO (provided we find a way to 'attach'
>> a file name to the flow).
>>
>> I see though supporting the total ordering can be a big deal in other
>> cases. Eugene, can you please explain how it can be done, is it
>> achievable in principle, without the users having to do some custom
>> coding ?
>>
>>> To the question of -- why is this in Beam at all; why don't we let users
>> call it if they want it?...
>>>
>>> No matter how much we do to Tika, it will behave badly sometimes --
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
>> using Beam -- folks likely with large batches of unruly/noisy documents --
>> are more likely to run into these problems than your average
>> couple-of-thousand-docs-from-our-own-company user. So, if there are things
>> we can do in Beam to prevent developers around the world from having to
>> reinvent the wheel for defenses against these problems, then I'd be
>> enormously grateful if we could put Tika into Beam.  That means:
>>>
>>> 1) a process-level timeout (because you can't actually kill a thread in
>> Java)
>>> 2) a process-level restart on OOM
>>> 3) avoid trying to reprocess a badly behaving document
>>>
>>> If Beam automatically handles those problems, then I'd say, y, let users
>> write their own code.  If there is so much as a single configuration knob
>> (and it sounds like Beam is against complex configuration...yay!) to get
>> that working in Beam, then I'd say, please integrate Tika into Beam.  From
>> a safety perspective, it is critical to keep the extraction process
>> entirely separate (jvm, vm, m, rack, data center!) from the
>> transformation+loading steps.  IMHO, very few devs realize this because
>> Tika works well lots of the time...which is why it is critical for us to
>> make it easy for people to get it right all of the time.
>>>
>>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
>> mode first in one jvm, and then I kick off another process to do
>> transform/loading into Lucene/Solr from the .json files that Tika generates
>> for each input file.  If I were to scale up, I'd want to maintain this
>> complete separation of steps.
>>>
>>> Apologies if I've derailed the conversation or misunderstood this thread.
>>>
>> Major thanks for your input :-)
>>
>> Cheers, Sergey
>>
>>> Cheers,
>>>
>>>                  Tim
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Thursday, September 21, 2017 9:07 AM
>>> To: dev@beam.apache.org
>>> Cc: Allison, Timothy B. <ta...@mitre.org>
>>> Subject: Re: TikaIO concerns
>>>
>>> Hi All
>>>
>>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>>
>>> Tim, thanks for joining in :-). If you have some great Apache Tika
>> stories to share (preferably involving the cases where it did not really
>> matter the ordering in which Tika-produced data were dealt with by the
>>> consumers) then please do so :-).
>>>
>>> At the moment, even though Tika ContentHandler will emit the ordered
>> data, the Beam runtime will have no guarantees that the downstream pipeline
>> components will see the data coming in the right order.
>>>
>>> (FYI, I understand from the earlier comments that the total ordering is
>> also achievable but would require the extra API support)
>>>
>>> Other comments would be welcome too
>>>
>>> Thanks, Sergey
>>>
>>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>>> individual words and nearly 100% sure I saw Tika reporting individual
>>>> lines when it was parsing the text files. The 'min text length'
>>>> feature can help with reporting several lines at a time, etc...
>>>>
>>>> I'm working with this PDF all the time:
>>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>>
>>>> try it too if you get a chance.
>>>>
>>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>>> be for ex a log-like content too)
>>>>
>>>> That said, I don't know how a parser for the format N will behave, it
>>>> depends on the individual parsers.
>>>>
>>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>>
>>>> I'd like to know though how to make a file name available to the
>>>> pipeline which is working with the current text fragment ?
>>>>
>>>> Going to try and do some measurements and compare the sync vs async
>>>> parsing modes...
>>>>
>>>> Asked the Tika team to support with some more examples...
>>>>
>>>> Cheers, Sergey
>>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>>> Hi,
>>>>>
>>>>> thanks for the explanations,
>>>>>
>>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>>> Hi!
>>>>>>
>>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>>> a row of a CSV file - so discarding order is ok.
>>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>>> it depends...
>>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>>> natural language processing and the text files contain actual prose,
>>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>>> - Cases where you need to remember which file each element came
>>>>>> from, e.g.
>>>>>> if you're creating a search index for the files: TextIO can't do
>>>>>> this either.
>>>>>>
>>>>>> Both of these issues have been raised in the past against TextIO;
>>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>>> not been a priority.
>>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>>> gives you access to reading a full file yourself - people who want
>>>>>> more flexibility will be able to use standard Java text-parsing
>>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>>
>>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>>> use case where the files contain independent data entries, so
>>>>>> returning an unordered soup of them, with no association to the
>>>>>> original file, is the user's intention. XmlIO will not work for
>>>>>> processing more complex XML files that are not simply a sequence of
>>>>>> entries with the same tag, and it also does not remember the
>>>>>> original filename.
>>>>>>
>>>>>
>>>>> OK...
>>>>>
>>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>>> mainly used for extracting content from complex file formats - for
>>>>>> example, extracting text and images from PDF files or Word
>>>>>> documents. I believe this is the main difference between it and
>>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>>
>>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>>> the Tika API is ContentHandler
>>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>>> html?is-external=true
>>>>>>
>>>>>> whose
>>>>>> documentation says "The order of events in this interface is very
>>>>>> important, and mirrors the order of information in the document
>> itself."
>>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>>> ContentHandler...
>>>>>>
>>>>>> Let me give a few examples of what I think is possible with the raw
>>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>>> with Tika and am judging just based on what I read about it.
>>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>>> to text files for future natural language processing.
>>>>>> - User has 100,000 PDF files with financial statements, each
>>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>>> of transactions in PDF tables. User wants to extract each
>>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>>> extract text from them, somehow parse author and affiliation from
>>>>>> the text, and compute statistics of topics and terminology usage by
>>>>>> author name and affiliation.
>>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>>> observing a location over time: they want to extract metadata from
>>>>>> each image using Tika, analyze the images themselves using some
>>>>>> other library, and detect anomalies in the overall appearance of the
>>>>>> location over time as seen from multiple cameras.
>>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>>> the resulting PCollection<String> contains no information about
>>>>>> which String comes from which document and about the order in which
>>>>>> they appear in the document.
>>>>> These are good use cases, thanks... I thought what you were talking
>>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>>> TextIO and alike :-)).
>>>>> Putting the ordered vs unordered question aside for a sec, why
>>>>> exactly a Tika Reader can not make the name of the file it's
>>>>> currently reading from available to the pipeline, as some Beam
>> pipeline metadata piece ?
>>>>> Surely it can be possible with Beam ? If not then I would be
>> surprised...
>>>>>
>>>>>>
>>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>>> of strings.
>>>>>> So some examples would be very helpful.
>>>>>>
>>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>>> give one example where it did not matter to us in what order
>>>>> Tika-produced data were available to the downstream layer.
>>>>>
>>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>>> Con NAs, and we had a happy audience:
>>>>>
>>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search
>>>>>
>>>>>
>>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>>> into Lucene. We associate a file name with the indexed content and
>>>>> then let users find a list of PDF files which contain a given word or
>>>>> few words, details are here
>>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>>> og.java#L131
>>>>>
>>>>>
>>>>> I'd say even more involved search engines would not mind supporting a
>>>>> case like that :-)
>>>>>
>>>>> Now there we process one file at a time, and I understand now that
>>>>> with TikaIO and N files it's all over the place really as far as the
>>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>>> TikaReader must be able to associate the file name with a given piece
>>>>> of text it's making available to the pipeline.
>>>>>
>>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>>> If it makes things simpler then it would be good, I've just no idea
>>>>> at the moment how to start the pipeline without using a
>>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>>> SAX data pieces first before making a single composite piece to the
>> pipeline ?
>>>>>
>>>>>
>>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>>> provide a usability improvement over such usage?
>>>>>>
>>>>>
>>>>>
>>>>> If you are actually asking, does it really make sense for Beam to
>>>>> ship Tika related code, given that users can just do it themselves,
>>>>> I'm not sure.
>>>>>
>>>>> IMHO it always works better if users have to provide just few config
>>>>> options to an integral part of the framework and see things happening.
>>>>> It will bring more users.
>>>>>
>>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>>> not - I'll let you and the team decide; believe it or not I was
>>>>> seriously contemplating at the last moment to make it all part of the
>>>>> Tika project itself and have a bit more flexibility over there with
>>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>>> know - it's no my decision...
>>>>>
>>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>>> Perhaps
>>>>>> for some cases it does, and for some it does not. May be it makes
>>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>>> and ParDo, with getting the common code reused." - because using
>>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>>> The resulting PCollection will be unordered either way - this needs
>>>>>> to be solved separately by providing a different API.
>>>>> Right I see now, so ParDo is not about making Tika reported data
>>>>> available to the downstream pipeline components ordered, only about
>>>>> the simpler implementation.
>>>>> Association with the file should be possible I hope, but I understand
>>>>> it would be possible to optionally make the data coming out in the
>>>>> ordered way as well...
>>>>>
>>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>>> let me double check: should we still give some thought to the
>>>>> possible performance benefit of the current approach ? As I said, I
>>>>> can easily get rid of all that polling code, use a simple Blocking
>> queue.
>>>>>
>>>>> Cheers, Sergey
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>>
>>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>>
>>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>>> usecases.
>>>>>>>
>>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>>
>>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>>> basically from any file format.
>>>>>>>
>>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>>
>>>>>>> Sergey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Replies inline.
>>>>>>>>
>>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>>> <sb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All
>>>>>>>>>
>>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>>> [2].
>>>>>>>>>
>>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>>
>>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>>> report the data chunks.
>>>>>>>>> Some
>>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>>> with some being able report the data only after the completely
>>>>>>>>> parse the document.
>>>>>>>>> All depends on the data format.
>>>>>>>>>
>>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>>> the tests might suggest otherwise).
>>>>>>>>>
>>>>>>>> I agree that your implementation of reader returns records in
>>>>>>>> order
>>>>>>>> - but
>>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>>> transforms to the
>>>>>>> PCollection
>>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>>> your reader returned the records.
>>>>>>>>
>>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did it was because I thought
>>>>>>>>>
>>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>>
>>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>>> potentially
>>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>>> execute in
>>>>>>> the
>>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>>> correctly,
>>>>>>>> you might be assuming that:
>>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>>> complete
>>>>>>>> before processing its outputs with downstream transforms
>>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>>> *concurrently*
>>>>>>>> with downstream processing of its results
>>>>>>>> - Passing an element from one thread to another using a
>>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>>> most
>>>>>>>> production runners.
>>>>>>>>
>>>>>>>> There are other disadvantages to this approach:
>>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>>> invisible
>>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>>> next element, but not *why* the next
>>>>>>> element
>>>>>>>> is taking long to compute.
>>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>>> autoscaling, binpacking
>>>>>>> and
>>>>>>>> other resource management magic (how much of this runners actually
>>>>>>>> do is
>>>>>>> a
>>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>>> the processing happens in a thread about which the runner is
>>>>>>>> unaware.
>>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>>> in the Tika thread
>>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>>> easier
>>>>>>> to
>>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>>> yet)
>>>>>>>>>
>>>>>>>> What are these issues?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>>> will never be reached for a practical case, or
>>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>>> I propose to follow 2b).
>>>>>>>>>
>>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>>> Beam's "no knobs"
>>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>>> out a
>>>>>>> good
>>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>>> running on
>>>>>>> a
>>>>>>>> new dataset or updating a version of some of the involved
>>>>>>>> dependencies
>>>>>>> etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know what you think.
>>>>>>>>> My plan so far is:
>>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>>> some minor TikaIO updates
>>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>>> BoundedSource/Reader or not
>>>>>>>>>
>>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>>
>>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>>> cases
>>>>>>> of
>>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>>> then see what's the best implementation for that particular API
>>>>>>>> and set of anticipated use cases.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, Sergey
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
>>
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?

Not at the moment, we’d have to do some coding on our end or within Beam.  The format is a list of maps/dicts for each file.  Each map contains all of the metadata, with one key reserved for the content.  If a file is a file with no attachments, the list has length 1; otherwise there’s a map for each embedded file.  Unlike our legacy xhtml, this format maintains metadata for attachments.

The downside to this extract format is that it requires a full parse of the document and all data to be held in-memory before writing it.  On the other hand, while Tika tries to be streaming, and that was one of the critical early design goals, for some file formats, we simply have to parse the whole thing before we can have any output.

So, y, large files are a problem. :\

Example with purely made-up keys representing a pdf file containing an RTF attachment
[
{
   Name : “container file”,
   Author: “Chris Mattmann”,
   Content: “Four score and seven years ago…”,
   Content-type: “application/pdf”
  …
},
{
  Name : “embedded file1”
  Author: “Nick Burch”,
  Content: “When in the course of human events…”,
  Content-type: “application/rtf”
}
]

From: Eugene Kirpichov [mailto:kirpichov@google.com]
Sent: Thursday, September 21, 2017 7:42 PM
To: Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we could alternatively put it into ParseResult - don't have a strong opinion.

@Chris: unorderedness of Metadata would have helped if we extracted each Metadata item into a separate PCollection element, but that's not what we want to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this something that the user can configure by specifying the Parser on TikaIO if they so wish?

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>> wrote:
Like Sergey, it’ll take me some time to understand your recommendations.  Thank you!

On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov [mailto:kirpichov@google.com<ma...@google.com>]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <ta...@mitre.org>>; dev@beam.apache.org<ma...@beam.apache.org>
Cc: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E .

If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
>
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
>
>  From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
>
Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
>
> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
>
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
>
> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
>
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
>
> Apologies if I've derailed the conversation or misunderstood this thread.
>
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
>
>                 Tim
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org<ma...@beam.apache.org>
> Cc: Allison, Timothy B. <ta...@mitre.org>>
> Subject: Re: TikaIO concerns
>
> Hi All
>
> Please welcome Tim, one of Apache Tika leads and practitioners.
>
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
>
> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
>
> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
>
> Other comments would be welcome too
>
> Thanks, Sergey
>
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sb...@gmail.com>>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,
On 22/09/17 00:42, Eugene Kirpichov wrote:
> Hi,
> @Sergey:
> - I already marked TikaIO @Experimental, so we can make changes.
OK, thanks
> - Yes, the String in KV<String, ParseResult> is the filename. I guess we
> could alternatively put it into ParseResult - don't have a strong opinion.
> 
Sure. If you don't mind then the 1st thing I'd like to try hopefully 
early next week is to introduce ParseResult first into the existing code.
I know it won't 'fix' the issues related to the ordering, but starting 
with a complete re-write would be a steep curve for me, so I'd try to 
experiment first with the idea (which I like very much) of wrapping 
several related pieces (content fragment, metadata, and the doc id/file 
name) into ParseResult.

By the way, reporting Tika file (output) metadata with every ParseResult 
instance will work much better, I thought first it won't because Tika 
does not callback when it populates the file metadata; it only does it 
for the actual content, but it will update the Metadata instance passed 
to it while it keeps parsing and finding the new metadata, so the 
metadata pieces will be available to the pipeline as soon as they may 
become available. Though Tika (1.17 ?) may need to ensure its Metadata 
is backed up by the concurrent map for this approach to work, not sure 
yet...


> @Chris: unorderedness of Metadata would have helped if we extracted each
> Metadata item into a separate PCollection element, but that's not what we
> want to do (we want to have an element per document instead).
> 
> @Timothy: can you tell more about this RecursiveParserWrapper? Is this
> something that the user can configure by specifying the Parser on TikaIO if
> they so wish?
> 


As a general note, Metadata passed to the top-level parser acts as a 
file (and embedded attachments) metadata sink but also as a 'helper' to 
the parser, right now TikaIO uses it to pass a media type hint if 
available (to help the auto-detect parser select the correct parser 
faster), and also a parser which will be used to parse the embedded 
attachments (I did it after Tim hinted about it earlier on...).

Not sure if RecusriveParserWrapper can act as a top-level parser or 
needs to be passed as a metadata property to AutoDetectParser, Tim will 
know :-)

Thanks, Sergey

> On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>
> wrote:
> 
>> Like Sergey, it’ll take me some time to understand your recommendations.
>> Thank you!
>>
>>
>>
>> On one small point:
>>
>>> return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
>> is a class with properties { String content, Metadata metadata }
>>
>>
>>
>> For this option, I’d strongly encourage using the Json output from the
>> RecursiveParserWrapper that contains metadata and content, and captures
>> metadata even from embedded documents.
>>
>>
>>
>>> However, since TikaIO can be applied to very large files, this could
>> produce very large elements, which is a bad idea
>>
>> Large documents are a problem, no doubt about it…
>>
>>
>>
>> *From:* Eugene Kirpichov [mailto:kirpichov@google.com]
>> *Sent:* Thursday, September 21, 2017 4:41 PM
>> *To:* Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
>> *Cc:* dev@tika.apache.org
>> *Subject:* Re: TikaIO concerns
>>
>>
>>
>> Thanks all for the discussion. It seems we have consensus that both
>> within-document order and association with the original filename are
>> necessary, but currently absent from TikaIO.
>>
>>
>>
>> *Association with original file:*
>>
>> Sergey - Beam does not *automatically* provide a way to associate an
>> element with the file it originated from: automatically tracking data
>> provenance is a known very hard research problem on which many papers have
>> been written, and obvious solutions are very easy to break. See related
>> discussion at
>> https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
>>   .
>>
>>
>>
>> If you want the elements of your PCollection to contain additional
>> information, you need the elements themselves to contain this information:
>> the elements are self-contained and have no metadata associated with them
>> (beyond the timestamp and windows, universal to the whole Beam model).
>>
>>
>>
>> *Order within a file:*
>>
>> The only way to have any kind of order within a PCollection is to have the
>> elements of the PCollection contain something ordered, e.g. have a
>> PCollection<List<Something>>, where each List is for one file [I'm assuming
>> Tika, at a low level, works on a per-file basis?]. However, since TikaIO
>> can be applied to very large files, this could produce very large elements,
>> which is a bad idea. Because of this, I don't think the result of applying
>> Tika to a single file can be encoded as a PCollection element.
>>
>>
>>
>> Given both of these, I think that it's not possible to create a
>> *general-purpose* TikaIO transform that will be better than manual
>> invocation of Tika as a DoFn on the result of FileIO.readMatches().
>>
>>
>>
>> However, looking at the examples at
>> https://tika.apache.org/1.16/examples.html - almost all of the examples
>> involve extracting a single String from each document. This use case, with
>> the assumption that individual documents are small enough, can certainly be
>> simplified and TikaIO could be a facade for doing just this.
>>
>>
>>
>> E.g. TikaIO could:
>>
>> - take as input a PCollection<ReadableFile>
>>
>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
>> is a class with properties { String content, Metadata metadata }
>>
>> - be configured by: a Parser (it implements Serializable so can be
>> specified at pipeline construction time) and a ContentHandler whose
>> toString() will go into "content". ContentHandler does not implement
>> Serializable, so you can not specify it at construction time - however, you
>> can let the user specify either its class (if it's a simple handler like a
>> BodyContentHandler) or specify a lambda for creating the handler
>> (SerializableFunction<Void, ContentHandler>), and potentially you can have
>> a simpler facade for Tika.parseAsString() - e.g. call it
>> TikaIO.parseAllAsStrings().
>>
>>
>>
>> Example usage would look like:
>>
>>
>>
>>    PCollection<KV<String, ParseResult>> parseResults =
>> p.apply(FileIO.match().filepattern(...))
>>
>>      .apply(FileIO.readMatches())
>>
>>      .apply(TikaIO.parseAllAsStrings())
>>
>>
>>
>> or:
>>
>>
>>
>>      .apply(TikaIO.parseAll()
>>
>>          .withParser(new AutoDetectParser())
>>
>>          .withContentHandler(() -> new BodyContentHandler(new
>> ToXMLContentHandler())))
>>
>>
>>
>> You could also have shorthands for letting the user avoid using FileIO
>> directly in simple cases, for example:
>>
>>      p.apply(TikaIO.parseAsStrings().from(filepattern))
>>
>>
>>
>> This would of course be implemented as a ParDo or even MapElements, and
>> you'll be able to share the code between parseAll and regular parse.
>>
>>
>>
>> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>> Hi Tim
>> On 21/09/17 14:33, Allison, Timothy B. wrote:
>>> Thank you, Sergey.
>>>
>>> My knowledge of Apache Beam is limited -- I saw Davor and
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
>> impressed, but I haven't had a chance to work with it yet.
>>>
>>>   From my perspective, if I understand this thread (and I may not!),
>> getting unordered text from _a given file_ is a non-starter for most
>> applications.  The implementation needs to guarantee order per file, and
>> the user has to be able to link the "extract" back to a unique identifier
>> for the document.  If the current implementation doesn't do those things,
>> we need to change it, IMHO.
>>>
>> Right now Tika-related reader does not associate a given text fragment
>> with the file name, so a function looking at some text and trying to
>> find where it came from won't be able to do so.
>>
>> So I asked how to do it in Beam, how to attach some context to the given
>> piece of data. I hope it can be done and if not - then perhaps some
>> improvement can be applied.
>>
>> Re the unordered text - yes - this is what we currently have with Beam +
>> TikaIO :-).
>>
>> The use-case I referred to earlier in this thread (upload PDFs - save
>> the possibly unordered text to Lucene with the file name 'attached', let
>> users search for the files containing some words - phrases, this works
>> OK given that I can see PDF parser for ex reporting the lines) can be
>> supported OK with the current TikaIO (provided we find a way to 'attach'
>> a file name to the flow).
>>
>> I see though supporting the total ordering can be a big deal in other
>> cases. Eugene, can you please explain how it can be done, is it
>> achievable in principle, without the users having to do some custom
>> coding ?
>>
>>> To the question of -- why is this in Beam at all; why don't we let users
>> call it if they want it?...
>>>
>>> No matter how much we do to Tika, it will behave badly sometimes --
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
>> using Beam -- folks likely with large batches of unruly/noisy documents --
>> are more likely to run into these problems than your average
>> couple-of-thousand-docs-from-our-own-company user. So, if there are things
>> we can do in Beam to prevent developers around the world from having to
>> reinvent the wheel for defenses against these problems, then I'd be
>> enormously grateful if we could put Tika into Beam.  That means:
>>>
>>> 1) a process-level timeout (because you can't actually kill a thread in
>> Java)
>>> 2) a process-level restart on OOM
>>> 3) avoid trying to reprocess a badly behaving document
>>>
>>> If Beam automatically handles those problems, then I'd say, y, let users
>> write their own code.  If there is so much as a single configuration knob
>> (and it sounds like Beam is against complex configuration...yay!) to get
>> that working in Beam, then I'd say, please integrate Tika into Beam.  From
>> a safety perspective, it is critical to keep the extraction process
>> entirely separate (jvm, vm, m, rack, data center!) from the
>> transformation+loading steps.  IMHO, very few devs realize this because
>> Tika works well lots of the time...which is why it is critical for us to
>> make it easy for people to get it right all of the time.
>>>
>>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
>> mode first in one jvm, and then I kick off another process to do
>> transform/loading into Lucene/Solr from the .json files that Tika generates
>> for each input file.  If I were to scale up, I'd want to maintain this
>> complete separation of steps.
>>>
>>> Apologies if I've derailed the conversation or misunderstood this thread.
>>>
>> Major thanks for your input :-)
>>
>> Cheers, Sergey
>>
>>> Cheers,
>>>
>>>                  Tim
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Thursday, September 21, 2017 9:07 AM
>>> To: dev@beam.apache.org
>>> Cc: Allison, Timothy B. <ta...@mitre.org>
>>> Subject: Re: TikaIO concerns
>>>
>>> Hi All
>>>
>>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>>
>>> Tim, thanks for joining in :-). If you have some great Apache Tika
>> stories to share (preferably involving the cases where it did not really
>> matter the ordering in which Tika-produced data were dealt with by the
>>> consumers) then please do so :-).
>>>
>>> At the moment, even though Tika ContentHandler will emit the ordered
>> data, the Beam runtime will have no guarantees that the downstream pipeline
>> components will see the data coming in the right order.
>>>
>>> (FYI, I understand from the earlier comments that the total ordering is
>> also achievable but would require the extra API support)
>>>
>>> Other comments would be welcome too
>>>
>>> Thanks, Sergey
>>>
>>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>>> individual words and nearly 100% sure I saw Tika reporting individual
>>>> lines when it was parsing the text files. The 'min text length'
>>>> feature can help with reporting several lines at a time, etc...
>>>>
>>>> I'm working with this PDF all the time:
>>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>>
>>>> try it too if you get a chance.
>>>>
>>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>>> be for ex a log-like content too)
>>>>
>>>> That said, I don't know how a parser for the format N will behave, it
>>>> depends on the individual parsers.
>>>>
>>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>>
>>>> I'd like to know though how to make a file name available to the
>>>> pipeline which is working with the current text fragment ?
>>>>
>>>> Going to try and do some measurements and compare the sync vs async
>>>> parsing modes...
>>>>
>>>> Asked the Tika team to support with some more examples...
>>>>
>>>> Cheers, Sergey
>>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>>> Hi,
>>>>>
>>>>> thanks for the explanations,
>>>>>
>>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>>> Hi!
>>>>>>
>>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>>> a row of a CSV file - so discarding order is ok.
>>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>>> it depends...
>>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>>> natural language processing and the text files contain actual prose,
>>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>>> - Cases where you need to remember which file each element came
>>>>>> from, e.g.
>>>>>> if you're creating a search index for the files: TextIO can't do
>>>>>> this either.
>>>>>>
>>>>>> Both of these issues have been raised in the past against TextIO;
>>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>>> not been a priority.
>>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>>> gives you access to reading a full file yourself - people who want
>>>>>> more flexibility will be able to use standard Java text-parsing
>>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>>
>>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>>> use case where the files contain independent data entries, so
>>>>>> returning an unordered soup of them, with no association to the
>>>>>> original file, is the user's intention. XmlIO will not work for
>>>>>> processing more complex XML files that are not simply a sequence of
>>>>>> entries with the same tag, and it also does not remember the
>>>>>> original filename.
>>>>>>
>>>>>
>>>>> OK...
>>>>>
>>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>>> mainly used for extracting content from complex file formats - for
>>>>>> example, extracting text and images from PDF files or Word
>>>>>> documents. I believe this is the main difference between it and
>>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>>
>>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>>> the Tika API is ContentHandler
>>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>>> html?is-external=true
>>>>>>
>>>>>> whose
>>>>>> documentation says "The order of events in this interface is very
>>>>>> important, and mirrors the order of information in the document
>> itself."
>>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>>> ContentHandler...
>>>>>>
>>>>>> Let me give a few examples of what I think is possible with the raw
>>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>>> with Tika and am judging just based on what I read about it.
>>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>>> to text files for future natural language processing.
>>>>>> - User has 100,000 PDF files with financial statements, each
>>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>>> of transactions in PDF tables. User wants to extract each
>>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>>> extract text from them, somehow parse author and affiliation from
>>>>>> the text, and compute statistics of topics and terminology usage by
>>>>>> author name and affiliation.
>>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>>> observing a location over time: they want to extract metadata from
>>>>>> each image using Tika, analyze the images themselves using some
>>>>>> other library, and detect anomalies in the overall appearance of the
>>>>>> location over time as seen from multiple cameras.
>>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>>> the resulting PCollection<String> contains no information about
>>>>>> which String comes from which document and about the order in which
>>>>>> they appear in the document.
>>>>> These are good use cases, thanks... I thought what you were talking
>>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>>> TextIO and alike :-)).
>>>>> Putting the ordered vs unordered question aside for a sec, why
>>>>> exactly a Tika Reader can not make the name of the file it's
>>>>> currently reading from available to the pipeline, as some Beam
>> pipeline metadata piece ?
>>>>> Surely it can be possible with Beam ? If not then I would be
>> surprised...
>>>>>
>>>>>>
>>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>>> of strings.
>>>>>> So some examples would be very helpful.
>>>>>>
>>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>>> give one example where it did not matter to us in what order
>>>>> Tika-produced data were available to the downstream layer.
>>>>>
>>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>>> Con NAs, and we had a happy audience:
>>>>>
>>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search
>>>>>
>>>>>
>>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>>> into Lucene. We associate a file name with the indexed content and
>>>>> then let users find a list of PDF files which contain a given word or
>>>>> few words, details are here
>>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>>> og.java#L131
>>>>>
>>>>>
>>>>> I'd say even more involved search engines would not mind supporting a
>>>>> case like that :-)
>>>>>
>>>>> Now there we process one file at a time, and I understand now that
>>>>> with TikaIO and N files it's all over the place really as far as the
>>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>>> TikaReader must be able to associate the file name with a given piece
>>>>> of text it's making available to the pipeline.
>>>>>
>>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>>> If it makes things simpler then it would be good, I've just no idea
>>>>> at the moment how to start the pipeline without using a
>>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>>> SAX data pieces first before making a single composite piece to the
>> pipeline ?
>>>>>
>>>>>
>>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>>> provide a usability improvement over such usage?
>>>>>>
>>>>>
>>>>>
>>>>> If you are actually asking, does it really make sense for Beam to
>>>>> ship Tika related code, given that users can just do it themselves,
>>>>> I'm not sure.
>>>>>
>>>>> IMHO it always works better if users have to provide just few config
>>>>> options to an integral part of the framework and see things happening.
>>>>> It will bring more users.
>>>>>
>>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>>> not - I'll let you and the team decide; believe it or not I was
>>>>> seriously contemplating at the last moment to make it all part of the
>>>>> Tika project itself and have a bit more flexibility over there with
>>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>>> know - it's no my decision...
>>>>>
>>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>>> Perhaps
>>>>>> for some cases it does, and for some it does not. May be it makes
>>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>>> and ParDo, with getting the common code reused." - because using
>>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>>> The resulting PCollection will be unordered either way - this needs
>>>>>> to be solved separately by providing a different API.
>>>>> Right I see now, so ParDo is not about making Tika reported data
>>>>> available to the downstream pipeline components ordered, only about
>>>>> the simpler implementation.
>>>>> Association with the file should be possible I hope, but I understand
>>>>> it would be possible to optionally make the data coming out in the
>>>>> ordered way as well...
>>>>>
>>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>>> let me double check: should we still give some thought to the
>>>>> possible performance benefit of the current approach ? As I said, I
>>>>> can easily get rid of all that polling code, use a simple Blocking
>> queue.
>>>>>
>>>>> Cheers, Sergey
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>>
>>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>>
>>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>>> usecases.
>>>>>>>
>>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>>
>>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>>> basically from any file format.
>>>>>>>
>>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>>
>>>>>>> Sergey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Replies inline.
>>>>>>>>
>>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>>> <sb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All
>>>>>>>>>
>>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>>> [2].
>>>>>>>>>
>>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>>
>>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>>> report the data chunks.
>>>>>>>>> Some
>>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>>> with some being able report the data only after the completely
>>>>>>>>> parse the document.
>>>>>>>>> All depends on the data format.
>>>>>>>>>
>>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>>> the tests might suggest otherwise).
>>>>>>>>>
>>>>>>>> I agree that your implementation of reader returns records in
>>>>>>>> order
>>>>>>>> - but
>>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>>> transforms to the
>>>>>>> PCollection
>>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>>> your reader returned the records.
>>>>>>>>
>>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did it was because I thought
>>>>>>>>>
>>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>>
>>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>>> potentially
>>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>>> execute in
>>>>>>> the
>>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>>> correctly,
>>>>>>>> you might be assuming that:
>>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>>> complete
>>>>>>>> before processing its outputs with downstream transforms
>>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>>> *concurrently*
>>>>>>>> with downstream processing of its results
>>>>>>>> - Passing an element from one thread to another using a
>>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>>> most
>>>>>>>> production runners.
>>>>>>>>
>>>>>>>> There are other disadvantages to this approach:
>>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>>> invisible
>>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>>> next element, but not *why* the next
>>>>>>> element
>>>>>>>> is taking long to compute.
>>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>>> autoscaling, binpacking
>>>>>>> and
>>>>>>>> other resource management magic (how much of this runners actually
>>>>>>>> do is
>>>>>>> a
>>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>>> the processing happens in a thread about which the runner is
>>>>>>>> unaware.
>>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>>> in the Tika thread
>>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>>> easier
>>>>>>> to
>>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>>> yet)
>>>>>>>>>
>>>>>>>> What are these issues?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>>> will never be reached for a practical case, or
>>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>>> I propose to follow 2b).
>>>>>>>>>
>>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>>> Beam's "no knobs"
>>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>>> out a
>>>>>>> good
>>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>>> running on
>>>>>>> a
>>>>>>>> new dataset or updating a version of some of the involved
>>>>>>>> dependencies
>>>>>>> etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know what you think.
>>>>>>>>> My plan so far is:
>>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>>> some minor TikaIO updates
>>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>>> BoundedSource/Reader or not
>>>>>>>>>
>>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>>
>>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>>> cases
>>>>>>> of
>>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>>> then see what's the best implementation for that particular API
>>>>>>>> and set of anticipated use cases.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, Sergey
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
>>
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi,
@Sergey:
- I already marked TikaIO @Experimental, so we can make changes.
- Yes, the String in KV<String, ParseResult> is the filename. I guess we
could alternatively put it into ParseResult - don't have a strong opinion.

@Chris: unorderedness of Metadata would have helped if we extracted each
Metadata item into a separate PCollection element, but that's not what we
want to do (we want to have an element per document instead).

@Timothy: can you tell more about this RecursiveParserWrapper? Is this
something that the user can configure by specifying the Parser on TikaIO if
they so wish?

On Thu, Sep 21, 2017 at 2:23 PM Allison, Timothy B. <ta...@mitre.org>
wrote:

> Like Sergey, it’ll take me some time to understand your recommendations.
> Thank you!
>
>
>
> On one small point:
>
> >return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> is a class with properties { String content, Metadata metadata }
>
>
>
> For this option, I’d strongly encourage using the Json output from the
> RecursiveParserWrapper that contains metadata and content, and captures
> metadata even from embedded documents.
>
>
>
> > However, since TikaIO can be applied to very large files, this could
> produce very large elements, which is a bad idea
>
> Large documents are a problem, no doubt about it…
>
>
>
> *From:* Eugene Kirpichov [mailto:kirpichov@google.com]
> *Sent:* Thursday, September 21, 2017 4:41 PM
> *To:* Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
> *Cc:* dev@tika.apache.org
> *Subject:* Re: TikaIO concerns
>
>
>
> Thanks all for the discussion. It seems we have consensus that both
> within-document order and association with the original filename are
> necessary, but currently absent from TikaIO.
>
>
>
> *Association with original file:*
>
> Sergey - Beam does not *automatically* provide a way to associate an
> element with the file it originated from: automatically tracking data
> provenance is a known very hard research problem on which many papers have
> been written, and obvious solutions are very easy to break. See related
> discussion at
> https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
>  .
>
>
>
> If you want the elements of your PCollection to contain additional
> information, you need the elements themselves to contain this information:
> the elements are self-contained and have no metadata associated with them
> (beyond the timestamp and windows, universal to the whole Beam model).
>
>
>
> *Order within a file:*
>
> The only way to have any kind of order within a PCollection is to have the
> elements of the PCollection contain something ordered, e.g. have a
> PCollection<List<Something>>, where each List is for one file [I'm assuming
> Tika, at a low level, works on a per-file basis?]. However, since TikaIO
> can be applied to very large files, this could produce very large elements,
> which is a bad idea. Because of this, I don't think the result of applying
> Tika to a single file can be encoded as a PCollection element.
>
>
>
> Given both of these, I think that it's not possible to create a
> *general-purpose* TikaIO transform that will be better than manual
> invocation of Tika as a DoFn on the result of FileIO.readMatches().
>
>
>
> However, looking at the examples at
> https://tika.apache.org/1.16/examples.html - almost all of the examples
> involve extracting a single String from each document. This use case, with
> the assumption that individual documents are small enough, can certainly be
> simplified and TikaIO could be a facade for doing just this.
>
>
>
> E.g. TikaIO could:
>
> - take as input a PCollection<ReadableFile>
>
> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> is a class with properties { String content, Metadata metadata }
>
> - be configured by: a Parser (it implements Serializable so can be
> specified at pipeline construction time) and a ContentHandler whose
> toString() will go into "content". ContentHandler does not implement
> Serializable, so you can not specify it at construction time - however, you
> can let the user specify either its class (if it's a simple handler like a
> BodyContentHandler) or specify a lambda for creating the handler
> (SerializableFunction<Void, ContentHandler>), and potentially you can have
> a simpler facade for Tika.parseAsString() - e.g. call it
> TikaIO.parseAllAsStrings().
>
>
>
> Example usage would look like:
>
>
>
>   PCollection<KV<String, ParseResult>> parseResults =
> p.apply(FileIO.match().filepattern(...))
>
>     .apply(FileIO.readMatches())
>
>     .apply(TikaIO.parseAllAsStrings())
>
>
>
> or:
>
>
>
>     .apply(TikaIO.parseAll()
>
>         .withParser(new AutoDetectParser())
>
>         .withContentHandler(() -> new BodyContentHandler(new
> ToXMLContentHandler())))
>
>
>
> You could also have shorthands for letting the user avoid using FileIO
> directly in simple cases, for example:
>
>     p.apply(TikaIO.parseAsStrings().from(filepattern))
>
>
>
> This would of course be implemented as a ParDo or even MapElements, and
> you'll be able to share the code between parseAll and regular parse.
>
>
>
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
>
> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
> > Thank you, Sergey.
> >
> > My knowledge of Apache Beam is limited -- I saw Davor and
> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
> impressed, but I haven't had a chance to work with it yet.
> >
> >  From my perspective, if I understand this thread (and I may not!),
> getting unordered text from _a given file_ is a non-starter for most
> applications.  The implementation needs to guarantee order per file, and
> the user has to be able to link the "extract" back to a unique identifier
> for the document.  If the current implementation doesn't do those things,
> we need to change it, IMHO.
> >
> Right now Tika-related reader does not associate a given text fragment
> with the file name, so a function looking at some text and trying to
> find where it came from won't be able to do so.
>
> So I asked how to do it in Beam, how to attach some context to the given
> piece of data. I hope it can be done and if not - then perhaps some
> improvement can be applied.
>
> Re the unordered text - yes - this is what we currently have with Beam +
> TikaIO :-).
>
> The use-case I referred to earlier in this thread (upload PDFs - save
> the possibly unordered text to Lucene with the file name 'attached', let
> users search for the files containing some words - phrases, this works
> OK given that I can see PDF parser for ex reporting the lines) can be
> supported OK with the current TikaIO (provided we find a way to 'attach'
> a file name to the flow).
>
> I see though supporting the total ordering can be a big deal in other
> cases. Eugene, can you please explain how it can be done, is it
> achievable in principle, without the users having to do some custom
> coding ?
>
> > To the question of -- why is this in Beam at all; why don't we let users
> call it if they want it?...
> >
> > No matter how much we do to Tika, it will behave badly sometimes --
> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
> using Beam -- folks likely with large batches of unruly/noisy documents --
> are more likely to run into these problems than your average
> couple-of-thousand-docs-from-our-own-company user. So, if there are things
> we can do in Beam to prevent developers around the world from having to
> reinvent the wheel for defenses against these problems, then I'd be
> enormously grateful if we could put Tika into Beam.  That means:
> >
> > 1) a process-level timeout (because you can't actually kill a thread in
> Java)
> > 2) a process-level restart on OOM
> > 3) avoid trying to reprocess a badly behaving document
> >
> > If Beam automatically handles those problems, then I'd say, y, let users
> write their own code.  If there is so much as a single configuration knob
> (and it sounds like Beam is against complex configuration...yay!) to get
> that working in Beam, then I'd say, please integrate Tika into Beam.  From
> a safety perspective, it is critical to keep the extraction process
> entirely separate (jvm, vm, m, rack, data center!) from the
> transformation+loading steps.  IMHO, very few devs realize this because
> Tika works well lots of the time...which is why it is critical for us to
> make it easy for people to get it right all of the time.
> >
> > Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
> mode first in one jvm, and then I kick off another process to do
> transform/loading into Lucene/Solr from the .json files that Tika generates
> for each input file.  If I were to scale up, I'd want to maintain this
> complete separation of steps.
> >
> > Apologies if I've derailed the conversation or misunderstood this thread.
> >
> Major thanks for your input :-)
>
> Cheers, Sergey
>
> > Cheers,
> >
> >                 Tim
> >
> > -----Original Message-----
> > From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> > Sent: Thursday, September 21, 2017 9:07 AM
> > To: dev@beam.apache.org
> > Cc: Allison, Timothy B. <ta...@mitre.org>
> > Subject: Re: TikaIO concerns
> >
> > Hi All
> >
> > Please welcome Tim, one of Apache Tika leads and practitioners.
> >
> > Tim, thanks for joining in :-). If you have some great Apache Tika
> stories to share (preferably involving the cases where it did not really
> matter the ordering in which Tika-produced data were dealt with by the
> > consumers) then please do so :-).
> >
> > At the moment, even though Tika ContentHandler will emit the ordered
> data, the Beam runtime will have no guarantees that the downstream pipeline
> components will see the data coming in the right order.
> >
> > (FYI, I understand from the earlier comments that the total ordering is
> also achievable but would require the extra API support)
> >
> > Other comments would be welcome too
> >
> > Thanks, Sergey
> >
> > On 21/09/17 10:55, Sergey Beryozkin wrote:
> >> I noticed that the PDF and ODT parsers actually split by lines, not
> >> individual words and nearly 100% sure I saw Tika reporting individual
> >> lines when it was parsing the text files. The 'min text length'
> >> feature can help with reporting several lines at a time, etc...
> >>
> >> I'm working with this PDF all the time:
> >> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> >>
> >> try it too if you get a chance.
> >>
> >> (and I can imagine not all PDFs/etc representing the 'story' but can
> >> be for ex a log-like content too)
> >>
> >> That said, I don't know how a parser for the format N will behave, it
> >> depends on the individual parsers.
> >>
> >> IMHO it's an equal candidate alongside Text-based bounded IOs...
> >>
> >> I'd like to know though how to make a file name available to the
> >> pipeline which is working with the current text fragment ?
> >>
> >> Going to try and do some measurements and compare the sync vs async
> >> parsing modes...
> >>
> >> Asked the Tika team to support with some more examples...
> >>
> >> Cheers, Sergey
> >> On 20/09/17 22:17, Sergey Beryozkin wrote:
> >>> Hi,
> >>>
> >>> thanks for the explanations,
> >>>
> >>> On 20/09/17 16:41, Eugene Kirpichov wrote:
> >>>> Hi!
> >>>>
> >>>> TextIO returns an unordered soup of lines contained in all files you
> >>>> ask it to read. People usually use TextIO for reading files where 1
> >>>> line corresponds to 1 independent data element, e.g. a log entry, or
> >>>> a row of a CSV file - so discarding order is ok.
> >>> Just a side note, I'd probably want that be ordered, though I guess
> >>> it depends...
> >>>> However, there is a number of cases where TextIO is a poor fit:
> >>>> - Cases where discarding order is not ok - e.g. if you're doing
> >>>> natural language processing and the text files contain actual prose,
> >>>> where you need to process a file as a whole. TextIO can't do that.
> >>>> - Cases where you need to remember which file each element came
> >>>> from, e.g.
> >>>> if you're creating a search index for the files: TextIO can't do
> >>>> this either.
> >>>>
> >>>> Both of these issues have been raised in the past against TextIO;
> >>>> however it seems that the overwhelming majority of users of TextIO
> >>>> use it for logs or CSV files or alike, so solving these issues has
> >>>> not been a priority.
> >>>> Currently they are solved in a general form via FileIO.read() which
> >>>> gives you access to reading a full file yourself - people who want
> >>>> more flexibility will be able to use standard Java text-parsing
> >>>> utilities on a ReadableFile, without involving TextIO.
> >>>>
> >>>> Same applies for XmlIO: it is specifically designed for the narrow
> >>>> use case where the files contain independent data entries, so
> >>>> returning an unordered soup of them, with no association to the
> >>>> original file, is the user's intention. XmlIO will not work for
> >>>> processing more complex XML files that are not simply a sequence of
> >>>> entries with the same tag, and it also does not remember the
> >>>> original filename.
> >>>>
> >>>
> >>> OK...
> >>>
> >>>> However, if my understanding of Tika use cases is correct, it is
> >>>> mainly used for extracting content from complex file formats - for
> >>>> example, extracting text and images from PDF files or Word
> >>>> documents. I believe this is the main difference between it and
> >>>> TextIO - people usually use Tika for complex use cases where the
> >>>> "unordered soup of stuff" abstraction is not useful.
> >>>>
> >>>> My suspicion about this is confirmed by the fact that the crux of
> >>>> the Tika API is ContentHandler
> >>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
> >>>> html?is-external=true
> >>>>
> >>>> whose
> >>>> documentation says "The order of events in this interface is very
> >>>> important, and mirrors the order of information in the document
> itself."
> >>> All that says is that a (Tika) ContentHandler will be a true SAX
> >>> ContentHandler...
> >>>>
> >>>> Let me give a few examples of what I think is possible with the raw
> >>>> Tika API, but I think is not currently possible with TikaIO - please
> >>>> correct me where I'm wrong, because I'm not particularly familiar
> >>>> with Tika and am judging just based on what I read about it.
> >>>> - User has 100,000 Word documents and wants to convert each of them
> >>>> to text files for future natural language processing.
> >>>> - User has 100,000 PDF files with financial statements, each
> >>>> containing a bunch of unrelated text and - the main content - a list
> >>>> of transactions in PDF tables. User wants to extract each
> >>>> transaction as a PCollection element, discarding the unrelated text.
> >>>> - User has 100,000 PDF files with scientific papers, and wants to
> >>>> extract text from them, somehow parse author and affiliation from
> >>>> the text, and compute statistics of topics and terminology usage by
> >>>> author name and affiliation.
> >>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
> >>>> observing a location over time: they want to extract metadata from
> >>>> each image using Tika, analyze the images themselves using some
> >>>> other library, and detect anomalies in the overall appearance of the
> >>>> location over time as seen from multiple cameras.
> >>>> I believe all of these cases can not be solved with TikaIO because
> >>>> the resulting PCollection<String> contains no information about
> >>>> which String comes from which document and about the order in which
> >>>> they appear in the document.
> >>> These are good use cases, thanks... I thought what you were talking
> >>> about the unordered soup of data produced by TikaIO (and its friends
> >>> TextIO and alike :-)).
> >>> Putting the ordered vs unordered question aside for a sec, why
> >>> exactly a Tika Reader can not make the name of the file it's
> >>> currently reading from available to the pipeline, as some Beam
> pipeline metadata piece ?
> >>> Surely it can be possible with Beam ? If not then I would be
> surprised...
> >>>
> >>>>
> >>>> I am, honestly, struggling to think of a case where I would want to
> >>>> use Tika, but where I *would* be ok with getting an unordered soup
> >>>> of strings.
> >>>> So some examples would be very helpful.
> >>>>
> >>> Yes. I'll ask Tika developers to help with some examples, but I'll
> >>> give one example where it did not matter to us in what order
> >>> Tika-produced data were available to the downstream layer.
> >>>
> >>> It's a demo the Apache CXF colleague of mine showed at one of Apache
> >>> Con NAs, and we had a happy audience:
> >>>
> >>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
> >>> se/samples/jax_rs/search
> >>>
> >>>
> >>> PDF or ODT files uploaded, Tika parses them, and all of that is put
> >>> into Lucene. We associate a file name with the indexed content and
> >>> then let users find a list of PDF files which contain a given word or
> >>> few words, details are here
> >>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
> >>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
> >>> og.java#L131
> >>>
> >>>
> >>> I'd say even more involved search engines would not mind supporting a
> >>> case like that :-)
> >>>
> >>> Now there we process one file at a time, and I understand now that
> >>> with TikaIO and N files it's all over the place really as far as the
> >>> ordering is concerned, which file it's coming from. etc. That's why
> >>> TikaReader must be able to associate the file name with a given piece
> >>> of text it's making available to the pipeline.
> >>>
> >>> I'd be happy to support the ParDo way of linking Tika with Beam.
> >>> If it makes things simpler then it would be good, I've just no idea
> >>> at the moment how to start the pipeline without using a
> >>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
> >>> earlier - how can one avoid it with ParDo when implementing a 'min
> >>> len chunk' feature, where the ParDo would have to concatenate several
> >>> SAX data pieces first before making a single composite piece to the
> pipeline ?
> >>>
> >>>
> >>>> Another way to state it: currently, if I wanted to solve all of the
> >>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
> >>>> API myself on the resulting ReadableFile. How can we make TikaIO
> >>>> provide a usability improvement over such usage?
> >>>>
> >>>
> >>>
> >>> If you are actually asking, does it really make sense for Beam to
> >>> ship Tika related code, given that users can just do it themselves,
> >>> I'm not sure.
> >>>
> >>> IMHO it always works better if users have to provide just few config
> >>> options to an integral part of the framework and see things happening.
> >>> It will bring more users.
> >>>
> >>> Whether the current Tika code (refactored or not) stays with Beam or
> >>> not - I'll let you and the team decide; believe it or not I was
> >>> seriously contemplating at the last moment to make it all part of the
> >>> Tika project itself and have a bit more flexibility over there with
> >>> tweaking things, but now that it is in the Beam snapshot - I don't
> >>> know - it's no my decision...
> >>>
> >>>> I am confused by your other comment - "Does the ordering matter ?
> >>>> Perhaps
> >>>> for some cases it does, and for some it does not. May be it makes
> >>>> sense to support running TikaIO as both the bounded reader/source
> >>>> and ParDo, with getting the common code reused." - because using
> >>>> BoundedReader or ParDo is not related to the ordering issue, only to
> >>>> the issue of asynchronous reading and complexity of implementation.
> >>>> The resulting PCollection will be unordered either way - this needs
> >>>> to be solved separately by providing a different API.
> >>> Right I see now, so ParDo is not about making Tika reported data
> >>> available to the downstream pipeline components ordered, only about
> >>> the simpler implementation.
> >>> Association with the file should be possible I hope, but I understand
> >>> it would be possible to optionally make the data coming out in the
> >>> ordered way as well...
> >>>
> >>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
> >>> let me double check: should we still give some thought to the
> >>> possible performance benefit of the current approach ? As I said, I
> >>> can easily get rid of all that polling code, use a simple Blocking
> queue.
> >>>
> >>> Cheers, Sergey
> >>>>
> >>>> Thanks.
> >>>>
> >>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
> >>>> <sb...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> Glad TikaIO getting some serious attention :-), I believe one thing
> >>>>> we both agree upon is that Tika can help Beam in its own unique way.
> >>>>>
> >>>>> Before trying to reply online, I'd like to state that my main
> >>>>> assumption is that TikaIO (as far as the read side is concerned) is
> >>>>> no different to Text, XML or similar bounded reader components.
> >>>>>
> >>>>> I have to admit I don't understand your questions about TikaIO
> >>>>> usecases.
> >>>>>
> >>>>> What are the Text Input or XML input use-cases ? These use cases
> >>>>> are TikaInput cases as well, the only difference is Tika can not
> >>>>> split the individual file into a sequence of sources/etc,
> >>>>>
> >>>>> TextIO can read from the plain text files (possibly zipped), XML -
> >>>>> optimized around reading from the XML files, and I thought I made
> >>>>> it clear (and it is a known fact anyway) Tika was about reading
> >>>>> basically from any file format.
> >>>>>
> >>>>> Where is the difference (apart from what I've already mentioned) ?
> >>>>>
> >>>>> Sergey
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Replies inline.
> >>>>>>
> >>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
> >>>>>> <sb...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi All
> >>>>>>>
> >>>>>>> This is my first post the the dev list, I work for Talend, I'm a
> >>>>>>> Beam novice, Apache Tika fan, and thought it would be really
> >>>>>>> great to try and link both projects together, which led me to
> >>>>>>> opening [1] where I typed some early thoughts, followed by PR
> >>>>>>> [2].
> >>>>>>>
> >>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
> >>>>>>> newer review comments from Eugene pending, so I'd like to
> >>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
> >>>>>>> decide, based on the feedback from the experts, what to do next.
> >>>>>>>
> >>>>>>> Apache Tika Parsers report the text content in chunks, via
> >>>>>>> SaxParser events. It's not possible with Tika to take a file and
> >>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
> >>>>>>> by line, the only way is to handle the SAXParser callbacks which
> >>>>>>> report the data chunks.
> >>>>>>> Some
> >>>>>>> parsers may report the complete lines, some individual words,
> >>>>>>> with some being able report the data only after the completely
> >>>>>>> parse the document.
> >>>>>>> All depends on the data format.
> >>>>>>>
> >>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
> >>>>>>> to parse the files, Beam threads will only collect the data from
> >>>>>>> the internal queue where the internal TikaReader's thread will
> >>>>>>> put the data into (note the data chunks are ordered even though
> >>>>>>> the tests might suggest otherwise).
> >>>>>>>
> >>>>>> I agree that your implementation of reader returns records in
> >>>>>> order
> >>>>>> - but
> >>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
> >>>>>> the order in which records are produced by a BoundedReader - the
> >>>>>> order produced by your reader is ignored, and when applying any
> >>>>>> transforms to the
> >>>>> PCollection
> >>>>>> produced by TikaIO, it is impossible to recover the order in which
> >>>>>> your reader returned the records.
> >>>>>>
> >>>>>> With that in mind, is PCollection<String>, containing individual
> >>>>>> Tika-detected items, still the right API for representing the
> >>>>>> result of parsing a large number of documents with Tika?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> The reason I did it was because I thought
> >>>>>>>
> >>>>>>> 1) it would make the individual data chunks available faster to
> >>>>>>> the pipeline - the parser will continue working via the
> >>>>>>> binary/video etc file while the data will already start flowing -
> >>>>>>> I agree there should be some tests data available confirming it -
> >>>>>>> but I'm positive at the moment this approach might yield some
> >>>>>>> performance gains with the large sets. If the file is large, if
> >>>>>>> it has the embedded attachments/videos to deal with, then it may
> >>>>>>> be more effective not to get the Beam thread deal with it...
> >>>>>>>
> >>>>>>> As I said on the PR, this description contains unfounded and
> >>>>>>> potentially
> >>>>>> incorrect assumptions about how Beam runners execute (or may
> >>>>>> execute in
> >>>>> the
> >>>>>> future) a ParDo or a BoundedReader. For example, if I understand
> >>>>> correctly,
> >>>>>> you might be assuming that:
> >>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
> >>>>> complete
> >>>>>> before processing its outputs with downstream transforms
> >>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
> >>>>> *concurrently*
> >>>>>> with downstream processing of its results
> >>>>>> - Passing an element from one thread to another using a
> >>>>>> BlockingQueue is free in terms of performance All of these are
> >>>>>> false at least in some runners, and I'm almost certain that in
> >>>>>> reality, performance of this approach is worse than a ParDo in
> >>>>> most
> >>>>>> production runners.
> >>>>>>
> >>>>>> There are other disadvantages to this approach:
> >>>>>> - Doing the bulk of the processing in a separate thread makes it
> >>>>> invisible
> >>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
> >>>>>> profiling capabilities, or the ability to get the current stack
> >>>>>> trace for stuck elements, this approach would make the real
> >>>>>> processing invisible to all of these capabilities, and a user
> >>>>>> would only see that the bulk of the time is spent waiting for the
> >>>>>> next element, but not *why* the next
> >>>>> element
> >>>>>> is taking long to compute.
> >>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
> >>>>>> invisible to Beam, will make it harder for runners to do
> >>>>>> autoscaling, binpacking
> >>>>> and
> >>>>>> other resource management magic (how much of this runners actually
> >>>>>> do is
> >>>>> a
> >>>>>> separate issue), because the runner will have no way of knowing
> >>>>>> how much CPU/IO this particular transform is actually using - all
> >>>>>> the processing happens in a thread about which the runner is
> >>>>>> unaware.
> >>>>>> - As far as I can tell, the code also hides exceptions that happen
> >>>>>> in the Tika thread
> >>>>>> - Adding the thread management makes the code much more complex,
> >>>>>> easier
> >>>>> to
> >>>>>> introduce bugs, and harder for others to contribute
> >>>>>>
> >>>>>>
> >>>>>>> 2) As I commented at the end of [2], having an option to
> >>>>>>> concatenate the data chunks first before making them available to
> >>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
> >>>>>>> introduce some synchronization issues (though not exactly sure
> >>>>>>> yet)
> >>>>>>>
> >>>>>> What are these issues?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> One of valid concerns there is that the reader is polling the
> >>>>>>> internal queue so, in theory at least, and perhaps in some rare
> >>>>>>> cases too, we may have a case where the max polling time has been
> >>>>>>> reached, the parser is still busy, and TikaIO fails to report all
> >>>>>>> the file data. I think that it can be solved by either 2a)
> >>>>>>> configuring the max polling time to a very large number which
> >>>>>>> will never be reached for a practical case, or
> >>>>>>> 2b) simply use a blocking queue without the time limits - in the
> >>>>>>> worst case, if TikaParser spins and fails to report the end of
> >>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
> >>>>>>> I propose to follow 2b).
> >>>>>>>
> >>>>>> I agree that there should be no way to unintentionally configure
> >>>>>> the transform in a way that will produce silent data loss. Another
> >>>>>> reason for not having these tuning knobs is that it goes against
> >>>>>> Beam's "no knobs"
> >>>>>> philosophy, and that in most cases users have no way of figuring
> >>>>>> out a
> >>>>> good
> >>>>>> value for tuning knobs except for manual experimentation, which is
> >>>>>> extremely brittle and typically gets immediately obsoleted by
> >>>>>> running on
> >>>>> a
> >>>>>> new dataset or updating a version of some of the involved
> >>>>>> dependencies
> >>>>> etc.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Please let me know what you think.
> >>>>>>> My plan so far is:
> >>>>>>> 1) start addressing most of Eugene's comments which would require
> >>>>>>> some minor TikaIO updates
> >>>>>>> 2) work on removing the TikaSource internal code dealing with
> >>>>>>> File patterns which I copied from TextIO at the next stage
> >>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
> >>>>>>> users some time to try it with some real complex files and also
> >>>>>>> decide if TikaIO can continue implemented as a
> >>>>>>> BoundedSource/Reader or not
> >>>>>>>
> >>>>>>> Eugene, all, will it work if I start with 1) ?
> >>>>>>>
> >>>>>> Yes, but I think we should start by discussing the anticipated use
> >>>>>> cases
> >>>>> of
> >>>>>> TikaIO and designing an API for it based on those use cases; and
> >>>>>> then see what's the best implementation for that particular API
> >>>>>> and set of anticipated use cases.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks, Sergey
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
> >>>>>>> [2] https://github.com/apache/beam/pull/3378
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
>
>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Like Sergey, it’ll take me some time to understand your recommendations.  Thank you!

On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov [mailto:kirpichov@google.com]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E .

If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
>
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
>
>  From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
>
Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
>
> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
>
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
>
> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
>
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
>
> Apologies if I've derailed the conversation or misunderstood this thread.
>
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
>
>                 Tim
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org<ma...@beam.apache.org>
> Cc: Allison, Timothy B. <ta...@mitre.org>>
> Subject: Re: TikaIO concerns
>
> Hi All
>
> Please welcome Tim, one of Apache Tika leads and practitioners.
>
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
>
> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
>
> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
>
> Other comments would be welcome too
>
> Thanks, Sergey
>
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sb...@gmail.com>>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Like Sergey, it’ll take me some time to understand your recommendations.  Thank you!

On one small point:
>return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }

For this option, I’d strongly encourage using the Json output from the RecursiveParserWrapper that contains metadata and content, and captures metadata even from embedded documents.

> However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea
Large documents are a problem, no doubt about it…

From: Eugene Kirpichov [mailto:kirpichov@google.com]
Sent: Thursday, September 21, 2017 4:41 PM
To: Allison, Timothy B. <ta...@mitre.org>; dev@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Thanks all for the discussion. It seems we have consensus that both within-document order and association with the original filename are necessary, but currently absent from TikaIO.

Association with original file:
Sergey - Beam does not automatically provide a way to associate an element with the file it originated from: automatically tracking data provenance is a known very hard research problem on which many papers have been written, and obvious solutions are very easy to break. See related discussion at https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E .

If you want the elements of your PCollection to contain additional information, you need the elements themselves to contain this information: the elements are self-contained and have no metadata associated with them (beyond the timestamp and windows, universal to the whole Beam model).

Order within a file:
The only way to have any kind of order within a PCollection is to have the elements of the PCollection contain something ordered, e.g. have a PCollection<List<Something>>, where each List is for one file [I'm assuming Tika, at a low level, works on a per-file basis?]. However, since TikaIO can be applied to very large files, this could produce very large elements, which is a bad idea. Because of this, I don't think the result of applying Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a general-purpose TikaIO transform that will be better than manual invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html - almost all of the examples involve extracting a single String from each document. This use case, with the assumption that individual documents are small enough, can certainly be simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at pipeline construction time) and a ContentHandler whose toString() will go into "content". ContentHandler does not implement Serializable, so you can not specify it at construction time - however, you can let the user specify either its class (if it's a simple handler like a BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - e.g. call it TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults = p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO directly in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
>
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
>
>  From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
>
Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
>
> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
>
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
>
> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
>
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
>
> Apologies if I've derailed the conversation or misunderstood this thread.
>
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
>
>                 Tim
>
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com<ma...@gmail.com>]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org<ma...@beam.apache.org>
> Cc: Allison, Timothy B. <ta...@mitre.org>>
> Subject: Re: TikaIO concerns
>
> Hi All
>
> Please welcome Tim, one of Apache Tika leads and practitioners.
>
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
>
> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
>
> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
>
> Other comments would be welcome too
>
> Thanks, Sergey
>
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sb...@gmail.com>>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: TikaIO concerns

Posted by Chris Mattmann <ma...@apache.org>.

Hi all,

One other thing is that Tika extracts metadata, and language information in which order
doesn’t matter (Keys can be out of order).

Would this be useful?

Cheers,
Chris




On 9/21/17, 2:10 PM, "Sergey Beryozkin" <sb...@gmail.com> wrote:

    Hi Eugene
    
    Thank you, very helpful, let me read it few times before I get what 
    exactly I need to clarify :-), two questions so far:
    
    On 21/09/17 21:40, Eugene Kirpichov wrote:
    > Thanks all for the discussion. It seems we have consensus that both
    > within-document order and association with the original filename are
    > necessary, but currently absent from TikaIO.
    > 
    > *Association with original file:*
    > Sergey - Beam does not *automatically* provide a way to associate an
    > element with the file it originated from: automatically tracking data
    > provenance is a known very hard research problem on which many papers have
    > been written, and obvious solutions are very easy to break. See related
    > discussion at
    > https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
    >   .
    > 
    > If you want the elements of your PCollection to contain additional
    > information, you need the elements themselves to contain this information:
    > the elements are self-contained and have no metadata associated with them
    > (beyond the timestamp and windows, universal to the whole Beam model).
    > 
    > *Order within a file:*
    > The only way to have any kind of order within a PCollection is to have the
    > elements of the PCollection contain something ordered, e.g. have a
    > PCollection<List<Something>>, where each List is for one file [I'm assuming
    > Tika, at a low level, works on a per-file basis?]. However, since TikaIO
    > can be applied to very large files, this could produce very large elements,
    > which is a bad idea. Because of this, I don't think the result of applying
    > Tika to a single file can be encoded as a PCollection element.
    > 
    > Given both of these, I think that it's not possible to create a
    > *general-purpose* TikaIO transform that will be better than manual
    > invocation of Tika as a DoFn on the result of FileIO.readMatches().
    > 
    > However, looking at the examples at
    > https://tika.apache.org/1.16/examples.html - almost all of the examples
    > involve extracting a single String from each document. This use case, with
    > the assumption that individual documents are small enough, can certainly be
    > simplified and TikaIO could be a facade for doing just this.
    > 
    > E.g. TikaIO could:
    > - take as input a PCollection<ReadableFile>
    > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
    > is a class with properties { String content, Metadata metadata }
    
    and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
    represents the content + (Tika) Metadata of the file such as the author 
    name, etc ? Is it the file name ?
    > - be configured by: a Parser (it implements Serializable so can be
    > specified at pipeline construction time) and a ContentHandler whose
    > toString() will go into "content". ContentHandler does not implement
    > Serializable, so you can not specify it at construction time - however, you
    > can let the user specify either its class (if it's a simple handler like a
    > BodyContentHandler) or specify a lambda for creating the handler
    > (SerializableFunction<Void, ContentHandler>), and potentially you can have
    > a simpler facade for Tika.parseAsString() - e.g. call it
    > TikaIO.parseAllAsStrings().
    > 
    > Example usage would look like:
    > 
    >    PCollection<KV<String, ParseResult>> parseResults =
    > p.apply(FileIO.match().filepattern(...))
    >      .apply(FileIO.readMatches())
    >      .apply(TikaIO.parseAllAsStrings())
    > 
    > or:
    > 
    >      .apply(TikaIO.parseAll()
    >          .withParser(new AutoDetectParser())
    >          .withContentHandler(() -> new BodyContentHandler(new
    > ToXMLContentHandler())))
    > 
    > You could also have shorthands for letting the user avoid using FileIO
    > directly in simple cases, for example:
    >      p.apply(TikaIO.parseAsStrings().from(filepattern))
    > 
    > This would of course be implemented as a ParDo or even MapElements, and
    > you'll be able to share the code between parseAll and regular parse.
    > 
    OK. What about the current source on the master, should be marked 
    Experimental till I manage to write something new with the above ideas 
    in mind ? Or there's enough time till 2.2.0 gets released ?
    
    Thanks, Sergey
    > On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
    > wrote:
    > 
    >> Hi Tim
    >> On 21/09/17 14:33, Allison, Timothy B. wrote:
    >>> Thank you, Sergey.
    >>>
    >>> My knowledge of Apache Beam is limited -- I saw Davor and
    >> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
    >> impressed, but I haven't had a chance to work with it yet.
    >>>
    >>>   From my perspective, if I understand this thread (and I may not!),
    >> getting unordered text from _a given file_ is a non-starter for most
    >> applications.  The implementation needs to guarantee order per file, and
    >> the user has to be able to link the "extract" back to a unique identifier
    >> for the document.  If the current implementation doesn't do those things,
    >> we need to change it, IMHO.
    >>>
    >> Right now Tika-related reader does not associate a given text fragment
    >> with the file name, so a function looking at some text and trying to
    >> find where it came from won't be able to do so.
    >>
    >> So I asked how to do it in Beam, how to attach some context to the given
    >> piece of data. I hope it can be done and if not - then perhaps some
    >> improvement can be applied.
    >>
    >> Re the unordered text - yes - this is what we currently have with Beam +
    >> TikaIO :-).
    >>
    >> The use-case I referred to earlier in this thread (upload PDFs - save
    >> the possibly unordered text to Lucene with the file name 'attached', let
    >> users search for the files containing some words - phrases, this works
    >> OK given that I can see PDF parser for ex reporting the lines) can be
    >> supported OK with the current TikaIO (provided we find a way to 'attach'
    >> a file name to the flow).
    >>
    >> I see though supporting the total ordering can be a big deal in other
    >> cases. Eugene, can you please explain how it can be done, is it
    >> achievable in principle, without the users having to do some custom
    >> coding ?
    >>
    >>> To the question of -- why is this in Beam at all; why don't we let users
    >> call it if they want it?...
    >>>
    >>> No matter how much we do to Tika, it will behave badly sometimes --
    >> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
    >> using Beam -- folks likely with large batches of unruly/noisy documents --
    >> are more likely to run into these problems than your average
    >> couple-of-thousand-docs-from-our-own-company user. So, if there are things
    >> we can do in Beam to prevent developers around the world from having to
    >> reinvent the wheel for defenses against these problems, then I'd be
    >> enormously grateful if we could put Tika into Beam.  That means:
    >>>
    >>> 1) a process-level timeout (because you can't actually kill a thread in
    >> Java)
    >>> 2) a process-level restart on OOM
    >>> 3) avoid trying to reprocess a badly behaving document
    >>>
    >>> If Beam automatically handles those problems, then I'd say, y, let users
    >> write their own code.  If there is so much as a single configuration knob
    >> (and it sounds like Beam is against complex configuration...yay!) to get
    >> that working in Beam, then I'd say, please integrate Tika into Beam.  From
    >> a safety perspective, it is critical to keep the extraction process
    >> entirely separate (jvm, vm, m, rack, data center!) from the
    >> transformation+loading steps.  IMHO, very few devs realize this because
    >> Tika works well lots of the time...which is why it is critical for us to
    >> make it easy for people to get it right all of the time.
    >>>
    >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
    >> mode first in one jvm, and then I kick off another process to do
    >> transform/loading into Lucene/Solr from the .json files that Tika generates
    >> for each input file.  If I were to scale up, I'd want to maintain this
    >> complete separation of steps.
    >>>
    >>> Apologies if I've derailed the conversation or misunderstood this thread.
    >>>
    >> Major thanks for your input :-)
    >>
    >> Cheers, Sergey
    >>
    >>> Cheers,
    >>>
    >>>                  Tim
    >>>
    >>> -----Original Message-----
    >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
    >>> Sent: Thursday, September 21, 2017 9:07 AM
    >>> To: dev@beam.apache.org
    >>> Cc: Allison, Timothy B. <ta...@mitre.org>
    >>> Subject: Re: TikaIO concerns
    >>>
    >>> Hi All
    >>>
    >>> Please welcome Tim, one of Apache Tika leads and practitioners.
    >>>
    >>> Tim, thanks for joining in :-). If you have some great Apache Tika
    >> stories to share (preferably involving the cases where it did not really
    >> matter the ordering in which Tika-produced data were dealt with by the
    >>> consumers) then please do so :-).
    >>>
    >>> At the moment, even though Tika ContentHandler will emit the ordered
    >> data, the Beam runtime will have no guarantees that the downstream pipeline
    >> components will see the data coming in the right order.
    >>>
    >>> (FYI, I understand from the earlier comments that the total ordering is
    >> also achievable but would require the extra API support)
    >>>
    >>> Other comments would be welcome too
    >>>
    >>> Thanks, Sergey
    >>>
    >>> On 21/09/17 10:55, Sergey Beryozkin wrote:
    >>>> I noticed that the PDF and ODT parsers actually split by lines, not
    >>>> individual words and nearly 100% sure I saw Tika reporting individual
    >>>> lines when it was parsing the text files. The 'min text length'
    >>>> feature can help with reporting several lines at a time, etc...
    >>>>
    >>>> I'm working with this PDF all the time:
    >>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
    >>>>
    >>>> try it too if you get a chance.
    >>>>
    >>>> (and I can imagine not all PDFs/etc representing the 'story' but can
    >>>> be for ex a log-like content too)
    >>>>
    >>>> That said, I don't know how a parser for the format N will behave, it
    >>>> depends on the individual parsers.
    >>>>
    >>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
    >>>>
    >>>> I'd like to know though how to make a file name available to the
    >>>> pipeline which is working with the current text fragment ?
    >>>>
    >>>> Going to try and do some measurements and compare the sync vs async
    >>>> parsing modes...
    >>>>
    >>>> Asked the Tika team to support with some more examples...
    >>>>
    >>>> Cheers, Sergey
    >>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
    >>>>> Hi,
    >>>>>
    >>>>> thanks for the explanations,
    >>>>>
    >>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
    >>>>>> Hi!
    >>>>>>
    >>>>>> TextIO returns an unordered soup of lines contained in all files you
    >>>>>> ask it to read. People usually use TextIO for reading files where 1
    >>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
    >>>>>> a row of a CSV file - so discarding order is ok.
    >>>>> Just a side note, I'd probably want that be ordered, though I guess
    >>>>> it depends...
    >>>>>> However, there is a number of cases where TextIO is a poor fit:
    >>>>>> - Cases where discarding order is not ok - e.g. if you're doing
    >>>>>> natural language processing and the text files contain actual prose,
    >>>>>> where you need to process a file as a whole. TextIO can't do that.
    >>>>>> - Cases where you need to remember which file each element came
    >>>>>> from, e.g.
    >>>>>> if you're creating a search index for the files: TextIO can't do
    >>>>>> this either.
    >>>>>>
    >>>>>> Both of these issues have been raised in the past against TextIO;
    >>>>>> however it seems that the overwhelming majority of users of TextIO
    >>>>>> use it for logs or CSV files or alike, so solving these issues has
    >>>>>> not been a priority.
    >>>>>> Currently they are solved in a general form via FileIO.read() which
    >>>>>> gives you access to reading a full file yourself - people who want
    >>>>>> more flexibility will be able to use standard Java text-parsing
    >>>>>> utilities on a ReadableFile, without involving TextIO.
    >>>>>>
    >>>>>> Same applies for XmlIO: it is specifically designed for the narrow
    >>>>>> use case where the files contain independent data entries, so
    >>>>>> returning an unordered soup of them, with no association to the
    >>>>>> original file, is the user's intention. XmlIO will not work for
    >>>>>> processing more complex XML files that are not simply a sequence of
    >>>>>> entries with the same tag, and it also does not remember the
    >>>>>> original filename.
    >>>>>>
    >>>>>
    >>>>> OK...
    >>>>>
    >>>>>> However, if my understanding of Tika use cases is correct, it is
    >>>>>> mainly used for extracting content from complex file formats - for
    >>>>>> example, extracting text and images from PDF files or Word
    >>>>>> documents. I believe this is the main difference between it and
    >>>>>> TextIO - people usually use Tika for complex use cases where the
    >>>>>> "unordered soup of stuff" abstraction is not useful.
    >>>>>>
    >>>>>> My suspicion about this is confirmed by the fact that the crux of
    >>>>>> the Tika API is ContentHandler
    >>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
    >>>>>> html?is-external=true
    >>>>>>
    >>>>>> whose
    >>>>>> documentation says "The order of events in this interface is very
    >>>>>> important, and mirrors the order of information in the document
    >> itself."
    >>>>> All that says is that a (Tika) ContentHandler will be a true SAX
    >>>>> ContentHandler...
    >>>>>>
    >>>>>> Let me give a few examples of what I think is possible with the raw
    >>>>>> Tika API, but I think is not currently possible with TikaIO - please
    >>>>>> correct me where I'm wrong, because I'm not particularly familiar
    >>>>>> with Tika and am judging just based on what I read about it.
    >>>>>> - User has 100,000 Word documents and wants to convert each of them
    >>>>>> to text files for future natural language processing.
    >>>>>> - User has 100,000 PDF files with financial statements, each
    >>>>>> containing a bunch of unrelated text and - the main content - a list
    >>>>>> of transactions in PDF tables. User wants to extract each
    >>>>>> transaction as a PCollection element, discarding the unrelated text.
    >>>>>> - User has 100,000 PDF files with scientific papers, and wants to
    >>>>>> extract text from them, somehow parse author and affiliation from
    >>>>>> the text, and compute statistics of topics and terminology usage by
    >>>>>> author name and affiliation.
    >>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
    >>>>>> observing a location over time: they want to extract metadata from
    >>>>>> each image using Tika, analyze the images themselves using some
    >>>>>> other library, and detect anomalies in the overall appearance of the
    >>>>>> location over time as seen from multiple cameras.
    >>>>>> I believe all of these cases can not be solved with TikaIO because
    >>>>>> the resulting PCollection<String> contains no information about
    >>>>>> which String comes from which document and about the order in which
    >>>>>> they appear in the document.
    >>>>> These are good use cases, thanks... I thought what you were talking
    >>>>> about the unordered soup of data produced by TikaIO (and its friends
    >>>>> TextIO and alike :-)).
    >>>>> Putting the ordered vs unordered question aside for a sec, why
    >>>>> exactly a Tika Reader can not make the name of the file it's
    >>>>> currently reading from available to the pipeline, as some Beam
    >> pipeline metadata piece ?
    >>>>> Surely it can be possible with Beam ? If not then I would be
    >> surprised...
    >>>>>
    >>>>>>
    >>>>>> I am, honestly, struggling to think of a case where I would want to
    >>>>>> use Tika, but where I *would* be ok with getting an unordered soup
    >>>>>> of strings.
    >>>>>> So some examples would be very helpful.
    >>>>>>
    >>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
    >>>>> give one example where it did not matter to us in what order
    >>>>> Tika-produced data were available to the downstream layer.
    >>>>>
    >>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
    >>>>> Con NAs, and we had a happy audience:
    >>>>>
    >>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search
    >>>>>
    >>>>>
    >>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
    >>>>> into Lucene. We associate a file name with the indexed content and
    >>>>> then let users find a list of PDF files which contain a given word or
    >>>>> few words, details are here
    >>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
    >>>>> og.java#L131
    >>>>>
    >>>>>
    >>>>> I'd say even more involved search engines would not mind supporting a
    >>>>> case like that :-)
    >>>>>
    >>>>> Now there we process one file at a time, and I understand now that
    >>>>> with TikaIO and N files it's all over the place really as far as the
    >>>>> ordering is concerned, which file it's coming from. etc. That's why
    >>>>> TikaReader must be able to associate the file name with a given piece
    >>>>> of text it's making available to the pipeline.
    >>>>>
    >>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
    >>>>> If it makes things simpler then it would be good, I've just no idea
    >>>>> at the moment how to start the pipeline without using a
    >>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
    >>>>> earlier - how can one avoid it with ParDo when implementing a 'min
    >>>>> len chunk' feature, where the ParDo would have to concatenate several
    >>>>> SAX data pieces first before making a single composite piece to the
    >> pipeline ?
    >>>>>
    >>>>>
    >>>>>> Another way to state it: currently, if I wanted to solve all of the
    >>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
    >>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
    >>>>>> provide a usability improvement over such usage?
    >>>>>>
    >>>>>
    >>>>>
    >>>>> If you are actually asking, does it really make sense for Beam to
    >>>>> ship Tika related code, given that users can just do it themselves,
    >>>>> I'm not sure.
    >>>>>
    >>>>> IMHO it always works better if users have to provide just few config
    >>>>> options to an integral part of the framework and see things happening.
    >>>>> It will bring more users.
    >>>>>
    >>>>> Whether the current Tika code (refactored or not) stays with Beam or
    >>>>> not - I'll let you and the team decide; believe it or not I was
    >>>>> seriously contemplating at the last moment to make it all part of the
    >>>>> Tika project itself and have a bit more flexibility over there with
    >>>>> tweaking things, but now that it is in the Beam snapshot - I don't
    >>>>> know - it's no my decision...
    >>>>>
    >>>>>> I am confused by your other comment - "Does the ordering matter ?
    >>>>>> Perhaps
    >>>>>> for some cases it does, and for some it does not. May be it makes
    >>>>>> sense to support running TikaIO as both the bounded reader/source
    >>>>>> and ParDo, with getting the common code reused." - because using
    >>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
    >>>>>> the issue of asynchronous reading and complexity of implementation.
    >>>>>> The resulting PCollection will be unordered either way - this needs
    >>>>>> to be solved separately by providing a different API.
    >>>>> Right I see now, so ParDo is not about making Tika reported data
    >>>>> available to the downstream pipeline components ordered, only about
    >>>>> the simpler implementation.
    >>>>> Association with the file should be possible I hope, but I understand
    >>>>> it would be possible to optionally make the data coming out in the
    >>>>> ordered way as well...
    >>>>>
    >>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
    >>>>> let me double check: should we still give some thought to the
    >>>>> possible performance benefit of the current approach ? As I said, I
    >>>>> can easily get rid of all that polling code, use a simple Blocking
    >> queue.
    >>>>>
    >>>>> Cheers, Sergey
    >>>>>>
    >>>>>> Thanks.
    >>>>>>
    >>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
    >>>>>> <sb...@gmail.com>
    >>>>>> wrote:
    >>>>>>
    >>>>>>> Hi
    >>>>>>>
    >>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
    >>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
    >>>>>>>
    >>>>>>> Before trying to reply online, I'd like to state that my main
    >>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
    >>>>>>> no different to Text, XML or similar bounded reader components.
    >>>>>>>
    >>>>>>> I have to admit I don't understand your questions about TikaIO
    >>>>>>> usecases.
    >>>>>>>
    >>>>>>> What are the Text Input or XML input use-cases ? These use cases
    >>>>>>> are TikaInput cases as well, the only difference is Tika can not
    >>>>>>> split the individual file into a sequence of sources/etc,
    >>>>>>>
    >>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
    >>>>>>> optimized around reading from the XML files, and I thought I made
    >>>>>>> it clear (and it is a known fact anyway) Tika was about reading
    >>>>>>> basically from any file format.
    >>>>>>>
    >>>>>>> Where is the difference (apart from what I've already mentioned) ?
    >>>>>>>
    >>>>>>> Sergey
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
    >>>>>>>> Hi,
    >>>>>>>>
    >>>>>>>> Replies inline.
    >>>>>>>>
    >>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
    >>>>>>>> <sb...@gmail.com>
    >>>>>>>> wrote:
    >>>>>>>>
    >>>>>>>>> Hi All
    >>>>>>>>>
    >>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
    >>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
    >>>>>>>>> great to try and link both projects together, which led me to
    >>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
    >>>>>>>>> [2].
    >>>>>>>>>
    >>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
    >>>>>>>>> newer review comments from Eugene pending, so I'd like to
    >>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
    >>>>>>>>> decide, based on the feedback from the experts, what to do next.
    >>>>>>>>>
    >>>>>>>>> Apache Tika Parsers report the text content in chunks, via
    >>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
    >>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
    >>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
    >>>>>>>>> report the data chunks.
    >>>>>>>>> Some
    >>>>>>>>> parsers may report the complete lines, some individual words,
    >>>>>>>>> with some being able report the data only after the completely
    >>>>>>>>> parse the document.
    >>>>>>>>> All depends on the data format.
    >>>>>>>>>
    >>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
    >>>>>>>>> to parse the files, Beam threads will only collect the data from
    >>>>>>>>> the internal queue where the internal TikaReader's thread will
    >>>>>>>>> put the data into (note the data chunks are ordered even though
    >>>>>>>>> the tests might suggest otherwise).
    >>>>>>>>>
    >>>>>>>> I agree that your implementation of reader returns records in
    >>>>>>>> order
    >>>>>>>> - but
    >>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
    >>>>>>>> the order in which records are produced by a BoundedReader - the
    >>>>>>>> order produced by your reader is ignored, and when applying any
    >>>>>>>> transforms to the
    >>>>>>> PCollection
    >>>>>>>> produced by TikaIO, it is impossible to recover the order in which
    >>>>>>>> your reader returned the records.
    >>>>>>>>
    >>>>>>>> With that in mind, is PCollection<String>, containing individual
    >>>>>>>> Tika-detected items, still the right API for representing the
    >>>>>>>> result of parsing a large number of documents with Tika?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> The reason I did it was because I thought
    >>>>>>>>>
    >>>>>>>>> 1) it would make the individual data chunks available faster to
    >>>>>>>>> the pipeline - the parser will continue working via the
    >>>>>>>>> binary/video etc file while the data will already start flowing -
    >>>>>>>>> I agree there should be some tests data available confirming it -
    >>>>>>>>> but I'm positive at the moment this approach might yield some
    >>>>>>>>> performance gains with the large sets. If the file is large, if
    >>>>>>>>> it has the embedded attachments/videos to deal with, then it may
    >>>>>>>>> be more effective not to get the Beam thread deal with it...
    >>>>>>>>>
    >>>>>>>>> As I said on the PR, this description contains unfounded and
    >>>>>>>>> potentially
    >>>>>>>> incorrect assumptions about how Beam runners execute (or may
    >>>>>>>> execute in
    >>>>>>> the
    >>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
    >>>>>>> correctly,
    >>>>>>>> you might be assuming that:
    >>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
    >>>>>>> complete
    >>>>>>>> before processing its outputs with downstream transforms
    >>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
    >>>>>>> *concurrently*
    >>>>>>>> with downstream processing of its results
    >>>>>>>> - Passing an element from one thread to another using a
    >>>>>>>> BlockingQueue is free in terms of performance All of these are
    >>>>>>>> false at least in some runners, and I'm almost certain that in
    >>>>>>>> reality, performance of this approach is worse than a ParDo in
    >>>>>>> most
    >>>>>>>> production runners.
    >>>>>>>>
    >>>>>>>> There are other disadvantages to this approach:
    >>>>>>>> - Doing the bulk of the processing in a separate thread makes it
    >>>>>>> invisible
    >>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
    >>>>>>>> profiling capabilities, or the ability to get the current stack
    >>>>>>>> trace for stuck elements, this approach would make the real
    >>>>>>>> processing invisible to all of these capabilities, and a user
    >>>>>>>> would only see that the bulk of the time is spent waiting for the
    >>>>>>>> next element, but not *why* the next
    >>>>>>> element
    >>>>>>>> is taking long to compute.
    >>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
    >>>>>>>> invisible to Beam, will make it harder for runners to do
    >>>>>>>> autoscaling, binpacking
    >>>>>>> and
    >>>>>>>> other resource management magic (how much of this runners actually
    >>>>>>>> do is
    >>>>>>> a
    >>>>>>>> separate issue), because the runner will have no way of knowing
    >>>>>>>> how much CPU/IO this particular transform is actually using - all
    >>>>>>>> the processing happens in a thread about which the runner is
    >>>>>>>> unaware.
    >>>>>>>> - As far as I can tell, the code also hides exceptions that happen
    >>>>>>>> in the Tika thread
    >>>>>>>> - Adding the thread management makes the code much more complex,
    >>>>>>>> easier
    >>>>>>> to
    >>>>>>>> introduce bugs, and harder for others to contribute
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>> 2) As I commented at the end of [2], having an option to
    >>>>>>>>> concatenate the data chunks first before making them available to
    >>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
    >>>>>>>>> introduce some synchronization issues (though not exactly sure
    >>>>>>>>> yet)
    >>>>>>>>>
    >>>>>>>> What are these issues?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> One of valid concerns there is that the reader is polling the
    >>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
    >>>>>>>>> cases too, we may have a case where the max polling time has been
    >>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
    >>>>>>>>> the file data. I think that it can be solved by either 2a)
    >>>>>>>>> configuring the max polling time to a very large number which
    >>>>>>>>> will never be reached for a practical case, or
    >>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
    >>>>>>>>> worst case, if TikaParser spins and fails to report the end of
    >>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
    >>>>>>>>> I propose to follow 2b).
    >>>>>>>>>
    >>>>>>>> I agree that there should be no way to unintentionally configure
    >>>>>>>> the transform in a way that will produce silent data loss. Another
    >>>>>>>> reason for not having these tuning knobs is that it goes against
    >>>>>>>> Beam's "no knobs"
    >>>>>>>> philosophy, and that in most cases users have no way of figuring
    >>>>>>>> out a
    >>>>>>> good
    >>>>>>>> value for tuning knobs except for manual experimentation, which is
    >>>>>>>> extremely brittle and typically gets immediately obsoleted by
    >>>>>>>> running on
    >>>>>>> a
    >>>>>>>> new dataset or updating a version of some of the involved
    >>>>>>>> dependencies
    >>>>>>> etc.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Please let me know what you think.
    >>>>>>>>> My plan so far is:
    >>>>>>>>> 1) start addressing most of Eugene's comments which would require
    >>>>>>>>> some minor TikaIO updates
    >>>>>>>>> 2) work on removing the TikaSource internal code dealing with
    >>>>>>>>> File patterns which I copied from TextIO at the next stage
    >>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
    >>>>>>>>> users some time to try it with some real complex files and also
    >>>>>>>>> decide if TikaIO can continue implemented as a
    >>>>>>>>> BoundedSource/Reader or not
    >>>>>>>>>
    >>>>>>>>> Eugene, all, will it work if I start with 1) ?
    >>>>>>>>>
    >>>>>>>> Yes, but I think we should start by discussing the anticipated use
    >>>>>>>> cases
    >>>>>>> of
    >>>>>>>> TikaIO and designing an API for it based on those use cases; and
    >>>>>>>> then see what's the best implementation for that particular API
    >>>>>>>> and set of anticipated use cases.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Thanks, Sergey
    >>>>>>>>>
    >>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
    >>>>>>>>> [2] https://github.com/apache/beam/pull/3378
    >>>>>>>>>
    >>>>>>>>
    >>>>>>>
    >>>>>>
    >>>>>
    >>
    > 
    
    
    -- 
    Sergey Beryozkin
    
    Talend Community Coders
    http://coders.talend.com/

Re: TikaIO concerns

Posted by Chris Mattmann <ma...@apache.org>.

Hi all,

One other thing is that Tika extracts metadata, and language information in which order
doesn’t matter (Keys can be out of order).

Would this be useful?

Cheers,
Chris




On 9/21/17, 2:10 PM, "Sergey Beryozkin" <sb...@gmail.com> wrote:

    Hi Eugene
    
    Thank you, very helpful, let me read it few times before I get what 
    exactly I need to clarify :-), two questions so far:
    
    On 21/09/17 21:40, Eugene Kirpichov wrote:
    > Thanks all for the discussion. It seems we have consensus that both
    > within-document order and association with the original filename are
    > necessary, but currently absent from TikaIO.
    > 
    > *Association with original file:*
    > Sergey - Beam does not *automatically* provide a way to associate an
    > element with the file it originated from: automatically tracking data
    > provenance is a known very hard research problem on which many papers have
    > been written, and obvious solutions are very easy to break. See related
    > discussion at
    > https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
    >   .
    > 
    > If you want the elements of your PCollection to contain additional
    > information, you need the elements themselves to contain this information:
    > the elements are self-contained and have no metadata associated with them
    > (beyond the timestamp and windows, universal to the whole Beam model).
    > 
    > *Order within a file:*
    > The only way to have any kind of order within a PCollection is to have the
    > elements of the PCollection contain something ordered, e.g. have a
    > PCollection<List<Something>>, where each List is for one file [I'm assuming
    > Tika, at a low level, works on a per-file basis?]. However, since TikaIO
    > can be applied to very large files, this could produce very large elements,
    > which is a bad idea. Because of this, I don't think the result of applying
    > Tika to a single file can be encoded as a PCollection element.
    > 
    > Given both of these, I think that it's not possible to create a
    > *general-purpose* TikaIO transform that will be better than manual
    > invocation of Tika as a DoFn on the result of FileIO.readMatches().
    > 
    > However, looking at the examples at
    > https://tika.apache.org/1.16/examples.html - almost all of the examples
    > involve extracting a single String from each document. This use case, with
    > the assumption that individual documents are small enough, can certainly be
    > simplified and TikaIO could be a facade for doing just this.
    > 
    > E.g. TikaIO could:
    > - take as input a PCollection<ReadableFile>
    > - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
    > is a class with properties { String content, Metadata metadata }
    
    and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
    represents the content + (Tika) Metadata of the file such as the author 
    name, etc ? Is it the file name ?
    > - be configured by: a Parser (it implements Serializable so can be
    > specified at pipeline construction time) and a ContentHandler whose
    > toString() will go into "content". ContentHandler does not implement
    > Serializable, so you can not specify it at construction time - however, you
    > can let the user specify either its class (if it's a simple handler like a
    > BodyContentHandler) or specify a lambda for creating the handler
    > (SerializableFunction<Void, ContentHandler>), and potentially you can have
    > a simpler facade for Tika.parseAsString() - e.g. call it
    > TikaIO.parseAllAsStrings().
    > 
    > Example usage would look like:
    > 
    >    PCollection<KV<String, ParseResult>> parseResults =
    > p.apply(FileIO.match().filepattern(...))
    >      .apply(FileIO.readMatches())
    >      .apply(TikaIO.parseAllAsStrings())
    > 
    > or:
    > 
    >      .apply(TikaIO.parseAll()
    >          .withParser(new AutoDetectParser())
    >          .withContentHandler(() -> new BodyContentHandler(new
    > ToXMLContentHandler())))
    > 
    > You could also have shorthands for letting the user avoid using FileIO
    > directly in simple cases, for example:
    >      p.apply(TikaIO.parseAsStrings().from(filepattern))
    > 
    > This would of course be implemented as a ParDo or even MapElements, and
    > you'll be able to share the code between parseAll and regular parse.
    > 
    OK. What about the current source on the master, should be marked 
    Experimental till I manage to write something new with the above ideas 
    in mind ? Or there's enough time till 2.2.0 gets released ?
    
    Thanks, Sergey
    > On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
    > wrote:
    > 
    >> Hi Tim
    >> On 21/09/17 14:33, Allison, Timothy B. wrote:
    >>> Thank you, Sergey.
    >>>
    >>> My knowledge of Apache Beam is limited -- I saw Davor and
    >> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
    >> impressed, but I haven't had a chance to work with it yet.
    >>>
    >>>   From my perspective, if I understand this thread (and I may not!),
    >> getting unordered text from _a given file_ is a non-starter for most
    >> applications.  The implementation needs to guarantee order per file, and
    >> the user has to be able to link the "extract" back to a unique identifier
    >> for the document.  If the current implementation doesn't do those things,
    >> we need to change it, IMHO.
    >>>
    >> Right now Tika-related reader does not associate a given text fragment
    >> with the file name, so a function looking at some text and trying to
    >> find where it came from won't be able to do so.
    >>
    >> So I asked how to do it in Beam, how to attach some context to the given
    >> piece of data. I hope it can be done and if not - then perhaps some
    >> improvement can be applied.
    >>
    >> Re the unordered text - yes - this is what we currently have with Beam +
    >> TikaIO :-).
    >>
    >> The use-case I referred to earlier in this thread (upload PDFs - save
    >> the possibly unordered text to Lucene with the file name 'attached', let
    >> users search for the files containing some words - phrases, this works
    >> OK given that I can see PDF parser for ex reporting the lines) can be
    >> supported OK with the current TikaIO (provided we find a way to 'attach'
    >> a file name to the flow).
    >>
    >> I see though supporting the total ordering can be a big deal in other
    >> cases. Eugene, can you please explain how it can be done, is it
    >> achievable in principle, without the users having to do some custom
    >> coding ?
    >>
    >>> To the question of -- why is this in Beam at all; why don't we let users
    >> call it if they want it?...
    >>>
    >>> No matter how much we do to Tika, it will behave badly sometimes --
    >> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
    >> using Beam -- folks likely with large batches of unruly/noisy documents --
    >> are more likely to run into these problems than your average
    >> couple-of-thousand-docs-from-our-own-company user. So, if there are things
    >> we can do in Beam to prevent developers around the world from having to
    >> reinvent the wheel for defenses against these problems, then I'd be
    >> enormously grateful if we could put Tika into Beam.  That means:
    >>>
    >>> 1) a process-level timeout (because you can't actually kill a thread in
    >> Java)
    >>> 2) a process-level restart on OOM
    >>> 3) avoid trying to reprocess a badly behaving document
    >>>
    >>> If Beam automatically handles those problems, then I'd say, y, let users
    >> write their own code.  If there is so much as a single configuration knob
    >> (and it sounds like Beam is against complex configuration...yay!) to get
    >> that working in Beam, then I'd say, please integrate Tika into Beam.  From
    >> a safety perspective, it is critical to keep the extraction process
    >> entirely separate (jvm, vm, m, rack, data center!) from the
    >> transformation+loading steps.  IMHO, very few devs realize this because
    >> Tika works well lots of the time...which is why it is critical for us to
    >> make it easy for people to get it right all of the time.
    >>>
    >>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
    >> mode first in one jvm, and then I kick off another process to do
    >> transform/loading into Lucene/Solr from the .json files that Tika generates
    >> for each input file.  If I were to scale up, I'd want to maintain this
    >> complete separation of steps.
    >>>
    >>> Apologies if I've derailed the conversation or misunderstood this thread.
    >>>
    >> Major thanks for your input :-)
    >>
    >> Cheers, Sergey
    >>
    >>> Cheers,
    >>>
    >>>                  Tim
    >>>
    >>> -----Original Message-----
    >>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
    >>> Sent: Thursday, September 21, 2017 9:07 AM
    >>> To: dev@beam.apache.org
    >>> Cc: Allison, Timothy B. <ta...@mitre.org>
    >>> Subject: Re: TikaIO concerns
    >>>
    >>> Hi All
    >>>
    >>> Please welcome Tim, one of Apache Tika leads and practitioners.
    >>>
    >>> Tim, thanks for joining in :-). If you have some great Apache Tika
    >> stories to share (preferably involving the cases where it did not really
    >> matter the ordering in which Tika-produced data were dealt with by the
    >>> consumers) then please do so :-).
    >>>
    >>> At the moment, even though Tika ContentHandler will emit the ordered
    >> data, the Beam runtime will have no guarantees that the downstream pipeline
    >> components will see the data coming in the right order.
    >>>
    >>> (FYI, I understand from the earlier comments that the total ordering is
    >> also achievable but would require the extra API support)
    >>>
    >>> Other comments would be welcome too
    >>>
    >>> Thanks, Sergey
    >>>
    >>> On 21/09/17 10:55, Sergey Beryozkin wrote:
    >>>> I noticed that the PDF and ODT parsers actually split by lines, not
    >>>> individual words and nearly 100% sure I saw Tika reporting individual
    >>>> lines when it was parsing the text files. The 'min text length'
    >>>> feature can help with reporting several lines at a time, etc...
    >>>>
    >>>> I'm working with this PDF all the time:
    >>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
    >>>>
    >>>> try it too if you get a chance.
    >>>>
    >>>> (and I can imagine not all PDFs/etc representing the 'story' but can
    >>>> be for ex a log-like content too)
    >>>>
    >>>> That said, I don't know how a parser for the format N will behave, it
    >>>> depends on the individual parsers.
    >>>>
    >>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
    >>>>
    >>>> I'd like to know though how to make a file name available to the
    >>>> pipeline which is working with the current text fragment ?
    >>>>
    >>>> Going to try and do some measurements and compare the sync vs async
    >>>> parsing modes...
    >>>>
    >>>> Asked the Tika team to support with some more examples...
    >>>>
    >>>> Cheers, Sergey
    >>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
    >>>>> Hi,
    >>>>>
    >>>>> thanks for the explanations,
    >>>>>
    >>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
    >>>>>> Hi!
    >>>>>>
    >>>>>> TextIO returns an unordered soup of lines contained in all files you
    >>>>>> ask it to read. People usually use TextIO for reading files where 1
    >>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
    >>>>>> a row of a CSV file - so discarding order is ok.
    >>>>> Just a side note, I'd probably want that be ordered, though I guess
    >>>>> it depends...
    >>>>>> However, there is a number of cases where TextIO is a poor fit:
    >>>>>> - Cases where discarding order is not ok - e.g. if you're doing
    >>>>>> natural language processing and the text files contain actual prose,
    >>>>>> where you need to process a file as a whole. TextIO can't do that.
    >>>>>> - Cases where you need to remember which file each element came
    >>>>>> from, e.g.
    >>>>>> if you're creating a search index for the files: TextIO can't do
    >>>>>> this either.
    >>>>>>
    >>>>>> Both of these issues have been raised in the past against TextIO;
    >>>>>> however it seems that the overwhelming majority of users of TextIO
    >>>>>> use it for logs or CSV files or alike, so solving these issues has
    >>>>>> not been a priority.
    >>>>>> Currently they are solved in a general form via FileIO.read() which
    >>>>>> gives you access to reading a full file yourself - people who want
    >>>>>> more flexibility will be able to use standard Java text-parsing
    >>>>>> utilities on a ReadableFile, without involving TextIO.
    >>>>>>
    >>>>>> Same applies for XmlIO: it is specifically designed for the narrow
    >>>>>> use case where the files contain independent data entries, so
    >>>>>> returning an unordered soup of them, with no association to the
    >>>>>> original file, is the user's intention. XmlIO will not work for
    >>>>>> processing more complex XML files that are not simply a sequence of
    >>>>>> entries with the same tag, and it also does not remember the
    >>>>>> original filename.
    >>>>>>
    >>>>>
    >>>>> OK...
    >>>>>
    >>>>>> However, if my understanding of Tika use cases is correct, it is
    >>>>>> mainly used for extracting content from complex file formats - for
    >>>>>> example, extracting text and images from PDF files or Word
    >>>>>> documents. I believe this is the main difference between it and
    >>>>>> TextIO - people usually use Tika for complex use cases where the
    >>>>>> "unordered soup of stuff" abstraction is not useful.
    >>>>>>
    >>>>>> My suspicion about this is confirmed by the fact that the crux of
    >>>>>> the Tika API is ContentHandler
    >>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
    >>>>>> html?is-external=true
    >>>>>>
    >>>>>> whose
    >>>>>> documentation says "The order of events in this interface is very
    >>>>>> important, and mirrors the order of information in the document
    >> itself."
    >>>>> All that says is that a (Tika) ContentHandler will be a true SAX
    >>>>> ContentHandler...
    >>>>>>
    >>>>>> Let me give a few examples of what I think is possible with the raw
    >>>>>> Tika API, but I think is not currently possible with TikaIO - please
    >>>>>> correct me where I'm wrong, because I'm not particularly familiar
    >>>>>> with Tika and am judging just based on what I read about it.
    >>>>>> - User has 100,000 Word documents and wants to convert each of them
    >>>>>> to text files for future natural language processing.
    >>>>>> - User has 100,000 PDF files with financial statements, each
    >>>>>> containing a bunch of unrelated text and - the main content - a list
    >>>>>> of transactions in PDF tables. User wants to extract each
    >>>>>> transaction as a PCollection element, discarding the unrelated text.
    >>>>>> - User has 100,000 PDF files with scientific papers, and wants to
    >>>>>> extract text from them, somehow parse author and affiliation from
    >>>>>> the text, and compute statistics of topics and terminology usage by
    >>>>>> author name and affiliation.
    >>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
    >>>>>> observing a location over time: they want to extract metadata from
    >>>>>> each image using Tika, analyze the images themselves using some
    >>>>>> other library, and detect anomalies in the overall appearance of the
    >>>>>> location over time as seen from multiple cameras.
    >>>>>> I believe all of these cases can not be solved with TikaIO because
    >>>>>> the resulting PCollection<String> contains no information about
    >>>>>> which String comes from which document and about the order in which
    >>>>>> they appear in the document.
    >>>>> These are good use cases, thanks... I thought what you were talking
    >>>>> about the unordered soup of data produced by TikaIO (and its friends
    >>>>> TextIO and alike :-)).
    >>>>> Putting the ordered vs unordered question aside for a sec, why
    >>>>> exactly a Tika Reader can not make the name of the file it's
    >>>>> currently reading from available to the pipeline, as some Beam
    >> pipeline metadata piece ?
    >>>>> Surely it can be possible with Beam ? If not then I would be
    >> surprised...
    >>>>>
    >>>>>>
    >>>>>> I am, honestly, struggling to think of a case where I would want to
    >>>>>> use Tika, but where I *would* be ok with getting an unordered soup
    >>>>>> of strings.
    >>>>>> So some examples would be very helpful.
    >>>>>>
    >>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
    >>>>> give one example where it did not matter to us in what order
    >>>>> Tika-produced data were available to the downstream layer.
    >>>>>
    >>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
    >>>>> Con NAs, and we had a happy audience:
    >>>>>
    >>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search
    >>>>>
    >>>>>
    >>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
    >>>>> into Lucene. We associate a file name with the indexed content and
    >>>>> then let users find a list of PDF files which contain a given word or
    >>>>> few words, details are here
    >>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
    >>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
    >>>>> og.java#L131
    >>>>>
    >>>>>
    >>>>> I'd say even more involved search engines would not mind supporting a
    >>>>> case like that :-)
    >>>>>
    >>>>> Now there we process one file at a time, and I understand now that
    >>>>> with TikaIO and N files it's all over the place really as far as the
    >>>>> ordering is concerned, which file it's coming from. etc. That's why
    >>>>> TikaReader must be able to associate the file name with a given piece
    >>>>> of text it's making available to the pipeline.
    >>>>>
    >>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
    >>>>> If it makes things simpler then it would be good, I've just no idea
    >>>>> at the moment how to start the pipeline without using a
    >>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
    >>>>> earlier - how can one avoid it with ParDo when implementing a 'min
    >>>>> len chunk' feature, where the ParDo would have to concatenate several
    >>>>> SAX data pieces first before making a single composite piece to the
    >> pipeline ?
    >>>>>
    >>>>>
    >>>>>> Another way to state it: currently, if I wanted to solve all of the
    >>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
    >>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
    >>>>>> provide a usability improvement over such usage?
    >>>>>>
    >>>>>
    >>>>>
    >>>>> If you are actually asking, does it really make sense for Beam to
    >>>>> ship Tika related code, given that users can just do it themselves,
    >>>>> I'm not sure.
    >>>>>
    >>>>> IMHO it always works better if users have to provide just few config
    >>>>> options to an integral part of the framework and see things happening.
    >>>>> It will bring more users.
    >>>>>
    >>>>> Whether the current Tika code (refactored or not) stays with Beam or
    >>>>> not - I'll let you and the team decide; believe it or not I was
    >>>>> seriously contemplating at the last moment to make it all part of the
    >>>>> Tika project itself and have a bit more flexibility over there with
    >>>>> tweaking things, but now that it is in the Beam snapshot - I don't
    >>>>> know - it's no my decision...
    >>>>>
    >>>>>> I am confused by your other comment - "Does the ordering matter ?
    >>>>>> Perhaps
    >>>>>> for some cases it does, and for some it does not. May be it makes
    >>>>>> sense to support running TikaIO as both the bounded reader/source
    >>>>>> and ParDo, with getting the common code reused." - because using
    >>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
    >>>>>> the issue of asynchronous reading and complexity of implementation.
    >>>>>> The resulting PCollection will be unordered either way - this needs
    >>>>>> to be solved separately by providing a different API.
    >>>>> Right I see now, so ParDo is not about making Tika reported data
    >>>>> available to the downstream pipeline components ordered, only about
    >>>>> the simpler implementation.
    >>>>> Association with the file should be possible I hope, but I understand
    >>>>> it would be possible to optionally make the data coming out in the
    >>>>> ordered way as well...
    >>>>>
    >>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
    >>>>> let me double check: should we still give some thought to the
    >>>>> possible performance benefit of the current approach ? As I said, I
    >>>>> can easily get rid of all that polling code, use a simple Blocking
    >> queue.
    >>>>>
    >>>>> Cheers, Sergey
    >>>>>>
    >>>>>> Thanks.
    >>>>>>
    >>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
    >>>>>> <sb...@gmail.com>
    >>>>>> wrote:
    >>>>>>
    >>>>>>> Hi
    >>>>>>>
    >>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
    >>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
    >>>>>>>
    >>>>>>> Before trying to reply online, I'd like to state that my main
    >>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
    >>>>>>> no different to Text, XML or similar bounded reader components.
    >>>>>>>
    >>>>>>> I have to admit I don't understand your questions about TikaIO
    >>>>>>> usecases.
    >>>>>>>
    >>>>>>> What are the Text Input or XML input use-cases ? These use cases
    >>>>>>> are TikaInput cases as well, the only difference is Tika can not
    >>>>>>> split the individual file into a sequence of sources/etc,
    >>>>>>>
    >>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
    >>>>>>> optimized around reading from the XML files, and I thought I made
    >>>>>>> it clear (and it is a known fact anyway) Tika was about reading
    >>>>>>> basically from any file format.
    >>>>>>>
    >>>>>>> Where is the difference (apart from what I've already mentioned) ?
    >>>>>>>
    >>>>>>> Sergey
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
    >>>>>>>> Hi,
    >>>>>>>>
    >>>>>>>> Replies inline.
    >>>>>>>>
    >>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
    >>>>>>>> <sb...@gmail.com>
    >>>>>>>> wrote:
    >>>>>>>>
    >>>>>>>>> Hi All
    >>>>>>>>>
    >>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
    >>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
    >>>>>>>>> great to try and link both projects together, which led me to
    >>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
    >>>>>>>>> [2].
    >>>>>>>>>
    >>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
    >>>>>>>>> newer review comments from Eugene pending, so I'd like to
    >>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
    >>>>>>>>> decide, based on the feedback from the experts, what to do next.
    >>>>>>>>>
    >>>>>>>>> Apache Tika Parsers report the text content in chunks, via
    >>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
    >>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
    >>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
    >>>>>>>>> report the data chunks.
    >>>>>>>>> Some
    >>>>>>>>> parsers may report the complete lines, some individual words,
    >>>>>>>>> with some being able report the data only after the completely
    >>>>>>>>> parse the document.
    >>>>>>>>> All depends on the data format.
    >>>>>>>>>
    >>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
    >>>>>>>>> to parse the files, Beam threads will only collect the data from
    >>>>>>>>> the internal queue where the internal TikaReader's thread will
    >>>>>>>>> put the data into (note the data chunks are ordered even though
    >>>>>>>>> the tests might suggest otherwise).
    >>>>>>>>>
    >>>>>>>> I agree that your implementation of reader returns records in
    >>>>>>>> order
    >>>>>>>> - but
    >>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
    >>>>>>>> the order in which records are produced by a BoundedReader - the
    >>>>>>>> order produced by your reader is ignored, and when applying any
    >>>>>>>> transforms to the
    >>>>>>> PCollection
    >>>>>>>> produced by TikaIO, it is impossible to recover the order in which
    >>>>>>>> your reader returned the records.
    >>>>>>>>
    >>>>>>>> With that in mind, is PCollection<String>, containing individual
    >>>>>>>> Tika-detected items, still the right API for representing the
    >>>>>>>> result of parsing a large number of documents with Tika?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> The reason I did it was because I thought
    >>>>>>>>>
    >>>>>>>>> 1) it would make the individual data chunks available faster to
    >>>>>>>>> the pipeline - the parser will continue working via the
    >>>>>>>>> binary/video etc file while the data will already start flowing -
    >>>>>>>>> I agree there should be some tests data available confirming it -
    >>>>>>>>> but I'm positive at the moment this approach might yield some
    >>>>>>>>> performance gains with the large sets. If the file is large, if
    >>>>>>>>> it has the embedded attachments/videos to deal with, then it may
    >>>>>>>>> be more effective not to get the Beam thread deal with it...
    >>>>>>>>>
    >>>>>>>>> As I said on the PR, this description contains unfounded and
    >>>>>>>>> potentially
    >>>>>>>> incorrect assumptions about how Beam runners execute (or may
    >>>>>>>> execute in
    >>>>>>> the
    >>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
    >>>>>>> correctly,
    >>>>>>>> you might be assuming that:
    >>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
    >>>>>>> complete
    >>>>>>>> before processing its outputs with downstream transforms
    >>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
    >>>>>>> *concurrently*
    >>>>>>>> with downstream processing of its results
    >>>>>>>> - Passing an element from one thread to another using a
    >>>>>>>> BlockingQueue is free in terms of performance All of these are
    >>>>>>>> false at least in some runners, and I'm almost certain that in
    >>>>>>>> reality, performance of this approach is worse than a ParDo in
    >>>>>>> most
    >>>>>>>> production runners.
    >>>>>>>>
    >>>>>>>> There are other disadvantages to this approach:
    >>>>>>>> - Doing the bulk of the processing in a separate thread makes it
    >>>>>>> invisible
    >>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
    >>>>>>>> profiling capabilities, or the ability to get the current stack
    >>>>>>>> trace for stuck elements, this approach would make the real
    >>>>>>>> processing invisible to all of these capabilities, and a user
    >>>>>>>> would only see that the bulk of the time is spent waiting for the
    >>>>>>>> next element, but not *why* the next
    >>>>>>> element
    >>>>>>>> is taking long to compute.
    >>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
    >>>>>>>> invisible to Beam, will make it harder for runners to do
    >>>>>>>> autoscaling, binpacking
    >>>>>>> and
    >>>>>>>> other resource management magic (how much of this runners actually
    >>>>>>>> do is
    >>>>>>> a
    >>>>>>>> separate issue), because the runner will have no way of knowing
    >>>>>>>> how much CPU/IO this particular transform is actually using - all
    >>>>>>>> the processing happens in a thread about which the runner is
    >>>>>>>> unaware.
    >>>>>>>> - As far as I can tell, the code also hides exceptions that happen
    >>>>>>>> in the Tika thread
    >>>>>>>> - Adding the thread management makes the code much more complex,
    >>>>>>>> easier
    >>>>>>> to
    >>>>>>>> introduce bugs, and harder for others to contribute
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>> 2) As I commented at the end of [2], having an option to
    >>>>>>>>> concatenate the data chunks first before making them available to
    >>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
    >>>>>>>>> introduce some synchronization issues (though not exactly sure
    >>>>>>>>> yet)
    >>>>>>>>>
    >>>>>>>> What are these issues?
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> One of valid concerns there is that the reader is polling the
    >>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
    >>>>>>>>> cases too, we may have a case where the max polling time has been
    >>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
    >>>>>>>>> the file data. I think that it can be solved by either 2a)
    >>>>>>>>> configuring the max polling time to a very large number which
    >>>>>>>>> will never be reached for a practical case, or
    >>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
    >>>>>>>>> worst case, if TikaParser spins and fails to report the end of
    >>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
    >>>>>>>>> I propose to follow 2b).
    >>>>>>>>>
    >>>>>>>> I agree that there should be no way to unintentionally configure
    >>>>>>>> the transform in a way that will produce silent data loss. Another
    >>>>>>>> reason for not having these tuning knobs is that it goes against
    >>>>>>>> Beam's "no knobs"
    >>>>>>>> philosophy, and that in most cases users have no way of figuring
    >>>>>>>> out a
    >>>>>>> good
    >>>>>>>> value for tuning knobs except for manual experimentation, which is
    >>>>>>>> extremely brittle and typically gets immediately obsoleted by
    >>>>>>>> running on
    >>>>>>> a
    >>>>>>>> new dataset or updating a version of some of the involved
    >>>>>>>> dependencies
    >>>>>>> etc.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Please let me know what you think.
    >>>>>>>>> My plan so far is:
    >>>>>>>>> 1) start addressing most of Eugene's comments which would require
    >>>>>>>>> some minor TikaIO updates
    >>>>>>>>> 2) work on removing the TikaSource internal code dealing with
    >>>>>>>>> File patterns which I copied from TextIO at the next stage
    >>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
    >>>>>>>>> users some time to try it with some real complex files and also
    >>>>>>>>> decide if TikaIO can continue implemented as a
    >>>>>>>>> BoundedSource/Reader or not
    >>>>>>>>>
    >>>>>>>>> Eugene, all, will it work if I start with 1) ?
    >>>>>>>>>
    >>>>>>>> Yes, but I think we should start by discussing the anticipated use
    >>>>>>>> cases
    >>>>>>> of
    >>>>>>>> TikaIO and designing an API for it based on those use cases; and
    >>>>>>>> then see what's the best implementation for that particular API
    >>>>>>>> and set of anticipated use cases.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>>
    >>>>>>>>> Thanks, Sergey
    >>>>>>>>>
    >>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
    >>>>>>>>> [2] https://github.com/apache/beam/pull/3378
    >>>>>>>>>
    >>>>>>>>
    >>>>>>>
    >>>>>>
    >>>>>
    >>
    > 
    
    
    -- 
    Sergey Beryozkin
    
    Talend Community Coders
    http://coders.talend.com/

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Eugene

Thank you, very helpful, let me read it few times before I get what 
exactly I need to clarify :-), two questions so far:

On 21/09/17 21:40, Eugene Kirpichov wrote:
> Thanks all for the discussion. It seems we have consensus that both
> within-document order and association with the original filename are
> necessary, but currently absent from TikaIO.
> 
> *Association with original file:*
> Sergey - Beam does not *automatically* provide a way to associate an
> element with the file it originated from: automatically tracking data
> provenance is a known very hard research problem on which many papers have
> been written, and obvious solutions are very easy to break. See related
> discussion at
> https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
>   .
> 
> If you want the elements of your PCollection to contain additional
> information, you need the elements themselves to contain this information:
> the elements are self-contained and have no metadata associated with them
> (beyond the timestamp and windows, universal to the whole Beam model).
> 
> *Order within a file:*
> The only way to have any kind of order within a PCollection is to have the
> elements of the PCollection contain something ordered, e.g. have a
> PCollection<List<Something>>, where each List is for one file [I'm assuming
> Tika, at a low level, works on a per-file basis?]. However, since TikaIO
> can be applied to very large files, this could produce very large elements,
> which is a bad idea. Because of this, I don't think the result of applying
> Tika to a single file can be encoded as a PCollection element.
> 
> Given both of these, I think that it's not possible to create a
> *general-purpose* TikaIO transform that will be better than manual
> invocation of Tika as a DoFn on the result of FileIO.readMatches().
> 
> However, looking at the examples at
> https://tika.apache.org/1.16/examples.html - almost all of the examples
> involve extracting a single String from each document. This use case, with
> the assumption that individual documents are small enough, can certainly be
> simplified and TikaIO could be a facade for doing just this.
> 
> E.g. TikaIO could:
> - take as input a PCollection<ReadableFile>
> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> is a class with properties { String content, Metadata metadata }

and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
represents the content + (Tika) Metadata of the file such as the author 
name, etc ? Is it the file name ?
> - be configured by: a Parser (it implements Serializable so can be
> specified at pipeline construction time) and a ContentHandler whose
> toString() will go into "content". ContentHandler does not implement
> Serializable, so you can not specify it at construction time - however, you
> can let the user specify either its class (if it's a simple handler like a
> BodyContentHandler) or specify a lambda for creating the handler
> (SerializableFunction<Void, ContentHandler>), and potentially you can have
> a simpler facade for Tika.parseAsString() - e.g. call it
> TikaIO.parseAllAsStrings().
> 
> Example usage would look like:
> 
>    PCollection<KV<String, ParseResult>> parseResults =
> p.apply(FileIO.match().filepattern(...))
>      .apply(FileIO.readMatches())
>      .apply(TikaIO.parseAllAsStrings())
> 
> or:
> 
>      .apply(TikaIO.parseAll()
>          .withParser(new AutoDetectParser())
>          .withContentHandler(() -> new BodyContentHandler(new
> ToXMLContentHandler())))
> 
> You could also have shorthands for letting the user avoid using FileIO
> directly in simple cases, for example:
>      p.apply(TikaIO.parseAsStrings().from(filepattern))
> 
> This would of course be implemented as a ParDo or even MapElements, and
> you'll be able to share the code between parseAll and regular parse.
> 
OK. What about the current source on the master, should be marked 
Experimental till I manage to write something new with the above ideas 
in mind ? Or there's enough time till 2.2.0 gets released ?

Thanks, Sergey
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi Tim
>> On 21/09/17 14:33, Allison, Timothy B. wrote:
>>> Thank you, Sergey.
>>>
>>> My knowledge of Apache Beam is limited -- I saw Davor and
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
>> impressed, but I haven't had a chance to work with it yet.
>>>
>>>   From my perspective, if I understand this thread (and I may not!),
>> getting unordered text from _a given file_ is a non-starter for most
>> applications.  The implementation needs to guarantee order per file, and
>> the user has to be able to link the "extract" back to a unique identifier
>> for the document.  If the current implementation doesn't do those things,
>> we need to change it, IMHO.
>>>
>> Right now Tika-related reader does not associate a given text fragment
>> with the file name, so a function looking at some text and trying to
>> find where it came from won't be able to do so.
>>
>> So I asked how to do it in Beam, how to attach some context to the given
>> piece of data. I hope it can be done and if not - then perhaps some
>> improvement can be applied.
>>
>> Re the unordered text - yes - this is what we currently have with Beam +
>> TikaIO :-).
>>
>> The use-case I referred to earlier in this thread (upload PDFs - save
>> the possibly unordered text to Lucene with the file name 'attached', let
>> users search for the files containing some words - phrases, this works
>> OK given that I can see PDF parser for ex reporting the lines) can be
>> supported OK with the current TikaIO (provided we find a way to 'attach'
>> a file name to the flow).
>>
>> I see though supporting the total ordering can be a big deal in other
>> cases. Eugene, can you please explain how it can be done, is it
>> achievable in principle, without the users having to do some custom
>> coding ?
>>
>>> To the question of -- why is this in Beam at all; why don't we let users
>> call it if they want it?...
>>>
>>> No matter how much we do to Tika, it will behave badly sometimes --
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
>> using Beam -- folks likely with large batches of unruly/noisy documents --
>> are more likely to run into these problems than your average
>> couple-of-thousand-docs-from-our-own-company user. So, if there are things
>> we can do in Beam to prevent developers around the world from having to
>> reinvent the wheel for defenses against these problems, then I'd be
>> enormously grateful if we could put Tika into Beam.  That means:
>>>
>>> 1) a process-level timeout (because you can't actually kill a thread in
>> Java)
>>> 2) a process-level restart on OOM
>>> 3) avoid trying to reprocess a badly behaving document
>>>
>>> If Beam automatically handles those problems, then I'd say, y, let users
>> write their own code.  If there is so much as a single configuration knob
>> (and it sounds like Beam is against complex configuration...yay!) to get
>> that working in Beam, then I'd say, please integrate Tika into Beam.  From
>> a safety perspective, it is critical to keep the extraction process
>> entirely separate (jvm, vm, m, rack, data center!) from the
>> transformation+loading steps.  IMHO, very few devs realize this because
>> Tika works well lots of the time...which is why it is critical for us to
>> make it easy for people to get it right all of the time.
>>>
>>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
>> mode first in one jvm, and then I kick off another process to do
>> transform/loading into Lucene/Solr from the .json files that Tika generates
>> for each input file.  If I were to scale up, I'd want to maintain this
>> complete separation of steps.
>>>
>>> Apologies if I've derailed the conversation or misunderstood this thread.
>>>
>> Major thanks for your input :-)
>>
>> Cheers, Sergey
>>
>>> Cheers,
>>>
>>>                  Tim
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Thursday, September 21, 2017 9:07 AM
>>> To: dev@beam.apache.org
>>> Cc: Allison, Timothy B. <ta...@mitre.org>
>>> Subject: Re: TikaIO concerns
>>>
>>> Hi All
>>>
>>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>>
>>> Tim, thanks for joining in :-). If you have some great Apache Tika
>> stories to share (preferably involving the cases where it did not really
>> matter the ordering in which Tika-produced data were dealt with by the
>>> consumers) then please do so :-).
>>>
>>> At the moment, even though Tika ContentHandler will emit the ordered
>> data, the Beam runtime will have no guarantees that the downstream pipeline
>> components will see the data coming in the right order.
>>>
>>> (FYI, I understand from the earlier comments that the total ordering is
>> also achievable but would require the extra API support)
>>>
>>> Other comments would be welcome too
>>>
>>> Thanks, Sergey
>>>
>>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>>> individual words and nearly 100% sure I saw Tika reporting individual
>>>> lines when it was parsing the text files. The 'min text length'
>>>> feature can help with reporting several lines at a time, etc...
>>>>
>>>> I'm working with this PDF all the time:
>>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>>
>>>> try it too if you get a chance.
>>>>
>>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>>> be for ex a log-like content too)
>>>>
>>>> That said, I don't know how a parser for the format N will behave, it
>>>> depends on the individual parsers.
>>>>
>>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>>
>>>> I'd like to know though how to make a file name available to the
>>>> pipeline which is working with the current text fragment ?
>>>>
>>>> Going to try and do some measurements and compare the sync vs async
>>>> parsing modes...
>>>>
>>>> Asked the Tika team to support with some more examples...
>>>>
>>>> Cheers, Sergey
>>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>>> Hi,
>>>>>
>>>>> thanks for the explanations,
>>>>>
>>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>>> Hi!
>>>>>>
>>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>>> a row of a CSV file - so discarding order is ok.
>>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>>> it depends...
>>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>>> natural language processing and the text files contain actual prose,
>>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>>> - Cases where you need to remember which file each element came
>>>>>> from, e.g.
>>>>>> if you're creating a search index for the files: TextIO can't do
>>>>>> this either.
>>>>>>
>>>>>> Both of these issues have been raised in the past against TextIO;
>>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>>> not been a priority.
>>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>>> gives you access to reading a full file yourself - people who want
>>>>>> more flexibility will be able to use standard Java text-parsing
>>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>>
>>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>>> use case where the files contain independent data entries, so
>>>>>> returning an unordered soup of them, with no association to the
>>>>>> original file, is the user's intention. XmlIO will not work for
>>>>>> processing more complex XML files that are not simply a sequence of
>>>>>> entries with the same tag, and it also does not remember the
>>>>>> original filename.
>>>>>>
>>>>>
>>>>> OK...
>>>>>
>>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>>> mainly used for extracting content from complex file formats - for
>>>>>> example, extracting text and images from PDF files or Word
>>>>>> documents. I believe this is the main difference between it and
>>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>>
>>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>>> the Tika API is ContentHandler
>>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>>> html?is-external=true
>>>>>>
>>>>>> whose
>>>>>> documentation says "The order of events in this interface is very
>>>>>> important, and mirrors the order of information in the document
>> itself."
>>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>>> ContentHandler...
>>>>>>
>>>>>> Let me give a few examples of what I think is possible with the raw
>>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>>> with Tika and am judging just based on what I read about it.
>>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>>> to text files for future natural language processing.
>>>>>> - User has 100,000 PDF files with financial statements, each
>>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>>> of transactions in PDF tables. User wants to extract each
>>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>>> extract text from them, somehow parse author and affiliation from
>>>>>> the text, and compute statistics of topics and terminology usage by
>>>>>> author name and affiliation.
>>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>>> observing a location over time: they want to extract metadata from
>>>>>> each image using Tika, analyze the images themselves using some
>>>>>> other library, and detect anomalies in the overall appearance of the
>>>>>> location over time as seen from multiple cameras.
>>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>>> the resulting PCollection<String> contains no information about
>>>>>> which String comes from which document and about the order in which
>>>>>> they appear in the document.
>>>>> These are good use cases, thanks... I thought what you were talking
>>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>>> TextIO and alike :-)).
>>>>> Putting the ordered vs unordered question aside for a sec, why
>>>>> exactly a Tika Reader can not make the name of the file it's
>>>>> currently reading from available to the pipeline, as some Beam
>> pipeline metadata piece ?
>>>>> Surely it can be possible with Beam ? If not then I would be
>> surprised...
>>>>>
>>>>>>
>>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>>> of strings.
>>>>>> So some examples would be very helpful.
>>>>>>
>>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>>> give one example where it did not matter to us in what order
>>>>> Tika-produced data were available to the downstream layer.
>>>>>
>>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>>> Con NAs, and we had a happy audience:
>>>>>
>>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search
>>>>>
>>>>>
>>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>>> into Lucene. We associate a file name with the indexed content and
>>>>> then let users find a list of PDF files which contain a given word or
>>>>> few words, details are here
>>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>>> og.java#L131
>>>>>
>>>>>
>>>>> I'd say even more involved search engines would not mind supporting a
>>>>> case like that :-)
>>>>>
>>>>> Now there we process one file at a time, and I understand now that
>>>>> with TikaIO and N files it's all over the place really as far as the
>>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>>> TikaReader must be able to associate the file name with a given piece
>>>>> of text it's making available to the pipeline.
>>>>>
>>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>>> If it makes things simpler then it would be good, I've just no idea
>>>>> at the moment how to start the pipeline without using a
>>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>>> SAX data pieces first before making a single composite piece to the
>> pipeline ?
>>>>>
>>>>>
>>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>>> provide a usability improvement over such usage?
>>>>>>
>>>>>
>>>>>
>>>>> If you are actually asking, does it really make sense for Beam to
>>>>> ship Tika related code, given that users can just do it themselves,
>>>>> I'm not sure.
>>>>>
>>>>> IMHO it always works better if users have to provide just few config
>>>>> options to an integral part of the framework and see things happening.
>>>>> It will bring more users.
>>>>>
>>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>>> not - I'll let you and the team decide; believe it or not I was
>>>>> seriously contemplating at the last moment to make it all part of the
>>>>> Tika project itself and have a bit more flexibility over there with
>>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>>> know - it's no my decision...
>>>>>
>>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>>> Perhaps
>>>>>> for some cases it does, and for some it does not. May be it makes
>>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>>> and ParDo, with getting the common code reused." - because using
>>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>>> The resulting PCollection will be unordered either way - this needs
>>>>>> to be solved separately by providing a different API.
>>>>> Right I see now, so ParDo is not about making Tika reported data
>>>>> available to the downstream pipeline components ordered, only about
>>>>> the simpler implementation.
>>>>> Association with the file should be possible I hope, but I understand
>>>>> it would be possible to optionally make the data coming out in the
>>>>> ordered way as well...
>>>>>
>>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>>> let me double check: should we still give some thought to the
>>>>> possible performance benefit of the current approach ? As I said, I
>>>>> can easily get rid of all that polling code, use a simple Blocking
>> queue.
>>>>>
>>>>> Cheers, Sergey
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>>
>>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>>
>>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>>> usecases.
>>>>>>>
>>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>>
>>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>>> basically from any file format.
>>>>>>>
>>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>>
>>>>>>> Sergey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Replies inline.
>>>>>>>>
>>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>>> <sb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All
>>>>>>>>>
>>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>>> [2].
>>>>>>>>>
>>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>>
>>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>>> report the data chunks.
>>>>>>>>> Some
>>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>>> with some being able report the data only after the completely
>>>>>>>>> parse the document.
>>>>>>>>> All depends on the data format.
>>>>>>>>>
>>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>>> the tests might suggest otherwise).
>>>>>>>>>
>>>>>>>> I agree that your implementation of reader returns records in
>>>>>>>> order
>>>>>>>> - but
>>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>>> transforms to the
>>>>>>> PCollection
>>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>>> your reader returned the records.
>>>>>>>>
>>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did it was because I thought
>>>>>>>>>
>>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>>
>>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>>> potentially
>>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>>> execute in
>>>>>>> the
>>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>>> correctly,
>>>>>>>> you might be assuming that:
>>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>>> complete
>>>>>>>> before processing its outputs with downstream transforms
>>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>>> *concurrently*
>>>>>>>> with downstream processing of its results
>>>>>>>> - Passing an element from one thread to another using a
>>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>>> most
>>>>>>>> production runners.
>>>>>>>>
>>>>>>>> There are other disadvantages to this approach:
>>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>>> invisible
>>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>>> next element, but not *why* the next
>>>>>>> element
>>>>>>>> is taking long to compute.
>>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>>> autoscaling, binpacking
>>>>>>> and
>>>>>>>> other resource management magic (how much of this runners actually
>>>>>>>> do is
>>>>>>> a
>>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>>> the processing happens in a thread about which the runner is
>>>>>>>> unaware.
>>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>>> in the Tika thread
>>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>>> easier
>>>>>>> to
>>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>>> yet)
>>>>>>>>>
>>>>>>>> What are these issues?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>>> will never be reached for a practical case, or
>>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>>> I propose to follow 2b).
>>>>>>>>>
>>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>>> Beam's "no knobs"
>>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>>> out a
>>>>>>> good
>>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>>> running on
>>>>>>> a
>>>>>>>> new dataset or updating a version of some of the involved
>>>>>>>> dependencies
>>>>>>> etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know what you think.
>>>>>>>>> My plan so far is:
>>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>>> some minor TikaIO updates
>>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>>> BoundedSource/Reader or not
>>>>>>>>>
>>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>>
>>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>>> cases
>>>>>>> of
>>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>>> then see what's the best implementation for that particular API
>>>>>>>> and set of anticipated use cases.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, Sergey
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Eugene

Thank you, very helpful, let me read it few times before I get what 
exactly I need to clarify :-), two questions so far:

On 21/09/17 21:40, Eugene Kirpichov wrote:
> Thanks all for the discussion. It seems we have consensus that both
> within-document order and association with the original filename are
> necessary, but currently absent from TikaIO.
> 
> *Association with original file:*
> Sergey - Beam does not *automatically* provide a way to associate an
> element with the file it originated from: automatically tracking data
> provenance is a known very hard research problem on which many papers have
> been written, and obvious solutions are very easy to break. See related
> discussion at
> https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
>   .
> 
> If you want the elements of your PCollection to contain additional
> information, you need the elements themselves to contain this information:
> the elements are self-contained and have no metadata associated with them
> (beyond the timestamp and windows, universal to the whole Beam model).
> 
> *Order within a file:*
> The only way to have any kind of order within a PCollection is to have the
> elements of the PCollection contain something ordered, e.g. have a
> PCollection<List<Something>>, where each List is for one file [I'm assuming
> Tika, at a low level, works on a per-file basis?]. However, since TikaIO
> can be applied to very large files, this could produce very large elements,
> which is a bad idea. Because of this, I don't think the result of applying
> Tika to a single file can be encoded as a PCollection element.
> 
> Given both of these, I think that it's not possible to create a
> *general-purpose* TikaIO transform that will be better than manual
> invocation of Tika as a DoFn on the result of FileIO.readMatches().
> 
> However, looking at the examples at
> https://tika.apache.org/1.16/examples.html - almost all of the examples
> involve extracting a single String from each document. This use case, with
> the assumption that individual documents are small enough, can certainly be
> simplified and TikaIO could be a facade for doing just this.
> 
> E.g. TikaIO could:
> - take as input a PCollection<ReadableFile>
> - return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
> is a class with properties { String content, Metadata metadata }

and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
represents the content + (Tika) Metadata of the file such as the author 
name, etc ? Is it the file name ?
> - be configured by: a Parser (it implements Serializable so can be
> specified at pipeline construction time) and a ContentHandler whose
> toString() will go into "content". ContentHandler does not implement
> Serializable, so you can not specify it at construction time - however, you
> can let the user specify either its class (if it's a simple handler like a
> BodyContentHandler) or specify a lambda for creating the handler
> (SerializableFunction<Void, ContentHandler>), and potentially you can have
> a simpler facade for Tika.parseAsString() - e.g. call it
> TikaIO.parseAllAsStrings().
> 
> Example usage would look like:
> 
>    PCollection<KV<String, ParseResult>> parseResults =
> p.apply(FileIO.match().filepattern(...))
>      .apply(FileIO.readMatches())
>      .apply(TikaIO.parseAllAsStrings())
> 
> or:
> 
>      .apply(TikaIO.parseAll()
>          .withParser(new AutoDetectParser())
>          .withContentHandler(() -> new BodyContentHandler(new
> ToXMLContentHandler())))
> 
> You could also have shorthands for letting the user avoid using FileIO
> directly in simple cases, for example:
>      p.apply(TikaIO.parseAsStrings().from(filepattern))
> 
> This would of course be implemented as a ParDo or even MapElements, and
> you'll be able to share the code between parseAll and regular parse.
> 
OK. What about the current source on the master, should be marked 
Experimental till I manage to write something new with the above ideas 
in mind ? Or there's enough time till 2.2.0 gets released ?

Thanks, Sergey
> On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi Tim
>> On 21/09/17 14:33, Allison, Timothy B. wrote:
>>> Thank you, Sergey.
>>>
>>> My knowledge of Apache Beam is limited -- I saw Davor and
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
>> impressed, but I haven't had a chance to work with it yet.
>>>
>>>   From my perspective, if I understand this thread (and I may not!),
>> getting unordered text from _a given file_ is a non-starter for most
>> applications.  The implementation needs to guarantee order per file, and
>> the user has to be able to link the "extract" back to a unique identifier
>> for the document.  If the current implementation doesn't do those things,
>> we need to change it, IMHO.
>>>
>> Right now Tika-related reader does not associate a given text fragment
>> with the file name, so a function looking at some text and trying to
>> find where it came from won't be able to do so.
>>
>> So I asked how to do it in Beam, how to attach some context to the given
>> piece of data. I hope it can be done and if not - then perhaps some
>> improvement can be applied.
>>
>> Re the unordered text - yes - this is what we currently have with Beam +
>> TikaIO :-).
>>
>> The use-case I referred to earlier in this thread (upload PDFs - save
>> the possibly unordered text to Lucene with the file name 'attached', let
>> users search for the files containing some words - phrases, this works
>> OK given that I can see PDF parser for ex reporting the lines) can be
>> supported OK with the current TikaIO (provided we find a way to 'attach'
>> a file name to the flow).
>>
>> I see though supporting the total ordering can be a big deal in other
>> cases. Eugene, can you please explain how it can be done, is it
>> achievable in principle, without the users having to do some custom
>> coding ?
>>
>>> To the question of -- why is this in Beam at all; why don't we let users
>> call it if they want it?...
>>>
>>> No matter how much we do to Tika, it will behave badly sometimes --
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
>> using Beam -- folks likely with large batches of unruly/noisy documents --
>> are more likely to run into these problems than your average
>> couple-of-thousand-docs-from-our-own-company user. So, if there are things
>> we can do in Beam to prevent developers around the world from having to
>> reinvent the wheel for defenses against these problems, then I'd be
>> enormously grateful if we could put Tika into Beam.  That means:
>>>
>>> 1) a process-level timeout (because you can't actually kill a thread in
>> Java)
>>> 2) a process-level restart on OOM
>>> 3) avoid trying to reprocess a badly behaving document
>>>
>>> If Beam automatically handles those problems, then I'd say, y, let users
>> write their own code.  If there is so much as a single configuration knob
>> (and it sounds like Beam is against complex configuration...yay!) to get
>> that working in Beam, then I'd say, please integrate Tika into Beam.  From
>> a safety perspective, it is critical to keep the extraction process
>> entirely separate (jvm, vm, m, rack, data center!) from the
>> transformation+loading steps.  IMHO, very few devs realize this because
>> Tika works well lots of the time...which is why it is critical for us to
>> make it easy for people to get it right all of the time.
>>>
>>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
>> mode first in one jvm, and then I kick off another process to do
>> transform/loading into Lucene/Solr from the .json files that Tika generates
>> for each input file.  If I were to scale up, I'd want to maintain this
>> complete separation of steps.
>>>
>>> Apologies if I've derailed the conversation or misunderstood this thread.
>>>
>> Major thanks for your input :-)
>>
>> Cheers, Sergey
>>
>>> Cheers,
>>>
>>>                  Tim
>>>
>>> -----Original Message-----
>>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>>> Sent: Thursday, September 21, 2017 9:07 AM
>>> To: dev@beam.apache.org
>>> Cc: Allison, Timothy B. <ta...@mitre.org>
>>> Subject: Re: TikaIO concerns
>>>
>>> Hi All
>>>
>>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>>
>>> Tim, thanks for joining in :-). If you have some great Apache Tika
>> stories to share (preferably involving the cases where it did not really
>> matter the ordering in which Tika-produced data were dealt with by the
>>> consumers) then please do so :-).
>>>
>>> At the moment, even though Tika ContentHandler will emit the ordered
>> data, the Beam runtime will have no guarantees that the downstream pipeline
>> components will see the data coming in the right order.
>>>
>>> (FYI, I understand from the earlier comments that the total ordering is
>> also achievable but would require the extra API support)
>>>
>>> Other comments would be welcome too
>>>
>>> Thanks, Sergey
>>>
>>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>>> individual words and nearly 100% sure I saw Tika reporting individual
>>>> lines when it was parsing the text files. The 'min text length'
>>>> feature can help with reporting several lines at a time, etc...
>>>>
>>>> I'm working with this PDF all the time:
>>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>>
>>>> try it too if you get a chance.
>>>>
>>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>>> be for ex a log-like content too)
>>>>
>>>> That said, I don't know how a parser for the format N will behave, it
>>>> depends on the individual parsers.
>>>>
>>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>>
>>>> I'd like to know though how to make a file name available to the
>>>> pipeline which is working with the current text fragment ?
>>>>
>>>> Going to try and do some measurements and compare the sync vs async
>>>> parsing modes...
>>>>
>>>> Asked the Tika team to support with some more examples...
>>>>
>>>> Cheers, Sergey
>>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>>> Hi,
>>>>>
>>>>> thanks for the explanations,
>>>>>
>>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>>> Hi!
>>>>>>
>>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>>> a row of a CSV file - so discarding order is ok.
>>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>>> it depends...
>>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>>> natural language processing and the text files contain actual prose,
>>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>>> - Cases where you need to remember which file each element came
>>>>>> from, e.g.
>>>>>> if you're creating a search index for the files: TextIO can't do
>>>>>> this either.
>>>>>>
>>>>>> Both of these issues have been raised in the past against TextIO;
>>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>>> not been a priority.
>>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>>> gives you access to reading a full file yourself - people who want
>>>>>> more flexibility will be able to use standard Java text-parsing
>>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>>
>>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>>> use case where the files contain independent data entries, so
>>>>>> returning an unordered soup of them, with no association to the
>>>>>> original file, is the user's intention. XmlIO will not work for
>>>>>> processing more complex XML files that are not simply a sequence of
>>>>>> entries with the same tag, and it also does not remember the
>>>>>> original filename.
>>>>>>
>>>>>
>>>>> OK...
>>>>>
>>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>>> mainly used for extracting content from complex file formats - for
>>>>>> example, extracting text and images from PDF files or Word
>>>>>> documents. I believe this is the main difference between it and
>>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>>
>>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>>> the Tika API is ContentHandler
>>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>>> html?is-external=true
>>>>>>
>>>>>> whose
>>>>>> documentation says "The order of events in this interface is very
>>>>>> important, and mirrors the order of information in the document
>> itself."
>>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>>> ContentHandler...
>>>>>>
>>>>>> Let me give a few examples of what I think is possible with the raw
>>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>>> with Tika and am judging just based on what I read about it.
>>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>>> to text files for future natural language processing.
>>>>>> - User has 100,000 PDF files with financial statements, each
>>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>>> of transactions in PDF tables. User wants to extract each
>>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>>> extract text from them, somehow parse author and affiliation from
>>>>>> the text, and compute statistics of topics and terminology usage by
>>>>>> author name and affiliation.
>>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>>> observing a location over time: they want to extract metadata from
>>>>>> each image using Tika, analyze the images themselves using some
>>>>>> other library, and detect anomalies in the overall appearance of the
>>>>>> location over time as seen from multiple cameras.
>>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>>> the resulting PCollection<String> contains no information about
>>>>>> which String comes from which document and about the order in which
>>>>>> they appear in the document.
>>>>> These are good use cases, thanks... I thought what you were talking
>>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>>> TextIO and alike :-)).
>>>>> Putting the ordered vs unordered question aside for a sec, why
>>>>> exactly a Tika Reader can not make the name of the file it's
>>>>> currently reading from available to the pipeline, as some Beam
>> pipeline metadata piece ?
>>>>> Surely it can be possible with Beam ? If not then I would be
>> surprised...
>>>>>
>>>>>>
>>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>>> of strings.
>>>>>> So some examples would be very helpful.
>>>>>>
>>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>>> give one example where it did not matter to us in what order
>>>>> Tika-produced data were available to the downstream layer.
>>>>>
>>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>>> Con NAs, and we had a happy audience:
>>>>>
>>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search
>>>>>
>>>>>
>>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>>> into Lucene. We associate a file name with the indexed content and
>>>>> then let users find a list of PDF files which contain a given word or
>>>>> few words, details are here
>>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>>> og.java#L131
>>>>>
>>>>>
>>>>> I'd say even more involved search engines would not mind supporting a
>>>>> case like that :-)
>>>>>
>>>>> Now there we process one file at a time, and I understand now that
>>>>> with TikaIO and N files it's all over the place really as far as the
>>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>>> TikaReader must be able to associate the file name with a given piece
>>>>> of text it's making available to the pipeline.
>>>>>
>>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>>> If it makes things simpler then it would be good, I've just no idea
>>>>> at the moment how to start the pipeline without using a
>>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>>> SAX data pieces first before making a single composite piece to the
>> pipeline ?
>>>>>
>>>>>
>>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>>> provide a usability improvement over such usage?
>>>>>>
>>>>>
>>>>>
>>>>> If you are actually asking, does it really make sense for Beam to
>>>>> ship Tika related code, given that users can just do it themselves,
>>>>> I'm not sure.
>>>>>
>>>>> IMHO it always works better if users have to provide just few config
>>>>> options to an integral part of the framework and see things happening.
>>>>> It will bring more users.
>>>>>
>>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>>> not - I'll let you and the team decide; believe it or not I was
>>>>> seriously contemplating at the last moment to make it all part of the
>>>>> Tika project itself and have a bit more flexibility over there with
>>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>>> know - it's no my decision...
>>>>>
>>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>>> Perhaps
>>>>>> for some cases it does, and for some it does not. May be it makes
>>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>>> and ParDo, with getting the common code reused." - because using
>>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>>> The resulting PCollection will be unordered either way - this needs
>>>>>> to be solved separately by providing a different API.
>>>>> Right I see now, so ParDo is not about making Tika reported data
>>>>> available to the downstream pipeline components ordered, only about
>>>>> the simpler implementation.
>>>>> Association with the file should be possible I hope, but I understand
>>>>> it would be possible to optionally make the data coming out in the
>>>>> ordered way as well...
>>>>>
>>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>>> let me double check: should we still give some thought to the
>>>>> possible performance benefit of the current approach ? As I said, I
>>>>> can easily get rid of all that polling code, use a simple Blocking
>> queue.
>>>>>
>>>>> Cheers, Sergey
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>>
>>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>>
>>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>>> usecases.
>>>>>>>
>>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>>
>>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>>> basically from any file format.
>>>>>>>
>>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>>
>>>>>>> Sergey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Replies inline.
>>>>>>>>
>>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>>> <sb...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All
>>>>>>>>>
>>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>>> [2].
>>>>>>>>>
>>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>>
>>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>>> report the data chunks.
>>>>>>>>> Some
>>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>>> with some being able report the data only after the completely
>>>>>>>>> parse the document.
>>>>>>>>> All depends on the data format.
>>>>>>>>>
>>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>>> the tests might suggest otherwise).
>>>>>>>>>
>>>>>>>> I agree that your implementation of reader returns records in
>>>>>>>> order
>>>>>>>> - but
>>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>>> transforms to the
>>>>>>> PCollection
>>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>>> your reader returned the records.
>>>>>>>>
>>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did it was because I thought
>>>>>>>>>
>>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>>
>>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>>> potentially
>>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>>> execute in
>>>>>>> the
>>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>>> correctly,
>>>>>>>> you might be assuming that:
>>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>>> complete
>>>>>>>> before processing its outputs with downstream transforms
>>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>>> *concurrently*
>>>>>>>> with downstream processing of its results
>>>>>>>> - Passing an element from one thread to another using a
>>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>>> most
>>>>>>>> production runners.
>>>>>>>>
>>>>>>>> There are other disadvantages to this approach:
>>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>>> invisible
>>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>>> next element, but not *why* the next
>>>>>>> element
>>>>>>>> is taking long to compute.
>>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>>> autoscaling, binpacking
>>>>>>> and
>>>>>>>> other resource management magic (how much of this runners actually
>>>>>>>> do is
>>>>>>> a
>>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>>> the processing happens in a thread about which the runner is
>>>>>>>> unaware.
>>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>>> in the Tika thread
>>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>>> easier
>>>>>>> to
>>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>>> yet)
>>>>>>>>>
>>>>>>>> What are these issues?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>>> will never be reached for a practical case, or
>>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>>> I propose to follow 2b).
>>>>>>>>>
>>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>>> Beam's "no knobs"
>>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>>> out a
>>>>>>> good
>>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>>> running on
>>>>>>> a
>>>>>>>> new dataset or updating a version of some of the involved
>>>>>>>> dependencies
>>>>>>> etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please let me know what you think.
>>>>>>>>> My plan so far is:
>>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>>> some minor TikaIO updates
>>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>>> BoundedSource/Reader or not
>>>>>>>>>
>>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>>
>>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>>> cases
>>>>>>> of
>>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>>> then see what's the best implementation for that particular API
>>>>>>>> and set of anticipated use cases.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, Sergey
>>>>>>>>>
>>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Thanks all for the discussion. It seems we have consensus that both
within-document order and association with the original filename are
necessary, but currently absent from TikaIO.

*Association with original file:*
Sergey - Beam does not *automatically* provide a way to associate an
element with the file it originated from: automatically tracking data
provenance is a known very hard research problem on which many papers have
been written, and obvious solutions are very easy to break. See related
discussion at
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
 .

If you want the elements of your PCollection to contain additional
information, you need the elements themselves to contain this information:
the elements are self-contained and have no metadata associated with them
(beyond the timestamp and windows, universal to the whole Beam model).

*Order within a file:*
The only way to have any kind of order within a PCollection is to have the
elements of the PCollection contain something ordered, e.g. have a
PCollection<List<Something>>, where each List is for one file [I'm assuming
Tika, at a low level, works on a per-file basis?]. However, since TikaIO
can be applied to very large files, this could produce very large elements,
which is a bad idea. Because of this, I don't think the result of applying
Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a
*general-purpose* TikaIO transform that will be better than manual
invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at
https://tika.apache.org/1.16/examples.html - almost all of the examples
involve extracting a single String from each document. This use case, with
the assumption that individual documents are small enough, can certainly be
simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection<ReadableFile>
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
is a class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however, you
can let the user specify either its class (if it's a simple handler like a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

  PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
    .apply(FileIO.readMatches())
    .apply(TikaIO.parseAllAsStrings())

or:

    .apply(TikaIO.parseAll()
        .withParser(new AutoDetectParser())
        .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler())))

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
    p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
> > Thank you, Sergey.
> >
> > My knowledge of Apache Beam is limited -- I saw Davor and
> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
> impressed, but I haven't had a chance to work with it yet.
> >
> >  From my perspective, if I understand this thread (and I may not!),
> getting unordered text from _a given file_ is a non-starter for most
> applications.  The implementation needs to guarantee order per file, and
> the user has to be able to link the "extract" back to a unique identifier
> for the document.  If the current implementation doesn't do those things,
> we need to change it, IMHO.
> >
> Right now Tika-related reader does not associate a given text fragment
> with the file name, so a function looking at some text and trying to
> find where it came from won't be able to do so.
>
> So I asked how to do it in Beam, how to attach some context to the given
> piece of data. I hope it can be done and if not - then perhaps some
> improvement can be applied.
>
> Re the unordered text - yes - this is what we currently have with Beam +
> TikaIO :-).
>
> The use-case I referred to earlier in this thread (upload PDFs - save
> the possibly unordered text to Lucene with the file name 'attached', let
> users search for the files containing some words - phrases, this works
> OK given that I can see PDF parser for ex reporting the lines) can be
> supported OK with the current TikaIO (provided we find a way to 'attach'
> a file name to the flow).
>
> I see though supporting the total ordering can be a big deal in other
> cases. Eugene, can you please explain how it can be done, is it
> achievable in principle, without the users having to do some custom
> coding ?
>
> > To the question of -- why is this in Beam at all; why don't we let users
> call it if they want it?...
> >
> > No matter how much we do to Tika, it will behave badly sometimes --
> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
> using Beam -- folks likely with large batches of unruly/noisy documents --
> are more likely to run into these problems than your average
> couple-of-thousand-docs-from-our-own-company user. So, if there are things
> we can do in Beam to prevent developers around the world from having to
> reinvent the wheel for defenses against these problems, then I'd be
> enormously grateful if we could put Tika into Beam.  That means:
> >
> > 1) a process-level timeout (because you can't actually kill a thread in
> Java)
> > 2) a process-level restart on OOM
> > 3) avoid trying to reprocess a badly behaving document
> >
> > If Beam automatically handles those problems, then I'd say, y, let users
> write their own code.  If there is so much as a single configuration knob
> (and it sounds like Beam is against complex configuration...yay!) to get
> that working in Beam, then I'd say, please integrate Tika into Beam.  From
> a safety perspective, it is critical to keep the extraction process
> entirely separate (jvm, vm, m, rack, data center!) from the
> transformation+loading steps.  IMHO, very few devs realize this because
> Tika works well lots of the time...which is why it is critical for us to
> make it easy for people to get it right all of the time.
> >
> > Even in my desktop (gah, y, desktop!) search app, I run Tika in batch
> mode first in one jvm, and then I kick off another process to do
> transform/loading into Lucene/Solr from the .json files that Tika generates
> for each input file.  If I were to scale up, I'd want to maintain this
> complete separation of steps.
> >
> > Apologies if I've derailed the conversation or misunderstood this thread.
> >
> Major thanks for your input :-)
>
> Cheers, Sergey
>
> > Cheers,
> >
> >                 Tim
> >
> > -----Original Message-----
> > From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> > Sent: Thursday, September 21, 2017 9:07 AM
> > To: dev@beam.apache.org
> > Cc: Allison, Timothy B. <ta...@mitre.org>
> > Subject: Re: TikaIO concerns
> >
> > Hi All
> >
> > Please welcome Tim, one of Apache Tika leads and practitioners.
> >
> > Tim, thanks for joining in :-). If you have some great Apache Tika
> stories to share (preferably involving the cases where it did not really
> matter the ordering in which Tika-produced data were dealt with by the
> > consumers) then please do so :-).
> >
> > At the moment, even though Tika ContentHandler will emit the ordered
> data, the Beam runtime will have no guarantees that the downstream pipeline
> components will see the data coming in the right order.
> >
> > (FYI, I understand from the earlier comments that the total ordering is
> also achievable but would require the extra API support)
> >
> > Other comments would be welcome too
> >
> > Thanks, Sergey
> >
> > On 21/09/17 10:55, Sergey Beryozkin wrote:
> >> I noticed that the PDF and ODT parsers actually split by lines, not
> >> individual words and nearly 100% sure I saw Tika reporting individual
> >> lines when it was parsing the text files. The 'min text length'
> >> feature can help with reporting several lines at a time, etc...
> >>
> >> I'm working with this PDF all the time:
> >> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> >>
> >> try it too if you get a chance.
> >>
> >> (and I can imagine not all PDFs/etc representing the 'story' but can
> >> be for ex a log-like content too)
> >>
> >> That said, I don't know how a parser for the format N will behave, it
> >> depends on the individual parsers.
> >>
> >> IMHO it's an equal candidate alongside Text-based bounded IOs...
> >>
> >> I'd like to know though how to make a file name available to the
> >> pipeline which is working with the current text fragment ?
> >>
> >> Going to try and do some measurements and compare the sync vs async
> >> parsing modes...
> >>
> >> Asked the Tika team to support with some more examples...
> >>
> >> Cheers, Sergey
> >> On 20/09/17 22:17, Sergey Beryozkin wrote:
> >>> Hi,
> >>>
> >>> thanks for the explanations,
> >>>
> >>> On 20/09/17 16:41, Eugene Kirpichov wrote:
> >>>> Hi!
> >>>>
> >>>> TextIO returns an unordered soup of lines contained in all files you
> >>>> ask it to read. People usually use TextIO for reading files where 1
> >>>> line corresponds to 1 independent data element, e.g. a log entry, or
> >>>> a row of a CSV file - so discarding order is ok.
> >>> Just a side note, I'd probably want that be ordered, though I guess
> >>> it depends...
> >>>> However, there is a number of cases where TextIO is a poor fit:
> >>>> - Cases where discarding order is not ok - e.g. if you're doing
> >>>> natural language processing and the text files contain actual prose,
> >>>> where you need to process a file as a whole. TextIO can't do that.
> >>>> - Cases where you need to remember which file each element came
> >>>> from, e.g.
> >>>> if you're creating a search index for the files: TextIO can't do
> >>>> this either.
> >>>>
> >>>> Both of these issues have been raised in the past against TextIO;
> >>>> however it seems that the overwhelming majority of users of TextIO
> >>>> use it for logs or CSV files or alike, so solving these issues has
> >>>> not been a priority.
> >>>> Currently they are solved in a general form via FileIO.read() which
> >>>> gives you access to reading a full file yourself - people who want
> >>>> more flexibility will be able to use standard Java text-parsing
> >>>> utilities on a ReadableFile, without involving TextIO.
> >>>>
> >>>> Same applies for XmlIO: it is specifically designed for the narrow
> >>>> use case where the files contain independent data entries, so
> >>>> returning an unordered soup of them, with no association to the
> >>>> original file, is the user's intention. XmlIO will not work for
> >>>> processing more complex XML files that are not simply a sequence of
> >>>> entries with the same tag, and it also does not remember the
> >>>> original filename.
> >>>>
> >>>
> >>> OK...
> >>>
> >>>> However, if my understanding of Tika use cases is correct, it is
> >>>> mainly used for extracting content from complex file formats - for
> >>>> example, extracting text and images from PDF files or Word
> >>>> documents. I believe this is the main difference between it and
> >>>> TextIO - people usually use Tika for complex use cases where the
> >>>> "unordered soup of stuff" abstraction is not useful.
> >>>>
> >>>> My suspicion about this is confirmed by the fact that the crux of
> >>>> the Tika API is ContentHandler
> >>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
> >>>> html?is-external=true
> >>>>
> >>>> whose
> >>>> documentation says "The order of events in this interface is very
> >>>> important, and mirrors the order of information in the document
> itself."
> >>> All that says is that a (Tika) ContentHandler will be a true SAX
> >>> ContentHandler...
> >>>>
> >>>> Let me give a few examples of what I think is possible with the raw
> >>>> Tika API, but I think is not currently possible with TikaIO - please
> >>>> correct me where I'm wrong, because I'm not particularly familiar
> >>>> with Tika and am judging just based on what I read about it.
> >>>> - User has 100,000 Word documents and wants to convert each of them
> >>>> to text files for future natural language processing.
> >>>> - User has 100,000 PDF files with financial statements, each
> >>>> containing a bunch of unrelated text and - the main content - a list
> >>>> of transactions in PDF tables. User wants to extract each
> >>>> transaction as a PCollection element, discarding the unrelated text.
> >>>> - User has 100,000 PDF files with scientific papers, and wants to
> >>>> extract text from them, somehow parse author and affiliation from
> >>>> the text, and compute statistics of topics and terminology usage by
> >>>> author name and affiliation.
> >>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
> >>>> observing a location over time: they want to extract metadata from
> >>>> each image using Tika, analyze the images themselves using some
> >>>> other library, and detect anomalies in the overall appearance of the
> >>>> location over time as seen from multiple cameras.
> >>>> I believe all of these cases can not be solved with TikaIO because
> >>>> the resulting PCollection<String> contains no information about
> >>>> which String comes from which document and about the order in which
> >>>> they appear in the document.
> >>> These are good use cases, thanks... I thought what you were talking
> >>> about the unordered soup of data produced by TikaIO (and its friends
> >>> TextIO and alike :-)).
> >>> Putting the ordered vs unordered question aside for a sec, why
> >>> exactly a Tika Reader can not make the name of the file it's
> >>> currently reading from available to the pipeline, as some Beam
> pipeline metadata piece ?
> >>> Surely it can be possible with Beam ? If not then I would be
> surprised...
> >>>
> >>>>
> >>>> I am, honestly, struggling to think of a case where I would want to
> >>>> use Tika, but where I *would* be ok with getting an unordered soup
> >>>> of strings.
> >>>> So some examples would be very helpful.
> >>>>
> >>> Yes. I'll ask Tika developers to help with some examples, but I'll
> >>> give one example where it did not matter to us in what order
> >>> Tika-produced data were available to the downstream layer.
> >>>
> >>> It's a demo the Apache CXF colleague of mine showed at one of Apache
> >>> Con NAs, and we had a happy audience:
> >>>
> >>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
> >>> se/samples/jax_rs/search
> >>>
> >>>
> >>> PDF or ODT files uploaded, Tika parses them, and all of that is put
> >>> into Lucene. We associate a file name with the indexed content and
> >>> then let users find a list of PDF files which contain a given word or
> >>> few words, details are here
> >>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
> >>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
> >>> og.java#L131
> >>>
> >>>
> >>> I'd say even more involved search engines would not mind supporting a
> >>> case like that :-)
> >>>
> >>> Now there we process one file at a time, and I understand now that
> >>> with TikaIO and N files it's all over the place really as far as the
> >>> ordering is concerned, which file it's coming from. etc. That's why
> >>> TikaReader must be able to associate the file name with a given piece
> >>> of text it's making available to the pipeline.
> >>>
> >>> I'd be happy to support the ParDo way of linking Tika with Beam.
> >>> If it makes things simpler then it would be good, I've just no idea
> >>> at the moment how to start the pipeline without using a
> >>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
> >>> earlier - how can one avoid it with ParDo when implementing a 'min
> >>> len chunk' feature, where the ParDo would have to concatenate several
> >>> SAX data pieces first before making a single composite piece to the
> pipeline ?
> >>>
> >>>
> >>>> Another way to state it: currently, if I wanted to solve all of the
> >>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
> >>>> API myself on the resulting ReadableFile. How can we make TikaIO
> >>>> provide a usability improvement over such usage?
> >>>>
> >>>
> >>>
> >>> If you are actually asking, does it really make sense for Beam to
> >>> ship Tika related code, given that users can just do it themselves,
> >>> I'm not sure.
> >>>
> >>> IMHO it always works better if users have to provide just few config
> >>> options to an integral part of the framework and see things happening.
> >>> It will bring more users.
> >>>
> >>> Whether the current Tika code (refactored or not) stays with Beam or
> >>> not - I'll let you and the team decide; believe it or not I was
> >>> seriously contemplating at the last moment to make it all part of the
> >>> Tika project itself and have a bit more flexibility over there with
> >>> tweaking things, but now that it is in the Beam snapshot - I don't
> >>> know - it's no my decision...
> >>>
> >>>> I am confused by your other comment - "Does the ordering matter ?
> >>>> Perhaps
> >>>> for some cases it does, and for some it does not. May be it makes
> >>>> sense to support running TikaIO as both the bounded reader/source
> >>>> and ParDo, with getting the common code reused." - because using
> >>>> BoundedReader or ParDo is not related to the ordering issue, only to
> >>>> the issue of asynchronous reading and complexity of implementation.
> >>>> The resulting PCollection will be unordered either way - this needs
> >>>> to be solved separately by providing a different API.
> >>> Right I see now, so ParDo is not about making Tika reported data
> >>> available to the downstream pipeline components ordered, only about
> >>> the simpler implementation.
> >>> Association with the file should be possible I hope, but I understand
> >>> it would be possible to optionally make the data coming out in the
> >>> ordered way as well...
> >>>
> >>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
> >>> let me double check: should we still give some thought to the
> >>> possible performance benefit of the current approach ? As I said, I
> >>> can easily get rid of all that polling code, use a simple Blocking
> queue.
> >>>
> >>> Cheers, Sergey
> >>>>
> >>>> Thanks.
> >>>>
> >>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
> >>>> <sb...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> Glad TikaIO getting some serious attention :-), I believe one thing
> >>>>> we both agree upon is that Tika can help Beam in its own unique way.
> >>>>>
> >>>>> Before trying to reply online, I'd like to state that my main
> >>>>> assumption is that TikaIO (as far as the read side is concerned) is
> >>>>> no different to Text, XML or similar bounded reader components.
> >>>>>
> >>>>> I have to admit I don't understand your questions about TikaIO
> >>>>> usecases.
> >>>>>
> >>>>> What are the Text Input or XML input use-cases ? These use cases
> >>>>> are TikaInput cases as well, the only difference is Tika can not
> >>>>> split the individual file into a sequence of sources/etc,
> >>>>>
> >>>>> TextIO can read from the plain text files (possibly zipped), XML -
> >>>>> optimized around reading from the XML files, and I thought I made
> >>>>> it clear (and it is a known fact anyway) Tika was about reading
> >>>>> basically from any file format.
> >>>>>
> >>>>> Where is the difference (apart from what I've already mentioned) ?
> >>>>>
> >>>>> Sergey
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Replies inline.
> >>>>>>
> >>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
> >>>>>> <sb...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi All
> >>>>>>>
> >>>>>>> This is my first post the the dev list, I work for Talend, I'm a
> >>>>>>> Beam novice, Apache Tika fan, and thought it would be really
> >>>>>>> great to try and link both projects together, which led me to
> >>>>>>> opening [1] where I typed some early thoughts, followed by PR
> >>>>>>> [2].
> >>>>>>>
> >>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
> >>>>>>> newer review comments from Eugene pending, so I'd like to
> >>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
> >>>>>>> decide, based on the feedback from the experts, what to do next.
> >>>>>>>
> >>>>>>> Apache Tika Parsers report the text content in chunks, via
> >>>>>>> SaxParser events. It's not possible with Tika to take a file and
> >>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
> >>>>>>> by line, the only way is to handle the SAXParser callbacks which
> >>>>>>> report the data chunks.
> >>>>>>> Some
> >>>>>>> parsers may report the complete lines, some individual words,
> >>>>>>> with some being able report the data only after the completely
> >>>>>>> parse the document.
> >>>>>>> All depends on the data format.
> >>>>>>>
> >>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
> >>>>>>> to parse the files, Beam threads will only collect the data from
> >>>>>>> the internal queue where the internal TikaReader's thread will
> >>>>>>> put the data into (note the data chunks are ordered even though
> >>>>>>> the tests might suggest otherwise).
> >>>>>>>
> >>>>>> I agree that your implementation of reader returns records in
> >>>>>> order
> >>>>>> - but
> >>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
> >>>>>> the order in which records are produced by a BoundedReader - the
> >>>>>> order produced by your reader is ignored, and when applying any
> >>>>>> transforms to the
> >>>>> PCollection
> >>>>>> produced by TikaIO, it is impossible to recover the order in which
> >>>>>> your reader returned the records.
> >>>>>>
> >>>>>> With that in mind, is PCollection<String>, containing individual
> >>>>>> Tika-detected items, still the right API for representing the
> >>>>>> result of parsing a large number of documents with Tika?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> The reason I did it was because I thought
> >>>>>>>
> >>>>>>> 1) it would make the individual data chunks available faster to
> >>>>>>> the pipeline - the parser will continue working via the
> >>>>>>> binary/video etc file while the data will already start flowing -
> >>>>>>> I agree there should be some tests data available confirming it -
> >>>>>>> but I'm positive at the moment this approach might yield some
> >>>>>>> performance gains with the large sets. If the file is large, if
> >>>>>>> it has the embedded attachments/videos to deal with, then it may
> >>>>>>> be more effective not to get the Beam thread deal with it...
> >>>>>>>
> >>>>>>> As I said on the PR, this description contains unfounded and
> >>>>>>> potentially
> >>>>>> incorrect assumptions about how Beam runners execute (or may
> >>>>>> execute in
> >>>>> the
> >>>>>> future) a ParDo or a BoundedReader. For example, if I understand
> >>>>> correctly,
> >>>>>> you might be assuming that:
> >>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
> >>>>> complete
> >>>>>> before processing its outputs with downstream transforms
> >>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
> >>>>> *concurrently*
> >>>>>> with downstream processing of its results
> >>>>>> - Passing an element from one thread to another using a
> >>>>>> BlockingQueue is free in terms of performance All of these are
> >>>>>> false at least in some runners, and I'm almost certain that in
> >>>>>> reality, performance of this approach is worse than a ParDo in
> >>>>> most
> >>>>>> production runners.
> >>>>>>
> >>>>>> There are other disadvantages to this approach:
> >>>>>> - Doing the bulk of the processing in a separate thread makes it
> >>>>> invisible
> >>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
> >>>>>> profiling capabilities, or the ability to get the current stack
> >>>>>> trace for stuck elements, this approach would make the real
> >>>>>> processing invisible to all of these capabilities, and a user
> >>>>>> would only see that the bulk of the time is spent waiting for the
> >>>>>> next element, but not *why* the next
> >>>>> element
> >>>>>> is taking long to compute.
> >>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
> >>>>>> invisible to Beam, will make it harder for runners to do
> >>>>>> autoscaling, binpacking
> >>>>> and
> >>>>>> other resource management magic (how much of this runners actually
> >>>>>> do is
> >>>>> a
> >>>>>> separate issue), because the runner will have no way of knowing
> >>>>>> how much CPU/IO this particular transform is actually using - all
> >>>>>> the processing happens in a thread about which the runner is
> >>>>>> unaware.
> >>>>>> - As far as I can tell, the code also hides exceptions that happen
> >>>>>> in the Tika thread
> >>>>>> - Adding the thread management makes the code much more complex,
> >>>>>> easier
> >>>>> to
> >>>>>> introduce bugs, and harder for others to contribute
> >>>>>>
> >>>>>>
> >>>>>>> 2) As I commented at the end of [2], having an option to
> >>>>>>> concatenate the data chunks first before making them available to
> >>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
> >>>>>>> introduce some synchronization issues (though not exactly sure
> >>>>>>> yet)
> >>>>>>>
> >>>>>> What are these issues?
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> One of valid concerns there is that the reader is polling the
> >>>>>>> internal queue so, in theory at least, and perhaps in some rare
> >>>>>>> cases too, we may have a case where the max polling time has been
> >>>>>>> reached, the parser is still busy, and TikaIO fails to report all
> >>>>>>> the file data. I think that it can be solved by either 2a)
> >>>>>>> configuring the max polling time to a very large number which
> >>>>>>> will never be reached for a practical case, or
> >>>>>>> 2b) simply use a blocking queue without the time limits - in the
> >>>>>>> worst case, if TikaParser spins and fails to report the end of
> >>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
> >>>>>>> I propose to follow 2b).
> >>>>>>>
> >>>>>> I agree that there should be no way to unintentionally configure
> >>>>>> the transform in a way that will produce silent data loss. Another
> >>>>>> reason for not having these tuning knobs is that it goes against
> >>>>>> Beam's "no knobs"
> >>>>>> philosophy, and that in most cases users have no way of figuring
> >>>>>> out a
> >>>>> good
> >>>>>> value for tuning knobs except for manual experimentation, which is
> >>>>>> extremely brittle and typically gets immediately obsoleted by
> >>>>>> running on
> >>>>> a
> >>>>>> new dataset or updating a version of some of the involved
> >>>>>> dependencies
> >>>>> etc.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Please let me know what you think.
> >>>>>>> My plan so far is:
> >>>>>>> 1) start addressing most of Eugene's comments which would require
> >>>>>>> some minor TikaIO updates
> >>>>>>> 2) work on removing the TikaSource internal code dealing with
> >>>>>>> File patterns which I copied from TextIO at the next stage
> >>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
> >>>>>>> users some time to try it with some real complex files and also
> >>>>>>> decide if TikaIO can continue implemented as a
> >>>>>>> BoundedSource/Reader or not
> >>>>>>>
> >>>>>>> Eugene, all, will it work if I start with 1) ?
> >>>>>>>
> >>>>>> Yes, but I think we should start by discussing the anticipated use
> >>>>>> cases
> >>>>> of
> >>>>>> TikaIO and designing an API for it based on those use cases; and
> >>>>>> then see what's the best implementation for that particular API
> >>>>>> and set of anticipated use cases.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks, Sergey
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
> >>>>>>> [2] https://github.com/apache/beam/pull/3378
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi all,

Please also welcome Chris to this thread,

Chris, thanks for joining in :-), FYI, the main concern that was raised 
is that it was not obvious when to use TikaIO in the current form, given 
that Beam+TikaIO will have a totally unordered sequence of data 
(originally extracted by Tika in the right order) flowing through the 
pipeline, thus the question is, what is the practical value of having a 
native Bean TikaIO component as opposed to the users manually doing some 
Tika coding in the Beam function.

According to Tim, the ordering is often very important; I referred to 
one of CXF demos where it did not matter, but some more practical 
examples would be of interest.

FYI, the file metadata are also reported, but optionally, and after the 
content has been reported. The metadata will flow in the form 
"author=Alice" etc, though a '=' can be easily made customizable...

Tim suggested having a higher-level Beam recovery support in the cases 
where the parser go OOM or fail somehow else and spin would be a big 
deal for the users.

Implementation wise, the open question is, to make TikaIO capable of 
covering more cases is to how to help it at the Beam level to get the 
data ordered all the way, but that would be the phase 2, assuming TikaIO 
stays...

Thanks, Sergey


On 21/09/17 15:38, Sergey Beryozkin wrote:
> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
>> Thank you, Sergey.
>>
>> My knowledge of Apache Beam is limited -- I saw Davor and 
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally 
>> impressed, but I haven't had a chance to work with it yet.
>>
>>  From my perspective, if I understand this thread (and I may not!), 
>> getting unordered text from _a given file_ is a non-starter for most 
>> applications.  The implementation needs to guarantee order per file, 
>> and the user has to be able to link the "extract" back to a unique 
>> identifier for the document.  If the current implementation doesn't do 
>> those things, we need to change it, IMHO.
>>
> Right now Tika-related reader does not associate a given text fragment 
> with the file name, so a function looking at some text and trying to 
> find where it came from won't be able to do so.
> 
> So I asked how to do it in Beam, how to attach some context to the given 
> piece of data. I hope it can be done and if not - then perhaps some 
> improvement can be applied.
> 
> Re the unordered text - yes - this is what we currently have with Beam + 
> TikaIO :-).
> 
> The use-case I referred to earlier in this thread (upload PDFs - save 
> the possibly unordered text to Lucene with the file name 'attached', let 
> users search for the files containing some words - phrases, this works 
> OK given that I can see PDF parser for ex reporting the lines) can be 
> supported OK with the current TikaIO (provided we find a way to 'attach' 
> a file name to the flow).
> 
> I see though supporting the total ordering can be a big deal in other 
> cases. Eugene, can you please explain how it can be done, is it 
> achievable in principle, without the users having to do some custom 
> coding ?
> 
>> To the question of -- why is this in Beam at all; why don't we let 
>> users call it if they want it?...
>>
>> No matter how much we do to Tika, it will behave badly sometimes -- 
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine 
>> folks using Beam -- folks likely with large batches of unruly/noisy 
>> documents -- are more likely to run into these problems than your 
>> average couple-of-thousand-docs-from-our-own-company user. So, if 
>> there are things we can do in Beam to prevent developers around the 
>> world from having to reinvent the wheel for defenses against these 
>> problems, then I'd be enormously grateful if we could put Tika into 
>> Beam.  That means:
>>
>> 1) a process-level timeout (because you can't actually kill a thread 
>> in Java)
>> 2) a process-level restart on OOM
>> 3) avoid trying to reprocess a badly behaving document
>>
>> If Beam automatically handles those problems, then I'd say, y, let 
>> users write their own code.  If there is so much as a single 
>> configuration knob (and it sounds like Beam is against complex 
>> configuration...yay!) to get that working in Beam, then I'd say, 
>> please integrate Tika into Beam.  From a safety perspective, it is 
>> critical to keep the extraction process entirely separate (jvm, vm, m, 
>> rack, data center!) from the transformation+loading steps.  IMHO, very 
>> few devs realize this because Tika works well lots of the time...which 
>> is why it is critical for us to make it easy for people to get it 
>> right all of the time.
>>
>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch 
>> mode first in one jvm, and then I kick off another process to do 
>> transform/loading into Lucene/Solr from the .json files that Tika 
>> generates for each input file.  If I were to scale up, I'd want to 
>> maintain this complete separation of steps.
>>
>> Apologies if I've derailed the conversation or misunderstood this thread.
>>
> Major thanks for your input :-)
> 
> Cheers, Sergey
> 
>> Cheers,
>>
>>                 Tim
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Thursday, September 21, 2017 9:07 AM
>> To: dev@beam.apache.org
>> Cc: Allison, Timothy B. <ta...@mitre.org>
>> Subject: Re: TikaIO concerns
>>
>> Hi All
>>
>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>
>> Tim, thanks for joining in :-). If you have some great Apache Tika 
>> stories to share (preferably involving the cases where it did not 
>> really matter the ordering in which Tika-produced data were dealt with 
>> by the
>> consumers) then please do so :-).
>>
>> At the moment, even though Tika ContentHandler will emit the ordered 
>> data, the Beam runtime will have no guarantees that the downstream 
>> pipeline components will see the data coming in the right order.
>>
>> (FYI, I understand from the earlier comments that the total ordering 
>> is also achievable but would require the extra API support)
>>
>> Other comments would be welcome too
>>
>> Thanks, Sergey
>>
>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>> individual words and nearly 100% sure I saw Tika reporting individual
>>> lines when it was parsing the text files. The 'min text length'
>>> feature can help with reporting several lines at a time, etc...
>>>
>>> I'm working with this PDF all the time:
>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>
>>> try it too if you get a chance.
>>>
>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>> be for ex a log-like content too)
>>>
>>> That said, I don't know how a parser for the format N will behave, it
>>> depends on the individual parsers.
>>>
>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>
>>> I'd like to know though how to make a file name available to the
>>> pipeline which is working with the current text fragment ?
>>>
>>> Going to try and do some measurements and compare the sync vs async
>>> parsing modes...
>>>
>>> Asked the Tika team to support with some more examples...
>>>
>>> Cheers, Sergey
>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>> Hi,
>>>>
>>>> thanks for the explanations,
>>>>
>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>> Hi!
>>>>>
>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>> a row of a CSV file - so discarding order is ok.
>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>> it depends...
>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>> natural language processing and the text files contain actual prose,
>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>> - Cases where you need to remember which file each element came
>>>>> from, e.g.
>>>>> if you're creating a search index for the files: TextIO can't do
>>>>> this either.
>>>>>
>>>>> Both of these issues have been raised in the past against TextIO;
>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>> not been a priority.
>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>> gives you access to reading a full file yourself - people who want
>>>>> more flexibility will be able to use standard Java text-parsing
>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>
>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>> use case where the files contain independent data entries, so
>>>>> returning an unordered soup of them, with no association to the
>>>>> original file, is the user's intention. XmlIO will not work for
>>>>> processing more complex XML files that are not simply a sequence of
>>>>> entries with the same tag, and it also does not remember the
>>>>> original filename.
>>>>>
>>>>
>>>> OK...
>>>>
>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>> mainly used for extracting content from complex file formats - for
>>>>> example, extracting text and images from PDF files or Word
>>>>> documents. I believe this is the main difference between it and
>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>
>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>> the Tika API is ContentHandler
>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>> html?is-external=true
>>>>>
>>>>> whose
>>>>> documentation says "The order of events in this interface is very
>>>>> important, and mirrors the order of information in the document 
>>>>> itself."
>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>> ContentHandler...
>>>>>
>>>>> Let me give a few examples of what I think is possible with the raw
>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>> with Tika and am judging just based on what I read about it.
>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>> to text files for future natural language processing.
>>>>> - User has 100,000 PDF files with financial statements, each
>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>> of transactions in PDF tables. User wants to extract each
>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>> extract text from them, somehow parse author and affiliation from
>>>>> the text, and compute statistics of topics and terminology usage by
>>>>> author name and affiliation.
>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>> observing a location over time: they want to extract metadata from
>>>>> each image using Tika, analyze the images themselves using some
>>>>> other library, and detect anomalies in the overall appearance of the
>>>>> location over time as seen from multiple cameras.
>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>> the resulting PCollection<String> contains no information about
>>>>> which String comes from which document and about the order in which
>>>>> they appear in the document.
>>>> These are good use cases, thanks... I thought what you were talking
>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>> TextIO and alike :-)).
>>>> Putting the ordered vs unordered question aside for a sec, why
>>>> exactly a Tika Reader can not make the name of the file it's
>>>> currently reading from available to the pipeline, as some Beam 
>>>> pipeline metadata piece ?
>>>> Surely it can be possible with Beam ? If not then I would be 
>>>> surprised...
>>>>
>>>>>
>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>> of strings.
>>>>> So some examples would be very helpful.
>>>>>
>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>> give one example where it did not matter to us in what order
>>>> Tika-produced data were available to the downstream layer.
>>>>
>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>> Con NAs, and we had a happy audience:
>>>>
>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search
>>>>
>>>>
>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>> into Lucene. We associate a file name with the indexed content and
>>>> then let users find a list of PDF files which contain a given word or
>>>> few words, details are here
>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>> og.java#L131
>>>>
>>>>
>>>> I'd say even more involved search engines would not mind supporting a
>>>> case like that :-)
>>>>
>>>> Now there we process one file at a time, and I understand now that
>>>> with TikaIO and N files it's all over the place really as far as the
>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>> TikaReader must be able to associate the file name with a given piece
>>>> of text it's making available to the pipeline.
>>>>
>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>> If it makes things simpler then it would be good, I've just no idea
>>>> at the moment how to start the pipeline without using a
>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>> SAX data pieces first before making a single composite piece to the 
>>>> pipeline ?
>>>>
>>>>
>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>> provide a usability improvement over such usage?
>>>>>
>>>>
>>>>
>>>> If you are actually asking, does it really make sense for Beam to
>>>> ship Tika related code, given that users can just do it themselves,
>>>> I'm not sure.
>>>>
>>>> IMHO it always works better if users have to provide just few config
>>>> options to an integral part of the framework and see things happening.
>>>> It will bring more users.
>>>>
>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>> not - I'll let you and the team decide; believe it or not I was
>>>> seriously contemplating at the last moment to make it all part of the
>>>> Tika project itself and have a bit more flexibility over there with
>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>> know - it's no my decision...
>>>>
>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>> Perhaps
>>>>> for some cases it does, and for some it does not. May be it makes
>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>> and ParDo, with getting the common code reused." - because using
>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>> The resulting PCollection will be unordered either way - this needs
>>>>> to be solved separately by providing a different API.
>>>> Right I see now, so ParDo is not about making Tika reported data
>>>> available to the downstream pipeline components ordered, only about
>>>> the simpler implementation.
>>>> Association with the file should be possible I hope, but I understand
>>>> it would be possible to optionally make the data coming out in the
>>>> ordered way as well...
>>>>
>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>> let me double check: should we still give some thought to the
>>>> possible performance benefit of the current approach ? As I said, I
>>>> can easily get rid of all that polling code, use a simple Blocking 
>>>> queue.
>>>>
>>>> Cheers, Sergey
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>> <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>
>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>
>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>> usecases.
>>>>>>
>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>
>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>> basically from any file format.
>>>>>>
>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>
>>>>>> Sergey
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Replies inline.
>>>>>>>
>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>> <sb...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>> [2].
>>>>>>>>
>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>
>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>> report the data chunks.
>>>>>>>> Some
>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>> with some being able report the data only after the completely
>>>>>>>> parse the document.
>>>>>>>> All depends on the data format.
>>>>>>>>
>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>> the tests might suggest otherwise).
>>>>>>>>
>>>>>>> I agree that your implementation of reader returns records in
>>>>>>> order
>>>>>>> - but
>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>> transforms to the
>>>>>> PCollection
>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>> your reader returned the records.
>>>>>>>
>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The reason I did it was because I thought
>>>>>>>>
>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>
>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>> potentially
>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>> execute in
>>>>>> the
>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>> correctly,
>>>>>>> you might be assuming that:
>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>> complete
>>>>>>> before processing its outputs with downstream transforms
>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>> *concurrently*
>>>>>>> with downstream processing of its results
>>>>>>> - Passing an element from one thread to another using a
>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>> most
>>>>>>> production runners.
>>>>>>>
>>>>>>> There are other disadvantages to this approach:
>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>> invisible
>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>> next element, but not *why* the next
>>>>>> element
>>>>>>> is taking long to compute.
>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>> autoscaling, binpacking
>>>>>> and
>>>>>>> other resource management magic (how much of this runners actually
>>>>>>> do is
>>>>>> a
>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>> the processing happens in a thread about which the runner is
>>>>>>> unaware.
>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>> in the Tika thread
>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>> easier
>>>>>> to
>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>
>>>>>>>
>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>> yet)
>>>>>>>>
>>>>>>> What are these issues?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>> will never be reached for a practical case, or
>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>> I propose to follow 2b).
>>>>>>>>
>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>> Beam's "no knobs"
>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>> out a
>>>>>> good
>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>> running on
>>>>>> a
>>>>>>> new dataset or updating a version of some of the involved
>>>>>>> dependencies
>>>>>> etc.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know what you think.
>>>>>>>> My plan so far is:
>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>> some minor TikaIO updates
>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>> BoundedSource/Reader or not
>>>>>>>>
>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>
>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>> cases
>>>>>> of
>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>> then see what's the best implementation for that particular API
>>>>>>> and set of anticipated use cases.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Sergey
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi all,

Please also welcome Chris to this thread,

Chris, thanks for joining in :-), FYI, the main concern that was raised 
is that it was not obvious when to use TikaIO in the current form, given 
that Beam+TikaIO will have a totally unordered sequence of data 
(originally extracted by Tika in the right order) flowing through the 
pipeline, thus the question is, what is the practical value of having a 
native Bean TikaIO component as opposed to the users manually doing some 
Tika coding in the Beam function.

According to Tim, the ordering is often very important; I referred to 
one of CXF demos where it did not matter, but some more practical 
examples would be of interest.

FYI, the file metadata are also reported, but optionally, and after the 
content has been reported. The metadata will flow in the form 
"author=Alice" etc, though a '=' can be easily made customizable...

Tim suggested having a higher-level Beam recovery support in the cases 
where the parser go OOM or fail somehow else and spin would be a big 
deal for the users.

Implementation wise, the open question is, to make TikaIO capable of 
covering more cases is to how to help it at the Beam level to get the 
data ordered all the way, but that would be the phase 2, assuming TikaIO 
stays...

Thanks, Sergey


On 21/09/17 15:38, Sergey Beryozkin wrote:
> Hi Tim
> On 21/09/17 14:33, Allison, Timothy B. wrote:
>> Thank you, Sergey.
>>
>> My knowledge of Apache Beam is limited -- I saw Davor and 
>> Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally 
>> impressed, but I haven't had a chance to work with it yet.
>>
>>  From my perspective, if I understand this thread (and I may not!), 
>> getting unordered text from _a given file_ is a non-starter for most 
>> applications.  The implementation needs to guarantee order per file, 
>> and the user has to be able to link the "extract" back to a unique 
>> identifier for the document.  If the current implementation doesn't do 
>> those things, we need to change it, IMHO.
>>
> Right now Tika-related reader does not associate a given text fragment 
> with the file name, so a function looking at some text and trying to 
> find where it came from won't be able to do so.
> 
> So I asked how to do it in Beam, how to attach some context to the given 
> piece of data. I hope it can be done and if not - then perhaps some 
> improvement can be applied.
> 
> Re the unordered text - yes - this is what we currently have with Beam + 
> TikaIO :-).
> 
> The use-case I referred to earlier in this thread (upload PDFs - save 
> the possibly unordered text to Lucene with the file name 'attached', let 
> users search for the files containing some words - phrases, this works 
> OK given that I can see PDF parser for ex reporting the lines) can be 
> supported OK with the current TikaIO (provided we find a way to 'attach' 
> a file name to the flow).
> 
> I see though supporting the total ordering can be a big deal in other 
> cases. Eugene, can you please explain how it can be done, is it 
> achievable in principle, without the users having to do some custom 
> coding ?
> 
>> To the question of -- why is this in Beam at all; why don't we let 
>> users call it if they want it?...
>>
>> No matter how much we do to Tika, it will behave badly sometimes -- 
>> permanent hangs requiring kill -9 and OOMs to name a few.  I imagine 
>> folks using Beam -- folks likely with large batches of unruly/noisy 
>> documents -- are more likely to run into these problems than your 
>> average couple-of-thousand-docs-from-our-own-company user. So, if 
>> there are things we can do in Beam to prevent developers around the 
>> world from having to reinvent the wheel for defenses against these 
>> problems, then I'd be enormously grateful if we could put Tika into 
>> Beam.  That means:
>>
>> 1) a process-level timeout (because you can't actually kill a thread 
>> in Java)
>> 2) a process-level restart on OOM
>> 3) avoid trying to reprocess a badly behaving document
>>
>> If Beam automatically handles those problems, then I'd say, y, let 
>> users write their own code.  If there is so much as a single 
>> configuration knob (and it sounds like Beam is against complex 
>> configuration...yay!) to get that working in Beam, then I'd say, 
>> please integrate Tika into Beam.  From a safety perspective, it is 
>> critical to keep the extraction process entirely separate (jvm, vm, m, 
>> rack, data center!) from the transformation+loading steps.  IMHO, very 
>> few devs realize this because Tika works well lots of the time...which 
>> is why it is critical for us to make it easy for people to get it 
>> right all of the time.
>>
>> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch 
>> mode first in one jvm, and then I kick off another process to do 
>> transform/loading into Lucene/Solr from the .json files that Tika 
>> generates for each input file.  If I were to scale up, I'd want to 
>> maintain this complete separation of steps.
>>
>> Apologies if I've derailed the conversation or misunderstood this thread.
>>
> Major thanks for your input :-)
> 
> Cheers, Sergey
> 
>> Cheers,
>>
>>                 Tim
>>
>> -----Original Message-----
>> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
>> Sent: Thursday, September 21, 2017 9:07 AM
>> To: dev@beam.apache.org
>> Cc: Allison, Timothy B. <ta...@mitre.org>
>> Subject: Re: TikaIO concerns
>>
>> Hi All
>>
>> Please welcome Tim, one of Apache Tika leads and practitioners.
>>
>> Tim, thanks for joining in :-). If you have some great Apache Tika 
>> stories to share (preferably involving the cases where it did not 
>> really matter the ordering in which Tika-produced data were dealt with 
>> by the
>> consumers) then please do so :-).
>>
>> At the moment, even though Tika ContentHandler will emit the ordered 
>> data, the Beam runtime will have no guarantees that the downstream 
>> pipeline components will see the data coming in the right order.
>>
>> (FYI, I understand from the earlier comments that the total ordering 
>> is also achievable but would require the extra API support)
>>
>> Other comments would be welcome too
>>
>> Thanks, Sergey
>>
>> On 21/09/17 10:55, Sergey Beryozkin wrote:
>>> I noticed that the PDF and ODT parsers actually split by lines, not
>>> individual words and nearly 100% sure I saw Tika reporting individual
>>> lines when it was parsing the text files. The 'min text length'
>>> feature can help with reporting several lines at a time, etc...
>>>
>>> I'm working with this PDF all the time:
>>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>>
>>> try it too if you get a chance.
>>>
>>> (and I can imagine not all PDFs/etc representing the 'story' but can
>>> be for ex a log-like content too)
>>>
>>> That said, I don't know how a parser for the format N will behave, it
>>> depends on the individual parsers.
>>>
>>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>>
>>> I'd like to know though how to make a file name available to the
>>> pipeline which is working with the current text fragment ?
>>>
>>> Going to try and do some measurements and compare the sync vs async
>>> parsing modes...
>>>
>>> Asked the Tika team to support with some more examples...
>>>
>>> Cheers, Sergey
>>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>>> Hi,
>>>>
>>>> thanks for the explanations,
>>>>
>>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>>> Hi!
>>>>>
>>>>> TextIO returns an unordered soup of lines contained in all files you
>>>>> ask it to read. People usually use TextIO for reading files where 1
>>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>>> a row of a CSV file - so discarding order is ok.
>>>> Just a side note, I'd probably want that be ordered, though I guess
>>>> it depends...
>>>>> However, there is a number of cases where TextIO is a poor fit:
>>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>>> natural language processing and the text files contain actual prose,
>>>>> where you need to process a file as a whole. TextIO can't do that.
>>>>> - Cases where you need to remember which file each element came
>>>>> from, e.g.
>>>>> if you're creating a search index for the files: TextIO can't do
>>>>> this either.
>>>>>
>>>>> Both of these issues have been raised in the past against TextIO;
>>>>> however it seems that the overwhelming majority of users of TextIO
>>>>> use it for logs or CSV files or alike, so solving these issues has
>>>>> not been a priority.
>>>>> Currently they are solved in a general form via FileIO.read() which
>>>>> gives you access to reading a full file yourself - people who want
>>>>> more flexibility will be able to use standard Java text-parsing
>>>>> utilities on a ReadableFile, without involving TextIO.
>>>>>
>>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>>> use case where the files contain independent data entries, so
>>>>> returning an unordered soup of them, with no association to the
>>>>> original file, is the user's intention. XmlIO will not work for
>>>>> processing more complex XML files that are not simply a sequence of
>>>>> entries with the same tag, and it also does not remember the
>>>>> original filename.
>>>>>
>>>>
>>>> OK...
>>>>
>>>>> However, if my understanding of Tika use cases is correct, it is
>>>>> mainly used for extracting content from complex file formats - for
>>>>> example, extracting text and images from PDF files or Word
>>>>> documents. I believe this is the main difference between it and
>>>>> TextIO - people usually use Tika for complex use cases where the
>>>>> "unordered soup of stuff" abstraction is not useful.
>>>>>
>>>>> My suspicion about this is confirmed by the fact that the crux of
>>>>> the Tika API is ContentHandler
>>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>>> html?is-external=true
>>>>>
>>>>> whose
>>>>> documentation says "The order of events in this interface is very
>>>>> important, and mirrors the order of information in the document 
>>>>> itself."
>>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>>> ContentHandler...
>>>>>
>>>>> Let me give a few examples of what I think is possible with the raw
>>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>>> with Tika and am judging just based on what I read about it.
>>>>> - User has 100,000 Word documents and wants to convert each of them
>>>>> to text files for future natural language processing.
>>>>> - User has 100,000 PDF files with financial statements, each
>>>>> containing a bunch of unrelated text and - the main content - a list
>>>>> of transactions in PDF tables. User wants to extract each
>>>>> transaction as a PCollection element, discarding the unrelated text.
>>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>>> extract text from them, somehow parse author and affiliation from
>>>>> the text, and compute statistics of topics and terminology usage by
>>>>> author name and affiliation.
>>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>>> observing a location over time: they want to extract metadata from
>>>>> each image using Tika, analyze the images themselves using some
>>>>> other library, and detect anomalies in the overall appearance of the
>>>>> location over time as seen from multiple cameras.
>>>>> I believe all of these cases can not be solved with TikaIO because
>>>>> the resulting PCollection<String> contains no information about
>>>>> which String comes from which document and about the order in which
>>>>> they appear in the document.
>>>> These are good use cases, thanks... I thought what you were talking
>>>> about the unordered soup of data produced by TikaIO (and its friends
>>>> TextIO and alike :-)).
>>>> Putting the ordered vs unordered question aside for a sec, why
>>>> exactly a Tika Reader can not make the name of the file it's
>>>> currently reading from available to the pipeline, as some Beam 
>>>> pipeline metadata piece ?
>>>> Surely it can be possible with Beam ? If not then I would be 
>>>> surprised...
>>>>
>>>>>
>>>>> I am, honestly, struggling to think of a case where I would want to
>>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>>> of strings.
>>>>> So some examples would be very helpful.
>>>>>
>>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>>> give one example where it did not matter to us in what order
>>>> Tika-produced data were available to the downstream layer.
>>>>
>>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>>> Con NAs, and we had a happy audience:
>>>>
>>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search
>>>>
>>>>
>>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>>> into Lucene. We associate a file name with the indexed content and
>>>> then let users find a list of PDF files which contain a given word or
>>>> few words, details are here
>>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>>> og.java#L131
>>>>
>>>>
>>>> I'd say even more involved search engines would not mind supporting a
>>>> case like that :-)
>>>>
>>>> Now there we process one file at a time, and I understand now that
>>>> with TikaIO and N files it's all over the place really as far as the
>>>> ordering is concerned, which file it's coming from. etc. That's why
>>>> TikaReader must be able to associate the file name with a given piece
>>>> of text it's making available to the pipeline.
>>>>
>>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>>> If it makes things simpler then it would be good, I've just no idea
>>>> at the moment how to start the pipeline without using a
>>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>>> len chunk' feature, where the ParDo would have to concatenate several
>>>> SAX data pieces first before making a single composite piece to the 
>>>> pipeline ?
>>>>
>>>>
>>>>> Another way to state it: currently, if I wanted to solve all of the
>>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>>> provide a usability improvement over such usage?
>>>>>
>>>>
>>>>
>>>> If you are actually asking, does it really make sense for Beam to
>>>> ship Tika related code, given that users can just do it themselves,
>>>> I'm not sure.
>>>>
>>>> IMHO it always works better if users have to provide just few config
>>>> options to an integral part of the framework and see things happening.
>>>> It will bring more users.
>>>>
>>>> Whether the current Tika code (refactored or not) stays with Beam or
>>>> not - I'll let you and the team decide; believe it or not I was
>>>> seriously contemplating at the last moment to make it all part of the
>>>> Tika project itself and have a bit more flexibility over there with
>>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>>> know - it's no my decision...
>>>>
>>>>> I am confused by your other comment - "Does the ordering matter ?
>>>>> Perhaps
>>>>> for some cases it does, and for some it does not. May be it makes
>>>>> sense to support running TikaIO as both the bounded reader/source
>>>>> and ParDo, with getting the common code reused." - because using
>>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>>> the issue of asynchronous reading and complexity of implementation.
>>>>> The resulting PCollection will be unordered either way - this needs
>>>>> to be solved separately by providing a different API.
>>>> Right I see now, so ParDo is not about making Tika reported data
>>>> available to the downstream pipeline components ordered, only about
>>>> the simpler implementation.
>>>> Association with the file should be possible I hope, but I understand
>>>> it would be possible to optionally make the data coming out in the
>>>> ordered way as well...
>>>>
>>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>>> let me double check: should we still give some thought to the
>>>> possible performance benefit of the current approach ? As I said, I
>>>> can easily get rid of all that polling code, use a simple Blocking 
>>>> queue.
>>>>
>>>> Cheers, Sergey
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>>> <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>>
>>>>>> Before trying to reply online, I'd like to state that my main
>>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>>> no different to Text, XML or similar bounded reader components.
>>>>>>
>>>>>> I have to admit I don't understand your questions about TikaIO
>>>>>> usecases.
>>>>>>
>>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>>> split the individual file into a sequence of sources/etc,
>>>>>>
>>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>>> optimized around reading from the XML files, and I thought I made
>>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>>> basically from any file format.
>>>>>>
>>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>>
>>>>>> Sergey
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Replies inline.
>>>>>>>
>>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>>> <sb...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>>> great to try and link both projects together, which led me to
>>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>>> [2].
>>>>>>>>
>>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>>
>>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>>> report the data chunks.
>>>>>>>> Some
>>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>>> with some being able report the data only after the completely
>>>>>>>> parse the document.
>>>>>>>> All depends on the data format.
>>>>>>>>
>>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>>> the tests might suggest otherwise).
>>>>>>>>
>>>>>>> I agree that your implementation of reader returns records in
>>>>>>> order
>>>>>>> - but
>>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>>> order produced by your reader is ignored, and when applying any
>>>>>>> transforms to the
>>>>>> PCollection
>>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>>> your reader returned the records.
>>>>>>>
>>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>>> Tika-detected items, still the right API for representing the
>>>>>>> result of parsing a large number of documents with Tika?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> The reason I did it was because I thought
>>>>>>>>
>>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>>> the pipeline - the parser will continue working via the
>>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>>
>>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>>> potentially
>>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>>> execute in
>>>>>> the
>>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>>> correctly,
>>>>>>> you might be assuming that:
>>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>>> complete
>>>>>>> before processing its outputs with downstream transforms
>>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>>> *concurrently*
>>>>>>> with downstream processing of its results
>>>>>>> - Passing an element from one thread to another using a
>>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>>> most
>>>>>>> production runners.
>>>>>>>
>>>>>>> There are other disadvantages to this approach:
>>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>>> invisible
>>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>>> trace for stuck elements, this approach would make the real
>>>>>>> processing invisible to all of these capabilities, and a user
>>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>>> next element, but not *why* the next
>>>>>> element
>>>>>>> is taking long to compute.
>>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>>> autoscaling, binpacking
>>>>>> and
>>>>>>> other resource management magic (how much of this runners actually
>>>>>>> do is
>>>>>> a
>>>>>>> separate issue), because the runner will have no way of knowing
>>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>>> the processing happens in a thread about which the runner is
>>>>>>> unaware.
>>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>>> in the Tika thread
>>>>>>> - Adding the thread management makes the code much more complex,
>>>>>>> easier
>>>>>> to
>>>>>>> introduce bugs, and harder for others to contribute
>>>>>>>
>>>>>>>
>>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>>> concatenate the data chunks first before making them available to
>>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>>> yet)
>>>>>>>>
>>>>>>> What are these issues?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>>> configuring the max polling time to a very large number which
>>>>>>>> will never be reached for a practical case, or
>>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>>> I propose to follow 2b).
>>>>>>>>
>>>>>>> I agree that there should be no way to unintentionally configure
>>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>>> Beam's "no knobs"
>>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>>> out a
>>>>>> good
>>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>>> running on
>>>>>> a
>>>>>>> new dataset or updating a version of some of the involved
>>>>>>> dependencies
>>>>>> etc.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please let me know what you think.
>>>>>>>> My plan so far is:
>>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>>> some minor TikaIO updates
>>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>>> users some time to try it with some real complex files and also
>>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>>> BoundedSource/Reader or not
>>>>>>>>
>>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>>
>>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>>> cases
>>>>>> of
>>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>>> then see what's the best implementation for that particular API
>>>>>>> and set of anticipated use cases.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Sergey
>>>>>>>>
>>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
> 
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
> 
>  From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
> 
Right now Tika-related reader does not associate a given text fragment 
with the file name, so a function looking at some text and trying to 
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given 
piece of data. I hope it can be done and if not - then perhaps some 
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam + 
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save 
the possibly unordered text to Lucene with the file name 'attached', let 
users search for the files containing some words - phrases, this works 
OK given that I can see PDF parser for ex reporting the lines) can be 
supported OK with the current TikaIO (provided we find a way to 'attach' 
a file name to the flow).

I see though supporting the total ordering can be a big deal in other 
cases. Eugene, can you please explain how it can be done, is it 
achievable in principle, without the users having to do some custom 
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
> 
> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
> 
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
> 
> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
> 
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
> 
> Apologies if I've derailed the conversation or misunderstood this thread.
> 
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
> 
>                 Tim
> 
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org
> Cc: Allison, Timothy B. <ta...@mitre.org>
> Subject: Re: TikaIO concerns
> 
> Hi All
> 
> Please welcome Tim, one of Apache Tika leads and practitioners.
> 
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
> 
> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
> 
> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
> 
> Other comments would be welcome too
> 
> Thanks, Sergey
> 
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:
> Thank you, Sergey.
> 
> My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.
> 
>  From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.
> 
Right now Tika-related reader does not associate a given text fragment 
with the file name, so a function looking at some text and trying to 
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given 
piece of data. I hope it can be done and if not - then perhaps some 
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam + 
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save 
the possibly unordered text to Lucene with the file name 'attached', let 
users search for the files containing some words - phrases, this works 
OK given that I can see PDF parser for ex reporting the lines) can be 
supported OK with the current TikaIO (provided we find a way to 'attach' 
a file name to the flow).

I see though supporting the total ordering can be a big deal in other 
cases. Eugene, can you please explain how it can be done, is it 
achievable in principle, without the users having to do some custom 
coding ?

> To the question of -- why is this in Beam at all; why don't we let users call it if they want it?...
> 
> No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means:
> 
> 1) a process-level timeout (because you can't actually kill a thread in Java)
> 2) a process-level restart on OOM
> 3) avoid trying to reprocess a badly behaving document
> 
> If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.
> 
> Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.
> 
> Apologies if I've derailed the conversation or misunderstood this thread.
> 
Major thanks for your input :-)

Cheers, Sergey

> Cheers,
> 
>                 Tim
> 
> -----Original Message-----
> From: Sergey Beryozkin [mailto:sberyozkin@gmail.com]
> Sent: Thursday, September 21, 2017 9:07 AM
> To: dev@beam.apache.org
> Cc: Allison, Timothy B. <ta...@mitre.org>
> Subject: Re: TikaIO concerns
> 
> Hi All
> 
> Please welcome Tim, one of Apache Tika leads and practitioners.
> 
> Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
> consumers) then please do so :-).
> 
> At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.
> 
> (FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)
> 
> Other comments would be welcome too
> 
> Thanks, Sergey
> 
> On 21/09/17 10:55, Sergey Beryozkin wrote:
>> I noticed that the PDF and ODT parsers actually split by lines, not
>> individual words and nearly 100% sure I saw Tika reporting individual
>> lines when it was parsing the text files. The 'min text length'
>> feature can help with reporting several lines at a time, etc...
>>
>> I'm working with this PDF all the time:
>> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
>>
>> try it too if you get a chance.
>>
>> (and I can imagine not all PDFs/etc representing the 'story' but can
>> be for ex a log-like content too)
>>
>> That said, I don't know how a parser for the format N will behave, it
>> depends on the individual parsers.
>>
>> IMHO it's an equal candidate alongside Text-based bounded IOs...
>>
>> I'd like to know though how to make a file name available to the
>> pipeline which is working with the current text fragment ?
>>
>> Going to try and do some measurements and compare the sync vs async
>> parsing modes...
>>
>> Asked the Tika team to support with some more examples...
>>
>> Cheers, Sergey
>> On 20/09/17 22:17, Sergey Beryozkin wrote:
>>> Hi,
>>>
>>> thanks for the explanations,
>>>
>>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>> Hi!
>>>>
>>>> TextIO returns an unordered soup of lines contained in all files you
>>>> ask it to read. People usually use TextIO for reading files where 1
>>>> line corresponds to 1 independent data element, e.g. a log entry, or
>>>> a row of a CSV file - so discarding order is ok.
>>> Just a side note, I'd probably want that be ordered, though I guess
>>> it depends...
>>>> However, there is a number of cases where TextIO is a poor fit:
>>>> - Cases where discarding order is not ok - e.g. if you're doing
>>>> natural language processing and the text files contain actual prose,
>>>> where you need to process a file as a whole. TextIO can't do that.
>>>> - Cases where you need to remember which file each element came
>>>> from, e.g.
>>>> if you're creating a search index for the files: TextIO can't do
>>>> this either.
>>>>
>>>> Both of these issues have been raised in the past against TextIO;
>>>> however it seems that the overwhelming majority of users of TextIO
>>>> use it for logs or CSV files or alike, so solving these issues has
>>>> not been a priority.
>>>> Currently they are solved in a general form via FileIO.read() which
>>>> gives you access to reading a full file yourself - people who want
>>>> more flexibility will be able to use standard Java text-parsing
>>>> utilities on a ReadableFile, without involving TextIO.
>>>>
>>>> Same applies for XmlIO: it is specifically designed for the narrow
>>>> use case where the files contain independent data entries, so
>>>> returning an unordered soup of them, with no association to the
>>>> original file, is the user's intention. XmlIO will not work for
>>>> processing more complex XML files that are not simply a sequence of
>>>> entries with the same tag, and it also does not remember the
>>>> original filename.
>>>>
>>>
>>> OK...
>>>
>>>> However, if my understanding of Tika use cases is correct, it is
>>>> mainly used for extracting content from complex file formats - for
>>>> example, extracting text and images from PDF files or Word
>>>> documents. I believe this is the main difference between it and
>>>> TextIO - people usually use Tika for complex use cases where the
>>>> "unordered soup of stuff" abstraction is not useful.
>>>>
>>>> My suspicion about this is confirmed by the fact that the crux of
>>>> the Tika API is ContentHandler
>>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>>> html?is-external=true
>>>>
>>>> whose
>>>> documentation says "The order of events in this interface is very
>>>> important, and mirrors the order of information in the document itself."
>>> All that says is that a (Tika) ContentHandler will be a true SAX
>>> ContentHandler...
>>>>
>>>> Let me give a few examples of what I think is possible with the raw
>>>> Tika API, but I think is not currently possible with TikaIO - please
>>>> correct me where I'm wrong, because I'm not particularly familiar
>>>> with Tika and am judging just based on what I read about it.
>>>> - User has 100,000 Word documents and wants to convert each of them
>>>> to text files for future natural language processing.
>>>> - User has 100,000 PDF files with financial statements, each
>>>> containing a bunch of unrelated text and - the main content - a list
>>>> of transactions in PDF tables. User wants to extract each
>>>> transaction as a PCollection element, discarding the unrelated text.
>>>> - User has 100,000 PDF files with scientific papers, and wants to
>>>> extract text from them, somehow parse author and affiliation from
>>>> the text, and compute statistics of topics and terminology usage by
>>>> author name and affiliation.
>>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>>> observing a location over time: they want to extract metadata from
>>>> each image using Tika, analyze the images themselves using some
>>>> other library, and detect anomalies in the overall appearance of the
>>>> location over time as seen from multiple cameras.
>>>> I believe all of these cases can not be solved with TikaIO because
>>>> the resulting PCollection<String> contains no information about
>>>> which String comes from which document and about the order in which
>>>> they appear in the document.
>>> These are good use cases, thanks... I thought what you were talking
>>> about the unordered soup of data produced by TikaIO (and its friends
>>> TextIO and alike :-)).
>>> Putting the ordered vs unordered question aside for a sec, why
>>> exactly a Tika Reader can not make the name of the file it's
>>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>>> Surely it can be possible with Beam ? If not then I would be surprised...
>>>
>>>>
>>>> I am, honestly, struggling to think of a case where I would want to
>>>> use Tika, but where I *would* be ok with getting an unordered soup
>>>> of strings.
>>>> So some examples would be very helpful.
>>>>
>>> Yes. I'll ask Tika developers to help with some examples, but I'll
>>> give one example where it did not matter to us in what order
>>> Tika-produced data were available to the downstream layer.
>>>
>>> It's a demo the Apache CXF colleague of mine showed at one of Apache
>>> Con NAs, and we had a happy audience:
>>>
>>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>>> se/samples/jax_rs/search
>>>
>>>
>>> PDF or ODT files uploaded, Tika parses them, and all of that is put
>>> into Lucene. We associate a file name with the indexed content and
>>> then let users find a list of PDF files which contain a given word or
>>> few words, details are here
>>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>>> og.java#L131
>>>
>>>
>>> I'd say even more involved search engines would not mind supporting a
>>> case like that :-)
>>>
>>> Now there we process one file at a time, and I understand now that
>>> with TikaIO and N files it's all over the place really as far as the
>>> ordering is concerned, which file it's coming from. etc. That's why
>>> TikaReader must be able to associate the file name with a given piece
>>> of text it's making available to the pipeline.
>>>
>>> I'd be happy to support the ParDo way of linking Tika with Beam.
>>> If it makes things simpler then it would be good, I've just no idea
>>> at the moment how to start the pipeline without using a
>>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned
>>> earlier - how can one avoid it with ParDo when implementing a 'min
>>> len chunk' feature, where the ParDo would have to concatenate several
>>> SAX data pieces first before making a single composite piece to the pipeline ?
>>>
>>>
>>>> Another way to state it: currently, if I wanted to solve all of the
>>>> use cases above, I'd just use FileIO.readMatches() and use the Tika
>>>> API myself on the resulting ReadableFile. How can we make TikaIO
>>>> provide a usability improvement over such usage?
>>>>
>>>
>>>
>>> If you are actually asking, does it really make sense for Beam to
>>> ship Tika related code, given that users can just do it themselves,
>>> I'm not sure.
>>>
>>> IMHO it always works better if users have to provide just few config
>>> options to an integral part of the framework and see things happening.
>>> It will bring more users.
>>>
>>> Whether the current Tika code (refactored or not) stays with Beam or
>>> not - I'll let you and the team decide; believe it or not I was
>>> seriously contemplating at the last moment to make it all part of the
>>> Tika project itself and have a bit more flexibility over there with
>>> tweaking things, but now that it is in the Beam snapshot - I don't
>>> know - it's no my decision...
>>>
>>>> I am confused by your other comment - "Does the ordering matter ?
>>>> Perhaps
>>>> for some cases it does, and for some it does not. May be it makes
>>>> sense to support running TikaIO as both the bounded reader/source
>>>> and ParDo, with getting the common code reused." - because using
>>>> BoundedReader or ParDo is not related to the ordering issue, only to
>>>> the issue of asynchronous reading and complexity of implementation.
>>>> The resulting PCollection will be unordered either way - this needs
>>>> to be solved separately by providing a different API.
>>> Right I see now, so ParDo is not about making Tika reported data
>>> available to the downstream pipeline components ordered, only about
>>> the simpler implementation.
>>> Association with the file should be possible I hope, but I understand
>>> it would be possible to optionally make the data coming out in the
>>> ordered way as well...
>>>
>>> Assuming TikaIO stays, and before trying to re-implement as ParDo,
>>> let me double check: should we still give some thought to the
>>> possible performance benefit of the current approach ? As I said, I
>>> can easily get rid of all that polling code, use a simple Blocking queue.
>>>
>>> Cheers, Sergey
>>>>
>>>> Thanks.
>>>>
>>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin
>>>> <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Glad TikaIO getting some serious attention :-), I believe one thing
>>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>>
>>>>> Before trying to reply online, I'd like to state that my main
>>>>> assumption is that TikaIO (as far as the read side is concerned) is
>>>>> no different to Text, XML or similar bounded reader components.
>>>>>
>>>>> I have to admit I don't understand your questions about TikaIO
>>>>> usecases.
>>>>>
>>>>> What are the Text Input or XML input use-cases ? These use cases
>>>>> are TikaInput cases as well, the only difference is Tika can not
>>>>> split the individual file into a sequence of sources/etc,
>>>>>
>>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>>> optimized around reading from the XML files, and I thought I made
>>>>> it clear (and it is a known fact anyway) Tika was about reading
>>>>> basically from any file format.
>>>>>
>>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>>
>>>>> Sergey
>>>>>
>>>>>
>>>>>
>>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Replies inline.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin
>>>>>> <sb...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> This is my first post the the dev list, I work for Talend, I'm a
>>>>>>> Beam novice, Apache Tika fan, and thought it would be really
>>>>>>> great to try and link both projects together, which led me to
>>>>>>> opening [1] where I typed some early thoughts, followed by PR
>>>>>>> [2].
>>>>>>>
>>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful)
>>>>>>> newer review comments from Eugene pending, so I'd like to
>>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then
>>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>>
>>>>>>> Apache Tika Parsers report the text content in chunks, via
>>>>>>> SaxParser events. It's not possible with Tika to take a file and
>>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line
>>>>>>> by line, the only way is to handle the SAXParser callbacks which
>>>>>>> report the data chunks.
>>>>>>> Some
>>>>>>> parsers may report the complete lines, some individual words,
>>>>>>> with some being able report the data only after the completely
>>>>>>> parse the document.
>>>>>>> All depends on the data format.
>>>>>>>
>>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads
>>>>>>> to parse the files, Beam threads will only collect the data from
>>>>>>> the internal queue where the internal TikaReader's thread will
>>>>>>> put the data into (note the data chunks are ordered even though
>>>>>>> the tests might suggest otherwise).
>>>>>>>
>>>>>> I agree that your implementation of reader returns records in
>>>>>> order
>>>>>> - but
>>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about
>>>>>> the order in which records are produced by a BoundedReader - the
>>>>>> order produced by your reader is ignored, and when applying any
>>>>>> transforms to the
>>>>> PCollection
>>>>>> produced by TikaIO, it is impossible to recover the order in which
>>>>>> your reader returned the records.
>>>>>>
>>>>>> With that in mind, is PCollection<String>, containing individual
>>>>>> Tika-detected items, still the right API for representing the
>>>>>> result of parsing a large number of documents with Tika?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> The reason I did it was because I thought
>>>>>>>
>>>>>>> 1) it would make the individual data chunks available faster to
>>>>>>> the pipeline - the parser will continue working via the
>>>>>>> binary/video etc file while the data will already start flowing -
>>>>>>> I agree there should be some tests data available confirming it -
>>>>>>> but I'm positive at the moment this approach might yield some
>>>>>>> performance gains with the large sets. If the file is large, if
>>>>>>> it has the embedded attachments/videos to deal with, then it may
>>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>>
>>>>>>> As I said on the PR, this description contains unfounded and
>>>>>>> potentially
>>>>>> incorrect assumptions about how Beam runners execute (or may
>>>>>> execute in
>>>>> the
>>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>> correctly,
>>>>>> you might be assuming that:
>>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>> complete
>>>>>> before processing its outputs with downstream transforms
>>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>> *concurrently*
>>>>>> with downstream processing of its results
>>>>>> - Passing an element from one thread to another using a
>>>>>> BlockingQueue is free in terms of performance All of these are
>>>>>> false at least in some runners, and I'm almost certain that in
>>>>>> reality, performance of this approach is worse than a ParDo in
>>>>> most
>>>>>> production runners.
>>>>>>
>>>>>> There are other disadvantages to this approach:
>>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>> invisible
>>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>>> profiling capabilities, or the ability to get the current stack
>>>>>> trace for stuck elements, this approach would make the real
>>>>>> processing invisible to all of these capabilities, and a user
>>>>>> would only see that the bulk of the time is spent waiting for the
>>>>>> next element, but not *why* the next
>>>>> element
>>>>>> is taking long to compute.
>>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>>> invisible to Beam, will make it harder for runners to do
>>>>>> autoscaling, binpacking
>>>>> and
>>>>>> other resource management magic (how much of this runners actually
>>>>>> do is
>>>>> a
>>>>>> separate issue), because the runner will have no way of knowing
>>>>>> how much CPU/IO this particular transform is actually using - all
>>>>>> the processing happens in a thread about which the runner is
>>>>>> unaware.
>>>>>> - As far as I can tell, the code also hides exceptions that happen
>>>>>> in the Tika thread
>>>>>> - Adding the thread management makes the code much more complex,
>>>>>> easier
>>>>> to
>>>>>> introduce bugs, and harder for others to contribute
>>>>>>
>>>>>>
>>>>>>> 2) As I commented at the end of [2], having an option to
>>>>>>> concatenate the data chunks first before making them available to
>>>>>>> the pipeline is useful, and I guess doing the same in ParDo would
>>>>>>> introduce some synchronization issues (though not exactly sure
>>>>>>> yet)
>>>>>>>
>>>>>> What are these issues?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> One of valid concerns there is that the reader is polling the
>>>>>>> internal queue so, in theory at least, and perhaps in some rare
>>>>>>> cases too, we may have a case where the max polling time has been
>>>>>>> reached, the parser is still busy, and TikaIO fails to report all
>>>>>>> the file data. I think that it can be solved by either 2a)
>>>>>>> configuring the max polling time to a very large number which
>>>>>>> will never be reached for a practical case, or
>>>>>>> 2b) simply use a blocking queue without the time limits - in the
>>>>>>> worst case, if TikaParser spins and fails to report the end of
>>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>>> I propose to follow 2b).
>>>>>>>
>>>>>> I agree that there should be no way to unintentionally configure
>>>>>> the transform in a way that will produce silent data loss. Another
>>>>>> reason for not having these tuning knobs is that it goes against
>>>>>> Beam's "no knobs"
>>>>>> philosophy, and that in most cases users have no way of figuring
>>>>>> out a
>>>>> good
>>>>>> value for tuning knobs except for manual experimentation, which is
>>>>>> extremely brittle and typically gets immediately obsoleted by
>>>>>> running on
>>>>> a
>>>>>> new dataset or updating a version of some of the involved
>>>>>> dependencies
>>>>> etc.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> My plan so far is:
>>>>>>> 1) start addressing most of Eugene's comments which would require
>>>>>>> some minor TikaIO updates
>>>>>>> 2) work on removing the TikaSource internal code dealing with
>>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam
>>>>>>> users some time to try it with some real complex files and also
>>>>>>> decide if TikaIO can continue implemented as a
>>>>>>> BoundedSource/Reader or not
>>>>>>>
>>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>>
>>>>>> Yes, but I think we should start by discussing the anticipated use
>>>>>> cases
>>>>> of
>>>>>> TikaIO and designing an API for it based on those use cases; and
>>>>>> then see what's the best implementation for that particular API
>>>>>> and set of anticipated use cases.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks, Sergey
>>>>>>>
>>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

@Eugene: What's the best way to have Beam help us with these issues, or do these come for free with the Beam framework? 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.

From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.

To the question of -- why is this in Beam at all; why don't we let users call it if they want it?... 

No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means: 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.

Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.

Apologies if I've derailed the conversation or misunderstood this thread.

Cheers,

               Tim

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Thursday, September 21, 2017 9:07 AM
To: dev@beam.apache.org
Cc: Allison, Timothy B. <ta...@mitre.org>
Subject: Re: TikaIO concerns

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.

(FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)

Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:
> I noticed that the PDF and ODT parsers actually split by lines, not 
> individual words and nearly 100% sure I saw Tika reporting individual 
> lines when it was parsing the text files. The 'min text length' 
> feature can help with reporting several lines at a time, etc...
> 
> I'm working with this PDF all the time:
> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> 
> try it too if you get a chance.
> 
> (and I can imagine not all PDFs/etc representing the 'story' but can 
> be for ex a log-like content too)
> 
> That said, I don't know how a parser for the format N will behave, it 
> depends on the individual parsers.
> 
> IMHO it's an equal candidate alongside Text-based bounded IOs...
> 
> I'd like to know though how to make a file name available to the 
> pipeline which is working with the current text fragment ?
> 
> Going to try and do some measurements and compare the sync vs async 
> parsing modes...
> 
> Asked the Tika team to support with some more examples...
> 
> Cheers, Sergey
> On 20/09/17 22:17, Sergey Beryozkin wrote:
>> Hi,
>>
>> thanks for the explanations,
>>
>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>> Hi!
>>>
>>> TextIO returns an unordered soup of lines contained in all files you 
>>> ask it to read. People usually use TextIO for reading files where 1 
>>> line corresponds to 1 independent data element, e.g. a log entry, or 
>>> a row of a CSV file - so discarding order is ok.
>> Just a side note, I'd probably want that be ordered, though I guess 
>> it depends...
>>> However, there is a number of cases where TextIO is a poor fit:
>>> - Cases where discarding order is not ok - e.g. if you're doing 
>>> natural language processing and the text files contain actual prose, 
>>> where you need to process a file as a whole. TextIO can't do that.
>>> - Cases where you need to remember which file each element came 
>>> from, e.g.
>>> if you're creating a search index for the files: TextIO can't do 
>>> this either.
>>>
>>> Both of these issues have been raised in the past against TextIO; 
>>> however it seems that the overwhelming majority of users of TextIO 
>>> use it for logs or CSV files or alike, so solving these issues has 
>>> not been a priority.
>>> Currently they are solved in a general form via FileIO.read() which 
>>> gives you access to reading a full file yourself - people who want 
>>> more flexibility will be able to use standard Java text-parsing 
>>> utilities on a ReadableFile, without involving TextIO.
>>>
>>> Same applies for XmlIO: it is specifically designed for the narrow 
>>> use case where the files contain independent data entries, so 
>>> returning an unordered soup of them, with no association to the 
>>> original file, is the user's intention. XmlIO will not work for 
>>> processing more complex XML files that are not simply a sequence of 
>>> entries with the same tag, and it also does not remember the 
>>> original filename.
>>>
>>
>> OK...
>>
>>> However, if my understanding of Tika use cases is correct, it is 
>>> mainly used for extracting content from complex file formats - for 
>>> example, extracting text and images from PDF files or Word 
>>> documents. I believe this is the main difference between it and 
>>> TextIO - people usually use Tika for complex use cases where the 
>>> "unordered soup of stuff" abstraction is not useful.
>>>
>>> My suspicion about this is confirmed by the fact that the crux of 
>>> the Tika API is ContentHandler 
>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>> html?is-external=true
>>>
>>> whose
>>> documentation says "The order of events in this interface is very 
>>> important, and mirrors the order of information in the document itself."
>> All that says is that a (Tika) ContentHandler will be a true SAX 
>> ContentHandler...
>>>
>>> Let me give a few examples of what I think is possible with the raw 
>>> Tika API, but I think is not currently possible with TikaIO - please 
>>> correct me where I'm wrong, because I'm not particularly familiar 
>>> with Tika and am judging just based on what I read about it.
>>> - User has 100,000 Word documents and wants to convert each of them 
>>> to text files for future natural language processing.
>>> - User has 100,000 PDF files with financial statements, each 
>>> containing a bunch of unrelated text and - the main content - a list 
>>> of transactions in PDF tables. User wants to extract each 
>>> transaction as a PCollection element, discarding the unrelated text.
>>> - User has 100,000 PDF files with scientific papers, and wants to 
>>> extract text from them, somehow parse author and affiliation from 
>>> the text, and compute statistics of topics and terminology usage by 
>>> author name and affiliation.
>>> - User has 100,000 photos in JPEG made by a set of automatic cameras 
>>> observing a location over time: they want to extract metadata from 
>>> each image using Tika, analyze the images themselves using some 
>>> other library, and detect anomalies in the overall appearance of the 
>>> location over time as seen from multiple cameras.
>>> I believe all of these cases can not be solved with TikaIO because 
>>> the resulting PCollection<String> contains no information about 
>>> which String comes from which document and about the order in which 
>>> they appear in the document.
>> These are good use cases, thanks... I thought what you were talking 
>> about the unordered soup of data produced by TikaIO (and its friends 
>> TextIO and alike :-)).
>> Putting the ordered vs unordered question aside for a sec, why 
>> exactly a Tika Reader can not make the name of the file it's 
>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>> Surely it can be possible with Beam ? If not then I would be surprised...
>>
>>>
>>> I am, honestly, struggling to think of a case where I would want to 
>>> use Tika, but where I *would* be ok with getting an unordered soup 
>>> of strings.
>>> So some examples would be very helpful.
>>>
>> Yes. I'll ask Tika developers to help with some examples, but I'll 
>> give one example where it did not matter to us in what order 
>> Tika-produced data were available to the downstream layer.
>>
>> It's a demo the Apache CXF colleague of mine showed at one of Apache 
>> Con NAs, and we had a happy audience:
>>
>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>> se/samples/jax_rs/search
>>
>>
>> PDF or ODT files uploaded, Tika parses them, and all of that is put 
>> into Lucene. We associate a file name with the indexed content and 
>> then let users find a list of PDF files which contain a given word or 
>> few words, details are here
>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>> og.java#L131
>>
>>
>> I'd say even more involved search engines would not mind supporting a 
>> case like that :-)
>>
>> Now there we process one file at a time, and I understand now that 
>> with TikaIO and N files it's all over the place really as far as the 
>> ordering is concerned, which file it's coming from. etc. That's why 
>> TikaReader must be able to associate the file name with a given piece 
>> of text it's making available to the pipeline.
>>
>> I'd be happy to support the ParDo way of linking Tika with Beam.
>> If it makes things simpler then it would be good, I've just no idea 
>> at the moment how to start the pipeline without using a 
>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned 
>> earlier - how can one avoid it with ParDo when implementing a 'min 
>> len chunk' feature, where the ParDo would have to concatenate several 
>> SAX data pieces first before making a single composite piece to the pipeline ?
>>
>>
>>> Another way to state it: currently, if I wanted to solve all of the 
>>> use cases above, I'd just use FileIO.readMatches() and use the Tika 
>>> API myself on the resulting ReadableFile. How can we make TikaIO 
>>> provide a usability improvement over such usage?
>>>
>>
>>
>> If you are actually asking, does it really make sense for Beam to 
>> ship Tika related code, given that users can just do it themselves, 
>> I'm not sure.
>>
>> IMHO it always works better if users have to provide just few config 
>> options to an integral part of the framework and see things happening.
>> It will bring more users.
>>
>> Whether the current Tika code (refactored or not) stays with Beam or 
>> not - I'll let you and the team decide; believe it or not I was 
>> seriously contemplating at the last moment to make it all part of the 
>> Tika project itself and have a bit more flexibility over there with 
>> tweaking things, but now that it is in the Beam snapshot - I don't 
>> know - it's no my decision...
>>
>>> I am confused by your other comment - "Does the ordering matter ? 
>>> Perhaps
>>> for some cases it does, and for some it does not. May be it makes 
>>> sense to support running TikaIO as both the bounded reader/source 
>>> and ParDo, with getting the common code reused." - because using 
>>> BoundedReader or ParDo is not related to the ordering issue, only to 
>>> the issue of asynchronous reading and complexity of implementation. 
>>> The resulting PCollection will be unordered either way - this needs 
>>> to be solved separately by providing a different API.
>> Right I see now, so ParDo is not about making Tika reported data 
>> available to the downstream pipeline components ordered, only about 
>> the simpler implementation.
>> Association with the file should be possible I hope, but I understand 
>> it would be possible to optionally make the data coming out in the 
>> ordered way as well...
>>
>> Assuming TikaIO stays, and before trying to re-implement as ParDo, 
>> let me double check: should we still give some thought to the 
>> possible performance benefit of the current approach ? As I said, I 
>> can easily get rid of all that polling code, use a simple Blocking queue.
>>
>> Cheers, Sergey
>>>
>>> Thanks.
>>>
>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin 
>>> <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Glad TikaIO getting some serious attention :-), I believe one thing 
>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>
>>>> Before trying to reply online, I'd like to state that my main 
>>>> assumption is that TikaIO (as far as the read side is concerned) is 
>>>> no different to Text, XML or similar bounded reader components.
>>>>
>>>> I have to admit I don't understand your questions about TikaIO 
>>>> usecases.
>>>>
>>>> What are the Text Input or XML input use-cases ? These use cases 
>>>> are TikaInput cases as well, the only difference is Tika can not 
>>>> split the individual file into a sequence of sources/etc,
>>>>
>>>> TextIO can read from the plain text files (possibly zipped), XML - 
>>>> optimized around reading from the XML files, and I thought I made 
>>>> it clear (and it is a known fact anyway) Tika was about reading 
>>>> basically from any file format.
>>>>
>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>
>>>> Sergey
>>>>
>>>>
>>>>
>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>> Hi,
>>>>>
>>>>> Replies inline.
>>>>>
>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin 
>>>>> <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> This is my first post the the dev list, I work for Talend, I'm a 
>>>>>> Beam novice, Apache Tika fan, and thought it would be really 
>>>>>> great to try and link both projects together, which led me to 
>>>>>> opening [1] where I typed some early thoughts, followed by PR 
>>>>>> [2].
>>>>>>
>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) 
>>>>>> newer review comments from Eugene pending, so I'd like to 
>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then 
>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>
>>>>>> Apache Tika Parsers report the text content in chunks, via 
>>>>>> SaxParser events. It's not possible with Tika to take a file and 
>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line 
>>>>>> by line, the only way is to handle the SAXParser callbacks which 
>>>>>> report the data chunks.
>>>>>> Some
>>>>>> parsers may report the complete lines, some individual words, 
>>>>>> with some being able report the data only after the completely 
>>>>>> parse the document.
>>>>>> All depends on the data format.
>>>>>>
>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads 
>>>>>> to parse the files, Beam threads will only collect the data from 
>>>>>> the internal queue where the internal TikaReader's thread will 
>>>>>> put the data into (note the data chunks are ordered even though 
>>>>>> the tests might suggest otherwise).
>>>>>>
>>>>> I agree that your implementation of reader returns records in 
>>>>> order
>>>>> - but
>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about 
>>>>> the order in which records are produced by a BoundedReader - the 
>>>>> order produced by your reader is ignored, and when applying any 
>>>>> transforms to the
>>>> PCollection
>>>>> produced by TikaIO, it is impossible to recover the order in which 
>>>>> your reader returned the records.
>>>>>
>>>>> With that in mind, is PCollection<String>, containing individual 
>>>>> Tika-detected items, still the right API for representing the 
>>>>> result of parsing a large number of documents with Tika?
>>>>>
>>>>>
>>>>>>
>>>>>> The reason I did it was because I thought
>>>>>>
>>>>>> 1) it would make the individual data chunks available faster to 
>>>>>> the pipeline - the parser will continue working via the 
>>>>>> binary/video etc file while the data will already start flowing - 
>>>>>> I agree there should be some tests data available confirming it - 
>>>>>> but I'm positive at the moment this approach might yield some 
>>>>>> performance gains with the large sets. If the file is large, if 
>>>>>> it has the embedded attachments/videos to deal with, then it may 
>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>
>>>>>> As I said on the PR, this description contains unfounded and 
>>>>>> potentially
>>>>> incorrect assumptions about how Beam runners execute (or may 
>>>>> execute in
>>>> the
>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>> correctly,
>>>>> you might be assuming that:
>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>> complete
>>>>> before processing its outputs with downstream transforms
>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>> *concurrently*
>>>>> with downstream processing of its results
>>>>> - Passing an element from one thread to another using a 
>>>>> BlockingQueue is free in terms of performance All of these are 
>>>>> false at least in some runners, and I'm almost certain that in 
>>>>> reality, performance of this approach is worse than a ParDo in
>>>> most
>>>>> production runners.
>>>>>
>>>>> There are other disadvantages to this approach:
>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>> invisible
>>>>> to Beam's instrumentation. If a Beam runner provided per-transform 
>>>>> profiling capabilities, or the ability to get the current stack 
>>>>> trace for stuck elements, this approach would make the real 
>>>>> processing invisible to all of these capabilities, and a user 
>>>>> would only see that the bulk of the time is spent waiting for the 
>>>>> next element, but not *why* the next
>>>> element
>>>>> is taking long to compute.
>>>>> - Likewise, offloading all the CPU and IO to a separate thread, 
>>>>> invisible to Beam, will make it harder for runners to do 
>>>>> autoscaling, binpacking
>>>> and
>>>>> other resource management magic (how much of this runners actually 
>>>>> do is
>>>> a
>>>>> separate issue), because the runner will have no way of knowing 
>>>>> how much CPU/IO this particular transform is actually using - all 
>>>>> the processing happens in a thread about which the runner is 
>>>>> unaware.
>>>>> - As far as I can tell, the code also hides exceptions that happen 
>>>>> in the Tika thread
>>>>> - Adding the thread management makes the code much more complex, 
>>>>> easier
>>>> to
>>>>> introduce bugs, and harder for others to contribute
>>>>>
>>>>>
>>>>>> 2) As I commented at the end of [2], having an option to 
>>>>>> concatenate the data chunks first before making them available to 
>>>>>> the pipeline is useful, and I guess doing the same in ParDo would 
>>>>>> introduce some synchronization issues (though not exactly sure 
>>>>>> yet)
>>>>>>
>>>>> What are these issues?
>>>>>
>>>>>
>>>>>>
>>>>>> One of valid concerns there is that the reader is polling the 
>>>>>> internal queue so, in theory at least, and perhaps in some rare 
>>>>>> cases too, we may have a case where the max polling time has been 
>>>>>> reached, the parser is still busy, and TikaIO fails to report all 
>>>>>> the file data. I think that it can be solved by either 2a) 
>>>>>> configuring the max polling time to a very large number which 
>>>>>> will never be reached for a practical case, or
>>>>>> 2b) simply use a blocking queue without the time limits - in the 
>>>>>> worst case, if TikaParser spins and fails to report the end of 
>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>> I propose to follow 2b).
>>>>>>
>>>>> I agree that there should be no way to unintentionally configure 
>>>>> the transform in a way that will produce silent data loss. Another 
>>>>> reason for not having these tuning knobs is that it goes against 
>>>>> Beam's "no knobs"
>>>>> philosophy, and that in most cases users have no way of figuring 
>>>>> out a
>>>> good
>>>>> value for tuning knobs except for manual experimentation, which is 
>>>>> extremely brittle and typically gets immediately obsoleted by 
>>>>> running on
>>>> a
>>>>> new dataset or updating a version of some of the involved 
>>>>> dependencies
>>>> etc.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Please let me know what you think.
>>>>>> My plan so far is:
>>>>>> 1) start addressing most of Eugene's comments which would require 
>>>>>> some minor TikaIO updates
>>>>>> 2) work on removing the TikaSource internal code dealing with 
>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam 
>>>>>> users some time to try it with some real complex files and also 
>>>>>> decide if TikaIO can continue implemented as a 
>>>>>> BoundedSource/Reader or not
>>>>>>
>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>
>>>>> Yes, but I think we should start by discussing the anticipated use 
>>>>> cases
>>>> of
>>>>> TikaIO and designing an API for it based on those use cases; and 
>>>>> then see what's the best implementation for that particular API 
>>>>> and set of anticipated use cases.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks, Sergey
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>
>>>>>
>>>>
>>>
>>

RE: TikaIO concerns

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a chance to work with it yet.

From my perspective, if I understand this thread (and I may not!), getting unordered text from _a given file_ is a non-starter for most applications.  The implementation needs to guarantee order per file, and the user has to be able to link the "extract" back to a unique identifier for the document.  If the current implementation doesn't do those things, we need to change it, IMHO.

To the question of -- why is this in Beam at all; why don't we let users call it if they want it?... 

No matter how much we do to Tika, it will behave badly sometimes -- permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- folks likely with large batches of unruly/noisy documents -- are more likely to run into these problems than your average couple-of-thousand-docs-from-our-own-company user. So, if there are things we can do in Beam to prevent developers around the world from having to reinvent the wheel for defenses against these problems, then I'd be enormously grateful if we could put Tika into Beam.  That means: 

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users write their own code.  If there is so much as a single configuration knob (and it sounds like Beam is against complex configuration...yay!) to get that working in Beam, then I'd say, please integrate Tika into Beam.  From a safety perspective, it is critical to keep the extraction process entirely separate (jvm, vm, m, rack, data center!) from the transformation+loading steps.  IMHO, very few devs realize this because Tika works well lots of the time...which is why it is critical for us to make it easy for people to get it right all of the time.

Even in my desktop (gah, y, desktop!) search app, I run Tika in batch mode first in one jvm, and then I kick off another process to do transform/loading into Lucene/Solr from the .json files that Tika generates for each input file.  If I were to scale up, I'd want to maintain this complete separation of steps.

Apologies if I've derailed the conversation or misunderstood this thread.

Cheers,

               Tim

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyozkin@gmail.com] 
Sent: Thursday, September 21, 2017 9:07 AM
To: dev@beam.apache.org
Cc: Allison, Timothy B. <ta...@mitre.org>
Subject: Re: TikaIO concerns

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika stories to share (preferably involving the cases where it did not really matter the ordering in which Tika-produced data were dealt with by the
consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered data, the Beam runtime will have no guarantees that the downstream pipeline components will see the data coming in the right order.

(FYI, I understand from the earlier comments that the total ordering is also achievable but would require the extra API support)

Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:
> I noticed that the PDF and ODT parsers actually split by lines, not 
> individual words and nearly 100% sure I saw Tika reporting individual 
> lines when it was parsing the text files. The 'min text length' 
> feature can help with reporting several lines at a time, etc...
> 
> I'm working with this PDF all the time:
> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> 
> try it too if you get a chance.
> 
> (and I can imagine not all PDFs/etc representing the 'story' but can 
> be for ex a log-like content too)
> 
> That said, I don't know how a parser for the format N will behave, it 
> depends on the individual parsers.
> 
> IMHO it's an equal candidate alongside Text-based bounded IOs...
> 
> I'd like to know though how to make a file name available to the 
> pipeline which is working with the current text fragment ?
> 
> Going to try and do some measurements and compare the sync vs async 
> parsing modes...
> 
> Asked the Tika team to support with some more examples...
> 
> Cheers, Sergey
> On 20/09/17 22:17, Sergey Beryozkin wrote:
>> Hi,
>>
>> thanks for the explanations,
>>
>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>> Hi!
>>>
>>> TextIO returns an unordered soup of lines contained in all files you 
>>> ask it to read. People usually use TextIO for reading files where 1 
>>> line corresponds to 1 independent data element, e.g. a log entry, or 
>>> a row of a CSV file - so discarding order is ok.
>> Just a side note, I'd probably want that be ordered, though I guess 
>> it depends...
>>> However, there is a number of cases where TextIO is a poor fit:
>>> - Cases where discarding order is not ok - e.g. if you're doing 
>>> natural language processing and the text files contain actual prose, 
>>> where you need to process a file as a whole. TextIO can't do that.
>>> - Cases where you need to remember which file each element came 
>>> from, e.g.
>>> if you're creating a search index for the files: TextIO can't do 
>>> this either.
>>>
>>> Both of these issues have been raised in the past against TextIO; 
>>> however it seems that the overwhelming majority of users of TextIO 
>>> use it for logs or CSV files or alike, so solving these issues has 
>>> not been a priority.
>>> Currently they are solved in a general form via FileIO.read() which 
>>> gives you access to reading a full file yourself - people who want 
>>> more flexibility will be able to use standard Java text-parsing 
>>> utilities on a ReadableFile, without involving TextIO.
>>>
>>> Same applies for XmlIO: it is specifically designed for the narrow 
>>> use case where the files contain independent data entries, so 
>>> returning an unordered soup of them, with no association to the 
>>> original file, is the user's intention. XmlIO will not work for 
>>> processing more complex XML files that are not simply a sequence of 
>>> entries with the same tag, and it also does not remember the 
>>> original filename.
>>>
>>
>> OK...
>>
>>> However, if my understanding of Tika use cases is correct, it is 
>>> mainly used for extracting content from complex file formats - for 
>>> example, extracting text and images from PDF files or Word 
>>> documents. I believe this is the main difference between it and 
>>> TextIO - people usually use Tika for complex use cases where the 
>>> "unordered soup of stuff" abstraction is not useful.
>>>
>>> My suspicion about this is confirmed by the fact that the crux of 
>>> the Tika API is ContentHandler 
>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.
>>> html?is-external=true
>>>
>>> whose
>>> documentation says "The order of events in this interface is very 
>>> important, and mirrors the order of information in the document itself."
>> All that says is that a (Tika) ContentHandler will be a true SAX 
>> ContentHandler...
>>>
>>> Let me give a few examples of what I think is possible with the raw 
>>> Tika API, but I think is not currently possible with TikaIO - please 
>>> correct me where I'm wrong, because I'm not particularly familiar 
>>> with Tika and am judging just based on what I read about it.
>>> - User has 100,000 Word documents and wants to convert each of them 
>>> to text files for future natural language processing.
>>> - User has 100,000 PDF files with financial statements, each 
>>> containing a bunch of unrelated text and - the main content - a list 
>>> of transactions in PDF tables. User wants to extract each 
>>> transaction as a PCollection element, discarding the unrelated text.
>>> - User has 100,000 PDF files with scientific papers, and wants to 
>>> extract text from them, somehow parse author and affiliation from 
>>> the text, and compute statistics of topics and terminology usage by 
>>> author name and affiliation.
>>> - User has 100,000 photos in JPEG made by a set of automatic cameras 
>>> observing a location over time: they want to extract metadata from 
>>> each image using Tika, analyze the images themselves using some 
>>> other library, and detect anomalies in the overall appearance of the 
>>> location over time as seen from multiple cameras.
>>> I believe all of these cases can not be solved with TikaIO because 
>>> the resulting PCollection<String> contains no information about 
>>> which String comes from which document and about the order in which 
>>> they appear in the document.
>> These are good use cases, thanks... I thought what you were talking 
>> about the unordered soup of data produced by TikaIO (and its friends 
>> TextIO and alike :-)).
>> Putting the ordered vs unordered question aside for a sec, why 
>> exactly a Tika Reader can not make the name of the file it's 
>> currently reading from available to the pipeline, as some Beam pipeline metadata piece ?
>> Surely it can be possible with Beam ? If not then I would be surprised...
>>
>>>
>>> I am, honestly, struggling to think of a case where I would want to 
>>> use Tika, but where I *would* be ok with getting an unordered soup 
>>> of strings.
>>> So some examples would be very helpful.
>>>
>> Yes. I'll ask Tika developers to help with some examples, but I'll 
>> give one example where it did not matter to us in what order 
>> Tika-produced data were available to the downstream layer.
>>
>> It's a demo the Apache CXF colleague of mine showed at one of Apache 
>> Con NAs, and we had a happy audience:
>>
>> https://github.com/apache/cxf/tree/master/distribution/src/main/relea
>> se/samples/jax_rs/search
>>
>>
>> PDF or ODT files uploaded, Tika parses them, and all of that is put 
>> into Lucene. We associate a file name with the indexed content and 
>> then let users find a list of PDF files which contain a given word or 
>> few words, details are here
>> https://github.com/apache/cxf/blob/master/distribution/src/main/relea
>> se/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catal
>> og.java#L131
>>
>>
>> I'd say even more involved search engines would not mind supporting a 
>> case like that :-)
>>
>> Now there we process one file at a time, and I understand now that 
>> with TikaIO and N files it's all over the place really as far as the 
>> ordering is concerned, which file it's coming from. etc. That's why 
>> TikaReader must be able to associate the file name with a given piece 
>> of text it's making available to the pipeline.
>>
>> I'd be happy to support the ParDo way of linking Tika with Beam.
>> If it makes things simpler then it would be good, I've just no idea 
>> at the moment how to start the pipeline without using a 
>> Source/Reader, but I'll learn :-). Re the sync issue I mentioned 
>> earlier - how can one avoid it with ParDo when implementing a 'min 
>> len chunk' feature, where the ParDo would have to concatenate several 
>> SAX data pieces first before making a single composite piece to the pipeline ?
>>
>>
>>> Another way to state it: currently, if I wanted to solve all of the 
>>> use cases above, I'd just use FileIO.readMatches() and use the Tika 
>>> API myself on the resulting ReadableFile. How can we make TikaIO 
>>> provide a usability improvement over such usage?
>>>
>>
>>
>> If you are actually asking, does it really make sense for Beam to 
>> ship Tika related code, given that users can just do it themselves, 
>> I'm not sure.
>>
>> IMHO it always works better if users have to provide just few config 
>> options to an integral part of the framework and see things happening.
>> It will bring more users.
>>
>> Whether the current Tika code (refactored or not) stays with Beam or 
>> not - I'll let you and the team decide; believe it or not I was 
>> seriously contemplating at the last moment to make it all part of the 
>> Tika project itself and have a bit more flexibility over there with 
>> tweaking things, but now that it is in the Beam snapshot - I don't 
>> know - it's no my decision...
>>
>>> I am confused by your other comment - "Does the ordering matter ? 
>>> Perhaps
>>> for some cases it does, and for some it does not. May be it makes 
>>> sense to support running TikaIO as both the bounded reader/source 
>>> and ParDo, with getting the common code reused." - because using 
>>> BoundedReader or ParDo is not related to the ordering issue, only to 
>>> the issue of asynchronous reading and complexity of implementation. 
>>> The resulting PCollection will be unordered either way - this needs 
>>> to be solved separately by providing a different API.
>> Right I see now, so ParDo is not about making Tika reported data 
>> available to the downstream pipeline components ordered, only about 
>> the simpler implementation.
>> Association with the file should be possible I hope, but I understand 
>> it would be possible to optionally make the data coming out in the 
>> ordered way as well...
>>
>> Assuming TikaIO stays, and before trying to re-implement as ParDo, 
>> let me double check: should we still give some thought to the 
>> possible performance benefit of the current approach ? As I said, I 
>> can easily get rid of all that polling code, use a simple Blocking queue.
>>
>> Cheers, Sergey
>>>
>>> Thanks.
>>>
>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin 
>>> <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Glad TikaIO getting some serious attention :-), I believe one thing 
>>>> we both agree upon is that Tika can help Beam in its own unique way.
>>>>
>>>> Before trying to reply online, I'd like to state that my main 
>>>> assumption is that TikaIO (as far as the read side is concerned) is 
>>>> no different to Text, XML or similar bounded reader components.
>>>>
>>>> I have to admit I don't understand your questions about TikaIO 
>>>> usecases.
>>>>
>>>> What are the Text Input or XML input use-cases ? These use cases 
>>>> are TikaInput cases as well, the only difference is Tika can not 
>>>> split the individual file into a sequence of sources/etc,
>>>>
>>>> TextIO can read from the plain text files (possibly zipped), XML - 
>>>> optimized around reading from the XML files, and I thought I made 
>>>> it clear (and it is a known fact anyway) Tika was about reading 
>>>> basically from any file format.
>>>>
>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>
>>>> Sergey
>>>>
>>>>
>>>>
>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>> Hi,
>>>>>
>>>>> Replies inline.
>>>>>
>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin 
>>>>> <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> This is my first post the the dev list, I work for Talend, I'm a 
>>>>>> Beam novice, Apache Tika fan, and thought it would be really 
>>>>>> great to try and link both projects together, which led me to 
>>>>>> opening [1] where I typed some early thoughts, followed by PR 
>>>>>> [2].
>>>>>>
>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) 
>>>>>> newer review comments from Eugene pending, so I'd like to 
>>>>>> summarize a bit why I did TikaIO (reader) the way I did, and then 
>>>>>> decide, based on the feedback from the experts, what to do next.
>>>>>>
>>>>>> Apache Tika Parsers report the text content in chunks, via 
>>>>>> SaxParser events. It's not possible with Tika to take a file and 
>>>>>> read it bit by bit at the 'initiative' of the Beam Reader, line 
>>>>>> by line, the only way is to handle the SAXParser callbacks which 
>>>>>> report the data chunks.
>>>>>> Some
>>>>>> parsers may report the complete lines, some individual words, 
>>>>>> with some being able report the data only after the completely 
>>>>>> parse the document.
>>>>>> All depends on the data format.
>>>>>>
>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads 
>>>>>> to parse the files, Beam threads will only collect the data from 
>>>>>> the internal queue where the internal TikaReader's thread will 
>>>>>> put the data into (note the data chunks are ordered even though 
>>>>>> the tests might suggest otherwise).
>>>>>>
>>>>> I agree that your implementation of reader returns records in 
>>>>> order
>>>>> - but
>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about 
>>>>> the order in which records are produced by a BoundedReader - the 
>>>>> order produced by your reader is ignored, and when applying any 
>>>>> transforms to the
>>>> PCollection
>>>>> produced by TikaIO, it is impossible to recover the order in which 
>>>>> your reader returned the records.
>>>>>
>>>>> With that in mind, is PCollection<String>, containing individual 
>>>>> Tika-detected items, still the right API for representing the 
>>>>> result of parsing a large number of documents with Tika?
>>>>>
>>>>>
>>>>>>
>>>>>> The reason I did it was because I thought
>>>>>>
>>>>>> 1) it would make the individual data chunks available faster to 
>>>>>> the pipeline - the parser will continue working via the 
>>>>>> binary/video etc file while the data will already start flowing - 
>>>>>> I agree there should be some tests data available confirming it - 
>>>>>> but I'm positive at the moment this approach might yield some 
>>>>>> performance gains with the large sets. If the file is large, if 
>>>>>> it has the embedded attachments/videos to deal with, then it may 
>>>>>> be more effective not to get the Beam thread deal with it...
>>>>>>
>>>>>> As I said on the PR, this description contains unfounded and 
>>>>>> potentially
>>>>> incorrect assumptions about how Beam runners execute (or may 
>>>>> execute in
>>>> the
>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>> correctly,
>>>>> you might be assuming that:
>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>> complete
>>>>> before processing its outputs with downstream transforms
>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>> *concurrently*
>>>>> with downstream processing of its results
>>>>> - Passing an element from one thread to another using a 
>>>>> BlockingQueue is free in terms of performance All of these are 
>>>>> false at least in some runners, and I'm almost certain that in 
>>>>> reality, performance of this approach is worse than a ParDo in
>>>> most
>>>>> production runners.
>>>>>
>>>>> There are other disadvantages to this approach:
>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>> invisible
>>>>> to Beam's instrumentation. If a Beam runner provided per-transform 
>>>>> profiling capabilities, or the ability to get the current stack 
>>>>> trace for stuck elements, this approach would make the real 
>>>>> processing invisible to all of these capabilities, and a user 
>>>>> would only see that the bulk of the time is spent waiting for the 
>>>>> next element, but not *why* the next
>>>> element
>>>>> is taking long to compute.
>>>>> - Likewise, offloading all the CPU and IO to a separate thread, 
>>>>> invisible to Beam, will make it harder for runners to do 
>>>>> autoscaling, binpacking
>>>> and
>>>>> other resource management magic (how much of this runners actually 
>>>>> do is
>>>> a
>>>>> separate issue), because the runner will have no way of knowing 
>>>>> how much CPU/IO this particular transform is actually using - all 
>>>>> the processing happens in a thread about which the runner is 
>>>>> unaware.
>>>>> - As far as I can tell, the code also hides exceptions that happen 
>>>>> in the Tika thread
>>>>> - Adding the thread management makes the code much more complex, 
>>>>> easier
>>>> to
>>>>> introduce bugs, and harder for others to contribute
>>>>>
>>>>>
>>>>>> 2) As I commented at the end of [2], having an option to 
>>>>>> concatenate the data chunks first before making them available to 
>>>>>> the pipeline is useful, and I guess doing the same in ParDo would 
>>>>>> introduce some synchronization issues (though not exactly sure 
>>>>>> yet)
>>>>>>
>>>>> What are these issues?
>>>>>
>>>>>
>>>>>>
>>>>>> One of valid concerns there is that the reader is polling the 
>>>>>> internal queue so, in theory at least, and perhaps in some rare 
>>>>>> cases too, we may have a case where the max polling time has been 
>>>>>> reached, the parser is still busy, and TikaIO fails to report all 
>>>>>> the file data. I think that it can be solved by either 2a) 
>>>>>> configuring the max polling time to a very large number which 
>>>>>> will never be reached for a practical case, or
>>>>>> 2b) simply use a blocking queue without the time limits - in the 
>>>>>> worst case, if TikaParser spins and fails to report the end of 
>>>>>> the document, then, Bean can heal itself if the pipeline blocks.
>>>>>> I propose to follow 2b).
>>>>>>
>>>>> I agree that there should be no way to unintentionally configure 
>>>>> the transform in a way that will produce silent data loss. Another 
>>>>> reason for not having these tuning knobs is that it goes against 
>>>>> Beam's "no knobs"
>>>>> philosophy, and that in most cases users have no way of figuring 
>>>>> out a
>>>> good
>>>>> value for tuning knobs except for manual experimentation, which is 
>>>>> extremely brittle and typically gets immediately obsoleted by 
>>>>> running on
>>>> a
>>>>> new dataset or updating a version of some of the involved 
>>>>> dependencies
>>>> etc.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Please let me know what you think.
>>>>>> My plan so far is:
>>>>>> 1) start addressing most of Eugene's comments which would require 
>>>>>> some minor TikaIO updates
>>>>>> 2) work on removing the TikaSource internal code dealing with 
>>>>>> File patterns which I copied from TextIO at the next stage
>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam 
>>>>>> users some time to try it with some real complex files and also 
>>>>>> decide if TikaIO can continue implemented as a 
>>>>>> BoundedSource/Reader or not
>>>>>>
>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>
>>>>> Yes, but I think we should start by discussing the anticipated use 
>>>>> cases
>>>> of
>>>>> TikaIO and designing an API for it based on those use cases; and 
>>>>> then see what's the best implementation for that particular API 
>>>>> and set of anticipated use cases.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks, Sergey
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>
>>>>>
>>>>
>>>
>>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi All

Please welcome Tim, one of Apache Tika leads and practitioners.

Tim, thanks for joining in :-). If you have some great Apache Tika 
stories to share (preferably involving the cases where it did not really 
matter the ordering in which Tika-produced data were dealt with by the 
consumers) then please do so :-).

At the moment, even though Tika ContentHandler will emit the ordered 
data, the Beam runtime will have no guarantees that the downstream 
pipeline components will see the data coming in the right order.

(FYI, I understand from the earlier comments that the total ordering is 
also achievable but would require the extra API support)

Other comments would be welcome too

Thanks, Sergey

On 21/09/17 10:55, Sergey Beryozkin wrote:
> I noticed that the PDF and ODT parsers actually split by lines, not 
> individual words and nearly 100% sure I saw Tika reporting individual
> lines when it was parsing the text files. The 'min text length' feature 
> can help with reporting several lines at a time, etc...
> 
> I'm working with this PDF all the time:
> https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf
> 
> try it too if you get a chance.
> 
> (and I can imagine not all PDFs/etc representing the 'story' but can be 
> for ex a log-like content too)
> 
> That said, I don't know how a parser for the format N will behave, it 
> depends on the individual parsers.
> 
> IMHO it's an equal candidate alongside Text-based bounded IOs...
> 
> I'd like to know though how to make a file name available to the 
> pipeline which is working with the current text fragment ?
> 
> Going to try and do some measurements and compare the sync vs async 
> parsing modes...
> 
> Asked the Tika team to support with some more examples...
> 
> Cheers, Sergey
> On 20/09/17 22:17, Sergey Beryozkin wrote:
>> Hi,
>>
>> thanks for the explanations,
>>
>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>> Hi!
>>>
>>> TextIO returns an unordered soup of lines contained in all files you 
>>> ask it
>>> to read. People usually use TextIO for reading files where 1 line
>>> corresponds to 1 independent data element, e.g. a log entry, or a row 
>>> of a
>>> CSV file - so discarding order is ok.
>> Just a side note, I'd probably want that be ordered, though I guess it 
>> depends...
>>> However, there is a number of cases where TextIO is a poor fit:
>>> - Cases where discarding order is not ok - e.g. if you're doing natural
>>> language processing and the text files contain actual prose, where 
>>> you need
>>> to process a file as a whole. TextIO can't do that.
>>> - Cases where you need to remember which file each element came from, 
>>> e.g.
>>> if you're creating a search index for the files: TextIO can't do this
>>> either.
>>>
>>> Both of these issues have been raised in the past against TextIO; 
>>> however
>>> it seems that the overwhelming majority of users of TextIO use it for 
>>> logs
>>> or CSV files or alike, so solving these issues has not been a priority.
>>> Currently they are solved in a general form via FileIO.read() which 
>>> gives
>>> you access to reading a full file yourself - people who want more
>>> flexibility will be able to use standard Java text-parsing utilities 
>>> on a
>>> ReadableFile, without involving TextIO.
>>>
>>> Same applies for XmlIO: it is specifically designed for the narrow 
>>> use case
>>> where the files contain independent data entries, so returning an 
>>> unordered
>>> soup of them, with no association to the original file, is the user's
>>> intention. XmlIO will not work for processing more complex XML files 
>>> that
>>> are not simply a sequence of entries with the same tag, and it also does
>>> not remember the original filename.
>>>
>>
>> OK...
>>
>>> However, if my understanding of Tika use cases is correct, it is mainly
>>> used for extracting content from complex file formats - for example,
>>> extracting text and images from PDF files or Word documents. I 
>>> believe this
>>> is the main difference between it and TextIO - people usually use 
>>> Tika for
>>> complex use cases where the "unordered soup of stuff" abstraction is not
>>> useful.
>>>
>>> My suspicion about this is confirmed by the fact that the crux of the 
>>> Tika
>>> API is ContentHandler
>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true 
>>>
>>> whose
>>> documentation says "The order of events in this interface is very
>>> important, and mirrors the order of information in the document itself."
>> All that says is that a (Tika) ContentHandler will be a true SAX 
>> ContentHandler...
>>>
>>> Let me give a few examples of what I think is possible with the raw Tika
>>> API, but I think is not currently possible with TikaIO - please 
>>> correct me
>>> where I'm wrong, because I'm not particularly familiar with Tika and am
>>> judging just based on what I read about it.
>>> - User has 100,000 Word documents and wants to convert each of them 
>>> to text
>>> files for future natural language processing.
>>> - User has 100,000 PDF files with financial statements, each 
>>> containing a
>>> bunch of unrelated text and - the main content - a list of 
>>> transactions in
>>> PDF tables. User wants to extract each transaction as a PCollection
>>> element, discarding the unrelated text.
>>> - User has 100,000 PDF files with scientific papers, and wants to 
>>> extract
>>> text from them, somehow parse author and affiliation from the text, and
>>> compute statistics of topics and terminology usage by author name and
>>> affiliation.
>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>> observing a location over time: they want to extract metadata from each
>>> image using Tika, analyze the images themselves using some other 
>>> library,
>>> and detect anomalies in the overall appearance of the location over 
>>> time as
>>> seen from multiple cameras.
>>> I believe all of these cases can not be solved with TikaIO because the
>>> resulting PCollection<String> contains no information about which String
>>> comes from which document and about the order in which they appear in 
>>> the
>>> document.
>> These are good use cases, thanks... I thought what you were talking 
>> about the unordered soup of data produced by TikaIO (and its friends 
>> TextIO and alike :-)).
>> Putting the ordered vs unordered question aside for a sec, why exactly 
>> a Tika Reader can not make the name of the file it's currently reading 
>> from available to the pipeline, as some Beam pipeline metadata piece ?
>> Surely it can be possible with Beam ? If not then I would be surprised...
>>
>>>
>>> I am, honestly, struggling to think of a case where I would want to use
>>> Tika, but where I *would* be ok with getting an unordered soup of 
>>> strings.
>>> So some examples would be very helpful.
>>>
>> Yes. I'll ask Tika developers to help with some examples, but I'll 
>> give one example where it did not matter to us in what order 
>> Tika-produced data were available to the downstream layer.
>>
>> It's a demo the Apache CXF colleague of mine showed at one of Apache 
>> Con NAs, and we had a happy audience:
>>
>> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search 
>>
>>
>> PDF or ODT files uploaded, Tika parses them, and all of that is put 
>> into Lucene. We associate a file name with the indexed content and 
>> then let users find a list of PDF files which contain a given word or 
>> few words, details are here
>> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131 
>>
>>
>> I'd say even more involved search engines would not mind supporting a 
>> case like that :-)
>>
>> Now there we process one file at a time, and I understand now that 
>> with TikaIO and N files it's all over the place really as far as the 
>> ordering is concerned, which file it's coming from. etc. That's why 
>> TikaReader must be able to associate the file name with a given piece 
>> of text it's making available to the pipeline.
>>
>> I'd be happy to support the ParDo way of linking Tika with Beam.
>> If it makes things simpler then it would be good, I've just no idea at 
>> the moment how to start the pipeline without using a Source/Reader,
>> but I'll learn :-). Re the sync issue I mentioned earlier - how can 
>> one avoid it with ParDo when implementing a 'min len chunk' feature, 
>> where the ParDo would have to concatenate several SAX data pieces 
>> first before making a single composite piece to the pipeline ?
>>
>>
>>> Another way to state it: currently, if I wanted to solve all of the use
>>> cases above, I'd just use FileIO.readMatches() and use the Tika API 
>>> myself
>>> on the resulting ReadableFile. How can we make TikaIO provide a 
>>> usability
>>> improvement over such usage?
>>>
>>
>>
>> If you are actually asking, does it really make sense for Beam to ship
>> Tika related code, given that users can just do it themselves, I'm not 
>> sure.
>>
>> IMHO it always works better if users have to provide just few config 
>> options to an integral part of the framework and see things happening.
>> It will bring more users.
>>
>> Whether the current Tika code (refactored or not) stays with Beam or 
>> not - I'll let you and the team decide; believe it or not I was 
>> seriously contemplating at the last moment to make it all part of the 
>> Tika project itself and have a bit more flexibility over there with 
>> tweaking things, but now that it is in the Beam snapshot - I don't 
>> know - it's no my decision...
>>
>>> I am confused by your other comment - "Does the ordering matter ? 
>>> Perhaps
>>> for some cases it does, and for some it does not. May be it makes 
>>> sense to
>>> support running TikaIO as both the bounded reader/source and ParDo, with
>>> getting the common code reused." - because using BoundedReader or 
>>> ParDo is
>>> not related to the ordering issue, only to the issue of asynchronous
>>> reading and complexity of implementation. The resulting PCollection 
>>> will be
>>> unordered either way - this needs to be solved separately by providing a
>>> different API.
>> Right I see now, so ParDo is not about making Tika reported data 
>> available to the downstream pipeline components ordered, only about 
>> the simpler implementation.
>> Association with the file should be possible I hope, but I understand 
>> it would be possible to optionally make the data coming out in the 
>> ordered way as well...
>>
>> Assuming TikaIO stays, and before trying to re-implement as ParDo, let 
>> me double check: should we still give some thought to the possible 
>> performance benefit of the current approach ? As I said, I can easily 
>> get rid of all that polling code, use a simple Blocking queue.
>>
>> Cheers, Sergey
>>>
>>> Thanks.
>>>
>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Glad TikaIO getting some serious attention :-), I believe one thing we
>>>> both agree upon is that Tika can help Beam in its own unique way.
>>>>
>>>> Before trying to reply online, I'd like to state that my main 
>>>> assumption
>>>> is that TikaIO (as far as the read side is concerned) is no 
>>>> different to
>>>> Text, XML or similar bounded reader components.
>>>>
>>>> I have to admit I don't understand your questions about TikaIO 
>>>> usecases.
>>>>
>>>> What are the Text Input or XML input use-cases ? These use cases are
>>>> TikaInput cases as well, the only difference is Tika can not split the
>>>> individual file into a sequence of sources/etc,
>>>>
>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>> optimized around reading from the XML files, and I thought I made it
>>>> clear (and it is a known fact anyway) Tika was about reading basically
>>>> from any file format.
>>>>
>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>
>>>> Sergey
>>>>
>>>>
>>>>
>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>> Hi,
>>>>>
>>>>> Replies inline.
>>>>>
>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin 
>>>>> <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>>>>>> novice, Apache Tika fan, and thought it would be really great to 
>>>>>> try and
>>>>>> link both projects together, which led me to opening [1] where I 
>>>>>> typed
>>>>>> some early thoughts, followed by PR [2].
>>>>>>
>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) 
>>>>>> newer
>>>>>> review comments from Eugene pending, so I'd like to summarize a 
>>>>>> bit why
>>>>>> I did TikaIO (reader) the way I did, and then decide, based on the
>>>>>> feedback from the experts, what to do next.
>>>>>>
>>>>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>>>>>> events. It's not possible with Tika to take a file and read it bit by
>>>>>> bit at the 'initiative' of the Beam Reader, line by line, the only 
>>>>>> way
>>>>>> is to handle the SAXParser callbacks which report the data chunks. 
>>>>>> Some
>>>>>> parsers may report the complete lines, some individual words, with 
>>>>>> some
>>>>>> being able report the data only after the completely parse the 
>>>>>> document.
>>>>>> All depends on the data format.
>>>>>>
>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads to 
>>>>>> parse
>>>>>> the files, Beam threads will only collect the data from the internal
>>>>>> queue where the internal TikaReader's thread will put the data into
>>>>>> (note the data chunks are ordered even though the tests might suggest
>>>>>> otherwise).
>>>>>>
>>>>> I agree that your implementation of reader returns records in order 
>>>>> - but
>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about the 
>>>>> order
>>>>> in which records are produced by a BoundedReader - the order 
>>>>> produced by
>>>>> your reader is ignored, and when applying any transforms to the
>>>> PCollection
>>>>> produced by TikaIO, it is impossible to recover the order in which 
>>>>> your
>>>>> reader returned the records.
>>>>>
>>>>> With that in mind, is PCollection<String>, containing individual
>>>>> Tika-detected items, still the right API for representing the 
>>>>> result of
>>>>> parsing a large number of documents with Tika?
>>>>>
>>>>>
>>>>>>
>>>>>> The reason I did it was because I thought
>>>>>>
>>>>>> 1) it would make the individual data chunks available faster to the
>>>>>> pipeline - the parser will continue working via the binary/video etc
>>>>>> file while the data will already start flowing - I agree there 
>>>>>> should be
>>>>>> some tests data available confirming it - but I'm positive at the 
>>>>>> moment
>>>>>> this approach might yield some performance gains with the large 
>>>>>> sets. If
>>>>>> the file is large, if it has the embedded attachments/videos to deal
>>>>>> with, then it may be more effective not to get the Beam thread 
>>>>>> deal with
>>>>>> it...
>>>>>>
>>>>>> As I said on the PR, this description contains unfounded and 
>>>>>> potentially
>>>>> incorrect assumptions about how Beam runners execute (or may 
>>>>> execute in
>>>> the
>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>> correctly,
>>>>> you might be assuming that:
>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>> complete
>>>>> before processing its outputs with downstream transforms
>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>> *concurrently*
>>>>> with downstream processing of its results
>>>>> - Passing an element from one thread to another using a 
>>>>> BlockingQueue is
>>>>> free in terms of performance
>>>>> All of these are false at least in some runners, and I'm almost 
>>>>> certain
>>>>> that in reality, performance of this approach is worse than a ParDo in
>>>> most
>>>>> production runners.
>>>>>
>>>>> There are other disadvantages to this approach:
>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>> invisible
>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>> profiling capabilities, or the ability to get the current stack 
>>>>> trace for
>>>>> stuck elements, this approach would make the real processing 
>>>>> invisible to
>>>>> all of these capabilities, and a user would only see that the bulk 
>>>>> of the
>>>>> time is spent waiting for the next element, but not *why* the next
>>>> element
>>>>> is taking long to compute.
>>>>> - Likewise, offloading all the CPU and IO to a separate thread, 
>>>>> invisible
>>>>> to Beam, will make it harder for runners to do autoscaling, binpacking
>>>> and
>>>>> other resource management magic (how much of this runners actually 
>>>>> do is
>>>> a
>>>>> separate issue), because the runner will have no way of knowing how 
>>>>> much
>>>>> CPU/IO this particular transform is actually using - all the 
>>>>> processing
>>>>> happens in a thread about which the runner is unaware.
>>>>> - As far as I can tell, the code also hides exceptions that happen 
>>>>> in the
>>>>> Tika thread
>>>>> - Adding the thread management makes the code much more complex, 
>>>>> easier
>>>> to
>>>>> introduce bugs, and harder for others to contribute
>>>>>
>>>>>
>>>>>> 2) As I commented at the end of [2], having an option to 
>>>>>> concatenate the
>>>>>> data chunks first before making them available to the pipeline is
>>>>>> useful, and I guess doing the same in ParDo would introduce some
>>>>>> synchronization issues (though not exactly sure yet)
>>>>>>
>>>>> What are these issues?
>>>>>
>>>>>
>>>>>>
>>>>>> One of valid concerns there is that the reader is polling the 
>>>>>> internal
>>>>>> queue so, in theory at least, and perhaps in some rare cases too, 
>>>>>> we may
>>>>>> have a case where the max polling time has been reached, the 
>>>>>> parser is
>>>>>> still busy, and TikaIO fails to report all the file data. I think 
>>>>>> that
>>>>>> it can be solved by either 2a) configuring the max polling time to a
>>>>>> very large number which will never be reached for a practical 
>>>>>> case, or
>>>>>> 2b) simply use a blocking queue without the time limits - in the 
>>>>>> worst
>>>>>> case, if TikaParser spins and fails to report the end of the 
>>>>>> document,
>>>>>> then, Bean can heal itself if the pipeline blocks.
>>>>>> I propose to follow 2b).
>>>>>>
>>>>> I agree that there should be no way to unintentionally configure the
>>>>> transform in a way that will produce silent data loss. Another 
>>>>> reason for
>>>>> not having these tuning knobs is that it goes against Beam's "no 
>>>>> knobs"
>>>>> philosophy, and that in most cases users have no way of figuring out a
>>>> good
>>>>> value for tuning knobs except for manual experimentation, which is
>>>>> extremely brittle and typically gets immediately obsoleted by 
>>>>> running on
>>>> a
>>>>> new dataset or updating a version of some of the involved dependencies
>>>> etc.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Please let me know what you think.
>>>>>> My plan so far is:
>>>>>> 1) start addressing most of Eugene's comments which would require 
>>>>>> some
>>>>>> minor TikaIO updates
>>>>>> 2) work on removing the TikaSource internal code dealing with File
>>>>>> patterns which I copied from TextIO at the next stage
>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam 
>>>>>> users some
>>>>>> time to try it with some real complex files and also decide if TikaIO
>>>>>> can continue implemented as a BoundedSource/Reader or not
>>>>>>
>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>
>>>>> Yes, but I think we should start by discussing the anticipated use 
>>>>> cases
>>>> of
>>>>> TikaIO and designing an API for it based on those use cases; and 
>>>>> then see
>>>>> what's the best implementation for that particular API and set of
>>>>> anticipated use cases.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks, Sergey
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>
>>>>>
>>>>
>>>
>>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

I noticed that the PDF and ODT parsers actually split by lines, not 
individual words and nearly 100% sure I saw Tika reporting individual
lines when it was parsing the text files. The 'min text length' feature 
can help with reporting several lines at a time, etc...

I'm working with this PDF all the time:
https://rwc.iacr.org/2017/Slides/nguyen.quan.pdf

try it too if you get a chance.

(and I can imagine not all PDFs/etc representing the 'story' but can be 
for ex a log-like content too)

That said, I don't know how a parser for the format N will behave, it 
depends on the individual parsers.

IMHO it's an equal candidate alongside Text-based bounded IOs...

I'd like to know though how to make a file name available to the 
pipeline which is working with the current text fragment ?

Going to try and do some measurements and compare the sync vs async 
parsing modes...

Asked the Tika team to support with some more examples...

Cheers, Sergey
On 20/09/17 22:17, Sergey Beryozkin wrote:
> Hi,
> 
> thanks for the explanations,
> 
> On 20/09/17 16:41, Eugene Kirpichov wrote:
>> Hi!
>>
>> TextIO returns an unordered soup of lines contained in all files you 
>> ask it
>> to read. People usually use TextIO for reading files where 1 line
>> corresponds to 1 independent data element, e.g. a log entry, or a row 
>> of a
>> CSV file - so discarding order is ok.
> Just a side note, I'd probably want that be ordered, though I guess it 
> depends...
>> However, there is a number of cases where TextIO is a poor fit:
>> - Cases where discarding order is not ok - e.g. if you're doing natural
>> language processing and the text files contain actual prose, where you 
>> need
>> to process a file as a whole. TextIO can't do that.
>> - Cases where you need to remember which file each element came from, 
>> e.g.
>> if you're creating a search index for the files: TextIO can't do this
>> either.
>>
>> Both of these issues have been raised in the past against TextIO; however
>> it seems that the overwhelming majority of users of TextIO use it for 
>> logs
>> or CSV files or alike, so solving these issues has not been a priority.
>> Currently they are solved in a general form via FileIO.read() which gives
>> you access to reading a full file yourself - people who want more
>> flexibility will be able to use standard Java text-parsing utilities on a
>> ReadableFile, without involving TextIO.
>>
>> Same applies for XmlIO: it is specifically designed for the narrow use 
>> case
>> where the files contain independent data entries, so returning an 
>> unordered
>> soup of them, with no association to the original file, is the user's
>> intention. XmlIO will not work for processing more complex XML files that
>> are not simply a sequence of entries with the same tag, and it also does
>> not remember the original filename.
>>
> 
> OK...
> 
>> However, if my understanding of Tika use cases is correct, it is mainly
>> used for extracting content from complex file formats - for example,
>> extracting text and images from PDF files or Word documents. I believe 
>> this
>> is the main difference between it and TextIO - people usually use Tika 
>> for
>> complex use cases where the "unordered soup of stuff" abstraction is not
>> useful.
>>
>> My suspicion about this is confirmed by the fact that the crux of the 
>> Tika
>> API is ContentHandler
>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true 
>>
>> whose
>> documentation says "The order of events in this interface is very
>> important, and mirrors the order of information in the document itself."
> All that says is that a (Tika) ContentHandler will be a true SAX 
> ContentHandler...
>>
>> Let me give a few examples of what I think is possible with the raw Tika
>> API, but I think is not currently possible with TikaIO - please 
>> correct me
>> where I'm wrong, because I'm not particularly familiar with Tika and am
>> judging just based on what I read about it.
>> - User has 100,000 Word documents and wants to convert each of them to 
>> text
>> files for future natural language processing.
>> - User has 100,000 PDF files with financial statements, each containing a
>> bunch of unrelated text and - the main content - a list of 
>> transactions in
>> PDF tables. User wants to extract each transaction as a PCollection
>> element, discarding the unrelated text.
>> - User has 100,000 PDF files with scientific papers, and wants to extract
>> text from them, somehow parse author and affiliation from the text, and
>> compute statistics of topics and terminology usage by author name and
>> affiliation.
>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>> observing a location over time: they want to extract metadata from each
>> image using Tika, analyze the images themselves using some other library,
>> and detect anomalies in the overall appearance of the location over 
>> time as
>> seen from multiple cameras.
>> I believe all of these cases can not be solved with TikaIO because the
>> resulting PCollection<String> contains no information about which String
>> comes from which document and about the order in which they appear in the
>> document.
> These are good use cases, thanks... I thought what you were talking 
> about the unordered soup of data produced by TikaIO (and its friends 
> TextIO and alike :-)).
> Putting the ordered vs unordered question aside for a sec, why exactly a 
> Tika Reader can not make the name of the file it's currently reading 
> from available to the pipeline, as some Beam pipeline metadata piece ?
> Surely it can be possible with Beam ? If not then I would be surprised...
> 
>>
>> I am, honestly, struggling to think of a case where I would want to use
>> Tika, but where I *would* be ok with getting an unordered soup of 
>> strings.
>> So some examples would be very helpful.
>>
> Yes. I'll ask Tika developers to help with some examples, but I'll give 
> one example where it did not matter to us in what order Tika-produced 
> data were available to the downstream layer.
> 
> It's a demo the Apache CXF colleague of mine showed at one of Apache Con 
> NAs, and we had a happy audience:
> 
> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search 
> 
> 
> PDF or ODT files uploaded, Tika parses them, and all of that is put into 
> Lucene. We associate a file name with the indexed content and then let 
> users find a list of PDF files which contain a given word or few words, 
> details are here
> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131 
> 
> 
> I'd say even more involved search engines would not mind supporting a 
> case like that :-)
> 
> Now there we process one file at a time, and I understand now that with 
> TikaIO and N files it's all over the place really as far as the ordering 
> is concerned, which file it's coming from. etc. That's why TikaReader 
> must be able to associate the file name with a given piece of text it's 
> making available to the pipeline.
> 
> I'd be happy to support the ParDo way of linking Tika with Beam.
> If it makes things simpler then it would be good, I've just no idea at 
> the moment how to start the pipeline without using a Source/Reader,
> but I'll learn :-). Re the sync issue I mentioned earlier - how can one 
> avoid it with ParDo when implementing a 'min len chunk' feature, where 
> the ParDo would have to concatenate several SAX data pieces first before 
> making a single composite piece to the pipeline ?
> 
> 
>> Another way to state it: currently, if I wanted to solve all of the use
>> cases above, I'd just use FileIO.readMatches() and use the Tika API 
>> myself
>> on the resulting ReadableFile. How can we make TikaIO provide a usability
>> improvement over such usage?
>>
> 
> 
> If you are actually asking, does it really make sense for Beam to ship
> Tika related code, given that users can just do it themselves, I'm not 
> sure.
> 
> IMHO it always works better if users have to provide just few config 
> options to an integral part of the framework and see things happening.
> It will bring more users.
> 
> Whether the current Tika code (refactored or not) stays with Beam or not 
> - I'll let you and the team decide; believe it or not I was seriously 
> contemplating at the last moment to make it all part of the Tika project 
> itself and have a bit more flexibility over there with tweaking things, 
> but now that it is in the Beam snapshot - I don't know - it's no my 
> decision...
> 
>> I am confused by your other comment - "Does the ordering matter ?  
>> Perhaps
>> for some cases it does, and for some it does not. May be it makes 
>> sense to
>> support running TikaIO as both the bounded reader/source and ParDo, with
>> getting the common code reused." - because using BoundedReader or 
>> ParDo is
>> not related to the ordering issue, only to the issue of asynchronous
>> reading and complexity of implementation. The resulting PCollection 
>> will be
>> unordered either way - this needs to be solved separately by providing a
>> different API.
> Right I see now, so ParDo is not about making Tika reported data 
> available to the downstream pipeline components ordered, only about the 
> simpler implementation.
> Association with the file should be possible I hope, but I understand it 
> would be possible to optionally make the data coming out in the ordered 
> way as well...
> 
> Assuming TikaIO stays, and before trying to re-implement as ParDo, let 
> me double check: should we still give some thought to the possible 
> performance benefit of the current approach ? As I said, I can easily 
> get rid of all that polling code, use a simple Blocking queue.
> 
> Cheers, Sergey
>>
>> Thanks.
>>
>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Glad TikaIO getting some serious attention :-), I believe one thing we
>>> both agree upon is that Tika can help Beam in its own unique way.
>>>
>>> Before trying to reply online, I'd like to state that my main assumption
>>> is that TikaIO (as far as the read side is concerned) is no different to
>>> Text, XML or similar bounded reader components.
>>>
>>> I have to admit I don't understand your questions about TikaIO usecases.
>>>
>>> What are the Text Input or XML input use-cases ? These use cases are
>>> TikaInput cases as well, the only difference is Tika can not split the
>>> individual file into a sequence of sources/etc,
>>>
>>> TextIO can read from the plain text files (possibly zipped), XML -
>>> optimized around reading from the XML files, and I thought I made it
>>> clear (and it is a known fact anyway) Tika was about reading basically
>>> from any file format.
>>>
>>> Where is the difference (apart from what I've already mentioned) ?
>>>
>>> Sergey
>>>
>>>
>>>
>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>> Hi,
>>>>
>>>> Replies inline.
>>>>
>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>>>>> novice, Apache Tika fan, and thought it would be really great to 
>>>>> try and
>>>>> link both projects together, which led me to opening [1] where I typed
>>>>> some early thoughts, followed by PR [2].
>>>>>
>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) 
>>>>> newer
>>>>> review comments from Eugene pending, so I'd like to summarize a bit 
>>>>> why
>>>>> I did TikaIO (reader) the way I did, and then decide, based on the
>>>>> feedback from the experts, what to do next.
>>>>>
>>>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>>>>> events. It's not possible with Tika to take a file and read it bit by
>>>>> bit at the 'initiative' of the Beam Reader, line by line, the only way
>>>>> is to handle the SAXParser callbacks which report the data chunks. 
>>>>> Some
>>>>> parsers may report the complete lines, some individual words, with 
>>>>> some
>>>>> being able report the data only after the completely parse the 
>>>>> document.
>>>>> All depends on the data format.
>>>>>
>>>>> At the moment TikaIO's TikaReader does not use the Beam threads to 
>>>>> parse
>>>>> the files, Beam threads will only collect the data from the internal
>>>>> queue where the internal TikaReader's thread will put the data into
>>>>> (note the data chunks are ordered even though the tests might suggest
>>>>> otherwise).
>>>>>
>>>> I agree that your implementation of reader returns records in order 
>>>> - but
>>>> Beam PCollection's are not ordered. Nothing in Beam cares about the 
>>>> order
>>>> in which records are produced by a BoundedReader - the order 
>>>> produced by
>>>> your reader is ignored, and when applying any transforms to the
>>> PCollection
>>>> produced by TikaIO, it is impossible to recover the order in which your
>>>> reader returned the records.
>>>>
>>>> With that in mind, is PCollection<String>, containing individual
>>>> Tika-detected items, still the right API for representing the result of
>>>> parsing a large number of documents with Tika?
>>>>
>>>>
>>>>>
>>>>> The reason I did it was because I thought
>>>>>
>>>>> 1) it would make the individual data chunks available faster to the
>>>>> pipeline - the parser will continue working via the binary/video etc
>>>>> file while the data will already start flowing - I agree there 
>>>>> should be
>>>>> some tests data available confirming it - but I'm positive at the 
>>>>> moment
>>>>> this approach might yield some performance gains with the large 
>>>>> sets. If
>>>>> the file is large, if it has the embedded attachments/videos to deal
>>>>> with, then it may be more effective not to get the Beam thread deal 
>>>>> with
>>>>> it...
>>>>>
>>>>> As I said on the PR, this description contains unfounded and 
>>>>> potentially
>>>> incorrect assumptions about how Beam runners execute (or may execute in
>>> the
>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>> correctly,
>>>> you might be assuming that:
>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>> complete
>>>> before processing its outputs with downstream transforms
>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>> *concurrently*
>>>> with downstream processing of its results
>>>> - Passing an element from one thread to another using a 
>>>> BlockingQueue is
>>>> free in terms of performance
>>>> All of these are false at least in some runners, and I'm almost certain
>>>> that in reality, performance of this approach is worse than a ParDo in
>>> most
>>>> production runners.
>>>>
>>>> There are other disadvantages to this approach:
>>>> - Doing the bulk of the processing in a separate thread makes it
>>> invisible
>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>> profiling capabilities, or the ability to get the current stack 
>>>> trace for
>>>> stuck elements, this approach would make the real processing 
>>>> invisible to
>>>> all of these capabilities, and a user would only see that the bulk 
>>>> of the
>>>> time is spent waiting for the next element, but not *why* the next
>>> element
>>>> is taking long to compute.
>>>> - Likewise, offloading all the CPU and IO to a separate thread, 
>>>> invisible
>>>> to Beam, will make it harder for runners to do autoscaling, binpacking
>>> and
>>>> other resource management magic (how much of this runners actually 
>>>> do is
>>> a
>>>> separate issue), because the runner will have no way of knowing how 
>>>> much
>>>> CPU/IO this particular transform is actually using - all the processing
>>>> happens in a thread about which the runner is unaware.
>>>> - As far as I can tell, the code also hides exceptions that happen 
>>>> in the
>>>> Tika thread
>>>> - Adding the thread management makes the code much more complex, easier
>>> to
>>>> introduce bugs, and harder for others to contribute
>>>>
>>>>
>>>>> 2) As I commented at the end of [2], having an option to 
>>>>> concatenate the
>>>>> data chunks first before making them available to the pipeline is
>>>>> useful, and I guess doing the same in ParDo would introduce some
>>>>> synchronization issues (though not exactly sure yet)
>>>>>
>>>> What are these issues?
>>>>
>>>>
>>>>>
>>>>> One of valid concerns there is that the reader is polling the internal
>>>>> queue so, in theory at least, and perhaps in some rare cases too, 
>>>>> we may
>>>>> have a case where the max polling time has been reached, the parser is
>>>>> still busy, and TikaIO fails to report all the file data. I think that
>>>>> it can be solved by either 2a) configuring the max polling time to a
>>>>> very large number which will never be reached for a practical case, or
>>>>> 2b) simply use a blocking queue without the time limits - in the worst
>>>>> case, if TikaParser spins and fails to report the end of the document,
>>>>> then, Bean can heal itself if the pipeline blocks.
>>>>> I propose to follow 2b).
>>>>>
>>>> I agree that there should be no way to unintentionally configure the
>>>> transform in a way that will produce silent data loss. Another 
>>>> reason for
>>>> not having these tuning knobs is that it goes against Beam's "no knobs"
>>>> philosophy, and that in most cases users have no way of figuring out a
>>> good
>>>> value for tuning knobs except for manual experimentation, which is
>>>> extremely brittle and typically gets immediately obsoleted by 
>>>> running on
>>> a
>>>> new dataset or updating a version of some of the involved dependencies
>>> etc.
>>>>
>>>>
>>>>>
>>>>>
>>>>> Please let me know what you think.
>>>>> My plan so far is:
>>>>> 1) start addressing most of Eugene's comments which would require some
>>>>> minor TikaIO updates
>>>>> 2) work on removing the TikaSource internal code dealing with File
>>>>> patterns which I copied from TextIO at the next stage
>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users 
>>>>> some
>>>>> time to try it with some real complex files and also decide if TikaIO
>>>>> can continue implemented as a BoundedSource/Reader or not
>>>>>
>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>
>>>> Yes, but I think we should start by discussing the anticipated use 
>>>> cases
>>> of
>>>> TikaIO and designing an API for it based on those use cases; and 
>>>> then see
>>>> what's the best implementation for that particular API and set of
>>>> anticipated use cases.
>>>>
>>>>
>>>>>
>>>>> Thanks, Sergey
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>
>>>>
>>>
>>
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Thanks for the comments,

On 20/09/17 22:46, Robert Bradshaw wrote:
> On Wed, Sep 20, 2017 at 2:17 PM, Sergey Beryozkin <sb...@gmail.com> wrote:
>> Hi,
>>
>> thanks for the explanations,
>>
>> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>>
>>> Hi!
>>>
>>> TextIO returns an unordered soup of lines contained in all files you ask
>>> it
>>> to read. People usually use TextIO for reading files where 1 line
>>> corresponds to 1 independent data element, e.g. a log entry, or a row of a
>>> CSV file - so discarding order is ok.
>>
>> Just a side note, I'd probably want that be ordered, though I guess it
>> depends...
>>>
>>> However, there is a number of cases where TextIO is a poor fit:
>>> - Cases where discarding order is not ok - e.g. if you're doing natural
>>> language processing and the text files contain actual prose, where you
>>> need
>>> to process a file as a whole. TextIO can't do that.
>>> - Cases where you need to remember which file each element came from, e.g.
>>> if you're creating a search index for the files: TextIO can't do this
>>> either.
>>>
>>> Both of these issues have been raised in the past against TextIO; however
>>> it seems that the overwhelming majority of users of TextIO use it for logs
>>> or CSV files or alike, so solving these issues has not been a priority.
>>> Currently they are solved in a general form via FileIO.read() which gives
>>> you access to reading a full file yourself - people who want more
>>> flexibility will be able to use standard Java text-parsing utilities on a
>>> ReadableFile, without involving TextIO.
>>>
>>> Same applies for XmlIO: it is specifically designed for the narrow use
>>> case
>>> where the files contain independent data entries, so returning an
>>> unordered
>>> soup of them, with no association to the original file, is the user's
>>> intention. XmlIO will not work for processing more complex XML files that
>>> are not simply a sequence of entries with the same tag, and it also does
>>> not remember the original filename.
>>>
>>
>> OK...
>>
>>> However, if my understanding of Tika use cases is correct, it is mainly
>>> used for extracting content from complex file formats - for example,
>>> extracting text and images from PDF files or Word documents. I believe
>>> this
>>> is the main difference between it and TextIO - people usually use Tika for
>>> complex use cases where the "unordered soup of stuff" abstraction is not
>>> useful.
>>>
>>> My suspicion about this is confirmed by the fact that the crux of the Tika
>>> API is ContentHandler
>>>
>>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true
>>> whose
>>> documentation says "The order of events in this interface is very
>>> important, and mirrors the order of information in the document itself."
>>
>> All that says is that a (Tika) ContentHandler will be a true SAX
>> ContentHandler...
>>>
>>>
>>> Let me give a few examples of what I think is possible with the raw Tika
>>> API, but I think is not currently possible with TikaIO - please correct me
>>> where I'm wrong, because I'm not particularly familiar with Tika and am
>>> judging just based on what I read about it.
>>> - User has 100,000 Word documents and wants to convert each of them to
>>> text
>>> files for future natural language processing.
>>> - User has 100,000 PDF files with financial statements, each containing a
>>> bunch of unrelated text and - the main content - a list of transactions in
>>> PDF tables. User wants to extract each transaction as a PCollection
>>> element, discarding the unrelated text.
>>> - User has 100,000 PDF files with scientific papers, and wants to extract
>>> text from them, somehow parse author and affiliation from the text, and
>>> compute statistics of topics and terminology usage by author name and
>>> affiliation.
>>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>>> observing a location over time: they want to extract metadata from each
>>> image using Tika, analyze the images themselves using some other library,
>>> and detect anomalies in the overall appearance of the location over time
>>> as
>>> seen from multiple cameras.
>>> I believe all of these cases can not be solved with TikaIO because the
>>> resulting PCollection<String> contains no information about which String
>>> comes from which document and about the order in which they appear in the
>>> document.
>>
>> These are good use cases, thanks... I thought what you were talking about
>> the unordered soup of data produced by TikaIO (and its friends TextIO and
>> alike :-)).
>> Putting the ordered vs unordered question aside for a sec, why exactly a
>> Tika Reader can not make the name of the file it's currently reading from
>> available to the pipeline, as some Beam pipeline metadata piece ?
>> Surely it can be possible with Beam ? If not then I would be surprised...
>>
>>>
>>> I am, honestly, struggling to think of a case where I would want to use
>>> Tika, but where I *would* be ok with getting an unordered soup of strings.
>>> So some examples would be very helpful.
>>>
>> Yes. I'll ask Tika developers to help with some examples, but I'll give one
>> example where it did not matter to us in what order Tika-produced data were
>> available to the downstream layer.
>>
>> It's a demo the Apache CXF colleague of mine showed at one of Apache Con
>> NAs, and we had a happy audience:
>>
>> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search
>>
>> PDF or ODT files uploaded, Tika parses them, and all of that is put into
>> Lucene. We associate a file name with the indexed content and then let users
>> find a list of PDF files which contain a given word or few words, details
>> are here
>> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131
>>
>> I'd say even more involved search engines would not mind supporting a case
>> like that :-)
>>
>> Now there we process one file at a time, and I understand now that with
>> TikaIO and N files it's all over the place really as far as the ordering is
>> concerned, which file it's coming from. etc. That's why TikaReader must be
>> able to associate the file name with a given piece of text it's making
>> available to the pipeline.
>>
>> I'd be happy to support the ParDo way of linking Tika with Beam.
>> If it makes things simpler then it would be good, I've just no idea at the
>> moment how to start the pipeline without using a Source/Reader,
>> but I'll learn :-).
> 
> This would be the (as yet unreleased) FileIO.readMatches and friends:
> 
> https://github.com/apache/beam/blob/6d4a78517708db3bd89cfeff5a7e62fb6b948e1d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L88

OK, thanks;
> 
>> Re the sync issue I mentioned earlier - how can one
>> avoid it with ParDo when implementing a 'min len chunk' feature, where the
>> ParDo would have to concatenate several SAX data pieces first before making
>> a single composite piece to the pipeline ?
>>
>>
>>> Another way to state it: currently, if I wanted to solve all of the use
>>> cases above, I'd just use FileIO.readMatches() and use the Tika API myself
>>> on the resulting ReadableFile. How can we make TikaIO provide a usability
>>> improvement over such usage?
> 
> +1, this was exactly the same question I had.

TikaIO PR was more than 3 months old by the time it got merged. I'm 
pretty sure in one of my comments in JIRA I mentioned I'd welcome a 
feedback from all of the team.

I realize that one can just start a pipeline with a soon to be released 
FileIO and do something very specific with some files in the functions.
Jumping a bit ahead, but IMHO it's still useful to have a utility 
support for working with Tika. In my own work I see users adapting a 
certain feature much much faster if there's a utility support even 
though in our project we have all the support for people writing their 
own custom features...

> 
>> If you are actually asking, does it really make sense for Beam to ship
>> Tika related code, given that users can just do it themselves, I'm not sure.
>>
>> IMHO it always works better if users have to provide just few config options
>> to an integral part of the framework and see things happening.
>> It will bring more users.
>>
>> Whether the current Tika code (refactored or not) stays with Beam or not -
>> I'll let you and the team decide; believe it or not I was seriously
>> contemplating at the last moment to make it all part of the Tika project
>> itself and have a bit more flexibility over there with tweaking things, but
>> now that it is in the Beam snapshot - I don't know - it's no my decision...
> 
> It is always an interesting question when one has two libraries X and
> Y, plus some utility code that makes X work well with Y, where this
> utility code should live. If this can be expressed primarily as X
> which calls function using Y (in this particular example, Tika being
> invoked in the body of a DoFn) there might not even be much such
> library code (short of examples and documentation which can go a long
> way here). On the other hand, in some cases there are advantages to
> having a hybrid XY component that interleaves or otherwise joins
> together the libraries in common or non-trivial ways--worth exploring
> if that's the case here.
+1
> 
>>> I am confused by your other comment - "Does the ordering matter ?  Perhaps
>>> for some cases it does, and for some it does not. May be it makes sense to
>>> support running TikaIO as both the bounded reader/source and ParDo, with
>>> getting the common code reused." - because using BoundedReader or ParDo is
>>> not related to the ordering issue, only to the issue of asynchronous
>>> reading and complexity of implementation. The resulting PCollection will
>>> be
>>> unordered either way - this needs to be solved separately by providing a
>>> different API.
>>
>> Right I see now, so ParDo is not about making Tika reported data available
>> to the downstream pipeline components ordered, only about the simpler
>> implementation.
>> Association with the file should be possible I hope, but I understand it
>> would be possible to optionally make the data coming out in the ordered way
>> as well...
>>
>> Assuming TikaIO stays, and before trying to re-implement as ParDo, let me
>> double check: should we still give some thought to the possible performance
>> benefit of the current approach ? As I said, I can easily get rid of all
>> that polling code, use a simple Blocking queue.
> 
> It's also a model and API question. For example, as mentioned above,
> if it makes sense to invoke Tika entirely within the body of a DoFn
> (where the input is a filename, and the output is interesting
> data/chunks/whatever) to achieve the desired results this means one
> doesn't need to worry about plumbing all the (likely evolving)
> configuration and other options through from some Beam API through to
> whatever interacts with the Tika objects. This helps with tooling,
> documentation, user support, etc. as well as simply being more modular
> and there being less code to write and maintain.
Well, as far as Tika is concerned, the way it can be configured is not 
going to change, I can't think of the reason why.
Speaking about the tooling: IMHO it will be easier for the teams 
considering wiring Tika with Beam to have a Beam TikaIO component.
The custom approach won't really make it into the tooling...

Thanks, Sergey
> 
>>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Glad TikaIO getting some serious attention :-), I believe one thing we
>>>> both agree upon is that Tika can help Beam in its own unique way.
>>>>
>>>> Before trying to reply online, I'd like to state that my main assumption
>>>> is that TikaIO (as far as the read side is concerned) is no different to
>>>> Text, XML or similar bounded reader components.
>>>>
>>>> I have to admit I don't understand your questions about TikaIO usecases.
>>>>
>>>> What are the Text Input or XML input use-cases ? These use cases are
>>>> TikaInput cases as well, the only difference is Tika can not split the
>>>> individual file into a sequence of sources/etc,
>>>>
>>>> TextIO can read from the plain text files (possibly zipped), XML -
>>>> optimized around reading from the XML files, and I thought I made it
>>>> clear (and it is a known fact anyway) Tika was about reading basically
>>>> from any file format.
>>>>
>>>> Where is the difference (apart from what I've already mentioned) ?
>>>>
>>>> Sergey
>>>>
>>>>
>>>>
>>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Replies inline.
>>>>>
>>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>>>>>> novice, Apache Tika fan, and thought it would be really great to try
>>>>>> and
>>>>>> link both projects together, which led me to opening [1] where I typed
>>>>>> some early thoughts, followed by PR [2].
>>>>>>
>>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>>>>>> review comments from Eugene pending, so I'd like to summarize a bit why
>>>>>> I did TikaIO (reader) the way I did, and then decide, based on the
>>>>>> feedback from the experts, what to do next.
>>>>>>
>>>>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>>>>>> events. It's not possible with Tika to take a file and read it bit by
>>>>>> bit at the 'initiative' of the Beam Reader, line by line, the only way
>>>>>> is to handle the SAXParser callbacks which report the data chunks. Some
>>>>>> parsers may report the complete lines, some individual words, with some
>>>>>> being able report the data only after the completely parse the
>>>>>> document.
>>>>>> All depends on the data format.
>>>>>>
>>>>>> At the moment TikaIO's TikaReader does not use the Beam threads to
>>>>>> parse
>>>>>> the files, Beam threads will only collect the data from the internal
>>>>>> queue where the internal TikaReader's thread will put the data into
>>>>>> (note the data chunks are ordered even though the tests might suggest
>>>>>> otherwise).
>>>>>>
>>>>> I agree that your implementation of reader returns records in order -
>>>>> but
>>>>> Beam PCollection's are not ordered. Nothing in Beam cares about the
>>>>> order
>>>>> in which records are produced by a BoundedReader - the order produced by
>>>>> your reader is ignored, and when applying any transforms to the
>>>>
>>>> PCollection
>>>>>
>>>>> produced by TikaIO, it is impossible to recover the order in which your
>>>>> reader returned the records.
>>>>>
>>>>> With that in mind, is PCollection<String>, containing individual
>>>>> Tika-detected items, still the right API for representing the result of
>>>>> parsing a large number of documents with Tika?
>>>>>
>>>>>
>>>>>>
>>>>>> The reason I did it was because I thought
>>>>>>
>>>>>> 1) it would make the individual data chunks available faster to the
>>>>>> pipeline - the parser will continue working via the binary/video etc
>>>>>> file while the data will already start flowing - I agree there should
>>>>>> be
>>>>>> some tests data available confirming it - but I'm positive at the
>>>>>> moment
>>>>>> this approach might yield some performance gains with the large sets.
>>>>>> If
>>>>>> the file is large, if it has the embedded attachments/videos to deal
>>>>>> with, then it may be more effective not to get the Beam thread deal
>>>>>> with
>>>>>> it...
>>>>>>
>>>>>> As I said on the PR, this description contains unfounded and
>>>>>> potentially
>>>>>
>>>>> incorrect assumptions about how Beam runners execute (or may execute in
>>>>
>>>> the
>>>>>
>>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>>
>>>> correctly,
>>>>>
>>>>> you might be assuming that:
>>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>>
>>>> complete
>>>>>
>>>>> before processing its outputs with downstream transforms
>>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>>
>>>> *concurrently*
>>>>>
>>>>> with downstream processing of its results
>>>>> - Passing an element from one thread to another using a BlockingQueue is
>>>>> free in terms of performance
>>>>> All of these are false at least in some runners, and I'm almost certain
>>>>> that in reality, performance of this approach is worse than a ParDo in
>>>>
>>>> most
>>>>>
>>>>> production runners.
>>>>>
>>>>> There are other disadvantages to this approach:
>>>>> - Doing the bulk of the processing in a separate thread makes it
>>>>
>>>> invisible
>>>>>
>>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>>> profiling capabilities, or the ability to get the current stack trace
>>>>> for
>>>>> stuck elements, this approach would make the real processing invisible
>>>>> to
>>>>> all of these capabilities, and a user would only see that the bulk of
>>>>> the
>>>>> time is spent waiting for the next element, but not *why* the next
>>>>
>>>> element
>>>>>
>>>>> is taking long to compute.
>>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>>> invisible
>>>>> to Beam, will make it harder for runners to do autoscaling, binpacking
>>>>
>>>> and
>>>>>
>>>>> other resource management magic (how much of this runners actually do is
>>>>
>>>> a
>>>>>
>>>>> separate issue), because the runner will have no way of knowing how much
>>>>> CPU/IO this particular transform is actually using - all the processing
>>>>> happens in a thread about which the runner is unaware.
>>>>> - As far as I can tell, the code also hides exceptions that happen in
>>>>> the
>>>>> Tika thread
>>>>> - Adding the thread management makes the code much more complex, easier
>>>>
>>>> to
>>>>>
>>>>> introduce bugs, and harder for others to contribute
>>>>>
>>>>>
>>>>>> 2) As I commented at the end of [2], having an option to concatenate
>>>>>> the
>>>>>> data chunks first before making them available to the pipeline is
>>>>>> useful, and I guess doing the same in ParDo would introduce some
>>>>>> synchronization issues (though not exactly sure yet)
>>>>>>
>>>>> What are these issues?
>>>>>
>>>>>
>>>>>>
>>>>>> One of valid concerns there is that the reader is polling the internal
>>>>>> queue so, in theory at least, and perhaps in some rare cases too, we
>>>>>> may
>>>>>> have a case where the max polling time has been reached, the parser is
>>>>>> still busy, and TikaIO fails to report all the file data. I think that
>>>>>> it can be solved by either 2a) configuring the max polling time to a
>>>>>> very large number which will never be reached for a practical case, or
>>>>>> 2b) simply use a blocking queue without the time limits - in the worst
>>>>>> case, if TikaParser spins and fails to report the end of the document,
>>>>>> then, Bean can heal itself if the pipeline blocks.
>>>>>> I propose to follow 2b).
>>>>>>
>>>>> I agree that there should be no way to unintentionally configure the
>>>>> transform in a way that will produce silent data loss. Another reason
>>>>> for
>>>>> not having these tuning knobs is that it goes against Beam's "no knobs"
>>>>> philosophy, and that in most cases users have no way of figuring out a
>>>>
>>>> good
>>>>>
>>>>> value for tuning knobs except for manual experimentation, which is
>>>>> extremely brittle and typically gets immediately obsoleted by running on
>>>>
>>>> a
>>>>>
>>>>> new dataset or updating a version of some of the involved dependencies
>>>>
>>>> etc.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Please let me know what you think.
>>>>>> My plan so far is:
>>>>>> 1) start addressing most of Eugene's comments which would require some
>>>>>> minor TikaIO updates
>>>>>> 2) work on removing the TikaSource internal code dealing with File
>>>>>> patterns which I copied from TextIO at the next stage
>>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users
>>>>>> some
>>>>>> time to try it with some real complex files and also decide if TikaIO
>>>>>> can continue implemented as a BoundedSource/Reader or not
>>>>>>
>>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>>
>>>>> Yes, but I think we should start by discussing the anticipated use cases
>>>>
>>>> of
>>>>>
>>>>> TikaIO and designing an API for it based on those use cases; and then
>>>>> see
>>>>> what's the best implementation for that particular API and set of
>>>>> anticipated use cases.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks, Sergey
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>>
>>>>>
>>>>
>>>
>>

Re: TikaIO concerns

Posted by Robert Bradshaw <ro...@google.com.INVALID>.

On Wed, Sep 20, 2017 at 2:17 PM, Sergey Beryozkin <sb...@gmail.com> wrote:
> Hi,
>
> thanks for the explanations,
>
> On 20/09/17 16:41, Eugene Kirpichov wrote:
>>
>> Hi!
>>
>> TextIO returns an unordered soup of lines contained in all files you ask
>> it
>> to read. People usually use TextIO for reading files where 1 line
>> corresponds to 1 independent data element, e.g. a log entry, or a row of a
>> CSV file - so discarding order is ok.
>
> Just a side note, I'd probably want that be ordered, though I guess it
> depends...
>>
>> However, there is a number of cases where TextIO is a poor fit:
>> - Cases where discarding order is not ok - e.g. if you're doing natural
>> language processing and the text files contain actual prose, where you
>> need
>> to process a file as a whole. TextIO can't do that.
>> - Cases where you need to remember which file each element came from, e.g.
>> if you're creating a search index for the files: TextIO can't do this
>> either.
>>
>> Both of these issues have been raised in the past against TextIO; however
>> it seems that the overwhelming majority of users of TextIO use it for logs
>> or CSV files or alike, so solving these issues has not been a priority.
>> Currently they are solved in a general form via FileIO.read() which gives
>> you access to reading a full file yourself - people who want more
>> flexibility will be able to use standard Java text-parsing utilities on a
>> ReadableFile, without involving TextIO.
>>
>> Same applies for XmlIO: it is specifically designed for the narrow use
>> case
>> where the files contain independent data entries, so returning an
>> unordered
>> soup of them, with no association to the original file, is the user's
>> intention. XmlIO will not work for processing more complex XML files that
>> are not simply a sequence of entries with the same tag, and it also does
>> not remember the original filename.
>>
>
> OK...
>
>> However, if my understanding of Tika use cases is correct, it is mainly
>> used for extracting content from complex file formats - for example,
>> extracting text and images from PDF files or Word documents. I believe
>> this
>> is the main difference between it and TextIO - people usually use Tika for
>> complex use cases where the "unordered soup of stuff" abstraction is not
>> useful.
>>
>> My suspicion about this is confirmed by the fact that the crux of the Tika
>> API is ContentHandler
>>
>> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true
>> whose
>> documentation says "The order of events in this interface is very
>> important, and mirrors the order of information in the document itself."
>
> All that says is that a (Tika) ContentHandler will be a true SAX
> ContentHandler...
>>
>>
>> Let me give a few examples of what I think is possible with the raw Tika
>> API, but I think is not currently possible with TikaIO - please correct me
>> where I'm wrong, because I'm not particularly familiar with Tika and am
>> judging just based on what I read about it.
>> - User has 100,000 Word documents and wants to convert each of them to
>> text
>> files for future natural language processing.
>> - User has 100,000 PDF files with financial statements, each containing a
>> bunch of unrelated text and - the main content - a list of transactions in
>> PDF tables. User wants to extract each transaction as a PCollection
>> element, discarding the unrelated text.
>> - User has 100,000 PDF files with scientific papers, and wants to extract
>> text from them, somehow parse author and affiliation from the text, and
>> compute statistics of topics and terminology usage by author name and
>> affiliation.
>> - User has 100,000 photos in JPEG made by a set of automatic cameras
>> observing a location over time: they want to extract metadata from each
>> image using Tika, analyze the images themselves using some other library,
>> and detect anomalies in the overall appearance of the location over time
>> as
>> seen from multiple cameras.
>> I believe all of these cases can not be solved with TikaIO because the
>> resulting PCollection<String> contains no information about which String
>> comes from which document and about the order in which they appear in the
>> document.
>
> These are good use cases, thanks... I thought what you were talking about
> the unordered soup of data produced by TikaIO (and its friends TextIO and
> alike :-)).
> Putting the ordered vs unordered question aside for a sec, why exactly a
> Tika Reader can not make the name of the file it's currently reading from
> available to the pipeline, as some Beam pipeline metadata piece ?
> Surely it can be possible with Beam ? If not then I would be surprised...
>
>>
>> I am, honestly, struggling to think of a case where I would want to use
>> Tika, but where I *would* be ok with getting an unordered soup of strings.
>> So some examples would be very helpful.
>>
> Yes. I'll ask Tika developers to help with some examples, but I'll give one
> example where it did not matter to us in what order Tika-produced data were
> available to the downstream layer.
>
> It's a demo the Apache CXF colleague of mine showed at one of Apache Con
> NAs, and we had a happy audience:
>
> https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search
>
> PDF or ODT files uploaded, Tika parses them, and all of that is put into
> Lucene. We associate a file name with the indexed content and then let users
> find a list of PDF files which contain a given word or few words, details
> are here
> https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131
>
> I'd say even more involved search engines would not mind supporting a case
> like that :-)
>
> Now there we process one file at a time, and I understand now that with
> TikaIO and N files it's all over the place really as far as the ordering is
> concerned, which file it's coming from. etc. That's why TikaReader must be
> able to associate the file name with a given piece of text it's making
> available to the pipeline.
>
> I'd be happy to support the ParDo way of linking Tika with Beam.
> If it makes things simpler then it would be good, I've just no idea at the
> moment how to start the pipeline without using a Source/Reader,
> but I'll learn :-).

This would be the (as yet unreleased) FileIO.readMatches and friends:

https://github.com/apache/beam/blob/6d4a78517708db3bd89cfeff5a7e62fb6b948e1d/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L88

> Re the sync issue I mentioned earlier - how can one
> avoid it with ParDo when implementing a 'min len chunk' feature, where the
> ParDo would have to concatenate several SAX data pieces first before making
> a single composite piece to the pipeline ?
>
>
>> Another way to state it: currently, if I wanted to solve all of the use
>> cases above, I'd just use FileIO.readMatches() and use the Tika API myself
>> on the resulting ReadableFile. How can we make TikaIO provide a usability
>> improvement over such usage?

+1, this was exactly the same question I had.

> If you are actually asking, does it really make sense for Beam to ship
> Tika related code, given that users can just do it themselves, I'm not sure.
>
> IMHO it always works better if users have to provide just few config options
> to an integral part of the framework and see things happening.
> It will bring more users.
>
> Whether the current Tika code (refactored or not) stays with Beam or not -
> I'll let you and the team decide; believe it or not I was seriously
> contemplating at the last moment to make it all part of the Tika project
> itself and have a bit more flexibility over there with tweaking things, but
> now that it is in the Beam snapshot - I don't know - it's no my decision...

It is always an interesting question when one has two libraries X and
Y, plus some utility code that makes X work well with Y, where this
utility code should live. If this can be expressed primarily as X
which calls function using Y (in this particular example, Tika being
invoked in the body of a DoFn) there might not even be much such
library code (short of examples and documentation which can go a long
way here). On the other hand, in some cases there are advantages to
having a hybrid XY component that interleaves or otherwise joins
together the libraries in common or non-trivial ways--worth exploring
if that's the case here.

>> I am confused by your other comment - "Does the ordering matter ?  Perhaps
>> for some cases it does, and for some it does not. May be it makes sense to
>> support running TikaIO as both the bounded reader/source and ParDo, with
>> getting the common code reused." - because using BoundedReader or ParDo is
>> not related to the ordering issue, only to the issue of asynchronous
>> reading and complexity of implementation. The resulting PCollection will
>> be
>> unordered either way - this needs to be solved separately by providing a
>> different API.
>
> Right I see now, so ParDo is not about making Tika reported data available
> to the downstream pipeline components ordered, only about the simpler
> implementation.
> Association with the file should be possible I hope, but I understand it
> would be possible to optionally make the data coming out in the ordered way
> as well...
>
> Assuming TikaIO stays, and before trying to re-implement as ParDo, let me
> double check: should we still give some thought to the possible performance
> benefit of the current approach ? As I said, I can easily get rid of all
> that polling code, use a simple Blocking queue.

It's also a model and API question. For example, as mentioned above,
if it makes sense to invoke Tika entirely within the body of a DoFn
(where the input is a filename, and the output is interesting
data/chunks/whatever) to achieve the desired results this means one
doesn't need to worry about plumbing all the (likely evolving)
configuration and other options through from some Beam API through to
whatever interacts with the Tika objects. This helps with tooling,
documentation, user support, etc. as well as simply being more modular
and there being less code to write and maintain.

>> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Glad TikaIO getting some serious attention :-), I believe one thing we
>>> both agree upon is that Tika can help Beam in its own unique way.
>>>
>>> Before trying to reply online, I'd like to state that my main assumption
>>> is that TikaIO (as far as the read side is concerned) is no different to
>>> Text, XML or similar bounded reader components.
>>>
>>> I have to admit I don't understand your questions about TikaIO usecases.
>>>
>>> What are the Text Input or XML input use-cases ? These use cases are
>>> TikaInput cases as well, the only difference is Tika can not split the
>>> individual file into a sequence of sources/etc,
>>>
>>> TextIO can read from the plain text files (possibly zipped), XML -
>>> optimized around reading from the XML files, and I thought I made it
>>> clear (and it is a known fact anyway) Tika was about reading basically
>>> from any file format.
>>>
>>> Where is the difference (apart from what I've already mentioned) ?
>>>
>>> Sergey
>>>
>>>
>>>
>>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>>>
>>>> Hi,
>>>>
>>>> Replies inline.
>>>>
>>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>>>>> novice, Apache Tika fan, and thought it would be really great to try
>>>>> and
>>>>> link both projects together, which led me to opening [1] where I typed
>>>>> some early thoughts, followed by PR [2].
>>>>>
>>>>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>>>>> review comments from Eugene pending, so I'd like to summarize a bit why
>>>>> I did TikaIO (reader) the way I did, and then decide, based on the
>>>>> feedback from the experts, what to do next.
>>>>>
>>>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>>>>> events. It's not possible with Tika to take a file and read it bit by
>>>>> bit at the 'initiative' of the Beam Reader, line by line, the only way
>>>>> is to handle the SAXParser callbacks which report the data chunks. Some
>>>>> parsers may report the complete lines, some individual words, with some
>>>>> being able report the data only after the completely parse the
>>>>> document.
>>>>> All depends on the data format.
>>>>>
>>>>> At the moment TikaIO's TikaReader does not use the Beam threads to
>>>>> parse
>>>>> the files, Beam threads will only collect the data from the internal
>>>>> queue where the internal TikaReader's thread will put the data into
>>>>> (note the data chunks are ordered even though the tests might suggest
>>>>> otherwise).
>>>>>
>>>> I agree that your implementation of reader returns records in order -
>>>> but
>>>> Beam PCollection's are not ordered. Nothing in Beam cares about the
>>>> order
>>>> in which records are produced by a BoundedReader - the order produced by
>>>> your reader is ignored, and when applying any transforms to the
>>>
>>> PCollection
>>>>
>>>> produced by TikaIO, it is impossible to recover the order in which your
>>>> reader returned the records.
>>>>
>>>> With that in mind, is PCollection<String>, containing individual
>>>> Tika-detected items, still the right API for representing the result of
>>>> parsing a large number of documents with Tika?
>>>>
>>>>
>>>>>
>>>>> The reason I did it was because I thought
>>>>>
>>>>> 1) it would make the individual data chunks available faster to the
>>>>> pipeline - the parser will continue working via the binary/video etc
>>>>> file while the data will already start flowing - I agree there should
>>>>> be
>>>>> some tests data available confirming it - but I'm positive at the
>>>>> moment
>>>>> this approach might yield some performance gains with the large sets.
>>>>> If
>>>>> the file is large, if it has the embedded attachments/videos to deal
>>>>> with, then it may be more effective not to get the Beam thread deal
>>>>> with
>>>>> it...
>>>>>
>>>>> As I said on the PR, this description contains unfounded and
>>>>> potentially
>>>>
>>>> incorrect assumptions about how Beam runners execute (or may execute in
>>>
>>> the
>>>>
>>>> future) a ParDo or a BoundedReader. For example, if I understand
>>>
>>> correctly,
>>>>
>>>> you might be assuming that:
>>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>>>
>>> complete
>>>>
>>>> before processing its outputs with downstream transforms
>>>> - Beam runners can not run a @ProcessElement call of a ParDo
>>>
>>> *concurrently*
>>>>
>>>> with downstream processing of its results
>>>> - Passing an element from one thread to another using a BlockingQueue is
>>>> free in terms of performance
>>>> All of these are false at least in some runners, and I'm almost certain
>>>> that in reality, performance of this approach is worse than a ParDo in
>>>
>>> most
>>>>
>>>> production runners.
>>>>
>>>> There are other disadvantages to this approach:
>>>> - Doing the bulk of the processing in a separate thread makes it
>>>
>>> invisible
>>>>
>>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>>> profiling capabilities, or the ability to get the current stack trace
>>>> for
>>>> stuck elements, this approach would make the real processing invisible
>>>> to
>>>> all of these capabilities, and a user would only see that the bulk of
>>>> the
>>>> time is spent waiting for the next element, but not *why* the next
>>>
>>> element
>>>>
>>>> is taking long to compute.
>>>> - Likewise, offloading all the CPU and IO to a separate thread,
>>>> invisible
>>>> to Beam, will make it harder for runners to do autoscaling, binpacking
>>>
>>> and
>>>>
>>>> other resource management magic (how much of this runners actually do is
>>>
>>> a
>>>>
>>>> separate issue), because the runner will have no way of knowing how much
>>>> CPU/IO this particular transform is actually using - all the processing
>>>> happens in a thread about which the runner is unaware.
>>>> - As far as I can tell, the code also hides exceptions that happen in
>>>> the
>>>> Tika thread
>>>> - Adding the thread management makes the code much more complex, easier
>>>
>>> to
>>>>
>>>> introduce bugs, and harder for others to contribute
>>>>
>>>>
>>>>> 2) As I commented at the end of [2], having an option to concatenate
>>>>> the
>>>>> data chunks first before making them available to the pipeline is
>>>>> useful, and I guess doing the same in ParDo would introduce some
>>>>> synchronization issues (though not exactly sure yet)
>>>>>
>>>> What are these issues?
>>>>
>>>>
>>>>>
>>>>> One of valid concerns there is that the reader is polling the internal
>>>>> queue so, in theory at least, and perhaps in some rare cases too, we
>>>>> may
>>>>> have a case where the max polling time has been reached, the parser is
>>>>> still busy, and TikaIO fails to report all the file data. I think that
>>>>> it can be solved by either 2a) configuring the max polling time to a
>>>>> very large number which will never be reached for a practical case, or
>>>>> 2b) simply use a blocking queue without the time limits - in the worst
>>>>> case, if TikaParser spins and fails to report the end of the document,
>>>>> then, Bean can heal itself if the pipeline blocks.
>>>>> I propose to follow 2b).
>>>>>
>>>> I agree that there should be no way to unintentionally configure the
>>>> transform in a way that will produce silent data loss. Another reason
>>>> for
>>>> not having these tuning knobs is that it goes against Beam's "no knobs"
>>>> philosophy, and that in most cases users have no way of figuring out a
>>>
>>> good
>>>>
>>>> value for tuning knobs except for manual experimentation, which is
>>>> extremely brittle and typically gets immediately obsoleted by running on
>>>
>>> a
>>>>
>>>> new dataset or updating a version of some of the involved dependencies
>>>
>>> etc.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Please let me know what you think.
>>>>> My plan so far is:
>>>>> 1) start addressing most of Eugene's comments which would require some
>>>>> minor TikaIO updates
>>>>> 2) work on removing the TikaSource internal code dealing with File
>>>>> patterns which I copied from TextIO at the next stage
>>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users
>>>>> some
>>>>> time to try it with some real complex files and also decide if TikaIO
>>>>> can continue implemented as a BoundedSource/Reader or not
>>>>>
>>>>> Eugene, all, will it work if I start with 1) ?
>>>>>
>>>> Yes, but I think we should start by discussing the anticipated use cases
>>>
>>> of
>>>>
>>>> TikaIO and designing an API for it based on those use cases; and then
>>>> see
>>>> what's the best implementation for that particular API and set of
>>>> anticipated use cases.
>>>>
>>>>
>>>>>
>>>>> Thanks, Sergey
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>>> [2] https://github.com/apache/beam/pull/3378
>>>>>
>>>>
>>>
>>
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi,

thanks for the explanations,

On 20/09/17 16:41, Eugene Kirpichov wrote:
> Hi!
> 
> TextIO returns an unordered soup of lines contained in all files you ask it
> to read. People usually use TextIO for reading files where 1 line
> corresponds to 1 independent data element, e.g. a log entry, or a row of a
> CSV file - so discarding order is ok.
Just a side note, I'd probably want that be ordered, though I guess it 
depends...
> However, there is a number of cases where TextIO is a poor fit:
> - Cases where discarding order is not ok - e.g. if you're doing natural
> language processing and the text files contain actual prose, where you need
> to process a file as a whole. TextIO can't do that.
> - Cases where you need to remember which file each element came from, e.g.
> if you're creating a search index for the files: TextIO can't do this
> either.
> 
> Both of these issues have been raised in the past against TextIO; however
> it seems that the overwhelming majority of users of TextIO use it for logs
> or CSV files or alike, so solving these issues has not been a priority.
> Currently they are solved in a general form via FileIO.read() which gives
> you access to reading a full file yourself - people who want more
> flexibility will be able to use standard Java text-parsing utilities on a
> ReadableFile, without involving TextIO.
> 
> Same applies for XmlIO: it is specifically designed for the narrow use case
> where the files contain independent data entries, so returning an unordered
> soup of them, with no association to the original file, is the user's
> intention. XmlIO will not work for processing more complex XML files that
> are not simply a sequence of entries with the same tag, and it also does
> not remember the original filename.
> 

OK...

> However, if my understanding of Tika use cases is correct, it is mainly
> used for extracting content from complex file formats - for example,
> extracting text and images from PDF files or Word documents. I believe this
> is the main difference between it and TextIO - people usually use Tika for
> complex use cases where the "unordered soup of stuff" abstraction is not
> useful.
> 
> My suspicion about this is confirmed by the fact that the crux of the Tika
> API is ContentHandler
> http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true
> whose
> documentation says "The order of events in this interface is very
> important, and mirrors the order of information in the document itself."
All that says is that a (Tika) ContentHandler will be a true SAX 
ContentHandler...
> 
> Let me give a few examples of what I think is possible with the raw Tika
> API, but I think is not currently possible with TikaIO - please correct me
> where I'm wrong, because I'm not particularly familiar with Tika and am
> judging just based on what I read about it.
> - User has 100,000 Word documents and wants to convert each of them to text
> files for future natural language processing.
> - User has 100,000 PDF files with financial statements, each containing a
> bunch of unrelated text and - the main content - a list of transactions in
> PDF tables. User wants to extract each transaction as a PCollection
> element, discarding the unrelated text.
> - User has 100,000 PDF files with scientific papers, and wants to extract
> text from them, somehow parse author and affiliation from the text, and
> compute statistics of topics and terminology usage by author name and
> affiliation.
> - User has 100,000 photos in JPEG made by a set of automatic cameras
> observing a location over time: they want to extract metadata from each
> image using Tika, analyze the images themselves using some other library,
> and detect anomalies in the overall appearance of the location over time as
> seen from multiple cameras.
> I believe all of these cases can not be solved with TikaIO because the
> resulting PCollection<String> contains no information about which String
> comes from which document and about the order in which they appear in the
> document.
These are good use cases, thanks... I thought what you were talking 
about the unordered soup of data produced by TikaIO (and its friends 
TextIO and alike :-)).
Putting the ordered vs unordered question aside for a sec, why exactly a 
Tika Reader can not make the name of the file it's currently reading 
from available to the pipeline, as some Beam pipeline metadata piece ?
Surely it can be possible with Beam ? If not then I would be surprised...

> 
> I am, honestly, struggling to think of a case where I would want to use
> Tika, but where I *would* be ok with getting an unordered soup of strings.
> So some examples would be very helpful.
> 
Yes. I'll ask Tika developers to help with some examples, but I'll give 
one example where it did not matter to us in what order Tika-produced 
data were available to the downstream layer.

It's a demo the Apache CXF colleague of mine showed at one of Apache Con 
NAs, and we had a happy audience:

https://github.com/apache/cxf/tree/master/distribution/src/main/release/samples/jax_rs/search

PDF or ODT files uploaded, Tika parses them, and all of that is put into 
Lucene. We associate a file name with the indexed content and then let 
users find a list of PDF files which contain a given word or few words, 
details are here
https://github.com/apache/cxf/blob/master/distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Catalog.java#L131

I'd say even more involved search engines would not mind supporting a 
case like that :-)

Now there we process one file at a time, and I understand now that with 
TikaIO and N files it's all over the place really as far as the ordering 
is concerned, which file it's coming from. etc. That's why TikaReader 
must be able to associate the file name with a given piece of text it's 
making available to the pipeline.

I'd be happy to support the ParDo way of linking Tika with Beam.
If it makes things simpler then it would be good, I've just no idea at 
the moment how to start the pipeline without using a Source/Reader,
but I'll learn :-). Re the sync issue I mentioned earlier - how can one 
avoid it with ParDo when implementing a 'min len chunk' feature, where 
the ParDo would have to concatenate several SAX data pieces first before 
making a single composite piece to the pipeline ?


> Another way to state it: currently, if I wanted to solve all of the use
> cases above, I'd just use FileIO.readMatches() and use the Tika API myself
> on the resulting ReadableFile. How can we make TikaIO provide a usability
> improvement over such usage?
> 


If you are actually asking, does it really make sense for Beam to ship
Tika related code, given that users can just do it themselves, I'm not sure.

IMHO it always works better if users have to provide just few config 
options to an integral part of the framework and see things happening.
It will bring more users.

Whether the current Tika code (refactored or not) stays with Beam or not 
- I'll let you and the team decide; believe it or not I was seriously 
contemplating at the last moment to make it all part of the Tika project 
itself and have a bit more flexibility over there with tweaking things, 
but now that it is in the Beam snapshot - I don't know - it's no my 
decision...

> I am confused by your other comment - "Does the ordering matter ?  Perhaps
> for some cases it does, and for some it does not. May be it makes sense to
> support running TikaIO as both the bounded reader/source and ParDo, with
> getting the common code reused." - because using BoundedReader or ParDo is
> not related to the ordering issue, only to the issue of asynchronous
> reading and complexity of implementation. The resulting PCollection will be
> unordered either way - this needs to be solved separately by providing a
> different API.
Right I see now, so ParDo is not about making Tika reported data 
available to the downstream pipeline components ordered, only about the 
simpler implementation.
Association with the file should be possible I hope, but I understand it 
would be possible to optionally make the data coming out in the ordered 
way as well...

Assuming TikaIO stays, and before trying to re-implement as ParDo, let 
me double check: should we still give some thought to the possible 
performance benefit of the current approach ? As I said, I can easily 
get rid of all that polling code, use a simple Blocking queue.

Cheers, Sergey
> 
> Thanks.
> 
> On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi
>>
>> Glad TikaIO getting some serious attention :-), I believe one thing we
>> both agree upon is that Tika can help Beam in its own unique way.
>>
>> Before trying to reply online, I'd like to state that my main assumption
>> is that TikaIO (as far as the read side is concerned) is no different to
>> Text, XML or similar bounded reader components.
>>
>> I have to admit I don't understand your questions about TikaIO usecases.
>>
>> What are the Text Input or XML input use-cases ? These use cases are
>> TikaInput cases as well, the only difference is Tika can not split the
>> individual file into a sequence of sources/etc,
>>
>> TextIO can read from the plain text files (possibly zipped), XML -
>> optimized around reading from the XML files, and I thought I made it
>> clear (and it is a known fact anyway) Tika was about reading basically
>> from any file format.
>>
>> Where is the difference (apart from what I've already mentioned) ?
>>
>> Sergey
>>
>>
>>
>> On 19/09/17 23:29, Eugene Kirpichov wrote:
>>> Hi,
>>>
>>> Replies inline.
>>>
>>> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi All
>>>>
>>>> This is my first post the the dev list, I work for Talend, I'm a Beam
>>>> novice, Apache Tika fan, and thought it would be really great to try and
>>>> link both projects together, which led me to opening [1] where I typed
>>>> some early thoughts, followed by PR [2].
>>>>
>>>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>>>> review comments from Eugene pending, so I'd like to summarize a bit why
>>>> I did TikaIO (reader) the way I did, and then decide, based on the
>>>> feedback from the experts, what to do next.
>>>>
>>>> Apache Tika Parsers report the text content in chunks, via SaxParser
>>>> events. It's not possible with Tika to take a file and read it bit by
>>>> bit at the 'initiative' of the Beam Reader, line by line, the only way
>>>> is to handle the SAXParser callbacks which report the data chunks. Some
>>>> parsers may report the complete lines, some individual words, with some
>>>> being able report the data only after the completely parse the document.
>>>> All depends on the data format.
>>>>
>>>> At the moment TikaIO's TikaReader does not use the Beam threads to parse
>>>> the files, Beam threads will only collect the data from the internal
>>>> queue where the internal TikaReader's thread will put the data into
>>>> (note the data chunks are ordered even though the tests might suggest
>>>> otherwise).
>>>>
>>> I agree that your implementation of reader returns records in order - but
>>> Beam PCollection's are not ordered. Nothing in Beam cares about the order
>>> in which records are produced by a BoundedReader - the order produced by
>>> your reader is ignored, and when applying any transforms to the
>> PCollection
>>> produced by TikaIO, it is impossible to recover the order in which your
>>> reader returned the records.
>>>
>>> With that in mind, is PCollection<String>, containing individual
>>> Tika-detected items, still the right API for representing the result of
>>> parsing a large number of documents with Tika?
>>>
>>>
>>>>
>>>> The reason I did it was because I thought
>>>>
>>>> 1) it would make the individual data chunks available faster to the
>>>> pipeline - the parser will continue working via the binary/video etc
>>>> file while the data will already start flowing - I agree there should be
>>>> some tests data available confirming it - but I'm positive at the moment
>>>> this approach might yield some performance gains with the large sets. If
>>>> the file is large, if it has the embedded attachments/videos to deal
>>>> with, then it may be more effective not to get the Beam thread deal with
>>>> it...
>>>>
>>>> As I said on the PR, this description contains unfounded and potentially
>>> incorrect assumptions about how Beam runners execute (or may execute in
>> the
>>> future) a ParDo or a BoundedReader. For example, if I understand
>> correctly,
>>> you might be assuming that:
>>> - Beam runners wait for a full @ProcessElement call of a ParDo to
>> complete
>>> before processing its outputs with downstream transforms
>>> - Beam runners can not run a @ProcessElement call of a ParDo
>> *concurrently*
>>> with downstream processing of its results
>>> - Passing an element from one thread to another using a BlockingQueue is
>>> free in terms of performance
>>> All of these are false at least in some runners, and I'm almost certain
>>> that in reality, performance of this approach is worse than a ParDo in
>> most
>>> production runners.
>>>
>>> There are other disadvantages to this approach:
>>> - Doing the bulk of the processing in a separate thread makes it
>> invisible
>>> to Beam's instrumentation. If a Beam runner provided per-transform
>>> profiling capabilities, or the ability to get the current stack trace for
>>> stuck elements, this approach would make the real processing invisible to
>>> all of these capabilities, and a user would only see that the bulk of the
>>> time is spent waiting for the next element, but not *why* the next
>> element
>>> is taking long to compute.
>>> - Likewise, offloading all the CPU and IO to a separate thread, invisible
>>> to Beam, will make it harder for runners to do autoscaling, binpacking
>> and
>>> other resource management magic (how much of this runners actually do is
>> a
>>> separate issue), because the runner will have no way of knowing how much
>>> CPU/IO this particular transform is actually using - all the processing
>>> happens in a thread about which the runner is unaware.
>>> - As far as I can tell, the code also hides exceptions that happen in the
>>> Tika thread
>>> - Adding the thread management makes the code much more complex, easier
>> to
>>> introduce bugs, and harder for others to contribute
>>>
>>>
>>>> 2) As I commented at the end of [2], having an option to concatenate the
>>>> data chunks first before making them available to the pipeline is
>>>> useful, and I guess doing the same in ParDo would introduce some
>>>> synchronization issues (though not exactly sure yet)
>>>>
>>> What are these issues?
>>>
>>>
>>>>
>>>> One of valid concerns there is that the reader is polling the internal
>>>> queue so, in theory at least, and perhaps in some rare cases too, we may
>>>> have a case where the max polling time has been reached, the parser is
>>>> still busy, and TikaIO fails to report all the file data. I think that
>>>> it can be solved by either 2a) configuring the max polling time to a
>>>> very large number which will never be reached for a practical case, or
>>>> 2b) simply use a blocking queue without the time limits - in the worst
>>>> case, if TikaParser spins and fails to report the end of the document,
>>>> then, Bean can heal itself if the pipeline blocks.
>>>> I propose to follow 2b).
>>>>
>>> I agree that there should be no way to unintentionally configure the
>>> transform in a way that will produce silent data loss. Another reason for
>>> not having these tuning knobs is that it goes against Beam's "no knobs"
>>> philosophy, and that in most cases users have no way of figuring out a
>> good
>>> value for tuning knobs except for manual experimentation, which is
>>> extremely brittle and typically gets immediately obsoleted by running on
>> a
>>> new dataset or updating a version of some of the involved dependencies
>> etc.
>>>
>>>
>>>>
>>>>
>>>> Please let me know what you think.
>>>> My plan so far is:
>>>> 1) start addressing most of Eugene's comments which would require some
>>>> minor TikaIO updates
>>>> 2) work on removing the TikaSource internal code dealing with File
>>>> patterns which I copied from TextIO at the next stage
>>>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
>>>> time to try it with some real complex files and also decide if TikaIO
>>>> can continue implemented as a BoundedSource/Reader or not
>>>>
>>>> Eugene, all, will it work if I start with 1) ?
>>>>
>>> Yes, but I think we should start by discussing the anticipated use cases
>> of
>>> TikaIO and designing an API for it based on those use cases; and then see
>>> what's the best implementation for that particular API and set of
>>> anticipated use cases.
>>>
>>>
>>>>
>>>> Thanks, Sergey
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>>>> [2] https://github.com/apache/beam/pull/3378
>>>>
>>>
>>
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi!

TextIO returns an unordered soup of lines contained in all files you ask it
to read. People usually use TextIO for reading files where 1 line
corresponds to 1 independent data element, e.g. a log entry, or a row of a
CSV file - so discarding order is ok.
However, there is a number of cases where TextIO is a poor fit:
- Cases where discarding order is not ok - e.g. if you're doing natural
language processing and the text files contain actual prose, where you need
to process a file as a whole. TextIO can't do that.
- Cases where you need to remember which file each element came from, e.g.
if you're creating a search index for the files: TextIO can't do this
either.

Both of these issues have been raised in the past against TextIO; however
it seems that the overwhelming majority of users of TextIO use it for logs
or CSV files or alike, so solving these issues has not been a priority.
Currently they are solved in a general form via FileIO.read() which gives
you access to reading a full file yourself - people who want more
flexibility will be able to use standard Java text-parsing utilities on a
ReadableFile, without involving TextIO.

Same applies for XmlIO: it is specifically designed for the narrow use case
where the files contain independent data entries, so returning an unordered
soup of them, with no association to the original file, is the user's
intention. XmlIO will not work for processing more complex XML files that
are not simply a sequence of entries with the same tag, and it also does
not remember the original filename.

However, if my understanding of Tika use cases is correct, it is mainly
used for extracting content from complex file formats - for example,
extracting text and images from PDF files or Word documents. I believe this
is the main difference between it and TextIO - people usually use Tika for
complex use cases where the "unordered soup of stuff" abstraction is not
useful.

My suspicion about this is confirmed by the fact that the crux of the Tika
API is ContentHandler
http://docs.oracle.com/javase/6/docs/api/org/xml/sax/ContentHandler.html?is-external=true
whose
documentation says "The order of events in this interface is very
important, and mirrors the order of information in the document itself."

Let me give a few examples of what I think is possible with the raw Tika
API, but I think is not currently possible with TikaIO - please correct me
where I'm wrong, because I'm not particularly familiar with Tika and am
judging just based on what I read about it.
- User has 100,000 Word documents and wants to convert each of them to text
files for future natural language processing.
- User has 100,000 PDF files with financial statements, each containing a
bunch of unrelated text and - the main content - a list of transactions in
PDF tables. User wants to extract each transaction as a PCollection
element, discarding the unrelated text.
- User has 100,000 PDF files with scientific papers, and wants to extract
text from them, somehow parse author and affiliation from the text, and
compute statistics of topics and terminology usage by author name and
affiliation.
- User has 100,000 photos in JPEG made by a set of automatic cameras
observing a location over time: they want to extract metadata from each
image using Tika, analyze the images themselves using some other library,
and detect anomalies in the overall appearance of the location over time as
seen from multiple cameras.
I believe all of these cases can not be solved with TikaIO because the
resulting PCollection<String> contains no information about which String
comes from which document and about the order in which they appear in the
document.

I am, honestly, struggling to think of a case where I would want to use
Tika, but where I *would* be ok with getting an unordered soup of strings.
So some examples would be very helpful.

Another way to state it: currently, if I wanted to solve all of the use
cases above, I'd just use FileIO.readMatches() and use the Tika API myself
on the resulting ReadableFile. How can we make TikaIO provide a usability
improvement over such usage?

I am confused by your other comment - "Does the ordering matter ?  Perhaps
for some cases it does, and for some it does not. May be it makes sense to
support running TikaIO as both the bounded reader/source and ParDo, with
getting the common code reused." - because using BoundedReader or ParDo is
not related to the ordering issue, only to the issue of asynchronous
reading and complexity of implementation. The resulting PCollection will be
unordered either way - this needs to be solved separately by providing a
different API.

Thanks.

On Wed, Sep 20, 2017 at 1:51 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi
>
> Glad TikaIO getting some serious attention :-), I believe one thing we
> both agree upon is that Tika can help Beam in its own unique way.
>
> Before trying to reply online, I'd like to state that my main assumption
> is that TikaIO (as far as the read side is concerned) is no different to
> Text, XML or similar bounded reader components.
>
> I have to admit I don't understand your questions about TikaIO usecases.
>
> What are the Text Input or XML input use-cases ? These use cases are
> TikaInput cases as well, the only difference is Tika can not split the
> individual file into a sequence of sources/etc,
>
> TextIO can read from the plain text files (possibly zipped), XML -
> optimized around reading from the XML files, and I thought I made it
> clear (and it is a known fact anyway) Tika was about reading basically
> from any file format.
>
> Where is the difference (apart from what I've already mentioned) ?
>
> Sergey
>
>
>
> On 19/09/17 23:29, Eugene Kirpichov wrote:
> > Hi,
> >
> > Replies inline.
> >
> > On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
> > wrote:
> >
> >> Hi All
> >>
> >> This is my first post the the dev list, I work for Talend, I'm a Beam
> >> novice, Apache Tika fan, and thought it would be really great to try and
> >> link both projects together, which led me to opening [1] where I typed
> >> some early thoughts, followed by PR [2].
> >>
> >> I noticed yesterday I had the robust :-) (but useful and helpful) newer
> >> review comments from Eugene pending, so I'd like to summarize a bit why
> >> I did TikaIO (reader) the way I did, and then decide, based on the
> >> feedback from the experts, what to do next.
> >>
> >> Apache Tika Parsers report the text content in chunks, via SaxParser
> >> events. It's not possible with Tika to take a file and read it bit by
> >> bit at the 'initiative' of the Beam Reader, line by line, the only way
> >> is to handle the SAXParser callbacks which report the data chunks. Some
> >> parsers may report the complete lines, some individual words, with some
> >> being able report the data only after the completely parse the document.
> >> All depends on the data format.
> >>
> >> At the moment TikaIO's TikaReader does not use the Beam threads to parse
> >> the files, Beam threads will only collect the data from the internal
> >> queue where the internal TikaReader's thread will put the data into
> >> (note the data chunks are ordered even though the tests might suggest
> >> otherwise).
> >>
> > I agree that your implementation of reader returns records in order - but
> > Beam PCollection's are not ordered. Nothing in Beam cares about the order
> > in which records are produced by a BoundedReader - the order produced by
> > your reader is ignored, and when applying any transforms to the
> PCollection
> > produced by TikaIO, it is impossible to recover the order in which your
> > reader returned the records.
> >
> > With that in mind, is PCollection<String>, containing individual
> > Tika-detected items, still the right API for representing the result of
> > parsing a large number of documents with Tika?
> >
> >
> >>
> >> The reason I did it was because I thought
> >>
> >> 1) it would make the individual data chunks available faster to the
> >> pipeline - the parser will continue working via the binary/video etc
> >> file while the data will already start flowing - I agree there should be
> >> some tests data available confirming it - but I'm positive at the moment
> >> this approach might yield some performance gains with the large sets. If
> >> the file is large, if it has the embedded attachments/videos to deal
> >> with, then it may be more effective not to get the Beam thread deal with
> >> it...
> >>
> >> As I said on the PR, this description contains unfounded and potentially
> > incorrect assumptions about how Beam runners execute (or may execute in
> the
> > future) a ParDo or a BoundedReader. For example, if I understand
> correctly,
> > you might be assuming that:
> > - Beam runners wait for a full @ProcessElement call of a ParDo to
> complete
> > before processing its outputs with downstream transforms
> > - Beam runners can not run a @ProcessElement call of a ParDo
> *concurrently*
> > with downstream processing of its results
> > - Passing an element from one thread to another using a BlockingQueue is
> > free in terms of performance
> > All of these are false at least in some runners, and I'm almost certain
> > that in reality, performance of this approach is worse than a ParDo in
> most
> > production runners.
> >
> > There are other disadvantages to this approach:
> > - Doing the bulk of the processing in a separate thread makes it
> invisible
> > to Beam's instrumentation. If a Beam runner provided per-transform
> > profiling capabilities, or the ability to get the current stack trace for
> > stuck elements, this approach would make the real processing invisible to
> > all of these capabilities, and a user would only see that the bulk of the
> > time is spent waiting for the next element, but not *why* the next
> element
> > is taking long to compute.
> > - Likewise, offloading all the CPU and IO to a separate thread, invisible
> > to Beam, will make it harder for runners to do autoscaling, binpacking
> and
> > other resource management magic (how much of this runners actually do is
> a
> > separate issue), because the runner will have no way of knowing how much
> > CPU/IO this particular transform is actually using - all the processing
> > happens in a thread about which the runner is unaware.
> > - As far as I can tell, the code also hides exceptions that happen in the
> > Tika thread
> > - Adding the thread management makes the code much more complex, easier
> to
> > introduce bugs, and harder for others to contribute
> >
> >
> >> 2) As I commented at the end of [2], having an option to concatenate the
> >> data chunks first before making them available to the pipeline is
> >> useful, and I guess doing the same in ParDo would introduce some
> >> synchronization issues (though not exactly sure yet)
> >>
> > What are these issues?
> >
> >
> >>
> >> One of valid concerns there is that the reader is polling the internal
> >> queue so, in theory at least, and perhaps in some rare cases too, we may
> >> have a case where the max polling time has been reached, the parser is
> >> still busy, and TikaIO fails to report all the file data. I think that
> >> it can be solved by either 2a) configuring the max polling time to a
> >> very large number which will never be reached for a practical case, or
> >> 2b) simply use a blocking queue without the time limits - in the worst
> >> case, if TikaParser spins and fails to report the end of the document,
> >> then, Bean can heal itself if the pipeline blocks.
> >> I propose to follow 2b).
> >>
> > I agree that there should be no way to unintentionally configure the
> > transform in a way that will produce silent data loss. Another reason for
> > not having these tuning knobs is that it goes against Beam's "no knobs"
> > philosophy, and that in most cases users have no way of figuring out a
> good
> > value for tuning knobs except for manual experimentation, which is
> > extremely brittle and typically gets immediately obsoleted by running on
> a
> > new dataset or updating a version of some of the involved dependencies
> etc.
> >
> >
> >>
> >>
> >> Please let me know what you think.
> >> My plan so far is:
> >> 1) start addressing most of Eugene's comments which would require some
> >> minor TikaIO updates
> >> 2) work on removing the TikaSource internal code dealing with File
> >> patterns which I copied from TextIO at the next stage
> >> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
> >> time to try it with some real complex files and also decide if TikaIO
> >> can continue implemented as a BoundedSource/Reader or not
> >>
> >> Eugene, all, will it work if I start with 1) ?
> >>
> > Yes, but I think we should start by discussing the anticipated use cases
> of
> > TikaIO and designing an API for it based on those use cases; and then see
> > what's the best implementation for that particular API and set of
> > anticipated use cases.
> >
> >
> >>
> >> Thanks, Sergey
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-2328
> >> [2] https://github.com/apache/beam/pull/3378
> >>
> >
>

Re: TikaIO concerns

Posted by Sergey Beryozkin <sb...@gmail.com>.

Hi

Glad TikaIO getting some serious attention :-), I believe one thing we 
both agree upon is that Tika can help Beam in its own unique way.

Before trying to reply online, I'd like to state that my main assumption 
is that TikaIO (as far as the read side is concerned) is no different to 
Text, XML or similar bounded reader components.

I have to admit I don't understand your questions about TikaIO usecases.

What are the Text Input or XML input use-cases ? These use cases are 
TikaInput cases as well, the only difference is Tika can not split the 
individual file into a sequence of sources/etc,

TextIO can read from the plain text files (possibly zipped), XML - 
optimized around reading from the XML files, and I thought I made it 
clear (and it is a known fact anyway) Tika was about reading basically 
from any file format.

Where is the difference (apart from what I've already mentioned) ?

Sergey



On 19/09/17 23:29, Eugene Kirpichov wrote:
> Hi,
> 
> Replies inline.
> 
> On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Hi All
>>
>> This is my first post the the dev list, I work for Talend, I'm a Beam
>> novice, Apache Tika fan, and thought it would be really great to try and
>> link both projects together, which led me to opening [1] where I typed
>> some early thoughts, followed by PR [2].
>>
>> I noticed yesterday I had the robust :-) (but useful and helpful) newer
>> review comments from Eugene pending, so I'd like to summarize a bit why
>> I did TikaIO (reader) the way I did, and then decide, based on the
>> feedback from the experts, what to do next.
>>
>> Apache Tika Parsers report the text content in chunks, via SaxParser
>> events. It's not possible with Tika to take a file and read it bit by
>> bit at the 'initiative' of the Beam Reader, line by line, the only way
>> is to handle the SAXParser callbacks which report the data chunks. Some
>> parsers may report the complete lines, some individual words, with some
>> being able report the data only after the completely parse the document.
>> All depends on the data format.
>>
>> At the moment TikaIO's TikaReader does not use the Beam threads to parse
>> the files, Beam threads will only collect the data from the internal
>> queue where the internal TikaReader's thread will put the data into
>> (note the data chunks are ordered even though the tests might suggest
>> otherwise).
>>
> I agree that your implementation of reader returns records in order - but
> Beam PCollection's are not ordered. Nothing in Beam cares about the order
> in which records are produced by a BoundedReader - the order produced by
> your reader is ignored, and when applying any transforms to the PCollection
> produced by TikaIO, it is impossible to recover the order in which your
> reader returned the records.
> 
> With that in mind, is PCollection<String>, containing individual
> Tika-detected items, still the right API for representing the result of
> parsing a large number of documents with Tika?
> 
> 
>>
>> The reason I did it was because I thought
>>
>> 1) it would make the individual data chunks available faster to the
>> pipeline - the parser will continue working via the binary/video etc
>> file while the data will already start flowing - I agree there should be
>> some tests data available confirming it - but I'm positive at the moment
>> this approach might yield some performance gains with the large sets. If
>> the file is large, if it has the embedded attachments/videos to deal
>> with, then it may be more effective not to get the Beam thread deal with
>> it...
>>
>> As I said on the PR, this description contains unfounded and potentially
> incorrect assumptions about how Beam runners execute (or may execute in the
> future) a ParDo or a BoundedReader. For example, if I understand correctly,
> you might be assuming that:
> - Beam runners wait for a full @ProcessElement call of a ParDo to complete
> before processing its outputs with downstream transforms
> - Beam runners can not run a @ProcessElement call of a ParDo *concurrently*
> with downstream processing of its results
> - Passing an element from one thread to another using a BlockingQueue is
> free in terms of performance
> All of these are false at least in some runners, and I'm almost certain
> that in reality, performance of this approach is worse than a ParDo in most
> production runners.
> 
> There are other disadvantages to this approach:
> - Doing the bulk of the processing in a separate thread makes it invisible
> to Beam's instrumentation. If a Beam runner provided per-transform
> profiling capabilities, or the ability to get the current stack trace for
> stuck elements, this approach would make the real processing invisible to
> all of these capabilities, and a user would only see that the bulk of the
> time is spent waiting for the next element, but not *why* the next element
> is taking long to compute.
> - Likewise, offloading all the CPU and IO to a separate thread, invisible
> to Beam, will make it harder for runners to do autoscaling, binpacking and
> other resource management magic (how much of this runners actually do is a
> separate issue), because the runner will have no way of knowing how much
> CPU/IO this particular transform is actually using - all the processing
> happens in a thread about which the runner is unaware.
> - As far as I can tell, the code also hides exceptions that happen in the
> Tika thread
> - Adding the thread management makes the code much more complex, easier to
> introduce bugs, and harder for others to contribute
> 
> 
>> 2) As I commented at the end of [2], having an option to concatenate the
>> data chunks first before making them available to the pipeline is
>> useful, and I guess doing the same in ParDo would introduce some
>> synchronization issues (though not exactly sure yet)
>>
> What are these issues?
> 
> 
>>
>> One of valid concerns there is that the reader is polling the internal
>> queue so, in theory at least, and perhaps in some rare cases too, we may
>> have a case where the max polling time has been reached, the parser is
>> still busy, and TikaIO fails to report all the file data. I think that
>> it can be solved by either 2a) configuring the max polling time to a
>> very large number which will never be reached for a practical case, or
>> 2b) simply use a blocking queue without the time limits - in the worst
>> case, if TikaParser spins and fails to report the end of the document,
>> then, Bean can heal itself if the pipeline blocks.
>> I propose to follow 2b).
>>
> I agree that there should be no way to unintentionally configure the
> transform in a way that will produce silent data loss. Another reason for
> not having these tuning knobs is that it goes against Beam's "no knobs"
> philosophy, and that in most cases users have no way of figuring out a good
> value for tuning knobs except for manual experimentation, which is
> extremely brittle and typically gets immediately obsoleted by running on a
> new dataset or updating a version of some of the involved dependencies etc.
> 
> 
>>
>>
>> Please let me know what you think.
>> My plan so far is:
>> 1) start addressing most of Eugene's comments which would require some
>> minor TikaIO updates
>> 2) work on removing the TikaSource internal code dealing with File
>> patterns which I copied from TextIO at the next stage
>> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
>> time to try it with some real complex files and also decide if TikaIO
>> can continue implemented as a BoundedSource/Reader or not
>>
>> Eugene, all, will it work if I start with 1) ?
>>
> Yes, but I think we should start by discussing the anticipated use cases of
> TikaIO and designing an API for it based on those use cases; and then see
> what's the best implementation for that particular API and set of
> anticipated use cases.
> 
> 
>>
>> Thanks, Sergey
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-2328
>> [2] https://github.com/apache/beam/pull/3378
>>
>

Re: TikaIO concerns

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.

Hi,

Replies inline.

On Tue, Sep 19, 2017 at 3:41 AM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Hi All
>
> This is my first post the the dev list, I work for Talend, I'm a Beam
> novice, Apache Tika fan, and thought it would be really great to try and
> link both projects together, which led me to opening [1] where I typed
> some early thoughts, followed by PR [2].
>
> I noticed yesterday I had the robust :-) (but useful and helpful) newer
> review comments from Eugene pending, so I'd like to summarize a bit why
> I did TikaIO (reader) the way I did, and then decide, based on the
> feedback from the experts, what to do next.
>
> Apache Tika Parsers report the text content in chunks, via SaxParser
> events. It's not possible with Tika to take a file and read it bit by
> bit at the 'initiative' of the Beam Reader, line by line, the only way
> is to handle the SAXParser callbacks which report the data chunks. Some
> parsers may report the complete lines, some individual words, with some
> being able report the data only after the completely parse the document.
> All depends on the data format.
>
> At the moment TikaIO's TikaReader does not use the Beam threads to parse
> the files, Beam threads will only collect the data from the internal
> queue where the internal TikaReader's thread will put the data into
> (note the data chunks are ordered even though the tests might suggest
> otherwise).
>
I agree that your implementation of reader returns records in order - but
Beam PCollection's are not ordered. Nothing in Beam cares about the order
in which records are produced by a BoundedReader - the order produced by
your reader is ignored, and when applying any transforms to the PCollection
produced by TikaIO, it is impossible to recover the order in which your
reader returned the records.

With that in mind, is PCollection<String>, containing individual
Tika-detected items, still the right API for representing the result of
parsing a large number of documents with Tika?


>
> The reason I did it was because I thought
>
> 1) it would make the individual data chunks available faster to the
> pipeline - the parser will continue working via the binary/video etc
> file while the data will already start flowing - I agree there should be
> some tests data available confirming it - but I'm positive at the moment
> this approach might yield some performance gains with the large sets. If
> the file is large, if it has the embedded attachments/videos to deal
> with, then it may be more effective not to get the Beam thread deal with
> it...
>
> As I said on the PR, this description contains unfounded and potentially
incorrect assumptions about how Beam runners execute (or may execute in the
future) a ParDo or a BoundedReader. For example, if I understand correctly,
you might be assuming that:
- Beam runners wait for a full @ProcessElement call of a ParDo to complete
before processing its outputs with downstream transforms
- Beam runners can not run a @ProcessElement call of a ParDo *concurrently*
with downstream processing of its results
- Passing an element from one thread to another using a BlockingQueue is
free in terms of performance
All of these are false at least in some runners, and I'm almost certain
that in reality, performance of this approach is worse than a ParDo in most
production runners.

There are other disadvantages to this approach:
- Doing the bulk of the processing in a separate thread makes it invisible
to Beam's instrumentation. If a Beam runner provided per-transform
profiling capabilities, or the ability to get the current stack trace for
stuck elements, this approach would make the real processing invisible to
all of these capabilities, and a user would only see that the bulk of the
time is spent waiting for the next element, but not *why* the next element
is taking long to compute.
- Likewise, offloading all the CPU and IO to a separate thread, invisible
to Beam, will make it harder for runners to do autoscaling, binpacking and
other resource management magic (how much of this runners actually do is a
separate issue), because the runner will have no way of knowing how much
CPU/IO this particular transform is actually using - all the processing
happens in a thread about which the runner is unaware.
- As far as I can tell, the code also hides exceptions that happen in the
Tika thread
- Adding the thread management makes the code much more complex, easier to
introduce bugs, and harder for others to contribute


> 2) As I commented at the end of [2], having an option to concatenate the
> data chunks first before making them available to the pipeline is
> useful, and I guess doing the same in ParDo would introduce some
> synchronization issues (though not exactly sure yet)
>
What are these issues?


>
> One of valid concerns there is that the reader is polling the internal
> queue so, in theory at least, and perhaps in some rare cases too, we may
> have a case where the max polling time has been reached, the parser is
> still busy, and TikaIO fails to report all the file data. I think that
> it can be solved by either 2a) configuring the max polling time to a
> very large number which will never be reached for a practical case, or
> 2b) simply use a blocking queue without the time limits - in the worst
> case, if TikaParser spins and fails to report the end of the document,
> then, Bean can heal itself if the pipeline blocks.
> I propose to follow 2b).
>
I agree that there should be no way to unintentionally configure the
transform in a way that will produce silent data loss. Another reason for
not having these tuning knobs is that it goes against Beam's "no knobs"
philosophy, and that in most cases users have no way of figuring out a good
value for tuning knobs except for manual experimentation, which is
extremely brittle and typically gets immediately obsoleted by running on a
new dataset or updating a version of some of the involved dependencies etc.


>
>
> Please let me know what you think.
> My plan so far is:
> 1) start addressing most of Eugene's comments which would require some
> minor TikaIO updates
> 2) work on removing the TikaSource internal code dealing with File
> patterns which I copied from TextIO at the next stage
> 3) If needed - mark TikaIO Experimental to give Tika and Beam users some
> time to try it with some real complex files and also decide if TikaIO
> can continue implemented as a BoundedSource/Reader or not
>
> Eugene, all, will it work if I start with 1) ?
>
Yes, but I think we should start by discussing the anticipated use cases of
TikaIO and designing an API for it based on those use cases; and then see
what's the best implementation for that particular API and set of
anticipated use cases.


>
> Thanks, Sergey
>
> [1] https://issues.apache.org/jira/browse/BEAM-2328
> [2] https://github.com/apache/beam/pull/3378
>