You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Sergey Beryozkin <sb...@gmail.com> on 2017/10/02 11:03:48 UTC

Re: TikaIO Refactoring

Thanks for the review, please see the last comment:

https://github.com/apache/beam/pull/3835#issuecomment-333502388

(sorry for the possible duplication - but I'm not sure that GitHub will 
propagate it - as I can not see a comment there that I left on Saturday).

Cheers, Sergey
On 29/09/17 10:21, Sergey Beryozkin wrote:
> Hi
> On 28/09/17 17:09, Eugene Kirpichov wrote:
>> Hi! Glad the refactoring is happening, thanks!
> 
> Thanks for getting me focused on having TikaIO supporting the simpler 
> (and practical) cases first :-)
>> It was auto-assigned to Reuven as formal owner of the component. I
>> reassigned it to you.
> OK, thanks...
>>
>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I started looking at
>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>
>>> and pushed some initial code to my tikaio branch introducing ParseResult
>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>> the asynchronous parsing code, and few other bits.
>>>
>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>> into it too or was it auto-assigned ?
>>>
>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>> what've done so far before completely removing BoundedSource/Reader
>>> based code ?
>>>
>> Yes :)
>>
> I did commit yesterday to my branch, and it made its way to the pending 
> PR (which I forgot about) where I only tweaked a couple of doc typos, so 
> I renamed that PR:
> 
> https://github.com/apache/beam/pull/3835
> 
> (The build failures are apparently due to the build timeouts)
> 
> As I mentioned, in this PR I updated the existing TikaIO test to work 
> with ParseResult, at the moment a file location as its property. Only a 
> file name can easily be saved, I thought it might be important where on 
> the network the file is - may be copy it afterwards if needed, etc. I'd 
> also have no problems with having it typed as a K key, was only trying 
> to make it a bit simpler at the start.
> 
> I'll deal with the new configurations after a switch. TikaConfig would 
> most likely still need to be supported but I recall you mentioned the 
> way it's done now will make it work only with the direct runner. I guess 
> I can load it as a URL resource... The other bits like providing custom 
> content handlers, parsers, input metadata, may be setting the max size 
> of the files, etc, can all be added after a switch.
> 
> Note I haven't dealt with a number of your comments to the original code 
> which can still be dealt with in the current code - given that most of 
> that code will go with the next PR anyway.
> 
> Please review or merge if it looks like it is a step in the right 
> direction...
> 
>>
>>>
>>> I have another question anyway,
>>>
>>>
>>>> E.g. TikaIO could:
>>>> - take as input a PCollection<ReadableFile>
>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where 
>>>> ParseResult
>>>> is a class with properties { String content, Metadata metadata }
>>>> - be configured by: a Parser (it implements Serializable so can be
>>>> specified at pipeline construction time) and a ContentHandler whose
>>>> toString() will go into "content". ContentHandler does not implement
>>>> Serializable, so you can not specify it at construction time - however,
>>> you
>>>> can let the user specify either its class (if it's a simple handler 
>>>> like
>>> a
>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>> have
>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>> TikaIO.parseAllAsStrings().
>>>>
>>>> Example usage would look like:
>>>>
>>>>     PCollection<KV<String, ParseResult>> parseResults =
>>>> p.apply(FileIO.match().filepattern(...))
>>>>       .apply(FileIO.readMatches())
>>>>       .apply(TikaIO.parseAllAsStrings())
>>>>
>>>> or:
>>>>
>>>>       .apply(TikaIO.parseAll()
>>>>           .withParser(new AutoDetectParser())
>>>>           .withContentHandler(() -> new BodyContentHandler(new
>>>> ToXMLContentHandler())))
>>>>
>>>> You could also have shorthands for letting the user avoid using FileIO
>>>> directly in simple cases, for example:
>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>
>>>> This would of course be implemented as a ParDo or even MapElements, and
>>>> you'll be able to share the code between parseAll and regular parse.
>>>>
>>> I'd like to understand how to do
>>>
>>> TikaIO.parse().from(filepattern)
>>>
>>> Right now I have TikaIO.Read extending
>>> PTransform<PBegin, PCollection<ParseResult>
>>>
>>> and then the boilerplate code which builds Read when I do something like
>>>
>>> TikaIO.read().from(filepattern).
>>>
>>> What is the convention for supporting something like
>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I see
>>> some example ?
>>>
>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
>> Note that in TikaIO you probably won't need a fusion break after the 
>> ParDo
>> since there's 1 result per input file.
>>
> 
> OK, I'll have a look
> 
> Cheers, Sergey
> 
>>
>>>
>>> Many thanks, Sergey
>>>
>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Ben - yes, something like that would work for ReadableFile consumers 
to be able to choose.

Cheers, Sergey
On 04/10/17 22:51, Ben Chambers wrote:
> It looks like ReadableFile#open does currently decompress the stream, but
> it seems like we could add a ReadableFile#openRaw(...) or something like
> that which didn't implicitly decompress. Then libraries such as Tika which
> want the *actual* file content could use that method. Would that address
> your concerns?
> 
> https://github.com/apache/beam/blob/393e5631054a81ae1fdcd304f81cc68cf53d3422/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L131
> 
> On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
>> expected to decompress itself, while ReadableFile has the content
>> decompressed.
>>
>> The other point is that Tika reports the names of the zipped files too,
>> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>>
>> Can we assume that if Metadata does not point to the local file then it
>> can be opened as a URL stream ? The same issue affects TikaConfig, so
>> I'd rather have a solution which will work for MatchResult.Metadata and
>> TikaConfig
>>
>> Thanks, Sergey
>> On 04/10/17 22:02, Sergey Beryozkin wrote:
>>> Good point...
>>>
>>> Sergey
>>>
>>> On 04/10/17 18:24, Eugene Kirpichov wrote:
>>>> Can TikaInputStream consume a regular InputStream? If so, you can
>>>> apply it
>>>> to Channels.newInputStream(channel). If not, applying it to the filename
>>>> extracted from Metadata won't work either because it can point to a file
>>>> that's not on the local disk.
>>>>
>>>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm starting moving toward
>>>>>
>>>>> class TikaIO {
>>>>>      public static ParseAllToString parseAllToString() {..}
>>>>>      class ParseAllToString extends
>> PTransform<PCollection<ReadableFile>,
>>>>> PCollection<ParseResult>> {
>>>>>        ...configuration properties...
>>>>>        expand {
>>>>>          return input.apply(ParDo.of(new ParseToStringFn))
>>>>>        }
>>>>>        class ParseToStringFn extends DoFn<...> {...}
>>>>>      }
>>>>> }
>>>>>
>>>>> as suggested by Eugene
>>>>>
>>>>> The initial migration seems to work fine, except that ReadableFile and
>>>>> in particular, ReadableByteChannel can not be consumed by
>>>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>>>> better let Tika to unzip if needed given that a lot of effort went in
>>>>> Tika into detecting zip security issues...
>>>>>
>>>>> So I'm typing it as
>>>>>
>>>>> class ParseAllToString extends
>>>>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>>>> Thanks for the review, please see the last comment:
>>>>>>
>>>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>>>
>>>>>> (sorry for the possible duplication - but I'm not sure that GitHub
>> will
>>>>>> propagate it - as I can not see a comment there that I left on
>>>>>> Saturday).
>>>>>>
>>>>>> Cheers, Sergey
>>>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>>>> Hi
>>>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>>>
>>>>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>>>>> (and practical) cases first :-)
>>>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>>>> reassigned it to you.
>>>>>>> OK, thanks...
>>>>>>>>
>>>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
>>>>>>>> <sberyozkin@gmail.com
>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I started looking at
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>>>
>>>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>>>> ParseResult
>>>>>>>>> and updating the tests but keeping the BounderSource/Reader,
>>>>>>>>> dropping
>>>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>>>
>>>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
>>>>>>>>> looking
>>>>>>>>> into it too or was it auto-assigned ?
>>>>>>>>>
>>>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>>>>> based code ?
>>>>>>>>>
>>>>>>>> Yes :)
>>>>>>>>
>>>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>>>> pending PR (which I forgot about) where I only tweaked a couple of
>> doc
>>>>>>> typos, so I renamed that PR:
>>>>>>>
>>>>>>> https://github.com/apache/beam/pull/3835
>>>>>>>
>>>>>>> (The build failures are apparently due to the build timeouts)
>>>>>>>
>>>>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>>>>> with ParseResult, at the moment a file location as its property. Only
>>>>>>> a file name can easily be saved, I thought it might be important
>> where
>>>>>>> on the network the file is - may be copy it afterwards if needed,
>> etc.
>>>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>>>> trying to make it a bit simpler at the start.
>>>>>>>
>>>>>>> I'll deal with the new configurations after a switch. TikaConfig
>> would
>>>>>>> most likely still need to be supported but I recall you mentioned the
>>>>>>> way it's done now will make it work only with the direct runner. I
>>>>>>> guess I can load it as a URL resource... The other bits like
>> providing
>>>>>>> custom content handlers, parsers, input metadata, may be setting the
>>>>>>> max size of the files, etc, can all be added after a switch.
>>>>>>>
>>>>>>> Note I haven't dealt with a number of your comments to the original
>>>>>>> code which can still be dealt with in the current code - given that
>>>>>>> most of that code will go with the next PR anyway.
>>>>>>>
>>>>>>> Please review or merge if it looks like it is a step in the right
>>>>>>> direction...
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have another question anyway,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> E.g. TikaIO could:
>>>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>>>> ParseResult
>>>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>>>>> specified at pipeline construction time) and a ContentHandler
>> whose
>>>>>>>>>> toString() will go into "content". ContentHandler does not
>>>>>>>>>> implement
>>>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>>>> however,
>>>>>>>>> you
>>>>>>>>>> can let the user specify either its class (if it's a simple
>> handler
>>>>>>>>>> like
>>>>>>>>> a
>>>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
>>>>>>>>>> you can
>>>>>>>>> have
>>>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>>>
>>>>>>>>>> Example usage would look like:
>>>>>>>>>>
>>>>>>>>>>       PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>>>         .apply(FileIO.readMatches())
>>>>>>>>>>         .apply(TikaIO.parseAllAsStrings())
>>>>>>>>>>
>>>>>>>>>> or:
>>>>>>>>>>
>>>>>>>>>>         .apply(TikaIO.parseAll()
>>>>>>>>>>             .withParser(new AutoDetectParser())
>>>>>>>>>>             .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>>>> ToXMLContentHandler())))
>>>>>>>>>>
>>>>>>>>>> You could also have shorthands for letting the user avoid using
>>>>> FileIO
>>>>>>>>>> directly in simple cases, for example:
>>>>>>>>>>         p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>>>
>>>>>>>>>> This would of course be implemented as a ParDo or even
>> MapElements,
>>>>>>>>>> and
>>>>>>>>>> you'll be able to share the code between parseAll and regular
>>>>>>>>>> parse.
>>>>>>>>>>
>>>>>>>>> I'd like to understand how to do
>>>>>>>>>
>>>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>>>
>>>>>>>>> Right now I have TikaIO.Read extending
>>>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>>>
>>>>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>>>>> like
>>>>>>>>>
>>>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>>>
>>>>>>>>> What is the convention for supporting something like
>>>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
>> I
>>>>>>>>> see
>>>>>>>>> some example ?
>>>>>>>>>
>>>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
>>>>>>>> and
>>>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>>>> mimic.
>>>>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>>>>> ParDo
>>>>>>>> since there's 1 result per input file.
>>>>>>>>
>>>>>>>
>>>>>>> OK, I'll have a look
>>>>>>>
>>>>>>> Cheers, Sergey
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Many thanks, Sergey
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>
> 



Re: TikaIO Refactoring

Posted by Ben Chambers <bc...@apache.org>.
It looks like ReadableFile#open does currently decompress the stream, but
it seems like we could add a ReadableFile#openRaw(...) or something like
that which didn't implicitly decompress. Then libraries such as Tika which
want the *actual* file content could use that method. Would that address
your concerns?

https://github.com/apache/beam/blob/393e5631054a81ae1fdcd304f81cc68cf53d3422/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L131

On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
> expected to decompress itself, while ReadableFile has the content
> decompressed.
>
> The other point is that Tika reports the names of the zipped files too,
> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>
> Can we assume that if Metadata does not point to the local file then it
> can be opened as a URL stream ? The same issue affects TikaConfig, so
> I'd rather have a solution which will work for MatchResult.Metadata and
> TikaConfig
>
> Thanks, Sergey
> On 04/10/17 22:02, Sergey Beryozkin wrote:
> > Good point...
> >
> > Sergey
> >
> > On 04/10/17 18:24, Eugene Kirpichov wrote:
> >> Can TikaInputStream consume a regular InputStream? If so, you can
> >> apply it
> >> to Channels.newInputStream(channel). If not, applying it to the filename
> >> extracted from Metadata won't work either because it can point to a file
> >> that's not on the local disk.
> >>
> >> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
> >> wrote:
> >>
> >>> I'm starting moving toward
> >>>
> >>> class TikaIO {
> >>>     public static ParseAllToString parseAllToString() {..}
> >>>     class ParseAllToString extends
> PTransform<PCollection<ReadableFile>,
> >>> PCollection<ParseResult>> {
> >>>       ...configuration properties...
> >>>       expand {
> >>>         return input.apply(ParDo.of(new ParseToStringFn))
> >>>       }
> >>>       class ParseToStringFn extends DoFn<...> {...}
> >>>     }
> >>> }
> >>>
> >>> as suggested by Eugene
> >>>
> >>> The initial migration seems to work fine, except that ReadableFile and
> >>> in particular, ReadableByteChannel can not be consumed by
> >>> TikaInputStream yet (I'll open an enhancement request), besides, it's
> >>> better let Tika to unzip if needed given that a lot of effort went in
> >>> Tika into detecting zip security issues...
> >>>
> >>> So I'm typing it as
> >>>
> >>> class ParseAllToString extends
> >>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
> >>>
> >>> Cheers, Sergey
> >>>
> >>> On 02/10/17 12:03, Sergey Beryozkin wrote:
> >>>> Thanks for the review, please see the last comment:
> >>>>
> >>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >>>>
> >>>> (sorry for the possible duplication - but I'm not sure that GitHub
> will
> >>>> propagate it - as I can not see a comment there that I left on
> >>>> Saturday).
> >>>>
> >>>> Cheers, Sergey
> >>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
> >>>>> Hi
> >>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>>>>> Hi! Glad the refactoring is happening, thanks!
> >>>>>
> >>>>> Thanks for getting me focused on having TikaIO supporting the simpler
> >>>>> (and practical) cases first :-)
> >>>>>> It was auto-assigned to Reuven as formal owner of the component. I
> >>>>>> reassigned it to you.
> >>>>> OK, thanks...
> >>>>>>
> >>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
> >>>>>> <sberyozkin@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> I started looking at
> >>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>>>>
> >>>>>>> and pushed some initial code to my tikaio branch introducing
> >>>>>>> ParseResult
> >>>>>>> and updating the tests but keeping the BounderSource/Reader,
> >>>>>>> dropping
> >>>>>>> the asynchronous parsing code, and few other bits.
> >>>>>>>
> >>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
> >>>>>>> looking
> >>>>>>> into it too or was it auto-assigned ?
> >>>>>>>
> >>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>>>>> what've done so far before completely removing BoundedSource/Reader
> >>>>>>> based code ?
> >>>>>>>
> >>>>>> Yes :)
> >>>>>>
> >>>>> I did commit yesterday to my branch, and it made its way to the
> >>>>> pending PR (which I forgot about) where I only tweaked a couple of
> doc
> >>>>> typos, so I renamed that PR:
> >>>>>
> >>>>> https://github.com/apache/beam/pull/3835
> >>>>>
> >>>>> (The build failures are apparently due to the build timeouts)
> >>>>>
> >>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
> >>>>> with ParseResult, at the moment a file location as its property. Only
> >>>>> a file name can easily be saved, I thought it might be important
> where
> >>>>> on the network the file is - may be copy it afterwards if needed,
> etc.
> >>>>> I'd also have no problems with having it typed as a K key, was only
> >>>>> trying to make it a bit simpler at the start.
> >>>>>
> >>>>> I'll deal with the new configurations after a switch. TikaConfig
> would
> >>>>> most likely still need to be supported but I recall you mentioned the
> >>>>> way it's done now will make it work only with the direct runner. I
> >>>>> guess I can load it as a URL resource... The other bits like
> providing
> >>>>> custom content handlers, parsers, input metadata, may be setting the
> >>>>> max size of the files, etc, can all be added after a switch.
> >>>>>
> >>>>> Note I haven't dealt with a number of your comments to the original
> >>>>> code which can still be dealt with in the current code - given that
> >>>>> most of that code will go with the next PR anyway.
> >>>>>
> >>>>> Please review or merge if it looks like it is a step in the right
> >>>>> direction...
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I have another question anyway,
> >>>>>>>
> >>>>>>>
> >>>>>>>> E.g. TikaIO could:
> >>>>>>>> - take as input a PCollection<ReadableFile>
> >>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>>>>> ParseResult
> >>>>>>>> is a class with properties { String content, Metadata metadata }
> >>>>>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>>>>> specified at pipeline construction time) and a ContentHandler
> whose
> >>>>>>>> toString() will go into "content". ContentHandler does not
> >>>>>>>> implement
> >>>>>>>> Serializable, so you can not specify it at construction time -
> >>>>>>>> however,
> >>>>>>> you
> >>>>>>>> can let the user specify either its class (if it's a simple
> handler
> >>>>>>>> like
> >>>>>>> a
> >>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
> >>>>>>>> you can
> >>>>>>> have
> >>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>>>>> TikaIO.parseAllAsStrings().
> >>>>>>>>
> >>>>>>>> Example usage would look like:
> >>>>>>>>
> >>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
> >>>>>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>>>>        .apply(FileIO.readMatches())
> >>>>>>>>        .apply(TikaIO.parseAllAsStrings())
> >>>>>>>>
> >>>>>>>> or:
> >>>>>>>>
> >>>>>>>>        .apply(TikaIO.parseAll()
> >>>>>>>>            .withParser(new AutoDetectParser())
> >>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
> >>>>>>>> ToXMLContentHandler())))
> >>>>>>>>
> >>>>>>>> You could also have shorthands for letting the user avoid using
> >>> FileIO
> >>>>>>>> directly in simple cases, for example:
> >>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>>>>
> >>>>>>>> This would of course be implemented as a ParDo or even
> MapElements,
> >>>>>>>> and
> >>>>>>>> you'll be able to share the code between parseAll and regular
> >>>>>>>> parse.
> >>>>>>>>
> >>>>>>> I'd like to understand how to do
> >>>>>>>
> >>>>>>> TikaIO.parse().from(filepattern)
> >>>>>>>
> >>>>>>> Right now I have TikaIO.Read extending
> >>>>>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>>>>
> >>>>>>> and then the boilerplate code which builds Read when I do something
> >>>>>>> like
> >>>>>>>
> >>>>>>> TikaIO.read().from(filepattern).
> >>>>>>>
> >>>>>>> What is the convention for supporting something like
> >>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
> I
> >>>>>>> see
> >>>>>>> some example ?
> >>>>>>>
> >>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
> >>>>>> and
> >>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
> >>> mimic.
> >>>>>> Note that in TikaIO you probably won't need a fusion break after the
> >>>>>> ParDo
> >>>>>> since there's 1 result per input file.
> >>>>>>
> >>>>>
> >>>>> OK, I'll have a look
> >>>>>
> >>>>> Cheers, Sergey
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Many thanks, Sergey
> >>>>>>>
> >>>>>>
> >>>
> >>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi

Yes, using .withCompression(UNCOMPRESSED) works, but the test code looks 
funny:

p.apply("ParseFiles", FileIO.match().filepattern(resourcePath))
         .apply(FileIO.readMatches().withCompression(
           compressed ? Compression.UNCOMPRESSED : Compression.AUTO))
         .apply(TikaIO.read())

One can obviously always set UNCOMPRESSED for Tika and avoid if/else and 
it will work fine (with some minor confusions expected when someone 
attempts to process the compressed files), but IMHO a more typed 
approach for getting the raw content would be useful to have as well.

Thanks, Sergey
On 05/10/17 12:56, Sergey Beryozkin wrote:
> Hi Eugene
> 
> On 04/10/17 22:52, Eugene Kirpichov wrote:
>> You can avoid automatic decompression by using
>> FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).
> 
> This is nice it was already thought about earlier the auto-decompression 
> would not always be needed, and it would help a readZippedPdfFile test 
> passing :-). but I'd rather have TikaIO users not worry about doing 
> .withCompression(UNCOMPRESSED) for it to work correctly, given the raw 
> content is always needed by Tika - thus having the users to assert it 
> may be a bit suboptimal.
> 
> It is only useful to auto-decompress if it is indeed what a user wants 
> (it's a single file only, no risk of some malicious compression), but 
> otherwise Tika should do it itself.
> 
> So ideally, after the initial refactoring is complete, we'd only have
> 
> p.apply(FileIO.readMatches()).apply(TikaIO.parseAll()) for reading 
> compressed or uncompressed files, where Tika would just do 
> ReadableFile.openRaw as suggested by Ben, or at least to have another 
> shortcut at the FileIO level, something like
> 
> FileIO.readRawMatches()
> 
> which would be equivalent to specifying the UNCOMPRESSED option explicitly.
> 
> 
> Thanks, Sergey
>>
>> On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
>> wrote:
>>
>>> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
>>> expected to decompress itself, while ReadableFile has the content
>>> decompressed.
>>>
>>> The other point is that Tika reports the names of the zipped files too,
>>> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>>>
>>> Can we assume that if Metadata does not point to the local file then it
>>> can be opened as a URL stream ? The same issue affects TikaConfig, so
>>> I'd rather have a solution which will work for MatchResult.Metadata and
>>> TikaConfig
>>>
>>> Thanks, Sergey
>>> On 04/10/17 22:02, Sergey Beryozkin wrote:
>>>> Good point...
>>>>
>>>> Sergey
>>>>
>>>> On 04/10/17 18:24, Eugene Kirpichov wrote:
>>>>> Can TikaInputStream consume a regular InputStream? If so, you can
>>>>> apply it
>>>>> to Channels.newInputStream(channel). If not, applying it to the 
>>>>> filename
>>>>> extracted from Metadata won't work either because it can point to a 
>>>>> file
>>>>> that's not on the local disk.
>>>>>
>>>>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I'm starting moving toward
>>>>>>
>>>>>> class TikaIO {
>>>>>>      public static ParseAllToString parseAllToString() {..}
>>>>>>      class ParseAllToString extends
>>> PTransform<PCollection<ReadableFile>,
>>>>>> PCollection<ParseResult>> {
>>>>>>        ...configuration properties...
>>>>>>        expand {
>>>>>>          return input.apply(ParDo.of(new ParseToStringFn))
>>>>>>        }
>>>>>>        class ParseToStringFn extends DoFn<...> {...}
>>>>>>      }
>>>>>> }
>>>>>>
>>>>>> as suggested by Eugene
>>>>>>
>>>>>> The initial migration seems to work fine, except that ReadableFile 
>>>>>> and
>>>>>> in particular, ReadableByteChannel can not be consumed by
>>>>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>>>>> better let Tika to unzip if needed given that a lot of effort went in
>>>>>> Tika into detecting zip security issues...
>>>>>>
>>>>>> So I'm typing it as
>>>>>>
>>>>>> class ParseAllToString extends
>>>>>> PTransform<PCollection<MatchResult.Metadata>, 
>>>>>> PCollection<ParseResult>>
>>>>>>
>>>>>> Cheers, Sergey
>>>>>>
>>>>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>>>>> Thanks for the review, please see the last comment:
>>>>>>>
>>>>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>>>>
>>>>>>> (sorry for the possible duplication - but I'm not sure that GitHub
>>> will
>>>>>>> propagate it - as I can not see a comment there that I left on
>>>>>>> Saturday).
>>>>>>>
>>>>>>> Cheers, Sergey
>>>>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>>>>> Hi
>>>>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>>>>
>>>>>>>> Thanks for getting me focused on having TikaIO supporting the 
>>>>>>>> simpler
>>>>>>>> (and practical) cases first :-)
>>>>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>>>>> reassigned it to you.
>>>>>>>> OK, thanks...
>>>>>>>>>
>>>>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
>>>>>>>>> <sberyozkin@gmail.com
>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi
>>>>>>>>>>
>>>>>>>>>> I started looking at
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>>>>
>>>>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>>>>> ParseResult
>>>>>>>>>> and updating the tests but keeping the BounderSource/Reader,
>>>>>>>>>> dropping
>>>>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>>>>
>>>>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
>>>>>>>>>> looking
>>>>>>>>>> into it too or was it auto-assigned ?
>>>>>>>>>>
>>>>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>>>>> what've done so far before completely removing 
>>>>>>>>>> BoundedSource/Reader
>>>>>>>>>> based code ?
>>>>>>>>>>
>>>>>>>>> Yes :)
>>>>>>>>>
>>>>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>>>>> pending PR (which I forgot about) where I only tweaked a couple of
>>> doc
>>>>>>>> typos, so I renamed that PR:
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/pull/3835
>>>>>>>>
>>>>>>>> (The build failures are apparently due to the build timeouts)
>>>>>>>>
>>>>>>>> As I mentioned, in this PR I updated the existing TikaIO test to 
>>>>>>>> work
>>>>>>>> with ParseResult, at the moment a file location as its property. 
>>>>>>>> Only
>>>>>>>> a file name can easily be saved, I thought it might be important
>>> where
>>>>>>>> on the network the file is - may be copy it afterwards if needed,
>>> etc.
>>>>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>>>>> trying to make it a bit simpler at the start.
>>>>>>>>
>>>>>>>> I'll deal with the new configurations after a switch. TikaConfig
>>> would
>>>>>>>> most likely still need to be supported but I recall you 
>>>>>>>> mentioned the
>>>>>>>> way it's done now will make it work only with the direct runner. I
>>>>>>>> guess I can load it as a URL resource... The other bits like
>>> providing
>>>>>>>> custom content handlers, parsers, input metadata, may be setting 
>>>>>>>> the
>>>>>>>> max size of the files, etc, can all be added after a switch.
>>>>>>>>
>>>>>>>> Note I haven't dealt with a number of your comments to the original
>>>>>>>> code which can still be dealt with in the current code - given that
>>>>>>>> most of that code will go with the next PR anyway.
>>>>>>>>
>>>>>>>> Please review or merge if it looks like it is a step in the right
>>>>>>>> direction...
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have another question anyway,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> E.g. TikaIO could:
>>>>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>>>>> ParseResult
>>>>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>>>>> - be configured by: a Parser (it implements Serializable so 
>>>>>>>>>>> can be
>>>>>>>>>>> specified at pipeline construction time) and a ContentHandler
>>> whose
>>>>>>>>>>> toString() will go into "content". ContentHandler does not
>>>>>>>>>>> implement
>>>>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>>>>> however,
>>>>>>>>>> you
>>>>>>>>>>> can let the user specify either its class (if it's a simple
>>> handler
>>>>>>>>>>> like
>>>>>>>>>> a
>>>>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
>>>>>>>>>>> you can
>>>>>>>>>> have
>>>>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>>>>
>>>>>>>>>>> Example usage would look like:
>>>>>>>>>>>
>>>>>>>>>>>       PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>>>>         .apply(FileIO.readMatches())
>>>>>>>>>>>         .apply(TikaIO.parseAllAsStrings())
>>>>>>>>>>>
>>>>>>>>>>> or:
>>>>>>>>>>>
>>>>>>>>>>>         .apply(TikaIO.parseAll()
>>>>>>>>>>>             .withParser(new AutoDetectParser())
>>>>>>>>>>>             .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>>>>> ToXMLContentHandler())))
>>>>>>>>>>>
>>>>>>>>>>> You could also have shorthands for letting the user avoid using
>>>>>> FileIO
>>>>>>>>>>> directly in simple cases, for example:
>>>>>>>>>>>         p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>>>>
>>>>>>>>>>> This would of course be implemented as a ParDo or even
>>> MapElements,
>>>>>>>>>>> and
>>>>>>>>>>> you'll be able to share the code between parseAll and regular
>>>>>>>>>>> parse.
>>>>>>>>>>>
>>>>>>>>>> I'd like to understand how to do
>>>>>>>>>>
>>>>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>>>>
>>>>>>>>>> Right now I have TikaIO.Read extending
>>>>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>>>>
>>>>>>>>>> and then the boilerplate code which builds Read when I do 
>>>>>>>>>> something
>>>>>>>>>> like
>>>>>>>>>>
>>>>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>>>>
>>>>>>>>>> What is the convention for supporting something like
>>>>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, 
>>>>>>>>>> can
>>> I
>>>>>>>>>> see
>>>>>>>>>> some example ?
>>>>>>>>>>
>>>>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
>>>>>>>>> and
>>>>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>>>>> mimic.
>>>>>>>>> Note that in TikaIO you probably won't need a fusion break 
>>>>>>>>> after the
>>>>>>>>> ParDo
>>>>>>>>> since there's 1 result per input file.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I'll have a look
>>>>>>>>
>>>>>>>> Cheers, Sergey
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Many thanks, Sergey
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>
>>
> 
> 

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.
Hi Eugene

On 04/10/17 22:52, Eugene Kirpichov wrote:
> You can avoid automatic decompression by using
> FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).

This is nice it was already thought about earlier the auto-decompression 
would not always be needed, and it would help a readZippedPdfFile test 
passing :-). but I'd rather have TikaIO users not worry about doing 
.withCompression(UNCOMPRESSED) for it to work correctly, given the raw 
content is always needed by Tika - thus having the users to assert it 
may be a bit suboptimal.

It is only useful to auto-decompress if it is indeed what a user wants 
(it's a single file only, no risk of some malicious compression), but 
otherwise Tika should do it itself.

So ideally, after the initial refactoring is complete, we'd only have

p.apply(FileIO.readMatches()).apply(TikaIO.parseAll()) for reading 
compressed or uncompressed files, where Tika would just do 
ReadableFile.openRaw as suggested by Ben, or at least to have another 
shortcut at the FileIO level, something like

FileIO.readRawMatches()

which would be equivalent to specifying the UNCOMPRESSED option explicitly.


Thanks, Sergey
> 
> On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
> wrote:
> 
>> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
>> expected to decompress itself, while ReadableFile has the content
>> decompressed.
>>
>> The other point is that Tika reports the names of the zipped files too,
>> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>>
>> Can we assume that if Metadata does not point to the local file then it
>> can be opened as a URL stream ? The same issue affects TikaConfig, so
>> I'd rather have a solution which will work for MatchResult.Metadata and
>> TikaConfig
>>
>> Thanks, Sergey
>> On 04/10/17 22:02, Sergey Beryozkin wrote:
>>> Good point...
>>>
>>> Sergey
>>>
>>> On 04/10/17 18:24, Eugene Kirpichov wrote:
>>>> Can TikaInputStream consume a regular InputStream? If so, you can
>>>> apply it
>>>> to Channels.newInputStream(channel). If not, applying it to the filename
>>>> extracted from Metadata won't work either because it can point to a file
>>>> that's not on the local disk.
>>>>
>>>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm starting moving toward
>>>>>
>>>>> class TikaIO {
>>>>>      public static ParseAllToString parseAllToString() {..}
>>>>>      class ParseAllToString extends
>> PTransform<PCollection<ReadableFile>,
>>>>> PCollection<ParseResult>> {
>>>>>        ...configuration properties...
>>>>>        expand {
>>>>>          return input.apply(ParDo.of(new ParseToStringFn))
>>>>>        }
>>>>>        class ParseToStringFn extends DoFn<...> {...}
>>>>>      }
>>>>> }
>>>>>
>>>>> as suggested by Eugene
>>>>>
>>>>> The initial migration seems to work fine, except that ReadableFile and
>>>>> in particular, ReadableByteChannel can not be consumed by
>>>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>>>> better let Tika to unzip if needed given that a lot of effort went in
>>>>> Tika into detecting zip security issues...
>>>>>
>>>>> So I'm typing it as
>>>>>
>>>>> class ParseAllToString extends
>>>>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>>>> Thanks for the review, please see the last comment:
>>>>>>
>>>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>>>
>>>>>> (sorry for the possible duplication - but I'm not sure that GitHub
>> will
>>>>>> propagate it - as I can not see a comment there that I left on
>>>>>> Saturday).
>>>>>>
>>>>>> Cheers, Sergey
>>>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>>>> Hi
>>>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>>>
>>>>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>>>>> (and practical) cases first :-)
>>>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>>>> reassigned it to you.
>>>>>>> OK, thanks...
>>>>>>>>
>>>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
>>>>>>>> <sberyozkin@gmail.com
>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I started looking at
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>>>
>>>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>>>> ParseResult
>>>>>>>>> and updating the tests but keeping the BounderSource/Reader,
>>>>>>>>> dropping
>>>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>>>
>>>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
>>>>>>>>> looking
>>>>>>>>> into it too or was it auto-assigned ?
>>>>>>>>>
>>>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>>>>> based code ?
>>>>>>>>>
>>>>>>>> Yes :)
>>>>>>>>
>>>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>>>> pending PR (which I forgot about) where I only tweaked a couple of
>> doc
>>>>>>> typos, so I renamed that PR:
>>>>>>>
>>>>>>> https://github.com/apache/beam/pull/3835
>>>>>>>
>>>>>>> (The build failures are apparently due to the build timeouts)
>>>>>>>
>>>>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>>>>> with ParseResult, at the moment a file location as its property. Only
>>>>>>> a file name can easily be saved, I thought it might be important
>> where
>>>>>>> on the network the file is - may be copy it afterwards if needed,
>> etc.
>>>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>>>> trying to make it a bit simpler at the start.
>>>>>>>
>>>>>>> I'll deal with the new configurations after a switch. TikaConfig
>> would
>>>>>>> most likely still need to be supported but I recall you mentioned the
>>>>>>> way it's done now will make it work only with the direct runner. I
>>>>>>> guess I can load it as a URL resource... The other bits like
>> providing
>>>>>>> custom content handlers, parsers, input metadata, may be setting the
>>>>>>> max size of the files, etc, can all be added after a switch.
>>>>>>>
>>>>>>> Note I haven't dealt with a number of your comments to the original
>>>>>>> code which can still be dealt with in the current code - given that
>>>>>>> most of that code will go with the next PR anyway.
>>>>>>>
>>>>>>> Please review or merge if it looks like it is a step in the right
>>>>>>> direction...
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have another question anyway,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> E.g. TikaIO could:
>>>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>>>> ParseResult
>>>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>>>>> specified at pipeline construction time) and a ContentHandler
>> whose
>>>>>>>>>> toString() will go into "content". ContentHandler does not
>>>>>>>>>> implement
>>>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>>>> however,
>>>>>>>>> you
>>>>>>>>>> can let the user specify either its class (if it's a simple
>> handler
>>>>>>>>>> like
>>>>>>>>> a
>>>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
>>>>>>>>>> you can
>>>>>>>>> have
>>>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>>>
>>>>>>>>>> Example usage would look like:
>>>>>>>>>>
>>>>>>>>>>       PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>>>         .apply(FileIO.readMatches())
>>>>>>>>>>         .apply(TikaIO.parseAllAsStrings())
>>>>>>>>>>
>>>>>>>>>> or:
>>>>>>>>>>
>>>>>>>>>>         .apply(TikaIO.parseAll()
>>>>>>>>>>             .withParser(new AutoDetectParser())
>>>>>>>>>>             .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>>>> ToXMLContentHandler())))
>>>>>>>>>>
>>>>>>>>>> You could also have shorthands for letting the user avoid using
>>>>> FileIO
>>>>>>>>>> directly in simple cases, for example:
>>>>>>>>>>         p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>>>
>>>>>>>>>> This would of course be implemented as a ParDo or even
>> MapElements,
>>>>>>>>>> and
>>>>>>>>>> you'll be able to share the code between parseAll and regular
>>>>>>>>>> parse.
>>>>>>>>>>
>>>>>>>>> I'd like to understand how to do
>>>>>>>>>
>>>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>>>
>>>>>>>>> Right now I have TikaIO.Read extending
>>>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>>>
>>>>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>>>>> like
>>>>>>>>>
>>>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>>>
>>>>>>>>> What is the convention for supporting something like
>>>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
>> I
>>>>>>>>> see
>>>>>>>>> some example ?
>>>>>>>>>
>>>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
>>>>>>>> and
>>>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>>>> mimic.
>>>>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>>>>> ParDo
>>>>>>>> since there's 1 result per input file.
>>>>>>>>
>>>>>>>
>>>>>>> OK, I'll have a look
>>>>>>>
>>>>>>> Cheers, Sergey
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Many thanks, Sergey
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>
>>
> 


-- 
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/

Re: TikaIO Refactoring

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
You can avoid automatic decompression by using
FileIO.readMatches().withCompression(UNCOMPRESSED) (default is AUTO).

On Wed, Oct 4, 2017 at 2:42 PM Sergey Beryozkin <sb...@gmail.com>
wrote:

> Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is
> expected to decompress itself, while ReadableFile has the content
> decompressed.
>
> The other point is that Tika reports the names of the zipped files too,
> in the content, as you can see from TikaIOTest#readZippedPdfFile.
>
> Can we assume that if Metadata does not point to the local file then it
> can be opened as a URL stream ? The same issue affects TikaConfig, so
> I'd rather have a solution which will work for MatchResult.Metadata and
> TikaConfig
>
> Thanks, Sergey
> On 04/10/17 22:02, Sergey Beryozkin wrote:
> > Good point...
> >
> > Sergey
> >
> > On 04/10/17 18:24, Eugene Kirpichov wrote:
> >> Can TikaInputStream consume a regular InputStream? If so, you can
> >> apply it
> >> to Channels.newInputStream(channel). If not, applying it to the filename
> >> extracted from Metadata won't work either because it can point to a file
> >> that's not on the local disk.
> >>
> >> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com>
> >> wrote:
> >>
> >>> I'm starting moving toward
> >>>
> >>> class TikaIO {
> >>>     public static ParseAllToString parseAllToString() {..}
> >>>     class ParseAllToString extends
> PTransform<PCollection<ReadableFile>,
> >>> PCollection<ParseResult>> {
> >>>       ...configuration properties...
> >>>       expand {
> >>>         return input.apply(ParDo.of(new ParseToStringFn))
> >>>       }
> >>>       class ParseToStringFn extends DoFn<...> {...}
> >>>     }
> >>> }
> >>>
> >>> as suggested by Eugene
> >>>
> >>> The initial migration seems to work fine, except that ReadableFile and
> >>> in particular, ReadableByteChannel can not be consumed by
> >>> TikaInputStream yet (I'll open an enhancement request), besides, it's
> >>> better let Tika to unzip if needed given that a lot of effort went in
> >>> Tika into detecting zip security issues...
> >>>
> >>> So I'm typing it as
> >>>
> >>> class ParseAllToString extends
> >>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
> >>>
> >>> Cheers, Sergey
> >>>
> >>> On 02/10/17 12:03, Sergey Beryozkin wrote:
> >>>> Thanks for the review, please see the last comment:
> >>>>
> >>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >>>>
> >>>> (sorry for the possible duplication - but I'm not sure that GitHub
> will
> >>>> propagate it - as I can not see a comment there that I left on
> >>>> Saturday).
> >>>>
> >>>> Cheers, Sergey
> >>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
> >>>>> Hi
> >>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>>>>> Hi! Glad the refactoring is happening, thanks!
> >>>>>
> >>>>> Thanks for getting me focused on having TikaIO supporting the simpler
> >>>>> (and practical) cases first :-)
> >>>>>> It was auto-assigned to Reuven as formal owner of the component. I
> >>>>>> reassigned it to you.
> >>>>> OK, thanks...
> >>>>>>
> >>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin
> >>>>>> <sberyozkin@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> I started looking at
> >>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>>>>
> >>>>>>> and pushed some initial code to my tikaio branch introducing
> >>>>>>> ParseResult
> >>>>>>> and updating the tests but keeping the BounderSource/Reader,
> >>>>>>> dropping
> >>>>>>> the asynchronous parsing code, and few other bits.
> >>>>>>>
> >>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is
> >>>>>>> looking
> >>>>>>> into it too or was it auto-assigned ?
> >>>>>>>
> >>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>>>>> what've done so far before completely removing BoundedSource/Reader
> >>>>>>> based code ?
> >>>>>>>
> >>>>>> Yes :)
> >>>>>>
> >>>>> I did commit yesterday to my branch, and it made its way to the
> >>>>> pending PR (which I forgot about) where I only tweaked a couple of
> doc
> >>>>> typos, so I renamed that PR:
> >>>>>
> >>>>> https://github.com/apache/beam/pull/3835
> >>>>>
> >>>>> (The build failures are apparently due to the build timeouts)
> >>>>>
> >>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
> >>>>> with ParseResult, at the moment a file location as its property. Only
> >>>>> a file name can easily be saved, I thought it might be important
> where
> >>>>> on the network the file is - may be copy it afterwards if needed,
> etc.
> >>>>> I'd also have no problems with having it typed as a K key, was only
> >>>>> trying to make it a bit simpler at the start.
> >>>>>
> >>>>> I'll deal with the new configurations after a switch. TikaConfig
> would
> >>>>> most likely still need to be supported but I recall you mentioned the
> >>>>> way it's done now will make it work only with the direct runner. I
> >>>>> guess I can load it as a URL resource... The other bits like
> providing
> >>>>> custom content handlers, parsers, input metadata, may be setting the
> >>>>> max size of the files, etc, can all be added after a switch.
> >>>>>
> >>>>> Note I haven't dealt with a number of your comments to the original
> >>>>> code which can still be dealt with in the current code - given that
> >>>>> most of that code will go with the next PR anyway.
> >>>>>
> >>>>> Please review or merge if it looks like it is a step in the right
> >>>>> direction...
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I have another question anyway,
> >>>>>>>
> >>>>>>>
> >>>>>>>> E.g. TikaIO could:
> >>>>>>>> - take as input a PCollection<ReadableFile>
> >>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>>>>> ParseResult
> >>>>>>>> is a class with properties { String content, Metadata metadata }
> >>>>>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>>>>> specified at pipeline construction time) and a ContentHandler
> whose
> >>>>>>>> toString() will go into "content". ContentHandler does not
> >>>>>>>> implement
> >>>>>>>> Serializable, so you can not specify it at construction time -
> >>>>>>>> however,
> >>>>>>> you
> >>>>>>>> can let the user specify either its class (if it's a simple
> handler
> >>>>>>>> like
> >>>>>>> a
> >>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially
> >>>>>>>> you can
> >>>>>>> have
> >>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>>>>> TikaIO.parseAllAsStrings().
> >>>>>>>>
> >>>>>>>> Example usage would look like:
> >>>>>>>>
> >>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
> >>>>>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>>>>        .apply(FileIO.readMatches())
> >>>>>>>>        .apply(TikaIO.parseAllAsStrings())
> >>>>>>>>
> >>>>>>>> or:
> >>>>>>>>
> >>>>>>>>        .apply(TikaIO.parseAll()
> >>>>>>>>            .withParser(new AutoDetectParser())
> >>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
> >>>>>>>> ToXMLContentHandler())))
> >>>>>>>>
> >>>>>>>> You could also have shorthands for letting the user avoid using
> >>> FileIO
> >>>>>>>> directly in simple cases, for example:
> >>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>>>>
> >>>>>>>> This would of course be implemented as a ParDo or even
> MapElements,
> >>>>>>>> and
> >>>>>>>> you'll be able to share the code between parseAll and regular
> >>>>>>>> parse.
> >>>>>>>>
> >>>>>>> I'd like to understand how to do
> >>>>>>>
> >>>>>>> TikaIO.parse().from(filepattern)
> >>>>>>>
> >>>>>>> Right now I have TikaIO.Read extending
> >>>>>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>>>>
> >>>>>>> and then the boilerplate code which builds Read when I do something
> >>>>>>> like
> >>>>>>>
> >>>>>>> TikaIO.read().from(filepattern).
> >>>>>>>
> >>>>>>> What is the convention for supporting something like
> >>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can
> I
> >>>>>>> see
> >>>>>>> some example ?
> >>>>>>>
> >>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO
> >>>>>> and
> >>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
> >>> mimic.
> >>>>>> Note that in TikaIO you probably won't need a fusion break after the
> >>>>>> ParDo
> >>>>>> since there's 1 result per input file.
> >>>>>>
> >>>>>
> >>>>> OK, I'll have a look
> >>>>>
> >>>>> Cheers, Sergey
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Many thanks, Sergey
> >>>>>>>
> >>>>>>
> >>>
> >>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.
Wait, but what about Tika doing checks like Zip bombs, etc ? Tika is 
expected to decompress itself, while ReadableFile has the content 
decompressed.

The other point is that Tika reports the names of the zipped files too, 
in the content, as you can see from TikaIOTest#readZippedPdfFile.

Can we assume that if Metadata does not point to the local file then it 
can be opened as a URL stream ? The same issue affects TikaConfig, so 
I'd rather have a solution which will work for MatchResult.Metadata and 
TikaConfig

Thanks, Sergey
On 04/10/17 22:02, Sergey Beryozkin wrote:
> Good point...
> 
> Sergey
> 
> On 04/10/17 18:24, Eugene Kirpichov wrote:
>> Can TikaInputStream consume a regular InputStream? If so, you can 
>> apply it
>> to Channels.newInputStream(channel). If not, applying it to the filename
>> extracted from Metadata won't work either because it can point to a file
>> that's not on the local disk.
>>
>> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com> 
>> wrote:
>>
>>> I'm starting moving toward
>>>
>>> class TikaIO {
>>>     public static ParseAllToString parseAllToString() {..}
>>>     class ParseAllToString extends PTransform<PCollection<ReadableFile>,
>>> PCollection<ParseResult>> {
>>>       ...configuration properties...
>>>       expand {
>>>         return input.apply(ParDo.of(new ParseToStringFn))
>>>       }
>>>       class ParseToStringFn extends DoFn<...> {...}
>>>     }
>>> }
>>>
>>> as suggested by Eugene
>>>
>>> The initial migration seems to work fine, except that ReadableFile and
>>> in particular, ReadableByteChannel can not be consumed by
>>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>>> better let Tika to unzip if needed given that a lot of effort went in
>>> Tika into detecting zip security issues...
>>>
>>> So I'm typing it as
>>>
>>> class ParseAllToString extends
>>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>>
>>> Cheers, Sergey
>>>
>>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>>> Thanks for the review, please see the last comment:
>>>>
>>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>>
>>>> (sorry for the possible duplication - but I'm not sure that GitHub will
>>>> propagate it - as I can not see a comment there that I left on 
>>>> Saturday).
>>>>
>>>> Cheers, Sergey
>>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>>> Hi
>>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>>
>>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>>> (and practical) cases first :-)
>>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>>> reassigned it to you.
>>>>> OK, thanks...
>>>>>>
>>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin 
>>>>>> <sberyozkin@gmail.com
>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I started looking at
>>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>>
>>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>>> ParseResult
>>>>>>> and updating the tests but keeping the BounderSource/Reader, 
>>>>>>> dropping
>>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>>
>>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is 
>>>>>>> looking
>>>>>>> into it too or was it auto-assigned ?
>>>>>>>
>>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>>> based code ?
>>>>>>>
>>>>>> Yes :)
>>>>>>
>>>>> I did commit yesterday to my branch, and it made its way to the
>>>>> pending PR (which I forgot about) where I only tweaked a couple of doc
>>>>> typos, so I renamed that PR:
>>>>>
>>>>> https://github.com/apache/beam/pull/3835
>>>>>
>>>>> (The build failures are apparently due to the build timeouts)
>>>>>
>>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>>> with ParseResult, at the moment a file location as its property. Only
>>>>> a file name can easily be saved, I thought it might be important where
>>>>> on the network the file is - may be copy it afterwards if needed, etc.
>>>>> I'd also have no problems with having it typed as a K key, was only
>>>>> trying to make it a bit simpler at the start.
>>>>>
>>>>> I'll deal with the new configurations after a switch. TikaConfig would
>>>>> most likely still need to be supported but I recall you mentioned the
>>>>> way it's done now will make it work only with the direct runner. I
>>>>> guess I can load it as a URL resource... The other bits like providing
>>>>> custom content handlers, parsers, input metadata, may be setting the
>>>>> max size of the files, etc, can all be added after a switch.
>>>>>
>>>>> Note I haven't dealt with a number of your comments to the original
>>>>> code which can still be dealt with in the current code - given that
>>>>> most of that code will go with the next PR anyway.
>>>>>
>>>>> Please review or merge if it looks like it is a step in the right
>>>>> direction...
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I have another question anyway,
>>>>>>>
>>>>>>>
>>>>>>>> E.g. TikaIO could:
>>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>>> ParseResult
>>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>>>>> toString() will go into "content". ContentHandler does not 
>>>>>>>> implement
>>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>>> however,
>>>>>>> you
>>>>>>>> can let the user specify either its class (if it's a simple handler
>>>>>>>> like
>>>>>>> a
>>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially 
>>>>>>>> you can
>>>>>>> have
>>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>>
>>>>>>>> Example usage would look like:
>>>>>>>>
>>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
>>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>>        .apply(FileIO.readMatches())
>>>>>>>>        .apply(TikaIO.parseAllAsStrings())
>>>>>>>>
>>>>>>>> or:
>>>>>>>>
>>>>>>>>        .apply(TikaIO.parseAll()
>>>>>>>>            .withParser(new AutoDetectParser())
>>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
>>>>>>>> ToXMLContentHandler())))
>>>>>>>>
>>>>>>>> You could also have shorthands for letting the user avoid using
>>> FileIO
>>>>>>>> directly in simple cases, for example:
>>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>>
>>>>>>>> This would of course be implemented as a ParDo or even MapElements,
>>>>>>>> and
>>>>>>>> you'll be able to share the code between parseAll and regular 
>>>>>>>> parse.
>>>>>>>>
>>>>>>> I'd like to understand how to do
>>>>>>>
>>>>>>> TikaIO.parse().from(filepattern)
>>>>>>>
>>>>>>> Right now I have TikaIO.Read extending
>>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>>
>>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>>> like
>>>>>>>
>>>>>>> TikaIO.read().from(filepattern).
>>>>>>>
>>>>>>> What is the convention for supporting something like
>>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
>>>>>>> see
>>>>>>> some example ?
>>>>>>>
>>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO 
>>>>>> and
>>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>>> mimic.
>>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>>> ParDo
>>>>>> since there's 1 result per input file.
>>>>>>
>>>>>
>>>>> OK, I'll have a look
>>>>>
>>>>> Cheers, Sergey
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Many thanks, Sergey
>>>>>>>
>>>>>>
>>>
>>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.
Good point...

Sergey

On 04/10/17 18:24, Eugene Kirpichov wrote:
> Can TikaInputStream consume a regular InputStream? If so, you can apply it
> to Channels.newInputStream(channel). If not, applying it to the filename
> extracted from Metadata won't work either because it can point to a file
> that's not on the local disk.
> 
> On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com> wrote:
> 
>> I'm starting moving toward
>>
>> class TikaIO {
>>     public static ParseAllToString parseAllToString() {..}
>>     class ParseAllToString extends PTransform<PCollection<ReadableFile>,
>> PCollection<ParseResult>> {
>>       ...configuration properties...
>>       expand {
>>         return input.apply(ParDo.of(new ParseToStringFn))
>>       }
>>       class ParseToStringFn extends DoFn<...> {...}
>>     }
>> }
>>
>> as suggested by Eugene
>>
>> The initial migration seems to work fine, except that ReadableFile and
>> in particular, ReadableByteChannel can not be consumed by
>> TikaInputStream yet (I'll open an enhancement request), besides, it's
>> better let Tika to unzip if needed given that a lot of effort went in
>> Tika into detecting zip security issues...
>>
>> So I'm typing it as
>>
>> class ParseAllToString extends
>> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>>
>> Cheers, Sergey
>>
>> On 02/10/17 12:03, Sergey Beryozkin wrote:
>>> Thanks for the review, please see the last comment:
>>>
>>> https://github.com/apache/beam/pull/3835#issuecomment-333502388
>>>
>>> (sorry for the possible duplication - but I'm not sure that GitHub will
>>> propagate it - as I can not see a comment there that I left on Saturday).
>>>
>>> Cheers, Sergey
>>> On 29/09/17 10:21, Sergey Beryozkin wrote:
>>>> Hi
>>>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>>>> Hi! Glad the refactoring is happening, thanks!
>>>>
>>>> Thanks for getting me focused on having TikaIO supporting the simpler
>>>> (and practical) cases first :-)
>>>>> It was auto-assigned to Reuven as formal owner of the component. I
>>>>> reassigned it to you.
>>>> OK, thanks...
>>>>>
>>>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyozkin@gmail.com
>>>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I started looking at
>>>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>>>
>>>>>> and pushed some initial code to my tikaio branch introducing
>>>>>> ParseResult
>>>>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>>>>> the asynchronous parsing code, and few other bits.
>>>>>>
>>>>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>>>>> into it too or was it auto-assigned ?
>>>>>>
>>>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>>>> what've done so far before completely removing BoundedSource/Reader
>>>>>> based code ?
>>>>>>
>>>>> Yes :)
>>>>>
>>>> I did commit yesterday to my branch, and it made its way to the
>>>> pending PR (which I forgot about) where I only tweaked a couple of doc
>>>> typos, so I renamed that PR:
>>>>
>>>> https://github.com/apache/beam/pull/3835
>>>>
>>>> (The build failures are apparently due to the build timeouts)
>>>>
>>>> As I mentioned, in this PR I updated the existing TikaIO test to work
>>>> with ParseResult, at the moment a file location as its property. Only
>>>> a file name can easily be saved, I thought it might be important where
>>>> on the network the file is - may be copy it afterwards if needed, etc.
>>>> I'd also have no problems with having it typed as a K key, was only
>>>> trying to make it a bit simpler at the start.
>>>>
>>>> I'll deal with the new configurations after a switch. TikaConfig would
>>>> most likely still need to be supported but I recall you mentioned the
>>>> way it's done now will make it work only with the direct runner. I
>>>> guess I can load it as a URL resource... The other bits like providing
>>>> custom content handlers, parsers, input metadata, may be setting the
>>>> max size of the files, etc, can all be added after a switch.
>>>>
>>>> Note I haven't dealt with a number of your comments to the original
>>>> code which can still be dealt with in the current code - given that
>>>> most of that code will go with the next PR anyway.
>>>>
>>>> Please review or merge if it looks like it is a step in the right
>>>> direction...
>>>>
>>>>>
>>>>>>
>>>>>> I have another question anyway,
>>>>>>
>>>>>>
>>>>>>> E.g. TikaIO could:
>>>>>>> - take as input a PCollection<ReadableFile>
>>>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
>>>>>>> ParseResult
>>>>>>> is a class with properties { String content, Metadata metadata }
>>>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>>>> toString() will go into "content". ContentHandler does not implement
>>>>>>> Serializable, so you can not specify it at construction time -
>>>>>>> however,
>>>>>> you
>>>>>>> can let the user specify either its class (if it's a simple handler
>>>>>>> like
>>>>>> a
>>>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>>>>> have
>>>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>>>> TikaIO.parseAllAsStrings().
>>>>>>>
>>>>>>> Example usage would look like:
>>>>>>>
>>>>>>>      PCollection<KV<String, ParseResult>> parseResults =
>>>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>>>        .apply(FileIO.readMatches())
>>>>>>>        .apply(TikaIO.parseAllAsStrings())
>>>>>>>
>>>>>>> or:
>>>>>>>
>>>>>>>        .apply(TikaIO.parseAll()
>>>>>>>            .withParser(new AutoDetectParser())
>>>>>>>            .withContentHandler(() -> new BodyContentHandler(new
>>>>>>> ToXMLContentHandler())))
>>>>>>>
>>>>>>> You could also have shorthands for letting the user avoid using
>> FileIO
>>>>>>> directly in simple cases, for example:
>>>>>>>        p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>>>
>>>>>>> This would of course be implemented as a ParDo or even MapElements,
>>>>>>> and
>>>>>>> you'll be able to share the code between parseAll and regular parse.
>>>>>>>
>>>>>> I'd like to understand how to do
>>>>>>
>>>>>> TikaIO.parse().from(filepattern)
>>>>>>
>>>>>> Right now I have TikaIO.Read extending
>>>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>>>
>>>>>> and then the boilerplate code which builds Read when I do something
>>>>>> like
>>>>>>
>>>>>> TikaIO.read().from(filepattern).
>>>>>>
>>>>>> What is the convention for supporting something like
>>>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
>>>>>> see
>>>>>> some example ?
>>>>>>
>>>>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>>>>> JdbcIO. TextIO.readMatches() might be an even better transform to
>> mimic.
>>>>> Note that in TikaIO you probably won't need a fusion break after the
>>>>> ParDo
>>>>> since there's 1 result per input file.
>>>>>
>>>>
>>>> OK, I'll have a look
>>>>
>>>> Cheers, Sergey
>>>>
>>>>>
>>>>>>
>>>>>> Many thanks, Sergey
>>>>>>
>>>>>
>>
> 

Re: TikaIO Refactoring

Posted by Eugene Kirpichov <ki...@google.com.INVALID>.
Can TikaInputStream consume a regular InputStream? If so, you can apply it
to Channels.newInputStream(channel). If not, applying it to the filename
extracted from Metadata won't work either because it can point to a file
that's not on the local disk.

On Wed, Oct 4, 2017, 10:08 AM Sergey Beryozkin <sb...@gmail.com> wrote:

> I'm starting moving toward
>
> class TikaIO {
>    public static ParseAllToString parseAllToString() {..}
>    class ParseAllToString extends PTransform<PCollection<ReadableFile>,
> PCollection<ParseResult>> {
>      ...configuration properties...
>      expand {
>        return input.apply(ParDo.of(new ParseToStringFn))
>      }
>      class ParseToStringFn extends DoFn<...> {...}
>    }
> }
>
> as suggested by Eugene
>
> The initial migration seems to work fine, except that ReadableFile and
> in particular, ReadableByteChannel can not be consumed by
> TikaInputStream yet (I'll open an enhancement request), besides, it's
> better let Tika to unzip if needed given that a lot of effort went in
> Tika into detecting zip security issues...
>
> So I'm typing it as
>
> class ParseAllToString extends
> PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>
>
> Cheers, Sergey
>
> On 02/10/17 12:03, Sergey Beryozkin wrote:
> > Thanks for the review, please see the last comment:
> >
> > https://github.com/apache/beam/pull/3835#issuecomment-333502388
> >
> > (sorry for the possible duplication - but I'm not sure that GitHub will
> > propagate it - as I can not see a comment there that I left on Saturday).
> >
> > Cheers, Sergey
> > On 29/09/17 10:21, Sergey Beryozkin wrote:
> >> Hi
> >> On 28/09/17 17:09, Eugene Kirpichov wrote:
> >>> Hi! Glad the refactoring is happening, thanks!
> >>
> >> Thanks for getting me focused on having TikaIO supporting the simpler
> >> (and practical) cases first :-)
> >>> It was auto-assigned to Reuven as formal owner of the component. I
> >>> reassigned it to you.
> >> OK, thanks...
> >>>
> >>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sberyozkin@gmail.com
> >
> >>> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> I started looking at
> >>>> https://issues.apache.org/jira/browse/BEAM-2994
> >>>>
> >>>> and pushed some initial code to my tikaio branch introducing
> >>>> ParseResult
> >>>> and updating the tests but keeping the BounderSource/Reader, dropping
> >>>> the asynchronous parsing code, and few other bits.
> >>>>
> >>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
> >>>> into it too or was it auto-assigned ?
> >>>>
> >>>> I don't mind, would it make sense for me to do an 'interim' PR on
> >>>> what've done so far before completely removing BoundedSource/Reader
> >>>> based code ?
> >>>>
> >>> Yes :)
> >>>
> >> I did commit yesterday to my branch, and it made its way to the
> >> pending PR (which I forgot about) where I only tweaked a couple of doc
> >> typos, so I renamed that PR:
> >>
> >> https://github.com/apache/beam/pull/3835
> >>
> >> (The build failures are apparently due to the build timeouts)
> >>
> >> As I mentioned, in this PR I updated the existing TikaIO test to work
> >> with ParseResult, at the moment a file location as its property. Only
> >> a file name can easily be saved, I thought it might be important where
> >> on the network the file is - may be copy it afterwards if needed, etc.
> >> I'd also have no problems with having it typed as a K key, was only
> >> trying to make it a bit simpler at the start.
> >>
> >> I'll deal with the new configurations after a switch. TikaConfig would
> >> most likely still need to be supported but I recall you mentioned the
> >> way it's done now will make it work only with the direct runner. I
> >> guess I can load it as a URL resource... The other bits like providing
> >> custom content handlers, parsers, input metadata, may be setting the
> >> max size of the files, etc, can all be added after a switch.
> >>
> >> Note I haven't dealt with a number of your comments to the original
> >> code which can still be dealt with in the current code - given that
> >> most of that code will go with the next PR anyway.
> >>
> >> Please review or merge if it looks like it is a step in the right
> >> direction...
> >>
> >>>
> >>>>
> >>>> I have another question anyway,
> >>>>
> >>>>
> >>>>> E.g. TikaIO could:
> >>>>> - take as input a PCollection<ReadableFile>
> >>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where
> >>>>> ParseResult
> >>>>> is a class with properties { String content, Metadata metadata }
> >>>>> - be configured by: a Parser (it implements Serializable so can be
> >>>>> specified at pipeline construction time) and a ContentHandler whose
> >>>>> toString() will go into "content". ContentHandler does not implement
> >>>>> Serializable, so you can not specify it at construction time -
> >>>>> however,
> >>>> you
> >>>>> can let the user specify either its class (if it's a simple handler
> >>>>> like
> >>>> a
> >>>>> BodyContentHandler) or specify a lambda for creating the handler
> >>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
> >>>> have
> >>>>> a simpler facade for Tika.parseAsString() - e.g. call it
> >>>>> TikaIO.parseAllAsStrings().
> >>>>>
> >>>>> Example usage would look like:
> >>>>>
> >>>>>     PCollection<KV<String, ParseResult>> parseResults =
> >>>>> p.apply(FileIO.match().filepattern(...))
> >>>>>       .apply(FileIO.readMatches())
> >>>>>       .apply(TikaIO.parseAllAsStrings())
> >>>>>
> >>>>> or:
> >>>>>
> >>>>>       .apply(TikaIO.parseAll()
> >>>>>           .withParser(new AutoDetectParser())
> >>>>>           .withContentHandler(() -> new BodyContentHandler(new
> >>>>> ToXMLContentHandler())))
> >>>>>
> >>>>> You could also have shorthands for letting the user avoid using
> FileIO
> >>>>> directly in simple cases, for example:
> >>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
> >>>>>
> >>>>> This would of course be implemented as a ParDo or even MapElements,
> >>>>> and
> >>>>> you'll be able to share the code between parseAll and regular parse.
> >>>>>
> >>>> I'd like to understand how to do
> >>>>
> >>>> TikaIO.parse().from(filepattern)
> >>>>
> >>>> Right now I have TikaIO.Read extending
> >>>> PTransform<PBegin, PCollection<ParseResult>
> >>>>
> >>>> and then the boilerplate code which builds Read when I do something
> >>>> like
> >>>>
> >>>> TikaIO.read().from(filepattern).
> >>>>
> >>>> What is the convention for supporting something like
> >>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I
> >>>> see
> >>>> some example ?
> >>>>
> >>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
> >>> JdbcIO. TextIO.readMatches() might be an even better transform to
> mimic.
> >>> Note that in TikaIO you probably won't need a fusion break after the
> >>> ParDo
> >>> since there's 1 result per input file.
> >>>
> >>
> >> OK, I'll have a look
> >>
> >> Cheers, Sergey
> >>
> >>>
> >>>>
> >>>> Many thanks, Sergey
> >>>>
> >>>
>

Re: TikaIO Refactoring

Posted by Sergey Beryozkin <sb...@gmail.com>.
I'm starting moving toward

class TikaIO {
   public static ParseAllToString parseAllToString() {..}
   class ParseAllToString extends PTransform<PCollection<ReadableFile>, 
PCollection<ParseResult>> {
     ...configuration properties...
     expand {
       return input.apply(ParDo.of(new ParseToStringFn))
     }
     class ParseToStringFn extends DoFn<...> {...}
   }
}

as suggested by Eugene

The initial migration seems to work fine, except that ReadableFile and 
in particular, ReadableByteChannel can not be consumed by 
TikaInputStream yet (I'll open an enhancement request), besides, it's 
better let Tika to unzip if needed given that a lot of effort went in 
Tika into detecting zip security issues...

So I'm typing it as

class ParseAllToString extends 
PTransform<PCollection<MatchResult.Metadata>, PCollection<ParseResult>>

Cheers, Sergey

On 02/10/17 12:03, Sergey Beryozkin wrote:
> Thanks for the review, please see the last comment:
> 
> https://github.com/apache/beam/pull/3835#issuecomment-333502388
> 
> (sorry for the possible duplication - but I'm not sure that GitHub will 
> propagate it - as I can not see a comment there that I left on Saturday).
> 
> Cheers, Sergey
> On 29/09/17 10:21, Sergey Beryozkin wrote:
>> Hi
>> On 28/09/17 17:09, Eugene Kirpichov wrote:
>>> Hi! Glad the refactoring is happening, thanks!
>>
>> Thanks for getting me focused on having TikaIO supporting the simpler 
>> (and practical) cases first :-)
>>> It was auto-assigned to Reuven as formal owner of the component. I
>>> reassigned it to you.
>> OK, thanks...
>>>
>>> On Thu, Sep 28, 2017 at 7:57 AM Sergey Beryozkin <sb...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I started looking at
>>>> https://issues.apache.org/jira/browse/BEAM-2994
>>>>
>>>> and pushed some initial code to my tikaio branch introducing 
>>>> ParseResult
>>>> and updating the tests but keeping the BounderSource/Reader, dropping
>>>> the asynchronous parsing code, and few other bits.
>>>>
>>>> Just noticed it is assigned to Reuven - does it mean Reuven is looking
>>>> into it too or was it auto-assigned ?
>>>>
>>>> I don't mind, would it make sense for me to do an 'interim' PR on
>>>> what've done so far before completely removing BoundedSource/Reader
>>>> based code ?
>>>>
>>> Yes :)
>>>
>> I did commit yesterday to my branch, and it made its way to the 
>> pending PR (which I forgot about) where I only tweaked a couple of doc 
>> typos, so I renamed that PR:
>>
>> https://github.com/apache/beam/pull/3835
>>
>> (The build failures are apparently due to the build timeouts)
>>
>> As I mentioned, in this PR I updated the existing TikaIO test to work 
>> with ParseResult, at the moment a file location as its property. Only 
>> a file name can easily be saved, I thought it might be important where 
>> on the network the file is - may be copy it afterwards if needed, etc. 
>> I'd also have no problems with having it typed as a K key, was only 
>> trying to make it a bit simpler at the start.
>>
>> I'll deal with the new configurations after a switch. TikaConfig would 
>> most likely still need to be supported but I recall you mentioned the 
>> way it's done now will make it work only with the direct runner. I 
>> guess I can load it as a URL resource... The other bits like providing 
>> custom content handlers, parsers, input metadata, may be setting the 
>> max size of the files, etc, can all be added after a switch.
>>
>> Note I haven't dealt with a number of your comments to the original 
>> code which can still be dealt with in the current code - given that 
>> most of that code will go with the next PR anyway.
>>
>> Please review or merge if it looks like it is a step in the right 
>> direction...
>>
>>>
>>>>
>>>> I have another question anyway,
>>>>
>>>>
>>>>> E.g. TikaIO could:
>>>>> - take as input a PCollection<ReadableFile>
>>>>> - return a PCollection<KV<String, TikaIO.ParseResult>>, where 
>>>>> ParseResult
>>>>> is a class with properties { String content, Metadata metadata }
>>>>> - be configured by: a Parser (it implements Serializable so can be
>>>>> specified at pipeline construction time) and a ContentHandler whose
>>>>> toString() will go into "content". ContentHandler does not implement
>>>>> Serializable, so you can not specify it at construction time - 
>>>>> however,
>>>> you
>>>>> can let the user specify either its class (if it's a simple handler 
>>>>> like
>>>> a
>>>>> BodyContentHandler) or specify a lambda for creating the handler
>>>>> (SerializableFunction<Void, ContentHandler>), and potentially you can
>>>> have
>>>>> a simpler facade for Tika.parseAsString() - e.g. call it
>>>>> TikaIO.parseAllAsStrings().
>>>>>
>>>>> Example usage would look like:
>>>>>
>>>>>     PCollection<KV<String, ParseResult>> parseResults =
>>>>> p.apply(FileIO.match().filepattern(...))
>>>>>       .apply(FileIO.readMatches())
>>>>>       .apply(TikaIO.parseAllAsStrings())
>>>>>
>>>>> or:
>>>>>
>>>>>       .apply(TikaIO.parseAll()
>>>>>           .withParser(new AutoDetectParser())
>>>>>           .withContentHandler(() -> new BodyContentHandler(new
>>>>> ToXMLContentHandler())))
>>>>>
>>>>> You could also have shorthands for letting the user avoid using FileIO
>>>>> directly in simple cases, for example:
>>>>>       p.apply(TikaIO.parseAsStrings().from(filepattern))
>>>>>
>>>>> This would of course be implemented as a ParDo or even MapElements, 
>>>>> and
>>>>> you'll be able to share the code between parseAll and regular parse.
>>>>>
>>>> I'd like to understand how to do
>>>>
>>>> TikaIO.parse().from(filepattern)
>>>>
>>>> Right now I have TikaIO.Read extending
>>>> PTransform<PBegin, PCollection<ParseResult>
>>>>
>>>> and then the boilerplate code which builds Read when I do something 
>>>> like
>>>>
>>>> TikaIO.read().from(filepattern).
>>>>
>>>> What is the convention for supporting something like
>>>> TikaIO.parse().from(filepattern) to be implemented as a ParDo, can I 
>>>> see
>>>> some example ?
>>>>
>>> There are a number of IOs that don't use Source - e.g. DatastoreIO and
>>> JdbcIO. TextIO.readMatches() might be an even better transform to mimic.
>>> Note that in TikaIO you probably won't need a fusion break after the 
>>> ParDo
>>> since there's 1 result per input file.
>>>
>>
>> OK, I'll have a look
>>
>> Cheers, Sergey
>>
>>>
>>>>
>>>> Many thanks, Sergey
>>>>
>>>