You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Cameron Bateman <cb...@gmail.com> on 2020/04/08 05:30:07 UTC

Handling imperfect data

I am trying to create a pipeline that intakes PDF files, parses the data
using Tika and processes the data.  A problem I have is that sometimes Tika
doesn't perfectly convert certain pieces of text correctly.

I can detect that this and would like to fork the output of my pipeline:
for correctly converted PDF files, I want to continue processing the data.
For the ones that have errors, I'd like to dump the intermediate XML data
to a directory and raise an alert.  For those files, I will go and manually
fix the file and effective restart the pipeline from where it failed as if
it was correct in the first place.

Is there any facility to do this sort of handling of imperfect data
inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
try to fork the data but I'm a little at a loss where to proceed.

Thanks,
Cameron

Re: Handling imperfect data

Posted by Cameron Bateman <cb...@gmail.com>.

Thanks Varun, that worked.  A small note for anyone following this is that
the API seems to have changed slightly since the blog was written.  In
particular, processElement is no longer a method of the DoFn parent as of
recent versions. Instead, it is referenced via the @ProcessElement
annotation.  Check the up to date API for more info.

On Tue, Apr 7, 2020 at 11:05 PM Varun Dhussa <va...@google.com> wrote:

> TupleTags is a good way to proceed. You can add a dead letter side output
> for the tag. A sample implementation is here
> <https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow>
> .
>
> Varun
>
>
> On Wed, Apr 8, 2020 at 11:00 AM Cameron Bateman <cb...@gmail.com>
> wrote:
>
>> I am trying to create a pipeline that intakes PDF files, parses the data
>> using Tika and processes the data.  A problem I have is that sometimes Tika
>> doesn't perfectly convert certain pieces of text correctly.
>>
>> I can detect that this and would like to fork the output of my pipeline:
>> for correctly converted PDF files, I want to continue processing the data.
>> For the ones that have errors, I'd like to dump the intermediate XML data
>> to a directory and raise an alert.  For those files, I will go and manually
>> fix the file and effective restart the pipeline from where it failed as if
>> it was correct in the first place.
>>
>> Is there any facility to do this sort of handling of imperfect data
>> inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
>> try to fork the data but I'm a little at a loss where to proceed.
>>
>> Thanks,
>> Cameron
>>
>

Re: Handling imperfect data

Posted by Varun Dhussa <va...@google.com>.

TupleTags is a good way to proceed. You can add a dead letter side output
for the tag. A sample implementation is here
<https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow>
.

Varun


On Wed, Apr 8, 2020 at 11:00 AM Cameron Bateman <cb...@gmail.com>
wrote:

> I am trying to create a pipeline that intakes PDF files, parses the data
> using Tika and processes the data.  A problem I have is that sometimes Tika
> doesn't perfectly convert certain pieces of text correctly.
>
> I can detect that this and would like to fork the output of my pipeline:
> for correctly converted PDF files, I want to continue processing the data.
> For the ones that have errors, I'd like to dump the intermediate XML data
> to a directory and raise an alert.  For those files, I will go and manually
> fix the file and effective restart the pipeline from where it failed as if
> it was correct in the first place.
>
> Is there any facility to do this sort of handling of imperfect data
> inputs?  I see that I can try to use MultiOutputReceiver and TupleTags to
> try to fork the data but I'm a little at a loss where to proceed.
>
> Thanks,
> Cameron
>