You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Shawn Weeks <sw...@weeksconsulting.us> on 2018/10/26 15:36:35 UTC

ScriptedRecordReader Error Handling

Is there anyway for a ScriptedRecordReader to set an attribute on a FlowFile when there is an error? Have a situation where I've written a groovy script to parse xml into a specific record structure and occasionally the incoming data has characters not allowed in XML. Unfortunately the system that generates the XML is doing it through string manipulation instead of actually understanding XML so it crams all kinds of junk characters in the data. I'd rather not scrub every file as some of them can be large so I was trying to figure out a way to only scrub them on exception.


Thanks

Shawn Weeks

RE: ScriptedRecordReader Error Handling

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
I was trying to avoid having to check every single file since that will impact performance. I  could run a ReplaceText on each file prior to parsing the records but the files may be 100-200mb and that slows things down a bit.

Thanks
Shawn

From: Mike Thomsen <mi...@gmail.com>
Sent: Friday, October 26, 2018 11:38 AM
To: users@nifi.apache.org
Subject: Re: ScriptedRecordReader Error Handling

As a backup to that, you can also write a Groovy script for ExecuteScript that uses stax to iterate over the XML data. It won't care about schemas (Avro or XML) and stuff like that; just check for basic validity.

On Fri, Oct 26, 2018 at 11:42 AM Joe Witt <jo...@gmail.com>> wrote:
Cant your logic detect the strange characters and then apply its
behavior?  Alternatively, you could perhaps use ValidateRecord and
have its reader only understand the good records.  It should kick out
the bad records and you can then apply deeper processing on them.

Thanks
On Fri, Oct 26, 2018 at 11:36 AM Shawn Weeks <sw...@weeksconsulting.us>> wrote:
>
> Is there anyway for a ScriptedRecordReader to set an attribute on a FlowFile when there is an error? Have a situation where I've written a groovy script to parse xml into a specific record structure and occasionally the incoming data has characters not allowed in XML. Unfortunately the system that generates the XML is doing it through string manipulation instead of actually understanding XML so it crams all kinds of junk characters in the data. I'd rather not scrub every file as some of them can be large so I was trying to figure out a way to only scrub them on exception.
>
>
> Thanks
>
> Shawn Weeks

Re: ScriptedRecordReader Error Handling

Posted by Mike Thomsen <mi...@gmail.com>.
As a backup to that, you can also write a Groovy script for ExecuteScript
that uses stax to iterate over the XML data. It won't care about schemas
(Avro or XML) and stuff like that; just check for basic validity.

On Fri, Oct 26, 2018 at 11:42 AM Joe Witt <jo...@gmail.com> wrote:

> Cant your logic detect the strange characters and then apply its
> behavior?  Alternatively, you could perhaps use ValidateRecord and
> have its reader only understand the good records.  It should kick out
> the bad records and you can then apply deeper processing on them.
>
> Thanks
> On Fri, Oct 26, 2018 at 11:36 AM Shawn Weeks <sw...@weeksconsulting.us>
> wrote:
> >
> > Is there anyway for a ScriptedRecordReader to set an attribute on a
> FlowFile when there is an error? Have a situation where I've written a
> groovy script to parse xml into a specific record structure and
> occasionally the incoming data has characters not allowed in XML.
> Unfortunately the system that generates the XML is doing it through string
> manipulation instead of actually understanding XML so it crams all kinds of
> junk characters in the data. I'd rather not scrub every file as some of
> them can be large so I was trying to figure out a way to only scrub them on
> exception.
> >
> >
> > Thanks
> >
> > Shawn Weeks
>

RE: ScriptedRecordReader Error Handling

Posted by Shawn Weeks <sw...@weeksconsulting.us>.
Since it's XML it fails on the initial parse of the document as the entire FlowFile is a single XML Document.

Thanks
Shawn

-----Original Message-----
From: Joe Witt <jo...@gmail.com> 
Sent: Friday, October 26, 2018 10:43 AM
To: users@nifi.apache.org
Subject: Re: ScriptedRecordReader Error Handling

Cant your logic detect the strange characters and then apply its
behavior?  Alternatively, you could perhaps use ValidateRecord and
have its reader only understand the good records.  It should kick out
the bad records and you can then apply deeper processing on them.

Thanks
On Fri, Oct 26, 2018 at 11:36 AM Shawn Weeks <sw...@weeksconsulting.us> wrote:
>
> Is there anyway for a ScriptedRecordReader to set an attribute on a FlowFile when there is an error? Have a situation where I've written a groovy script to parse xml into a specific record structure and occasionally the incoming data has characters not allowed in XML. Unfortunately the system that generates the XML is doing it through string manipulation instead of actually understanding XML so it crams all kinds of junk characters in the data. I'd rather not scrub every file as some of them can be large so I was trying to figure out a way to only scrub them on exception.
>
>
> Thanks
>
> Shawn Weeks

Re: ScriptedRecordReader Error Handling

Posted by Joe Witt <jo...@gmail.com>.
Cant your logic detect the strange characters and then apply its
behavior?  Alternatively, you could perhaps use ValidateRecord and
have its reader only understand the good records.  It should kick out
the bad records and you can then apply deeper processing on them.

Thanks
On Fri, Oct 26, 2018 at 11:36 AM Shawn Weeks <sw...@weeksconsulting.us> wrote:
>
> Is there anyway for a ScriptedRecordReader to set an attribute on a FlowFile when there is an error? Have a situation where I've written a groovy script to parse xml into a specific record structure and occasionally the incoming data has characters not allowed in XML. Unfortunately the system that generates the XML is doing it through string manipulation instead of actually understanding XML so it crams all kinds of junk characters in the data. I'd rather not scrub every file as some of them can be large so I was trying to figure out a way to only scrub them on exception.
>
>
> Thanks
>
> Shawn Weeks