You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by naga satish <cn...@gmail.com> on 2020/11/27 15:57:40 UTC

Extracting fields that are enveloped in quotes.

I have some pipe delimited files. Each field is bounded by quotes like
this.

"Created_Date__c"|"CreatedById"|"CreatedDate"|"Guid_c"
"2020-03-02 00:00:00"|"0053i000002XCpAAG"|"2020-03-02
16:01:34"|"94bf83ccf9daf610VgnVCM100000307882a2RCRD"
"2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03
09:15:56"|"1a4bb238cdedd610VgnVCM100000307882a2RCRD"
"2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03
09:52:33"|"22408baca6fee610VgnVCM100000307882a2RCRD"

I need to cleanse this data and the needs to look like this.

Created_Date__c|CreatedById|CreatedDate|Guid_c
2020-03-02 00:00:00|0053i000002XCpAAG|2020-03-02
16:01:34|94bf83ccf9daf610VgnVCM100000307882a2RCRD
2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03
09:15:56|1a4bb238cdedd610VgnVCM100000307882a2RCRD
2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03
09:52:33|22408baca6fee610VgnVCM100000307882a2RCRD

I tried using ReplaceText with these configurations.
search value - ^"(.*)"$ and Replacement Value - $1. But these
configurations is not working and the file is routing to failure. not sure
what might be the issue. What can i do to solve this issue?

open to other suggestions as well. Thanks in advance.

Re: Extracting fields that are enveloped in quotes.

Posted by naga satish <cn...@gmail.com>.
Hi mark,tried UpdateRecord with CSVReader and CSVRecordSetWriter.
configured delimiter as "|" and schema property to infer schema. file is
being sent to failure.



On Sat, Nov 28, 2020 at 9:08 AM Mark Payne <ma...@hotmail.com> wrote:

> You wouldn’t configure that in the csv reader - it will read whatever it
> is given. You’d set that in the writer. The writer has a property named
> Quote Mode, which you can set to Minimal.
>
> Sent from my iPhone
>
> On Nov 27, 2020, at 10:09 PM, naga satish <cn...@gmail.com> wrote:
>
> 
> Hi Mark,
> Few fields contains special characters (quotes and pipe). I need to retain
> them. How to mention in CSV reader to use quotes when necessary?
>
> On Fri, 27 Nov 2020, 9:41 pm Mark Payne, <ma...@hotmail.com> wrote:
>
>> Hello,
>>
>> The given regex is definitely not what you want. It should remove the
>> first quote and last quote in every line but not the quotes in between. Not
>> sure why it would go to failure without having the error message.
>>
>> You have a few different options. Firstly you could use replace text and
>> use a search value of just double quotes (“) and a replacement value of
>> empty string and a Replacement Strategy of Literal. This would just blindly
>> remove any quote characters.
>>
>> Perhaps a better solution would be to use ConvertRecord with a CSV Reader
>> and CSV Writer. Use pipe as the delimiter. On the writer you can tell it to
>> only use quotes where necessary. So if you had any cells that have pipes or
>> other special characters those would get quotes. Given your data set it may
>> be ok to assume that won’t be the case but it’s worth mentioning. It may
>> perform better but I can’t say for sure without testing.
>>
>> Thanks
>> -Mark
>>
>> Sent from my iPhone
>>
>> > On Nov 27, 2020, at 10:58 AM, naga satish <cn...@gmail.com>
>> wrote:
>> >
>> > 
>> > I have some pipe delimited files. Each field is bounded by quotes like
>> this.
>> >
>> > "Created_Date__c"|"CreatedById"|"CreatedDate"|"Guid_c"
>> > "2020-03-02 00:00:00"|"0053i000002XCpAAG"|"2020-03-02
>> 16:01:34"|"94bf83ccf9daf610VgnVCM100000307882a2RCRD"
>> > "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03
>> 09:15:56"|"1a4bb238cdedd610VgnVCM100000307882a2RCRD"
>> > "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03
>> 09:52:33"|"22408baca6fee610VgnVCM100000307882a2RCRD"
>> >
>> > I need to cleanse this data and the needs to look like this.
>> >
>> > Created_Date__c|CreatedById|CreatedDate|Guid_c
>> > 2020-03-02 00:00:00|0053i000002XCpAAG|2020-03-02
>> 16:01:34|94bf83ccf9daf610VgnVCM100000307882a2RCRD
>> > 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03
>> 09:15:56|1a4bb238cdedd610VgnVCM100000307882a2RCRD
>> > 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03
>> 09:52:33|22408baca6fee610VgnVCM100000307882a2RCRD
>> >
>> > I tried using ReplaceText with these configurations.
>> > search value - ^"(.*)"$ and Replacement Value - $1. But these
>> configurations is not working and the file is routing to failure. not sure
>> what might be the issue. What can i do to solve this issue?
>> >
>> > open to other suggestions as well. Thanks in advance.
>>
>

Re: Extracting fields that are enveloped in quotes.

Posted by Mark Payne <ma...@hotmail.com>.
You wouldn’t configure that in the csv reader - it will read whatever it is given. You’d set that in the writer. The writer has a property named Quote Mode, which you can set to Minimal.

Sent from my iPhone

On Nov 27, 2020, at 10:09 PM, naga satish <cn...@gmail.com> wrote:


Hi Mark,
Few fields contains special characters (quotes and pipe). I need to retain them. How to mention in CSV reader to use quotes when necessary?

On Fri, 27 Nov 2020, 9:41 pm Mark Payne, <ma...@hotmail.com>> wrote:
Hello,

The given regex is definitely not what you want. It should remove the first quote and last quote in every line but not the quotes in between. Not sure why it would go to failure without having the error message.

You have a few different options. Firstly you could use replace text and use a search value of just double quotes (“) and a replacement value of empty string and a Replacement Strategy of Literal. This would just blindly remove any quote characters.

Perhaps a better solution would be to use ConvertRecord with a CSV Reader and CSV Writer. Use pipe as the delimiter. On the writer you can tell it to only use quotes where necessary. So if you had any cells that have pipes or other special characters those would get quotes. Given your data set it may be ok to assume that won’t be the case but it’s worth mentioning. It may perform better but I can’t say for sure without testing.

Thanks
-Mark

Sent from my iPhone

> On Nov 27, 2020, at 10:58 AM, naga satish <cn...@gmail.com>> wrote:
>
> 
> I have some pipe delimited files. Each field is bounded by quotes like this.
>
> "Created_Date__c"|"CreatedById"|"CreatedDate"|"Guid_c"
> "2020-03-02 00:00:00"|"0053i000002XCpAAG"|"2020-03-02 16:01:34"|"94bf83ccf9daf610VgnVCM100000307882a2RCRD"
> "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03 09:15:56"|"1a4bb238cdedd610VgnVCM100000307882a2RCRD"
> "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03 09:52:33"|"22408baca6fee610VgnVCM100000307882a2RCRD"
>
> I need to cleanse this data and the needs to look like this.
>
> Created_Date__c|CreatedById|CreatedDate|Guid_c
> 2020-03-02 00:00:00|0053i000002XCpAAG|2020-03-02 16:01:34|94bf83ccf9daf610VgnVCM100000307882a2RCRD
> 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03 09:15:56|1a4bb238cdedd610VgnVCM100000307882a2RCRD
> 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03 09:52:33|22408baca6fee610VgnVCM100000307882a2RCRD
>
> I tried using ReplaceText with these configurations.
> search value - ^"(.*)"$ and Replacement Value - $1. But these configurations is not working and the file is routing to failure. not sure what might be the issue. What can i do to solve this issue?
>
> open to other suggestions as well. Thanks in advance.

Re: Extracting fields that are enveloped in quotes.

Posted by naga satish <cn...@gmail.com>.
Hi Mark,
Few fields contains special characters (quotes and pipe). I need to retain
them. How to mention in CSV reader to use quotes when necessary?

On Fri, 27 Nov 2020, 9:41 pm Mark Payne, <ma...@hotmail.com> wrote:

> Hello,
>
> The given regex is definitely not what you want. It should remove the
> first quote and last quote in every line but not the quotes in between. Not
> sure why it would go to failure without having the error message.
>
> You have a few different options. Firstly you could use replace text and
> use a search value of just double quotes (“) and a replacement value of
> empty string and a Replacement Strategy of Literal. This would just blindly
> remove any quote characters.
>
> Perhaps a better solution would be to use ConvertRecord with a CSV Reader
> and CSV Writer. Use pipe as the delimiter. On the writer you can tell it to
> only use quotes where necessary. So if you had any cells that have pipes or
> other special characters those would get quotes. Given your data set it may
> be ok to assume that won’t be the case but it’s worth mentioning. It may
> perform better but I can’t say for sure without testing.
>
> Thanks
> -Mark
>
> Sent from my iPhone
>
> > On Nov 27, 2020, at 10:58 AM, naga satish <cn...@gmail.com>
> wrote:
> >
> > 
> > I have some pipe delimited files. Each field is bounded by quotes like
> this.
> >
> > "Created_Date__c"|"CreatedById"|"CreatedDate"|"Guid_c"
> > "2020-03-02 00:00:00"|"0053i000002XCpAAG"|"2020-03-02
> 16:01:34"|"94bf83ccf9daf610VgnVCM100000307882a2RCRD"
> > "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03
> 09:15:56"|"1a4bb238cdedd610VgnVCM100000307882a2RCRD"
> > "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03
> 09:52:33"|"22408baca6fee610VgnVCM100000307882a2RCRD"
> >
> > I need to cleanse this data and the needs to look like this.
> >
> > Created_Date__c|CreatedById|CreatedDate|Guid_c
> > 2020-03-02 00:00:00|0053i000002XCpAAG|2020-03-02
> 16:01:34|94bf83ccf9daf610VgnVCM100000307882a2RCRD
> > 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03
> 09:15:56|1a4bb238cdedd610VgnVCM100000307882a2RCRD
> > 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03
> 09:52:33|22408baca6fee610VgnVCM100000307882a2RCRD
> >
> > I tried using ReplaceText with these configurations.
> > search value - ^"(.*)"$ and Replacement Value - $1. But these
> configurations is not working and the file is routing to failure. not sure
> what might be the issue. What can i do to solve this issue?
> >
> > open to other suggestions as well. Thanks in advance.
>

Re: Extracting fields that are enveloped in quotes.

Posted by Mark Payne <ma...@hotmail.com>.
Hello,

The given regex is definitely not what you want. It should remove the first quote and last quote in every line but not the quotes in between. Not sure why it would go to failure without having the error message. 

You have a few different options. Firstly you could use replace text and use a search value of just double quotes (“) and a replacement value of empty string and a Replacement Strategy of Literal. This would just blindly remove any quote characters.

Perhaps a better solution would be to use ConvertRecord with a CSV Reader and CSV Writer. Use pipe as the delimiter. On the writer you can tell it to only use quotes where necessary. So if you had any cells that have pipes or other special characters those would get quotes. Given your data set it may be ok to assume that won’t be the case but it’s worth mentioning. It may perform better but I can’t say for sure without testing.

Thanks
-Mark

Sent from my iPhone

> On Nov 27, 2020, at 10:58 AM, naga satish <cn...@gmail.com> wrote:
> 
> 
> I have some pipe delimited files. Each field is bounded by quotes like this. 
> 
> "Created_Date__c"|"CreatedById"|"CreatedDate"|"Guid_c"
> "2020-03-02 00:00:00"|"0053i000002XCpAAG"|"2020-03-02 16:01:34"|"94bf83ccf9daf610VgnVCM100000307882a2RCRD"
> "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03 09:15:56"|"1a4bb238cdedd610VgnVCM100000307882a2RCRD"
> "2020-03-03 00:00:00"|"0053i000002XCpAAG"|"2020-03-03 09:52:33"|"22408baca6fee610VgnVCM100000307882a2RCRD"
> 
> I need to cleanse this data and the needs to look like this.
> 
> Created_Date__c|CreatedById|CreatedDate|Guid_c
> 2020-03-02 00:00:00|0053i000002XCpAAG|2020-03-02 16:01:34|94bf83ccf9daf610VgnVCM100000307882a2RCRD
> 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03 09:15:56|1a4bb238cdedd610VgnVCM100000307882a2RCRD
> 2020-03-03 00:00:00|0053i000002XCpAAG|2020-03-03 09:52:33|22408baca6fee610VgnVCM100000307882a2RCRD
> 
> I tried using ReplaceText with these configurations.
> search value - ^"(.*)"$ and Replacement Value - $1. But these configurations is not working and the file is routing to failure. not sure what might be the issue. What can i do to solve this issue?
> 
> open to other suggestions as well. Thanks in advance.