You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Eric FALK <er...@uni.lu> on 2016/04/05 12:36:50 UTC

Filtering large CSV files

Dear all,

I would require to filter large csv files in a data flow. By filtering I mean: scale down the file in terms of columns, and looking for a particular value to match a parameter. I looked into the example, of csv to JSON. I do have a couple of questions:

-First I use a SplitText control get each line of the file. It makes things slow, as it seems to generate a flow file for each line. Do I have to proceed this way, or is there an alternative? My csv files are really large and can have millions of lines.

-In a second step I am extracting the values with the (.+),(.+),….,(.+)  technique, before using a processor to check for a match, on ${csv.146} for instance. Now I have a problem: my csv has 233 fields, so I am getting the message: “ReGex is required to have between 1 and 40 capturing groups but has 233”. Again, is there another way to proceed, am I missing something?

Best regards,
Eric

Re: Re: Re: Filtering large CSV files

Posted by Dmitry Goldenberg <dg...@hexastax.com>.

Uwe,

The Velocity based transformer sounds like a cool feature.  As far as the
splitter, I'm not quite groking why it treats its input as a single row to
split?  Shouldn't the input be a full CSV which you'd want to split?  I
guess you already have a splitter, perhaps based on SplitText.  What I want
to do is implement a SplitCSV (and GetCSV) which uses OpenCSV to split a
full CSV into individual rows.

- Dmitry

On Tue, Apr 5, 2016 at 4:06 PM, Uwe Geercken <uw...@web.de> wrote:

> Dmitry,
>
> what I have is at the moment this:
>
> https://github.com/uwegeercken/nifi_processors
>
> Two processors: one that splits one CSV row and assigns the values to
> flowfile attributes. And one that merges the attributes with a template
> (apache velocity) to produce a different output.
>
> I wanted to start with opencsv but ran into problems and got no time
> afterwards.
>
> Rgds,
>
> Uwe
>
> > Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr
> > Von: "Dmitry Goldenberg" <dg...@hexastax.com>
> > An: dev@nifi.apache.org
> > Betreff: Re: Re: Filtering large CSV files
> >
> > Hi Uwe,
> >
> > Yes, that is what I was thinking of using for the CSV processor.  Will
> you
> > be committing your version?
> >
> > - Dmitry
> >
> > On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <uw...@web.de>
> wrote:
> >
> > > Dimitry,
> > >
> > > I was working on a processor for CSV files and one remark came up that
> we
> > > might want to use the opencsv library for parsing the file.
> > >
> > > Here is the link: http://opencsv.sourceforge.net/
> > >
> > > Greetings,
> > >
> > > Uwe
> > >
> > > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > > > Von: "Dmitry Goldenberg" <dg...@hexastax.com>
> > > > An: dev@nifi.apache.org
> > > > Betreff: Re: Filtering large CSV files
> > > >
> > > > Hi Eric,
> > > >
> > > > Thinking about exactly these use-cases, I filed the following JIRA
> > > ticket:
> > > > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks
> > > for a
> > > > SplitCSV processor, and actually for a GetCSV ingress which would
> address
> > > > the issue of reading out of a large CSV treating it as a "data
> source".
> > > I
> > > > was thinking of actually implementing both and committing them.
> > > >
> > > > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is
> asking
> > > for a
> > > > way to filter the CSV columns.  I believe this is best achieved as
> the
> > > CSV
> > > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not
> as a
> > > > separate step.
> > > >
> > > > I'm not sure that SplitText is the best way to process CSV data to
> begin
> > > > with, because with a CSV, there's a chance that a given cell may
> spill
> > > over
> > > > into multiple lines. Such would be the case of embedded newlines
> within a
> > > > single, quoted cell. I don't think SplitText addresses that and that
> > > would
> > > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > > > semantics, the other reason being efficiency of reading.
> > > >
> > > > As far as the limit on the capturing groups, that seems arbitrary. I
> > > think
> > > > that on GetCSV/SplitCSV, if you have a way to identify the filtered
> out
> > > > columns by their number (index) that should go a long way; perhaps a
> > > regex
> > > > is also a good option.  I know it may seem that filtering should be a
> > > > separate step in a given dataflow but from the point of view of
> > > efficiency,
> > > > I believe it belongs right in the GetCSV/SplitCSV processors as the
> CSV
> > > > records are being read and processed.
> > > >
> > > > - Dmitry
> > > >
> > > >
> > > >
> > > >
> > > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <er...@uni.lu> wrote:
> > > >
> > > > > Dear all,
> > > > >
> > > > > I would require to filter large csv files in a data flow. By
> filtering
> > > I
> > > > > mean: scale down the file in terms of columns, and looking for a
> > > particular
> > > > > value to match a parameter. I looked into the example, of csv to
> JSON.
> > > I do
> > > > > have a couple of questions:
> > > > >
> > > > > -First I use a SplitText control get each line of the file. It
> makes
> > > > > things slow, as it seems to generate a flow file for each line. Do
> I
> > > have
> > > > > to proceed this way, or is there an alternative? My csv files are
> > > really
> > > > > large and can have millions of lines.
> > > > >
> > > > > -In a second step I am extracting the values with the
> (.+),(.+),….,(.+)
> > > > > technique, before using a processor to check for a match, on
> > > ${csv.146} for
> > > > > instance. Now I have a problem: my csv has 233 fields, so I am
> getting
> > > the
> > > > > message: “ReGex is required to have between 1 and 40 capturing
> groups
> > > but
> > > > > has 233”. Again, is there another way to proceed, am I missing
> > > something?
> > > > >
> > > > > Best regards,
> > > > > Eric
> > > >
> > >
> >
>

Aw: Re: Re: Filtering large CSV files

Posted by Uwe Geercken <uw...@web.de>.

Dmitry,

what I have is at the moment this:

https://github.com/uwegeercken/nifi_processors

Two processors: one that splits one CSV row and assigns the values to flowfile attributes. And one that merges the attributes with a template (apache velocity) to produce a different output.

I wanted to start with opencsv but ran into problems and got no time afterwards.

Rgds,

Uwe

> Gesendet: Dienstag, 05. April 2016 um 21:21 Uhr
> Von: "Dmitry Goldenberg" <dg...@hexastax.com>
> An: dev@nifi.apache.org
> Betreff: Re: Re: Filtering large CSV files
>
> Hi Uwe,
> 
> Yes, that is what I was thinking of using for the CSV processor.  Will you
> be committing your version?
> 
> - Dmitry
> 
> On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <uw...@web.de> wrote:
> 
> > Dimitry,
> >
> > I was working on a processor for CSV files and one remark came up that we
> > might want to use the opencsv library for parsing the file.
> >
> > Here is the link: http://opencsv.sourceforge.net/
> >
> > Greetings,
> >
> > Uwe
> >
> > > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > > Von: "Dmitry Goldenberg" <dg...@hexastax.com>
> > > An: dev@nifi.apache.org
> > > Betreff: Re: Filtering large CSV files
> > >
> > > Hi Eric,
> > >
> > > Thinking about exactly these use-cases, I filed the following JIRA
> > ticket:
> > > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks
> > for a
> > > SplitCSV processor, and actually for a GetCSV ingress which would address
> > > the issue of reading out of a large CSV treating it as a "data source".
> > I
> > > was thinking of actually implementing both and committing them.
> > >
> > > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking
> > for a
> > > way to filter the CSV columns.  I believe this is best achieved as the
> > CSV
> > > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> > > separate step.
> > >
> > > I'm not sure that SplitText is the best way to process CSV data to begin
> > > with, because with a CSV, there's a chance that a given cell may spill
> > over
> > > into multiple lines. Such would be the case of embedded newlines within a
> > > single, quoted cell. I don't think SplitText addresses that and that
> > would
> > > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > > semantics, the other reason being efficiency of reading.
> > >
> > > As far as the limit on the capturing groups, that seems arbitrary. I
> > think
> > > that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> > > columns by their number (index) that should go a long way; perhaps a
> > regex
> > > is also a good option.  I know it may seem that filtering should be a
> > > separate step in a given dataflow but from the point of view of
> > efficiency,
> > > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> > > records are being read and processed.
> > >
> > > - Dmitry
> > >
> > >
> > >
> > >
> > > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <er...@uni.lu> wrote:
> > >
> > > > Dear all,
> > > >
> > > > I would require to filter large csv files in a data flow. By filtering
> > I
> > > > mean: scale down the file in terms of columns, and looking for a
> > particular
> > > > value to match a parameter. I looked into the example, of csv to JSON.
> > I do
> > > > have a couple of questions:
> > > >
> > > > -First I use a SplitText control get each line of the file. It makes
> > > > things slow, as it seems to generate a flow file for each line. Do I
> > have
> > > > to proceed this way, or is there an alternative? My csv files are
> > really
> > > > large and can have millions of lines.
> > > >
> > > > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > > > technique, before using a processor to check for a match, on
> > ${csv.146} for
> > > > instance. Now I have a problem: my csv has 233 fields, so I am getting
> > the
> > > > message: “ReGex is required to have between 1 and 40 capturing groups
> > but
> > > > has 233”. Again, is there another way to proceed, am I missing
> > something?
> > > >
> > > > Best regards,
> > > > Eric
> > >
> >
>

Re: Re: Filtering large CSV files

Posted by Dmitry Goldenberg <dg...@hexastax.com>.

Hi Uwe,

Yes, that is what I was thinking of using for the CSV processor.  Will you
be committing your version?

- Dmitry

On Tue, Apr 5, 2016 at 1:39 PM, Uwe Geercken <uw...@web.de> wrote:

> Dimitry,
>
> I was working on a processor for CSV files and one remark came up that we
> might want to use the opencsv library for parsing the file.
>
> Here is the link: http://opencsv.sourceforge.net/
>
> Greetings,
>
> Uwe
>
> > Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> > Von: "Dmitry Goldenberg" <dg...@hexastax.com>
> > An: dev@nifi.apache.org
> > Betreff: Re: Filtering large CSV files
> >
> > Hi Eric,
> >
> > Thinking about exactly these use-cases, I filed the following JIRA
> ticket:
> > NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks
> for a
> > SplitCSV processor, and actually for a GetCSV ingress which would address
> > the issue of reading out of a large CSV treating it as a "data source".
> I
> > was thinking of actually implementing both and committing them.
> >
> > NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking
> for a
> > way to filter the CSV columns.  I believe this is best achieved as the
> CSV
> > is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> > separate step.
> >
> > I'm not sure that SplitText is the best way to process CSV data to begin
> > with, because with a CSV, there's a chance that a given cell may spill
> over
> > into multiple lines. Such would be the case of embedded newlines within a
> > single, quoted cell. I don't think SplitText addresses that and that
> would
> > be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> > semantics, the other reason being efficiency of reading.
> >
> > As far as the limit on the capturing groups, that seems arbitrary. I
> think
> > that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> > columns by their number (index) that should go a long way; perhaps a
> regex
> > is also a good option.  I know it may seem that filtering should be a
> > separate step in a given dataflow but from the point of view of
> efficiency,
> > I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> > records are being read and processed.
> >
> > - Dmitry
> >
> >
> >
> >
> > On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <er...@uni.lu> wrote:
> >
> > > Dear all,
> > >
> > > I would require to filter large csv files in a data flow. By filtering
> I
> > > mean: scale down the file in terms of columns, and looking for a
> particular
> > > value to match a parameter. I looked into the example, of csv to JSON.
> I do
> > > have a couple of questions:
> > >
> > > -First I use a SplitText control get each line of the file. It makes
> > > things slow, as it seems to generate a flow file for each line. Do I
> have
> > > to proceed this way, or is there an alternative? My csv files are
> really
> > > large and can have millions of lines.
> > >
> > > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > > technique, before using a processor to check for a match, on
> ${csv.146} for
> > > instance. Now I have a problem: my csv has 233 fields, so I am getting
> the
> > > message: “ReGex is required to have between 1 and 40 capturing groups
> but
> > > has 233”. Again, is there another way to proceed, am I missing
> something?
> > >
> > > Best regards,
> > > Eric
> >
>

Aw: Re: Filtering large CSV files

Posted by Uwe Geercken <uw...@web.de>.

Dimitry,

I was working on a processor for CSV files and one remark came up that we might want to use the opencsv library for parsing the file.

Here is the link: http://opencsv.sourceforge.net/

Greetings,

Uwe

> Gesendet: Dienstag, 05. April 2016 um 13:00 Uhr
> Von: "Dmitry Goldenberg" <dg...@hexastax.com>
> An: dev@nifi.apache.org
> Betreff: Re: Filtering large CSV files
>
> Hi Eric,
> 
> Thinking about exactly these use-cases, I filed the following JIRA ticket:
> NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks for a
> SplitCSV processor, and actually for a GetCSV ingress which would address
> the issue of reading out of a large CSV treating it as a "data source".  I
> was thinking of actually implementing both and committing them.
> 
> NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking for a
> way to filter the CSV columns.  I believe this is best achieved as the CSV
> is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
> separate step.
> 
> I'm not sure that SplitText is the best way to process CSV data to begin
> with, because with a CSV, there's a chance that a given cell may spill over
> into multiple lines. Such would be the case of embedded newlines within a
> single, quoted cell. I don't think SplitText addresses that and that would
> be one reason to implement GetCSV/SplitCSV using proper CSV parsing
> semantics, the other reason being efficiency of reading.
> 
> As far as the limit on the capturing groups, that seems arbitrary. I think
> that on GetCSV/SplitCSV, if you have a way to identify the filtered out
> columns by their number (index) that should go a long way; perhaps a regex
> is also a good option.  I know it may seem that filtering should be a
> separate step in a given dataflow but from the point of view of efficiency,
> I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
> records are being read and processed.
> 
> - Dmitry
> 
> 
> 
> 
> On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <er...@uni.lu> wrote:
> 
> > Dear all,
> >
> > I would require to filter large csv files in a data flow. By filtering I
> > mean: scale down the file in terms of columns, and looking for a particular
> > value to match a parameter. I looked into the example, of csv to JSON. I do
> > have a couple of questions:
> >
> > -First I use a SplitText control get each line of the file. It makes
> > things slow, as it seems to generate a flow file for each line. Do I have
> > to proceed this way, or is there an alternative? My csv files are really
> > large and can have millions of lines.
> >
> > -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> > technique, before using a processor to check for a match, on ${csv.146} for
> > instance. Now I have a problem: my csv has 233 fields, so I am getting the
> > message: “ReGex is required to have between 1 and 40 capturing groups but
> > has 233”. Again, is there another way to proceed, am I missing something?
> >
> > Best regards,
> > Eric
>

Re: Filtering large CSV files

Posted by Dmitry Goldenberg <dg...@hexastax.com>.

Hi Eric,

Thinking about exactly these use-cases, I filed the following JIRA ticket:
NIFI-1716 <https://issues.apache.org/jira/browse/NIFI-1716>. It asks for a
SplitCSV processor, and actually for a GetCSV ingress which would address
the issue of reading out of a large CSV treating it as a "data source".  I
was thinking of actually implementing both and committing them.

NIFI-1280 <https://issues.apache.org/jira/browse/NIFI-1280> is asking for a
way to filter the CSV columns.  I believe this is best achieved as the CSV
is getting parsed, in other words, on the GetCSV/SplitCSV, and not as a
separate step.

I'm not sure that SplitText is the best way to process CSV data to begin
with, because with a CSV, there's a chance that a given cell may spill over
into multiple lines. Such would be the case of embedded newlines within a
single, quoted cell. I don't think SplitText addresses that and that would
be one reason to implement GetCSV/SplitCSV using proper CSV parsing
semantics, the other reason being efficiency of reading.

As far as the limit on the capturing groups, that seems arbitrary. I think
that on GetCSV/SplitCSV, if you have a way to identify the filtered out
columns by their number (index) that should go a long way; perhaps a regex
is also a good option.  I know it may seem that filtering should be a
separate step in a given dataflow but from the point of view of efficiency,
I believe it belongs right in the GetCSV/SplitCSV processors as the CSV
records are being read and processed.

- Dmitry

On Tue, Apr 5, 2016 at 6:36 AM, Eric FALK <er...@uni.lu> wrote:

> Dear all,
>
> I would require to filter large csv files in a data flow. By filtering I
> mean: scale down the file in terms of columns, and looking for a particular
> value to match a parameter. I looked into the example, of csv to JSON. I do
> have a couple of questions:
>
> -First I use a SplitText control get each line of the file. It makes
> things slow, as it seems to generate a flow file for each line. Do I have
> to proceed this way, or is there an alternative? My csv files are really
> large and can have millions of lines.
>
> -In a second step I am extracting the values with the (.+),(.+),….,(.+)
> technique, before using a processor to check for a match, on ${csv.146} for
> instance. Now I have a problem: my csv has 233 fields, so I am getting the
> message: “ReGex is required to have between 1 and 40 capturing groups but
> has 233”. Again, is there another way to proceed, am I missing something?
>
> Best regards,
> Eric