You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nifi.apache.org by Russell Whitaker <ru...@gmail.com> on 2015/10/05 22:14:28 UTC

A flow to PutSQL the lines of a CSV file?

Use case I'm attempting:

1.) ingest a CSV file with header lines;
2.) remove header lines (i.e. remove N lines at head);
2.) SQL INSERT each remaining line as a row in an existing mysql table.

My thinking so far:

#1 is given (CSV fetched already);
#2 simple, should be handled in the context of ExecuteStreamProcessor;

#3 is where I'm scratching my head: I keep re-reading the Description field for
the PutSQL processor in http://nifi.apache.org/docs.html but can't seem to
parse this into what I need to do to prepare a flowfile comprising lines of
comma-separated lines of text into a series of INSERT statements:

"Executes a SQL UPDATE or INSERT command. The content of an incoming
FlowFile is expected to be the SQL command to execute. The SQL command
may use the ? to escape parameters. In this case, the parameters to
use must exist as FlowFile attributes with the naming convention
sql.args.N.type and sql.args.N.value, where N is a positive integer.
The sql.args.N.type is expected to be a number indicating the JDBC
Type."

Of related interest: there seems to be only one CSV-relevant processor type in
v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like this:

ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL

Guidance, suggestions? Thanks!

Russell

-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Re: A flow to PutSQL the lines of a CSV file?

Posted by Russell Whitaker <ru...@gmail.com>.
On Tue, Oct 6, 2015 at 7:17 AM, Bryan Bende <bb...@gmail.com> wrote:
> Hi Russell,
>
> I understand what you are getting at... I don't think the current
> processors we have are designed to handle this bulk load scenario.
>

That's what I'm gathering so far myself, having been poring over the history of
related processors in Jira. I'm surprised it hadn't come up as an
issue, but then
there's always a first time for everything.

> The series of processors you outlined in your original email would likely
> work, but I *think* it would suffer the same problem of producing a lot of
> FlowFiles because ConvertJSONToSQL will produce a FlowFile with one SQL
> statement for each entry of the JSON array. You would have to try this out
> to see if you are getting the performance you want with your data set.

Looks like that's what I'll need to do in the short term.

> PutSQL does have a configurable batch size that lets multiple inserts
> happen through a single database transaction.
>

That's good to know.

> I'm not that familiar with the "LOAD DATA INFILE" command, but I'm
> wondering if another option is to somehow use ExecuteStreamCommand to
> execute that statement based on the filename from an incoming FlowFile.
> Just thinking out loud about what else could be done.
>

I'm not sure where ExecuteStreamCommand would help in this case, since
that's DML in the statement above, in this case specific to mysql.

So, I'll have a go at working with the approach you've suggested, and if in the
meantime I discover - or write - a workaround, I'll post it in a follow-up. My
overarching concern, I should mention here, has been to provide our business
analysts with a template for doing Sqoop-like egress of data from processing
into endpoint storage, something they can reason about given their experiences
with Hadoop.

Thanks Bryan,
Russell

> -Bryan
>
>
>
> On Mon, Oct 5, 2015 at 5:24 PM, Russell Whitaker <russell.whitaker@gmail.com
>> wrote:
>
>> Really, what I'd like to do is this type of msql bread 'n butter task:
>>
>> LOAD DATA INFILE <my_csv_file_ingested_as_a_flowfile>
>> INTO TABLE <my_destination_table>
>> FIELDS TERMINATED BY ','
>> ENCLOSED BY '"'
>> LINES TERMINATED BY '\n'
>> IGNORE 3 ROWS;
>>
>> Russell
>>
>>
>> On Mon, Oct 5, 2015 at 2:09 PM, Russell Whitaker
>> <ru...@gmail.com> wrote:
>> > Bryan,
>> >
>> > Some of the CSV files are as small as 6 columns and a thousand lines
>> > or so of entries;
>> > some are many more columns and thousands of lines. I'm hoping to avoid
>> > the necessity
>> > of spawning a flowfile per line; I'm hoping there's the Nifi
>> > equivalent of the SQL DML
>> > statement LOAD DATA INFILE. (Relatedly, being able to toggle off
>> > foreign key & uniqueness
>> > checks and transaction isolation guarantees during bulk load would be
>> > very nice...)
>> >
>> > Russell
>> >
>> > On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende <bb...@gmail.com> wrote:
>> >> Russell,
>> >>
>> >> How big are these CSVs in terms of rows and columns?
>> >>
>> >> If they aren't too big, another option could be to use SplitText +
>> >> ReplaceText to split the csv into a FlowFile per line, and then convert
>> >> each line into SQL in ReplaceText. The downside is that this would
>> create a
>> >> lot of FlowFiles for very large CSVs.
>> >>
>> >> -Bryan
>> >>
>> >> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker <
>> russell.whitaker@gmail.com
>> >>> wrote:
>> >>
>> >>> Use case I'm attempting:
>> >>>
>> >>> 1.) ingest a CSV file with header lines;
>> >>> 2.) remove header lines (i.e. remove N lines at head);
>> >>> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>> >>>
>> >>> My thinking so far:
>> >>>
>> >>> #1 is given (CSV fetched already);
>> >>> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>> >>>
>> >>> #3 is where I'm scratching my head: I keep re-reading the Description
>> >>> field for
>> >>> the PutSQL processor in http://nifi.apache.org/docs.html but can't
>> seem to
>> >>> parse this into what I need to do to prepare a flowfile comprising
>> lines of
>> >>> comma-separated lines of text into a series of INSERT statements:
>> >>>
>> >>> "Executes a SQL UPDATE or INSERT command. The content of an incoming
>> >>> FlowFile is expected to be the SQL command to execute. The SQL command
>> >>> may use the ? to escape parameters. In this case, the parameters to
>> >>> use must exist as FlowFile attributes with the naming convention
>> >>> sql.args.N.type and sql.args.N.value, where N is a positive integer.
>> >>> The sql.args.N.type is expected to be a number indicating the JDBC
>> >>> Type."
>> >>>
>> >>> Of related interest: there seems to be only one CSV-relevant processor
>> >>> type in
>> >>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
>> >>> this:
>> >>>
>> >>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>> >>>
>> >>> Guidance, suggestions? Thanks!
>> >>>
>> >>> Russell
>> >>>
>> >>> --
>> >>> Russell Whitaker
>> >>> http://twitter.com/OrthoNormalRuss
>> >>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>> >>>
>> >
>> >
>> >
>> > --
>> > Russell Whitaker
>> > http://twitter.com/OrthoNormalRuss
>> > http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>>
>>
>>
>> --
>> Russell Whitaker
>> http://twitter.com/OrthoNormalRuss
>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>>



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Re: A flow to PutSQL the lines of a CSV file?

Posted by Bryan Bende <bb...@gmail.com>.
Hi Russell,

I understand what you are getting at... I don't think the current
processors we have are designed to handle this bulk load scenario.

The series of processors you outlined in your original email would likely
work, but I *think* it would suffer the same problem of producing a lot of
FlowFiles because ConvertJSONToSQL will produce a FlowFile with one SQL
statement for each entry of the JSON array. You would have to try this out
to see if you are getting the performance you want with your data set.
PutSQL does have a configurable batch size that lets multiple inserts
happen through a single database transaction.

I'm not that familiar with the "LOAD DATA INFILE" command, but I'm
wondering if another option is to somehow use ExecuteStreamCommand to
execute that statement based on the filename from an incoming FlowFile.
Just thinking out loud about what else could be done.

-Bryan



On Mon, Oct 5, 2015 at 5:24 PM, Russell Whitaker <russell.whitaker@gmail.com
> wrote:

> Really, what I'd like to do is this type of msql bread 'n butter task:
>
> LOAD DATA INFILE <my_csv_file_ingested_as_a_flowfile>
> INTO TABLE <my_destination_table>
> FIELDS TERMINATED BY ','
> ENCLOSED BY '"'
> LINES TERMINATED BY '\n'
> IGNORE 3 ROWS;
>
> Russell
>
>
> On Mon, Oct 5, 2015 at 2:09 PM, Russell Whitaker
> <ru...@gmail.com> wrote:
> > Bryan,
> >
> > Some of the CSV files are as small as 6 columns and a thousand lines
> > or so of entries;
> > some are many more columns and thousands of lines. I'm hoping to avoid
> > the necessity
> > of spawning a flowfile per line; I'm hoping there's the Nifi
> > equivalent of the SQL DML
> > statement LOAD DATA INFILE. (Relatedly, being able to toggle off
> > foreign key & uniqueness
> > checks and transaction isolation guarantees during bulk load would be
> > very nice...)
> >
> > Russell
> >
> > On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende <bb...@gmail.com> wrote:
> >> Russell,
> >>
> >> How big are these CSVs in terms of rows and columns?
> >>
> >> If they aren't too big, another option could be to use SplitText +
> >> ReplaceText to split the csv into a FlowFile per line, and then convert
> >> each line into SQL in ReplaceText. The downside is that this would
> create a
> >> lot of FlowFiles for very large CSVs.
> >>
> >> -Bryan
> >>
> >> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker <
> russell.whitaker@gmail.com
> >>> wrote:
> >>
> >>> Use case I'm attempting:
> >>>
> >>> 1.) ingest a CSV file with header lines;
> >>> 2.) remove header lines (i.e. remove N lines at head);
> >>> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
> >>>
> >>> My thinking so far:
> >>>
> >>> #1 is given (CSV fetched already);
> >>> #2 simple, should be handled in the context of ExecuteStreamProcessor;
> >>>
> >>> #3 is where I'm scratching my head: I keep re-reading the Description
> >>> field for
> >>> the PutSQL processor in http://nifi.apache.org/docs.html but can't
> seem to
> >>> parse this into what I need to do to prepare a flowfile comprising
> lines of
> >>> comma-separated lines of text into a series of INSERT statements:
> >>>
> >>> "Executes a SQL UPDATE or INSERT command. The content of an incoming
> >>> FlowFile is expected to be the SQL command to execute. The SQL command
> >>> may use the ? to escape parameters. In this case, the parameters to
> >>> use must exist as FlowFile attributes with the naming convention
> >>> sql.args.N.type and sql.args.N.value, where N is a positive integer.
> >>> The sql.args.N.type is expected to be a number indicating the JDBC
> >>> Type."
> >>>
> >>> Of related interest: there seems to be only one CSV-relevant processor
> >>> type in
> >>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
> >>> this:
> >>>
> >>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
> >>>
> >>> Guidance, suggestions? Thanks!
> >>>
> >>> Russell
> >>>
> >>> --
> >>> Russell Whitaker
> >>> http://twitter.com/OrthoNormalRuss
> >>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
> >>>
> >
> >
> >
> > --
> > Russell Whitaker
> > http://twitter.com/OrthoNormalRuss
> > http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>
>
>
> --
> Russell Whitaker
> http://twitter.com/OrthoNormalRuss
> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>

Re: A flow to PutSQL the lines of a CSV file?

Posted by Russell Whitaker <ru...@gmail.com>.
Really, what I'd like to do is this type of msql bread 'n butter task:

LOAD DATA INFILE <my_csv_file_ingested_as_a_flowfile>
INTO TABLE <my_destination_table>
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 3 ROWS;

Russell


On Mon, Oct 5, 2015 at 2:09 PM, Russell Whitaker
<ru...@gmail.com> wrote:
> Bryan,
>
> Some of the CSV files are as small as 6 columns and a thousand lines
> or so of entries;
> some are many more columns and thousands of lines. I'm hoping to avoid
> the necessity
> of spawning a flowfile per line; I'm hoping there's the Nifi
> equivalent of the SQL DML
> statement LOAD DATA INFILE. (Relatedly, being able to toggle off
> foreign key & uniqueness
> checks and transaction isolation guarantees during bulk load would be
> very nice...)
>
> Russell
>
> On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende <bb...@gmail.com> wrote:
>> Russell,
>>
>> How big are these CSVs in terms of rows and columns?
>>
>> If they aren't too big, another option could be to use SplitText +
>> ReplaceText to split the csv into a FlowFile per line, and then convert
>> each line into SQL in ReplaceText. The downside is that this would create a
>> lot of FlowFiles for very large CSVs.
>>
>> -Bryan
>>
>> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker <russell.whitaker@gmail.com
>>> wrote:
>>
>>> Use case I'm attempting:
>>>
>>> 1.) ingest a CSV file with header lines;
>>> 2.) remove header lines (i.e. remove N lines at head);
>>> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>>>
>>> My thinking so far:
>>>
>>> #1 is given (CSV fetched already);
>>> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>>>
>>> #3 is where I'm scratching my head: I keep re-reading the Description
>>> field for
>>> the PutSQL processor in http://nifi.apache.org/docs.html but can't seem to
>>> parse this into what I need to do to prepare a flowfile comprising lines of
>>> comma-separated lines of text into a series of INSERT statements:
>>>
>>> "Executes a SQL UPDATE or INSERT command. The content of an incoming
>>> FlowFile is expected to be the SQL command to execute. The SQL command
>>> may use the ? to escape parameters. In this case, the parameters to
>>> use must exist as FlowFile attributes with the naming convention
>>> sql.args.N.type and sql.args.N.value, where N is a positive integer.
>>> The sql.args.N.type is expected to be a number indicating the JDBC
>>> Type."
>>>
>>> Of related interest: there seems to be only one CSV-relevant processor
>>> type in
>>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
>>> this:
>>>
>>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>>>
>>> Guidance, suggestions? Thanks!
>>>
>>> Russell
>>>
>>> --
>>> Russell Whitaker
>>> http://twitter.com/OrthoNormalRuss
>>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>>>
>
>
>
> --
> Russell Whitaker
> http://twitter.com/OrthoNormalRuss
> http://www.linkedin.com/pub/russell-whitaker/0/b86/329



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Re: A flow to PutSQL the lines of a CSV file?

Posted by Russell Whitaker <ru...@gmail.com>.
Bryan,

Some of the CSV files are as small as 6 columns and a thousand lines
or so of entries;
some are many more columns and thousands of lines. I'm hoping to avoid
the necessity
of spawning a flowfile per line; I'm hoping there's the Nifi
equivalent of the SQL DML
statement LOAD DATA INFILE. (Relatedly, being able to toggle off
foreign key & uniqueness
checks and transaction isolation guarantees during bulk load would be
very nice...)

Russell

On Mon, Oct 5, 2015 at 1:53 PM, Bryan Bende <bb...@gmail.com> wrote:
> Russell,
>
> How big are these CSVs in terms of rows and columns?
>
> If they aren't too big, another option could be to use SplitText +
> ReplaceText to split the csv into a FlowFile per line, and then convert
> each line into SQL in ReplaceText. The downside is that this would create a
> lot of FlowFiles for very large CSVs.
>
> -Bryan
>
> On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker <russell.whitaker@gmail.com
>> wrote:
>
>> Use case I'm attempting:
>>
>> 1.) ingest a CSV file with header lines;
>> 2.) remove header lines (i.e. remove N lines at head);
>> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>>
>> My thinking so far:
>>
>> #1 is given (CSV fetched already);
>> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>>
>> #3 is where I'm scratching my head: I keep re-reading the Description
>> field for
>> the PutSQL processor in http://nifi.apache.org/docs.html but can't seem to
>> parse this into what I need to do to prepare a flowfile comprising lines of
>> comma-separated lines of text into a series of INSERT statements:
>>
>> "Executes a SQL UPDATE or INSERT command. The content of an incoming
>> FlowFile is expected to be the SQL command to execute. The SQL command
>> may use the ? to escape parameters. In this case, the parameters to
>> use must exist as FlowFile attributes with the naming convention
>> sql.args.N.type and sql.args.N.value, where N is a positive integer.
>> The sql.args.N.type is expected to be a number indicating the JDBC
>> Type."
>>
>> Of related interest: there seems to be only one CSV-relevant processor
>> type in
>> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
>> this:
>>
>> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>>
>> Guidance, suggestions? Thanks!
>>
>> Russell
>>
>> --
>> Russell Whitaker
>> http://twitter.com/OrthoNormalRuss
>> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>>



-- 
Russell Whitaker
http://twitter.com/OrthoNormalRuss
http://www.linkedin.com/pub/russell-whitaker/0/b86/329

Re: A flow to PutSQL the lines of a CSV file?

Posted by Bryan Bende <bb...@gmail.com>.
Russell,

How big are these CSVs in terms of rows and columns?

If they aren't too big, another option could be to use SplitText +
ReplaceText to split the csv into a FlowFile per line, and then convert
each line into SQL in ReplaceText. The downside is that this would create a
lot of FlowFiles for very large CSVs.

-Bryan

On Mon, Oct 5, 2015 at 4:14 PM, Russell Whitaker <russell.whitaker@gmail.com
> wrote:

> Use case I'm attempting:
>
> 1.) ingest a CSV file with header lines;
> 2.) remove header lines (i.e. remove N lines at head);
> 2.) SQL INSERT each remaining line as a row in an existing mysql table.
>
> My thinking so far:
>
> #1 is given (CSV fetched already);
> #2 simple, should be handled in the context of ExecuteStreamProcessor;
>
> #3 is where I'm scratching my head: I keep re-reading the Description
> field for
> the PutSQL processor in http://nifi.apache.org/docs.html but can't seem to
> parse this into what I need to do to prepare a flowfile comprising lines of
> comma-separated lines of text into a series of INSERT statements:
>
> "Executes a SQL UPDATE or INSERT command. The content of an incoming
> FlowFile is expected to be the SQL command to execute. The SQL command
> may use the ? to escape parameters. In this case, the parameters to
> use must exist as FlowFile attributes with the naming convention
> sql.args.N.type and sql.args.N.value, where N is a positive integer.
> The sql.args.N.type is expected to be a number indicating the JDBC
> Type."
>
> Of related interest: there seems to be only one CSV-relevant processor
> type in
> v0.3.0, ConvertCSVToAvro; I fear the need to have to do something like
> this:
>
> ConvertCSVToAvro->ConvertAvroToJSON->ConvertJSONToSQL->PutSQL
>
> Guidance, suggestions? Thanks!
>
> Russell
>
> --
> Russell Whitaker
> http://twitter.com/OrthoNormalRuss
> http://www.linkedin.com/pub/russell-whitaker/0/b86/329
>