You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by François Prunier <fr...@hurence.com> on 2016/10/19 09:10:07 UTC

CsvToAttributes processor

Hello Nifi folks,

I've built a processor to parse CSV files with headers and turn each 
line in a flowfile. Each resulting flowfile has as many attributes as 
the number of columns. Each attributes has the name of a column with the 
corresponding value for the line.

For example, this CSV file:

|col1,col2,col3 a,b,c d,e,f |

would generate two flowfiles with the following attributes:

|col1 = a col2 = b col3 = c |

and

|col1 = d col2 = e col3 = f |

As of now, you can configure the charset plus delimiter, quote and 
escape character. It's based on the commons-csv parser.

It's very handy if you want to, for example, index a CSV file into 
elasticsearch.

Would you guys be interested in a pull request to add this processor to 
the main code base ? It needs a bit more documentation and cleanup that 
I would need to add in but it's already successfully used in production.

Best regards,
-- 
*Fran�ois Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50

Re: CsvToAttributes processor

Posted by Andy LoPresto <al...@apache.org>.

And according to the IETF RFC 2822 (Email), the reply-to field can hold multiple mailboxes, so we will investigate if we can get the dev@ and users@ lists to reply to the list *and* the sender by default. This might really clog people’s inboxes though, so it needs to be evaluated.

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 27, 2016, at 8:34 PM, Andy LoPresto <al...@apache.org> wrote:
> 
> Hi François,
> 
> I hope this is what you were looking for. If you do not get the entire thread via this email, you can see the thread in a web view here [1].
> 
> [1] https://lists.apache.org/thread.html/ffa390534d35056d3ad8ab5116f25665b73687855214afe95fcf6cab@%3Cdev.nifi.apache.org%3E <https://lists.apache.org/thread.html/ffa390534d35056d3ad8ab5116f25665b73687855214afe95fcf6cab@%3Cdev.nifi.apache.org%3E>
> 
> Andy LoPresto
> alopresto@apache.org <ma...@apache.org>
> alopresto.apache@gmail.com <ma...@gmail.com>
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On Oct 27, 2016, at 6:31 AM, François Prunier <francois.prunier@hurence.com <ma...@hurence.com>> wrote:
>> 
>> --------------7FEEA278B796C52DD32D150C
>> Content-Type: text/plain; charset=utf-8; format=flowed
>> Content-Transfer-Encoding: 8bit
>> 
>> Hello again nifi folks,
>> 
>> I did not get a direct reply to my email below. However, I've since
>> noticed in the mailing list archive that some of you have kindly
>> replied, although the emails did not make it to my inbox !
>> 
>> I wasn't part of the mailing list at the time, I am now, I guess that's
>> why I did not got the responses, it still seems a bit weird though... (*).
>> 
>> Anyway, could someone reply to the thread and include my email so I can
>> answer each of your comments while keeping the threading 'clean' ?
>> 
>> Thanks !
>> 
>> François
>> 
>> *: Maybe something the admins should look into, as some people might
>> fire off an email to the list, see no answers and assume no one replied
>> to them !
>> 
>> On 19/10/2016 11:10, François Prunier wrote:
>>> 
>>> Hello Nifi folks,
>>> 
>>> I've built a processor to parse CSV files with headers and turn each
>>> line in a flowfile. Each resulting flowfile has as many attributes as
>>> the number of columns. Each attributes has the name of a column with
>>> the corresponding value for the line.
>>> 
>>> For example, this CSV file:
>>> 
>>> |col1,col2,col3 a,b,c d,e,f |
>>> 
>>> would generate two flowfiles with the following attributes:
>>> 
>>> |col1 = a col2 = b col3 = c |
>>> 
>>> and
>>> 
>>> |col1 = d col2 = e col3 = f |
>>> As of now, you can configure the charset plus delimiter, quote and
>>> escape character. It's based on the commons-csv parser.
>>> 
>>> It's very handy if you want to, for example, index a CSV file into
>>> elasticsearch.
>>> 
>>> Would you guys be interested in a pull request to add this processor
>>> to the main code base ? It needs a bit more documentation and cleanup
>>> that I would need to add in but it's already successfully used in
>>> production.
>>> 
>>> Best regards,
>>> --
>>> *François Prunier
>>> * *Hurence* - /Vos experts Big Data/
>>> http://www.hurence.com <http://www.hurence.com/>
>>> *mobile:* +33 6 38 68 60 50
>>> 
>> 
>> --
>> *François Prunier
>> * *Hurence* - /Vos experts Big Data/
>> http://www.hurence.com <http://www.hurence.com/>
>> *mobile:* +33
>

Re: CsvToAttributes processor

Posted by Andy LoPresto <al...@apache.org>.

Hi François,

I hope this is what you were looking for. If you do not get the entire thread via this email, you can see the thread in a web view here [1].

[1] https://lists.apache.org/thread.html/ffa390534d35056d3ad8ab5116f25665b73687855214afe95fcf6cab@%3Cdev.nifi.apache.org%3E <https://lists.apache.org/thread.html/ffa390534d35056d3ad8ab5116f25665b73687855214afe95fcf6cab@%3Cdev.nifi.apache.org%3E>

Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 27, 2016, at 6:31 AM, François Prunier <fr...@hurence.com> wrote:
> 
> --------------7FEEA278B796C52DD32D150C
> Content-Type: text/plain; charset=utf-8; format=flowed
> Content-Transfer-Encoding: 8bit
> 
> Hello again nifi folks,
> 
> I did not get a direct reply to my email below. However, I've since
> noticed in the mailing list archive that some of you have kindly
> replied, although the emails did not make it to my inbox !
> 
> I wasn't part of the mailing list at the time, I am now, I guess that's
> why I did not got the responses, it still seems a bit weird though... (*).
> 
> Anyway, could someone reply to the thread and include my email so I can
> answer each of your comments while keeping the threading 'clean' ?
> 
> Thanks !
> 
> François
> 
> *: Maybe something the admins should look into, as some people might
> fire off an email to the list, see no answers and assume no one replied
> to them !
> 
> On 19/10/2016 11:10, François Prunier wrote:
>> 
>> Hello Nifi folks,
>> 
>> I've built a processor to parse CSV files with headers and turn each
>> line in a flowfile. Each resulting flowfile has as many attributes as
>> the number of columns. Each attributes has the name of a column with
>> the corresponding value for the line.
>> 
>> For example, this CSV file:
>> 
>> |col1,col2,col3 a,b,c d,e,f |
>> 
>> would generate two flowfiles with the following attributes:
>> 
>> |col1 = a col2 = b col3 = c |
>> 
>> and
>> 
>> |col1 = d col2 = e col3 = f |
>> As of now, you can configure the charset plus delimiter, quote and
>> escape character. It's based on the commons-csv parser.
>> 
>> It's very handy if you want to, for example, index a CSV file into
>> elasticsearch.
>> 
>> Would you guys be interested in a pull request to add this processor
>> to the main code base ? It needs a bit more documentation and cleanup
>> that I would need to add in but it's already successfully used in
>> production.
>> 
>> Best regards,
>> --
>> *François Prunier
>> * *Hurence* - /Vos experts Big Data/
>> http://www.hurence.com
>> *mobile:* +33 6 38 68 60 50
>> 
> 
> --
> *François Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33

Re: CsvToAttributes processor

Posted by François Prunier <fr...@hurence.com>.

Hello again nifi folks,

I did not get a direct reply to my email below. However, I've since 
noticed in the mailing list archive that some of you have kindly 
replied, although the emails did not make it to my inbox !

I wasn't part of the mailing list at the time, I am now, I guess that's 
why I did not got the responses, it still seems a bit weird though... (*).

Anyway, could someone reply to the thread and include my email so I can 
answer each of your comments while keeping the threading 'clean' ?

Thanks !

Fran�ois

*: Maybe something the admins should look into, as some people might 
fire off an email to the list, see no answers and assume no one replied 
to them !

On 19/10/2016 11:10, Fran�ois Prunier wrote:
>
> Hello Nifi folks,
>
> I've built a processor to parse CSV files with headers and turn each 
> line in a flowfile. Each resulting flowfile has as many attributes as 
> the number of columns. Each attributes has the name of a column with 
> the corresponding value for the line.
>
> For example, this CSV file:
>
> |col1,col2,col3 a,b,c d,e,f |
>
> would generate two flowfiles with the following attributes:
>
> |col1 = a col2 = b col3 = c |
>
> and
>
> |col1 = d col2 = e col3 = f |
> As of now, you can configure the charset plus delimiter, quote and 
> escape character. It's based on the commons-csv parser.
>
> It's very handy if you want to, for example, index a CSV file into 
> elasticsearch.
>
> Would you guys be interested in a pull request to add this processor 
> to the main code base ? It needs a bit more documentation and cleanup 
> that I would need to add in but it's already successfully used in 
> production.
>
> Best regards,
> -- 
> *Fran�ois Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33 6 38 68 60 50
>

-- 
*Fran�ois Prunier
* *Hurence* - /Vos experts Big Data/
http://www.hurence.com
*mobile:* +33 6 38 68 60 50

Re: CsvToAttributes processor

Posted by Matt Burgess <ma...@apache.org>.

Alternative to n^2 processors, there was some discussion a little
while back about having Controller Service instances to do data format
conversions [1]. However that's a complex issue and might not get
integrated in the near-term. I agree with Andy that CSV->JSON is a
useful task, and that when we get the extension registry (and/or the
controller services), we can update the processors accordingly.

Regards,
Matt

[1] http://apache-nifi-developer-list.39713.n7.nabble.com/Looking-for-feedback-on-my-WIP-Design-td13097.html

On Wed, Oct 19, 2016 at 1:58 PM, Andy LoPresto <al...@apache.org> wrote:
> I like Matt’s idea. Currently there are ConvertCSVToAvro and
> ConvertAvroToJSON processors, but no processor that directly converts CSV to
> JSON. Keeping the content in the content claim, as Joe and Matt pointed out,
> will greatly improve performance over loading it into attributes. If
> attribute-based routing is desired, an UpdateAttribute processor can follow
> on to update a single attribute from the content without polluting it with
> unnecessary data.
>
> While I am not a proponent of creating n^2 processors just to do format
> conversions, I think CSV to JSON is a common-enough and useful-enough task
> that this would be beneficial. And once we get the extension registry,
> people can go nuts with n^2 conversion processors.
>
>
> Andy LoPresto
> alopresto@apache.org
> alopresto.apache@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Oct 19, 2016, at 1:14 PM, Matt Foley <ma...@apache.org> wrote:
>
> For the specific use case of processing CSV files (and possibly “flat” db
> tables), would many of the same goals be met if the simple list of “bare”
> values in each record was turned into easily parsable key/value pairs?
> Perhaps either JSON or YAML format?  But still left in the content rather
> than moved into the attribute list, so as to avoid the problems Joe stated.
> Granted each downstream processor will have to re-parse the content, but
> it’s fast and easy - for instance, in python one can read such content into
> a {dictionary} with just a couple lines of code.  Indexers consume it well,
> too, or can be taught to do so.
>
> Thanks,
> --Matt
>
>
> On 10/19/16, 6:02 AM, "Joe Witt" <jo...@gmail.com> wrote:
>
>    Francois
>
>    Thanks for starting the discussion and this is indeed the type of
>    thing people would find helpful.  One thing I'd want to flag with this
>    approach is the impact it will have on performance at higher rates.
>    We're starting to see people wanting to do this more and more where
>    they'll take the content of a flowfile and turn it into attributes.
>    This can put a lot of pressure on the heap and garbage collection and
>    is best to avoid if you want to achieve sustained high performance.
>    Keeping the content in its native form or converting it to another
>    form will yield much higher sustained throughput as we can stream
>    those things from their underlying storage in the content repository
>    to their new form in the repository or to another system all while
>    only ever having only as much in memory as your technique for
>    operating on them. So for example we can do things like compress a 1GB
>    file and only have say 1KB in memory (as an example).  But by taking
>    the content and turning it into attributes on the flow file the flow
>    file object (not its content) will be in memory most of the time and
>    this is where problems can occur.  It would be better to have pushing
>    to elastic be driven off the content though this admittedly
>    introducing a different challenge which is 'well, what format of that
>    content does it expect'?  We have some examples of this pattern now in
>    our SQL processors for instance which are built around a specific data
>    format but we need to do better and offer generic or pluggable ways to
>    read record oriented data from a variety of formats and not have the
>    processors be specific to the underlying format where possible and
>    appropriate.  The key is to do this without forcing some goofy
>    normalization format that will kill performance as well and which
>    would make it more brittle.
>
>    So, anyway, I said all that to say that it is great you've offered to
>    contribute it and I think you certainly should.  We should just take
>    care to document its intended use and limitations on performance to
>    consider, and enable it to limit how many columns/fields get turned
>    into attributes maybe by setting a max or by having a
>    whitelist/blacklist type model.  Even if it won't achieve highest
>    sustained performance I suspect this will be quite helpful for people
>    as is.
>
>    Thanks!
>    Joe
>
>    On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <uw...@web.de>
> wrote:
>
> Francois,
>
> very nice. Thanks.
>
> I have been working on a simple version a while ago. But it had another
> scope: I wnated to have a Nifi processor to merge CSV data with a template
> from a template engine (e.g. Apache Velocity). I will review my code and
> have a look at your processor.
>
> Where can we get it? Github?
>
> Rgds,
>
> Uwe
>
> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
> Von: "François Prunier" <fr...@hurence.com>
> An: dev@nifi.apache.org
> Betreff: CsvToAttributes processor
>
> Hello Nifi folks,
>
> I've built a processor to parse CSV files with headers and turn each
> line in a flowfile. Each resulting flowfile has as many attributes as
> the number of columns. Each attributes has the name of a column with the
> corresponding value for the line.
>
> For example, this CSV file:
>
> |col1,col2,col3 a,b,c d,e,f |
>
> would generate two flowfiles with the following attributes:
>
> |col1 = a col2 = b col3 = c |
>
> and
>
> |col1 = d col2 = e col3 = f |
>
> As of now, you can configure the charset plus delimiter, quote and
> escape character. It's based on the commons-csv parser.
>
> It's very handy if you want to, for example, index a CSV file into
> elasticsearch.
>
> Would you guys be interested in a pull request to add this processor to
> the main code base ? It needs a bit more documentation and cleanup that
> I would need to add in but it's already successfully used in production.
>
> Best regards,
> --
> *François Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33 6 38 68 60 50
>
>
>
>
>
>
>

Re: CsvToAttributes processor

Posted by Andy LoPresto <al...@apache.org>.

I like Matt’s idea. Currently there are ConvertCSVToAvro and ConvertAvroToJSON processors, but no processor that directly converts CSV to JSON. Keeping the content in the content claim, as Joe and Matt pointed out, will greatly improve performance over loading it into attributes. If attribute-based routing is desired, an UpdateAttribute processor can follow on to update a single attribute from the content without polluting it with unnecessary data.

While I am not a proponent of creating n^2 processors just to do format conversions, I think CSV to JSON is a common-enough and useful-enough task that this would be beneficial. And once we get the extension registry, people can go nuts with n^2 conversion processors.


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 19, 2016, at 1:14 PM, Matt Foley <ma...@apache.org> wrote:
> 
> For the specific use case of processing CSV files (and possibly “flat” db tables), would many of the same goals be met if the simple list of “bare” values in each record was turned into easily parsable key/value pairs?  Perhaps either JSON or YAML format?  But still left in the content rather than moved into the attribute list, so as to avoid the problems Joe stated.  Granted each downstream processor will have to re-parse the content, but it’s fast and easy - for instance, in python one can read such content into a {dictionary} with just a couple lines of code.  Indexers consume it well, too, or can be taught to do so.
> 
> Thanks,
> --Matt
> 
> 
> On 10/19/16, 6:02 AM, "Joe Witt" <jo...@gmail.com> wrote:
> 
>    Francois
> 
>    Thanks for starting the discussion and this is indeed the type of
>    thing people would find helpful.  One thing I'd want to flag with this
>    approach is the impact it will have on performance at higher rates.
>    We're starting to see people wanting to do this more and more where
>    they'll take the content of a flowfile and turn it into attributes.
>    This can put a lot of pressure on the heap and garbage collection and
>    is best to avoid if you want to achieve sustained high performance.
>    Keeping the content in its native form or converting it to another
>    form will yield much higher sustained throughput as we can stream
>    those things from their underlying storage in the content repository
>    to their new form in the repository or to another system all while
>    only ever having only as much in memory as your technique for
>    operating on them. So for example we can do things like compress a 1GB
>    file and only have say 1KB in memory (as an example).  But by taking
>    the content and turning it into attributes on the flow file the flow
>    file object (not its content) will be in memory most of the time and
>    this is where problems can occur.  It would be better to have pushing
>    to elastic be driven off the content though this admittedly
>    introducing a different challenge which is 'well, what format of that
>    content does it expect'?  We have some examples of this pattern now in
>    our SQL processors for instance which are built around a specific data
>    format but we need to do better and offer generic or pluggable ways to
>    read record oriented data from a variety of formats and not have the
>    processors be specific to the underlying format where possible and
>    appropriate.  The key is to do this without forcing some goofy
>    normalization format that will kill performance as well and which
>    would make it more brittle.
> 
>    So, anyway, I said all that to say that it is great you've offered to
>    contribute it and I think you certainly should.  We should just take
>    care to document its intended use and limitations on performance to
>    consider, and enable it to limit how many columns/fields get turned
>    into attributes maybe by setting a max or by having a
>    whitelist/blacklist type model.  Even if it won't achieve highest
>    sustained performance I suspect this will be quite helpful for people
>    as is.
> 
>    Thanks!
>    Joe
> 
>    On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <uw...@web.de> wrote:
>> Francois,
>> 
>> very nice. Thanks.
>> 
>> I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
>> 
>> Where can we get it? Github?
>> 
>> Rgds,
>> 
>> Uwe
>> 
>>> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
>>> Von: "François Prunier" <fr...@hurence.com>
>>> An: dev@nifi.apache.org
>>> Betreff: CsvToAttributes processor
>>> 
>>> Hello Nifi folks,
>>> 
>>> I've built a processor to parse CSV files with headers and turn each
>>> line in a flowfile. Each resulting flowfile has as many attributes as
>>> the number of columns. Each attributes has the name of a column with the
>>> corresponding value for the line.
>>> 
>>> For example, this CSV file:
>>> 
>>> |col1,col2,col3 a,b,c d,e,f |
>>> 
>>> would generate two flowfiles with the following attributes:
>>> 
>>> |col1 = a col2 = b col3 = c |
>>> 
>>> and
>>> 
>>> |col1 = d col2 = e col3 = f |
>>> 
>>> As of now, you can configure the charset plus delimiter, quote and
>>> escape character. It's based on the commons-csv parser.
>>> 
>>> It's very handy if you want to, for example, index a CSV file into
>>> elasticsearch.
>>> 
>>> Would you guys be interested in a pull request to add this processor to
>>> the main code base ? It needs a bit more documentation and cleanup that
>>> I would need to add in but it's already successfully used in production.
>>> 
>>> Best regards,
>>> --
>>> *François Prunier
>>> * *Hurence* - /Vos experts Big Data/
>>> http://www.hurence.com
>>> *mobile:* +33 6 38 68 60 50
>>> 
>>> 
> 
> 
> 
>

Re: CsvToAttributes processor

Posted by Matt Foley <ma...@apache.org>.

For the specific use case of processing CSV files (and possibly “flat” db tables), would many of the same goals be met if the simple list of “bare” values in each record was turned into easily parsable key/value pairs?  Perhaps either JSON or YAML format?  But still left in the content rather than moved into the attribute list, so as to avoid the problems Joe stated.  Granted each downstream processor will have to re-parse the content, but it’s fast and easy - for instance, in python one can read such content into a {dictionary} with just a couple lines of code.  Indexers consume it well, too, or can be taught to do so.

Thanks,
--Matt


On 10/19/16, 6:02 AM, "Joe Witt" <jo...@gmail.com> wrote:

    Francois
    
    Thanks for starting the discussion and this is indeed the type of
    thing people would find helpful.  One thing I'd want to flag with this
    approach is the impact it will have on performance at higher rates.
    We're starting to see people wanting to do this more and more where
    they'll take the content of a flowfile and turn it into attributes.
    This can put a lot of pressure on the heap and garbage collection and
    is best to avoid if you want to achieve sustained high performance.
    Keeping the content in its native form or converting it to another
    form will yield much higher sustained throughput as we can stream
    those things from their underlying storage in the content repository
    to their new form in the repository or to another system all while
    only ever having only as much in memory as your technique for
    operating on them. So for example we can do things like compress a 1GB
    file and only have say 1KB in memory (as an example).  But by taking
    the content and turning it into attributes on the flow file the flow
    file object (not its content) will be in memory most of the time and
    this is where problems can occur.  It would be better to have pushing
    to elastic be driven off the content though this admittedly
    introducing a different challenge which is 'well, what format of that
    content does it expect'?  We have some examples of this pattern now in
    our SQL processors for instance which are built around a specific data
    format but we need to do better and offer generic or pluggable ways to
    read record oriented data from a variety of formats and not have the
    processors be specific to the underlying format where possible and
    appropriate.  The key is to do this without forcing some goofy
    normalization format that will kill performance as well and which
    would make it more brittle.
    
    So, anyway, I said all that to say that it is great you've offered to
    contribute it and I think you certainly should.  We should just take
    care to document its intended use and limitations on performance to
    consider, and enable it to limit how many columns/fields get turned
    into attributes maybe by setting a max or by having a
    whitelist/blacklist type model.  Even if it won't achieve highest
    sustained performance I suspect this will be quite helpful for people
    as is.
    
    Thanks!
    Joe
    
    On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <uw...@web.de> wrote:
    > Francois,
    >
    > very nice. Thanks.
    >
    > I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
    >
    > Where can we get it? Github?
    >
    > Rgds,
    >
    > Uwe
    >
    >> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
    >> Von: "François Prunier" <fr...@hurence.com>
    >> An: dev@nifi.apache.org
    >> Betreff: CsvToAttributes processor
    >>
    >> Hello Nifi folks,
    >>
    >> I've built a processor to parse CSV files with headers and turn each
    >> line in a flowfile. Each resulting flowfile has as many attributes as
    >> the number of columns. Each attributes has the name of a column with the
    >> corresponding value for the line.
    >>
    >> For example, this CSV file:
    >>
    >> |col1,col2,col3 a,b,c d,e,f |
    >>
    >> would generate two flowfiles with the following attributes:
    >>
    >> |col1 = a col2 = b col3 = c |
    >>
    >> and
    >>
    >> |col1 = d col2 = e col3 = f |
    >>
    >> As of now, you can configure the charset plus delimiter, quote and
    >> escape character. It's based on the commons-csv parser.
    >>
    >> It's very handy if you want to, for example, index a CSV file into
    >> elasticsearch.
    >>
    >> Would you guys be interested in a pull request to add this processor to
    >> the main code base ? It needs a bit more documentation and cleanup that
    >> I would need to add in but it's already successfully used in production.
    >>
    >> Best regards,
    >> --
    >> *François Prunier
    >> * *Hurence* - /Vos experts Big Data/
    >> http://www.hurence.com
    >> *mobile:* +33 6 38 68 60 50
    >>
    >>

Aw: Re: CsvToAttributes processor

Posted by Uwe Geercken <uw...@web.de>.

Joe,

thanks for the clarifying words. It was exactly that what I asked a while ago.

Rgds,

Uwe

> Gesendet: Mittwoch, 19. Oktober 2016 um 15:02 Uhr
> Von: "Joe Witt" <jo...@gmail.com>
> An: dev@nifi.apache.org
> Betreff: Re: CsvToAttributes processor
>
> Francois
> 
> Thanks for starting the discussion and this is indeed the type of
> thing people would find helpful.  One thing I'd want to flag with this
> approach is the impact it will have on performance at higher rates.
> We're starting to see people wanting to do this more and more where
> they'll take the content of a flowfile and turn it into attributes.
> This can put a lot of pressure on the heap and garbage collection and
> is best to avoid if you want to achieve sustained high performance.
> Keeping the content in its native form or converting it to another
> form will yield much higher sustained throughput as we can stream
> those things from their underlying storage in the content repository
> to their new form in the repository or to another system all while
> only ever having only as much in memory as your technique for
> operating on them. So for example we can do things like compress a 1GB
> file and only have say 1KB in memory (as an example).  But by taking
> the content and turning it into attributes on the flow file the flow
> file object (not its content) will be in memory most of the time and
> this is where problems can occur.  It would be better to have pushing
> to elastic be driven off the content though this admittedly
> introducing a different challenge which is 'well, what format of that
> content does it expect'?  We have some examples of this pattern now in
> our SQL processors for instance which are built around a specific data
> format but we need to do better and offer generic or pluggable ways to
> read record oriented data from a variety of formats and not have the
> processors be specific to the underlying format where possible and
> appropriate.  The key is to do this without forcing some goofy
> normalization format that will kill performance as well and which
> would make it more brittle.
> 
> So, anyway, I said all that to say that it is great you've offered to
> contribute it and I think you certainly should.  We should just take
> care to document its intended use and limitations on performance to
> consider, and enable it to limit how many columns/fields get turned
> into attributes maybe by setting a max or by having a
> whitelist/blacklist type model.  Even if it won't achieve highest
> sustained performance I suspect this will be quite helpful for people
> as is.
> 
> Thanks!
> Joe
> 
> On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <uw...@web.de> wrote:
> > Francois,
> >
> > very nice. Thanks.
> >
> > I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
> >
> > Where can we get it? Github?
> >
> > Rgds,
> >
> > Uwe
> >
> >> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
> >> Von: "François Prunier" <fr...@hurence.com>
> >> An: dev@nifi.apache.org
> >> Betreff: CsvToAttributes processor
> >>
> >> Hello Nifi folks,
> >>
> >> I've built a processor to parse CSV files with headers and turn each
> >> line in a flowfile. Each resulting flowfile has as many attributes as
> >> the number of columns. Each attributes has the name of a column with the
> >> corresponding value for the line.
> >>
> >> For example, this CSV file:
> >>
> >> |col1,col2,col3 a,b,c d,e,f |
> >>
> >> would generate two flowfiles with the following attributes:
> >>
> >> |col1 = a col2 = b col3 = c |
> >>
> >> and
> >>
> >> |col1 = d col2 = e col3 = f |
> >>
> >> As of now, you can configure the charset plus delimiter, quote and
> >> escape character. It's based on the commons-csv parser.
> >>
> >> It's very handy if you want to, for example, index a CSV file into
> >> elasticsearch.
> >>
> >> Would you guys be interested in a pull request to add this processor to
> >> the main code base ? It needs a bit more documentation and cleanup that
> >> I would need to add in but it's already successfully used in production.
> >>
> >> Best regards,
> >> --
> >> *François Prunier
> >> * *Hurence* - /Vos experts Big Data/
> >> http://www.hurence.com
> >> *mobile:* +33 6 38 68 60 50
> >>
> >>
>

Re: CsvToAttributes processor

Posted by Joe Witt <jo...@gmail.com>.

Francois

Thanks for starting the discussion and this is indeed the type of
thing people would find helpful.  One thing I'd want to flag with this
approach is the impact it will have on performance at higher rates.
We're starting to see people wanting to do this more and more where
they'll take the content of a flowfile and turn it into attributes.
This can put a lot of pressure on the heap and garbage collection and
is best to avoid if you want to achieve sustained high performance.
Keeping the content in its native form or converting it to another
form will yield much higher sustained throughput as we can stream
those things from their underlying storage in the content repository
to their new form in the repository or to another system all while
only ever having only as much in memory as your technique for
operating on them. So for example we can do things like compress a 1GB
file and only have say 1KB in memory (as an example).  But by taking
the content and turning it into attributes on the flow file the flow
file object (not its content) will be in memory most of the time and
this is where problems can occur.  It would be better to have pushing
to elastic be driven off the content though this admittedly
introducing a different challenge which is 'well, what format of that
content does it expect'?  We have some examples of this pattern now in
our SQL processors for instance which are built around a specific data
format but we need to do better and offer generic or pluggable ways to
read record oriented data from a variety of formats and not have the
processors be specific to the underlying format where possible and
appropriate.  The key is to do this without forcing some goofy
normalization format that will kill performance as well and which
would make it more brittle.

So, anyway, I said all that to say that it is great you've offered to
contribute it and I think you certainly should.  We should just take
care to document its intended use and limitations on performance to
consider, and enable it to limit how many columns/fields get turned
into attributes maybe by setting a max or by having a
whitelist/blacklist type model.  Even if it won't achieve highest
sustained performance I suspect this will be quite helpful for people
as is.

Thanks!
Joe

On Wed, Oct 19, 2016 at 6:50 AM, Uwe Geercken <uw...@web.de> wrote:
> Francois,
>
> very nice. Thanks.
>
> I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.
>
> Where can we get it? Github?
>
> Rgds,
>
> Uwe
>
>> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
>> Von: "François Prunier" <fr...@hurence.com>
>> An: dev@nifi.apache.org
>> Betreff: CsvToAttributes processor
>>
>> Hello Nifi folks,
>>
>> I've built a processor to parse CSV files with headers and turn each
>> line in a flowfile. Each resulting flowfile has as many attributes as
>> the number of columns. Each attributes has the name of a column with the
>> corresponding value for the line.
>>
>> For example, this CSV file:
>>
>> |col1,col2,col3 a,b,c d,e,f |
>>
>> would generate two flowfiles with the following attributes:
>>
>> |col1 = a col2 = b col3 = c |
>>
>> and
>>
>> |col1 = d col2 = e col3 = f |
>>
>> As of now, you can configure the charset plus delimiter, quote and
>> escape character. It's based on the commons-csv parser.
>>
>> It's very handy if you want to, for example, index a CSV file into
>> elasticsearch.
>>
>> Would you guys be interested in a pull request to add this processor to
>> the main code base ? It needs a bit more documentation and cleanup that
>> I would need to add in but it's already successfully used in production.
>>
>> Best regards,
>> --
>> *François Prunier
>> * *Hurence* - /Vos experts Big Data/
>> http://www.hurence.com
>> *mobile:* +33 6 38 68 60 50
>>
>>

Aw: CsvToAttributes processor

Posted by Uwe Geercken <uw...@web.de>.

Francois,

very nice. Thanks.

I have been working on a simple version a while ago. But it had another scope: I wnated to have a Nifi processor to merge CSV data with a template from a template engine (e.g. Apache Velocity). I will review my code and have a look at your processor.

Where can we get it? Github?

Rgds,

Uwe

> Gesendet: Mittwoch, 19. Oktober 2016 um 11:10 Uhr
> Von: "François Prunier" <fr...@hurence.com>
> An: dev@nifi.apache.org
> Betreff: CsvToAttributes processor
>
> Hello Nifi folks,
> 
> I've built a processor to parse CSV files with headers and turn each 
> line in a flowfile. Each resulting flowfile has as many attributes as 
> the number of columns. Each attributes has the name of a column with the 
> corresponding value for the line.
> 
> For example, this CSV file:
> 
> |col1,col2,col3 a,b,c d,e,f |
> 
> would generate two flowfiles with the following attributes:
> 
> |col1 = a col2 = b col3 = c |
> 
> and
> 
> |col1 = d col2 = e col3 = f |
> 
> As of now, you can configure the charset plus delimiter, quote and 
> escape character. It's based on the commons-csv parser.
> 
> It's very handy if you want to, for example, index a CSV file into 
> elasticsearch.
> 
> Would you guys be interested in a pull request to add this processor to 
> the main code base ? It needs a bit more documentation and cleanup that 
> I would need to add in but it's already successfully used in production.
> 
> Best regards,
> -- 
> *François Prunier
> * *Hurence* - /Vos experts Big Data/
> http://www.hurence.com
> *mobile:* +33 6 38 68 60 50
> 
>