You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by Toivo Adams <to...@gmail.com> on 2015/11/02 11:12:33 UTC

Common data exchange formats and tabular data

All,
Some processors get/put data in tabular form. (PutSQL, ExecuteSQL, soon
Cassandra) 
It would be very nice to be able use such processors in pipeline – previous
processor output is next processor input. To achieve this, processors should
use common data exchange format.

JSON is most widely used, it’s simple and readable. But JSON lacks schema.
Schema can be very useful to automate data insert/update.

Avro has schema, but is somewhat more complicated and not widely used
(yet?).

Please see also:

https://issues.apache.org/jira/browse/NIFI-978

https://issues.apache.org/jira/browse/NIFI-901

Opinions?

Thanks
Toivo




--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by Joe Witt <jo...@gmail.com>.

Toivo,

At a framework level NiFi itself is format/schema agnostic.  It holds
maps of strings (attributes) and a chunk of zero or more bytes
(payload) on a per flowfile basis.  Those bytes could be anything.  In
looking at the majority of processors they exist to deal with
protocols of exchange between systems and generally are also type
agnostic.  In looking at others we see things that work with JSON,
XML, Avro, CSV, text files, etc..  These processors inherently are
meant to deal with those formats.

Now, with the inclusion of processors that pull from and send to
databases we've hit the need for a way to serialize that data while it
lives in NiFi.  We also need these for things like writing to Accumulo
or HBase and presumably other systems as well.  These are systems for
which the 'structure' of the data is in many ways controlled by their
model (mutations, rows, etc...) So, for these I definitely see what
you mean about us centering on some recommended and fully tooled
formats.  The devil of course is always in the schema details.

When pulling from Solr, or a database, or pushing to accumulo or hbase
or other such systems I do think we can/should find a standard.

Joe

On Sat, Jan 2, 2016 at 5:03 AM, Toivo Adams <to...@gmail.com> wrote:
> I was occupied with other things lately and didn't have time to deal with
> this.
>
> In my opinion clear statement which formats are recommended by NiFi helps a
> lot newcomers and processor writers. Also it helps to create Lego like
> pieces which fits together without incompatibility worries.
>
> In ideal world I'd like have one or two “correct data exchange format”.
> But reality is different.
> Different users have different needs and habits, so there is a need for many
> formats.
>
> I prefer to have few 'recommended core formats' which processors should
> support in the way or the other.
> And Convert processors for all other formats.
>
> But I am not sure how to reach consensus.
>
>
> Thanks
> Toivo
>
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p6009.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by Toivo Adams <to...@gmail.com>.

I was occupied with other things lately and didn't have time to deal with
this.

In my opinion clear statement which formats are recommended by NiFi helps a
lot newcomers and processor writers. Also it helps to create Lego like
pieces which fits together without incompatibility worries.

In ideal world I'd like have one or two “correct data exchange format”.
But reality is different.
Different users have different needs and habits, so there is a need for many
formats.

I prefer to have few 'recommended core formats' which processors should
support in the way or the other.
And Convert processors for all other formats.

But I am not sure how to reach consensus.


Thanks
Toivo




--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p6009.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by Joe Witt <jo...@gmail.com>.

Toivo - this thread seems important and does not appear to have come
to a resolution.  Do you want to pick this back up or are you
comfortable with where it is as for now?

On Wed, Dec 2, 2015 at 12:39 PM, dcave <dc...@ssglimited.com> wrote:
> Adding multiple input and output format support would complicate the
> usability and ongoing maintenance of the SQL/NoSQL processors.
> Additionally, as you suggested it is impossible to select a "correct" format
> or set of formats that can handle all potential needs.
>
> A simpler and more streamlined solution is to put the emphasis on having
> Convert processors available that can handle specific cases as they come up
> as your last comment suggested.  This also keeps processor focus on one
> specific task rather than having Get/Put/Convert hybrids that can lead to
> unneeded complexity and code bloat.
>
> This is in line with Benjamin's line of work.
>
>
>
> --
> View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p5551.html
> Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by dcave <dc...@ssglimited.com>.

Adding multiple input and output format support would complicate the
usability and ongoing maintenance of the SQL/NoSQL processors. 
Additionally, as you suggested it is impossible to select a "correct" format
or set of formats that can handle all potential needs.

A simpler and more streamlined solution is to put the emphasis on having
Convert processors available that can handle specific cases as they come up
as your last comment suggested.  This also keeps processor focus on one
specific task rather than having Get/Put/Convert hybrids that can lead to
unneeded complexity and code bloat.

This is in line with Benjamin's line of work.



--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p5551.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by Toivo Adams <to...@gmail.com>.

All,

Benjamin has already done a lot of good work and it would very helpful we
can agree how to move on.
https://issues.apache.org/jira/browse/NIFI-901

My first post was naive, there are much more things to consider.

It is probably impossible to select only one “correct data exchange format”
what all processors should use.

But can we agree one or two preferred data format what SQL and NoSQL
processors should support.
And all other other formats are supported using converter processors.

I my opinion preferred data exchange format should:

1. Support schema in the way or another.

2. Support streaming.

3. Support different data types (String, numeric types, date/time, binary)

4. Serialization/deserialization should be fast and efficient.

5. Widely used and has strong supporters.

6. Can be used in transformations, filtering, join, split, etc.

7. Can be converted to/and from other formats relatively easily.

Nice to have:

1. Nested data structures. For example Orders can contain order rows.


Or maybe we recommend all SQL and NoSQL processors should support two or
more input/output formats and user can select format using configuration?
Or separate sets of processors for different formats?


Thanks
Toivo




--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p4337.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by Toivo Adams <to...@gmail.com>.

Matt,

Good overview.

>Avro is similar, but the schema must always be provided with the data. In 
>the case of NiFi DataFlows, it's likely more efficient to send the schema 
>once as an initialization packet (I can't remember the real term in NiFi), 
>then the rows can be streamed individually, in batches of user-defined
size, 
>sampled, etc. 

Do you mean "Initial Information Packet" or "IIP" ?
Mr. Morrison classical FBP includes such functionality, used often for
configuration.

As far I know NiFi don't have such concept.

But NiFi ExecuteSQL uses Avro with schema for query result.
Result is one big FlowFile which includes both schema and all rows.
Processor just creates schema from JDBC metadata, writes it to Avro
container
and next all rows are written to Avro container.
Writing and reading such file is done using streaming, so result can very
big.

Thanks
Toivo




--
View this message in context: http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-formats-and-tabular-data-tp3508p4271.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.

Re: Common data exchange formats and tabular data

Posted by Adam Taft <ad...@adamtaft.com>.

CSV (and friends like TSV, PSV, etc.) are obviously very naturally oriented
to representing tabular data.  I don't know that there would be a lot of
value using/inventing a JSON or AVRO format in place of CSV for tabular data

The only slight advantage might be maintaining type information, which JSON
or AVRO could carry (since CSV is basically all strings).  But the type
information alone might make dealing with data in follow-on processors a
bit more difficult.

That being said, I do like the concept of having the tabular data payload
being processed separately from the original remote "fetch" call.  This
definitely follows the unix philosophy better favoring data flow
composition.  A lot of data can be converted/represented to string based
tabular form, regardless of the original source, which could enable
interesting possibilities for many data flows.

Adam

On Mon, Nov 2, 2015 at 9:06 AM, Matthew Burgess <ma...@gmail.com> wrote:

> Hello all,
>
> I am new to the NiFi community but I have a good amount of experience with
> ETL tools and applications that process lots of tabular data. In my
> experience, JSON is only useful as the common format for tabular data if it
> has a "flat" schema, in which case there aren't any advantages for JSON
> over
> other formats such as CSV. However, I've seen lots of "CSV" files that
> don't
> seem to adhere to any standard, so I would presume NiFi would need a rigid
> schema such as RFC-4180 (http://www.rfc-base.org/txt/rfc-4180.txt).
>
> However CSV isn't a natural way to express the schema of the rows, so JSON
> or YAML is probably a better choice. There's a format called Tabular Data
> Package that combines CSV and JSON for tabular data serialization:
> http://dataprotocols.org/tabular-data-package/
>
> Avro is similar, but the schema must always be provided with the data. In
> the case of NiFi DataFlows, it's likely more efficient to send the schema
> once as an initialization packet (I can't remember the real term in NiFi),
> then the rows can be streamed individually, in batches of user-defined
> size,
> sampled, etc.
>
> Having said all that, there are projects like Apache Drill that can handle
> non-flat JSON files and still present them in tabular format. They have
> functions like KVGEN and FLATTEN to transform the document(s) into tabular
> format. In the use cases you present below, you already know the data is
> tabular and as such, the extra data model transformation is not needed.  If
> this is desired, it should be apparent that a Streaming JSON processor
> would
> be necessary; otherwise, for large tabular datasets you'd have to read the
> whole JSON file into memory to parse individual rows.
>
> Regards,
> Matt
>
> From:  Toivo Adams <to...@gmail.com>
> Reply-To:  <de...@nifi.apache.org>
> Date:  Monday, November 2, 2015 at 5:12 AM
> To:  <de...@nifi.apache.org>
> Subject:  Common data exchange formats and tabular data
>
> All,
> Some processors get/put data in tabular form. (PutSQL, ExecuteSQL, soon
> Cassandra)
> It would be very nice to be able use such processors in pipeline  previous
> processor output is next processor input. To achieve this, processors
> should
> use common data exchange format.
>
> JSON is most widely used, it¹s simple and readable. But JSON lacks schema.
> Schema can be very useful to automate data insert/update.
>
> Avro has schema, but is somewhat more complicated and not widely used
> (yet?).
>
> Please see also:
>
> https://issues.apache.org/jira/browse/NIFI-978
>
> https://issues.apache.org/jira/browse/NIFI-901
>
> Opinions?
>
> Thanks
> Toivo
>
>
>
>
> --
> View this message in context:
>
> http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-f
> ormats-and-tabular-data-tp3508.html
> Sent from the Apache NiFi Developer List mailing list archive at
> Nabble.com.
>
>
>
>

Re: Common data exchange formats and tabular data

Posted by Matthew Burgess <ma...@gmail.com>.

Hello all,

I am new to the NiFi community but I have a good amount of experience with
ETL tools and applications that process lots of tabular data. In my
experience, JSON is only useful as the common format for tabular data if it
has a "flat" schema, in which case there aren't any advantages for JSON over
other formats such as CSV. However, I've seen lots of "CSV" files that don't
seem to adhere to any standard, so I would presume NiFi would need a rigid
schema such as RFC-4180 (http://www.rfc-base.org/txt/rfc-4180.txt).

However CSV isn't a natural way to express the schema of the rows, so JSON
or YAML is probably a better choice. There's a format called Tabular Data
Package that combines CSV and JSON for tabular data serialization:
http://dataprotocols.org/tabular-data-package/

Avro is similar, but the schema must always be provided with the data. In
the case of NiFi DataFlows, it's likely more efficient to send the schema
once as an initialization packet (I can't remember the real term in NiFi),
then the rows can be streamed individually, in batches of user-defined size,
sampled, etc.

Having said all that, there are projects like Apache Drill that can handle
non-flat JSON files and still present them in tabular format. They have
functions like KVGEN and FLATTEN to transform the document(s) into tabular
format. In the use cases you present below, you already know the data is
tabular and as such, the extra data model transformation is not needed.  If
this is desired, it should be apparent that a Streaming JSON processor would
be necessary; otherwise, for large tabular datasets you'd have to read the
whole JSON file into memory to parse individual rows.

Regards,
Matt

From:  Toivo Adams <to...@gmail.com>
Reply-To:  <de...@nifi.apache.org>
Date:  Monday, November 2, 2015 at 5:12 AM
To:  <de...@nifi.apache.org>
Subject:  Common data exchange formats and tabular data

All,
Some processors get/put data in tabular form. (PutSQL, ExecuteSQL, soon
Cassandra) 
It would be very nice to be able use such processors in pipeline  previous
processor output is next processor input. To achieve this, processors should
use common data exchange format.

JSON is most widely used, it¹s simple and readable. But JSON lacks schema.
Schema can be very useful to automate data insert/update.

Avro has schema, but is somewhat more complicated and not widely used
(yet?).

Please see also:

https://issues.apache.org/jira/browse/NIFI-978

https://issues.apache.org/jira/browse/NIFI-901

Opinions?

Thanks
Toivo




--
View this message in context:
http://apache-nifi-developer-list.39713.n7.nabble.com/Common-data-exchange-f
ormats-and-tabular-data-tp3508.html
Sent from the Apache NiFi Developer List mailing list archive at Nabble.com.