You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Benjamin Janssen (JIRA)" <ji...@apache.org> on 2015/10/14 06:31:05 UTC
[jira] [Comment Edited] (NIFI-901) Create processors to get/put data with Apache Cassandra

    [ https://issues.apache.org/jira/browse/NIFI-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956226#comment-14956226 ] 

Benjamin Janssen edited comment on NIFI-901 at 10/14/15 4:30 AM:
-----------------------------------------------------------------

Been brushing up on CQL and I'm starting to foresee some difficulties.

First is the issue that with CQL Cassandra loses a lot of the nice fancy features of NoSQL databases.  There is no longer a way (from what I've been able to gather) to refer to a row by row name + column name.  Instead, each table must have a schema assigned to it with the row and column names being constructed from the fields that make up the "primary key" of the SQL like language.  This makes it difficult to build a simple generic processor to read row and column from the FlowFile attributes and dump the content into the cell and requires that the processor somehow be schema aware.

For batching purposes on the Put side of things.  The CQL3 documentation seems to imply that batching should not be used when seeking performance improvements (http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html) but this seems to be mostly directed at the BATCH construct.  I think it would be fine to batch (without using the BATCH key word) by buffering updates to a single primary key (note that primary key in CQL refers to the combination of fields that defines the row AND column that will be written to).  I'm not sure this level of buffering is worth doing.

Combining these two issues, I'm wondering if FlowFiles should be structured in such a way that they have no content and the information to insert is contained solely within the attributes or if perhaps the content should be required to be of a JSON type format that defines the relevant information necessary for the update.  I think both of these approaches would limit the overall size of the entry that could be inserted but I'm not sure we want to be loading particularly huge objects into Cassandra anyways.

Thoughts?

Background for those not familiar with Cassandra and CQL:

Cassandra's original data model dealt with keyspace, column families, row keys, column keys, and cells.  It's new data model (via the CQL API attempting to mimic SQL) essentially abstracts away all of these underlying constructs.

Column Families are replaced by "Tables" in CQL.  Row key and column key are both replaced by the "primary key" concept from SQL.  The first entry in the "primary key" is treated as the legacy row key and the other entries are combined to form the legacy column key.  So your typical SQL type columns in the CQL language are not necessarily columns at all in the Cassandra backend.  They could be part of the row key, part of the column key, or even just one part of the cell.

The big thing is that the concepts of a "cell" are really no longer present in the CQL data model and writing a processor designed to write the contents of a FlowFile to a single cell does not really work if we want to use modern Cassandra clients to interact with the cluster.


was (Author: bjanssen1):
Been brushing up on CQL and I'm starting to foresee some difficulties.

First is the issue that with CQL Cassandra loses a lot of the nice fancy features of NoSQL databases.  There is no longer a way (from what I've been able to gather) to refer to a row by row name + column name.  Instead, each table must have a schema assigned to it with the row and column names being constructed from the fields that make up the "primary key" of the SQL like language.  This makes it difficult to build a simple generic processor to read row and column from the FlowFile attributes and dump the content into the cell and requires that the processor somehow be schema aware.

For batching purposes on the Put side of things.  The CQL3 documentation seems to imply that batching should not be used when seeking performance improvements (http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html) but this seems to be mostly directed at the BATCH construct.  I think it would be fine to batch (without using the BATCH key word) by buffering updates to a single primary key (note that primary key in CQL refers to the combination of fields that defines the row AND column that will be written to).  I'm not sure this level of buffering is worth doing.

Combining these two issues, I'm wondering if FlowFiles should be structured in such a way that they have no content and the information to insert is contained solely within the attributes or if perhaps the content should be required to be of a JSON type format that defines the relevant information necessary for the update.  I think both of these approaches would limit the overall size of the entry that could be inserted but I'm not sure we want to be loading particularly huge objects into Cassandra anyways.

Thoughts?

> Create processors to get/put data with Apache Cassandra
> -------------------------------------------------------
>
>                 Key: NIFI-901
>                 URL: https://issues.apache.org/jira/browse/NIFI-901
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Joseph Witt
>              Labels: beginner
>             Fix For: 0.4.0
>
>
> Develop processors to interact with Apache Cassandra.  The current http processors may actually support this as is but such configuration may be too complex to provide the quality user experience desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)