You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Robert Stupp (JIRA)" <ji...@apache.org> on 2015/04/26 15:05:38 UTC

[jira] [Commented] (DRILL-92) Cassandra storage engine

    [ https://issues.apache.org/jira/browse/DRILL-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513026#comment-14513026 ] 

Robert Stupp commented on DRILL-92:
-----------------------------------

[~yash360@gmail.com] sorry for the late response. Here are my comments:

Tables & Keyspace cache - you don’t need to cache it. It is already cached - so it’s duplicate effort. Makes your code less complex if you remove the cache - e.g. all those cache related {{catch ExecutionException}} stuff.

I’m not sure whether the connection cache in {{CassandraConnectionManager}} really works. For example, if you have hosts 127.0.0.1 and 127.0.0.2 using the same {{Cluster}} instance, and the cache decides to evict 127.0.0.1, the instance for 127.0.0.2 no longer works.
Beside that you query the cache using {{List<String>}} but add using {{String}} for the key.
As a proposal: use the cluster name as the cache key and query for that.
I’m not sure whether you can always close the {{Cluster}} instance - in respect whether such an instance is still in use during a long running operation.

Both {{CassandraConnectionManager}} and {{CassandraSchemaFactory}} create individual {{Cluster}} instances and therefore independent resources (connections, threads, etc). Should be merged. If both classes are used in completely different contexts, please ignore this comment.
The {{Cluster}} instance in {{CassandraSchemaFactory}} is never closed.

bq. Endpoint affinity & Partition token
If you’re using the code just to assign ranges to Drill hosts, then that should be fine.
But do not assume anything about tokens assigned to a C* host. That code heavily depends on the individual cluster configuration (partitioner, topology, node placement (DC, rack)) and keyspace configuration. It’s not that easy, but manageable.

In {{org.apache.drill.exec.store.cassandra.CassandraRecordReader#setup}} you’re using {{QueryBuilder.token}} for paging / slicing. Unfortunately that would not work. Assume that you have vnodes in the C* cluster (defaults to 256 vnodes per C* node). Vnode tokens are assigned randomly to endpoints (=nodes) - it’s not like old-fashioned single token per node. You just cannot slice using the {{token()}} function. Even further it’s quite difficult to nicely split slices matching both C* nodes/vnodes *and* Drill _sub scan ranges_ (is this the correct wording?). Nice slicing across all nodes/vnodes is one of the weak sides in C*. That’s why Hadoop-on-C* recommends to prevent vnodes - they have the same problem. Let me think a bit about that - maybe I can provide a solution or at least a workaround for that.

For unit tests: take a look at https://github.com/doanduyhai/Achilles - it has some nice support for unit tests, which may make all that manual work to setup and fill keyspaces/tables superfluous.

Is it a Drill requirement that {{CassandraRecordReader#updateValueVector}} only mutates using {{String}}s?

General code comments:
* there’s some unused code, that can be safely removed
* in {{CassandraRecordReader}}: you can safely replace the {{clazz.isInstance()}}-sequence with {{clazz == ClassName.class}}

Note: the patch does not apply onto the current master - but on master as of March 31st. There were some breaking API changes in Drill.

For the future: I don’t know whether the current code supports Cassandra’s User-Defined-Types or collections (maps, sets, lists). If not, it might be a nice feature for later.


> Cassandra storage engine
> ------------------------
>
>                 Key: DRILL-92
>                 URL: https://issues.apache.org/jira/browse/DRILL-92
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Steven Phillips
>            Assignee: Yash Sharma
>             Fix For: Future
>
>         Attachments: DRILL-92-Cassandra-Storage.patch, DRILL-92-CassandraStorage.patch, DRILL-92.patch, DRILL-CASSANDRA.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)