You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Brian O'Neill <bo...@alumni.brown.edu> on 2013/01/21 05:37:47 UTC

Getting plugged in... (Cassandra and Drill?)

Last week, Brad Anderson came up and presented at the PhillyDB meetup.
http://www.slideshare.net/boorad/phillydb-talk-beyond-batch

He gave us an overview of Drill, and I'm curious...

Presently, we heavily use Storm + Cassandra.
http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html

We treat CRUD operations as events. Then within Storm we calculate
aggregate counts of entities flowing through the system by various
dimensions.   That works well, but we still need an ad hoc reporting
capability, and a way to report on data in the system that is not
active (historical).

Would it be possible to use the Drill engine against a Cassandra backend?
If so, what does that mean?   (implementing some API?)

I assume that performance would be terrible unless somehow the data is
stored using the columnar data format from the Dremel paper.  Is that
accurate?  Does anyone know if anyone has attempted a translation of
that format to Cassandra?

Regardless, I'm very interested in getting involved and no stranger to
getting my hands dirty.
Let me know if you can provide any direction. (our entities are
currently stored in JSON in Cassandra)

-brian


-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://brianoneill.blogspot.com/
twitter: @boneill42

Re: Getting plugged in... (Cassandra and Drill?)

Posted by Jacques Nadeau <ja...@gmail.com>.

Hey Brian,

Yeah, the storage engine APIs haven't been defined yet.  Expounding a bit
on the high-level goals include what we had in the JIRA:

The primary interface is the Storage Engine Capabilities API.  It should
describe everything that the particular storage engine supports.  This
includes whether the storage engine supports serialization,
deserialization, what types of logical operator capabilities it supports
internally.  It also needs to include a description of statistics
capabilities (e.g. supports approximate row keys, average row size, total
data size, data distribution statistics, etc) and metadata capabilities

Statistics API: Provide the actual statistics information that is utilized
during query planning.
Metadata API: Provide information about the available sub data sources
(tables, keyspaces, etc) along with locality information, schema
information, type information, primary and secondary indices
types, partitioning information,  etc.  Portions of this information are
used in query parsing.  Others in query planning.  Others portions in
Execution planning.
DeserializationAPI - Convert a particular data source into one of our two
canonical in-memory formats.  (row-based or column-based).  Additionally
support particular types of logical operation pushdown.
Serialization - Serialize the in-memory format back into the persistent
storage format.

If you wanted to take a look at other projects existing interfaces around
each of these things and then try to draw up a design, that would be really
helpful.

Jacques


On Mon, Jan 21, 2013 at 8:20 PM, Brian O'Neill <bo...@alumni.brown.edu>wrote:

>
> Hey crew. Thanks for all the useful replies.
>
> With respect to data model/selective queries:
> Understood.  I am open to and anticipated creating wide-row indexes that
> would cut down on the range queries.  With the right number of wide-row
> indexes that support the appropriate dimensions, we can probably cut down
> on the requisite full table scans.
>
> I'm even open to creating a CF/table specifically to support the Dremel
> data model.  (And I'm looking at the recent release of Cassandra native
> support for collections to see if they help with that approach)
> http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.h
> tml<http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.html>
>
>
> For cases where wide-rows can't be constructed (e.g. We can't fully
> anticipate the dimensions needed),  we might be able to handle full-table
> scans if we made the Drill API implementation aware of the
> partitions/token-space in Cassandra. I saw that you mention locality on
> DRILL-13, vnode information from Cassandra might help there. With that, at
> least you could send the queries to the right host.
> (thinking outloud)
>
> Regardless, I can certainly come up with a straw-man data model that I
> believe is common in the Cassandra community, and we can brainstorm to see
> what makes sense.
>
> I'm certainly game for taking on DRILL-16 and contributing to DRILL-13.
> Solving this is a priority for us and Drill seems promising.
>
> I didn't see any pointers to the Storage Engine API on the issue.  I've
> got the code down from github, but didn't see much:
> bone@zen:~/git/boneill42/incubator-drill/sandbox-> find . -name '*.java' |
> grep storage
> ./prototype/contrib/storage-hbase/src/main/java/org/apache/drill/App.java
> ./prototype/contrib/storage-hbase/src/test/java/org/apache/drill/AppTest.ja
> va
>
> Can anybody point me in the right direction?
>
> -brian
>
>
>
>
> ---
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science
> The Science of Better Results
> 2700 Horizon Drive € King of Prussia, PA € 19406
> M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
> healthmarketscience.com
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or
> the person responsible to deliver it to the intended recipient, please
> contact the sender at the email above and delete this email and any
> attachments and destroy any copies thereof. Any review, retransmission,
> dissemination, copying or other use of, or taking any action in reliance
> upon, this information by persons or entities other than the intended
> recipient is strictly prohibited.
>
>
>
>
>
>
>
> On 1/21/13 2:23 PM, "Jacques Nadeau" <ja...@gmail.com> wrote:
>
> >Hey Brian,
> >
> >Welcome to the list!
> >
> >Here are some thoughts
> >
> >On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill
> ><bo...@alumni.brown.edu>wrote:
> >
> >> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
> >> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
> >>
> >> He gave us an overview of Drill, and I'm curious...
> >>
> >> Presently, we heavily use Storm + Cassandra.
> >>
> >>
> >>
> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-a
> >>nd.html
> >>
> >> We treat CRUD operations as events. Then within Storm we calculate
> >> aggregate counts of entities flowing through the system by various
> >> dimensions.   That works well, but we still need an ad hoc reporting
> >> capability, and a way to report on data in the system that is not
> >> active (historical).
> >>
> >> Seems like a great use case for Drill.
> >
> >
> >> Would it be possible to use the Drill engine against a Cassandra
> >>backend?
> >> If so, what does that mean?   (implementing some API?)
> >>
> >
> >Yes.  One of our goals is to have a defined storage engine API with
> >required and optional features to add new data sources.  In fact, we have
> >DRILL-16 which is dependent on DRILL-13 which specifically outlines this
> >goal.  DRILL-13 is the base API and DRILL-16 is the Cassandra
> >implementation.  Depending on your level of interest and time, we would
> >love to have some help on DRILL-13.
> >
> >>
> >> I assume that performance would be terrible unless somehow the data is
> >> stored using the columnar data format from the Dremel paper.  Is that
> >> accurate?  Does anyone know if anyone has attempted a translation of
> >> that format to Cassandra?
> >>
> >> One of the visions behind Dremel and Drill are that full table scans are
> >okay.  Part of the reason is the compact format of the data and the fact
> >that you only read important columns.  I'd expect that for many schema
> >designs, insitu-querying of Cassandra could be pretty effective.
> >
> >One of the things we've talked about is supporting caching
> >transformations.
> > E.g. the first time you query a source, it may be automatically
> >reorganized in a more efficient format.  This works really well with
> >HDFS's
> >write-once scheme.  Harder with something like Cassandra depending on how
> >your using it.
> >
> >
> >
> >> Regardless, I'm very interested in getting involved and no stranger to
> >> getting my hands dirty.
> >> Let me know if you can provide any direction. (our entities are
> >> currently stored in JSON in Cassandra)
> >>
> >>
> >As mentioned above, if you wanted to start a discussion and work on
> >DRILL-13, that would be very helpful.  Since we're still very much in
> >alpha
> >development right now, another helpful item would be to document your
> >rough
> >schema, available secondary indexes and example queries/needs on the wiki.
> > You could then translate those into Drill Logical plan syntax.  We could
> >use these as earlier test cases to ensure the system will support these
> >effectively.
> >
> >
> >Welcome,
> >
> >Jacques
> >
> >
> >
> >> -brian
> >>
> >>
> >> --
> >> Brian ONeill
> >> Lead Architect, Health Market Science (http://healthmarketscience.com)
> >> mobile:215.588.6024
> >> blog: http://brianoneill.blogspot.com/
> >> twitter: @boneill42
> >>
>
>
>

Re: Getting plugged in... (Cassandra and Drill?)

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

Hey crew. Thanks for all the useful replies.

With respect to data model/selective queries:
Understood.  I am open to and anticipated creating wide-row indexes that
would cut down on the range queries.  With the right number of wide-row
indexes that support the appropriate dimensions, we can probably cut down
on the requisite full table scans.

I'm even open to creating a CF/table specifically to support the Dremel
data model.  (And I'm looking at the recent release of Cassandra native
support for collections to see if they help with that approach)
http://brianoneill.blogspot.com/2013/01/native-support-for-collections-in.h
tml

For cases where wide-rows can't be constructed (e.g. We can't fully
anticipate the dimensions needed),  we might be able to handle full-table
scans if we made the Drill API implementation aware of the
partitions/token-space in Cassandra. I saw that you mention locality on
DRILL-13, vnode information from Cassandra might help there. With that, at
least you could send the queries to the right host.
(thinking outloud)

Regardless, I can certainly come up with a straw-man data model that I
believe is common in the Cassandra community, and we can brainstorm to see
what makes sense.

I'm certainly game for taking on DRILL-16 and contributing to DRILL-13.
Solving this is a priority for us and Drill seems promising.

I didn't see any pointers to the Storage Engine API on the issue.  I've
got the code down from github, but didn't see much:
bone@zen:~/git/boneill42/incubator-drill/sandbox-> find . -name '*.java' |
grep storage
./prototype/contrib/storage-hbase/src/main/java/org/apache/drill/App.java
./prototype/contrib/storage-hbase/src/test/java/org/apache/drill/AppTest.ja
va

Can anybody point me in the right direction?

-brian

---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.

On 1/21/13 2:23 PM, "Jacques Nadeau" <ja...@gmail.com> wrote:

>Hey Brian,
>
>Welcome to the list!
>
>Here are some thoughts
>
>On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill
><bo...@alumni.brown.edu>wrote:
>
>> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
>> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
>>
>> He gave us an overview of Drill, and I'm curious...
>>
>> Presently, we heavily use Storm + Cassandra.
>>
>> 
>>http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-a
>>nd.html
>>
>> We treat CRUD operations as events. Then within Storm we calculate
>> aggregate counts of entities flowing through the system by various
>> dimensions.   That works well, but we still need an ad hoc reporting
>> capability, and a way to report on data in the system that is not
>> active (historical).
>>
>> Seems like a great use case for Drill.
>
>
>> Would it be possible to use the Drill engine against a Cassandra
>>backend?
>> If so, what does that mean?   (implementing some API?)
>>
>
>Yes.  One of our goals is to have a defined storage engine API with
>required and optional features to add new data sources.  In fact, we have
>DRILL-16 which is dependent on DRILL-13 which specifically outlines this
>goal.  DRILL-13 is the base API and DRILL-16 is the Cassandra
>implementation.  Depending on your level of interest and time, we would
>love to have some help on DRILL-13.
>
>>
>> I assume that performance would be terrible unless somehow the data is
>> stored using the columnar data format from the Dremel paper.  Is that
>> accurate?  Does anyone know if anyone has attempted a translation of
>> that format to Cassandra?
>>
>> One of the visions behind Dremel and Drill are that full table scans are
>okay.  Part of the reason is the compact format of the data and the fact
>that you only read important columns.  I'd expect that for many schema
>designs, insitu-querying of Cassandra could be pretty effective.
>
>One of the things we've talked about is supporting caching
>transformations.
> E.g. the first time you query a source, it may be automatically
>reorganized in a more efficient format.  This works really well with
>HDFS's
>write-once scheme.  Harder with something like Cassandra depending on how
>your using it.
>
>
>
>> Regardless, I'm very interested in getting involved and no stranger to
>> getting my hands dirty.
>> Let me know if you can provide any direction. (our entities are
>> currently stored in JSON in Cassandra)
>>
>>
>As mentioned above, if you wanted to start a discussion and work on
>DRILL-13, that would be very helpful.  Since we're still very much in
>alpha
>development right now, another helpful item would be to document your
>rough
>schema, available secondary indexes and example queries/needs on the wiki.
> You could then translate those into Drill Logical plan syntax.  We could
>use these as earlier test cases to ensure the system will support these
>effectively.
>
>
>Welcome,
>
>Jacques
>
>
>
>> -brian
>>
>>
>> --
>> Brian ONeill
>> Lead Architect, Health Market Science (http://healthmarketscience.com)
>> mobile:215.588.6024
>> blog: http://brianoneill.blogspot.com/
>> twitter: @boneill42
>>

Re: Getting plugged in... (Cassandra and Drill?)

Posted by Jacques Nadeau <ja...@gmail.com>.

Hey Brian,

Welcome to the list!

Here are some thoughts

On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill <bo...@alumni.brown.edu>wrote:

> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
>
> He gave us an overview of Drill, and I'm curious...
>
> Presently, we heavily use Storm + Cassandra.
>
> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
>
> We treat CRUD operations as events. Then within Storm we calculate
> aggregate counts of entities flowing through the system by various
> dimensions.   That works well, but we still need an ad hoc reporting
> capability, and a way to report on data in the system that is not
> active (historical).
>
> Seems like a great use case for Drill.


> Would it be possible to use the Drill engine against a Cassandra backend?
> If so, what does that mean?   (implementing some API?)
>

Yes.  One of our goals is to have a defined storage engine API with
required and optional features to add new data sources.  In fact, we have
DRILL-16 which is dependent on DRILL-13 which specifically outlines this
goal.  DRILL-13 is the base API and DRILL-16 is the Cassandra
implementation.  Depending on your level of interest and time, we would
love to have some help on DRILL-13.

>
> I assume that performance would be terrible unless somehow the data is
> stored using the columnar data format from the Dremel paper.  Is that
> accurate?  Does anyone know if anyone has attempted a translation of
> that format to Cassandra?
>
> One of the visions behind Dremel and Drill are that full table scans are
okay.  Part of the reason is the compact format of the data and the fact
that you only read important columns.  I'd expect that for many schema
designs, insitu-querying of Cassandra could be pretty effective.

One of the things we've talked about is supporting caching transformations.
 E.g. the first time you query a source, it may be automatically
reorganized in a more efficient format.  This works really well with HDFS's
write-once scheme.  Harder with something like Cassandra depending on how
your using it.



> Regardless, I'm very interested in getting involved and no stranger to
> getting my hands dirty.
> Let me know if you can provide any direction. (our entities are
> currently stored in JSON in Cassandra)
>
>
As mentioned above, if you wanted to start a discussion and work on
DRILL-13, that would be very helpful.  Since we're still very much in alpha
development right now, another helpful item would be to document your rough
schema, available secondary indexes and example queries/needs on the wiki.
 You could then translate those into Drill Logical plan syntax.  We could
use these as earlier test cases to ensure the system will support these
effectively.


Welcome,

Jacques



> -brian
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://brianoneill.blogspot.com/
> twitter: @boneill42
>

Re: Getting plugged in... (Cassandra and Drill?)

Posted by Tomer Shiran <ts...@maprtech.com>.

With "very selective" I intended to refer to the columns, not the rows.
That is, if your query only careas about 3 columns out of 100, then a true
columnar layout works great.


On Sun, Jan 20, 2013 at 10:07 PM, Tomer Shiran <ts...@maprtech.com> wrote:

> Drill is being developed with the flexibility to support different data
> sources, so Cassandra support should not be a problem. Is that something
> you would be interested in building?
>
> The performance depends on the query. A query that involves a range scan
> would be very slow (assuming the default partitioner in Cassandra,
> RandomPartitioner), but point queries and queries that involve full table
> scans would provide reasonable performance. A full columnar layout would be
> faster for some queries (eg, queries that are very selective).
>
> BTW, Drill will support nested data, so JSON is not an issue.
>
>
> On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill <bo...@alumni.brown.edu>wrote:
>
>> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
>> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
>>
>> He gave us an overview of Drill, and I'm curious...
>>
>> Presently, we heavily use Storm + Cassandra.
>>
>> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
>>
>> We treat CRUD operations as events. Then within Storm we calculate
>> aggregate counts of entities flowing through the system by various
>> dimensions.   That works well, but we still need an ad hoc reporting
>> capability, and a way to report on data in the system that is not
>> active (historical).
>>
>> Would it be possible to use the Drill engine against a Cassandra backend?
>> If so, what does that mean?   (implementing some API?)
>>
>> I assume that performance would be terrible unless somehow the data is
>> stored using the columnar data format from the Dremel paper.  Is that
>> accurate?  Does anyone know if anyone has attempted a translation of
>> that format to Cassandra?
>>
>> Regardless, I'm very interested in getting involved and no stranger to
>> getting my hands dirty.
>> Let me know if you can provide any direction. (our entities are
>> currently stored in JSON in Cassandra)
>>
>> -brian
>>
>>
>> --
>> Brian ONeill
>> Lead Architect, Health Market Science (http://healthmarketscience.com)
>> mobile:215.588.6024
>> blog: http://brianoneill.blogspot.com/
>> twitter: @boneill42
>>
>
>
>
> --
> Tomer Shiran
> Director of Product Management | MapR Technologies | 650-804-8657
>



-- 
Tomer Shiran
Director of Product Management | MapR Technologies | 650-804-8657

Re: Getting plugged in... (Cassandra and Drill?)

Posted by Tomer Shiran <ts...@maprtech.com>.

Drill is being developed with the flexibility to support different data
sources, so Cassandra support should not be a problem. Is that something
you would be interested in building?

The performance depends on the query. A query that involves a range scan
would be very slow (assuming the default partitioner in Cassandra,
RandomPartitioner), but point queries and queries that involve full table
scans would provide reasonable performance. A full columnar layout would be
faster for some queries (eg, queries that are very selective).

BTW, Drill will support nested data, so JSON is not an issue.

On Sun, Jan 20, 2013 at 8:37 PM, Brian O'Neill <bo...@alumni.brown.edu>wrote:

> Last week, Brad Anderson came up and presented at the PhillyDB meetup.
> http://www.slideshare.net/boorad/phillydb-talk-beyond-batch
>
> He gave us an overview of Drill, and I'm curious...
>
> Presently, we heavily use Storm + Cassandra.
>
> http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
>
> We treat CRUD operations as events. Then within Storm we calculate
> aggregate counts of entities flowing through the system by various
> dimensions.   That works well, but we still need an ad hoc reporting
> capability, and a way to report on data in the system that is not
> active (historical).
>
> Would it be possible to use the Drill engine against a Cassandra backend?
> If so, what does that mean?   (implementing some API?)
>
> I assume that performance would be terrible unless somehow the data is
> stored using the columnar data format from the Dremel paper.  Is that
> accurate?  Does anyone know if anyone has attempted a translation of
> that format to Cassandra?
>
> Regardless, I'm very interested in getting involved and no stranger to
> getting my hands dirty.
> Let me know if you can provide any direction. (our entities are
> currently stored in JSON in Cassandra)
>
> -brian
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://brianoneill.blogspot.com/
> twitter: @boneill42
>

-- 
Tomer Shiran
Director of Product Management | MapR Technologies | 650-804-8657