You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Yash Sharma <ya...@gmail.com> on 2015/01/08 17:29:45 UTC

[DISCUSS] Cassandra storage for Drill

Hi Folks,
This thread is to discuss few scenarios how Cassandra works - and how do we
think it should be supported in Drill.

While they are not supported in Cassandra inherently but its doable on
Drill's end once we fetch a superset of data without these cases.

1. Filtering non indexed column in Cassandra
2. Filtering by subset of primary key
3. OR condition in where clause

Should we apply filters at Drill's end and support these features or we
propagate an error back to user for asking for a valid Cassandra based
query?

-----
Examples:
Here 'trending_now' is a dummy table with (id, rank, pog_id) where
(id,rank) is primary key pair.
1.
cqlsh:recsys> select * from trending_now where pog_id=10004 ;
Bad Request: No indexed columns present in by-columns clause with Equal
operator

2.
cqlsh:recsys> select * from trending_now where rank=4;
Bad Request: Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute this
query despite the performance unpredictability, use ALLOW FILTERING
P.S. ALLOW FILTERING is not permitted in Cassandra java driver as of now.

3.
cqlsh:recsys> select * from trending_now where rank=4 or id='id0004';
Bad Request: line 1:40 missing EOF at 'or'

4. Valid Query:
cqlsh:recsys> select * from trending_now where id='id0004' and rank=4;

 id     | rank | pog_id
--------+------+--------
 id0004 |    4 |  10002

(1 rows)

Re: [DISCUSS] Cassandra storage for Drill

Posted by Yash Sharma <ya...@gmail.com>.
Great guys. That summaries the questions i had in mind.

As the first phase of implementation I am planning to bypass the entire
pushdown if the query falls in any of these cases.
In next phase I would check if we can selectively apply few pushdowns even
if the query falls in the cases.

Since I am using Datastax's java driver to query Cassandra (which itself
has a subset of cql supported) we would have to figure out the best way to
apply range filters as well. Currently my subscan pulls the entire data
into it unlike Hbase and Mongo where we apply the min/max filters to limit
data in subscan.
On 09/01/2015 5:07 am, "Jacques Nadeau" <ja...@apache.org> wrote:

> Drill's framework does the same.  Drill leverages some of Calcite's
> extension capabilities to allow very easy pushdowns by allowing storage
> subsystems to expose optimizer rules (subclassed on top of Calcite's
> optimizer rule construct).  On-top of what Calcite can do, Drill also
> understand concepts like parallelization and data locality and lets systems
> like Cassandra expose this information to vastly improve performance,
> especially when working across multiple systems.
>
> On Thu, Jan 8, 2015 at 12:41 PM, Julian Hyde <ju...@gmail.com> wrote:
>
> > Calcite’s adapter framework makes it easy to push down filters,
> > aggregations to third-party sources, and  to express more powerful and
> > data-source-specific optimizations.
> >
> > Is Drill building on Calcite’s support or doing it its own way?
> >
> > Calcite doesn’t have a Cassandra adapter but the same approach taken in
> > the MongoDb, Splunk, Phoenix adapters could be used.
> >
> > On Jan 8, 2015, at 9:11 AM, Tomer Shiran <ts...@gmail.com> wrote:
> >
> > > I think that any valid SQL statement should work with any data source.
> > > Drill should:
> > >
> > >   - Push down as much processing as possible into the data source
> > >   (Cassandra in this case)
> > >   - Maintain as much data locality as possible (ie, spread the work so
> > >   that each drillbit is handling local data)
> > >   - In the worst case, Drill should pull the entire table from the data
> > >   source if that's what's needed to satisfy the query.
> > >
> > >
> > > On Thu, Jan 8, 2015 at 8:29 AM, Yash Sharma <ya...@gmail.com> wrote:
> > >
> > >> Hi Folks,
> > >> This thread is to discuss few scenarios how Cassandra works - and how
> > do we
> > >> think it should be supported in Drill.
> > >>
> > >> While they are not supported in Cassandra inherently but its doable on
> > >> Drill's end once we fetch a superset of data without these cases.
> > >>
> > >> 1. Filtering non indexed column in Cassandra
> > >> 2. Filtering by subset of primary key
> > >> 3. OR condition in where clause
> > >>
> > >> Should we apply filters at Drill's end and support these features or
> we
> > >> propagate an error back to user for asking for a valid Cassandra based
> > >> query?
> > >>
> > >> -----
> > >> Examples:
> > >> Here 'trending_now' is a dummy table with (id, rank, pog_id) where
> > >> (id,rank) is primary key pair.
> > >> 1.
> > >> cqlsh:recsys> select * from trending_now where pog_id=10004 ;
> > >> Bad Request: No indexed columns present in by-columns clause with
> Equal
> > >> operator
> > >>
> > >> 2.
> > >> cqlsh:recsys> select * from trending_now where rank=4;
> > >> Bad Request: Cannot execute this query as it might involve data
> > filtering
> > >> and thus may have unpredictable performance. If you want to execute
> this
> > >> query despite the performance unpredictability, use ALLOW FILTERING
> > >> P.S. ALLOW FILTERING is not permitted in Cassandra java driver as of
> > now.
> > >>
> > >> 3.
> > >> cqlsh:recsys> select * from trending_now where rank=4 or id='id0004';
> > >> Bad Request: line 1:40 missing EOF at 'or'
> > >>
> > >> 4. Valid Query:
> > >> cqlsh:recsys> select * from trending_now where id='id0004' and rank=4;
> > >>
> > >> id     | rank | pog_id
> > >> --------+------+--------
> > >> id0004 |    4 |  10002
> > >>
> > >> (1 rows)
> > >>
> >
> >
>

Re: [DISCUSS] Cassandra storage for Drill

Posted by Jacques Nadeau <ja...@apache.org>.
Drill's framework does the same.  Drill leverages some of Calcite's
extension capabilities to allow very easy pushdowns by allowing storage
subsystems to expose optimizer rules (subclassed on top of Calcite's
optimizer rule construct).  On-top of what Calcite can do, Drill also
understand concepts like parallelization and data locality and lets systems
like Cassandra expose this information to vastly improve performance,
especially when working across multiple systems.

On Thu, Jan 8, 2015 at 12:41 PM, Julian Hyde <ju...@gmail.com> wrote:

> Calcite’s adapter framework makes it easy to push down filters,
> aggregations to third-party sources, and  to express more powerful and
> data-source-specific optimizations.
>
> Is Drill building on Calcite’s support or doing it its own way?
>
> Calcite doesn’t have a Cassandra adapter but the same approach taken in
> the MongoDb, Splunk, Phoenix adapters could be used.
>
> On Jan 8, 2015, at 9:11 AM, Tomer Shiran <ts...@gmail.com> wrote:
>
> > I think that any valid SQL statement should work with any data source.
> > Drill should:
> >
> >   - Push down as much processing as possible into the data source
> >   (Cassandra in this case)
> >   - Maintain as much data locality as possible (ie, spread the work so
> >   that each drillbit is handling local data)
> >   - In the worst case, Drill should pull the entire table from the data
> >   source if that's what's needed to satisfy the query.
> >
> >
> > On Thu, Jan 8, 2015 at 8:29 AM, Yash Sharma <ya...@gmail.com> wrote:
> >
> >> Hi Folks,
> >> This thread is to discuss few scenarios how Cassandra works - and how
> do we
> >> think it should be supported in Drill.
> >>
> >> While they are not supported in Cassandra inherently but its doable on
> >> Drill's end once we fetch a superset of data without these cases.
> >>
> >> 1. Filtering non indexed column in Cassandra
> >> 2. Filtering by subset of primary key
> >> 3. OR condition in where clause
> >>
> >> Should we apply filters at Drill's end and support these features or we
> >> propagate an error back to user for asking for a valid Cassandra based
> >> query?
> >>
> >> -----
> >> Examples:
> >> Here 'trending_now' is a dummy table with (id, rank, pog_id) where
> >> (id,rank) is primary key pair.
> >> 1.
> >> cqlsh:recsys> select * from trending_now where pog_id=10004 ;
> >> Bad Request: No indexed columns present in by-columns clause with Equal
> >> operator
> >>
> >> 2.
> >> cqlsh:recsys> select * from trending_now where rank=4;
> >> Bad Request: Cannot execute this query as it might involve data
> filtering
> >> and thus may have unpredictable performance. If you want to execute this
> >> query despite the performance unpredictability, use ALLOW FILTERING
> >> P.S. ALLOW FILTERING is not permitted in Cassandra java driver as of
> now.
> >>
> >> 3.
> >> cqlsh:recsys> select * from trending_now where rank=4 or id='id0004';
> >> Bad Request: line 1:40 missing EOF at 'or'
> >>
> >> 4. Valid Query:
> >> cqlsh:recsys> select * from trending_now where id='id0004' and rank=4;
> >>
> >> id     | rank | pog_id
> >> --------+------+--------
> >> id0004 |    4 |  10002
> >>
> >> (1 rows)
> >>
>
>

Re: [DISCUSS] Cassandra storage for Drill

Posted by Julian Hyde <ju...@gmail.com>.
Calcite’s adapter framework makes it easy to push down filters, aggregations to third-party sources, and  to express more powerful and data-source-specific optimizations.

Is Drill building on Calcite’s support or doing it its own way?

Calcite doesn’t have a Cassandra adapter but the same approach taken in the MongoDb, Splunk, Phoenix adapters could be used.

On Jan 8, 2015, at 9:11 AM, Tomer Shiran <ts...@gmail.com> wrote:

> I think that any valid SQL statement should work with any data source.
> Drill should:
> 
>   - Push down as much processing as possible into the data source
>   (Cassandra in this case)
>   - Maintain as much data locality as possible (ie, spread the work so
>   that each drillbit is handling local data)
>   - In the worst case, Drill should pull the entire table from the data
>   source if that's what's needed to satisfy the query.
> 
> 
> On Thu, Jan 8, 2015 at 8:29 AM, Yash Sharma <ya...@gmail.com> wrote:
> 
>> Hi Folks,
>> This thread is to discuss few scenarios how Cassandra works - and how do we
>> think it should be supported in Drill.
>> 
>> While they are not supported in Cassandra inherently but its doable on
>> Drill's end once we fetch a superset of data without these cases.
>> 
>> 1. Filtering non indexed column in Cassandra
>> 2. Filtering by subset of primary key
>> 3. OR condition in where clause
>> 
>> Should we apply filters at Drill's end and support these features or we
>> propagate an error back to user for asking for a valid Cassandra based
>> query?
>> 
>> -----
>> Examples:
>> Here 'trending_now' is a dummy table with (id, rank, pog_id) where
>> (id,rank) is primary key pair.
>> 1.
>> cqlsh:recsys> select * from trending_now where pog_id=10004 ;
>> Bad Request: No indexed columns present in by-columns clause with Equal
>> operator
>> 
>> 2.
>> cqlsh:recsys> select * from trending_now where rank=4;
>> Bad Request: Cannot execute this query as it might involve data filtering
>> and thus may have unpredictable performance. If you want to execute this
>> query despite the performance unpredictability, use ALLOW FILTERING
>> P.S. ALLOW FILTERING is not permitted in Cassandra java driver as of now.
>> 
>> 3.
>> cqlsh:recsys> select * from trending_now where rank=4 or id='id0004';
>> Bad Request: line 1:40 missing EOF at 'or'
>> 
>> 4. Valid Query:
>> cqlsh:recsys> select * from trending_now where id='id0004' and rank=4;
>> 
>> id     | rank | pog_id
>> --------+------+--------
>> id0004 |    4 |  10002
>> 
>> (1 rows)
>> 


Re: [DISCUSS] Cassandra storage for Drill

Posted by Tomer Shiran <ts...@gmail.com>.
I think that any valid SQL statement should work with any data source.
Drill should:

   - Push down as much processing as possible into the data source
   (Cassandra in this case)
   - Maintain as much data locality as possible (ie, spread the work so
   that each drillbit is handling local data)
   - In the worst case, Drill should pull the entire table from the data
   source if that's what's needed to satisfy the query.


On Thu, Jan 8, 2015 at 8:29 AM, Yash Sharma <ya...@gmail.com> wrote:

> Hi Folks,
> This thread is to discuss few scenarios how Cassandra works - and how do we
> think it should be supported in Drill.
>
> While they are not supported in Cassandra inherently but its doable on
> Drill's end once we fetch a superset of data without these cases.
>
> 1. Filtering non indexed column in Cassandra
> 2. Filtering by subset of primary key
> 3. OR condition in where clause
>
> Should we apply filters at Drill's end and support these features or we
> propagate an error back to user for asking for a valid Cassandra based
> query?
>
> -----
> Examples:
> Here 'trending_now' is a dummy table with (id, rank, pog_id) where
> (id,rank) is primary key pair.
> 1.
> cqlsh:recsys> select * from trending_now where pog_id=10004 ;
> Bad Request: No indexed columns present in by-columns clause with Equal
> operator
>
> 2.
> cqlsh:recsys> select * from trending_now where rank=4;
> Bad Request: Cannot execute this query as it might involve data filtering
> and thus may have unpredictable performance. If you want to execute this
> query despite the performance unpredictability, use ALLOW FILTERING
> P.S. ALLOW FILTERING is not permitted in Cassandra java driver as of now.
>
> 3.
> cqlsh:recsys> select * from trending_now where rank=4 or id='id0004';
> Bad Request: line 1:40 missing EOF at 'or'
>
> 4. Valid Query:
> cqlsh:recsys> select * from trending_now where id='id0004' and rank=4;
>
>  id     | rank | pog_id
> --------+------+--------
>  id0004 |    4 |  10002
>
> (1 rows)
>