You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Vincent Marquez <vi...@gmail.com> on 2019/10/03 23:02:23 UTC

Feature addition to java CassandraIO connector

Currently the CassandraIO connector allows a user to specify a table, and
the CassandraSource object generates a list of queries based on token
ranges of the table, along with grouping them by the token ranges.

I often need to run (generated, sometimes a million+) queries against a
subset of a table.  Instead of providing a filter, it is easier and much
more performant to supply a collection of queries along with their tokens
to both partition and group by, instead of letting CassandraIO naively run
over the entire table or with a simple filter.

I propose in addition to the current method of supplying a table and
filter, also allowing the user to pass in a collection of queries and
tokens.   The current way CassandraSource breaks up the table could be
modified to build on top of the proposed implementation to reduce code
duplication as well.  If this sounds like an acceptable alternative way of
using the CassandraIO connector, I don't mind giving it a shot with a pull
request.

If there is a better way of doing this, I'm eager to hear and learn.
Thanks for reading!

Re: Feature addition to java CassandraIO connector

Posted by Pablo Estrada <pa...@google.com>.
Hi Vincent!
Do you think you could add some code snippets / pseudocode as to what this
looks like? Feel free to do it on email, gist, google doc, etc?
Best
-P.

On Thu, Oct 3, 2019 at 4:16 PM Vincent Marquez <vi...@gmail.com>
wrote:

> Currently the CassandraIO connector allows a user to specify a table, and
> the CassandraSource object generates a list of queries based on token
> ranges of the table, along with grouping them by the token ranges.
>
> I often need to run (generated, sometimes a million+) queries against a
> subset of a table.  Instead of providing a filter, it is easier and much
> more performant to supply a collection of queries along with their tokens
> to both partition and group by, instead of letting CassandraIO naively run
> over the entire table or with a simple filter.
>
> I propose in addition to the current method of supplying a table and
> filter, also allowing the user to pass in a collection of queries and
> tokens.   The current way CassandraSource breaks up the table could be
> modified to build on top of the proposed implementation to reduce code
> duplication as well.  If this sounds like an acceptable alternative way of
> using the CassandraIO connector, I don't mind giving it a shot with a pull
> request.
>
> If there is a better way of doing this, I'm eager to hear and learn.
> Thanks for reading!
>