You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Vincent Marquez <vi...@gmail.com> on 2019/11/07 23:14:10 UTC

Re: Feature addition to java CassandraIO connector

Thanks for the response Pablo.  We're currently using our own custom ParDo
connector for Cassandra (specialized to Scylla's sharding algorithm) that
has a 'readAll' type option and getting great results.   Would you be up
for taking an outside contribution that refactors the current CassandraIO
connector to be of the  PTransform/ParDo kind? I'm happy to give it a shot
in the next week or so and send a PR on github.  My username is vmarquez on
both ASF and gh, I'm also fine with writing up a JIRA describing how I'd
want the more flexible connector to look.


--Vincent

On Wed, Oct 16, 2019 at 11:20 AM Pablo Estrada <pa...@google.com> wrote:

> Hi Vincent,
> I think it makes sense to have some sort of `readAll` for CassandraIO that
> can receive multiple queries, and execute each one of them. This would also
> be consistent with other IOs that we have such as FileIOs.
> I suspect that doing this may require rearchitecting the whole IO from a
> BoundedSource-based one to a ParDo-based one - so a large change; and we'd
> need to make sure that we don't lose scalability due to that change.
>
> Adding Ismael/JB/Etienne who've done a lot of the work on CassandraIO.
> Thoughts?
> -P.
>
>
> On Mon, Oct 14, 2019 at 3:32 PM Vincent Marquez <vi...@gmail.com>
> wrote:
>
>> Hello Pablo, thank you for the response, and apologies for the delay.  I
>> had some work and also wanted to prove out what I was proposing with our
>> own code at my workplace.
>>
>> Here is a small gist of what I'm proposing.
>>
>> https://gist.github.com/vmarquez/204b8f44b1279fdbae97b40f8681bc25
>>
>> I'm happy to explain more or even write up an official design doc if you
>> think that would be helpful explaining things.
>>
>> --Vincent
>>
>> On 2019/10/04 18:03:23, Pablo Estrada <p....@google.com> wrote:
>> > Hi Vincent!>
>> > Do you think you could add some code snippets / pseudocode as to what
>> this>
>> > looks like? Feel free to do it on email, gist, google doc, etc?>
>> > Best>
>> > -P.>
>> >
>> > On Thu, Oct 3, 2019 at 4:16 PM Vincent Marquez <vi...@gmail.com>>
>> > wrote:>
>> >
>> > > Currently the CassandraIO connector allows a user to specify a table,
>> and>
>> > > the CassandraSource object generates a list of queries based on
>> token>
>> > > ranges of the table, along with grouping them by the token ranges.>
>> > >>
>> > > I often need to run (generated, sometimes a million+) queries against
>> a>
>> > > subset of a table.  Instead of providing a filter, it is easier and
>> much>
>> > > more performant to supply a collection of queries along with their
>> tokens>
>> > > to both partition and group by, instead of letting CassandraIO
>> naively run>
>> > > over the entire table or with a simple filter.>
>> > >>
>> > > I propose in addition to the current method of supplying a table and>
>> > > filter, also allowing the user to pass in a collection of queries
>> and>
>> > > tokens.   The current way CassandraSource breaks up the table could
>> be>
>> > > modified to build on top of the proposed implementation to reduce
>> code>
>> > > duplication as well.  If this sounds like an acceptable alternative
>> way of>
>> > > using the CassandraIO connector, I don't mind giving it a shot with a
>> pull>
>> > > request.>
>> > >>
>> > > If there is a better way of doing this, I'm eager to hear and learn.>
>> > > Thanks for reading!>
>> > >>
>> >
>
>

-- 
*-Vincent*