You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Ismaël Mejía (Jira)" <ji...@apache.org> on 2020/11/26 21:14:00 UTC
[jira] [Updated] (BEAM-9008) Add readAll() method to CassandraIO
[ https://issues.apache.org/jira/browse/BEAM-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía updated BEAM-9008:
-------------------------------
Fix Version/s: 2.27.0
> Add readAll() method to CassandraIO
> -----------------------------------
>
> Key: BEAM-9008
> URL: https://issues.apache.org/jira/browse/BEAM-9008
> Project: Beam
> Issue Type: New Feature
> Components: io-java-cassandra
> Affects Versions: 2.16.0
> Reporter: vincent marquez
> Assignee: vincent marquez
> Priority: P3
> Fix For: 2.27.0
>
> Time Spent: 15h 40m
> Remaining Estimate: 0h
>
> When querying a large cassandra database, it's often *much* more useful to programatically generate the queries needed to to be run rather than reading all partitions and attempting some filtering.
> As an example:
> {code:java}
> public class Event {
> @PartitionKey(0) public UUID accountId;
> @PartitionKey(1)public String yearMonthDay;
> @ClusteringKey public UUID eventId;
> //other data...
> }{code}
> If there is ten years worth of data, you may want to only query one year's worth. Here each token range would represent one 'token' but all events for the day.
> {code:java}
> Set<UUID> accounts = getRelevantAccounts();
> Set<String> dateRange = generateDateRange("2018-01-01", "2019-01-01");
> PCollection<TokenRange> tokens = generateTokens(accounts, dateRange);
> {code}
>
> I propose an additional _readAll()_ PTransform that can take a PCollection of token ranges and can return a PCollection<T> of what the query would return.
> *Question: How much code should be in common between both methods?*
> Currently the read connector already groups all partitions into a List of Token Ranges, so it would be simple to refactor the current read() based method to a 'ParDo' based one and have them both share the same function. Reasons against sharing code between read and readAll
> * Not having the read based method return a BoundedSource connector would mean losing the ability to know the size of the data returned
> * Currently the CassandraReader executes all the grouped TokenRange queries *asynchronously* which is (maybe?) fine when all that's happening is splitting up all the partition ranges but terrible for executing potentially millions of queries.
> Reasons _for_ sharing code would be simplified code base and that both of the above issues would most likely have a negligable performance impact.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)