You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Adam Smith <ad...@gmail.com> on 2017/09/18 03:07:07 UTC

Wide rows splitting

Dear community,

I have a table with inlinks to URLs, i.e. many URLs point to
http://google.com, less URLs point to http://somesmallweb.page.

It has very wide and very skinny rows - the distribution is following a
power law. I do not know a priori how many columns a row has. Also, I can't
identify a schema to introduce a good partitioning.

Currently, I am thinking about introducing splits by: pk is like (URL,
splitnumber), where splitnumber is initially 1 and  hash URL mod
splitnumber would determine the splitnumber on insert. I would need a
separate table to maintain the splitnumber and a spark-cassandra-connector
job counts the columns and and increases/doubles the number of splits on
demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
when splitnumber would be 2.

Would you do the same? Is there a better way?

Thanks!
Adam

Re: Wide rows splitting

Posted by Stefano Ortolani <os...@gmail.com>.

You might find this interesting:
https://medium.com/@foundev/synthetic-sharding-in-cassandra-to-deal-with-large-partitions-2124b2fd788b

Cheers,
Stefano

On Mon, Sep 18, 2017 at 5:07 AM, Adam Smith <ad...@gmail.com> wrote:

> Dear community,
>
> I have a table with inlinks to URLs, i.e. many URLs point to
> http://google.com, less URLs point to http://somesmallweb.page.
>
> It has very wide and very skinny rows - the distribution is following a
> power law. I do not know a priori how many columns a row has. Also, I can't
> identify a schema to introduce a good partitioning.
>
> Currently, I am thinking about introducing splits by: pk is like (URL,
> splitnumber), where splitnumber is initially 1 and  hash URL mod
> splitnumber would determine the splitnumber on insert. I would need a
> separate table to maintain the splitnumber and a spark-cassandra-connector
> job counts the columns and and increases/doubles the number of splits on
> demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
> when splitnumber would be 2.
>
> Would you do the same? Is there a better way?
>
> Thanks!
> Adam
>