You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mark Lewis <ma...@lewisworld.org> on 2015/10/17 17:30:21 UTC

Advice for asymmetric reporting cluster architecture

I've got an existing C* cluster spread across three data centers, and I'm
wrestling with how to add some support for ad-hoc user reporting against
(ideally) near real-time data.

The type of reports I want to support basically boil down to allowing the
user to select a single highly-denormalized "Table" from a predefined list,
pick some filters (ideally with arbitrary boolean logic), project out some
columns, and allow for some simple grouping and aggregation.  I've seen
several companies expose reporting this way and it seems like a good way to
avoid the complexity of joins while still providing a good deal of
flexibility.

Has anybody done this or have any recommendations?

My current thinking is that I'd like to have the ad-hoc reporting
infrastructure in separate data centers from our active production
OLTP-type stuff, both to isolate any load away from the OLTP infrastructure
and also because I'll likely need other stuff there (Spark?) to support
ad-hoc reporting.

So I basically have two problems:
(1) Get an eventually-consistent view of the data into a data-center I can
query against relativly quickly (so no big batch imports)
(2) Be able to run ad-hoc user queries against it

If I just think about query flexibility, I might consider dumping data into
PostgreSQL nodes (practical because the data that any individual user can
query will fit onto a single node).  But then I have the problem of getting
the data there; I looked into an architecture using Kafka to pump data from
the OLTP data centers to PostgreSQL mirrors, but down that road lies the
need to manually deal with the eventual consistency.  Ugh.

If I just run C* nodes in my reporting cluster that makes the problem of
getting the data into the right place with eventual consistency easy to
solve and I like that idea quite a lot, but then I need to run reporting
against C*.  I could make the queries I need to run reasonably performant
with enough secondary-indexes or materialized views (we're upgrading to 3.0
soon), but I would need a lot of secondary-indexes and materialized views,
and I'd rather not pay to store them in all of my data centers.  I wish
there were a way to define secondary-indexes or materialized views to only
exist in one DC of a cluster, but unless I've missed something it doesn't
look possible.

Any advice or case studies in this area would be greatly appreciated.

-- Mark

Re: Advice for asymmetric reporting cluster architecture

Posted by Ryan Svihla <rs...@foundev.pro>.
Don't forget SSDs for indexing joy and a reasonable amount of cpu or those indexes will be very behind.
If you size the hardware correctly and avoid very silly configuration it works really well for this sort of purpose, especially when combined with Spark to do any hardcore analysis on the filtered dataset.

- Ryan Svihla




On Sat, Oct 17, 2015 at 7:12 PM -0700, "Jack Krupansky" <ja...@gmail.com> wrote:










Yes, you can have all your normal data centers with DSE configured for real-time data access and then have a data center that shares the same data but has DSE Search (Solr indexing) enabled. Your Cassandra data will get replicated to the Search data center and then indexed there and only there. You do need to have more RAM on the DSE Search nodes for the indexing, and maybe more nodes as well to assure decent latency for complex queries.
-- Jack Krupansky

On Sat, Oct 17, 2015 at 3:54 PM, Mark Lewis <ma...@lewisworld.org> wrote:


I hadn't considered it because I didn't think it could be configured just for a single data center; can it?
On Oct 17, 2015 8:50 AM, "Jack Krupansky" <ja...@gmail.com> wrote:
Did you consider DSE Search in a DC?
-- Jack Krupansky

On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <ma...@lewisworld.org> wrote:
I've got an existing C* cluster spread across three data centers, and I'm wrestling with how to add some support for ad-hoc user reporting against (ideally) near real-time data.  
The type of reports I want to support basically boil down to allowing the user to select a single highly-denormalized "Table" from a predefined list, pick some filters (ideally with arbitrary boolean logic), project out some columns, and allow for some simple grouping and aggregation.  I've seen several companies expose reporting this way and it seems like a good way to avoid the complexity of joins while still providing a good deal of flexibility.
Has anybody done this or have any recommendations?
My current thinking is that I'd like to have the ad-hoc reporting infrastructure in separate data centers from our active production OLTP-type stuff, both to isolate any load away from the OLTP infrastructure and also because I'll likely need other stuff there (Spark?) to support ad-hoc reporting.
So I basically have two problems:(1) Get an eventually-consistent view of the data into a data-center I can query against relativly quickly (so no big batch imports)(2) Be able to run ad-hoc user queries against it
If I just think about query flexibility, I might consider dumping data into PostgreSQL nodes (practical because the data that any individual user can query will fit onto a single node).  But then I have the problem of getting the data there; I looked into an architecture using Kafka to pump data from the OLTP data centers to PostgreSQL mirrors, but down that road lies the need to manually deal with the eventual consistency.  Ugh.
If I just run C* nodes in my reporting cluster that makes the problem of getting the data into the right place with eventual consistency easy to solve and I like that idea quite a lot, but then I need to run reporting against C*.  I could make the queries I need to run reasonably performant with enough secondary-indexes or materialized views (we're upgrading to 3.0 soon), but I would need a lot of secondary-indexes and materialized views, and I'd rather not pay to store them in all of my data centers.  I wish there were a way to define secondary-indexes or materialized views to only exist in one DC of a cluster, but unless I've missed something it doesn't look possible.
Any advice or case studies in this area would be greatly appreciated.
-- Mark

Re: Advice for asymmetric reporting cluster architecture

Posted by Jack Krupansky <ja...@gmail.com>.
Yes, you can have all your normal data centers with DSE configured for
real-time data access and then have a data center that shares the same data
but has DSE Search (Solr indexing) enabled. Your Cassandra data will get
replicated to the Search data center and then indexed there and only there.
You do need to have more RAM on the DSE Search nodes for the indexing, and
maybe more nodes as well to assure decent latency for complex queries.

-- Jack Krupansky

On Sat, Oct 17, 2015 at 3:54 PM, Mark Lewis <ma...@lewisworld.org> wrote:

> I hadn't considered it because I didn't think it could be configured just
> for a single data center; can it?
> On Oct 17, 2015 8:50 AM, "Jack Krupansky" <ja...@gmail.com>
> wrote:
>
>> Did you consider DSE Search in a DC?
>>
>> -- Jack Krupansky
>>
>> On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <ma...@lewisworld.org> wrote:
>>
>>> I've got an existing C* cluster spread across three data centers, and
>>> I'm wrestling with how to add some support for ad-hoc user reporting
>>> against (ideally) near real-time data.
>>>
>>> The type of reports I want to support basically boil down to allowing
>>> the user to select a single highly-denormalized "Table" from a predefined
>>> list, pick some filters (ideally with arbitrary boolean logic), project out
>>> some columns, and allow for some simple grouping and aggregation.  I've
>>> seen several companies expose reporting this way and it seems like a good
>>> way to avoid the complexity of joins while still providing a good deal of
>>> flexibility.
>>>
>>> Has anybody done this or have any recommendations?
>>>
>>> My current thinking is that I'd like to have the ad-hoc reporting
>>> infrastructure in separate data centers from our active production
>>> OLTP-type stuff, both to isolate any load away from the OLTP infrastructure
>>> and also because I'll likely need other stuff there (Spark?) to support
>>> ad-hoc reporting.
>>>
>>> So I basically have two problems:
>>> (1) Get an eventually-consistent view of the data into a data-center I
>>> can query against relativly quickly (so no big batch imports)
>>> (2) Be able to run ad-hoc user queries against it
>>>
>>> If I just think about query flexibility, I might consider dumping data
>>> into PostgreSQL nodes (practical because the data that any individual user
>>> can query will fit onto a single node).  But then I have the problem of
>>> getting the data there; I looked into an architecture using Kafka to pump
>>> data from the OLTP data centers to PostgreSQL mirrors, but down that road
>>> lies the need to manually deal with the eventual consistency.  Ugh.
>>>
>>> If I just run C* nodes in my reporting cluster that makes the problem of
>>> getting the data into the right place with eventual consistency easy to
>>> solve and I like that idea quite a lot, but then I need to run reporting
>>> against C*.  I could make the queries I need to run reasonably performant
>>> with enough secondary-indexes or materialized views (we're upgrading to 3.0
>>> soon), but I would need a lot of secondary-indexes and materialized views,
>>> and I'd rather not pay to store them in all of my data centers.  I wish
>>> there were a way to define secondary-indexes or materialized views to only
>>> exist in one DC of a cluster, but unless I've missed something it doesn't
>>> look possible.
>>>
>>> Any advice or case studies in this area would be greatly appreciated.
>>>
>>> -- Mark
>>>
>>
>>

Re: Advice for asymmetric reporting cluster architecture

Posted by Mark Lewis <ma...@lewisworld.org>.
I hadn't considered it because I didn't think it could be configured just
for a single data center; can it?
On Oct 17, 2015 8:50 AM, "Jack Krupansky" <ja...@gmail.com> wrote:

> Did you consider DSE Search in a DC?
>
> -- Jack Krupansky
>
> On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <ma...@lewisworld.org> wrote:
>
>> I've got an existing C* cluster spread across three data centers, and I'm
>> wrestling with how to add some support for ad-hoc user reporting against
>> (ideally) near real-time data.
>>
>> The type of reports I want to support basically boil down to allowing the
>> user to select a single highly-denormalized "Table" from a predefined list,
>> pick some filters (ideally with arbitrary boolean logic), project out some
>> columns, and allow for some simple grouping and aggregation.  I've seen
>> several companies expose reporting this way and it seems like a good way to
>> avoid the complexity of joins while still providing a good deal of
>> flexibility.
>>
>> Has anybody done this or have any recommendations?
>>
>> My current thinking is that I'd like to have the ad-hoc reporting
>> infrastructure in separate data centers from our active production
>> OLTP-type stuff, both to isolate any load away from the OLTP infrastructure
>> and also because I'll likely need other stuff there (Spark?) to support
>> ad-hoc reporting.
>>
>> So I basically have two problems:
>> (1) Get an eventually-consistent view of the data into a data-center I
>> can query against relativly quickly (so no big batch imports)
>> (2) Be able to run ad-hoc user queries against it
>>
>> If I just think about query flexibility, I might consider dumping data
>> into PostgreSQL nodes (practical because the data that any individual user
>> can query will fit onto a single node).  But then I have the problem of
>> getting the data there; I looked into an architecture using Kafka to pump
>> data from the OLTP data centers to PostgreSQL mirrors, but down that road
>> lies the need to manually deal with the eventual consistency.  Ugh.
>>
>> If I just run C* nodes in my reporting cluster that makes the problem of
>> getting the data into the right place with eventual consistency easy to
>> solve and I like that idea quite a lot, but then I need to run reporting
>> against C*.  I could make the queries I need to run reasonably performant
>> with enough secondary-indexes or materialized views (we're upgrading to 3.0
>> soon), but I would need a lot of secondary-indexes and materialized views,
>> and I'd rather not pay to store them in all of my data centers.  I wish
>> there were a way to define secondary-indexes or materialized views to only
>> exist in one DC of a cluster, but unless I've missed something it doesn't
>> look possible.
>>
>> Any advice or case studies in this area would be greatly appreciated.
>>
>> -- Mark
>>
>
>

Re: Advice for asymmetric reporting cluster architecture

Posted by Jack Krupansky <ja...@gmail.com>.
Did you consider DSE Search in a DC?

-- Jack Krupansky

On Sat, Oct 17, 2015 at 11:30 AM, Mark Lewis <ma...@lewisworld.org> wrote:

> I've got an existing C* cluster spread across three data centers, and I'm
> wrestling with how to add some support for ad-hoc user reporting against
> (ideally) near real-time data.
>
> The type of reports I want to support basically boil down to allowing the
> user to select a single highly-denormalized "Table" from a predefined list,
> pick some filters (ideally with arbitrary boolean logic), project out some
> columns, and allow for some simple grouping and aggregation.  I've seen
> several companies expose reporting this way and it seems like a good way to
> avoid the complexity of joins while still providing a good deal of
> flexibility.
>
> Has anybody done this or have any recommendations?
>
> My current thinking is that I'd like to have the ad-hoc reporting
> infrastructure in separate data centers from our active production
> OLTP-type stuff, both to isolate any load away from the OLTP infrastructure
> and also because I'll likely need other stuff there (Spark?) to support
> ad-hoc reporting.
>
> So I basically have two problems:
> (1) Get an eventually-consistent view of the data into a data-center I can
> query against relativly quickly (so no big batch imports)
> (2) Be able to run ad-hoc user queries against it
>
> If I just think about query flexibility, I might consider dumping data
> into PostgreSQL nodes (practical because the data that any individual user
> can query will fit onto a single node).  But then I have the problem of
> getting the data there; I looked into an architecture using Kafka to pump
> data from the OLTP data centers to PostgreSQL mirrors, but down that road
> lies the need to manually deal with the eventual consistency.  Ugh.
>
> If I just run C* nodes in my reporting cluster that makes the problem of
> getting the data into the right place with eventual consistency easy to
> solve and I like that idea quite a lot, but then I need to run reporting
> against C*.  I could make the queries I need to run reasonably performant
> with enough secondary-indexes or materialized views (we're upgrading to 3.0
> soon), but I would need a lot of secondary-indexes and materialized views,
> and I'd rather not pay to store them in all of my data centers.  I wish
> there were a way to define secondary-indexes or materialized views to only
> exist in one DC of a cluster, but unless I've missed something it doesn't
> look possible.
>
> Any advice or case studies in this area would be greatly appreciated.
>
> -- Mark
>