You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Richard Grossman <ri...@gmail.com> on 2010/01/07 12:16:01 UTC

Advise for choice

Hi,

This message is little different than support.
I'm confronted to problem where people want to change Cassandra with Solr
server. I really think that our problem is a great case for cassandra but I
need more arguments.

So please if you've some time just put some idea why to use cassandra
instead solr.

Thanks for any help

Richard

Re: Advise for choice

Posted by scott w <sc...@gmail.com>.
Good point although there has been very recent work integrating solr with
katta so you can have your cake and eat it too:

http://developer.yahoo.net/blogs/theater/archives/2009/12/hadoop_bay_area_user_group_session_1.html


On Fri, Jan 8, 2010 at 1:09 AM, Erich Nachbar <er...@nachbar.biz> wrote:

> I can give you a few more data points. For one of my last projects, I
> built the search index of one of the largest IM aggregators. I got
> around 2.5k chat msg/s, keeping 400M messages in my index.
>
> I looked at Solr and while it is very convenient/luxurious, there was
> no way in hell I could scale it this big. I ended up using Katta to
> serve the index with Hadoop to compute my index shards.
>
> While the whole system is batch oriented, I got my latency down to
> 2min (time for a doc to show up in the index), if I got less than 8k
> chat messages/s in.
>
> Katta handles replication and node failover (uses Zookeeper) and can
> be scaled easily by adding nodes & increasing the replication factor.
> In comparison to Solr, scale was not one of the things I had to worry.
>
> Like others have said, unless you provide a lot more specifics it will
> be hard to give you detailed recommendations.
>
> Hope this help!
> -Erich
>
> On Thu, Jan 7, 2010 at 11:31 PM, Richard Grossman <ri...@gmail.com>
> wrote:
> > First Thanks to all your answer it's help to really check  all the
> aspects.
> >
> > In fact the system we want to build have to manage a lot of data but not
> in
> > an heavy transactional way. Solr can handle the data but doesn't have
> > the distributed way to serve it. But it's always possible to just
> duplicate
> > the data in my case. then we can load balancing the queries between
> multiple
> > instance server.
> >
> > We load a large set of data once a week and that all this data are going
> to
> > be used as his without modification or update or delete. In this point
> load
> > the data into Solr is very easy because we make a csv file and that's it
> > it's inside.
> >
> > The data need to be structured but not like a relational
> database. Obviously
> > Solr doesn't fit the data structure required. it force us
> to de-normalize a
> > lot of data and build like a very very big table it's force us also to
> build
> > very difficult lucene query.
> >
> > The speed to query for data is critical cause the application is internet
> > oriented we hope a lot of queries / minutes. With this point the problem
> is
> > that with the same amount of data Solr have been faster than cassandra
> but
> > of course the data structure is not the same.
> >
> > It seems by the end we'll go as Tatu tell to have an hybrid solution
> mixing
> > Solr and Cassandra. I'm not sure its the best in our case
> > Thanks
>

Re: Advise for choice

Posted by Erich Nachbar <er...@nachbar.biz>.
I can give you a few more data points. For one of my last projects, I
built the search index of one of the largest IM aggregators. I got
around 2.5k chat msg/s, keeping 400M messages in my index.

I looked at Solr and while it is very convenient/luxurious, there was
no way in hell I could scale it this big. I ended up using Katta to
serve the index with Hadoop to compute my index shards.

While the whole system is batch oriented, I got my latency down to
2min (time for a doc to show up in the index), if I got less than 8k
chat messages/s in.

Katta handles replication and node failover (uses Zookeeper) and can
be scaled easily by adding nodes & increasing the replication factor.
In comparison to Solr, scale was not one of the things I had to worry.

Like others have said, unless you provide a lot more specifics it will
be hard to give you detailed recommendations.

Hope this help!
-Erich

On Thu, Jan 7, 2010 at 11:31 PM, Richard Grossman <ri...@gmail.com> wrote:
> First Thanks to all your answer it's help to really check  all the aspects.
>
> In fact the system we want to build have to manage a lot of data but not in
> an heavy transactional way. Solr can handle the data but doesn't have
> the distributed way to serve it. But it's always possible to just duplicate
> the data in my case. then we can load balancing the queries between multiple
> instance server.
>
> We load a large set of data once a week and that all this data are going to
> be used as his without modification or update or delete. In this point load
> the data into Solr is very easy because we make a csv file and that's it
> it's inside.
>
> The data need to be structured but not like a relational database. Obviously
> Solr doesn't fit the data structure required. it force us to de-normalize a
> lot of data and build like a very very big table it's force us also to build
> very difficult lucene query.
>
> The speed to query for data is critical cause the application is internet
> oriented we hope a lot of queries / minutes. With this point the problem is
> that with the same amount of data Solr have been faster than cassandra but
> of course the data structure is not the same.
>
> It seems by the end we'll go as Tatu tell to have an hybrid solution mixing
> Solr and Cassandra. I'm not sure its the best in our case
> Thanks

Re: Advise for choice

Posted by Richard Grossman <ri...@gmail.com>.
First Thanks to all your answer it's help to really check  all the aspects.


   - In fact the system we want to build have to manage a lot of data but
   not in an heavy transactional way. Solr can handle the data but doesn't have
   the distributed way to serve it. But it's always possible to just duplicate
   the data in my case. then we can load balancing the queries between multiple
   instance server.


   - We load a large set of data once a week and that all this data are
   going to be used as his without modification or update or delete. In this
   point load the data into Solr is very easy because we make a csv file and
   that's it it's inside.


   - The data need to be structured but not like a relational
   database. Obviously Solr doesn't fit the data structure required. it force
   us to de-normalize a lot of data and build like a very very big table it's
   force us also to build very difficult lucene query.


   - The speed to query for data is critical cause the application is
   internet oriented we hope a lot of queries / minutes. With this point the
   problem is that with the same amount of data Solr have been faster than
   cassandra but of course the data structure is not the same.

It seems by the end we'll go as Tatu tell to have an hybrid solution mixing
Solr and Cassandra. I'm not sure its the best in our case

Thanks

Re: Advise for choice

Posted by Tatu Saloranta <ts...@gmail.com>.
On Thu, Jan 7, 2010 at 10:43 AM, Nathan McCall <na...@vervewireless.com> wrote:
> Agreed that there is not much to go on here in the original question.
> I will say that we very recently found a good fit with Solr and
> Cassandra in how we deal with a very heavy write volume of news
> article data. Cassandra is excellent with write throughput and high
> availability, but our search use cases are with time-dependent news
> content, so we need lots of term proximity, faceting and ordering
> functionality.
>
> We probably could store everything in Solr, but the above approach
...

I think that in many (most?) cases, optimal solutions for searching
and lookups are different.

Traditionally this has meant that instead of trying to cram everything
in Oracle (or MySQL, Postgres) with its in-built
not-quite-as-good-as-Lucene text indexer, do the right thing and use
both: DB for storing data, for lookups, aggregates; and search index
for full-text searches. For some reason it seems very unintuitive
notion to use two tools instead of one, when they have different sweet
spots.
And going forward, similar trade-offs are needed between 'traditional'
RDBMSs, newer distributed high-availability eventual consistent data
stores (with multiple variation from simple-lookup to sorted access),
search index processing, and batch-oriented processing (Hadoop /
map/reduce).
Trying to do too many things using just one kind of tool tends to lead
to scalability and maintenance problems.

I am actually trying to decide on similar case which tools (from loose
set of Cassandra, Lucene/Solr, Voldemort) to use to handle processing
of large amounts of data, and I'm pretty sure I will end up using more
than just one.

-+ Tatu +-

Re: Advise for choice

Posted by Ian Holsman <ia...@holsman.net>.
things positive for solr.
- mature and stable
- lots of documentation
- a swiss army knife and can be used for a LOT of things, especially if you are manipulating a lot of text.
- the query language is easier to use (imho.. but i've been using solr for years, so I am biased)
- lots of people know it
- fast caching
- faceting

cons for solr.
- hard to update a single field (you need to fetch & re-insert the entire row)
- commits/optimizes can slow things down to a crawl
- can't store structured data easily. (for example a blog post has tags which have both a key and a value).
- scalability isn't as easy as cassandra. sharding works, but it requires a lot of manual effort
- it's easy to get started and get something running, but if you need to do something out of the ordinary, it gets hard fast. I think cassandra is more flexible to do ordinary things that don't involve text-matching.
- replication isn't instant. (this is changing.. also look at zoie which may help).

of course, if you tell us what your trying to do, I can be more specific.
FWIW.. we use SOLR for some of our news-content (see love.com and newsrunner.com) and it works fast enough for us. 
We have a incoming doc rate of about 8-10 news articles/second.

On Jan 8, 2010, at 5:43 AM, Nathan McCall wrote:

> Agreed that there is not much to go on here in the original question.
> I will say that we very recently found a good fit with Solr and
> Cassandra in how we deal with a very heavy write volume of news
> article data. Cassandra is excellent with write throughput and high
> availability, but our search use cases are with time-dependent news
> content, so we need lots of term proximity, faceting and ordering
> functionality.
> 
> We probably could store everything in Solr, but the above approach
> will allow us to make articles immediately available in a
> fault-tolerant manner while being able to efficiently send batches at
> regular intervals to Solr and therefore scale out our ingestion of
> news articles a little smoother. Full disclosure: I am still getting
> my head around the innards of Solr replication and clustering, but so
> far I feel like we made a good choice.
> 
> Hopefully the above will be helpful to folks during their evaluations.
> 
> Cheers,
> -Nate
> 
> 
> On Thu, Jan 7, 2010 at 10:02 AM, Joseph Bowman <bo...@gmail.com> wrote:
>> I have to agree with Tatu. If you're struggling to find reasons to validate
>> that Cassandra is the better choice for your task than Solr, then perhaps
>> Solr is the correct choice. I kind of went through the same thing recently,
>> struggled to make Cassandra fit what I was doing, then realized I was doing
>> it wrong and moved to MongoDB.
>> Cassandra is great at what it tries to accomplish, which is managing
>> gigantic datasets in a distributed way. The question is, is that really what
>> you need?
>> 
>> On Thu, Jan 7, 2010 at 12:58 PM, Tatu Saloranta <ts...@gmail.com>
>> wrote:
>>> 
>>> On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman <ri...@gmail.com>
>>> wrote:
>>>> Hi,
>>>> 
>>>> This message is little different than support.
>>>> I'm confronted to problem where people want to change Cassandra with
>>>> Solr
>>>> server. I really think that our problem is a great case for cassandra
>>>> but I
>>>> need more arguments.
>>>> 
>>>> So please if you've some time just put some idea why to use cassandra
>>>> instead solr.
>>> 
>>> Solution is generally applicable to a problem... so what is the (main) use
>>> case?
>>> 
>>> That would make it easier to find arguments for or against proposed
>>> solution.
>>> 
>>> -+ Tatu +-
>> 
>> 

--
Ian Holsman
Ian@Holsman.net




Re: Advise for choice

Posted by Nathan McCall <na...@vervewireless.com>.
Agreed that there is not much to go on here in the original question.
I will say that we very recently found a good fit with Solr and
Cassandra in how we deal with a very heavy write volume of news
article data. Cassandra is excellent with write throughput and high
availability, but our search use cases are with time-dependent news
content, so we need lots of term proximity, faceting and ordering
functionality.

We probably could store everything in Solr, but the above approach
will allow us to make articles immediately available in a
fault-tolerant manner while being able to efficiently send batches at
regular intervals to Solr and therefore scale out our ingestion of
news articles a little smoother. Full disclosure: I am still getting
my head around the innards of Solr replication and clustering, but so
far I feel like we made a good choice.

Hopefully the above will be helpful to folks during their evaluations.

Cheers,
-Nate


On Thu, Jan 7, 2010 at 10:02 AM, Joseph Bowman <bo...@gmail.com> wrote:
> I have to agree with Tatu. If you're struggling to find reasons to validate
> that Cassandra is the better choice for your task than Solr, then perhaps
> Solr is the correct choice. I kind of went through the same thing recently,
> struggled to make Cassandra fit what I was doing, then realized I was doing
> it wrong and moved to MongoDB.
> Cassandra is great at what it tries to accomplish, which is managing
> gigantic datasets in a distributed way. The question is, is that really what
> you need?
>
> On Thu, Jan 7, 2010 at 12:58 PM, Tatu Saloranta <ts...@gmail.com>
> wrote:
>>
>> On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman <ri...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > This message is little different than support.
>> > I'm confronted to problem where people want to change Cassandra with
>> > Solr
>> > server. I really think that our problem is a great case for cassandra
>> > but I
>> > need more arguments.
>> >
>> > So please if you've some time just put some idea why to use cassandra
>> > instead solr.
>>
>> Solution is generally applicable to a problem... so what is the (main) use
>> case?
>>
>> That would make it easier to find arguments for or against proposed
>> solution.
>>
>> -+ Tatu +-
>
>

Re: Advise for choice

Posted by Joseph Bowman <bo...@gmail.com>.
I have to agree with Tatu. If you're struggling to find reasons to validate
that Cassandra is the better choice for your task than Solr, then perhaps
Solr is the correct choice. I kind of went through the same thing recently,
struggled to make Cassandra fit what I was doing, then realized I was doing
it wrong and moved to MongoDB.

Cassandra is great at what it tries to accomplish, which is managing
gigantic datasets in a distributed way. The question is, is that really what
you need?

On Thu, Jan 7, 2010 at 12:58 PM, Tatu Saloranta <ts...@gmail.com>wrote:

> On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman <ri...@gmail.com>
> wrote:
> > Hi,
> >
> > This message is little different than support.
> > I'm confronted to problem where people want to change Cassandra with Solr
> > server. I really think that our problem is a great case for cassandra but
> I
> > need more arguments.
> >
> > So please if you've some time just put some idea why to use cassandra
> > instead solr.
>
> Solution is generally applicable to a problem... so what is the (main) use
> case?
>
> That would make it easier to find arguments for or against proposed
> solution.
>
> -+ Tatu +-
>

Re: Advise for choice

Posted by Tatu Saloranta <ts...@gmail.com>.
On Thu, Jan 7, 2010 at 3:16 AM, Richard Grossman <ri...@gmail.com> wrote:
> Hi,
>
> This message is little different than support.
> I'm confronted to problem where people want to change Cassandra with Solr
> server. I really think that our problem is a great case for cassandra but I
> need more arguments.
>
> So please if you've some time just put some idea why to use cassandra
> instead solr.

Solution is generally applicable to a problem... so what is the (main) use case?

That would make it easier to find arguments for or against proposed solution.

-+ Tatu +-