You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by AJ Asver <aj...@scoopler.com> on 2010/02/04 02:41:53 UTC

Using solr to store data

Hi all,

I work on search at Scoopler.com, a real-time search engine which uses Solr.
 We current use solr for indexing but then fetch data from our couchdb
cluster using the IDs solr returns.  We are now considering storing a larger
portion of data in Solr's index itself so we don't have to hit the DB too.
 Assuming that we are still storing data on the db (for backend and back up
purposes) are there any significant disadvantages to using solr as a data
store too?

We currently run a master-slave setup on EC2 using x-large slave instances
to allow for the disk cache to use as much memory as possible.  I imagine we
would definitely have to add more slave instances to accomodate the extra
data we're storing (and make sure it stays in memory).

Any tips would be really helpful.
--
AJ Asver
Co-founder, Scoopler.com

+44 (0) 7834 609830 / +1 (415) 670 9152
aj@scoopler.com


Follow me on Twitter: http://www.twitter.com/_aj
Add me on Linkedin: http://www.linkedin.com/in/ajasver
or YouNoodle: http://younoodle.com/people/ajmal_asver

My Blog: http://ajasver.com

Re: Using solr to store data

Posted by Lance Norskog <go...@gmail.com>.

If you're happy with disk sizes and indexing&search performance, there
are still holes:

Documents update instead of fields, so when you have a million
documents that say "German" and should say "French", you have to
reindex a million documents.

There are no tools for managing distributed indexes, so you're on your own.

Distributed TF/IDF is coming, but will never be perfect. So managing
your own distributed relevance strategies is a must.

On Wed, Feb 3, 2010 at 5:41 PM, AJ Asver <aj...@scoopler.com> wrote:
> Hi all,
>
> I work on search at Scoopler.com, a real-time search engine which uses Solr.
>  We current use solr for indexing but then fetch data from our couchdb
> cluster using the IDs solr returns.  We are now considering storing a larger
> portion of data in Solr's index itself so we don't have to hit the DB too.
>  Assuming that we are still storing data on the db (for backend and back up
> purposes) are there any significant disadvantages to using solr as a data
> store too?
>
> We currently run a master-slave setup on EC2 using x-large slave instances
> to allow for the disk cache to use as much memory as possible.  I imagine we
> would definitely have to add more slave instances to accomodate the extra
> data we're storing (and make sure it stays in memory).
>
> Any tips would be really helpful.
> --
> AJ Asver
> Co-founder, Scoopler.com
>
> +44 (0) 7834 609830 / +1 (415) 670 9152
> aj@scoopler.com
>
>
> Follow me on Twitter: http://www.twitter.com/_aj
> Add me on Linkedin: http://www.linkedin.com/in/ajasver
> or YouNoodle: http://younoodle.com/people/ajmal_asver
>
> My Blog: http://ajasver.com
>



-- 
Lance Norskog
goksron@gmail.com

Re: Using solr to store data

Posted by Tim Underwood <ti...@gmail.com>.

We just switched over to storing our data directly in Solr as
compressed JSON fields at http://frugalmechanic.com.  So far it's
working out great.  Our detail pages (e.g.:
http://frugalmechanic.com/auto-part/817453-33-2084-kn-high-performance-air-filter)
now make a single Solr request to grab the part data, pricing data,
and fitment data.  Before we'd make a call to Solr, and then probably
3-4 DB calls to load data.

As Lance pointed out, the downside is that whenever any of our part
data changes we have to re-index the entire document.  So updating
pricing for some of our larger retailers means reindexing a large
portion of our dataset.  But that's the tradeoff we were willing to
make going into the change and so far a daily re-index of the data
that takes 30-60mins isn't a big deal.  But later on we may split out
the data that changes frequently from the data that doesn't change
often.

We're working with about 2 million documents and our optimized index
files are currently at 3.2 GB.  Using compression on the large text
fields really helps keep the size down.

-Tim

On Wed, Feb 3, 2010 at 9:26 PM, Tommy Chheng <to...@gmail.com> wrote:
> Hey AJ,
> For simplicity sake, I am using Solr to serve as storage and search for
> http://researchwatch.net.
> The dataset is 110K  NSF grants from 1999 to 2009. The faceting is all
> dynamic fields and I use a catch all to copy all fields to a default text
> field. All fields are also stored and used for individual grant view.
> The performance seems fine for my purposes. I haven't done any extensive
> benchmarking with it. The site was built using a light ROR/rsolr layer on a
> small EC2 instance.
>
> Feel free to bang against the site with jmeter if you want to stress test a
> sample server to failure.  :)
>
> --
> Tommy Chheng
> Developer & UC Irvine Graduate Student
> http://tommy.chheng.com
>
> On 2/3/10 5:41 PM, AJ Asver wrote:
>>
>> Hi all,
>>
>> I work on search at Scoopler.com, a real-time search engine which uses
>> Solr.
>>  We current use solr for indexing but then fetch data from our couchdb
>> cluster using the IDs solr returns.  We are now considering storing a
>> larger
>> portion of data in Solr's index itself so we don't have to hit the DB too.
>>  Assuming that we are still storing data on the db (for backend and back
>> up
>> purposes) are there any significant disadvantages to using solr as a data
>> store too?
>>
>> We currently run a master-slave setup on EC2 using x-large slave instances
>> to allow for the disk cache to use as much memory as possible.  I imagine
>> we
>> would definitely have to add more slave instances to accomodate the extra
>> data we're storing (and make sure it stays in memory).
>>
>> Any tips would be really helpful.
>> --
>> AJ Asver
>> Co-founder, Scoopler.com
>>
>> +44 (0) 7834 609830 / +1 (415) 670 9152
>> aj@scoopler.com
>>
>>
>> Follow me on Twitter: http://www.twitter.com/_aj
>> Add me on Linkedin: http://www.linkedin.com/in/ajasver
>> or YouNoodle: http://younoodle.com/people/ajmal_asver
>>
>> My Blog: http://ajasver.com
>>
>>
>

Re: Using solr to store data

Posted by Tommy Chheng <to...@gmail.com>.

Hey AJ,
For simplicity sake, I am using Solr to serve as storage and search for 
http://researchwatch.net.
The dataset is 110K  NSF grants from 1999 to 2009. The faceting is all 
dynamic fields and I use a catch all to copy all fields to a default 
text field. All fields are also stored and used for individual grant view.
The performance seems fine for my purposes. I haven't done any extensive 
benchmarking with it. The site was built using a light ROR/rsolr layer 
on a small EC2 instance.

Feel free to bang against the site with jmeter if you want to stress 
test a sample server to failure.  :)

--
Tommy Chheng
Developer & UC Irvine Graduate Student
http://tommy.chheng.com

On 2/3/10 5:41 PM, AJ Asver wrote:
> Hi all,
>
> I work on search at Scoopler.com, a real-time search engine which uses Solr.
>   We current use solr for indexing but then fetch data from our couchdb
> cluster using the IDs solr returns.  We are now considering storing a larger
> portion of data in Solr's index itself so we don't have to hit the DB too.
>   Assuming that we are still storing data on the db (for backend and back up
> purposes) are there any significant disadvantages to using solr as a data
> store too?
>
> We currently run a master-slave setup on EC2 using x-large slave instances
> to allow for the disk cache to use as much memory as possible.  I imagine we
> would definitely have to add more slave instances to accomodate the extra
> data we're storing (and make sure it stays in memory).
>
> Any tips would be really helpful.
> --
> AJ Asver
> Co-founder, Scoopler.com
>
> +44 (0) 7834 609830 / +1 (415) 670 9152
> aj@scoopler.com
>
>
> Follow me on Twitter: http://www.twitter.com/_aj
> Add me on Linkedin: http://www.linkedin.com/in/ajasver
> or YouNoodle: http://younoodle.com/people/ajmal_asver
>
> My Blog: http://ajasver.com
>
>