You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tracey Jaquith <tr...@archive.org> on 2007/02/01 22:35:20 UTC

Re: INTERNET ARCHIVE goes SOLR!

Yes, any of our search bars on our site will use Solr.
So your example is using Solr.  8-)


Otis Gospodnetic wrote:
> Hi Tracey,
>
> Thanks for sharing.  Which search exactly is powered by Solr now?
> http://www.archive.org/search.php?query=middlebury for example?
>
> Thanks,
> Otis
>
> ----- Original Message ----
> From: Tracey Jaquith <tr...@archive.org>
> To: solr-user@lucene.apache.org
> Sent: Sunday, January 28, 2007 5:12:44 AM
> Subject: INTERNET ARCHIVE goes SOLR!
>
>
> Internet Archive on Monday afternoon switched over to SOLR!
>
>   We converted from a badly deteriorating "home grown" server that
>   was made up of java + jetty ( + rsync for replication) + an older
>   version of lucene.
>   I make some comparisons of SOLR vs. "prior" using "[]" notes below.
>
>   I parsed 2 days worth of SOLR logs to determine:
>     Max queries/sec: 8.8
>     Avg queries/sec: 5.4
>     Number (re)indexed / day: 3372
>
>   Index size: 1.1gb [vs. 26gb]
>   Number of document fields searched on a quoted unqualified query:
>     5 [vs. 677] *
>
>   Horsepower:
>     one 4gb RAM dual core cpu 
>     [vs. three 4gb RAM dual core cpu (readers) and one 8gb RAM 2 dual 
> core cpus (writer)]
>
>   Solr hardly touches our disks, load avg stays around 0.5, typically.
>   "sar" shows we average 85% idle!
>   Solr seems quicker to respond, overall, and much more stable.
>   We can reindex our entire set of 575K items in about 2 hours
>      (where we are limited more by the "crawling" of our 190 servers for 
> XML than Solr).
>
>   With our current configuration, we can show index changes on our live 
> site in < 15 minutes
>   (compared to our last SE which could take 4+ hours)
>   Related to above point, we commit every 15 minutes; we optimize 
> once/day late at night.
>
>   * To be fair, Michael StAck (our greatest help for prior SE "life 
> support")
>   has smartly pointed out that by making a smarter schema and strategy, 
> I could
>   reduce the number of fields searched from 677 to 5, with the same overall
>   functionality.  677 fields search on most queries was surely part of 
> bucket
>   of nails in the coffin of our prior SE.
>
>
>   [Some information and configuration]
>
>   We've done essentially no optimizing outside of focusing on a "smart" 
> schema.
>   We do query-time boosting (more on that follows).
>   We (presently) do not use replication.
>   We do (server-side) XSLT of output into our prior SE's XML format.
>   We don't use DisMax and (as of now) do not use faceting.
>   We override defaultOperator of "OR" to "AND".
>   We increased our commitLockTimeout to 5 minutes, and unlockOnStartup.
>   We useCompoundFile (for the index).
>   External to Solr, we use XSLT to transform our item XML into a 
> post-able form for Solr to (re)index.
>
>   And finally, the hardest part to convert to Solr.
>   I had to write a PHP front-end custom converter to take our query strings,
>   parse the clauses and lucene syntax into pieces, and "expand" clauses 
> where
>   they were not searching a specific field to expand it to our 
> query-time boosting.
>   Eg: if someone were to look for "tracey pooh" on our site, we expand 
> it to:
>   (title:"tracey pooh"^100 OR description:"tracey pooh"^15 OR 
> collection:"tracey pooh"^10 OR language:"tracey pooh"^10 OR text:"tracey 
> pooh"^1)
>   (but 'creator:"tracey pooh"' would pass to SOLR as is).
>
>   Lastly, a feelgood. All of Internet Archive's written code is opensource,
>   as is *all* the third-party code we use!
>   So go SOLR and thank you SO much for keeping it open, keeping it real, 
> and for *saving our site*!
>   Thanks for the great mail list and all the continual work, updating, 
> and thinking the Solr
>   team continues to do.  We have all been greatly impressed by this 
> project and it has worked out
>   better than we had hoped!
>
>   

-- 
*       --Tracey Jaquith - http://www.archive.org/~tracey 
<http://www.archive.org/%7Etracey> --*