You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David Welton <da...@gmail.com> on 2007/09/17 12:06:22 UTC

largish test data set?

Hi,

I'm in the process of evaluating solr and sphinx, and have come to
realize that actually having a large data set to run them against
would be handy.  However, I'm pretty new to both systems, so thought
that perhaps asking around my produce something useful.

What *I* mean by largish is something that won't fit into memory - say
5 or 6 gigs, which is probably puny for some and huge for others.

BTW, I would also welcome any input from others who have done the
above comparison, although what we'll be using it for is specific
enough that of course I'll need to do my own testing.

Thanks!
-- 
David N. Welton
http://www.welton.it/davidw/

Re: largish test data set?

Posted by Grant Ingersoll <gs...@apache.org>.
You might be interested in the Lucene Java contrib/Benchmark task,  
which provides an indexing implementation of a download of Wikipedia  
(available at http://people.apache.org/~gsingers/wikipedia/)

It is pretty trivial to convert the indexing code to send add  
commands to Solr.

HTH,
Grant

On Sep 17, 2007, at 6:06 AM, David Welton wrote:

> Hi,
>
> I'm in the process of evaluating solr and sphinx, and have come to
> realize that actually having a large data set to run them against
> would be handy.  However, I'm pretty new to both systems, so thought
> that perhaps asking around my produce something useful.
>
> What *I* mean by largish is something that won't fit into memory - say
> 5 or 6 gigs, which is probably puny for some and huge for others.
>
> BTW, I would also welcome any input from others who have done the
> above comparison, although what we'll be using it for is specific
> enough that of course I'll need to do my own testing.
>
> Thanks!
> -- 
> David N. Welton
> http://www.welton.it/davidw/



Re: largish test data set?

Posted by Karl Wettin <ka...@gmail.com>.
17 sep 2007 kl. 12.06 skrev David Welton:

>
> I'm in the process of evaluating solr and sphinx, and have come to
> realize that actually having a large data set to run them against
> would be handy.  However, I'm pretty new to both systems, so thought
> that perhaps asking around my produce something useful.
>
> What *I* mean by largish is something that won't fit into memory - say
> 5 or 6 gigs, which is probably puny for some and huge for others.

IMDB is about 1.2GB of data:

<http://www.imdb.com/interfaces#plain>

You can extract real queries from the TPB data collection, it should  
contain about 1M queries in the movie category:

<http://torrents.thepiratebay.org/3783572/ 
db_dump_and_query_log_from_piratebay.org__summer_of_2006.3783572.TPB.tor 
rent>


-- 
karl

Re: largish test data set?

Posted by Daniel Alheiros <Da...@bbc.co.uk>.
Hi Yonik.

Do you have any performance statistics about those changes?
Is it possible to upgrade to this new Lucene version using the Solr 1.2
stable version?

Regards,
Daniel


On 17/9/07 17:37, "Yonik Seeley" <yo...@apache.org> wrote:

> If you want to see what performance will be like on the next release,
> you could try upgrading Solr's internal version of lucene to trunk
> (current dev version)... there have been some fantastic improvements
> in indexing speed.
> 
> For query speed/throughput, Solr 1.2 or trunk should do fine.
> 
> -Yonik
> 
> On 9/17/07, David Welton <da...@gmail.com> wrote:
>> Hi,
>> 
>> I'm in the process of evaluating solr and sphinx, and have come to
>> realize that actually having a large data set to run them against
>> would be handy.  However, I'm pretty new to both systems, so thought
>> that perhaps asking around my produce something useful.
>> 
>> What *I* mean by largish is something that won't fit into memory - say
>> 5 or 6 gigs, which is probably puny for some and huge for others.
>> 
>> BTW, I would also welcome any input from others who have done the
>> above comparison, although what we'll be using it for is specific
>> enough that of course I'll need to do my own testing.
>> 
>> Thanks!
>> --
>> David N. Welton
>> http://www.welton.it/davidw/
>> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Re: largish test data set?

Posted by Yonik Seeley <yo...@apache.org>.
If you want to see what performance will be like on the next release,
you could try upgrading Solr's internal version of lucene to trunk
(current dev version)... there have been some fantastic improvements
in indexing speed.

For query speed/throughput, Solr 1.2 or trunk should do fine.

-Yonik

On 9/17/07, David Welton <da...@gmail.com> wrote:
> Hi,
>
> I'm in the process of evaluating solr and sphinx, and have come to
> realize that actually having a large data set to run them against
> would be handy.  However, I'm pretty new to both systems, so thought
> that perhaps asking around my produce something useful.
>
> What *I* mean by largish is something that won't fit into memory - say
> 5 or 6 gigs, which is probably puny for some and huge for others.
>
> BTW, I would also welcome any input from others who have done the
> above comparison, although what we'll be using it for is specific
> enough that of course I'll need to do my own testing.
>
> Thanks!
> --
> David N. Welton
> http://www.welton.it/davidw/
>