You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2009/07/14 21:36:38 UTC

Wikipedia or reuters like index for testing facets?

Is there a standard index like what Lucene uses for contrib/benchmark for
executing faceted queries over? Or maybe we can randomly generate one that
works in conjunction with wikipedia? That way we can execute real world
queries against faceted data. Or we could use the Lucene/Solr mailing lists
and other data (ala Lucid's faceted site) as a standard index?

Re: Wikipedia or reuters like index for testing facets?

Posted by Grant Ingersoll <gs...@apache.org>.
It's only really effective if the number of tokens in the Sink is  
expected to be significantly less than (my various tests showed around  
< 50%, but YMMV) so it isn't likely useful for most copy fields  
situations.  For Solr to utilize, the schema would have to allow for  
giving ids to the various TokenFilter's so that you could identify the  
Tees and the Sinks.  At least that was my first thought on it.

-Grant
On Jul 17, 2009, at 7:50 PM, Jason Rutherglen wrote:

> I saw the discussion about TeeSinkTokenFilter on java-user, and
> was wondering how Solr performs copy fields? Couldn't Solr by
> default utilize a TeeSinkTokenFilter like class for copying
> fields?
>
>> That link is meant to be stable for benchmarking purposes within  
>> Lucene.
>
> The fields are different?
>
> On Fri, Jul 17, 2009 at 9:57 AM, Grant  
> Ingersoll<gs...@apache.org> wrote:
>> It's likely quite different.  That link is meant to be stable for
>> benchmarking purposes within Lucene.
>>
>> Note, one think I wish I had time for:
>> Hook in Tee/Sink capabilities into Solr such that one could use the
>> WikipediaTokenizer and then Tee the Categories, etc. off to  
>> separate fields
>> automatically for faceting, etc.
>>
>> -Grant
>>
>> On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:
>>
>>> The question that comes to mind is how it's different than
>>>
>>> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>>>
>>> Guess we'd need to download it and take a look!
>>>
>>> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wolanin@acquia.com 
>>> >
>>> wrote:
>>>>
>>>> AWS provides some standard data sets, including an extract of all
>>>> wikipedia content:
>>>>
>>>>
>>>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>>>
>>>> Looks like it's not being updated often, so this or another AWS  
>>>> data
>>>> set could be a consistent basis for benchmarking?
>>>>
>>>> -Peter
>>>>
>>>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>>>> Rutherglen<ja...@gmail.com> wrote:
>>>>>
>>>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>>>> and randomly generate facet data along with it. However for
>>>>> consistent benchmarking the random data would need to stay the
>>>>> same so that people could execute the same benchmark
>>>>> consistently in their own environment.
>>>>>
>>>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark  
>>>>> Miller<ma...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Why don't you just randomly generate the facet data? Thats prob  
>>>>>> the
>>>>>> best way
>>>>>> right? You can control the uniques and ranges.
>>>>>>
>>>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
>>>>>> <gs...@apache.org>wrote:
>>>>>>
>>>>>>> Probably not as generated by the EnwikiDocMaker, but the
>>>>>>> WikipediaTokenizer
>>>>>>> in Lucene can pull out richer syntax which could then be Teed/ 
>>>>>>> Sinked
>>>>>>> to
>>>>>>> other fields.  Things like categories, related links, etc.   
>>>>>>> Mostly,
>>>>>>> though,
>>>>>>> I was just commenting on the fact that it isn't hard to at  
>>>>>>> least use
>>>>>>> it for
>>>>>>> getting docs into Solr.
>>>>>>>
>>>>>>> -Grant
>>>>>>>
>>>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>>>
>>>>>>>  You think enwiki has enough data for faceting?
>>>>>>>>
>>>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsingers@apache.org 
>>>>>>>> >
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then  
>>>>>>>>> send the
>>>>>>>>> doc
>>>>>>>>> over
>>>>>>>>> SolrJ...
>>>>>>>>>
>>>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>>>
>>>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>>>>
>>>>>>>>>> jason.rutherglen@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>  Is there a standard index like what Lucene uses for
>>>>>>>>>> contrib/benchmark
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>> executing faceted queries over? Or maybe we can randomly  
>>>>>>>>>>> generate
>>>>>>>>>>> one
>>>>>>>>>>> that
>>>>>>>>>>> works in conjunction with wikipedia? That way we can  
>>>>>>>>>>> execute real
>>>>>>>>>>> world
>>>>>>>>>>> queries against faceted data. Or we could use the Lucene/ 
>>>>>>>>>>> Solr
>>>>>>>>>>> mailing
>>>>>>>>>>> lists
>>>>>>>>>>> and other data (ala Lucid's faceted site) as a standard  
>>>>>>>>>>> index?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> I don't think there is any standard set of docs for solr  
>>>>>>>>>> testing -
>>>>>>>>>> there
>>>>>>>>>> is
>>>>>>>>>> not a real benchmark contrib - though I know more than a  
>>>>>>>>>> few of us
>>>>>>>>>> have
>>>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I  
>>>>>>>>>> think
>>>>>>>>>> I've
>>>>>>>>>> done
>>>>>>>>>> it twice now ;)
>>>>>>>>>>
>>>>>>>>>> Would be nice to get things going. I was thinking the other  
>>>>>>>>>> day: I
>>>>>>>>>> wonder
>>>>>>>>>> how hard it would be to make Lucene Benchmark generic  
>>>>>>>>>> enough to
>>>>>>>>>> accept
>>>>>>>>>> Solr
>>>>>>>>>> impls and Solr algs?
>>>>>>>>>>
>>>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> --
>>>>>>>>>> - Mark
>>>>>>>>>>
>>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --------------------------
>>>>>>>>> Grant Ingersoll
>>>>>>>>> http://www.lucidimagination.com/
>>>>>>>>>
>>>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>>>>>> Droids)
>>>>>>>>> using
>>>>>>>>> Solr/Lucene:
>>>>>>>>> http://www.lucidimagination.com/search
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com/
>>>>>>>
>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>>>> Droids)
>>>>>>> using
>>>>>>> Solr/Lucene:
>>>>>>> http://www.lucidimagination.com/search
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Peter M. Wolanin, Ph.D.
>>>> Momentum Specialist,  Acquia. Inc.
>>>> peter.wolanin@acquia.com
>>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Wikipedia or reuters like index for testing facets?

Posted by Jason Rutherglen <ja...@gmail.com>.
I saw the discussion about TeeSinkTokenFilter on java-user, and
was wondering how Solr performs copy fields? Couldn't Solr by
default utilize a TeeSinkTokenFilter like class for copying
fields?

> That link is meant to be stable for benchmarking purposes within Lucene.

The fields are different?

On Fri, Jul 17, 2009 at 9:57 AM, Grant Ingersoll<gs...@apache.org> wrote:
> It's likely quite different.  That link is meant to be stable for
> benchmarking purposes within Lucene.
>
> Note, one think I wish I had time for:
> Hook in Tee/Sink capabilities into Solr such that one could use the
> WikipediaTokenizer and then Tee the Categories, etc. off to separate fields
> automatically for faceting, etc.
>
> -Grant
>
> On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:
>
>> The question that comes to mind is how it's different than
>>
>> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>>
>> Guess we'd need to download it and take a look!
>>
>> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<pe...@acquia.com>
>> wrote:
>>>
>>> AWS provides some standard data sets, including an extract of all
>>> wikipedia content:
>>>
>>>
>>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>>
>>> Looks like it's not being updated often, so this or another AWS data
>>> set could be a consistent basis for benchmarking?
>>>
>>> -Peter
>>>
>>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>>> Rutherglen<ja...@gmail.com> wrote:
>>>>
>>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>>> and randomly generate facet data along with it. However for
>>>> consistent benchmarking the random data would need to stay the
>>>> same so that people could execute the same benchmark
>>>> consistently in their own environment.
>>>>
>>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<ma...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Why don't you just randomly generate the facet data? Thats prob the
>>>>> best way
>>>>> right? You can control the uniques and ranges.
>>>>>
>>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll
>>>>> <gs...@apache.org>wrote:
>>>>>
>>>>>> Probably not as generated by the EnwikiDocMaker, but the
>>>>>> WikipediaTokenizer
>>>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked
>>>>>> to
>>>>>> other fields.  Things like categories, related links, etc.  Mostly,
>>>>>> though,
>>>>>> I was just commenting on the fact that it isn't hard to at least use
>>>>>> it for
>>>>>> getting docs into Solr.
>>>>>>
>>>>>> -Grant
>>>>>>
>>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>>
>>>>>>  You think enwiki has enough data for faceting?
>>>>>>>
>>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gs...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the
>>>>>>>> doc
>>>>>>>> over
>>>>>>>> SolrJ...
>>>>>>>>
>>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>>
>>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>>>
>>>>>>>>> jason.rutherglen@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>  Is there a standard index like what Lucene uses for
>>>>>>>>> contrib/benchmark
>>>>>>>>>>
>>>>>>>>>> for
>>>>>>>>>> executing faceted queries over? Or maybe we can randomly generate
>>>>>>>>>> one
>>>>>>>>>> that
>>>>>>>>>> works in conjunction with wikipedia? That way we can execute real
>>>>>>>>>> world
>>>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr
>>>>>>>>>> mailing
>>>>>>>>>> lists
>>>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> I don't think there is any standard set of docs for solr testing -
>>>>>>>>> there
>>>>>>>>> is
>>>>>>>>> not a real benchmark contrib - though I know more than a few of us
>>>>>>>>> have
>>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think
>>>>>>>>> I've
>>>>>>>>> done
>>>>>>>>> it twice now ;)
>>>>>>>>>
>>>>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>>>>> wonder
>>>>>>>>> how hard it would be to make Lucene Benchmark generic enough to
>>>>>>>>> accept
>>>>>>>>> Solr
>>>>>>>>> impls and Solr algs?
>>>>>>>>>
>>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>
>>>>>>>>
>>>>>>>> --------------------------
>>>>>>>> Grant Ingersoll
>>>>>>>> http://www.lucidimagination.com/
>>>>>>>>
>>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>>>> using
>>>>>>>> Solr/Lucene:
>>>>>>>> http://www.lucidimagination.com/search
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>>>>> using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> peter.wolanin@acquia.com
>>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Wikipedia or reuters like index for testing facets?

Posted by Grant Ingersoll <gs...@apache.org>.
It's likely quite different.  That link is meant to be stable for  
benchmarking purposes within Lucene.

Note, one think I wish I had time for:
Hook in Tee/Sink capabilities into Solr such that one could use the  
WikipediaTokenizer and then Tee the Categories, etc. off to separate  
fields automatically for faceting, etc.

-Grant

On Jul 17, 2009, at 10:48 AM, Jason Rutherglen wrote:

> The question that comes to mind is how it's different than
> http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2
>
> Guess we'd need to download it and take a look!
>
> On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<peter.wolanin@acquia.com 
> > wrote:
>> AWS provides some standard data sets, including an extract of all
>> wikipedia content:
>>
>> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>>
>> Looks like it's not being updated often, so this or another AWS data
>> set could be a consistent basis for benchmarking?
>>
>> -Peter
>>
>> On Wed, Jul 15, 2009 at 2:21 PM, Jason
>> Rutherglen<ja...@gmail.com> wrote:
>>> Yeah that's what I was thinking of as an alternative, use enwiki
>>> and randomly generate facet data along with it. However for
>>> consistent benchmarking the random data would need to stay the
>>> same so that people could execute the same benchmark
>>> consistently in their own environment.
>>>
>>> On Tue, Jul 14, 2009 at 6:28 PM, Mark  
>>> Miller<ma...@gmail.com> wrote:
>>>> Why don't you just randomly generate the facet data? Thats prob  
>>>> the best way
>>>> right? You can control the uniques and ranges.
>>>>
>>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gsingers@apache.org 
>>>> >wrote:
>>>>
>>>>> Probably not as generated by the EnwikiDocMaker, but the  
>>>>> WikipediaTokenizer
>>>>> in Lucene can pull out richer syntax which could then be Teed/ 
>>>>> Sinked to
>>>>> other fields.  Things like categories, related links, etc.   
>>>>> Mostly, though,
>>>>> I was just commenting on the fact that it isn't hard to at least  
>>>>> use it for
>>>>> getting docs into Solr.
>>>>>
>>>>> -Grant
>>>>>
>>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>>
>>>>>  You think enwiki has enough data for faceting?
>>>>>>
>>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gsingers@apache.org 
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then  
>>>>>>> send the doc
>>>>>>> over
>>>>>>> SolrJ...
>>>>>>>
>>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>>
>>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>>> jason.rutherglen@gmail.com> wrote:
>>>>>>>>
>>>>>>>>  Is there a standard index like what Lucene uses for contrib/ 
>>>>>>>> benchmark
>>>>>>>>> for
>>>>>>>>> executing faceted queries over? Or maybe we can randomly  
>>>>>>>>> generate one
>>>>>>>>> that
>>>>>>>>> works in conjunction with wikipedia? That way we can execute  
>>>>>>>>> real world
>>>>>>>>> queries against faceted data. Or we could use the Lucene/ 
>>>>>>>>> Solr mailing
>>>>>>>>> lists
>>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I don't think there is any standard set of docs for solr  
>>>>>>>> testing - there
>>>>>>>> is
>>>>>>>> not a real benchmark contrib - though I know more than a few  
>>>>>>>> of us have
>>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I  
>>>>>>>> think I've
>>>>>>>> done
>>>>>>>> it twice now ;)
>>>>>>>>
>>>>>>>> Would be nice to get things going. I was thinking the other  
>>>>>>>> day: I
>>>>>>>> wonder
>>>>>>>> how hard it would be to make Lucene Benchmark generic enough  
>>>>>>>> to accept
>>>>>>>> Solr
>>>>>>>> impls and Solr algs?
>>>>>>>>
>>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>>
>>>>>>>> --
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> http://www.lucidimagination.com
>>>>>>>>
>>>>>>>
>>>>>>> --------------------------
>>>>>>> Grant Ingersoll
>>>>>>> http://www.lucidimagination.com/
>>>>>>>
>>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>>>> Droids) using
>>>>>>> Solr/Lucene:
>>>>>>> http://www.lucidimagination.com/search
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/ 
>>>>> Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wolanin@acquia.com
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Wikipedia or reuters like index for testing facets?

Posted by Jason Rutherglen <ja...@gmail.com>.
The question that comes to mind is how it's different than
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-pages-articles.xml.bz2

Guess we'd need to download it and take a look!

On Thu, Jul 16, 2009 at 8:33 PM, Peter Wolanin<pe...@acquia.com> wrote:
> AWS provides some standard data sets, including an extract of all
> wikipedia content:
>
> http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249
>
> Looks like it's not being updated often, so this or another AWS data
> set could be a consistent basis for benchmarking?
>
> -Peter
>
> On Wed, Jul 15, 2009 at 2:21 PM, Jason
> Rutherglen<ja...@gmail.com> wrote:
>> Yeah that's what I was thinking of as an alternative, use enwiki
>> and randomly generate facet data along with it. However for
>> consistent benchmarking the random data would need to stay the
>> same so that people could execute the same benchmark
>> consistently in their own environment.
>>
>> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<ma...@gmail.com> wrote:
>>> Why don't you just randomly generate the facet data? Thats prob the best way
>>> right? You can control the uniques and ranges.
>>>
>>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>>
>>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>>> I was just commenting on the fact that it isn't hard to at least use it for
>>>> getting docs into Solr.
>>>>
>>>> -Grant
>>>>
>>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>>
>>>>  You think enwiki has enough data for faceting?
>>>>>
>>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gs...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>>>> over
>>>>>> SolrJ...
>>>>>>
>>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>>
>>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>>> jason.rutherglen@gmail.com> wrote:
>>>>>>>
>>>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>>>> for
>>>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>>>> that
>>>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>>>> lists
>>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>>
>>>>>>>>
>>>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>>>> is
>>>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>>>> done
>>>>>>> it twice now ;)
>>>>>>>
>>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>>> wonder
>>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>>>> Solr
>>>>>>> impls and Solr algs?
>>>>>>>
>>>>>>> It does a lot that would suck to duplicate.
>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> - Mark
>>>>>>>
>>>>>>> http://www.lucidimagination.com
>>>>>>>
>>>>>>
>>>>>> --------------------------
>>>>>> Grant Ingersoll
>>>>>> http://www.lucidimagination.com/
>>>>>>
>>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>>> Solr/Lucene:
>>>>>> http://www.lucidimagination.com/search
>>>>>>
>>>>>>
>>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wolanin@acquia.com
>

Re: Wikipedia or reuters like index for testing facets?

Posted by Peter Wolanin <pe...@acquia.com>.
AWS provides some standard data sets, including an extract of all
wikipedia content:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249

Looks like it's not being updated often, so this or another AWS data
set could be a consistent basis for benchmarking?

-Peter

On Wed, Jul 15, 2009 at 2:21 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
> Yeah that's what I was thinking of as an alternative, use enwiki
> and randomly generate facet data along with it. However for
> consistent benchmarking the random data would need to stay the
> same so that people could execute the same benchmark
> consistently in their own environment.
>
> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<ma...@gmail.com> wrote:
>> Why don't you just randomly generate the facet data? Thats prob the best way
>> right? You can control the uniques and ranges.
>>
>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
>>
>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>> I was just commenting on the fact that it isn't hard to at least use it for
>>> getting docs into Solr.
>>>
>>> -Grant
>>>
>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>
>>>  You think enwiki has enough data for faceting?
>>>>
>>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gs...@apache.org>
>>>> wrote:
>>>>
>>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>>> over
>>>>> SolrJ...
>>>>>
>>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>>
>>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>>> jason.rutherglen@gmail.com> wrote:
>>>>>>
>>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>>> for
>>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>>> that
>>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>>> lists
>>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>>
>>>>>>>
>>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>>> is
>>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>>> done
>>>>>> it twice now ;)
>>>>>>
>>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>>> wonder
>>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>>> Solr
>>>>>> impls and Solr algs?
>>>>>>
>>>>>> It does a lot that would suck to duplicate.
>>>>>>
>>>>>> --
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://www.lucidimagination.com/
>>>>>
>>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>>> Solr/Lucene:
>>>>> http://www.lucidimagination.com/search
>>>>>
>>>>>
>>>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Re: Wikipedia or reuters like index for testing facets?

Posted by Jason Rutherglen <ja...@gmail.com>.
Yeah that's what I was thinking of as an alternative, use enwiki
and randomly generate facet data along with it. However for
consistent benchmarking the random data would need to stay the
same so that people could execute the same benchmark
consistently in their own environment.

On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller<ma...@gmail.com> wrote:
> Why don't you just randomly generate the facet data? Thats prob the best way
> right? You can control the uniques and ranges.
>
> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>> other fields.  Things like categories, related links, etc.  Mostly, though,
>> I was just commenting on the fact that it isn't hard to at least use it for
>> getting docs into Solr.
>>
>> -Grant
>>
>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>
>>  You think enwiki has enough data for faceting?
>>>
>>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gs...@apache.org>
>>> wrote:
>>>
>>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>>> over
>>>> SolrJ...
>>>>
>>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>>
>>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>>> jason.rutherglen@gmail.com> wrote:
>>>>>
>>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>>> for
>>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>>> that
>>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>>> lists
>>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>>
>>>>>>
>>>>> I don't think there is any standard set of docs for solr testing - there
>>>>> is
>>>>> not a real benchmark contrib - though I know more than a few of us have
>>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>>> done
>>>>> it twice now ;)
>>>>>
>>>>> Would be nice to get things going. I was thinking the other day: I
>>>>> wonder
>>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>>> Solr
>>>>> impls and Solr algs?
>>>>>
>>>>> It does a lot that would suck to duplicate.
>>>>>
>>>>> --
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://www.lucidimagination.com/
>>>>
>>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>>> Solr/Lucene:
>>>> http://www.lucidimagination.com/search
>>>>
>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>
>
> --
> --
> - Mark
>
> http://www.lucidimagination.com
>

Re: Wikipedia or reuters like index for testing facets?

Posted by Mark Miller <ma...@gmail.com>.
Why don't you just randomly generate the facet data? Thats prob the best way
right? You can control the uniques and ranges.

On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
> in Lucene can pull out richer syntax which could then be Teed/Sinked to
> other fields.  Things like categories, related links, etc.  Mostly, though,
> I was just commenting on the fact that it isn't hard to at least use it for
> getting docs into Solr.
>
> -Grant
>
> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>
>  You think enwiki has enough data for faceting?
>>
>> On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gs...@apache.org>
>> wrote:
>>
>>> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
>>> over
>>> SolrJ...
>>>
>>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>>
>>>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>>> jason.rutherglen@gmail.com> wrote:
>>>>
>>>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>>>> for
>>>>> executing faceted queries over? Or maybe we can randomly generate one
>>>>> that
>>>>> works in conjunction with wikipedia? That way we can execute real world
>>>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>>>> lists
>>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>>
>>>>>
>>>> I don't think there is any standard set of docs for solr testing - there
>>>> is
>>>> not a real benchmark contrib - though I know more than a few of us have
>>>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>>>> done
>>>> it twice now ;)
>>>>
>>>> Would be nice to get things going. I was thinking the other day: I
>>>> wonder
>>>> how hard it would be to make Lucene Benchmark generic enough to accept
>>>> Solr
>>>> impls and Solr algs?
>>>>
>>>> It does a lot that would suck to duplicate.
>>>>
>>>> --
>>>> --
>>>> - Mark
>>>>
>>>> http://www.lucidimagination.com
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
-- 
- Mark

http://www.lucidimagination.com

Re: Wikipedia or reuters like index for testing facets?

Posted by Grant Ingersoll <gs...@apache.org>.
Probably not as generated by the EnwikiDocMaker, but the  
WikipediaTokenizer in Lucene can pull out richer syntax which could  
then be Teed/Sinked to other fields.  Things like categories, related  
links, etc.  Mostly, though, I was just commenting on the fact that it  
isn't hard to at least use it for getting docs into Solr.

-Grant
On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:

> You think enwiki has enough data for faceting?
>
> On Tue, Jul 14, 2009 at 2:56 PM, Grant  
> Ingersoll<gs...@apache.org> wrote:
>> At a min, it is trivial to use the EnWikiDocMaker and then send the  
>> doc over
>> SolrJ...
>>
>> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>>
>>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>>> jason.rutherglen@gmail.com> wrote:
>>>
>>>> Is there a standard index like what Lucene uses for contrib/ 
>>>> benchmark for
>>>> executing faceted queries over? Or maybe we can randomly generate  
>>>> one
>>>> that
>>>> works in conjunction with wikipedia? That way we can execute real  
>>>> world
>>>> queries against faceted data. Or we could use the Lucene/Solr  
>>>> mailing
>>>> lists
>>>> and other data (ala Lucid's faceted site) as a standard index?
>>>>
>>>
>>> I don't think there is any standard set of docs for solr testing -  
>>> there
>>> is
>>> not a real benchmark contrib - though I know more than a few of us  
>>> have
>>> hacked up pieces of Lucene benchmark to work with Solr - I think  
>>> I've done
>>> it twice now ;)
>>>
>>> Would be nice to get things going. I was thinking the other day: I  
>>> wonder
>>> how hard it would be to make Lucene Benchmark generic enough to  
>>> accept
>>> Solr
>>> impls and Solr algs?
>>>
>>> It does a lot that would suck to duplicate.
>>>
>>> --
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Wikipedia or reuters like index for testing facets?

Posted by Jason Rutherglen <ja...@gmail.com>.
You think enwiki has enough data for faceting?

On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll<gs...@apache.org> wrote:
> At a min, it is trivial to use the EnWikiDocMaker and then send the doc over
> SolrJ...
>
> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>
>> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>> jason.rutherglen@gmail.com> wrote:
>>
>>> Is there a standard index like what Lucene uses for contrib/benchmark for
>>> executing faceted queries over? Or maybe we can randomly generate one
>>> that
>>> works in conjunction with wikipedia? That way we can execute real world
>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>> lists
>>> and other data (ala Lucid's faceted site) as a standard index?
>>>
>>
>> I don't think there is any standard set of docs for solr testing - there
>> is
>> not a real benchmark contrib - though I know more than a few of us have
>> hacked up pieces of Lucene benchmark to work with Solr - I think I've done
>> it twice now ;)
>>
>> Would be nice to get things going. I was thinking the other day: I wonder
>> how hard it would be to make Lucene Benchmark generic enough to accept
>> Solr
>> impls and Solr algs?
>>
>> It does a lot that would suck to duplicate.
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Wikipedia or reuters like index for testing facets?

Posted by Grant Ingersoll <gs...@apache.org>.
At a min, it is trivial to use the EnWikiDocMaker and then send the  
doc over SolrJ...

On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:

> On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Is there a standard index like what Lucene uses for contrib/ 
>> benchmark for
>> executing faceted queries over? Or maybe we can randomly generate  
>> one that
>> works in conjunction with wikipedia? That way we can execute real  
>> world
>> queries against faceted data. Or we could use the Lucene/Solr  
>> mailing lists
>> and other data (ala Lucid's faceted site) as a standard index?
>>
>
> I don't think there is any standard set of docs for solr testing -  
> there is
> not a real benchmark contrib - though I know more than a few of us  
> have
> hacked up pieces of Lucene benchmark to work with Solr - I think  
> I've done
> it twice now ;)
>
> Would be nice to get things going. I was thinking the other day: I  
> wonder
> how hard it would be to make Lucene Benchmark generic enough to  
> accept Solr
> impls and Solr algs?
>
> It does a lot that would suck to duplicate.
>
> -- 
> -- 
> - Mark
>
> http://www.lucidimagination.com

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Wikipedia or reuters like index for testing facets?

Posted by Mark Miller <ma...@gmail.com>.
On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> Is there a standard index like what Lucene uses for contrib/benchmark for
> executing faceted queries over? Or maybe we can randomly generate one that
> works in conjunction with wikipedia? That way we can execute real world
> queries against faceted data. Or we could use the Lucene/Solr mailing lists
> and other data (ala Lucid's faceted site) as a standard index?
>

I don't think there is any standard set of docs for solr testing - there is
not a real benchmark contrib - though I know more than a few of us have
hacked up pieces of Lucene benchmark to work with Solr - I think I've done
it twice now ;)

Would be nice to get things going. I was thinking the other day: I wonder
how hard it would be to make Lucene Benchmark generic enough to accept Solr
impls and Solr algs?

It does a lot that would suck to duplicate.

-- 
-- 
- Mark

http://www.lucidimagination.com

Re: Wikipedia or reuters like index for testing facets?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I have something that maybe could be made into one: http://uncorpora.org/

It is resolutions of the United Nations General Assembly in 6 official
languages aligned on a paragraph level in an XML (Translation Memory
eXchange) format. The 6 languages are: English, French, Spanish,
Arabic, Chinese, Russian.

Facets could be derived from already encoded information for:
1) Session number: 55-62
2) Committee number: 0-6
3) Operative/preambulatory phrase (for some of the paragraphs)
4) Resolution number (which is part of the record ID)
5) Cross-reference information that is embedded in the text, but is
marked off with XML tags

Markup and all, it is about 170 Mbytes between 6 languages.

If that looks useful, I would be happy to work with more experienced
Solr users to beat it into the right shape.

Regards,
    Alex.

Personal blog: http://blog.outerthoughts.com/
Research group: http://www.clt.mq.edu.au/Research/
- I think age is a very high price to pay for maturity (Tom Stoppard)

On Tue, Jul 14, 2009 at 3:36 PM, Jason
Rutherglen<ja...@gmail.com> wrote:
> Is there a standard index like what Lucene uses for contrib/benchmark for
> executing faceted queries over? Or maybe we can randomly generate one that
> works in conjunction with wikipedia? That way we can execute real world
> queries against faceted data. Or we could use the Lucene/Solr mailing lists
> and other data (ala Lucid's faceted site) as a standard index?