You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by "M. Kaufmann" <ka...@hispeed.ch> on 2010/07/27 17:06:18 UTC

Re[4]: Counting occurences with HitCollector

Hello Jokin / Ben,
first - thanks for the answer.
sorry for the long response time - some other projects got in between.
I've set up the search with a subsearch:

- Collect all subcategories with a hitcollector (Hashtable)
- search for counts of subcategories in category

Now the next problem is that the subcategory is a phrase (e.g. "English Books").

Depending on the main search there can be up to 500 different subcategories per main category, the faceting search currently takes up ~10-20 times more of the time in comparsion to the main search.

Detailed pseudo code:
- search for occuring categories (results in counts per category -> hashtable)
- search for subcategories with hitcollector (HashTable of all subcategories)
- for each subcategory
      search booleanquery:
             TermQuery category - $category$ - Occur.MUST
             Parsed query subcategory - $subcategory$ - Occur.MUST
      get hits of result and add to datatable
- create dataview for each category to get top 10 results (sorted by count)
...
main search for displaying results

Time measurements (whole search, including main search and debug result displays)
- 341 hits (64 subcategories) - 0.808 seconds
- 2163 hits (601 subcategories) - 7.106 seconds
- 42004 hits (561 subcategories) - 6.994 seconds
- 816218 hits (3411 subcategories) - 47.449 seconds

The most time is used by the faceting and adding the data to the datatable.
Best Regards, Marc

Friday, July 9, 2010, 11:45:57 AM, you wrote:

> Well, the simple facets code it's very old, and i just suggest as a way to
> get the values instead of calling document.doc(). if you are already getting
> from the fieldcache it's ok.

> I don't know what do you wan't to say with "pre-define any facets", you just
> have to call with the facet field. However, in the case of your
> subcategories, how are they indexed? it's a tokenized field or it's a string
> that you process in the hitcollector?

> Whatever how are they indexed,  once you have the all the subcategories from
> a search it will be always faster than creating the file, inserting into sql
> and counting, if you insert them into a priority queue.


> On Thu, Jul 8, 2010 at 4:14 PM, Kaufmann M. <ka...@gmail.com> wrote:

>> Hello Jokin,
>> I'm currently precaching all counting values with the following function:
>> Search.FieldCache_Fields.DEFAULT.GetInts
>>
>> This works pretty fast - except that the startup time is pretty slow (but
>> that's only once).
>> The problem is the counting as is (tried different models from arrays do
>> dictionaries) - that's getting really slow when the resultcount is above
>> 10'000.
>>
>> I've found the SimpleFacets and tried it before - but as far as I've
>> understood you would have to pre-define any facets getting counted, and
>> that
>> would be >30000 in my case.
>>
>> The fastest possible solution I've found for a big set of different
>> categories is:
>> - write them to a file
>> - bulk insert into a sql server
>> - count in sql server
>> - return top 20 of those categories
>>
>> But I'd rather prefer a fully .NET codes solution w/o writing to the
>> harddisk.
>>
>> Best Regards, Marc
>>
>> Thursday, July 8, 2010, 10:17:05 AM, you wrote:
>>
>> > If you are getting the data via the "doc" property it's a very bad idea,
>> it
>> > haves to get the whole document from the hard disk and it's terribly
>> slow.
>>
>> > The best approach in this case it's to get a fieldcache, and get the
>> field
>> > values from there. You can see this approach in a simple facets
>> application
>> > that i have wrote a lot of time ago.
>> > http://lucene.apache.org/~digy/files/SimpleFacets.zip
>>
>> > You can take a look also to the various discussions about facets in this
>> > same list.
>>
>> > On Wed, Jul 7, 2010 at 2:18 PM, Kaufmann M. <ka...@gmail.com>
>> wrote:
>>
>> >> Hello everbody,
>> >> I have a running project in which I'd like to realize an overview table
>> of
>> >> the search results (similar to faceted searching).
>> >> Currently I've tried different approaches to do this:
>> >>
>> >> DataTable in HitCollector to count occurences
>> >> Faceted Booleanqueries
>> >>
>> >> Now in both cases I have a problem:
>> >> I have multiple fields I'd like to count:
>> >> - Main category (numerical value between 0 and 50)
>> >> - Subcategories (string values, 5-15 per result)
>> >>
>> >> With the DataTable method I can count both categories, but if the
>> results
>> >> reach a big number it get's miserably slow.
>> >> With the Faceted Booleanqueries I cannot search for the subcategories (I
>> >> would have to search for thousands of different strings).
>> >>
>> >> Does anybody have an Idea how to solve this?
>> >>
>> >> Concerning the usage in the end:
>> >> I'd like to display an overview like:
>> >> Maincategory 1 [50 Hits]
>> >>  - Subcategory 1 [20 Hits]
>> >>  - Subcategory 2 [10 Hits]
>> >>  ... Top 10 subcategories
>> >> ... all Maincategories
>> >>
>> >> Any help would be greatly appreciated.
>> >> Best Regards
>> >>
>>







Re[5]: Counting occurences with HitCollector

Posted by "M. Kaufmann" <ka...@hispeed.ch>.
Is it really possible that the query parsing is taking that long??
If the facetsearcher has to parse 64 subqueries it takes 0.36 seconds(!).

Best Regards, Marc

Tuesday, July 27, 2010, 5:06:18 PM, you wrote:

> Hello Jokin / Ben,
> first - thanks for the answer.
> sorry for the long response time - some other projects got in between.
> I've set up the search with a subsearch:

> - Collect all subcategories with a hitcollector (Hashtable)
> - search for counts of subcategories in category

> Now the next problem is that the subcategory is a phrase (e.g. "English Books").

> Depending on the main search there can be up to 500 different subcategories per main category, the faceting search currently takes up ~10-20 times more of the time in comparsion to the main search.

> Detailed pseudo code:
> - search for occuring categories (results in counts per category -> hashtable)
> - search for subcategories with hitcollector (HashTable of all subcategories)
> - for each subcategory
>       search booleanquery:
>              TermQuery category - $category$ - Occur.MUST
>              Parsed query subcategory - $subcategory$ - Occur.MUST
>       get hits of result and add to datatable
> - create dataview for each category to get top 10 results (sorted by count)
> ...
> main search for displaying results

> Time measurements (whole search, including main search and debug result displays)
> - 341 hits (64 subcategories) - 0.808 seconds
> - 2163 hits (601 subcategories) - 7.106 seconds
> - 42004 hits (561 subcategories) - 6.994 seconds
> - 816218 hits (3411 subcategories) - 47.449 seconds

> The most time is used by the faceting and adding the data to the datatable.
> Best Regards, Marc

> Friday, July 9, 2010, 11:45:57 AM, you wrote:

>> Well, the simple facets code it's very old, and i just suggest as a way to
>> get the values instead of calling document.doc(). if you are already getting
>> from the fieldcache it's ok.

>> I don't know what do you wan't to say with "pre-define any facets", you just
>> have to call with the facet field. However, in the case of your
>> subcategories, how are they indexed? it's a tokenized field or it's a string
>> that you process in the hitcollector?

>> Whatever how are they indexed,  once you have the all the subcategories from
>> a search it will be always faster than creating the file, inserting into sql
>> and counting, if you insert them into a priority queue.


>> On Thu, Jul 8, 2010 at 4:14 PM, Kaufmann M. <ka...@gmail.com> wrote:

>>> Hello Jokin,
>>> I'm currently precaching all counting values with the following function:
>>> Search.FieldCache_Fields.DEFAULT.GetInts
>>>
>>> This works pretty fast - except that the startup time is pretty slow (but
>>> that's only once).
>>> The problem is the counting as is (tried different models from arrays do
>>> dictionaries) - that's getting really slow when the resultcount is above
>>> 10'000.
>>>
>>> I've found the SimpleFacets and tried it before - but as far as I've
>>> understood you would have to pre-define any facets getting counted, and
>>> that
>>> would be >30000 in my case.
>>>
>>> The fastest possible solution I've found for a big set of different
>>> categories is:
>>> - write them to a file
>>> - bulk insert into a sql server
>>> - count in sql server
>>> - return top 20 of those categories
>>>
>>> But I'd rather prefer a fully .NET codes solution w/o writing to the
>>> harddisk.
>>>
>>> Best Regards, Marc
>>>
>>> Thursday, July 8, 2010, 10:17:05 AM, you wrote:
>>>
>>> > If you are getting the data via the "doc" property it's a very bad idea,
>>> it
>>> > haves to get the whole document from the hard disk and it's terribly
>>> slow.
>>>
>>> > The best approach in this case it's to get a fieldcache, and get the
>>> field
>>> > values from there. You can see this approach in a simple facets
>>> application
>>> > that i have wrote a lot of time ago.
>>> > http://lucene.apache.org/~digy/files/SimpleFacets.zip
>>>
>>> > You can take a look also to the various discussions about facets in this
>>> > same list.
>>>
>>> > On Wed, Jul 7, 2010 at 2:18 PM, Kaufmann M. <ka...@gmail.com>
>>> wrote:
>>>
>>> >> Hello everbody,
>>> >> I have a running project in which I'd like to realize an overview table
>>> of
>>> >> the search results (similar to faceted searching).
>>> >> Currently I've tried different approaches to do this:
>>> >>
>>> >> DataTable in HitCollector to count occurences
>>> >> Faceted Booleanqueries
>>> >>
>>> >> Now in both cases I have a problem:
>>> >> I have multiple fields I'd like to count:
>>> >> - Main category (numerical value between 0 and 50)
>>> >> - Subcategories (string values, 5-15 per result)
>>> >>
>>> >> With the DataTable method I can count both categories, but if the
>>> results
>>> >> reach a big number it get's miserably slow.
>>> >> With the Faceted Booleanqueries I cannot search for the subcategories (I
>>> >> would have to search for thousands of different strings).
>>> >>
>>> >> Does anybody have an Idea how to solve this?
>>> >>
>>> >> Concerning the usage in the end:
>>> >> I'd like to display an overview like:
>>> >> Maincategory 1 [50 Hits]
>>> >>  - Subcategory 1 [20 Hits]
>>> >>  - Subcategory 2 [10 Hits]
>>> >>  ... Top 10 subcategories
>>> >> ... all Maincategories
>>> >>
>>> >> Any help would be greatly appreciated.
>>> >> Best Regards
>>> >>
>>>