You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Matt Honeycutt <mb...@gmail.com> on 2010/04/30 17:35:11 UTC

Handling Hierarchical Document Categories

I have a requirement that I'm not sure how to handle.  Documents in our
system can be mapped to zero or more categories from a list of potentially
thousands of categories.  The categories themselves are hierarchical in
nature, so the category structure might look like this:

- Root
-- Child 1
---- Grandchild 1
---- Grandchild 2
-- Child 2
----- Grandchild 3
----- Grandchild 4

(note that the real hierarchy is *much* more complex than that)

A document could map to any of those categories.  When an end-user searches
the system by category, they should see documents that fall in the category
as well as documents that fall into a "descendant" category.  So for
example, if they searched for "Category:Root", they should see documents
that fall in any of the categories described above.

Right now, we're implementing this functionality by dynamically OR-ing the
subcategories to their query at query time.  Unfortunately, this doesn't
scale for thousands of categories due to limits built in to Lucene.

I have come up with three alternatives.  The first is still performed at
query time. Instead of OR-ing the subcategories, I would implement a custom
Query class that would internally know about how categories are related, and
it would handle the categories internally.  I'm thinking it would take an
approach similar to TermQuery, so in it's Scorer method implementation, it
would call reader.TermDocs for the user's category as well as for each
descendant category, then merge the results.

The other two strategies would be performed at indexing time.  The first is
to go ahead and store each descendant category as an additional copy of the
"Category" field with each document.  So, if a document was assigned the
category "Child 1", during indexing 3 "Category" fields would be added: one
for "Child 1", one for "Grandchild 1", and one for "Grandchild 2".

The second strategy is similar, except instead of storing each category as a
separate instance of the "Category" field, concatenate all the categories
into a single string, and only add one "Category" field.  The field would be
stored with 'Field.Index.NOT_ANALYZED_NO_NORMS' set.  A related strategy
could be to create the field using a token stream so that each category
appears at the same position in the stream (effectively making the
categories synonymous for a given document).

So, any thoughts?  I'm leaning towards either the custom Query class or the
token stream approach, but if there's a better way, please do share.

Thanks!

Re: Handling Hierarchical Document Categories

Posted by Matt Honeycutt <mb...@gmail.com>.

Thanks for the feedback, everyone.  I'm going to go with the delimited field
approach.

On Fri, Apr 30, 2010 at 11:44 AM, Digy <di...@gmail.com> wrote:

> I would exclude second approach since you may end up with too many fields
> for top level documents
>
>
> As another alternative, you can also store your docs in an ANALYZED field
> like
>  Doc1 : Category="Root"
>  Doc2 : Category="Root Child1 "
>  Doc3 : Category="Root Child1 Grandchild1"
>  Doc4 : Category="Root Child1 Grandchild2"
>  Doc5 : Category="Root Child2"
>  Doc6 : Category="Root Child2 Grandchild3"
>  Doc7 : Category="Root Child2 Grandchild4"
>
> DIGY
>
> -----Original Message-----
> From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com]
> Sent: Friday, April 30, 2010 6:35 PM
> To: lucene-net-user@lucene.apache.org
> Subject: Handling Hierarchical Document Categories
>
> I have a requirement that I'm not sure how to handle.  Documents in our
> system can be mapped to zero or more categories from a list of potentially
> thousands of categories.  The categories themselves are hierarchical in
> nature, so the category structure might look like this:
>
> - Root
> -- Child 1
> ---- Grandchild 1
> ---- Grandchild 2
> -- Child 2
> ----- Grandchild 3
> ----- Grandchild 4
>
> (note that the real hierarchy is *much* more complex than that)
>
> A document could map to any of those categories.  When an end-user searches
> the system by category, they should see documents that fall in the category
> as well as documents that fall into a "descendant" category.  So for
> example, if they searched for "Category:Root", they should see documents
> that fall in any of the categories described above.
>
> Right now, we're implementing this functionality by dynamically OR-ing the
> subcategories to their query at query time.  Unfortunately, this doesn't
> scale for thousands of categories due to limits built in to Lucene.
>
> I have come up with three alternatives.  The first is still performed at
> query time. Instead of OR-ing the subcategories, I would implement a custom
> Query class that would internally know about how categories are related,
> and
> it would handle the categories internally.  I'm thinking it would take an
> approach similar to TermQuery, so in it's Scorer method implementation, it
> would call reader.TermDocs for the user's category as well as for each
> descendant category, then merge the results.
>
> The other two strategies would be performed at indexing time.  The first is
> to go ahead and store each descendant category as an additional copy of the
> "Category" field with each document.  So, if a document was assigned the
> category "Child 1", during indexing 3 "Category" fields would be added: one
> for "Child 1", one for "Grandchild 1", and one for "Grandchild 2".
>
> The second strategy is similar, except instead of storing each category as
> a
> separate instance of the "Category" field, concatenate all the categories
> into a single string, and only add one "Category" field.  The field would
> be
> stored with 'Field.Index.NOT_ANALYZED_NO_NORMS' set.  A related strategy
> could be to create the field using a token stream so that each category
> appears at the same position in the stream (effectively making the
> categories synonymous for a given document).
>
> So, any thoughts?  I'm leaning towards either the custom Query class or the
> token stream approach, but if there's a better way, please do share.
>
> Thanks!
>
>

RE: Handling Hierarchical Document Categories

Posted by Digy <di...@gmail.com>.

I would exclude second approach since you may end up with too many fields
for top level documents


As another alternative, you can also store your docs in an ANALYZED field
like
  Doc1 : Category="Root"
  Doc2 : Category="Root Child1 "
  Doc3 : Category="Root Child1 Grandchild1"
  Doc4 : Category="Root Child1 Grandchild2"
  Doc5 : Category="Root Child2"
  Doc6 : Category="Root Child2 Grandchild3"
  Doc7 : Category="Root Child2 Grandchild4"

DIGY

-----Original Message-----
From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com] 
Sent: Friday, April 30, 2010 6:35 PM
To: lucene-net-user@lucene.apache.org
Subject: Handling Hierarchical Document Categories

I have a requirement that I'm not sure how to handle.  Documents in our
system can be mapped to zero or more categories from a list of potentially
thousands of categories.  The categories themselves are hierarchical in
nature, so the category structure might look like this:

- Root
-- Child 1
---- Grandchild 1
---- Grandchild 2
-- Child 2
----- Grandchild 3
----- Grandchild 4

(note that the real hierarchy is *much* more complex than that)

A document could map to any of those categories.  When an end-user searches
the system by category, they should see documents that fall in the category
as well as documents that fall into a "descendant" category.  So for
example, if they searched for "Category:Root", they should see documents
that fall in any of the categories described above.

Right now, we're implementing this functionality by dynamically OR-ing the
subcategories to their query at query time.  Unfortunately, this doesn't
scale for thousands of categories due to limits built in to Lucene.

I have come up with three alternatives.  The first is still performed at
query time. Instead of OR-ing the subcategories, I would implement a custom
Query class that would internally know about how categories are related, and
it would handle the categories internally.  I'm thinking it would take an
approach similar to TermQuery, so in it's Scorer method implementation, it
would call reader.TermDocs for the user's category as well as for each
descendant category, then merge the results.

The other two strategies would be performed at indexing time.  The first is
to go ahead and store each descendant category as an additional copy of the
"Category" field with each document.  So, if a document was assigned the
category "Child 1", during indexing 3 "Category" fields would be added: one
for "Child 1", one for "Grandchild 1", and one for "Grandchild 2".

The second strategy is similar, except instead of storing each category as a
separate instance of the "Category" field, concatenate all the categories
into a single string, and only add one "Category" field.  The field would be
stored with 'Field.Index.NOT_ANALYZED_NO_NORMS' set.  A related strategy
could be to create the field using a token stream so that each category
appears at the same position in the stream (effectively making the
categories synonymous for a given document).

So, any thoughts?  I'm leaning towards either the custom Query class or the
token stream approach, but if there's a better way, please do share.

Thanks!

RE: Handling Hierarchical Document Categories

Posted by Heath Aldrich <ha...@aes2.com>.

We have done something similar with *millions* of categories.

Our approach was to add two extra fields at indexing time.
One of the fields is "category".  It holds a comma separated list of every category where the document actually lives.  We can search this field if we only want items that are *in* a category, not descendant of another category.  That field may contain something like "12,4,14" (we use numeric category ids in the index and this item lives in 3 different categories).

The 2nd field is "CategoryList".  This field holds a unique list of categories that lead to the item.  
This field would look like "1,2,6,12|4,5,14|1,3,4"

Once you index this way, the search is easy.
We simply do a Boolean query for where the category list field contains "1" and the keywords contains "xxxxx".

Now we'll get all the results constrained to everything in or below category 1.

In the instance where we have the millions of categories, the most time consuming part of the query is generating the category trees that go along with each item.  Still that generation only takes a couple minutes and then we can start indexing from our database.


Hope that helps.
Heath


-----Original Message-----
From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com] 
Sent: Friday, April 30, 2010 10:35 AM
To: lucene-net-user@lucene.apache.org
Subject: Handling Hierarchical Document Categories

I have a requirement that I'm not sure how to handle.  Documents in our
system can be mapped to zero or more categories from a list of potentially
thousands of categories.  The categories themselves are hierarchical in
nature, so the category structure might look like this:

- Root
-- Child 1
---- Grandchild 1
---- Grandchild 2
-- Child 2
----- Grandchild 3
----- Grandchild 4

(note that the real hierarchy is *much* more complex than that)

A document could map to any of those categories.  When an end-user searches
the system by category, they should see documents that fall in the category
as well as documents that fall into a "descendant" category.  So for
example, if they searched for "Category:Root", they should see documents
that fall in any of the categories described above.

Right now, we're implementing this functionality by dynamically OR-ing the
subcategories to their query at query time.  Unfortunately, this doesn't
scale for thousands of categories due to limits built in to Lucene.

I have come up with three alternatives.  The first is still performed at
query time. Instead of OR-ing the subcategories, I would implement a custom
Query class that would internally know about how categories are related, and
it would handle the categories internally.  I'm thinking it would take an
approach similar to TermQuery, so in it's Scorer method implementation, it
would call reader.TermDocs for the user's category as well as for each
descendant category, then merge the results.

The other two strategies would be performed at indexing time.  The first is
to go ahead and store each descendant category as an additional copy of the
"Category" field with each document.  So, if a document was assigned the
category "Child 1", during indexing 3 "Category" fields would be added: one
for "Child 1", one for "Grandchild 1", and one for "Grandchild 2".

The second strategy is similar, except instead of storing each category as a
separate instance of the "Category" field, concatenate all the categories
into a single string, and only add one "Category" field.  The field would be
stored with 'Field.Index.NOT_ANALYZED_NO_NORMS' set.  A related strategy
could be to create the field using a token stream so that each category
appears at the same position in the stream (effectively making the
categories synonymous for a given document).

So, any thoughts?  I'm leaning towards either the custom Query class or the
token stream approach, but if there's a better way, please do share.

Thanks!

Re: Handling Hierarchical Document Categories

Posted by Robert Jordan <ro...@gmx.net>.

On 30.04.2010 17:35, Matt Honeycutt wrote:
> I have a requirement that I'm not sure how to handle.  Documents in our
> system can be mapped to zero or more categories from a list of potentially
> thousands of categories.  The categories themselves are hierarchical in
> nature, so the category structure might look like this:
>
> - Root
> -- Child 1
> ---- Grandchild 1
> ---- Grandchild 2
> -- Child 2
> ----- Grandchild 3
> ----- Grandchild 4
>
> (note that the real hierarchy is *much* more complex than that)
>
> A document could map to any of those categories.  When an end-user searches
> the system by category, they should see documents that fall in the category
> as well as documents that fall into a "descendant" category.  So for
> example, if they searched for "Category:Root", they should see documents
> that fall in any of the categories described above.
>
> Right now, we're implementing this functionality by dynamically OR-ing the
> subcategories to their query at query time.  Unfortunately, this doesn't
> scale for thousands of categories due to limits built in to Lucene.

You could solve this at indexing time (at the cost of space) by
using a multivalued category field:

Document 1
	Category: "Root"
	Category: "Node"
	Category: "Subnode"
	...

Document 2
	Category: "Root"
	Category: "Node"
	Category: "Subnode"
	...

Document 3
	Category: "Root"
	Category: "Node"
	Category: "Subnode"
	...


This assumes that the category names are unique within
the tree, but this rule can be easily enforced.

Robert