You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/01/01 19:16:12 UTC

[jira] [Comment Edited] (LUCENE-4647) Simplify CategoryDocumentBuilder

    [ https://issues.apache.org/jira/browse/LUCENE-4647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541894#comment-13541894 ] 

Shai Erera edited comment on LUCENE-4647 at 1/1/13 6:15 PM:
------------------------------------------------------------

Attached patch kinda overhauls how facets are indexed. The concept stays the same, but I sort of rewrote it all. Here's how the code in the patch works:

* {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an {{Iterable<CategoryPath>}} and indexes them in two steps:
*# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}}, {{$facets:Author/Bob}}
*# {{CategoryListBuilder}} encodes the category ordinals into a {{Map<String,BytesRef>}} (more on this later), which is later set as the payload of {{$facets:$fulltree$}}
** Both these steps work per {{CategoryListParams}} (in case the application wishes to index groups of facets in different category lists)

* {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}}) extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which implements {{Iterable<CategoryPath>}}) and holds a mapping from a {{CategoryPath}} to {{CategoryAssociation}}
** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds association values to the drill-down tokens' payloads
** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and encodes {{<category,association-value>}} pairs into their own {{BytesRef}}

* {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the usage (end-user wise) a lot !
** Two implementations {{CategoryIntAssociation}} and {{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}} + {{AssociationInt/FloatProperty}} and allow associating an int/float value with a category
** Every {{CategoryAssociation}} impl is responsible for serialization of its value to a {{ByteArrayDataOutput}} (and de-serialize from {{ByteArrayDataInput}})
** Every implementation needs to specify its {{categoryListID}}, since it determines the term payload under which the association values are encoded
** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and {{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather than the enhancement
** {{EnhancementsIndexingParams}} were removed, and togeher with them all the 'attributes', 'enhancements' and 'streaming' packages

* The {{Map<String,BytesRef>}} easily supports partitions and associations:
** When simple categories are indexed, no partitions, a single entry exists in the map
** When simple categories are indexed with partitions, an entry per partition exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc.
** When associations are indexed, the map contains the ordinals list (as described above) and the association values per {{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is encoded into a different {{BytesRef}} than a float one

I chose to implement it all from scratch because the code was very intertwined and complicated, much because of a very complicated desing for enhancements. At least to me, the code is now much simpler. Migrating facets from this code to DocValues should be an easy task - all that needs to be done is to write the output of {{CategoryListBuilder}} to a DocValues field, rather than a TokenStream payload.

The patch is huge, but mostly because of all the code that's been removed. When you review it, focus on the classes mentioned above.

*NOTE:* the new associations code breaks backwards compatibility with old indexes:
# Previously both the int and float associations were written to the same associations list as integers, and the float one used {{Float.intBitsToFloat}} and vice versa. Now they are written to two separate lists
# Previously the code supported multiple enhancements which affected how they were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always had only one enhancement ({{AssociationEnhancement}}), so that encoding was really redundant.
# In order to support multiple {{CategoryAssociations}} per {{CategoryPath}}, one can easily write a {{CompoundAssociation}} and take care of its serialization.

Since enhancements/associations are quite an advanced feature, I think this break makes sense. We can always add a backwards layer later if someone complains (and cannot reindex). Keeping the previous code, which was prepared to handle multiple {{CategoryEnhancement}} types, while only one enhancement was available to our users did not make sense to me.
                
      was (Author: shaie):
    Attached patch kinda overhauls how facets are indexed. The concept stays the same, but I sort of rewrote it all. Here's how the code in the patch works:

* {{FacetFields}} (previously {{CategoryDocumentBuilder}}) takes an {{Iterable<CategoryPath>}} and indexes them in two steps:
*# {{DrillDownStream}} indexes the drill-down tokens, e.g. {{$facets:Author}}, {{$facets:Author/Bob}}
*# {{CategoryListBuilder}} encodes the category ordinals into a {{Map<String,BytesRef>}} (more on this later), which is later set as the payload of {{$facets:$fulltree$}}
** Both these steps work per {{CategoryListParams}} (in case the application wishes to index groups of facets in different category lists)

* {{AssociationsFacetFields}} (previously {{EnhancementsDocumentBuilder}}) extends {{FacetFields}} and takes a {{CategoryAssociationsContainer}} (which implements {{Iterable<CategoryPath>}}) and holds a mapping from a {{CategoryPath}} to {{CategoryAssociation}}
** {{AssociationsDrillDownStream}} extends {{DrillDownStream}} and adds association values to the drill-down tokens' payloads
** {{AssociationsCategoryListBuilder}} extends {{CategoryListBuilder}} and encodes {{category,association-value}} pairs into their own {{BytesRef}}

* {{CategoryAssociation}} replaces {{CategoryEnhancement}} and simplifies the usage (end-user wise) a lot !
** Two implementations {{CategoryIntAssociation}} and {{CategoryFloatAssociation}} support the previously {{AssociationEnhancement}} + {{AssociationInt/FloatProperty}} and allow associating an int/float value with a category
** Every {{CategoryAssociation}} impl is responsible for serialization of its value to a {{ByteArrayDataOutput}} (and de-serialize from {{ByteArrayDataInput}}
** Every implementation needs to specify its {{categoryListID}}, since it determines the term payload under which the association values are encoded
** The two {{FacetRequests}}, {{AssociationIntSumFacetRequest}} and {{AssociationFloatSumFacetRequest}}, work with {{CategoryAssociation}} rather than the enhancement
** {{EnhancementsIndexingParams}} were removed, and togeher with them all the 'attributes', 'enhancements' and 'streaming' packages

* The {{Map<String,BytesRef>}} easily supports partitions and associations:
** When simple categories are indexed, no partitions, a single entry exists in the map
** When simple categories are indexed with partitions, an entry per partition exists in the map, e.g. {{$facets:$fulltree$1}}, {{$facets:$fulltree$2}} etc.
** When associations are indexed, the map contains the ordinals list (as described above) and the association values per {{CategoryAssociation.getCategoryListID()}}, so e.g. an int association is encoded into a different {{BytesRef}} than a float one

I chose to implement it all from scratch because the code was very intertwined and complicated, much because of a very complicated desing for enhancements. At least to me, the code is now much simpler. Migrating facets from this code to DocValues should be an easy task - all that needs to be done is to write the output of {{CategoryListBuilder}} to a DocValues field, rather than a TokenStream payload.

The patch is huge, but mostly because of all the code that's been removed. When you review it, focus on the classes mentioned above.

*NOTE:* the new associations code breaks backwards compatibility with old indexes:
# Previously both the int and float associations were written to the same associations list as integers, and the float one used {{Float.intBitsToFloat}} and vice versa. Now they are written to two separate lists
# Previously the code supported multiple enhancements which affected how they were encoded (e.g. {{#ENHANCEMENTS, #ENH_LENGTHS, #ENH_BYTES}}). But we always had only one enhancement ({{AssociationEnhancement}}), so that encoding was really redundant.
# In order to support multiple {{CategoryAssociations}} per {{CategoryPath}}, one can easily write a {{CompoundAssociation}} and take care of its serialization.

Since enhancements/associations are quite an advanced feature, I think this break makes sense. We can always add a backwards layer later if someone complains (and cannot reindex). Keeping the previous code, which was prepared to handle multiple {{CategoryEnhancement}} types, while only one enhancement was available to our users did not make sense to me.
                  
> Simplify CategoryDocumentBuilder
> --------------------------------
>
>                 Key: LUCENE-4647
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4647
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: LUCENE-4647.patch
>
>
> CategoryDocumentBuilder is used to add facet fields to a document. Today the usage is not so straightforward, and I'd like to simplify it. First, to improve usage but also to make cutover to DocValues easier.
> This clsas does two operations: (1) adds drill-down terms and (2) creates the fulltree payload. Today, since it does it all on terms, there's a hairy TokenStream which does both these operations in one go. For simplicity, I'd like to break this into two steps:
> # Write a TokenStream which adds the drill-down terms
> #* For no associations, terms should be indexed w/ DOCS_ONLY (see LUCENE-4623).
> #* With associations, drill-down terms have payload too.
> #* So EnhancementsDocumentBuilder should be able to extend this stream.
> # Write some API which can be used to build the fulltree payload (i.e. populate a byte[]). Currently that byte[] will be written to a payload and later to DV.
> Hopefully, I'd like to have FacetsDocumentBuilder (maybe just FacetsBuilder?) which only handles things with no associations, and EnhancementsDocBuilder which extends whatever is needed to add associations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org