You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "John Berryman (JIRA)" <ji...@apache.org> on 2013/06/16 22:33:20 UTC

[jira] [Created] (SOLR-4930) Make PathHierarchyTokenizer use regex and optionally prefix the depth of the path.

John Berryman created SOLR-4930:
-----------------------------------

             Summary: Make PathHierarchyTokenizer use regex and optionally prefix the depth of the path.
                 Key: SOLR-4930
                 URL: https://issues.apache.org/jira/browse/SOLR-4930
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
            Reporter: John Berryman
            Priority: Minor


The PathHierarchyTokenizer lacks a couple of features that I think are commonly needed.

1. Split and replace based upon regex.
2. Optionally prefix the token with the depth of the path token

Motivation: I recently had a client who asked me to index laws that were organized in the chapters, sections, subsections, etc. The problem was that the section number used a mixture of delimiters. Ex: 13.4-64.2, so I had to use pattern replacement to map either delimiter to tilda. But the next problem was that these could no longer be displayed as facets (at least not without extra code on the front end). Also, I wanted to prefix the depth of the path at the front of the token. Again, I can achieve this with pattern replacement - but it is ugly and non-performant.

I propose we:

* update PathHierarchyTokenizer so that if the parameters for delimiter of replacement are single character, then the behavior of PathHierarchyTokenizer remains consistent, but if the length of these arguments is greater than one, then they should be interpreted as regex.
* add a new parameter called depthPrefixNumChars that indicates how many characters will be used for a depth prefix - this defaults to zero

Here's my current first stab at it:
https://github.com/o19s/statedecoded/blob/master/solr_home/statedecoded/src/src/main/java/com/o19s/RegexPathHierarchyTokenizer.java This doesn't support the replacement or skip parameter yet. Before I go the rest of the way, I wanted to gauge interest and see if others need this.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org