You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Prashant Saraswat (JIRA)" <ji...@apache.org> on 2013/11/27 21:37:35 UTC

[jira] [Commented] (SOLR-1690) JSONKeyValueTokenizerFactory -- JSON Tokenizer

    [ https://issues.apache.org/jira/browse/SOLR-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13834119#comment-13834119 ] 

Prashant Saraswat commented on SOLR-1690:
-----------------------------------------

@Ryan Mckinley: Many thanks for attaching the patch here. It is most useful.

@Hoss Man: Consider this usecase.Take your favorite ecommerce site ( say newegg.com, ebay.com etc ). Notice that they have some kind of category hierarchy. Each category has category attributes ( say Brand ) with category sensitive possible values(Apple/Samsung for cell phone and Sharp/Samsung for HDTVs) (. In these cases the number of categories specific attributes are in 10's of thousand. It is not realistically possible to create such a schema so that every category specific attribute is mapped to a solr field. However, you can store the category specific attributes per category as a json string.

Now, you do need to filter by category specific attributes. Say you are searching for HDTVs and you only want to see those manufactured by Samsung. As is, solr will not allow you to search in a field which looks like this:
{"name":"Brand", "value":"Samsung"}

something like fq=categoryattribute:"name":"brand","value":"samsung"  ( properly escaped ) doesn't work

Enter the awesome jsontokenizer written by Ryan McKinley. This allows the same field to be indexed as json and 
something like fq=categoryattribute:"name:brand" AND categoryattribute:"value:Samsung" works.

Happy to provide more information if needed. Also happy to take the slap if I'm missing something obvious here.

> JSONKeyValueTokenizerFactory -- JSON Tokenizer
> ----------------------------------------------
>
>                 Key: SOLR-1690
>                 URL: https://issues.apache.org/jira/browse/SOLR-1690
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Ryan McKinley
>            Priority: Minor
>         Attachments: SOLR-1690-JSONKeyValueTokenizerFactory.patch, noggit-1.0-A1.jar
>
>
> Sometimes it is nice to group structured data into a single field.
> This (rough) patch, takes JSON input and indexes tokens based on the key values pairs in the json.
> {code:xml|title=schema.xml}
> <!-- JSON Field Type -->
>     <fieldtype name="json" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
>       <analyzer type="index">
>         <tokenizer class="solr.JSONKeyValueTokenizerFactory" keepArray="true" hierarchicalKey="false"/>
>         <filter class="solr.TrimFilterFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.TrimFilterFactory" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> {code}
> Given text:
> {code}
>  { "hello": "world", "rank":5 }
> {code}
> indexed as two tokens:
> || term position | 	1 |	2 |
> || term text | 	hello:world	| rank:5 |
> || term type | 	word |	word |
> || source start,end | 	12,17	| 27,28 |



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org