You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2008/01/26 00:20:02 UTC
JSON tokenizer? tagging ideas
I've been struggling with how to get various bits of structured data
into solr documents. In various projects I have tried various ideas,
but none feel great.
Take a simple example where I want a document field to be the list of
linked data with name, ID, and path. I have tried things like:
<doc>
<field name="id">ID</field>
<field name="link">IDA nameA pathA</field>
<field name="link">IDB nameB pathB</field>
<field name="link">IDC nameC pathC</field>
</doc>
this is ok -- when spaces are a problem, i've tokenized on \n -- but
this feels very brittle.
I'm considering a general JSON tokenizer and want to know what you all
think. Consider:
<doc>
<field name="id">ID</field>
<field name="link">{ "id":10 "name":"nameA" "path":"/..." }</field>
<field name="link">{ "id":11 "name":"nameB" "path":"/..." }</field>
<field name="link">{ "id":12 "name":"nameB" "path":"/..." }</field>
</doc>
The tokenizer can make a token for each key:value pair, that is:
id:10, name:nameA,path:....,id:11...
Perhaps this could be part of the general 'tag' design:
http://wiki.apache.org/solr/UserTagDesign
rather then having fixed prefixes "~erik#lucene", we could use json syntax:
{user:erik, text:lucene, date:20071112 }
Using noggit (http://svn.apache.org/repos/asf/labs/noggit/) the JSON
parsing is super fast. The prefix queries are probably slower with a
longer string, but I guess you could just use:
{u:erik, t:lucene, d:20071112 }
Thoughts?
ryan