You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2008/01/26 00:20:02 UTC

JSON tokenizer? tagging ideas

I've been struggling with how to get various bits of structured data 
into solr documents.  In various projects I have tried various ideas, 
but none feel great.

Take a simple example where I want a document field to be the list of 
linked data with name, ID, and path.  I have tried things like:

<doc>
   <field name="id">ID</field>
   <field name="link">IDA nameA pathA</field>
   <field name="link">IDB nameB pathB</field>
   <field name="link">IDC nameC pathC</field>
</doc>

this is ok -- when spaces are a problem, i've tokenized on \n -- but 
this feels very brittle.

I'm considering a general JSON tokenizer and want to know what you all 
think.  Consider:
<doc>
   <field name="id">ID</field>
   <field name="link">{ "id":10 "name":"nameA" "path":"/..." }</field>
   <field name="link">{ "id":11 "name":"nameB" "path":"/..." }</field>
   <field name="link">{ "id":12 "name":"nameB" "path":"/..." }</field>
</doc>

The tokenizer can make a token for each key:value pair, that is:
  id:10, name:nameA,path:....,id:11...

Perhaps this could be part of the general 'tag' design:
http://wiki.apache.org/solr/UserTagDesign

rather then having fixed prefixes "~erik#lucene", we could use json syntax:
  {user:erik, text:lucene, date:20071112 }

Using noggit (http://svn.apache.org/repos/asf/labs/noggit/) the JSON 
parsing is super fast.  The prefix queries are probably slower with a 
longer string, but I guess you could just use:
  {u:erik, t:lucene, d:20071112 }

Thoughts?

ryan