You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Yonik Seeley <ys...@gmail.com> on 2006/02/21 18:47:50 UTC

example schema/docs

I'm coming up a little short on ideas for documents for the example.
I first thought of DVDs or Books, but the titles are very regular (not
much to demonstrate in the way of text analysis), and bigger portions
of text like reviews are most likely copyrighted (fair use?).

So I've fallen back to electronics, which has an advantage of product
names and jargon that shows off the need for text analysis, such as
word splitting and combining, and synonyms.

Here is an example of what I might limit the fields to:

id  (could be non-tokenized version of sku?)

sku:MA147LL/A

name: Apple 60 GB iPod with Video Playback Black

manu: Apple

features: Stores up to 15,000 songs, 25,000 photos, or 150 hours of video

features: iTunes, Podcasts, Audiobooks

features: 2.5-inch, 320x240 color TFT LCD display with LED backlight

features: Up to 20 hours of battery life

features: Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video

features: Notes, Calendar, Phone book, Hold button, Date display,
Photo wallet, Built-in games, JPEG photo playback, Upgradeable
firmware, USB 2.0 compatibility, Playback speed control, Rechargeable
capability, Battery level indication

includes: earbud headphones, USB cable

weight: 5.5

price: 399.00

popularity: 10


-Yonik

Re: example schema/docs

Posted by Chris Hostetter <ho...@fucit.org>.
: So I've fallen back to electronics, which has an advantage of product
: names and jargon that shows off the need for text analysis, such as
: word splitting and combining, and synonyms.
:
: Here is an example of what I might limit the fields to:

A few suggestions:

1) Demonstrating dynamic fields would be good, perhaps a non-tokenized
"category" field with a pop* field (one number per category)

2) Make sure some records don't have values for some fields to
depmonstrate missing values (and sort missing last).  demonstrating docs
with heterogenous fields might be nice ... but isn't crucial.

3) Have multiple versions of the name field with differnet analyzers to
show off copy field.

4) having a boolean field for completeness of primary datatypes would be
nice .. doesn't really matter hawt the field is (inStock perhaps?)

I'm assuming you were allready planning on mixing up things like
omitNorms, multiValued, etc...



-Hoss