You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Yonik Seeley <ys...@gmail.com> on 2006/02/21 18:47:50 UTC
example schema/docs
I'm coming up a little short on ideas for documents for the example.
I first thought of DVDs or Books, but the titles are very regular (not
much to demonstrate in the way of text analysis), and bigger portions
of text like reviews are most likely copyrighted (fair use?).
So I've fallen back to electronics, which has an advantage of product
names and jargon that shows off the need for text analysis, such as
word splitting and combining, and synonyms.
Here is an example of what I might limit the fields to:
id (could be non-tokenized version of sku?)
sku:MA147LL/A
name: Apple 60 GB iPod with Video Playback Black
manu: Apple
features: Stores up to 15,000 songs, 25,000 photos, or 150 hours of video
features: iTunes, Podcasts, Audiobooks
features: 2.5-inch, 320x240 color TFT LCD display with LED backlight
features: Up to 20 hours of battery life
features: Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video
features: Notes, Calendar, Phone book, Hold button, Date display,
Photo wallet, Built-in games, JPEG photo playback, Upgradeable
firmware, USB 2.0 compatibility, Playback speed control, Rechargeable
capability, Battery level indication
includes: earbud headphones, USB cable
weight: 5.5
price: 399.00
popularity: 10
-Yonik
Re: example schema/docs
Posted by Chris Hostetter <ho...@fucit.org>.
: So I've fallen back to electronics, which has an advantage of product
: names and jargon that shows off the need for text analysis, such as
: word splitting and combining, and synonyms.
:
: Here is an example of what I might limit the fields to:
A few suggestions:
1) Demonstrating dynamic fields would be good, perhaps a non-tokenized
"category" field with a pop* field (one number per category)
2) Make sure some records don't have values for some fields to
depmonstrate missing values (and sort missing last). demonstrating docs
with heterogenous fields might be nice ... but isn't crucial.
3) Have multiple versions of the name field with differnet analyzers to
show off copy field.
4) having a boolean field for completeness of primary datatypes would be
nice .. doesn't really matter hawt the field is (inStock perhaps?)
I'm assuming you were allready planning on mixing up things like
omitNorms, multiValued, etc...
-Hoss