You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Harry Hochheiser <hs...@gmail.com> on 2010/08/06 17:41:00 UTC

help with tokenizer/filter

Relatively new to solr, and I'm having trouble with indexing some
fields coming out of the solr cell extraction handler.

First question - what does the extraction handler do with text? For
example, if i throw it an excel file, what am I going to get back as
input to solr processing? is anything done at all?

Second question. I've got fields like AA_12345 - two chars,
underscore, several digits, and   AA.44.A3  - 2 chars, period, 2
numbers, period, char, number.

I'd like these to match in a variety of different ways. For example,
the first should match AA, AA12345, AA_12345, and 12345. The second
should match AA.44.A3, AA44A3, 44, A3, etc. all both in upper and
lower case, of course.

What's the best way to filter and index? I've tried the following workflow
1) whitespace tokenizer
2) trim filter
3) word delimiter filter, with generate number parts, generate word
parts, catenate numbesr, catenate words, split on case change, and
prserve originals all set.
4) lowercase filter

but I get very mixed results. the AA_12345 doesn't work in any form,
and theAA.44.A3 is mixed: the whole thing matches, and "A3" matches,
but "AA" does not.

I've also got a simple string ("abcde")  that goes into the same field
type, and that string doesn't work at all?

Any help would be appreciated. thanks!

-harry