You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by PeterKerk <ve...@hotmail.com> on 2010/10/31 17:12:50 UTC

indexing '-

I have a city named 's-Hertogenbosch

I want it to be indexed exactly like that, so "'s-Hertogenbosch" (without
"")

But now I get:
<lst name="city">
	<int name="hertogenbosch">1</int>
	<int name="s">1</int>
	<int name="shertogenbosch">1</int>
</lst>

What filter should I add/remove from my field definition?

I already tried a new fieldtype with just this, but no luck:
    <fieldType name="exacttext" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
      </analyzer>
    </fieldType>


My schema.xml

    <fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_dutch.txt" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="Dutch"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

<field name="city" type="textTight" indexed="true" stored="true"/>






-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing '-

Posted by Erick Erickson <er...@gmail.com>.

Did you restart solr after the changes? Did you reindex? Because the string
type
should do what you want.

And you've shown us <fieldType> definitions. What <field> are you using with
them?

Best
Erick

On Sun, Oct 31, 2010 at 1:13 PM, PeterKerk <ve...@hotmail.com> wrote:

>
> I already tried the normal string type, but that doesnt work either.
> I now use this:
>    <fieldType name="mytype" class="solr.TextField" sortMissingLast="true"
> omitNorms="true">
>      <analyzer>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>
> But that doesnt do it either...what else can I try?
>
> Thanks!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing '-

Posted by Savvas-Andreas Moysidis <sa...@googlemail.com>.

One way to view how your Tokenizers/Filters chain transforms your input
terms, is to use the analysis page of the Solr admin web application. This
is very handy when troubleshooting issues related to how terms are indexed.

On 31 October 2010 17:13, PeterKerk <ve...@hotmail.com> wrote:

>
> I already tried the normal string type, but that doesnt work either.
> I now use this:
>    <fieldType name="mytype" class="solr.TextField" sortMissingLast="true"
> omitNorms="true">
>      <analyzer>
>        <tokenizer class="solr.KeywordTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>
> But that doesnt do it either...what else can I try?
>
> Thanks!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

RE: indexing '-

Posted by PeterKerk <ve...@hotmail.com>.

Guys, the "string" type did the trick :)

Thanks
-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-tp1816969p1823199.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: indexing '-

Posted by Jonathan Rochkind <ro...@jhu.edu>.

What do you actually want to do? Give an example of a string that would be found in the source document (to index), and a few queries that you want to match it (and that presumably aren't matching it with the methods you've tried, since you say "it doesn't work")

Both a string type or a text type set to KeywordTokenizer (and with no other analyzers, as in your example) should/will index exactly what is in your source document. 

My guess is that you aren't happy with this because in fact you DO want tokenization, which neither of those options will get you.   But you haven't given enough information for us to know what you actually want to do, and without knowing what you're trying to do we cant' tell you why what you've tried doesn't do it, or brainstorm for ways to do it differently.  What "doesn't work"? 
________________________________________
From: PeterKerk [vetteparty@hotmail.com]
Sent: Sunday, October 31, 2010 1:13 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing '-

I already tried the normal string type, but that doesnt work either.
I now use this:
    <fieldType name="mytype" class="solr.TextField" sortMissingLast="true"
omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>

But that doesnt do it either...what else can I try?

Thanks!
--
View this message in context: http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing '-

Posted by PeterKerk <ve...@hotmail.com>.

I already tried the normal string type, but that doesnt work either.
I now use this:
    <fieldType name="mytype" class="solr.TextField" sortMissingLast="true"
omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldType>

But that doesnt do it either...what else can I try?

Thanks!
-- 
View this message in context: http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing '-

Posted by Ken Stanley <do...@gmail.com>.

On Sun, Oct 31, 2010 at 12:12 PM, PeterKerk <ve...@hotmail.com> wrote:

>
> I have a city named 's-Hertogenbosch
>
> I want it to be indexed exactly like that, so "'s-Hertogenbosch" (without
> "")
>
> But now I get:
> <lst name="city">
>        <int name="hertogenbosch">1</int>
>        <int name="s">1</int>
>        <int name="shertogenbosch">1</int>
> </lst>
>
> What filter should I add/remove from my field definition?
>
> I already tried a new fieldtype with just this, but no luck:
>    <fieldType name="exacttext" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>      </analyzer>
>    </fieldType>
>
>
> My schema.xml
>
>    <fieldType name="textTight" class="solr.TextField"
> positionIncrementGap="100" >
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="false"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_dutch.txt" />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="Dutch"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> <field name="city" type="textTight" indexed="true" stored="true"/>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

For exact text, you should try using either the string type, or a type that
only uses the KeywordTokenizer. Other field types may perform
transformations on the text similar to what you are seeing.

- Ken