You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by MitchK <mi...@web.de> on 2010/01/06 21:48:26 UTC

No Analyzer, tokenizer or stemmer works at Solr

I have tested a lot and all the time I thought I set wrong options for my
custom analyzer.
Well, I have noticed that Solr isn't using ANY analyzer, filter or stemmer.
It seems like it only stores the original input.

I am using the example-configuration of the current Solr 1.4 release.
What's wrong?

Thank you!
-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Ryan McKinley <ry...@gmail.com>.

On Jan 7, 2010, at 12:11 PM, MitchK wrote:

>
> Thank you, Ryan. I will have a look on lucene's material and luke.
>
> I think I got it. :)
>
> Sometimes there will be the need, to response on the one hand the  
> value and
> on the other hand the indexed version of the value.
> How can I fullfill such needs? Doing copyfield on indexed-only fields?
>

see erik's response on 'analysis request handler'


>
>
> ryantxu wrote:
>>
>>
>> On Jan 7, 2010, at 10:50 AM, MitchK wrote:
>>
>>>
>>> Eric,
>>>
>>> you mean, everything is okay, but I do not see it?
>>>
>>>>> Internally for searching the analysis takes place and writes to  
>>>>> the
>>>>> index in an inverted fashion, but the stored stuff is left alone.
>>>
>>> if I use an analyzer, Solr "stores" it's output two ways?
>>> One public output, which is similar to the original input
>>> and one "hidden" or internal output, which is based on the
>>> analyzer's work?
>>> Did I understand that right?
>>
>> yes.
>>
>> indexed fields and stored fields are different.
>>
>> Solr results show stored fields in the results (however facets are
>> based on indexed fields)
>>
>> Take a look at Lucene in Action for a better description of what is
>> happening.  The best tool to get your head around what is happening  
>> is
>> probably luke (http://www.getopt.org/luke/)
>>
>>
>>>
>>> If yes, I have got another problem:
>>> I don't want to waste any diskspace.
>>
>> You have control over what is stored and what is indexed -- how that
>> is configured is up to you.
>>
>> ryan
>>
>>
>
> -- 
> View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erick Erickson <er...@gmail.com>.

Somewhere, you have to create the document XML you
send to SOLR. Just add the calculated data to
your new field there...

HTH
Erick

On Fri, Jan 8, 2010 at 9:30 AM, MitchK <mi...@web.de> wrote:

>
> Okay, you're right. It really would be cleaner, if I do such stuff in the
> code which populates the document to Solr.
>
> Is there a way to prepare a document the described way with Lucene/Solr,
> before I analyze it?
> My use case is to categorize several documents in an automatic way, which
> includes that I have to "create" data from the given input doing some
> information retrieval.
>
> The problem is I am really new to Solr and Lucene - as you can see - and I
> do not know, whether there are some classes that fit my needs.
>
> Any idea?
>
>
> Erick Erickson wrote:
> >
> > Well, I'd approach either of these use cases
> > by simply performing my computations on
> > the input and storing the result in another
> > (non-indexed unless I wanted to search it)
> > field. This wouldn't happen in the Analyzer,
> > but in the code that populated the document
> > fields.....
> >
> > Which is a much cleaner solution IMO than creating
> > some sort of "index this but store that" capability.
> > The purpose of analysis is to produce *searchable*
> > tokens after all.
> >
> > But we're getting into angels dancing on pins here. Do
> > you actually have a use case you're trying to implement
> > or is this mostly theoretical?
> >
> > Erick
> >
> > On Thu, Jan 7, 2010 at 2:08 PM, MitchK <mi...@web.de> wrote:
> >
> >>
> >> The difference between stored and indexed is clear now.
> >>
> >> You are right, if you are responsing only to "normal users".
> >>
> >> Use case:
> >> You got a stored field "The good, the bad and the ugly".
> >> And you got a really fantastic analyzer, which is doing some magic to
> >> this
> >> movie title.
> >> Let's say, the analyzer translates the title into md5 or into another
> >> abstract expression.
> >> Instead of doing the same magical function on the client's side again
> and
> >> again, he only needs to take the prepared data from your response.
> >>
> >> Another use case could be:
> >> Imagine you have got two categories: cheap and expensive and your
> >> document
> >> gots a title-, a label-, an owner- and a price-field.
> >> Imagine you would analyze, index and store them like you normally do and
> >> afterwards you want to set, whether the document belongs to the
> expensive
> >> item-group or not.
> >> If the price for the item is higher than 500$, it belongs to the
> >> expensive
> >> ones, otherwise not.
> >> I think, this would be a job for a special analyzer - and this only
> makes
> >> sense, if I also store the analyzed data.
> >>
> >> I think information retrieval is a really interesting use case.
> >>
> >>
> >> Erick Erickson wrote:
> >> >
> >> > What is your use case for "responding sometimes with the indexed
> >> value"?
> >> > Other than reconstructing a field that hasn't been stored, I can't
> >> think
> >> > of
> >> > one.
> >> >
> >> > I still think you're missing the point. Indexing and storing are
> >> > orthogonal operations that have (almost) nothing to do with each
> >> > other, for all that they happen at the same time on the same field.
> >> >
> >> > You never search against the stored data in a field. You *always*
> >> > search against the indexed data.
> >> >
> >> > Contrariwise, you never display the indexed form to the user, you
> >> > *always* show the stored data (unless you come up with
> >> > a really interesting use case).
> >> >
> >> > Step back and consider what happens when you index data,
> >> > it gets broken up all kinds of ways. Stop words are removed,
> >> > case may change, etc, etc, etc. It makes no sense to
> >> > then display this data for a user. Would you really like
> >> > to have, say a movie title "The Good, The Bad, and The
> >> > Ugly". Remove stopwords, puncuation and lowercase
> >> > and you index three tokens "good", "bad", "ugly".
> >> > Even if you reconstruct this field, the user would see
> >> > "good bad ugly". Bad, very bad.
> >> >
> >> > Yet I want to display the original title to the user in
> >> > response to searching on "ugly", so I need the
> >> > original, unanalyzed data.
> >> >
> >> > Perhaps it would help to think of it this way.
> >> > 1> take some data and index it in f1
> >> >     but do NOT store it in f1. Store it in f2
> >> >     but do NOT index it in f2.
> >> > 2> take that same data, index AND store
> >> >     it in f3.
> >> >
> >> > <1> is almost entirely equivalent to <2>
> >> > in terms of index resources.
> >> >
> >> > Practically though, <1> is harder to use,
> >> > because you have to remember
> >> > to use f1 for searching and f2 for getting
> >> > the raw data.
> >> >
> >> > HTH
> >> > Erick
> >> >
> >> > On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mi...@web.de> wrote:
> >> >
> >> >>
> >> >> Thank you, Ryan. I will have a look on lucene's material and luke.
> >> >>
> >> >> I think I got it. :)
> >> >>
> >> >> Sometimes there will be the need, to response on the one hand the
> >> value
> >> >> and
> >> >> on the other hand the indexed version of the value.
> >> >> How can I fullfill such needs? Doing copyfield on indexed-only
> fields?
> >> >>
> >> >>
> >> >>
> >> >> ryantxu wrote:
> >> >> >
> >> >> >
> >> >> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
> >> >> >
> >> >> >>
> >> >> >> Eric,
> >> >> >>
> >> >> >> you mean, everything is okay, but I do not see it?
> >> >> >>
> >> >> >>>> Internally for searching the analysis takes place and writes to
> >> the
> >> >> >>>> index in an inverted fashion, but the stored stuff is left
> alone.
> >> >> >>
> >> >> >> if I use an analyzer, Solr "stores" it's output two ways?
> >> >> >> One public output, which is similar to the original input
> >> >> >> and one "hidden" or internal output, which is based on the
> >> >> >> analyzer's work?
> >> >> >> Did I understand that right?
> >> >> >
> >> >> > yes.
> >> >> >
> >> >> > indexed fields and stored fields are different.
> >> >> >
> >> >> > Solr results show stored fields in the results (however facets are
> >> >> > based on indexed fields)
> >> >> >
> >> >> > Take a look at Lucene in Action for a better description of what is
> >> >> > happening.  The best tool to get your head around what is happening
> >> is
> >> >> > probably luke (http://www.getopt.org/luke/)
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> If yes, I have got another problem:
> >> >> >> I don't want to waste any diskspace.
> >> >> >
> >> >> > You have control over what is stored and what is indexed -- how
> that
> >> >> > is configured is up to you.
> >> >> >
> >> >> > ryan
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> >> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27076795.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>