You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by MitchK <mi...@web.de> on 2010/01/06 21:48:26 UTC

No Analyzer, tokenizer or stemmer works at Solr

I have tested a lot and all the time I thought I set wrong options for my
custom analyzer.
Well, I have noticed that Solr isn't using ANY analyzer, filter or stemmer.
It seems like it only stores the original input.

I am using the example-configuration of the current Solr 1.4 release.
What's wrong?

Thank you!
-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Ryan McKinley <ry...@gmail.com>.
On Jan 7, 2010, at 12:11 PM, MitchK wrote:

>
> Thank you, Ryan. I will have a look on lucene's material and luke.
>
> I think I got it. :)
>
> Sometimes there will be the need, to response on the one hand the  
> value and
> on the other hand the indexed version of the value.
> How can I fullfill such needs? Doing copyfield on indexed-only fields?
>

see erik's response on 'analysis request handler'


>
>
> ryantxu wrote:
>>
>>
>> On Jan 7, 2010, at 10:50 AM, MitchK wrote:
>>
>>>
>>> Eric,
>>>
>>> you mean, everything is okay, but I do not see it?
>>>
>>>>> Internally for searching the analysis takes place and writes to  
>>>>> the
>>>>> index in an inverted fashion, but the stored stuff is left alone.
>>>
>>> if I use an analyzer, Solr "stores" it's output two ways?
>>> One public output, which is similar to the original input
>>> and one "hidden" or internal output, which is based on the
>>> analyzer's work?
>>> Did I understand that right?
>>
>> yes.
>>
>> indexed fields and stored fields are different.
>>
>> Solr results show stored fields in the results (however facets are
>> based on indexed fields)
>>
>> Take a look at Lucene in Action for a better description of what is
>> happening.  The best tool to get your head around what is happening  
>> is
>> probably luke (http://www.getopt.org/luke/)
>>
>>
>>>
>>> If yes, I have got another problem:
>>> I don't want to waste any diskspace.
>>
>> You have control over what is stored and what is indexed -- how that
>> is configured is up to you.
>>
>> ryan
>>
>>
>
> -- 
> View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erick Erickson <er...@gmail.com>.
Somewhere, you have to create the document XML you
send to SOLR. Just add the calculated data to
your new field there...

HTH
Erick

On Fri, Jan 8, 2010 at 9:30 AM, MitchK <mi...@web.de> wrote:

>
> Okay, you're right. It really would be cleaner, if I do such stuff in the
> code which populates the document to Solr.
>
> Is there a way to prepare a document the described way with Lucene/Solr,
> before I analyze it?
> My use case is to categorize several documents in an automatic way, which
> includes that I have to "create" data from the given input doing some
> information retrieval.
>
> The problem is I am really new to Solr and Lucene - as you can see - and I
> do not know, whether there are some classes that fit my needs.
>
> Any idea?
>
>
> Erick Erickson wrote:
> >
> > Well, I'd approach either of these use cases
> > by simply performing my computations on
> > the input and storing the result in another
> > (non-indexed unless I wanted to search it)
> > field. This wouldn't happen in the Analyzer,
> > but in the code that populated the document
> > fields.....
> >
> > Which is a much cleaner solution IMO than creating
> > some sort of "index this but store that" capability.
> > The purpose of analysis is to produce *searchable*
> > tokens after all.
> >
> > But we're getting into angels dancing on pins here. Do
> > you actually have a use case you're trying to implement
> > or is this mostly theoretical?
> >
> > Erick
> >
> > On Thu, Jan 7, 2010 at 2:08 PM, MitchK <mi...@web.de> wrote:
> >
> >>
> >> The difference between stored and indexed is clear now.
> >>
> >> You are right, if you are responsing only to "normal users".
> >>
> >> Use case:
> >> You got a stored field "The good, the bad and the ugly".
> >> And you got a really fantastic analyzer, which is doing some magic to
> >> this
> >> movie title.
> >> Let's say, the analyzer translates the title into md5 or into another
> >> abstract expression.
> >> Instead of doing the same magical function on the client's side again
> and
> >> again, he only needs to take the prepared data from your response.
> >>
> >> Another use case could be:
> >> Imagine you have got two categories: cheap and expensive and your
> >> document
> >> gots a title-, a label-, an owner- and a price-field.
> >> Imagine you would analyze, index and store them like you normally do and
> >> afterwards you want to set, whether the document belongs to the
> expensive
> >> item-group or not.
> >> If the price for the item is higher than 500$, it belongs to the
> >> expensive
> >> ones, otherwise not.
> >> I think, this would be a job for a special analyzer - and this only
> makes
> >> sense, if I also store the analyzed data.
> >>
> >> I think information retrieval is a really interesting use case.
> >>
> >>
> >> Erick Erickson wrote:
> >> >
> >> > What is your use case for "responding sometimes with the indexed
> >> value"?
> >> > Other than reconstructing a field that hasn't been stored, I can't
> >> think
> >> > of
> >> > one.
> >> >
> >> > I still think you're missing the point. Indexing and storing are
> >> > orthogonal operations that have (almost) nothing to do with each
> >> > other, for all that they happen at the same time on the same field.
> >> >
> >> > You never search against the stored data in a field. You *always*
> >> > search against the indexed data.
> >> >
> >> > Contrariwise, you never display the indexed form to the user, you
> >> > *always* show the stored data (unless you come up with
> >> > a really interesting use case).
> >> >
> >> > Step back and consider what happens when you index data,
> >> > it gets broken up all kinds of ways. Stop words are removed,
> >> > case may change, etc, etc, etc. It makes no sense to
> >> > then display this data for a user. Would you really like
> >> > to have, say a movie title "The Good, The Bad, and The
> >> > Ugly". Remove stopwords, puncuation and lowercase
> >> > and you index three tokens "good", "bad", "ugly".
> >> > Even if you reconstruct this field, the user would see
> >> > "good bad ugly". Bad, very bad.
> >> >
> >> > Yet I want to display the original title to the user in
> >> > response to searching on "ugly", so I need the
> >> > original, unanalyzed data.
> >> >
> >> > Perhaps it would help to think of it this way.
> >> > 1> take some data and index it in f1
> >> >     but do NOT store it in f1. Store it in f2
> >> >     but do NOT index it in f2.
> >> > 2> take that same data, index AND store
> >> >     it in f3.
> >> >
> >> > <1> is almost entirely equivalent to <2>
> >> > in terms of index resources.
> >> >
> >> > Practically though, <1> is harder to use,
> >> > because you have to remember
> >> > to use f1 for searching and f2 for getting
> >> > the raw data.
> >> >
> >> > HTH
> >> > Erick
> >> >
> >> > On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mi...@web.de> wrote:
> >> >
> >> >>
> >> >> Thank you, Ryan. I will have a look on lucene's material and luke.
> >> >>
> >> >> I think I got it. :)
> >> >>
> >> >> Sometimes there will be the need, to response on the one hand the
> >> value
> >> >> and
> >> >> on the other hand the indexed version of the value.
> >> >> How can I fullfill such needs? Doing copyfield on indexed-only
> fields?
> >> >>
> >> >>
> >> >>
> >> >> ryantxu wrote:
> >> >> >
> >> >> >
> >> >> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
> >> >> >
> >> >> >>
> >> >> >> Eric,
> >> >> >>
> >> >> >> you mean, everything is okay, but I do not see it?
> >> >> >>
> >> >> >>>> Internally for searching the analysis takes place and writes to
> >> the
> >> >> >>>> index in an inverted fashion, but the stored stuff is left
> alone.
> >> >> >>
> >> >> >> if I use an analyzer, Solr "stores" it's output two ways?
> >> >> >> One public output, which is similar to the original input
> >> >> >> and one "hidden" or internal output, which is based on the
> >> >> >> analyzer's work?
> >> >> >> Did I understand that right?
> >> >> >
> >> >> > yes.
> >> >> >
> >> >> > indexed fields and stored fields are different.
> >> >> >
> >> >> > Solr results show stored fields in the results (however facets are
> >> >> > based on indexed fields)
> >> >> >
> >> >> > Take a look at Lucene in Action for a better description of what is
> >> >> > happening.  The best tool to get your head around what is happening
> >> is
> >> >> > probably luke (http://www.getopt.org/luke/)
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> If yes, I have got another problem:
> >> >> >> I don't want to waste any diskspace.
> >> >> >
> >> >> > You have control over what is stored and what is indexed -- how
> that
> >> >> > is configured is up to you.
> >> >> >
> >> >> > ryan
> >> >> >
> >> >> >
> >> >>
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> >> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27076795.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Chris Hostetter <ho...@fucit.org>.
: Imagine there is a query like "harry potter dvd-collection cheap" or "cheap
: Harry Potter dvd-collection". 
: How can I customize, that, if there is something said about the category
: "cheap", Solr uses a facetting query on "cat:cheap"? To do so, I have to
: alter the original query - how can I do that?

TMTOWTDI

One solution would be a QParserPlugin ... it's utilized by the 
QueryComponent to decide how to parse the query string.

Or you could write your own SearchComponent to use in place of the 
QueryComponent, then you could not only modify the way the string is 
parsed, but you could also modify the DocSet/DocList anyway you want.


-Hoss


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Is there any schemata that explains which class is responsible for which
level of processing my data to the index?

My example was: I have categorized, whether something is cheap or expensive.  
Let's say I didn't do that on the fly, but with the help of the
UpdateRequestProcessor.
Imagine there is a query like "harry potter dvd-collection cheap" or "cheap
Harry Potter dvd-collection". 
How can I customize, that, if there is something said about the category
"cheap", Solr uses a facetting query on "cat:cheap"? To do so, I have to
alter the original query - how can I do that?
 

Erik Hatcher-4 wrote:
> 
> 
> On Jan 11, 2010, at 7:33 AM, MitchK wrote:
>> Is the UpdateProcessor something that comes froms Lucene itself or  
>> from
>> Solr?
> 
> It's at the Solr level -
> <http://lucene.apache.org/solr/api/org/apache/solr/update/processor/UpdateRequestProcessor.html 
>  >
> 
> 	Erik
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27111504.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erik Hatcher <er...@gmail.com>.
On Jan 11, 2010, at 7:33 AM, MitchK wrote:
> Is the UpdateProcessor something that comes froms Lucene itself or  
> from
> Solr?

It's at the Solr level - <http://lucene.apache.org/solr/api/org/apache/solr/update/processor/UpdateRequestProcessor.html 
 >

	Erik


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Hello Hossman,

sorry for my late response.

For this specific case, you are right. It makes more sense to do such work
"on the fly".
However, I am only testing at the moment, what one can do with Solr and what
not.

Is the UpdateProcessor something that comes froms Lucene itself or from
Solr?

Thanks!


hossman wrote:
> 
> 
> : Is there a way to prepare a document the described way with Lucene/Solr,
> : before I analyze it?
> : My use case is to categorize several documents in an automatic way,
> which
> : includes that I have to "create" data from the given input doing some
> : information retrieval.
> 
> As Ryan mentioned earlier: this is what the UpdateRequestProcessor API 
> is for -- it allows you to modify Documents (regardless of how they were 
> added: csv, xml, dih) prior to Solr processing them...
> 
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-to27026739.html
> 
> Personally, i think you may be looking at your problem from the wrong 
> dirrection...
> 
> : >> Imagine you would analyze, index and store them like you normally do
> and
> : >> afterwards you want to set, whether the document belongs to the
> expensive
> : >> item-group or not.
> : >> If the price for the item is higher than 500$, it belongs to the
> : >> expensive
> : >> ones, otherwise not.
> 
> ...for a situation like that, i wouldn't attempt to "classify" the docs as 
> "expensive" or "cheap" when adding them.  instead i would use numeric 
> ranges for faceting and filtering to show me how many docs where 
> "expensive" or "cheap" at query time -- that way when the ecomony tanks i 
> can redifine my definition of "expensive" on the fly w/o needing to 
> reindex a million documents.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27109760.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Chris Hostetter <ho...@fucit.org>.
: Is there a way to prepare a document the described way with Lucene/Solr,
: before I analyze it?
: My use case is to categorize several documents in an automatic way, which
: includes that I have to "create" data from the given input doing some
: information retrieval.

As Ryan mentioned earlier: this is what the UpdateRequestProcessor API 
is for -- it allows you to modify Documents (regardless of how they were 
added: csv, xml, dih) prior to Solr processing them...

http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-to27026739.html

Personally, i think you may be looking at your problem from the wrong 
dirrection...

: >> Imagine you would analyze, index and store them like you normally do and
: >> afterwards you want to set, whether the document belongs to the expensive
: >> item-group or not.
: >> If the price for the item is higher than 500$, it belongs to the
: >> expensive
: >> ones, otherwise not.

...for a situation like that, i wouldn't attempt to "classify" the docs as 
"expensive" or "cheap" when adding them.  instead i would use numeric 
ranges for faceting and filtering to show me how many docs where 
"expensive" or "cheap" at query time -- that way when the ecomony tanks i 
can redifine my definition of "expensive" on the fly w/o needing to 
reindex a million documents.



-Hoss


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Okay, you're right. It really would be cleaner, if I do such stuff in the
code which populates the document to Solr.

Is there a way to prepare a document the described way with Lucene/Solr,
before I analyze it?
My use case is to categorize several documents in an automatic way, which
includes that I have to "create" data from the given input doing some
information retrieval.

The problem is I am really new to Solr and Lucene - as you can see - and I
do not know, whether there are some classes that fit my needs.

Any idea?


Erick Erickson wrote:
> 
> Well, I'd approach either of these use cases
> by simply performing my computations on
> the input and storing the result in another
> (non-indexed unless I wanted to search it)
> field. This wouldn't happen in the Analyzer,
> but in the code that populated the document
> fields.....
> 
> Which is a much cleaner solution IMO than creating
> some sort of "index this but store that" capability.
> The purpose of analysis is to produce *searchable*
> tokens after all.
> 
> But we're getting into angels dancing on pins here. Do
> you actually have a use case you're trying to implement
> or is this mostly theoretical?
> 
> Erick
> 
> On Thu, Jan 7, 2010 at 2:08 PM, MitchK <mi...@web.de> wrote:
> 
>>
>> The difference between stored and indexed is clear now.
>>
>> You are right, if you are responsing only to "normal users".
>>
>> Use case:
>> You got a stored field "The good, the bad and the ugly".
>> And you got a really fantastic analyzer, which is doing some magic to
>> this
>> movie title.
>> Let's say, the analyzer translates the title into md5 or into another
>> abstract expression.
>> Instead of doing the same magical function on the client's side again and
>> again, he only needs to take the prepared data from your response.
>>
>> Another use case could be:
>> Imagine you have got two categories: cheap and expensive and your
>> document
>> gots a title-, a label-, an owner- and a price-field.
>> Imagine you would analyze, index and store them like you normally do and
>> afterwards you want to set, whether the document belongs to the expensive
>> item-group or not.
>> If the price for the item is higher than 500$, it belongs to the
>> expensive
>> ones, otherwise not.
>> I think, this would be a job for a special analyzer - and this only makes
>> sense, if I also store the analyzed data.
>>
>> I think information retrieval is a really interesting use case.
>>
>>
>> Erick Erickson wrote:
>> >
>> > What is your use case for "responding sometimes with the indexed
>> value"?
>> > Other than reconstructing a field that hasn't been stored, I can't
>> think
>> > of
>> > one.
>> >
>> > I still think you're missing the point. Indexing and storing are
>> > orthogonal operations that have (almost) nothing to do with each
>> > other, for all that they happen at the same time on the same field.
>> >
>> > You never search against the stored data in a field. You *always*
>> > search against the indexed data.
>> >
>> > Contrariwise, you never display the indexed form to the user, you
>> > *always* show the stored data (unless you come up with
>> > a really interesting use case).
>> >
>> > Step back and consider what happens when you index data,
>> > it gets broken up all kinds of ways. Stop words are removed,
>> > case may change, etc, etc, etc. It makes no sense to
>> > then display this data for a user. Would you really like
>> > to have, say a movie title "The Good, The Bad, and The
>> > Ugly". Remove stopwords, puncuation and lowercase
>> > and you index three tokens "good", "bad", "ugly".
>> > Even if you reconstruct this field, the user would see
>> > "good bad ugly". Bad, very bad.
>> >
>> > Yet I want to display the original title to the user in
>> > response to searching on "ugly", so I need the
>> > original, unanalyzed data.
>> >
>> > Perhaps it would help to think of it this way.
>> > 1> take some data and index it in f1
>> >     but do NOT store it in f1. Store it in f2
>> >     but do NOT index it in f2.
>> > 2> take that same data, index AND store
>> >     it in f3.
>> >
>> > <1> is almost entirely equivalent to <2>
>> > in terms of index resources.
>> >
>> > Practically though, <1> is harder to use,
>> > because you have to remember
>> > to use f1 for searching and f2 for getting
>> > the raw data.
>> >
>> > HTH
>> > Erick
>> >
>> > On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mi...@web.de> wrote:
>> >
>> >>
>> >> Thank you, Ryan. I will have a look on lucene's material and luke.
>> >>
>> >> I think I got it. :)
>> >>
>> >> Sometimes there will be the need, to response on the one hand the
>> value
>> >> and
>> >> on the other hand the indexed version of the value.
>> >> How can I fullfill such needs? Doing copyfield on indexed-only fields?
>> >>
>> >>
>> >>
>> >> ryantxu wrote:
>> >> >
>> >> >
>> >> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
>> >> >
>> >> >>
>> >> >> Eric,
>> >> >>
>> >> >> you mean, everything is okay, but I do not see it?
>> >> >>
>> >> >>>> Internally for searching the analysis takes place and writes to
>> the
>> >> >>>> index in an inverted fashion, but the stored stuff is left alone.
>> >> >>
>> >> >> if I use an analyzer, Solr "stores" it's output two ways?
>> >> >> One public output, which is similar to the original input
>> >> >> and one "hidden" or internal output, which is based on the
>> >> >> analyzer's work?
>> >> >> Did I understand that right?
>> >> >
>> >> > yes.
>> >> >
>> >> > indexed fields and stored fields are different.
>> >> >
>> >> > Solr results show stored fields in the results (however facets are
>> >> > based on indexed fields)
>> >> >
>> >> > Take a look at Lucene in Action for a better description of what is
>> >> > happening.  The best tool to get your head around what is happening
>> is
>> >> > probably luke (http://www.getopt.org/luke/)
>> >> >
>> >> >
>> >> >>
>> >> >> If yes, I have got another problem:
>> >> >> I don't want to waste any diskspace.
>> >> >
>> >> > You have control over what is stored and what is indexed -- how that
>> >> > is configured is up to you.
>> >> >
>> >> > ryan
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27076795.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erick Erickson <er...@gmail.com>.
Well, I'd approach either of these use cases
by simply performing my computations on
the input and storing the result in another
(non-indexed unless I wanted to search it)
field. This wouldn't happen in the Analyzer,
but in the code that populated the document
fields.....

Which is a much cleaner solution IMO than creating
some sort of "index this but store that" capability.
The purpose of analysis is to produce *searchable*
tokens after all.

But we're getting into angels dancing on pins here. Do
you actually have a use case you're trying to implement
or is this mostly theoretical?

Erick

On Thu, Jan 7, 2010 at 2:08 PM, MitchK <mi...@web.de> wrote:

>
> The difference between stored and indexed is clear now.
>
> You are right, if you are responsing only to "normal users".
>
> Use case:
> You got a stored field "The good, the bad and the ugly".
> And you got a really fantastic analyzer, which is doing some magic to this
> movie title.
> Let's say, the analyzer translates the title into md5 or into another
> abstract expression.
> Instead of doing the same magical function on the client's side again and
> again, he only needs to take the prepared data from your response.
>
> Another use case could be:
> Imagine you have got two categories: cheap and expensive and your document
> gots a title-, a label-, an owner- and a price-field.
> Imagine you would analyze, index and store them like you normally do and
> afterwards you want to set, whether the document belongs to the expensive
> item-group or not.
> If the price for the item is higher than 500$, it belongs to the expensive
> ones, otherwise not.
> I think, this would be a job for a special analyzer - and this only makes
> sense, if I also store the analyzed data.
>
> I think information retrieval is a really interesting use case.
>
>
> Erick Erickson wrote:
> >
> > What is your use case for "responding sometimes with the indexed value"?
> > Other than reconstructing a field that hasn't been stored, I can't think
> > of
> > one.
> >
> > I still think you're missing the point. Indexing and storing are
> > orthogonal operations that have (almost) nothing to do with each
> > other, for all that they happen at the same time on the same field.
> >
> > You never search against the stored data in a field. You *always*
> > search against the indexed data.
> >
> > Contrariwise, you never display the indexed form to the user, you
> > *always* show the stored data (unless you come up with
> > a really interesting use case).
> >
> > Step back and consider what happens when you index data,
> > it gets broken up all kinds of ways. Stop words are removed,
> > case may change, etc, etc, etc. It makes no sense to
> > then display this data for a user. Would you really like
> > to have, say a movie title "The Good, The Bad, and The
> > Ugly". Remove stopwords, puncuation and lowercase
> > and you index three tokens "good", "bad", "ugly".
> > Even if you reconstruct this field, the user would see
> > "good bad ugly". Bad, very bad.
> >
> > Yet I want to display the original title to the user in
> > response to searching on "ugly", so I need the
> > original, unanalyzed data.
> >
> > Perhaps it would help to think of it this way.
> > 1> take some data and index it in f1
> >     but do NOT store it in f1. Store it in f2
> >     but do NOT index it in f2.
> > 2> take that same data, index AND store
> >     it in f3.
> >
> > <1> is almost entirely equivalent to <2>
> > in terms of index resources.
> >
> > Practically though, <1> is harder to use,
> > because you have to remember
> > to use f1 for searching and f2 for getting
> > the raw data.
> >
> > HTH
> > Erick
> >
> > On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mi...@web.de> wrote:
> >
> >>
> >> Thank you, Ryan. I will have a look on lucene's material and luke.
> >>
> >> I think I got it. :)
> >>
> >> Sometimes there will be the need, to response on the one hand the value
> >> and
> >> on the other hand the indexed version of the value.
> >> How can I fullfill such needs? Doing copyfield on indexed-only fields?
> >>
> >>
> >>
> >> ryantxu wrote:
> >> >
> >> >
> >> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
> >> >
> >> >>
> >> >> Eric,
> >> >>
> >> >> you mean, everything is okay, but I do not see it?
> >> >>
> >> >>>> Internally for searching the analysis takes place and writes to the
> >> >>>> index in an inverted fashion, but the stored stuff is left alone.
> >> >>
> >> >> if I use an analyzer, Solr "stores" it's output two ways?
> >> >> One public output, which is similar to the original input
> >> >> and one "hidden" or internal output, which is based on the
> >> >> analyzer's work?
> >> >> Did I understand that right?
> >> >
> >> > yes.
> >> >
> >> > indexed fields and stored fields are different.
> >> >
> >> > Solr results show stored fields in the results (however facets are
> >> > based on indexed fields)
> >> >
> >> > Take a look at Lucene in Action for a better description of what is
> >> > happening.  The best tool to get your head around what is happening is
> >> > probably luke (http://www.getopt.org/luke/)
> >> >
> >> >
> >> >>
> >> >> If yes, I have got another problem:
> >> >> I don't want to waste any diskspace.
> >> >
> >> > You have control over what is stored and what is indexed -- how that
> >> > is configured is up to you.
> >> >
> >> > ryan
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
The difference between stored and indexed is clear now.

You are right, if you are responsing only to "normal users".

Use case:
You got a stored field "The good, the bad and the ugly".
And you got a really fantastic analyzer, which is doing some magic to this
movie title.
Let's say, the analyzer translates the title into md5 or into another
abstract expression.
Instead of doing the same magical function on the client's side again and
again, he only needs to take the prepared data from your response.

Another use case could be:
Imagine you have got two categories: cheap and expensive and your document
gots a title-, a label-, an owner- and a price-field.
Imagine you would analyze, index and store them like you normally do and
afterwards you want to set, whether the document belongs to the expensive
item-group or not.
If the price for the item is higher than 500$, it belongs to the expensive
ones, otherwise not.
I think, this would be a job for a special analyzer - and this only makes
sense, if I also store the analyzed data.

I think information retrieval is a really interesting use case.


Erick Erickson wrote:
> 
> What is your use case for "responding sometimes with the indexed value"?
> Other than reconstructing a field that hasn't been stored, I can't think
> of
> one.
> 
> I still think you're missing the point. Indexing and storing are
> orthogonal operations that have (almost) nothing to do with each
> other, for all that they happen at the same time on the same field.
> 
> You never search against the stored data in a field. You *always*
> search against the indexed data.
> 
> Contrariwise, you never display the indexed form to the user, you
> *always* show the stored data (unless you come up with
> a really interesting use case).
> 
> Step back and consider what happens when you index data,
> it gets broken up all kinds of ways. Stop words are removed,
> case may change, etc, etc, etc. It makes no sense to
> then display this data for a user. Would you really like
> to have, say a movie title "The Good, The Bad, and The
> Ugly". Remove stopwords, puncuation and lowercase
> and you index three tokens "good", "bad", "ugly".
> Even if you reconstruct this field, the user would see
> "good bad ugly". Bad, very bad.
> 
> Yet I want to display the original title to the user in
> response to searching on "ugly", so I need the
> original, unanalyzed data.
> 
> Perhaps it would help to think of it this way.
> 1> take some data and index it in f1
>     but do NOT store it in f1. Store it in f2
>     but do NOT index it in f2.
> 2> take that same data, index AND store
>     it in f3.
> 
> <1> is almost entirely equivalent to <2>
> in terms of index resources.
> 
> Practically though, <1> is harder to use,
> because you have to remember
> to use f1 for searching and f2 for getting
> the raw data.
> 
> HTH
> Erick
> 
> On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mi...@web.de> wrote:
> 
>>
>> Thank you, Ryan. I will have a look on lucene's material and luke.
>>
>> I think I got it. :)
>>
>> Sometimes there will be the need, to response on the one hand the value
>> and
>> on the other hand the indexed version of the value.
>> How can I fullfill such needs? Doing copyfield on indexed-only fields?
>>
>>
>>
>> ryantxu wrote:
>> >
>> >
>> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
>> >
>> >>
>> >> Eric,
>> >>
>> >> you mean, everything is okay, but I do not see it?
>> >>
>> >>>> Internally for searching the analysis takes place and writes to the
>> >>>> index in an inverted fashion, but the stored stuff is left alone.
>> >>
>> >> if I use an analyzer, Solr "stores" it's output two ways?
>> >> One public output, which is similar to the original input
>> >> and one "hidden" or internal output, which is based on the
>> >> analyzer's work?
>> >> Did I understand that right?
>> >
>> > yes.
>> >
>> > indexed fields and stored fields are different.
>> >
>> > Solr results show stored fields in the results (however facets are
>> > based on indexed fields)
>> >
>> > Take a look at Lucene in Action for a better description of what is
>> > happening.  The best tool to get your head around what is happening is
>> > probably luke (http://www.getopt.org/luke/)
>> >
>> >
>> >>
>> >> If yes, I have got another problem:
>> >> I don't want to waste any diskspace.
>> >
>> > You have control over what is stored and what is indexed -- how that
>> > is configured is up to you.
>> >
>> > ryan
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erick Erickson <er...@gmail.com>.
What is your use case for "responding sometimes with the indexed value"?
Other than reconstructing a field that hasn't been stored, I can't think of
one.

I still think you're missing the point. Indexing and storing are
orthogonal operations that have (almost) nothing to do with each
other, for all that they happen at the same time on the same field.

You never search against the stored data in a field. You *always*
search against the indexed data.

Contrariwise, you never display the indexed form to the user, you
*always* show the stored data (unless you come up with
a really interesting use case).

Step back and consider what happens when you index data,
it gets broken up all kinds of ways. Stop words are removed,
case may change, etc, etc, etc. It makes no sense to
then display this data for a user. Would you really like
to have, say a movie title "The Good, The Bad, and The
Ugly". Remove stopwords, puncuation and lowercase
and you index three tokens "good", "bad", "ugly".
Even if you reconstruct this field, the user would see
"good bad ugly". Bad, very bad.

Yet I want to display the original title to the user in
response to searching on "ugly", so I need the
original, unanalyzed data.

Perhaps it would help to think of it this way.
1> take some data and index it in f1
    but do NOT store it in f1. Store it in f2
    but do NOT index it in f2.
2> take that same data, index AND store
    it in f3.

<1> is almost entirely equivalent to <2>
in terms of index resources.

Practically though, <1> is harder to use,
because you have to remember
to use f1 for searching and f2 for getting
the raw data.

HTH
Erick

On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mi...@web.de> wrote:

>
> Thank you, Ryan. I will have a look on lucene's material and luke.
>
> I think I got it. :)
>
> Sometimes there will be the need, to response on the one hand the value and
> on the other hand the indexed version of the value.
> How can I fullfill such needs? Doing copyfield on indexed-only fields?
>
>
>
> ryantxu wrote:
> >
> >
> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
> >
> >>
> >> Eric,
> >>
> >> you mean, everything is okay, but I do not see it?
> >>
> >>>> Internally for searching the analysis takes place and writes to the
> >>>> index in an inverted fashion, but the stored stuff is left alone.
> >>
> >> if I use an analyzer, Solr "stores" it's output two ways?
> >> One public output, which is similar to the original input
> >> and one "hidden" or internal output, which is based on the
> >> analyzer's work?
> >> Did I understand that right?
> >
> > yes.
> >
> > indexed fields and stored fields are different.
> >
> > Solr results show stored fields in the results (however facets are
> > based on indexed fields)
> >
> > Take a look at Lucene in Action for a better description of what is
> > happening.  The best tool to get your head around what is happening is
> > probably luke (http://www.getopt.org/luke/)
> >
> >
> >>
> >> If yes, I have got another problem:
> >> I don't want to waste any diskspace.
> >
> > You have control over what is stored and what is indexed -- how that
> > is configured is up to you.
> >
> > ryan
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Thank you, Ryan. I will have a look on lucene's material and luke.

I think I got it. :)

Sometimes there will be the need, to response on the one hand the value and
on the other hand the indexed version of the value. 
How can I fullfill such needs? Doing copyfield on indexed-only fields?



ryantxu wrote:
> 
> 
> On Jan 7, 2010, at 10:50 AM, MitchK wrote:
> 
>>
>> Eric,
>>
>> you mean, everything is okay, but I do not see it?
>>
>>>> Internally for searching the analysis takes place and writes to the
>>>> index in an inverted fashion, but the stored stuff is left alone.
>>
>> if I use an analyzer, Solr "stores" it's output two ways?
>> One public output, which is similar to the original input
>> and one "hidden" or internal output, which is based on the  
>> analyzer's work?
>> Did I understand that right?
> 
> yes.
> 
> indexed fields and stored fields are different.
> 
> Solr results show stored fields in the results (however facets are  
> based on indexed fields)
> 
> Take a look at Lucene in Action for a better description of what is  
> happening.  The best tool to get your head around what is happening is  
> probably luke (http://www.getopt.org/luke/)
> 
> 
>>
>> If yes, I have got another problem:
>> I don't want to waste any diskspace.
> 
> You have control over what is stored and what is indexed -- how that  
> is configured is up to you.
> 
> ryan
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Ryan McKinley <ry...@gmail.com>.
On Jan 7, 2010, at 10:50 AM, MitchK wrote:

>
> Eric,
>
> you mean, everything is okay, but I do not see it?
>
>>> Internally for searching the analysis takes place and writes to the
>>> index in an inverted fashion, but the stored stuff is left alone.
>
> if I use an analyzer, Solr "stores" it's output two ways?
> One public output, which is similar to the original input
> and one "hidden" or internal output, which is based on the  
> analyzer's work?
> Did I understand that right?

yes.

indexed fields and stored fields are different.

Solr results show stored fields in the results (however facets are  
based on indexed fields)

Take a look at Lucene in Action for a better description of what is  
happening.  The best tool to get your head around what is happening is  
probably luke (http://www.getopt.org/luke/)


>
> If yes, I have got another problem:
> I don't want to waste any diskspace.

You have control over what is stored and what is indexed -- how that  
is configured is up to you.

ryan

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Eric,

you mean, everything is okay, but I do not see it?

>>Internally for searching the analysis takes place and writes to the  
>>index in an inverted fashion, but the stored stuff is left alone.

if I use an analyzer, Solr "stores" it's output two ways?
One public output, which is similar to the original input
and one "hidden" or internal output, which is based on the analyzer's work?
Did I understand that right?

If yes, I have got another problem: 
I don't want to waste any diskspace. Does the copyfield-order stores the
same data two times?
I mean: I have got originalField and copiedField. originalField gets indexed
with text_analyzer and copiedField with a stemmer. Does this mean, I am
storing the original data two times public and once analyzed per analyzer?
Or does Solr stores the original input only once and makes a reference to
the public data of the originalField? 

Thank you
Mitch


Erik Hatcher-4 wrote:
> 
> Mitch,
> 
> Again, I think you're misunderstanding what analysis does.  You must  
> be expecting we think, though you've not provided exact duplication  
> steps to be sure, that the value you get back from Solr is the  
> analyzer processed output.  It's not, it's exactly what you provide.   
> Internally for searching the analysis takes place and writes to the  
> index in an inverted fashion, but the stored stuff is left alone.
> 
> There's some thinking going on implementing it such that analyzed  
> output is stored.
> 
> You can, however, use the analysis request handler componentry to get  
> analyzed stuff back as you see it in analysis.jsp on a per-document or  
> per-field text basis - if you're looking to leverage the analyzer  
> output in that fashion from a client.
> 
> 	Erik
> 
> On Jan 7, 2010, at 1:21 AM, MitchK wrote:
> 
>>
>> Hello Erick,
>>
>> thank you for answering.
>>
>> I can do whatever I want - Solr does nothing.
>> For example: If I use the textgen-fieldtype which is predefined,  
>> nothing
>> happens to the text. Even the stopFilter is not working - no  
>> stopword from
>> stopword.txt was replaced. I think, that this only affects the index,
>> because, if I query for "for" he returns nothing, which is quietly  
>> correct,
>> due to the work of the stopFilter.
>>
>> Everything works fine on analysis.jsp, but not in "reality".
>>
>> If you have got any testcase-data you want me to add, please, tell  
>> me and I
>> will show you the saved data afterwards.
>>
>> Thank you.
>>
>> Mitch
>>
>>
>> Erick Erickson wrote:
>>>
>>> <<<Well, I have noticed that Solr isn't using ANY analyzer>>>
>>>
>>> How do you know this? Because it's highly unlikely that SOLR
>>> is completely broken on that level.....
>>>
>>> Erick
>>>
>>> On Wed, Jan 6, 2010 at 3:48 PM, MitchK <mi...@web.de> wrote:
>>>
>>>>
>>>> I have tested a lot and all the time I thought I set wrong options  
>>>> for my
>>>> custom analyzer.
>>>> Well, I have noticed that Solr isn't using ANY analyzer, filter or
>>>> stemmer.
>>>> It seems like it only stores the original input.
>>>>
>>>> I am using the example-configuration of the current Solr 1.4  
>>>> release.
>>>> What's wrong?
>>>>
>>>> Thank you!
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27062080.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erik Hatcher <er...@gmail.com>.
Mitch,

Again, I think you're misunderstanding what analysis does.  You must  
be expecting we think, though you've not provided exact duplication  
steps to be sure, that the value you get back from Solr is the  
analyzer processed output.  It's not, it's exactly what you provide.   
Internally for searching the analysis takes place and writes to the  
index in an inverted fashion, but the stored stuff is left alone.

There's some thinking going on implementing it such that analyzed  
output is stored.

You can, however, use the analysis request handler componentry to get  
analyzed stuff back as you see it in analysis.jsp on a per-document or  
per-field text basis - if you're looking to leverage the analyzer  
output in that fashion from a client.

	Erik

On Jan 7, 2010, at 1:21 AM, MitchK wrote:

>
> Hello Erick,
>
> thank you for answering.
>
> I can do whatever I want - Solr does nothing.
> For example: If I use the textgen-fieldtype which is predefined,  
> nothing
> happens to the text. Even the stopFilter is not working - no  
> stopword from
> stopword.txt was replaced. I think, that this only affects the index,
> because, if I query for "for" he returns nothing, which is quietly  
> correct,
> due to the work of the stopFilter.
>
> Everything works fine on analysis.jsp, but not in "reality".
>
> If you have got any testcase-data you want me to add, please, tell  
> me and I
> will show you the saved data afterwards.
>
> Thank you.
>
> Mitch
>
>
> Erick Erickson wrote:
>>
>> <<<Well, I have noticed that Solr isn't using ANY analyzer>>>
>>
>> How do you know this? Because it's highly unlikely that SOLR
>> is completely broken on that level.....
>>
>> Erick
>>
>> On Wed, Jan 6, 2010 at 3:48 PM, MitchK <mi...@web.de> wrote:
>>
>>>
>>> I have tested a lot and all the time I thought I set wrong options  
>>> for my
>>> custom analyzer.
>>> Well, I have noticed that Solr isn't using ANY analyzer, filter or
>>> stemmer.
>>> It seems like it only stores the original input.
>>>
>>> I am using the example-configuration of the current Solr 1.4  
>>> release.
>>> What's wrong?
>>>
>>> Thank you!
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> -- 
> View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Hello Erick,

thank you for answering.

I can do whatever I want - Solr does nothing.
For example: If I use the textgen-fieldtype which is predefined, nothing
happens to the text. Even the stopFilter is not working - no stopword from
stopword.txt was replaced. I think, that this only affects the index,
because, if I query for "for" he returns nothing, which is quietly correct,
due to the work of the stopFilter. 

Everything works fine on analysis.jsp, but not in "reality". 

If you have got any testcase-data you want me to add, please, tell me and I
will show you the saved data afterwards.  

Thank you.

Mitch


Erick Erickson wrote:
> 
> <<<Well, I have noticed that Solr isn't using ANY analyzer>>>
> 
> How do you know this? Because it's highly unlikely that SOLR
> is completely broken on that level.....
> 
> Erick
> 
> On Wed, Jan 6, 2010 at 3:48 PM, MitchK <mi...@web.de> wrote:
> 
>>
>> I have tested a lot and all the time I thought I set wrong options for my
>> custom analyzer.
>> Well, I have noticed that Solr isn't using ANY analyzer, filter or
>> stemmer.
>> It seems like it only stores the original input.
>>
>> I am using the example-configuration of the current Solr 1.4 release.
>> What's wrong?
>>
>> Thank you!
>> --
>> View this message in context:
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055510.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Erick Erickson <er...@gmail.com>.
<<<Well, I have noticed that Solr isn't using ANY analyzer>>>

How do you know this? Because it's highly unlikely that SOLR
is completely broken on that level.....

Erick

On Wed, Jan 6, 2010 at 3:48 PM, MitchK <mi...@web.de> wrote:

>
> I have tested a lot and all the time I thought I set wrong options for my
> custom analyzer.
> Well, I have noticed that Solr isn't using ANY analyzer, filter or stemmer.
> It seems like it only stores the original input.
>
> I am using the example-configuration of the current Solr 1.4 release.
> What's wrong?
>
> Thank you!
> --
> View this message in context:
> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27026959.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by MitchK <mi...@web.de>.
Hello Ryan,

thank you for answering.

In my schema.xml I am defining the field as "indexed = true".
The problem is: nothing, even the original predefined analyzers don't work
anyway.
Please, have a look on my response to Erick.

Mitch

P.S.
Oh, I see what you mean. The field is indexed = true. My language was a
little bit tricky ;).


ryantxu wrote:
> 
> 
> On Jan 6, 2010, at 3:48 PM, MitchK wrote:
> 
>>
>> I have tested a lot and all the time I thought I set wrong options  
>> for my
>> custom analyzer.
>> Well, I have noticed that Solr isn't using ANY analyzer, filter or  
>> stemmer.
>> It seems like it only stores the original input.
> 
> The stored value is always the original input.
> 
> The *indexed* values are transformed by analysis.
> 
> If you really need to store the analyzed fields, that may be possible  
> with an UpdateRequestProcessor.  also see:
> https://issues.apache.org/jira/browse/SOLR-314
> 
> ryan
> 
> 

-- 
View this message in context: http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27055512.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: No Analyzer, tokenizer or stemmer works at Solr

Posted by Ryan McKinley <ry...@gmail.com>.
On Jan 6, 2010, at 3:48 PM, MitchK wrote:

>
> I have tested a lot and all the time I thought I set wrong options  
> for my
> custom analyzer.
> Well, I have noticed that Solr isn't using ANY analyzer, filter or  
> stemmer.
> It seems like it only stores the original input.

The stored value is always the original input.

The *indexed* values are transformed by analysis.

If you really need to store the analyzed fields, that may be possible  
with an UpdateRequestProcessor.  also see:
https://issues.apache.org/jira/browse/SOLR-314

ryan