You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Patrick Sauts <pa...@gmail.com> on 2011/09/10 03:24:35 UTC

Stemming and other tokenizers

Hello,

 

I want to implement some king of AutoStemming that will detect the language
of a field based on a tag at the start of this field like #en# my field is
stored on disc but I don't want this tag to be stored. Is there a way to
avoid this field to be stored ?

To me all the filters and the tokenizers interact only with the indexed
field and not the stored one.

Am I wrong ?

Is it possible to you to do such a filter.

 

Patrick.

Re: Stemming and other tokenizers

Posted by Pranav Prakash <pr...@gmail.com>.

I have a similar use case, but slightly more flexible and straight forward.
In my case, I have a field "language" which stores 'en', 'es' or whatever
the language of the document is. Then the field 'transcript' stores the
actual content which is in the language as described in language field.
Following up with the conversation, is this how I am supposed to proceed:

   1. Create one field type in my schema per supported language. This would
   cause me to create ~30 fields.
   2. Since, I already know the language of my content, I can skip SOLR-1979
   (which is expected in Solr 3.5)

The point where I am unclear is, how do I specify at Index time, to use a
certain field for a certain language?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Mon, Sep 12, 2011 at 20:55, Jan Høydahl <ja...@cominvent.com> wrote:

> Hi,
>
> Do they? Can you explain the layout of the documents?
>
> There are two ways to handle multi lingual docs. If all your docs have both
> an English and a Norwegian version, you may either split these into two
> separate documents, each with the "language" field filled by LangId - which
> then also lets you filter by language. Or you may assign a title_en and
> title_no to the same document (expand with more fields if you have more
> languages per document), and keep it as one document. Your client will then
> be adapted to search the language(s) that the user wants.
>
> If one document has multiple languages within the same field, e.g. "body",
> say one paragraph of English and the next is Norwegian, then we currently do
> not have any capability in Solr to apply different analysis (tokenization,
> stemming etc) to each paragraph.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 12. sep. 2011, at 11:37, Manish Bafna wrote:
>
> > What is single document has multiple languages?
> >
> > On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl <ja...@cominvent.com>
> wrote:
> >
> >> Hi
> >>
> >> Everybody else use dedicated field per language, so why can't you?
> >> Please explain your use case, and perhaps we can better help understand
> >> what you're trying to do.
> >> Do you always know the query language in advance?
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >> Solr Training - www.solrtraining.com
> >>
> >> On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
> >>
> >>> I can't create one field per language, that is the problem but I'll dig
> >> into
> >>> it following your indications.
> >>> I let you know what I could come out with.
> >>>
> >>> Patrick.
> >>>
> >>> 2011/9/11 Jan Høydahl <ja...@cominvent.com>
> >>>
> >>>> Hi,
> >>>>
> >>>> You'll not be able to detect language and change stemmer on the same
> >> field
> >>>> in one go. You need to create one fieldType in your schema per
> language
> >> you
> >>>> want to use, and then use LanguageIdentification (SOLR-1979) to do the
> >> magic
> >>>> of detecting language and renaming the field. If you set
> >>>> langid.override=false, languid.map=true and populate your "language"
> >> field
> >>>> with the known language, you will probably get the desired effect.
> >>>>
> >>>> --
> >>>> Jan Høydahl, search solution architect
> >>>> Cominvent AS - www.cominvent.com
> >>>> Solr Training - www.solrtraining.com
> >>>>
> >>>> On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>>
> >>>>>
> >>>>> I want to implement some king of AutoStemming that will detect the
> >>>> language
> >>>>> of a field based on a tag at the start of this field like #en# my
> field
> >>>> is
> >>>>> stored on disc but I don't want this tag to be stored. Is there a way
> >> to
> >>>>> avoid this field to be stored ?
> >>>>>
> >>>>> To me all the filters and the tokenizers interact only with the
> indexed
> >>>>> field and not the stored one.
> >>>>>
> >>>>> Am I wrong ?
> >>>>>
> >>>>> Is it possible to you to do such a filter.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Patrick.
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Stemming and other tokenizers

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Do they? Can you explain the layout of the documents? 

There are two ways to handle multi lingual docs. If all your docs have both an English and a Norwegian version, you may either split these into two separate documents, each with the "language" field filled by LangId - which then also lets you filter by language. Or you may assign a title_en and title_no to the same document (expand with more fields if you have more languages per document), and keep it as one document. Your client will then be adapted to search the language(s) that the user wants.

If one document has multiple languages within the same field, e.g. "body", say one paragraph of English and the next is Norwegian, then we currently do not have any capability in Solr to apply different analysis (tokenization, stemming etc) to each paragraph.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. sep. 2011, at 11:37, Manish Bafna wrote:

> What is single document has multiple languages?
> 
> On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl <ja...@cominvent.com> wrote:
> 
>> Hi
>> 
>> Everybody else use dedicated field per language, so why can't you?
>> Please explain your use case, and perhaps we can better help understand
>> what you're trying to do.
>> Do you always know the query language in advance?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
>> 
>>> I can't create one field per language, that is the problem but I'll dig
>> into
>>> it following your indications.
>>> I let you know what I could come out with.
>>> 
>>> Patrick.
>>> 
>>> 2011/9/11 Jan Høydahl <ja...@cominvent.com>
>>> 
>>>> Hi,
>>>> 
>>>> You'll not be able to detect language and change stemmer on the same
>> field
>>>> in one go. You need to create one fieldType in your schema per language
>> you
>>>> want to use, and then use LanguageIdentification (SOLR-1979) to do the
>> magic
>>>> of detecting language and renaming the field. If you set
>>>> langid.override=false, languid.map=true and populate your "language"
>> field
>>>> with the known language, you will probably get the desired effect.
>>>> 
>>>> --
>>>> Jan Høydahl, search solution architect
>>>> Cominvent AS - www.cominvent.com
>>>> Solr Training - www.solrtraining.com
>>>> 
>>>> On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> 
>>>>> 
>>>>> I want to implement some king of AutoStemming that will detect the
>>>> language
>>>>> of a field based on a tag at the start of this field like #en# my field
>>>> is
>>>>> stored on disc but I don't want this tag to be stored. Is there a way
>> to
>>>>> avoid this field to be stored ?
>>>>> 
>>>>> To me all the filters and the tokenizers interact only with the indexed
>>>>> field and not the stored one.
>>>>> 
>>>>> Am I wrong ?
>>>>> 
>>>>> Is it possible to you to do such a filter.
>>>>> 
>>>>> 
>>>>> 
>>>>> Patrick.
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Stemming and other tokenizers

Posted by Manish Bafna <ma...@gmail.com>.

What is single document has multiple languages?

On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl <ja...@cominvent.com> wrote:

> Hi
>
> Everybody else use dedicated field per language, so why can't you?
> Please explain your use case, and perhaps we can better help understand
> what you're trying to do.
> Do you always know the query language in advance?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
>
> > I can't create one field per language, that is the problem but I'll dig
> into
> > it following your indications.
> > I let you know what I could come out with.
> >
> > Patrick.
> >
> > 2011/9/11 Jan Høydahl <ja...@cominvent.com>
> >
> >> Hi,
> >>
> >> You'll not be able to detect language and change stemmer on the same
> field
> >> in one go. You need to create one fieldType in your schema per language
> you
> >> want to use, and then use LanguageIdentification (SOLR-1979) to do the
> magic
> >> of detecting language and renaming the field. If you set
> >> langid.override=false, languid.map=true and populate your "language"
> field
> >> with the known language, you will probably get the desired effect.
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >> Solr Training - www.solrtraining.com
> >>
> >> On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
> >>
> >>> Hello,
> >>>
> >>>
> >>>
> >>> I want to implement some king of AutoStemming that will detect the
> >> language
> >>> of a field based on a tag at the start of this field like #en# my field
> >> is
> >>> stored on disc but I don't want this tag to be stored. Is there a way
> to
> >>> avoid this field to be stored ?
> >>>
> >>> To me all the filters and the tokenizers interact only with the indexed
> >>> field and not the stored one.
> >>>
> >>> Am I wrong ?
> >>>
> >>> Is it possible to you to do such a filter.
> >>>
> >>>
> >>>
> >>> Patrick.
> >>>
> >>
> >>
>
>

Re: Stemming and other tokenizers

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi

Everybody else use dedicated field per language, so why can't you?
Please explain your use case, and perhaps we can better help understand what you're trying to do.
Do you always know the query language in advance?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 12. sep. 2011, at 08:28, Patrick Sauts wrote:

> I can't create one field per language, that is the problem but I'll dig into
> it following your indications.
> I let you know what I could come out with.
> 
> Patrick.
> 
> 2011/9/11 Jan Høydahl <ja...@cominvent.com>
> 
>> Hi,
>> 
>> You'll not be able to detect language and change stemmer on the same field
>> in one go. You need to create one fieldType in your schema per language you
>> want to use, and then use LanguageIdentification (SOLR-1979) to do the magic
>> of detecting language and renaming the field. If you set
>> langid.override=false, languid.map=true and populate your "language" field
>> with the known language, you will probably get the desired effect.
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
>> 
>>> Hello,
>>> 
>>> 
>>> 
>>> I want to implement some king of AutoStemming that will detect the
>> language
>>> of a field based on a tag at the start of this field like #en# my field
>> is
>>> stored on disc but I don't want this tag to be stored. Is there a way to
>>> avoid this field to be stored ?
>>> 
>>> To me all the filters and the tokenizers interact only with the indexed
>>> field and not the stored one.
>>> 
>>> Am I wrong ?
>>> 
>>> Is it possible to you to do such a filter.
>>> 
>>> 
>>> 
>>> Patrick.
>>> 
>> 
>>

Re: Stemming and other tokenizers

Posted by Patrick Sauts <pa...@gmail.com>.

I can't create one field per language, that is the problem but I'll dig into
it following your indications.
I let you know what I could come out with.

Patrick.

2011/9/11 Jan Høydahl <ja...@cominvent.com>

> Hi,
>
> You'll not be able to detect language and change stemmer on the same field
> in one go. You need to create one fieldType in your schema per language you
> want to use, and then use LanguageIdentification (SOLR-1979) to do the magic
> of detecting language and renaming the field. If you set
> langid.override=false, languid.map=true and populate your "language" field
> with the known language, you will probably get the desired effect.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
>
> > Hello,
> >
> >
> >
> > I want to implement some king of AutoStemming that will detect the
> language
> > of a field based on a tag at the start of this field like #en# my field
> is
> > stored on disc but I don't want this tag to be stored. Is there a way to
> > avoid this field to be stored ?
> >
> > To me all the filters and the tokenizers interact only with the indexed
> > field and not the stored one.
> >
> > Am I wrong ?
> >
> > Is it possible to you to do such a filter.
> >
> >
> >
> > Patrick.
> >
>
>

Re: Stemming and other tokenizers

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

You'll not be able to detect language and change stemmer on the same field in one go. You need to create one fieldType in your schema per language you want to use, and then use LanguageIdentification (SOLR-1979) to do the magic of detecting language and renaming the field. If you set langid.override=false, languid.map=true and populate your "language" field with the known language, you will probably get the desired effect.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. sep. 2011, at 03:24, Patrick Sauts wrote:

> Hello,
> 
> 
> 
> I want to implement some king of AutoStemming that will detect the language
> of a field based on a tag at the start of this field like #en# my field is
> stored on disc but I don't want this tag to be stored. Is there a way to
> avoid this field to be stored ?
> 
> To me all the filters and the tokenizers interact only with the indexed
> field and not the stored one.
> 
> Am I wrong ?
> 
> Is it possible to you to do such a filter.
> 
> 
> 
> Patrick.
>