You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mahmoud Almokadem <pr...@gmail.com> on 2015/11/09 10:48:38 UTC

Arabic analyser

Hello,

We are indexing Arabic content and facing a problem for tokenizing multi
terms phrases like 'عبد الله' 'Abd Allah', so users will search for
'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
الله' with space. We are using StandardTokenizer.


Is there any configurations to handle this case?

Thank you,
Mahmoud

Re: Arabic analyser

Posted by Mahmoud Almokadem <pr...@gmail.com>.

Thank you very much David, It's wonderful and I will try it.

On Wed, Nov 11, 2015 at 1:37 PM, David Murgatroyd <dm...@gmail.com> wrote:

> >So BasisTech works for the latest version of solr?
>
> Yes, our latest Arabic analyzer supports up through 5.3.x. But since the
> examples you give are names, it sounds like you might instead/also want our
> fuzzy name matcher which will find "عبد الله" not only with "عبدالله" but
> also with typos like "عبالله" or even translations into 'English' like
> "abdollah". You can visit http://www.basistech.com/solutions/search/solr/
> and fill out the form there to learn more (mentioning this thread). See
> also http://www.slideshare.net/dmurga/simple-fuzzy-name-matching-in-solr
> for a talk I gave at the San Francisco Solr Meet-up in April on how it
> plugs in to Solr by creating a special field type you can query just like
> any other; this was also presented at Lucene/Solr Revolution last month (
> http://lucenerevolution.org/sessions/simple-fuzzy-name-matching-in-solr/).
>
> Best,
> David Murgatroyd
> (VP, Engineering, Basis Technology)
>
> On Wed, Nov 11, 2015 at 4:31 AM, Mahmoud Almokadem <prog.mahmoud@gmail.com
> >
> wrote:
>
> > Thank Alex,
> >
> > So BasisTech works for the latest version of solr?
> >
> > Sincerely,
> > Mahmoud
> >
> > On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch <
> arafalov@gmail.com
> > >
> > wrote:
> >
> > > If this is for a significant project and you are ready to pay for it,
> > > BasisTech has commercial solutions in this area I believe.
> > >
> > > Regards,
> > >    Alex.
> > > ----
> > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 10 November 2015 at 08:46, Mahmoud Almokadem <
> prog.mahmoud@gmail.com>
> > > wrote:
> > > > Thanks Pual,
> > > >
> > > > Arabic analyser applying filters of normalisation and stemming only
> for
> > > > single terms out of standard tokenzier.
> > > > Gathering all synonyms will be hard work. Should I customise my
> > Tokenizer
> > > > to handle this case?
> > > >
> > > > Sincerely,
> > > > Mahmoud
> > > >
> > > >
> > > > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <pa...@hoplahup.net>
> > > wrote:
> > > >
> > > >> Mahmoud,
> > > >>
> > > >> there is an arabic analyzer:
> > > >>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> > > >> doesn't it do what you describe?
> > > >> Synonyms probably work there too.
> > > >>
> > > >> Paul
> > > >>
> > > >> > Mahmoud Almokadem <ma...@gmail.com>
> > > >> > 9 novembre 2015 17:47
> > > >> > Thanks Jack,
> > > >> >
> > > >> > This is a good solution, but we have more combinations that I
> think
> > > >> > can’t be handled as synonyms like every word starts with ‘عبد’
> ‘Abd’
> > > >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> > > >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will
> be
> > > >> > applied for each separate term.
> > > >> >
> > > >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> > > >> > single term?
> > > >> >
> > > >> > Thanks,
> > > >> > Mahmoud
> > > >> >
> > > >> >
> > > >> >
> > > >> > Jack Krupansky <ma...@gmail.com>
> > > >> > 9 novembre 2015 16:47
> > > >> > Use an index-time (but not query time) synonym filter with a rule
> > > like:
> > > >> >
> > > >> > Abd Allah,Abdallah
> > > >> >
> > > >> > This will index the combined word in addition to the separate
> words.
> > > >> >
> > > >> > -- Jack Krupansky
> > > >> >
> > > >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> > > >> prog.mahmoud@gmail.com>
> > > >> >
> > > >> > Mahmoud Almokadem <ma...@gmail.com>
> > > >> > 9 novembre 2015 10:48
> > > >> > Hello,
> > > >> >
> > > >> > We are indexing Arabic content and facing a problem for tokenizing
> > > multi
> > > >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search
> for
> > > >> > 'عبدالله' 'Abdallah' without space and need to get the results of
> > 'عبد
> > > >> > الله' with space. We are using StandardTokenizer.
> > > >> >
> > > >> >
> > > >> > Is there any configurations to handle this case?
> > > >> >
> > > >> > Thank you,
> > > >> > Mahmoud
> > > >> >
> > > >>
> > > >>
> > >
> >
>

Re: Arabic analyser

Posted by David Murgatroyd <dm...@gmail.com>.

>So BasisTech works for the latest version of solr?

Yes, our latest Arabic analyzer supports up through 5.3.x. But since the
examples you give are names, it sounds like you might instead/also want our
fuzzy name matcher which will find "عبد الله" not only with "عبدالله" but
also with typos like "عبالله" or even translations into 'English' like
"abdollah". You can visit http://www.basistech.com/solutions/search/solr/
and fill out the form there to learn more (mentioning this thread). See
also http://www.slideshare.net/dmurga/simple-fuzzy-name-matching-in-solr
for a talk I gave at the San Francisco Solr Meet-up in April on how it
plugs in to Solr by creating a special field type you can query just like
any other; this was also presented at Lucene/Solr Revolution last month (
http://lucenerevolution.org/sessions/simple-fuzzy-name-matching-in-solr/).

Best,
David Murgatroyd
(VP, Engineering, Basis Technology)

On Wed, Nov 11, 2015 at 4:31 AM, Mahmoud Almokadem <pr...@gmail.com>
wrote:

> Thank Alex,
>
> So BasisTech works for the latest version of solr?
>
> Sincerely,
> Mahmoud
>
> On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch <arafalov@gmail.com
> >
> wrote:
>
> > If this is for a significant project and you are ready to pay for it,
> > BasisTech has commercial solutions in this area I believe.
> >
> > Regards,
> >    Alex.
> > ----
> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > http://www.solr-start.com/
> >
> >
> > On 10 November 2015 at 08:46, Mahmoud Almokadem <pr...@gmail.com>
> > wrote:
> > > Thanks Pual,
> > >
> > > Arabic analyser applying filters of normalisation and stemming only for
> > > single terms out of standard tokenzier.
> > > Gathering all synonyms will be hard work. Should I customise my
> Tokenizer
> > > to handle this case?
> > >
> > > Sincerely,
> > > Mahmoud
> > >
> > >
> > > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <pa...@hoplahup.net>
> > wrote:
> > >
> > >> Mahmoud,
> > >>
> > >> there is an arabic analyzer:
> > >>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> > >> doesn't it do what you describe?
> > >> Synonyms probably work there too.
> > >>
> > >> Paul
> > >>
> > >> > Mahmoud Almokadem <ma...@gmail.com>
> > >> > 9 novembre 2015 17:47
> > >> > Thanks Jack,
> > >> >
> > >> > This is a good solution, but we have more combinations that I think
> > >> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
> > >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> > >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
> > >> > applied for each separate term.
> > >> >
> > >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> > >> > single term?
> > >> >
> > >> > Thanks,
> > >> > Mahmoud
> > >> >
> > >> >
> > >> >
> > >> > Jack Krupansky <ma...@gmail.com>
> > >> > 9 novembre 2015 16:47
> > >> > Use an index-time (but not query time) synonym filter with a rule
> > like:
> > >> >
> > >> > Abd Allah,Abdallah
> > >> >
> > >> > This will index the combined word in addition to the separate words.
> > >> >
> > >> > -- Jack Krupansky
> > >> >
> > >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> > >> prog.mahmoud@gmail.com>
> > >> >
> > >> > Mahmoud Almokadem <ma...@gmail.com>
> > >> > 9 novembre 2015 10:48
> > >> > Hello,
> > >> >
> > >> > We are indexing Arabic content and facing a problem for tokenizing
> > multi
> > >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> > >> > 'عبدالله' 'Abdallah' without space and need to get the results of
> 'عبد
> > >> > الله' with space. We are using StandardTokenizer.
> > >> >
> > >> >
> > >> > Is there any configurations to handle this case?
> > >> >
> > >> > Thank you,
> > >> > Mahmoud
> > >> >
> > >>
> > >>
> >
>

Re: Arabic analyser

Posted by Mahmoud Almokadem <pr...@gmail.com>.

Thank Alex,

So BasisTech works for the latest version of solr?

Sincerely,
Mahmoud

On Tue, Nov 10, 2015 at 5:28 PM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> If this is for a significant project and you are ready to pay for it,
> BasisTech has commercial solutions in this area I believe.
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 10 November 2015 at 08:46, Mahmoud Almokadem <pr...@gmail.com>
> wrote:
> > Thanks Pual,
> >
> > Arabic analyser applying filters of normalisation and stemming only for
> > single terms out of standard tokenzier.
> > Gathering all synonyms will be hard work. Should I customise my Tokenizer
> > to handle this case?
> >
> > Sincerely,
> > Mahmoud
> >
> >
> > On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <pa...@hoplahup.net>
> wrote:
> >
> >> Mahmoud,
> >>
> >> there is an arabic analyzer:
> >>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> >> doesn't it do what you describe?
> >> Synonyms probably work there too.
> >>
> >> Paul
> >>
> >> > Mahmoud Almokadem <ma...@gmail.com>
> >> > 9 novembre 2015 17:47
> >> > Thanks Jack,
> >> >
> >> > This is a good solution, but we have more combinations that I think
> >> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
> >> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> >> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
> >> > applied for each separate term.
> >> >
> >> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> >> > single term?
> >> >
> >> > Thanks,
> >> > Mahmoud
> >> >
> >> >
> >> >
> >> > Jack Krupansky <ma...@gmail.com>
> >> > 9 novembre 2015 16:47
> >> > Use an index-time (but not query time) synonym filter with a rule
> like:
> >> >
> >> > Abd Allah,Abdallah
> >> >
> >> > This will index the combined word in addition to the separate words.
> >> >
> >> > -- Jack Krupansky
> >> >
> >> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> >> prog.mahmoud@gmail.com>
> >> >
> >> > Mahmoud Almokadem <ma...@gmail.com>
> >> > 9 novembre 2015 10:48
> >> > Hello,
> >> >
> >> > We are indexing Arabic content and facing a problem for tokenizing
> multi
> >> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> >> > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
> >> > الله' with space. We are using StandardTokenizer.
> >> >
> >> >
> >> > Is there any configurations to handle this case?
> >> >
> >> > Thank you,
> >> > Mahmoud
> >> >
> >>
> >>
>

Re: Arabic analyser

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

If this is for a significant project and you are ready to pay for it,
BasisTech has commercial solutions in this area I believe.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 10 November 2015 at 08:46, Mahmoud Almokadem <pr...@gmail.com> wrote:
> Thanks Pual,
>
> Arabic analyser applying filters of normalisation and stemming only for
> single terms out of standard tokenzier.
> Gathering all synonyms will be hard work. Should I customise my Tokenizer
> to handle this case?
>
> Sincerely,
> Mahmoud
>
>
> On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
>
>> Mahmoud,
>>
>> there is an arabic analyzer:
>>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
>> doesn't it do what you describe?
>> Synonyms probably work there too.
>>
>> Paul
>>
>> > Mahmoud Almokadem <ma...@gmail.com>
>> > 9 novembre 2015 17:47
>> > Thanks Jack,
>> >
>> > This is a good solution, but we have more combinations that I think
>> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
>> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
>> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
>> > applied for each separate term.
>> >
>> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
>> > single term?
>> >
>> > Thanks,
>> > Mahmoud
>> >
>> >
>> >
>> > Jack Krupansky <ma...@gmail.com>
>> > 9 novembre 2015 16:47
>> > Use an index-time (but not query time) synonym filter with a rule like:
>> >
>> > Abd Allah,Abdallah
>> >
>> > This will index the combined word in addition to the separate words.
>> >
>> > -- Jack Krupansky
>> >
>> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
>> prog.mahmoud@gmail.com>
>> >
>> > Mahmoud Almokadem <ma...@gmail.com>
>> > 9 novembre 2015 10:48
>> > Hello,
>> >
>> > We are indexing Arabic content and facing a problem for tokenizing multi
>> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for
>> > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
>> > الله' with space. We are using StandardTokenizer.
>> >
>> >
>> > Is there any configurations to handle this case?
>> >
>> > Thank you,
>> > Mahmoud
>> >
>>
>>

Re: Arabic analyser

Posted by Mahmoud Almokadem <pr...@gmail.com>.

Thanks Pual,

Arabic analyser applying filters of normalisation and stemming only for
single terms out of standard tokenzier.
Gathering all synonyms will be hard work. Should I customise my Tokenizer
to handle this case?

Sincerely,
Mahmoud


On Tue, Nov 10, 2015 at 3:06 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:

> Mahmoud,
>
> there is an arabic analyzer:
>   https://wiki.apache.org/solr/LanguageAnalysis#Arabic
> doesn't it do what you describe?
> Synonyms probably work there too.
>
> Paul
>
> > Mahmoud Almokadem <ma...@gmail.com>
> > 9 novembre 2015 17:47
> > Thanks Jack,
> >
> > This is a good solution, but we have more combinations that I think
> > can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
> > and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> > Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
> > applied for each separate term.
> >
> > Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> > single term?
> >
> > Thanks,
> > Mahmoud
> >
> >
> >
> > Jack Krupansky <ma...@gmail.com>
> > 9 novembre 2015 16:47
> > Use an index-time (but not query time) synonym filter with a rule like:
> >
> > Abd Allah,Abdallah
> >
> > This will index the combined word in addition to the separate words.
> >
> > -- Jack Krupansky
> >
> > On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <
> prog.mahmoud@gmail.com>
> >
> > Mahmoud Almokadem <ma...@gmail.com>
> > 9 novembre 2015 10:48
> > Hello,
> >
> > We are indexing Arabic content and facing a problem for tokenizing multi
> > terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> > 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
> > الله' with space. We are using StandardTokenizer.
> >
> >
> > Is there any configurations to handle this case?
> >
> > Thank you,
> > Mahmoud
> >
>
>

Re: Arabic analyser

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Mahmoud,

there is an arabic analyzer:
  https://wiki.apache.org/solr/LanguageAnalysis#Arabic
doesn't it do what you describe?
Synonyms probably work there too.

Paul

> Mahmoud Almokadem <ma...@gmail.com>
> 9 novembre 2015 17:47
> Thanks Jack,
>
> This is a good solution, but we have more combinations that I think
> can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’
> and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo
> Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be
> applied for each separate term.
>
> Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a
> single term?
>
> Thanks,
> Mahmoud
>
>
>
> Jack Krupansky <ma...@gmail.com>
> 9 novembre 2015 16:47
> Use an index-time (but not query time) synonym filter with a rule like:
>
> Abd Allah,Abdallah
>
> This will index the combined word in addition to the separate words.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <pr...@gmail.com>
>
> Mahmoud Almokadem <ma...@gmail.com>
> 9 novembre 2015 10:48
> Hello,
>
> We are indexing Arabic content and facing a problem for tokenizing multi
> terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
> الله' with space. We are using StandardTokenizer.
>
>
> Is there any configurations to handle this case?
>
> Thank you,
> Mahmoud
>

Re: Arabic analyser

Posted by Mahmoud Almokadem <pr...@gmail.com>.

Thanks Jack, 

This is a good solution, but we have more combinations that I think can’t be handled as synonyms like every word starts with ‘عبد’ ‘Abd’ and ‘أبو’ ‘Abo’. When using Standard tokenizer on ‘أبو بكر’ ‘Abo Bakr’, It’ll be tokenised to ‘أبو’ and ‘بكر’ and the filters will be applied for each separate term.

Is there available tokeniser to tokenise ‘أبو *’ or ‘عبد *' as a single term?

Thanks,
Mahmoud 

> On Nov 9, 2015, at 5:47 PM, Jack Krupansky <ja...@gmail.com> wrote:
> 
> Use an index-time (but not query time) synonym filter with a rule like:
> 
> Abd Allah,Abdallah
> 
> This will index the combined word in addition to the separate words.
> 
> -- Jack Krupansky
> 
> On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <pr...@gmail.com>
> wrote:
> 
>> Hello,
>> 
>> We are indexing Arabic content and facing a problem for tokenizing multi
>> terms phrases like 'عبد الله' 'Abd Allah', so users will search for
>> 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
>> الله' with space. We are using StandardTokenizer.
>> 
>> 
>> Is there any configurations to handle this case?
>> 
>> Thank you,
>> Mahmoud
>>

Re: Arabic analyser

Posted by Jack Krupansky <ja...@gmail.com>.

Use an index-time (but not query time) synonym filter with a rule like:

Abd Allah,Abdallah

This will index the combined word in addition to the separate words.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 4:48 AM, Mahmoud Almokadem <pr...@gmail.com>
wrote:

> Hello,
>
> We are indexing Arabic content and facing a problem for tokenizing multi
> terms phrases like 'عبد الله' 'Abd Allah', so users will search for
> 'عبدالله' 'Abdallah' without space and need to get the results of 'عبد
> الله' with space. We are using StandardTokenizer.
>
>
> Is there any configurations to handle this case?
>
> Thank you,
> Mahmoud
>