You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Elie Choueiri <el...@elementn.com> on 2007/07/24 09:21:24 UTC

Multiple Languages with Lucene (Arabic & English)

Hi

 

I'm new to searching and am trying to use Lucene to search English & Arabic
documents.  I've got a bunch of questions (hopefully you'll find some
interesting!) and am hoping someone's gone through some of them and has some
answers for me!

 

First, do I have to worry about the Arabic Analyzer overwriting the index
files of the English analyzer? (Or vice versa?)

i.e. When I index documents a second time, will data be overwritten?

 

I could just store the index files for different languages in a different
location, but it's good to know and I'd rather not if I don't have to :)

 

Also, on the same note, if I'm indexing documents that contain both Arabic
and English, will the index files created by the English (or Arabic)
analyzer contain garbage or become corrupted because of the language
difference?

 

It is possible to index (using an English/Latin/Standard analyzer) a file
that contains both english and arabic words, and expect the searches in
English using the same analyzer to be valid, right?

 

In an Arabic document with a single English word (the name of a corporation,
for example) will the English word even be indexed and located by a search?
I could test something like this with a small subset of documents, but I
doubt the actual usefulness of a test with such a tiny (relatively
speaking!) amount of data.. I know we can tell Lucene to store the full copy
of the document, but does that affect the index itself?

 

Finally, and here's the tricky one, are searches that contain both English
and Arabic words valid?  My limited understanding of the way search engines
work tells me the search analyzes the context of words as well as
statistical data to decide the relevance of hits, is this still valid for
multi-lingual searches?  

 

 

Thanks for any help you can provide!

 


RE: Multiple Languages with Lucene (Arabic & English)

Posted by Elie Choueiri <el...@elementn.com>.
Thanks for the clarification, I'll play around with it and head back if
things don't work according to plan.

Thanks again,
e

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, July 24, 2007 5:53 PM
To: java-user@lucene.apache.org
Subject: Re: Multiple Languages with Lucene (Arabic & English)

You'll also find lots of discussion about indexing multiple
languages if you search the mail archive for things like multiple
language.

I think one thing you're missing is that Lucene indexes data however
you tell it to. You have both total control over and total responsibility
for how things are indexed. You could, say, determine that a document
was English, and index all of the fields like
title_english
text_english
etc.. and then write that document out to your index. Say the next
document was Arabic. You could index
title_arabic
text_arabic
and write that document to your index. Now if you're searching
in Arabic, you search on the title_arabic and text_arabic fields and
you'd never get any hits on English documents.

Not saying you want to do it this way, but you need to be clear that
Lucene well do what you tell it to. Nothing more and nothing less <G>.

Best
Erick


On 7/24/07, Grant Ingersoll <gs...@apache.org> wrote:
>
>
> On Jul 24, 2007, at 3:21 AM, Elie Choueiri wrote:
>
> > Hi
> >
> >
> >
> > I'm new to searching and am trying to use Lucene to search English
> > & Arabic
> > documents.  I've got a bunch of questions (hopefully you'll find some
> > interesting!) and am hoping someone's gone through some of them and
> > has some
> > answers for me!
> >
> >
> >
> > First, do I have to worry about the Arabic Analyzer overwriting the
> > index
> > files of the English analyzer? (Or vice versa?)
> >
> > i.e. When I index documents a second time, will data be overwritten?
> >
>
> That depends whether you tell Lucene to create a new index or not.
> See the IndexWriter API for your options.
>
> >
> >
> > I could just store the index files for different languages in a
> > different
> > location, but it's good to know and I'd rather not if I don't have
> > to :)
> >
> >
> >
> > Also, on the same note, if I'm indexing documents that contain both
> > Arabic
> > and English, will the index files created by the English (or Arabic)
> > analyzer contain garbage or become corrupted because of the language
> > difference?
> >
>
> I don't know if it will be corrupted, but probably won't be all that
> useful, either. You may find the PerFieldAnalyzerWrapper to be helpful.
>
> >
> >
> > It is possible to index (using an English/Latin/Standard analyzer)
> > a file
> > that contains both english and arabic words, and expect the
> > searches in
> > English using the same analyzer to be valid, right?
>
> I should think so.  I don't recall running across this case too much,
> but do remember the reverse, Arabic files w/ some English and the
> Arabic analyzer usually just skipped over the English leaving it
> intact, thus searching those English terms in the Arabic index worked
> just fine.
>
> >
> >
> >
> > In an Arabic document with a single English word (the name of a
> > corporation,
> > for example) will the English word even be indexed and located by a
> > search?
> > I could test something like this with a small subset of documents,
> > but I
> > doubt the actual usefulness of a test with such a tiny (relatively
> > speaking!) amount of data.. I know we can tell Lucene to store the
> > full copy
> > of the document, but does that affect the index itself?
> >
> >
> >
> > Finally, and here's the tricky one, are searches that contain both
> > English
> > and Arabic words valid?  My limited understanding of the way search
> > engines
> > work tells me the search analyzes the context of words as well as
> > statistical data to decide the relevance of hits, is this still
> > valid for
> > multi-lingual searches?
> >
> >
>
> They are valid, just not sure how useful, but that is for your app to
> decide.  I guess if your users know both Arabic and English, it
> probably isn't a big deal.  Lucene just tries to match up what is in
> the query w/ what is in the index, so if you have validly analyzed
> tokens in both the query and the index then Lucene should find them.
>
> HTH,
> Grant
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Tuesday, July 24, 2007 3:11 PM
To: java-user@lucene.apache.org
Subject: Re: Multiple Languages with Lucene (Arabic & English)


On Jul 24, 2007, at 3:21 AM, Elie Choueiri wrote:

> Hi
>
>
>
> I'm new to searching and am trying to use Lucene to search English  
> & Arabic
> documents.  I've got a bunch of questions (hopefully you'll find some
> interesting!) and am hoping someone's gone through some of them and  
> has some
> answers for me!
>
>
>
> First, do I have to worry about the Arabic Analyzer overwriting the  
> index
> files of the English analyzer? (Or vice versa?)
>
> i.e. When I index documents a second time, will data be overwritten?
>

That depends whether you tell Lucene to create a new index or not.   
See the IndexWriter API for your options.

>
>
> I could just store the index files for different languages in a  
> different
> location, but it's good to know and I'd rather not if I don't have  
> to :)
>
>
>
> Also, on the same note, if I'm indexing documents that contain both  
> Arabic
> and English, will the index files created by the English (or Arabic)
> analyzer contain garbage or become corrupted because of the language
> difference?
>

I don't know if it will be corrupted, but probably won't be all that  
useful, either. You may find the PerFieldAnalyzerWrapper to be helpful.

>
>
> It is possible to index (using an English/Latin/Standard analyzer)  
> a file
> that contains both english and arabic words, and expect the  
> searches in
> English using the same analyzer to be valid, right?

I should think so.  I don't recall running across this case too much,  
but do remember the reverse, Arabic files w/ some English and the  
Arabic analyzer usually just skipped over the English leaving it  
intact, thus searching those English terms in the Arabic index worked  
just fine.

>
>
>
> In an Arabic document with a single English word (the name of a  
> corporation,
> for example) will the English word even be indexed and located by a  
> search?
> I could test something like this with a small subset of documents,  
> but I
> doubt the actual usefulness of a test with such a tiny (relatively
> speaking!) amount of data.. I know we can tell Lucene to store the  
> full copy
> of the document, but does that affect the index itself?
>
>
>
> Finally, and here's the tricky one, are searches that contain both  
> English
> and Arabic words valid?  My limited understanding of the way search  
> engines
> work tells me the search analyzes the context of words as well as
> statistical data to decide the relevance of hits, is this still  
> valid for
> multi-lingual searches?
>
>

They are valid, just not sure how useful, but that is for your app to  
decide.  I guess if your users know both Arabic and English, it  
probably isn't a big deal.  Lucene just tries to match up what is in  
the query w/ what is in the index, so if you have validly analyzed  
tokens in both the query and the index then Lucene should find them.

HTH,
Grant

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multiple Languages with Lucene (Arabic & English)

Posted by Erick Erickson <er...@gmail.com>.
You'll also find lots of discussion about indexing multiple
languages if you search the mail archive for things like multiple
language.

I think one thing you're missing is that Lucene indexes data however
you tell it to. You have both total control over and total responsibility
for how things are indexed. You could, say, determine that a document
was English, and index all of the fields like
title_english
text_english
etc.. and then write that document out to your index. Say the next
document was Arabic. You could index
title_arabic
text_arabic
and write that document to your index. Now if you're searching
in Arabic, you search on the title_arabic and text_arabic fields and
you'd never get any hits on English documents.

Not saying you want to do it this way, but you need to be clear that
Lucene well do what you tell it to. Nothing more and nothing less <G>.

Best
Erick


On 7/24/07, Grant Ingersoll <gs...@apache.org> wrote:
>
>
> On Jul 24, 2007, at 3:21 AM, Elie Choueiri wrote:
>
> > Hi
> >
> >
> >
> > I'm new to searching and am trying to use Lucene to search English
> > & Arabic
> > documents.  I've got a bunch of questions (hopefully you'll find some
> > interesting!) and am hoping someone's gone through some of them and
> > has some
> > answers for me!
> >
> >
> >
> > First, do I have to worry about the Arabic Analyzer overwriting the
> > index
> > files of the English analyzer? (Or vice versa?)
> >
> > i.e. When I index documents a second time, will data be overwritten?
> >
>
> That depends whether you tell Lucene to create a new index or not.
> See the IndexWriter API for your options.
>
> >
> >
> > I could just store the index files for different languages in a
> > different
> > location, but it's good to know and I'd rather not if I don't have
> > to :)
> >
> >
> >
> > Also, on the same note, if I'm indexing documents that contain both
> > Arabic
> > and English, will the index files created by the English (or Arabic)
> > analyzer contain garbage or become corrupted because of the language
> > difference?
> >
>
> I don't know if it will be corrupted, but probably won't be all that
> useful, either. You may find the PerFieldAnalyzerWrapper to be helpful.
>
> >
> >
> > It is possible to index (using an English/Latin/Standard analyzer)
> > a file
> > that contains both english and arabic words, and expect the
> > searches in
> > English using the same analyzer to be valid, right?
>
> I should think so.  I don't recall running across this case too much,
> but do remember the reverse, Arabic files w/ some English and the
> Arabic analyzer usually just skipped over the English leaving it
> intact, thus searching those English terms in the Arabic index worked
> just fine.
>
> >
> >
> >
> > In an Arabic document with a single English word (the name of a
> > corporation,
> > for example) will the English word even be indexed and located by a
> > search?
> > I could test something like this with a small subset of documents,
> > but I
> > doubt the actual usefulness of a test with such a tiny (relatively
> > speaking!) amount of data.. I know we can tell Lucene to store the
> > full copy
> > of the document, but does that affect the index itself?
> >
> >
> >
> > Finally, and here's the tricky one, are searches that contain both
> > English
> > and Arabic words valid?  My limited understanding of the way search
> > engines
> > work tells me the search analyzes the context of words as well as
> > statistical data to decide the relevance of hits, is this still
> > valid for
> > multi-lingual searches?
> >
> >
>
> They are valid, just not sure how useful, but that is for your app to
> decide.  I guess if your users know both Arabic and English, it
> probably isn't a big deal.  Lucene just tries to match up what is in
> the query w/ what is in the index, so if you have validly analyzed
> tokens in both the query and the index then Lucene should find them.
>
> HTH,
> Grant
>
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>
> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Multiple Languages with Lucene (Arabic & English)

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 24, 2007, at 3:21 AM, Elie Choueiri wrote:

> Hi
>
>
>
> I'm new to searching and am trying to use Lucene to search English  
> & Arabic
> documents.  I've got a bunch of questions (hopefully you'll find some
> interesting!) and am hoping someone's gone through some of them and  
> has some
> answers for me!
>
>
>
> First, do I have to worry about the Arabic Analyzer overwriting the  
> index
> files of the English analyzer? (Or vice versa?)
>
> i.e. When I index documents a second time, will data be overwritten?
>

That depends whether you tell Lucene to create a new index or not.   
See the IndexWriter API for your options.

>
>
> I could just store the index files for different languages in a  
> different
> location, but it's good to know and I'd rather not if I don't have  
> to :)
>
>
>
> Also, on the same note, if I'm indexing documents that contain both  
> Arabic
> and English, will the index files created by the English (or Arabic)
> analyzer contain garbage or become corrupted because of the language
> difference?
>

I don't know if it will be corrupted, but probably won't be all that  
useful, either. You may find the PerFieldAnalyzerWrapper to be helpful.

>
>
> It is possible to index (using an English/Latin/Standard analyzer)  
> a file
> that contains both english and arabic words, and expect the  
> searches in
> English using the same analyzer to be valid, right?

I should think so.  I don't recall running across this case too much,  
but do remember the reverse, Arabic files w/ some English and the  
Arabic analyzer usually just skipped over the English leaving it  
intact, thus searching those English terms in the Arabic index worked  
just fine.

>
>
>
> In an Arabic document with a single English word (the name of a  
> corporation,
> for example) will the English word even be indexed and located by a  
> search?
> I could test something like this with a small subset of documents,  
> but I
> doubt the actual usefulness of a test with such a tiny (relatively
> speaking!) amount of data.. I know we can tell Lucene to store the  
> full copy
> of the document, but does that affect the index itself?
>
>
>
> Finally, and here's the tricky one, are searches that contain both  
> English
> and Arabic words valid?  My limited understanding of the way search  
> engines
> work tells me the search analyzes the context of words as well as
> statistical data to decide the relevance of hits, is this still  
> valid for
> multi-lingual searches?
>
>

They are valid, just not sure how useful, but that is for your app to  
decide.  I guess if your users know both Arabic and English, it  
probably isn't a big deal.  Lucene just tries to match up what is in  
the query w/ what is in the index, so if you have validly analyzed  
tokens in both the query and the index then Lucene should find them.

HTH,
Grant

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org