You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by bhecht <bh...@ams-sys.com> on 2007/05/07 10:02:24 UTC

Multi language indexing

Hello all,

I need to index a table containing company details (name, address, city ...
country).
Each record contains data written in the language appropriate to the records
country.
I was thinking of indexing each record using an analyzer according to the
records country value.
Then when searching, again using the needed analyzer according to the
entered country.
This means I index and search using the same analyzer.

I was interested to know if this is the way to go?

I am trying to implement this using "Hibernate search", and it seems to be a
problem changing analyzers according to a specific value in a record to be
indexed.

Before I break my head understanding how this can be implemented, I wanted
to know if this approach is correct?

Thanks in advance.

-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10353549
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is it necessary to optimize?

Posted by Grant Ingersoll <gs...@apache.org>.
The contrib/benchmark addition can help you characterize many of  
these scenarios, especially if you write a DocMaker and QueryMaker  
for your collection.

On May 8, 2007, at 5:30 AM, Stadler Hans-Christian wrote:

> If mergeFactor is set to 2 and no optimize() is ever done on the  
> index,
> what is the impact on
>
> 1) the number opened files during indexing
> 2) the number of opened files during searching
> 2) the search speed
> 3) the indexing speed
>
> ??
>
> HC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Is it necessary to optimize?

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
I would say, that over time, the number of files will grow. and continue  
growing if you never perform
an optimize(). After some very adviceful mails from Erick i settled on a  
mergeFactor of 30, and since I do the indexing in large batches, I perform  
an optimize() only in the end of the indexing-run. This works really great  
and reduced the indexing time greatly!

as to your points, the number of opened files is closely related to the  
merge factor, since;
"With the default value of 10, Lucene will store 10 documents in memory  
before writing them to a single segment on the disk. The mergeFactor value  
of 10 also means that once the number of segments on the disk has reached  
the power of 10, Lucene will merge these segments into a single segment."

It is also a fact that higher mergeFactor allows less file-io, hence  
faster indexing.
"MergeFactor - Determines the minimal number of documents required before  
the buffered in-memory documents are merged and a new Segment is created.  
Since Documents are merged in a RAMDirectory, large value gives faster  
indexing. "

Backside is when using a too large mergeFactor you may experience the "Too  
many open files" exception.
"For instance, with a default mergeFactor of 10 and an index of 1 million  
documents, Lucene will require 110 open files on an unoptimized index.  
When IndexWrite's optimize() method is called, all segments are merged  
into a single segment, which minimizes the number of open files that  
Lucene needs."

"using a higher value for mergeFactor will cause Lucene to use more RAM,  
but will let Lucene write data to disk less frequently, which will speed  
up the indexing process. A smaller mergeFactor will use less memory and  
will cause the index to be updated more frequently, which will make it  
more up-to-date, but will also slow down the indexing process. Similarly,  
a larger maxMergeDocs is better suited for batch indexing, and a smaller  
maxMergeDocs is better for more interactive indexing."

- Aleksander

On Tue, 08 May 2007 11:30:00 +0200, Stadler Hans-Christian  
<ha...@psi.ch> wrote:

> If mergeFactor is set to 2 and no optimize() is ever done on the index,
> what is the impact on
>
> 1) the number opened files during indexing
> 2) the number of opened files during searching
> 2) the search speed
> 3) the indexing speed
>
> ??
>
> HC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Aleksander M. Stensby
Software Developer
Integrasco A/S
aleksander.stensby@integrasco.no

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Is it necessary to optimize?

Posted by Stadler Hans-Christian <ha...@psi.ch>.
If mergeFactor is set to 2 and no optimize() is ever done on the index,
what is the impact on

1) the number opened files during indexing
2) the number of opened files during searching
2) the search speed
3) the indexing speed

??

HC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by bhecht <bh...@ams-sys.com>.
Hi Doron,

Thank you very much for your time and for the detailed explanations. This is
exactly what I meant and I am happy to see I understood correctly.

I am now using the Snowball which seems to work very good.

Thanks again and good day,

Barak Hecht.



-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10372286
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by Doron Cohen <DO...@il.ibm.com>.
bhecht <bh...@ams-sys.com> wrote on 07/05/2007 10:26:27:

> I have implemented my own analyzer for each country.
> So as I see it, when I index these records, I want to
> provide lucene, with a specific analyzer per record
> i'm indexing.
>
> When a user performs a query in my JSF form, I will
> use the country value he entered, to get the needed
> analyzer, and query lucene with the users query and
> the needed analyzer.
>
> The user may also choose not to enter a country value
> to his search, and here comes in the solution you gave
> me, to duplicate each field, and index it using a non
> stemming analyzer (A standard analyzer without stop
> words defined).
>
> Am I going the right direction?

Sounds ok to me except that there seems to be a mix
between stemming and stop-words elimination. Perhaps
just a typo in the above text, but anyhow while the
StandardAnalyzer constructor takes a stopwords list
parameter and would eliminate these words (e.g. "is"),
it would not do stemming (e.g "knives" --> "knive").
(Though both a stop-list and a stemming algorithm
are language specific.)

So, rephrasing the discussion so far, assuming:

1) a single field "F" (for simplicity),
2) (doc) language always known at indexing
3) (user) language sometimes known at search

I think a resonable solution might be:

1) use PerFieldanalyzerWrapper
2) index each doc to F and to F_LAN
3) F would be language neutral - no
   stemming and no stop words elimination
4) F_LAN (e.g. F_en) would be language specific,
   so a specific language stopwords list would be
   used, and a specific stemmer would be used.
5) Search would go to F_LAN when the language is
   known and to F when the language is not known,
   using language specific analysis as while indexing.
6) Note Karl's mentioning having both F and F_LAN at
   search, assigning higher boost to F_LAN. Useful when
   there is some uncertainty on the "marked language".

There can be other considerations - for instance (1) the
certainty of language id; (2) fallback to English when the
language is unknown...

Note that SnowballFilter can be used for applying
stemming on the output of StandardAnalyzer.

Doron



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by bhecht <bh...@ams-sys.com>.
Sorry,

I didn't understand I need to use the PerFieldanalyzerWrapper for this task,
and tried to index the document twice.
Sorry for the previous post.
thanks for the great help.

But if you already asked, I will be happy to explain what my goal is, and
maybe see if i'm approaching this correctly:

I have a database table containing records of company information, like
comapny name, address, city, state ... country.
The companies information may be written in different languages, but I can
determine the language according to the country field each record has (an
exception to this are countries that use more than 1 language).

I have a JSF form containing input fields for each column, so users can
search for companies.
I have my own metadata (stop words...) and matching alghorythms for each
different country, which I want to use during the analysis process of
Lucene. I have implemented my own analyzer for each country.
So as I see it, when I index these records, I want to provide lucene, with a
specific analyzer per record i'm indexing. 
When a user performs a query in my JSF form, I will use the country value he
entered, to get the needed analyzer, and query lucene with the users query
and the needed analyzer.
The user may also choose not to enter a country value to his search, and
here comes in the solution you gave me, to duplicate each field, and index
it using a non stemming analyzer (A standard analyzer without stop words
defined).
Then with no country entered ni a search, I will use the non stemming
analyzer.

Am I going the right direction?

-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10361747
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by bhecht <bh...@ams-sys.com>.
Sorry,

I didn't understand I need to use the PerFieldanalyzerWrapper for this task,
and tried to index the document twice.
Sorry for the previous post.
thanks for the great help.

But if you already asked, I will be happy to explain what my goal is, and
maybe see if i'm approaching this correctly:

I have a database table containing records of company information, like
comapny name, address, city, state ... country.
The companies information may be written in different languages, but I can
determine the language according to the country field each record has (an
exception to this are countries that use more than 1 language).

I have a JSF form containing input fields for each column, so users can
search for companies.
I have my own metadata (stop words...) and matching alghorythms for each
different country, which I want to use during the analysis process of
Lucene. 
I have implemented my own analyzer for each country.
So as I see it, when I index these records, I want to provide lucene, with a
specific analyzer per record i'm indexing. 
When a user performs a query in my JSF form, I will use the country value he
entered, to get the needed analyzer, and query lucene with the users query
and the needed analyzer.
The user may also choose not to enter a country value to his search, and
here comes in the solution you gave me, to duplicate each field, and index
it using a non stemming analyzer (A standard analyzer without stop words
defined).

Am I going the right direction?

-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10361747
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by karl wettin <ka...@gmail.com>.
7 maj 2007 kl. 15.45 skrev bhecht:

>
> OK, thanks, I think I got it.
>
> Just to see if I understood correctly:
>
> When I do the search on both stemmed and unstemmed fields, I will  
> do the
> following:
>
> 1) If I know the country of the requested search -  I will use the  
> stemmed
> analyzer, and then the unstemmed field
>
> might not be found (the stemmed field will be found).
>
> 2) if I don't know the country of the requested search - I will use  
> the
> unstemmed analyzer, and then the stemmed field
>
> might not be found.
>
> Am I correct?

Above sounds very confused and I'm afraid you got it all mixed up.  
Please explain in detail what your data looks like and what effect  
you are looking for. It will make it easier for all parties.

I have no idea how Hibernate search works, perhaps that is the reason  
for me not to understand what you try to do.


-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by bhecht <bh...@ams-sys.com>.
OK, thanks, I think I got it.

Just to see if I understood correctly:

When I do the search on both stemmed and unstemmed fields, I will do the
following:

1) If I know the country of the requested search -  I will use the stemmed
analyzer, and then the unstemmed field 
                                                                             
might not be found (the stemmed field will be found).

2) if I don't know the country of the requested search - I will use the
unstemmed analyzer, and then the stemmed field
                                                                                    
might not be found.

Am I correct?
-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10357611
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by karl wettin <ka...@gmail.com>.
7 maj 2007 kl. 13.27 skrev bhecht:

> The last option seems to be the right one for me, using a stemmed and
> unstemmed field.
> I assume when you mean "unstemmed", you mean indexing the field  
> using the
> UN_TOKENIZED parameter.

No, I mean TOKENIZED, but not using a stemmer analyzer.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by bhecht <bh...@ams-sys.com>.
OK, thanks for the reply.
The last option seems to be the right one for me, using a stemmed and
unstemmed field.
I assume when you mean "unstemmed", you mean indexing the field using the
UN_TOKENIZED parameter.

Now my problem starts, when trying to implement this with "Hibernate
Search", which allows only 1 analyzer to be defined.

Thanks, I will post my problem now in the Hibernate search forum.

Good day.
-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10355770
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by karl wettin <ka...@gmail.com>.
7 maj 2007 kl. 12.16 skrev bhecht:

> My question regarding "the way to go", was if it is a good solution  
> to index
> a content of a table, using more than 1 analyzer, determining the  
> analyzer
> by the country value of each record.

I'm not sure what you mean, but I'll try.

Do you ask if it makes sense to stem text based on the language of  
the text and put in the same field no matter what language it is?

For the record, it usually makes very little sense to search in text  
stemmed for one language with a query stemmed for another language.  
This is what you will do if you store the stemmed text, no matter the  
language, in the same field. You could add another field called  
"language_iso" and add a boolean clause, but that would just be  
overkill and will increase the response time.

In essence, it depends on your needs. For instance, are users  
supposed to find documents written in other languages than the  
language specified? You want to limit searches to a content language?

My guess is that you probably want to index unstemmed in  
"unstemmed_text" and stemmed in a language specific field  
"stemmed_text_[language iso]", or so, querying the unstemmed field  
and the user language specific when searching, boosting the stemmed  
field.

I hope this helps.

-- 
karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by bhecht <bh...@ams-sys.com>.
I know indexing and searching need to use the same analyzer.

My question regarding "the way to go", was if it is a good solution to index
a content of a table, using more than 1 analyzer, determining the analyzer
by the country value of each record.

Couldn't find a post that describes exactly my problem, and I just want to
be sure this is how people with Lucene 
expiriance would have approached this problem.

Thanks
-- 
View this message in context: http://www.nabble.com/Multi-language-indexing-tf3702402.html#a10354930
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi language indexing

Posted by karl wettin <ka...@gmail.com>.
7 maj 2007 kl. 10.02 skrev bhecht:

> This means I index and search using the same analyzer.
>
> I was interested to know if this is the way to go?

That would be the way to go (unless you are really sure what you're  
doing).

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org