You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Koren Krupko <kr...@gmail.com> on 2009/02/26 02:21:41 UTC

Integrating Language Models into Lucene

Hello Lucene Developers!

My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
research in the fields of information retrieval. After reviewing Lucene's
capabilities I understand that one of its major strengths is its scalability
(as opposed to other frameworks such as Lemur). However, the retrieval and
scoring models used by Lucene are based upon the rather obsolete traditional
Vector Space Model. I'm interested in adding newer, state of the art,
retrieval models based on the notion of Language Models (see  
http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf  for more
details).
During the last years, retrieval systems based on LM have outperformed their
VSM based counterparts consistently in well recognized competitions such as
TREC. Thus, in order to make Lucene more attractive to IR researchers, I
would like to implement the following LM scoring functions using both
Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
KL-Divergence and Cross Entropy.
Integrating Language Models into Lucene in addition to its proven
performance capabilities and ease of use, will undoubtedly advance Lucene
into becoming the leading open source IR framework.

Assuming the usage of an Inverted Index holding posting lists, in order to
implement  basic LM scoring functions, I need the following information
available during query time:
1.	For each term in the inverted index – 
a.	Frequency in every document.
b.	Frequency in the corpus.
2.	For each document – its size.
3.	Total size of the corpus.
As I understand, 1a is implemented in Lucene but the problem is getting 1b,
2 and 3 since these details are not calculated during indexing. As I see it,
one could use the Payload to store document size. However, adding the Corpus
statistics requires fundamental changes in the index file format. From first
glance, this addition isn't substantial space-wise since all we need is one
more parameter per term. My eventual goal is to build a more complete and
comprehensive index once that will allow running multiple sessions of
retrieval using different scoring models later.
I did a survey of the forum but didn't find anything similar to my ideas
(the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I
also understand that there are thoughts regarding changing the index format
in the future ("flexible indexing" -
https://issues.apache.org/jira/browse/LUCENE-1458).

My questions are:
1.	Has anyone tried to do something similar in the past?
2.	Is anyone working on something similar at the moment?
3.	Do you think LM can/should become a part of official future Lucene
versions?
4.	How would you recommend implementing the index additions with minimal
changes as a temporary patch?

Koren

-- 
View this message in context: http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22215790.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Integrating Language Models into Lucene

Posted by José Ramón Pérez Agüera <jo...@gmail.com>.
you have a Lucene LM implementation only for research purposes in

http://ilps.science.uva.nl/resources/lm-lucene

is a very old implementation but maybe could be useful to you

jose

On Thu, Feb 26, 2009 at 9:25 AM, Paul Elschot <pa...@xs4all.nl> wrote:
> On Thursday 26 February 2009 02:21:41 Koren Krupko wrote:
>
>>
>
>> Hello Lucene Developers!
>
>>
>
>> My name is Koren Krupko. I'm quite new to Lucene but I do have experience
>> in
>
>> research in the fields of information retrieval. After reviewing Lucene's
>
>> capabilities I understand that one of its major strengths is its
>> scalability
>
>> (as opposed to other frameworks such as Lemur). However, the retrieval and
>
>> scoring models used by Lucene are based upon the rather obsolete
>> traditional
>
>> Vector Space Model. I'm interested in adding newer, state of the art,
>
>> retrieval models based on the notion of Language Models (see
>
>> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf for more
>
>> details).
>
>> During the last years, retrieval systems based on LM have outperformed
>> their
>
>> VSM based counterparts consistently in well recognized competitions such
>> as
>
>> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
>
>> would like to implement the following LM scoring functions using both
>
>> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
>
>> KL-Divergence and Cross Entropy.
>
>> Integrating Language Models into Lucene in addition to its proven
>
>> performance capabilities and ease of use, will undoubtedly advance Lucene
>
>> into becoming the leading open source IR framework.
>
>>
>
>> Assuming the usage of an Inverted Index holding posting lists, in order to
>
>> implement basic LM scoring functions, I need the following information
>
>> available during query time:
>
>> 1. For each term in the inverted index –
>
>> a. Frequency in every document.
>
>> b. Frequency in the corpus.
>
>> 2. For each document – its size.
>
>> 3. Total size of the corpus.
>
>> As I understand, 1a is implemented in Lucene but the problem is getting
>> 1b,
>
>> 2 and 3 since these details are not calculated during indexing. As I see
>> it,
>
>> one could use the Payload to store document size.
>
> The field size is encoded in the norms.
>
>> However, adding the Corpus
>
>> statistics requires fundamental changes in the index file format. From
>> first
>
>> glance, this addition isn't substantial space-wise since all we need is
>> one
>
>> more parameter per term. My eventual goal is to build a more complete and
>
>> comprehensive index once that will allow running multiple sessions of
>
>> retrieval using different scoring models later.
>
>> I did a survey of the forum but didn't find anything similar to my ideas
>
>> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965).
>> I
>
>> also understand that there are thoughts regarding changing the index
>> format
>
>> in the future ("flexible indexing" -
>
>> https://issues.apache.org/jira/browse/LUCENE-1458).
>
>>
>
>> My questions are:
>
>> 1. Has anyone tried to do something similar in the past?
>
> This is a term scorer that simply divides term frequency by field length:
>
> https://issues.apache.org/jira/browse/LUCENE-293
>
> A better field length encoding would be welcome, but it's a start.
>
>> 2. Is anyone working on something similar at the moment?
>
> Me, not any more, but that's for other reasons than the qualities of LM.
>
>> 3. Do you think LM can/should become a part of official future Lucene
>
>> versions?
>
> A contrib module with an alternative set of scorers would be a nice goal,
>
> for example starting from the one referenced above.
>
>> 4. How would you recommend implementing the index additions with minimal
>
>> changes as a temporary patch?
>
> No need for a temporary patch, just create a separate issue for each index
>
> addition, and see what happens.
>
> Regards,
>
> Paul Elschot



-- 
José Ramón Pérez Agüera

Dept. de Ingeniería del Software e Inteligencia Artificial
Despacho 411 tlf. 913947599
Facultad de Informática
Universidad Complutense de Madrid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Integrating Language Models into Lucene

Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 26 February 2009 02:21:41 Koren Krupko wrote:
> 
> Hello Lucene Developers!
> 
> My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
> research in the fields of information retrieval. After reviewing Lucene's
> capabilities I understand that one of its major strengths is its scalability
> (as opposed to other frameworks such as Lemur). However, the retrieval and
> scoring models used by Lucene are based upon the rather obsolete traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see  
> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf  for more
> details).
> During the last years, retrieval systems based on LM have outperformed their
> VSM based counterparts consistently in well recognized competitions such as
> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance Lucene
> into becoming the leading open source IR framework.
> 
> Assuming the usage of an Inverted Index holding posting lists, in order to
> implement  basic LM scoring functions, I need the following information
> available during query time:
> 1.	For each term in the inverted index – 
> a.	Frequency in every document.
> b.	Frequency in the corpus.
> 2.	For each document – its size.
> 3.	Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I see it,
> one could use the Payload to store document size.

The field size is encoded in the norms.

> However, adding the Corpus
> statistics requires fundamental changes in the index file format. From first
> glance, this addition isn't substantial space-wise since all we need is one
> more parameter per term. My eventual goal is to build a more complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my ideas
> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I
> also understand that there are thoughts regarding changing the index format
> in the future ("flexible indexing" -
> https://issues.apache.org/jira/browse/LUCENE-1458).
> 
> My questions are:
> 1.	Has anyone tried to do something similar in the past?

This is a term scorer that simply divides term frequency by field length:
https://issues.apache.org/jira/browse/LUCENE-293
A better field length encoding would be welcome, but it's a start.

> 2.	Is anyone working on something similar at the moment?

Me, not any more, but that's for other reasons than the qualities of LM.

> 3.	Do you think LM can/should become a part of official future Lucene
> versions?

A contrib module with an alternative set of scorers would be a nice goal,
for example starting from the one referenced above.

> 4.	How would you recommend implementing the index additions with minimal
> changes as a temporary patch?

No need for a temporary patch, just create a separate issue for each index
addition, and see what happens.

Regards,
Paul Elschot

Re: Integrating Language Models into Lucene

Posted by Grant Ingersoll <gs...@apache.org>.

On Feb 26, 2009, at 12:07 PM, Koren Krupko wrote:

>
> I'm familiar with their work. They implemented only one model and  
> made very
> "model proprietary" changes in the index. In addition, it was done a  
> long
> time ago (not compatible to current versions of Lucene).

I figured as much

> My goal is to build
> basic LM software infrastructure (indexing + scoring) to allow future
> implementation of different other models without touching the core.
> I would have liked to insert these changes in a fundamental manner  
> fully
> integrating them into the Lucene project thus allowing code  
> maintenance and
> backward compatibility of future releases.

+1.  Like I said, this fits well with the goal of "flexible  
indexing" (which should also be called "flexible scoring" or whatever)  
so those are things to keep in mind.

Have a look at the How To Contribute section on the wiki and keep in  
mind that small incremental, back-compatible patches are often easier  
to swallow than very large ones that massively change things.

At any rate, what you are proposing is very interesting and exciting  
(to me, anyway).

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Integrating Language Models into Lucene

Posted by Koren Krupko <kr...@gmail.com>.
I'm familiar with their work. They implemented only one model and made very
"model proprietary" changes in the index. In addition, it was done a long
time ago (not compatible to current versions of Lucene). My goal is to build
basic LM software infrastructure (indexing + scoring) to allow future
implementation of different other models without touching the core. 
I would have liked to insert these changes in a fundamental manner fully
integrating them into the Lucene project thus allowing code maintenance and
backward compatibility of future releases.

Koren


Paul Elschot wrote:
> 
> On Thursday 26 February 2009 13:41:30 Grant Ingersoll wrote:
>> I think there is a group in the Netherlands that has open sourced a  
>> version of Lucene using Language Models.
> 
> http://ilps.science.uva.nl/resources/lm-lucene
> 
> Regards,
> Paul Elschot
> 
> 

-- 
View this message in context: http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22228739.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Integrating Language Models into Lucene

Posted by Paul Elschot <pa...@xs4all.nl>.
On Thursday 26 February 2009 13:41:30 Grant Ingersoll wrote:
> I think there is a group in the Netherlands that has open sourced a  
> version of Lucene using Language Models.

http://ilps.science.uva.nl/resources/lm-lucene

Regards,
Paul Elschot

Re: Integrating Language Models into Lucene

Posted by Grant Ingersoll <gs...@apache.org>.
I think there is a group in the Netherlands that has open sourced a  
version of Lucene using Language Models.

I'd certainly welcome alternate implementations.  There have been  
many, many discussions about "flexible indexing" (http://www.lucidimagination.com/search/?q=flexible+indexing 
, and I know there are a bunch of related JIRA issues too) on the list  
here that you might look at.  In fact, several people have made some  
progress towards it, such that we are getting close to being able to  
more easily plug in different scoring models.   With flex. indexing,  
you should be able to do #3 below, and I believe all the others are  
already possible.



On Feb 25, 2009, at 8:21 PM, Koren Krupko wrote:

>
> Hello Lucene Developers!
>
> My name is Koren Krupko. I'm quite new to Lucene but I do have  
> experience in
> research in the fields of information retrieval. After reviewing  
> Lucene's
> capabilities I understand that one of its major strengths is its  
> scalability
> (as opposed to other frameworks such as Lemur). However, the  
> retrieval and
> scoring models used by Lucene are based upon the rather obsolete  
> traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see
> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf   
> for more
> details).
> During the last years, retrieval systems based on LM have  
> outperformed their
> VSM based counterparts consistently in well recognized competitions  
> such as
> TREC. Thus, in order to make Lucene more attractive to IR  
> researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query  
> Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance  
> Lucene
> into becoming the leading open source IR framework.
>
> Assuming the usage of an Inverted Index holding posting lists, in  
> order to
> implement  basic LM scoring functions, I need the following  
> information
> available during query time:
> 1.	For each term in the inverted index –
> a.	Frequency in every document.
> b.	Frequency in the corpus.
> 2.	For each document – its size.
> 3.	Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is  
> getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I  
> see it,
> one could use the Payload to store document size. However, adding  
> the Corpus
> statistics requires fundamental changes in the index file format.  
> From first
> glance, this addition isn't substantial space-wise since all we need  
> is one
> more parameter per term. My eventual goal is to build a more  
> complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my  
> ideas
> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965) 
> . I
> also understand that there are thoughts regarding changing the index  
> format
> in the future ("flexible indexing" -
> https://issues.apache.org/jira/browse/LUCENE-1458).
>
> My questions are:
> 1.	Has anyone tried to do something similar in the past?
> 2.	Is anyone working on something similar at the moment?
> 3.	Do you think LM can/should become a part of official future Lucene
> versions?
> 4.	How would you recommend implementing the index additions with  
> minimal
> changes as a temporary patch?
>
> Koren
>
> -- 
> View this message in context: http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22215790.html
> Sent from the Lucene - Java Developer mailing list archive at  
> Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Integrating Language Models into Lucene

Posted by Earwin Burrfoot <ea...@gmail.com>.
Have you looked at MG4J (http://mg4j.dsi.unimi.it/)?
Last time I did, it looked like an opposite of lucene - nice and
up-to-date algorithmics, but hard to apply to complex real-world
tasks.

On Thu, Feb 26, 2009 at 04:21, Koren Krupko <kr...@gmail.com> wrote:
>
> Hello Lucene Developers!
>
> My name is Koren Krupko. I'm quite new to Lucene but I do have experience in
> research in the fields of information retrieval. After reviewing Lucene's
> capabilities I understand that one of its major strengths is its scalability
> (as opposed to other frameworks such as Lemur). However, the retrieval and
> scoring models used by Lucene are based upon the rather obsolete traditional
> Vector Space Model. I'm interested in adding newer, state of the art,
> retrieval models based on the notion of Language Models (see
> http://www.nabble.com/file/p22215790/LM-review.pdf LM-review.pdf  for more
> details).
> During the last years, retrieval systems based on LM have outperformed their
> VSM based counterparts consistently in well recognized competitions such as
> TREC. Thus, in order to make Lucene more attractive to IR researchers, I
> would like to implement the following LM scoring functions using both
> Jelinek-Mercer and Dirichlet priors smoothing functions: Query Likelihood,
> KL-Divergence and Cross Entropy.
> Integrating Language Models into Lucene in addition to its proven
> performance capabilities and ease of use, will undoubtedly advance Lucene
> into becoming the leading open source IR framework.
>
> Assuming the usage of an Inverted Index holding posting lists, in order to
> implement  basic LM scoring functions, I need the following information
> available during query time:
> 1.      For each term in the inverted index –
> a.      Frequency in every document.
> b.      Frequency in the corpus.
> 2.      For each document – its size.
> 3.      Total size of the corpus.
> As I understand, 1a is implemented in Lucene but the problem is getting 1b,
> 2 and 3 since these details are not calculated during indexing. As I see it,
> one could use the Payload to store document size. However, adding the Corpus
> statistics requires fundamental changes in the index file format. From first
> glance, this addition isn't substantial space-wise since all we need is one
> more parameter per term. My eventual goal is to build a more complete and
> comprehensive index once that will allow running multiple sessions of
> retrieval using different scoring models later.
> I did a survey of the forum but didn't find anything similar to my ideas
> (the closest I got was https://issues.apache.org/jira/browse/LUCENE-965). I
> also understand that there are thoughts regarding changing the index format
> in the future ("flexible indexing" -
> https://issues.apache.org/jira/browse/LUCENE-1458).
>
> My questions are:
> 1.      Has anyone tried to do something similar in the past?
> 2.      Is anyone working on something similar at the moment?
> 3.      Do you think LM can/should become a part of official future Lucene
> versions?
> 4.      How would you recommend implementing the index additions with minimal
> changes as a temporary patch?
>
> Koren
>
> --
> View this message in context: http://www.nabble.com/Integrating-Language-Models-into-Lucene-tp22215790p22215790.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org