You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ian Soboroff <ia...@nist.gov> on 2005/01/26 16:25:15 UTC

Suggestions for documentation or LIA

Erik Hatcher <er...@ehatchersolutions.com> writes:

> By all means, if you have other suggestions for our site, let us know 
> at authors@lucenebook.com.

One of the things I would like to see, but which isn't either in the
Lucene site, documentation, or "Lucene in Action", is a complete
description of how the retrieval algorithm works.  That is, how the
HitCollector, Scorers, Similarity, etc all fit together.

I'm involved in a project which to some degree is looking at poking
deeply into this part of the Lucene code.  We have a nice (non-Lucene)
framework for working with more different kinds of similarity
functions (beyond tf-idf) which should also be expandable to include
query expansion, relevance feedback, and the like.  

I used to think that integrating it would be as simple as hacking in
Similarity, but I'm beginning to think it might need broader changes.
I could obviously hook in our whole retrieval setup by just diving for
an IndexReader and doing it all by hand, but then I would have to redo
the incremental search and possibly the rich query structure, which
would be a lose.

So anyway, I got LIA hoping for a good explanation (not a good
Explanation) on this bit, but it wasn't there.  There are some hints
on the Lucene site, but nothing complete.  If I muddle it out before
anything gets contributed, I'll try to write something up, but don't
expect anything too soon...

Ian

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: text highlighting

Posted by Robert Koberg <ro...@koberg.com>.

A relevant link :)

http://www.tbray.org/ongoing/When/200x/2005/01/26/PatentFunnies


Erik Hatcher wrote:
> Also, there are some examples in the Lucene in Action source code (grab  
> it from http://www.lucenebook.com) (see HighlightIt.java).
> 
>     Erik
> 
> On Jan 26, 2005, at 5:52 PM, markharw00d wrote:
> 
>> Michael Celona wrote:
>>
>>> Does any have a working example of the highlighter class found in the
>>> sandbox?
>>>
>>>
>> There are several in the accompanying Junit test:
>> http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
>> contributions/highlighter/src/test/org/apache/lucene/search/highlight/
>>
>>
>> Cheers
>> Mark
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: text highlighting

Posted by Youngho Cho <yo...@nannet.co.kr>.

More test result

if the text contains  ... Family ...
Than

family query string woks OK.
But if the query stirng is Family than the highlighter return none.


Thanks.

Youngho

----- Original Message ----- 
From: "Youngho Cho" <yo...@nannet.co.kr>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Cc: "Che Dong" <ch...@hotmail.com>
Sent: Thursday, January 27, 2005 6:10 PM
Subject: Re: text highlighting


> Hello,
> 
> When I used the code with CJKAnalyzer and search English Text 
> (Because the text is mixed with Korean and English )
> sometimes the return Stirng is none.
> Others works well.
> 
> Is the code analyzer dependancy ?
> 
> Thanks.
> 
> Youngho
> 
> -------  Test Code ( Just copy of the Book code ) ---------
> 
>     private static final String HIGH_LIGHT_OPEN = "<span class=\"highlight\">";
>     private static final String HIGH_LIGHT_CLOSE = "</span>";
> 
>     public static String highLight(String value, String queryString)
>         throws IOException
>     {
>         if (StringUtils.isEmpty(value) || StringUtils.isEmpty(queryString))
>         {
>             return value;
>         }
> 
>         TermQuery query = new TermQuery(new Term("h", queryString));
>         QueryScorer scorer = new QueryScorer(query);
>         SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(HIGH_LIGHT_OPEN,
>                 HIGH_LIGHT_CLOSE);
>         Highlighter highlighter = new Highlighter(formatter, scorer);
> 
>         Fragmenter fragmenter = new SimpleFragmenter(50);
> 
>         highlighter.setTextFragmenter(fragmenter);
> 
>         TokenStream tokenStream = new CJKAnalyzer().tokenStream("h",
>                 new StringReader(value));
> 
>         return highlighter.getBestFragments(tokenStream, value, 5, "...");
>     }
> 
> ----- Original Message ----- 
> From: "Erik Hatcher" <er...@ehatchersolutions.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Thursday, January 27, 2005 8:37 AM
> Subject: Re: text highlighting
> 
> 
> > Also, there are some examples in the Lucene in Action source code (grab  
> > it from http://www.lucenebook.com) (see HighlightIt.java).
> > 
> > Erik
> > 
> > On Jan 26, 2005, at 5:52 PM, markharw00d wrote:
> > 
> > > Michael Celona wrote:
> > >
> > >> Does any have a working example of the highlighter class found in the
> > >> sandbox?
> > >>
> > >>
> > > There are several in the accompanying Junit test:
> > > http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
> > > contributions/highlighter/src/test/org/apache/lucene/search/highlight/
> > >
> > >
> > > Cheers
> > > Mark
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: text highlighting

Posted by Youngho Cho <yo...@nannet.co.kr>.

Hello,

When I used the code with CJKAnalyzer and search English Text 
(Because the text is mixed with Korean and English )
sometimes the return Stirng is none.
Others works well.

Is the code analyzer dependancy ?

Thanks.

Youngho

-------  Test Code ( Just copy of the Book code ) ---------

    private static final String HIGH_LIGHT_OPEN = "<span class=\"highlight\">";
    private static final String HIGH_LIGHT_CLOSE = "</span>";

    public static String highLight(String value, String queryString)
        throws IOException
    {
        if (StringUtils.isEmpty(value) || StringUtils.isEmpty(queryString))
        {
            return value;
        }

        TermQuery query = new TermQuery(new Term("h", queryString));
        QueryScorer scorer = new QueryScorer(query);
        SimpleHTMLFormatter formatter = new SimpleHTMLFormatter(HIGH_LIGHT_OPEN,
                HIGH_LIGHT_CLOSE);
        Highlighter highlighter = new Highlighter(formatter, scorer);

        Fragmenter fragmenter = new SimpleFragmenter(50);

        highlighter.setTextFragmenter(fragmenter);

        TokenStream tokenStream = new CJKAnalyzer().tokenStream("h",
                new StringReader(value));

        return highlighter.getBestFragments(tokenStream, value, 5, "...");
    }

----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, January 27, 2005 8:37 AM
Subject: Re: text highlighting


> Also, there are some examples in the Lucene in Action source code (grab  
> it from http://www.lucenebook.com) (see HighlightIt.java).
> 
> Erik
> 
> On Jan 26, 2005, at 5:52 PM, markharw00d wrote:
> 
> > Michael Celona wrote:
> >
> >> Does any have a working example of the highlighter class found in the
> >> sandbox?
> >>
> >>
> > There are several in the accompanying Junit test:
> > http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
> > contributions/highlighter/src/test/org/apache/lucene/search/highlight/
> >
> >
> > Cheers
> > Mark
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: text highlighting

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Also, there are some examples in the Lucene in Action source code (grab  
it from http://www.lucenebook.com) (see HighlightIt.java).

	Erik

On Jan 26, 2005, at 5:52 PM, markharw00d wrote:

> Michael Celona wrote:
>
>> Does any have a working example of the highlighter class found in the
>> sandbox?
>>
>>
> There are several in the accompanying Junit test:
> http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/ 
> contributions/highlighter/src/test/org/apache/lucene/search/highlight/
>
>
> Cheers
> Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: text highlighting

Posted by markharw00d <ma...@yahoo.co.uk>.

Michael Celona wrote:

>Does any have a working example of the highlighter class found in the
>sandbox?
>
>  
>
There are several in the accompanying Junit test:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/highlighter/src/test/org/apache/lucene/search/highlight/


Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

text highlighting

Posted by Michael Celona <mc...@criticalmention.com>.

Does any have a working example of the highlighter class found in the
sandbox?

-----Original Message-----
From: Jason Polites [mailto:jasonpolites@tpg.com.au] 
Sent: Wednesday, January 26, 2005 5:34 PM
To: Lucene Users List
Subject: Re: Search Engine review article/book

Also:

http://labs.google.com/papers.html
http://research.microsoft.com/wsm/

----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, January 27, 2005 9:27 AM
Subject: Re: Search Engine review article/book


>+  the lucene in action book. :-)
> +  scholar.google.com
> + acm.org ir group
> + ieee.org has ir group as well
> may you will find http://searchenginewatch.com/ useful as well.
> 
> HTH
> Stefan
> 
> 
> Am 26.01.2005 um 23:18 schrieb Xiaohong Yang ((Sharon)):
> 
>> Hi all,
>>
>> I am looking for good review articles or books regarding latest search 
>> engine development trend and practices.  Any suggestions would be very 
>> helpful.  Any comments not covered by articles are also welcome.
>>
>> Thanks a lot,
>>
>> Sharon
>>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Search Engine review article/book

Posted by Jason Polites <ja...@tpg.com.au>.

Also:

http://labs.google.com/papers.html
http://research.microsoft.com/wsm/

----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, January 27, 2005 9:27 AM
Subject: Re: Search Engine review article/book


>+  the lucene in action book. :-)
> +  scholar.google.com
> + acm.org ir group
> + ieee.org has ir group as well
> may you will find http://searchenginewatch.com/ useful as well.
> 
> HTH
> Stefan
> 
> 
> Am 26.01.2005 um 23:18 schrieb Xiaohong Yang ((Sharon)):
> 
>> Hi all,
>>
>> I am looking for good review articles or books regarding latest search 
>> engine development trend and practices.  Any suggestions would be very 
>> helpful.  Any comments not covered by articles are also welcome.
>>
>> Thanks a lot,
>>
>> Sharon
>>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Search Engine review article/book

Posted by Stefan Groschupf <sg...@media-style.com>.

+  the lucene in action book. :-)
+  scholar.google.com
+ acm.org ir group
+ ieee.org has ir group as well
may you will find http://searchenginewatch.com/ useful as well.

HTH
Stefan


Am 26.01.2005 um 23:18 schrieb Xiaohong Yang ((Sharon)):

> Hi all,
>
> I am looking for good review articles or books regarding latest search 
> engine development trend and practices.  Any suggestions would be very 
> helpful.  Any comments not covered by articles are also welcome.
>
> Thanks a lot,
>
> Sharon
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Engine review article/book

Posted by "Xiaohong Yang (Sharon)" <sh...@yahoo.com>.

Hi all,
 
I am looking for good review articles or books regarding latest search engine development trend and practices.  Any suggestions would be very helpful.  Any comments not covered by articles are also welcome.
 
Thanks a lot,
 
Sharon

Re: Suggestions for documentation or LIA

Posted by Paul Elschot <pa...@xs4all.nl>.

On Wednesday 26 January 2005 18:40, Ian Soboroff wrote:
> jian chen <ch...@gmail.com> writes:
> 
> > Just to continue this discussion. I think right now Lucene's retrieval
> > algorithm is based purely on Vector Space Model, which is simple and
> > efficient.
> 
> As I understand it, it's indeed a tf-idf vector space approach, except
> that the queries are structured and as such, the tf-idf weights are
> totaled as a straight cosine among siblings of a BooleanQuery, but
> other query nodes may do things differently, for example, I haven't
> read it but I assume PhraseQueries require all terms present and
> adjacent to contribute to the score.
> 
> There is also a document-specific boost factor in the equation which
> is essentially a hook for document things like recency, PageRank, etc
> etc.
> 
> You can tweak this by defining custom Similarity classes which can say
> what the tf, idf, norm, and boost mean.  You can also affect the
> term normalization at the query end in BooleanScorer (I think? through
> the sumOfSquares method?).
> 
> We've implemented something kind of like the Similarity class but
> based on a model which decsribes a larger family of "similarity
> functions".  (For the curious or similarly IR-geeky, it's from Justin
> Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
> need more general hooks than the Lucene Similarity provides.  I think
> those hooks might exist, but I'm not sure I know which classes they're
> in.
> 
> I'm also interested in things like relevance feedback which can affect
> term weights as well as adding terms to the query... just how many
> places in the code do I have to subclass or change?

None. Create your own TermQuery instances, set their boosts,
and add them to a BooleanQuery.
 
> It's clear that if I'm interested in a completely different model like
> language modeling the IndexReader is the way to go.  In which case,
> what parts of the Lucene class structure should I adapt to maintain
> the incremental-results-return, inverted list skips, and other
> features which make the inverted search fast?

To keep the speed, the one thing you should keep is the performance of
TermQuery. In case you're interested in changing proximity scores,
the same holds for SpanTermQuery.
For a variation on TermQuery that scores query terms by their density in a
document field you can have a look here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31784

On top of these you can implement your own Scorers, but for Zobel's
similarities you probably won't need much more than what BooleanQuery
provides.
To use the inverted list skips, make sure to implement and use skipTo()
on your scorers.
In case you need larger queries in conjunctive normal form:
+(synA1 synA2 ....) +(synB1 synB2 ...) +(synC1 synC2 ...) ....
the development version of BooleanQuery might be a bit faster
than the current one.

For an interesting twist in the use of idf please search
for "fuzzy scoring changes" on lucene-dev at the end of 2004.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Suggestions for documentation or LIA

Posted by jian chen <ch...@gmail.com>.

Hi, Ian,

Thanks for your information. It would be really helpful to have some
documentation maybe on the WIKI about retrieval algorithm and how to
hack it. At least, something there even if like several paragraphs to
get started...

Thanks,

Jian

On Wed, 26 Jan 2005 12:40:54 -0500, Ian Soboroff <ia...@nist.gov> wrote:
> jian chen <ch...@gmail.com> writes:
> 
> > Just to continue this discussion. I think right now Lucene's retrieval
> > algorithm is based purely on Vector Space Model, which is simple and
> > efficient.
> 
> As I understand it, it's indeed a tf-idf vector space approach, except
> that the queries are structured and as such, the tf-idf weights are
> totaled as a straight cosine among siblings of a BooleanQuery, but
> other query nodes may do things differently, for example, I haven't
> read it but I assume PhraseQueries require all terms present and
> adjacent to contribute to the score.
> 
> There is also a document-specific boost factor in the equation which
> is essentially a hook for document things like recency, PageRank, etc
> etc.
> 
> You can tweak this by defining custom Similarity classes which can say
> what the tf, idf, norm, and boost mean.  You can also affect the
> term normalization at the query end in BooleanScorer (I think? through
> the sumOfSquares method?).
> 
> We've implemented something kind of like the Similarity class but
> based on a model which decsribes a larger family of "similarity
> functions".  (For the curious or similarly IR-geeky, it's from Justin
> Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
> need more general hooks than the Lucene Similarity provides.  I think
> those hooks might exist, but I'm not sure I know which classes they're
> in.
> 
> I'm also interested in things like relevance feedback which can affect
> term weights as well as adding terms to the query... just how many
> places in the code do I have to subclass or change?
> 
> It's clear that if I'm interested in a completely different model like
> language modeling the IndexReader is the way to go.  In which case,
> what parts of the Lucene class structure should I adapt to maintain
> the incremental-results-return, inverted list skips, and other
> features which make the inverted search fast?
> 
> Ian
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Suggestions for documentation or LIA

Posted by Ian Soboroff <ia...@nist.gov>.

jian chen <ch...@gmail.com> writes:

> Just to continue this discussion. I think right now Lucene's retrieval
> algorithm is based purely on Vector Space Model, which is simple and
> efficient.

As I understand it, it's indeed a tf-idf vector space approach, except
that the queries are structured and as such, the tf-idf weights are
totaled as a straight cosine among siblings of a BooleanQuery, but
other query nodes may do things differently, for example, I haven't
read it but I assume PhraseQueries require all terms present and
adjacent to contribute to the score.

There is also a document-specific boost factor in the equation which
is essentially a hook for document things like recency, PageRank, etc
etc.

You can tweak this by defining custom Similarity classes which can say
what the tf, idf, norm, and boost mean.  You can also affect the
term normalization at the query end in BooleanScorer (I think? through
the sumOfSquares method?).

We've implemented something kind of like the Similarity class but
based on a model which decsribes a larger family of "similarity
functions".  (For the curious or similarly IR-geeky, it's from Justin
Zobel's paper from a few years ago in SIGIR Forum.)  Essentially I
need more general hooks than the Lucene Similarity provides.  I think
those hooks might exist, but I'm not sure I know which classes they're
in.

I'm also interested in things like relevance feedback which can affect
term weights as well as adding terms to the query... just how many
places in the code do I have to subclass or change?

It's clear that if I'm interested in a completely different model like
language modeling the IndexReader is the way to go.  In which case,
what parts of the Lucene class structure should I adapt to maintain
the incremental-results-return, inverted list skips, and other
features which make the inverted search fast?

Ian

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Suggestions for documentation or LIA

Posted by jian chen <ch...@gmail.com>.

Hi,

Just to continue this discussion. I think right now Lucene's retrieval
algorithm is based purely on Vector Space Model, which is simple and
efficient.

However, there maybe cases where folks like me want to use another set
of completely different ranking algorithms, those which do not even
use tf/idf.

For example, I am thinking about adding Cover Density ranking
algorithm to lucene, which is for now purely based on the proximity
information and does not require any global ranking variables. But
looking into the lucene code, it seems not very easy to make a hack
for that. At least, for me, a novice lucene user.

I read on the lucene whiteboard 2.0 that lucene will accomodate more
in terms of what to be indexed and such. That move might be good for
implementing other or ad hoc ranking algorithms.

Cheers,

Jian


On Wed, 26 Jan 2005 10:25:15 -0500, Ian Soboroff <ia...@nist.gov> wrote:
> Erik Hatcher <er...@ehatchersolutions.com> writes:
> 
> > By all means, if you have other suggestions for our site, let us know
> > at authors@lucenebook.com.
> 
> One of the things I would like to see, but which isn't either in the
> Lucene site, documentation, or "Lucene in Action", is a complete
> description of how the retrieval algorithm works.  That is, how the
> HitCollector, Scorers, Similarity, etc all fit together.
> 
> I'm involved in a project which to some degree is looking at poking
> deeply into this part of the Lucene code.  We have a nice (non-Lucene)
> framework for working with more different kinds of similarity
> functions (beyond tf-idf) which should also be expandable to include
> query expansion, relevance feedback, and the like.
> 
> I used to think that integrating it would be as simple as hacking in
> Similarity, but I'm beginning to think it might need broader changes.
> I could obviously hook in our whole retrieval setup by just diving for
> an IndexReader and doing it all by hand, but then I would have to redo
> the incremental search and possibly the rich query structure, which
> would be a lose.
> 
> So anyway, I got LIA hoping for a good explanation (not a good
> Explanation) on this bit, but it wasn't there.  There are some hints
> on the Lucene site, but nothing complete.  If I muddle it out before
> anything gets contributed, I'll try to write something up, but don't
> expect anything too soon...
> 
> Ian
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Suggestions for documentation or LIA

Posted by Ian Soboroff <ia...@nist.gov>.

Erik Hatcher <er...@ehatchersolutions.com> writes:

> Hacking Similarity wasn't covered in LIA for one simple reason - 
> Lucene's built-in scoring mechanism really is good enough for almost 
> all projects.  The book was written for developers of those projects.
>
> Personally, I've not had to hack Similarity, though I've toyed with it 
> in prototypes and am using a minor tweak (turning off length 
> normalization for the "title" field) for the lucenebook.com book 
> indexing.
>
>>   There are some hints
>> on the Lucene site, but nothing complete.  If I muddle it out before
>> anything gets contributed, I'll try to write something up, but don't
>> expect anything too soon...
>
> And maybe you'd contribute what you write to LIA 2nd edition :)

Maybe that too.  ;-) What we're working on isn't aimed at the site
admin who wants to tweak site search, it's more aimed at the IR
researcher.  Among other things it handles Cranfield-style batch
experiments and many standard IR test collections, for example.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Suggestions for documentation or LIA

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 26, 2005, at 10:25 AM, Ian Soboroff wrote:
> Erik Hatcher <er...@ehatchersolutions.com> writes:
>
>> By all means, if you have other suggestions for our site, let us know
>> at authors@lucenebook.com.
>
> One of the things I would like to see, but which isn't either in the
> Lucene site, documentation, or "Lucene in Action", is a complete
> description of how the retrieval algorithm works.  That is, how the
> HitCollector, Scorers, Similarity, etc all fit together.
>
> I'm involved in a project which to some degree is looking at poking
> deeply into this part of the Lucene code.  We have a nice (non-Lucene)
> framework for working with more different kinds of similarity
> functions (beyond tf-idf) which should also be expandable to include
> query expansion, relevance feedback, and the like.
>
> I used to think that integrating it would be as simple as hacking in
> Similarity, but I'm beginning to think it might need broader changes.
> I could obviously hook in our whole retrieval setup by just diving for
> an IndexReader and doing it all by hand, but then I would have to redo
> the incremental search and possibly the rich query structure, which
> would be a lose.
>
> So anyway, I got LIA hoping for a good explanation (not a good
> Explanation) on this bit, but it wasn't there.

Hacking Similarity wasn't covered in LIA for one simple reason - 
Lucene's built-in scoring mechanism really is good enough for almost 
all projects.  The book was written for developers of those projects.

Personally, I've not had to hack Similarity, though I've toyed with it 
in prototypes and am using a minor tweak (turning off length 
normalization for the "title" field) for the lucenebook.com book 
indexing.

>   There are some hints
> on the Lucene site, but nothing complete.  If I muddle it out before
> anything gets contributed, I'll try to write something up, but don't
> expect anything too soon...

And maybe you'd contribute what you write to LIA 2nd edition :)

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org