You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2003/08/18 19:11:43 UTC

Similar Document Search

Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query?  (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.)

Regards,

Terry

Re: Similar Document Search

Posted by Terry Steichen <te...@net-frame.com>.
Hi Peter,

What got me thinking about this was the way that Lucene computes similarity
(or scoring).  After the boolean keyword matches have been found, Lucene
then computes relevance.  What Lucene does, I think, is to process the query
into some intermediate internal representation and computes the similarity
between the query (now a kind of a pseudo-document) and each of the matching
hits.

I was wondering if there might not be a way to internally process a selected
document (rather than the query per se) and then, in effect, compute the
similarity between that document and all the other documents (which have
already been pre-processed in the indexing process).  So, what you'd be
doing is not a boolean keyword match, but a ranking of all the documents in
the repository on the basis of relevance or similarity to the target
document.

(If that's not too far off in terms of reality, maybe Doug could comment?)

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pb...@dstc.edu.au>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, August 18, 2003 9:05 PM
Subject: Re: Similar Document Search


> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided
> that most likely the only good solution to this is to keep a
> non-inverted index, i.e. a map from the documents to the terms. Then you
> can query the most terms for the documents and query other documents
> matching parts of this (where you get the usual question of what is
> actually interesting: high frequency, low frequency or the mid range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem to
> support changes in the index, and the index for the terms would change
> all the time. We haven't implemented it yet, but it shouldn't be hard to
> code. I just wouldn't expect good performance when indexing large
> collections.
>
>   Peter
>
>
> Terry Steichen wrote:
>
> >Is it possible without extensive additional coding to use Lucene to
conduct a search based on a document rather than a query?  (One use of this
would be to refine a search by selecting one of the hits returned from the
initial query and subsequently retrieving other documents "like" the
selected one.)
> >
> >Regards,
> >
> >Terry
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Magnus Johansson <ma...@technohuman.com>.
Hi Peter,

I guess you are right.

I've implemented this for a index with ten millions of really small
documents that all are stored in the index. The documents are never more 
than a thousand
words so re-indexing is quick enough. However it is probably not 
advisable to do
this with bigger documents or documents that need additional parsing.

/magnus


Peter Becker wrote:

> Hi Magnus,
>
> thanks for the offer, but unfortunately I can't/don't want to make the 
> assumption that I can easily access the documents to re-index them. 
> And I don't think this approach would be feasible unless you can keep 
> the documents in memory somehow.
>
> Storing the other/non-inverted/normal/whatever index would be 
> expensive for indexing, but querying should be a lot faster than 
> having to re-index documents. That is in our situation preferable.
>
>  Peter
>
>
> Magnus Johansson wrote:
>
>> Hi Peter
>>
>> If the original document is available. You could extract keywords 
>> from the document
>> at query time. That is when someone asks for documents similar to 
>> document a. You
>> re-analyze document a and in combination with statistics from the 
>> Lucene index you extract
>> keywords from document a that can then be used as a query for 
>> findining similar documents.
>>
>> I've got some sample code if anyone is interested.
>>
>> /magnus
>>
>>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Magnus Johansson <ma...@technohuman.com>.
Hi Peter,

I guess you are right.

I've implemented this for a index with ten millions of really small
documents that all are stored in the index. The documents are never more 
than a thousand
words so re-indexing is quick enough. However it is probably not 
advisable to do
this with bigger documents or documents that need additional parsing.

/magnus


Peter Becker wrote:

> Hi Magnus,
>
> thanks for the offer, but unfortunately I can't/don't want to make the 
> assumption that I can easily access the documents to re-index them. 
> And I don't think this approach would be feasible unless you can keep 
> the documents in memory somehow.
>
> Storing the other/non-inverted/normal/whatever index would be 
> expensive for indexing, but querying should be a lot faster than 
> having to re-index documents. That is in our situation preferable.
>
>  Peter
>
>
> Magnus Johansson wrote:
>
>> Hi Peter
>>
>> If the original document is available. You could extract keywords 
>> from the document
>> at query time. That is when someone asks for documents similar to 
>> document a. You
>> re-analyze document a and in combination with statistics from the 
>> Lucene index you extract
>> keywords from document a that can then be used as a query for 
>> findining similar documents.
>>
>> I've got some sample code if anyone is interested.
>>
>> /magnus
>>
>>



Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Magnus,

thanks for the offer, but unfortunately I can't/don't want to make the 
assumption that I can easily access the documents to re-index them. And 
I don't think this approach would be feasible unless you can keep the 
documents in memory somehow.

Storing the other/non-inverted/normal/whatever index would be expensive 
for indexing, but querying should be a lot faster than having to 
re-index documents. That is in our situation preferable.

  Peter


Magnus Johansson wrote:

> Hi Peter
>
> If the original document is available. You could extract keywords from 
> the document
> at query time. That is when someone asks for documents similar to 
> document a. You
> re-analyze document a and in combination with statistics from the 
> Lucene index you extract
> keywords from document a that can then be used as a query for 
> findining similar documents.
>
> I've got some sample code if anyone is interested.
>
> /magnus
>
>
> Peter Becker wrote:
>
>> Hi Terry,
>>
>> we have been thinking about the same problem and in the end we 
>> decided that most likely the only good solution to this is to keep a 
>> non-inverted index, i.e. a map from the documents to the terms. Then 
>> you can query the most terms for the documents and query other 
>> documents matching parts of this (where you get the usual question of 
>> what is actually interesting: high frequency, low frequency or the 
>> mid range).
>>
>> Indexing would probably be quite expensive since Lucene doesn't seem 
>> to support changes in the index, and the index for the terms would 
>> change all the time. We haven't implemented it yet, but it shouldn't 
>> be hard to code. I just wouldn't expect good performance when 
>> indexing large collections.
>>
>>  Peter
>>
>>
>> Terry Steichen wrote:
>>
>>> Is it possible without extensive additional coding to use Lucene to 
>>> conduct a search based on a document rather than a query?  (One use 
>>> of this would be to refine a search by selecting one of the hits 
>>> returned from the initial query and subsequently retrieving other 
>>> documents "like" the selected one.)
>>>
>>> Regards,
>>>
>>> Terry
>>>
>>>  
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Magnus,

thanks for the offer, but unfortunately I can't/don't want to make the 
assumption that I can easily access the documents to re-index them. And 
I don't think this approach would be feasible unless you can keep the 
documents in memory somehow.

Storing the other/non-inverted/normal/whatever index would be expensive 
for indexing, but querying should be a lot faster than having to 
re-index documents. That is in our situation preferable.

  Peter


Magnus Johansson wrote:

> Hi Peter
>
> If the original document is available. You could extract keywords from 
> the document
> at query time. That is when someone asks for documents similar to 
> document a. You
> re-analyze document a and in combination with statistics from the 
> Lucene index you extract
> keywords from document a that can then be used as a query for 
> findining similar documents.
>
> I've got some sample code if anyone is interested.
>
> /magnus
>
>
> Peter Becker wrote:
>
>> Hi Terry,
>>
>> we have been thinking about the same problem and in the end we 
>> decided that most likely the only good solution to this is to keep a 
>> non-inverted index, i.e. a map from the documents to the terms. Then 
>> you can query the most terms for the documents and query other 
>> documents matching parts of this (where you get the usual question of 
>> what is actually interesting: high frequency, low frequency or the 
>> mid range).
>>
>> Indexing would probably be quite expensive since Lucene doesn't seem 
>> to support changes in the index, and the index for the terms would 
>> change all the time. We haven't implemented it yet, but it shouldn't 
>> be hard to code. I just wouldn't expect good performance when 
>> indexing large collections.
>>
>>  Peter
>>
>>
>> Terry Steichen wrote:
>>
>>> Is it possible without extensive additional coding to use Lucene to 
>>> conduct a search based on a document rather than a query?  (One use 
>>> of this would be to refine a search by selecting one of the hits 
>>> returned from the initial query and subsequently retrieving other 
>>> documents "like" the selected one.)
>>>
>>> Regards,
>>>
>>> Terry
>>>
>>>  
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




Re: Similar Document Search

Posted by Magnus Johansson <ma...@technohuman.com>.
Ok, here it is. It's part of a JSP that prints out all keywords in a 
document.

/magnus


<%@ page import="org.apache.lucene.index.IndexReader,
                 org.apache.lucene.document.Document,
                 com.technohuman.search.language.SwedishAnalyzer,
                 java.io.StringReader,
                 org.apache.lucene.analysis.TokenStream,
                 org.apache.lucene.analysis.Token,
                 org.apache.lucene.index.Term,
                 org.apache.lucene.index.TermEnum,
                 java.util.*"%>
<%!
    class Entry implements Comparable {
        public double score;
        public String termText;

        public Entry(double score, String termText) {
            this.score = score;
            this.termText = termText;
        }

        public int compareTo(Object o) {
            Entry e = (Entry) o;
            if (e.score < score) return -1;
            else return 1;
        }
    }
%>
<%
    IndexReader reader = 
IndexReader.open(application.getRealPath("/WEB-INF/index"));
    Document d = 
reader.document(Integer.parseInt(request.getParameter("docId")));

    Map m = new HashMap();

    // Count all terms in the description field of the given document
    String description = d.getField("Parser.DESCRIPTION").stringValue();
    final java.io.Reader r = new StringReader(description);
    final TokenStream in = new SwedishAnalyzer().tokenStream(r);

    for (; ;) {
        final Token token = in.next();

        if (token == null) {
            break;
        }

        if (m.containsKey(token.termText())) {
            int a = ((Integer)m.get(token.termText())).intValue();
            m.put(token.termText(), new Integer(a + 1));
        } else {
            m.put(token.termText(), new Integer(1));
        }
    }


    ArrayList tm = new ArrayList();

    // Calculate inverse document frequency * term frequency
    Iterator it = m.keySet().iterator();
    while (it.hasNext()) {
        String termText = (String) it.next();
        TermEnum te = reader.terms(new Term("Parser.DESCRIPTION", 
termText));

        double idf = Math.log(reader.numDocs() / (te.docFreq() + 1)) + 1;
        double tf = Math.sqrt(((Integer)m.get(termText)).intValue());

        tm.add(new Entry(idf * tf, termText));
    }


    Collections.sort(tm);

    // Print the keywords and the score for each keyword
    Iterator it2 = tm.iterator();
    while (it2.hasNext()) {
        Entry e = (Entry) it2.next();
        out.println(e.score + " " + e.termText + "<br />");
    }

    reader.close();
%>

Rociel Buico wrote:

>hello magnus,
> 
>can i ask your sample script?
> 
>--buics
> 
>Hi Peter
>
>If the original document is available. You could extract keywords from 
>the document
>at query time. That is when someone asks for documents similar to 
>document a. You
>re-analyze document a and in combination with statistics from the Lucene 
>index you extract
>keywords from document a that can then be used as a query for findining 
>similar documents.
>
>I've got some sample code if anyone is interested.
>
>/magnus
>
>
>Peter Becker wrote:
>
>  
>
>>Hi Terry,
>>
>>we have been thinking about the same problem and in the end we decided 
>>that most likely the only good solution to this is to keep a 
>>non-inverted index, i.e. a map from the documents to the terms. Then 
>>you can query the most terms for the documents and query other 
>>documents matching parts of this (where you get the usual question of 
>>what is actually interesting: high frequency, low frequency or the mid 
>>range).
>>
>>Indexing would probably be quite expensive since Lucene doesn't seem 
>>to support changes in the index, and the index for the terms would 
>>change all the time. We haven't implemented it yet, but it shouldn't 
>>be hard to code. I just wouldn't expect good performance when indexing 
>>large collections.
>>
>>Peter
>>
>>
>>Terry Steichen wrote:
>>
>>    
>>
>>>Is it possible without extensive additional coding to use Lucene to 
>>>conduct a search based on a document rather than a query? (One use 
>>>of this would be to refine a search by selecting one of the hits 
>>>returned from the initial query and subsequently retrieving other 
>>>documents "like" the selected one.)
>>>
>>>Regards,
>>>
>>>Terry
>>>
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>---------------------------------
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>  
>



Re: Similar Document Search

Posted by Magnus Johansson <ma...@technohuman.com>.
Ok, here it is. It's part of a JSP that prints out all keywords in a 
document.

/magnus


<%@ page import="org.apache.lucene.index.IndexReader,
                 org.apache.lucene.document.Document,
                 com.technohuman.search.language.SwedishAnalyzer,
                 java.io.StringReader,
                 org.apache.lucene.analysis.TokenStream,
                 org.apache.lucene.analysis.Token,
                 org.apache.lucene.index.Term,
                 org.apache.lucene.index.TermEnum,
                 java.util.*"%>
<%!
    class Entry implements Comparable {
        public double score;
        public String termText;

        public Entry(double score, String termText) {
            this.score = score;
            this.termText = termText;
        }

        public int compareTo(Object o) {
            Entry e = (Entry) o;
            if (e.score < score) return -1;
            else return 1;
        }
    }
%>
<%
    IndexReader reader = 
IndexReader.open(application.getRealPath("/WEB-INF/index"));
    Document d = 
reader.document(Integer.parseInt(request.getParameter("docId")));

    Map m = new HashMap();

    // Count all terms in the description field of the given document
    String description = d.getField("Parser.DESCRIPTION").stringValue();
    final java.io.Reader r = new StringReader(description);
    final TokenStream in = new SwedishAnalyzer().tokenStream(r);

    for (; ;) {
        final Token token = in.next();

        if (token == null) {
            break;
        }

        if (m.containsKey(token.termText())) {
            int a = ((Integer)m.get(token.termText())).intValue();
            m.put(token.termText(), new Integer(a + 1));
        } else {
            m.put(token.termText(), new Integer(1));
        }
    }


    ArrayList tm = new ArrayList();

    // Calculate inverse document frequency * term frequency
    Iterator it = m.keySet().iterator();
    while (it.hasNext()) {
        String termText = (String) it.next();
        TermEnum te = reader.terms(new Term("Parser.DESCRIPTION", 
termText));

        double idf = Math.log(reader.numDocs() / (te.docFreq() + 1)) + 1;
        double tf = Math.sqrt(((Integer)m.get(termText)).intValue());

        tm.add(new Entry(idf * tf, termText));
    }


    Collections.sort(tm);

    // Print the keywords and the score for each keyword
    Iterator it2 = tm.iterator();
    while (it2.hasNext()) {
        Entry e = (Entry) it2.next();
        out.println(e.score + " " + e.termText + "<br />");
    }

    reader.close();
%>

Rociel Buico wrote:

>hello magnus,
> 
>can i ask your sample script?
> 
>--buics
> 
>Hi Peter
>
>If the original document is available. You could extract keywords from 
>the document
>at query time. That is when someone asks for documents similar to 
>document a. You
>re-analyze document a and in combination with statistics from the Lucene 
>index you extract
>keywords from document a that can then be used as a query for findining 
>similar documents.
>
>I've got some sample code if anyone is interested.
>
>/magnus
>
>
>Peter Becker wrote:
>
>  
>
>>Hi Terry,
>>
>>we have been thinking about the same problem and in the end we decided 
>>that most likely the only good solution to this is to keep a 
>>non-inverted index, i.e. a map from the documents to the terms. Then 
>>you can query the most terms for the documents and query other 
>>documents matching parts of this (where you get the usual question of 
>>what is actually interesting: high frequency, low frequency or the mid 
>>range).
>>
>>Indexing would probably be quite expensive since Lucene doesn't seem 
>>to support changes in the index, and the index for the terms would 
>>change all the time. We haven't implemented it yet, but it shouldn't 
>>be hard to code. I just wouldn't expect good performance when indexing 
>>large collections.
>>
>>Peter
>>
>>
>>Terry Steichen wrote:
>>
>>    
>>
>>>Is it possible without extensive additional coding to use Lucene to 
>>>conduct a search based on a document rather than a query? (One use 
>>>of this would be to refine a search by selecting one of the hits 
>>>returned from the initial query and subsequently retrieving other 
>>>documents "like" the selected one.)
>>>
>>>Regards,
>>>
>>>Terry
>>>
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>---------------------------------
>Do you Yahoo!?
>The New Yahoo! Search - Faster. Easier. Bingo.
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Rociel Buico <bu...@yahoo.com>.
hello magnus,
 
can i ask your sample script?
 
--buics
 
Hi Peter

If the original document is available. You could extract keywords from 
the document
at query time. That is when someone asks for documents similar to 
document a. You
re-analyze document a and in combination with statistics from the Lucene 
index you extract
keywords from document a that can then be used as a query for findining 
similar documents.

I've got some sample code if anyone is interested.

/magnus


Peter Becker wrote:

> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided 
> that most likely the only good solution to this is to keep a 
> non-inverted index, i.e. a map from the documents to the terms. Then 
> you can query the most terms for the documents and query other 
> documents matching parts of this (where you get the usual question of 
> what is actually interesting: high frequency, low frequency or the mid 
> range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem 
> to support changes in the index, and the index for the terms would 
> change all the time. We haven't implemented it yet, but it shouldn't 
> be hard to code. I just wouldn't expect good performance when indexing 
> large collections.
>
> Peter
>
>
> Terry Steichen wrote:
>
>> Is it possible without extensive additional coding to use Lucene to 
>> conduct a search based on a document rather than a query? (One use 
>> of this would be to refine a search by selecting one of the hits 
>> returned from the initial query and subsequently retrieving other 
>> documents "like" the selected one.)
>>
>> Regards,
>>
>> Terry
>>
>> 
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.

Re: Similar Document Search

Posted by Magnus Johansson <ma...@technohuman.com>.
Hi Peter

If the original document is available. You could extract keywords from 
the document
at query time. That is when someone asks for documents similar to 
document a. You
re-analyze document a and in combination with statistics from the Lucene 
index you extract
keywords from document a that can then be used as a query for findining 
similar documents.

I've got some sample code if anyone is interested.

/magnus


Peter Becker wrote:

> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided 
> that most likely the only good solution to this is to keep a 
> non-inverted index, i.e. a map from the documents to the terms. Then 
> you can query the most terms for the documents and query other 
> documents matching parts of this (where you get the usual question of 
> what is actually interesting: high frequency, low frequency or the mid 
> range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem 
> to support changes in the index, and the index for the terms would 
> change all the time. We haven't implemented it yet, but it shouldn't 
> be hard to code. I just wouldn't expect good performance when indexing 
> large collections.
>
>  Peter
>
>
> Terry Steichen wrote:
>
>> Is it possible without extensive additional coding to use Lucene to 
>> conduct a search based on a document rather than a query?  (One use 
>> of this would be to refine a search by selecting one of the hits 
>> returned from the initial query and subsequently retrieving other 
>> documents "like" the selected one.)
>>
>> Regards,
>>
>> Terry
>>
>>  
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Terry Steichen <te...@net-frame.com>.
Hi Peter,

What got me thinking about this was the way that Lucene computes similarity
(or scoring).  After the boolean keyword matches have been found, Lucene
then computes relevance.  What Lucene does, I think, is to process the query
into some intermediate internal representation and computes the similarity
between the query (now a kind of a pseudo-document) and each of the matching
hits.

I was wondering if there might not be a way to internally process a selected
document (rather than the query per se) and then, in effect, compute the
similarity between that document and all the other documents (which have
already been pre-processed in the indexing process).  So, what you'd be
doing is not a boolean keyword match, but a ranking of all the documents in
the repository on the basis of relevance or similarity to the target
document.

(If that's not too far off in terms of reality, maybe Doug could comment?)

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pb...@dstc.edu.au>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, August 18, 2003 9:05 PM
Subject: Re: Similar Document Search


> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided
> that most likely the only good solution to this is to keep a
> non-inverted index, i.e. a map from the documents to the terms. Then you
> can query the most terms for the documents and query other documents
> matching parts of this (where you get the usual question of what is
> actually interesting: high frequency, low frequency or the mid range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem to
> support changes in the index, and the index for the terms would change
> all the time. We haven't implemented it yet, but it shouldn't be hard to
> code. I just wouldn't expect good performance when indexing large
> collections.
>
>   Peter
>
>
> Terry Steichen wrote:
>
> >Is it possible without extensive additional coding to use Lucene to
conduct a search based on a document rather than a query?  (One use of this
would be to refine a search by selecting one of the hits returned from the
initial query and subsequently retrieving other documents "like" the
selected one.)
> >
> >Regards,
> >
> >Terry
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


Re: Similar Document Search

Posted by Magnus Johansson <ma...@technohuman.com>.
Hi Peter

If the original document is available. You could extract keywords from 
the document
at query time. That is when someone asks for documents similar to 
document a. You
re-analyze document a and in combination with statistics from the Lucene 
index you extract
keywords from document a that can then be used as a query for findining 
similar documents.

I've got some sample code if anyone is interested.

/magnus


Peter Becker wrote:

> Hi Terry,
>
> we have been thinking about the same problem and in the end we decided 
> that most likely the only good solution to this is to keep a 
> non-inverted index, i.e. a map from the documents to the terms. Then 
> you can query the most terms for the documents and query other 
> documents matching parts of this (where you get the usual question of 
> what is actually interesting: high frequency, low frequency or the mid 
> range).
>
> Indexing would probably be quite expensive since Lucene doesn't seem 
> to support changes in the index, and the index for the terms would 
> change all the time. We haven't implemented it yet, but it shouldn't 
> be hard to code. I just wouldn't expect good performance when indexing 
> large collections.
>
>  Peter
>
>
> Terry Steichen wrote:
>
>> Is it possible without extensive additional coding to use Lucene to 
>> conduct a search based on a document rather than a query?  (One use 
>> of this would be to refine a search by selecting one of the hits 
>> returned from the initial query and subsequently retrieving other 
>> documents "like" the selected one.)
>>
>> Regards,
>>
>> Terry
>>
>>  
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



Re: Similar Document Search

Posted by Brian Mila <bm...@iastate.edu>.
> As a user of Lucene I missed some features. Part of the OSS culture is
> for me to tell others about this and maybe to try to find solutions.
> Mark's code seems to be one, so I proposed to consider adding it into
> some spot with better exposure for testing. And I don't seem to be the
> only person with the need for these features. I think Lucene would be
> better if these features were easily available. If the Lucene team
> doesn't think so -- fair enough, it is their project. But asking me to
> stop requesting features in a (hopefully) sensible way is pretty much
> against the spirit of OSS and hacker culture as far as I understand it.
>
> Does that answer your questions?
>

Yep.  I guess we have different philosophies but are more or less heading toward the same goal :)   I've been on other projects
where the response to questions has been "look at the source".  So maybe I have gotten into a bad habit of hacking first and asking
questions second.  thanks for the excellent reply, btw.

brian

>   Peter
>





Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Brian Mila wrote:

>>amounts). I failed to find a way to get Lucene to give me this
>>information without hacking this or that. Considering the attention IR
>>    
>>
>
>Excuse me if this is off-topic, but isn't hacking the code what open source
>software is all about?  
>
Not always, but quite often :-)

>I mean, its always better to try to do it with
>existing methods but if it can't, why not hack the source?  
>
Because you might need to put quite some effort into getting it right? 
Because you might do something someone else already did better -- which 
is not really against the spirit of hackerism, but I have so many other 
things to hack where I think I can do better than most people. Inverted 
file indexes is not my particular domain.

>If it works and
>people use it then it should probably be incorporated into the main source
>tree.  If poeple don't use it (or the hack is terribly ugly, which may be
>what you were referring to) then it doesn't make the cut.  
>
That needs exposure. If some Lucene code is hidden in the Haystack 
project, it won't get enough exposure IMO.

>In either case,
>I'm just wondering why I see many questions or answers include this almost
>standard reply.  I hack the source regularly to acheive a needed goal.
>Sure its not forward-compatible, but if I waited for the feature to be added
>on its own, our project would never get off the ground.
>  
>
One of the important things about OSS for me is resuse and 
collaboration. If you hack things again and again without trying to turn 
it into something reusable, I'd say you constantly create small 
proprietary forks based on open source code but you are not part of any 
OSS effort. That's of course my point of view on OSS, but then you asked 
for it :-)

As a user of Lucene I missed some features. Part of the OSS culture is 
for me to tell others about this and maybe to try to find solutions. 
Mark's code seems to be one, so I proposed to consider adding it into 
some spot with better exposure for testing. And I don't seem to be the 
only person with the need for these features. I think Lucene would be 
better if these features were easily available. If the Lucene team 
doesn't think so -- fair enough, it is their project. But asking me to 
stop requesting features in a (hopefully) sensible way is pretty much 
against the spirit of OSS and hacker culture as far as I understand it.

Does that answer your questions?

  Peter


Re: Similar Document Search

Posted by Brian Mila <bm...@iastate.edu>.
> amounts). I failed to find a way to get Lucene to give me this
> information without hacking this or that. Considering the attention IR

Excuse me if this is off-topic, but isn't hacking the code what open source
software is all about?  I mean, its always better to try to do it with
existing methods but if it can't, why not hack the source?  If it works and
people use it then it should probably be incorporated into the main source
tree.  If poeple don't use it (or the hack is terribly ugly, which may be
what you were referring to) then it doesn't make the cut.  In either case,
I'm just wondering why I see many questions or answers include this almost
standard reply.  I hack the source regularly to acheive a needed goal.
Sure its not forward-compatible, but if I waited for the feature to be added
on its own, our project would never get off the ground.

Brian




Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Terry,

exactly these two features of (a) having a unique identifier and (b) 
easily finding the term frequencies for the document is what we (i.e. 
our working group and seemingly others) are missing.

(a) As far as I understand Lucene, there is no such notion as value 
identity on Document instances. This is a problem if you want to do 
things like applying set-theory as we did. The workaround is easy: store 
a unique id in the index and wrap the documents in an object using this 
field as base for equals/hashCode. But it still has to be done, you need 
the unique id in the index and it is not really elegant since you have 
to go through the wrapper all the time.

(b) Lucene allows you to find term frequencies in the index, but not for 
subsets or single items. Many information retrieval approaches define 
document similarity using metrics on the term frequencies. The more 
similar the term frequencies, the more similar the documents are 
considered. You get different details and levels of complexity (esp. if 
you try to mix in background knowledge like knowing synonyms and 
generalizations of the terms), but the basic idea is that documents are 
similar if they contain the same terms (and maybe even in the same 
amounts). I failed to find a way to get Lucene to give me this 
information without hacking this or that. Considering the attention IR 
techniques like latent semantic analysis (LSA) and others get nowadays 
(and rightfully so I think), not finding these features in Lucene was a 
bit of a surprise. I still haven't looked at Mark's code, but I would be 
surprised if he had to do much. But you still have to do something.

After the more abstract talk a bit of a more concrete answer for your 
question: one simple way of defining similarity of documents is just 
treating the term frequencies of some (or all) terms as a vector space 
and then use a metric in the vector space to define distance. If you 
have two frequency maps, you can for example go through all keys in 
them, create all differences of the values attached (assuming null if a 
term is not in a map) and sum them up (giving you the manhattan metric 
in R^n), then you divide by the numbers of terms to normalize (the 
frequency maps are probably of different lengths) and that might give 
you a reasonable first try. If the result is zero, you consider the 
documents to be extremely similar, the higher the value, the more 
different they are suppossed to be.

The approach I described is a bit too naive to be really good -- for 
example I'd expect some bias towards more similarity on documents with 
less terms. And there are so many other enhancements you could do. 
Actually the whole idea is a field of research. And one I am not really 
expert in, I just sometimes work with people who are. This might help:

  
http://citeseer.nj.nec.com/cs?q=latent+semantic+analysis&submit=Search+Documents&cs=1

Maybe someone else on the list has better pointers.

  Peter




Terry Steichen wrote:

>Hi Peter,
>
>I took a look at Mark's thesis and briefly at some of his code.  It appears
>to me that what he's done with the so-called forward indexing is to (a)
>include a unique id with each document (allowing retrieval by id rather than
>by a standard query), and to (b) include a frequency map class with each
>document (allowing easier retrieval of term frequency information).
>
>Now I may be missing something very obvious, but it seems to me that both of
>these functions can be done rather easily with the standard (unmodified)
>version of Lucene.  Moreover, I don't understand how use of these functions
>will facilitate retrieval of documents that are "similar" to a selected
>document, as outlined in my original question on this topic.
>
>Could you (or anyone else, of course) perhaps elaborate just a bit on how
>using this approach will help achieve that end?
>
>Regards,
>
>Terry
>
>----- Original Message -----
>From: "Peter Becker" <pb...@dstc.edu.au>
>To: "Lucene Users List" <lu...@jakarta.apache.org>
>Sent: Thursday, August 21, 2003 1:37 AM
>Subject: Re: Similar Document Search
>
>
>  
>
>>Hi all,
>>
>>it seems there are quite a few people looking for similar features, i.e.
>>(a) document identity and (b) forward indexing. So far we hacked (a) by
>>using a wrapper implementing equals/hashcode based on a unique field,
>>but of course that assumes maintaining a unique field in the index. (b)
>>is something we haven't tackled yet, but plan to.
>>
>>The source code for Mark's thesis seems to be part of the Haystack
>>distribution. The comments in the files put it under Apche-license. This
>>seems to make it a good candidate to be included at least in the Lucene
>>sandbox -- although I haven't tried it myself yet. But it sounds like a
>>good candidate for us to use.
>>
>>Since the haystack source is a bit larger and I actually couldn't get
>>the download at the moment, here is a copy of the relevant bit grabbed
>>from one of my colleague's machines:
>>
>>  http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>>
>>Note that this is just a tarball of src/org/apache/lucene out of some
>>Haystack source. Untested, unmodified.
>>
>>I'd love to see something like this supported in the Lucene context were
>>people might actually find it :-)
>>
>>  Peter
>>
>>
>>Gregor Heinrich wrote:
>>
>>    
>>
>>>Hello Terry,
>>>
>>>Lucene can do forward indexing, as Mark Rosen outlines in his Master's
>>>thesis: http://citeseer.nj.nec.com/rosen03email.html.
>>>
>>>We use a similar approach for (probabilistic) latent semantic analysis
>>>      
>>>
>and
>  
>
>>>vector space searches. However, the solution is not really completely
>>>      
>>>
>fixed
>  
>
>>>yet, therefore no code at this time...
>>>
>>>Best regards,
>>>
>>>Gregor
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Peter Becker [mailto:pbecker@dstc.edu.au]
>>>Sent: Tuesday, August 19, 2003 3:06 AM
>>>To: Lucene Users List
>>>Subject: Re: Similar Document Search
>>>
>>>
>>>Hi Terry,
>>>
>>>we have been thinking about the same problem and in the end we decided
>>>that most likely the only good solution to this is to keep a
>>>non-inverted index, i.e. a map from the documents to the terms. Then you
>>>can query the most terms for the documents and query other documents
>>>matching parts of this (where you get the usual question of what is
>>>actually interesting: high frequency, low frequency or the mid range).
>>>
>>>Indexing would probably be quite expensive since Lucene doesn't seem to
>>>support changes in the index, and the index for the terms would change
>>>all the time. We haven't implemented it yet, but it shouldn't be hard to
>>>code. I just wouldn't expect good performance when indexing large
>>>collections.
>>>
>>> Peter
>>>
>>>
>>>Terry Steichen wrote:
>>>
>>>
>>>
>>>      
>>>
>>>>Is it possible without extensive additional coding to use Lucene to
>>>>        
>>>>
>conduct
>  
>
>>>>        
>>>>
>>>a search based on a document rather than a query?  (One use of this would
>>>      
>>>
>be
>  
>
>>>to refine a search by selecting one of the hits returned from the initial
>>>      
>>>
>
>  
>
>>>query and subsequently retrieving other documents "like" the selected
>>>      
>>>
>one.)
>  
>
>>>      
>>>
>>>>Regards,
>>>>
>>>>Terry
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>



RE: Similar Document Search

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi Terry,

the suggestion of Haystack's Lucene was a hint to give you an additional
alternative to reach your goal.

Depending on the definition of your notion "similar document", this solution
does or does not make sense. My definition of similar document (and term) is
maybe more general than yours: It supports rather generic similarity metrics
and needs to cover cosine similarity according to vector-space model (VSM;
can be achieved using unmodified Lucene code), semantic similarity according
to a generative model like latent semantic indexing or Bayesian approaches
etc. and even semantic similarity according to a taxonomy. If you want such
a flexibility (like I do for my research), you should consider this approach
because you can relatively easily work on the forward document vectors.

If all you need is vanilla VSM cosine similarity, you are probably best off
with the suggestion that was sent in this list, to submit the document
content in the query and throw it through the same Analyzer that was used to
create the index, thus finding best matches using Lucene's standard matching
scheme.

Good luck,

Gregor





-----Original Message-----
From: Terry Steichen [mailto:terry@net-frame.com]
Sent: Thursday, August 21, 2003 2:54 PM
To: Lucene Users List
Subject: Re: Similar Document Search


Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are "similar" to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pb...@dstc.edu.au>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search


> Hi all,
>
> it seems there are quite a few people looking for similar features, i.e.
> (a) document identity and (b) forward indexing. So far we hacked (a) by
> using a wrapper implementing equals/hashcode based on a unique field,
> but of course that assumes maintaining a unique field in the index. (b)
> is something we haven't tackled yet, but plan to.
>
> The source code for Mark's thesis seems to be part of the Haystack
> distribution. The comments in the files put it under Apche-license. This
> seems to make it a good candidate to be included at least in the Lucene
> sandbox -- although I haven't tried it myself yet. But it sounds like a
> good candidate for us to use.
>
> Since the haystack source is a bit larger and I actually couldn't get
> the download at the moment, here is a copy of the relevant bit grabbed
> from one of my colleague's machines:
>
>   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>
> Note that this is just a tarball of src/org/apache/lucene out of some
> Haystack source. Untested, unmodified.
>
> I'd love to see something like this supported in the Lucene context were
> people might actually find it :-)
>
>   Peter
>
>
> Gregor Heinrich wrote:
>
> >Hello Terry,
> >
> >Lucene can do forward indexing, as Mark Rosen outlines in his Master's
> >thesis: http://citeseer.nj.nec.com/rosen03email.html.
> >
> >We use a similar approach for (probabilistic) latent semantic analysis
and
> >vector space searches. However, the solution is not really completely
fixed
> >yet, therefore no code at this time...
> >
> >Best regards,
> >
> >Gregor
> >
> >
> >
> >
> >-----Original Message-----
> >From: Peter Becker [mailto:pbecker@dstc.edu.au]
> >Sent: Tuesday, August 19, 2003 3:06 AM
> >To: Lucene Users List
> >Subject: Re: Similar Document Search
> >
> >
> >Hi Terry,
> >
> >we have been thinking about the same problem and in the end we decided
> >that most likely the only good solution to this is to keep a
> >non-inverted index, i.e. a map from the documents to the terms. Then you
> >can query the most terms for the documents and query other documents
> >matching parts of this (where you get the usual question of what is
> >actually interesting: high frequency, low frequency or the mid range).
> >
> >Indexing would probably be quite expensive since Lucene doesn't seem to
> >support changes in the index, and the index for the terms would change
> >all the time. We haven't implemented it yet, but it shouldn't be hard to
> >code. I just wouldn't expect good performance when indexing large
> >collections.
> >
> >  Peter
> >
> >
> >Terry Steichen wrote:
> >
> >
> >
> >>Is it possible without extensive additional coding to use Lucene to
conduct
> >>
> >>
> >a search based on a document rather than a query?  (One use of this would
be
> >to refine a search by selecting one of the hits returned from the initial

> >query and subsequently retrieving other documents "like" the selected
one.)
> >
> >
> >>Regards,
> >>
> >>Terry
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



Re: Similar Document Search

Posted by Terry Steichen <te...@net-frame.com>.
Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are "similar" to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pb...@dstc.edu.au>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search


> Hi all,
>
> it seems there are quite a few people looking for similar features, i.e.
> (a) document identity and (b) forward indexing. So far we hacked (a) by
> using a wrapper implementing equals/hashcode based on a unique field,
> but of course that assumes maintaining a unique field in the index. (b)
> is something we haven't tackled yet, but plan to.
>
> The source code for Mark's thesis seems to be part of the Haystack
> distribution. The comments in the files put it under Apche-license. This
> seems to make it a good candidate to be included at least in the Lucene
> sandbox -- although I haven't tried it myself yet. But it sounds like a
> good candidate for us to use.
>
> Since the haystack source is a bit larger and I actually couldn't get
> the download at the moment, here is a copy of the relevant bit grabbed
> from one of my colleague's machines:
>
>   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>
> Note that this is just a tarball of src/org/apache/lucene out of some
> Haystack source. Untested, unmodified.
>
> I'd love to see something like this supported in the Lucene context were
> people might actually find it :-)
>
>   Peter
>
>
> Gregor Heinrich wrote:
>
> >Hello Terry,
> >
> >Lucene can do forward indexing, as Mark Rosen outlines in his Master's
> >thesis: http://citeseer.nj.nec.com/rosen03email.html.
> >
> >We use a similar approach for (probabilistic) latent semantic analysis
and
> >vector space searches. However, the solution is not really completely
fixed
> >yet, therefore no code at this time...
> >
> >Best regards,
> >
> >Gregor
> >
> >
> >
> >
> >-----Original Message-----
> >From: Peter Becker [mailto:pbecker@dstc.edu.au]
> >Sent: Tuesday, August 19, 2003 3:06 AM
> >To: Lucene Users List
> >Subject: Re: Similar Document Search
> >
> >
> >Hi Terry,
> >
> >we have been thinking about the same problem and in the end we decided
> >that most likely the only good solution to this is to keep a
> >non-inverted index, i.e. a map from the documents to the terms. Then you
> >can query the most terms for the documents and query other documents
> >matching parts of this (where you get the usual question of what is
> >actually interesting: high frequency, low frequency or the mid range).
> >
> >Indexing would probably be quite expensive since Lucene doesn't seem to
> >support changes in the index, and the index for the terms would change
> >all the time. We haven't implemented it yet, but it shouldn't be hard to
> >code. I just wouldn't expect good performance when indexing large
> >collections.
> >
> >  Peter
> >
> >
> >Terry Steichen wrote:
> >
> >
> >
> >>Is it possible without extensive additional coding to use Lucene to
conduct
> >>
> >>
> >a search based on a document rather than a query?  (One use of this would
be
> >to refine a search by selecting one of the hits returned from the initial

> >query and subsequently retrieving other documents "like" the selected
one.)
> >
> >
> >>Regards,
> >>
> >>Terry
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


Re: Similar Document Search

Posted by Terry Steichen <te...@net-frame.com>.
Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are "similar" to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <pb...@dstc.edu.au>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search


> Hi all,
>
> it seems there are quite a few people looking for similar features, i.e.
> (a) document identity and (b) forward indexing. So far we hacked (a) by
> using a wrapper implementing equals/hashcode based on a unique field,
> but of course that assumes maintaining a unique field in the index. (b)
> is something we haven't tackled yet, but plan to.
>
> The source code for Mark's thesis seems to be part of the Haystack
> distribution. The comments in the files put it under Apche-license. This
> seems to make it a good candidate to be included at least in the Lucene
> sandbox -- although I haven't tried it myself yet. But it sounds like a
> good candidate for us to use.
>
> Since the haystack source is a bit larger and I actually couldn't get
> the download at the moment, here is a copy of the relevant bit grabbed
> from one of my colleague's machines:
>
>   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>
> Note that this is just a tarball of src/org/apache/lucene out of some
> Haystack source. Untested, unmodified.
>
> I'd love to see something like this supported in the Lucene context were
> people might actually find it :-)
>
>   Peter
>
>
> Gregor Heinrich wrote:
>
> >Hello Terry,
> >
> >Lucene can do forward indexing, as Mark Rosen outlines in his Master's
> >thesis: http://citeseer.nj.nec.com/rosen03email.html.
> >
> >We use a similar approach for (probabilistic) latent semantic analysis
and
> >vector space searches. However, the solution is not really completely
fixed
> >yet, therefore no code at this time...
> >
> >Best regards,
> >
> >Gregor
> >
> >
> >
> >
> >-----Original Message-----
> >From: Peter Becker [mailto:pbecker@dstc.edu.au]
> >Sent: Tuesday, August 19, 2003 3:06 AM
> >To: Lucene Users List
> >Subject: Re: Similar Document Search
> >
> >
> >Hi Terry,
> >
> >we have been thinking about the same problem and in the end we decided
> >that most likely the only good solution to this is to keep a
> >non-inverted index, i.e. a map from the documents to the terms. Then you
> >can query the most terms for the documents and query other documents
> >matching parts of this (where you get the usual question of what is
> >actually interesting: high frequency, low frequency or the mid range).
> >
> >Indexing would probably be quite expensive since Lucene doesn't seem to
> >support changes in the index, and the index for the terms would change
> >all the time. We haven't implemented it yet, but it shouldn't be hard to
> >code. I just wouldn't expect good performance when indexing large
> >collections.
> >
> >  Peter
> >
> >
> >Terry Steichen wrote:
> >
> >
> >
> >>Is it possible without extensive additional coding to use Lucene to
conduct
> >>
> >>
> >a search based on a document rather than a query?  (One use of this would
be
> >to refine a search by selecting one of the hits returned from the initial

> >query and subsequently retrieving other documents "like" the selected
one.)
> >
> >
> >>Regards,
> >>
> >>Terry
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi all,

it seems there are quite a few people looking for similar features, i.e. 
(a) document identity and (b) forward indexing. So far we hacked (a) by 
using a wrapper implementing equals/hashcode based on a unique field, 
but of course that assumes maintaining a unique field in the index. (b) 
is something we haven't tackled yet, but plan to.

The source code for Mark's thesis seems to be part of the Haystack 
distribution. The comments in the files put it under Apche-license. This 
seems to make it a good candidate to be included at least in the Lucene 
sandbox -- although I haven't tried it myself yet. But it sounds like a 
good candidate for us to use.

Since the haystack source is a bit larger and I actually couldn't get 
the download at the moment, here is a copy of the relevant bit grabbed 
from one of my colleague's machines:

  http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)

Note that this is just a tarball of src/org/apache/lucene out of some 
Haystack source. Untested, unmodified.

I'd love to see something like this supported in the Lucene context were 
people might actually find it :-)

  Peter


Gregor Heinrich wrote:

>Hello Terry,
>
>Lucene can do forward indexing, as Mark Rosen outlines in his Master's
>thesis: http://citeseer.nj.nec.com/rosen03email.html.
>
>We use a similar approach for (probabilistic) latent semantic analysis and
>vector space searches. However, the solution is not really completely fixed
>yet, therefore no code at this time...
>
>Best regards,
>
>Gregor
>
>
>
>
>-----Original Message-----
>From: Peter Becker [mailto:pbecker@dstc.edu.au]
>Sent: Tuesday, August 19, 2003 3:06 AM
>To: Lucene Users List
>Subject: Re: Similar Document Search
>
>
>Hi Terry,
>
>we have been thinking about the same problem and in the end we decided
>that most likely the only good solution to this is to keep a
>non-inverted index, i.e. a map from the documents to the terms. Then you
>can query the most terms for the documents and query other documents
>matching parts of this (where you get the usual question of what is
>actually interesting: high frequency, low frequency or the mid range).
>
>Indexing would probably be quite expensive since Lucene doesn't seem to
>support changes in the index, and the index for the terms would change
>all the time. We haven't implemented it yet, but it shouldn't be hard to
>code. I just wouldn't expect good performance when indexing large
>collections.
>
>  Peter
>
>
>Terry Steichen wrote:
>
>  
>
>>Is it possible without extensive additional coding to use Lucene to conduct
>>    
>>
>a search based on a document rather than a query?  (One use of this would be
>to refine a search by selecting one of the hits returned from the initial
>query and subsequently retrieving other documents "like" the selected one.)
>  
>
>>Regards,
>>
>>Terry
>>
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi all,

it seems there are quite a few people looking for similar features, i.e. 
(a) document identity and (b) forward indexing. So far we hacked (a) by 
using a wrapper implementing equals/hashcode based on a unique field, 
but of course that assumes maintaining a unique field in the index. (b) 
is something we haven't tackled yet, but plan to.

The source code for Mark's thesis seems to be part of the Haystack 
distribution. The comments in the files put it under Apche-license. This 
seems to make it a good candidate to be included at least in the Lucene 
sandbox -- although I haven't tried it myself yet. But it sounds like a 
good candidate for us to use.

Since the haystack source is a bit larger and I actually couldn't get 
the download at the moment, here is a copy of the relevant bit grabbed 
from one of my colleague's machines:

  http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)

Note that this is just a tarball of src/org/apache/lucene out of some 
Haystack source. Untested, unmodified.

I'd love to see something like this supported in the Lucene context were 
people might actually find it :-)

  Peter


Gregor Heinrich wrote:

>Hello Terry,
>
>Lucene can do forward indexing, as Mark Rosen outlines in his Master's
>thesis: http://citeseer.nj.nec.com/rosen03email.html.
>
>We use a similar approach for (probabilistic) latent semantic analysis and
>vector space searches. However, the solution is not really completely fixed
>yet, therefore no code at this time...
>
>Best regards,
>
>Gregor
>
>
>
>
>-----Original Message-----
>From: Peter Becker [mailto:pbecker@dstc.edu.au]
>Sent: Tuesday, August 19, 2003 3:06 AM
>To: Lucene Users List
>Subject: Re: Similar Document Search
>
>
>Hi Terry,
>
>we have been thinking about the same problem and in the end we decided
>that most likely the only good solution to this is to keep a
>non-inverted index, i.e. a map from the documents to the terms. Then you
>can query the most terms for the documents and query other documents
>matching parts of this (where you get the usual question of what is
>actually interesting: high frequency, low frequency or the mid range).
>
>Indexing would probably be quite expensive since Lucene doesn't seem to
>support changes in the index, and the index for the terms would change
>all the time. We haven't implemented it yet, but it shouldn't be hard to
>code. I just wouldn't expect good performance when indexing large
>collections.
>
>  Peter
>
>
>Terry Steichen wrote:
>
>  
>
>>Is it possible without extensive additional coding to use Lucene to conduct
>>    
>>
>a search based on a document rather than a query?  (One use of this would be
>to refine a search by selecting one of the hits returned from the initial
>query and subsequently retrieving other documents "like" the selected one.)
>  
>
>>Regards,
>>
>>Terry
>>
>>
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>  
>



RE: Similar Document Search

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hello Terry,

Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.

We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, therefore no code at this time...

Best regards,

Gregor




-----Original Message-----
From: Peter Becker [mailto:pbecker@dstc.edu.au]
Sent: Tuesday, August 19, 2003 3:06 AM
To: Lucene Users List
Subject: Re: Similar Document Search


Hi Terry,

we have been thinking about the same problem and in the end we decided
that most likely the only good solution to this is to keep a
non-inverted index, i.e. a map from the documents to the terms. Then you
can query the most terms for the documents and query other documents
matching parts of this (where you get the usual question of what is
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to
support changes in the index, and the index for the terms would change
all the time. We haven't implemented it yet, but it shouldn't be hard to
code. I just wouldn't expect good performance when indexing large
collections.

  Peter


Terry Steichen wrote:

>Is it possible without extensive additional coding to use Lucene to conduct
a search based on a document rather than a query?  (One use of this would be
to refine a search by selecting one of the hits returned from the initial
query and subsequently retrieving other documents "like" the selected one.)
>
>Regards,
>
>Terry
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: Similar Document Search

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hello Terry,

Lucene can do forward indexing, as Mark Rosen outlines in his Master's
thesis: http://citeseer.nj.nec.com/rosen03email.html.

We use a similar approach for (probabilistic) latent semantic analysis and
vector space searches. However, the solution is not really completely fixed
yet, therefore no code at this time...

Best regards,

Gregor




-----Original Message-----
From: Peter Becker [mailto:pbecker@dstc.edu.au]
Sent: Tuesday, August 19, 2003 3:06 AM
To: Lucene Users List
Subject: Re: Similar Document Search


Hi Terry,

we have been thinking about the same problem and in the end we decided
that most likely the only good solution to this is to keep a
non-inverted index, i.e. a map from the documents to the terms. Then you
can query the most terms for the documents and query other documents
matching parts of this (where you get the usual question of what is
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to
support changes in the index, and the index for the terms would change
all the time. We haven't implemented it yet, but it shouldn't be hard to
code. I just wouldn't expect good performance when indexing large
collections.

  Peter


Terry Steichen wrote:

>Is it possible without extensive additional coding to use Lucene to conduct
a search based on a document rather than a query?  (One use of this would be
to refine a search by selecting one of the hits returned from the initial
query and subsequently retrieving other documents "like" the selected one.)
>
>Regards,
>
>Terry
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Terry,

we have been thinking about the same problem and in the end we decided 
that most likely the only good solution to this is to keep a 
non-inverted index, i.e. a map from the documents to the terms. Then you 
can query the most terms for the documents and query other documents 
matching parts of this (where you get the usual question of what is 
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to 
support changes in the index, and the index for the terms would change 
all the time. We haven't implemented it yet, but it shouldn't be hard to 
code. I just wouldn't expect good performance when indexing large 
collections.

  Peter


Terry Steichen wrote:

>Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query?  (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.)
>
>Regards,
>
>Terry
>
>  
>



Re: Similar Document Search

Posted by Peter Becker <pb...@dstc.edu.au>.
Hi Terry,

we have been thinking about the same problem and in the end we decided 
that most likely the only good solution to this is to keep a 
non-inverted index, i.e. a map from the documents to the terms. Then you 
can query the most terms for the documents and query other documents 
matching parts of this (where you get the usual question of what is 
actually interesting: high frequency, low frequency or the mid range).

Indexing would probably be quite expensive since Lucene doesn't seem to 
support changes in the index, and the index for the terms would change 
all the time. We haven't implemented it yet, but it shouldn't be hard to 
code. I just wouldn't expect good performance when indexing large 
collections.

  Peter


Terry Steichen wrote:

>Is it possible without extensive additional coding to use Lucene to conduct a search based on a document rather than a query?  (One use of this would be to refine a search by selecting one of the hits returned from the initial query and subsequently retrieving other documents "like" the selected one.)
>
>Regards,
>
>Terry
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
Using the QueryFilter would help with the refining a search based on 
hits from a previous search, but it wouldn't help with the "like" part 
your asked about.

I'm interested in what you turn up with this though.

	Erik

On Monday, August 18, 2003, at 01:11  PM, Terry Steichen wrote:

> Is it possible without extensive additional coding to use Lucene to 
> conduct a search based on a document rather than a query?  (One use of 
> this would be to refine a search by selecting one of the hits returned 
> from the initial query and subsequently retrieving other documents 
> "like" the selected one.)
>
> Regards,
>
> Terry


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similar Document Search

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
Using the QueryFilter would help with the refining a search based on 
hits from a previous search, but it wouldn't help with the "like" part 
your asked about.

I'm interested in what you turn up with this though.

	Erik

On Monday, August 18, 2003, at 01:11  PM, Terry Steichen wrote:

> Is it possible without extensive additional coding to use Lucene to 
> conduct a search based on a document rather than a query?  (One use of 
> this would be to refine a search by selecting one of the hits returned 
> from the initial query and subsequently retrieving other documents 
> "like" the selected one.)
>
> Regards,
>
> Terry