You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Giovanni Gherdovich <g....@gmail.com> on 2012/07/15 17:56:32 UTC

from docID to terms enumerator in O(1) ?

Hi all,

I'd like to know if I can get the list of indexed terms in a document
from its document ID in constant time
(say, in a time independent of the size of the index).

The reason why I ask might be relevant
(you could suggest me a totally different way to achieve my goal).

I want to present the search results of a query as a word cloud,
i.e. no scoring, no sorting, no nothing, just a visual representation
of the array of pairs (term, docFreq) for all terms appearing in
at least one of the docs that matched my query.

Skimming thru the pages of "Lucene in Action"
I found that I might need to call the method

void IndexSearcher.search(Query query, Collector results)

i.e. pass that method my own Collector class,
that fetches and cook results the way I want.

The author provides a very clear code example for
the Collector,

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
public class BookLinkCollector extends Collector {
    private Map<String,String> documents = new HashMap<String,String>();
    private Scorer scorer;
    private String[] urls;
    private String[] titles;

    public boolean acceptsDocsOutOfOrder() {
	return true;
    }

    public void setScorer(Scorer scorer) {
	this.scorer = scorer;
    }

    public void setNextReader(IndexReader reader, int docBase)
	throws IOException {
	urls = FieldCache.DEFAULT.getStrings(reader, "url");
	titles = FieldCache.DEFAULT.getStrings(reader, "title2");
    }

    public void collect(int docID) {
	try {
	    String url = urls[docID];
	    String title = titles[docID];
	    documents.put(url, title);
	    System.out.println(title + ":" + scorer.score());
	} catch (IOException e) {
	}
    }

    public Map<String,String> getLinks() {
	return Collections.unmodifiableMap(documents);
    }
}
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8

which is the used like

-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
public void testCollecting() throws Exception {
    Directory dir = TestUtil.getBookIndexDirectory();
    TermQuery query = new TermQuery(new Term("contents", "junit"));
    IndexSearcher searcher = new IndexSearcher(dir);
    BookLinkCollector collector = new BookLinkCollector(searcher);

    searcher.search(query, collector);
    Map<String,String> linkMap = collector.getLinks();
    assertEquals("ant in action",
		 linkMap.get("http://www.manning.com/loughran"));;
    searcher.close();
    dir.close();
}
-- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8

What might not work for me is the use of FieldCache
on the IndexReader to retrieve all fields values on the current segment;
those values are returned as String[],

while for me it would be more convenient to get a term enumerator:
all the tokenizing and stopword removal work has already been
dojne and indexing time, and I would like to leverage that.

How does it sound?

Cheers,
Giovanni

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: from docID to terms enumerator in O(1) ?

Posted by Giovanni Gherdovich <g....@gmail.com>.
2012/7/15 Uwe Schindler <uw...@thetaphi.de>:
> Enable term vectors while indexing and use the TermVector API.
>

Thank you very much Uwe!

I just got back to chapter 2 of "Lucene in Action", where it says

"If it’s indexed, the field may also optionally store term vectors,
which are collectively a miniature inverted index for that one field,
allowing you to retrieve all of its tokens."

Just as you say, this is exactly what I need.

Cheers,
GGhh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: from docID to terms enumerator in O(1) ?

Posted by Uwe Schindler <uw...@thetaphi.de>.
Enable term vectors while indexing and use the TermVector API.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Giovanni Gherdovich [mailto:g.gherdovich@gmail.com]
> Sent: Sunday, July 15, 2012 5:57 PM
> To: java-user@lucene.apache.org
> Subject: from docID to terms enumerator in O(1) ?
> 
> Hi all,
> 
> I'd like to know if I can get the list of indexed terms in a document from
its
> document ID in constant time (say, in a time independent of the size of
the
> index).
> 
> The reason why I ask might be relevant
> (you could suggest me a totally different way to achieve my goal).
> 
> I want to present the search results of a query as a word cloud, i.e. no
scoring,
> no sorting, no nothing, just a visual representation of the array of pairs
(term,
> docFreq) for all terms appearing in at least one of the docs that matched
my
> query.
> 
> Skimming thru the pages of "Lucene in Action"
> I found that I might need to call the method
> 
> void IndexSearcher.search(Query query, Collector results)
> 
> i.e. pass that method my own Collector class, that fetches and cook
results the
> way I want.
> 
> The author provides a very clear code example for the Collector,
> 
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
public class
> BookLinkCollector extends Collector {
>     private Map<String,String> documents = new HashMap<String,String>();
>     private Scorer scorer;
>     private String[] urls;
>     private String[] titles;
> 
>     public boolean acceptsDocsOutOfOrder() {
> 	return true;
>     }
> 
>     public void setScorer(Scorer scorer) {
> 	this.scorer = scorer;
>     }
> 
>     public void setNextReader(IndexReader reader, int docBase)
> 	throws IOException {
> 	urls = FieldCache.DEFAULT.getStrings(reader, "url");
> 	titles = FieldCache.DEFAULT.getStrings(reader, "title2");
>     }
> 
>     public void collect(int docID) {
> 	try {
> 	    String url = urls[docID];
> 	    String title = titles[docID];
> 	    documents.put(url, title);
> 	    System.out.println(title + ":" + scorer.score());
> 	} catch (IOException e) {
> 	}
>     }
> 
>     public Map<String,String> getLinks() {
> 	return Collections.unmodifiableMap(documents);
>     }
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
> 
> which is the used like
> 
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
public void
> testCollecting() throws Exception {
>     Directory dir = TestUtil.getBookIndexDirectory();
>     TermQuery query = new TermQuery(new Term("contents", "junit"));
>     IndexSearcher searcher = new IndexSearcher(dir);
>     BookLinkCollector collector = new BookLinkCollector(searcher);
> 
>     searcher.search(query, collector);
>     Map<String,String> linkMap = collector.getLinks();
>     assertEquals("ant in action",
> 		 linkMap.get("http://www.manning.com/loughran"));;
>     searcher.close();
>     dir.close();
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
> 
> What might not work for me is the use of FieldCache on the IndexReader to
> retrieve all fields values on the current segment; those values are
returned as
> String[],
> 
> while for me it would be more convenient to get a term enumerator:
> all the tokenizing and stopword removal work has already been dojne and
> indexing time, and I would like to leverage that.
> 
> How does it sound?
> 
> Cheers,
> Giovanni
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org