You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Nicholas Petersen <np...@gmail.com> on 2011/02/06 02:14:56 UTC

Is it Possible to Get Full Hit Counts within each Document?

Two burning questions by this Lucene beginner:

1) Is it possible to get the number of hits *within *each Document, not just
the number of documents?  Consider a scenario like this: you are indexing a
book by page, where each page represents an indexed Document. But
ultimately, you want to know how many times the query term occurs in the
entire book ... (e.g. "Silverlight" 17 times). Is this feasible? Ultimately,
I am hoping to go beyond just "Document hits," to actual phrase hits (see
next point).

2) Besides getting the entire count of actual hits, is it possible to get
the surrounding text of a query hit? I know this would entail storing the
inserted string content values in the index, but what I am after is having
the phrase of the hit. Like in Adobe Reader, in a search on the side, all
hits in the entire document get displayed with the found text. So if someone
searches for "scanner", the first Document might have 2 hits, for each of
which I'd want to display a snippet of text surrounding the hit:
Document 1:
"... a top notch *scanner* is the Canon ..."
"... with Xerox's 411 *scanner* your options ..."

I'm currently reading through Lucene in Action, but I'm burning to know if
this scenario is ultimately possible before I make it through those 500
pages.

Thanks a million for any help or advice,
Nick

Re: Is it Possible to Get Full Hit Counts within each Document?

Posted by Nicholas Petersen <np...@gmail.com>.
You rock Heath and DIGY, thanks, I'll be looking into these things and will
share back any good stuff I find...

Nick

On Sun, Feb 6, 2011 at 1:38 PM, Digy <di...@gmail.com> wrote:

> For your second question, you can use FastVectorHighlighter in contrib.
>
> DIGY
>
> -----Original Message-----
> From: Nicholas Petersen [mailto:npetersen77@gmail.com]
> Sent: Sunday, February 06, 2011 3:15 AM
> To: lucene-net-user@lucene.apache.org
> Subject: Is it Possible to Get Full Hit Counts within each Document?
>
>  Two burning questions by this Lucene beginner:
>
> 1) Is it possible to get the number of hits *within *each Document, not
> just
> the number of documents?  Consider a scenario like this: you are indexing a
> book by page, where each page represents an indexed Document. But
> ultimately, you want to know how many times the query term occurs in the
> entire book ... (e.g. "Silverlight" 17 times). Is this feasible?
> Ultimately,
> I am hoping to go beyond just "Document hits," to actual phrase hits (see
> next point).
>
> 2) Besides getting the entire count of actual hits, is it possible to get
> the surrounding text of a query hit? I know this would entail storing the
> inserted string content values in the index, but what I am after is having
> the phrase of the hit. Like in Adobe Reader, in a search on the side, all
> hits in the entire document get displayed with the found text. So if
> someone
> searches for "scanner", the first Document might have 2 hits, for each of
> which I'd want to display a snippet of text surrounding the hit:
> Document 1:
> "... a top notch *scanner* is the Canon ..."
> "... with Xerox's 411 *scanner* your options ..."
>
> I'm currently reading through Lucene in Action, but I'm burning to know if
> this scenario is ultimately possible before I make it through those 500
> pages.
>
> Thanks a million for any help or advice,
> Nick
>
>

RE: Is it Possible to Get Full Hit Counts within each Document?

Posted by Digy <di...@gmail.com>.
For your second question, you can use FastVectorHighlighter in contrib.

DIGY

-----Original Message-----
From: Nicholas Petersen [mailto:npetersen77@gmail.com] 
Sent: Sunday, February 06, 2011 3:15 AM
To: lucene-net-user@lucene.apache.org
Subject: Is it Possible to Get Full Hit Counts within each Document?

Two burning questions by this Lucene beginner:

1) Is it possible to get the number of hits *within *each Document, not just
the number of documents?  Consider a scenario like this: you are indexing a
book by page, where each page represents an indexed Document. But
ultimately, you want to know how many times the query term occurs in the
entire book ... (e.g. "Silverlight" 17 times). Is this feasible? Ultimately,
I am hoping to go beyond just "Document hits," to actual phrase hits (see
next point).

2) Besides getting the entire count of actual hits, is it possible to get
the surrounding text of a query hit? I know this would entail storing the
inserted string content values in the index, but what I am after is having
the phrase of the hit. Like in Adobe Reader, in a search on the side, all
hits in the entire document get displayed with the found text. So if someone
searches for "scanner", the first Document might have 2 hits, for each of
which I'd want to display a snippet of text surrounding the hit:
Document 1:
"... a top notch *scanner* is the Canon ..."
"... with Xerox's 411 *scanner* your options ..."

I'm currently reading through Lucene in Action, but I'm burning to know if
this scenario is ultimately possible before I make it through those 500
pages.

Thanks a million for any help or advice,
Nick


Re: Is it Possible to Get Full Hit Counts within each Document?

Posted by Heath Aldrich <ha...@aes2.com>.
Hi nick. 

Google faceted searching and look up the simplefacet library. 

Facets are what you see on many ecomm sites where you narrow your search by some property of an item. (eg size when you're shopping for tvs). 

In your case you can make page number a property and the get your counts based on that facet. 

I don't consider it an easy thing, but if you google around you will find some direction. 

This is why some people choose solr as it is a wrapper for lucene and has much of this built in so you don't have to build it. 



Sent from my mobile phone. 
Please excuse any spelling or gramatical errors. 

On Feb 5, 2011, at 9:59 PM, "Nicholas Petersen" <np...@gmail.com> wrote:

> Thanks Heath, but I'm struggling with this:
> 
> <Effectively, what you would do is store the searchable text for each page
> as a record along with the page number.>
> 
> No problem, just have Field.Store.YES on your content field:
> doc.Add(new Field(contentFieldName, content, Field.Store.YES,
> Field.Index.ANALYZED));
> 
> while having another Field in that document with the Page #. But here is the
> problem:
> 
> <*Then you could do a search with faceted results by page*. So you'd see
> each page that has the results, and a count of how many results per page.>
> 
> ? Would you mind sharing a bit more of what that means? Do you mean this:
> now that you got back hit documents, you yourself manually retrieve and then
> search on the content text that you stored in the document (thanks to
> Field.Store.YES)?
> 
> Thanks for your help, and I would most appreciate any advice you others may
> have too!
> 
> Nick


Re: Is it Possible to Get Full Hit Counts within each Document?

Posted by Nicholas Petersen <np...@gmail.com>.
Thanks Heath, but I'm struggling with this:

<Effectively, what you would do is store the searchable text for each page
as a record along with the page number.>

No problem, just have Field.Store.YES on your content field:
doc.Add(new Field(contentFieldName, content, Field.Store.YES,
Field.Index.ANALYZED));

while having another Field in that document with the Page #. But here is the
problem:

<*Then you could do a search with faceted results by page*. So you'd see
each page that has the results, and a count of how many results per page.>

? Would you mind sharing a bit more of what that means? Do you mean this:
now that you got back hit documents, you yourself manually retrieve and then
search on the content text that you stored in the document (thanks to
Field.Store.YES)?

Thanks for your help, and I would most appreciate any advice you others may
have too!

Nick

RE: Is it Possible to Get Full Hit Counts within each Document?

Posted by Heath Aldrich <ha...@aes2.com>.
Hi Nicholas, 

For your first questions, it sounds like you are looking for faceted searching.
Effectively, what you would do is store the searchable text for each page as a record along with the page number.
Then you could do a search with faceted results by page.
So you'd see each page that has the results, and a count of how many results per page.
Then for the full count of results (if you cared), you could either add up the numbers from the faceted pages, r do a search again without the page facet.

Facets are not native in Lucene.net, but there is a library we've used... I think it was called "Simple Facets" (that's off the cuff, so you may have to google it)

Good luck
Heath


-----Original Message-----
From: Nicholas Petersen [mailto:npetersen77@gmail.com] 
Sent: Saturday, February 05, 2011 7:15 PM
To: lucene-net-user@lucene.apache.org
Subject: Is it Possible to Get Full Hit Counts within each Document?

Two burning questions by this Lucene beginner:

1) Is it possible to get the number of hits *within *each Document, not just the number of documents?  Consider a scenario like this: you are indexing a book by page, where each page represents an indexed Document. But ultimately, you want to know how many times the query term occurs in the entire book ... (e.g. "Silverlight" 17 times). Is this feasible? Ultimately, I am hoping to go beyond just "Document hits," to actual phrase hits (see next point).

2) Besides getting the entire count of actual hits, is it possible to get the surrounding text of a query hit? I know this would entail storing the inserted string content values in the index, but what I am after is having the phrase of the hit. Like in Adobe Reader, in a search on the side, all hits in the entire document get displayed with the found text. So if someone searches for "scanner", the first Document might have 2 hits, for each of which I'd want to display a snippet of text surrounding the hit:
Document 1:
"... a top notch *scanner* is the Canon ..."
"... with Xerox's 411 *scanner* your options ..."

I'm currently reading through Lucene in Action, but I'm burning to know if this scenario is ultimately possible before I make it through those 500 pages.

Thanks a million for any help or advice, Nick