You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by KK <di...@gmail.com> on 2009/05/25 07:03:55 UTC

How to extract 15/20 words around the matched query after getting results from lucene searcher?

Hi All,
I'm trying to index some non-english web pages and I'm keeping all the
content of the page in a single field and the searches are working fine as
well. Now when I search for some query it gives the complete page, which is
expected. Now I want to restrict the showing of results to say 20 words
around the match, something like google does, otherwise we cann't make users
to look for a match in the whole page content[I'll use highlighter after
this is done]. So getting positions of the matched word/phrase might help so
that I can extract some words before and some words after that and will show
that to end user. Any idea on doing the same will be very helpful. Thank
you.

KK.

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by Grant Ingersoll <gs...@apache.org>.

On May 25, 2009, at 1:34 AM, KK wrote:

> Also people are talking about someting called spanQueries/ 
> termvectors etc to
> use for this purpose. I'm still to get the exact idea of how to do  
> this.

I just blogged up a quick little demo (including full code) of this at http://www.lucidimagination.com/blog/2009/05/26/accessing-words-around-a-positional-match-in-lucene/

Please excuse any mistakes on my part as I threw it together pretty  
quickly today.

HTH,
Grant


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by Hasan Diwan <ha...@gmail.com>.

2009/5/24 KK <di...@gmail.com>:
> There is one more mail I found in the archive[3/4 days old] where someone
> asked about extracting 3 neighbors words around the match. I think once you
> have the position of matching term/phrase then extracting 3 or 30 neighbors
> wont be different, right? because you just have to move back/forward and get
> the words, this sounds logically simple but I dont know how simple is this
> implementation-wise.
> Also people are talking about someting called spanQueries/termvectors etc to
> use for this purpose. I'm still to get the exact idea of how to do this.
> As per your mail, you used Java to extract the neighbors, Is that using the
> standard techniques i.e using those spanqueries/termvectors or something
> else.
// query contains the query string, and doc contains the string
corresponding to the document contents
public String resultPhrase() {
   int queryPosition = doc.indexOf(query);
   int numberOfWords = 20; // get 20 words on either side of query
   String [] words = doc.split("\s");
   ArrayList wordList = new ArrayList();
   String[] queryWords = query.split("\s");
   String ret = new String();
   for (int i = wordList.indexOf(queryWords[0])-numberOfWords; i!=
wordList.indexOf(queryWords[0]+numberOfWords;i++) {
      if (words[i] == null) continue;
      ret += words[i];
   }
   return ret;
}
This isn't tested, but let me know how it works, this doesn't use
anything beyond what is in the JDK.
-- 
Sent from my mobile device

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by KK <di...@gmail.com>.

Thank you very much @Michael.
I googled and didn't find much but I grabbed the book LIA 2nd Edn and went
through that and found a very good example in Sec8.7 and that helped me
solve the problem. Now I'm able to do highlighting for english texts but for
non-english text no luck yet. I've posted new mails regarding the same[dont
want to do cossposting]. If possible please have a look.
And yes I must thank you and other authors for the wonderful book called
LIA. Once I went thru this book I got cleared a lot many things, really very
nice book. Surprisingly there is not much information on the net regarding
the advanced features of lucene all of which are clearly explained in this
book with examples.

Thank you.
KK.

On Mon, May 25, 2009 at 7:47 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> I would do some googling to find examples, or read the javadocs for
> the highlighter package?
>
> Or pick up copy of the early-release of Lucene in Action 2nd edition
> from http://manning.com [disclaimer: I'm one of the authors on that
> book!].  We've revamped the highlighter coverage (in chapter 8)...
>
> Mike
>
> On Mon, May 25, 2009 at 6:19 AM, KK <di...@gmail.com> wrote:
> > Thanks @Michael.
> > I've no idea about this contrib though I'm looking into highlighter. Can
> you
> > throw some lights on the same. The steps to be taken for achieving the
> same.
> > I'm completely new to this thing. Can you point me to some examples for
> the
> > same? Thank you.
> >
> > KK.
> >
> > On Mon, May 25, 2009 at 3:26 PM, Michael McCandless <
> > lucene@mikemccandless.com> wrote:
> >
> >> Can't you use contrib/highlighter to achieve this?
> >>
> >> It can do both excerpting (grabbing chunk of text around each hit) and
> >> highlighting (highlighting the specific tokens that matched, within
> >> that excerpt).
> >>
> >> Mike
> >>
> >> On Mon, May 25, 2009 at 5:20 AM, KK <di...@gmail.com> wrote:
> >> > Thanks for your response @Seid.
> >> > Can any Lucene user give me directions on this regard? I'm stuck.
> >> >  Really appreciate your help.
> >> >
> >> > Thanks,
> >> > KK
> >> >
> >> > On Mon, May 25, 2009 at 2:43 PM, Seid Muhie <se...@gmail.com>
> wrote:
> >> >
> >> >> actually I used the normal java standard libraries for this work. I
> >> >> used lucene only to retrieve the relevant document.
> >> >> what you will do is, thought it is to manuall, as i don't know the
> way
> >> >> it can be done by the Lucene API, you just record the location of the
> >> >> query terms in the document (it is as easy as indexOf(query terms)).
> >> >> But you have to be very aware of the speed of the system. then you
> can
> >> >> go ahead or back, as you want.
> >> >>
> >> >> Once again, I think there will be also a LUcene workaround, that I am
> >> >> not aware of it at all.
> >> >>
> >> >> Seid M
> >> >>
> >> >> On 5/25/09, KK <di...@gmail.com> wrote:
> >> >> > Thanks for your quick response, Seid.
> >> >> >
> >> >> > There is one more mail I found in the archive[3/4 days old] where
> >> someone
> >> >> > asked about extracting 3 neighbors words around the match. I think
> >> once
> >> >> you
> >> >> > have the position of matching term/phrase then extracting 3 or 30
> >> >> neighbors
> >> >> > wont be different, right? because you just have to move
> back/forward
> >> and
> >> >> get
> >> >> > the words, this sounds logically simple but I dont know how simple
> is
> >> >> this
> >> >> > implementation-wise.
> >> >> > Also people are talking about someting called
> spanQueries/termvectors
> >> etc
> >> >> to
> >> >> > use for this purpose. I'm still to get the exact idea of how to do
> >> this.
> >> >> > As per your mail, you used Java to extract the neighbors, Is that
> >> using
> >> >> the
> >> >> > standard techniques i.e using those spanqueries/termvectors or
> >> something
> >> >> > else.
> >> >> > If you can elaborate all this a bit It'd be very helpful.
> >> >> >
> >> >> > Thank you.
> >> >> > KK>
> >> >> >
> >> >> > On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com>
> >> wrote:
> >> >> >
> >> >> >> for my thesis work (Question Answering) I used to retrieve first
> the
> >> >> >> document and then play with java to extract the needed answer.
> >> >> >> for your case what you will do is first locate the positions of
> the
> >> >> >> query terms in the document (in this case it might be distributed
> >> >> >> throughout the document - hence difficult to get the 15/20 words)
> >> then
> >> >> >> count something10 words forward and backward and extract the
> match.
> >> >> >> this is the way I handle my problem. Hope there might be different
> I
> >> >> >> dea too
> >> >> >>
> >> >> >> Seid M.
> >> >> >>
> >> >> >> On 5/25/09, KK <di...@gmail.com> wrote:
> >> >> >> > Hi All,
> >> >> >> > I'm trying to index some non-english web pages and I'm keeping
> all
> >> the
> >> >> >> > content of the page in a single field and the searches are
> working
> >> >> fine
> >> >> >> as
> >> >> >> > well. Now when I search for some query it gives the complete
> page,
> >> >> which
> >> >> >> is
> >> >> >> > expected. Now I want to restrict the showing of results to say
> 20
> >> >> words
> >> >> >> > around the match, something like google does, otherwise we
> cann't
> >> make
> >> >> >> users
> >> >> >> > to look for a match in the whole page content[I'll use
> highlighter
> >> >> after
> >> >> >> > this is done]. So getting positions of the matched word/phrase
> >> might
> >> >> >> > help
> >> >> >> so
> >> >> >> > that I can extract some words before and some words after that
> and
> >> >> will
> >> >> >> show
> >> >> >> > that to end user. Any idea on doing the same will be very
> helpful.
> >> >> Thank
> >> >> >> > you.
> >> >> >> >
> >> >> >> > KK.
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> "RABI ZIDNI ILMA"
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> "RABI ZIDNI ILMA"
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by Michael McCandless <lu...@mikemccandless.com>.

I would do some googling to find examples, or read the javadocs for
the highlighter package?

Or pick up copy of the early-release of Lucene in Action 2nd edition
from http://manning.com [disclaimer: I'm one of the authors on that
book!].  We've revamped the highlighter coverage (in chapter 8)...

Mike

On Mon, May 25, 2009 at 6:19 AM, KK <di...@gmail.com> wrote:
> Thanks @Michael.
> I've no idea about this contrib though I'm looking into highlighter. Can you
> throw some lights on the same. The steps to be taken for achieving the same.
> I'm completely new to this thing. Can you point me to some examples for the
> same? Thank you.
>
> KK.
>
> On Mon, May 25, 2009 at 3:26 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> Can't you use contrib/highlighter to achieve this?
>>
>> It can do both excerpting (grabbing chunk of text around each hit) and
>> highlighting (highlighting the specific tokens that matched, within
>> that excerpt).
>>
>> Mike
>>
>> On Mon, May 25, 2009 at 5:20 AM, KK <di...@gmail.com> wrote:
>> > Thanks for your response @Seid.
>> > Can any Lucene user give me directions on this regard? I'm stuck.
>> >  Really appreciate your help.
>> >
>> > Thanks,
>> > KK
>> >
>> > On Mon, May 25, 2009 at 2:43 PM, Seid Muhie <se...@gmail.com> wrote:
>> >
>> >> actually I used the normal java standard libraries for this work. I
>> >> used lucene only to retrieve the relevant document.
>> >> what you will do is, thought it is to manuall, as i don't know the way
>> >> it can be done by the Lucene API, you just record the location of the
>> >> query terms in the document (it is as easy as indexOf(query terms)).
>> >> But you have to be very aware of the speed of the system. then you can
>> >> go ahead or back, as you want.
>> >>
>> >> Once again, I think there will be also a LUcene workaround, that I am
>> >> not aware of it at all.
>> >>
>> >> Seid M
>> >>
>> >> On 5/25/09, KK <di...@gmail.com> wrote:
>> >> > Thanks for your quick response, Seid.
>> >> >
>> >> > There is one more mail I found in the archive[3/4 days old] where
>> someone
>> >> > asked about extracting 3 neighbors words around the match. I think
>> once
>> >> you
>> >> > have the position of matching term/phrase then extracting 3 or 30
>> >> neighbors
>> >> > wont be different, right? because you just have to move back/forward
>> and
>> >> get
>> >> > the words, this sounds logically simple but I dont know how simple is
>> >> this
>> >> > implementation-wise.
>> >> > Also people are talking about someting called spanQueries/termvectors
>> etc
>> >> to
>> >> > use for this purpose. I'm still to get the exact idea of how to do
>> this.
>> >> > As per your mail, you used Java to extract the neighbors, Is that
>> using
>> >> the
>> >> > standard techniques i.e using those spanqueries/termvectors or
>> something
>> >> > else.
>> >> > If you can elaborate all this a bit It'd be very helpful.
>> >> >
>> >> > Thank you.
>> >> > KK>
>> >> >
>> >> > On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com>
>> wrote:
>> >> >
>> >> >> for my thesis work (Question Answering) I used to retrieve first the
>> >> >> document and then play with java to extract the needed answer.
>> >> >> for your case what you will do is first locate the positions of the
>> >> >> query terms in the document (in this case it might be distributed
>> >> >> throughout the document - hence difficult to get the 15/20 words)
>> then
>> >> >> count something10 words forward and backward and extract the match.
>> >> >> this is the way I handle my problem. Hope there might be different I
>> >> >> dea too
>> >> >>
>> >> >> Seid M.
>> >> >>
>> >> >> On 5/25/09, KK <di...@gmail.com> wrote:
>> >> >> > Hi All,
>> >> >> > I'm trying to index some non-english web pages and I'm keeping all
>> the
>> >> >> > content of the page in a single field and the searches are working
>> >> fine
>> >> >> as
>> >> >> > well. Now when I search for some query it gives the complete page,
>> >> which
>> >> >> is
>> >> >> > expected. Now I want to restrict the showing of results to say 20
>> >> words
>> >> >> > around the match, something like google does, otherwise we cann't
>> make
>> >> >> users
>> >> >> > to look for a match in the whole page content[I'll use highlighter
>> >> after
>> >> >> > this is done]. So getting positions of the matched word/phrase
>> might
>> >> >> > help
>> >> >> so
>> >> >> > that I can extract some words before and some words after that and
>> >> will
>> >> >> show
>> >> >> > that to end user. Any idea on doing the same will be very helpful.
>> >> Thank
>> >> >> > you.
>> >> >> >
>> >> >> > KK.
>> >> >> >
>> >> >>
>> >> >>
>> >> >> --
>> >> >> "RABI ZIDNI ILMA"
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >> --
>> >> "RABI ZIDNI ILMA"
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by KK <di...@gmail.com>.

Thanks @Michael.
I've no idea about this contrib though I'm looking into highlighter. Can you
throw some lights on the same. The steps to be taken for achieving the same.
I'm completely new to this thing. Can you point me to some examples for the
same? Thank you.

KK.

On Mon, May 25, 2009 at 3:26 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Can't you use contrib/highlighter to achieve this?
>
> It can do both excerpting (grabbing chunk of text around each hit) and
> highlighting (highlighting the specific tokens that matched, within
> that excerpt).
>
> Mike
>
> On Mon, May 25, 2009 at 5:20 AM, KK <di...@gmail.com> wrote:
> > Thanks for your response @Seid.
> > Can any Lucene user give me directions on this regard? I'm stuck.
> >  Really appreciate your help.
> >
> > Thanks,
> > KK
> >
> > On Mon, May 25, 2009 at 2:43 PM, Seid Muhie <se...@gmail.com> wrote:
> >
> >> actually I used the normal java standard libraries for this work. I
> >> used lucene only to retrieve the relevant document.
> >> what you will do is, thought it is to manuall, as i don't know the way
> >> it can be done by the Lucene API, you just record the location of the
> >> query terms in the document (it is as easy as indexOf(query terms)).
> >> But you have to be very aware of the speed of the system. then you can
> >> go ahead or back, as you want.
> >>
> >> Once again, I think there will be also a LUcene workaround, that I am
> >> not aware of it at all.
> >>
> >> Seid M
> >>
> >> On 5/25/09, KK <di...@gmail.com> wrote:
> >> > Thanks for your quick response, Seid.
> >> >
> >> > There is one more mail I found in the archive[3/4 days old] where
> someone
> >> > asked about extracting 3 neighbors words around the match. I think
> once
> >> you
> >> > have the position of matching term/phrase then extracting 3 or 30
> >> neighbors
> >> > wont be different, right? because you just have to move back/forward
> and
> >> get
> >> > the words, this sounds logically simple but I dont know how simple is
> >> this
> >> > implementation-wise.
> >> > Also people are talking about someting called spanQueries/termvectors
> etc
> >> to
> >> > use for this purpose. I'm still to get the exact idea of how to do
> this.
> >> > As per your mail, you used Java to extract the neighbors, Is that
> using
> >> the
> >> > standard techniques i.e using those spanqueries/termvectors or
> something
> >> > else.
> >> > If you can elaborate all this a bit It'd be very helpful.
> >> >
> >> > Thank you.
> >> > KK>
> >> >
> >> > On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com>
> wrote:
> >> >
> >> >> for my thesis work (Question Answering) I used to retrieve first the
> >> >> document and then play with java to extract the needed answer.
> >> >> for your case what you will do is first locate the positions of the
> >> >> query terms in the document (in this case it might be distributed
> >> >> throughout the document - hence difficult to get the 15/20 words)
> then
> >> >> count something10 words forward and backward and extract the match.
> >> >> this is the way I handle my problem. Hope there might be different I
> >> >> dea too
> >> >>
> >> >> Seid M.
> >> >>
> >> >> On 5/25/09, KK <di...@gmail.com> wrote:
> >> >> > Hi All,
> >> >> > I'm trying to index some non-english web pages and I'm keeping all
> the
> >> >> > content of the page in a single field and the searches are working
> >> fine
> >> >> as
> >> >> > well. Now when I search for some query it gives the complete page,
> >> which
> >> >> is
> >> >> > expected. Now I want to restrict the showing of results to say 20
> >> words
> >> >> > around the match, something like google does, otherwise we cann't
> make
> >> >> users
> >> >> > to look for a match in the whole page content[I'll use highlighter
> >> after
> >> >> > this is done]. So getting positions of the matched word/phrase
> might
> >> >> > help
> >> >> so
> >> >> > that I can extract some words before and some words after that and
> >> will
> >> >> show
> >> >> > that to end user. Any idea on doing the same will be very helpful.
> >> Thank
> >> >> > you.
> >> >> >
> >> >> > KK.
> >> >> >
> >> >>
> >> >>
> >> >> --
> >> >> "RABI ZIDNI ILMA"
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >> >
> >>
> >>
> >> --
> >> "RABI ZIDNI ILMA"
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Can't you use contrib/highlighter to achieve this?

It can do both excerpting (grabbing chunk of text around each hit) and
highlighting (highlighting the specific tokens that matched, within
that excerpt).

Mike

On Mon, May 25, 2009 at 5:20 AM, KK <di...@gmail.com> wrote:
> Thanks for your response @Seid.
> Can any Lucene user give me directions on this regard? I'm stuck.
>  Really appreciate your help.
>
> Thanks,
> KK
>
> On Mon, May 25, 2009 at 2:43 PM, Seid Muhie <se...@gmail.com> wrote:
>
>> actually I used the normal java standard libraries for this work. I
>> used lucene only to retrieve the relevant document.
>> what you will do is, thought it is to manuall, as i don't know the way
>> it can be done by the Lucene API, you just record the location of the
>> query terms in the document (it is as easy as indexOf(query terms)).
>> But you have to be very aware of the speed of the system. then you can
>> go ahead or back, as you want.
>>
>> Once again, I think there will be also a LUcene workaround, that I am
>> not aware of it at all.
>>
>> Seid M
>>
>> On 5/25/09, KK <di...@gmail.com> wrote:
>> > Thanks for your quick response, Seid.
>> >
>> > There is one more mail I found in the archive[3/4 days old] where someone
>> > asked about extracting 3 neighbors words around the match. I think once
>> you
>> > have the position of matching term/phrase then extracting 3 or 30
>> neighbors
>> > wont be different, right? because you just have to move back/forward and
>> get
>> > the words, this sounds logically simple but I dont know how simple is
>> this
>> > implementation-wise.
>> > Also people are talking about someting called spanQueries/termvectors etc
>> to
>> > use for this purpose. I'm still to get the exact idea of how to do this.
>> > As per your mail, you used Java to extract the neighbors, Is that using
>> the
>> > standard techniques i.e using those spanqueries/termvectors or something
>> > else.
>> > If you can elaborate all this a bit It'd be very helpful.
>> >
>> > Thank you.
>> > KK>
>> >
>> > On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com> wrote:
>> >
>> >> for my thesis work (Question Answering) I used to retrieve first the
>> >> document and then play with java to extract the needed answer.
>> >> for your case what you will do is first locate the positions of the
>> >> query terms in the document (in this case it might be distributed
>> >> throughout the document - hence difficult to get the 15/20 words) then
>> >> count something10 words forward and backward and extract the match.
>> >> this is the way I handle my problem. Hope there might be different I
>> >> dea too
>> >>
>> >> Seid M.
>> >>
>> >> On 5/25/09, KK <di...@gmail.com> wrote:
>> >> > Hi All,
>> >> > I'm trying to index some non-english web pages and I'm keeping all the
>> >> > content of the page in a single field and the searches are working
>> fine
>> >> as
>> >> > well. Now when I search for some query it gives the complete page,
>> which
>> >> is
>> >> > expected. Now I want to restrict the showing of results to say 20
>> words
>> >> > around the match, something like google does, otherwise we cann't make
>> >> users
>> >> > to look for a match in the whole page content[I'll use highlighter
>> after
>> >> > this is done]. So getting positions of the matched word/phrase might
>> >> > help
>> >> so
>> >> > that I can extract some words before and some words after that and
>> will
>> >> show
>> >> > that to end user. Any idea on doing the same will be very helpful.
>> Thank
>> >> > you.
>> >> >
>> >> > KK.
>> >> >
>> >>
>> >>
>> >> --
>> >> "RABI ZIDNI ILMA"
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>> >
>>
>>
>> --
>> "RABI ZIDNI ILMA"
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by KK <di...@gmail.com>.

Thanks for your response @Seid.
Can any Lucene user give me directions on this regard? I'm stuck.
 Really appreciate your help.

Thanks,
KK

On Mon, May 25, 2009 at 2:43 PM, Seid Muhie <se...@gmail.com> wrote:

> actually I used the normal java standard libraries for this work. I
> used lucene only to retrieve the relevant document.
> what you will do is, thought it is to manuall, as i don't know the way
> it can be done by the Lucene API, you just record the location of the
> query terms in the document (it is as easy as indexOf(query terms)).
> But you have to be very aware of the speed of the system. then you can
> go ahead or back, as you want.
>
> Once again, I think there will be also a LUcene workaround, that I am
> not aware of it at all.
>
> Seid M
>
> On 5/25/09, KK <di...@gmail.com> wrote:
> > Thanks for your quick response, Seid.
> >
> > There is one more mail I found in the archive[3/4 days old] where someone
> > asked about extracting 3 neighbors words around the match. I think once
> you
> > have the position of matching term/phrase then extracting 3 or 30
> neighbors
> > wont be different, right? because you just have to move back/forward and
> get
> > the words, this sounds logically simple but I dont know how simple is
> this
> > implementation-wise.
> > Also people are talking about someting called spanQueries/termvectors etc
> to
> > use for this purpose. I'm still to get the exact idea of how to do this.
> > As per your mail, you used Java to extract the neighbors, Is that using
> the
> > standard techniques i.e using those spanqueries/termvectors or something
> > else.
> > If you can elaborate all this a bit It'd be very helpful.
> >
> > Thank you.
> > KK>
> >
> > On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com> wrote:
> >
> >> for my thesis work (Question Answering) I used to retrieve first the
> >> document and then play with java to extract the needed answer.
> >> for your case what you will do is first locate the positions of the
> >> query terms in the document (in this case it might be distributed
> >> throughout the document - hence difficult to get the 15/20 words) then
> >> count something10 words forward and backward and extract the match.
> >> this is the way I handle my problem. Hope there might be different I
> >> dea too
> >>
> >> Seid M.
> >>
> >> On 5/25/09, KK <di...@gmail.com> wrote:
> >> > Hi All,
> >> > I'm trying to index some non-english web pages and I'm keeping all the
> >> > content of the page in a single field and the searches are working
> fine
> >> as
> >> > well. Now when I search for some query it gives the complete page,
> which
> >> is
> >> > expected. Now I want to restrict the showing of results to say 20
> words
> >> > around the match, something like google does, otherwise we cann't make
> >> users
> >> > to look for a match in the whole page content[I'll use highlighter
> after
> >> > this is done]. So getting positions of the matched word/phrase might
> >> > help
> >> so
> >> > that I can extract some words before and some words after that and
> will
> >> show
> >> > that to end user. Any idea on doing the same will be very helpful.
> Thank
> >> > you.
> >> >
> >> > KK.
> >> >
> >>
> >>
> >> --
> >> "RABI ZIDNI ILMA"
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
>
> --
> "RABI ZIDNI ILMA"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by Seid Muhie <se...@gmail.com>.

actually I used the normal java standard libraries for this work. I
used lucene only to retrieve the relevant document.
what you will do is, thought it is to manuall, as i don't know the way
it can be done by the Lucene API, you just record the location of the
query terms in the document (it is as easy as indexOf(query terms)).
But you have to be very aware of the speed of the system. then you can
go ahead or back, as you want.

Once again, I think there will be also a LUcene workaround, that I am
not aware of it at all.

Seid M

On 5/25/09, KK <di...@gmail.com> wrote:
> Thanks for your quick response, Seid.
>
> There is one more mail I found in the archive[3/4 days old] where someone
> asked about extracting 3 neighbors words around the match. I think once you
> have the position of matching term/phrase then extracting 3 or 30 neighbors
> wont be different, right? because you just have to move back/forward and get
> the words, this sounds logically simple but I dont know how simple is this
> implementation-wise.
> Also people are talking about someting called spanQueries/termvectors etc to
> use for this purpose. I'm still to get the exact idea of how to do this.
> As per your mail, you used Java to extract the neighbors, Is that using the
> standard techniques i.e using those spanqueries/termvectors or something
> else.
> If you can elaborate all this a bit It'd be very helpful.
>
> Thank you.
> KK>
>
> On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com> wrote:
>
>> for my thesis work (Question Answering) I used to retrieve first the
>> document and then play with java to extract the needed answer.
>> for your case what you will do is first locate the positions of the
>> query terms in the document (in this case it might be distributed
>> throughout the document - hence difficult to get the 15/20 words) then
>> count something10 words forward and backward and extract the match.
>> this is the way I handle my problem. Hope there might be different I
>> dea too
>>
>> Seid M.
>>
>> On 5/25/09, KK <di...@gmail.com> wrote:
>> > Hi All,
>> > I'm trying to index some non-english web pages and I'm keeping all the
>> > content of the page in a single field and the searches are working fine
>> as
>> > well. Now when I search for some query it gives the complete page, which
>> is
>> > expected. Now I want to restrict the showing of results to say 20 words
>> > around the match, something like google does, otherwise we cann't make
>> users
>> > to look for a match in the whole page content[I'll use highlighter after
>> > this is done]. So getting positions of the matched word/phrase might
>> > help
>> so
>> > that I can extract some words before and some words after that and will
>> show
>> > that to end user. Any idea on doing the same will be very helpful. Thank
>> > you.
>> >
>> > KK.
>> >
>>
>>
>> --
>> "RABI ZIDNI ILMA"
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


-- 
"RABI ZIDNI ILMA"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by KK <di...@gmail.com>.

Thanks for your quick response, Seid.

There is one more mail I found in the archive[3/4 days old] where someone
asked about extracting 3 neighbors words around the match. I think once you
have the position of matching term/phrase then extracting 3 or 30 neighbors
wont be different, right? because you just have to move back/forward and get
the words, this sounds logically simple but I dont know how simple is this
implementation-wise.
Also people are talking about someting called spanQueries/termvectors etc to
use for this purpose. I'm still to get the exact idea of how to do this.
As per your mail, you used Java to extract the neighbors, Is that using the
standard techniques i.e using those spanqueries/termvectors or something
else.
If you can elaborate all this a bit It'd be very helpful.

Thank you.
KK>

On Mon, May 25, 2009 at 10:51 AM, Seid Muhie <se...@gmail.com> wrote:

> for my thesis work (Question Answering) I used to retrieve first the
> document and then play with java to extract the needed answer.
> for your case what you will do is first locate the positions of the
> query terms in the document (in this case it might be distributed
> throughout the document - hence difficult to get the 15/20 words) then
> count something10 words forward and backward and extract the match.
> this is the way I handle my problem. Hope there might be different I
> dea too
>
> Seid M.
>
> On 5/25/09, KK <di...@gmail.com> wrote:
> > Hi All,
> > I'm trying to index some non-english web pages and I'm keeping all the
> > content of the page in a single field and the searches are working fine
> as
> > well. Now when I search for some query it gives the complete page, which
> is
> > expected. Now I want to restrict the showing of results to say 20 words
> > around the match, something like google does, otherwise we cann't make
> users
> > to look for a match in the whole page content[I'll use highlighter after
> > this is done]. So getting positions of the matched word/phrase might help
> so
> > that I can extract some words before and some words after that and will
> show
> > that to end user. Any idea on doing the same will be very helpful. Thank
> > you.
> >
> > KK.
> >
>
>
> --
> "RABI ZIDNI ILMA"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: How to extract 15/20 words around the matched query after getting results from lucene searcher?

Posted by Seid Muhie <se...@gmail.com>.

for my thesis work (Question Answering) I used to retrieve first the
document and then play with java to extract the needed answer.
for your case what you will do is first locate the positions of the
query terms in the document (in this case it might be distributed
throughout the document - hence difficult to get the 15/20 words) then
count something10 words forward and backward and extract the match.
this is the way I handle my problem. Hope there might be different I
dea too

Seid M.

On 5/25/09, KK <di...@gmail.com> wrote:
> Hi All,
> I'm trying to index some non-english web pages and I'm keeping all the
> content of the page in a single field and the searches are working fine as
> well. Now when I search for some query it gives the complete page, which is
> expected. Now I want to restrict the showing of results to say 20 words
> around the match, something like google does, otherwise we cann't make users
> to look for a match in the whole page content[I'll use highlighter after
> this is done]. So getting positions of the matched word/phrase might help so
> that I can extract some words before and some words after that and will show
> that to end user. Any idea on doing the same will be very helpful. Thank
> you.
>
> KK.
>

-- 
"RABI ZIDNI ILMA"

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org