You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sirish Vadala <si...@gmail.com> on 2010/09/16 19:39:18 UTC

Problem searching in the same sentence

Hello All: 

Can any one suggest me the best way to allow me to perform a sentence
specific phrase search? 

Eg: Let the indexed text be: 

If you are posting a question, please try search first. Your question may
have already been answered. Don't post repeatedly. Wait for a few days.
People will read your post by email.

Now if I search for the phrase 'post repeatedly Wait for a few', still I am
able to retrieve the document even though they are in different sentences.

Currently I am using StandardAnalyzer and this is how I am generating lucene
documents:

Field field = new Field(fieldName, validFieldValue, Field.Store.YES,
Field.Index.ANALYZED);
document.add(field);

Is there a way to keep track of different sentences while indexing the
content.
Any hint would be appreciated. 
Thanks. 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-tp1501269p1501269.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Problem searching in the same sentence

Posted by Sirish Vadala <si...@gmail.com>.

I have tried the below code:

Field field = new Field(fieldName, validFieldValue,
	(store) ? Field.Store.YES : Field.Store.NO,
	(tokenize) ? Field.Index.ANALYZED : Field.Index.NOT_ANALYZED,
	Field.TermVector.WITH_POSITIONS_OFFSETS);

However, I still have the same problem. It doesn't return me the highlight
snippets. Any other hints would be highly appreciated.

Thanks.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-tp1501269p1611118.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Problem searching in the same sentence

Posted by Jagdish Vasani IN <jv...@in.Capitallegals.com>.

For highlighting to work you need to store position information of each
token...
So while field creation you need to call following constructor..
 
	Field field = new Field(fieldName, validFieldValue,
			(store) ? Field.Store.YES : Field.Store.NO,
			(tokenize) ? Field.Index.ANALYZED :
Field.Index.NOT_ANALYZED, TermVector.WITH_POSITIONS_OFFSET);

Hope this will solve your issue..

Thanks,
Jagdish
-----Original Message-----
From: Sirish Vadala [mailto:sirishreddy@gmail.com] 
Sent: Thursday, September 30, 2010 5:51 AM
To: java-user@lucene.apache.org
Subject: Re: Problem searching in the same sentence


Hello All:

I am performing the sentence specific phrase search, by adding sentence
by
sentence to the same field as suggested below. Everything works fine,
but
when I display my results, highlighter is not able to find the search
text
phrase.

The following is my code:

SentenceScanner sentenceScanner = new
SentenceScanner(doc.getText().replaceAll("\\s+", " "));
ArrayList<String> sentencesList = sentenceScanner.getAllSentences();
for (String sentence : sentencesList){
	addFieldToDocument(document, IFIELD_TEXT, sentence, true, true);
}

private void addFieldToDocument(Document document, String fieldName,
	String fieldValue, Boolean store, Boolean tokenize) {
	String validFieldValue = Utility.validateString(fieldValue);
	Field field = new Field(fieldName, validFieldValue,
			(store) ? Field.Store.YES : Field.Store.NO,
			(tokenize) ? Field.Index.ANALYZED :
Field.Index.NOT_ANALYZED);
	document.add(field);
}

My custom standard analyzer:

public class MyStandardAnalyzer extends StandardAnalyzer implements
IndexFields {
	public MyStandardAnalyzer(Version matchVersion) {
		super(matchVersion);
	}
	public int getPositionIncrementGap(String fieldName) {
		int incrementGap =
super.getPositionIncrementGap(fieldName);
		if (fieldName.equals(IFIELD_TEXT)) {
			incrementGap += 10;
		}
		return incrementGap;
	}
}

My highlighter code:

//analyzer instantiated as 'MyStandardAnalyzer' in the constructor
public String highlight(String text) {
   String highlightedText = "";
   TokenStream tokenStream =
analyzer.tokenStream(IndexFields.IFIELD_TEXT,
new StringReader(text));
   highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
   try {
	return highlighter.getBestFragments(tokenStream, text,
			maxFragments, delimiter);
   } catch (Exception e) {
	e.printStackTrace();
   } 
	return highlightedText;
}

Everything works fine except for the highlighter. Highlighter doesn't
return
me the text snippets while retrieving the results. Before this sentence
specific implementation, it worked well.

Any hints or help on this would be highly appreciated.
-- 
View this message in context:
http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentenc
e-tp1501269p1605904.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem searching in the same sentence

Posted by Sirish Vadala <si...@gmail.com>.

Hello All:

I am performing the sentence specific phrase search, by adding sentence by
sentence to the same field as suggested below. Everything works fine, but
when I display my results, highlighter is not able to find the search text
phrase.

The following is my code:

SentenceScanner sentenceScanner = new
SentenceScanner(doc.getText().replaceAll("\\s+", " "));
ArrayList<String> sentencesList = sentenceScanner.getAllSentences();
for (String sentence : sentencesList){
	addFieldToDocument(document, IFIELD_TEXT, sentence, true, true);
}

private void addFieldToDocument(Document document, String fieldName,
	String fieldValue, Boolean store, Boolean tokenize) {
	String validFieldValue = Utility.validateString(fieldValue);
	Field field = new Field(fieldName, validFieldValue,
			(store) ? Field.Store.YES : Field.Store.NO,
			(tokenize) ? Field.Index.ANALYZED : Field.Index.NOT_ANALYZED);
	document.add(field);
}

My custom standard analyzer:

public class MyStandardAnalyzer extends StandardAnalyzer implements
IndexFields {
	public MyStandardAnalyzer(Version matchVersion) {
		super(matchVersion);
	}
	public int getPositionIncrementGap(String fieldName) {
		int incrementGap = super.getPositionIncrementGap(fieldName);
		if (fieldName.equals(IFIELD_TEXT)) {
			incrementGap += 10;
		}
		return incrementGap;
	}
}

My highlighter code:

//analyzer instantiated as 'MyStandardAnalyzer' in the constructor
public String highlight(String text) {
   String highlightedText = "";
   TokenStream tokenStream = analyzer.tokenStream(IndexFields.IFIELD_TEXT,
new StringReader(text));
   highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);
   try {
	return highlighter.getBestFragments(tokenStream, text,
			maxFragments, delimiter);
   } catch (Exception e) {
	e.printStackTrace();
   } 
	return highlightedText;
}

Everything works fine except for the highlighter. Highlighter doesn't return
me the text snippets while retrieving the results. Before this sentence
specific implementation, it worked well.

Any hints or help on this would be highly appreciated.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-tp1501269p1605904.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem searching in the same sentence

Posted by Simon Willnauer <si...@googlemail.com>.

Hi Sirish,

see my comments inline...

On Thu, Sep 16, 2010 at 7:39 PM, Sirish Vadala <si...@gmail.com> wrote:
>
> Hello All:
>
> Can any one suggest me the best way to allow me to perform a sentence
> specific phrase search?
>
> Eg: Let the indexed text be:
>
> If you are posting a question, please try search first. Your question may
> have already been answered. Don't post repeatedly. Wait for a few days.
> People will read your post by email.
>
> Now if I search for the phrase 'post repeatedly Wait for a few', still I am
> able to retrieve the document even though they are in different sentences.
>
> Currently I am using StandardAnalyzer and this is how I am generating lucene
> documents:
>
> Field field = new Field(fieldName, validFieldValue, Field.Store.YES,
> Field.Index.ANALYZED);
> document.add(field);
>
> Is there a way to keep track of different sentences while indexing the
> content.
What you essentially need to do is you need to tell lucene where the
sentence ends are while you are indexing your text. PhraseQuery uses
positional information to retrieve phrase matches and those positional
information is created by your TokenStream (you get from the
analyzer). The TokenStream sets the PositionIncrementAttribute for
each term with an according delta between the current term and the
previous term. Yet, if you index "hello world. Here am I" the
posIncrement between "world" and "here" will be 1 since
StandardTokenizer will throw away the punctuation. what you
essentially need to do is to introduce a larger posIncrement at
sentence borders so that PhraseQuery does not consider "world" and
"here" to be a phrase.

You can either write your own Tokenizer which is the more advance
version or you can simply add multiple fields to you document one for
each sentence. If you do that you need to set the  position increment
gap which is done by subclassing Analyzer and override
Analyzer#getPositionIncrementGap() to return 100 or something like
that.
The document you need to build would then look like the following pseudocode:

doc = Document()
doc.addField(Field("foo", "hello world.",...)
doc.addField(Field("foo", "here am I", ...)

I hope that helps

simon




There are several possibilities to do what you want and maybe the
easiest would be to split you
> Any hint would be appreciated.
> Thanks.
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-tp1501269p1501269.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org