You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by VIGNESH S <vi...@gmail.com> on 2013/10/03 16:07:43 UTC

Problem with MultiPhrase Query in Lucene 4.3

Hi,

I am Trying to do Multiphrase Query in Lucene 4.3. It is working Perfect
for all scenarios except the below scenario.
When I try to Search for a phrase which is preceded by any punctuation,it
is not working..

TextContent:  Dremel is a scalable, interactive ad-hoc query system for
analysis
of read-only nested data. By combining multi-level execution
trees and columnar data layout, it is capable of running aggregation

Search phrase :  interactive adhoc

The Above Search is failing because "interactive adhoc" is preceded by ","
in original text.


I am Doing Indexing like this..Sample Code for Indexing.I have used
whitespace analyzer.

Document doc = new Document();

contents ="Dremel is a scalable, interactive ad-hoc query system for
analysis
of read-only nested data. By combining multi-level execution
trees and columnar data layout, it is capable of running aggregation";

FieldType offsetsType = new FieldType(TextField.TYPE_STORED);

Field field =new Field("content","", offsetsType);

doc.add(field);
field.setStringValue(contents);

mWriter.addDocument(doc);

In the Search I am forming MultiphraseQueryObject and adding the tokens of
the search Phrase.

Before Adding the tokens,I validated like this

LinkedList<Term> termsWithPrefix = new LinkedList<Term>(); trm.seekCeil(new
BytesRef(word)); do { String s = trm.term().utf8ToString(); if
(s.startsWith(word)) { termsWithPrefix.add(new Term("content", s)); } else
{ break; } } while (trm.next() != null);
mpquery.add(termsWithPrefix.toArray(new Term[0])); }

It is working for all scenarios except the scenarios where the search
phrase is preceded by punctuation.

In case of text preceded by punctuation trm.seekCeil(new BytesRef(word));
is pointing a diffrent word which actually causes the problem..

Please kindly help..


-- 
Thanks and Regards
Vignesh Srinivasan

Re: Problem with MultiPhrase Query in Lucene 4.3

Posted by VIGNESH S <vi...@gmail.com>.

Thanks for your Reply Ian.

I will check and let you know.


On Thu, Oct 3, 2013 at 9:19 PM, Ian Lea <ia...@gmail.com> wrote:

> Below is a little self-contained test program.  You may recognise some
> of the code.
>
> Here's the output from a couple of runs using lucene 4.4.0.
>
> $ java ian.G1 "Dremel is a scalable, interactive ad-hoc query system"
> "interactive ad-hoc"
> term=interactive
> term=ad-hoc
> +content:"interactive" +content:"ad-hoc": totalHits=1
>
>
> $ java ian.G1 "Dremel is a scalable, interactive ad-hoc query system"
> "interactive adhoc"
> term=interactive
> +content:"interactive": totalHits=1
>
> All looks OK to me.  Maybe you can make it fail, or use it to help fix
> your problem.
>
> --
> Ian.
>
> package ian;
>
> import java.util.*;
> import org.apache.lucene.analysis.*;
> import org.apache.lucene.analysis.core.*;
> import org.apache.lucene.analysis.en.*;
> import org.apache.lucene.analysis.standard.*;
> import org.apache.lucene.document.*;
> import org.apache.lucene.queries.*;
> import org.apache.lucene.search.*;
> import org.apache.lucene.store.*;
> import org.apache.lucene.index.*;
> import org.apache.lucene.util.*;
>
> public class G1 {
>
>     void test(String _contents, String _words) throws Exception {
> String contents = _contents;
> String words = _words;
>
>   RAMDirectory dir = new RAMDirectory();
> Analyzer anl = new WhitespaceAnalyzer(Version.LUCENE_44);
> IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_44,
> anl);
> IndexWriter iw = new IndexWriter(dir, iwcfg);
>
> FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
> Field field = new Field("content", "", offsetsType);
> Document doc = new Document();
> doc.add(field);
> field.setStringValue(contents);
> iw.addDocument(doc);
> iw.close();
>
> IndexReader rdr = DirectoryReader.open(dir);
> Fields fields = MultiFields.getFields(rdr);
> Terms terms = fields.terms("content");
>
> BooleanQuery bq = new BooleanQuery();
> String[] worda = _words.split(" ");
> for (String w : worda) {
>    LinkedList<Term> termsWithPrefix = new LinkedList<Term>();
>    TermsEnum trm = terms.iterator(null);
>    trm.seekCeil(new BytesRef(w));
>    do {
> String s = trm.term().utf8ToString();
> if (s.startsWith(w)) {
>    termsWithPrefix.add(new Term("content", s));
>    System.out.printf("term=%s\n", s);
> }
> else {
>    break;
> }
>    }
>    while (trm.next() != null);
>
>    if (!termsWithPrefix.isEmpty()) {
> MultiPhraseQuery mpquery = new MultiPhraseQuery();
> mpquery.add(termsWithPrefix.toArray(new Term[0]));
> bq.add(mpquery, BooleanClause.Occur.MUST);
>    }
> }
>
> IndexSearcher searcher = new IndexSearcher(rdr);
> TopDocs results = searcher.search(bq, 10);
> System.out.printf("%s: totalHits=%s\n",
>  bq, results.totalHits);
>     }
>
>
>
>     public static void main(String[] _args) throws Exception {
> G1 t = new G1();
> t.test(_args[0], _args[1]);
>     }
> }
>
>
> On Thu, Oct 3, 2013 at 4:10 PM, VIGNESH S <vi...@gmail.com> wrote:
> > Hi,
> >
> > sorry.. thats my typo..
> >
> > Its not failing because of that
> >
> >
> > On Thu, Oct 3, 2013 at 8:17 PM, Ian Lea <ia...@gmail.com> wrote:
> >
> >> Are you sure it's not failing because "adhoc" != "ad-hoc"?
> >>
> >>
> >> --
> >> Ian.
> >>
> >>
> >> On Thu, Oct 3, 2013 at 3:07 PM, VIGNESH S <vi...@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I am Trying to do Multiphrase Query in Lucene 4.3. It is working
> Perfect
> >> > for all scenarios except the below scenario.
> >> > When I try to Search for a phrase which is preceded by any
> punctuation,it
> >> > is not working..
> >> >
> >> > TextContent:  Dremel is a scalable, interactive ad-hoc query system
> for
> >> > analysis
> >> > of read-only nested data. By combining multi-level execution
> >> > trees and columnar data layout, it is capable of running aggregation
> >> >
> >> > Search phrase :  interactive adhoc
> >> >
> >> > The Above Search is failing because "interactive adhoc" is preceded by
> >> ","
> >> > in original text.
> >> >
> >> >
> >> > I am Doing Indexing like this..Sample Code for Indexing.I have used
> >> > whitespace analyzer.
> >> >
> >> > Document doc = new Document();
> >> >
> >> > contents ="Dremel is a scalable, interactive ad-hoc query system for
> >> > analysis
> >> > of read-only nested data. By combining multi-level execution
> >> > trees and columnar data layout, it is capable of running aggregation";
> >> >
> >> > FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
> >> >
> >> > Field field =new Field("content","", offsetsType);
> >> >
> >> > doc.add(field);
> >> > field.setStringValue(contents);
> >> >
> >> > mWriter.addDocument(doc);
> >> >
> >> > In the Search I am forming MultiphraseQueryObject and adding the
> tokens
> >> of
> >> > the search Phrase.
> >> >
> >> > Before Adding the tokens,I validated like this
> >> >
> >> > LinkedList<Term> termsWithPrefix = new LinkedList<Term>();
> >> trm.seekCeil(new
> >> > BytesRef(word)); do { String s = trm.term().utf8ToString(); if
> >> > (s.startsWith(word)) { termsWithPrefix.add(new Term("content", s)); }
> >> else
> >> > { break; } } while (trm.next() != null);
> >> > mpquery.add(termsWithPrefix.toArray(new Term[0])); }
> >> >
> >> > It is working for all scenarios except the scenarios where the search
> >> > phrase is preceded by punctuation.
> >> >
> >> > In case of text preceded by punctuation trm.seekCeil(new
> BytesRef(word));
> >> > is pointing a diffrent word which actually causes the problem..
> >> >
> >> > Please kindly help..
> >> >
> >> >
> >> > --
> >> > Thanks and Regards
> >> > Vignesh Srinivasan
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Thanks and Regards
> > Vignesh Srinivasan
> > 9739135640
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Thanks and Regards
Vignesh Srinivasan
9739135640

Re: Problem with MultiPhrase Query in Lucene 4.3

Posted by Ian Lea <ia...@gmail.com>.

Below is a little self-contained test program.  You may recognise some
of the code.

Here's the output from a couple of runs using lucene 4.4.0.

$ java ian.G1 "Dremel is a scalable, interactive ad-hoc query system"
"interactive ad-hoc"
term=interactive
term=ad-hoc
+content:"interactive" +content:"ad-hoc": totalHits=1


$ java ian.G1 "Dremel is a scalable, interactive ad-hoc query system"
"interactive adhoc"
term=interactive
+content:"interactive": totalHits=1

All looks OK to me.  Maybe you can make it fail, or use it to help fix
your problem.

--
Ian.

package ian;

import java.util.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.en.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queries.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import org.apache.lucene.index.*;
import org.apache.lucene.util.*;

public class G1 {

    void test(String _contents, String _words) throws Exception {
String contents = _contents;
String words = _words;

  RAMDirectory dir = new RAMDirectory();
Analyzer anl = new WhitespaceAnalyzer(Version.LUCENE_44);
IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_44,
anl);
IndexWriter iw = new IndexWriter(dir, iwcfg);

FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
Field field = new Field("content", "", offsetsType);
Document doc = new Document();
doc.add(field);
field.setStringValue(contents);
iw.addDocument(doc);
iw.close();

IndexReader rdr = DirectoryReader.open(dir);
Fields fields = MultiFields.getFields(rdr);
Terms terms = fields.terms("content");

BooleanQuery bq = new BooleanQuery();
String[] worda = _words.split(" ");
for (String w : worda) {
   LinkedList<Term> termsWithPrefix = new LinkedList<Term>();
   TermsEnum trm = terms.iterator(null);
   trm.seekCeil(new BytesRef(w));
   do {
String s = trm.term().utf8ToString();
if (s.startsWith(w)) {
   termsWithPrefix.add(new Term("content", s));
   System.out.printf("term=%s\n", s);
}
else {
   break;
}
   }
   while (trm.next() != null);

   if (!termsWithPrefix.isEmpty()) {
MultiPhraseQuery mpquery = new MultiPhraseQuery();
mpquery.add(termsWithPrefix.toArray(new Term[0]));
bq.add(mpquery, BooleanClause.Occur.MUST);
   }
}

IndexSearcher searcher = new IndexSearcher(rdr);
TopDocs results = searcher.search(bq, 10);
System.out.printf("%s: totalHits=%s\n",
 bq, results.totalHits);
    }



    public static void main(String[] _args) throws Exception {
G1 t = new G1();
t.test(_args[0], _args[1]);
    }
}


On Thu, Oct 3, 2013 at 4:10 PM, VIGNESH S <vi...@gmail.com> wrote:
> Hi,
>
> sorry.. thats my typo..
>
> Its not failing because of that
>
>
> On Thu, Oct 3, 2013 at 8:17 PM, Ian Lea <ia...@gmail.com> wrote:
>
>> Are you sure it's not failing because "adhoc" != "ad-hoc"?
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Oct 3, 2013 at 3:07 PM, VIGNESH S <vi...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am Trying to do Multiphrase Query in Lucene 4.3. It is working Perfect
>> > for all scenarios except the below scenario.
>> > When I try to Search for a phrase which is preceded by any punctuation,it
>> > is not working..
>> >
>> > TextContent:  Dremel is a scalable, interactive ad-hoc query system for
>> > analysis
>> > of read-only nested data. By combining multi-level execution
>> > trees and columnar data layout, it is capable of running aggregation
>> >
>> > Search phrase :  interactive adhoc
>> >
>> > The Above Search is failing because "interactive adhoc" is preceded by
>> ","
>> > in original text.
>> >
>> >
>> > I am Doing Indexing like this..Sample Code for Indexing.I have used
>> > whitespace analyzer.
>> >
>> > Document doc = new Document();
>> >
>> > contents ="Dremel is a scalable, interactive ad-hoc query system for
>> > analysis
>> > of read-only nested data. By combining multi-level execution
>> > trees and columnar data layout, it is capable of running aggregation";
>> >
>> > FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
>> >
>> > Field field =new Field("content","", offsetsType);
>> >
>> > doc.add(field);
>> > field.setStringValue(contents);
>> >
>> > mWriter.addDocument(doc);
>> >
>> > In the Search I am forming MultiphraseQueryObject and adding the tokens
>> of
>> > the search Phrase.
>> >
>> > Before Adding the tokens,I validated like this
>> >
>> > LinkedList<Term> termsWithPrefix = new LinkedList<Term>();
>> trm.seekCeil(new
>> > BytesRef(word)); do { String s = trm.term().utf8ToString(); if
>> > (s.startsWith(word)) { termsWithPrefix.add(new Term("content", s)); }
>> else
>> > { break; } } while (trm.next() != null);
>> > mpquery.add(termsWithPrefix.toArray(new Term[0])); }
>> >
>> > It is working for all scenarios except the scenarios where the search
>> > phrase is preceded by punctuation.
>> >
>> > In case of text preceded by punctuation trm.seekCeil(new BytesRef(word));
>> > is pointing a diffrent word which actually causes the problem..
>> >
>> > Please kindly help..
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Vignesh Srinivasan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Thanks and Regards
> Vignesh Srinivasan
> 9739135640

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Problem with MultiPhrase Query in Lucene 4.3

Posted by VIGNESH S <vi...@gmail.com>.

Hi,

sorry.. thats my typo..

Its not failing because of that


On Thu, Oct 3, 2013 at 8:17 PM, Ian Lea <ia...@gmail.com> wrote:

> Are you sure it's not failing because "adhoc" != "ad-hoc"?
>
>
> --
> Ian.
>
>
> On Thu, Oct 3, 2013 at 3:07 PM, VIGNESH S <vi...@gmail.com> wrote:
> > Hi,
> >
> > I am Trying to do Multiphrase Query in Lucene 4.3. It is working Perfect
> > for all scenarios except the below scenario.
> > When I try to Search for a phrase which is preceded by any punctuation,it
> > is not working..
> >
> > TextContent:  Dremel is a scalable, interactive ad-hoc query system for
> > analysis
> > of read-only nested data. By combining multi-level execution
> > trees and columnar data layout, it is capable of running aggregation
> >
> > Search phrase :  interactive adhoc
> >
> > The Above Search is failing because "interactive adhoc" is preceded by
> ","
> > in original text.
> >
> >
> > I am Doing Indexing like this..Sample Code for Indexing.I have used
> > whitespace analyzer.
> >
> > Document doc = new Document();
> >
> > contents ="Dremel is a scalable, interactive ad-hoc query system for
> > analysis
> > of read-only nested data. By combining multi-level execution
> > trees and columnar data layout, it is capable of running aggregation";
> >
> > FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
> >
> > Field field =new Field("content","", offsetsType);
> >
> > doc.add(field);
> > field.setStringValue(contents);
> >
> > mWriter.addDocument(doc);
> >
> > In the Search I am forming MultiphraseQueryObject and adding the tokens
> of
> > the search Phrase.
> >
> > Before Adding the tokens,I validated like this
> >
> > LinkedList<Term> termsWithPrefix = new LinkedList<Term>();
> trm.seekCeil(new
> > BytesRef(word)); do { String s = trm.term().utf8ToString(); if
> > (s.startsWith(word)) { termsWithPrefix.add(new Term("content", s)); }
> else
> > { break; } } while (trm.next() != null);
> > mpquery.add(termsWithPrefix.toArray(new Term[0])); }
> >
> > It is working for all scenarios except the scenarios where the search
> > phrase is preceded by punctuation.
> >
> > In case of text preceded by punctuation trm.seekCeil(new BytesRef(word));
> > is pointing a diffrent word which actually causes the problem..
> >
> > Please kindly help..
> >
> >
> > --
> > Thanks and Regards
> > Vignesh Srinivasan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Thanks and Regards
Vignesh Srinivasan
9739135640

Re: Problem with MultiPhrase Query in Lucene 4.3

Posted by Ian Lea <ia...@gmail.com>.

Are you sure it's not failing because "adhoc" != "ad-hoc"?


--
Ian.


On Thu, Oct 3, 2013 at 3:07 PM, VIGNESH S <vi...@gmail.com> wrote:
> Hi,
>
> I am Trying to do Multiphrase Query in Lucene 4.3. It is working Perfect
> for all scenarios except the below scenario.
> When I try to Search for a phrase which is preceded by any punctuation,it
> is not working..
>
> TextContent:  Dremel is a scalable, interactive ad-hoc query system for
> analysis
> of read-only nested data. By combining multi-level execution
> trees and columnar data layout, it is capable of running aggregation
>
> Search phrase :  interactive adhoc
>
> The Above Search is failing because "interactive adhoc" is preceded by ","
> in original text.
>
>
> I am Doing Indexing like this..Sample Code for Indexing.I have used
> whitespace analyzer.
>
> Document doc = new Document();
>
> contents ="Dremel is a scalable, interactive ad-hoc query system for
> analysis
> of read-only nested data. By combining multi-level execution
> trees and columnar data layout, it is capable of running aggregation";
>
> FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
>
> Field field =new Field("content","", offsetsType);
>
> doc.add(field);
> field.setStringValue(contents);
>
> mWriter.addDocument(doc);
>
> In the Search I am forming MultiphraseQueryObject and adding the tokens of
> the search Phrase.
>
> Before Adding the tokens,I validated like this
>
> LinkedList<Term> termsWithPrefix = new LinkedList<Term>(); trm.seekCeil(new
> BytesRef(word)); do { String s = trm.term().utf8ToString(); if
> (s.startsWith(word)) { termsWithPrefix.add(new Term("content", s)); } else
> { break; } } while (trm.next() != null);
> mpquery.add(termsWithPrefix.toArray(new Term[0])); }
>
> It is working for all scenarios except the scenarios where the search
> phrase is preceded by punctuation.
>
> In case of text preceded by punctuation trm.seekCeil(new BytesRef(word));
> is pointing a diffrent word which actually causes the problem..
>
> Please kindly help..
>
>
> --
> Thanks and Regards
> Vignesh Srinivasan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org