You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Milind <mi...@gmail.com> on 2014/08/26 18:24:22 UTC

Why does this search fail?

I have a field with the value C0001.DevNm001.  If I search for

    C0001.DevNm001 --> Get Hit
    DevNm00*       --> Get Hit
    C0001.DevNm00*  --> Get No Hit

The field gets tokenized on the period since it's surrounded by a letter
and and a number.  The query gets evaluated as a prefix query.  I'd have
thought that this should have found the document.  Any clues on why this
doesn't work?

The full code is below.

        Directory theDirectory = new RAMDirectory();
        Version theVersion = Version.LUCENE_47;
        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
        IndexWriterConfig theConfig =
                            new IndexWriterConfig(theVersion, theAnalyzer);
        IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);

        String theFieldName = "Name";
        String theFieldValue = "C0001.DevNm001";
          Document theDocument = new Document();
          theDocument.add(new TextField(theFieldName, theFieldValue,
Field.Store.YES));
          theWriter.addDocument(theDocument);
        theWriter.close();

        String theQueryStr = theFieldName + ":C0001.DevNm00*";
        Query theQuery =
            new QueryParser(theVersion, theFieldName,
theAnalyzer).parse(theQueryStr);
        System.out.println(theQuery.getClass() + ", " + theQuery);
        IndexReader theIndexReader = DirectoryReader.open(theDirectory);
        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
        TopScoreDocCollector collector = TopScoreDocCollector.create(10,
true);
        theSearcher.search(theQuery, collector);
        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
        System.out.println("Hits found: " + theHits.length);

Output:

class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
Hits found: 0


-- 
Regards
Milind

Re: Why does this search fail?

Posted by Milind <mi...@gmail.com>.

I just wrote this small test case.  Do you mean add some more field values
and search for it?  I added a whole bunch of strings with the same
pattern.  C000x.DevNm00y, changing the x and y values to different numbers.

I changed the code to add some similar and different patterns and this is
what I get

    Directory theDirectory = new RAMDirectory();
    Version theVersion = Version.LUCENE_47;
    Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
    IndexWriterConfig theConfig =
       new IndexWriterConfig(theVersion, theAnalyzer);
    IndexWriter theWriter =
       new IndexWriter(theDirectory, theConfig);

    String theFieldName = "Name";
        String[] theFieldValues = new String[]
           {"C0001.DevNm001", "C0001.DevNm002",
            "John-Appleseed", "JohnAppleseed"};
        Document theDocument = new Document();
        for (int i = 0; i < theFieldValues.length; i++) {
            theDocument.add(new TextField(theFieldName,
                                          theFieldValues[i],
                                          Field.Store.YES));
        }
        theWriter.addDocument(theDocument);
        theWriter.close();

        String[] theQueryStr = new String[]
            {"C0001.DevNm00*", "John-Applesee*", "JohnApplesee*"};
        QueryParser theQueryParser =
            new QueryParser(theVersion, theFieldName, theAnalyzer);
        IndexReader theIndexReader = DirectoryReader.open(theDirectory);
        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
        for (int i = 0; i < theQueryStr.length; i++) {
            Query theQuery =
                theQueryParser.parse(theFieldName + ":" + theQueryStr[i]);
            System.out.println(theQuery.getClass() + ", " + theQuery);
            TopScoreDocCollector theCollector =
               TopScoreDocCollector.create(10, true);
            theSearcher.search(theQuery, theCollector);
            ScoreDoc[] theHits = theCollector.topDocs().scoreDocs;
            System.out.println("Hits found: " + theHits.length);
        }


Output:

class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
Hits found: 0
class org.apache.lucene.search.PrefixQuery, Name:john-applesee*
Hits found: 0
class org.apache.lucene.search.PrefixQuery, Name:johnapplesee*
Hits found: 1



On Tue, Aug 26, 2014 at 12:30 PM, Ralf Heyde <ra...@gmx.de> wrote:

> Can you Post the Result of the queryparser for the other queries too?
>
> Gesendet von meinem BlackBerry 10-Smartphone.
>   Originalnachricht
> Von: Milind
> Gesendet: Dienstag, 26. August 2014 18:24
> An: java-user@lucene.apache.org
> Antwort an: java-user@lucene.apache.org
> Betreff: Why does this search fail?
>
> I have a field with the value C0001.DevNm001. If I search for
>
> C0001.DevNm001 --> Get Hit
> DevNm00* --> Get Hit
> C0001.DevNm00* --> Get No Hit
>
> The field gets tokenized on the period since it's surrounded by a letter
> and and a number. The query gets evaluated as a prefix query. I'd have
> thought that this should have found the document. Any clues on why this
> doesn't work?
>
> The full code is below.
>
> Directory theDirectory = new RAMDirectory();
> Version theVersion = Version.LUCENE_47;
> Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
> IndexWriterConfig theConfig =
> new IndexWriterConfig(theVersion, theAnalyzer);
> IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>
> String theFieldName = "Name";
> String theFieldValue = "C0001.DevNm001";
> Document theDocument = new Document();
> theDocument.add(new TextField(theFieldName, theFieldValue,
> Field.Store.YES));
> theWriter.addDocument(theDocument);
> theWriter.close();
>
> String theQueryStr = theFieldName + ":C0001.DevNm00*";
> Query theQuery =
> new QueryParser(theVersion, theFieldName,
> theAnalyzer).parse(theQueryStr);
> System.out.println(theQuery.getClass() + ", " + theQuery);
> IndexReader theIndexReader = DirectoryReader.open(theDirectory);
> IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
> TopScoreDocCollector collector = TopScoreDocCollector.create(10,
> true);
> theSearcher.search(theQuery, collector);
> ScoreDoc[] theHits = collector.topDocs().scoreDocs;
> System.out.println("Hits found: " + theHits.length);
>
> Output:
>
> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
> Hits found: 0
>
>
> --
> Regards
> Milind
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

AW: Why does this search fail?

Posted by Ralf Heyde <ra...@gmx.de>.

Can you Post the Result of the queryparser for the other queries too?

Gesendet von meinem BlackBerry 10-Smartphone.
  Originalnachricht  
Von: Milind
Gesendet: Dienstag, 26. August 2014 18:24
An: java-user@lucene.apache.org
Antwort an: java-user@lucene.apache.org
Betreff: Why does this search fail?

I have a field with the value C0001.DevNm001. If I search for

C0001.DevNm001 --> Get Hit
DevNm00* --> Get Hit
C0001.DevNm00* --> Get No Hit

The field gets tokenized on the period since it's surrounded by a letter
and and a number. The query gets evaluated as a prefix query. I'd have
thought that this should have found the document. Any clues on why this
doesn't work?

The full code is below.

Directory theDirectory = new RAMDirectory();
Version theVersion = Version.LUCENE_47;
Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
IndexWriterConfig theConfig =
new IndexWriterConfig(theVersion, theAnalyzer);
IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);

String theFieldName = "Name";
String theFieldValue = "C0001.DevNm001";
Document theDocument = new Document();
theDocument.add(new TextField(theFieldName, theFieldValue,
Field.Store.YES));
theWriter.addDocument(theDocument);
theWriter.close();

String theQueryStr = theFieldName + ":C0001.DevNm00*";
Query theQuery =
new QueryParser(theVersion, theFieldName,
theAnalyzer).parse(theQueryStr);
System.out.println(theQuery.getClass() + ", " + theQuery);
IndexReader theIndexReader = DirectoryReader.open(theDirectory);
IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
TopScoreDocCollector collector = TopScoreDocCollector.create(10,
true);
theSearcher.search(theQuery, collector);
ScoreDoc[] theHits = collector.topDocs().scoreDocs;
System.out.println("Hits found: " + theHits.length);

Output:

class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
Hits found: 0


-- 
Regards
Milind

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search fail?

Posted by Milind <mi...@gmail.com>.

Thanks Jack.  I'll try this out.  I'll have to see if that creates other
side effects :-(.  Tokenization is already causing a great deal of
confusion.  I want to make it as intuitive as possible.



On Wed, Aug 27, 2014 at 10:45 AM, Jack Krupansky <ja...@basetechnology.com>
wrote:

> Yes, the white space tokenizer will preserve all punctuation, but... then
> the query for DevNm00* will fail. A "smarter" set of filters is probably
> needed here... start with white space tokenization, keep that overall
> token, then trim external punctuation and keep that token as well, and then
> use word delimiter filter to split out the embedded words, like DevNm00,
> and add them.
>
> The word delimiter filter will do most of that, but not the part of
> trimming out external punctuation. But depending on your use case, it may
> be close enough.
>
> See:
> http://lucene.apache.org/core/4_9_0/analyzers-common/org/
> apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Sokolov
> Sent: Wednesday, August 27, 2014 10:26 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why does this search fail?
>
>
> Tokenization is tricky.  You might  consider using whitespace tokenizer
> followed by word delimiter filter (instead of standard tokenizer); it
> does a kind of secondary tokenization pass that can preserve the
> original token in addition to its component parts. There are some weird
> side effects to do with term frequencies and phrase-like queries, but it
> would make all these wildcard queries work I think.
>
> -Mike
>
> On 08/27/2014 09:54 AM, Milind wrote:
>
>> I see.  This is going to be extremely difficult to explain to end users.
>> It doesn't work as they would expect.  Some of the tokenizing rules are
>> already somewhat confusing.  Their expectation is that it should work the
>> way their searches work in Google.
>>
>> It's difficult enough to recognize that because the period is surrounded
>> by
>> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
>> tokenized.  So I'd have expected that C0001.DevNm00* would effectively
>> become a search for C0001 OR DevNm00*.  But now, because of the presence
>> of
>> the wildcard, it's considered as 1 term and the period is not a tokenizer.
>> That's actually good, but now the fact that it's still considered as 2
>> terms for wildcard searches makes it very unintuitive.  I don't suppose
>> that I can do anything about making wildcard search use multiple terms if
>> joined together with a tokenizer.  But is there any way that I can force
>> it
>> to go through an analyzer prior to doing the search?
>>
>>
>>
>>
>> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <ja...@basetechnology.com>
>> wrote:
>>
>>  Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
>>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
>>> match any term (at least in this case.)
>>>
>>> Also, if your query term includes a wildcard, it will not be fully
>>> analyzed. Some filters such as lower case are defined as "multi-term", so
>>> they will be performed, but the standard tokenizer is not being called,
>>> so
>>> the dot remains and this whole term is treated as one term, unlike the
>>> index analysis.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Milind
>>> Sent: Tuesday, August 26, 2014 12:24 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Why does this search fail?
>>>
>>>
>>> I have a field with the value C0001.DevNm001.  If I search for
>>>
>>>     C0001.DevNm001 --> Get Hit
>>>     DevNm00*       --> Get Hit
>>>     C0001.DevNm00*  --> Get No Hit
>>>
>>> The field gets tokenized on the period since it's surrounded by a letter
>>> and and a number.  The query gets evaluated as a prefix query.  I'd have
>>> thought that this should have found the document.  Any clues on why this
>>> doesn't work?
>>>
>>> The full code is below.
>>>
>>>         Directory theDirectory = new RAMDirectory();
>>>         Version theVersion = Version.LUCENE_47;
>>>         Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>>>         IndexWriterConfig theConfig =
>>>                             new IndexWriterConfig(theVersion,
>>> theAnalyzer);
>>>         IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>>>
>>>         String theFieldName = "Name";
>>>         String theFieldValue = "C0001.DevNm001";
>>>           Document theDocument = new Document();
>>>           theDocument.add(new TextField(theFieldName, theFieldValue,
>>> Field.Store.YES));
>>>           theWriter.addDocument(theDocument);
>>>         theWriter.close();
>>>
>>>         String theQueryStr = theFieldName + ":C0001.DevNm00*";
>>>         Query theQuery =
>>>             new QueryParser(theVersion, theFieldName,
>>> theAnalyzer).parse(theQueryStr);
>>>         System.out.println(theQuery.getClass() + ", " + theQuery);
>>>         IndexReader theIndexReader = DirectoryReader.open(theDirectory);
>>>         IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>>>         TopScoreDocCollector collector = TopScoreDocCollector.create(10,
>>> true);
>>>         theSearcher.search(theQuery, collector);
>>>         ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>>>         System.out.println("Hits found: " + theHits.length);
>>>
>>> Output:
>>>
>>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>>> Hits found: 0
>>>
>>>
>>> --
>>> Regards
>>> Milind
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

Re: Why does this search fail?

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yes, the white space tokenizer will preserve all punctuation, but... then 
the query for DevNm00* will fail. A "smarter" set of filters is probably 
needed here... start with white space tokenization, keep that overall token, 
then trim external punctuation and keep that token as well, and then use 
word delimiter filter to split out the embedded words, like DevNm00, and add 
them.

The word delimiter filter will do most of that, but not the part of trimming 
out external punctuation. But depending on your use case, it may be close 
enough.

See:
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

-- Jack Krupansky

-----Original Message----- 
From: Michael Sokolov
Sent: Wednesday, August 27, 2014 10:26 AM
To: java-user@lucene.apache.org
Subject: Re: Why does this search fail?

Tokenization is tricky.  You might  consider using whitespace tokenizer
followed by word delimiter filter (instead of standard tokenizer); it
does a kind of secondary tokenization pass that can preserve the
original token in addition to its component parts. There are some weird
side effects to do with term frequencies and phrase-like queries, but it
would make all these wildcard queries work I think.

-Mike

On 08/27/2014 09:54 AM, Milind wrote:
> I see.  This is going to be extremely difficult to explain to end users.
> It doesn't work as they would expect.  Some of the tokenizing rules are
> already somewhat confusing.  Their expectation is that it should work the
> way their searches work in Google.
>
> It's difficult enough to recognize that because the period is surrounded 
> by
> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> become a search for C0001 OR DevNm00*.  But now, because of the presence 
> of
> the wildcard, it's considered as 1 term and the period is not a tokenizer.
> That's actually good, but now the fact that it's still considered as 2
> terms for wildcard searches makes it very unintuitive.  I don't suppose
> that I can do anything about making wildcard search use multiple terms if
> joined together with a tokenizer.  But is there any way that I can force 
> it
> to go through an analyzer prior to doing the search?
>
>
>
>
> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
>> Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
>> match any term (at least in this case.)
>>
>> Also, if your query term includes a wildcard, it will not be fully
>> analyzed. Some filters such as lower case are defined as "multi-term", so
>> they will be performed, but the standard tokenizer is not being called, 
>> so
>> the dot remains and this whole term is treated as one term, unlike the
>> index analysis.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Milind
>> Sent: Tuesday, August 26, 2014 12:24 PM
>> To: java-user@lucene.apache.org
>> Subject: Why does this search fail?
>>
>>
>> I have a field with the value C0001.DevNm001.  If I search for
>>
>>     C0001.DevNm001 --> Get Hit
>>     DevNm00*       --> Get Hit
>>     C0001.DevNm00*  --> Get No Hit
>>
>> The field gets tokenized on the period since it's surrounded by a letter
>> and and a number.  The query gets evaluated as a prefix query.  I'd have
>> thought that this should have found the document.  Any clues on why this
>> doesn't work?
>>
>> The full code is below.
>>
>>         Directory theDirectory = new RAMDirectory();
>>         Version theVersion = Version.LUCENE_47;
>>         Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>>         IndexWriterConfig theConfig =
>>                             new IndexWriterConfig(theVersion, 
>> theAnalyzer);
>>         IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>>
>>         String theFieldName = "Name";
>>         String theFieldValue = "C0001.DevNm001";
>>           Document theDocument = new Document();
>>           theDocument.add(new TextField(theFieldName, theFieldValue,
>> Field.Store.YES));
>>           theWriter.addDocument(theDocument);
>>         theWriter.close();
>>
>>         String theQueryStr = theFieldName + ":C0001.DevNm00*";
>>         Query theQuery =
>>             new QueryParser(theVersion, theFieldName,
>> theAnalyzer).parse(theQueryStr);
>>         System.out.println(theQuery.getClass() + ", " + theQuery);
>>         IndexReader theIndexReader = DirectoryReader.open(theDirectory);
>>         IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>>         TopScoreDocCollector collector = TopScoreDocCollector.create(10,
>> true);
>>         theSearcher.search(theQuery, collector);
>>         ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>>         System.out.println("Hits found: " + theHits.length);
>>
>> Output:
>>
>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>> Hits found: 0
>>
>>
>> --
>> Regards
>> Milind
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search fail?

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

Tokenization is tricky.  You might  consider using whitespace tokenizer 
followed by word delimiter filter (instead of standard tokenizer); it 
does a kind of secondary tokenization pass that can preserve the 
original token in addition to its component parts. There are some weird 
side effects to do with term frequencies and phrase-like queries, but it 
would make all these wildcard queries work I think.

-Mike

On 08/27/2014 09:54 AM, Milind wrote:
> I see.  This is going to be extremely difficult to explain to end users.
> It doesn't work as they would expect.  Some of the tokenizing rules are
> already somewhat confusing.  Their expectation is that it should work the
> way their searches work in Google.
>
> It's difficult enough to recognize that because the period is surrounded by
> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> become a search for C0001 OR DevNm00*.  But now, because of the presence of
> the wildcard, it's considered as 1 term and the period is not a tokenizer.
> That's actually good, but now the fact that it's still considered as 2
> terms for wildcard searches makes it very unintuitive.  I don't suppose
> that I can do anything about making wildcard search use multiple terms if
> joined together with a tokenizer.  But is there any way that I can force it
> to go through an analyzer prior to doing the search?
>
>
>
>
> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
>> Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
>> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
>> match any term (at least in this case.)
>>
>> Also, if your query term includes a wildcard, it will not be fully
>> analyzed. Some filters such as lower case are defined as "multi-term", so
>> they will be performed, but the standard tokenizer is not being called, so
>> the dot remains and this whole term is treated as one term, unlike the
>> index analysis.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Milind
>> Sent: Tuesday, August 26, 2014 12:24 PM
>> To: java-user@lucene.apache.org
>> Subject: Why does this search fail?
>>
>>
>> I have a field with the value C0001.DevNm001.  If I search for
>>
>>     C0001.DevNm001 --> Get Hit
>>     DevNm00*       --> Get Hit
>>     C0001.DevNm00*  --> Get No Hit
>>
>> The field gets tokenized on the period since it's surrounded by a letter
>> and and a number.  The query gets evaluated as a prefix query.  I'd have
>> thought that this should have found the document.  Any clues on why this
>> doesn't work?
>>
>> The full code is below.
>>
>>         Directory theDirectory = new RAMDirectory();
>>         Version theVersion = Version.LUCENE_47;
>>         Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>>         IndexWriterConfig theConfig =
>>                             new IndexWriterConfig(theVersion, theAnalyzer);
>>         IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>>
>>         String theFieldName = "Name";
>>         String theFieldValue = "C0001.DevNm001";
>>           Document theDocument = new Document();
>>           theDocument.add(new TextField(theFieldName, theFieldValue,
>> Field.Store.YES));
>>           theWriter.addDocument(theDocument);
>>         theWriter.close();
>>
>>         String theQueryStr = theFieldName + ":C0001.DevNm00*";
>>         Query theQuery =
>>             new QueryParser(theVersion, theFieldName,
>> theAnalyzer).parse(theQueryStr);
>>         System.out.println(theQuery.getClass() + ", " + theQuery);
>>         IndexReader theIndexReader = DirectoryReader.open(theDirectory);
>>         IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>>         TopScoreDocCollector collector = TopScoreDocCollector.create(10,
>> true);
>>         theSearcher.search(theQuery, collector);
>>         ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>>         System.out.println("Hits found: " + theHits.length);
>>
>> Output:
>>
>> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>> Hits found: 0
>>
>>
>> --
>> Regards
>> Milind
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search fail?

Posted by Milind <mi...@gmail.com>.

Thanks for the Google link.  I wasn't aware of it.  Most of it is very
intuitive.  And most importantly consistent.


On Wed, Aug 27, 2014 at 11:07 AM, Jack Krupansky <ja...@basetechnology.com>
wrote:

> It's not documented, but Google does seem to support trailing wildcard,
> but only if the prefix has at least six characters. For shorter prefixes,
> it seems to just drop the wildcard.
>
> Google also uses "*" in quoted phrases to mean a placeholder for any
> single term. That's documented.
>
> See:
> https://support.google.com/websearch/answer/136861?hl=en
>
> It also seems to support "**" in a quoted phrase to mean one or more
> arbitrary terms. This isn't documented, but seems to work.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Milind
> Sent: Wednesday, August 27, 2014 10:51 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why does this search fail?
>
>
> Yes.  If you search for alphare on google and alphare*, you get 2 different
> results.  Sorry for the contrived example.  I just tried searching for
> alpharetta and went backwards deleting characters.
>
>
> On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies <be...@basistech.com>
> wrote:
>
>  Does google actually support "*"?
>>
>>
>>
>> On Wed, Aug 27, 2014 at 9:54 AM, Milind <mi...@gmail.com> wrote:
>>
>> > I see.  This is going to be extremely difficult to explain to end users.
>> > It doesn't work as they would expect.  Some of the tokenizing rules are
>> > already somewhat confusing.  Their expectation is that it should work >
>> the
>> > way their searches work in Google.
>> >
>> > It's difficult enough to recognize that because the period is surrounded
>> by
>> > a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
>> > tokenized.  So I'd have expected that C0001.DevNm00* would effectively
>> > become a search for C0001 OR DevNm00*.  But now, because of the presence
>> of
>> > the wildcard, it's considered as 1 term and the period is not a
>> tokenizer.
>> > That's actually good, but now the fact that it's still considered as 2
>> > terms for wildcard searches makes it very unintuitive.  I don't suppose
>> > that I can do anything about making wildcard search use multiple terms
>> > if
>> > joined together with a tokenizer.  But is there any way that I can force
>> it
>> > to go through an analyzer prior to doing the search?
>> >
>> >
>> >
>> >
>> > On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <
>> jack@basetechnology.com
>> >
>> > wrote:
>> >
>> > > Sorry, but you can only use a wildcard on a single term.
>> "C0001.DevNm001"
>> > > gets indexed as two terms, "c0001" and "devnm001", so your wildcard
>> won't
>> > > match any term (at least in this case.)
>> > >
>> > > Also, if your query term includes a wildcard, it will not be fully
>> > > analyzed. Some filters such as lower case are defined as "multi-term",
>> so
>> > > they will be performed, but the standard tokenizer is not being > >
>> called,
>> > so
>> > > the dot remains and this whole term is treated as one term, unlike the
>> > > index analysis.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > -----Original Message----- From: Milind
>> > > Sent: Tuesday, August 26, 2014 12:24 PM
>> > > To: java-user@lucene.apache.org
>> > > Subject: Why does this search fail?
>> > >
>> > >
>> > > I have a field with the value C0001.DevNm001.  If I search for
>> > >
>> > >    C0001.DevNm001 --> Get Hit
>> > >    DevNm00*       --> Get Hit
>> > >    C0001.DevNm00*  --> Get No Hit
>> > >
>> > > The field gets tokenized on the period since it's surrounded by a
>> letter
>> > > and and a number.  The query gets evaluated as a prefix query.  I'd
>> have
>> > > thought that this should have found the document.  Any clues on why
>> this
>> > > doesn't work?
>> > >
>> > > The full code is below.
>> > >
>> > >        Directory theDirectory = new RAMDirectory();
>> > >        Version theVersion = Version.LUCENE_47;
>> > >        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>> > >        IndexWriterConfig theConfig =
>> > >                            new IndexWriterConfig(theVersion,
>> > theAnalyzer);
>> > >        IndexWriter theWriter = new IndexWriter(theDirectory,
>> theConfig);
>> > >
>> > >        String theFieldName = "Name";
>> > >        String theFieldValue = "C0001.DevNm001";
>> > >          Document theDocument = new Document();
>> > >          theDocument.add(new TextField(theFieldName, theFieldValue,
>> > > Field.Store.YES));
>> > >          theWriter.addDocument(theDocument);
>> > >        theWriter.close();
>> > >
>> > >        String theQueryStr = theFieldName + ":C0001.DevNm00*";
>> > >        Query theQuery =
>> > >            new QueryParser(theVersion, theFieldName,
>> > > theAnalyzer).parse(theQueryStr);
>> > >        System.out.println(theQuery.getClass() + ", " + theQuery);
>> > >        IndexReader theIndexReader = > > DirectoryReader.open(
>> theDirectory);
>> > >        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>> > >        TopScoreDocCollector collector = > >
>> TopScoreDocCollector.create(10,
>> > > true);
>> > >        theSearcher.search(theQuery, collector);
>> > >        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>> > >        System.out.println("Hits found: " + theHits.length);
>> > >
>> > > Output:
>> > >
>> > > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>> > > Hits found: 0
>> > >
>> > >
>> > > --
>> > > Regards
>> > > Milind
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> >
>> >
>> > --
>> > Regards
>> > Milind
>> >
>>
>>
>
>
> --
> Regards
> Milind
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

Re: Why does this search fail?

Posted by Jack Krupansky <ja...@basetechnology.com>.

It's not documented, but Google does seem to support trailing wildcard, but 
only if the prefix has at least six characters. For shorter prefixes, it 
seems to just drop the wildcard.

Google also uses "*" in quoted phrases to mean a placeholder for any single 
term. That's documented.

See:
https://support.google.com/websearch/answer/136861?hl=en

It also seems to support "**" in a quoted phrase to mean one or more 
arbitrary terms. This isn't documented, but seems to work.

-- Jack Krupansky

-----Original Message----- 
From: Milind
Sent: Wednesday, August 27, 2014 10:51 AM
To: java-user@lucene.apache.org
Subject: Re: Why does this search fail?

Yes.  If you search for alphare on google and alphare*, you get 2 different
results.  Sorry for the contrived example.  I just tried searching for
alpharetta and went backwards deleting characters.


On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies <be...@basistech.com>
wrote:

> Does google actually support "*"?
>
>
>
> On Wed, Aug 27, 2014 at 9:54 AM, Milind <mi...@gmail.com> wrote:
>
> > I see.  This is going to be extremely difficult to explain to end users.
> > It doesn't work as they would expect.  Some of the tokenizing rules are
> > already somewhat confusing.  Their expectation is that it should work 
> > the
> > way their searches work in Google.
> >
> > It's difficult enough to recognize that because the period is surrounded
> by
> > a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> > tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> > become a search for C0001 OR DevNm00*.  But now, because of the presence
> of
> > the wildcard, it's considered as 1 term and the period is not a
> tokenizer.
> > That's actually good, but now the fact that it's still considered as 2
> > terms for wildcard searches makes it very unintuitive.  I don't suppose
> > that I can do anything about making wildcard search use multiple terms 
> > if
> > joined together with a tokenizer.  But is there any way that I can force
> it
> > to go through an analyzer prior to doing the search?
> >
> >
> >
> >
> > On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <jack@basetechnology.com
> >
> > wrote:
> >
> > > Sorry, but you can only use a wildcard on a single term.
> "C0001.DevNm001"
> > > gets indexed as two terms, "c0001" and "devnm001", so your wildcard
> won't
> > > match any term (at least in this case.)
> > >
> > > Also, if your query term includes a wildcard, it will not be fully
> > > analyzed. Some filters such as lower case are defined as "multi-term",
> so
> > > they will be performed, but the standard tokenizer is not being 
> > > called,
> > so
> > > the dot remains and this whole term is treated as one term, unlike the
> > > index analysis.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Milind
> > > Sent: Tuesday, August 26, 2014 12:24 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Why does this search fail?
> > >
> > >
> > > I have a field with the value C0001.DevNm001.  If I search for
> > >
> > >    C0001.DevNm001 --> Get Hit
> > >    DevNm00*       --> Get Hit
> > >    C0001.DevNm00*  --> Get No Hit
> > >
> > > The field gets tokenized on the period since it's surrounded by a
> letter
> > > and and a number.  The query gets evaluated as a prefix query.  I'd
> have
> > > thought that this should have found the document.  Any clues on why
> this
> > > doesn't work?
> > >
> > > The full code is below.
> > >
> > >        Directory theDirectory = new RAMDirectory();
> > >        Version theVersion = Version.LUCENE_47;
> > >        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
> > >        IndexWriterConfig theConfig =
> > >                            new IndexWriterConfig(theVersion,
> > theAnalyzer);
> > >        IndexWriter theWriter = new IndexWriter(theDirectory,
> theConfig);
> > >
> > >        String theFieldName = "Name";
> > >        String theFieldValue = "C0001.DevNm001";
> > >          Document theDocument = new Document();
> > >          theDocument.add(new TextField(theFieldName, theFieldValue,
> > > Field.Store.YES));
> > >          theWriter.addDocument(theDocument);
> > >        theWriter.close();
> > >
> > >        String theQueryStr = theFieldName + ":C0001.DevNm00*";
> > >        Query theQuery =
> > >            new QueryParser(theVersion, theFieldName,
> > > theAnalyzer).parse(theQueryStr);
> > >        System.out.println(theQuery.getClass() + ", " + theQuery);
> > >        IndexReader theIndexReader = 
> > > DirectoryReader.open(theDirectory);
> > >        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
> > >        TopScoreDocCollector collector = 
> > > TopScoreDocCollector.create(10,
> > > true);
> > >        theSearcher.search(theQuery, collector);
> > >        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
> > >        System.out.println("Hits found: " + theHits.length);
> > >
> > > Output:
> > >
> > > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
> > > Hits found: 0
> > >
> > >
> > > --
> > > Regards
> > > Milind
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Regards
> > Milind
> >
>



-- 
Regards
Milind 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search fail?

Posted by Milind <mi...@gmail.com>.

Yes.  If you search for alphare on google and alphare*, you get 2 different
results.  Sorry for the contrived example.  I just tried searching for
alpharetta and went backwards deleting characters.


On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies <be...@basistech.com>
wrote:

> Does google actually support "*"?
>
>
>
> On Wed, Aug 27, 2014 at 9:54 AM, Milind <mi...@gmail.com> wrote:
>
> > I see.  This is going to be extremely difficult to explain to end users.
> > It doesn't work as they would expect.  Some of the tokenizing rules are
> > already somewhat confusing.  Their expectation is that it should work the
> > way their searches work in Google.
> >
> > It's difficult enough to recognize that because the period is surrounded
> by
> > a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> > tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> > become a search for C0001 OR DevNm00*.  But now, because of the presence
> of
> > the wildcard, it's considered as 1 term and the period is not a
> tokenizer.
> > That's actually good, but now the fact that it's still considered as 2
> > terms for wildcard searches makes it very unintuitive.  I don't suppose
> > that I can do anything about making wildcard search use multiple terms if
> > joined together with a tokenizer.  But is there any way that I can force
> it
> > to go through an analyzer prior to doing the search?
> >
> >
> >
> >
> > On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <jack@basetechnology.com
> >
> > wrote:
> >
> > > Sorry, but you can only use a wildcard on a single term.
> "C0001.DevNm001"
> > > gets indexed as two terms, "c0001" and "devnm001", so your wildcard
> won't
> > > match any term (at least in this case.)
> > >
> > > Also, if your query term includes a wildcard, it will not be fully
> > > analyzed. Some filters such as lower case are defined as "multi-term",
> so
> > > they will be performed, but the standard tokenizer is not being called,
> > so
> > > the dot remains and this whole term is treated as one term, unlike the
> > > index analysis.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Milind
> > > Sent: Tuesday, August 26, 2014 12:24 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Why does this search fail?
> > >
> > >
> > > I have a field with the value C0001.DevNm001.  If I search for
> > >
> > >    C0001.DevNm001 --> Get Hit
> > >    DevNm00*       --> Get Hit
> > >    C0001.DevNm00*  --> Get No Hit
> > >
> > > The field gets tokenized on the period since it's surrounded by a
> letter
> > > and and a number.  The query gets evaluated as a prefix query.  I'd
> have
> > > thought that this should have found the document.  Any clues on why
> this
> > > doesn't work?
> > >
> > > The full code is below.
> > >
> > >        Directory theDirectory = new RAMDirectory();
> > >        Version theVersion = Version.LUCENE_47;
> > >        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
> > >        IndexWriterConfig theConfig =
> > >                            new IndexWriterConfig(theVersion,
> > theAnalyzer);
> > >        IndexWriter theWriter = new IndexWriter(theDirectory,
> theConfig);
> > >
> > >        String theFieldName = "Name";
> > >        String theFieldValue = "C0001.DevNm001";
> > >          Document theDocument = new Document();
> > >          theDocument.add(new TextField(theFieldName, theFieldValue,
> > > Field.Store.YES));
> > >          theWriter.addDocument(theDocument);
> > >        theWriter.close();
> > >
> > >        String theQueryStr = theFieldName + ":C0001.DevNm00*";
> > >        Query theQuery =
> > >            new QueryParser(theVersion, theFieldName,
> > > theAnalyzer).parse(theQueryStr);
> > >        System.out.println(theQuery.getClass() + ", " + theQuery);
> > >        IndexReader theIndexReader = DirectoryReader.open(theDirectory);
> > >        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
> > >        TopScoreDocCollector collector = TopScoreDocCollector.create(10,
> > > true);
> > >        theSearcher.search(theQuery, collector);
> > >        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
> > >        System.out.println("Hits found: " + theHits.length);
> > >
> > > Output:
> > >
> > > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
> > > Hits found: 0
> > >
> > >
> > > --
> > > Regards
> > > Milind
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Regards
> > Milind
> >
>



-- 
Regards
Milind

Re: Why does this search fail?

Posted by Benson Margulies <be...@basistech.com>.

Does google actually support "*"?



On Wed, Aug 27, 2014 at 9:54 AM, Milind <mi...@gmail.com> wrote:

> I see.  This is going to be extremely difficult to explain to end users.
> It doesn't work as they would expect.  Some of the tokenizing rules are
> already somewhat confusing.  Their expectation is that it should work the
> way their searches work in Google.
>
> It's difficult enough to recognize that because the period is surrounded by
> a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> become a search for C0001 OR DevNm00*.  But now, because of the presence of
> the wildcard, it's considered as 1 term and the period is not a tokenizer.
> That's actually good, but now the fact that it's still considered as 2
> terms for wildcard searches makes it very unintuitive.  I don't suppose
> that I can do anything about making wildcard search use multiple terms if
> joined together with a tokenizer.  But is there any way that I can force it
> to go through an analyzer prior to doing the search?
>
>
>
>
> On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <ja...@basetechnology.com>
> wrote:
>
> > Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
> > gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
> > match any term (at least in this case.)
> >
> > Also, if your query term includes a wildcard, it will not be fully
> > analyzed. Some filters such as lower case are defined as "multi-term", so
> > they will be performed, but the standard tokenizer is not being called,
> so
> > the dot remains and this whole term is treated as one term, unlike the
> > index analysis.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Milind
> > Sent: Tuesday, August 26, 2014 12:24 PM
> > To: java-user@lucene.apache.org
> > Subject: Why does this search fail?
> >
> >
> > I have a field with the value C0001.DevNm001.  If I search for
> >
> >    C0001.DevNm001 --> Get Hit
> >    DevNm00*       --> Get Hit
> >    C0001.DevNm00*  --> Get No Hit
> >
> > The field gets tokenized on the period since it's surrounded by a letter
> > and and a number.  The query gets evaluated as a prefix query.  I'd have
> > thought that this should have found the document.  Any clues on why this
> > doesn't work?
> >
> > The full code is below.
> >
> >        Directory theDirectory = new RAMDirectory();
> >        Version theVersion = Version.LUCENE_47;
> >        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
> >        IndexWriterConfig theConfig =
> >                            new IndexWriterConfig(theVersion,
> theAnalyzer);
> >        IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
> >
> >        String theFieldName = "Name";
> >        String theFieldValue = "C0001.DevNm001";
> >          Document theDocument = new Document();
> >          theDocument.add(new TextField(theFieldName, theFieldValue,
> > Field.Store.YES));
> >          theWriter.addDocument(theDocument);
> >        theWriter.close();
> >
> >        String theQueryStr = theFieldName + ":C0001.DevNm00*";
> >        Query theQuery =
> >            new QueryParser(theVersion, theFieldName,
> > theAnalyzer).parse(theQueryStr);
> >        System.out.println(theQuery.getClass() + ", " + theQuery);
> >        IndexReader theIndexReader = DirectoryReader.open(theDirectory);
> >        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
> >        TopScoreDocCollector collector = TopScoreDocCollector.create(10,
> > true);
> >        theSearcher.search(theQuery, collector);
> >        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
> >        System.out.println("Hits found: " + theHits.length);
> >
> > Output:
> >
> > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
> > Hits found: 0
> >
> >
> > --
> > Regards
> > Milind
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Regards
> Milind
>

Re: Why does this search fail?

Posted by Milind <mi...@gmail.com>.

I see.  This is going to be extremely difficult to explain to end users.
It doesn't work as they would expect.  Some of the tokenizing rules are
already somewhat confusing.  Their expectation is that it should work the
way their searches work in Google.

It's difficult enough to recognize that because the period is surrounded by
a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
tokenized.  So I'd have expected that C0001.DevNm00* would effectively
become a search for C0001 OR DevNm00*.  But now, because of the presence of
the wildcard, it's considered as 1 term and the period is not a tokenizer.
That's actually good, but now the fact that it's still considered as 2
terms for wildcard searches makes it very unintuitive.  I don't suppose
that I can do anything about making wildcard search use multiple terms if
joined together with a tokenizer.  But is there any way that I can force it
to go through an analyzer prior to doing the search?




On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <ja...@basetechnology.com>
wrote:

> Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001"
> gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't
> match any term (at least in this case.)
>
> Also, if your query term includes a wildcard, it will not be fully
> analyzed. Some filters such as lower case are defined as "multi-term", so
> they will be performed, but the standard tokenizer is not being called, so
> the dot remains and this whole term is treated as one term, unlike the
> index analysis.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Milind
> Sent: Tuesday, August 26, 2014 12:24 PM
> To: java-user@lucene.apache.org
> Subject: Why does this search fail?
>
>
> I have a field with the value C0001.DevNm001.  If I search for
>
>    C0001.DevNm001 --> Get Hit
>    DevNm00*       --> Get Hit
>    C0001.DevNm00*  --> Get No Hit
>
> The field gets tokenized on the period since it's surrounded by a letter
> and and a number.  The query gets evaluated as a prefix query.  I'd have
> thought that this should have found the document.  Any clues on why this
> doesn't work?
>
> The full code is below.
>
>        Directory theDirectory = new RAMDirectory();
>        Version theVersion = Version.LUCENE_47;
>        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>        IndexWriterConfig theConfig =
>                            new IndexWriterConfig(theVersion, theAnalyzer);
>        IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);
>
>        String theFieldName = "Name";
>        String theFieldValue = "C0001.DevNm001";
>          Document theDocument = new Document();
>          theDocument.add(new TextField(theFieldName, theFieldValue,
> Field.Store.YES));
>          theWriter.addDocument(theDocument);
>        theWriter.close();
>
>        String theQueryStr = theFieldName + ":C0001.DevNm00*";
>        Query theQuery =
>            new QueryParser(theVersion, theFieldName,
> theAnalyzer).parse(theQueryStr);
>        System.out.println(theQuery.getClass() + ", " + theQuery);
>        IndexReader theIndexReader = DirectoryReader.open(theDirectory);
>        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>        TopScoreDocCollector collector = TopScoreDocCollector.create(10,
> true);
>        theSearcher.search(theQuery, collector);
>        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>        System.out.println("Hits found: " + theHits.length);
>
> Output:
>
> class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
> Hits found: 0
>
>
> --
> Regards
> Milind
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

Re: Why does this search fail?

Posted by Jack Krupansky <ja...@basetechnology.com>.

Sorry, but you can only use a wildcard on a single term. "C0001.DevNm001" 
gets indexed as two terms, "c0001" and "devnm001", so your wildcard won't 
match any term (at least in this case.)

Also, if your query term includes a wildcard, it will not be fully analyzed. 
Some filters such as lower case are defined as "multi-term", so they will be 
performed, but the standard tokenizer is not being called, so the dot 
remains and this whole term is treated as one term, unlike the index 
analysis.

-- Jack Krupansky

-----Original Message----- 
From: Milind
Sent: Tuesday, August 26, 2014 12:24 PM
To: java-user@lucene.apache.org
Subject: Why does this search fail?

I have a field with the value C0001.DevNm001.  If I search for

    C0001.DevNm001 --> Get Hit
    DevNm00*       --> Get Hit
    C0001.DevNm00*  --> Get No Hit

The field gets tokenized on the period since it's surrounded by a letter
and and a number.  The query gets evaluated as a prefix query.  I'd have
thought that this should have found the document.  Any clues on why this
doesn't work?

The full code is below.

        Directory theDirectory = new RAMDirectory();
        Version theVersion = Version.LUCENE_47;
        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
        IndexWriterConfig theConfig =
                            new IndexWriterConfig(theVersion, theAnalyzer);
        IndexWriter theWriter = new IndexWriter(theDirectory, theConfig);

        String theFieldName = "Name";
        String theFieldValue = "C0001.DevNm001";
          Document theDocument = new Document();
          theDocument.add(new TextField(theFieldName, theFieldValue,
Field.Store.YES));
          theWriter.addDocument(theDocument);
        theWriter.close();

        String theQueryStr = theFieldName + ":C0001.DevNm00*";
        Query theQuery =
            new QueryParser(theVersion, theFieldName,
theAnalyzer).parse(theQueryStr);
        System.out.println(theQuery.getClass() + ", " + theQuery);
        IndexReader theIndexReader = DirectoryReader.open(theDirectory);
        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
        TopScoreDocCollector collector = TopScoreDocCollector.create(10,
true);
        theSearcher.search(theQuery, collector);
        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
        System.out.println("Hits found: " + theHits.length);

Output:

class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
Hits found: 0


-- 
Regards
Milind 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org