You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by oh...@cox.net on 2009/08/06 22:03:47 UTC

Why does this search succeed with web app, but not Luke?

Hi,

In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:

    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));

Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.

Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.

But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.

Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?

Thanks,
Jim




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by oh...@cox.net.

Ian,

I just re-confirmed that StandardAnalyzer is used in both my indexer app and in the query/search web app.

The actual file paths look like:

C:\lucene-devel\dat\xxxxxxxxxxxxxxxx.dat
or
C:\lucene-devel\data\testdir\\xxxxxxxxxxxxxxxx.dat

For field "path", Luke shows:

lucene
data
c
devel
dat
testdir
xxxxxxxxxxxxxxxxxxxxxxxxxxx
.
.
zzzzzzzzzzzzzzzzzzzzzzzzzzzz


where "xxxxxxxxxxxxxxxx" and "zzzzzzzzzzzzzzzzz" are the left part (to the left of the ".") of filenames.

So, it seems like you're correct, that what you're seeing is the opposite from what I'm seeing :(??

Again, the actual code in my indexer has:

 doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));

(and again, the indexer uses StandardAnalyzer).

Is that different from what you did in your "little" index test?

Jim




---- Ian Lea <ia...@gmail.com> wrote: 
> It is a good general assumption that Luke is correct.
> 
> Can you confirm that you are using StandardAnalyzer everywhere, for
> indexing and searching?  This sort of issue is often caused by using
> different analyzers.
> 
> What does Luke show as the indexed terms for path?  In a little index
> I've just created with StandardAnalyzer and file paths Luke is showing
> xxx.yyy as a term and not xxx.  The opposite to what you have.
> 
> There was a thread yesterday about acronyms which might be relevant.
> As might writing a tiny self-contained program that indexes a few
> paths and displays the terms that have been indexed and runs a few
> searches.
> 
> 
> --
> Ian.
> 
> 
> On Fri, Aug 7, 2009 at 5:36 AM, <oh...@cox.net> wrote:
> > Hi Phil,
> >
> > Well, kind of... but...
> >
> > Then, why, when I do the search in Luke, do I get the results I cited:
> >
> > xxxx  ==> succeeds
> >
> > xxxx.yyy  ==> fails (no results)
> >
> > I guess that I've been assuming that the search in Luke is "correct" and I've been using that to "test my understanding", but maybe that's an invalid assumption?
> >
> > Jim
> >
> >
> >
> >
> >
> > ---- Phil Whelan <ph...@gmail.com> wrote:
> >> Hi Jim,
> >>
> >> > As I said, based on the terms in Luke, I would have expected a web app query on:
> >> >
> >> > path:file-1-2
> >> >
> >> > to succeed, and a query on:
> >> >
> >> > path:file-1-2.dat
> >> > to fail.
> >> >
> >> > But, instead both of those succeed when I do a web query.
> >>
> >> This query will also pass through the same (hopefully) Analyzer and
> >> will be broken into terms. So the query will actually be for
> >> "file-1-2" and "dat" where "file-1-2" is followed immediately by
> >> "dat".
> >>
> >> In indexing the terms position is stored, so
> >> "C:\dir1\dir2\file-1-1.dat" becomes...
> >> [0] c
> >> [1] dir1
> >> [2] dir2
> >> [3] file-1-1
> >> [4] dat
> >>
> >> "file-1-1" is followed by "dat", so there is a match.
> >>
> >> Does that make sense?
> >>
> >> Cheers,
> >> Phil
> >>
> >> >
> >> > Jim
> >> >
> >> >
> >> > ---- ohaya@cox.net wrote:
> >> >> Phil,
> >> >>
> >> >> Both my indexer and the webapp are basically from the Lucene demos, the indexer starting with the IndexFiles.java demo code, so I think they're both using the StandardAnalyzer.
> >> >>
> >> >> What appears in Luke, when I select "path" is just the filename part, without the extension, i.e., the "xxxx" part.
> >> >>
> >> >> That's why I said in my original post that I was kind of surprised that doing a web query for "path:xxxx.yyy" succeeded, i.e, in the path field in the index, there is no "xxxx.yyy", just "xxxx".
> >> >>
> >> >> Jim
> >> >>
> >> >> ---- Phil Whelan <ph...@gmail.com> wrote:
> >> >> > Hi Jim,
> >> >> >
> >> >> > Are you using the same Analyzer for indexing and searching? xxxx.yyy
> >> >> > will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
> >> >> > term, whereas another indexer might split this into 2 terms. This
> >> >> > should not matter either way as long as you are using the same
> >> >> > Analyzer for both indexing and searching.
> >> >> >
> >> >> > I would expect this to pass unless you are using NOT_ANALYZED, or the
> >> >> > WhitespaceAnalyzer, or something else that would not split on "/".
> >> >> >     path:xxxx.yyy
> >> >> >
> >> >> > In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
> >> >> > something else?
> >> >> >
> >> >> > Thanks,
> >> >> > Phil
> >> >> >
> >> >> > On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
> >> >> > > Hi,
> >> >> > >
> >> >> > > In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
> >> >> > >
> >> >> > >    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
> >> >> > >
> >> >> > > Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
> >> >> > >
> >> >> > > Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
> >> >> > >
> >> >> > > But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
> >> >> > >
> >> >> > > Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
> >> >> > >
> >> >> > > Thanks,
> >> >> > > Jim
> >> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by Ian Lea <ia...@gmail.com>.

It is a good general assumption that Luke is correct.

Can you confirm that you are using StandardAnalyzer everywhere, for
indexing and searching?  This sort of issue is often caused by using
different analyzers.

What does Luke show as the indexed terms for path?  In a little index
I've just created with StandardAnalyzer and file paths Luke is showing
xxx.yyy as a term and not xxx.  The opposite to what you have.

There was a thread yesterday about acronyms which might be relevant.
As might writing a tiny self-contained program that indexes a few
paths and displays the terms that have been indexed and runs a few
searches.


--
Ian.


On Fri, Aug 7, 2009 at 5:36 AM, <oh...@cox.net> wrote:
> Hi Phil,
>
> Well, kind of... but...
>
> Then, why, when I do the search in Luke, do I get the results I cited:
>
> xxxx  ==> succeeds
>
> xxxx.yyy  ==> fails (no results)
>
> I guess that I've been assuming that the search in Luke is "correct" and I've been using that to "test my understanding", but maybe that's an invalid assumption?
>
> Jim
>
>
>
>
>
> ---- Phil Whelan <ph...@gmail.com> wrote:
>> Hi Jim,
>>
>> > As I said, based on the terms in Luke, I would have expected a web app query on:
>> >
>> > path:file-1-2
>> >
>> > to succeed, and a query on:
>> >
>> > path:file-1-2.dat
>> > to fail.
>> >
>> > But, instead both of those succeed when I do a web query.
>>
>> This query will also pass through the same (hopefully) Analyzer and
>> will be broken into terms. So the query will actually be for
>> "file-1-2" and "dat" where "file-1-2" is followed immediately by
>> "dat".
>>
>> In indexing the terms position is stored, so
>> "C:\dir1\dir2\file-1-1.dat" becomes...
>> [0] c
>> [1] dir1
>> [2] dir2
>> [3] file-1-1
>> [4] dat
>>
>> "file-1-1" is followed by "dat", so there is a match.
>>
>> Does that make sense?
>>
>> Cheers,
>> Phil
>>
>> >
>> > Jim
>> >
>> >
>> > ---- ohaya@cox.net wrote:
>> >> Phil,
>> >>
>> >> Both my indexer and the webapp are basically from the Lucene demos, the indexer starting with the IndexFiles.java demo code, so I think they're both using the StandardAnalyzer.
>> >>
>> >> What appears in Luke, when I select "path" is just the filename part, without the extension, i.e., the "xxxx" part.
>> >>
>> >> That's why I said in my original post that I was kind of surprised that doing a web query for "path:xxxx.yyy" succeeded, i.e, in the path field in the index, there is no "xxxx.yyy", just "xxxx".
>> >>
>> >> Jim
>> >>
>> >> ---- Phil Whelan <ph...@gmail.com> wrote:
>> >> > Hi Jim,
>> >> >
>> >> > Are you using the same Analyzer for indexing and searching? xxxx.yyy
>> >> > will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
>> >> > term, whereas another indexer might split this into 2 terms. This
>> >> > should not matter either way as long as you are using the same
>> >> > Analyzer for both indexing and searching.
>> >> >
>> >> > I would expect this to pass unless you are using NOT_ANALYZED, or the
>> >> > WhitespaceAnalyzer, or something else that would not split on "/".
>> >> >     path:xxxx.yyy
>> >> >
>> >> > In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
>> >> > something else?
>> >> >
>> >> > Thanks,
>> >> > Phil
>> >> >
>> >> > On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
>> >> > > Hi,
>> >> > >
>> >> > > In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
>> >> > >
>> >> > >    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
>> >> > >
>> >> > > Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
>> >> > >
>> >> > > Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
>> >> > >
>> >> > > But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
>> >> > >
>> >> > > Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
>> >> > >
>> >> > > Thanks,
>> >> > > Jim
>> >> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by oh...@cox.net.

Hi Matt,

Good catch!  As I just posted, I *just* noticed that (Luke use Keyword Analyzer) :)!!!

Once I switched Luke to using Standard Analyzer, the Luke search results matched my web query results.

Thanks!

Jim



---- Matthew Hall <mh...@informatics.jax.org> wrote: 
> Luke defaults to KeywordAnalyzer when you do a search on it.  You have 
> to specifically choose StandardAnalyzer.  You are probably already doing 
> this, but I figure its worth a check.
> 
> Matt
> 
> Andrzej Bialecki wrote:
> > ohaya@cox.net wrote:
> >> Hi Phil,
> >>
> >> Well, kind of... but...
> >>
> >> Then, why, when I do the search in Luke, do I get the results I cited:
> >>
> >> xxxx  ==> succeeds
> >>
> >> xxxx.yyy  ==> fails (no results)
> >>
> >> I guess that I've been assuming that the search in Luke is "correct" 
> >> and I've been using that to "test my understanding", but maybe that's 
> >> an invalid assumption?
> >
> > Luke has some bugs, that's for sure, but not as many as one would 
> > think ;) I recommend the following exercise:
> >
> > * first, check what the "Rewritten" query looks like, in both cases. 
> > This could be enlightening, because depending on the choice of default 
> > field and query analyzer results could differ dramatically.
> >
> > * then, if a query succeeds in matching one or more documents, open 
> > this document and view its fields using "Reconstruct & edit", 
> > especially the "Tokenized" version of the field. At this point any 
> > potential mismatch in query terms vs. analyzed tokens in the field 
> > should become apparent.
> >
> 
> 
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by Matthew Hall <mh...@informatics.jax.org>.

Luke defaults to KeywordAnalyzer when you do a search on it.  You have 
to specifically choose StandardAnalyzer.  You are probably already doing 
this, but I figure its worth a check.

Matt

Andrzej Bialecki wrote:
> ohaya@cox.net wrote:
>> Hi Phil,
>>
>> Well, kind of... but...
>>
>> Then, why, when I do the search in Luke, do I get the results I cited:
>>
>> xxxx  ==> succeeds
>>
>> xxxx.yyy  ==> fails (no results)
>>
>> I guess that I've been assuming that the search in Luke is "correct" 
>> and I've been using that to "test my understanding", but maybe that's 
>> an invalid assumption?
>
> Luke has some bugs, that's for sure, but not as many as one would 
> think ;) I recommend the following exercise:
>
> * first, check what the "Rewritten" query looks like, in both cases. 
> This could be enlightening, because depending on the choice of default 
> field and query analyzer results could differ dramatically.
>
> * then, if a query succeeds in matching one or more documents, open 
> this document and view its fields using "Reconstruct & edit", 
> especially the "Tokenized" version of the field. At this point any 
> potential mismatch in query terms vs. analyzed tokens in the field 
> should become apparent.
>


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by oh...@cox.net.

Andrzej,

Hah!  

I tried as you suggested using Luke, and I found at least part of my problem.  Luke was defaulting to KeywordAnalyzer.  

I changed that to StandardAnalyzer, and did queries for:

path:xxxxxxxxxxxxxxxxxxxxx

and

path:xxxxxxxxxxxxxxxxxxxxxx.dat

For the first, the Rewritten was:

path:xxxxxxxxxxxxxxxxxxxxx

and found 1 document.

For the 2nd, the Rewritten was:

path:"xxxxxxxxxxxxxxxxxxxxxx.dat"

and found 1 document.

So, at least now the Luke search results are the same as what I'm seeing in the luceneweb web query.

With the 2nd query, I did "Explain structure" and it shows:

Term 0: field='path' text='xxxxxxxxxxxxxxxxxx'
Term 1: field='path' text='dat'

So, going back to Phil Whelan's explanation in his email yesterday:

====================================
This query will also pass through the same (hopefully) Analyzer and 
will be broken into terms. So the query will actually be for 
"file-1-2" and "dat" where "file-1-2" is followed immediately by 
"dat". 

In indexing the terms position is stored, so 
"C:\dir1\dir2\file-1-1.dat" becomes... 
[0] c 
[1] dir1 
[2] dir2 
[3] file-1-1 
[4] dat 

"file-1-1" is followed by "dat", so there is a match. 
========================================

I think the above explains things.

So, the bottom line was that with Luke, it was using KeywordAnalyzer.

When I switched Luke to using StandardAnalyzer, the Luke query results matched my web query results.

THANKS!!  I feel better now :)...

Later,
Jim

---- Andrzej Bialecki <ab...@getopt.org> wrote: 
> ohaya@cox.net wrote:
> > Hi Phil,
> > 
> > Well, kind of... but...
> > 
> > Then, why, when I do the search in Luke, do I get the results I cited:
> > 
> > xxxx  ==> succeeds
> > 
> > xxxx.yyy  ==> fails (no results)
> > 
> > I guess that I've been assuming that the search in Luke is "correct" and I've been using that to "test my understanding", but maybe that's an invalid assumption?
> 
> Luke has some bugs, that's for sure, but not as many as one would think 
> ;) I recommend the following exercise:
> 
> * first, check what the "Rewritten" query looks like, in both cases. 
> This could be enlightening, because depending on the choice of default 
> field and query analyzer results could differ dramatically.
> 
> * then, if a query succeeds in matching one or more documents, open this 
> document and view its fields using "Reconstruct & edit", especially the 
> "Tokenized" version of the field. At this point any potential mismatch 
> in query terms vs. analyzed tokens in the field should become apparent.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by Andrzej Bialecki <ab...@getopt.org>.

ohaya@cox.net wrote:
> Hi Phil,
> 
> Well, kind of... but...
> 
> Then, why, when I do the search in Luke, do I get the results I cited:
> 
> xxxx  ==> succeeds
> 
> xxxx.yyy  ==> fails (no results)
> 
> I guess that I've been assuming that the search in Luke is "correct" and I've been using that to "test my understanding", but maybe that's an invalid assumption?

Luke has some bugs, that's for sure, but not as many as one would think 
;) I recommend the following exercise:

* first, check what the "Rewritten" query looks like, in both cases. 
This could be enlightening, because depending on the choice of default 
field and query analyzer results could differ dramatically.

* then, if a query succeeds in matching one or more documents, open this 
document and view its fields using "Reconstruct & edit", especially the 
"Tokenized" version of the field. At this point any potential mismatch 
in query terms vs. analyzed tokens in the field should become apparent.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by oh...@cox.net.

Hi Phil,

Well, kind of... but...

Then, why, when I do the search in Luke, do I get the results I cited:

xxxx  ==> succeeds

xxxx.yyy  ==> fails (no results)

I guess that I've been assuming that the search in Luke is "correct" and I've been using that to "test my understanding", but maybe that's an invalid assumption?

Jim





---- Phil Whelan <ph...@gmail.com> wrote: 
> Hi Jim,
> 
> > As I said, based on the terms in Luke, I would have expected a web app query on:
> >
> > path:file-1-2
> >
> > to succeed, and a query on:
> >
> > path:file-1-2.dat
> > to fail.
> >
> > But, instead both of those succeed when I do a web query.
> 
> This query will also pass through the same (hopefully) Analyzer and
> will be broken into terms. So the query will actually be for
> "file-1-2" and "dat" where "file-1-2" is followed immediately by
> "dat".
> 
> In indexing the terms position is stored, so
> "C:\dir1\dir2\file-1-1.dat" becomes...
> [0] c
> [1] dir1
> [2] dir2
> [3] file-1-1
> [4] dat
> 
> "file-1-1" is followed by "dat", so there is a match.
> 
> Does that make sense?
> 
> Cheers,
> Phil
> 
> >
> > Jim
> >
> >
> > ---- ohaya@cox.net wrote:
> >> Phil,
> >>
> >> Both my indexer and the webapp are basically from the Lucene demos, the indexer starting with the IndexFiles.java demo code, so I think they're both using the StandardAnalyzer.
> >>
> >> What appears in Luke, when I select "path" is just the filename part, without the extension, i.e., the "xxxx" part.
> >>
> >> That's why I said in my original post that I was kind of surprised that doing a web query for "path:xxxx.yyy" succeeded, i.e, in the path field in the index, there is no "xxxx.yyy", just "xxxx".
> >>
> >> Jim
> >>
> >> ---- Phil Whelan <ph...@gmail.com> wrote:
> >> > Hi Jim,
> >> >
> >> > Are you using the same Analyzer for indexing and searching? xxxx.yyy
> >> > will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
> >> > term, whereas another indexer might split this into 2 terms. This
> >> > should not matter either way as long as you are using the same
> >> > Analyzer for both indexing and searching.
> >> >
> >> > I would expect this to pass unless you are using NOT_ANALYZED, or the
> >> > WhitespaceAnalyzer, or something else that would not split on "/".
> >> >     path:xxxx.yyy
> >> >
> >> > In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
> >> > something else?
> >> >
> >> > Thanks,
> >> > Phil
> >> >
> >> > On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
> >> > > Hi,
> >> > >
> >> > > In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
> >> > >
> >> > >    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
> >> > >
> >> > > Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
> >> > >
> >> > > Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
> >> > >
> >> > > But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
> >> > >
> >> > > Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
> >> > >
> >> > > Thanks,
> >> > > Jim
> >> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by Phil Whelan <ph...@gmail.com>.

Hi Jim,

> As I said, based on the terms in Luke, I would have expected a web app query on:
>
> path:file-1-2
>
> to succeed, and a query on:
>
> path:file-1-2.dat
> to fail.
>
> But, instead both of those succeed when I do a web query.

This query will also pass through the same (hopefully) Analyzer and
will be broken into terms. So the query will actually be for
"file-1-2" and "dat" where "file-1-2" is followed immediately by
"dat".

In indexing the terms position is stored, so
"C:\dir1\dir2\file-1-1.dat" becomes...
[0] c
[1] dir1
[2] dir2
[3] file-1-1
[4] dat

"file-1-1" is followed by "dat", so there is a match.

Does that make sense?

Cheers,
Phil

>
> Jim
>
>
> ---- ohaya@cox.net wrote:
>> Phil,
>>
>> Both my indexer and the webapp are basically from the Lucene demos, the indexer starting with the IndexFiles.java demo code, so I think they're both using the StandardAnalyzer.
>>
>> What appears in Luke, when I select "path" is just the filename part, without the extension, i.e., the "xxxx" part.
>>
>> That's why I said in my original post that I was kind of surprised that doing a web query for "path:xxxx.yyy" succeeded, i.e, in the path field in the index, there is no "xxxx.yyy", just "xxxx".
>>
>> Jim
>>
>> ---- Phil Whelan <ph...@gmail.com> wrote:
>> > Hi Jim,
>> >
>> > Are you using the same Analyzer for indexing and searching? xxxx.yyy
>> > will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
>> > term, whereas another indexer might split this into 2 terms. This
>> > should not matter either way as long as you are using the same
>> > Analyzer for both indexing and searching.
>> >
>> > I would expect this to pass unless you are using NOT_ANALYZED, or the
>> > WhitespaceAnalyzer, or something else that would not split on "/".
>> >     path:xxxx.yyy
>> >
>> > In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
>> > something else?
>> >
>> > Thanks,
>> > Phil
>> >
>> > On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
>> > > Hi,
>> > >
>> > > In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
>> > >
>> > >    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
>> > >
>> > > Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
>> > >
>> > > Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
>> > >
>> > > But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
>> > >
>> > > Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
>> > >
>> > > Thanks,
>> > > Jim
>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by oh...@cox.net.

Phil,

I need to be more precise...

The files that I have are at, say:

C:\dir1\dir2\

so, for example, I have

C:\dir1\dir2\file-1-1.dat
C:\dir1\dir2\file-1-2.dat
C:\dir1\dir2\file-1-3.dat
C:\dir1\dir2\file-1-4.dat
C:\dir1\dir2\file-1-5.dat

After indexing, and, using Luke, I look at the "path" field, and I see terms like (not in this order):

C
dir1
dir2
dat
file-1-1
file-1-2
file-1-3
file-1-4
file-1-5

Notice that there is no (for example) term like "file-1-2.dat".

I'm assuming this is because the analyzer is breaking that into "file-1-2" and "dat".

As I said, based on the terms in Luke, I would have expected a web app query on:

path:file-1-2

to succeed, and a query on:

path:file-1-2.dat

to fail.

But, instead both of those succeed when I do a web query.

Jim


---- ohaya@cox.net wrote: 
> Phil,
> 
> Both my indexer and the webapp are basically from the Lucene demos, the indexer starting with the IndexFiles.java demo code, so I think they're both using the StandardAnalyzer.
> 
> What appears in Luke, when I select "path" is just the filename part, without the extension, i.e., the "xxxx" part.  
> 
> That's why I said in my original post that I was kind of surprised that doing a web query for "path:xxxx.yyy" succeeded, i.e, in the path field in the index, there is no "xxxx.yyy", just "xxxx".
> 
> Jim
> 
> ---- Phil Whelan <ph...@gmail.com> wrote: 
> > Hi Jim,
> > 
> > Are you using the same Analyzer for indexing and searching? xxxx.yyy
> > will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
> > term, whereas another indexer might split this into 2 terms. This
> > should not matter either way as long as you are using the same
> > Analyzer for both indexing and searching.
> > 
> > I would expect this to pass unless you are using NOT_ANALYZED, or the
> > WhitespaceAnalyzer, or something else that would not split on "/".
> >     path:xxxx.yyy
> > 
> > In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
> > something else?
> > 
> > Thanks,
> > Phil
> > 
> > On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
> > > Hi,
> > >
> > > In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
> > >
> > >    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
> > >
> > > Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
> > >
> > > Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
> > >
> > > But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
> > >
> > > Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
> > >
> > > Thanks,
> > > Jim
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by oh...@cox.net.

Phil,

Both my indexer and the webapp are basically from the Lucene demos, the indexer starting with the IndexFiles.java demo code, so I think they're both using the StandardAnalyzer.

What appears in Luke, when I select "path" is just the filename part, without the extension, i.e., the "xxxx" part.  

That's why I said in my original post that I was kind of surprised that doing a web query for "path:xxxx.yyy" succeeded, i.e, in the path field in the index, there is no "xxxx.yyy", just "xxxx".

Jim

---- Phil Whelan <ph...@gmail.com> wrote: 
> Hi Jim,
> 
> Are you using the same Analyzer for indexing and searching? xxxx.yyy
> will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
> term, whereas another indexer might split this into 2 terms. This
> should not matter either way as long as you are using the same
> Analyzer for both indexing and searching.
> 
> I would expect this to pass unless you are using NOT_ANALYZED, or the
> WhitespaceAnalyzer, or something else that would not split on "/".
>     path:xxxx.yyy
> 
> In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
> something else?
> 
> Thanks,
> Phil
> 
> On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
> > Hi,
> >
> > In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
> >
> >    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
> >
> > Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
> >
> > Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
> >
> > But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
> >
> > Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
> >
> > Thanks,
> > Jim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why does this search succeed with web app, but not Luke?

Posted by Phil Whelan <ph...@gmail.com>.

Hi Jim,

Are you using the same Analyzer for indexing and searching? xxxx.yyy
will be seem as a HOSTNAME by StandardAnalyzer and will keep it as one
term, whereas another indexer might split this into 2 terms. This
should not matter either way as long as you are using the same
Analyzer for both indexing and searching.

I would expect this to pass unless you are using NOT_ANALYZED, or the
WhitespaceAnalyzer, or something else that would not split on "/".
    path:xxxx.yyy

In Luke, do you see 2 terms "xxxx" and "yyy", or just "xxxx.yyy", or
something else?

Thanks,
Phil

On Thu, Aug 6, 2009 at 1:03 PM, <oh...@cox.net> wrote:
> Hi,
>
> In my indexer app (based on the IndexFiles.java demo), I am adding the "path" field:
>
>    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.ANALYZED));
>
> Per Luke, the full path (e.g., "c:\....\xxxx.yyy") gets parsed, and one of the terms (again, per Luke) is "xxxx", i.e., the actual file name, but without the extension.
>
> Then, when I search with Luke for "path:xxxx", that succeeds, as expected, and when I search with Luke for "path:xxxx.yyy", that fails, as expected.
>
> But, if I search using the demo web app, for "path:xxxx.yyy", it succeeds.
>
> Since the Luke search for "path:xxxx.yyy" fails, I don't understand why the web app search for "path:xxxx.yyy" would succeed?
>
> Thanks,
> Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org