You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Terry Steichen <te...@net-frame.com> on 2002/06/17 01:00:22 UTC

Peculiar Behavior with Field queries

Hello,

I'm using Lucene (1.2RC5) and, when indexing, I include a field called "headline" using the following line of code in the document I create to use for indexing:

      addField("headline", root.elementText("headline"), true, true, true, doc);

When I search on headline:term1, it works just fine.  But I've noticed that if I query using, for example, 

        headline:"on the job"

I will get returned all items that have the term 'job' in their headline.

I presume I've overlooked something and would appreciate any suggestions on what that might be.

Regards,

Terry Steichen


Re: Peculiar Behavior with Field queries

Posted by Karl Øie <ka...@gan.no>.
one of the reasons to use stopwords is to reduce indexsize, so an analyzer 
that doesnt stop words, but stop you from searching on them would give the 
worst of both worlds i think... if size reduction is the reason for wanting 
stopwords it is somewhat contradicting the idea of phase searches...

if you accepts som losses a phases search could still be usefull if the query 
also are passed into the same stopanalyzer.

orgtxt : "nearly all the kings men" stripped for the stopword "the" and "all" 
would perhaps still match a phrase search for "nearly all the kings" if the 
samewords was stripped out of the query as well... i havent tested but it 
sounds logical....

mvh karl øie

> I guess one option would be to create an Analyzer to use when creating the
> index that would not eliminate the stop words, then a change the
> QueryParser.jj to use this analyzer when searching for phrases.
> For all other queries you could use a different analyzer that would
> eliminate the stop words.
>
> I don't find this a problem personally as long as you tell the person that
> you have eliminated these terms from what they are searching for. As an
> example, in Google they tell you which terms were just common words that
> have been eliminated from your query string.
>
> --Peter


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Peter Carlson <ca...@bookandhammer.com>.
Hi Terry,

This is a strange. I'll have to check it out.
BTW, the best way to debug the QueryParser and learn it is to look at the
generated QueryParser.java file. It is in the bin directory if you build
Lucene from scratch.

--Peter


On 6/19/02 8:24 AM, "Terry Steichen" <te...@net-frame.com> wrote:

> Peter,
> 
> 1) I was using precisely that spelling of the search string, misspellings
> and case matched.
> 
> 2) I dumped the Query.toString() and it showed that the entered term ("The
> Knockout Paunch") was converted to lower case (l_headline:"the knockout
> paunch").  So, I just tried modifying WPDocument so that when indexing, the
> contents of 'l_headline' would be processed/saved as lowercase.  Didn't
> change anything - still didn't match.
> 
> 3) Why doesn't the '?' wildcard work?
> 
> 4) Also, (related to 3 above) how does Lucene choose which type of query to
> employ?  I've tried examining the contents of QueryParser.jj, but don't
> really understand it's structure.
> 
> Regards,
> 
> Terry
> 
> ----- Original Message -----
> From: "Peter Carlson" <ca...@bookandhammer.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, June 19, 2002 11:06 AM
> Subject: Re: Peculiar Behavior with Field queries
> 
> 
>> So just to be clear, the search string you are using is exactly
>> 
>> L_headline:"The Knockout Paunch"
>> 
>> Note the misspelling of Punch and the case sensitive specifics.
>> 
>> If this doesn't work, please output the results of the Query object you
>> create. That is Query.toString([defaultField]).
>> 
>> 
>> Also, for the wildcard issue, this is an FAQ. The wildcard query does not
>> tokenize the query term and there for it does not lower case the "N".
> Since
>> you used the standard tokenizer, all terms are lower case.
>> 
>> 
>> --Peter
>> 
>> 
>> 
>> On 6/19/02 7:27 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>> 
>>> Peter,
>>> 
>>> Enclosed is an xml file which reflects the structure of the documents I
>>> index.  Note that it has a 'headline' field.  In my WPDocument class
> (used
>>> by the indexer), I parse this xml file into its components and insert
> them
>>> as Fields into the Document class.  Specifically, I put the contents of
> the
>>> 'headline' xml field into a Field called "headline" and also into a
> Field
>>> called "l_headline".  The former is stored, indexed and tokenized.  The
>>> latter is stored, indexed and *not* tokenized.
>>> 
>>> Upon retrieval, I am able to readily display both the "headline" and
>>> "l_headline" fields.  But I am able to search *only* on the headline
> field.
>>> (BTW, I realize  that I must include the entire, literal headline to
> match
>>> "l_headline".)
>>> 
>>> As long as I'm mentioning problems/observations, I find that I am able
> to
>>> search on all fields (other than the 'l_headline' field) using the "*"
>>> wildcard - but *only* when the preceding letter is lower case.  For
> example,
>>> I have another field called "category" and one such value is "NAT".  I
> can
>>> match this with "category:NAT", "category:nat", or "category:n*".  But I
>>> cannot match with "category:N*".
>>> 
>>> Also, while the "*" wildcard works fine (at the end and/or in the middle
> of
>>> a term), the '?' wildcard doesn't work at all.
>>> 
>>> Regards,
>>> 
>>> Terry
>>> 
>>> PS: I am using the StandardAnalyzer and QueryParser that comes with
> Lucene
>>> 1.2rc5.
>>> 
>>> ------------ Example XML file that I index --------------------
>>> <?xml version="1.0" encoding="iso-8859-1"?>
>>> 
>>> <article>
>>> <headline>The Knockout Paunch</headline>
>>> <author>Peter Piper</author>
>>> <category>FAT</category>
>>> <pub_date create_date="20020616" event_date="20020616" timestamp="22:23
>>> PM">20020616</pub_date>
>>> <placement edition="EE" section="EZ" page="F01 " slug="POTBELLIES16"/>
>>> <origin sourcenumber="6">Post</origin>
>>> <webexec created="Mon Jun 17 23:15:33 EDT 2002" module="v_wp13"/>
>>> <summary><![CDATA[<p>This Father's Day, let us praise Dad by celebrating
>>> that ever-expanding, much-maligned monument to the good life that he
> always
>>> carries close to his heart -- his paunch, his shelf, his spare tire, his
>>> front porch, his Buddha, his bay window, his beer gut, his
>>> potbelly.</p>]]></summary>
>>> <body paras="74"><![CDATA[ <p>This Father's Day, let us praise Dad by
>>> celebrating that ever-expanding, much-maligned monument to the good life
>>> that he always carries close to his heart -- his paunch, his shelf, his
>>> spare tire, his front porch, his Buddha, his bay window, his beer gut,
> his
>>> potbelly.</p> <p>The potbelly is the essence of distilled Dadness. It's
> as
>>> much a part of the architecture of middle-aged masculinity as creaky
> knees
>>> or hairy ears or the bald spot that keeps growing, wiping out wilderness
>>> faster than the Sahara.</p>
>>> 
>>> ---Stuff snipped for brevity --
>>> 
>>> <p>What does the perfect potbelly say?</p> <p>"It says, 'God, that guy's
> got
>>> a great beer gut,' " Decaire declares. "I saw a guy with a great gut in
> the
>>> store today. He had on a Hawaiian shirt and white shorts. The Hawaiian
> shirt
>>> just gave great form to his gut, the way a good bra gives form to
> breasts.
>>> It was just perfect. It was holding itself up -- nothing was hanging
> over
>>> the belt. I said, 'Great gut.' He said, 'Thanks.'</p> <p>"It was
>>> beautiful."</p>]]></body>
>>> <doc_name>A51288-2002Jun14</doc_name>
>>> <references>
>>>   <ref_articles>
>>>     <ref_article/>
>>>   </ref_articles>
>>>   <urls>
>>>     <url/>
>>>   </urls>
>>>   <graphics>
>>>     <graphic/>
>>>   </graphics>
>>> </references>
>>> </article>
>>> 
>>> 
>>> ----- Original Message -----
>>> From: "Peter Carlson" <ca...@bookandhammer.com>
>>> To: "Lucene Users List" <lu...@jakarta.apache.org>
>>> Sent: Wednesday, June 19, 2002 9:47 AM
>>> Subject: Re: Peculiar Behavior with Field queries
>>> 
>>> 
>>>> Terry,
>>>> 
>>>> Please provide the exact example of the text so we can look at it and
>>>> evaluate what's going on.
>>>> 
>>>> -Peter
>>>> 
>>>> 
>>>> On 6/19/02 5:20 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>>>> 
>>>>> Peter,
>>>>> 
>>>>> I added a new field called 'l_headline' (for literal headline) which I
>>> set
>>>>> so it was searchable and included in the index and not tokenized.  But
>>> the
>>>>> query (using a phrase that is an exact match for the headline, but
> which
>>> may
>>>>> include stop words) still fails.  Even when I apply this to an article
>>> whose
>>>>> headline contains no stop words (so the headline:"phrase"' returns the
>>>>> article), the 'l_headline' fails to produce anything.
>>>>> 
>>>>> I can do a 'doc.get("l_headline")' and it shows the proper phrase has
>>> been
>>>>> included.
>>>>> 
>>>>> Any ideas why this won't let me do a literal match?  Seems like it
>>> should
>>>>> work fine.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Terry
>>>> 
>>>> 
>>>> --
>>>> To unsubscribe, e-mail:
>>> <ma...@jakarta.apache.org>
>>>> For additional commands, e-mail:
>>> <ma...@jakarta.apache.org>
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
>>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>>> 
>>> 
>> 
>> 
>> --
>> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>> 
>> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Terry Steichen <te...@net-frame.com>.
Peter,

1) I was using precisely that spelling of the search string, misspellings
and case matched.

2) I dumped the Query.toString() and it showed that the entered term ("The
Knockout Paunch") was converted to lower case (l_headline:"the knockout
paunch").  So, I just tried modifying WPDocument so that when indexing, the
contents of 'l_headline' would be processed/saved as lowercase.  Didn't
change anything - still didn't match.

3) Why doesn't the '?' wildcard work?

4) Also, (related to 3 above) how does Lucene choose which type of query to
employ?  I've tried examining the contents of QueryParser.jj, but don't
really understand it's structure.

Regards,

Terry

----- Original Message -----
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, June 19, 2002 11:06 AM
Subject: Re: Peculiar Behavior with Field queries


> So just to be clear, the search string you are using is exactly
>
> L_headline:"The Knockout Paunch"
>
> Note the misspelling of Punch and the case sensitive specifics.
>
> If this doesn't work, please output the results of the Query object you
> create. That is Query.toString([defaultField]).
>
>
> Also, for the wildcard issue, this is an FAQ. The wildcard query does not
> tokenize the query term and there for it does not lower case the "N".
Since
> you used the standard tokenizer, all terms are lower case.
>
>
> --Peter
>
>
>
> On 6/19/02 7:27 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>
> > Peter,
> >
> > Enclosed is an xml file which reflects the structure of the documents I
> > index.  Note that it has a 'headline' field.  In my WPDocument class
(used
> > by the indexer), I parse this xml file into its components and insert
them
> > as Fields into the Document class.  Specifically, I put the contents of
the
> > 'headline' xml field into a Field called "headline" and also into a
Field
> > called "l_headline".  The former is stored, indexed and tokenized.  The
> > latter is stored, indexed and *not* tokenized.
> >
> > Upon retrieval, I am able to readily display both the "headline" and
> > "l_headline" fields.  But I am able to search *only* on the headline
field.
> > (BTW, I realize  that I must include the entire, literal headline to
match
> > "l_headline".)
> >
> > As long as I'm mentioning problems/observations, I find that I am able
to
> > search on all fields (other than the 'l_headline' field) using the "*"
> > wildcard - but *only* when the preceding letter is lower case.  For
example,
> > I have another field called "category" and one such value is "NAT".  I
can
> > match this with "category:NAT", "category:nat", or "category:n*".  But I
> > cannot match with "category:N*".
> >
> > Also, while the "*" wildcard works fine (at the end and/or in the middle
of
> > a term), the '?' wildcard doesn't work at all.
> >
> > Regards,
> >
> > Terry
> >
> > PS: I am using the StandardAnalyzer and QueryParser that comes with
Lucene
> > 1.2rc5.
> >
> > ------------ Example XML file that I index --------------------
> > <?xml version="1.0" encoding="iso-8859-1"?>
> >
> > <article>
> > <headline>The Knockout Paunch</headline>
> > <author>Peter Piper</author>
> > <category>FAT</category>
> > <pub_date create_date="20020616" event_date="20020616" timestamp="22:23
> > PM">20020616</pub_date>
> > <placement edition="EE" section="EZ" page="F01 " slug="POTBELLIES16"/>
> > <origin sourcenumber="6">Post</origin>
> > <webexec created="Mon Jun 17 23:15:33 EDT 2002" module="v_wp13"/>
> > <summary><![CDATA[<p>This Father's Day, let us praise Dad by celebrating
> > that ever-expanding, much-maligned monument to the good life that he
always
> > carries close to his heart -- his paunch, his shelf, his spare tire, his
> > front porch, his Buddha, his bay window, his beer gut, his
> > potbelly.</p>]]></summary>
> > <body paras="74"><![CDATA[ <p>This Father's Day, let us praise Dad by
> > celebrating that ever-expanding, much-maligned monument to the good life
> > that he always carries close to his heart -- his paunch, his shelf, his
> > spare tire, his front porch, his Buddha, his bay window, his beer gut,
his
> > potbelly.</p> <p>The potbelly is the essence of distilled Dadness. It's
as
> > much a part of the architecture of middle-aged masculinity as creaky
knees
> > or hairy ears or the bald spot that keeps growing, wiping out wilderness
> > faster than the Sahara.</p>
> >
> > ---Stuff snipped for brevity --
> >
> > <p>What does the perfect potbelly say?</p> <p>"It says, 'God, that guy's
got
> > a great beer gut,' " Decaire declares. "I saw a guy with a great gut in
the
> > store today. He had on a Hawaiian shirt and white shorts. The Hawaiian
shirt
> > just gave great form to his gut, the way a good bra gives form to
breasts.
> > It was just perfect. It was holding itself up -- nothing was hanging
over
> > the belt. I said, 'Great gut.' He said, 'Thanks.'</p> <p>"It was
> > beautiful."</p>]]></body>
> > <doc_name>A51288-2002Jun14</doc_name>
> > <references>
> >   <ref_articles>
> >     <ref_article/>
> >   </ref_articles>
> >   <urls>
> >     <url/>
> >   </urls>
> >   <graphics>
> >     <graphic/>
> >   </graphics>
> > </references>
> > </article>
> >
> >
> > ----- Original Message -----
> > From: "Peter Carlson" <ca...@bookandhammer.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Wednesday, June 19, 2002 9:47 AM
> > Subject: Re: Peculiar Behavior with Field queries
> >
> >
> >> Terry,
> >>
> >> Please provide the exact example of the text so we can look at it and
> >> evaluate what's going on.
> >>
> >> -Peter
> >>
> >>
> >> On 6/19/02 5:20 AM, "Terry Steichen" <te...@net-frame.com> wrote:
> >>
> >>> Peter,
> >>>
> >>> I added a new field called 'l_headline' (for literal headline) which I
> > set
> >>> so it was searchable and included in the index and not tokenized.  But
> > the
> >>> query (using a phrase that is an exact match for the headline, but
which
> > may
> >>> include stop words) still fails.  Even when I apply this to an article
> > whose
> >>> headline contains no stop words (so the headline:"phrase"' returns the
> >>> article), the 'l_headline' fails to produce anything.
> >>>
> >>> I can do a 'doc.get("l_headline")' and it shows the proper phrase has
> > been
> >>> included.
> >>>
> >>> Any ideas why this won't let me do a literal match?  Seems like it
> > should
> >>> work fine.
> >>>
> >>> Regards,
> >>>
> >>> Terry
> >>
> >>
> >> --
> >> To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> >> For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >>
> >>
> >
> >
> > --
> > To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> > For additional commands, e-mail:
<ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Peter Carlson <ca...@bookandhammer.com>.
So just to be clear, the search string you are using is exactly

L_headline:"The Knockout Paunch"

Note the misspelling of Punch and the case sensitive specifics.

If this doesn't work, please output the results of the Query object you
create. That is Query.toString([defaultField]).


Also, for the wildcard issue, this is an FAQ. The wildcard query does not
tokenize the query term and there for it does not lower case the "N". Since
you used the standard tokenizer, all terms are lower case.


--Peter



On 6/19/02 7:27 AM, "Terry Steichen" <te...@net-frame.com> wrote:

> Peter,
> 
> Enclosed is an xml file which reflects the structure of the documents I
> index.  Note that it has a 'headline' field.  In my WPDocument class (used
> by the indexer), I parse this xml file into its components and insert them
> as Fields into the Document class.  Specifically, I put the contents of the
> 'headline' xml field into a Field called "headline" and also into a Field
> called "l_headline".  The former is stored, indexed and tokenized.  The
> latter is stored, indexed and *not* tokenized.
> 
> Upon retrieval, I am able to readily display both the "headline" and
> "l_headline" fields.  But I am able to search *only* on the headline field.
> (BTW, I realize  that I must include the entire, literal headline to match
> "l_headline".)
> 
> As long as I'm mentioning problems/observations, I find that I am able to
> search on all fields (other than the 'l_headline' field) using the "*"
> wildcard - but *only* when the preceding letter is lower case.  For example,
> I have another field called "category" and one such value is "NAT".  I can
> match this with "category:NAT", "category:nat", or "category:n*".  But I
> cannot match with "category:N*".
> 
> Also, while the "*" wildcard works fine (at the end and/or in the middle of
> a term), the '?' wildcard doesn't work at all.
> 
> Regards,
> 
> Terry
> 
> PS: I am using the StandardAnalyzer and QueryParser that comes with Lucene
> 1.2rc5.
> 
> ------------ Example XML file that I index --------------------
> <?xml version="1.0" encoding="iso-8859-1"?>
> 
> <article>
> <headline>The Knockout Paunch</headline>
> <author>Peter Piper</author>
> <category>FAT</category>
> <pub_date create_date="20020616" event_date="20020616" timestamp="22:23
> PM">20020616</pub_date>
> <placement edition="EE" section="EZ" page="F01 " slug="POTBELLIES16"/>
> <origin sourcenumber="6">Post</origin>
> <webexec created="Mon Jun 17 23:15:33 EDT 2002" module="v_wp13"/>
> <summary><![CDATA[<p>This Father's Day, let us praise Dad by celebrating
> that ever-expanding, much-maligned monument to the good life that he always
> carries close to his heart -- his paunch, his shelf, his spare tire, his
> front porch, his Buddha, his bay window, his beer gut, his
> potbelly.</p>]]></summary>
> <body paras="74"><![CDATA[ <p>This Father's Day, let us praise Dad by
> celebrating that ever-expanding, much-maligned monument to the good life
> that he always carries close to his heart -- his paunch, his shelf, his
> spare tire, his front porch, his Buddha, his bay window, his beer gut, his
> potbelly.</p> <p>The potbelly is the essence of distilled Dadness. It's as
> much a part of the architecture of middle-aged masculinity as creaky knees
> or hairy ears or the bald spot that keeps growing, wiping out wilderness
> faster than the Sahara.</p>
> 
> ---Stuff snipped for brevity --
> 
> <p>What does the perfect potbelly say?</p> <p>"It says, 'God, that guy's got
> a great beer gut,' " Decaire declares. "I saw a guy with a great gut in the
> store today. He had on a Hawaiian shirt and white shorts. The Hawaiian shirt
> just gave great form to his gut, the way a good bra gives form to breasts.
> It was just perfect. It was holding itself up -- nothing was hanging over
> the belt. I said, 'Great gut.' He said, 'Thanks.'</p> <p>"It was
> beautiful."</p>]]></body>
> <doc_name>A51288-2002Jun14</doc_name>
> <references>
>   <ref_articles>
>     <ref_article/>
>   </ref_articles>
>   <urls>
>     <url/>
>   </urls>
>   <graphics>
>     <graphic/>
>   </graphics>
> </references>
> </article>
> 
> 
> ----- Original Message -----
> From: "Peter Carlson" <ca...@bookandhammer.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Wednesday, June 19, 2002 9:47 AM
> Subject: Re: Peculiar Behavior with Field queries
> 
> 
>> Terry,
>> 
>> Please provide the exact example of the text so we can look at it and
>> evaluate what's going on.
>> 
>> -Peter
>> 
>> 
>> On 6/19/02 5:20 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>> 
>>> Peter,
>>> 
>>> I added a new field called 'l_headline' (for literal headline) which I
> set
>>> so it was searchable and included in the index and not tokenized.  But
> the
>>> query (using a phrase that is an exact match for the headline, but which
> may
>>> include stop words) still fails.  Even when I apply this to an article
> whose
>>> headline contains no stop words (so the headline:"phrase"' returns the
>>> article), the 'l_headline' fails to produce anything.
>>> 
>>> I can do a 'doc.get("l_headline")' and it shows the proper phrase has
> been
>>> included.
>>> 
>>> Any ideas why this won't let me do a literal match?  Seems like it
> should
>>> work fine.
>>> 
>>> Regards,
>>> 
>>> Terry
>> 
>> 
>> --
>> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>> 
>> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Terry Steichen <te...@net-frame.com>.
Peter,

Enclosed is an xml file which reflects the structure of the documents I
index.  Note that it has a 'headline' field.  In my WPDocument class (used
by the indexer), I parse this xml file into its components and insert them
as Fields into the Document class.  Specifically, I put the contents of the
'headline' xml field into a Field called "headline" and also into a Field
called "l_headline".  The former is stored, indexed and tokenized.  The
latter is stored, indexed and *not* tokenized.

Upon retrieval, I am able to readily display both the "headline" and
"l_headline" fields.  But I am able to search *only* on the headline field.
(BTW, I realize  that I must include the entire, literal headline to match
"l_headline".)

As long as I'm mentioning problems/observations, I find that I am able to
search on all fields (other than the 'l_headline' field) using the "*"
wildcard - but *only* when the preceding letter is lower case.  For example,
I have another field called "category" and one such value is "NAT".  I can
match this with "category:NAT", "category:nat", or "category:n*".  But I
cannot match with "category:N*".

Also, while the "*" wildcard works fine (at the end and/or in the middle of
a term), the '?' wildcard doesn't work at all.

Regards,

Terry

PS: I am using the StandardAnalyzer and QueryParser that comes with Lucene
1.2rc5.

------------ Example XML file that I index --------------------
<?xml version="1.0" encoding="iso-8859-1"?>

<article>
  <headline>The Knockout Paunch</headline>
  <author>Peter Piper</author>
  <category>FAT</category>
  <pub_date create_date="20020616" event_date="20020616" timestamp="22:23
PM">20020616</pub_date>
  <placement edition="EE" section="EZ" page="F01 " slug="POTBELLIES16"/>
  <origin sourcenumber="6">Post</origin>
  <webexec created="Mon Jun 17 23:15:33 EDT 2002" module="v_wp13"/>
  <summary><![CDATA[<p>This Father's Day, let us praise Dad by celebrating
that ever-expanding, much-maligned monument to the good life that he always
carries close to his heart -- his paunch, his shelf, his spare tire, his
front porch, his Buddha, his bay window, his beer gut, his
potbelly.</p>]]></summary>
  <body paras="74"><![CDATA[ <p>This Father's Day, let us praise Dad by
celebrating that ever-expanding, much-maligned monument to the good life
that he always carries close to his heart -- his paunch, his shelf, his
spare tire, his front porch, his Buddha, his bay window, his beer gut, his
potbelly.</p> <p>The potbelly is the essence of distilled Dadness. It's as
much a part of the architecture of middle-aged masculinity as creaky knees
or hairy ears or the bald spot that keeps growing, wiping out wilderness
faster than the Sahara.</p>

---Stuff snipped for brevity --

<p>What does the perfect potbelly say?</p> <p>"It says, 'God, that guy's got
a great beer gut,' " Decaire declares. "I saw a guy with a great gut in the
store today. He had on a Hawaiian shirt and white shorts. The Hawaiian shirt
just gave great form to his gut, the way a good bra gives form to breasts.
It was just perfect. It was holding itself up -- nothing was hanging over
the belt. I said, 'Great gut.' He said, 'Thanks.'</p> <p>"It was
beautiful."</p>]]></body>
  <doc_name>A51288-2002Jun14</doc_name>
  <references>
    <ref_articles>
      <ref_article/>
    </ref_articles>
    <urls>
      <url/>
    </urls>
    <graphics>
      <graphic/>
    </graphics>
  </references>
</article>


----- Original Message -----
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, June 19, 2002 9:47 AM
Subject: Re: Peculiar Behavior with Field queries


> Terry,
>
> Please provide the exact example of the text so we can look at it and
> evaluate what's going on.
>
> -Peter
>
>
> On 6/19/02 5:20 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>
> > Peter,
> >
> > I added a new field called 'l_headline' (for literal headline) which I
set
> > so it was searchable and included in the index and not tokenized.  But
the
> > query (using a phrase that is an exact match for the headline, but which
may
> > include stop words) still fails.  Even when I apply this to an article
whose
> > headline contains no stop words (so the headline:"phrase"' returns the
> > article), the 'l_headline' fails to produce anything.
> >
> > I can do a 'doc.get("l_headline")' and it shows the proper phrase has
been
> > included.
> >
> > Any ideas why this won't let me do a literal match?  Seems like it
should
> > work fine.
> >
> > Regards,
> >
> > Terry
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Peter Carlson <ca...@bookandhammer.com>.
Terry,

Please provide the exact example of the text so we can look at it and
evaluate what's going on.

-Peter


On 6/19/02 5:20 AM, "Terry Steichen" <te...@net-frame.com> wrote:

> Peter,
> 
> I added a new field called 'l_headline' (for literal headline) which I set
> so it was searchable and included in the index and not tokenized.  But the
> query (using a phrase that is an exact match for the headline, but which may
> include stop words) still fails.  Even when I apply this to an article whose
> headline contains no stop words (so the headline:"phrase"' returns the
> article), the 'l_headline' fails to produce anything.
> 
> I can do a 'doc.get("l_headline")' and it shows the proper phrase has been
> included.
> 
> Any ideas why this won't let me do a literal match?  Seems like it should
> work fine.
> 
> Regards,
> 
> Terry


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Terry Steichen <te...@net-frame.com>.
Peter,

I added a new field called 'l_headline' (for literal headline) which I set
so it was searchable and included in the index and not tokenized.  But the
query (using a phrase that is an exact match for the headline, but which may
include stop words) still fails.  Even when I apply this to an article whose
headline contains no stop words (so the headline:"phrase"' returns the
article), the 'l_headline' fails to produce anything.

I can do a 'doc.get("l_headline")' and it shows the proper phrase has been
included.

Any ideas why this won't let me do a literal match?  Seems like it should
work fine.

Regards,

Terry


----- Original Message -----
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, June 17, 2002 10:32 AM
Subject: Re: Peculiar Behavior with Field queries


> You could do this, but you would also have to match case exactly.
>
> --Peter
>
>
> On 6/17/02 7:04 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>
> > Could you handle this by including extra fields?  Let's say we're
dealing
> > with a database of articles, and that we wanted to do regular as well as
> > literal phrase searches on the 'headline' field.  What would happen if,
> > during indexing, you created a second headline field which you defined
as
> > searchable but *not* tokenized.  Could you then apply the phrase query
> > (which included stop words) successfully against the second headline
field
> > (recognizing that you would have to match the second headline's text
> > exactly)?
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Peter Carlson" <ca...@bookandhammer.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Monday, June 17, 2002 10:03 AM
> > Subject: Re: Peculiar Behavior with Field queries
> >
> >
> >> I don't think there is a work around without eliminating stop words,
> > because
> >> when you index, you will not include the stop words in the index.
> >>
> >> I guess one option would be to create an Analyzer to use when creating
the
> >> index that would not eliminate the stop words, then a change the
> >> QueryParser.jj to use this analyzer when searching for phrases.
> >> For all other queries you could use a different analyzer that would
> >> eliminate the stop words.
> >>
> >> I don't find this a problem personally as long as you tell the person
that
> >> you have eliminated these terms from what they are searching for. As an
> >> example, in Google they tell you which terms were just common words
that
> >> have been eliminated from your query string.
> >>
> >> --Peter
> >>
> >> On 6/17/02 5:32 AM, "Terry Steichen" <te...@net-frame.com> wrote:
> >>
> >>> This apparent inability of Lucene to find articles containing literal
> >>> phrases (if the phrase contains stop words) can be, I think, a severe
> >>> limitation.  I wonder if there is any workaround (short of eliminating
> > stop
> >>> words)?
> >>>
> >>> Terry
> >>>
> >>> ----- Original Message -----
> >>> From: "Otis Gospodnetic" <ot...@yahoo.com>
> >>> To: "Lucene Users List" <lu...@jakarta.apache.org>
> >>> Sent: Sunday, June 16, 2002 11:04 PM
> >>> Subject: Re: Peculiar Behavior with Field queries
> >>>
> >>>
> >>>> I believe you are correct.
> >>>>
> >>>> istO tisO sitO itSo Otsi Osit Otis
> >>>>
> >>>> --- Terry Steichen <te...@net-frame.com> wrote:
> >>>>> Oits,
> >>>>>
> >>>>> You may be right that the stop words are at work here.  What I was
> >>>>> expecting
> >>>>> to do is be able to match a specific phrase ("on the job"), even if
> >>>>> it
> >>>>> includes stop words.  But I guess that may not be the way that
Lucene
> >>>>> works,
> >>>>> right?
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Terry
> >>>>>
> >>>>> ----- Original Message -----
> >>>>> From: "Otis Gospodnetic" <ot...@yahoo.com>
> >>>>> To: "Lucene Users List" <lu...@jakarta.apache.org>
> >>>>> Sent: Sunday, June 16, 2002 7:13 PM
> >>>>> Subject: Re: Peculiar Behavior with Field queries
> >>>>>
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I'm not sure what the problem is :)
> >>>>>> What do you expect to get back?
> >>>>>> Are you wondering why 'on the' part is not matched?
> >>>>>> If so, it's probably because both 'on' and 'the' are in the list of
> >>>>>> stop words, which are thrown out when/before indexing.
> >>>>>>
> >>>>>> Otis
> >>>>>>
> >>>>>> --- Terry Steichen <te...@net-frame.com> wrote:
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> I'm using Lucene (1.2RC5) and, when indexing, I include a field
> >>>>>>> called "headline" using the following line of code in the
> >>>>> document I
> >>>>>>> create to use for indexing:
> >>>>>>>
> >>>>>>>       addField("headline", root.elementText("headline"), true,
> >>>>> true,
> >>>>>>> true, doc);
> >>>>>>>
> >>>>>>> When I search on headline:term1, it works just fine.  But I've
> >>>>>>> noticed that if I query using, for example,
> >>>>>>>
> >>>>>>>         headline:"on the job"
> >>>>>>>
> >>>>>>> I will get returned all items that have the term 'job' in their
> >>>>>>> headline.
> >>>>>>>
> >>>>>>> I presume I've overlooked something and would appreciate any
> >>>>>>> suggestions on what that might be.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> Terry Steichen
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> __________________________________________________
> >>>>>> Do You Yahoo!?
> >>>>>> Yahoo! - Official partner of 2002 FIFA World Cup
> >>>>>> http://fifaworldcup.yahoo.com
> >>>>>>
> >>>>>> --
> >>>>>> To unsubscribe, e-mail:
> >>>>> <ma...@jakarta.apache.org>
> >>>>>> For additional commands, e-mail:
> >>>>> <ma...@jakarta.apache.org>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> To unsubscribe, e-mail:
> >>>>> <ma...@jakarta.apache.org>
> >>>>> For additional commands, e-mail:
> >>>>> <ma...@jakarta.apache.org>
> >>>>>
> >>>>
> >>>>
> >>>> __________________________________________________
> >>>> Do You Yahoo!?
> >>>> Yahoo! - Official partner of 2002 FIFA World Cup
> >>>> http://fifaworldcup.yahoo.com
> >>>>
> >>>> --
> >>>> To unsubscribe, e-mail:
> >>> <ma...@jakarta.apache.org>
> >>>> For additional commands, e-mail:
> >>> <ma...@jakarta.apache.org>
> >>>>
> >>>
> >>>
> >>> --
> >>> To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> >>> For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >>>
> >>>
> >>
> >>
> >> --
> >> To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> >> For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >>
> >>
> >
> >
> > --
> > To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> > For additional commands, e-mail:
<ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Peter Carlson <ca...@bookandhammer.com>.
You could do this, but you would also have to match case exactly.

--Peter


On 6/17/02 7:04 AM, "Terry Steichen" <te...@net-frame.com> wrote:

> Could you handle this by including extra fields?  Let's say we're dealing
> with a database of articles, and that we wanted to do regular as well as
> literal phrase searches on the 'headline' field.  What would happen if,
> during indexing, you created a second headline field which you defined as
> searchable but *not* tokenized.  Could you then apply the phrase query
> (which included stop words) successfully against the second headline field
> (recognizing that you would have to match the second headline's text
> exactly)?
> 
> Terry
> 
> ----- Original Message -----
> From: "Peter Carlson" <ca...@bookandhammer.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, June 17, 2002 10:03 AM
> Subject: Re: Peculiar Behavior with Field queries
> 
> 
>> I don't think there is a work around without eliminating stop words,
> because
>> when you index, you will not include the stop words in the index.
>> 
>> I guess one option would be to create an Analyzer to use when creating the
>> index that would not eliminate the stop words, then a change the
>> QueryParser.jj to use this analyzer when searching for phrases.
>> For all other queries you could use a different analyzer that would
>> eliminate the stop words.
>> 
>> I don't find this a problem personally as long as you tell the person that
>> you have eliminated these terms from what they are searching for. As an
>> example, in Google they tell you which terms were just common words that
>> have been eliminated from your query string.
>> 
>> --Peter
>> 
>> On 6/17/02 5:32 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>> 
>>> This apparent inability of Lucene to find articles containing literal
>>> phrases (if the phrase contains stop words) can be, I think, a severe
>>> limitation.  I wonder if there is any workaround (short of eliminating
> stop
>>> words)?
>>> 
>>> Terry
>>> 
>>> ----- Original Message -----
>>> From: "Otis Gospodnetic" <ot...@yahoo.com>
>>> To: "Lucene Users List" <lu...@jakarta.apache.org>
>>> Sent: Sunday, June 16, 2002 11:04 PM
>>> Subject: Re: Peculiar Behavior with Field queries
>>> 
>>> 
>>>> I believe you are correct.
>>>> 
>>>> istO tisO sitO itSo Otsi Osit Otis
>>>> 
>>>> --- Terry Steichen <te...@net-frame.com> wrote:
>>>>> Oits,
>>>>> 
>>>>> You may be right that the stop words are at work here.  What I was
>>>>> expecting
>>>>> to do is be able to match a specific phrase ("on the job"), even if
>>>>> it
>>>>> includes stop words.  But I guess that may not be the way that Lucene
>>>>> works,
>>>>> right?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Terry
>>>>> 
>>>>> ----- Original Message -----
>>>>> From: "Otis Gospodnetic" <ot...@yahoo.com>
>>>>> To: "Lucene Users List" <lu...@jakarta.apache.org>
>>>>> Sent: Sunday, June 16, 2002 7:13 PM
>>>>> Subject: Re: Peculiar Behavior with Field queries
>>>>> 
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I'm not sure what the problem is :)
>>>>>> What do you expect to get back?
>>>>>> Are you wondering why 'on the' part is not matched?
>>>>>> If so, it's probably because both 'on' and 'the' are in the list of
>>>>>> stop words, which are thrown out when/before indexing.
>>>>>> 
>>>>>> Otis
>>>>>> 
>>>>>> --- Terry Steichen <te...@net-frame.com> wrote:
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I'm using Lucene (1.2RC5) and, when indexing, I include a field
>>>>>>> called "headline" using the following line of code in the
>>>>> document I
>>>>>>> create to use for indexing:
>>>>>>> 
>>>>>>>       addField("headline", root.elementText("headline"), true,
>>>>> true,
>>>>>>> true, doc);
>>>>>>> 
>>>>>>> When I search on headline:term1, it works just fine.  But I've
>>>>>>> noticed that if I query using, for example,
>>>>>>> 
>>>>>>>         headline:"on the job"
>>>>>>> 
>>>>>>> I will get returned all items that have the term 'job' in their
>>>>>>> headline.
>>>>>>> 
>>>>>>> I presume I've overlooked something and would appreciate any
>>>>>>> suggestions on what that might be.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Terry Steichen
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> __________________________________________________
>>>>>> Do You Yahoo!?
>>>>>> Yahoo! - Official partner of 2002 FIFA World Cup
>>>>>> http://fifaworldcup.yahoo.com
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe, e-mail:
>>>>> <ma...@jakarta.apache.org>
>>>>>> For additional commands, e-mail:
>>>>> <ma...@jakarta.apache.org>
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> To unsubscribe, e-mail:
>>>>> <ma...@jakarta.apache.org>
>>>>> For additional commands, e-mail:
>>>>> <ma...@jakarta.apache.org>
>>>>> 
>>>> 
>>>> 
>>>> __________________________________________________
>>>> Do You Yahoo!?
>>>> Yahoo! - Official partner of 2002 FIFA World Cup
>>>> http://fifaworldcup.yahoo.com
>>>> 
>>>> --
>>>> To unsubscribe, e-mail:
>>> <ma...@jakarta.apache.org>
>>>> For additional commands, e-mail:
>>> <ma...@jakarta.apache.org>
>>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
>>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>>> 
>>> 
>> 
>> 
>> --
>> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>> 
>> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Terry Steichen <te...@net-frame.com>.
Could you handle this by including extra fields?  Let's say we're dealing
with a database of articles, and that we wanted to do regular as well as
literal phrase searches on the 'headline' field.  What would happen if,
during indexing, you created a second headline field which you defined as
searchable but *not* tokenized.  Could you then apply the phrase query
(which included stop words) successfully against the second headline field
(recognizing that you would have to match the second headline's text
exactly)?

Terry

----- Original Message -----
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, June 17, 2002 10:03 AM
Subject: Re: Peculiar Behavior with Field queries


> I don't think there is a work around without eliminating stop words,
because
> when you index, you will not include the stop words in the index.
>
> I guess one option would be to create an Analyzer to use when creating the
> index that would not eliminate the stop words, then a change the
> QueryParser.jj to use this analyzer when searching for phrases.
> For all other queries you could use a different analyzer that would
> eliminate the stop words.
>
> I don't find this a problem personally as long as you tell the person that
> you have eliminated these terms from what they are searching for. As an
> example, in Google they tell you which terms were just common words that
> have been eliminated from your query string.
>
> --Peter
>
> On 6/17/02 5:32 AM, "Terry Steichen" <te...@net-frame.com> wrote:
>
> > This apparent inability of Lucene to find articles containing literal
> > phrases (if the phrase contains stop words) can be, I think, a severe
> > limitation.  I wonder if there is any workaround (short of eliminating
stop
> > words)?
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <ot...@yahoo.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Sunday, June 16, 2002 11:04 PM
> > Subject: Re: Peculiar Behavior with Field queries
> >
> >
> >> I believe you are correct.
> >>
> >> istO tisO sitO itSo Otsi Osit Otis
> >>
> >> --- Terry Steichen <te...@net-frame.com> wrote:
> >>> Oits,
> >>>
> >>> You may be right that the stop words are at work here.  What I was
> >>> expecting
> >>> to do is be able to match a specific phrase ("on the job"), even if
> >>> it
> >>> includes stop words.  But I guess that may not be the way that Lucene
> >>> works,
> >>> right?
> >>>
> >>> Regards,
> >>>
> >>> Terry
> >>>
> >>> ----- Original Message -----
> >>> From: "Otis Gospodnetic" <ot...@yahoo.com>
> >>> To: "Lucene Users List" <lu...@jakarta.apache.org>
> >>> Sent: Sunday, June 16, 2002 7:13 PM
> >>> Subject: Re: Peculiar Behavior with Field queries
> >>>
> >>>
> >>>> Hello,
> >>>>
> >>>> I'm not sure what the problem is :)
> >>>> What do you expect to get back?
> >>>> Are you wondering why 'on the' part is not matched?
> >>>> If so, it's probably because both 'on' and 'the' are in the list of
> >>>> stop words, which are thrown out when/before indexing.
> >>>>
> >>>> Otis
> >>>>
> >>>> --- Terry Steichen <te...@net-frame.com> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> I'm using Lucene (1.2RC5) and, when indexing, I include a field
> >>>>> called "headline" using the following line of code in the
> >>> document I
> >>>>> create to use for indexing:
> >>>>>
> >>>>>       addField("headline", root.elementText("headline"), true,
> >>> true,
> >>>>> true, doc);
> >>>>>
> >>>>> When I search on headline:term1, it works just fine.  But I've
> >>>>> noticed that if I query using, for example,
> >>>>>
> >>>>>         headline:"on the job"
> >>>>>
> >>>>> I will get returned all items that have the term 'job' in their
> >>>>> headline.
> >>>>>
> >>>>> I presume I've overlooked something and would appreciate any
> >>>>> suggestions on what that might be.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Terry Steichen
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> __________________________________________________
> >>>> Do You Yahoo!?
> >>>> Yahoo! - Official partner of 2002 FIFA World Cup
> >>>> http://fifaworldcup.yahoo.com
> >>>>
> >>>> --
> >>>> To unsubscribe, e-mail:
> >>> <ma...@jakarta.apache.org>
> >>>> For additional commands, e-mail:
> >>> <ma...@jakarta.apache.org>
> >>>>
> >>>
> >>>
> >>> --
> >>> To unsubscribe, e-mail:
> >>> <ma...@jakarta.apache.org>
> >>> For additional commands, e-mail:
> >>> <ma...@jakarta.apache.org>
> >>>
> >>
> >>
> >> __________________________________________________
> >> Do You Yahoo!?
> >> Yahoo! - Official partner of 2002 FIFA World Cup
> >> http://fifaworldcup.yahoo.com
> >>
> >> --
> >> To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> >> For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >>
> >
> >
> > --
> > To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> > For additional commands, e-mail:
<ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Peter Carlson <ca...@bookandhammer.com>.
I don't think there is a work around without eliminating stop words, because
when you index, you will not include the stop words in the index.

I guess one option would be to create an Analyzer to use when creating the
index that would not eliminate the stop words, then a change the
QueryParser.jj to use this analyzer when searching for phrases.
For all other queries you could use a different analyzer that would
eliminate the stop words.

I don't find this a problem personally as long as you tell the person that
you have eliminated these terms from what they are searching for. As an
example, in Google they tell you which terms were just common words that
have been eliminated from your query string.

--Peter

On 6/17/02 5:32 AM, "Terry Steichen" <te...@net-frame.com> wrote:

> This apparent inability of Lucene to find articles containing literal
> phrases (if the phrase contains stop words) can be, I think, a severe
> limitation.  I wonder if there is any workaround (short of eliminating stop
> words)?
> 
> Terry
> 
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, June 16, 2002 11:04 PM
> Subject: Re: Peculiar Behavior with Field queries
> 
> 
>> I believe you are correct.
>> 
>> istO tisO sitO itSo Otsi Osit Otis
>> 
>> --- Terry Steichen <te...@net-frame.com> wrote:
>>> Oits,
>>> 
>>> You may be right that the stop words are at work here.  What I was
>>> expecting
>>> to do is be able to match a specific phrase ("on the job"), even if
>>> it
>>> includes stop words.  But I guess that may not be the way that Lucene
>>> works,
>>> right?
>>> 
>>> Regards,
>>> 
>>> Terry
>>> 
>>> ----- Original Message -----
>>> From: "Otis Gospodnetic" <ot...@yahoo.com>
>>> To: "Lucene Users List" <lu...@jakarta.apache.org>
>>> Sent: Sunday, June 16, 2002 7:13 PM
>>> Subject: Re: Peculiar Behavior with Field queries
>>> 
>>> 
>>>> Hello,
>>>> 
>>>> I'm not sure what the problem is :)
>>>> What do you expect to get back?
>>>> Are you wondering why 'on the' part is not matched?
>>>> If so, it's probably because both 'on' and 'the' are in the list of
>>>> stop words, which are thrown out when/before indexing.
>>>> 
>>>> Otis
>>>> 
>>>> --- Terry Steichen <te...@net-frame.com> wrote:
>>>>> Hello,
>>>>> 
>>>>> I'm using Lucene (1.2RC5) and, when indexing, I include a field
>>>>> called "headline" using the following line of code in the
>>> document I
>>>>> create to use for indexing:
>>>>> 
>>>>>       addField("headline", root.elementText("headline"), true,
>>> true,
>>>>> true, doc);
>>>>> 
>>>>> When I search on headline:term1, it works just fine.  But I've
>>>>> noticed that if I query using, for example,
>>>>> 
>>>>>         headline:"on the job"
>>>>> 
>>>>> I will get returned all items that have the term 'job' in their
>>>>> headline.
>>>>> 
>>>>> I presume I've overlooked something and would appreciate any
>>>>> suggestions on what that might be.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Terry Steichen
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> __________________________________________________
>>>> Do You Yahoo!?
>>>> Yahoo! - Official partner of 2002 FIFA World Cup
>>>> http://fifaworldcup.yahoo.com
>>>> 
>>>> --
>>>> To unsubscribe, e-mail:
>>> <ma...@jakarta.apache.org>
>>>> For additional commands, e-mail:
>>> <ma...@jakarta.apache.org>
>>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe, e-mail:
>>> <ma...@jakarta.apache.org>
>>> For additional commands, e-mail:
>>> <ma...@jakarta.apache.org>
>>> 
>> 
>> 
>> __________________________________________________
>> Do You Yahoo!?
>> Yahoo! - Official partner of 2002 FIFA World Cup
>> http://fifaworldcup.yahoo.com
>> 
>> --
>> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>> 
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Terry Steichen <te...@net-frame.com>.
This apparent inability of Lucene to find articles containing literal
phrases (if the phrase contains stop words) can be, I think, a severe
limitation.  I wonder if there is any workaround (short of eliminating stop
words)?

Terry

----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, June 16, 2002 11:04 PM
Subject: Re: Peculiar Behavior with Field queries


> I believe you are correct.
>
> istO tisO sitO itSo Otsi Osit Otis
>
> --- Terry Steichen <te...@net-frame.com> wrote:
> > Oits,
> >
> > You may be right that the stop words are at work here.  What I was
> > expecting
> > to do is be able to match a specific phrase ("on the job"), even if
> > it
> > includes stop words.  But I guess that may not be the way that Lucene
> > works,
> > right?
> >
> > Regards,
> >
> > Terry
> >
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <ot...@yahoo.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Sunday, June 16, 2002 7:13 PM
> > Subject: Re: Peculiar Behavior with Field queries
> >
> >
> > > Hello,
> > >
> > > I'm not sure what the problem is :)
> > > What do you expect to get back?
> > > Are you wondering why 'on the' part is not matched?
> > > If so, it's probably because both 'on' and 'the' are in the list of
> > > stop words, which are thrown out when/before indexing.
> > >
> > > Otis
> > >
> > > --- Terry Steichen <te...@net-frame.com> wrote:
> > > > Hello,
> > > >
> > > > I'm using Lucene (1.2RC5) and, when indexing, I include a field
> > > > called "headline" using the following line of code in the
> > document I
> > > > create to use for indexing:
> > > >
> > > >       addField("headline", root.elementText("headline"), true,
> > true,
> > > > true, doc);
> > > >
> > > > When I search on headline:term1, it works just fine.  But I've
> > > > noticed that if I query using, for example,
> > > >
> > > >         headline:"on the job"
> > > >
> > > > I will get returned all items that have the term 'job' in their
> > > > headline.
> > > >
> > > > I presume I've overlooked something and would appreciate any
> > > > suggestions on what that might be.
> > > >
> > > > Regards,
> > > >
> > > > Terry Steichen
> > > >
> > > >
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Yahoo! - Official partner of 2002 FIFA World Cup
> > > http://fifaworldcup.yahoo.com
> > >
> > > --
> > > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I believe you are correct.

istO tisO sitO itSo Otsi Osit Otis

--- Terry Steichen <te...@net-frame.com> wrote:
> Oits,
> 
> You may be right that the stop words are at work here.  What I was
> expecting
> to do is be able to match a specific phrase ("on the job"), even if
> it
> includes stop words.  But I guess that may not be the way that Lucene
> works,
> right?
> 
> Regards,
> 
> Terry
> 
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, June 16, 2002 7:13 PM
> Subject: Re: Peculiar Behavior with Field queries
> 
> 
> > Hello,
> >
> > I'm not sure what the problem is :)
> > What do you expect to get back?
> > Are you wondering why 'on the' part is not matched?
> > If so, it's probably because both 'on' and 'the' are in the list of
> > stop words, which are thrown out when/before indexing.
> >
> > Otis
> >
> > --- Terry Steichen <te...@net-frame.com> wrote:
> > > Hello,
> > >
> > > I'm using Lucene (1.2RC5) and, when indexing, I include a field
> > > called "headline" using the following line of code in the
> document I
> > > create to use for indexing:
> > >
> > >       addField("headline", root.elementText("headline"), true,
> true,
> > > true, doc);
> > >
> > > When I search on headline:term1, it works just fine.  But I've
> > > noticed that if I query using, for example,
> > >
> > >         headline:"on the job"
> > >
> > > I will get returned all items that have the term 'job' in their
> > > headline.
> > >
> > > I presume I've overlooked something and would appreciate any
> > > suggestions on what that might be.
> > >
> > > Regards,
> > >
> > > Terry Steichen
> > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Yahoo! - Official partner of 2002 FIFA World Cup
> > http://fifaworldcup.yahoo.com
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Terry Steichen <te...@net-frame.com>.
Oits,

You may be right that the stop words are at work here.  What I was expecting
to do is be able to match a specific phrase ("on the job"), even if it
includes stop words.  But I guess that may not be the way that Lucene works,
right?

Regards,

Terry

----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, June 16, 2002 7:13 PM
Subject: Re: Peculiar Behavior with Field queries


> Hello,
>
> I'm not sure what the problem is :)
> What do you expect to get back?
> Are you wondering why 'on the' part is not matched?
> If so, it's probably because both 'on' and 'the' are in the list of
> stop words, which are thrown out when/before indexing.
>
> Otis
>
> --- Terry Steichen <te...@net-frame.com> wrote:
> > Hello,
> >
> > I'm using Lucene (1.2RC5) and, when indexing, I include a field
> > called "headline" using the following line of code in the document I
> > create to use for indexing:
> >
> >       addField("headline", root.elementText("headline"), true, true,
> > true, doc);
> >
> > When I search on headline:term1, it works just fine.  But I've
> > noticed that if I query using, for example,
> >
> >         headline:"on the job"
> >
> > I will get returned all items that have the term 'job' in their
> > headline.
> >
> > I presume I've overlooked something and would appreciate any
> > suggestions on what that might be.
> >
> > Regards,
> >
> > Terry Steichen
> >
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Peculiar Behavior with Field queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

I'm not sure what the problem is :)
What do you expect to get back?
Are you wondering why 'on the' part is not matched?
If so, it's probably because both 'on' and 'the' are in the list of
stop words, which are thrown out when/before indexing.

Otis

--- Terry Steichen <te...@net-frame.com> wrote:
> Hello,
> 
> I'm using Lucene (1.2RC5) and, when indexing, I include a field
> called "headline" using the following line of code in the document I
> create to use for indexing:
> 
>       addField("headline", root.elementText("headline"), true, true,
> true, doc);
> 
> When I search on headline:term1, it works just fine.  But I've
> noticed that if I query using, for example, 
> 
>         headline:"on the job"
> 
> I will get returned all items that have the term 'job' in their
> headline.
> 
> I presume I've overlooked something and would appreciate any
> suggestions on what that might be.
> 
> Regards,
> 
> Terry Steichen
> 
> 


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>