You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Harish Kayarohanam (JIRA)" <ji...@apache.org> on 2015/07/28 06:47:04 UTC
[jira] [Comment Edited] (LUCENE-373) Query parts ending with a colon are handled badly

    [ https://issues.apache.org/jira/browse/LUCENE-373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643840#comment-14643840 ] 

Harish Kayarohanam edited comment on LUCENE-373 at 7/28/15 4:46 AM:
--------------------------------------------------------------------

My understanding of the above issue and analyze if it really needs a fix, if so where, or to find if it is an enhancement.

section 1:
==========
>>> If queryString is "title: search", there's no exception. However, the parsed
>>> query which is returned is "title:search". 
This is as expected.

section 2:
==========
>>> If queryString is "title: contents: text", 
>>> the parsed query is "title:contents" and the "text" part is ignored completely. 
this needs revisit. may be we should bring in something like
a = b = 2 in java or python or javascript or ruby means 2 is assigned to a and b .
so similar approach can be followed here .This is discussed in detail later in my answer(see section 5 & 7)

section 3:
==========
>>> When queryString is "title: text contents:" the above exception is
>>> produced again.
This is also expected . It breaks the syntax.
Why ? and Why this may not be conceived as a bug ?
We should accept one thing that is that lucene query language is like a language of
its own and it has its own syntax. So we should obey that . 
And I would say that it has a meaningful syntax. It is not weird.
why did I make the above statement ?
Let us see what happens in other  programming languages(say python or java or javascript or ruby) .
say a = ; ( a = 
is an error (unexpected End of input error)
similary 
 = 2;
is an error ... so
this is something that is common in all most languages and expected ..
why is this the most expected ?
the idea is 
1) if you assign something to nothing it is a bug. = 2
2) if you assign nothing to something it is a bug. a = 
 
Now lets comes to lucene context :
 = something ...
then comes the question "what should we search something against default field of something else?" this is meaningless . so it is  best choice made by lucene developers to have considered it as a bug and throw parseException.
something = 
what should we search for in field something ... we should not infer anything as value unless told explicitly , so here too it is  best choice made by lucene developers to have considered it as a bug and throw parseException. I personally like the decision made.

section 4:
==========
>>> This seems inconsistent. Given that it's pointless searching for an empty
>>> string (since it has no tokens), I'd expect both "search title:" & "title:
>>> search" to be parsed as "search" (or, given the default field I specified,
>>> "contents:search"), 
search title:  
is like as explained above . I like the present syntax as it is best for a syntax not to assume anything unless
said explicitly. like the cases
 = 2 
a = 
where we cannot assume either the field or the term. so it should be a parseException and that is what we get now.

"title: search" overrides the default field and searches in title field. this is as per design and this cannot do just "search" on default, which breaks the original design. pls refer  fields section in http://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description. 

section 5:
===========
>>> and "title: contents: text" 
this seems meaningful at least to me. But I would not say it is right or wrong .. but it is about what we want
and what most people want and what seems meaningful.
if we want we can bring in a syntax again I would like to see other programing languages to see how a similar syntax is handled 
in java, python, javascript, ruby
a = b = c = 2
is allowed 
and what it does is assign a, b , c  the value of 2 .
so here too we can have syntax to make the text term be searched in fields title and contents . This is a choice which 
we can make if the present state is confusing.
I feel that as the person who reported this issue says , just ignoring something that user gave silently seems
unfair .This is just my point of view .
If the community takes a stand that this breaks syntax and we don't want this new syntax, at least we should throw exception .

section 6:
==========
>>> "title: text contents:" to
>>> parse as "text" ("contents:text") i.e. parts which have no term are ignored. At
>>> worst I'd expect them all to throw a ParseException rather than just the ones
>>> with the colon at the end of the string.
pls see my explanation above . this as per my reasoning need not be considered a bug.

Note: I am taking other programming language syntax  just to see which design has stood the test of time .. so that I can infer that it is mostly expected from people and is less confusing. These programming languages have evolved over time, so we can take these 
syntax as reference and be considered as the most expected ones. I personally would like to go by the most famous
expectations. Please correct me if I am wrong.



section 7:
==========
Further discussion on section 5 :
lets see if the new syntax work in our lucene query language, and how it can work without ambiguity
a : b : hello world h: when
hello will be searched in fields with names a,b
world will be searched in default field
when will be searched in field with name h.

whenever and wherever there are statements like the following
1) with fieldnames but no terms --   a:
2) terms with intention to assign (with :) but no field name --  : hello
 should be flagged as error.
(already the above is done by query parser..(this is to say that queryparser does not just look for : in begining or end and flags the
error. This is good. even if I have statements within brackets like (fieldname:) or (:termvalue) it flags error.  

The above in section 5 & 7 is just a proposal. Please give your comments. Feel free to point out mistakes.
If there is  expectation that this syntax will have a bad impact on performance , even then this syntax need not get inside.

I referred http://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description for better understanding .


was (Author: harishjira):
My understanding of the above issue and analyze if it really needs a fix ..

section 1:
==========
>>> If queryString is "title: search", there's no exception. However, the parsed
>>> query which is returned is "title:search". 
This is as expected.

section 2:
==========
>>> If queryString is "title: contents: text", 
>>> the parsed query is "title:contents" and the "text" part is ignored completely. 
this needs revisit. may be we should bring in something like
a = b = 2 in java or python or javascript or ruby means 2 is assigned to a and b .
so similar approach can be followed here .This is discussed in detail later in my answer(see section 5 & 7)

section 3:
==========
>>> When queryString is "title: text contents:" the above exception is
>>> produced again.
This is also expected . It breaks the syntax.
Why ? and Why this may not be conceived as a bug ?
We should accept one thing that is that lucene query language is like a language of
its own and it has its own syntax. So we should obey that . 
And I would say that it has a meaningful syntax. It is not weird.
why did I make the above statement ?
Let us see what happens in other  programming languages(say python or java or javascript or ruby) .
say a = ; ( a = 
is an error (unexpected End of input error)
similary 
 = 2;
is an error ... so
this is something that is common in all most languages and expected ..
why is this the most expected ?
the idea is 
1) if you assign something to nothing it is a bug. = 2
2) if you assign nothing to something it is a bug. a = 
 
Now lets comes to lucene context :
 = something ...
then comes the question "what should we search something against default field of something else?" this is meaningless . so it is  best choice made by lucene developers to have considered it as a bug and throw parseException.
something = 
what should we search for in field something ... we should not infer anything as value unless told explicitly , so here too it is  best choice made by lucene developers to have considered it as a bug and throw parseException. I personally like the decision made.

section 4:
==========
>>> This seems inconsistent. Given that it's pointless searching for an empty
>>> string (since it has no tokens), I'd expect both "search title:" & "title:
>>> search" to be parsed as "search" (or, given the default field I specified,
>>> "contents:search"), 
search title:  
is like as explained above . I like the present syntax as it is best for a syntax not to assume anything unless
said explicitly. like the cases
 = 2 
a = 
where we cannot assume either the field or the term. so it should be a parseException and that is what we get now.

"title: search" overrides the default field and searches in title field. this is as per design and this cannot do just "search" on default, which breaks the original design. pls refer  fields section in http://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description. 

section 5:
===========
>>> and "title: contents: text" 
this seems meaningful at least to me. But I would not say it is right or wrong .. but it is about what we want
and what most people want and what seems meaningful.
if we want we can bring in a syntax again I would like to see other programing languages to see how a similar syntax is handled 
in java, python, javascript, ruby
a = b = c = 2
is allowed 
and what it does is assign a, b , c  the value of 2 .
so here too we can have syntax to make the text term be searched in fields title and contents . This is a choice which 
we can make if the present state is confusing.
I feel that as the person who reported this issue says , just ignoring something that user gave silently seems
unfair .This is just my point of view .
If the community takes a stand that this breaks syntax and we don't want this new syntax, at least we should throw exception .

section 6:
==========
>>> "title: text contents:" to
>>> parse as "text" ("contents:text") i.e. parts which have no term are ignored. At
>>> worst I'd expect them all to throw a ParseException rather than just the ones
>>> with the colon at the end of the string.
pls see my explanation above . this as per my reasoning need not be considered a bug.

Note: I am taking other programming language syntax  just to see which design has stood the test of time .. so that I can infer that it is mostly expected from people and is less confusing. These programming languages have evolved over time, so we can take these 
syntax as reference and be considered as the most expected ones. I personally would like to go by the most famous
expectations. Please correct me if I am wrong.



section 7:
==========
Further discussion on section 5 :
lets see if the new syntax work in our lucene query language, and how it can work without ambiguity
a : b : hello world h: when
hello will be searched in fields with names a,b
world will be searched in default field
when will be searched in field with name h.

whenever and wherever there are statements like the following
1) with fieldnames but no terms --   a:
2) terms with intention to assign (with :) but no field name --  : hello
 should be flagged as error.
(already the above is done by query parser..(this is to say that queryparser does not just look for : in begining or end and flags the
error. This is good. even if I have statements within brackets like (fieldname:) or (:termvalue) it flags error.  

The above in section 5 & 7 is just a proposal. Please give your comments. Feel free to point out mistakes.
If there is  expectation that this syntax will have a bad impact on performance , even then this syntax need not get inside.

I referred http://lucene.apache.org/core/5_2_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description for better understanding .

> Query parts ending with a colon are handled badly
> -------------------------------------------------
>
>                 Key: LUCENE-373
>                 URL: https://issues.apache.org/jira/browse/LUCENE-373
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser
>    Affects Versions: 1.4
>         Environment: Operating System: Windows 2000
> Platform: PC
>            Reporter: Andrew Stevens
>            Priority: Minor
>              Labels: newdev
>
> I'm using Lucene 1.4.3, running
> Query query = QueryParser.parse(queryString, "contents", new StandardAnalyzer());
> If queryString is "search title:" i.e. specifying a field name without a
> corresponding value, I get a parsing exception:
> Encountered "<EOF>" at line 1, column 8.
> Was expecting one of:
>     "(" ...
>     <QUOTED> ...
>     <TERM> ...
>     <PREFIXTERM> ...
>     <WILDTERM> ...
>     "[" ...
>     "{" ...
>     <NUMBER> ...
> If queryString is "title: search", there's no exception.  However, the parsed
> query which is returned is "title:search".  If queryString is "title: contents:
> text", the parsed query is "title:contents" and the "text" part is ignored
> completely.  When queryString is "title: text contents:" the above exception is
> produced again.
> This seems inconsistent.  Given that it's pointless searching for an empty
> string (since it has no tokens), I'd expect both "search title:" & "title:
> search" to be parsed as "search" (or, given the default field I specified,
> "contents:search"), and "title: contents: text" & "title: text contents:" to
> parse as "text" ("contents:text") i.e. parts which have no term are ignored.  At
> worst I'd expect them all to throw a ParseException rather than just the ones
> with the colon at the end of the string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org