You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Enrico Triolo <en...@gmail.com> on 2006/09/26 16:10:24 UTC

Searching on fields with uppercase letters

Hi all, I'm trying to implement a search plugin to search on the
'subType' field added by index-more plugin. It's a very simple plugin,
copied almost entirely from query-basic.

The problem is, when I perform a query on that field I get no results
at all. Other fields are handled by the same plugin, and I'm able to
search over them. Moreover, performing queries with luke on the
subType field I get the expected results.

Looking at the source code I found out that when parsing a query
string all fields are transformed lower case: so, the query
'subType:html' becomes 'subtype:html' (see method 'getNextToken' in
org.apache.nutch.analysis.NutchAnalysisTokenManager).
Could it be this the cause of the wrong result set? Is there a reason
why fields are treated this way?

Thanks,
Enrico

Re: Searching on fields with uppercase letters

Posted by Enrico Triolo <en...@gmail.com>.
That sounds ok... So we should modify index-more (and maybe others?)
plugin to add 'primarytype' and 'subtype' fields instead of
'primaryType' and 'subType', I think.

Cheers,
Enrico

On 9/26/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Enrico Triolo wrote:
> > Hi all, I'm trying to implement a search plugin to search on the
> > 'subType' field added by index-more plugin. It's a very simple plugin,
> > copied almost entirely from query-basic.
> >
> > The problem is, when I perform a query on that field I get no results
> > at all. Other fields are handled by the same plugin, and I'm able to
> > search over them. Moreover, performing queries with luke on the
> > subType field I get the expected results.
> >
> > Looking at the source code I found out that when parsing a query
> > string all fields are transformed lower case: so, the query
> > 'subType:html' becomes 'subtype:html' (see method 'getNextToken' in
> > org.apache.nutch.analysis.NutchAnalysisTokenManager).
> > Could it be this the cause of the wrong result set? Is there a reason
> > why fields are treated this way?
>
> For simplicity and user-friendliness. While in Lucene we can reasonably
> expect that sophisticated users will construct sophisticated queries,
> paying attention to lower/upper-case, we need to lower the barrier for a
> general-purpose search engine frontend.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Searching on fields with uppercase letters

Posted by Andrzej Bialecki <ab...@getopt.org>.
Enrico Triolo wrote:
> Hi all, I'm trying to implement a search plugin to search on the
> 'subType' field added by index-more plugin. It's a very simple plugin,
> copied almost entirely from query-basic.
>
> The problem is, when I perform a query on that field I get no results
> at all. Other fields are handled by the same plugin, and I'm able to
> search over them. Moreover, performing queries with luke on the
> subType field I get the expected results.
>
> Looking at the source code I found out that when parsing a query
> string all fields are transformed lower case: so, the query
> 'subType:html' becomes 'subtype:html' (see method 'getNextToken' in
> org.apache.nutch.analysis.NutchAnalysisTokenManager).
> Could it be this the cause of the wrong result set? Is there a reason
> why fields are treated this way?

For simplicity and user-friendliness. While in Lucene we can reasonably 
expect that sophisticated users will construct sophisticated queries, 
paying attention to lower/upper-case, we need to lower the barrier for a 
general-purpose search engine frontend.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com