You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paul Tyson <ph...@sbcglobal.net> on 2014/09/19 03:30:36 UTC

text query analyzer problem

I've been using the configurable text query analyzer in jena 2.11.2
(fuseki 1.0.2) since it was provided by JENA-654. I use the
KeywordAnalyzer to index a field that contains part numbers, which are
mostly composed of digits with dashes, but a fair amount of alphabetic
characters.

I just noticed a problem searching for alphabetic strings. I can't even
detect a consistent failure mode, but it is definitely doing
case-insensitive searching. I understand this is how the Lucene
StandardAnalyzer works, which leads me to suspect the query parser is
using StandardAnalyzer instead of KeywordAnalyzer. There are other
problems too, which might be due to the query parser tokenizing the
query string incompatibly with the index.

Here's the bit of the assembler file with the entity map:

<#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;
    text:map (
    [ text:field "text" ;
      text:predicate skosxl:literalForm;
      text:analyzer [a text:KeywordAnalyzer;]
    ]) .

Text queries work quite as expected when the search string is numeric
and dashes. It goes wrong when alphabetic characters are used.

Has anyone else noticed this problem? Do I need a different entity
mapping to make the query parser use KeywordAnalyzer? Where else could I
look for the cause of this problem?

Thanks in advance.

Regards,
--Paul

Re: text query analyzer problem

Posted by Andy Seaborne <an...@apache.org>.

On 19/09/14 14:50, Paul Tyson wrote:
> Ok, I will work up an example. But you should be able to duplicate
> the symptoms by seeing if a field indexed with KeywordAnalyzer
> returns case insensitive hits. Also if the indexed keywords contain
> punctuation that StandardAnalyzer would use for tokenizing, you might
> see odd results.
>
> Regards. --Paul

Paul,

A test case would be very helpful.  It defines not only the problem but 
also acts to define the expectation. It then goes in the standard test 
suite and hence is fixed or all time.

Is this Solr or Lucene text search?

	Andy

>
>
>
>> On Sep 19, 2014, at 0:46, Osma Suominen <os...@helsinki.fi>
>> wrote:
>>
>> Hi Paul!
>>
>> Can you provide an example query, some data, and the results you
>> got instead of what you were expecting?
>>
>> I've been testing different analyzer configurations (though not yet
>> in production) and I think I'd have noticed if the query parser
>> would have used a different Analyzer than the indexer.
>>
>> -Osma
>>
>>> On 19/09/14 04:30, Paul Tyson wrote: I've been using the
>>> configurable text query analyzer in jena 2.11.2 (fuseki 1.0.2)
>>> since it was provided by JENA-654. I use the KeywordAnalyzer to
>>> index a field that contains part numbers, which are mostly
>>> composed of digits with dashes, but a fair amount of alphabetic
>>> characters.
>>>
>>> I just noticed a problem searching for alphabetic strings. I
>>> can't even detect a consistent failure mode, but it is definitely
>>> doing case-insensitive searching. I understand this is how the
>>> Lucene StandardAnalyzer works, which leads me to suspect the
>>> query parser is using StandardAnalyzer instead of
>>> KeywordAnalyzer. There are other problems too, which might be due
>>> to the query parser tokenizing the query string incompatibly with
>>> the index.
>>>
>>> Here's the bit of the assembler file with the entity map:
>>>
>>> <#entMap> a text:EntityMap ; text:entityField      "uri" ;
>>> text:defaultField     "text" ; text:map ( [ text:field "text" ;
>>> text:predicate skosxl:literalForm; text:analyzer [a
>>> text:KeywordAnalyzer;] ]) .
>>>
>>> Text queries work quite as expected when the search string is
>>> numeric and dashes. It goes wrong when alphabetic characters are
>>> used.
>>>
>>> Has anyone else noticed this problem? Do I need a different
>>> entity mapping to make the query parser use KeywordAnalyzer?
>>> Where else could I look for the cause of this problem?
>>>
>>> Thanks in advance.
>>>
>>> Regards, --Paul
>>
>>
>> -- Osma Suominen D.Sc. (Tech), Information Systems Specialist
>> National Library of Finland P.O. Box 26 (Teollisuuskatu 23) 00014
>> HELSINGIN YLIOPISTO Tel. +358 50 3199529 osma.suominen@helsinki.fi
>> http://www.nationallibrary.fi

Re: text query analyzer problem

Posted by Paul Tyson <ph...@sbcglobal.net>.

Ok, I will work up an example. But you should be able to duplicate the symptoms by seeing if a field indexed with KeywordAnalyzer returns case insensitive hits. Also if the indexed keywords contain punctuation that StandardAnalyzer would use for tokenizing, you might see odd results.

Regards.
--Paul



> On Sep 19, 2014, at 0:46, Osma Suominen <os...@helsinki.fi> wrote:
> 
> Hi Paul!
> 
> Can you provide an example query, some data, and the results you got instead of what you were expecting?
> 
> I've been testing different analyzer configurations (though not yet in production) and I think I'd have noticed if the query parser would have used a different Analyzer than the indexer.
> 
> -Osma
> 
>> On 19/09/14 04:30, Paul Tyson wrote:
>> I've been using the configurable text query analyzer in jena 2.11.2
>> (fuseki 1.0.2) since it was provided by JENA-654. I use the
>> KeywordAnalyzer to index a field that contains part numbers, which are
>> mostly composed of digits with dashes, but a fair amount of alphabetic
>> characters.
>> 
>> I just noticed a problem searching for alphabetic strings. I can't even
>> detect a consistent failure mode, but it is definitely doing
>> case-insensitive searching. I understand this is how the Lucene
>> StandardAnalyzer works, which leads me to suspect the query parser is
>> using StandardAnalyzer instead of KeywordAnalyzer. There are other
>> problems too, which might be due to the query parser tokenizing the
>> query string incompatibly with the index.
>> 
>> Here's the bit of the assembler file with the entity map:
>> 
>> <#entMap> a text:EntityMap ;
>>  text:entityField      "uri" ;
>>  text:defaultField     "text" ;
>>  text:map (
>>  [ text:field "text" ;
>>    text:predicate skosxl:literalForm;
>>    text:analyzer [a text:KeywordAnalyzer;]
>>  ]) .
>> 
>> Text queries work quite as expected when the search string is numeric
>> and dashes. It goes wrong when alphabetic characters are used.
>> 
>> Has anyone else noticed this problem? Do I need a different entity
>> mapping to make the query parser use KeywordAnalyzer? Where else could I
>> look for the cause of this problem?
>> 
>> Thanks in advance.
>> 
>> Regards,
>> --Paul
> 
> 
> -- 
> Osma Suominen
> D.Sc. (Tech), Information Systems Specialist
> National Library of Finland
> P.O. Box 26 (Teollisuuskatu 23)
> 00014 HELSINGIN YLIOPISTO
> Tel. +358 50 3199529
> osma.suominen@helsinki.fi
> http://www.nationallibrary.fi

Re: text query analyzer problem

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Paul!

Can you provide an example query, some data, and the results you got 
instead of what you were expecting?

I've been testing different analyzer configurations (though not yet in 
production) and I think I'd have noticed if the query parser would have 
used a different Analyzer than the indexer.

-Osma

On 19/09/14 04:30, Paul Tyson wrote:
> I've been using the configurable text query analyzer in jena 2.11.2
> (fuseki 1.0.2) since it was provided by JENA-654. I use the
> KeywordAnalyzer to index a field that contains part numbers, which are
> mostly composed of digits with dashes, but a fair amount of alphabetic
> characters.
>
> I just noticed a problem searching for alphabetic strings. I can't even
> detect a consistent failure mode, but it is definitely doing
> case-insensitive searching. I understand this is how the Lucene
> StandardAnalyzer works, which leads me to suspect the query parser is
> using StandardAnalyzer instead of KeywordAnalyzer. There are other
> problems too, which might be due to the query parser tokenizing the
> query string incompatibly with the index.
>
> Here's the bit of the assembler file with the entity map:
>
> <#entMap> a text:EntityMap ;
>      text:entityField      "uri" ;
>      text:defaultField     "text" ;
>      text:map (
>      [ text:field "text" ;
>        text:predicate skosxl:literalForm;
>        text:analyzer [a text:KeywordAnalyzer;]
>      ]) .
>
> Text queries work quite as expected when the search string is numeric
> and dashes. It goes wrong when alphabetic characters are used.
>
> Has anyone else noticed this problem? Do I need a different entity
> mapping to make the query parser use KeywordAnalyzer? Where else could I
> look for the cause of this problem?
>
> Thanks in advance.
>
> Regards,
> --Paul
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Teollisuuskatu 23)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi