You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rajani Maski <ra...@gmail.com> on 2012/03/30 10:01:10 UTC

Trouble handling Unit symbol

Hi,

We have data having such symbols like :  µ


Indexed data has  -    Dose:"0 µL"
Language type - "English"


Now , when  it is searched as  - Dose:"0 µL"
Number of document matched = 0


Query Q value observed  : <str name="q">S257:"0 µL/injection"</str>




*Any solution to handle such cases? *

Thanks & Regards,
Rajani
*
*
*
*

Re: Trouble handling Unit symbol

Posted by Rajani Maski <ra...@gmail.com>.
Fine. Thank you. I will look at it.


On Fri, Apr 13, 2012 at 5:21 PM, Erick Erickson <er...@gmail.com>wrote:

> Please review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Especially the bit about adding &debugQuery=on
> and showing the results. You're asking people
> to guess at solutions without providing much
> in the way of context.
>
> You might try looking at your index with Luke to
> see what's actually in your index, or perhaps
> TermsComponent
>
>
> Best
> Erick
>
> On Fri, Apr 13, 2012 at 2:29 AM, Rajani Maski <ra...@gmail.com>
> wrote:
> > Hi All,
> >
> >   I tried to index with UTF-8  encode but the issue is still not fixed.
> > Please see my inputs below.
> >
> > *Indexed XML:*
> > <?xml version="1.0" encoding="UTF-8" ?>
> > <add>
> >  <doc>
> >    <field name="ID">0.1000000</field>
> >    <field name="BODY">µ</field>
> >  </doc>
> > </add>
> >
> > *Search Query - * BODY:µ
> >
> > numfound : 0 results obtained.
> >
> > *What can be the reason for this? How do i need to make search query so
> > that the above document is found.*
> >
> >
> > Thanks & Regards
> >
> > Regards
> > Rajani
> >
> >
> >
> > 2012/4/2 Rajani Maski <ra...@gmail.com>
> >
> >> Thank you for the reply.
> >>
> >>
> >>
> >> On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter <
> hossman_lucene@fucit.org
> >> > wrote:
> >>
> >>>
> >>> : We have data having such symbols like :  ต
> >>> : Indexed data has  -    Dose:"0 ตL"
> >>> : Now , when  it is searched as  - Dose:"0 ตL"
> >>>        ...
> >>> : Query Q value observed  : <str name="q">S257:"0 ยตL/injection"</str>
> >>>
> >>> First off: your "when searched as" example does not match up to your
> >>> "Query Q" observed value (ie: field queries, extra "/injection" text at
> >>> the end) suggesting that you maybe cut/paste something you didn't mean
> to
> >>> -- so take the rest of this advice with a grain of salt.
> >>>
> >>> If i ignore your "when it is searched as" exampleand focus entirely on
> >>> what you say you've indexed the data as, and the Q value you are sing
> (in
> >>> what looks like the echoParams output) then the first thing that jumps
> out
> >>> at me is that it looks like your servlet container (or perhaps your web
> >>> browser if that's where you tested this) is not dealing with the
> unicode
> >>> correctly -- because allthough i see a "ต" in the first three lines i
> >>> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
> >>> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the
> "ต"
> >>> did not get URL encoded properly when the request was made to your
> servlet
> >>> container?
> >>>
> >>> In particular, you might want to take a look at...
> >>>
> >>>
> >>>
> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
> >>> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
> >>> The example/exampledocs/test_utf8.sh script included with solr
> >>>
> >>>
> >>>
> >>>
> >>> -Hoss
> >>
> >>
> >>
>

Re: Trouble handling Unit symbol

Posted by Erick Erickson <er...@gmail.com>.
Please review:
http://wiki.apache.org/solr/UsingMailingLists

Especially the bit about adding &debugQuery=on
and showing the results. You're asking people
to guess at solutions without providing much
in the way of context.

You might try looking at your index with Luke to
see what's actually in your index, or perhaps
TermsComponent


Best
Erick

On Fri, Apr 13, 2012 at 2:29 AM, Rajani Maski <ra...@gmail.com> wrote:
> Hi All,
>
>   I tried to index with UTF-8  encode but the issue is still not fixed.
> Please see my inputs below.
>
> *Indexed XML:*
> <?xml version="1.0" encoding="UTF-8" ?>
> <add>
>  <doc>
>    <field name="ID">0.1000000</field>
>    <field name="BODY">µ</field>
>  </doc>
> </add>
>
> *Search Query - * BODY:µ
>
> numfound : 0 results obtained.
>
> *What can be the reason for this? How do i need to make search query so
> that the above document is found.*
>
>
> Thanks & Regards
>
> Regards
> Rajani
>
>
>
> 2012/4/2 Rajani Maski <ra...@gmail.com>
>
>> Thank you for the reply.
>>
>>
>>
>> On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter <hossman_lucene@fucit.org
>> > wrote:
>>
>>>
>>> : We have data having such symbols like :  ต
>>> : Indexed data has  -    Dose:"0 ตL"
>>> : Now , when  it is searched as  - Dose:"0 ตL"
>>>        ...
>>> : Query Q value observed  : <str name="q">S257:"0 ยตL/injection"</str>
>>>
>>> First off: your "when searched as" example does not match up to your
>>> "Query Q" observed value (ie: field queries, extra "/injection" text at
>>> the end) suggesting that you maybe cut/paste something you didn't mean to
>>> -- so take the rest of this advice with a grain of salt.
>>>
>>> If i ignore your "when it is searched as" exampleand focus entirely on
>>> what you say you've indexed the data as, and the Q value you are sing (in
>>> what looks like the echoParams output) then the first thing that jumps out
>>> at me is that it looks like your servlet container (or perhaps your web
>>> browser if that's where you tested this) is not dealing with the unicode
>>> correctly -- because allthough i see a "ต" in the first three lines i
>>> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
>>> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the "ต"
>>> did not get URL encoded properly when the request was made to your servlet
>>> container?
>>>
>>> In particular, you might want to take a look at...
>>>
>>>
>>> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
>>> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>>> The example/exampledocs/test_utf8.sh script included with solr
>>>
>>>
>>>
>>>
>>> -Hoss
>>
>>
>>

Re: Trouble handling Unit symbol

Posted by Rajani Maski <ra...@gmail.com>.
Hi All,

   I tried to index with UTF-8  encode but the issue is still not fixed.
Please see my inputs below.

*Indexed XML:*
<?xml version="1.0" encoding="UTF-8" ?>
<add>
  <doc>
    <field name="ID">0.1000000</field>
    <field name="BODY">µ</field>
  </doc>
</add>

*Search Query - * BODY:µ

numfound : 0 results obtained.

*What can be the reason for this? How do i need to make search query so
that the above document is found.*


Thanks & Regards

Regards
Rajani



2012/4/2 Rajani Maski <ra...@gmail.com>

> Thank you for the reply.
>
>
>
> On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter <hossman_lucene@fucit.org
> > wrote:
>
>>
>> : We have data having such symbols like :  ต
>> : Indexed data has  -    Dose:"0 ตL"
>> : Now , when  it is searched as  - Dose:"0 ตL"
>>        ...
>> : Query Q value observed  : <str name="q">S257:"0 ยตL/injection"</str>
>>
>> First off: your "when searched as" example does not match up to your
>> "Query Q" observed value (ie: field queries, extra "/injection" text at
>> the end) suggesting that you maybe cut/paste something you didn't mean to
>> -- so take the rest of this advice with a grain of salt.
>>
>> If i ignore your "when it is searched as" exampleand focus entirely on
>> what you say you've indexed the data as, and the Q value you are sing (in
>> what looks like the echoParams output) then the first thing that jumps out
>> at me is that it looks like your servlet container (or perhaps your web
>> browser if that's where you tested this) is not dealing with the unicode
>> correctly -- because allthough i see a "ต" in the first three lines i
>> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
>> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the "ต"
>> did not get URL encoded properly when the request was made to your servlet
>> container?
>>
>> In particular, you might want to take a look at...
>>
>>
>> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
>> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>> The example/exampledocs/test_utf8.sh script included with solr
>>
>>
>>
>>
>> -Hoss
>
>
>

Re: Trouble handling Unit symbol

Posted by Rajani Maski <ra...@gmail.com>.
Thank you for the reply.



On Sat, Mar 31, 2012 at 3:38 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : We have data having such symbols like :  ต
> : Indexed data has  -    Dose:"0 ตL"
> : Now , when  it is searched as  - Dose:"0 ตL"
>        ...
> : Query Q value observed  : <str name="q">S257:"0 ยตL/injection"</str>
>
> First off: your "when searched as" example does not match up to your
> "Query Q" observed value (ie: field queries, extra "/injection" text at
> the end) suggesting that you maybe cut/paste something you didn't mean to
> -- so take the rest of this advice with a grain of salt.
>
> If i ignore your "when it is searched as" exampleand focus entirely on
> what you say you've indexed the data as, and the Q value you are sing (in
> what looks like the echoParams output) then the first thing that jumps out
> at me is that it looks like your servlet container (or perhaps your web
> browser if that's where you tested this) is not dealing with the unicode
> correctly -- because allthough i see a "ต" in the first three lines i
> quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it
> preceeded by a "ย" (UTF8: 0xC3 0x82) ... suggesting that perhaps the "ต"
> did not get URL encoded properly when the request was made to your servlet
> container?
>
> In particular, you might want to take a look at...
>
>
> https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
> The example/exampledocs/test_utf8.sh script included with solr
>
>
>
>
> -Hoss

Re: Trouble handling Unit symbol

Posted by Chris Hostetter <ho...@fucit.org>.
: We have data having such symbols like :  µ
: Indexed data has  -    Dose:"0 µL"
: Now , when  it is searched as  - Dose:"0 µL"
	...
: Query Q value observed  : <str name="q">S257:"0 µL/injection"</str>

First off: your "when searched as" example does not match up to your 
"Query Q" observed value (ie: field queries, extra "/injection" text at 
the end) suggesting that you maybe cut/paste something you didn't mean to 
-- so take the rest of this advice with a grain of salt.

If i ignore your "when it is searched as" exampleand focus entirely on 
what you say you've indexed the data as, and the Q value you are sing (in 
what looks like the echoParams output) then the first thing that jumps out 
at me is that it looks like your servlet container (or perhaps your web 
browser if that's where you tested this) is not dealing with the unicode 
correctly -- because allthough i see a "µ" in the first three lines i 
quoted above (UTF8: 0xC2 0xB5) in your value observed i'm seeing it 
preceeded by a "Â" (UTF8: 0xC3 0x82) ... suggesting that perhaps the "µ" 
did not get URL encoded properly when the request was made to your servlet 
container?

In particular, you might want to take a look at...

https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F
http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
The example/exampledocs/test_utf8.sh script included with solr




-Hoss

Re: Trouble handling Unit symbol

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Rajani,

you need to look at the analysis tools of solr-admin, or even luke, to help you.

paul


Le 30 mars 2012 à 10:01, Rajani Maski a écrit :

> Hi,
> 
> We have data having such symbols like :  µ
> 
> 
> Indexed data has  -    Dose:"0 µL"
> Language type - "English"
> 
> 
> Now , when  it is searched as  - Dose:"0 µL"
> Number of document matched = 0
> 
> 
> Query Q value observed  : <str name="q">S257:"0 µL/injection"</str>
> 
> 
> 
> 
> *Any solution to handle such cases? *
> 
> Thanks & Regards,
> Rajani
> *
> *
> *
> *