You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Christian Klinger <ck...@novareto.de> on 2007/09/06 11:54:37 UTC

solr.py problems with german "Umlaute"

Hi all,

i try to add/update documents with
the python solr.py api.

Everything works fine so far
but if i try to add a documents which contain
German Umlaute (ö,ä,ü, ...) i got errors.

Maybe someone has an idea how i could convert
my data?
Should i post this to JIRA?

Thanks for help.

Btw: I have no sitecustomize.py .

This is my script:
------------------------------------------------------
from solr import *
title="Übersicht"
kw = {'id':'12','title':title,'system':'plone','url':'http://www.google.de'}
c = SolrConnection('http://192.168.2.13:8080/solr')
c.add_many([kw,])
c.commit()
------------------------------------------------------

This is the error:

   File "t.py", line 5, in ?
     c.add_many([kw,])
   File "/usr/local/lib/python2.4/site-packages/solr.py", line 596, in 
add_many
     self.__add(lst, doc)
   File "/usr/local/lib/python2.4/site-packages/solr.py", line 710, in __add
     lst.append('<field name=%s>%s</field>' % (
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
ordinal not in range(128)

RE: solr.py problems with german "Umlaute"

Posted by Lance Norskog <go...@gmail.com>.

I researched this problem before. The problem I found is that Python strings
are not Unicode by default. You have to do something to make them Unicode.
Here are the links I found:

http://www.reportlab.com/i18n/python_unicode_tutorial.html
 
http://evanjones.ca/python-utf8.html
 
http://jjinux.blogspot.com/2006/04/python-protecting-utf-8-strings-from.html


We do the utf-8 encode&submit and so our strings are badly encoded and
stored. We are seeing the problem shown in "Marc-Andre Lemburg" in the
reportlab.com link: an e-forward-accent becomes some Japanese character.

-----Original Message-----
From: news [mailto:news@sea.gmane.org] On Behalf Of Christian Klinger
Sent: Thursday, September 06, 2007 2:55 AM
To: solr-user@lucene.apache.org
Subject: solr.py problems with german "Umlaute"

Hi all,

i try to add/update documents with
the python solr.py api.

Everything works fine so far
but if i try to add a documents which contain German Umlaute (ö,ä,ü, ...) i
got errors.

Maybe someone has an idea how i could convert my data?
Should i post this to JIRA?

Thanks for help.

Btw: I have no sitecustomize.py .

This is my script:
------------------------------------------------------
from solr import *
title="Übersicht"
kw = {'id':'12','title':title,'system':'plone','url':'http://www.google.de'}
c = SolrConnection('http://192.168.2.13:8080/solr')
c.add_many([kw,])
c.commit()
------------------------------------------------------

This is the error:

   File "t.py", line 5, in ?
     c.add_many([kw,])
   File "/usr/local/lib/python2.4/site-packages/solr.py", line 596, in
add_many
     self.__add(lst, doc)
   File "/usr/local/lib/python2.4/site-packages/solr.py", line 710, in __add
     lst.append('<field name=%s>%s</field>' % (
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
ordinal not in range(128)

Re: solr.py problems with german "Umlaute"

Posted by Mike Klaas <mi...@gmail.com>.

On 6-Sep-07, at 12:13 PM, Yonik Seeley wrote:

> On 9/6/07, Brian Carmalt <bc...@contact.de> wrote:
>> Try it with title.encode('utf-8').
>> As in: kw =
>> {'id':'12','title':title.encode 
>> ('utf-8'),'system':'plone','url':'http://www.google.de'}
>
> It seems like the client library should be responsible for encoding,
> not the user.
> So try changing
> title="Übersicht"
>   into a unicode string via
> title=u"Übersicht"
>
> And that should hopefully get your test program working.
> If it doesn't it's probably a solr.py bug and should be fixed there.

It may or may not, depending on the vagaries of the encoding in his  
text editor.

What python gets when you enter u'é' is the byte sequence  
corresponding to the encoding of your editor.  For instance, my  
terminal is set to utf-8 and when I type in é it is equivalent to  
entering the bytes C3 A9:

In [5]: 'é'
Out[5]: '\xc3\xa9'

Prepending u does not work, because you are telling python that you  
want these two bytes as unicode characters.  Note that this could be  
fixed by setting python's default encoding to match.

In [1]: u'é'
Out[1]: u'\xc3\xa9'
In [11]: print u'é'
Ã©

The proper thing to do is to interpret the byte sequence given the  
proper encoding:

'é'.decode('utf-8')
Out[3]: u'\xe9'

or enter the desired unicode character directly:

 >>> u'\u00e9'
u'\xe9'
 >>> print u'\u00e9'
é

This is less complicated in the usual case of reading data from a  
file, because the encoding should be known (terminal encoding issues  
are much trickier).  Use codecs.open() to get a unicode-output text  
stream.

-Mike

Re: solr.py problems with german "Umlaute"

Posted by Yonik Seeley <yo...@apache.org>.

On 9/6/07, Brian Carmalt <bc...@contact.de> wrote:
> Try it with title.encode('utf-8').
> As in: kw =
> {'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'}

It seems like the client library should be responsible for encoding,
not the user.
So try changing
title="Übersicht"
  into a unicode string via
title=u"Übersicht"

And that should hopefully get your test program working.
If it doesn't it's probably a solr.py bug and should be fixed there.

-Yonik

Re: solr.py problems with german "Umlaute"

Posted by Brian Carmalt <bc...@contact.de>.

Hallo Christian,

Try it with title.encode('utf-8').
As in: kw = 
{'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'}


Christian Klinger schrieb:
> Hi all,
>
> i try to add/update documents with
> the python solr.py api.
>
> Everything works fine so far
> but if i try to add a documents which contain
> German Umlaute (ö,ä,ü, ...) i got errors.
>
> Maybe someone has an idea how i could convert
> my data?
> Should i post this to JIRA?
>
> Thanks for help.
>
> Btw: I have no sitecustomize.py .
>
> This is my script:
> ------------------------------------------------------
> from solr import *
> title="Übersicht"
> kw = 
> {'id':'12','title':title,'system':'plone','url':'http://www.google.de'}
> c = SolrConnection('http://192.168.2.13:8080/solr')
> c.add_many([kw,])
> c.commit()
> ------------------------------------------------------
>
> This is the error:
>
>   File "t.py", line 5, in ?
>     c.add_many([kw,])
>   File "/usr/local/lib/python2.4/site-packages/solr.py", line 596, in 
> add_many
>     self.__add(lst, doc)
>   File "/usr/local/lib/python2.4/site-packages/solr.py", line 710, in 
> __add
>     lst.append('<field name=%s>%s</field>' % (
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 
> 0: ordinal not in range(128)
>