You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Christian Klinger <ck...@novareto.de> on 2007/09/06 11:54:37 UTC
solr.py problems with german "Umlaute"
Hi all,
i try to add/update documents with
the python solr.py api.
Everything works fine so far
but if i try to add a documents which contain
German Umlaute (ö,ä,ü, ...) i got errors.
Maybe someone has an idea how i could convert
my data?
Should i post this to JIRA?
Thanks for help.
Btw: I have no sitecustomize.py .
This is my script:
------------------------------------------------------
from solr import *
title="Übersicht"
kw = {'id':'12','title':title,'system':'plone','url':'http://www.google.de'}
c = SolrConnection('http://192.168.2.13:8080/solr')
c.add_many([kw,])
c.commit()
------------------------------------------------------
This is the error:
File "t.py", line 5, in ?
c.add_many([kw,])
File "/usr/local/lib/python2.4/site-packages/solr.py", line 596, in
add_many
self.__add(lst, doc)
File "/usr/local/lib/python2.4/site-packages/solr.py", line 710, in __add
lst.append('<field name=%s>%s</field>' % (
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
RE: solr.py problems with german "Umlaute"
Posted by Lance Norskog <go...@gmail.com>.
I researched this problem before. The problem I found is that Python strings
are not Unicode by default. You have to do something to make them Unicode.
Here are the links I found:
http://www.reportlab.com/i18n/python_unicode_tutorial.html
http://evanjones.ca/python-utf8.html
http://jjinux.blogspot.com/2006/04/python-protecting-utf-8-strings-from.html
We do the utf-8 encode&submit and so our strings are badly encoded and
stored. We are seeing the problem shown in "Marc-Andre Lemburg" in the
reportlab.com link: an e-forward-accent becomes some Japanese character.
-----Original Message-----
From: news [mailto:news@sea.gmane.org] On Behalf Of Christian Klinger
Sent: Thursday, September 06, 2007 2:55 AM
To: solr-user@lucene.apache.org
Subject: solr.py problems with german "Umlaute"
Hi all,
i try to add/update documents with
the python solr.py api.
Everything works fine so far
but if i try to add a documents which contain German Umlaute (ö,ä,ü, ...) i
got errors.
Maybe someone has an idea how i could convert my data?
Should i post this to JIRA?
Thanks for help.
Btw: I have no sitecustomize.py .
This is my script:
------------------------------------------------------
from solr import *
title="Übersicht"
kw = {'id':'12','title':title,'system':'plone','url':'http://www.google.de'}
c = SolrConnection('http://192.168.2.13:8080/solr')
c.add_many([kw,])
c.commit()
------------------------------------------------------
This is the error:
File "t.py", line 5, in ?
c.add_many([kw,])
File "/usr/local/lib/python2.4/site-packages/solr.py", line 596, in
add_many
self.__add(lst, doc)
File "/usr/local/lib/python2.4/site-packages/solr.py", line 710, in __add
lst.append('<field name=%s>%s</field>' % (
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)
Re: solr.py problems with german "Umlaute"
Posted by Mike Klaas <mi...@gmail.com>.
On 6-Sep-07, at 12:13 PM, Yonik Seeley wrote:
> On 9/6/07, Brian Carmalt <bc...@contact.de> wrote:
>> Try it with title.encode('utf-8').
>> As in: kw =
>> {'id':'12','title':title.encode
>> ('utf-8'),'system':'plone','url':'http://www.google.de'}
>
> It seems like the client library should be responsible for encoding,
> not the user.
> So try changing
> title="Übersicht"
> into a unicode string via
> title=u"Übersicht"
>
> And that should hopefully get your test program working.
> If it doesn't it's probably a solr.py bug and should be fixed there.
It may or may not, depending on the vagaries of the encoding in his
text editor.
What python gets when you enter u'é' is the byte sequence
corresponding to the encoding of your editor. For instance, my
terminal is set to utf-8 and when I type in é it is equivalent to
entering the bytes C3 A9:
In [5]: 'é'
Out[5]: '\xc3\xa9'
Prepending u does not work, because you are telling python that you
want these two bytes as unicode characters. Note that this could be
fixed by setting python's default encoding to match.
In [1]: u'é'
Out[1]: u'\xc3\xa9'
In [11]: print u'é'
é
The proper thing to do is to interpret the byte sequence given the
proper encoding:
'é'.decode('utf-8')
Out[3]: u'\xe9'
or enter the desired unicode character directly:
>>> u'\u00e9'
u'\xe9'
>>> print u'\u00e9'
é
This is less complicated in the usual case of reading data from a
file, because the encoding should be known (terminal encoding issues
are much trickier). Use codecs.open() to get a unicode-output text
stream.
-Mike
Re: solr.py problems with german "Umlaute"
Posted by Yonik Seeley <yo...@apache.org>.
On 9/6/07, Brian Carmalt <bc...@contact.de> wrote:
> Try it with title.encode('utf-8').
> As in: kw =
> {'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'}
It seems like the client library should be responsible for encoding,
not the user.
So try changing
title="Übersicht"
into a unicode string via
title=u"Übersicht"
And that should hopefully get your test program working.
If it doesn't it's probably a solr.py bug and should be fixed there.
-Yonik
Re: solr.py problems with german "Umlaute"
Posted by Brian Carmalt <bc...@contact.de>.
Hallo Christian,
Try it with title.encode('utf-8').
As in: kw =
{'id':'12','title':title.encode('utf-8'),'system':'plone','url':'http://www.google.de'}
Christian Klinger schrieb:
> Hi all,
>
> i try to add/update documents with
> the python solr.py api.
>
> Everything works fine so far
> but if i try to add a documents which contain
> German Umlaute (ö,ä,ü, ...) i got errors.
>
> Maybe someone has an idea how i could convert
> my data?
> Should i post this to JIRA?
>
> Thanks for help.
>
> Btw: I have no sitecustomize.py .
>
> This is my script:
> ------------------------------------------------------
> from solr import *
> title="Übersicht"
> kw =
> {'id':'12','title':title,'system':'plone','url':'http://www.google.de'}
> c = SolrConnection('http://192.168.2.13:8080/solr')
> c.add_many([kw,])
> c.commit()
> ------------------------------------------------------
>
> This is the error:
>
> File "t.py", line 5, in ?
> c.add_many([kw,])
> File "/usr/local/lib/python2.4/site-packages/solr.py", line 596, in
> add_many
> self.__add(lst, doc)
> File "/usr/local/lib/python2.4/site-packages/solr.py", line 710, in
> __add
> lst.append('<field name=%s>%s</field>' % (
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
> 0: ordinal not in range(128)
>