You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Walid ABDELKABIR <ab...@gmail.com> on 2009/03/18 12:46:20 UTC

solrj : probleme with utf-8 content

when executing this code I got in my index the field "includes" with this
value : "????? ???? ????????????? ?????" :
---------------------------
String content ="eaiou with circumflexes: êâîôû";
SolrInputDocument doc = new SolrInputDocument();
doc.addField( "id", "123", 1.0f );
doc.addField( "includes", content, 1.0f );
server.add( doc );
---------------------------

but this code works fine :

-------------------------------
String addContent =   "<add><doc boost="1.0">"
                              +"<field name="id">123</field><field
name="includes">eaiou with circumflexes:âîôû</field>"
                              +"</doc></add>";
DirectXmlRequest up = new DirectXmlRequest( "/update", addContent );
server.request( up );
-------------------------------

thanks for help

Re: solrj : probleme with utf-8 content

Posted by Pascal Dimassimo <th...@hotmail.com>.
Hi,

I have that problem to. But I notice that it only happens if I send my data
via solrj. If I send it via the solr-ruby gem, everything is fine
(http://wiki.apache.org/solr/solr-ruby).

Here is my jruby script:
-------------------------------
require 'rubygems'

require 'solr'
require 'rexml/document'

include Java

def send_via_solrj(text, url)
  doc = org.apache.solr.common.SolrInputDocument.new
  doc.addField('id', '1')
  doc.addField('text', text)

  server = org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.new(url)
  server.add(doc);
  server.commit();
end

def send_via_gem(text, url)
  solr_doc = Solr::Document.new
  solr_doc['id'] = '2'
  solr_doc['text'] = text

  options = {
    :autocommit => :on
  }

  conn = Solr::Connection.new(url, options)
  conn.add(solr_doc)
end

host = 'localhost'
port = '8888'
path = '/solr/core0'
url = "http://#{host}:#{port}#{path}"

text = "eaiou with circumflexes: êâîôû"

send_via_solrj(text, url)
send_via_gem(text, url)

puts "done!"
-------------------------------

If I watch the http messages with tcpmon, I see that the data sent via solrj
is encoded in cp1252 while the data sent via the gem is utf-8.

Anyone has an idea of how we can configure sorlj to send in utf-8?

Thanks in advance.


Walid ABDELKABIR wrote:
> 
> when executing this code I got in my index the field "includes" with this
> value : "????? ???? ????????????? ?????" :
> ---------------------------
> String content ="eaiou with circumflexes: êâîôû";
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField( "id", "123", 1.0f );
> doc.addField( "includes", content, 1.0f );
> server.add( doc );
> ---------------------------
> 
> but this code works fine :
> 
> -------------------------------
> String addContent =   "<add><doc boost="1.0">"
>                               +"<field name="id">123</field><field
> name="includes">eaiou with circumflexes:âîôû</field>"
>                               +"</doc></add>";
> DirectXmlRequest up = new DirectXmlRequest( "/update", addContent );
> server.request( up );
> -------------------------------
> 
> thanks for help
> 
> 

-- 
View this message in context: http://www.nabble.com/solrj-%3A-probleme-with-utf-8-content-tp22577377p22620317.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrj : probleme with utf-8 content

Posted by Pascal Dimassimo <th...@hotmail.com>.
yes, now it works fine with the trunk sources

thanks!


Noble Paul നോബിള്‍  नोब्ळ् wrote:
> 
> SOLR-973 seems to have caused the problem
> 
> On Fri, Mar 20, 2009 at 11:01 PM, Ryan McKinley <ry...@gmail.com> wrote:
>> do you know if your java file is encoded with utf-8?
>>
>> sometimes it will be encoded as something different and that can cause
>> funny
>> problems..
>>
>>
>> On Mar 18, 2009, at 7:46 AM, Walid ABDELKABIR wrote:
>>
>>> when executing this code I got in my index the field "includes" with
>>> this
>>> value : "????? ???? ????????????? ?????" :
>>> ---------------------------
>>> String content ="eaiou with circumflexes: êâîôû";
>>> SolrInputDocument doc = new SolrInputDocument();
>>> doc.addField( "id", "123", 1.0f );
>>> doc.addField( "includes", content, 1.0f );
>>> server.add( doc );
>>> ---------------------------
>>>
>>> but this code works fine :
>>>
>>> -------------------------------
>>> String addContent =   "<add><doc boost="1.0">"
>>>                             +"<field name="id">123</field><field
>>> name="includes">eaiou with circumflexes:âîôû</field>"
>>>                             +"</doc></add>";
>>> DirectXmlRequest up = new DirectXmlRequest( "/update", addContent );
>>> server.request( up );
>>> -------------------------------
>>>
>>> thanks for help
>>
>>
> 
> 
> 
> -- 
> --Noble Paul
> 
> 

-- 
View this message in context: http://www.nabble.com/solrj-%3A-probleme-with-utf-8-content-tp22577377p22627715.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solrj : probleme with utf-8 content

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
SOLR-973 seems to have caused the problem

On Fri, Mar 20, 2009 at 11:01 PM, Ryan McKinley <ry...@gmail.com> wrote:
> do you know if your java file is encoded with utf-8?
>
> sometimes it will be encoded as something different and that can cause funny
> problems..
>
>
> On Mar 18, 2009, at 7:46 AM, Walid ABDELKABIR wrote:
>
>> when executing this code I got in my index the field "includes" with this
>> value : "????? ???? ????????????? ?????" :
>> ---------------------------
>> String content ="eaiou with circumflexes: êâîôû";
>> SolrInputDocument doc = new SolrInputDocument();
>> doc.addField( "id", "123", 1.0f );
>> doc.addField( "includes", content, 1.0f );
>> server.add( doc );
>> ---------------------------
>>
>> but this code works fine :
>>
>> -------------------------------
>> String addContent =   "<add><doc boost="1.0">"
>>                             +"<field name="id">123</field><field
>> name="includes">eaiou with circumflexes:âîôû</field>"
>>                             +"</doc></add>";
>> DirectXmlRequest up = new DirectXmlRequest( "/update", addContent );
>> server.request( up );
>> -------------------------------
>>
>> thanks for help
>
>



-- 
--Noble Paul

Re: solrj : probleme with utf-8 content

Posted by Ryan McKinley <ry...@gmail.com>.
do you know if your java file is encoded with utf-8?

sometimes it will be encoded as something different and that can cause  
funny problems..


On Mar 18, 2009, at 7:46 AM, Walid ABDELKABIR wrote:

> when executing this code I got in my index the field "includes" with  
> this
> value : "????? ???? ????????????? ?????" :
> ---------------------------
> String content ="eaiou with circumflexes: êâîôû";
> SolrInputDocument doc = new SolrInputDocument();
> doc.addField( "id", "123", 1.0f );
> doc.addField( "includes", content, 1.0f );
> server.add( doc );
> ---------------------------
>
> but this code works fine :
>
> -------------------------------
> String addContent =   "<add><doc boost="1.0">"
>                              +"<field name="id">123</field><field
> name="includes">eaiou with circumflexes:âîôû</field>"
>                              +"</doc></add>";
> DirectXmlRequest up = new DirectXmlRequest( "/update", addContent );
> server.request( up );
> -------------------------------
>
> thanks for help