You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ahmed baseet <ah...@gmail.com> on 2009/04/29 14:01:37 UTC

Problem adding unicoded docs to Solr through SolrJ

Hi All,
I'm trying to automate the process of posting xml s to Solr using Solrj.
Essentially I'm extracting the text from a given Url, then creating a
solrDoc and posting the same using the following function,

public void postToSolrUsingSolrj(String rawText, String pageId) {
        String url = "http://localhost:8983/solr";
        CommonsHttpSolrServer server;

        try {
            // Get connection to Solr server
              server = new CommonsHttpSolrServer(url);

            // Set XMLResponseParser : Reqd for older version of Solr 1.3
            server.setParser(new XMLResponseParser());

            server.setSoTimeout(1000);  // socket read timeout
              server.setConnectionTimeout(100);
              server.setDefaultMaxConnectionsPerHost(100);
              server.setMaxTotalConnections(100);
              server.setFollowRedirects(false);  // defaults to false
              // allowCompression defaults to false.
              // Server side must support gzip or deflate for this to have
any effect.
              server.setAllowCompression(true);
              server.setMaxRetries(1); // defaults to 0.  > 1 not
recommended.

            // WARNING : this will delete all pre-existing Solr index
            //server.deleteByQuery( "*:*" );// delete everything!

            SolrInputDocument doc = new SolrInputDocument();
            doc.addField("id", pageId );
            doc.addField("features", rawText );


            // Add the docs to Solr Server
            server.add(doc);

            // Do commit the changes
            server.commit();

        }catch (Exception e) {}
    }

In the above the param rawText is just the html stripped off of all its
tags, js, css etc and pageId is the Url for that page. When I'm using this
for English pages its working perfectly fine but the problem comes up when
I'm trying to index some non-english pages. For them, say pages in tamil,
the encoding Unicode/Utf-8 seems to create some problem, because after
indexing some non-english pages when I'm trying to search those from solr
admin search interface, it gives the result but the content is not showing
in that language i.e tamil rather it just displays just some characters, i
think in unicode. The same thing worked fine for pages in English.

Now what I did is just extracted the raw text from that html page and
manually created an xml page like this

<?xml version="1.0" encoding="UTF-8"?>
<add>
  <doc>
    <field name="id">UTF2TEST</field>
    <field name="name">Test with some UTF-8 encoded characters</field>
    <field name="features">*some tamil unicode text here*</field>
   </doc>
</add>

and posted this from command line using the post.jar file. Now searching
gives me the result but unlike last time browser shows the indexed text in
tamil itself and not the raw unicode. So this clearly shows that the string
that I'm using to create the solrDoc seems to have some encoding issues,
right? Or something else? I tried doing something like this also,

// Encode in Unicode UTF-8
 utfEncodedText = new String(rawText.getBytes("UTF-8"));

but even this didn't help eighter.
Its seems some silly problem some where, which I'm not able to catch. :-)

I appreciate if some one can point me the bug...

Thanks,
Ahmed.

Re: Problem adding unicoded docs to Solr through SolrJ

Posted by Gunnar Wagenknecht <gu...@wagenknecht.org>.
ahmed baseet schrieb:
> I first converted the whole string to
> byte array and then used that byte array to create a new utf-8 encoded sting
> like this,

I'm not sure that this is required at all. Java strings have the same
representation internally no matter what they were created from. Thus,
the code snipped you posted is wrong.

> // Encode in Unicode UTF-8
>                 byte [] utfEncodeByteArray = textOnly.getBytes();
>                 String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-8"));

Especially the expression "textOnly.getBytes()" is wrong. It converts
the String to a set of bytes using the JVM's default encoding. Then you
convert the bytes back to a string using UTF-8 encoding.

You carefully have to check *how* the string "textOnly" is created in
the first place. That's where you UTF-8 issues might come from.

-Gunnar

-- 
Gunnar Wagenknecht
gunnar@wagenknecht.org
http://wagenknecht.org/


Re: Problem adding unicoded docs to Solr through SolrJ

Posted by Michael Ludwig <ml...@as-guides.com>.
ahmed baseet schrieb:

> I tried something stupid but working though. I first converted the
> whole string to byte array and then used that byte array to create a
> new utf-8 encoded sting like this,
>
> // Encode in Unicode UTF-8
>                 byte [] utfEncodeByteArray = textOnly.getBytes();

This yields a sequence of bytes using the platform's default charset,
which may not be UTF-8. Check:

* String#getBytes()
* String#getBytes(String charsetName)

>                 String utfString = new String(utfEncodeByteArray,
> Charset.forName("UTF-8"));

Note that strings in Java are always internally encoded in UTF-16, so it
doesn't make much sense to call it utfString, especially if you think
that it is encoded in UTF-8, which it is not.

The above operation is only guaranteed to succeed without losing data
(resulting in ? in the output) when the sequence of bytes is valid as
UTF-8, i.e. in this case when your platform encoding, which you've
relied upon, is UTF-8.

> then passed the utfString to the function for posting to Solr and it
> works prefectly.
> But is there any intelligent way of doing all this, like straight from
> default encoded string to utf-8 encoded string, without going via byte
> array.

It is a feature of the java.lang.String that you don't need to know the
encoding, as the string contains characters, not bytes. Only for input
and output you are concerned with encoding. So where you're dealing with
encodings, you're dealing with bytes.

And when dealing with bytes on the wire, you're likely concerned with
encodings, for example when the page you read via HTTP comes with a
Content-Type header specifying the encoding, or when you send documents
to the Solr indexer.

For more "intelligent" ways, you could take a look at the class
java.nio.charset.Charset and the methods encode, decode, newEncoder,
newDecoder.

Michael Ludwig

Re: Problem adding unicoded docs to Solr through SolrJ

Posted by ahmed baseet <ah...@gmail.com>.
Thanks a lot for your quick and detailed response.
I got the point. But as I've mentioned earlier I've  a string of
rawtext[default encoding] that needs to be encoded in utf-8, so I tried
something stupid but working though. I first converted the whole string to
byte array and then used that byte array to create a new utf-8 encoded sting
like this,

// Encode in Unicode UTF-8
                byte [] utfEncodeByteArray = textOnly.getBytes();
                String utfString = new String(utfEncodeByteArray,
Charset.forName("UTF-8"));

then passed the utfString to the function for posting to Solr and it works
prefectly.
But is there any intelligent way of doing all this, like straight from
default encoded string to utf-8 encoded string, without going via byte
array.
Thank you very much.

--Ahmed.



On Wed, Apr 29, 2009 at 6:45 PM, Michael Ludwig <ml...@as-guides.com> wrote:

> ahmed baseet schrieb:
>
>  public void postToSolrUsingSolrj(String rawText, String pageId) {
>>
>
>             doc.addField("features", rawText );
>>
>
>  In the above the param rawText is just the html stripped off of all
>> its tags, js, css etc and pageId is the Url for that page. When I'm
>> using this for English pages its working perfectly fine but the
>> problem comes up when I'm trying to index some non-english pages.
>>
>
> Maybe you're constructing a string without specifying the encoding, so
> Java uses your default platform encoding?
>
> String(byte[] bytes)
>  Constructs a new String by decoding the specified array of
>  bytes using the platform's default charset.
>
> String(byte[] bytes, Charset charset)
>  Constructs a new String by decoding the specified array of bytes using
>  the specified charset.
>
>  Now what I did is just extracted the raw text from that html page and
>> manually created an xml page like this
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <add>
>>  <doc>
>>    <field name="id">UTF2TEST</field>
>>    <field name="name">Test with some UTF-8 encoded characters</field>
>>    <field name="features">*some tamil unicode text here*</field>
>>   </doc>
>> </add>
>>
>> and posted this from command line using the post.jar file. Now searching
>> gives me the result but unlike last time browser shows the indexed text in
>> tamil itself and not the raw unicode.
>>
>
> Now that's perfect, isn't it?
>
>  I tried doing something like this also,
>>
>
>  // Encode in Unicode UTF-8
>>  utfEncodedText = new String(rawText.getBytes("UTF-8"));
>>
>> but even this didn't help eighter.
>>
>
> No encoding specified, so the default platform encoding is used, which
> is likely not what you want. Consider the following example:
>
> package milu;
> import java.nio.charset.Charset;
> public class StringAndCharset {
>  public static void main(String[] args) {
>    byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
>    System.out.println(Charset.defaultCharset().displayName());
>    System.out.println(new String(bytes));
>    System.out.println(new String(bytes,  Charset.forName("UTF-8")));
>  }
> }
>
> Output:
>
> windows-1252
> Käse (bad)
> Käse (good)
>
> Michael Ludwig
>

Re: Problem adding unicoded docs to Solr through SolrJ

Posted by Michael Ludwig <ml...@as-guides.com>.
ahmed baseet schrieb:

> public void postToSolrUsingSolrj(String rawText, String pageId) {

>             doc.addField("features", rawText );

> In the above the param rawText is just the html stripped off of all
> its tags, js, css etc and pageId is the Url for that page. When I'm
> using this for English pages its working perfectly fine but the
> problem comes up when I'm trying to index some non-english pages.

Maybe you're constructing a string without specifying the encoding, so
Java uses your default platform encoding?

String(byte[] bytes)
   Constructs a new String by decoding the specified array of
   bytes using the platform's default charset.

String(byte[] bytes, Charset charset)
   Constructs a new String by decoding the specified array of bytes using
   the specified charset.

> Now what I did is just extracted the raw text from that html page and
> manually created an xml page like this
>
> <?xml version="1.0" encoding="UTF-8"?>
> <add>
>   <doc>
>     <field name="id">UTF2TEST</field>
>     <field name="name">Test with some UTF-8 encoded characters</field>
>     <field name="features">*some tamil unicode text here*</field>
>    </doc>
> </add>
>
> and posted this from command line using the post.jar file. Now searching
> gives me the result but unlike last time browser shows the indexed text in
> tamil itself and not the raw unicode.

Now that's perfect, isn't it?

> I tried doing something like this also,

> // Encode in Unicode UTF-8
>  utfEncodedText = new String(rawText.getBytes("UTF-8"));
>
> but even this didn't help eighter.

No encoding specified, so the default platform encoding is used, which
is likely not what you want. Consider the following example:

package milu;
import java.nio.charset.Charset;
public class StringAndCharset {
   public static void main(String[] args) {
     byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
     System.out.println(Charset.defaultCharset().displayName());
     System.out.println(new String(bytes));
     System.out.println(new String(bytes,  Charset.forName("UTF-8")));
   }
}

Output:

windows-1252
Käse (bad)
Käse (good)

Michael Ludwig