You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by prasad deshpande <pr...@gmail.com> on 2011/01/27 07:31:08 UTC

Does solr supports indexing of files other than UTF-8

Hello,


I am able to successfully index/search non-Engilsh data(like Hebrew,
Japnese) which was encoded in UTF-8.
However, When I tried to index data which was encoded in local encoding like
Big5 for Japanese I could not see the desired results.
The contents after indexing looked garbled for Big5 encoded document when I
searched for all indexed documents.

Converting a complete document in UTF-8 is not feasible.
I am not very clear about how Solr support these localizations with other
than UTF-8 encoding.


I verified below links
1. http://lucene.apache.org/java/3_0_3/api/all/index.html
2.  http://wiki.apache.org/solr/LanguageAnalysis

Thanks and Regards,
Prasad

Re: Does solr supports indexing of files other than UTF-8

Posted by Dennis Gearon <ge...@sbcglobal.net>.
Use ICONV library in your server side language.

Convert it to utf-8, store it with a filed describing what incoding it was in, 
and re encode it if you wish.

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: prasad deshpande <pr...@gmail.com>
To: solr-user@lucene.apache.org
Sent: Fri, January 28, 2011 12:41:29 AM
Subject: Re: Does solr supports indexing of files other than UTF-8

Thanks paul.

However I want to support local encoding files to be indexed. How would I
achieve it?

On Thu, Jan 27, 2011 at 2:46 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:

> At least in java utf-8 transcoding is done on a stream basis. No issue
> there.
>
> paul
>
>
> Le 27 janv. 2011 à 09:51, prasad deshpande a écrit :
>
> > The size of docs can be huge, like suppose there are 800MB pdf file to
> index
> > it I need to translate it in UTF-8 and then send this file to index. Now
> > suppose there can be any number of clients who can upload file. at that
> time
> > it will affect performance. and already our product support localization
> > with local encoding.
> >
> > Thanks,
> > Prasad
> >
> > On Thu, Jan 27, 2011 at 2:04 PM, Paul Libbrecht <pa...@hoplahup.net>
> wrote:
> >
> >> Why is converting documents to utf-8 not feasible?
> >> Nowadays any platform offers such services.
> >>
> >> Can you give a detailed failure description (maybe with the URL to a
> sample
> >> document you post)?
> >>
> >> paul
> >>
> >>
> >> Le 27 janv. 2011 à 07:31, prasad deshpande a écrit :
> >>> I am able to successfully index/search non-Engilsh data(like Hebrew,
> >>> Japnese) which was encoded in UTF-8.
> >>> However, When I tried to index data which was encoded in local encoding
> >> like
> >>> Big5 for Japanese I could not see the desired results.
> >>> The contents after indexing looked garbled for Big5 encoded document
> when
> >> I
> >>> searched for all indexed documents.
> >>>
> >>> Converting a complete document in UTF-8 is not feasible.
> >>> I am not very clear about how Solr support these localizations with
> other
> >>> than UTF-8 encoding.
> >>>
> >>>
> >>> I verified below links
> >>> 1. http://lucene.apache.org/java/3_0_3/api/all/index.html
> >>> 2.  http://wiki.apache.org/solr/LanguageAnalysis
> >>>
> >>> Thanks and Regards,
> >>> Prasad
> >>
> >>
>
>


Re: Does solr supports indexing of files other than UTF-8

Posted by prasad deshpande <pr...@gmail.com>.
Thanks paul.

However I want to support local encoding files to be indexed. How would I
achieve it?

On Thu, Jan 27, 2011 at 2:46 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:

> At least in java utf-8 transcoding is done on a stream basis. No issue
> there.
>
> paul
>
>
> Le 27 janv. 2011 à 09:51, prasad deshpande a écrit :
>
> > The size of docs can be huge, like suppose there are 800MB pdf file to
> index
> > it I need to translate it in UTF-8 and then send this file to index. Now
> > suppose there can be any number of clients who can upload file. at that
> time
> > it will affect performance. and already our product support localization
> > with local encoding.
> >
> > Thanks,
> > Prasad
> >
> > On Thu, Jan 27, 2011 at 2:04 PM, Paul Libbrecht <pa...@hoplahup.net>
> wrote:
> >
> >> Why is converting documents to utf-8 not feasible?
> >> Nowadays any platform offers such services.
> >>
> >> Can you give a detailed failure description (maybe with the URL to a
> sample
> >> document you post)?
> >>
> >> paul
> >>
> >>
> >> Le 27 janv. 2011 à 07:31, prasad deshpande a écrit :
> >>> I am able to successfully index/search non-Engilsh data(like Hebrew,
> >>> Japnese) which was encoded in UTF-8.
> >>> However, When I tried to index data which was encoded in local encoding
> >> like
> >>> Big5 for Japanese I could not see the desired results.
> >>> The contents after indexing looked garbled for Big5 encoded document
> when
> >> I
> >>> searched for all indexed documents.
> >>>
> >>> Converting a complete document in UTF-8 is not feasible.
> >>> I am not very clear about how Solr support these localizations with
> other
> >>> than UTF-8 encoding.
> >>>
> >>>
> >>> I verified below links
> >>> 1. http://lucene.apache.org/java/3_0_3/api/all/index.html
> >>> 2.  http://wiki.apache.org/solr/LanguageAnalysis
> >>>
> >>> Thanks and Regards,
> >>> Prasad
> >>
> >>
>
>

Re: Does solr supports indexing of files other than UTF-8

Posted by Paul Libbrecht <pa...@hoplahup.net>.
At least in java utf-8 transcoding is done on a stream basis. No issue there.

paul


Le 27 janv. 2011 à 09:51, prasad deshpande a écrit :

> The size of docs can be huge, like suppose there are 800MB pdf file to index
> it I need to translate it in UTF-8 and then send this file to index. Now
> suppose there can be any number of clients who can upload file. at that time
> it will affect performance. and already our product support localization
> with local encoding.
> 
> Thanks,
> Prasad
> 
> On Thu, Jan 27, 2011 at 2:04 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
> 
>> Why is converting documents to utf-8 not feasible?
>> Nowadays any platform offers such services.
>> 
>> Can you give a detailed failure description (maybe with the URL to a sample
>> document you post)?
>> 
>> paul
>> 
>> 
>> Le 27 janv. 2011 à 07:31, prasad deshpande a écrit :
>>> I am able to successfully index/search non-Engilsh data(like Hebrew,
>>> Japnese) which was encoded in UTF-8.
>>> However, When I tried to index data which was encoded in local encoding
>> like
>>> Big5 for Japanese I could not see the desired results.
>>> The contents after indexing looked garbled for Big5 encoded document when
>> I
>>> searched for all indexed documents.
>>> 
>>> Converting a complete document in UTF-8 is not feasible.
>>> I am not very clear about how Solr support these localizations with other
>>> than UTF-8 encoding.
>>> 
>>> 
>>> I verified below links
>>> 1. http://lucene.apache.org/java/3_0_3/api/all/index.html
>>> 2.  http://wiki.apache.org/solr/LanguageAnalysis
>>> 
>>> Thanks and Regards,
>>> Prasad
>> 
>> 


Re: Does solr supports indexing of files other than UTF-8

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Jan 27, 2011 at 3:51 AM, prasad deshpande
<pr...@gmail.com> wrote:
> The size of docs can be huge, like suppose there are 800MB pdf file to index
> it I need to translate it in UTF-8 and then send this file to index.

PDF is binary AFAIK... you shouldn't need to do any charset
translation before sending it to solr, or any other extraction
library.  If you're using solr-cell then it's the Tika component that
is responsible for pulling out the text in the right format.

-Yonik
http://lucidimagination.com

Re: Does solr supports indexing of files other than UTF-8

Posted by prasad deshpande <pr...@gmail.com>.
The size of docs can be huge, like suppose there are 800MB pdf file to index
it I need to translate it in UTF-8 and then send this file to index. Now
suppose there can be any number of clients who can upload file. at that time
it will affect performance. and already our product support localization
with local encoding.

Thanks,
Prasad

On Thu, Jan 27, 2011 at 2:04 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:

> Why is converting documents to utf-8 not feasible?
> Nowadays any platform offers such services.
>
> Can you give a detailed failure description (maybe with the URL to a sample
> document you post)?
>
> paul
>
>
> Le 27 janv. 2011 à 07:31, prasad deshpande a écrit :
> > I am able to successfully index/search non-Engilsh data(like Hebrew,
> > Japnese) which was encoded in UTF-8.
> > However, When I tried to index data which was encoded in local encoding
> like
> > Big5 for Japanese I could not see the desired results.
> > The contents after indexing looked garbled for Big5 encoded document when
> I
> > searched for all indexed documents.
> >
> > Converting a complete document in UTF-8 is not feasible.
> > I am not very clear about how Solr support these localizations with other
> > than UTF-8 encoding.
> >
> >
> > I verified below links
> > 1. http://lucene.apache.org/java/3_0_3/api/all/index.html
> > 2.  http://wiki.apache.org/solr/LanguageAnalysis
> >
> > Thanks and Regards,
> > Prasad
>
>

Re: Does solr supports indexing of files other than UTF-8

Posted by Paul Libbrecht <pa...@hoplahup.net>.
Why is converting documents to utf-8 not feasible?
Nowadays any platform offers such services.

Can you give a detailed failure description (maybe with the URL to a sample document you post)?

paul


Le 27 janv. 2011 à 07:31, prasad deshpande a écrit :
> I am able to successfully index/search non-Engilsh data(like Hebrew,
> Japnese) which was encoded in UTF-8.
> However, When I tried to index data which was encoded in local encoding like
> Big5 for Japanese I could not see the desired results.
> The contents after indexing looked garbled for Big5 encoded document when I
> searched for all indexed documents.
> 
> Converting a complete document in UTF-8 is not feasible.
> I am not very clear about how Solr support these localizations with other
> than UTF-8 encoding.
> 
> 
> I verified below links
> 1. http://lucene.apache.org/java/3_0_3/api/all/index.html
> 2.  http://wiki.apache.org/solr/LanguageAnalysis
> 
> Thanks and Regards,
> Prasad