You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by kushti <sa...@gmail.com> on 2011/03/25 05:31:01 UTC

SOLR - problems with non-english symbols when extracting HTML

When I send plain utf-8 text to index(non-english text), all ok, but with
HTML I have wrong characters instead of non-ASCII symbols. So


$this->solr->extractContents($url,  strip_tags($code),
array("literal.url"=>$url,"fmap.content"=>"body"));

Works well, but just

$this->solr->extractContents($url,  $code,
array("literal.url"=>$url,"fmap.content"=>"body"));

not ! What's the problem ?

SOLR-PHP client used (code.google.com/p/solr-php-client/), but I think,
problem isn't here.

In both cases "text/plain" content-type noted in request(i've updated
standard lib code)

SOLR 1.4.1 / Tomcat 6 / Fedora 12

--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2729126.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR - problems with non-english symbols when extracting HTML

Posted by Lance Norskog <go...@gmail.com>.

Tomcat has to be configured to use UTF-8.

http://wiki.apache.org/solr/SolrTomcat?highlight=%28tomcat%29#URI_Charset_Config

On Fri, Mar 25, 2011 at 6:58 PM, kushti <sa...@gmail.com> wrote:
>
> Grijesh wrote:
>>
>> Try to send HTML data using format CDATA .
>>
> Doesn't work with
>
>
>> $content = "";
>>
>
> And my goal is not to avoid extraction, but have no problems with
> non-english chars
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goksron@gmail.com

Re: SOLR - problems with non-english symbols when extracting HTML

Posted by kushti <sa...@gmail.com>.

Grijesh wrote:
> 
> Try to send HTML data using format CDATA .
> 
Doesn't work with 


> $content = "";
> 

And my goal is not to avoid extraction, but have no problems with
non-english chars


--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR - problems with non-english symbols when extracting HTML

Posted by Grijesh <pi...@gmail.com>.

Try to send HTML data using format CDATA .

-----
Thanx: 
Grijesh 
www.gettinhahead.co.in 
--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2729923.html
Sent from the Solr - User mailing list archive at Nabble.com.