You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by kushti <sa...@gmail.com> on 2011/03/25 05:31:01 UTC
SOLR - problems with non-english symbols when extracting HTML
When I send plain utf-8 text to index(non-english text), all ok, but with
HTML I have wrong characters instead of non-ASCII symbols. So
$this->solr->extractContents($url, strip_tags($code),
array("literal.url"=>$url,"fmap.content"=>"body"));
Works well, but just
$this->solr->extractContents($url, $code,
array("literal.url"=>$url,"fmap.content"=>"body"));
not ! What's the problem ?
SOLR-PHP client used (code.google.com/p/solr-php-client/), but I think,
problem isn't here.
In both cases "text/plain" content-type noted in request(i've updated
standard lib code)
SOLR 1.4.1 / Tomcat 6 / Fedora 12
--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2729126.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR - problems with non-english symbols when extracting HTML
Posted by Lance Norskog <go...@gmail.com>.
Tomcat has to be configured to use UTF-8.
http://wiki.apache.org/solr/SolrTomcat?highlight=%28tomcat%29#URI_Charset_Config
On Fri, Mar 25, 2011 at 6:58 PM, kushti <sa...@gmail.com> wrote:
>
> Grijesh wrote:
>>
>> Try to send HTML data using format CDATA .
>>
> Doesn't work with
>
>
>> $content = "";
>>
>
> And my goal is not to avoid extraction, but have no problems with
> non-english chars
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
--
Lance Norskog
goksron@gmail.com
Re: SOLR - problems with non-english symbols when extracting HTML
Posted by kushti <sa...@gmail.com>.
Grijesh wrote:
>
> Try to send HTML data using format CDATA .
>
Doesn't work with
> $content = "";
>
And my goal is not to avoid extraction, but have no problems with
non-english chars
--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2733858.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR - problems with non-english symbols when extracting HTML
Posted by Grijesh <pi...@gmail.com>.
Try to send HTML data using format CDATA .
-----
Thanx:
Grijesh
www.gettinhahead.co.in
--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2729923.html
Sent from the Solr - User mailing list archive at Nabble.com.