You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by reinhard schwab <re...@aon.at> on 2009/12/22 02:00:09 UTC

unicode 2029 paragraph separator

http://www.fileformat.info/info/unicode/char/2029/index.htm

i have experienced that this unicode character breaks JSON deserializing
when using SOLR and AJAX.
it comes from a pdf text.
where to filter out or replace this character? pdf parser/text
extractor? solr indexer?
regards
reinhard