You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Laurent Hervaud (JIRA)" <ji...@apache.org> on 2019/02/21 09:58:00 UTC

[jira] [Created] (NUTCH-2696) Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x

Laurent Hervaud created NUTCH-2696:
--------------------------------------

             Summary: Nutch SegmentReader does not dump non-ASCII characters with Hadoop 3.x
                 Key: NUTCH-2696
                 URL: https://issues.apache.org/jira/browse/NUTCH-2696
             Project: Nutch
          Issue Type: Bug
          Components: segment
         Environment: Hadoop version : 3.0.0 (CDH 6.1)

Nutch : 1.15

Mode : distributed mode
            Reporter: Laurent Hervaud


All Nutch tasks work properly with Hadoop 3.x. (except SegmentReader)
 SegmentReader with -get option work fine.
 SegmentReader with -dump option replace non-ascii character by ?

Exemple url : [http://www.wikipedia.fr/index.php]

 
{code:java}
command : ./runtime/deploy/bin/nutch readseg -dump /user/nutch/crawl1.15/segments/20190221093756 /tmp/dump1.15 -nocontent -nogenerate -noparse -noparsedata
ParseText::
 Wikipedia.fr - Portail de recherche sur les projets Wikim?dia
 Chercher sur Wikip?dia en fran?ais
 L?encyclop?die librement r?utilisable que chacun peut am?liorer.
{code}
 

 
{code:java}
command : ./runtime/deploy/bin/nutch readseg -get /user/nutch/crawl1.15/segments/20190221093756 http://www.wikipedia.fr/index.php -nocontent -nogenerate -noparse -noparsedata
ParseText::
 Wikipedia.fr - Portail de recherche sur les projets Wikimédia
 Chercher sur Wikipédia en français
 L’encyclopédie librement réutilisable que chacun peut améliorer.
{code}
 

I try to build with hadoop 3.0.0 dependencies in ivy.xml but i have the same result

It's work fine in local mode.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)