You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2006/08/16 16:19:14 UTC

[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

    [ http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428399 ] 
            
Sami Siren commented on NUTCH-349:
----------------------------------

I anything at all should be done then I'd go for #2. There was also a total incombatibility from 0.7 to 0.8 and I didn't see so many complaints.

I also noticed that there is (in some file formats) code to test version changes and format read varies from version to version - those could be removed also if we go for #2.

> Port Nutch to use Hadoop Text instead of UTF8
> ---------------------------------------------
>
>                 Key: NUTCH-349
>                 URL: http://issues.apache.org/jira/browse/NUTCH-349
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>
> Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. This class has been deprecated in Hadoop 0.5.0, and Text class should be used instead. Sooner or later we will need to move Nutch to use this class instead of UTF8.
> This raises numerous issues regarding the compatibility of existing data in CrawlDB, LinkDB and segments. I can see two ways to solve this:
> * add code in readers of respective formats to convert UTF8->Text on the fly. New writers would only use Text. This is less than ideal, because it complicates the code, and also at some point in time the UTF8 class will be removed.
> * create a converter (to be maintaines as long as UTF8 exists), which converts existing data in bulk from UTF8 to Text. This requires an additional processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira