You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/10/10 21:26:19 UTC

[jira] Created: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Upgrade Nutch to Hadoop 0.7
---------------------------

                 Key: NUTCH-383
                 URL: http://issues.apache.org/jira/browse/NUTCH-383
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Andrzej Bialecki 
         Assigned To: Andrzej Bialecki 


Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.

This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.

This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Closed: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]

Andrzej Bialecki  closed NUTCH-383.
-----------------------------------

    Fix Version/s: 0.9.0
       Resolution: Fixed

Committed to trunk as rev. 464654.

> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
>                 Key: NUTCH-383
>                 URL: http://issues.apache.org/jira/browse/NUTCH-383
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch-v2.txt, patch-v3.txt, patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> I quickly screened though the <<massive>> patch, didn't pick anything 
> special except the stuff from 

Yes, unfortunately UTF8 was used widely throughout the whole code base, 
but most diffs are trivial substitutions of UTF8 with Text.

> clustering-carrot2, formatting changes?
>

Hmm. Good catch. These are purely Windows->Unix line ending differences 
- I'm not sure why I'm getting these ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by Sami Siren <ss...@gmail.com>.
I quickly screened though the <<massive>> patch, didn't pick anything 
special except the stuff from clustering-carrot2, formatting changes?


+1

--
  Sami Siren

Andrzej Bialecki wrote:
> Andrzej Bialecki (JIRA) wrote:
>>      [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
>>
>> Andrzej Bialecki  updated NUTCH-383:
>> ------------------------------------
>>
>>     Attachment: patch-v2.txt
>>
>> This patch uses Hadoop 0.7.1. New changes:
>>   
> 
> [..]
> 
> I would appreciate a review. I plan to commit this soon, while the 
> trunk/ is still in sync.
> 


Re: [jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
>
> Andrzej Bialecki  updated NUTCH-383:
> ------------------------------------
>
>     Attachment: patch-v2.txt
>
> This patch uses Hadoop 0.7.1. New changes:
>   

[..]

I would appreciate a review. I plan to commit this soon, while the 
trunk/ is still in sync.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]

Andrzej Bialecki  updated NUTCH-383:
------------------------------------

    Attachment: patch-v2.txt

This patch uses Hadoop 0.7.1. New changes:

* upgrade to use Hadoop 0.7.1

* upgrade to use Lucene 2.0.0 jars

* use Hadoop's ToolBase instead of Nutch ToolBase

This patch also provides a limited backward-compatibility, namely:

* existing crawldb-s can be converted to new format using CrawlDbConverter tool

* existing segments can be partially converted, using SegmentMerger. However, segment parts related to parsing will NOT be converted and have to be removed prior to converting - i.e. crawl_parse, parse_data and parse_text needs to be removed, and after conversion re-created with 'nutch parse'.

* other Nutch data, such as linkdb and indexes, need to be re-created.

> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
>                 Key: NUTCH-383
>                 URL: http://issues.apache.org/jira/browse/NUTCH-383
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: patch-v2.txt, patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]

Andrzej Bialecki  updated NUTCH-383:
------------------------------------

    Attachment: patch.txt

This patch includes all changes needed to use Hadoop 0.7.0. Additionally, a CrawlDbConverter tool is included, which converts currently used CrawlDb format  using <UTF8, CrawlDatum> to the new format <Text, CrawlDatum>.

All JUnit tests pass.

> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
>                 Key: NUTCH-383
>                 URL: http://issues.apache.org/jira/browse/NUTCH-383
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]

Andrzej Bialecki  updated NUTCH-383:
------------------------------------

    Attachment: patch-v3.txt

Cleanup the patch by removing accidental changes.

If there are no further objections I'd like to commit this.

> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
>                 Key: NUTCH-383
>                 URL: http://issues.apache.org/jira/browse/NUTCH-383
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: patch-v2.txt, patch-v3.txt, patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira