You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/10/10 21:26:19 UTC
[jira] Created: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Upgrade Nutch to Hadoop 0.7
---------------------------
Key: NUTCH-383
URL: http://issues.apache.org/jira/browse/NUTCH-383
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki
Assigned To: Andrzej Bialecki
Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
Andrzej Bialecki closed NUTCH-383.
-----------------------------------
Fix Version/s: 0.9.0
Resolution: Fixed
Committed to trunk as rev. 464654.
> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
> Key: NUTCH-383
> URL: http://issues.apache.org/jira/browse/NUTCH-383
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Fix For: 0.9.0
>
> Attachments: patch-v2.txt, patch-v3.txt, patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> I quickly screened though the <<massive>> patch, didn't pick anything
> special except the stuff from
Yes, unfortunately UTF8 was used widely throughout the whole code base,
but most diffs are trivial substitutions of UTF8 with Text.
> clustering-carrot2, formatting changes?
>
Hmm. Good catch. These are purely Windows->Unix line ending differences
- I'm not sure why I'm getting these ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by Sami Siren <ss...@gmail.com>.
I quickly screened though the <<massive>> patch, didn't pick anything
special except the stuff from clustering-carrot2, formatting changes?
+1
--
Sami Siren
Andrzej Bialecki wrote:
> Andrzej Bialecki (JIRA) wrote:
>> [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
>>
>> Andrzej Bialecki updated NUTCH-383:
>> ------------------------------------
>>
>> Attachment: patch-v2.txt
>>
>> This patch uses Hadoop 0.7.1. New changes:
>>
>
> [..]
>
> I would appreciate a review. I plan to commit this soon, while the
> trunk/ is still in sync.
>
Re: [jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by Andrzej Bialecki <ab...@getopt.org>.
Andrzej Bialecki (JIRA) wrote:
> [ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
>
> Andrzej Bialecki updated NUTCH-383:
> ------------------------------------
>
> Attachment: patch-v2.txt
>
> This patch uses Hadoop 0.7.1. New changes:
>
[..]
I would appreciate a review. I plan to commit this soon, while the
trunk/ is still in sync.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
Andrzej Bialecki updated NUTCH-383:
------------------------------------
Attachment: patch-v2.txt
This patch uses Hadoop 0.7.1. New changes:
* upgrade to use Hadoop 0.7.1
* upgrade to use Lucene 2.0.0 jars
* use Hadoop's ToolBase instead of Nutch ToolBase
This patch also provides a limited backward-compatibility, namely:
* existing crawldb-s can be converted to new format using CrawlDbConverter tool
* existing segments can be partially converted, using SegmentMerger. However, segment parts related to parsing will NOT be converted and have to be removed prior to converting - i.e. crawl_parse, parse_data and parse_text needs to be removed, and after conversion re-created with 'nutch parse'.
* other Nutch data, such as linkdb and indexes, need to be re-created.
> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
> Key: NUTCH-383
> URL: http://issues.apache.org/jira/browse/NUTCH-383
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Attachments: patch-v2.txt, patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
Andrzej Bialecki updated NUTCH-383:
------------------------------------
Attachment: patch.txt
This patch includes all changes needed to use Hadoop 0.7.0. Additionally, a CrawlDbConverter tool is included, which converts currently used CrawlDb format using <UTF8, CrawlDatum> to the new format <Text, CrawlDatum>.
All JUnit tests pass.
> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
> Key: NUTCH-383
> URL: http://issues.apache.org/jira/browse/NUTCH-383
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Attachments: patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-383) Upgrade Nutch to Hadoop 0.7
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-383?page=all ]
Andrzej Bialecki updated NUTCH-383:
------------------------------------
Attachment: patch-v3.txt
Cleanup the patch by removing accidental changes.
If there are no further objections I'd like to commit this.
> Upgrade Nutch to Hadoop 0.7
> ---------------------------
>
> Key: NUTCH-383
> URL: http://issues.apache.org/jira/browse/NUTCH-383
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Attachments: patch-v2.txt, patch-v3.txt, patch.txt
>
>
> Upgrade Nutch to Hadoop 0.7, and replace all occurences of UTF8 with Text. UTF8 is deprecated and its use is discouraged due to its limitations.
> This change will break API, in the sense that all third-party additions will have to be updated to use new APIs that use Text instead of UTF8 in method parameters.
> This change also breaks backward compatibility of data in CrawlDb, LinkDb and segments. A tool to upgrade CrawlDb, LinkDb and segments can be created to facilitate the upgrade path.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira