You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2008/12/30 21:56:44 UTC
[jira] Created: (NUTCH-676) MapWritable is written inefficiently
and confusingly
MapWritable is written inefficiently and confusingly
----------------------------------------------------
Key: NUTCH-676
URL: https://issues.apache.org/jira/browse/NUTCH-676
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor
The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-676) MapWritable is written inefficiently and
confusingly
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney closed NUTCH-676.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Assignee: Doğacan Güney
Patch committed as of rev. 736385.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665921#action_12665921 ]
Doğacan Güney commented on NUTCH-676:
-------------------------------------
No, actually it is because we should create a new MapWritable in CrawlDatum#readFields. Because MapWritable "remembers" the id-class mappings it has already written, and does not rewrite them in a later #write call. So, if the order of keys you output in map is different than the order you receive keys in reduce, it fails. As MapWritable tries to map an id to a class but that id-class mapping is not read yet.
Sorry if the description is not very clear.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665889#action_12665889 ]
Todd Lipcon commented on NUTCH-676:
-----------------------------------
Hmm, I can't seem to find the bug I thought I remembered. Maybe the bug I ran into was actually due to the hashCode/equals issue.
If a crawl seems to go OK, I'm all for this.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665855#action_12665855 ]
Todd Lipcon commented on NUTCH-676:
-----------------------------------
Have you run some full crawls yet? I wrote pretty much this same patch but ran into a lot of issues when actually trying to run it in production. It seems like there's a bug in nutch's MapWritable where the classes of the keys are used for keys rather than the actual keys. I'll try to hunt down what I'm referring to and post back later today.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-676:
--------------------------------
Attachment: NUTCH-676_v2.patch
Patch for the issue.
Bumps CrawlDatum version and starts using o.a.h.io.MapWritable in CrawlDatum. Compatibility
is preserved by keeping nutch's MapWritable around and adding extra code for reading from nutch MapWritable if CrawlDatum version is 6.
Also changes CrawlDatum#toString as hadoop's MapWritable does not have a good toString method.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666046#action_12666046 ]
Hudson commented on NUTCH-676:
------------------------------
Integrated in Nutch-trunk #701 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/701/])
- MapWritable is written inefficiently and confusingly.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon updated NUTCH-676:
------------------------------
Attachment: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch
NUTCH-676: Replace MapWritable implementation with the one from Hadoop, but retaining old class IDs from nutch
Change to the test because the test assumes broken behavior in MapWritable
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659961#action_12659961 ]
Todd Lipcon commented on NUTCH-676:
-----------------------------------
Oops - please disregard above patch - it breaks backwards compatibility. Will send in a new one that is compatible later.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-676:
--------------------------------
Attachment: NUTCH-676_v3.patch
New patch.
It seems we have to create a new MapWritable in every CrawlDatum#readFields call, otherwise
we run into a similar problem (in nutch's MapWritable).
Also, updates CrawlDatum#equals and CrawlDatum#hashCode as hadoop's MapWritable does not have an equals method (whereas nutch's MapWritable compares every entry with the other MapWritable).
I am going to commit this soon if no objections.
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Priority: Minor
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-676) MapWritable is written inefficiently
and confusingly
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672870#action_12672870 ]
Hudson commented on NUTCH-676:
------------------------------
Integrated in Nutch-trunk #722 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/722/])
NUTCH-683 - broke CrawlDbMerger
> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
> Key: NUTCH-676
> URL: https://issues.apache.org/jira/browse/NUTCH-676
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Todd Lipcon
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.