You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2008/12/30 21:56:44 UTC

[jira] Created: (NUTCH-676) MapWritable is written inefficiently and confusingly

MapWritable is written inefficiently and confusingly
----------------------------------------------------

                 Key: NUTCH-676
                 URL: https://issues.apache.org/jira/browse/NUTCH-676
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Todd Lipcon
            Priority: Minor


The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)

Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.

What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-676.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Doğacan Güney

Patch committed as of rev. 736385. 

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665921#action_12665921 ] 

Doğacan Güney commented on NUTCH-676:
-------------------------------------

No, actually it is because we should create a new MapWritable in CrawlDatum#readFields. Because MapWritable "remembers" the id-class mappings it has already written, and does not rewrite them in a later #write call. So, if the order of keys you output in map is different than the order you receive keys in reduce, it fails. As MapWritable tries to map an id to a class but that id-class mapping is not read yet.

Sorry if the description is not very clear. 

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665889#action_12665889 ] 

Todd Lipcon commented on NUTCH-676:
-----------------------------------

Hmm, I can't seem to find the bug I thought I remembered. Maybe the bug I ran into was actually due to the hashCode/equals issue.

If a crawl seems to go OK, I'm all for this.

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665855#action_12665855 ] 

Todd Lipcon commented on NUTCH-676:
-----------------------------------

Have you run some full crawls yet? I wrote pretty much this same patch but ran into a lot of issues when actually trying to run it in production. It seems like there's a bug in nutch's MapWritable where the classes of the keys are used for keys rather than the actual keys. I'll try to hunt down what I'm referring to and post  back later today.

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-676:
--------------------------------

    Attachment: NUTCH-676_v2.patch

Patch for the issue.

Bumps CrawlDatum version and starts using o.a.h.io.MapWritable in CrawlDatum. Compatibility
is preserved by keeping nutch's MapWritable around and adding extra code for reading from nutch MapWritable if CrawlDatum version is 6.

Also changes CrawlDatum#toString as hadoop's MapWritable does not have a good toString method.

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666046#action_12666046 ] 

Hudson commented on NUTCH-676:
------------------------------

Integrated in Nutch-trunk #701 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/701/])
     - MapWritable is written inefficiently and confusingly.


> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated NUTCH-676:
------------------------------

    Attachment: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch

    NUTCH-676: Replace MapWritable implementation with the one from Hadoop, but retaining old class IDs from nutch
    
    Change to the test because the test assumes broken behavior in MapWritable


> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659961#action_12659961 ] 

Todd Lipcon commented on NUTCH-676:
-----------------------------------

Oops - please disregard above patch - it breaks backwards compatibility. Will send in a new one that is compatible later.

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney updated NUTCH-676:
--------------------------------

    Attachment: NUTCH-676_v3.patch

New patch.

It seems we have to create a new MapWritable in every CrawlDatum#readFields call, otherwise
we run into a similar problem (in nutch's MapWritable).

Also, updates CrawlDatum#equals and CrawlDatum#hashCode as hadoop's MapWritable does not have an equals method (whereas nutch's MapWritable compares every entry with the other MapWritable).

I am going to commit this soon if no objections. 

> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Priority: Minor
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-676) MapWritable is written inefficiently and confusingly

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12672870#action_12672870 ] 

Hudson commented on NUTCH-676:
------------------------------

Integrated in Nutch-trunk #722 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/722/])
    NUTCH-683 -  broke CrawlDbMerger


> MapWritable is written inefficiently and confusingly
> ----------------------------------------------------
>
>                 Key: NUTCH-676
>                 URL: https://issues.apache.org/jira/browse/NUTCH-676
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Todd Lipcon
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch, NUTCH-676_v3.patch
>
>
> The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down)
> Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance.
> What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.