You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dan Rosher (Created) (JIRA)" <ji...@apache.org> on 2012/02/27 12:07:48 UTC

[jira] [Created] (NUTCH-1289) In distributed mode URL's are not partitioned

In distributed mode URL's are not partitioned
---------------------------------------------

                 Key: NUTCH-1289
                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: nutchgora
            Reporter: Dan Rosher
             Fix For: nutchgora


In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222324#comment-13222324 ] 

Ferdy Galema commented on NUTCH-1289:
-------------------------------------

Committed. 

Dan, could you verify this issue for closing?
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289-v2.patch, NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Ferdy Galema (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1289:
--------------------------------

    Attachment: NUTCH-1289-v2.patch

Done with patch v2. It fixes the problem as described above. It also features a minor improvement, namely that the partition code will be skipped entirely when there is just one partition. (For example in local mode.)

It includes several tests, including the seed function, the different modes and signature partitioners.
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289-v2.patch, NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Mathijs Homminga (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217172#comment-13217172 ] 

Mathijs Homminga commented on NUTCH-1289:
-----------------------------------------

Nice catch. The PartitionUrlByHost seems broken indeed.
I would suggest that we use the existing o.a.n.crawl.URLPartitioner class which has support for three URL partition modes (host, domain, IP) and which is used by the GeneratorJob too.

Pros: support for different partition modes in the Fetcher + no duplicate code.
Or is there a reason why the Fetcher has its own partition logic?

The URLPartitioner class is a Partitioner<SelectorEntry, WebPage> instead of a Partitioner<IntWritable, FetchEntry> but you can perhaps extract a method and use it from both classes, or create one URLPartitioner with two specific inner classes for the Generator and Fetcher.

                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Ferdy Galema (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema closed NUTCH-1289.
-------------------------------

    Resolution: Fixed
    
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289-v2.patch, NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217155#comment-13217155 ] 

Lewis John McGibbney commented on NUTCH-1289:
---------------------------------------------

Hi Dan, thanks for opening this issue and for the patch. Are you using trunk at all? If so is it possible to confirm if this functionality is already running in trunk... if not then we can get a patch cooked up.
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222982#comment-13222982 ] 

Hudson commented on NUTCH-1289:
-------------------------------

Integrated in Nutch-nutchgora #184 (See [https://builds.apache.org/job/Nutch-nutchgora/184/])
    NUTCH-1289 In distributed mode URL's are not partitioned (Revision 1297039)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/GeneratorJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/URLPartitioner.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/PartitionUrlByHost.java
* /nutch/branches/nutchgora/src/test/org/apache/nutch/crawl/TestURLPartitioner.java

                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289-v2.patch, NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Markus Jelsma (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217157#comment-13217157 ] 

Markus Jelsma commented on NUTCH-1289:
--------------------------------------

In trunk records of the same queue end up in the same fetch list which corresponds to a single mapper.
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Dan Rosher (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222415#comment-13222415 ] 

Dan Rosher commented on NUTCH-1289:
-----------------------------------

Hi Ferdy,

Thanks for adding the tests, looks good to me,

Cheers,
Dan
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289-v2.patch, NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Dan Rosher (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Rosher updated NUTCH-1289:
------------------------------

    Patch Info: Patch Available
    
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Ferdy Galema (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13222214#comment-13222214 ] 

Ferdy Galema commented on NUTCH-1289:
-------------------------------------

This is a showstopper for the upcoming release. I will cook up a patch using your input and commit it asap.
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217159#comment-13217159 ] 

Lewis John McGibbney commented on NUTCH-1289:
---------------------------------------------

Markus, what is your opinion as to which suits best? Or is it the case in Nutchgora that Dan's patch is more appropriate?
                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1289) In distributed mode URL's are not partitioned

Posted by "Dan Rosher (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dan Rosher updated NUTCH-1289:
------------------------------

    Attachment: NUTCH-1289.patch
    
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira