You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ruslan Ermilov (JIRA)" <ji...@apache.org> on 2007/11/28 16:57:43 UTC

[jira] Created: (NUTCH-584) urls missing from fetchlist

urls missing from fetchlist
---------------------------

                 Key: NUTCH-584
                 URL: https://issues.apache.org/jira/browse/NUTCH-584
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.9.0, 1.0.0
         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
            Reporter: Ruslan Ermilov


When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.

Somehow only one of two URLs ends up in the fetchlist:

$ rm -rf segments
$ cat urls/x
http://tkd.ru/
http://t-f.ru/
$ nutch org.apache.nutch.tools.FreeGenerator urls segments
$ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
SegmentReader: dump segment: segments/20071128195720
SegmentReader: done
$ cat xxx/dump

Recno:: 0
URL:: http://tkd.ru/

CrawlDatum::
Version: 5
Status: 0 (unknown)
Fetch time: Wed Nov 28 19:57:20 GMT 2007
Modified time: Thu Jan 01 00:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 0.0 days
Score: 1.0
Signature: null
Metadata: null

$ 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-584) urls missing from fetchlist

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559319#action_12559319 ] 

Andrzej Bialecki  commented on NUTCH-584:
-----------------------------------------

Thank you for the simple test case! I believe I found the problem - it was quite tricky.

During the final step of generation we partition the urls by host, and then sort them by a simple hash(url). At least that was the intention - HashComparator class produces many collisions in hash values, which is ok - we only need to roughly randomize the urls within partition. However, since this comparator is used during the sorting of data submitted to reduce() these collisions caused several urls to become "equal". Consequently Hadoop did what it was meant to do - collected all values that matched "equal" keys under a single iterator, and then invoked Reducer.reduce() using only a single key picked up from all "equal" keys ... so all other "equal" urls were dropped. At this point also we were getting multiple records in the output fetchlist (because the default IdentityReducer produced as many output records as many there were input values to reduce - all with the same key!), with the final result being that several CrawlDatum-s coming originally from different urls were stored using the same url ...

This is a serious bug, which may have caused numerous problems in fetchlist generation, such as missing urls, multiple fetches of the same url, or CrawlDatum-s paired with wrong urls.

> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-584) urls missing from fetchlist

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  resolved NUTCH-584.
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Andrzej Bialecki 

Patch applied (sans System.out.println ;) ) in rev. 612505. Thanks for the review and testing!

> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-584) urls missing from fetchlist

Posted by "Ruslan Ermilov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559569#action_12559569 ] 

Ruslan Ermilov commented on NUTCH-584:
--------------------------------------

Andrzej,

I've tested your patch both with a simple test case mentioned above, and on real data (~100k urls). It now works as expected, thank you!

> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>         Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-584) urls missing from fetchlist

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559394#action_12559394 ] 

Doğacan Güney commented on NUTCH-584:
-------------------------------------

Andrzej,

Nice analysis :)

+1 for the patch (it seems you forgot a System.out.println there, though)

> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>         Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-584) urls missing from fetchlist

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-584.
-----------------------------------


> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>            Assignee: Andrzej Bialecki 
>             Fix For: 1.0.0
>
>         Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-584) urls missing from fetchlist

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  updated NUTCH-584:
------------------------------------

    Attachment: generator.patch

Patch to address this problem - your test case executes fine with this patch. Please test.

> urls missing from fetchlist
> ---------------------------
>
>                 Key: NUTCH-584
>                 URL: https://issues.apache.org/jira/browse/NUTCH-584
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0, 1.0.0
>         Environment: FreeBSD 7.0, JDK 1.5.0, Nu
>            Reporter: Ruslan Ermilov
>         Attachments: generator.patch
>
>
> When generating an initial set of ~100k URLs for fetching, I've noticed that some URLs are missing from the fetchlist.
> The test case below has only 2 URLs, and I've used the FreeGenerator tool instead of the standard inject/generate
> that saves me time when experimenting. It doesn't matter if I run it in clustered or local mode.
> Somehow only one of two URLs ends up in the fetchlist:
> $ rm -rf segments
> $ cat urls/x
> http://tkd.ru/
> http://t-f.ru/
> $ nutch org.apache.nutch.tools.FreeGenerator urls segments
> $ nutch readseg -dump segments/* xxx -nocontent -noparse -noparsedata -noparsetext -nofetch
> SegmentReader: dump segment: segments/20071128195720
> SegmentReader: done
> $ cat xxx/dump
> Recno:: 0
> URL:: http://tkd.ru/
> CrawlDatum::
> Version: 5
> Status: 0 (unknown)
> Fetch time: Wed Nov 28 19:57:20 GMT 2007
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 0.0 days
> Score: 1.0
> Signature: null
> Metadata: null
> $ 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.