You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2015/11/01 13:19:27 UTC

[jira] [Reopened] (NUTCH-2155) Create a "crawl completeness" utility

     [ https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel reopened NUTCH-2155:
------------------------------------

When running the completion statistics on a CrawlDb, an exception is thrown
{noformat}
% nutch crawlcomplete
usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
% nutch crawlcomplete ./crawl/crawldb completion_stats domain
Exception in thread "main" java.io.FileNotFoundException: File file:.../crawl/crawldb/old/data does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
{noformat}
I had to take a look into the code to figure out that the parameter <inputdirs> is expected as comma-separated list of CrawlDb sequence files. The following command works:
{noformat}
% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
{noformat}
All Nutch tools and utils operating on CrawlDb take just the bare path without the current/ subdirectory. Shouldn't the crawlcomplete command behave the same?
To pass more than one CrawlDb may be useful sometimes. However, usually crawls (and their dbs) are disjoint. If they are not the completeness statistics are probably not correct due to duplicates.

> Create a "crawl completeness" utility
> -------------------------------------
>
>                 Key: NUTCH-2155
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2155
>             Project: Nutch
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 1.10
>            Reporter: Michael Joyce
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.11
>
>
> I've found it useful to have a tool for dumping some "completeness" information from a crawl similar to how domainstats does but including fetched and unfetched counts per domain/host. This is especially nice when doing vertical crawls over a few domains or just to see how much of a host/domain you've covered with your crawl so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Aron Ahmadia <aa...@continuum.io>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, November 1, 2015 at 7:01 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness"
utility

>
>
>
>Is this exposed to the REST API?  I might be able to plot this in memex
>explorer. 
>
>On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <ji...@apache.org>
>wrote:
>
>
>     [ 
>https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.p
>lugin.system.issuetabpanels:all-tabpanel
><https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.
>plugin.system.issuetabpanels:all-tabpanel> ]
>
>Sebastian Nagel reopened NUTCH-2155:
>------------------------------------
>
>When running the completion statistics on a CrawlDb, an exception is
>thrown
>{noformat}
>% nutch crawlcomplete
>usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
>% nutch crawlcomplete ./crawl/crawldb completion_stats domain
>Exception in thread "main" java.io.FileNotFoundException: File
>file:.../crawl/crawldb/old/data does not exist
>        at 
>org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFi
>leSystem.java:511)
>{noformat}
>I had to take a look into the code to figure out that the parameter
><inputdirs> is expected as comma-separated list of CrawlDb sequence
>files. The following command works:
>{noformat}
>% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
>{noformat}
>All Nutch tools and utils operating on CrawlDb take just the bare path
>without the current/ subdirectory. Shouldn't the crawlcomplete command
>behave the same?
>To pass more than one CrawlDb may be useful sometimes. However, usually
>crawls (and their dbs) are disjoint. If they are not the completeness
>statistics are probably not correct due to duplicates.
>
>> Create a "crawl completeness" utility
>> -------------------------------------
>>
>>                 Key: NUTCH-2155
>>                 URL:
>https://issues.apache.org/jira/browse/NUTCH-2155
><https://issues.apache.org/jira/browse/NUTCH-2155>
>>             Project: Nutch
>>          Issue Type: Improvement
>>          Components: util
>>    Affects Versions: 1.10
>>            Reporter: Michael Joyce
>>            Assignee: Chris A. Mattmann
>>              Labels: memex
>>             Fix For: 1.11
>>
>>
>> I've found it useful to have a tool for dumping some "completeness"
>>information from a crawl similar to how domainstats does but including
>>fetched and unfetched counts per domain/host. This is especially nice
>>when doing vertical crawls over a few domains
> or just to see how much of a host/domain you've covered with your crawl
>so far.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)
>
>
>
>
>-- 
>_______________________________
>
>Aron Ahmadia
>
>Computational and Data Scientist
>
>
> <http://continuum.io>
>
>

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Posted by Aron Ahmadia <aa...@continuum.io>.

Is this exposed to the REST API?  I might be able to plot this in memex
explorer.

On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <ji...@apache.org> wrote:

>
>      [
> https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Sebastian Nagel reopened NUTCH-2155:
> ------------------------------------
>
> When running the completion statistics on a CrawlDb, an exception is thrown
> {noformat}
> % nutch crawlcomplete
> usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
> % nutch crawlcomplete ./crawl/crawldb completion_stats domain
> Exception in thread "main" java.io.FileNotFoundException: File
> file:.../crawl/crawldb/old/data does not exist
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
> {noformat}
> I had to take a look into the code to figure out that the parameter
> <inputdirs> is expected as comma-separated list of CrawlDb sequence files.
> The following command works:
> {noformat}
> % nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
> {noformat}
> All Nutch tools and utils operating on CrawlDb take just the bare path
> without the current/ subdirectory. Shouldn't the crawlcomplete command
> behave the same?
> To pass more than one CrawlDb may be useful sometimes. However, usually
> crawls (and their dbs) are disjoint. If they are not the completeness
> statistics are probably not correct due to duplicates.
>
> > Create a "crawl completeness" utility
> > -------------------------------------
> >
> >                 Key: NUTCH-2155
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-2155
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: util
> >    Affects Versions: 1.10
> >            Reporter: Michael Joyce
> >            Assignee: Chris A. Mattmann
> >              Labels: memex
> >             Fix For: 1.11
> >
> >
> > I've found it useful to have a tool for dumping some "completeness"
> information from a crawl similar to how domainstats does but including
> fetched and unfetched counts per domain/host. This is especially nice when
> doing vertical crawls over a few domains or just to see how much of a
> host/domain you've covered with your crawl so far.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


-- 
_______________________________

Aron Ahmadia
Computational and Data Scientist

[image: Continuum Analytics] <http://continuum.io>