You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Aron Ahmadia <aa...@continuum.io> on 2015/11/01 15:01:12 UTC

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Is this exposed to the REST API?  I might be able to plot this in memex
explorer.

On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <ji...@apache.org> wrote:

>
>      [
> https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Sebastian Nagel reopened NUTCH-2155:
> ------------------------------------
>
> When running the completion statistics on a CrawlDb, an exception is thrown
> {noformat}
> % nutch crawlcomplete
> usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
> % nutch crawlcomplete ./crawl/crawldb completion_stats domain
> Exception in thread "main" java.io.FileNotFoundException: File
> file:.../crawl/crawldb/old/data does not exist
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
> {noformat}
> I had to take a look into the code to figure out that the parameter
> <inputdirs> is expected as comma-separated list of CrawlDb sequence files.
> The following command works:
> {noformat}
> % nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
> {noformat}
> All Nutch tools and utils operating on CrawlDb take just the bare path
> without the current/ subdirectory. Shouldn't the crawlcomplete command
> behave the same?
> To pass more than one CrawlDb may be useful sometimes. However, usually
> crawls (and their dbs) are disjoint. If they are not the completeness
> statistics are probably not correct due to duplicates.
>
> > Create a "crawl completeness" utility
> > -------------------------------------
> >
> >                 Key: NUTCH-2155
> >                 URL: https://issues.apache.org/jira/browse/NUTCH-2155
> >             Project: Nutch
> >          Issue Type: Improvement
> >          Components: util
> >    Affects Versions: 1.10
> >            Reporter: Michael Joyce
> >            Assignee: Chris A. Mattmann
> >              Labels: memex
> >             Fix For: 1.11
> >
> >
> > I've found it useful to have a tool for dumping some "completeness"
> information from a crawl similar to how domainstats does but including
> fetched and unfetched counts per domain/host. This is especially nice when
> doing vertical crawls over a few domains or just to see how much of a
> host/domain you've covered with your crawl so far.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


-- 
_______________________________

Aron Ahmadia
Computational and Data Scientist

[image: Continuum Analytics] <http://continuum.io>

Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness" utility

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hey Aron, it isn’t yet - @MikeJ and @Sujen want to give it a whack?

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Aron Ahmadia <aa...@continuum.io>
Reply-To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Date: Sunday, November 1, 2015 at 7:01 AM
To: "dev@nutch.apache.org" <de...@nutch.apache.org>
Subject: Re: [jira] [Created] (NUTCH-2155) Create a "crawl completeness"
utility

>
>
>
>Is this exposed to the REST API?  I might be able to plot this in memex
>explorer. 
>
>On Sunday, November 1, 2015, Sebastian Nagel (JIRA) <ji...@apache.org>
>wrote:
>
>
>     [ 
>https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.p
>lugin.system.issuetabpanels:all-tabpanel
><https://issues.apache.org/jira/browse/NUTCH-2155?page=com.atlassian.jira.
>plugin.system.issuetabpanels:all-tabpanel> ]
>
>Sebastian Nagel reopened NUTCH-2155:
>------------------------------------
>
>When running the completion statistics on a CrawlDb, an exception is
>thrown
>{noformat}
>% nutch crawlcomplete
>usage: CrawlCompletionStats inputDirs outDir host|domain [numOfReducer]
>% nutch crawlcomplete ./crawl/crawldb completion_stats domain
>Exception in thread "main" java.io.FileNotFoundException: File
>file:.../crawl/crawldb/old/data does not exist
>        at 
>org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFi
>leSystem.java:511)
>{noformat}
>I had to take a look into the code to figure out that the parameter
><inputdirs> is expected as comma-separated list of CrawlDb sequence
>files. The following command works:
>{noformat}
>% nutch crawlcomplete ./crawl/crawldb/current completion_stats domain
>{noformat}
>All Nutch tools and utils operating on CrawlDb take just the bare path
>without the current/ subdirectory. Shouldn't the crawlcomplete command
>behave the same?
>To pass more than one CrawlDb may be useful sometimes. However, usually
>crawls (and their dbs) are disjoint. If they are not the completeness
>statistics are probably not correct due to duplicates.
>
>> Create a "crawl completeness" utility
>> -------------------------------------
>>
>>                 Key: NUTCH-2155
>>                 URL:
>https://issues.apache.org/jira/browse/NUTCH-2155
><https://issues.apache.org/jira/browse/NUTCH-2155>
>>             Project: Nutch
>>          Issue Type: Improvement
>>          Components: util
>>    Affects Versions: 1.10
>>            Reporter: Michael Joyce
>>            Assignee: Chris A. Mattmann
>>              Labels: memex
>>             Fix For: 1.11
>>
>>
>> I've found it useful to have a tool for dumping some "completeness"
>>information from a crawl similar to how domainstats does but including
>>fetched and unfetched counts per domain/host. This is especially nice
>>when doing vertical crawls over a few domains
> or just to see how much of a host/domain you've covered with your crawl
>so far.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)
>
>
>
>
>-- 
>_______________________________
>
>Aron Ahmadia
>
>Computational and Data Scientist
>
>
> <http://continuum.io>
>
>