You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2005/10/16 07:14:44 UTC

[jira] Created: (NUTCH-114) getting number of urls and links from crawldb

getting number of urls and links from crawldb
---------------------------------------------

         Key: NUTCH-114
         URL: http://issues.apache.org/jira/browse/NUTCH-114
     Project: Nutch
        Type: New Feature
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
    Priority: Minor
     Fix For: 0.8-dev


We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]

Stefan Groschupf updated NUTCH-114:
-----------------------------------

    Attachment: CrawlDbStat.java

A Class that counts entries and links in the crawldb. It does not use map reduce but is fast...

> getting number of urls and links from crawldb
> ---------------------------------------------
>
>          Key: NUTCH-114
>          URL: http://issues.apache.org/jira/browse/NUTCH-114
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: CrawlDbStat.java
>
> We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]

Stefan Groschupf updated NUTCH-114:
-----------------------------------

    Attachment: CrawlDbStatMapper.java

A class that counting entries and links in a crawl db. This tool is using the map reduce technology so it scales well, 
but is much slower than the first local working tool.
Please add one of these tools to the nutch sources. :-)

> getting number of urls and links from crawldb
> ---------------------------------------------
>
>          Key: NUTCH-114
>          URL: http://issues.apache.org/jira/browse/NUTCH-114
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: CrawlDbStat.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb

Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]

Stefan Groschupf updated NUTCH-114:
-----------------------------------

    Attachment: CrawlDbStatMapper.java

As discussed now with UTF8 keys and the text based output format.

> getting number of urls and links from crawldb
> ---------------------------------------------
>
>          Key: NUTCH-114
>          URL: http://issues.apache.org/jira/browse/NUTCH-114
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: CrawlDbStat.java, CrawlDbStatMapper.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-114) getting number of urls and links from crawldb

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-114?page=comments#action_12332267 ] 

Doug Cutting commented on NUTCH-114:
------------------------------------

You could use UTF8 as the output key type, map to keys like, "links" and "entries", and use TextOutputFormat.  Then the output would be a text file with the link and entry counts.

> getting number of urls and links from crawldb
> ---------------------------------------------
>
>          Key: NUTCH-114
>          URL: http://issues.apache.org/jira/browse/NUTCH-114
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: CrawlDbStat.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-114) getting number of urls and links from crawldb

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
     
Andrzej Bialecki  resolved NUTCH-114:
-------------------------------------

    Resolution: Fixed

Applied with changes. Thanks!

> getting number of urls and links from crawldb
> ---------------------------------------------
>
>          Key: NUTCH-114
>          URL: http://issues.apache.org/jira/browse/NUTCH-114
>      Project: Nutch
>         Type: New Feature
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Minor
>      Fix For: 0.8-dev
>  Attachments: CrawlDbStat.java, CrawlDbStatMapper.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira