You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Groschupf (JIRA)" <ji...@apache.org> on 2005/10/16 07:14:44 UTC
[jira] Created: (NUTCH-114) getting number of urls and links from crawldb
getting number of urls and links from crawldb
---------------------------------------------
Key: NUTCH-114
URL: http://issues.apache.org/jira/browse/NUTCH-114
Project: Nutch
Type: New Feature
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
Fix For: 0.8-dev
We need a tool that provide basic statistics about the crawldb.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
Stefan Groschupf updated NUTCH-114:
-----------------------------------
Attachment: CrawlDbStat.java
A Class that counts entries and links in the crawldb. It does not use map reduce but is fast...
> getting number of urls and links from crawldb
> ---------------------------------------------
>
> Key: NUTCH-114
> URL: http://issues.apache.org/jira/browse/NUTCH-114
> Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
> Fix For: 0.8-dev
> Attachments: CrawlDbStat.java
>
> We need a tool that provide basic statistics about the crawldb.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
Stefan Groschupf updated NUTCH-114:
-----------------------------------
Attachment: CrawlDbStatMapper.java
A class that counting entries and links in a crawl db. This tool is using the map reduce technology so it scales well,
but is much slower than the first local working tool.
Please add one of these tools to the nutch sources. :-)
> getting number of urls and links from crawldb
> ---------------------------------------------
>
> Key: NUTCH-114
> URL: http://issues.apache.org/jira/browse/NUTCH-114
> Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
> Fix For: 0.8-dev
> Attachments: CrawlDbStat.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-114) getting number of urls and links from crawldb
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
Stefan Groschupf updated NUTCH-114:
-----------------------------------
Attachment: CrawlDbStatMapper.java
As discussed now with UTF8 keys and the text based output format.
> getting number of urls and links from crawldb
> ---------------------------------------------
>
> Key: NUTCH-114
> URL: http://issues.apache.org/jira/browse/NUTCH-114
> Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
> Fix For: 0.8-dev
> Attachments: CrawlDbStat.java, CrawlDbStatMapper.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-114) getting number of urls and links from crawldb
Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-114?page=comments#action_12332267 ]
Doug Cutting commented on NUTCH-114:
------------------------------------
You could use UTF8 as the output key type, map to keys like, "links" and "entries", and use TextOutputFormat. Then the output would be a text file with the link and entry counts.
> getting number of urls and links from crawldb
> ---------------------------------------------
>
> Key: NUTCH-114
> URL: http://issues.apache.org/jira/browse/NUTCH-114
> Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
> Fix For: 0.8-dev
> Attachments: CrawlDbStat.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-114) getting number of urls and links from crawldb
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-114?page=all ]
Andrzej Bialecki resolved NUTCH-114:
-------------------------------------
Resolution: Fixed
Applied with changes. Thanks!
> getting number of urls and links from crawldb
> ---------------------------------------------
>
> Key: NUTCH-114
> URL: http://issues.apache.org/jira/browse/NUTCH-114
> Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
> Fix For: 0.8-dev
> Attachments: CrawlDbStat.java, CrawlDbStatMapper.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira