You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/02/01 15:33:51 UTC
[jira] Created: (NUTCH-784) CrawlDBScanner
CrawlDBScanner
---------------
Key: NUTCH-784
URL: https://issues.apache.org/jira/browse/NUTCH-784
Project: Nutch
Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Julien Nioche
Attachments: NUTCH-784.patch
The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.
The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.
Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
regex: regular expression on the crawldb key
-s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
-text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
for instance the command below :
./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text
will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-784) CrawlDBScanner
Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850896#action_12850896 ]
Andrzej Bialecki commented on NUTCH-784:
-----------------------------------------
This should have been reviewed first - I don't question the usefulness of this class, but I think that this should have been added as an option to CrawlDbReader. As it is now we get a new tool with a cryptic name that performs a function that is a variant of another existing tool...
> CrawlDBScanner
> ---------------
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
> Issue Type: New Feature
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.
> The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below :
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-784) CrawlDBScanner
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-784:
--------------------------------
Fix Version/s: 1.1
> CrawlDBScanner
> ---------------
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
> Issue Type: New Feature
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.
> The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below :
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-784) CrawlDBScanner
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851238#action_12851238 ]
Hudson commented on NUTCH-784:
------------------------------
Integrated in Nutch-trunk #1111 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1111/])
: CrawlDBScanner
> CrawlDBScanner
> ---------------
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
> Issue Type: New Feature
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.
> The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below :
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-784) CrawlDBScanner
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-784:
--------------------------------
Attachment: NUTCH-784.patch
> CrawlDBScanner
> ---------------
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
> Issue Type: New Feature
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.
> The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below :
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-784) CrawlDBScanner
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche closed NUTCH-784.
-------------------------------
Resolution: Fixed
Committed revision 928746
> CrawlDBScanner
> ---------------
>
> Key: NUTCH-784
> URL: https://issues.apache.org/jira/browse/NUTCH-784
> Project: Nutch
> Issue Type: New Feature
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-784.patch
>
>
> The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at.
> The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB.
> Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>
> regex: regular expression on the crawldb key
> -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched
> -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat
> for instance the command below :
> ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text
> will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.