You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2013/03/20 19:05:46 UTC
[Nutch Wiki] Update of "bin/nutch readdb" by kiranchitturi
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/nutch readdb" page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/bin/nutch%20readdb
New page:
Readdb is an alias for org.apache.nutch.crawl.CrawlDbReader
The CrawlDbReader implements all the read-only parts of accessing our web database. It provides us with a read utility for the crawldb.
Usage:
{{{
bin/nutch readdb <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
}}}
'''<crawldb>''': The location of the crawldb directory we wish to read and obtain information from.
'''-stats''': This prints the overall statistics to System.out.
'''-dump <out_dir>''': Enables us to dump the whole crawldb to a text file in any <out_dir> we wish to specify.
'''[-regex <expr>]''': filter records with a regular expression
'''[-status <status>]''': filter records by CrawlDatum status
'''-topN <nnnn> <out_dir> [<min>]''': This dumps the top <nnnn> urls sorted by score relevance to any <out_dir> we wish to specify. If the [<min>] parameter is passed in the command the reader will skip records with scores below this particluar value. This can significantly improve retrieval performance of statistics or crawldb dump results.
'''-url <url>''': This simply prints information of any particular <url> to System.out.
CommandLineOptions