You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Lee <ml...@sugs.net> on 2011/02/28 12:57:04 UTC
using nutch with indri (outputting to WARC?)
Hi, I was looking at nutch as a crawler for indexing into Indri. In Indri's
docs, it lists "warc" as a corpus class option described as "WARC (Web
ARChive) format, such as is output by the Nutch webcrawler" -- c.f.
http://lemur.sourceforge.net/indri/IndriIndexer.html
After finishing a short crawl using nutch (v1.2), I found no way to produce
WARC output -- neither the native data store nor any of the export/dump
options appear to be WARC. I've inquired on Indri/Lemur forums about this,
but I thought I'd check here also if anyone knows what the docs might be
referring to... or how else I might proceed.
Thanks!
-Michael
Re: using nutch with indri (outputting to WARC?)
Posted by Alexander Aristov <al...@gmail.com>.
as far as I know this feature is not implemented but it's possible to make.
if you are a developer I would suggest you to take a look at solr indexer.
It can give you general idea how to read nutch data and transorm them into
another one.
Best Regards
Alexander Aristov
On 28 February 2011 14:57, Michael Lee <ml...@sugs.net> wrote:
> Hi, I was looking at nutch as a crawler for indexing into Indri. In
> Indri's
> docs, it lists "warc" as a corpus class option described as "WARC (Web
> ARChive) format, such as is output by the Nutch webcrawler" -- c.f.
> http://lemur.sourceforge.net/indri/IndriIndexer.html
>
> After finishing a short crawl using nutch (v1.2), I found no way to produce
> WARC output -- neither the native data store nor any of the export/dump
> options appear to be WARC. I've inquired on Indri/Lemur forums about this,
> but I thought I'd check here also if anyone knows what the docs might be
> referring to... or how else I might proceed.
>
> Thanks!
> -Michael
>