You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Lee <ml...@sugs.net> on 2011/02/28 12:57:04 UTC

using nutch with indri (outputting to WARC?)

Hi, I was looking at nutch as a crawler for indexing into Indri.  In Indri's
docs, it lists "warc" as a corpus class option described as "WARC (Web
ARChive) format, such as is output by the Nutch webcrawler" -- c.f.
http://lemur.sourceforge.net/indri/IndriIndexer.html

After finishing a short crawl using nutch (v1.2), I found no way to produce
WARC output -- neither the native data store nor any of the export/dump
options appear to be WARC.  I've inquired on Indri/Lemur forums about this,
but I thought I'd check here also if anyone knows what the docs might be
referring to...  or how else I might proceed.

Thanks!
-Michael

Re: using nutch with indri (outputting to WARC?)

Posted by Alexander Aristov <al...@gmail.com>.
as far as I know this feature is not implemented but it's possible to make.

if you are a developer I would suggest you to take a look at solr indexer.
It can give you general idea how to read nutch data and transorm them into
another one.


Best Regards
Alexander Aristov


On 28 February 2011 14:57, Michael Lee <ml...@sugs.net> wrote:

> Hi, I was looking at nutch as a crawler for indexing into Indri.  In
> Indri's
> docs, it lists "warc" as a corpus class option described as "WARC (Web
> ARChive) format, such as is output by the Nutch webcrawler" -- c.f.
> http://lemur.sourceforge.net/indri/IndriIndexer.html
>
> After finishing a short crawl using nutch (v1.2), I found no way to produce
> WARC output -- neither the native data store nor any of the export/dump
> options appear to be WARC.  I've inquired on Indri/Lemur forums about this,
> but I thought I'd check here also if anyone knows what the docs might be
> referring to...  or how else I might proceed.
>
> Thanks!
> -Michael
>