You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Shay Lawless <se...@gmail.com> on 2006/10/06 18:22:48 UTC

NutchWax

Hi all,

I'm new to the list so not sure if you even discuss extensions to Nutch or
if the list is exclusively for discussions on Nutch itself.

Have any of you ever used NutchWax? I'm attempting to use NutchWax to index
a number of .arc files generated by a web crawl. I can get the indexing step
to run fine, and when I perform a keyword search results are returned and
ranked by nutch. However when I click on any of the results, the content
cannot be displayed. The message returned is as follows,

Not Found The requested URL /null/20060930150000/http://blah.blah.com/ was
not found on this server.

Additionally, a 404 Not Found error was encountered while trying to use an
ErrorDocument to handle the request.

Any help you can provide would really be appreciated,

Thanks,

Séamus Lawless

Re: NutchWax

Posted by Gordon Mohr <go...@archive.org>.
A better place to ask NutchWAX- and ARC-specific questions would be is 
project discussion list, archive-access-discuss. See:

   http://archive-access.sourceforge.net/projects/nutch/mail-lists.html
   https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

I think your problem is this:

When used as a traditional search engine, Nutch assumes it will be 
sending you elsewhere to view the indexed content -- for example, at its 
original URL.

When used with archived web content, as in NutchWAX, you typically don't 
want to refer searchers to the original URLs, but rather the archived 
content. Redisplaying the original content from another URL is not a 
core capability of Nutch/NutchWAX (except in the limited manner of 
showing 'cached' versions). So NutchWAX expects to refer result-clicks 
to some other ARC-access redisplay tool.

One example of such an ARC-acces redisplay tool is the 'Wayback 
Machine', an early open-source version of which is available at:

   http://archive-access.sourceforge.net/projects/wayback/

So when providing access to archived web material in ARCs, you would 
typically use both:
  - Nutch/NutchWAX for the indexing/querying
  - Wayback for the redisplay/browsing of old versions

Hope this helps,

- Gordon @ IA

Shay Lawless wrote:
> Hi all,
> 
> I'm new to the list so not sure if you even discuss extensions to Nutch or
> if the list is exclusively for discussions on Nutch itself.
> 
> Have any of you ever used NutchWax? I'm attempting to use NutchWax to index
> a number of .arc files generated by a web crawl. I can get the indexing 
> step
> to run fine, and when I perform a keyword search results are returned and
> ranked by nutch. However when I click on any of the results, the content
> cannot be displayed. The message returned is as follows,
> 
> Not Found The requested URL /null/20060930150000/http://blah.blah.com/ was
> not found on this server.
> 
> Additionally, a 404 Not Found error was encountered while trying to use an
> ErrorDocument to handle the request.
> 
> Any help you can provide would really be appreciated,
> 
> Thanks,
> 
> Séamus Lawless
> 

Re: NutchWax

Posted by Gordon Mohr <go...@archive.org>.
A better place to ask NutchWAX- and ARC-specific questions would be is 
project discussion list, archive-access-discuss. See:

   http://archive-access.sourceforge.net/projects/nutch/mail-lists.html
   https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

I think your problem is this:

When used as a traditional search engine, Nutch assumes it will be 
sending you elsewhere to view the indexed content -- for example, at its 
original URL.

When used with archived web content, as in NutchWAX, you typically don't 
want to refer searchers to the original URLs, but rather the archived 
content. Redisplaying the original content from another URL is not a 
core capability of Nutch/NutchWAX (except in the limited manner of 
showing 'cached' versions). So NutchWAX expects to refer result-clicks 
to some other ARC-access redisplay tool.

One example of such an ARC-acces redisplay tool is the 'Wayback 
Machine', an early open-source version of which is available at:

   http://archive-access.sourceforge.net/projects/wayback/

So when providing access to archived web material in ARCs, you would 
typically use both:
  - Nutch/NutchWAX for the indexing/querying
  - Wayback for the redisplay/browsing of old versions

Hope this helps,

- Gordon @ IA

Shay Lawless wrote:
> Hi all,
> 
> I'm new to the list so not sure if you even discuss extensions to Nutch or
> if the list is exclusively for discussions on Nutch itself.
> 
> Have any of you ever used NutchWax? I'm attempting to use NutchWax to index
> a number of .arc files generated by a web crawl. I can get the indexing 
> step
> to run fine, and when I perform a keyword search results are returned and
> ranked by nutch. However when I click on any of the results, the content
> cannot be displayed. The message returned is as follows,
> 
> Not Found The requested URL /null/20060930150000/http://blah.blah.com/ was
> not found on this server.
> 
> Additionally, a 404 Not Found error was encountered while trying to use an
> ErrorDocument to handle the request.
> 
> Any help you can provide would really be appreciated,
> 
> Thanks,
> 
> Séamus Lawless
>