You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2013/08/07 04:24:26 UTC

protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

Hi,
Now using Nutch trunk 1.8-SNAPSHOT HEAD
Back at this tonight. When attempting to fetch

file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes)

which contains loads of HTML files, I get the error as below.


Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with:
org.apache.nutch.protocol.file.FileError: File Error: 404
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02

I then deleted the crawldb changed the seed URL to

file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)

But when I eventually get fetching after a few rounds of generate, fetch,
parse, updatedb, I am landed with

fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
(queue crawl delay=500ms)
org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
(queue crawl delay=500ms)
org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

Same as before... this happens with every single URL in the directory I am
trying to crawl.

Any advice here please?
Thanks
Lewis

-- 
*Lewis*

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Thanks for explanation Sebastian.
I need to be honest and say that I didn't encounter this before... as I was
never crawling my local system.
I am working to a deadline for Sunday and really don't have time to feed in
to this right now, however I am going to patch up, try it out and will move
on with the results.
Hopefully we can draw consensus about how protocol-file should behave and
also regarding its strictness/compliance with RFC's as it seems there may
be some overlap there as you stated Seb.
Thanks
Lewis


On Wed, Aug 7, 2013 at 12:01 AM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Lewis, hi Tejas,
>
> using protocol-file is currently a pain:
>
> fail with NPE:
>   (1 slash)   nutch indexchecker file:$PWD/test.html
>   (3 slashes) nutch indexchecker file://$PWD/test.html
> fail with protocol error:
>   (2 slashes) nutch indexchecker file:/$PWD/test.html
>
> URLs look different, e.g.
> - URL with 3 slashes (normalized by URLUtil.toASCII, called by
> indexchecker)
> - URL with one slash in Content and ParseResult
>
> We must take care that internally one normalized URL is used.
> Otherwise ParseResult.get(Text url) etc. will fail.
> Normalization to this form must take care of all legacy stuff:
>  file://C:/Documents/...
>  file://localhost/home/user/...
>
> Sebastian
>
>
>
> 2013/8/7 Tejas Patil <te...@gmail.com>
>
> > Hi Lewis,
> > Can you try the patch attached over here:
> > https://issues.apache.org/jira/browse/NUTCH-1483
> >
> > Thanks,
> > Tejas
> >
> >
> > On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > Hi,
> > > Now using Nutch trunk 1.8-SNAPSHOT HEAD
> > > Back at this tonight. When attempting to fetch
> > >
> > > file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two
> > slashes)
> > >
> > > which contains loads of HTML files, I get the error as below.
> > >
> > >
> > > Fetcher: throughput threshold retries: 5
> > > -finishing thread FetcherThread, activeThreads=1
> > > org.apache.nutch.protocol.file.FileError: File Error: 404
> > >     at
> > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> > >     at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > > fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed
> > with:
> > > org.apache.nutch.protocol.file.FileError: File Error: 404
> > > -finishing thread FetcherThread, activeThreads=0
> > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=0
> > > Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02
> > >
> > > I then deleted the crawldb changed the seed URL to
> > >
> > > file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)
> > >
> > > But when I eventually get fetching after a few rounds of generate,
> fetch,
> > > parse, updatedb, I am landed with
> > >
> > > fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> > > (queue crawl delay=500ms)
> > > org.apache.nutch.protocol.file.FileError: File Error: 404
> > >     at
> > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> > >     at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > > fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> > > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> > > fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> > > (queue crawl delay=500ms)
> > > org.apache.nutch.protocol.file.FileError: File Error: 404
> > >     at
> > org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> > >     at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > > fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> > > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> > >
> > > Same as before... this happens with every single URL in the directory I
> > am
> > > trying to crawl.
> > >
> > > Any advice here please?
> > > Thanks
> > > Lewis
> > >
> > > --
> > > *Lewis*
> > >
> >
>



-- 
*Lewis*

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Lewis, hi Tejas,

using protocol-file is currently a pain:

fail with NPE:
  (1 slash)   nutch indexchecker file:$PWD/test.html
  (3 slashes) nutch indexchecker file://$PWD/test.html
fail with protocol error:
  (2 slashes) nutch indexchecker file:/$PWD/test.html

URLs look different, e.g.
- URL with 3 slashes (normalized by URLUtil.toASCII, called by indexchecker)
- URL with one slash in Content and ParseResult

We must take care that internally one normalized URL is used.
Otherwise ParseResult.get(Text url) etc. will fail.
Normalization to this form must take care of all legacy stuff:
 file://C:/Documents/...
 file://localhost/home/user/...

Sebastian



2013/8/7 Tejas Patil <te...@gmail.com>

> Hi Lewis,
> Can you try the patch attached over here:
> https://issues.apache.org/jira/browse/NUTCH-1483
>
> Thanks,
> Tejas
>
>
> On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi,
> > Now using Nutch trunk 1.8-SNAPSHOT HEAD
> > Back at this tonight. When attempting to fetch
> >
> > file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two
> slashes)
> >
> > which contains loads of HTML files, I get the error as below.
> >
> >
> > Fetcher: throughput threshold retries: 5
> > -finishing thread FetcherThread, activeThreads=1
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> >     at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> >     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed
> with:
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02
> >
> > I then deleted the crawldb changed the seed URL to
> >
> > file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)
> >
> > But when I eventually get fetching after a few rounds of generate, fetch,
> > parse, updatedb, I am landed with
> >
> > fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> > (queue crawl delay=500ms)
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> >     at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> >     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> > fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> > (queue crawl delay=500ms)
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> >     at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> >     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> >
> > Same as before... this happens with every single URL in the directory I
> am
> > trying to crawl.
> >
> > Any advice here please?
> > Thanks
> > Lewis
> >
> > --
> > *Lewis*
> >
>

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Tejas, Thanks this looks like the key ;)


On Tue, Aug 6, 2013 at 9:51 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Lewis,
> Can you try the patch attached over here:
> https://issues.apache.org/jira/browse/NUTCH-1483
>
> Thanks,
> Tejas
>
>
> On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi,
> > Now using Nutch trunk 1.8-SNAPSHOT HEAD
> > Back at this tonight. When attempting to fetch
> >
> > file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two
> slashes)
> >
> > which contains loads of HTML files, I get the error as below.
> >
> >
> > Fetcher: throughput threshold retries: 5
> > -finishing thread FetcherThread, activeThreads=1
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> >     at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> >     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed
> with:
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02
> >
> > I then deleted the crawldb changed the seed URL to
> >
> > file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)
> >
> > But when I eventually get fetching after a few rounds of generate, fetch,
> > parse, updatedb, I am landed with
> >
> > fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> > (queue crawl delay=500ms)
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> >     at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> >     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> > fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> > (queue crawl delay=500ms)
> > org.apache.nutch.protocol.file.FileError: File Error: 404
> >     at
> org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
> >     at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> > fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> > failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> >
> > Same as before... this happens with every single URL in the directory I
> am
> > trying to crawl.
> >
> > Any advice here please?
> > Thanks
> > Lewis
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: protocol-file org.apache.nutch.protocol.file.FileError: File Error: 404

Posted by Tejas Patil <te...@gmail.com>.

Hi Lewis,
Can you try the patch attached over here:
https://issues.apache.org/jira/browse/NUTCH-1483

Thanks,
Tejas


On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi,
> Now using Nutch trunk 1.8-SNAPSHOT HEAD
> Back at this tonight. When attempting to fetch
>
> file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes)
>
> which contains loads of HTML files, I get the error as below.
>
>
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=1
> org.apache.nutch.protocol.file.FileError: File Error: 404
>     at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02
>
> I then deleted the crawldb changed the seed URL to
>
> file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)
>
> But when I eventually get fetching after a few rounds of generate, fetch,
> parse, updatedb, I am landed with
>
> fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> (queue crawl delay=500ms)
> org.apache.nutch.protocol.file.FileError: File Error: 404
>     at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> (queue crawl delay=500ms)
> org.apache.nutch.protocol.file.FileError: File Error: 404
>     at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
>
> Same as before... this happens with every single URL in the directory I am
> trying to crawl.
>
> Any advice here please?
> Thanks
> Lewis
>
> --
> *Lewis*
>