You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2013/03/27 20:26:10 UTC

Root slash being stripped from file path

I'm trying to crawl a local file system.  I've made the changes to not
ignore file urls and added protocol-file to the plugins list.  I've
included file:///data/mydir in my url fille.

However, when I run the fetch, Nutch tries to connect to file://data/mydir
and therefore returns a 404 error.  I think the root slash is being
stripped during the injection, but I can't seem to find out why.

Anybody have any suggestions or ideas?

Thanks.

Re: Root slash being stripped from file path

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Please also see

https://issues.apache.org/jira/browse/NUTCH-1484

Sebastien resolved this one off and AFAIK fixed the solution.

On Thu, Mar 28, 2013 at 6:09 AM, Bai Shen <ba...@gmail.com> wrote:

> Finally found it in JIRA.
>
> https://issues.apache.org/jira/browse/NUTCH-1483
>
> I'll give the patch a try and see if that fixes my issue.
>
> On Wed, Mar 27, 2013 at 4:29 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Nutch version please?
> > Sebastian and others worked on this a while ago.
> > I don't know about the progress on it. There is most certainly
> > open/resolved tickets for it on Jira please look there.
> > Thank you
> > Lewis
> >
> > On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen <ba...@gmail.com>
> > wrote:
> >
> > > I'm trying to crawl a local file system.  I've made the changes to not
> > > ignore file urls and added protocol-file to the plugins list.  I've
> > > included file:///data/mydir in my url fille.
> > >
> > > However, when I run the fetch, Nutch tries to connect to
> > file://data/mydir
> > > and therefore returns a 404 error.  I think the root slash is being
> > > stripped during the injection, but I can't seem to find out why.
> > >
> > > Anybody have any suggestions or ideas?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>



-- 
*Lewis*

Re: Root slash being stripped from file path

Posted by Bai Shen <ba...@gmail.com>.
Finally found it in JIRA.

https://issues.apache.org/jira/browse/NUTCH-1483

I'll give the patch a try and see if that fixes my issue.

On Wed, Mar 27, 2013 at 4:29 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Nutch version please?
> Sebastian and others worked on this a while ago.
> I don't know about the progress on it. There is most certainly
> open/resolved tickets for it on Jira please look there.
> Thank you
> Lewis
>
> On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen <ba...@gmail.com>
> wrote:
>
> > I'm trying to crawl a local file system.  I've made the changes to not
> > ignore file urls and added protocol-file to the plugins list.  I've
> > included file:///data/mydir in my url fille.
> >
> > However, when I run the fetch, Nutch tries to connect to
> file://data/mydir
> > and therefore returns a 404 error.  I think the root slash is being
> > stripped during the injection, but I can't seem to find out why.
> >
> > Anybody have any suggestions or ideas?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Root slash being stripped from file path

Posted by Bai Shen <ba...@gmail.com>.
Sorry.  I'm using 2.1.  I did a general web search and didn't find any
instances of the problem.  I found a couple tutorials using the
file:///data/mydir format with no mention of any issues.

The problem is that the normalizers(not sure which one) strip out that
leading / which changes the url from absolute to relative.  I turned off
the normalizers but now I'm getting an index out of bounds exception from
unreverseUrl.  I haven't dug through the code yet but I'm betting that it's
not liking the slash since that's not something that would show up in a
http url.

On Wed, Mar 27, 2013 at 4:29 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Nutch version please?
> Sebastian and others worked on this a while ago.
> I don't know about the progress on it. There is most certainly
> open/resolved tickets for it on Jira please look there.
> Thank you
> Lewis
>
> On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen <ba...@gmail.com>
> wrote:
>
> > I'm trying to crawl a local file system.  I've made the changes to not
> > ignore file urls and added protocol-file to the plugins list.  I've
> > included file:///data/mydir in my url fille.
> >
> > However, when I run the fetch, Nutch tries to connect to
> file://data/mydir
> > and therefore returns a 404 error.  I think the root slash is being
> > stripped during the injection, but I can't seem to find out why.
> >
> > Anybody have any suggestions or ideas?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Root slash being stripped from file path

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Nutch version please?
Sebastian and others worked on this a while ago.
I don't know about the progress on it. There is most certainly
open/resolved tickets for it on Jira please look there.
Thank you
Lewis

On Wed, Mar 27, 2013 at 12:26 PM, Bai Shen <ba...@gmail.com> wrote:

> I'm trying to crawl a local file system.  I've made the changes to not
> ignore file urls and added protocol-file to the plugins list.  I've
> included file:///data/mydir in my url fille.
>
> However, when I run the fetch, Nutch tries to connect to file://data/mydir
> and therefore returns a 404 error.  I think the root slash is being
> stripped during the injection, but I can't seem to find out why.
>
> Anybody have any suggestions or ideas?
>
> Thanks.
>



-- 
*Lewis*