You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "gaurav.gupta" <ga...@edynamic.info> on 2012/09/05 11:16:58 UTC

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Hello

I'm getting the following exception while indexing my site which are hosted
in my local machine.


Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException:
no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$
Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
+^http://women.net/*
Skipping -.:java.net.MalformedURLException: no protocol: -.
Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV
)$:java.net.MalformedURLException: no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF
|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$

Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
+^http://women.net/*

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)


I'd add the following line in conf\regex-urlfilter.txt

#+^http://([a-z0-9]*\.)*women.com/
+http://women.net/*

and in conf\crawl-urlfilter.txt
 # accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+http://women.net/*

Please let me what else i need to do in order to index the data.

Thanks in advance



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by "gaurav.gupta" <ga...@edynamic.info>.
I'd update the nutch from 1.5 to 2.2.



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4008235.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Thu, Sep 6, 2012 at 5:50 AM, gaurav.gupta <ga...@edynamic.info> wrote:

> C:\nutch\local\conf\crawl-urlfilter.txt as specified in my above post.

This no longer exists... that might be a problem


-- 
Lewis

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by "gaurav.gupta" <ga...@edynamic.info>.
I'm using Apache Nutch version 1.5. Just I'd hosted the site in my local
environment and trying to index the site. Providing the enteries in
C:\nutch\local\conf\regex-urlfilter.txt and
C:\nutch\local\conf\crawl-urlfilter.txt as specified in my above post.

 



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005804.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I think you've incorrectly passed your regex- as your seed URL list
when you've injected.

As a side note it is always VERY helpful to provide basic info such as
the Nutch version, the steps you took to reproduce the error, etc...
basic stuff.

hth

Lewis

On Wed, Sep 5, 2012 at 10:16 AM, gaurav.gupta
<ga...@edynamic.info> wrote:
> Hello
>
> I'm getting the following exception while indexing my site which are hosted
> in my local machine.
>
>
> Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
> -^(file|ftp|mailto):
> Skipping
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException:
> no protocol:
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$
> Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
> +^http://women.net/*
> Skipping -.:java.net.MalformedURLException: no protocol: -.
> Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
> -^(file|ftp|mailto):
> Skipping
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV
> )$:java.net.MalformedURLException: no protocol:
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF
> |pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$
>
> Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
> +^http://women.net/*
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data
>         at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>         at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>         at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>         at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>         at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
>
> I'd add the following line in conf\regex-urlfilter.txt
>
> #+^http://([a-z0-9]*\.)*women.com/
> +http://women.net/*
>
> and in conf\crawl-urlfilter.txt
>  # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +http://women.net/*
>
> Please let me what else i need to do in order to index the data.
>
> Thanks in advance
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis