You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mina <ta...@gmail.com> on 2011/12/15 23:48:50 UTC

Malformed URL: '', skipping (java.net.MalformedURLException

i crawl sites with nutch 1.3. i see this exception in my log when nutch crawl
my sites:

    Malformed URL: '', skipping (java.net.MalformedURLException: no
protocol: 
	at java.net.URL.<init>(URL.java:567)
	at java.net.URL.<init>(URL.java:464)
	at java.net.URL.<init>(URL.java:413)
	at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
	at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
)



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p3590159.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by "gaurav.gupta" <ga...@edynamic.info>.
I'd update the nutch from 1.5 to 2.2.



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4008235.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Thu, Sep 6, 2012 at 5:50 AM, gaurav.gupta <ga...@edynamic.info> wrote:

> C:\nutch\local\conf\crawl-urlfilter.txt as specified in my above post.

This no longer exists... that might be a problem


-- 
Lewis

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by "gaurav.gupta" <ga...@edynamic.info>.
I'm using Apache Nutch version 1.5. Just I'd hosted the site in my local
environment and trying to index the site. Providing the enteries in
C:\nutch\local\conf\regex-urlfilter.txt and
C:\nutch\local\conf\crawl-urlfilter.txt as specified in my above post.

 



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005804.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I think you've incorrectly passed your regex- as your seed URL list
when you've injected.

As a side note it is always VERY helpful to provide basic info such as
the Nutch version, the steps you took to reproduce the error, etc...
basic stuff.

hth

Lewis

On Wed, Sep 5, 2012 at 10:16 AM, gaurav.gupta
<ga...@edynamic.info> wrote:
> Hello
>
> I'm getting the following exception while indexing my site which are hosted
> in my local machine.
>
>
> Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
> -^(file|ftp|mailto):
> Skipping
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException:
> no protocol:
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$
> Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
> +^http://women.net/*
> Skipping -.:java.net.MalformedURLException: no protocol: -.
> Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
> -^(file|ftp|mailto):
> Skipping
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV
> )$:java.net.MalformedURLException: no protocol:
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF
> |pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$
>
> Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
> +^http://women.net/*
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data
>         at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>         at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>         at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>         at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>         at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>         at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
>
> I'd add the following line in conf\regex-urlfilter.txt
>
> #+^http://([a-z0-9]*\.)*women.com/
> +http://women.net/*
>
> and in conf\crawl-urlfilter.txt
>  # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +http://women.net/*
>
> Please let me what else i need to do in order to index the data.
>
> Thanks in advance
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by "gaurav.gupta" <ga...@edynamic.info>.
Hello

I'm getting the following exception while indexing my site which are hosted
in my local machine.


Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$:java.net.MalformedURLException:
no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|css|CSS|wmv|WMV)$
Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
+^http://women.net/*
Skipping -.:java.net.MalformedURLException: no protocol: -.
Skipping -^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV
)$:java.net.MalformedURLException: no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|PDF
|pdf|js|JS|swf|SWF|ashx|css|CSS|wmv|WMV)$

Skipping +^http://women.net/*:java.net.MalformedURLException: no protocol:
+^http://women.net/*

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/C:/Nutch/local/crawl/segments/20120905010233/parse_data
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
        at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)


I'd add the following line in conf\regex-urlfilter.txt

#+^http://([a-z0-9]*\.)*women.com/
+http://women.net/*

and in conf\crawl-urlfilter.txt
 # accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+http://women.net/*

Please let me what else i need to do in order to index the data.

Thanks in advance



--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p4005547.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by Markus Jelsma <ma...@openindex.io>.
using the regex url filter plugin you can for example only pass http:// urls.

+http://

On Friday 16 December 2011 16:09:00 mina wrote:
> thanks for your answer, how i set up proper URL filters?
> 
> On Fri, Dec 16, 2011 at 3:42 AM, Markus Jelsma-2 [via Lucene] <
> 
> ml-node+s472066n3591381h30@n3.nabble.com> wrote:
> > You haven't set up proper URL filters. You'd typically have URL filters
> > that
> > only pass the protocol's you need.
> > 
> > On Thursday 15 December 2011 23:48:50 mina wrote:
> > > i crawl sites with nutch 1.3. i see this exception in my log when nutch
> > > 
> > > crawl my sites:
> > >     Malformed URL: '', skipping (java.net.MalformedURLException: no
> > > 
> > > protocol:
> > > at java.net.URL.<init>(URL.java:567)
> > > at java.net.URL.<init>(URL.java:464)
> > > at java.net.URL.<init>(URL.java:413)
> > > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
> > > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
> > > at
> > 
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > 
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> > > at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> > > )
> > 
> > )
> > 
> > > --
> > 
> > > View this message in context:
> > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malfor
> > m
> > 
> > > edURLException-tp3590159p3590159.html Sent from the Nutch - User
> > > mailing list archive at Nabble.com.
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > 
> > 
> > ------------------------------
> > 
> >  If you reply to this email, your message will be added to the discussion
> > 
> > below:
> > 
> > http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malfor
> > medURLException-tp3590159p3591381.html
> > 
> >  To unsubscribe from Malformed URL: '', skipping
> > 
> > (java.net.MalformedURLException, click
> > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=u
> > nsubscribe_by_code&node=3590159&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM1
> > OTAxNTl8NTgyODE5NjA3> .
> > NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=ma
> > cro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespa
> > ces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.w
> > eb.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aem
> > ail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble
> > %3Aemail.naml>
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malform
> edURLException-tp3590159p3591831.html Sent from the Nutch - User mailing
> list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by mina <ta...@gmail.com>.
thanks for your answer, how i set up proper URL filters?

On Fri, Dec 16, 2011 at 3:42 AM, Markus Jelsma-2 [via Lucene] <
ml-node+s472066n3591381h30@n3.nabble.com> wrote:

> You haven't set up proper URL filters. You'd typically have URL filters
> that
> only pass the protocol's you need.
>
> On Thursday 15 December 2011 23:48:50 mina wrote:
>
> > i crawl sites with nutch 1.3. i see this exception in my log when nutch
> > crawl my sites:
> >
> >     Malformed URL: '', skipping (java.net.MalformedURLException: no
> > protocol:
> > at java.net.URL.<init>(URL.java:567)
> > at java.net.URL.<init>(URL.java:464)
> > at java.net.URL.<init>(URL.java:413)
> > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
> > at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
> > at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
> )
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malform
> > edURLException-tp3590159p3590159.html Sent from the Nutch - User mailing
> > list archive at Nabble.com.
>
> --
> Markus Jelsma - CTO - Openindex
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p3591381.html
>  To unsubscribe from Malformed URL: '', skipping
> (java.net.MalformedURLException, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3590159&code=dGFoZXJlZ2Fuaml5YXJAZ21haWwuY29tfDM1OTAxNTl8NTgyODE5NjA3>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-MalformedURLException-tp3590159p3591831.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Malformed URL: '', skipping (java.net.MalformedURLException

Posted by Markus Jelsma <ma...@openindex.io>.
You haven't set up proper URL filters. You'd typically have URL filters that 
only pass the protocol's you need.

On Thursday 15 December 2011 23:48:50 mina wrote:
> i crawl sites with nutch 1.3. i see this exception in my log when nutch
> crawl my sites:
> 
>     Malformed URL: '', skipping (java.net.MalformedURLException: no
> protocol:
> 	at java.net.URL.<init>(URL.java:567)
> 	at java.net.URL.<init>(URL.java:464)
> 	at java.net.URL.<init>(URL.java:413)
> 	at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:247)
> 	at org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:109)
> 	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
> 	at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) )
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Malformed-URL-skipping-java-net-Malform
> edURLException-tp3590159p3590159.html Sent from the Nutch - User mailing
> list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex