You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Aïcha <ai...@yahoo.com> on 2006/09/25 16:16:52 UTC

Re : problem with web site indexing

Hi,

I'm sorry but I still don't succeed in indexing all the content of my web site.
In the log I have some errors : 

2006-09-25 15:35:42,859 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: javascript
 at java.net.URL.<init>(URL.java:574)
 at java.net.URL.<init>(URL.java:464)
 at java.net.URL.<init>(URL.java:413)
 at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
 at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
 at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
 at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
 at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)

I don't clearly understood the configuration I have to make for the agent in the nutch-site.xml file.......

Could someone help me.........


----- Message d'origine ----
De : Aïcha <ai...@yahoo.com>
À : nutch-user@lucene.apache.org
Envoyé le : Mardi, 19 Septembre 2006, 16h16mn 32s
Objet : problem with web site indexing


Hi,

I try to index a web site with all the pages of the site,
but the only page I have in the index is the first page or the page of the URL I have put in the input file of the crawling.....
at the end I have only one page in the index.......
so do I have to do something to make it work?

Thanks in advance!
Aïcha


	

	
		
___________________________________________________________________________ 
Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! 
Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. 
http://fr.answers.yahoo.com 

Re: Re : problem with web site indexing

Posted by David Podunavac <da...@wyona.com>.
I think you have in your file which is being indexed something like
javascript:something
this makes nutch think javascript is a protocol and throws a malformed
url exception
try "javascript: somthing"
or you go into the code and ignore the MalformedURLException

at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)


hth david

> Hi,
>
> I'm sorry but I still don't succeed in indexing all the content of my web site.
> In the log I have some errors : 
>
> 2006-09-25 15:35:42,859 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: javascript
>  at java.net.URL.<init>(URL.java:574)
>  at java.net.URL.<init>(URL.java:464)
>  at java.net.URL.<init>(URL.java:413)
>  at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
>  at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
>  at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
>  at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
>  at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
>  at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>  at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
>  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
>
> I don't clearly understood the configuration I have to make for the agent in the nutch-site.xml file.......
>
> Could someone help me.........
>
>
> ----- Message d'origine ----
> De : Aïcha <ai...@yahoo.com>
> À : nutch-user@lucene.apache.org
> Envoyé le : Mardi, 19 Septembre 2006, 16h16mn 32s
> Objet : problem with web site indexing
>
>
> Hi,
>
> I try to index a web site with all the pages of the site,
> but the only page I have in the index is the first page or the page of the URL I have put in the input file of the crawling.....
> at the end I have only one page in the index.......
> so do I have to do something to make it work?
>
> Thanks in advance!
> Aïcha
>
>
> 	
>
> 	
> 		
> ___________________________________________________________________________ 
> Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! 
> Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. 
> http://fr.answers.yahoo.com 
>
>