You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Aïcha <ai...@yahoo.com> on 2006/09/25 18:02:41 UTC

Re : Re : problem with web site indexing

In fact , the site I want to index is on the web,
when I make the crawl, many others sites are indexed, they are referenced in pages of the site I want to index.
Until this point it seems to be good.
But, the problem is that for all these others sites, I have many pages in the index
and for my specific site I only have the page corresponding to the url........

I don't understand what happen and what I have to do to make it work.........


----- Message d'origine ----
De : David Podunavac <da...@wyona.com>
À : nutch-user@lucene.apache.org
Envoyé le : Lundi, 25 Septembre 2006, 16h34mn 13s
Objet : Re: Re : problem with web site indexing


I think you have in your file which is being indexed something like
javascript:something
this makes nutch think javascript is a protocol and throws a malformed
url exception
try "javascript: somthing"
or you go into the code and ignore the MalformedURLException

at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)


hth david

> Hi,
>
> I'm sorry but I still don't succeed in indexing all the content of my web site.
> In the log I have some errors : 
>
> 2006-09-25 15:35:42,859 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: javascript
>  at java.net.URL.<init>(URL.java:574)
>  at java.net.URL.<init>(URL.java:464)
>  at java.net.URL.<init>(URL.java:413)
>  at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
>  at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
>  at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
>  at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
>  at org.apache.nutch.parse.text.TextParser.getParse(TextParser.java:47)
>  at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
>  at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
>  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
>
> I don't clearly understood the configuration I have to make for the agent in the nutch-site.xml file.......
>
> Could someone help me.........
>
>
> ----- Message d'origine ----
> De : Aïcha <ai...@yahoo.com>
> À : nutch-user@lucene.apache.org
> Envoyé le : Mardi, 19 Septembre 2006, 16h16mn 32s
> Objet : problem with web site indexing
>
>
> Hi,
>
> I try to index a web site with all the pages of the site,
> but the only page I have in the index is the first page or the page of the URL I have put in the input file of the crawling.....
> at the end I have only one page in the index.......
> so do I have to do something to make it work?
>
> Thanks in advance!
> Aïcha
>
>
>     
>
>     
>         
> ___________________________________________________________________________ 
> Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! 
> Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. 
> http://fr.answers.yahoo.com 
>
>


	

	
		
___________________________________________________________________________ 
Découvrez un nouveau moyen de poser toutes vos questions quelque soit le sujet ! 
Yahoo! Questions/Réponses pour partager vos connaissances, vos opinions et vos expériences. 
http://fr.answers.yahoo.com