You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by mos <mo...@gmail.com> on 2006/02/03 15:55:10 UTC

Re: crawler

The problem at www.gildemeister.com is the use of JavaScript for link
generation.
That's the reason why nutch can't find the other pages (the links are
invisible).
Two ideas:
- You need something like a sitemap, that links the other main pages.
If it's not available
  right now, you should try to generate it (e.g. use the apache log-file)
- Enhance the nutch html parser and make it able to intepret the JavaScipt links

Greetings
mos, from munich



On 2/3/06, Poettgen@acocon.de <Po...@acocon.de> wrote:
> Hello,
>
> I have problems indexing a special internet site:
> http://www.gildemeister.com
>
> Nutch only fetches 14 pages but not the complete site.
>
> I'm using the default parameters and the intranet crawl command.
>
> I get no errors or so. Can someone try to index the site and can send me a
> hint?
> Or an config that works. I am new to nutch and I don't know where I can
> start to fix it.
>
> thanks
>
> wombat
>
>
>

Re: crawler

Posted by Po...@acocon.de.

ok, java script seems to be one problem. Thank you Andrzej.

I activate the JavaSript parser and some more pages are being indexed. But
the entries of the left menue are missing.

Is there an other solution as building an 'sitemap'?



Andrzej Bialecki <ab...@getopt.org> wrote on 03.02.2006 16:15:37:

> mos wrote:
> > The problem at www.gildemeister.com is the use of JavaScript for link
> > generation.
> > That's the reason why nutch can't find the other pages (the links are
> > invisible).
> > Two ideas:
> > - You need something like a sitemap, that links the other main pages.
> > If it's not available
> >   right now, you should try to generate it (e.g. use the apache
log-file)
> > - Enhance the nutch html parser and make it able to intepret the
> JavaScipt links
> >
>
> You can try activating parse-js - it can extract JavaScript snippets
> embedded in HTML actions, and figure out the links. It works reasonably
> well, at least most of the time... ;-)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: crawler

Posted by Andrzej Bialecki <ab...@getopt.org>.

mos wrote:
> The problem at www.gildemeister.com is the use of JavaScript for link
> generation.
> That's the reason why nutch can't find the other pages (the links are
> invisible).
> Two ideas:
> - You need something like a sitemap, that links the other main pages.
> If it's not available
>   right now, you should try to generate it (e.g. use the apache log-file)
> - Enhance the nutch html parser and make it able to intepret the JavaScipt links
>   

You can try activating parse-js - it can extract JavaScript snippets 
embedded in HTML actions, and figure out the links. It works reasonably 
well, at least most of the time... ;-)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: crawler

Posted by Stefan Groschupf <sg...@media-style.com>.

There is already a java script parser, you only need to switch it on.

Am 03.02.2006 um 15:55 schrieb mos:

> The problem at www.gildemeister.com is the use of JavaScript for link
> generation.
> That's the reason why nutch can't find the other pages (the links are
> invisible).
> Two ideas:
> - You need something like a sitemap, that links the other main pages.
> If it's not available
>   right now, you should try to generate it (e.g. use the apache log- 
> file)
> - Enhance the nutch html parser and make it able to intepret the  
> JavaScipt links
>
> Greetings
> mos, from munich
>
>
>
> On 2/3/06, Poettgen@acocon.de <Po...@acocon.de> wrote:
>> Hello,
>>
>> I have problems indexing a special internet site:
>> http://www.gildemeister.com
>>
>> Nutch only fetches 14 pages but not the complete site.
>>
>> I'm using the default parameters and the intranet crawl command.
>>
>> I get no errors or so. Can someone try to index the site and can  
>> send me a
>> hint?
>> Or an config that works. I am new to nutch and I don't know where  
>> I can
>> start to fix it.
>>
>> thanks
>>
>> wombat
>>
>>
>>
>