You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Florent Gluck <fl...@busytonight.com> on 2006/03/14 00:51:14 UTC

Buggy fetchlist' urls

Hi,

I'm using nutch revision 385671 from the trunk.  I'm running it on a
single machine using the local fileystem.
I just started with a seed of one single url: http://www.osnews.com
Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and
dumpped the crawl db.  Here is where I got quite surprised:

florent@florent-dev:~/tmp$ nutch readdb crawldb -dump dump
florent@florent-dev:~/tmp$ grep "^http" dump/part-00000
http://a.ads.t-online.de/       Version: 4
http://a.as-eu.falkag.net/      Version: 4
http://a.as-rh4.falkag.net/     Version: 4
http://a.as-rh4.falkag.net/server/asldata.js    Version: 4
http://a.as-test.falkag.net/    Version: 4
http://a.as-us.falkag.net/      Version: 4
http://a.as-us.falkag.net/dat/bfx/      Version: 4
http://a.as-us.falkag.net/dat/bgf/      Version: 4
http://a.as-us.falkag.net/dat/bgf/trpix.gif&    Version: 4
http://a.as-us.falkag.net/dat/bjf/      Version: 4
http://a.as-us.falkag.net/dat/brf/      Version: 4
http://a.as-us.falkag.net/dat/cjf/      Version: 4
http://a.as-us.falkag.net/dat/cjf/00/13/60/94.js        Version: 4
http://a.as-us.falkag.net/dat/cjf/00/13/60/96.js        Version: 4
http://a.as-us.falkag.net/dat/dlv/);QQt.document.write( Version: 4
http://a.as-us.falkag.net/dat/dlv/);document.write(     Version: 4
http://a.as-us.falkag.net/dat/dlv/+((QQPc-QQwA)/1000)+  Version: 4
http://a.as-us.falkag.net/dat/dlv/.ads.t-online.de      Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-eu.falkag.net     Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-rh4.falkag.net    Version: 4
http://a.as-us.falkag.net/dat/dlv/.as-us.falkag.net     Version: 4
http://a.as-us.falkag.net/dat/dlv/://   Version: 4
http://a.as-us.falkag.net/dat/dlv/</b><br>      Version: 4
http://a.as-us.falkag.net/dat/dlv/</big></b><br>        Version: 4
http://a.as-us.falkag.net/dat/dlv/</center></td></tr></table></body></html>    
Version: 4
http://a.as-us.falkag.net/dat/dlv/</div>        Version: 4
http://a.as-us.falkag.net/dat/dlv/Banner-Typ/PopUp      Version: 4
http://a.as-us.falkag.net/dat/dlv/ShockwaveFlash.ShockwaveFlash.       
Version: 4
http://a.as-us.falkag.net/dat/dlv/afxplay.js    Version: 4
http://a.as-us.falkag.net/dat/dlv/application/x-shockwave-flash Version: 4
http://a.as-us.falkag.net/dat/dlv/aslmain.js    Version: 4
http://a.as-us.falkag.net/dat/dlv/text/javascript       Version: 4
http://a.as-us.falkag.net/dat/dlv/window.blur();        Version: 4
http://a.as-us.falkag.net/dat/njf/      Version: 4
http://bilbo.counted.com/0/42699/       Version: 4
http://bilbo.counted.com/7/42699/       Version: 4
http://bw.ads.t-online.de/      Version: 4
http://bw.as-eu.falkag.net/     Version: 4
http://bw.as-us.falkag.net/     Version: 4
http://data.as-us.falkag.net/server/asldata.js  Version: 4
http://denux.org/       Version: 4
...

Some urls are totally bogus.  I didn't investigate what could be causing
this yet, but it looks like it could be a parsing issue.  Some urls
contain some javascript code and others contain some html tags.

Is there anyone aware of this?
I can open a bug if needed.

Thanks,
--Flo

Re: Buggy fetchlist' urls

Posted by Florent Gluck <fl...@busytonight.com>.

Hi Andrzej,

Well, I think for now I'll just disable the parse-js plugin since I
don't really need it anyway.
I'll let you know if I ever work on it (I may need it in the future).

Thanks,
--Flo

Andrzej Bialecki wrote:

> Florent Gluck wrote:
>
>> Some urls are totally bogus.  I didn't investigate what could be causing
>> this yet, but it looks like it could be a parsing issue.  Some urls
>> contain some javascript code and others contain some html tags.
>>   
>
>
> This is a side-effect of our primitive parse-js, which doesn't really
> parse anything, just uses some heuristic to extract possible URLs.
> Unfortunately, often as not the strings it extracts don't have
> anything to do with URLs.
>
> If you have suggestions on how to improve it I'm all ears.
>

Re: Buggy fetchlist' urls

Posted by Jack Tang <hi...@gmail.com>.

On 3/15/06, Jérôme Charron <je...@gmail.com> wrote:
> >
> > I am not familiar with Rhino engine. But it is said jdk 6 adopted it
> > as embeded javascript engine. Can we build one RhinoInterpreter first,
> > and then evaluate the javascipt function to get the result rather than
> > extracting pure text now.
>
> Hi Jack,
>
> I recently write a small article about search engine and javascript (in
> french, sorry):
> http://www.moteurzine.com/archives/2006/moteurzine127.html#2
>
> My conslusion is simply: Ok, you can figure to you use a javascript
> interpreter to extract
> URLs. But in fact, how could you simulate all the user interaction?
> You could you make that the nutch crawler acts as a human user?
> Interpreting Javascript is one thing, knowing all the possible outputs of a
> javascript is another one.
>  No?
Hi Jérôme.
Thanks for you article even I don't know french at all.
I agree with you on "nutch crawler cannot simulate all the user
interaction". Somthing like onClick and onKeyDown event. And now I
don't how RhinoInterpreter deal with form submit and
xmlhttprequest(more time need to know Rhino).

> Regards
>
> Jérôme
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Buggy fetchlist' urls

Posted by Jérôme Charron <je...@gmail.com>.

>
> I am not familiar with Rhino engine. But it is said jdk 6 adopted it
> as embeded javascript engine. Can we build one RhinoInterpreter first,
> and then evaluate the javascipt function to get the result rather than
> extracting pure text now.

Hi Jack,

I recently write a small article about search engine and javascript (in
french, sorry):
http://www.moteurzine.com/archives/2006/moteurzine127.html#2

My conslusion is simply: Ok, you can figure to you use a javascript
interpreter to extract
URLs. But in fact, how could you simulate all the user interaction?
You could you make that the nutch crawler acts as a human user?
Interpreting Javascript is one thing, knowing all the possible outputs of a
javascript is another one.
 No?

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Buggy fetchlist' urls

Posted by Jack Tang <hi...@gmail.com>.

Hi Andrzej.

In my previous projects, I bound javascript functions with center url.
And I knew the idea does not fit for nutch.

I am not familiar with Rhino engine. But it is said jdk 6 adopted it
as embeded javascript engine. Can we build one RhinoInterpreter first,
and then evaluate the javascipt function to get the result rather than
extracting pure text now.

You can find javadoc about Rhino here:
http://xmlgraphics.apache.org/batik/javadoc/index.html

Regards
/Jack

On 3/14/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Florent Gluck wrote:
> > Some urls are totally bogus.  I didn't investigate what could be causing
> > this yet, but it looks like it could be a parsing issue.  Some urls
> > contain some javascript code and others contain some html tags.
> >
>
> This is a side-effect of our primitive parse-js, which doesn't really
> parse anything, just uses some heuristic to extract possible URLs.
> Unfortunately, often as not the strings it extracts don't have anything
> to do with URLs.
>
> If you have suggestions on how to improve it I'm all ears.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Buggy fetchlist' urls

Posted by Andrzej Bialecki <ab...@getopt.org>.

Florent Gluck wrote:
> Some urls are totally bogus.  I didn't investigate what could be causing
> this yet, but it looks like it could be a parsing issue.  Some urls
> contain some javascript code and others contain some html tags.
>   

This is a side-effect of our primitive parse-js, which doesn't really 
parse anything, just uses some heuristic to extract possible URLs. 
Unfortunately, often as not the strings it extracts don't have anything 
to do with URLs.

If you have suggestions on how to improve it I'm all ears.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com