You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Ned Rockson <ne...@discoveryengine.com> on 2007/11/16 03:18:58 UTC

Nutch trunk js-parser problem with extremely long and meaningless Elements

I've run into the problem before that, while running the parser, it gets 
caught in really deep regex loops.  For a quick fix I changed 
urlfilter-prefix to not allow urls over 300 characters and to make sure 
none of the characters have ascii values <32 (control characters).  I 
just ran into another one today but it's in the js parser.  Take a look 
at the source for http://www.magic-cadeaux.fr/ when it lists the 
function swap(image, num).  If it weren't for all of the slashes then it 
is well formed javascript, but unfortunately the parse-js plugin doesn't 
deal with it correctly.  It just hangs in a very very deep loop.  A 
browser, such as firefox, however seems to deal with it okay.  Is there 
a way we can deal with these cases rather than limiting the size of an 
Element?