You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ned Rockson <ne...@discoveryengine.com> on 2007/11/16 03:18:58 UTC
Nutch trunk js-parser problem with extremely long and meaningless
Elements
I've run into the problem before that, while running the parser, it gets
caught in really deep regex loops. For a quick fix I changed
urlfilter-prefix to not allow urls over 300 characters and to make sure
none of the characters have ascii values <32 (control characters). I
just ran into another one today but it's in the js parser. Take a look
at the source for http://www.magic-cadeaux.fr/ when it lists the
function swap(image, num). If it weren't for all of the slashes then it
is well formed javascript, but unfortunately the parse-js plugin doesn't
deal with it correctly. It just hangs in a very very deep loop. A
browser, such as firefox, however seems to deal with it okay. Is there
a way we can deal with these cases rather than limiting the size of an
Element?