You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/10/20 20:06:53 UTC

Re: crawl problems (a bug/patch)

Still tracking down a solution, but my problems appear
to be parsing based.

My page has this tag

<div class="content" id="content"
style="display:none;">

The div starts without display and then javascript
brings in a template and displays the div.  I think
this totally legitimate, and yahoo!, google and msn
all seem to agree.

For whatever reason, nutch extracts display:none as a
url.  Am still digging, but haven't figured that part
out.

display:none gets passed to
org.apache.nutch.net.BasicUrlNormalizer, where this
line

URL url = new URL(urlString);

appears to throw a MalformedURLException, likely
because display:none isn't much of a URL.

After this, no other links get processed on the page. 


The try is around extracting links for the whole page,
and as soon as an exception is thrown, the link
extraction stops.  This seems a little harsh,
especially since nutch seems perhaps a little naive
here.  I propose to try each call to outlinks.add(new
Outlink(url, anchor)).  Then if there is a problem
with any single url, parsing continues.  The patch
below does such a thing.

Many more links on my page get processed, but nutch
still doesn't find

<a href=/sitemap.html>browse</a>

and I am not sure why.

This little patch seems like a pretty huge deal, and I
really can't believe that no one else has discovered
it.  One "bad" link and the rest of the page gets
thrown away?  If nothing else, doesn't anyone else use
styles?  It seems like any page with any div, with any
 style attribute that isn't a real link would have the
same result.

Maybe the thinking was that if a page has a bad link,
that is reason enough to skip a head.  I could buy
that a whole lot more if the parsing were more mature.

I just looked and the mapreduce branch has the exact
same code, so, the patch should work for both.

So, three open questions

1.  Why doesn't my link (<a
href=/sitemap.html>browse</a>) get parsed?
2.  Why does my style get followed?
3.  Where do I look for a list of all the failed
links?

Thanks,
Earl

Index:
src/java/org/apache/nutch/parse/OutlinkExtractor.java
===================================================================
---
src/java/org/apache/nutch/parse/OutlinkExtractor.java 
  (revision 326762)
+++
src/java/org/apache/nutch/parse/OutlinkExtractor.java 
  (working copy)
@@ -97,7 +97,11 @@
       while (matcher.contains(input, pattern)) {
         result = matcher.getMatch();
         url = result.group(0);
-        outlinks.add(new Outlink(url, anchor));
+        try {
+          outlinks.add(new Outlink(url, anchor));
+        } catch (Exception ex) {
+         
LOG.throwing(OutlinkExtractor.class.getName(),
"getOutlinks", ex);
+        }
       }
     } catch (Exception ex) {
       // if it is a malformed URL we just throw it
away and continue with



		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Should I submit this through JIRA?

Earl

--- Earl Cahill <ca...@yahoo.com> wrote:

> Still tracking down a solution, but my problems
> appear
> to be parsing based.
> 
> My page has this tag
> 
> <div class="content" id="content"
> style="display:none;">
> 
> The div starts without display and then javascript
> brings in a template and displays the div.  I think
> this totally legitimate, and yahoo!, google and msn
> all seem to agree.
> 
> For whatever reason, nutch extracts display:none as
> a
> url.  Am still digging, but haven't figured that
> part
> out.
> 
> display:none gets passed to
> org.apache.nutch.net.BasicUrlNormalizer, where this
> line
> 
> URL url = new URL(urlString);
> 
> appears to throw a MalformedURLException, likely
> because display:none isn't much of a URL.
> 
> After this, no other links get processed on the
> page. 
> 
> 
> The try is around extracting links for the whole
> page,
> and as soon as an exception is thrown, the link
> extraction stops.  This seems a little harsh,
> especially since nutch seems perhaps a little naive
> here.  I propose to try each call to
> outlinks.add(new
> Outlink(url, anchor)).  Then if there is a problem
> with any single url, parsing continues.  The patch
> below does such a thing.
> 
> Many more links on my page get processed, but nutch
> still doesn't find
> 
> <a href=/sitemap.html>browse</a>
> 
> and I am not sure why.
> 
> This little patch seems like a pretty huge deal, and
> I
> really can't believe that no one else has discovered
> it.  One "bad" link and the rest of the page gets
> thrown away?  If nothing else, doesn't anyone else
> use
> styles?  It seems like any page with any div, with
> any
>  style attribute that isn't a real link would have
> the
> same result.
> 
> Maybe the thinking was that if a page has a bad
> link,
> that is reason enough to skip a head.  I could buy
> that a whole lot more if the parsing were more
> mature.
> 
> I just looked and the mapreduce branch has the exact
> same code, so, the patch should work for both.
> 
> So, three open questions
> 
> 1.  Why doesn't my link (<a
> href=/sitemap.html>browse</a>) get parsed?
> 2.  Why does my style get followed?
> 3.  Where do I look for a list of all the failed
> links?
> 
> Thanks,
> Earl
> 
> Index:
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
>
===================================================================
> ---
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
> 
>   (revision 326762)
> +++
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
> 
>   (working copy)
> @@ -97,7 +97,11 @@
>        while (matcher.contains(input, pattern)) {
>          result = matcher.getMatch();
>          url = result.group(0);
> -        outlinks.add(new Outlink(url, anchor));
> +        try {
> +          outlinks.add(new Outlink(url, anchor));
> +        } catch (Exception ex) {
> +         
> LOG.throwing(OutlinkExtractor.class.getName(),
> "getOutlinks", ex);
> +        }
>        }
>      } catch (Exception ex) {
>        // if it is a malformed URL we just throw it
> away and continue with
> 
> 
> 
> 		
> __________________________________ 
> Yahoo! Music Unlimited 
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Jérôme,

> which Nutch version do you use?

Kind of gave up on mapred for awhile, so I am using
trunk.

> There were a bug concerning the content-types with
> parameters such as
> "text/html; charset=iso-8859-1".

Yeah, when I telnet in to GET / shopthar.com, I get

Content-Type: text/html; charset=iso-8859-1

> This issue is fixed in trunk and mapred.

Hmm, well, I was seeing something earlier in trunk. 
That said, something happened and I now seem to get a
partial crawl started.  How very strange.  I did catch
a few updates today, but the commits sure didn't seem
related.

Now I crawl for awhile, and then it just stops.  I
still get new segments starting, but no new http hits
to the server.  So looks like I have something new to
track down.  But yeah, when it is going, it can hammer
pretty good.

Earl


		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Jérôme Charron <je...@gmail.com>.
> By investing further, I've found that for parse-html, the links are
> extracted differently: the links are returned by
> DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
> wonder how you get to extract links with OutlinkExtractor instead...

Earl,

which Nutch version do you use?
If your links are extracted by the OutlinkExtractor, it seems that it is not
the HtmlParser that is used to parse your document, but the TextParser
instead (the default one).
There were a bug concerning the content-types with parameters such as
"text/html; charset=iso-8859-1".
Moreover your site return such a content-type, so that the ParserFactory
doesn't correctly find the good parser (HtmlParser), but uses the default
one.
This issue is fixed in trunk and mapred.
For further details,
see the thread on nutch-dev mailing list :
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00791.html
or the NUTCH-88 issue : http://issues.apache.org/jira/browse/NUTCH-88

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: crawl problems (a bug/patch)

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi Earl,


--- Earl Cahill <ca...@yahoo.com> wrote:

> I know it's minor, but if you later apply
> Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
> do a-zA-Z.


Indeed.

> 
> For me, there are two groups of issues.
> 
> 1.  URL_PATTERN issues.  URL_PATTERN matches things
> that aren't really links, and doesn't match things
> that are links.  This is what you talk about below,
> and what your JIRA issue covers.
> 
> It appears that URL_PATTERN aims to match anything
> that looks like a link on the page, whether in a tag
> or not.    Wondering about finding tags, and then
> looking for links in the tags.  Seems a little much to
> crawl urls that are plain text and not in a tag, since
> they aren't really links.  A simple tag regex in perl
> looks like this /<[^>]+?>/.

Indeed, but you're thinking HTML there.  The parsed text can come from
somwhere else:  it could be text parsed from a Word Document or a PDF
file (you can find references to getOutlinks in the relevant plugins).

By investing further, I've found that for parse-html, the links are
extracted differently: the links are returned by
DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
wonder how you get to extract links with OutlinkExtractor instead...

Regards,
Sébastien.


	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Hi Sébastien,

Yahoo! just hosed my message, glad I had it elsewhere.

> As you probably saw in the OutlinkExtractor class,
> the links are
> extracted with a Regexp.  

Ahh, didn't see it before, but I now see URL_PATTERN. 


I know it's minor, but if you later apply
Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
do a-zA-Z.

For me, there are two groups of issues.

1.  URL_PATTERN issues.  URL_PATTERN matches things
that aren't really links, and doesn't match things
that are links.  This is what you talk about below,
and what your JIRA issue covers.

It appears that URL_PATTERN aims to match anything
that looks like a link on the page, whether in a tag
or not.    Wondering about finding tags, and then
looking for links in the tags.  Seems a little much to
crawl urls that are plain text and not in a tag, since
they aren't really links.  A simple tag regex in perl
looks like this /<[^>]+?>/.

Extracting key/value pairs is a little harder, but
given $key, the second match from this regex

m/(?<!\w|\.)\Q$key\E\s*=(["']?)(|.*?[^\\])\1(\s|>)/is

returns the value.  This regex has gone through some
rather rigorous testing/use.

Getting $key interpolated into the regex seems a
little harder in java, but it seems we would mostly be
looking for href, and src, and maybe not even src.

The regex for href looks like this

m/(?<!\w|\.)href\s*=(["']?)(|.*?[^\\])\1(\s|>)/is

If you commit your TestPattern.java stuff, I would be
happy to add cases and play with the regexes until all
the cases work.

2.  If a link is encountered that throws an exception
anywhere in this call

outlinks.add(new Outlink(url, anchor))

then no more links will get extracted on the page. 
That is what my JIRA issue
(http://issues.apache.org/jira/browse/NUTCH-120)
covers.

Thanks for looking into this.

Earl


		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.

--- Sébastien LE CALLONNEC <sl...@yahoo.ie> wrote:

> Hi Earl,
> 
> Please, see my responses below.
> 
> 
> --- Earl Cahill <ca...@yahoo.com> wrote:
> 
> 
> As you probably saw in the OutlinkExtractor class,
> the links are
> extracted with a Regexp.  I'm no expert in the
> matter, but that will
> certainly answer your questions below...
> 
> > So, three open questions
> > 
> > 1.  Why doesn't my link (<a
> > href=/sitemap.html>browse</a>) get parsed?
> 
> Because it doesn't match the aforementioned regexp.
> 
> > 2.  Why does my style get followed?
> 
> Because it matches the regexp.
> 
> > 3.  Where do I look for a list of all the failed
> > links?
> 
> I don't think there is any.
> 
> I have just created the issue in JIRA:
> http://issues.apache.org/jira/browse/NUTCH-119
> 
> 
> Regards,
> Sébastien.
> 
> 
> 
> 
> 	
> 
> 	
> 		
>
___________________________________________________________________________
> 
> Appel audio GRATUIT partout dans le monde avec le
> nouveau Yahoo! Messenger 
> Téléchargez cette version sur
> http://fr.messenger.yahoo.com
> 



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi Earl,

Please, see my responses below.


--- Earl Cahill <ca...@yahoo.com> wrote:


As you probably saw in the OutlinkExtractor class, the links are
extracted with a Regexp.  I'm no expert in the matter, but that will
certainly answer your questions below...

> So, three open questions
> 
> 1.  Why doesn't my link (<a
> href=/sitemap.html>browse</a>) get parsed?

Because it doesn't match the aforementioned regexp.

> 2.  Why does my style get followed?

Because it matches the regexp.

> 3.  Where do I look for a list of all the failed
> links?

I don't think there is any.

I have just created the issue in JIRA:
http://issues.apache.org/jira/browse/NUTCH-119


Regards,
Sébastien.




	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com