You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/10/19 10:20:55 UTC

crawl problems

I am trying to do a crawl on trunk of one of my sites,
and it isn't working.  I make a file urls, that just
contains the site

http://shopthar.com/

in my conf/crawl-urlfilter.txt I have

+^http://shopthar.com/

I then do

bin/nutch crawl urls -dir crawl.test -depth 100
-threads 20

it kicks in and I get repeating chunks like

051019 010450 Updating
/home/nutch/nutch/trunk/crawl.test/db
051019 010450 Updating for
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
051019 010450 Finishing update
051019 010450 Update finished
051019 010450 FetchListTool started
051019 010450 Overall processing: Sorted 0 entries in
0.0 seconds.
051019 010450 Overall processing: Sorted NaN
entries/second
051019 010450 FetchListTool completed
051019 010450 logging at INFO

For ages, but I only see two nutch hits in my access
log: one for my robots.txt and one for my front page. 
Nothing else.

The "crawl" finishes, then I do a search and can only
get a hits for the front page.  When I do the search
via lynx, I get a momentary

Bad partial reference!  Stripping lead dots.

I can't imagine this is really the problem, but pretty
well all my links are relative.  I mean nutch has to
be able to follow relative links, right?

Ideas?

Thanks,
Earl


		
__________________________________ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs

Re: crawl problems

Posted by Earl Cahill <ca...@yahoo.com>.
Yeah, the big link on the homepage is

<a href=/sitemap.html>browse</a>

which then opens several other pages.  All the links
on the site will with /.

So I tried

+^/

in my conf/crawl-urlfilter.txt with no luck.  

Against my better judgement, I also tried

+^/.*

which also didn't work.

Thanks,
Earl

--- Doug Cutting <cu...@nutch.org> wrote:

> The only link on http://shopthar.com/ to the domain
> shopthar.com is a 
> link to http://shopthar.com/.  So a crawl starting
> from that page that 
> only visits pages in shopthar.com will only find
> that one page.
> 
> % wget -q -O - http://shopthar.com/ | grep
> shopthar.com
>    <tr><td colspan=2>Welcome to
> shopthar.com</td></td></tr>
> <a href=http://shopthar.com/>shopthar.com</a> |
> 
> Doug
> 
> Earl Cahill wrote:
> > I am trying to do a crawl on trunk of one of my
> sites,
> > and it isn't working.  I make a file urls, that
> just
> > contains the site
> > 
> > http://shopthar.com/
> > 
> > in my conf/crawl-urlfilter.txt I have
> > 
> > +^http://shopthar.com/
> > 
> > I then do
> > 
> > bin/nutch crawl urls -dir crawl.test -depth 100
> > -threads 20
> > 
> > it kicks in and I get repeating chunks like
> > 
> > 051019 010450 Updating
> > /home/nutch/nutch/trunk/crawl.test/db
> > 051019 010450 Updating for
> >
>
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
> > 051019 010450 Finishing update
> > 051019 010450 Update finished
> > 051019 010450 FetchListTool started
> > 051019 010450 Overall processing: Sorted 0 entries
> in
> > 0.0 seconds.
> > 051019 010450 Overall processing: Sorted NaN
> > entries/second
> > 051019 010450 FetchListTool completed
> > 051019 010450 logging at INFO
> > 
> > For ages, but I only see two nutch hits in my
> access
> > log: one for my robots.txt and one for my front
> page. 
> > Nothing else.
> > 
> > The "crawl" finishes, then I do a search and can
> only
> > get a hits for the front page.  When I do the
> search
> > via lynx, I get a momentary
> > 
> > Bad partial reference!  Stripping lead dots.
> > 
> > I can't imagine this is really the problem, but
> pretty
> > well all my links are relative.  I mean nutch has
> to
> > be able to follow relative links, right?
> > 
> > Ideas?
> > 
> > Thanks,
> > Earl
> > 
> > 
> > 		
> > __________________________________ 
> > Start your day with Yahoo! - Make it your home
> page! 
> > http://www.yahoo.com/r/hs
> 



		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl problems

Posted by Doug Cutting <cu...@nutch.org>.
The only link on http://shopthar.com/ to the domain shopthar.com is a 
link to http://shopthar.com/.  So a crawl starting from that page that 
only visits pages in shopthar.com will only find that one page.

% wget -q -O - http://shopthar.com/ | grep shopthar.com
   <tr><td colspan=2>Welcome to shopthar.com</td></td></tr>
<a href=http://shopthar.com/>shopthar.com</a> |

Doug

Earl Cahill wrote:
> I am trying to do a crawl on trunk of one of my sites,
> and it isn't working.  I make a file urls, that just
> contains the site
> 
> http://shopthar.com/
> 
> in my conf/crawl-urlfilter.txt I have
> 
> +^http://shopthar.com/
> 
> I then do
> 
> bin/nutch crawl urls -dir crawl.test -depth 100
> -threads 20
> 
> it kicks in and I get repeating chunks like
> 
> 051019 010450 Updating
> /home/nutch/nutch/trunk/crawl.test/db
> 051019 010450 Updating for
> /home/nutch/nutch/trunk/crawl.test/segments/20051019010449
> 051019 010450 Finishing update
> 051019 010450 Update finished
> 051019 010450 FetchListTool started
> 051019 010450 Overall processing: Sorted 0 entries in
> 0.0 seconds.
> 051019 010450 Overall processing: Sorted NaN
> entries/second
> 051019 010450 FetchListTool completed
> 051019 010450 logging at INFO
> 
> For ages, but I only see two nutch hits in my access
> log: one for my robots.txt and one for my front page. 
> Nothing else.
> 
> The "crawl" finishes, then I do a search and can only
> get a hits for the front page.  When I do the search
> via lynx, I get a momentary
> 
> Bad partial reference!  Stripping lead dots.
> 
> I can't imagine this is really the problem, but pretty
> well all my links are relative.  I mean nutch has to
> be able to follow relative links, right?
> 
> Ideas?
> 
> Thanks,
> Earl
> 
> 
> 		
> __________________________________ 
> Start your day with Yahoo! - Make it your home page! 
> http://www.yahoo.com/r/hs

Re: crawl problems

Posted by Earl Cahill <ca...@yahoo.com>.
I tried this as well, and it didn't help.  It does
look like I want to have at least both of these in my
filter

+^/
+^http://shopthar.com/

though.  I will have to dig through a little code
tonight to see why my urls are getting skipped.

Thanks,
Earl

--- Miguel A Paraz <mp...@gmail.com> wrote:

> On 10/19/05, Earl Cahill <ca...@yahoo.com> wrote:
> > I can't imagine this is really the problem, but
> pretty
> > well all my links are relative.  I mean nutch has
> to
> > be able to follow relative links, right?
> 
> I'm just guessing: is this affected by the
> db.ignore.internal.links property?
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: crawl problems

Posted by Miguel A Paraz <mp...@gmail.com>.
On 10/19/05, Earl Cahill <ca...@yahoo.com> wrote:
> I can't imagine this is really the problem, but pretty
> well all my links are relative.  I mean nutch has to
> be able to follow relative links, right?

I'm just guessing: is this affected by the db.ignore.internal.links property?

Re: search return 0 hit

Posted by Kai Hagemeister <kh...@planoweb.de>.
Hello,

> I replace ROOT.WAR file in tomcat by nutch's and
> launch tomcat in nutch's segment directory ( parallel
> to index subdir )

IMHO you should start tomcat in the parent directory of segment.

Kai


Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Should I submit this through JIRA?

Earl

--- Earl Cahill <ca...@yahoo.com> wrote:

> Still tracking down a solution, but my problems
> appear
> to be parsing based.
> 
> My page has this tag
> 
> <div class="content" id="content"
> style="display:none;">
> 
> The div starts without display and then javascript
> brings in a template and displays the div.  I think
> this totally legitimate, and yahoo!, google and msn
> all seem to agree.
> 
> For whatever reason, nutch extracts display:none as
> a
> url.  Am still digging, but haven't figured that
> part
> out.
> 
> display:none gets passed to
> org.apache.nutch.net.BasicUrlNormalizer, where this
> line
> 
> URL url = new URL(urlString);
> 
> appears to throw a MalformedURLException, likely
> because display:none isn't much of a URL.
> 
> After this, no other links get processed on the
> page. 
> 
> 
> The try is around extracting links for the whole
> page,
> and as soon as an exception is thrown, the link
> extraction stops.  This seems a little harsh,
> especially since nutch seems perhaps a little naive
> here.  I propose to try each call to
> outlinks.add(new
> Outlink(url, anchor)).  Then if there is a problem
> with any single url, parsing continues.  The patch
> below does such a thing.
> 
> Many more links on my page get processed, but nutch
> still doesn't find
> 
> <a href=/sitemap.html>browse</a>
> 
> and I am not sure why.
> 
> This little patch seems like a pretty huge deal, and
> I
> really can't believe that no one else has discovered
> it.  One "bad" link and the rest of the page gets
> thrown away?  If nothing else, doesn't anyone else
> use
> styles?  It seems like any page with any div, with
> any
>  style attribute that isn't a real link would have
> the
> same result.
> 
> Maybe the thinking was that if a page has a bad
> link,
> that is reason enough to skip a head.  I could buy
> that a whole lot more if the parsing were more
> mature.
> 
> I just looked and the mapreduce branch has the exact
> same code, so, the patch should work for both.
> 
> So, three open questions
> 
> 1.  Why doesn't my link (<a
> href=/sitemap.html>browse</a>) get parsed?
> 2.  Why does my style get followed?
> 3.  Where do I look for a list of all the failed
> links?
> 
> Thanks,
> Earl
> 
> Index:
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
>
===================================================================
> ---
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
> 
>   (revision 326762)
> +++
>
src/java/org/apache/nutch/parse/OutlinkExtractor.java
> 
>   (working copy)
> @@ -97,7 +97,11 @@
>        while (matcher.contains(input, pattern)) {
>          result = matcher.getMatch();
>          url = result.group(0);
> -        outlinks.add(new Outlink(url, anchor));
> +        try {
> +          outlinks.add(new Outlink(url, anchor));
> +        } catch (Exception ex) {
> +         
> LOG.throwing(OutlinkExtractor.class.getName(),
> "getOutlinks", ex);
> +        }
>        }
>      } catch (Exception ex) {
>        // if it is a malformed URL we just throw it
> away and continue with
> 
> 
> 
> 		
> __________________________________ 
> Yahoo! Music Unlimited 
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Jérôme,

> which Nutch version do you use?

Kind of gave up on mapred for awhile, so I am using
trunk.

> There were a bug concerning the content-types with
> parameters such as
> "text/html; charset=iso-8859-1".

Yeah, when I telnet in to GET / shopthar.com, I get

Content-Type: text/html; charset=iso-8859-1

> This issue is fixed in trunk and mapred.

Hmm, well, I was seeing something earlier in trunk. 
That said, something happened and I now seem to get a
partial crawl started.  How very strange.  I did catch
a few updates today, but the commits sure didn't seem
related.

Now I crawl for awhile, and then it just stops.  I
still get new segments starting, but no new http hits
to the server.  So looks like I have something new to
track down.  But yeah, when it is going, it can hammer
pretty good.

Earl


		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Jérôme Charron <je...@gmail.com>.
> By investing further, I've found that for parse-html, the links are
> extracted differently: the links are returned by
> DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
> wonder how you get to extract links with OutlinkExtractor instead...

Earl,

which Nutch version do you use?
If your links are extracted by the OutlinkExtractor, it seems that it is not
the HtmlParser that is used to parse your document, but the TextParser
instead (the default one).
There were a bug concerning the content-types with parameters such as
"text/html; charset=iso-8859-1".
Moreover your site return such a content-type, so that the ParserFactory
doesn't correctly find the good parser (HtmlParser), but uses the default
one.
This issue is fixed in trunk and mapred.
For further details,
see the thread on nutch-dev mailing list :
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00791.html
or the NUTCH-88 issue : http://issues.apache.org/jira/browse/NUTCH-88

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: crawl problems (a bug/patch)

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi Earl,


--- Earl Cahill <ca...@yahoo.com> wrote:

> I know it's minor, but if you later apply
> Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
> do a-zA-Z.


Indeed.

> 
> For me, there are two groups of issues.
> 
> 1.  URL_PATTERN issues.  URL_PATTERN matches things
> that aren't really links, and doesn't match things
> that are links.  This is what you talk about below,
> and what your JIRA issue covers.
> 
> It appears that URL_PATTERN aims to match anything
> that looks like a link on the page, whether in a tag
> or not.    Wondering about finding tags, and then
> looking for links in the tags.  Seems a little much to
> crawl urls that are plain text and not in a tag, since
> they aren't really links.  A simple tag regex in perl
> looks like this /<[^>]+?>/.

Indeed, but you're thinking HTML there.  The parsed text can come from
somwhere else:  it could be text parsed from a Word Document or a PDF
file (you can find references to getOutlinks in the relevant plugins).

By investing further, I've found that for parse-html, the links are
extracted differently: the links are returned by
DOMContentUtils.getOutlinks based upon Neko, which therefore makes me
wonder how you get to extract links with OutlinkExtractor instead...

Regards,
Sébastien.


	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Hi Sébastien,

Yahoo! just hosed my message, glad I had it elsewhere.

> As you probably saw in the OutlinkExtractor class,
> the links are
> extracted with a Regexp.  

Ahh, didn't see it before, but I now see URL_PATTERN. 


I know it's minor, but if you later apply
Perl5Compiler.CASE_INSENSITIVE_MASK, you don't need to
do a-zA-Z.

For me, there are two groups of issues.

1.  URL_PATTERN issues.  URL_PATTERN matches things
that aren't really links, and doesn't match things
that are links.  This is what you talk about below,
and what your JIRA issue covers.

It appears that URL_PATTERN aims to match anything
that looks like a link on the page, whether in a tag
or not.    Wondering about finding tags, and then
looking for links in the tags.  Seems a little much to
crawl urls that are plain text and not in a tag, since
they aren't really links.  A simple tag regex in perl
looks like this /<[^>]+?>/.

Extracting key/value pairs is a little harder, but
given $key, the second match from this regex

m/(?<!\w|\.)\Q$key\E\s*=(["']?)(|.*?[^\\])\1(\s|>)/is

returns the value.  This regex has gone through some
rather rigorous testing/use.

Getting $key interpolated into the regex seems a
little harder in java, but it seems we would mostly be
looking for href, and src, and maybe not even src.

The regex for href looks like this

m/(?<!\w|\.)href\s*=(["']?)(|.*?[^\\])\1(\s|>)/is

If you commit your TestPattern.java stuff, I would be
happy to add cases and play with the regexes until all
the cases work.

2.  If a link is encountered that throws an exception
anywhere in this call

outlinks.add(new Outlink(url, anchor))

then no more links will get extracted on the page. 
That is what my JIRA issue
(http://issues.apache.org/jira/browse/NUTCH-120)
covers.

Thanks for looking into this.

Earl


		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.

--- Sébastien LE CALLONNEC <sl...@yahoo.ie> wrote:

> Hi Earl,
> 
> Please, see my responses below.
> 
> 
> --- Earl Cahill <ca...@yahoo.com> wrote:
> 
> 
> As you probably saw in the OutlinkExtractor class,
> the links are
> extracted with a Regexp.  I'm no expert in the
> matter, but that will
> certainly answer your questions below...
> 
> > So, three open questions
> > 
> > 1.  Why doesn't my link (<a
> > href=/sitemap.html>browse</a>) get parsed?
> 
> Because it doesn't match the aforementioned regexp.
> 
> > 2.  Why does my style get followed?
> 
> Because it matches the regexp.
> 
> > 3.  Where do I look for a list of all the failed
> > links?
> 
> I don't think there is any.
> 
> I have just created the issue in JIRA:
> http://issues.apache.org/jira/browse/NUTCH-119
> 
> 
> Regards,
> Sébastien.
> 
> 
> 
> 
> 	
> 
> 	
> 		
>
___________________________________________________________________________
> 
> Appel audio GRATUIT partout dans le monde avec le
> nouveau Yahoo! Messenger 
> Téléchargez cette version sur
> http://fr.messenger.yahoo.com
> 



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
Hi Earl,

Please, see my responses below.


--- Earl Cahill <ca...@yahoo.com> wrote:


As you probably saw in the OutlinkExtractor class, the links are
extracted with a Regexp.  I'm no expert in the matter, but that will
certainly answer your questions below...

> So, three open questions
> 
> 1.  Why doesn't my link (<a
> href=/sitemap.html>browse</a>) get parsed?

Because it doesn't match the aforementioned regexp.

> 2.  Why does my style get followed?

Because it matches the regexp.

> 3.  Where do I look for a list of all the failed
> links?

I don't think there is any.

I have just created the issue in JIRA:
http://issues.apache.org/jira/browse/NUTCH-119


Regards,
Sébastien.




	

	
		
___________________________________________________________________________ 
Appel audio GRATUIT partout dans le monde avec le nouveau Yahoo! Messenger 
Téléchargez cette version sur http://fr.messenger.yahoo.com

Re: crawl problems (a bug/patch)

Posted by Earl Cahill <ca...@yahoo.com>.
Still tracking down a solution, but my problems appear
to be parsing based.

My page has this tag

<div class="content" id="content"
style="display:none;">

The div starts without display and then javascript
brings in a template and displays the div.  I think
this totally legitimate, and yahoo!, google and msn
all seem to agree.

For whatever reason, nutch extracts display:none as a
url.  Am still digging, but haven't figured that part
out.

display:none gets passed to
org.apache.nutch.net.BasicUrlNormalizer, where this
line

URL url = new URL(urlString);

appears to throw a MalformedURLException, likely
because display:none isn't much of a URL.

After this, no other links get processed on the page. 


The try is around extracting links for the whole page,
and as soon as an exception is thrown, the link
extraction stops.  This seems a little harsh,
especially since nutch seems perhaps a little naive
here.  I propose to try each call to outlinks.add(new
Outlink(url, anchor)).  Then if there is a problem
with any single url, parsing continues.  The patch
below does such a thing.

Many more links on my page get processed, but nutch
still doesn't find

<a href=/sitemap.html>browse</a>

and I am not sure why.

This little patch seems like a pretty huge deal, and I
really can't believe that no one else has discovered
it.  One "bad" link and the rest of the page gets
thrown away?  If nothing else, doesn't anyone else use
styles?  It seems like any page with any div, with any
 style attribute that isn't a real link would have the
same result.

Maybe the thinking was that if a page has a bad link,
that is reason enough to skip a head.  I could buy
that a whole lot more if the parsing were more mature.

I just looked and the mapreduce branch has the exact
same code, so, the patch should work for both.

So, three open questions

1.  Why doesn't my link (<a
href=/sitemap.html>browse</a>) get parsed?
2.  Why does my style get followed?
3.  Where do I look for a list of all the failed
links?

Thanks,
Earl

Index:
src/java/org/apache/nutch/parse/OutlinkExtractor.java
===================================================================
---
src/java/org/apache/nutch/parse/OutlinkExtractor.java 
  (revision 326762)
+++
src/java/org/apache/nutch/parse/OutlinkExtractor.java 
  (working copy)
@@ -97,7 +97,11 @@
       while (matcher.contains(input, pattern)) {
         result = matcher.getMatch();
         url = result.group(0);
-        outlinks.add(new Outlink(url, anchor));
+        try {
+          outlinks.add(new Outlink(url, anchor));
+        } catch (Exception ex) {
+         
LOG.throwing(OutlinkExtractor.class.getName(),
"getOutlinks", ex);
+        }
       }
     } catch (Exception ex) {
       // if it is a malformed URL we just throw it
away and continue with



		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

search return 0 hit

Posted by Michael Ji <fj...@yahoo.com>.
hi,

I sent the following message to the nutch-dev before
and I realized that this group might be the better
place, sorry if you get the duplicated message.

Michael Ji,

-----------------------------------------------

Somehow, I found my search engine didn't show the
result, even I can see the index from LukeAll. ( It
works fine before )

I replace ROOT.WAR file in tomcat by nutch's and
launch tomcat in nutch's segment directory ( parallel
to index subdir )

Should I reinstall Tomcat? Or will that be nutch's
indexing issue? My system is running in Linux. 

thanks,

Michael Ji,
-----------------

051019 215411 11 query: com
051019 215411 11 searching for 20 raw hits
051019 215411 11 total hits: 0
051019 215449 12 query request from 65.34.213.205
051019 215449 12 query: net
051019 215449 12 searching for 20 raw hits



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com