You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by consultas <co...@qualidade.eng.br> on 2009/02/14 18:31:13 UTC

Can't index a site

I have indexed about 600 sites on some specific subjects, including the nuclear area, that have resulted in about 500,000 indexed pages.  One important "seed site" is the www.nrc.gov, but no matter what (and this, since version 0.3 of Nutch) I am not able to index  more than about 100 pages for this site.  If you go to Google or Yahoo, they show more than 20000 results.  In past years I have used another program, Aspseek,  and with it I was able to index as many pages as I wanted.  I have looked at the source code of some of the nrc pages an could not find any mention to any robots rule.
Any ideas about this behaviour?

Thanks

URL normalization ...

Posted by "David M. Cole" <dm...@colegroup.com>.

Hi:

I'm running Build #722 on a Macintosh, using 10.4.11 and am indexing 
about 10,000 URLs from a single site. All is well, except I am 
getting double-indexes of some files.

For example

http://www.newsinc.net/morgue/2003/ni031110.html

and

http://www.newsinc.net/morgue/2003/NI031110.html

Because the web server is also a Mac-based system, from the Apache 
(and file system) viewpoint, these are the same file. Nutch sees them 
as two different files and indexes them twice. Search results present 
both URLs.

Ideally, there is a parameter somewhere that I can change to make 
URLs case-insensitive. I have Google'd Nutch URL normalization, but 
those postings seem to deal with issues such as 
http://my.domain.com:80/ vs. http://my.domain.com/ ...

Any thoughts about how to resolve this (admittedly minor) problem 
would be appreciated.

Thanks.

\dmc

-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Re: Build #722 won't start on Mac OS X, 10.4.11

Posted by "David M. Cole" <dm...@colegroup.com>.

At 7:39 AM -0600 2/15/09, Eric Christeson wrote:
>ater versions of nutch-dev use Hadoop 0.19 which requires Java 1.6.
>They used some features introduced in 1.6  If you ask google about 
>the 'Bad version number' you'll find that it refers to cases exactly 
>like this where a library need a (usually) newer jvm.

Yes, I had found those references, just wanted to confirm the bad news.

While Apple doesn't support 1.6 on 10.4 (or on a PPC, for that 
matter), some more Googling found a build for that platform. I am now 
successfully using Build #722.

Thanks.

\dmc

-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Re: Build #722 won't start on Mac OS X, 10.4.11

Posted by Eric Christeson <Er...@ndsu.edu>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Feb 14, 2009, at 20:16, David M. Cole wrote:

> Hiya:
>
> Brand new to Nutch. Was able to get it to work with Tomcat on a Mac  
> OS X PPC machine, 450MHz, dual processor, running OS X 10.4.11 with  
> the latest version of Java (1.5.0_16-132). Indexed great, am able  
> to search via OpenSearch option with zero problems.
>
> Unfortunately, I need HTTP authorization (basic, not digest or  
> NTLM) for the site I'm trying to index.
>
> I downloaded nightly build #722 the other day, added the  
> credentials info into 'conf/httpclient-auth.xml' and have not been  
> able to get it to launch -- I receive the error "Bad version number  
> in .class file" on the command line when I run a crawl command.

Later versions of nutch-dev use Hadoop 0.19 which requires Java 1.6.   
They used some features introduced in 1.6  If you ask google about  
the 'Bad version number' you'll find that it refers to cases exactly  
like this where a library need a (usually) newer jvm.

Eric

- --
Eric J. Christeson   <Er...@ndsu.edu>
Enterprise Computing and Infrastructure      (701) 231-8693
North Dakota State University, Fargo, North Dakota
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Darwin)

iD8DBQFJmBsxCnMyGd/wX/sRAjT3AJ9URHF4uw+6wpeV6aWDreLjOSD/hgCdFsJv
of/zE4M/RcBZbeg1nvBp+PE=
=PsdR
-----END PGP SIGNATURE-----

Build #722 won't start on Mac OS X, 10.4.11

Posted by "David M. Cole" <dm...@colegroup.com>.

Hiya:

Brand new to Nutch. Was able to get it to work with Tomcat on a Mac 
OS X PPC machine, 450MHz, dual processor, running OS X 10.4.11 with 
the latest version of Java (1.5.0_16-132). Indexed great, am able to 
search via OpenSearch option with zero problems.

Unfortunately, I need HTTP authorization (basic, not digest or NTLM) 
for the site I'm trying to index.

I downloaded nightly build #722 the other day, added the credentials 
info into 'conf/httpclient-auth.xml' and have not been able to get it 
to launch -- I receive the error "Bad version number in .class file" 
on the command line when I run a crawl command.

Do the latest builds require a higher version of Java than I have? Or 
is there something somewhere that I need to point to my JAVA_HOME? 
Or, shutter, do I have to download ANT and compile from source on the 
Mac to make sure the Javas line up?

Alternately, is there a way to get basic HTTP authorization without 
using httpclient-auth?

Your thoughts would be appreciated.

Thanks.

\dmc



-- 
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            dmc@colegroup.com
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Re: Can't index a site

Posted by consultas <co...@qualidade.eng.br>.

Yes, it can be. This is the nrc/robots.txt:
User-agent: *
Sitemap: <http://www.nrc.gov/sitemapindex.xml>
Disallow: .......As a matter of fact the old Aspseek was not very polite, as 
it allowed you to bypass the robots.txt, if you wished so.On the other hand 
there is an increasing number of sites that are adopting the sitemaps, so I 
think this something to consider.Thank you for the answer.
----- Original Message ----- 
From: "Frank McCown" <fm...@harding.edu>
To: <nu...@lucene.apache.org>
Sent: Saturday, February 14, 2009 4:52 PM
Subject: Re: Can't index a site


One possibility is that nrc.gov is using the sitemap protocol which
allows Google et al. to find more pages than would be found with
traditional web crawling:

http://www.nrc.gov/sitemapindex.xml

I don't think Nutch supports the sitemap protocol.  It could be
Aspseek supports sitemap or that the link structure of nrc.gov has
changed or that they have added more exclusions to their robots.txt
file.

Frank


On Sat, Feb 14, 2009 at 11:31 AM, consultas <co...@qualidade.eng.br> 
wrote:
> I have indexed about 600 sites on some specific subjects, including the 
> nuclear area, that have resulted in about 500,000 indexed pages.  One 
> important "seed site" is the www.nrc.gov, but no matter what (and this, 
> since version 0.3 of Nutch) I am not able to index  more than about 100 
> pages for this site.  If you go to Google or Yahoo, they show more than 
> 20000 results.  In past years I have used another program, Aspseek,  and 
> with it I was able to index as many pages as I wanted.  I have looked at 
> the source code of some of the nrc pages an could not find any mention to 
> any robots rule.
> Any ideas about this behaviour?
>
> Thanks


--------------------------------------------------------------------------------



No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.237 / Virus Database: 270.10.23/1952 - Release Date: 02/13/09 
18:29:00

Re: Can't index a site

Posted by Frank McCown <fm...@harding.edu>.

One possibility is that nrc.gov is using the sitemap protocol which
allows Google et al. to find more pages than would be found with
traditional web crawling:

http://www.nrc.gov/sitemapindex.xml

I don't think Nutch supports the sitemap protocol.  It could be
Aspseek supports sitemap or that the link structure of nrc.gov has
changed or that they have added more exclusions to their robots.txt
file.

Frank

On Sat, Feb 14, 2009 at 11:31 AM, consultas <co...@qualidade.eng.br> wrote:
> I have indexed about 600 sites on some specific subjects, including the nuclear area, that have resulted in about 500,000 indexed pages.  One important "seed site" is the www.nrc.gov, but no matter what (and this, since version 0.3 of Nutch) I am not able to index  more than about 100 pages for this site.  If you go to Google or Yahoo, they show more than 20000 results.  In past years I have used another program, Aspseek,  and with it I was able to index as many pages as I wanted.  I have looked at the source code of some of the nrc pages an could not find any mention to any robots rule.
> Any ideas about this behaviour?
>
> Thanks