You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by consultas <co...@qualidade.eng.br> on 2009/02/14 18:31:13 UTC
Can't index a site
I have indexed about 600 sites on some specific subjects, including the nuclear area, that have resulted in about 500,000 indexed pages. One important "seed site" is the www.nrc.gov, but no matter what (and this, since version 0.3 of Nutch) I am not able to index more than about 100 pages for this site. If you go to Google or Yahoo, they show more than 20000 results. In past years I have used another program, Aspseek, and with it I was able to index as many pages as I wanted. I have looked at the source code of some of the nrc pages an could not find any mention to any robots rule.
Any ideas about this behaviour?
Thanks
URL normalization ...
Posted by "David M. Cole" <dm...@colegroup.com>.
Hi:
I'm running Build #722 on a Macintosh, using 10.4.11 and am indexing
about 10,000 URLs from a single site. All is well, except I am
getting double-indexes of some files.
For example
http://www.newsinc.net/morgue/2003/ni031110.html
and
http://www.newsinc.net/morgue/2003/NI031110.html
Because the web server is also a Mac-based system, from the Apache
(and file system) viewpoint, these are the same file. Nutch sees them
as two different files and indexes them twice. Search results present
both URLs.
Ideally, there is a parameter somewhere that I can change to make
URLs case-insensitive. I have Google'd Nutch URL normalization, but
those postings seem to deal with issues such as
http://my.domain.com:80/ vs. http://my.domain.com/ ...
Any thoughts about how to resolve this (admittedly minor) problem
would be appreciated.
Thanks.
\dmc
--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Cole dmc@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/> F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
Re: Build #722 won't start on Mac OS X, 10.4.11
Posted by "David M. Cole" <dm...@colegroup.com>.
At 7:39 AM -0600 2/15/09, Eric Christeson wrote:
>ater versions of nutch-dev use Hadoop 0.19 which requires Java 1.6.
>They used some features introduced in 1.6 If you ask google about
>the 'Bad version number' you'll find that it refers to cases exactly
>like this where a library need a (usually) newer jvm.
Yes, I had found those references, just wanted to confirm the bad news.
While Apple doesn't support 1.6 on 10.4 (or on a PPC, for that
matter), some more Googling found a build for that platform. I am now
successfully using Build #722.
Thanks.
\dmc
--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Cole dmc@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/> F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
Re: Build #722 won't start on Mac OS X, 10.4.11
Posted by Eric Christeson <Er...@ndsu.edu>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Feb 14, 2009, at 20:16, David M. Cole wrote:
> Hiya:
>
> Brand new to Nutch. Was able to get it to work with Tomcat on a Mac
> OS X PPC machine, 450MHz, dual processor, running OS X 10.4.11 with
> the latest version of Java (1.5.0_16-132). Indexed great, am able
> to search via OpenSearch option with zero problems.
>
> Unfortunately, I need HTTP authorization (basic, not digest or
> NTLM) for the site I'm trying to index.
>
> I downloaded nightly build #722 the other day, added the
> credentials info into 'conf/httpclient-auth.xml' and have not been
> able to get it to launch -- I receive the error "Bad version number
> in .class file" on the command line when I run a crawl command.
Later versions of nutch-dev use Hadoop 0.19 which requires Java 1.6.
They used some features introduced in 1.6 If you ask google about
the 'Bad version number' you'll find that it refers to cases exactly
like this where a library need a (usually) newer jvm.
Eric
- --
Eric J. Christeson <Er...@ndsu.edu>
Enterprise Computing and Infrastructure (701) 231-8693
North Dakota State University, Fargo, North Dakota
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (Darwin)
iD8DBQFJmBsxCnMyGd/wX/sRAjT3AJ9URHF4uw+6wpeV6aWDreLjOSD/hgCdFsJv
of/zE4M/RcBZbeg1nvBp+PE=
=PsdR
-----END PGP SIGNATURE-----
Build #722 won't start on Mac OS X, 10.4.11
Posted by "David M. Cole" <dm...@colegroup.com>.
Hiya:
Brand new to Nutch. Was able to get it to work with Tomcat on a Mac
OS X PPC machine, 450MHz, dual processor, running OS X 10.4.11 with
the latest version of Java (1.5.0_16-132). Indexed great, am able to
search via OpenSearch option with zero problems.
Unfortunately, I need HTTP authorization (basic, not digest or NTLM)
for the site I'm trying to index.
I downloaded nightly build #722 the other day, added the credentials
info into 'conf/httpclient-auth.xml' and have not been able to get it
to launch -- I receive the error "Bad version number in .class file"
on the command line when I run a crawl command.
Do the latest builds require a higher version of Java than I have? Or
is there something somewhere that I need to point to my JAVA_HOME?
Or, shutter, do I have to download ANT and compile from source on the
Mac to make sure the Javas line up?
Alternately, is there a way to get basic HTTP authorization without
using httpclient-auth?
Your thoughts would be appreciated.
Thanks.
\dmc
--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Cole dmc@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/> F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
Re: Can't index a site
Posted by consultas <co...@qualidade.eng.br>.
Yes, it can be. This is the nrc/robots.txt:
User-agent: *
Sitemap: <http://www.nrc.gov/sitemapindex.xml>
Disallow: .......As a matter of fact the old Aspseek was not very polite, as
it allowed you to bypass the robots.txt, if you wished so.On the other hand
there is an increasing number of sites that are adopting the sitemaps, so I
think this something to consider.Thank you for the answer.
----- Original Message -----
From: "Frank McCown" <fm...@harding.edu>
To: <nu...@lucene.apache.org>
Sent: Saturday, February 14, 2009 4:52 PM
Subject: Re: Can't index a site
One possibility is that nrc.gov is using the sitemap protocol which
allows Google et al. to find more pages than would be found with
traditional web crawling:
http://www.nrc.gov/sitemapindex.xml
I don't think Nutch supports the sitemap protocol. It could be
Aspseek supports sitemap or that the link structure of nrc.gov has
changed or that they have added more exclusions to their robots.txt
file.
Frank
On Sat, Feb 14, 2009 at 11:31 AM, consultas <co...@qualidade.eng.br>
wrote:
> I have indexed about 600 sites on some specific subjects, including the
> nuclear area, that have resulted in about 500,000 indexed pages. One
> important "seed site" is the www.nrc.gov, but no matter what (and this,
> since version 0.3 of Nutch) I am not able to index more than about 100
> pages for this site. If you go to Google or Yahoo, they show more than
> 20000 results. In past years I have used another program, Aspseek, and
> with it I was able to index as many pages as I wanted. I have looked at
> the source code of some of the nrc pages an could not find any mention to
> any robots rule.
> Any ideas about this behaviour?
>
> Thanks
--------------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.0.237 / Virus Database: 270.10.23/1952 - Release Date: 02/13/09
18:29:00
Re: Can't index a site
Posted by Frank McCown <fm...@harding.edu>.
One possibility is that nrc.gov is using the sitemap protocol which
allows Google et al. to find more pages than would be found with
traditional web crawling:
http://www.nrc.gov/sitemapindex.xml
I don't think Nutch supports the sitemap protocol. It could be
Aspseek supports sitemap or that the link structure of nrc.gov has
changed or that they have added more exclusions to their robots.txt
file.
Frank
On Sat, Feb 14, 2009 at 11:31 AM, consultas <co...@qualidade.eng.br> wrote:
> I have indexed about 600 sites on some specific subjects, including the nuclear area, that have resulted in about 500,000 indexed pages. One important "seed site" is the www.nrc.gov, but no matter what (and this, since version 0.3 of Nutch) I am not able to index more than about 100 pages for this site. If you go to Google or Yahoo, they show more than 20000 results. In past years I have used another program, Aspseek, and with it I was able to index as many pages as I wanted. I have looked at the source code of some of the nrc pages an could not find any mention to any robots rule.
> Any ideas about this behaviour?
>
> Thanks