You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bill Arduino <ro...@gmail.com> on 2010/08/16 23:11:07 UTC

Not getting all documents

Hi all,

I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch flat
file.  Each line is a dir on the same server like so:
http://myserver.mydomain.com/docs/SC-09
http://myserver.mydomain.com/docs/SC-10

In each of these dirs are anywhere from 1 to 15,000 PDF files.  The index is
dynamically generated by apache for each dir.  In total there are 1.2
million PDF files I need to index.

Running the command:
bin/nutch crawl urls -dir crawl -depth 5 -topN 50000

seems to work and I get data that I can search, but I know I am not getting
all of the PDFs fetched or indexed.  If I do this:

grep pdf logs/hadoop.log | grep fetching | wc -l
12386

I know there are 276,867  PDFs in the URLs I provided in the nutch file, yet
it fetched only 12,386 of them.

I'm not sure on the -topN parameter, but it seems to run the same no matter
what I put in it.  I have these settings in my nutch-site.xml:

file.content.limit -1
http.content.limit -1
fetcher.threads.fetch 100
fetcher.threads.per.host 100

PDF parser is working.  I also have this in nutch-site:

<!-- plugin properties -->
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-pdf</value>

</property>

Any ideas?
Thanks!

Re: Not getting all documents

Posted by Bill Arduino <ro...@gmail.com>.

Yes I did.  Well, Jean-François sent me the answer:

*Hi,

You may want to look for the db.max.outlinks.per.page property in your
nutch-[default|site].xml configuration file. The default is 100 outlinks in
nutch 1.0. So, if your a index page contains more than 100 link to PDF file,
then only a maximum of 100 will be process for each index page.

Also, you may need to adjust the http.content.limit if your index pages are
bigger than 65535 (default value) otherwise nutch will trunc the content and
will not process links that are not in the first 65535 bytes.

I hope this will help

Jean-François Gingras*

I set db.max.outlinks to an appropriate number and all is well.

On Fri, Oct 1, 2010 at 7:41 AM, webdev1977 <we...@gmail.com> wrote:

>
> Good Morning..
>
> I was wondering if you ever found a solution to your problem?  I am facing
> the same problem.  I am missing about 300,000 fetched files.  I can't for
> the life of me figure out why it is not getting all the urls?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Not-getting-all-documents-tp1178079p1614122.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Not getting all documents

Posted by webdev1977 <we...@gmail.com>.

Good Morning..

I was wondering if you ever found a solution to your problem?  I am facing
the same problem.  I am missing about 300,000 fetched files.  I can't for
the life of me figure out why it is not getting all the urls?
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Not-getting-all-documents-tp1178079p1614122.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Not getting all documents

Posted by Bill Arduino <ro...@gmail.com>.

Thanks again for your time Markus,

There were no timeouts, in fact the only thing other than regular fetching
info type messages were entries like these:

WARN  regex.RegexURLNormalizer - can't find rules for scope 'outlink', using
default
and
WARN  mapred.JobClient - Use GenericOptionsParser for parsing the arguments.
Applications should implement Tool for the same.

Both of which I have read are pretty harmless and can be safely ignored.

I supplied all 428 dirs in a new url/nutch and deleted the crawl dir,
increased the timeout value to 60000 and reran.  There were no errors.  The
stats are:

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:    34176
retry 0:    34176
min score:    0.0
avg score:    0.012523408
max score:    1.0
status 1 (db_unfetched):    1
status 2 (db_fetched):    33747
status 5 (db_redir_perm):    428
CrawlDb statistics: done

There should easily be 1+ million files fetched.  Is there something obvious
that I am missing?  I have not that many changes in nutch-site:

<name>http.agent.name</name> = machine
<name>http.agent.url</name> = http://nutch.apache.org
<name>http.agent.email</name> = webamster@machine.mycompany.com
<name>http.timeout</name> = 60000
<name>http.content.limit</name> = -1
<name>http.verbose</name> = true
<name>file.content.limit</name> = -1
<name>http.agent.description</name> = Nutch
<name>fetcher.threads.fetch</name> = 100
<name>fetcher.threads.per.host</name> = 100
<name>plugin.includes</name> =
protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-pdf


Does anyone see anything wrong with these values?

On Tue, Aug 17, 2010 at 8:18 AM, Markus Jelsma <ma...@buyways.nl>wrote:

> Check logs/hadoop.log for connection time out errors.
>
> On Tuesday 17 August 2010 14:07:22 Bill Arduino wrote:
> > There are 128 entries in url/nutch formatted as so:
> > http://server.example.com/docs/DF-09/
> > http://server.example.com/docs/DF-10/
> > http://server.example.com/docs/EG-02/
> > http://server.example.com/docs/EG-03/
> > http://server.example.com/docs/EG-04/
> >
> > There are 428 directories in http://server.example.com/docs  I only
> wanted
> > to start out with a small number to reduce the wait times while
> >  configuring. I am wondering if it is timing out waiting for apache to
> >  generate the index page and just taking whatever it gets before moving
> on.
> >   Maybe I should increase my wait times...
> >
> > On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma
> <ma...@buyways.nl>wrote:
> > > Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps
> the
> > > seeding didn't go too well? Make sure that all your Apache directory
> > > listings
> > > are injected into the CrawlDB. If you then generate, fetch, parse and
> > > update
> > > the DB, you should have all URL's in your DB.
> > >
> > > How many directory listing pages do you have anyway?
> > >
> > > On Tuesday 17 August 2010 03:52:31 Bill Arduino wrote:
> > > > Thanks for your reply, Markus.
> > > >
> > > > I ran the command several times.  Each subsequent run finished in a
> few
> > > > seconds with only this output:
> > > >
> > > > crawl started in: crawl
> > > > rootUrlDir = urls
> > > > threads = 100
> > > > depth = 5
> > > > indexer=lucene
> > > > topN = 5000
> > > > Injector: starting
> > > > Injector: crawlDb: crawl/crawldb
> > > > Injector: urlDir: urls
> > > > Injector: Converting injected urls to crawl db entries.
> > > > Injector: Merging injected urls into crawl db.
> > > > Injector: done
> > > > Generator: Selecting best-scoring urls due for fetch.
> > > > Generator: starting
> > > > Generator: filtering: true
> > > > Generator: normalizing: true
> > > > Generator: topN: 5000
> > > > Generator: jobtracker is 'local', generating exactly one partition.
> > > > Generator: 0 records selected for fetching, exiting ...
> > > > Stopping at depth=0 - no more URLs to fetch.
> > > > No URLs to fetch - check your seed list and URL filters.
> > > > crawl finished: crawl
> > > >
> > > >
> > > > The query shows all URLs fetched:
> > > > #bin/nutch readdb crawl/crawldb/ -stats
> > > > CrawlDb statistics start: crawl/crawldb/
> > > > Statistics for CrawlDb: crawl/crawldb/
> > > > TOTAL urls:     8795
> > > > retry 0:        8795
> > > > min score:      0.0090
> > > > avg score:      0.028536895
> > > > max score:      13.42
> > > > status 2 (db_fetched):  8795
> > > > CrawlDb statistics: done
> > > >
> > > > I have tried deleteing the crawl dir and starting from scratch with
> the
> > > >  same results.  I'm at a loss.  I've been over all of the values in
> > > > nutch-default.xml but I can't really see anything that seems wrong.
> > > >
> > > > On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma
> > >
> > > <ma...@buyways.nl>wrote:
> > > > > Hi,
> > > > >
> > > > >
> > > > >
> > > > > Quite hard to debug, but lets try to make this a lucky guess: how
> > > > > many times did you crawl? If you have all the Apache directory
> > > > > listing pages injected by seeding, you'll only need one generate
> > > > > command. But, depending on different settings, you might need to
> > > > > fetch and parse multiple times.
> > > > >
> > > > >
> > > > >
> > > > > Also, you can check how many URL's are yet to be fetched by using
> the
> > > > > readdb command:
> > > > >
> > > > > # bin/nutch readdb crawl/crawldb/ -stats
> > > > >
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > -----Original message-----
> > > > > From: Bill Arduino <ro...@gmail.com>
> > > > > Sent: Mon 16-08-2010 23:11
> > > > > To: user@nutch.apache.org;
> > > > > Subject: Not getting all documents
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch
> > >
> > > flat
> > >
> > > > > file.  Each line is a dir on the same server like so:
> > > > > http://myserver.mydomain.com/docs/SC-09
> > > > > http://myserver.mydomain.com/docs/SC-10
> > > > >
> > > > > In each of these dirs are anywhere from 1 to 15,000 PDF files.  The
> > >
> > > index
> > >
> > > > > is
> > > > > dynamically generated by apache for each dir.  In total there are
> 1.2
> > > > > million PDF files I need to index.
> > > > >
> > > > > Running the command:
> > > > > bin/nutch crawl urls -dir crawl -depth 5 -topN 50000
> > > > >
> > > > > seems to work and I get data that I can search, but I know I am not
> > > > > getting all of the PDFs fetched or indexed.  If I do this:
> > > > >
> > > > > grep pdf logs/hadoop.log | grep fetching | wc -l
> > > > > 12386
> > > > >
> > > > > I know there are 276,867  PDFs in the URLs I provided in the nutch
> > >
> > > file,
> > >
> > > > > yet
> > > > > it fetched only 12,386 of them.
> > > > >
> > > > > I'm not sure on the -topN parameter, but it seems to run the same
> no
> > > > > matter what I put in it.  I have these settings in my
> nutch-site.xml:
> > > > >
> > > > > file.content.limit -1
> > > > > http.content.limit -1
> > > > > fetcher.threads.fetch 100
> > > > > fetcher.threads.per.host 100
> > > > >
> > > > > PDF parser is working.  I also have this in nutch-site:
> > > > >
> > > > > <!-- plugin properties -->
> > > > > <property>
> > > > > <name>plugin.includes</name>
> > >
> > >
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|quer
> > >
> > >
> >y-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnor
> > > >ma
> > > >
> > > > >lizer-(pass|regex|basic)|parse-pdf</value>
> > > > >
> > > > > </property>
> > > > >
> > > > > Any ideas?
> > > > > Thanks!
> > >
> > > Markus Jelsma - Technisch Architect - Buyways BV
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Not getting all documents

Posted by Markus Jelsma <ma...@buyways.nl>.

Check logs/hadoop.log for connection time out errors.

On Tuesday 17 August 2010 14:07:22 Bill Arduino wrote:
> There are 128 entries in url/nutch formatted as so:
> http://server.example.com/docs/DF-09/
> http://server.example.com/docs/DF-10/
> http://server.example.com/docs/EG-02/
> http://server.example.com/docs/EG-03/
> http://server.example.com/docs/EG-04/
> 
> There are 428 directories in http://server.example.com/docs  I only wanted
> to start out with a small number to reduce the wait times while
>  configuring. I am wondering if it is timing out waiting for apache to
>  generate the index page and just taking whatever it gets before moving on.
>   Maybe I should increase my wait times...
> 
> On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma 
<ma...@buyways.nl>wrote:
> > Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the
> > seeding didn't go too well? Make sure that all your Apache directory
> > listings
> > are injected into the CrawlDB. If you then generate, fetch, parse and
> > update
> > the DB, you should have all URL's in your DB.
> >
> > How many directory listing pages do you have anyway?
> >
> > On Tuesday 17 August 2010 03:52:31 Bill Arduino wrote:
> > > Thanks for your reply, Markus.
> > >
> > > I ran the command several times.  Each subsequent run finished in a few
> > > seconds with only this output:
> > >
> > > crawl started in: crawl
> > > rootUrlDir = urls
> > > threads = 100
> > > depth = 5
> > > indexer=lucene
> > > topN = 5000
> > > Injector: starting
> > > Injector: crawlDb: crawl/crawldb
> > > Injector: urlDir: urls
> > > Injector: Converting injected urls to crawl db entries.
> > > Injector: Merging injected urls into crawl db.
> > > Injector: done
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: starting
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: topN: 5000
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: 0 records selected for fetching, exiting ...
> > > Stopping at depth=0 - no more URLs to fetch.
> > > No URLs to fetch - check your seed list and URL filters.
> > > crawl finished: crawl
> > >
> > >
> > > The query shows all URLs fetched:
> > > #bin/nutch readdb crawl/crawldb/ -stats
> > > CrawlDb statistics start: crawl/crawldb/
> > > Statistics for CrawlDb: crawl/crawldb/
> > > TOTAL urls:     8795
> > > retry 0:        8795
> > > min score:      0.0090
> > > avg score:      0.028536895
> > > max score:      13.42
> > > status 2 (db_fetched):  8795
> > > CrawlDb statistics: done
> > >
> > > I have tried deleteing the crawl dir and starting from scratch with the
> > >  same results.  I'm at a loss.  I've been over all of the values in
> > > nutch-default.xml but I can't really see anything that seems wrong.
> > >
> > > On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma
> >
> > <ma...@buyways.nl>wrote:
> > > > Hi,
> > > >
> > > >
> > > >
> > > > Quite hard to debug, but lets try to make this a lucky guess: how
> > > > many times did you crawl? If you have all the Apache directory
> > > > listing pages injected by seeding, you'll only need one generate
> > > > command. But, depending on different settings, you might need to
> > > > fetch and parse multiple times.
> > > >
> > > >
> > > >
> > > > Also, you can check how many URL's are yet to be fetched by using the
> > > > readdb command:
> > > >
> > > > # bin/nutch readdb crawl/crawldb/ -stats
> > > >
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > -----Original message-----
> > > > From: Bill Arduino <ro...@gmail.com>
> > > > Sent: Mon 16-08-2010 23:11
> > > > To: user@nutch.apache.org;
> > > > Subject: Not getting all documents
> > > >
> > > > Hi all,
> > > >
> > > > I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch
> >
> > flat
> >
> > > > file.  Each line is a dir on the same server like so:
> > > > http://myserver.mydomain.com/docs/SC-09
> > > > http://myserver.mydomain.com/docs/SC-10
> > > >
> > > > In each of these dirs are anywhere from 1 to 15,000 PDF files.  The
> >
> > index
> >
> > > > is
> > > > dynamically generated by apache for each dir.  In total there are 1.2
> > > > million PDF files I need to index.
> > > >
> > > > Running the command:
> > > > bin/nutch crawl urls -dir crawl -depth 5 -topN 50000
> > > >
> > > > seems to work and I get data that I can search, but I know I am not
> > > > getting all of the PDFs fetched or indexed.  If I do this:
> > > >
> > > > grep pdf logs/hadoop.log | grep fetching | wc -l
> > > > 12386
> > > >
> > > > I know there are 276,867  PDFs in the URLs I provided in the nutch
> >
> > file,
> >
> > > > yet
> > > > it fetched only 12,386 of them.
> > > >
> > > > I'm not sure on the -topN parameter, but it seems to run the same no
> > > > matter what I put in it.  I have these settings in my nutch-site.xml:
> > > >
> > > > file.content.limit -1
> > > > http.content.limit -1
> > > > fetcher.threads.fetch 100
> > > > fetcher.threads.per.host 100
> > > >
> > > > PDF parser is working.  I also have this in nutch-site:
> > > >
> > > > <!-- plugin properties -->
> > > > <property>
> > > > <name>plugin.includes</name>
> >
> > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|quer
> >
> > >y-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnor
> > >ma
> > >
> > > >lizer-(pass|regex|basic)|parse-pdf</value>
> > > >
> > > > </property>
> > > >
> > > > Any ideas?
> > > > Thanks!
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Not getting all documents

Posted by Bill Arduino <ro...@gmail.com>.

There are 128 entries in url/nutch formatted as so:
http://server.example.com/docs/DF-09/
http://server.example.com/docs/DF-10/
http://server.example.com/docs/EG-02/
http://server.example.com/docs/EG-03/
http://server.example.com/docs/EG-04/

There are 428 directories in http://server.example.com/docs  I only wanted
to start out with a small number to reduce the wait times while configuring.
I am wondering if it is timing out waiting for apache to generate the index
page and just taking whatever it gets before moving on.  Maybe I should
increase my wait times...

On Tue, Aug 17, 2010 at 4:56 AM, Markus Jelsma <ma...@buyways.nl>wrote:

> Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the
> seeding didn't go too well? Make sure that all your Apache directory
> listings
> are injected into the CrawlDB. If you then generate, fetch, parse and
> update
> the DB, you should have all URL's in your DB.
>
> How many directory listing pages do you have anyway?
>
>
> On Tuesday 17 August 2010 03:52:31 Bill Arduino wrote:
> > Thanks for your reply, Markus.
> >
> > I ran the command several times.  Each subsequent run finished in a few
> > seconds with only this output:
> >
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 100
> > depth = 5
> > indexer=lucene
> > topN = 5000
> > Injector: starting
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: topN: 5000
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> > No URLs to fetch - check your seed list and URL filters.
> > crawl finished: crawl
> >
> >
> > The query shows all URLs fetched:
> > #bin/nutch readdb crawl/crawldb/ -stats
> > CrawlDb statistics start: crawl/crawldb/
> > Statistics for CrawlDb: crawl/crawldb/
> > TOTAL urls:     8795
> > retry 0:        8795
> > min score:      0.0090
> > avg score:      0.028536895
> > max score:      13.42
> > status 2 (db_fetched):  8795
> > CrawlDb statistics: done
> >
> > I have tried deleteing the crawl dir and starting from scratch with the
> >  same results.  I'm at a loss.  I've been over all of the values in
> > nutch-default.xml but I can't really see anything that seems wrong.
> >
> > On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma
> <ma...@buyways.nl>wrote:
> > > Hi,
> > >
> > >
> > >
> > > Quite hard to debug, but lets try to make this a lucky guess: how many
> > > times did you crawl? If you have all the Apache directory listing pages
> > > injected by seeding, you'll only need one generate command. But,
> > > depending on different settings, you might need to fetch and parse
> > > multiple times.
> > >
> > >
> > >
> > > Also, you can check how many URL's are yet to be fetched by using the
> > > readdb command:
> > >
> > > # bin/nutch readdb crawl/crawldb/ -stats
> > >
> > >
> > >
> > > Cheers,
> > >
> > > -----Original message-----
> > > From: Bill Arduino <ro...@gmail.com>
> > > Sent: Mon 16-08-2010 23:11
> > > To: user@nutch.apache.org;
> > > Subject: Not getting all documents
> > >
> > > Hi all,
> > >
> > > I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch
> flat
> > > file.  Each line is a dir on the same server like so:
> > > http://myserver.mydomain.com/docs/SC-09
> > > http://myserver.mydomain.com/docs/SC-10
> > >
> > > In each of these dirs are anywhere from 1 to 15,000 PDF files.  The
> index
> > > is
> > > dynamically generated by apache for each dir.  In total there are 1.2
> > > million PDF files I need to index.
> > >
> > > Running the command:
> > > bin/nutch crawl urls -dir crawl -depth 5 -topN 50000
> > >
> > > seems to work and I get data that I can search, but I know I am not
> > > getting all of the PDFs fetched or indexed.  If I do this:
> > >
> > > grep pdf logs/hadoop.log | grep fetching | wc -l
> > > 12386
> > >
> > > I know there are 276,867  PDFs in the URLs I provided in the nutch
> file,
> > > yet
> > > it fetched only 12,386 of them.
> > >
> > > I'm not sure on the -topN parameter, but it seems to run the same no
> > > matter what I put in it.  I have these settings in my nutch-site.xml:
> > >
> > > file.content.limit -1
> > > http.content.limit -1
> > > fetcher.threads.fetch 100
> > > fetcher.threads.per.host 100
> > >
> > > PDF parser is working.  I also have this in nutch-site:
> > >
> > > <!-- plugin properties -->
> > > <property>
> > > <name>plugin.includes</name>
> > >
> > >
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|quer
> >
> >y-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma
> > >lizer-(pass|regex|basic)|parse-pdf</value>
> > >
> > > </property>
> > >
> > > Any ideas?
> > > Thanks!
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Not getting all documents

Posted by Markus Jelsma <ma...@buyways.nl>.

Well, the CrawlDB tells us you only got ~9000 URL's in total. Perhaps the 
seeding didn't go too well? Make sure that all your Apache directory listings 
are injected into the CrawlDB. If you then generate, fetch, parse and update 
the DB, you should have all URL's in your DB.

How many directory listing pages do you have anyway?


On Tuesday 17 August 2010 03:52:31 Bill Arduino wrote:
> Thanks for your reply, Markus.
> 
> I ran the command several times.  Each subsequent run finished in a few
> seconds with only this output:
> 
> crawl started in: crawl
> rootUrlDir = urls
> threads = 100
> depth = 5
> indexer=lucene
> topN = 5000
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5000
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
> 
> 
> The query shows all URLs fetched:
> #bin/nutch readdb crawl/crawldb/ -stats
> CrawlDb statistics start: crawl/crawldb/
> Statistics for CrawlDb: crawl/crawldb/
> TOTAL urls:     8795
> retry 0:        8795
> min score:      0.0090
> avg score:      0.028536895
> max score:      13.42
> status 2 (db_fetched):  8795
> CrawlDb statistics: done
> 
> I have tried deleteing the crawl dir and starting from scratch with the
>  same results.  I'm at a loss.  I've been over all of the values in
> nutch-default.xml but I can't really see anything that seems wrong.
> 
> On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma 
<ma...@buyways.nl>wrote:
> > Hi,
> >
> >
> >
> > Quite hard to debug, but lets try to make this a lucky guess: how many
> > times did you crawl? If you have all the Apache directory listing pages
> > injected by seeding, you'll only need one generate command. But,
> > depending on different settings, you might need to fetch and parse
> > multiple times.
> >
> >
> >
> > Also, you can check how many URL's are yet to be fetched by using the
> > readdb command:
> >
> > # bin/nutch readdb crawl/crawldb/ -stats
> >
> >
> >
> > Cheers,
> >
> > -----Original message-----
> > From: Bill Arduino <ro...@gmail.com>
> > Sent: Mon 16-08-2010 23:11
> > To: user@nutch.apache.org;
> > Subject: Not getting all documents
> >
> > Hi all,
> >
> > I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch flat
> > file.  Each line is a dir on the same server like so:
> > http://myserver.mydomain.com/docs/SC-09
> > http://myserver.mydomain.com/docs/SC-10
> >
> > In each of these dirs are anywhere from 1 to 15,000 PDF files.  The index
> > is
> > dynamically generated by apache for each dir.  In total there are 1.2
> > million PDF files I need to index.
> >
> > Running the command:
> > bin/nutch crawl urls -dir crawl -depth 5 -topN 50000
> >
> > seems to work and I get data that I can search, but I know I am not
> > getting all of the PDFs fetched or indexed.  If I do this:
> >
> > grep pdf logs/hadoop.log | grep fetching | wc -l
> > 12386
> >
> > I know there are 276,867  PDFs in the URLs I provided in the nutch file,
> > yet
> > it fetched only 12,386 of them.
> >
> > I'm not sure on the -topN parameter, but it seems to run the same no
> > matter what I put in it.  I have these settings in my nutch-site.xml:
> >
> > file.content.limit -1
> > http.content.limit -1
> > fetcher.threads.fetch 100
> > fetcher.threads.per.host 100
> >
> > PDF parser is working.  I also have this in nutch-site:
> >
> > <!-- plugin properties -->
> > <property>
> > <name>plugin.includes</name>
> >
> > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|quer
> >y-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma
> >lizer-(pass|regex|basic)|parse-pdf</value>
> >
> > </property>
> >
> > Any ideas?
> > Thanks!
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Not getting all documents

Posted by Bill Arduino <ro...@gmail.com>.

Thanks for your reply, Markus.

I ran the command several times.  Each subsequent run finished in a few
seconds with only this output:

crawl started in: crawl
rootUrlDir = urls
threads = 100
depth = 5
indexer=lucene
topN = 5000
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5000
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

The query shows all URLs fetched:
#bin/nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     8795
retry 0:        8795
min score:      0.0090
avg score:      0.028536895
max score:      13.42
status 2 (db_fetched):  8795
CrawlDb statistics: done

I have tried deleteing the crawl dir and starting from scratch with the same
results.  I'm at a loss.  I've been over all of the values in
nutch-default.xml but I can't really see anything that seems wrong.

On Mon, Aug 16, 2010 at 6:05 PM, Markus Jelsma <ma...@buyways.nl>wrote:

> Hi,
>
>
>
> Quite hard to debug, but lets try to make this a lucky guess: how many
> times did you crawl? If you have all the Apache directory listing pages
> injected by seeding, you'll only need one generate command. But, depending
> on different settings, you might need to fetch and parse multiple times.
>
>
>
> Also, you can check how many URL's are yet to be fetched by using the
> readdb command:
>
> # bin/nutch readdb crawl/crawldb/ -stats
>
>
>
> Cheers,
>
> -----Original message-----
> From: Bill Arduino <ro...@gmail.com>
> Sent: Mon 16-08-2010 23:11
> To: user@nutch.apache.org;
> Subject: Not getting all documents
>
> Hi all,
>
> I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch flat
> file.  Each line is a dir on the same server like so:
> http://myserver.mydomain.com/docs/SC-09
> http://myserver.mydomain.com/docs/SC-10
>
> In each of these dirs are anywhere from 1 to 15,000 PDF files.  The index
> is
> dynamically generated by apache for each dir.  In total there are 1.2
> million PDF files I need to index.
>
> Running the command:
> bin/nutch crawl urls -dir crawl -depth 5 -topN 50000
>
> seems to work and I get data that I can search, but I know I am not getting
> all of the PDFs fetched or indexed.  If I do this:
>
> grep pdf logs/hadoop.log | grep fetching | wc -l
> 12386
>
> I know there are 276,867  PDFs in the URLs I provided in the nutch file,
> yet
> it fetched only 12,386 of them.
>
> I'm not sure on the -topN parameter, but it seems to run the same no matter
> what I put in it.  I have these settings in my nutch-site.xml:
>
> file.content.limit -1
> http.content.limit -1
> fetcher.threads.fetch 100
> fetcher.threads.per.host 100
>
> PDF parser is working.  I also have this in nutch-site:
>
> <!-- plugin properties -->
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-pdf</value>
>
> </property>
>
> Any ideas?
> Thanks!
>

RE: Not getting all documents

Posted by Markus Jelsma <ma...@buyways.nl>.

Hi,

 

Quite hard to debug, but lets try to make this a lucky guess: how many times did you crawl? If you have all the Apache directory listing pages injected by seeding, you'll only need one generate command. But, depending on different settings, you might need to fetch and parse multiple times.

 

Also, you can check how many URL's are yet to be fetched by using the readdb command:

# bin/nutch readdb crawl/crawldb/ -stats

 

Cheers,
 
-----Original message-----
From: Bill Arduino <ro...@gmail.com>
Sent: Mon 16-08-2010 23:11
To: user@nutch.apache.org; 
Subject: Not getting all documents

Hi all,

I have setup Nutch 1.1 and supplied it a list of URLs in urls/nutch flat
file.  Each line is a dir on the same server like so:
http://myserver.mydomain.com/docs/SC-09
http://myserver.mydomain.com/docs/SC-10

In each of these dirs are anywhere from 1 to 15,000 PDF files.  The index is
dynamically generated by apache for each dir.  In total there are 1.2
million PDF files I need to index.

Running the command:
bin/nutch crawl urls -dir crawl -depth 5 -topN 50000

seems to work and I get data that I can search, but I know I am not getting
all of the PDFs fetched or indexed.  If I do this:

grep pdf logs/hadoop.log | grep fetching | wc -l
12386

I know there are 276,867  PDFs in the URLs I provided in the nutch file, yet
it fetched only 12,386 of them.

I'm not sure on the -topN parameter, but it seems to run the same no matter
what I put in it.  I have these settings in my nutch-site.xml:

file.content.limit -1
http.content.limit -1
fetcher.threads.fetch 100
fetcher.threads.per.host 100

PDF parser is working.  I also have this in nutch-site:

<!-- plugin properties -->
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-pdf</value>

</property>

Any ideas?
Thanks!