You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Leo Subscriptions <ll...@zudiewiener.com> on 2011/07/16 02:28:24 UTC

skipping invalid segments nutch 1.3

I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and
relevant output.

----------------------------------
llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
Injector: starting at 2011-07-15 18:32:10
Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
Injector: urlDir: /home/llist/nutchData/seed
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02
=================
llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments
Generator: starting at 2011-07-15 18:32:41
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/llist/nutchData/crawl/segments/20110715183244
Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03
==================
llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
fetch /home/llist/nutchData/crawl/segments/20110715183244
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-07-15 18:34:55
Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.seek.com.au/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03
=================
llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
updatedb /home/llist/nutchData/crawl/crawldb
-dir /home/llist/nutchData/crawl/segments/20110715183244
CrawlDb update: starting at 2011-07-15 18:36:00
CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
CrawlDb update: segments:
[file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
file:/home/llist/nutchData/crawl/segments/20110715183244/content]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
- skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
- skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
- skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110715183244/content
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01
-----------------------------------

Appreciate any hints on what I'm missing.

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

Hi Sebastian,

I think the problem is with the fetch not returning any results. I
checked your suggestion, but it did not work.

Cheers,

Leo

On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote:

> Hi Leo, hi Lewis,
> 
> > From the times both the fetching and parsing took, I suspecting that maybe
> > Nutch didn't actually fetch the URL,
> 
> This may be the reason. "Empty" segments may break some of the crawler steps.
> 
> But if I'm not wrong it looks like the updatedb-command
> is not quite correct:
> 
>  > llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
>  > updatedb /home/llist/nutchData/crawl/crawldb
>  > -dir /home/llist/nutchData/crawl/segments/20110721122519
>  > CrawlDb update: starting at 2011-07-21 12:28:03
>  > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>  > CrawlDb update: segments:
>  > [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/content,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
>  > CrawlDb update: additions allowed: true
> 
> As for other commands reading segments there are two ways two
> add segments as arguments: 1) all segments enumarated or 2) via -dir the parent directory
> of all segments. See:
> 
> % $NUTCH_HOME/bin/nutch updatedb
> Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] 
> [-noAdditions]
>          crawldb CrawlDb to update
>          -dir segments   parent directory containing all segments to update from
>          seg1 seg2 ...   list of segment names to update from
> 
> Try your updatedb command without -dir, it should work.
> 
> Sebastian

Re: skipping invalid segments nutch 1.3

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Leo, hi Lewis,

> From the times both the fetching and parsing took, I suspecting that maybe
> Nutch didn't actually fetch the URL,

This may be the reason. "Empty" segments may break some of the crawler steps.

But if I'm not wrong it looks like the updatedb-command
is not quite correct:

 > llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
 > updatedb /home/llist/nutchData/crawl/crawldb
 > -dir /home/llist/nutchData/crawl/segments/20110721122519
 > CrawlDb update: starting at 2011-07-21 12:28:03
 > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
 > CrawlDb update: segments:
 > [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
 > file:/home/llist/nutchData/crawl/segments/20110721122519/content,
 > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
 > file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
 > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
 > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
 > CrawlDb update: additions allowed: true

As for other commands reading segments there are two ways two
add segments as arguments: 1) all segments enumarated or 2) via -dir the parent directory
of all segments. See:

% $NUTCH_HOME/bin/nutch updatedb
Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] 
[-noAdditions]
         crawldb CrawlDb to update
         -dir segments   parent directory containing all segments to update from
         seg1 seg2 ...   list of segment names to update from

Try your updatedb command without -dir, it should work.

Sebastian

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

Hi Lewis,

Following  are the things I tried ans the relevant source/logs


1. ran 'crawl' without  ending "/" in the url http://www.seek.com.au ;
Result OK
2. ran 'crawl' with ending "/" in the url http://www.seek.com.au/ ;
Result OK
3. Had a look at the regex-urlfilter.txt and the relevant entries are as
follows

----------- regex-urlfilter.txt -----------------
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
----------------------------------------------------------
4. I think you are correct in that fetch does not actually fetch
anything. Following are the relevant sections from the hadoop.log. First
the log when 'crawl' was running and then the log for 'inject, generate,
fetch'. The rest of the log up to the fetch is pretty much identical.
One thing I did notice is that the QueueFeeder returns 10 records for
'crawl' and 1 record for 'fetch'

--------- hadoop.log for 'crawl' -----------

2011-07-22 10:02:27,226 INFO  crawl.Generator - Generator: finished at
2011-07-22 10:02:27, elapsed: 00:00:03
2011-07-22 10:02:27,227 WARN  fetcher.Fetcher - Fetcher: Your
'http.agent.name' value should be listed first in 'http.robots.agents'
property.
2011-07-22 10:02:27,228 INFO  fetcher.Fetcher - Fetcher: starting at
2011-07-22 10:02:27
2011-07-22 10:02:27,228 INFO  fetcher.Fetcher - Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110722100225
2011-07-22 10:02:27,910 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-07-22 10:02:27,918 INFO  fetcher.Fetcher - QueueFeeder finished:
total 10 records + hit by time limit :0
2011-07-22 10:02:27,926 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/sales-jobs
2011-07-22 10:02:27,940 INFO  http.Http - http.proxy.host = null
2011-07-22 10:02:27,940 INFO  http.Http - http.proxy.port = 8080
2011-07-22 10:02:27,940 INFO  http.Http - http.timeout = 10000
2011-07-22 10:02:27,940 INFO  http.Http - http.content.limit = 65536
2011-07-22 10:02:27,940 INFO  http.Http - http.agent = listers
spider/Nutch-1.3
2011-07-22 10:02:27,940 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-07-22 10:02:28,929 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=9, fetchQueues.totalSize=9
2011-07-22 10:02:29,929 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=9, fetchQueues.totalSize=9
2011-07-22 10:02:30,930 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:31,930 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:32,931 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:33,931 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:34,932 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=9
2011-07-22 10:02:35,091 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/mining-resources-energy-jobs/
2011-07-22 10:02:35,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:36,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:37,933 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:38,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:39,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=8
2011-07-22 10:02:40,363 INFO  fetcher.Fetcher - fetching
http://www.seek.com.au/marketing-communications-jobs/
2011-07-22 10:02:40,934 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=7


etc.

-----------------------------------------------------------------------------------------

------- hadoop.log for 'fetch'
-------------------------------------------
2011-07-22 10:14:37,645 INFO  crawl.Generator - Generator: finished at
2011-07-22 10:14:37, elapsed: 00:00:03
2011-07-22 10:16:46,088 WARN  fetcher.Fetcher - Fetcher: Your
'http.agent.name' value should be listed first in 'http.robots.agents'
property.
2011-07-22 10:16:46,089 INFO  fetcher.Fetcher - Fetcher: starting at
2011-07-22 10:16:46
2011-07-22 10:16:46,089 INFO  fetcher.Fetcher - Fetcher:
segment: /home/llist/nutchData/crawl/segments/20110722101436
2011-07-22 10:16:46,720 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-07-22 10:16:46,741 INFO  plugin.PluginRepository - Plugins: looking
in: /usr/share/nutch/runtime/local/plugins
2011-07-22 10:16:46,746 INFO  fetcher.Fetcher - QueueFeeder finished:
total 1 records + hit by time limit :0
2011-07-22 10:16:46,815 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]

---------------------------------------------------------------------------------------

Cheers,

Leo


On Fri, 2011-07-22 at 09:51 +1000, Leo Subscriptions wrote:

> Hi Lewis,
> 
> Will try your suggestion shortly, but am still puzzled why the crawl
> command works. Isn't it using the same filter, etc?
> 
> Cheers,
> 
> Leo
> 
> On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:
> 
> > Hi Leo,
> > 
> > From the times both the fetching and parsing took, I suspecting that
> > maybe Nutch didn't actually fetch the URL, however this may not be the
> > case as I have nothing to benchmark it on. Unfortuantely on the
> > occasion the URL http://wiki.apache.org actually redirects to
> > http://wiki.apache.org/general/ so I'm going to post my log output
> > from last URL you specified in an attempt to clear this one up. The
> > following confirms that you are accurate with your observations that
> > not only does this produce invalid segments but also nothing is
> > fetched in the process.
> > 
> > Therefore the reason that we are getting the  - skipping invalid
> > segment message is that we are not actually fetching any content. My
> > initial thoughts were that your urlfilters were not set properly and I
> > think that this is part of the case.
> > 
> > Please follow the syntax very carefully and it will work perfectly for
> > you as follows
> > 
> > regex-urlfilter.txt
> > --------------------------
> > 
> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > break loops
> > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > 
> > # crawl URLs in the following domains.
> > +^http://([a-z0-9]*\.)*seek.com.au/
> > 
> > # accept anything else
> > #+.
> > 
> > seed file
> > ----------------------
> > http://www.seek.com.au
> > 
> > It sounds really trivial but I think that the trailing '/' in in your
> > seed file may have been making all of the difference.
> > 
> > Please try, test with readdb and readseg and comment back.
> > 
> > Sorry for the delayed posts on this one I have not had much time to
> > get to it. Hope all goes to plan. Evidence can be seen below
> > 
> > lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb
> > crawldb -stats
> > CrawlDb statistics start: crawldb
> > Statistics for CrawlDb: crawldb
> > TOTAL urls:    48
> > retry 0:    48
> > min score:    0.017
> > avg score:    0.041125
> > max score:    1.175
> > status 1 (db_unfetched):    47
> > status 2 (db_fetched):    1
> > CrawlDb statistics: done
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions
> > <ll...@zudiewiener.com> wrote:
> > 
> >         Following are the suggested commands and the result as
> >         suggested
> >          I left the redirect as 0 as 'crawl' works without any issues.
> >         The
> >         problem only occurs when running the individual commands.
> >         
> >         ------- nutch-site.xml -------------------------------
> >         <?xml version="1.0"?>
> >         <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> >         
> >         <!-- Put site-specific property overrides in this file. -->
> >         
> >         <configuration>
> >         
> >         <property>
> >          <name>http.agent.name</name>
> >          <value>listers spider</value>
> >         </property>
> >         
> >         <property>
> >          <name>fetcher.verbose</name>
> >          <value>true</value>
> >          <description>If true, fetcher will log more
> >         verbosely.</description>
> >         </property>
> >         
> >         <property>
> >          <name>http.verbose</name>
> >          <value>true</value>
> >          <description>If true, HTTP will log more
> >         verbosely.</description>
> >         </property>
> >         
> >         </configuration>
> >         ---------------------------------------------------------------
> >         
> >         ------ Individual commands and
> >         results-------------------------
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
> >         Injector: starting at 2011-07-21 12:24:52
> >         
> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> >         Injector: Converting injected urls to crawl db entries.
> >         Injector: Merging injected urls into crawl db.
> >         
> >         
> >         Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments -topN 100
> >         Generator: starting at 2011-07-21 12:25:16
> >         
> >         Generator: Selecting best-scoring urls due for fetch.
> >         Generator: filtering: true
> >         Generator: normalizing: true
> >         
> >         
> >         Generator: topN: 100
> >         
> >         Generator: jobtracker is 'local', generating exactly one
> >         partition.
> >         Generator: Partitioning selected urls for politeness.
> >         
> >         
> >         Generator:
> >         segment: /home/llist/nutchData/crawl/segments/20110721122519
> >         Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         fetch /home/llist/nutchData/crawl/segments/20110721122519
> >         
> >         Fetcher: Your 'http.agent.name' value should be listed first
> >         in
> >         'http.robots.agents' property.
> >         
> >         
> >         Fetcher: starting at 2011-07-21 12:26:36
> >         Fetcher:
> >         segment: /home/llist/nutchData/crawl/segments/20110721122519
> >         
> >         Fetcher: threads: 10
> >         QueueFeeder finished: total 1 records + hit by time limit :0
> >         
> >         -finishing thread FetcherThread, activeThreads=1
> >         
> >         
> >         fetching http://wiki.apache.org/
> >         
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >         -finishing thread FetcherThread, activeThreads=0
> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >         -activeThreads=0
> >         
> >         
> >         Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         
> >         
> >         parse /home/llist/nutchData/crawl/segments/20110721122519
> >         ParseSegment: starting at 2011-07-21 12:27:22
> >         ParseSegment:
> >         segment: /home/llist/nutchData/crawl/segments/20110721122519
> >         ParseSegment: finished at 2011-07-21 12:27:24, elapsed:
> >         00:00:01
> >         
> >         
> >         
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         updatedb /home/llist/nutchData/crawl/crawldb
> >         
> >         
> >         -dir /home/llist/nutchData/crawl/segments/20110721122519
> >         CrawlDb update: starting at 2011-07-21 12:28:03
> >         
> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >         CrawlDb update: segments:
> >         
> >         
> >         [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/content,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
> >         
> >         CrawlDb update: additions allowed: true
> >         CrawlDb update: URL normalizing: false
> >         CrawlDb update: URL filtering: false
> >          - skipping invalid segment
> >         
> >         
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/content
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
> >          - skipping invalid segment
> >         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
> >         
> >         CrawlDb update: Merging segment data into db.
> >         
> >         
> >         CrawlDb update: finished at 2011-07-21 12:28:04, elapsed:
> >         00:00:01
> >         
> >         ------------------------------------------------------------------------------------
> >         
> >         
> >         
> >         
> >         
> >         On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
> >         
> >         > There is no documentation for individual commands used to
> >         run a Nutch 1.3
> >         > crawl so I'm not sure where there has been a mislead. In the
> >         instance that
> >         > this was required I would direct newer users to the legacy
> >         documentation for
> >         > the time being.
> >         >
> >         > My comment to Leo was to understand whether he managed to
> >         correct the
> >         > invalid segments problem.
> >         >
> >         > Leo, if this still persists may I ask you to try again, I
> >         will do the same
> >         > and will be happy to provide feedback
> >         >
> >         > May I suggest the following
> >         >
> >         >
> >         > use the following commands
> >         >
> >         > inject
> >         > generate
> >         > fetch
> >         > parse
> >         > updatedb
> >         >
> >         > At this stage we should be able to ascertain if something is
> >         correct and
> >         > hopefully debug. May I add the following... please make the
> >         following
> >         > additions to nutch-site.
> >         >
> >         > fetcher verbose - true
> >         > http verbose - true
> >         > check for redirects and set accordingly
> >         >
> >         >
> >         > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
> >         > lists.digitalpebble@gmail.com> wrote:
> >         >
> >         > > The wiki can be edited and you are welcome to suggest
> >         improvements if there
> >         > > is something missing
> >         > >
> >         > > On 20 July 2011 13:31, Cam Bazz <ca...@gmail.com> wrote:
> >         > >
> >         > > > Hello,
> >         > > >
> >         > > > I think there is a mislead in the documentation, it does
> >         not tell us
> >         > > > that we have to parse.
> >         > > >
> >         > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> >         > > > <li...@gmail.com> wrote:
> >         > > > > Haven't you forgotten to call parse?
> >         > > > >
> >         > > > > On 19 July 2011 23:40, Leo Subscriptions
> >         <ll...@zudiewiener.com>
> >         > > > wrote:
> >         > > > >
> >         > > > >> Hi Lewis,
> >         > > > >>
> >         > > > >> You are correct about the last post not showing any
> >         errors. I just
> >         > > > >> wanted to show that I don't get any errors if I use
> >         'crawl' and to
> >         > > prove
> >         > > > >> that I do not have any faults in the conf files or
> >         the directories.
> >         > > > >>
> >         > > > >> I still get the errors if I use the individual
> >         commands inject,
> >         > > > >> generate, fetch....
> >         > > > >>
> >         > > > >> Cheers,
> >         > > > >>
> >         > > > >> Leo
> >         > > > >>
> >         > > > >>
> >         > > > >>
> >         > > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john
> >         mcgibbney wrote:
> >         > > > >>
> >         > > > >> > Hi Leo
> >         > > > >> >
> >         > > > >> > Did you resolve?
> >         > > > >> >
> >         > > > >> > Your second log data doesn't appear to show any
> >         errors however the
> >         > > > >> > problem you specify if one I have witnessed myself
> >         while ago. Since
> >         > > > >> > you posted have you been able to replicate... or
> >         resolve?
> >         > > > >> >
> >         > > > >> >
> >         > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> >         > > > >> > <ll...@zudiewiener.com> wrote:
> >         > > > >> >
> >         > > > >> >         I've used crawl to ensure config is correct
> >         and I don't get
> >         > > > >> >         any errors,
> >         > > > >> >         so I must be doing something wrong with the
> >         individual
> >         > > steps,
> >         > > > >> >         but can;t
> >         > > > >> >         see what.
> >         > > > >> >
> >         > > > >> >
> >         > > > >>
> >         > > >
> >         > >
> >         --------------------------------------------------------------------------------------------------------------------
> >         > > > >> >
> >         > > > >> >         llist@LeosLinux:~/nutchData
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         crawl /home/llist/nutchData/seed/urls
> >         > > > >> >         -dir /home/llist/nutchData/crawl
> >         > > > >> >         -depth 3 -topN 5
> >         > > > >> >         solrUrl is not set, indexing will be
> >         skipped...
> >         > > > >> >         crawl started
> >         in: /home/llist/nutchData/crawl
> >         > > > >> >         rootUrlDir
> >         = /home/llist/nutchData/seed/urls
> >         > > > >> >         threads = 10
> >         > > > >> >         depth = 3
> >         > > > >> >         solrUrl=null
> >         > > > >> >         topN = 5
> >         > > > >> >         Injector: starting at 2011-07-17 09:31:19
> >         > > > >> >
> >         > > > >> >         Injector:
> >         crawlDb: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Injector:
> >         urlDir: /home/llist/nutchData/seed/urls
> >         > > > >> >
> >         > > > >> >         Injector: Converting injected urls to crawl
> >         db entries.
> >         > > > >> >         Injector: Merging injected urls into crawl
> >         db.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Injector: finished at 2011-07-17 09:31:22,
> >         elapsed: 00:00:02
> >         > > > >> >         Generator: starting at 2011-07-17 09:31:22
> >         > > > >> >
> >         > > > >> >         Generator: Selecting best-scoring urls due
> >         for fetch.
> >         > > > >> >         Generator: filtering: true
> >         > > > >> >         Generator: normalizing: true
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Generator: topN: 5
> >         > > > >> >
> >         > > > >> >         Generator: jobtracker is 'local',
> >         generating exactly one
> >         > > > >> >         partition.
> >         > > > >> >         Generator: Partitioning selected urls for
> >         politeness.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Generator:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         > > > >> >         Generator: finished at 2011-07-17 09:31:26,
> >         elapsed:
> >         > > 00:00:04
> >         > > > >> >
> >         > > > >> >         Fetcher: Your 'http.agent.name' value
> >         should be listed
> >         > > first
> >         > > > >> >         in
> >         > > > >> >         'http.robots.agents' property.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Fetcher: starting at 2011-07-17 09:31:26
> >         > > > >> >         Fetcher:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         > > > >> >
> >         > > > >> >         Fetcher: threads: 10
> >         > > > >> >         QueueFeeder finished: total 1 records + hit
> >         by time limit :0
> >         > > > >> >         fetching http://www.seek.com.au/
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         -activeThreads=1, spinWaiting=0,
> >         fetchQueues.totalSize=0
> >         > > > >> >         -finishing thread FetcherThread,
> >         activeThreads=0
> >         > > > >> >         -activeThreads=0, spinWaiting=0,
> >         fetchQueues.totalSize=0
> >         > > > >> >         -activeThreads=0
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         Fetcher: finished at 2011-07-17 09:31:29,
> >         elapsed: 00:00:03
> >         > > > >> >         ParseSegment: starting at 2011-07-17
> >         09:31:29
> >         > > > >> >         ParseSegment:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         > > > >> >         ParseSegment: finished at 2011-07-17
> >         09:31:32, elapsed:
> >         > > > >> >         00:00:02
> >         > > > >> >         CrawlDb update: starting at 2011-07-17
> >         09:31:32
> >         > > > >> >
> >         > > > >> >         CrawlDb update:
> >         db: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         CrawlDb update: segments:
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> >         > > > >> >
> >         > > > >> >         CrawlDb update: additions allowed: true
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         CrawlDb update: URL normalizing: true
> >         > > > >> >         CrawlDb update: URL filtering: true
> >         > > > >> >
> >         > > > >> >         CrawlDb update: Merging segment data into
> >         db.
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         CrawlDb update: finished at 2011-07-17
> >         09:31:34, elapsed:
> >         > > > >> >         00:00:02
> >         > > > >> >         :
> >         > > > >> >         :
> >         > > > >> >         :
> >         > > > >> >         :
> >         > > > >> >
> >         > > > >>
> >         > > >
> >         > >
> >         -----------------------------------------------------------------------------------------------
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo
> >         Subscriptions wrote:
> >         > > > >> >
> >         > > > >> >         > Done, but now get additional errors:
> >         > > > >> >         >
> >         > > > >> >         > -------------------
> >         > > > >> >         > llist@LeosLinux:~/nutchData
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         >
> >         updatedb /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         >
> >         -dir /home/llist/nutchData/crawl/segments/20110716105826
> >         > > > >> >         > CrawlDb update: starting at 2011-07-16
> >         11:03:56
> >         > > > >> >         > CrawlDb update:
> >         db: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > CrawlDb update: segments:
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> >         > > > >> >         >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> >         > > > >> >         > CrawlDb update: additions allowed: true
> >         > > > >> >         > CrawlDb update: URL normalizing: false
> >         > > > >> >         > CrawlDb update: URL filtering: false
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> >         > > > >> >         >  - skipping invalid segment
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> >         > > > >> >         > CrawlDb update: Merging segment data into
> >         db.
> >         > > > >> >         > CrawlDb update: finished at 2011-07-16
> >         11:03:57, elapsed:
> >         > > > >> >         00:00:01
> >         > > > >> >         >
> >         -------------------------------------------
> >         > > > >> >         >
> >         > > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus
> >         Jelsma wrote:
> >         > > > >> >         >
> >         > > > >> >         > > fetch, then parse.
> >         > > > >> >         > >
> >         > > > >> >         > > > I'm running nutch 1.3 on 64 bit
> >         Ubuntu, following are
> >         > > > >> >         the commands and
> >         > > > >> >         > > > relevant output.
> >         > > > >> >         > > >
> >         > > > >> >         > > > ----------------------------------
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         > > > >> >         inject /home/llist/nutchData/crawl/crawldb
> >         > > > >> /home/llist/nutchData/seed
> >         > > > >> >         > > > Injector: starting at 2011-07-15
> >         18:32:10
> >         > > > >> >         > > > Injector:
> >         crawlDb: /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > Injector:
> >         urlDir: /home/llist/nutchData/seed
> >         > > > >> >         > > > Injector: Converting injected urls to
> >         crawl db
> >         > > entries.
> >         > > > >> >         > > > Injector: Merging injected urls into
> >         crawl db.
> >         > > > >> >         > > > Injector: finished at 2011-07-15
> >         18:32:13, elapsed:
> >         > > > >> >         00:00:02
> >         > > > >> >         > > > =================
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         generate /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > /home/llist/nutchData/crawl/segments
> >         Generator:
> >         > > starting
> >         > > > >> >         at 2011-07-15
> >         > > > >> >         > > > 18:32:41
> >         > > > >> >         > > > Generator: Selecting best-scoring
> >         urls due for fetch.
> >         > > > >> >         > > > Generator: filtering: true
> >         > > > >> >         > > > Generator: normalizing: true
> >         > > > >> >         > > > Generator: jobtracker is 'local',
> >         generating exactly
> >         > > one
> >         > > > >> >         partition.
> >         > > > >> >         > > > Generator: Partitioning selected urls
> >         for politeness.
> >         > > > >> >         > > > Generator:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > Generator: finished at 2011-07-15
> >         18:32:45, elapsed:
> >         > > > >> >         00:00:03
> >         > > > >> >         > > > ==================
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         > > > >> >
> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > Fetcher: Your 'http.agent.name' value
> >         should be
> >         > > listed
> >         > > > >> >         first in
> >         > > > >> >         > > > 'http.robots.agents' property.
> >         > > > >> >         > > > Fetcher: starting at 2011-07-15
> >         18:34:55
> >         > > > >> >         > > > Fetcher:
> >         > > > >> >
> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > Fetcher: threads: 10
> >         > > > >> >         > > > QueueFeeder finished: total 1 records
> >         + hit by time
> >         > > > >> >         limit :0
> >         > > > >> >         > > > fetching http://www.seek.com.au/
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=2
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=1
> >         > > > >> >         > > > -activeThreads=1, spinWaiting=0,
> >         > > fetchQueues.totalSize=0
> >         > > > >> >         > > > -finishing thread FetcherThread,
> >         activeThreads=0
> >         > > > >> >         > > > -activeThreads=0, spinWaiting=0,
> >         > > fetchQueues.totalSize=0
> >         > > > >> >         > > > -activeThreads=0
> >         > > > >> >         > > > Fetcher: finished at 2011-07-15
> >         18:34:59, elapsed:
> >         > > > >> >         00:00:03
> >         > > > >> >         > > > =================
> >         > > > >> >         > > > llist@LeosLinux:~
> >         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > >> >         > > >
> >         updatedb /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > -dir
> >         > > /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > >> >         > > > CrawlDb update: starting at
> >         2011-07-15 18:36:00
> >         > > > >> >         > > > CrawlDb update: db:
> >         > > /home/llist/nutchData/crawl/crawldb
> >         > > > >> >         > > > CrawlDb update: segments:
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> >         > > > >> >         > > > CrawlDb update: additions allowed:
> >         true
> >         > > > >> >         > > > CrawlDb update: URL normalizing:
> >         false
> >         > > > >> >         > > > CrawlDb update: URL filtering: false
> >         > > > >> >         > > > - skipping invalid segment
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> >         > > > >> >         > > > - skipping invalid segment
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > > >>
> >         > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> >         > > > >> >         > > > - skipping invalid segment
> >         > > > >> >         > > >
> >         > > > >> >
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content
> >         > > > >> >         > > > CrawlDb update: Merging segment data
> >         into db.
> >         > > > >> >         > > > CrawlDb update: finished at
> >         2011-07-15 18:36:01,
> >         > > > >> >         elapsed: 00:00:01
> >         > > > >> >         > > > -----------------------------------
> >         > > > >> >         > > >
> >         > > > >> >         > > > Appreciate any hints on what I'm
> >         missing.
> >         > > > >> >         >
> >         > > > >> >         >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> >
> >         > > > >> > --
> >         > > > >> > Lewis
> >         > > > >> >
> >         > > > >>
> >         > > > >>
> >         > > > >>
> >         > > > >
> >         > > > >
> >         > > > > --
> >         > > > > *
> >         > > > > *Open Source Solutions for Text Engineering
> >         > > > >
> >         > > > > http://digitalpebble.blogspot.com/
> >         > > > > http://www.digitalpebble.com
> >         > > > >
> >         > > >
> >         > >
> >         > >
> >         > >
> >         > > --
> >         > > *
> >         > > *Open Source Solutions for Text Engineering
> >         > >
> >         > > http://digitalpebble.blogspot.com/
> >         > > http://www.digitalpebble.com
> >         > >
> >         >
> >         >
> >         >
> >         
> >         
> >         
> > 
> > 
> > 
> > 
> > -- 
> > Lewis 
> > 
> 
>

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

Hi Lewis,

Will try your suggestion shortly, but am still puzzled why the crawl
command works. Isn't it using the same filter, etc?

Cheers,

Leo

On Thu, 2011-07-21 at 20:55 +0100, lewis john mcgibbney wrote:

> Hi Leo,
> 
> From the times both the fetching and parsing took, I suspecting that
> maybe Nutch didn't actually fetch the URL, however this may not be the
> case as I have nothing to benchmark it on. Unfortuantely on the
> occasion the URL http://wiki.apache.org actually redirects to
> http://wiki.apache.org/general/ so I'm going to post my log output
> from last URL you specified in an attempt to clear this one up. The
> following confirms that you are accurate with your observations that
> not only does this produce invalid segments but also nothing is
> fetched in the process.
> 
> Therefore the reason that we are getting the  - skipping invalid
> segment message is that we are not actually fetching any content. My
> initial thoughts were that your urlfilters were not set properly and I
> think that this is part of the case.
> 
> Please follow the syntax very carefully and it will work perfectly for
> you as follows
> 
> regex-urlfilter.txt
> --------------------------
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to
> break loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # crawl URLs in the following domains.
> +^http://([a-z0-9]*\.)*seek.com.au/
> 
> # accept anything else
> #+.
> 
> seed file
> ----------------------
> http://www.seek.com.au
> 
> It sounds really trivial but I think that the trailing '/' in in your
> seed file may have been making all of the difference.
> 
> Please try, test with readdb and readseg and comment back.
> 
> Sorry for the delayed posts on this one I have not had much time to
> get to it. Hope all goes to plan. Evidence can be seen below
> 
> lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb
> crawldb -stats
> CrawlDb statistics start: crawldb
> Statistics for CrawlDb: crawldb
> TOTAL urls:    48
> retry 0:    48
> min score:    0.017
> avg score:    0.041125
> max score:    1.175
> status 1 (db_unfetched):    47
> status 2 (db_fetched):    1
> CrawlDb statistics: done
> 
> 
> 
> 
> 
> 
> On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions
> <ll...@zudiewiener.com> wrote:
> 
>         Following are the suggested commands and the result as
>         suggested
>          I left the redirect as 0 as 'crawl' works without any issues.
>         The
>         problem only occurs when running the individual commands.
>         
>         ------- nutch-site.xml -------------------------------
>         <?xml version="1.0"?>
>         <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>         
>         <!-- Put site-specific property overrides in this file. -->
>         
>         <configuration>
>         
>         <property>
>          <name>http.agent.name</name>
>          <value>listers spider</value>
>         </property>
>         
>         <property>
>          <name>fetcher.verbose</name>
>          <value>true</value>
>          <description>If true, fetcher will log more
>         verbosely.</description>
>         </property>
>         
>         <property>
>          <name>http.verbose</name>
>          <value>true</value>
>          <description>If true, HTTP will log more
>         verbosely.</description>
>         </property>
>         
>         </configuration>
>         ---------------------------------------------------------------
>         
>         ------ Individual commands and
>         results-------------------------
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
>         Injector: starting at 2011-07-21 12:24:52
>         
>         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>         Injector: urlDir: /home/llist/nutchData/seed/urls
>         Injector: Converting injected urls to crawl db entries.
>         Injector: Merging injected urls into crawl db.
>         
>         
>         Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments -topN 100
>         Generator: starting at 2011-07-21 12:25:16
>         
>         Generator: Selecting best-scoring urls due for fetch.
>         Generator: filtering: true
>         Generator: normalizing: true
>         
>         
>         Generator: topN: 100
>         
>         Generator: jobtracker is 'local', generating exactly one
>         partition.
>         Generator: Partitioning selected urls for politeness.
>         
>         
>         Generator:
>         segment: /home/llist/nutchData/crawl/segments/20110721122519
>         Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         fetch /home/llist/nutchData/crawl/segments/20110721122519
>         
>         Fetcher: Your 'http.agent.name' value should be listed first
>         in
>         'http.robots.agents' property.
>         
>         
>         Fetcher: starting at 2011-07-21 12:26:36
>         Fetcher:
>         segment: /home/llist/nutchData/crawl/segments/20110721122519
>         
>         Fetcher: threads: 10
>         QueueFeeder finished: total 1 records + hit by time limit :0
>         
>         -finishing thread FetcherThread, activeThreads=1
>         
>         
>         fetching http://wiki.apache.org/
>         
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         -finishing thread FetcherThread, activeThreads=0
>         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>         -activeThreads=0
>         
>         
>         Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         parse /home/llist/nutchData/crawl/segments/20110721122519
>         ParseSegment: starting at 2011-07-21 12:27:22
>         ParseSegment:
>         segment: /home/llist/nutchData/crawl/segments/20110721122519
>         ParseSegment: finished at 2011-07-21 12:27:24, elapsed:
>         00:00:01
>         
>         
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         updatedb /home/llist/nutchData/crawl/crawldb
>         
>         
>         -dir /home/llist/nutchData/crawl/segments/20110721122519
>         CrawlDb update: starting at 2011-07-21 12:28:03
>         
>         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         CrawlDb update: segments:
>         
>         
>         [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/content,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
>         
>         CrawlDb update: additions allowed: true
>         CrawlDb update: URL normalizing: false
>         CrawlDb update: URL filtering: false
>          - skipping invalid segment
>         
>         
>         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/content
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
>          - skipping invalid segment
>         file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
>         
>         CrawlDb update: Merging segment data into db.
>         
>         
>         CrawlDb update: finished at 2011-07-21 12:28:04, elapsed:
>         00:00:01
>         
>         ------------------------------------------------------------------------------------
>         
>         
>         
>         
>         
>         On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
>         
>         > There is no documentation for individual commands used to
>         run a Nutch 1.3
>         > crawl so I'm not sure where there has been a mislead. In the
>         instance that
>         > this was required I would direct newer users to the legacy
>         documentation for
>         > the time being.
>         >
>         > My comment to Leo was to understand whether he managed to
>         correct the
>         > invalid segments problem.
>         >
>         > Leo, if this still persists may I ask you to try again, I
>         will do the same
>         > and will be happy to provide feedback
>         >
>         > May I suggest the following
>         >
>         >
>         > use the following commands
>         >
>         > inject
>         > generate
>         > fetch
>         > parse
>         > updatedb
>         >
>         > At this stage we should be able to ascertain if something is
>         correct and
>         > hopefully debug. May I add the following... please make the
>         following
>         > additions to nutch-site.
>         >
>         > fetcher verbose - true
>         > http verbose - true
>         > check for redirects and set accordingly
>         >
>         >
>         > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
>         > lists.digitalpebble@gmail.com> wrote:
>         >
>         > > The wiki can be edited and you are welcome to suggest
>         improvements if there
>         > > is something missing
>         > >
>         > > On 20 July 2011 13:31, Cam Bazz <ca...@gmail.com> wrote:
>         > >
>         > > > Hello,
>         > > >
>         > > > I think there is a mislead in the documentation, it does
>         not tell us
>         > > > that we have to parse.
>         > > >
>         > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
>         > > > <li...@gmail.com> wrote:
>         > > > > Haven't you forgotten to call parse?
>         > > > >
>         > > > > On 19 July 2011 23:40, Leo Subscriptions
>         <ll...@zudiewiener.com>
>         > > > wrote:
>         > > > >
>         > > > >> Hi Lewis,
>         > > > >>
>         > > > >> You are correct about the last post not showing any
>         errors. I just
>         > > > >> wanted to show that I don't get any errors if I use
>         'crawl' and to
>         > > prove
>         > > > >> that I do not have any faults in the conf files or
>         the directories.
>         > > > >>
>         > > > >> I still get the errors if I use the individual
>         commands inject,
>         > > > >> generate, fetch....
>         > > > >>
>         > > > >> Cheers,
>         > > > >>
>         > > > >> Leo
>         > > > >>
>         > > > >>
>         > > > >>
>         > > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john
>         mcgibbney wrote:
>         > > > >>
>         > > > >> > Hi Leo
>         > > > >> >
>         > > > >> > Did you resolve?
>         > > > >> >
>         > > > >> > Your second log data doesn't appear to show any
>         errors however the
>         > > > >> > problem you specify if one I have witnessed myself
>         while ago. Since
>         > > > >> > you posted have you been able to replicate... or
>         resolve?
>         > > > >> >
>         > > > >> >
>         > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
>         > > > >> > <ll...@zudiewiener.com> wrote:
>         > > > >> >
>         > > > >> >         I've used crawl to ensure config is correct
>         and I don't get
>         > > > >> >         any errors,
>         > > > >> >         so I must be doing something wrong with the
>         individual
>         > > steps,
>         > > > >> >         but can;t
>         > > > >> >         see what.
>         > > > >> >
>         > > > >> >
>         > > > >>
>         > > >
>         > >
>         --------------------------------------------------------------------------------------------------------------------
>         > > > >> >
>         > > > >> >         llist@LeosLinux:~/nutchData
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >
>         > > > >> >
>         > > > >> >         crawl /home/llist/nutchData/seed/urls
>         > > > >> >         -dir /home/llist/nutchData/crawl
>         > > > >> >         -depth 3 -topN 5
>         > > > >> >         solrUrl is not set, indexing will be
>         skipped...
>         > > > >> >         crawl started
>         in: /home/llist/nutchData/crawl
>         > > > >> >         rootUrlDir
>         = /home/llist/nutchData/seed/urls
>         > > > >> >         threads = 10
>         > > > >> >         depth = 3
>         > > > >> >         solrUrl=null
>         > > > >> >         topN = 5
>         > > > >> >         Injector: starting at 2011-07-17 09:31:19
>         > > > >> >
>         > > > >> >         Injector:
>         crawlDb: /home/llist/nutchData/crawl/crawldb
>         > > > >> >
>         > > > >> >
>         > > > >> >         Injector:
>         urlDir: /home/llist/nutchData/seed/urls
>         > > > >> >
>         > > > >> >         Injector: Converting injected urls to crawl
>         db entries.
>         > > > >> >         Injector: Merging injected urls into crawl
>         db.
>         > > > >> >
>         > > > >> >
>         > > > >> >         Injector: finished at 2011-07-17 09:31:22,
>         elapsed: 00:00:02
>         > > > >> >         Generator: starting at 2011-07-17 09:31:22
>         > > > >> >
>         > > > >> >         Generator: Selecting best-scoring urls due
>         for fetch.
>         > > > >> >         Generator: filtering: true
>         > > > >> >         Generator: normalizing: true
>         > > > >> >
>         > > > >> >
>         > > > >> >         Generator: topN: 5
>         > > > >> >
>         > > > >> >         Generator: jobtracker is 'local',
>         generating exactly one
>         > > > >> >         partition.
>         > > > >> >         Generator: Partitioning selected urls for
>         politeness.
>         > > > >> >
>         > > > >> >
>         > > > >> >         Generator:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         > > > >> >         Generator: finished at 2011-07-17 09:31:26,
>         elapsed:
>         > > 00:00:04
>         > > > >> >
>         > > > >> >         Fetcher: Your 'http.agent.name' value
>         should be listed
>         > > first
>         > > > >> >         in
>         > > > >> >         'http.robots.agents' property.
>         > > > >> >
>         > > > >> >
>         > > > >> >         Fetcher: starting at 2011-07-17 09:31:26
>         > > > >> >         Fetcher:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         > > > >> >
>         > > > >> >         Fetcher: threads: 10
>         > > > >> >         QueueFeeder finished: total 1 records + hit
>         by time limit :0
>         > > > >> >         fetching http://www.seek.com.au/
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         -activeThreads=1, spinWaiting=0,
>         fetchQueues.totalSize=0
>         > > > >> >         -finishing thread FetcherThread,
>         activeThreads=0
>         > > > >> >         -activeThreads=0, spinWaiting=0,
>         fetchQueues.totalSize=0
>         > > > >> >         -activeThreads=0
>         > > > >> >
>         > > > >> >
>         > > > >> >         Fetcher: finished at 2011-07-17 09:31:29,
>         elapsed: 00:00:03
>         > > > >> >         ParseSegment: starting at 2011-07-17
>         09:31:29
>         > > > >> >         ParseSegment:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         > > > >> >         ParseSegment: finished at 2011-07-17
>         09:31:32, elapsed:
>         > > > >> >         00:00:02
>         > > > >> >         CrawlDb update: starting at 2011-07-17
>         09:31:32
>         > > > >> >
>         > > > >> >         CrawlDb update:
>         db: /home/llist/nutchData/crawl/crawldb
>         > > > >> >         CrawlDb update: segments:
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         [/home/llist/nutchData/crawl/segments/20110717093124]
>         > > > >> >
>         > > > >> >         CrawlDb update: additions allowed: true
>         > > > >> >
>         > > > >> >
>         > > > >> >         CrawlDb update: URL normalizing: true
>         > > > >> >         CrawlDb update: URL filtering: true
>         > > > >> >
>         > > > >> >         CrawlDb update: Merging segment data into
>         db.
>         > > > >> >
>         > > > >> >
>         > > > >> >         CrawlDb update: finished at 2011-07-17
>         09:31:34, elapsed:
>         > > > >> >         00:00:02
>         > > > >> >         :
>         > > > >> >         :
>         > > > >> >         :
>         > > > >> >         :
>         > > > >> >
>         > > > >>
>         > > >
>         > >
>         -----------------------------------------------------------------------------------------------
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo
>         Subscriptions wrote:
>         > > > >> >
>         > > > >> >         > Done, but now get additional errors:
>         > > > >> >         >
>         > > > >> >         > -------------------
>         > > > >> >         > llist@LeosLinux:~/nutchData
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         >
>         updatedb /home/llist/nutchData/crawl/crawldb
>         > > > >> >         >
>         -dir /home/llist/nutchData/crawl/segments/20110716105826
>         > > > >> >         > CrawlDb update: starting at 2011-07-16
>         11:03:56
>         > > > >> >         > CrawlDb update:
>         db: /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > CrawlDb update: segments:
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
>         > > > >> >         >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         > >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
>         > > > >> >         > CrawlDb update: additions allowed: true
>         > > > >> >         > CrawlDb update: URL normalizing: false
>         > > > >> >         > CrawlDb update: URL filtering: false
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         > >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>         > > > >> >         >  - skipping invalid segment
>         > > > >> >         >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
>         > > > >> >         > CrawlDb update: Merging segment data into
>         db.
>         > > > >> >         > CrawlDb update: finished at 2011-07-16
>         11:03:57, elapsed:
>         > > > >> >         00:00:01
>         > > > >> >         >
>         -------------------------------------------
>         > > > >> >         >
>         > > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus
>         Jelsma wrote:
>         > > > >> >         >
>         > > > >> >         > > fetch, then parse.
>         > > > >> >         > >
>         > > > >> >         > > > I'm running nutch 1.3 on 64 bit
>         Ubuntu, following are
>         > > > >> >         the commands and
>         > > > >> >         > > > relevant output.
>         > > > >> >         > > >
>         > > > >> >         > > > ----------------------------------
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         > > > >> >         inject /home/llist/nutchData/crawl/crawldb
>         > > > >> /home/llist/nutchData/seed
>         > > > >> >         > > > Injector: starting at 2011-07-15
>         18:32:10
>         > > > >> >         > > > Injector:
>         crawlDb: /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > Injector:
>         urlDir: /home/llist/nutchData/seed
>         > > > >> >         > > > Injector: Converting injected urls to
>         crawl db
>         > > entries.
>         > > > >> >         > > > Injector: Merging injected urls into
>         crawl db.
>         > > > >> >         > > > Injector: finished at 2011-07-15
>         18:32:13, elapsed:
>         > > > >> >         00:00:02
>         > > > >> >         > > > =================
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         generate /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > /home/llist/nutchData/crawl/segments
>         Generator:
>         > > starting
>         > > > >> >         at 2011-07-15
>         > > > >> >         > > > 18:32:41
>         > > > >> >         > > > Generator: Selecting best-scoring
>         urls due for fetch.
>         > > > >> >         > > > Generator: filtering: true
>         > > > >> >         > > > Generator: normalizing: true
>         > > > >> >         > > > Generator: jobtracker is 'local',
>         generating exactly
>         > > one
>         > > > >> >         partition.
>         > > > >> >         > > > Generator: Partitioning selected urls
>         for politeness.
>         > > > >> >         > > > Generator:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > Generator: finished at 2011-07-15
>         18:32:45, elapsed:
>         > > > >> >         00:00:03
>         > > > >> >         > > > ==================
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         > > > >> >
>         fetch /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > Fetcher: Your 'http.agent.name' value
>         should be
>         > > listed
>         > > > >> >         first in
>         > > > >> >         > > > 'http.robots.agents' property.
>         > > > >> >         > > > Fetcher: starting at 2011-07-15
>         18:34:55
>         > > > >> >         > > > Fetcher:
>         > > > >> >
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > Fetcher: threads: 10
>         > > > >> >         > > > QueueFeeder finished: total 1 records
>         + hit by time
>         > > > >> >         limit :0
>         > > > >> >         > > > fetching http://www.seek.com.au/
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=2
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=1
>         > > > >> >         > > > -activeThreads=1, spinWaiting=0,
>         > > fetchQueues.totalSize=0
>         > > > >> >         > > > -finishing thread FetcherThread,
>         activeThreads=0
>         > > > >> >         > > > -activeThreads=0, spinWaiting=0,
>         > > fetchQueues.totalSize=0
>         > > > >> >         > > > -activeThreads=0
>         > > > >> >         > > > Fetcher: finished at 2011-07-15
>         18:34:59, elapsed:
>         > > > >> >         00:00:03
>         > > > >> >         > > > =================
>         > > > >> >         > > > llist@LeosLinux:~
>         > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > >> >         > > >
>         updatedb /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > -dir
>         > > /home/llist/nutchData/crawl/segments/20110715183244
>         > > > >> >         > > > CrawlDb update: starting at
>         2011-07-15 18:36:00
>         > > > >> >         > > > CrawlDb update: db:
>         > > /home/llist/nutchData/crawl/crawldb
>         > > > >> >         > > > CrawlDb update: segments:
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
>         > > > >> >         > > >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
>         > > > >> >         > > > CrawlDb update: additions allowed:
>         true
>         > > > >> >         > > > CrawlDb update: URL normalizing:
>         false
>         > > > >> >         > > > CrawlDb update: URL filtering: false
>         > > > >> >         > > > - skipping invalid segment
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
>         > > > >> >         > > > - skipping invalid segment
>         > > > >> >         > > >
>         > > > >> >
>         > > > >>
>         > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
>         > > > >> >         > > > - skipping invalid segment
>         > > > >> >         > > >
>         > > > >> >
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content
>         > > > >> >         > > > CrawlDb update: Merging segment data
>         into db.
>         > > > >> >         > > > CrawlDb update: finished at
>         2011-07-15 18:36:01,
>         > > > >> >         elapsed: 00:00:01
>         > > > >> >         > > > -----------------------------------
>         > > > >> >         > > >
>         > > > >> >         > > > Appreciate any hints on what I'm
>         missing.
>         > > > >> >         >
>         > > > >> >         >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> >
>         > > > >> > --
>         > > > >> > Lewis
>         > > > >> >
>         > > > >>
>         > > > >>
>         > > > >>
>         > > > >
>         > > > >
>         > > > > --
>         > > > > *
>         > > > > *Open Source Solutions for Text Engineering
>         > > > >
>         > > > > http://digitalpebble.blogspot.com/
>         > > > > http://www.digitalpebble.com
>         > > > >
>         > > >
>         > >
>         > >
>         > >
>         > > --
>         > > *
>         > > *Open Source Solutions for Text Engineering
>         > >
>         > > http://digitalpebble.blogspot.com/
>         > > http://www.digitalpebble.com
>         > >
>         >
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Lewis 
>

Re: skipping invalid segments nutch 1.3

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Leo,

>From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL, however this may not be the case as I
have nothing to benchmark it on. Unfortuantely on the occasion the URL
http://wiki.apache.org actually redirects to
http://wiki.apache.org/general/so I'm going to post my log output from
last URL you specified in an attempt
to clear this one up. The following confirms that you are accurate with your
observations that not only does this produce invalid segments but also
nothing is fetched in the process.

Therefore the reason that we are getting the  - skipping invalid segment
message is that we are not actually fetching any content. My initial
thoughts were that your urlfilters were not set properly and I think that
this is part of the case.

Please follow the syntax very carefully and it will work perfectly for you
as follows

regex-urlfilter.txt
--------------------------

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# crawl URLs in the following domains.
+^http://([a-z0-9]*\.)*seek.com.au/

# accept anything else
#+.

seed file
----------------------
http://www.seek.com.au

It sounds really trivial but I think that the trailing '/' in in your seed
file may have been making all of the difference.

Please try, test with readdb and readseg and comment back.

Sorry for the delayed posts on this one I have not had much time to get to
it. Hope all goes to plan. Evidence can be seen below

lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb crawldb
-stats
CrawlDb statistics start: crawldb
Statistics for CrawlDb: crawldb
TOTAL urls:    48
retry 0:    48
min score:    0.017
avg score:    0.041125
max score:    1.175
status 1 (db_unfetched):    47
status 2 (db_fetched):    1
CrawlDb statistics: done





On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions <llsubscr@zudiewiener.com
> wrote:

> Following are the suggested commands and the result as suggested
>  I left the redirect as 0 as 'crawl' works without any issues. The
> problem only occurs when running the individual commands.
>
> ------- nutch-site.xml -------------------------------
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>  <name>http.agent.name</name>
>  <value>listers spider</value>
> </property>
>
> <property>
>  <name>fetcher.verbose</name>
>  <value>true</value>
>  <description>If true, fetcher will log more verbosely.</description>
> </property>
>
> <property>
>  <name>http.verbose</name>
>  <value>true</value>
>  <description>If true, HTTP will log more verbosely.</description>
> </property>
>
> </configuration>
> ---------------------------------------------------------------
>
> ------ Individual commands and results-------------------------
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
> Injector: starting at 2011-07-21 12:24:52
> Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> Injector: urlDir: /home/llist/nutchData/seed/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> generate /home/llist/nutchData/crawl/crawldb
> /home/llist/nutchData/crawl/segments -topN 100
> Generator: starting at 2011-07-21 12:25:16
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 100
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519
> Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> fetch /home/llist/nutchData/crawl/segments/20110721122519
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-21 12:26:36
> Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> -finishing thread FetcherThread, activeThreads=1
> fetching http://wiki.apache.org/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> parse /home/llist/nutchData/crawl/segments/20110721122519
> ParseSegment: starting at 2011-07-21 12:27:22
> ParseSegment:
> segment: /home/llist/nutchData/crawl/segments/20110721122519
> ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> updatedb /home/llist/nutchData/crawl/crawldb
> -dir /home/llist/nutchData/crawl/segments/20110721122519
> CrawlDb update: starting at 2011-07-21 12:28:03
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
> file:/home/llist/nutchData/crawl/segments/20110721122519/content,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/content
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01
>
>
> ------------------------------------------------------------------------------------
>
>
>
> On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
>
> > There is no documentation for individual commands used to run a Nutch 1.3
> > crawl so I'm not sure where there has been a mislead. In the instance
> that
> > this was required I would direct newer users to the legacy documentation
> for
> > the time being.
> >
> > My comment to Leo was to understand whether he managed to correct the
> > invalid segments problem.
> >
> > Leo, if this still persists may I ask you to try again, I will do the
> same
> > and will be happy to provide feedback
> >
> > May I suggest the following
> >
> >
> > use the following commands
> >
> > inject
> > generate
> > fetch
> > parse
> > updatedb
> >
> > At this stage we should be able to ascertain if something is correct and
> > hopefully debug. May I add the following... please make the following
> > additions to nutch-site.
> >
> > fetcher verbose - true
> > http verbose - true
> > check for redirects and set accordingly
> >
> >
> > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
> > lists.digitalpebble@gmail.com> wrote:
> >
> > > The wiki can be edited and you are welcome to suggest improvements if
> there
> > > is something missing
> > >
> > > On 20 July 2011 13:31, Cam Bazz <ca...@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I think there is a mislead in the documentation, it does not tell us
> > > > that we have to parse.
> > > >
> > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> > > > <li...@gmail.com> wrote:
> > > > > Haven't you forgotten to call parse?
> > > > >
> > > > > On 19 July 2011 23:40, Leo Subscriptions <llsubscr@zudiewiener.com
> >
> > > > wrote:
> > > > >
> > > > >> Hi Lewis,
> > > > >>
> > > > >> You are correct about the last post not showing any errors. I just
> > > > >> wanted to show that I don't get any errors if I use 'crawl' and to
> > > prove
> > > > >> that I do not have any faults in the conf files or the
> directories.
> > > > >>
> > > > >> I still get the errors if I use the individual commands inject,
> > > > >> generate, fetch....
> > > > >>
> > > > >> Cheers,
> > > > >>
> > > > >> Leo
> > > > >>
> > > > >>
> > > > >>
> > > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> > > > >>
> > > > >> > Hi Leo
> > > > >> >
> > > > >> > Did you resolve?
> > > > >> >
> > > > >> > Your second log data doesn't appear to show any errors however
> the
> > > > >> > problem you specify if one I have witnessed myself while ago.
> Since
> > > > >> > you posted have you been able to replicate... or resolve?
> > > > >> >
> > > > >> >
> > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > > > >> > <ll...@zudiewiener.com> wrote:
> > > > >> >
> > > > >> >         I've used crawl to ensure config is correct and I don't
> get
> > > > >> >         any errors,
> > > > >> >         so I must be doing something wrong with the individual
> > > steps,
> > > > >> >         but can;t
> > > > >> >         see what.
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> --------------------------------------------------------------------------------------------------------------------
> > > > >> >
> > > > >> >         llist@LeosLinux:~/nutchData
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >
> > > > >> >
> > > > >> >         crawl /home/llist/nutchData/seed/urls
> > > > >> >         -dir /home/llist/nutchData/crawl
> > > > >> >         -depth 3 -topN 5
> > > > >> >         solrUrl is not set, indexing will be skipped...
> > > > >> >         crawl started in: /home/llist/nutchData/crawl
> > > > >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> > > > >> >         threads = 10
> > > > >> >         depth = 3
> > > > >> >         solrUrl=null
> > > > >> >         topN = 5
> > > > >> >         Injector: starting at 2011-07-17 09:31:19
> > > > >> >
> > > > >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > > >> >
> > > > >> >
> > > > >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> > > > >> >
> > > > >> >         Injector: Converting injected urls to crawl db entries.
> > > > >> >         Injector: Merging injected urls into crawl db.
> > > > >> >
> > > > >> >
> > > > >> >         Injector: finished at 2011-07-17 09:31:22, elapsed:
> 00:00:02
> > > > >> >         Generator: starting at 2011-07-17 09:31:22
> > > > >> >
> > > > >> >         Generator: Selecting best-scoring urls due for fetch.
> > > > >> >         Generator: filtering: true
> > > > >> >         Generator: normalizing: true
> > > > >> >
> > > > >> >
> > > > >> >         Generator: topN: 5
> > > > >> >
> > > > >> >         Generator: jobtracker is 'local', generating exactly one
> > > > >> >         partition.
> > > > >> >         Generator: Partitioning selected urls for politeness.
> > > > >> >
> > > > >> >
> > > > >> >         Generator:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110717093124
> > > > >> >         Generator: finished at 2011-07-17 09:31:26, elapsed:
> > > 00:00:04
> > > > >> >
> > > > >> >         Fetcher: Your 'http.agent.name' value should be listed
> > > first
> > > > >> >         in
> > > > >> >         'http.robots.agents' property.
> > > > >> >
> > > > >> >
> > > > >> >         Fetcher: starting at 2011-07-17 09:31:26
> > > > >> >         Fetcher:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110717093124
> > > > >> >
> > > > >> >         Fetcher: threads: 10
> > > > >> >         QueueFeeder finished: total 1 records + hit by time
> limit :0
> > > > >> >         fetching http://www.seek.com.au/
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > > >> >         -finishing thread FetcherThread, activeThreads=0
> > > > >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > > >> >         -activeThreads=0
> > > > >> >
> > > > >> >
> > > > >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed:
> 00:00:03
> > > > >> >         ParseSegment: starting at 2011-07-17 09:31:29
> > > > >> >         ParseSegment:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110717093124
> > > > >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> > > > >> >         00:00:02
> > > > >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> > > > >> >
> > > > >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > > >> >         CrawlDb update: segments:
> > > > >> >
> > > > >> >
> > > > >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> > > > >> >
> > > > >> >         CrawlDb update: additions allowed: true
> > > > >> >
> > > > >> >
> > > > >> >         CrawlDb update: URL normalizing: true
> > > > >> >         CrawlDb update: URL filtering: true
> > > > >> >
> > > > >> >         CrawlDb update: Merging segment data into db.
> > > > >> >
> > > > >> >
> > > > >> >         CrawlDb update: finished at 2011-07-17 09:31:34,
> elapsed:
> > > > >> >         00:00:02
> > > > >> >         :
> > > > >> >         :
> > > > >> >         :
> > > > >> >         :
> > > > >> >
> > > > >>
> > > >
> > >
> -----------------------------------------------------------------------------------------------
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions
> wrote:
> > > > >> >
> > > > >> >         > Done, but now get additional errors:
> > > > >> >         >
> > > > >> >         > -------------------
> > > > >> >         > llist@LeosLinux:~/nutchData
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> > > > >> >         > -dir
> /home/llist/nutchData/crawl/segments/20110716105826
> > > > >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> > > > >> >         > CrawlDb update: db:
> /home/llist/nutchData/crawl/crawldb
> > > > >> >         > CrawlDb update: segments:
> > > > >> >         >
> > > > >> >
> > > > >>
> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > > > >> >         >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > > > >> >         >
> > > > >> >
> > > > >>
> > >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > > > >> >         > CrawlDb update: additions allowed: true
> > > > >> >         > CrawlDb update: URL normalizing: false
> > > > >> >         > CrawlDb update: URL filtering: false
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > > > >> >         > CrawlDb update: Merging segment data into db.
> > > > >> >         > CrawlDb update: finished at 2011-07-16 11:03:57,
> elapsed:
> > > > >> >         00:00:01
> > > > >> >         > -------------------------------------------
> > > > >> >         >
> > > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma
> wrote:
> > > > >> >         >
> > > > >> >         > > fetch, then parse.
> > > > >> >         > >
> > > > >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following
> are
> > > > >> >         the commands and
> > > > >> >         > > > relevant output.
> > > > >> >         > > >
> > > > >> >         > > > ----------------------------------
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > >
> > > > >> >         inject /home/llist/nutchData/crawl/crawldb
> > > > >> /home/llist/nutchData/seed
> > > > >> >         > > > Injector: starting at 2011-07-15 18:32:10
> > > > >> >         > > > Injector: crawlDb:
> /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> > > > >> >         > > > Injector: Converting injected urls to crawl db
> > > entries.
> > > > >> >         > > > Injector: Merging injected urls into crawl db.
> > > > >> >         > > > Injector: finished at 2011-07-15 18:32:13,
> elapsed:
> > > > >> >         00:00:02
> > > > >> >         > > > =================
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > /home/llist/nutchData/crawl/segments Generator:
> > > starting
> > > > >> >         at 2011-07-15
> > > > >> >         > > > 18:32:41
> > > > >> >         > > > Generator: Selecting best-scoring urls due for
> fetch.
> > > > >> >         > > > Generator: filtering: true
> > > > >> >         > > > Generator: normalizing: true
> > > > >> >         > > > Generator: jobtracker is 'local', generating
> exactly
> > > one
> > > > >> >         partition.
> > > > >> >         > > > Generator: Partitioning selected urls for
> politeness.
> > > > >> >         > > > Generator:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > Generator: finished at 2011-07-15 18:32:45,
> elapsed:
> > > > >> >         00:00:03
> > > > >> >         > > > ==================
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > >
> > > > >> >         fetch
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > Fetcher: Your 'http.agent.name' value should be
> > > listed
> > > > >> >         first in
> > > > >> >         > > > 'http.robots.agents' property.
> > > > >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> > > > >> >         > > > Fetcher:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > Fetcher: threads: 10
> > > > >> >         > > > QueueFeeder finished: total 1 records + hit by
> time
> > > > >> >         limit :0
> > > > >> >         > > > fetching http://www.seek.com.au/
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=2
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -activeThreads=1, spinWaiting=0,
> > > fetchQueues.totalSize=0
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=0
> > > > >> >         > > > -activeThreads=0, spinWaiting=0,
> > > fetchQueues.totalSize=0
> > > > >> >         > > > -activeThreads=0
> > > > >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> > > > >> >         00:00:03
> > > > >> >         > > > =================
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > -dir
> > > /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > > >> >         > > > CrawlDb update: db:
> > > /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > CrawlDb update: segments:
> > > > >> >         > > >
> > > > >> >
> > > > >>
> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > > >> >         > > >
> > > > >> >
> > > > >>
> > >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > > >> >         > > >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > > >> >         > > > CrawlDb update: additions allowed: true
> > > > >> >         > > > CrawlDb update: URL normalizing: false
> > > > >> >         > > > CrawlDb update: URL filtering: false
> > > > >> >         > > > - skipping invalid segment
> > > > >> >         > > >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > > >> >         > > > - skipping invalid segment
> > > > >> >         > > >
> > > > >> >
> > > > >>
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > > >> >         > > > - skipping invalid segment
> > > > >> >         > > >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > > >> >         > > > CrawlDb update: Merging segment data into db.
> > > > >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> > > > >> >         elapsed: 00:00:01
> > > > >> >         > > > -----------------------------------
> > > > >> >         > > >
> > > > >> >         > > > Appreciate any hints on what I'm missing.
> > > > >> >         >
> > > > >> >         >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Lewis
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
>
>
>


-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

Following are the suggested commands and the result as suggested
 I left the redirect as 0 as 'crawl' works without any issues. The
problem only occurs when running the individual commands.

------- nutch-site.xml -------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>http.agent.name</name>
  <value>listers spider</value>
</property>

<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>

<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

</configuration>
---------------------------------------------------------------

------ Individual commands and results-------------------------

llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
Injector: starting at 2011-07-21 12:24:52
Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
Injector: urlDir: /home/llist/nutchData/seed/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
generate /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/crawl/segments -topN 100
Generator: starting at 2011-07-21 12:25:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519
Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
fetch /home/llist/nutchData/crawl/segments/20110721122519
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-07-21 12:26:36
Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
-finishing thread FetcherThread, activeThreads=1
fetching http://wiki.apache.org/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
parse /home/llist/nutchData/crawl/segments/20110721122519
ParseSegment: starting at 2011-07-21 12:27:22
ParseSegment:
segment: /home/llist/nutchData/crawl/segments/20110721122519
ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01


llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
updatedb /home/llist/nutchData/crawl/crawldb
-dir /home/llist/nutchData/crawl/segments/20110721122519
CrawlDb update: starting at 2011-07-21 12:28:03
CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
CrawlDb update: segments:
[file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
file:/home/llist/nutchData/crawl/segments/20110721122519/content,
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/content
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01

------------------------------------------------------------------------------------



On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:

> There is no documentation for individual commands used to run a Nutch 1.3
> crawl so I'm not sure where there has been a mislead. In the instance that
> this was required I would direct newer users to the legacy documentation for
> the time being.
> 
> My comment to Leo was to understand whether he managed to correct the
> invalid segments problem.
> 
> Leo, if this still persists may I ask you to try again, I will do the same
> and will be happy to provide feedback
> 
> May I suggest the following
> 
> 
> use the following commands
> 
> inject
> generate
> fetch
> parse
> updatedb
> 
> At this stage we should be able to ascertain if something is correct and
> hopefully debug. May I add the following... please make the following
> additions to nutch-site.
> 
> fetcher verbose - true
> http verbose - true
> check for redirects and set accordingly
> 
> 
> On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
> 
> > The wiki can be edited and you are welcome to suggest improvements if there
> > is something missing
> >
> > On 20 July 2011 13:31, Cam Bazz <ca...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I think there is a mislead in the documentation, it does not tell us
> > > that we have to parse.
> > >
> > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> > > <li...@gmail.com> wrote:
> > > > Haven't you forgotten to call parse?
> > > >
> > > > On 19 July 2011 23:40, Leo Subscriptions <ll...@zudiewiener.com>
> > > wrote:
> > > >
> > > >> Hi Lewis,
> > > >>
> > > >> You are correct about the last post not showing any errors. I just
> > > >> wanted to show that I don't get any errors if I use 'crawl' and to
> > prove
> > > >> that I do not have any faults in the conf files or the directories.
> > > >>
> > > >> I still get the errors if I use the individual commands inject,
> > > >> generate, fetch....
> > > >>
> > > >> Cheers,
> > > >>
> > > >> Leo
> > > >>
> > > >>
> > > >>
> > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> > > >>
> > > >> > Hi Leo
> > > >> >
> > > >> > Did you resolve?
> > > >> >
> > > >> > Your second log data doesn't appear to show any errors however the
> > > >> > problem you specify if one I have witnessed myself while ago. Since
> > > >> > you posted have you been able to replicate... or resolve?
> > > >> >
> > > >> >
> > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > > >> > <ll...@zudiewiener.com> wrote:
> > > >> >
> > > >> >         I've used crawl to ensure config is correct and I don't get
> > > >> >         any errors,
> > > >> >         so I must be doing something wrong with the individual
> > steps,
> > > >> >         but can;t
> > > >> >         see what.
> > > >> >
> > > >> >
> > > >>
> > >
> > --------------------------------------------------------------------------------------------------------------------
> > > >> >
> > > >> >         llist@LeosLinux:~/nutchData
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >
> > > >> >
> > > >> >         crawl /home/llist/nutchData/seed/urls
> > > >> >         -dir /home/llist/nutchData/crawl
> > > >> >         -depth 3 -topN 5
> > > >> >         solrUrl is not set, indexing will be skipped...
> > > >> >         crawl started in: /home/llist/nutchData/crawl
> > > >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> > > >> >         threads = 10
> > > >> >         depth = 3
> > > >> >         solrUrl=null
> > > >> >         topN = 5
> > > >> >         Injector: starting at 2011-07-17 09:31:19
> > > >> >
> > > >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > >> >
> > > >> >
> > > >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> > > >> >
> > > >> >         Injector: Converting injected urls to crawl db entries.
> > > >> >         Injector: Merging injected urls into crawl db.
> > > >> >
> > > >> >
> > > >> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> > > >> >         Generator: starting at 2011-07-17 09:31:22
> > > >> >
> > > >> >         Generator: Selecting best-scoring urls due for fetch.
> > > >> >         Generator: filtering: true
> > > >> >         Generator: normalizing: true
> > > >> >
> > > >> >
> > > >> >         Generator: topN: 5
> > > >> >
> > > >> >         Generator: jobtracker is 'local', generating exactly one
> > > >> >         partition.
> > > >> >         Generator: Partitioning selected urls for politeness.
> > > >> >
> > > >> >
> > > >> >         Generator:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > > >> >         Generator: finished at 2011-07-17 09:31:26, elapsed:
> > 00:00:04
> > > >> >
> > > >> >         Fetcher: Your 'http.agent.name' value should be listed
> > first
> > > >> >         in
> > > >> >         'http.robots.agents' property.
> > > >> >
> > > >> >
> > > >> >         Fetcher: starting at 2011-07-17 09:31:26
> > > >> >         Fetcher:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > > >> >
> > > >> >         Fetcher: threads: 10
> > > >> >         QueueFeeder finished: total 1 records + hit by time limit :0
> > > >> >         fetching http://www.seek.com.au/
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >
> > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > >> >         -finishing thread FetcherThread, activeThreads=0
> > > >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > >> >         -activeThreads=0
> > > >> >
> > > >> >
> > > >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> > > >> >         ParseSegment: starting at 2011-07-17 09:31:29
> > > >> >         ParseSegment:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > > >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> > > >> >         00:00:02
> > > >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> > > >> >
> > > >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > >> >         CrawlDb update: segments:
> > > >> >
> > > >> >
> > > >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> > > >> >
> > > >> >         CrawlDb update: additions allowed: true
> > > >> >
> > > >> >
> > > >> >         CrawlDb update: URL normalizing: true
> > > >> >         CrawlDb update: URL filtering: true
> > > >> >
> > > >> >         CrawlDb update: Merging segment data into db.
> > > >> >
> > > >> >
> > > >> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
> > > >> >         00:00:02
> > > >> >         :
> > > >> >         :
> > > >> >         :
> > > >> >         :
> > > >> >
> > > >>
> > >
> > -----------------------------------------------------------------------------------------------
> > > >> >
> > > >> >
> > > >> >
> > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
> > > >> >
> > > >> >         > Done, but now get additional errors:
> > > >> >         >
> > > >> >         > -------------------
> > > >> >         > llist@LeosLinux:~/nutchData
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> > > >> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
> > > >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> > > >> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > >> >         > CrawlDb update: segments:
> > > >> >         >
> > > >> >
> > > >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > > >> >         >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > > >> >         >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > > >> >         > CrawlDb update: additions allowed: true
> > > >> >         > CrawlDb update: URL normalizing: false
> > > >> >         > CrawlDb update: URL filtering: false
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> > > >> >         >  - skipping invalid segment
> > > >> >         >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > > >> >         > CrawlDb update: Merging segment data into db.
> > > >> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
> > > >> >         00:00:01
> > > >> >         > -------------------------------------------
> > > >> >         >
> > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> > > >> >         >
> > > >> >         > > fetch, then parse.
> > > >> >         > >
> > > >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
> > > >> >         the commands and
> > > >> >         > > > relevant output.
> > > >> >         > > >
> > > >> >         > > > ----------------------------------
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > >
> > > >> >         inject /home/llist/nutchData/crawl/crawldb
> > > >> /home/llist/nutchData/seed
> > > >> >         > > > Injector: starting at 2011-07-15 18:32:10
> > > >> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> > > >> >         > > > Injector: Converting injected urls to crawl db
> > entries.
> > > >> >         > > > Injector: Merging injected urls into crawl db.
> > > >> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
> > > >> >         00:00:02
> > > >> >         > > > =================
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > /home/llist/nutchData/crawl/segments Generator:
> > starting
> > > >> >         at 2011-07-15
> > > >> >         > > > 18:32:41
> > > >> >         > > > Generator: Selecting best-scoring urls due for fetch.
> > > >> >         > > > Generator: filtering: true
> > > >> >         > > > Generator: normalizing: true
> > > >> >         > > > Generator: jobtracker is 'local', generating exactly
> > one
> > > >> >         partition.
> > > >> >         > > > Generator: Partitioning selected urls for politeness.
> > > >> >         > > > Generator:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
> > > >> >         00:00:03
> > > >> >         > > > ==================
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > >
> > > >> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > Fetcher: Your 'http.agent.name' value should be
> > listed
> > > >> >         first in
> > > >> >         > > > 'http.robots.agents' property.
> > > >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> > > >> >         > > > Fetcher:
> > > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > Fetcher: threads: 10
> > > >> >         > > > QueueFeeder finished: total 1 records + hit by time
> > > >> >         limit :0
> > > >> >         > > > fetching http://www.seek.com.au/
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=2
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > >> >         > > > -activeThreads=1, spinWaiting=0,
> > fetchQueues.totalSize=0
> > > >> >         > > > -finishing thread FetcherThread, activeThreads=0
> > > >> >         > > > -activeThreads=0, spinWaiting=0,
> > fetchQueues.totalSize=0
> > > >> >         > > > -activeThreads=0
> > > >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> > > >> >         00:00:03
> > > >> >         > > > =================
> > > >> >         > > > llist@LeosLinux:~
> > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > -dir
> > /home/llist/nutchData/crawl/segments/20110715183244
> > > >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > >> >         > > > CrawlDb update: db:
> > /home/llist/nutchData/crawl/crawldb
> > > >> >         > > > CrawlDb update: segments:
> > > >> >         > > >
> > > >> >
> > > >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > >> >         > > >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > >> >         > > >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > >> >         > > > CrawlDb update: additions allowed: true
> > > >> >         > > > CrawlDb update: URL normalizing: false
> > > >> >         > > > CrawlDb update: URL filtering: false
> > > >> >         > > > - skipping invalid segment
> > > >> >         > > >
> > > >> >
> > > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > >> >         > > > - skipping invalid segment
> > > >> >         > > >
> > > >> >
> > > >>
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > >> >         > > > - skipping invalid segment
> > > >> >         > > >
> > > >> >
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > >> >         > > > CrawlDb update: Merging segment data into db.
> > > >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> > > >> >         elapsed: 00:00:01
> > > >> >         > > > -----------------------------------
> > > >> >         > > >
> > > >> >         > > > Appreciate any hints on what I'm missing.
> > > >> >         >
> > > >> >         >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Lewis
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
> 
> 
>

Re: skipping invalid segments nutch 1.3

Posted by lewis john mcgibbney <le...@gmail.com>.

There is no documentation for individual commands used to run a Nutch 1.3
crawl so I'm not sure where there has been a mislead. In the instance that
this was required I would direct newer users to the legacy documentation for
the time being.

My comment to Leo was to understand whether he managed to correct the
invalid segments problem.

Leo, if this still persists may I ask you to try again, I will do the same
and will be happy to provide feedback

May I suggest the following


use the following commands

inject
generate
fetch
parse
updatedb

At this stage we should be able to ascertain if something is correct and
hopefully debug. May I add the following... please make the following
additions to nutch-site.

fetcher verbose - true
http verbose - true
check for redirects and set accordingly


On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> The wiki can be edited and you are welcome to suggest improvements if there
> is something missing
>
> On 20 July 2011 13:31, Cam Bazz <ca...@gmail.com> wrote:
>
> > Hello,
> >
> > I think there is a mislead in the documentation, it does not tell us
> > that we have to parse.
> >
> > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> > <li...@gmail.com> wrote:
> > > Haven't you forgotten to call parse?
> > >
> > > On 19 July 2011 23:40, Leo Subscriptions <ll...@zudiewiener.com>
> > wrote:
> > >
> > >> Hi Lewis,
> > >>
> > >> You are correct about the last post not showing any errors. I just
> > >> wanted to show that I don't get any errors if I use 'crawl' and to
> prove
> > >> that I do not have any faults in the conf files or the directories.
> > >>
> > >> I still get the errors if I use the individual commands inject,
> > >> generate, fetch....
> > >>
> > >> Cheers,
> > >>
> > >> Leo
> > >>
> > >>
> > >>
> > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> > >>
> > >> > Hi Leo
> > >> >
> > >> > Did you resolve?
> > >> >
> > >> > Your second log data doesn't appear to show any errors however the
> > >> > problem you specify if one I have witnessed myself while ago. Since
> > >> > you posted have you been able to replicate... or resolve?
> > >> >
> > >> >
> > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > >> > <ll...@zudiewiener.com> wrote:
> > >> >
> > >> >         I've used crawl to ensure config is correct and I don't get
> > >> >         any errors,
> > >> >         so I must be doing something wrong with the individual
> steps,
> > >> >         but can;t
> > >> >         see what.
> > >> >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------------------------------------------
> > >> >
> > >> >         llist@LeosLinux:~/nutchData
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >
> > >> >
> > >> >         crawl /home/llist/nutchData/seed/urls
> > >> >         -dir /home/llist/nutchData/crawl
> > >> >         -depth 3 -topN 5
> > >> >         solrUrl is not set, indexing will be skipped...
> > >> >         crawl started in: /home/llist/nutchData/crawl
> > >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> > >> >         threads = 10
> > >> >         depth = 3
> > >> >         solrUrl=null
> > >> >         topN = 5
> > >> >         Injector: starting at 2011-07-17 09:31:19
> > >> >
> > >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > >> >
> > >> >
> > >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> > >> >
> > >> >         Injector: Converting injected urls to crawl db entries.
> > >> >         Injector: Merging injected urls into crawl db.
> > >> >
> > >> >
> > >> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> > >> >         Generator: starting at 2011-07-17 09:31:22
> > >> >
> > >> >         Generator: Selecting best-scoring urls due for fetch.
> > >> >         Generator: filtering: true
> > >> >         Generator: normalizing: true
> > >> >
> > >> >
> > >> >         Generator: topN: 5
> > >> >
> > >> >         Generator: jobtracker is 'local', generating exactly one
> > >> >         partition.
> > >> >         Generator: Partitioning selected urls for politeness.
> > >> >
> > >> >
> > >> >         Generator:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > >> >         Generator: finished at 2011-07-17 09:31:26, elapsed:
> 00:00:04
> > >> >
> > >> >         Fetcher: Your 'http.agent.name' value should be listed
> first
> > >> >         in
> > >> >         'http.robots.agents' property.
> > >> >
> > >> >
> > >> >         Fetcher: starting at 2011-07-17 09:31:26
> > >> >         Fetcher:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > >> >
> > >> >         Fetcher: threads: 10
> > >> >         QueueFeeder finished: total 1 records + hit by time limit :0
> > >> >         fetching http://www.seek.com.au/
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > >> >         -finishing thread FetcherThread, activeThreads=0
> > >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > >> >         -activeThreads=0
> > >> >
> > >> >
> > >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> > >> >         ParseSegment: starting at 2011-07-17 09:31:29
> > >> >         ParseSegment:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> > >> >         00:00:02
> > >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> > >> >
> > >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > >> >         CrawlDb update: segments:
> > >> >
> > >> >
> > >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> > >> >
> > >> >         CrawlDb update: additions allowed: true
> > >> >
> > >> >
> > >> >         CrawlDb update: URL normalizing: true
> > >> >         CrawlDb update: URL filtering: true
> > >> >
> > >> >         CrawlDb update: Merging segment data into db.
> > >> >
> > >> >
> > >> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
> > >> >         00:00:02
> > >> >         :
> > >> >         :
> > >> >         :
> > >> >         :
> > >> >
> > >>
> >
> -----------------------------------------------------------------------------------------------
> > >> >
> > >> >
> > >> >
> > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
> > >> >
> > >> >         > Done, but now get additional errors:
> > >> >         >
> > >> >         > -------------------
> > >> >         > llist@LeosLinux:~/nutchData
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> > >> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
> > >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> > >> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > >> >         > CrawlDb update: segments:
> > >> >         >
> > >> >
> > >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > >> >         >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > >> >         >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > >> >         > CrawlDb update: additions allowed: true
> > >> >         > CrawlDb update: URL normalizing: false
> > >> >         > CrawlDb update: URL filtering: false
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > >> >         > CrawlDb update: Merging segment data into db.
> > >> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
> > >> >         00:00:01
> > >> >         > -------------------------------------------
> > >> >         >
> > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> > >> >         >
> > >> >         > > fetch, then parse.
> > >> >         > >
> > >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
> > >> >         the commands and
> > >> >         > > > relevant output.
> > >> >         > > >
> > >> >         > > > ----------------------------------
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > >
> > >> >         inject /home/llist/nutchData/crawl/crawldb
> > >> /home/llist/nutchData/seed
> > >> >         > > > Injector: starting at 2011-07-15 18:32:10
> > >> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> > >> >         > > > Injector: Converting injected urls to crawl db
> entries.
> > >> >         > > > Injector: Merging injected urls into crawl db.
> > >> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
> > >> >         00:00:02
> > >> >         > > > =================
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> > >> >         > > > /home/llist/nutchData/crawl/segments Generator:
> starting
> > >> >         at 2011-07-15
> > >> >         > > > 18:32:41
> > >> >         > > > Generator: Selecting best-scoring urls due for fetch.
> > >> >         > > > Generator: filtering: true
> > >> >         > > > Generator: normalizing: true
> > >> >         > > > Generator: jobtracker is 'local', generating exactly
> one
> > >> >         partition.
> > >> >         > > > Generator: Partitioning selected urls for politeness.
> > >> >         > > > Generator:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
> > >> >         00:00:03
> > >> >         > > > ==================
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > >
> > >> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > Fetcher: Your 'http.agent.name' value should be
> listed
> > >> >         first in
> > >> >         > > > 'http.robots.agents' property.
> > >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> > >> >         > > > Fetcher:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > Fetcher: threads: 10
> > >> >         > > > QueueFeeder finished: total 1 records + hit by time
> > >> >         limit :0
> > >> >         > > > fetching http://www.seek.com.au/
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=2
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -activeThreads=1, spinWaiting=0,
> fetchQueues.totalSize=0
> > >> >         > > > -finishing thread FetcherThread, activeThreads=0
> > >> >         > > > -activeThreads=0, spinWaiting=0,
> fetchQueues.totalSize=0
> > >> >         > > > -activeThreads=0
> > >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> > >> >         00:00:03
> > >> >         > > > =================
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> > >> >         > > > -dir
> /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > >> >         > > > CrawlDb update: db:
> /home/llist/nutchData/crawl/crawldb
> > >> >         > > > CrawlDb update: segments:
> > >> >         > > >
> > >> >
> > >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > >> >         > > >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > >> >         > > >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > >> >         > > > CrawlDb update: additions allowed: true
> > >> >         > > > CrawlDb update: URL normalizing: false
> > >> >         > > > CrawlDb update: URL filtering: false
> > >> >         > > > - skipping invalid segment
> > >> >         > > >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > >> >         > > > - skipping invalid segment
> > >> >         > > >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > >> >         > > > - skipping invalid segment
> > >> >         > > >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > >> >         > > > CrawlDb update: Merging segment data into db.
> > >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> > >> >         elapsed: 00:00:01
> > >> >         > > > -----------------------------------
> > >> >         > > >
> > >> >         > > > Appreciate any hints on what I'm missing.
> > >> >         >
> > >> >         >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Lewis
> > >> >
> > >>
> > >>
> > >>
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

Posted by Julien Nioche <li...@gmail.com>.

The wiki can be edited and you are welcome to suggest improvements if there
is something missing

On 20 July 2011 13:31, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> I think there is a mislead in the documentation, it does not tell us
> that we have to parse.
>
> On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> <li...@gmail.com> wrote:
> > Haven't you forgotten to call parse?
> >
> > On 19 July 2011 23:40, Leo Subscriptions <ll...@zudiewiener.com>
> wrote:
> >
> >> Hi Lewis,
> >>
> >> You are correct about the last post not showing any errors. I just
> >> wanted to show that I don't get any errors if I use 'crawl' and to prove
> >> that I do not have any faults in the conf files or the directories.
> >>
> >> I still get the errors if I use the individual commands inject,
> >> generate, fetch....
> >>
> >> Cheers,
> >>
> >> Leo
> >>
> >>
> >>
> >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> >>
> >> > Hi Leo
> >> >
> >> > Did you resolve?
> >> >
> >> > Your second log data doesn't appear to show any errors however the
> >> > problem you specify if one I have witnessed myself while ago. Since
> >> > you posted have you been able to replicate... or resolve?
> >> >
> >> >
> >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> >> > <ll...@zudiewiener.com> wrote:
> >> >
> >> >         I've used crawl to ensure config is correct and I don't get
> >> >         any errors,
> >> >         so I must be doing something wrong with the individual steps,
> >> >         but can;t
> >> >         see what.
> >> >
> >> >
> >>
> --------------------------------------------------------------------------------------------------------------------
> >> >
> >> >         llist@LeosLinux:~/nutchData
> >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >> >
> >> >
> >> >         crawl /home/llist/nutchData/seed/urls
> >> >         -dir /home/llist/nutchData/crawl
> >> >         -depth 3 -topN 5
> >> >         solrUrl is not set, indexing will be skipped...
> >> >         crawl started in: /home/llist/nutchData/crawl
> >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> >> >         threads = 10
> >> >         depth = 3
> >> >         solrUrl=null
> >> >         topN = 5
> >> >         Injector: starting at 2011-07-17 09:31:19
> >> >
> >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> >> >
> >> >
> >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> >> >
> >> >         Injector: Converting injected urls to crawl db entries.
> >> >         Injector: Merging injected urls into crawl db.
> >> >
> >> >
> >> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> >> >         Generator: starting at 2011-07-17 09:31:22
> >> >
> >> >         Generator: Selecting best-scoring urls due for fetch.
> >> >         Generator: filtering: true
> >> >         Generator: normalizing: true
> >> >
> >> >
> >> >         Generator: topN: 5
> >> >
> >> >         Generator: jobtracker is 'local', generating exactly one
> >> >         partition.
> >> >         Generator: Partitioning selected urls for politeness.
> >> >
> >> >
> >> >         Generator:
> >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >> >         Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
> >> >
> >> >         Fetcher: Your 'http.agent.name' value should be listed first
> >> >         in
> >> >         'http.robots.agents' property.
> >> >
> >> >
> >> >         Fetcher: starting at 2011-07-17 09:31:26
> >> >         Fetcher:
> >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >> >
> >> >         Fetcher: threads: 10
> >> >         QueueFeeder finished: total 1 records + hit by time limit :0
> >> >         fetching http://www.seek.com.au/
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >
> >> >         -finishing thread FetcherThread, activeThreads=1
> >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >> >         -finishing thread FetcherThread, activeThreads=0
> >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> >         -activeThreads=0
> >> >
> >> >
> >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> >> >         ParseSegment: starting at 2011-07-17 09:31:29
> >> >         ParseSegment:
> >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> >> >         00:00:02
> >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> >> >
> >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >> >         CrawlDb update: segments:
> >> >
> >> >
> >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> >> >
> >> >         CrawlDb update: additions allowed: true
> >> >
> >> >
> >> >         CrawlDb update: URL normalizing: true
> >> >         CrawlDb update: URL filtering: true
> >> >
> >> >         CrawlDb update: Merging segment data into db.
> >> >
> >> >
> >> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
> >> >         00:00:02
> >> >         :
> >> >         :
> >> >         :
> >> >         :
> >> >
> >>
> -----------------------------------------------------------------------------------------------
> >> >
> >> >
> >> >
> >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
> >> >
> >> >         > Done, but now get additional errors:
> >> >         >
> >> >         > -------------------
> >> >         > llist@LeosLinux:~/nutchData
> >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> >> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
> >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> >> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >> >         > CrawlDb update: segments:
> >> >         >
> >> >
> >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> >> >         >
> >> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> >> >         > CrawlDb update: additions allowed: true
> >> >         > CrawlDb update: URL normalizing: false
> >> >         > CrawlDb update: URL filtering: false
> >> >         >  - skipping invalid segment
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> >> >         >  - skipping invalid segment
> >> >         >
> >> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/content
> >> >         >  - skipping invalid segment
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> >> >         >  - skipping invalid segment
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> >> >         >  - skipping invalid segment
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> >> >         >  - skipping invalid segment
> >> >         >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> >> >         > CrawlDb update: Merging segment data into db.
> >> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
> >> >         00:00:01
> >> >         > -------------------------------------------
> >> >         >
> >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> >> >         >
> >> >         > > fetch, then parse.
> >> >         > >
> >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
> >> >         the commands and
> >> >         > > > relevant output.
> >> >         > > >
> >> >         > > > ----------------------------------
> >> >         > > > llist@LeosLinux:~
> >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >> >         > > >
> >> >         inject /home/llist/nutchData/crawl/crawldb
> >> /home/llist/nutchData/seed
> >> >         > > > Injector: starting at 2011-07-15 18:32:10
> >> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> >> >         > > > Injector: Converting injected urls to crawl db entries.
> >> >         > > > Injector: Merging injected urls into crawl db.
> >> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
> >> >         00:00:02
> >> >         > > > =================
> >> >         > > > llist@LeosLinux:~
> >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> >> >         > > > /home/llist/nutchData/crawl/segments Generator: starting
> >> >         at 2011-07-15
> >> >         > > > 18:32:41
> >> >         > > > Generator: Selecting best-scoring urls due for fetch.
> >> >         > > > Generator: filtering: true
> >> >         > > > Generator: normalizing: true
> >> >         > > > Generator: jobtracker is 'local', generating exactly one
> >> >         partition.
> >> >         > > > Generator: Partitioning selected urls for politeness.
> >> >         > > > Generator:
> >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
> >> >         00:00:03
> >> >         > > > ==================
> >> >         > > > llist@LeosLinux:~
> >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >> >         > > >
> >> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> >> >         > > > Fetcher: Your 'http.agent.name' value should be listed
> >> >         first in
> >> >         > > > 'http.robots.agents' property.
> >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> >> >         > > > Fetcher:
> >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >> >         > > > Fetcher: threads: 10
> >> >         > > > QueueFeeder finished: total 1 records + hit by time
> >> >         limit :0
> >> >         > > > fetching http://www.seek.com.au/
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=2
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -finishing thread FetcherThread, activeThreads=1
> >> >         > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >> >         > > > -finishing thread FetcherThread, activeThreads=0
> >> >         > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >> >         > > > -activeThreads=0
> >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> >> >         00:00:03
> >> >         > > > =================
> >> >         > > > llist@LeosLinux:~
> >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> >> >         > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
> >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> >> >         > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >> >         > > > CrawlDb update: segments:
> >> >         > > >
> >> >
> >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> >> >         > > >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> >> >         > > >
> >> >
> file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> >> >         > > > CrawlDb update: additions allowed: true
> >> >         > > > CrawlDb update: URL normalizing: false
> >> >         > > > CrawlDb update: URL filtering: false
> >> >         > > > - skipping invalid segment
> >> >         > > >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> >> >         > > > - skipping invalid segment
> >> >         > > >
> >> >
> >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> >> >         > > > - skipping invalid segment
> >> >         > > >
> >> >
> file:/home/llist/nutchData/crawl/segments/20110715183244/content
> >> >         > > > CrawlDb update: Merging segment data into db.
> >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> >> >         elapsed: 00:00:01
> >> >         > > > -----------------------------------
> >> >         > > >
> >> >         > > > Appreciate any hints on what I'm missing.
> >> >         >
> >> >         >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Lewis
> >> >
> >>
> >>
> >>
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: skipping invalid segments nutch 1.3

Posted by Cam Bazz <ca...@gmail.com>.

Hello,

I think there is a mislead in the documentation, it does not tell us
that we have to parse.

On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
<li...@gmail.com> wrote:
> Haven't you forgotten to call parse?
>
> On 19 July 2011 23:40, Leo Subscriptions <ll...@zudiewiener.com> wrote:
>
>> Hi Lewis,
>>
>> You are correct about the last post not showing any errors. I just
>> wanted to show that I don't get any errors if I use 'crawl' and to prove
>> that I do not have any faults in the conf files or the directories.
>>
>> I still get the errors if I use the individual commands inject,
>> generate, fetch....
>>
>> Cheers,
>>
>> Leo
>>
>>
>>
>>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
>>
>> > Hi Leo
>> >
>> > Did you resolve?
>> >
>> > Your second log data doesn't appear to show any errors however the
>> > problem you specify if one I have witnessed myself while ago. Since
>> > you posted have you been able to replicate... or resolve?
>> >
>> >
>> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
>> > <ll...@zudiewiener.com> wrote:
>> >
>> >         I've used crawl to ensure config is correct and I don't get
>> >         any errors,
>> >         so I must be doing something wrong with the individual steps,
>> >         but can;t
>> >         see what.
>> >
>> >
>> --------------------------------------------------------------------------------------------------------------------
>> >
>> >         llist@LeosLinux:~/nutchData
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >
>> >
>> >         crawl /home/llist/nutchData/seed/urls
>> >         -dir /home/llist/nutchData/crawl
>> >         -depth 3 -topN 5
>> >         solrUrl is not set, indexing will be skipped...
>> >         crawl started in: /home/llist/nutchData/crawl
>> >         rootUrlDir = /home/llist/nutchData/seed/urls
>> >         threads = 10
>> >         depth = 3
>> >         solrUrl=null
>> >         topN = 5
>> >         Injector: starting at 2011-07-17 09:31:19
>> >
>> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>> >
>> >
>> >         Injector: urlDir: /home/llist/nutchData/seed/urls
>> >
>> >         Injector: Converting injected urls to crawl db entries.
>> >         Injector: Merging injected urls into crawl db.
>> >
>> >
>> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
>> >         Generator: starting at 2011-07-17 09:31:22
>> >
>> >         Generator: Selecting best-scoring urls due for fetch.
>> >         Generator: filtering: true
>> >         Generator: normalizing: true
>> >
>> >
>> >         Generator: topN: 5
>> >
>> >         Generator: jobtracker is 'local', generating exactly one
>> >         partition.
>> >         Generator: Partitioning selected urls for politeness.
>> >
>> >
>> >         Generator:
>> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
>> >         Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
>> >
>> >         Fetcher: Your 'http.agent.name' value should be listed first
>> >         in
>> >         'http.robots.agents' property.
>> >
>> >
>> >         Fetcher: starting at 2011-07-17 09:31:26
>> >         Fetcher:
>> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
>> >
>> >         Fetcher: threads: 10
>> >         QueueFeeder finished: total 1 records + hit by time limit :0
>> >         fetching http://www.seek.com.au/
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -finishing thread FetcherThread, activeThreads=1
>> >
>> >         -finishing thread FetcherThread, activeThreads=1
>> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >         -finishing thread FetcherThread, activeThreads=0
>> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> >         -activeThreads=0
>> >
>> >
>> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
>> >         ParseSegment: starting at 2011-07-17 09:31:29
>> >         ParseSegment:
>> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
>> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
>> >         00:00:02
>> >         CrawlDb update: starting at 2011-07-17 09:31:32
>> >
>> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>> >         CrawlDb update: segments:
>> >
>> >
>> >         [/home/llist/nutchData/crawl/segments/20110717093124]
>> >
>> >         CrawlDb update: additions allowed: true
>> >
>> >
>> >         CrawlDb update: URL normalizing: true
>> >         CrawlDb update: URL filtering: true
>> >
>> >         CrawlDb update: Merging segment data into db.
>> >
>> >
>> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
>> >         00:00:02
>> >         :
>> >         :
>> >         :
>> >         :
>> >
>> -----------------------------------------------------------------------------------------------
>> >
>> >
>> >
>> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
>> >
>> >         > Done, but now get additional errors:
>> >         >
>> >         > -------------------
>> >         > llist@LeosLinux:~/nutchData
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > updatedb /home/llist/nutchData/crawl/crawldb
>> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
>> >         > CrawlDb update: starting at 2011-07-16 11:03:56
>> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>> >         > CrawlDb update: segments:
>> >         >
>> >
>> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
>> >         >
>> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
>> >         > CrawlDb update: additions allowed: true
>> >         > CrawlDb update: URL normalizing: false
>> >         > CrawlDb update: URL filtering: false
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>> >         >  - skipping invalid segment
>> >         >
>> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>> >         >  - skipping invalid segment
>> >         >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
>> >         > CrawlDb update: Merging segment data into db.
>> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
>> >         00:00:01
>> >         > -------------------------------------------
>> >         >
>> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
>> >         >
>> >         > > fetch, then parse.
>> >         > >
>> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
>> >         the commands and
>> >         > > > relevant output.
>> >         > > >
>> >         > > > ----------------------------------
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > >
>> >         inject /home/llist/nutchData/crawl/crawldb
>> /home/llist/nutchData/seed
>> >         > > > Injector: starting at 2011-07-15 18:32:10
>> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>> >         > > > Injector: urlDir: /home/llist/nutchData/seed
>> >         > > > Injector: Converting injected urls to crawl db entries.
>> >         > > > Injector: Merging injected urls into crawl db.
>> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
>> >         00:00:02
>> >         > > > =================
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > > generate /home/llist/nutchData/crawl/crawldb
>> >         > > > /home/llist/nutchData/crawl/segments Generator: starting
>> >         at 2011-07-15
>> >         > > > 18:32:41
>> >         > > > Generator: Selecting best-scoring urls due for fetch.
>> >         > > > Generator: filtering: true
>> >         > > > Generator: normalizing: true
>> >         > > > Generator: jobtracker is 'local', generating exactly one
>> >         partition.
>> >         > > > Generator: Partitioning selected urls for politeness.
>> >         > > > Generator:
>> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
>> >         00:00:03
>> >         > > > ==================
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > >
>> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > Fetcher: Your 'http.agent.name' value should be listed
>> >         first in
>> >         > > > 'http.robots.agents' property.
>> >         > > > Fetcher: starting at 2011-07-15 18:34:55
>> >         > > > Fetcher:
>> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > Fetcher: threads: 10
>> >         > > > QueueFeeder finished: total 1 records + hit by time
>> >         limit :0
>> >         > > > fetching http://www.seek.com.au/
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=2
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -finishing thread FetcherThread, activeThreads=1
>> >         > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>> >         > > > -finishing thread FetcherThread, activeThreads=0
>> >         > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>> >         > > > -activeThreads=0
>> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
>> >         00:00:03
>> >         > > > =================
>> >         > > > llist@LeosLinux:~
>> >         $ /usr/share/nutch/runtime/local/bin/nutch
>> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
>> >         > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
>> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
>> >         > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>> >         > > > CrawlDb update: segments:
>> >         > > >
>> >
>> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
>> >         > > >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
>> >         > > >
>> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
>> >         > > > CrawlDb update: additions allowed: true
>> >         > > > CrawlDb update: URL normalizing: false
>> >         > > > CrawlDb update: URL filtering: false
>> >         > > > - skipping invalid segment
>> >         > > >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
>> >         > > > - skipping invalid segment
>> >         > > >
>> >
>> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
>> >         > > > - skipping invalid segment
>> >         > > >
>> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content
>> >         > > > CrawlDb update: Merging segment data into db.
>> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
>> >         elapsed: 00:00:01
>> >         > > > -----------------------------------
>> >         > > >
>> >         > > > Appreciate any hints on what I'm missing.
>> >         >
>> >         >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Lewis
>> >
>>
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: skipping invalid segments nutch 1.3

Posted by Julien Nioche <li...@gmail.com>.

Haven't you forgotten to call parse?

On 19 July 2011 23:40, Leo Subscriptions <ll...@zudiewiener.com> wrote:

> Hi Lewis,
>
> You are correct about the last post not showing any errors. I just
> wanted to show that I don't get any errors if I use 'crawl' and to prove
> that I do not have any faults in the conf files or the directories.
>
> I still get the errors if I use the individual commands inject,
> generate, fetch....
>
> Cheers,
>
> Leo
>
>
>
>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
>
> > Hi Leo
> >
> > Did you resolve?
> >
> > Your second log data doesn't appear to show any errors however the
> > problem you specify if one I have witnessed myself while ago. Since
> > you posted have you been able to replicate... or resolve?
> >
> >
> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > <ll...@zudiewiener.com> wrote:
> >
> >         I've used crawl to ensure config is correct and I don't get
> >         any errors,
> >         so I must be doing something wrong with the individual steps,
> >         but can;t
> >         see what.
> >
> >
> --------------------------------------------------------------------------------------------------------------------
> >
> >         llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >
> >
> >         crawl /home/llist/nutchData/seed/urls
> >         -dir /home/llist/nutchData/crawl
> >         -depth 3 -topN 5
> >         solrUrl is not set, indexing will be skipped...
> >         crawl started in: /home/llist/nutchData/crawl
> >         rootUrlDir = /home/llist/nutchData/seed/urls
> >         threads = 10
> >         depth = 3
> >         solrUrl=null
> >         topN = 5
> >         Injector: starting at 2011-07-17 09:31:19
> >
> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> >
> >
> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> >
> >         Injector: Converting injected urls to crawl db entries.
> >         Injector: Merging injected urls into crawl db.
> >
> >
> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> >         Generator: starting at 2011-07-17 09:31:22
> >
> >         Generator: Selecting best-scoring urls due for fetch.
> >         Generator: filtering: true
> >         Generator: normalizing: true
> >
> >
> >         Generator: topN: 5
> >
> >         Generator: jobtracker is 'local', generating exactly one
> >         partition.
> >         Generator: Partitioning selected urls for politeness.
> >
> >
> >         Generator:
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
> >
> >         Fetcher: Your 'http.agent.name' value should be listed first
> >         in
> >         'http.robots.agents' property.
> >
> >
> >         Fetcher: starting at 2011-07-17 09:31:26
> >         Fetcher:
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >
> >         Fetcher: threads: 10
> >         QueueFeeder finished: total 1 records + hit by time limit :0
> >         fetching http://www.seek.com.au/
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >         -finishing thread FetcherThread, activeThreads=1
> >
> >         -finishing thread FetcherThread, activeThreads=1
> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >         -finishing thread FetcherThread, activeThreads=0
> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >         -activeThreads=0
> >
> >
> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> >         ParseSegment: starting at 2011-07-17 09:31:29
> >         ParseSegment:
> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> >         00:00:02
> >         CrawlDb update: starting at 2011-07-17 09:31:32
> >
> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >         CrawlDb update: segments:
> >
> >
> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> >
> >         CrawlDb update: additions allowed: true
> >
> >
> >         CrawlDb update: URL normalizing: true
> >         CrawlDb update: URL filtering: true
> >
> >         CrawlDb update: Merging segment data into db.
> >
> >
> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
> >         00:00:02
> >         :
> >         :
> >         :
> >         :
> >
> -----------------------------------------------------------------------------------------------
> >
> >
> >
> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
> >
> >         > Done, but now get additional errors:
> >         >
> >         > -------------------
> >         > llist@LeosLinux:~/nutchData
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > updatedb /home/llist/nutchData/crawl/crawldb
> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >         > CrawlDb update: segments:
> >         >
> >
> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> >         >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> >         > CrawlDb update: additions allowed: true
> >         > CrawlDb update: URL normalizing: false
> >         > CrawlDb update: URL filtering: false
> >         >  - skipping invalid segment
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> >         >  - skipping invalid segment
> >         >
> >         file:/home/llist/nutchData/crawl/segments/20110716105826/content
> >         >  - skipping invalid segment
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> >         >  - skipping invalid segment
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> >         >  - skipping invalid segment
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> >         >  - skipping invalid segment
> >         >
> >
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> >         > CrawlDb update: Merging segment data into db.
> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
> >         00:00:01
> >         > -------------------------------------------
> >         >
> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> >         >
> >         > > fetch, then parse.
> >         > >
> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
> >         the commands and
> >         > > > relevant output.
> >         > > >
> >         > > > ----------------------------------
> >         > > > llist@LeosLinux:~
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > >
> >         inject /home/llist/nutchData/crawl/crawldb
> /home/llist/nutchData/seed
> >         > > > Injector: starting at 2011-07-15 18:32:10
> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> >         > > > Injector: Converting injected urls to crawl db entries.
> >         > > > Injector: Merging injected urls into crawl db.
> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
> >         00:00:02
> >         > > > =================
> >         > > > llist@LeosLinux:~
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > generate /home/llist/nutchData/crawl/crawldb
> >         > > > /home/llist/nutchData/crawl/segments Generator: starting
> >         at 2011-07-15
> >         > > > 18:32:41
> >         > > > Generator: Selecting best-scoring urls due for fetch.
> >         > > > Generator: filtering: true
> >         > > > Generator: normalizing: true
> >         > > > Generator: jobtracker is 'local', generating exactly one
> >         partition.
> >         > > > Generator: Partitioning selected urls for politeness.
> >         > > > Generator:
> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
> >         00:00:03
> >         > > > ==================
> >         > > > llist@LeosLinux:~
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > >
> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > Fetcher: Your 'http.agent.name' value should be listed
> >         first in
> >         > > > 'http.robots.agents' property.
> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> >         > > > Fetcher:
> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > Fetcher: threads: 10
> >         > > > QueueFeeder finished: total 1 records + hit by time
> >         limit :0
> >         > > > fetching http://www.seek.com.au/
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=2
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -finishing thread FetcherThread, activeThreads=1
> >         > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> >         > > > -finishing thread FetcherThread, activeThreads=0
> >         > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> >         > > > -activeThreads=0
> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> >         00:00:03
> >         > > > =================
> >         > > > llist@LeosLinux:~
> >         $ /usr/share/nutch/runtime/local/bin/nutch
> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> >         > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> >         > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> >         > > > CrawlDb update: segments:
> >         > > >
> >
> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> >         > > >
> >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> >         > > > CrawlDb update: additions allowed: true
> >         > > > CrawlDb update: URL normalizing: false
> >         > > > CrawlDb update: URL filtering: false
> >         > > > - skipping invalid segment
> >         > > >
> >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> >         > > > - skipping invalid segment
> >         > > >
> >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> >         > > > - skipping invalid segment
> >         > > >
> >         file:/home/llist/nutchData/crawl/segments/20110715183244/content
> >         > > > CrawlDb update: Merging segment data into db.
> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> >         elapsed: 00:00:01
> >         > > > -----------------------------------
> >         > > >
> >         > > > Appreciate any hints on what I'm missing.
> >         >
> >         >
> >
> >
> >
> >
> >
> >
> >
> > --
> > Lewis
> >
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

Hi Lewis,

You are correct about the last post not showing any errors. I just
wanted to show that I don't get any errors if I use 'crawl' and to prove
that I do not have any faults in the conf files or the directories.

I still get the errors if I use the individual commands inject,
generate, fetch....

Cheers,

Leo



 On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:

> Hi Leo
> 
> Did you resolve?
> 
> Your second log data doesn't appear to show any errors however the
> problem you specify if one I have witnessed myself while ago. Since
> you posted have you been able to replicate... or resolve?
> 
> 
> On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> <ll...@zudiewiener.com> wrote:
> 
>         I've used crawl to ensure config is correct and I don't get
>         any errors,
>         so I must be doing something wrong with the individual steps,
>         but can;t
>         see what.
>         
>         --------------------------------------------------------------------------------------------------------------------
>         
>         llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         
>         
>         crawl /home/llist/nutchData/seed/urls
>         -dir /home/llist/nutchData/crawl
>         -depth 3 -topN 5
>         solrUrl is not set, indexing will be skipped...
>         crawl started in: /home/llist/nutchData/crawl
>         rootUrlDir = /home/llist/nutchData/seed/urls
>         threads = 10
>         depth = 3
>         solrUrl=null
>         topN = 5
>         Injector: starting at 2011-07-17 09:31:19
>         
>         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>         
>         
>         Injector: urlDir: /home/llist/nutchData/seed/urls
>         
>         Injector: Converting injected urls to crawl db entries.
>         Injector: Merging injected urls into crawl db.
>         
>         
>         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
>         Generator: starting at 2011-07-17 09:31:22
>         
>         Generator: Selecting best-scoring urls due for fetch.
>         Generator: filtering: true
>         Generator: normalizing: true
>         
>         
>         Generator: topN: 5
>         
>         Generator: jobtracker is 'local', generating exactly one
>         partition.
>         Generator: Partitioning selected urls for politeness.
>         
>         
>         Generator:
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
>         
>         Fetcher: Your 'http.agent.name' value should be listed first
>         in
>         'http.robots.agents' property.
>         
>         
>         Fetcher: starting at 2011-07-17 09:31:26
>         Fetcher:
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         
>         Fetcher: threads: 10
>         QueueFeeder finished: total 1 records + hit by time limit :0
>         fetching http://www.seek.com.au/
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         -finishing thread FetcherThread, activeThreads=1
>         
>         -finishing thread FetcherThread, activeThreads=1
>         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         -finishing thread FetcherThread, activeThreads=0
>         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>         -activeThreads=0
>         
>         
>         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
>         ParseSegment: starting at 2011-07-17 09:31:29
>         ParseSegment:
>         segment: /home/llist/nutchData/crawl/segments/20110717093124
>         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
>         00:00:02
>         CrawlDb update: starting at 2011-07-17 09:31:32
>         
>         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         CrawlDb update: segments:
>         
>         
>         [/home/llist/nutchData/crawl/segments/20110717093124]
>         
>         CrawlDb update: additions allowed: true
>         
>         
>         CrawlDb update: URL normalizing: true
>         CrawlDb update: URL filtering: true
>         
>         CrawlDb update: Merging segment data into db.
>         
>         
>         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
>         00:00:02
>         :
>         :
>         :
>         :
>         -----------------------------------------------------------------------------------------------
>         
>         
>         
>         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
>         
>         > Done, but now get additional errors:
>         >
>         > -------------------
>         > llist@LeosLinux:~/nutchData
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > updatedb /home/llist/nutchData/crawl/crawldb
>         > -dir /home/llist/nutchData/crawl/segments/20110716105826
>         > CrawlDb update: starting at 2011-07-16 11:03:56
>         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         > CrawlDb update: segments:
>         >
>         [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
>         > CrawlDb update: additions allowed: true
>         > CrawlDb update: URL normalizing: false
>         > CrawlDb update: URL filtering: false
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/content
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>         >  - skipping invalid segment
>         >
>         file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
>         > CrawlDb update: Merging segment data into db.
>         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
>         00:00:01
>         > -------------------------------------------
>         >
>         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
>         >
>         > > fetch, then parse.
>         > >
>         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
>         the commands and
>         > > > relevant output.
>         > > >
>         > > > ----------------------------------
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > >
>         inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
>         > > > Injector: starting at 2011-07-15 18:32:10
>         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
>         > > > Injector: urlDir: /home/llist/nutchData/seed
>         > > > Injector: Converting injected urls to crawl db entries.
>         > > > Injector: Merging injected urls into crawl db.
>         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
>         00:00:02
>         > > > =================
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > generate /home/llist/nutchData/crawl/crawldb
>         > > > /home/llist/nutchData/crawl/segments Generator: starting
>         at 2011-07-15
>         > > > 18:32:41
>         > > > Generator: Selecting best-scoring urls due for fetch.
>         > > > Generator: filtering: true
>         > > > Generator: normalizing: true
>         > > > Generator: jobtracker is 'local', generating exactly one
>         partition.
>         > > > Generator: Partitioning selected urls for politeness.
>         > > > Generator:
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
>         00:00:03
>         > > > ==================
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > >
>         fetch /home/llist/nutchData/crawl/segments/20110715183244
>         > > > Fetcher: Your 'http.agent.name' value should be listed
>         first in
>         > > > 'http.robots.agents' property.
>         > > > Fetcher: starting at 2011-07-15 18:34:55
>         > > > Fetcher:
>         segment: /home/llist/nutchData/crawl/segments/20110715183244
>         > > > Fetcher: threads: 10
>         > > > QueueFeeder finished: total 1 records + hit by time
>         limit :0
>         > > > fetching http://www.seek.com.au/
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=2
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -finishing thread FetcherThread, activeThreads=1
>         > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
>         > > > -finishing thread FetcherThread, activeThreads=0
>         > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>         > > > -activeThreads=0
>         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
>         00:00:03
>         > > > =================
>         > > > llist@LeosLinux:~
>         $ /usr/share/nutch/runtime/local/bin/nutch
>         > > > updatedb /home/llist/nutchData/crawl/crawldb
>         > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
>         > > > CrawlDb update: starting at 2011-07-15 18:36:00
>         > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>         > > > CrawlDb update: segments:
>         > > >
>         [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content]
>         > > > CrawlDb update: additions allowed: true
>         > > > CrawlDb update: URL normalizing: false
>         > > > CrawlDb update: URL filtering: false
>         > > > - skipping invalid segment
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
>         > > > - skipping invalid segment
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
>         > > > - skipping invalid segment
>         > > >
>         file:/home/llist/nutchData/crawl/segments/20110715183244/content
>         > > > CrawlDb update: Merging segment data into db.
>         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
>         elapsed: 00:00:01
>         > > > -----------------------------------
>         > > >
>         > > > Appreciate any hints on what I'm missing.
>         >
>         >
>         
>         
>         
> 
> 
> 
> 
> -- 
> Lewis 
>

Re: skipping invalid segments nutch 1.3

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Leo

Did you resolve?

Your second log data doesn't appear to show any errors however the problem
you specify if one I have witnessed myself while ago. Since you posted have
you been able to replicate... or resolve?

On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions <llsubscr@zudiewiener.com
> wrote:

> I've used crawl to ensure config is correct and I don't get any errors,
> so I must be doing something wrong with the individual steps, but can;t
> see what.
>
>
> --------------------------------------------------------------------------------------------------------------------
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> crawl /home/llist/nutchData/seed/urls -dir /home/llist/nutchData/crawl
> -depth 3 -topN 5
> solrUrl is not set, indexing will be skipped...
> crawl started in: /home/llist/nutchData/crawl
> rootUrlDir = /home/llist/nutchData/seed/urls
> threads = 10
> depth = 3
> solrUrl=null
> topN = 5
> Injector: starting at 2011-07-17 09:31:19
> Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> Injector: urlDir: /home/llist/nutchData/seed/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> Generator: starting at 2011-07-17 09:31:22
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /home/llist/nutchData/crawl/segments/20110717093124
> Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-17 09:31:26
> Fetcher: segment: /home/llist/nutchData/crawl/segments/20110717093124
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.seek.com.au/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> ParseSegment: starting at 2011-07-17 09:31:29
> ParseSegment:
> segment: /home/llist/nutchData/crawl/segments/20110717093124
> ParseSegment: finished at 2011-07-17 09:31:32, elapsed: 00:00:02
> CrawlDb update: starting at 2011-07-17 09:31:32
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [/home/llist/nutchData/crawl/segments/20110717093124]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-17 09:31:34, elapsed: 00:00:02
> :
> :
> :
> :
>
> -----------------------------------------------------------------------------------------------
>
> On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
>
> > Done, but now get additional errors:
> >
> > -------------------
> > llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> > updatedb /home/llist/nutchData/crawl/crawldb
> > -dir /home/llist/nutchData/crawl/segments/20110716105826
> > CrawlDb update: starting at 2011-07-16 11:03:56
> > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > CrawlDb update: segments:
> > [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> >  - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: 00:00:01
> > -------------------------------------------
> >
> > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> >
> > > fetch, then parse.
> > >
> > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands
> and
> > > > relevant output.
> > > >
> > > > ----------------------------------
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
> > > > Injector: starting at 2011-07-15 18:32:10
> > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > > Injector: urlDir: /home/llist/nutchData/seed
> > > > Injector: Converting injected urls to crawl db entries.
> > > > Injector: Merging injected urls into crawl db.
> > > > Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02
> > > > =================
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > generate /home/llist/nutchData/crawl/crawldb
> > > > /home/llist/nutchData/crawl/segments Generator: starting at
> 2011-07-15
> > > > 18:32:41
> > > > Generator: Selecting best-scoring urls due for fetch.
> > > > Generator: filtering: true
> > > > Generator: normalizing: true
> > > > Generator: jobtracker is 'local', generating exactly one partition.
> > > > Generator: Partitioning selected urls for politeness.
> > > > Generator: segment:
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03
> > > > ==================
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > fetch /home/llist/nutchData/crawl/segments/20110715183244
> > > > Fetcher: Your 'http.agent.name' value should be listed first in
> > > > 'http.robots.agents' property.
> > > > Fetcher: starting at 2011-07-15 18:34:55
> > > > Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > > Fetcher: threads: 10
> > > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > > fetching http://www.seek.com.au/
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=2
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -finishing thread FetcherThread, activeThreads=1
> > > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > > -finishing thread FetcherThread, activeThreads=0
> > > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > > -activeThreads=0
> > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03
> > > > =================
> > > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > > updatedb /home/llist/nutchData/crawl/crawldb
> > > > -dir /home/llist/nutchData/crawl/segments/20110715183244
> > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > > CrawlDb update: segments:
> > > >
> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > > CrawlDb update: additions allowed: true
> > > > CrawlDb update: URL normalizing: false
> > > > CrawlDb update: URL filtering: false
> > > > - skipping invalid segment
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > > - skipping invalid segment
> > > >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > > - skipping invalid segment
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > > CrawlDb update: Merging segment data into db.
> > > > CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01
> > > > -----------------------------------
> > > >
> > > > Appreciate any hints on what I'm missing.
> >
> >
>
>
>


-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

I've used crawl to ensure config is correct and I don't get any errors,
so I must be doing something wrong with the individual steps, but can;t
see what.

--------------------------------------------------------------------------------------------------------------------
llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
crawl /home/llist/nutchData/seed/urls -dir /home/llist/nutchData/crawl
-depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: /home/llist/nutchData/crawl
rootUrlDir = /home/llist/nutchData/seed/urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2011-07-17 09:31:19
Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
Injector: urlDir: /home/llist/nutchData/seed/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
Generator: starting at 2011-07-17 09:31:22
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: /home/llist/nutchData/crawl/segments/20110717093124
Generator: finished at 2011-07-17 09:31:26, elapsed: 00:00:04
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2011-07-17 09:31:26
Fetcher: segment: /home/llist/nutchData/crawl/segments/20110717093124
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.seek.com.au/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
ParseSegment: starting at 2011-07-17 09:31:29
ParseSegment:
segment: /home/llist/nutchData/crawl/segments/20110717093124
ParseSegment: finished at 2011-07-17 09:31:32, elapsed: 00:00:02
CrawlDb update: starting at 2011-07-17 09:31:32
CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
CrawlDb update: segments:
[/home/llist/nutchData/crawl/segments/20110717093124]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-07-17 09:31:34, elapsed: 00:00:02
:
:
:
:
-----------------------------------------------------------------------------------------------

On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:

> Done, but now get additional errors:
> 
> -------------------
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> updatedb /home/llist/nutchData/crawl/crawldb
> -dir /home/llist/nutchData/crawl/segments/20110716105826
> CrawlDb update: starting at 2011-07-16 11:03:56
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110716105826/content
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: 00:00:01
> -------------------------------------------
> 
> On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> 
> > fetch, then parse.
> > 
> > > I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and
> > > relevant output.
> > > 
> > > ----------------------------------
> > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
> > > Injector: starting at 2011-07-15 18:32:10
> > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > Injector: urlDir: /home/llist/nutchData/seed
> > > Injector: Converting injected urls to crawl db entries.
> > > Injector: Merging injected urls into crawl db.
> > > Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02
> > > =================
> > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > generate /home/llist/nutchData/crawl/crawldb
> > > /home/llist/nutchData/crawl/segments Generator: starting at 2011-07-15
> > > 18:32:41
> > > Generator: Selecting best-scoring urls due for fetch.
> > > Generator: filtering: true
> > > Generator: normalizing: true
> > > Generator: jobtracker is 'local', generating exactly one partition.
> > > Generator: Partitioning selected urls for politeness.
> > > Generator: segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03
> > > ==================
> > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > fetch /home/llist/nutchData/crawl/segments/20110715183244
> > > Fetcher: Your 'http.agent.name' value should be listed first in
> > > 'http.robots.agents' property.
> > > Fetcher: starting at 2011-07-15 18:34:55
> > > Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244
> > > Fetcher: threads: 10
> > > QueueFeeder finished: total 1 records + hit by time limit :0
> > > fetching http://www.seek.com.au/
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=2
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > -finishing thread FetcherThread, activeThreads=1
> > > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > -finishing thread FetcherThread, activeThreads=0
> > > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > -activeThreads=0
> > > Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03
> > > =================
> > > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > > updatedb /home/llist/nutchData/crawl/crawldb
> > > -dir /home/llist/nutchData/crawl/segments/20110715183244
> > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > CrawlDb update: segments:
> > > [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > CrawlDb update: additions allowed: true
> > > CrawlDb update: URL normalizing: false
> > > CrawlDb update: URL filtering: false
> > > - skipping invalid segment
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > - skipping invalid segment
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > - skipping invalid segment
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > CrawlDb update: Merging segment data into db.
> > > CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01
> > > -----------------------------------
> > > 
> > > Appreciate any hints on what I'm missing.
> 
>

Re: skipping invalid segments nutch 1.3

Posted by Leo Subscriptions <ll...@zudiewiener.com>.

Done, but now get additional errors:

-------------------
llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
updatedb /home/llist/nutchData/crawl/crawldb
-dir /home/llist/nutchData/crawl/segments/20110716105826
CrawlDb update: starting at 2011-07-16 11:03:56
CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
CrawlDb update: segments:
[file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
file:/home/llist/nutchData/crawl/segments/20110716105826/content,
file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110716105826/content
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
 - skipping invalid segment
file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-07-16 11:03:57, elapsed: 00:00:01
-------------------------------------------

On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:

> fetch, then parse.
> 
> > I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and
> > relevant output.
> > 
> > ----------------------------------
> > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
> > Injector: starting at 2011-07-15 18:32:10
> > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > Injector: urlDir: /home/llist/nutchData/seed
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02
> > =================
> > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > generate /home/llist/nutchData/crawl/crawldb
> > /home/llist/nutchData/crawl/segments Generator: starting at 2011-07-15
> > 18:32:41
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: true
> > Generator: normalizing: true
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: /home/llist/nutchData/crawl/segments/20110715183244
> > Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03
> > ==================
> > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > fetch /home/llist/nutchData/crawl/segments/20110715183244
> > Fetcher: Your 'http.agent.name' value should be listed first in
> > 'http.robots.agents' property.
> > Fetcher: starting at 2011-07-15 18:34:55
> > Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244
> > Fetcher: threads: 10
> > QueueFeeder finished: total 1 records + hit by time limit :0
> > fetching http://www.seek.com.au/
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=2
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > -finishing thread FetcherThread, activeThreads=1
> > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > -finishing thread FetcherThread, activeThreads=0
> > -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > -activeThreads=0
> > Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03
> > =================
> > llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> > updatedb /home/llist/nutchData/crawl/crawldb
> > -dir /home/llist/nutchData/crawl/segments/20110715183244
> > CrawlDb update: starting at 2011-07-15 18:36:00
> > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > CrawlDb update: segments:
> > [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > CrawlDb update: additions allowed: true
> > CrawlDb update: URL normalizing: false
> > CrawlDb update: URL filtering: false
> > - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > - skipping invalid segment
> > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > CrawlDb update: Merging segment data into db.
> > CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01
> > -----------------------------------
> > 
> > Appreciate any hints on what I'm missing.

Re: skipping invalid segments nutch 1.3

Posted by Markus Jelsma <ma...@openindex.io>.

fetch, then parse.

> I'm running nutch 1.3 on 64 bit Ubuntu, following are the commands and
> relevant output.
> 
> ----------------------------------
> llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed
> Injector: starting at 2011-07-15 18:32:10
> Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> Injector: urlDir: /home/llist/nutchData/seed
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-15 18:32:13, elapsed: 00:00:02
> =================
> llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> generate /home/llist/nutchData/crawl/crawldb
> /home/llist/nutchData/crawl/segments Generator: starting at 2011-07-15
> 18:32:41
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /home/llist/nutchData/crawl/segments/20110715183244
> Generator: finished at 2011-07-15 18:32:45, elapsed: 00:00:03
> ==================
> llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> fetch /home/llist/nutchData/crawl/segments/20110715183244
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-15 18:34:55
> Fetcher: segment: /home/llist/nutchData/crawl/segments/20110715183244
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> fetching http://www.seek.com.au/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-15 18:34:59, elapsed: 00:00:03
> =================
> llist@LeosLinux:~$ /usr/share/nutch/runtime/local/bin/nutch
> updatedb /home/llist/nutchData/crawl/crawldb
> -dir /home/llist/nutchData/crawl/segments/20110715183244
> CrawlDb update: starting at 2011-07-15 18:36:00
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110715183244/content
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-15 18:36:01, elapsed: 00:00:01
> -----------------------------------
> 
> Appreciate any hints on what I'm missing.