You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christopher Gross <co...@gmail.com> on 2011/12/19 17:16:58 UTC
Missing document
I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
a document isn't getting added to my Solr Index.
I can use the parsechecker and indexchecker to verify the link to the
docx file, and they both can get to it and parse it just fine. But
when I use the crawl command, it doesn't appear. What config file
should I be checking? Do those tools use the same settings, or is
there something different about the way they operate?
Any help would be appreciated!
-- Chris
Re: Missing document
Posted by Christopher Gross <co...@gmail.com>.
I don't think it's a redirect, unless SharePoint made it one. Any
idea how to check for that?
-- Chris
On Mon, Dec 19, 2011 at 5:15 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Half-way, it's clear in the log. Is your document a redirect, i've not yet
> seen such a log line before.
>
> * haven't double-checked source code
>
>
>
>> Not sure where fetching starts...
>>
>> 2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule -
>> maxInterval=7776000 2011-12-19 20:13:53,261 INFO regex.RegexURLNormalizer
>> - can't find rules for scope 'partition', using default
>> 2011-12-19 20:13:53,394 INFO crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule -
>> maxInterval=7776000 2011-12-19 20:13:53,399 INFO regex.RegexURLNormalizer
>> - can't find rules for scope 'generate_host_count', using default
>> 2011-12-19 20:13:54,474 INFO crawl.Generator - Generator:
>> Partitioning selected urls for politeness.
>> 2011-12-19 20:13:55,479 INFO crawl.Generator - Generator: segment:
>> /cdda/nutch/crawl/segments/20111219201355
>> 2011-12-19 20:13:56,537 INFO regex.RegexURLNormalizer - can't find
>> rules for scope 'partition', using default
>> 2011-12-19 20:13:56,939 INFO crawl.Generator - Generator: finished at
>> 2011-12-19 20:13:56, elapsed: 00:00:05
>> 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: starting at
>> 2011-12-19 20:13:57
>> 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: segment:
>> /nutch/crawl/segments/20111219201355
>> 2011-12-19 20:13:58,743 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: threads: 10
>> 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: time-out divisor:
>> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls
>> ... 2011-12-19 20:13:58,756 INFO plugin.PluginRepository - Plugins:
>> looking in: /nutch/plugins
>> 2011-12-19 20:13:58,774 INFO fetcher.Fetcher - QueueFeeder finished:
>> total 1 records + hit by time limit :0
>> <cut plugin loader stuff, can push this if you need it>
>> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - fetching
>> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
>> Documents/Alpha.docx
>> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
>> 2011-12-19 20:13:59,038 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput
>> threshold: -1
>> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.host = null
>> 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.port = 8080
>> 2011-12-19 20:13:59,043 INFO http.Http - http.timeout = 10000
>> 2011-12-19 20:13:59,043 INFO http.Http - http.content.limit = -1
>> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,043 INFO http.Http - http.agent =
>> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
>> dni-ices-search@ugov.gov)
>> 2011-12-19 20:13:59,043 INFO http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2011-12-19 20:13:59,380 INFO fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0
>> 2011-12-19 20:14:00,372 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2011-12-19 20:14:01,451 INFO fetcher.Fetcher - Fetcher: finished at
>> 2011-12-19 20:14:01, elapsed: 00:00:03
>> 2011-12-19 20:14:02,197 INFO parse.ParseSegment - ParseSegment:
>> starting at 2011-12-19 20:14:02
>> 2011-12-19 20:14:02,198 INFO parse.ParseSegment - ParseSegment:
>> segment: /cdda/nutch/crawl/segments/20111219201355
>> 2011-12-19 20:14:03,062 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2
>> ...is that enough for the fetch logs? It's all crawl/generator
>> messages after that.
>>
>>
>> I ran:
>> ./nutch freegen ../urls/ ./test-segments
>> ./nutch readseg -dump ./test-segments/ ./segment-output
>>
>> I got an error:
>> Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/content
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
>> at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
>> 90) at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>> putFormat.java:44) at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
>> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at
>> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
>>
>> So do I need to run the generator step in the middle? How is this
>> different than just doing a crawl?
>>
>> Thanks!
>>
>> -- Chris
>>
>>
>>
>> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
>>
>> <ma...@openindex.io> wrote:
>> >> I'm a little confused -- should I set up a whole other instance of
>> >> nutch, crawldb, etc?
>> >
>> > Yes, i use clean instances for quick testing. Makes things easy
>> > sometimes.
>> >
>> >> Set the log to trace, I think this helps to tell why.....
>> >>
>> >> 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using
>> >> FetchSchedule impl:
>> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
>> >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> >> 20:14:10,716 INFO crawl.AbstractFetchSchedule -
>> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
>> >> -shouldFetch rejected 'http://url/Alpha.docx',
>> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
>> >> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl:
>> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
>> >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> >> 20:14:10,843 INFO crawl.AbstractFetchSchedule -
>> >> maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator -
>> >> Generator: 0 records selected for fetching, exiting ...
>> >
>> > Now, this is the generator indeed but you need to fetcher logs.
>> >
>> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
>> >> still got a rejected because it is before the next fetch time...why do
>> >> I get that? How do I set it up to always crawl all the docs? (Not
>> >> practical for production, but it's what I want when testing...)
>> >
>> > As i said, create segments using the freegen tool. It takes an input dir
>> > with seed files, just as your initial inject. Or can also inject files
>> > and give them meta data with a very low fetch interval so Nutch will
>> > crawl it each time, i usually take this approach in small tests.
>> >
>> > http://url<TAB>nutch.fetchInterval=10
>> >
>> > The URL will be selected by the generator all the time because of this
>> > low fetch interval.
>> >
>> >> -- Chris
>> >>
>> >>
>> >>
>> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>> >>
>> >> <ma...@openindex.io> wrote:
>> >> >> > Hmm, the status db_gone prevents it from being indexed, of course.
>> >> >> > It is perfectly possible for the checkers to pass but that the
>> >> >> > fetcher will fail. There may have been an error and i remeber you
>> >> >> > using a proxy earlier, that's likely the problem here too. The
>> >> >> > checkers don't use proxy configurations.
>> >> >> >
>> >> >> > Check the logs to make sure.
>> >> >>
>> >> >> I cut out the proxy, and that let me get as far as I have now.
>> >> >> Having that in place prevents me from crawling the local
>> >> >> source...is there any way to be able to crawl both the inside &
>> >> >> outside networks? [separate issue, but something that I'll need this
>> >> >> to do]
>> >> >
>> >> > Not that i know of. You can use separate configs but this is tricky.
>> >> > Better use separate crawldb's configs etc.
>> >> >
>> >> >> > That's good. But remember, to pass it _must_ match regex prefixed
>> >> >> > by a +. This, however, is not your problem because in that case it
>> >> >> > wouldn't have ended up in the CrawlDB at all.
>> >> >>
>> >> >> I have two +'s that it should match on, including +.*
>> >> >
>> >> > That'll do.
>> >> >
>> >> >> > Check the fetcher output thoroughly. Grep around. You should find
>> >> >> > it.
>> >> >>
>> >> >> What exactly am I grepping for?
>> >> >> This is the block between the doc and the next one that it tries to
>> >> >> crawl....
>> >> >
>> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not
>> >> > an error. Does debug say anything? You can set the level for the
>> >> > Fetcher in conf/log4j.properties. You can use freegen tool to
>> >> > generate a segments from some input text for tests.
>> >> >
>> >> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching
>> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher -
>> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO
>> >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540
>> >> >> INFO
>> >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
>> >> >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19
>> >> >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
>> >> >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode :
>> >> >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue
>> >> >> mode
>> >> >>
>> >> >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue
>> >> >>
>> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
>> >> >> Fetcher: throughput threshold: -1
>> >> >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
>> >> >> threshold retries: 5
>> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
>> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
>> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
>> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
>> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent =
>> >> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
>> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
>> >> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> --Chris
Re: Missing document
Posted by Markus Jelsma <ma...@openindex.io>.
Half-way, it's clear in the log. Is your document a redirect, i've not yet
seen such a log line before.
* haven't double-checked source code
> Not sure where fetching starts...
>
> 2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule -
> maxInterval=7776000 2011-12-19 20:13:53,261 INFO regex.RegexURLNormalizer
> - can't find rules for scope 'partition', using default
> 2011-12-19 20:13:53,394 INFO crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule -
> maxInterval=7776000 2011-12-19 20:13:53,399 INFO regex.RegexURLNormalizer
> - can't find rules for scope 'generate_host_count', using default
> 2011-12-19 20:13:54,474 INFO crawl.Generator - Generator:
> Partitioning selected urls for politeness.
> 2011-12-19 20:13:55,479 INFO crawl.Generator - Generator: segment:
> /cdda/nutch/crawl/segments/20111219201355
> 2011-12-19 20:13:56,537 INFO regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
> 2011-12-19 20:13:56,939 INFO crawl.Generator - Generator: finished at
> 2011-12-19 20:13:56, elapsed: 00:00:05
> 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: starting at
> 2011-12-19 20:13:57
> 2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: segment:
> /nutch/crawl/segments/20111219201355
> 2011-12-19 20:13:58,743 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: threads: 10
> 2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: time-out divisor:
> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls
> ... 2011-12-19 20:13:58,756 INFO plugin.PluginRepository - Plugins:
> looking in: /nutch/plugins
> 2011-12-19 20:13:58,774 INFO fetcher.Fetcher - QueueFeeder finished:
> total 1 records + hit by time limit :0
> <cut plugin loader stuff, can push this if you need it>
> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,036 INFO fetcher.Fetcher - fetching
> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
> Documents/Alpha.docx
> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
> 2011-12-19 20:13:59,038 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput
> threshold: -1
> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.host = null
> 2011-12-19 20:13:59,043 INFO http.Http - http.proxy.port = 8080
> 2011-12-19 20:13:59,043 INFO http.Http - http.timeout = 10000
> 2011-12-19 20:13:59,043 INFO http.Http - http.content.limit = -1
> 2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,043 INFO http.Http - http.agent =
> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
> dni-ices-search@ugov.gov)
> 2011-12-19 20:13:59,043 INFO http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-12-19 20:13:59,380 INFO fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0
> 2011-12-19 20:14:00,372 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2011-12-19 20:14:01,451 INFO fetcher.Fetcher - Fetcher: finished at
> 2011-12-19 20:14:01, elapsed: 00:00:03
> 2011-12-19 20:14:02,197 INFO parse.ParseSegment - ParseSegment:
> starting at 2011-12-19 20:14:02
> 2011-12-19 20:14:02,198 INFO parse.ParseSegment - ParseSegment:
> segment: /cdda/nutch/crawl/segments/20111219201355
> 2011-12-19 20:14:03,062 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2
> ...is that enough for the fetch logs? It's all crawl/generator
> messages after that.
>
>
> I ran:
> ./nutch freegen ../urls/ ./test-segments
> ./nutch readseg -dump ./test-segments/ ./segment-output
>
> I got an error:
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/content
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 90) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at
> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
>
> So do I need to run the generator step in the middle? How is this
> different than just doing a crawl?
>
> Thanks!
>
> -- Chris
>
>
>
> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
>
> <ma...@openindex.io> wrote:
> >> I'm a little confused -- should I set up a whole other instance of
> >> nutch, crawldb, etc?
> >
> > Yes, i use clean instances for quick testing. Makes things easy
> > sometimes.
> >
> >> Set the log to trace, I think this helps to tell why.....
> >>
> >> 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using
> >> FetchSchedule impl:
> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
> >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> >> 20:14:10,716 INFO crawl.AbstractFetchSchedule -
> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
> >> -shouldFetch rejected 'http://url/Alpha.docx',
> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
> >> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl:
> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
> >> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> >> 20:14:10,843 INFO crawl.AbstractFetchSchedule -
> >> maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator -
> >> Generator: 0 records selected for fetching, exiting ...
> >
> > Now, this is the generator indeed but you need to fetcher logs.
> >
> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
> >> still got a rejected because it is before the next fetch time...why do
> >> I get that? How do I set it up to always crawl all the docs? (Not
> >> practical for production, but it's what I want when testing...)
> >
> > As i said, create segments using the freegen tool. It takes an input dir
> > with seed files, just as your initial inject. Or can also inject files
> > and give them meta data with a very low fetch interval so Nutch will
> > crawl it each time, i usually take this approach in small tests.
> >
> > http://url<TAB>nutch.fetchInterval=10
> >
> > The URL will be selected by the generator all the time because of this
> > low fetch interval.
> >
> >> -- Chris
> >>
> >>
> >>
> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
> >>
> >> <ma...@openindex.io> wrote:
> >> >> > Hmm, the status db_gone prevents it from being indexed, of course.
> >> >> > It is perfectly possible for the checkers to pass but that the
> >> >> > fetcher will fail. There may have been an error and i remeber you
> >> >> > using a proxy earlier, that's likely the problem here too. The
> >> >> > checkers don't use proxy configurations.
> >> >> >
> >> >> > Check the logs to make sure.
> >> >>
> >> >> I cut out the proxy, and that let me get as far as I have now.
> >> >> Having that in place prevents me from crawling the local
> >> >> source...is there any way to be able to crawl both the inside &
> >> >> outside networks? [separate issue, but something that I'll need this
> >> >> to do]
> >> >
> >> > Not that i know of. You can use separate configs but this is tricky.
> >> > Better use separate crawldb's configs etc.
> >> >
> >> >> > That's good. But remember, to pass it _must_ match regex prefixed
> >> >> > by a +. This, however, is not your problem because in that case it
> >> >> > wouldn't have ended up in the CrawlDB at all.
> >> >>
> >> >> I have two +'s that it should match on, including +.*
> >> >
> >> > That'll do.
> >> >
> >> >> > Check the fetcher output thoroughly. Grep around. You should find
> >> >> > it.
> >> >>
> >> >> What exactly am I grepping for?
> >> >> This is the block between the doc and the next one that it tries to
> >> >> crawl....
> >> >
> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not
> >> > an error. Does debug say anything? You can set the level for the
> >> > Fetcher in conf/log4j.properties. You can use freegen tool to
> >> > generate a segments from some input text for tests.
> >> >
> >> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching
> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher -
> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO
> >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540
> >> >> INFO
> >> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
> >> >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19
> >> >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
> >> >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode :
> >> >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue
> >> >> mode
> >> >>
> >> >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue
> >> >>
> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
> >> >> Fetcher: throughput threshold: -1
> >> >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
> >> >> threshold retries: 5
> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent =
> >> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
> >> >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
> >> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >>
> >> >> Thanks!
> >> >>
> >> >> --Chris
Re: Missing document
Posted by Christopher Gross <co...@gmail.com>.
Not sure where fetching starts...
2011-12-19 20:13:53,223 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,223 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2011-12-19 20:13:53,261 INFO regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2011-12-19 20:13:53,394 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,395 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
2011-12-19 20:13:53,399 INFO regex.RegexURLNormalizer - can't find
rules for scope 'generate_host_count', using default
2011-12-19 20:13:54,474 INFO crawl.Generator - Generator:
Partitioning selected urls for politeness.
2011-12-19 20:13:55,479 INFO crawl.Generator - Generator: segment:
/cdda/nutch/crawl/segments/20111219201355
2011-12-19 20:13:56,537 INFO regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2011-12-19 20:13:56,939 INFO crawl.Generator - Generator: finished at
2011-12-19 20:13:56, elapsed: 00:00:05
2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: starting at
2011-12-19 20:13:57
2011-12-19 20:13:57,695 INFO fetcher.Fetcher - Fetcher: segment:
/nutch/crawl/segments/20111219201355
2011-12-19 20:13:58,743 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: threads: 10
2011-12-19 20:13:58,744 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2
2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls ...
2011-12-19 20:13:58,756 INFO plugin.PluginRepository - Plugins:
looking in: /nutch/plugins
2011-12-19 20:13:58,774 INFO fetcher.Fetcher - QueueFeeder finished:
total 1 records + hit by time limit :0
<cut plugin loader stuff, can push this if you need it>
2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,036 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,036 INFO fetcher.Fetcher - fetching
http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
Documents/Alpha.docx
2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
2011-12-19 20:13:59,038 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,039 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,039 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,040 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,040 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,041 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,041 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,042 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,042 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput
threshold: -1
2011-12-19 20:13:59,043 INFO fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2011-12-19 20:13:59,043 INFO http.Http - http.proxy.host = null
2011-12-19 20:13:59,043 INFO http.Http - http.proxy.port = 8080
2011-12-19 20:13:59,043 INFO http.Http - http.timeout = 10000
2011-12-19 20:13:59,043 INFO http.Http - http.content.limit = -1
2011-12-19 20:13:59,043 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,043 INFO http.Http - http.agent =
google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
dni-ices-search@ugov.gov)
2011-12-19 20:13:59,043 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-12-19 20:13:59,380 INFO fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2011-12-19 20:14:00,050 INFO fetcher.Fetcher - -activeThreads=0
2011-12-19 20:14:00,372 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2011-12-19 20:14:01,451 INFO fetcher.Fetcher - Fetcher: finished at
2011-12-19 20:14:01, elapsed: 00:00:03
2011-12-19 20:14:02,197 INFO parse.ParseSegment - ParseSegment:
starting at 2011-12-19 20:14:02
2011-12-19 20:14:02,198 INFO parse.ParseSegment - ParseSegment:
segment: /cdda/nutch/crawl/segments/20111219201355
2011-12-19 20:14:03,062 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2
...is that enough for the fetch logs? It's all crawl/generator
messages after that.
I ran:
./nutch freegen ../urls/ ./test-segments
./nutch readseg -dump ./test-segments/ ./segment-output
I got an error:
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/content
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
So do I need to run the generator step in the middle? How is this
different than just doing a crawl?
Thanks!
-- Chris
On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
<ma...@openindex.io> wrote:
>
>> I'm a little confused -- should I set up a whole other instance of
>> nutch, crawldb, etc?
>
> Yes, i use clean instances for quick testing. Makes things easy sometimes.
>
>>
>> Set the log to trace, I think this helps to tell why.....
>>
>> 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using
>> FetchSchedule impl:
>> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
>> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> 20:14:10,716 INFO crawl.AbstractFetchSchedule -
>> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
>> -shouldFetch rejected 'http://url/Alpha.docx',
>> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
>> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl:
>> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
>> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> 20:14:10,843 INFO crawl.AbstractFetchSchedule -
>> maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator -
>> Generator: 0 records selected for fetching, exiting ...
>
> Now, this is the generator indeed but you need to fetcher logs.
>
>> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
>> still got a rejected because it is before the next fetch time...why do
>> I get that? How do I set it up to always crawl all the docs? (Not
>> practical for production, but it's what I want when testing...)
>
> As i said, create segments using the freegen tool. It takes an input dir with
> seed files, just as your initial inject. Or can also inject files and give
> them meta data with a very low fetch interval so Nutch will crawl it each
> time, i usually take this approach in small tests.
>
> http://url<TAB>nutch.fetchInterval=10
>
> The URL will be selected by the generator all the time because of this low
> fetch interval.
>
>> -- Chris
>>
>>
>>
>> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>>
>> <ma...@openindex.io> wrote:
>> >> > Hmm, the status db_gone prevents it from being indexed, of course. It
>> >> > is perfectly possible for the checkers to pass but that the fetcher
>> >> > will fail. There may have been an error and i remeber you using a
>> >> > proxy earlier, that's likely the problem here too. The checkers don't
>> >> > use proxy configurations.
>> >> >
>> >> > Check the logs to make sure.
>> >>
>> >> I cut out the proxy, and that let me get as far as I have now. Having
>> >> that in place prevents me from crawling the local source...is there
>> >> any way to be able to crawl both the inside & outside networks?
>> >> [separate issue, but something that I'll need this to do]
>> >
>> > Not that i know of. You can use separate configs but this is tricky.
>> > Better use separate crawldb's configs etc.
>> >
>> >> > That's good. But remember, to pass it _must_ match regex prefixed by a
>> >> > +. This, however, is not your problem because in that case it
>> >> > wouldn't have ended up in the CrawlDB at all.
>> >>
>> >> I have two +'s that it should match on, including +.*
>> >
>> > That'll do.
>> >
>> >> > Check the fetcher output thoroughly. Grep around. You should find it.
>> >>
>> >> What exactly am I grepping for?
>> >> This is the block between the doc and the next one that it tries to
>> >> crawl....
>> >
>> > Hmm, that looks fine but can still indicate a 404 because a 404 is not an
>> > error. Does debug say anything? You can set the level for the Fetcher in
>> > conf/log4j.properties. You can use freegen tool to generate a segments
>> > from some input text for tests.
>> >
>> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching
>> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher -
>> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher
>> >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO
>> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
>> >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19
>> >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
>> >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode :
>> >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode
>> >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue
>> >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Fetcher:
>> >> throughput threshold: -1
>> >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
>> >> threshold retries: 5
>> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
>> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
>> >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
>> >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
>> >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent =
>> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
>> >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
>> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >>
>> >> Thanks!
>> >>
>> >> --Chris
Re: Missing document
Posted by Markus Jelsma <ma...@openindex.io>.
> I'm a little confused -- should I set up a whole other instance of
> nutch, crawldb, etc?
Yes, i use clean instances for quick testing. Makes things easy sometimes.
>
> Set the log to trace, I think this helps to tell why.....
>
> 2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using
> FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> 20:14:10,716 INFO crawl.AbstractFetchSchedule -
> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
> -shouldFetch rejected 'http://url/Alpha.docx',
> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
> INFO crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
> INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> 20:14:10,843 INFO crawl.AbstractFetchSchedule -
> maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator -
> Generator: 0 records selected for fetching, exiting ...
Now, this is the generator indeed but you need to fetcher logs.
> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
> still got a rejected because it is before the next fetch time...why do
> I get that? How do I set it up to always crawl all the docs? (Not
> practical for production, but it's what I want when testing...)
As i said, create segments using the freegen tool. It takes an input dir with
seed files, just as your initial inject. Or can also inject files and give
them meta data with a very low fetch interval so Nutch will crawl it each
time, i usually take this approach in small tests.
http://url<TAB>nutch.fetchInterval=10
The URL will be selected by the generator all the time because of this low
fetch interval.
> -- Chris
>
>
>
> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>
> <ma...@openindex.io> wrote:
> >> > Hmm, the status db_gone prevents it from being indexed, of course. It
> >> > is perfectly possible for the checkers to pass but that the fetcher
> >> > will fail. There may have been an error and i remeber you using a
> >> > proxy earlier, that's likely the problem here too. The checkers don't
> >> > use proxy configurations.
> >> >
> >> > Check the logs to make sure.
> >>
> >> I cut out the proxy, and that let me get as far as I have now. Having
> >> that in place prevents me from crawling the local source...is there
> >> any way to be able to crawl both the inside & outside networks?
> >> [separate issue, but something that I'll need this to do]
> >
> > Not that i know of. You can use separate configs but this is tricky.
> > Better use separate crawldb's configs etc.
> >
> >> > That's good. But remember, to pass it _must_ match regex prefixed by a
> >> > +. This, however, is not your problem because in that case it
> >> > wouldn't have ended up in the CrawlDB at all.
> >>
> >> I have two +'s that it should match on, including +.*
> >
> > That'll do.
> >
> >> > Check the fetcher output thoroughly. Grep around. You should find it.
> >>
> >> What exactly am I grepping for?
> >> This is the block between the doc and the next one that it tries to
> >> crawl....
> >
> > Hmm, that looks fine but can still indicate a 404 because a 404 is not an
> > error. Does debug say anything? You can set the level for the Fetcher in
> > conf/log4j.properties. You can use freegen tool to generate a segments
> > from some input text for tests.
> >
> >> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching
> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher -
> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher
> >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO
> >> fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
> >> INFO fetcher.Fetcher - Using queue mode : byHost 2011-12-19
> >> 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
> >> 2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode :
> >> byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode
> >> : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue
> >> mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Fetcher:
> >> throughput threshold: -1
> >> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
> >> threshold retries: 5
> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
> >> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
> >> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
> >> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
> >> 2011-12-19 18:42:19,545 INFO http.Http - http.agent =
> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
> >> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >>
> >> Thanks!
> >>
> >> --Chris
Re: Missing document
Posted by Christopher Gross <co...@gmail.com>.
I'm a little confused -- should I set up a whole other instance of
nutch, crawldb, etc?
Set the log to trace, I think this helps to tell why.....
2011-12-19 20:14:10,716 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
20:14:10,716 INFO crawl.AbstractFetchSchedule -
maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
-shouldFetch rejected 'http://url/Alpha.docx',
fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
INFO crawl.FetchScheduleFactory - Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
INFO crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
20:14:10,843 INFO crawl.AbstractFetchSchedule -
maxInterval=77760002011-12-19 20:14:11,145 WARN crawl.Generator -
Generator: 0 records selected for fetching, exiting ...
Now, before I ran this I cleared the crawldb, linkdb & segments, but I
still got a rejected because it is before the next fetch time...why do
I get that? How do I set it up to always crawl all the docs? (Not
practical for production, but it's what I want when testing...)
-- Chris
On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
<ma...@openindex.io> wrote:
>
>> > Hmm, the status db_gone prevents it from being indexed, of course. It is
>> > perfectly possible for the checkers to pass but that the fetcher will
>> > fail. There may have been an error and i remeber you using a proxy
>> > earlier, that's likely the problem here too. The checkers don't use
>> > proxy configurations.
>> >
>> > Check the logs to make sure.
>>
>> I cut out the proxy, and that let me get as far as I have now. Having
>> that in place prevents me from crawling the local source...is there
>> any way to be able to crawl both the inside & outside networks?
>> [separate issue, but something that I'll need this to do]
>
> Not that i know of. You can use separate configs but this is tricky. Better
> use separate crawldb's configs etc.
>
>>
>> > That's good. But remember, to pass it _must_ match regex prefixed by a +.
>> > This, however, is not your problem because in that case it wouldn't have
>> > ended up in the CrawlDB at all.
>>
>> I have two +'s that it should match on, including +.*
>
> That'll do.
>
>>
>> > Check the fetcher output thoroughly. Grep around. You should find it.
>>
>> What exactly am I grepping for?
>> This is the block between the doc and the next one that it tries to
>> crawl....
>
> Hmm, that looks fine but can still indicate a 404 because a 404 is not an
> error. Does debug say anything? You can set the level for the Fetcher in
> conf/log4j.properties. You can use freegen tool to generate a segments from
> some input text for tests.
>
>>
>> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching
>> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
>> Fetcher: throughput threshold: -1
>> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
>> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
>> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
>> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
>> 2011-12-19 18:42:19,545 INFO http.Http - http.agent =
>> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
>> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>>
>> Thanks!
>>
>> --Chris
Re: Missing document
Posted by Markus Jelsma <ma...@openindex.io>.
> > Hmm, the status db_gone prevents it from being indexed, of course. It is
> > perfectly possible for the checkers to pass but that the fetcher will
> > fail. There may have been an error and i remeber you using a proxy
> > earlier, that's likely the problem here too. The checkers don't use
> > proxy configurations.
> >
> > Check the logs to make sure.
>
> I cut out the proxy, and that let me get as far as I have now. Having
> that in place prevents me from crawling the local source...is there
> any way to be able to crawl both the inside & outside networks?
> [separate issue, but something that I'll need this to do]
Not that i know of. You can use separate configs but this is tricky. Better
use separate crawldb's configs etc.
>
> > That's good. But remember, to pass it _must_ match regex prefixed by a +.
> > This, however, is not your problem because in that case it wouldn't have
> > ended up in the CrawlDB at all.
>
> I have two +'s that it should match on, including +.*
That'll do.
>
> > Check the fetcher output thoroughly. Grep around. You should find it.
>
> What exactly am I grepping for?
> This is the block between the doc and the next one that it tries to
> crawl....
Hmm, that looks fine but can still indicate a 404 because a 404 is not an
error. Does debug say anything? You can set the level for the Fetcher in
conf/log4j.properties. You can use freegen tool to generate a segments from
some input text for tests.
>
> 2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching
> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO fetcher.Fetcher -
> Fetcher: throughput threshold: -1
> 2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
> 2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
> 2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
> 2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
> 2011-12-19 18:42:19,545 INFO http.Http - http.agent =
> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
> 2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
>
> Thanks!
>
> --Chris
Re: Missing document
Posted by Christopher Gross <co...@gmail.com>.
>
> Hmm, the status db_gone prevents it from being indexed, of course. It is
> perfectly possible for the checkers to pass but that the fetcher will fail.
> There may have been an error and i remeber you using a proxy earlier, that's
> likely the problem here too. The checkers don't use proxy configurations.
>
> Check the logs to make sure.
>
I cut out the proxy, and that let me get as far as I have now. Having
that in place prevents me from crawling the local source...is there
any way to be able to crawl both the inside & outside networks?
[separate issue, but something that I'll need this to do]
>
> That's good. But remember, to pass it _must_ match regex prefixed by a +.
> This, however, is not your problem because in that case it wouldn't have ended
> up in the CrawlDB at all.
I have two +'s that it should match on, including +.*
>
> Check the fetcher output thoroughly. Grep around. You should find it.
>
What exactly am I grepping for?
This is the block between the doc and the next one that it tries to crawl....
2011-12-19 18:42:19,538 INFO fetcher.Fetcher - fetching http://url/Alpha.docx
2011-12-19 18:42:19,538 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,539 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,540 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO fetcher.Fetcher - Fetcher: throughput
threshold: -1
2011-12-19 18:42:19,543 INFO fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2011-12-19 18:42:19,545 INFO http.Http - http.proxy.host = null
2011-12-19 18:42:19,545 INFO http.Http - http.proxy.port = 8080
2011-12-19 18:42:19,545 INFO http.Http - http.timeout = 10000
2011-12-19 18:42:19,545 INFO http.Http - http.content.limit = -1
2011-12-19 18:42:19,545 INFO http.Http - http.agent =
crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
2011-12-19 18:42:19,545 INFO http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-12-19 18:42:20,545 INFO fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:21,548 INFO fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:22,550 INFO fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:23,552 INFO fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:24,554 INFO fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
Thanks!
--Chris
Re: Missing document
Posted by Markus Jelsma <ma...@openindex.io>.
> I ran the nutch readdb and dumped it to a text file, I found the entry
> for one of them:
>
> http://url/Alpha.docx Version: 7
> Status: 3 (db_gone)
> Fetch time: Thu Feb 02 18:42:19 GMT 2012
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 0.21058823
> Signature: null
> Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx
>
> I guess the problem is that it is "gone" but I really don't know why
> -- the file does exist and nutch can seem to find/parse it with the
> checker runs.
Hmm, the status db_gone prevents it from being indexed, of course. It is
perfectly possible for the checkers to pass but that the fetcher will fail.
There may have been an error and i remeber you using a proxy earlier, that's
likely the problem here too. The checkers don't use proxy configurations.
Check the logs to make sure.
> Wouldn't the URL filter block it at that level? In any
> case, it doesn't match on anything that has a - in the
> regex-urlfilter.xml file, so I don't think it is being filtered out
> there.
That's good. But remember, to pass it _must_ match regex prefixed by a +.
This, however, is not your problem because in that case it wouldn't have ended
up in the CrawlDB at all.
> Is there another thing that I could look at?
>
> The only thing that dumps out errors is the hadoop logs, and there is
> a lot going on there...is there anything in particular that I should
> look for near where it crawls that file?
> I don't see anything
> error-related near it or the other missing files.
Check the fetcher output thoroughly. Grep around. You should find it.
>
> -- Chris
>
>
>
> On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma
>
> <ma...@openindex.io> wrote:
> > Check if it is in your CrawlDB at all. Debug further from that point on.
> > If it is not, then why? Perhaps some URL filter? If it is, did it get an
> > error?
> >
> >> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
> >> a document isn't getting added to my Solr Index.
> >>
> >> I can use the parsechecker and indexchecker to verify the link to the
> >> docx file, and they both can get to it and parse it just fine. But
> >> when I use the crawl command, it doesn't appear. What config file
> >> should I be checking? Do those tools use the same settings, or is
> >> there something different about the way they operate?
> >>
> >> Any help would be appreciated!
> >>
> >> -- Chris
Re: Missing document
Posted by Christopher Gross <co...@gmail.com>.
I ran the nutch readdb and dumped it to a text file, I found the entry
for one of them:
http://url/Alpha.docx Version: 7
Status: 3 (db_gone)
Fetch time: Thu Feb 02 18:42:19 GMT 2012
Modified time: Thu Jan 01 00:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 0.21058823
Signature: null
Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx
I guess the problem is that it is "gone" but I really don't know why
-- the file does exist and nutch can seem to find/parse it with the
checker runs. Wouldn't the URL filter block it at that level? In any
case, it doesn't match on anything that has a - in the
regex-urlfilter.xml file, so I don't think it is being filtered out
there. Is there another thing that I could look at?
The only thing that dumps out errors is the hadoop logs, and there is
a lot going on there...is there anything in particular that I should
look for near where it crawls that file? I don't see anything
error-related near it or the other missing files.
-- Chris
On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Check if it is in your CrawlDB at all. Debug further from that point on. If it
> is not, then why? Perhaps some URL filter? If it is, did it get an error?
>
>> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
>> a document isn't getting added to my Solr Index.
>>
>> I can use the parsechecker and indexchecker to verify the link to the
>> docx file, and they both can get to it and parse it just fine. But
>> when I use the crawl command, it doesn't appear. What config file
>> should I be checking? Do those tools use the same settings, or is
>> there something different about the way they operate?
>>
>> Any help would be appreciated!
>>
>> -- Chris
Re: Missing document
Posted by Markus Jelsma <ma...@openindex.io>.
Check if it is in your CrawlDB at all. Debug further from that point on. If it
is not, then why? Perhaps some URL filter? If it is, did it get an error?
> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
> a document isn't getting added to my Solr Index.
>
> I can use the parsechecker and indexchecker to verify the link to the
> docx file, and they both can get to it and parse it just fine. But
> when I use the crawl command, it doesn't appear. What config file
> should I be checking? Do those tools use the same settings, or is
> there something different about the way they operate?
>
> Any help would be appreciated!
>
> -- Chris