You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Christopher Gross <co...@gmail.com> on 2011/12/19 17:16:58 UTC

Missing document

I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
a document isn't getting added to my Solr Index.

I can use the parsechecker and indexchecker to verify the link to the
docx file, and they both can get to it and parse it just fine.  But
when I use the crawl command, it doesn't appear.  What config file
should I be checking?  Do those tools use the same settings, or is
there something different about the way they operate?

Any help would be appreciated!

-- Chris

Re: Missing document

Posted by Christopher Gross <co...@gmail.com>.

I don't think it's a redirect, unless SharePoint made it one.  Any
idea how to check for that?

-- Chris



On Mon, Dec 19, 2011 at 5:15 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Half-way, it's clear in the log. Is your document a redirect, i've not yet
> seen such a log line before.
>
> * haven't double-checked source code
>
>
>
>> Not sure where fetching starts...
>>
>> 2011-12-19 20:13:53,223 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000 2011-12-19 20:13:53,261 INFO  regex.RegexURLNormalizer
>> - can't find rules for scope 'partition', using default
>> 2011-12-19 20:13:53,394 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
>> defaultInterval=2592000
>> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=7776000 2011-12-19 20:13:53,399 INFO  regex.RegexURLNormalizer
>> - can't find rules for scope 'generate_host_count', using default
>> 2011-12-19 20:13:54,474 INFO  crawl.Generator - Generator:
>> Partitioning selected urls for politeness.
>> 2011-12-19 20:13:55,479 INFO  crawl.Generator - Generator: segment:
>> /cdda/nutch/crawl/segments/20111219201355
>> 2011-12-19 20:13:56,537 INFO  regex.RegexURLNormalizer - can't find
>> rules for scope 'partition', using default
>> 2011-12-19 20:13:56,939 INFO  crawl.Generator - Generator: finished at
>> 2011-12-19 20:13:56, elapsed: 00:00:05
>> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: starting at
>> 2011-12-19 20:13:57
>> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: segment:
>> /nutch/crawl/segments/20111219201355
>> 2011-12-19 20:13:58,743 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: threads: 10
>> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: time-out divisor:
>> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls
>> ... 2011-12-19 20:13:58,756 INFO  plugin.PluginRepository - Plugins:
>> looking in: /nutch/plugins
>> 2011-12-19 20:13:58,774 INFO  fetcher.Fetcher - QueueFeeder finished:
>> total 1 records + hit by time limit :0
>> <cut plugin loader stuff, can push this if you need it>
>> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - fetching
>> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
>> Documents/Alpha.docx
>> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
>> 2011-12-19 20:13:59,038 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold: -1
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.host = null
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.port = 8080
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.timeout = 10000
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.content.limit = -1
>> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=1
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.agent =
>> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
>> dni-ices-search@ugov.gov)
>> 2011-12-19 20:13:59,043 INFO  http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2011-12-19 20:13:59,380 INFO  fetcher.Fetcher - -finishing thread
>> FetcherThread, activeThreads=0
>> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0
>> 2011-12-19 20:14:00,372 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2011-12-19 20:14:01,451 INFO  fetcher.Fetcher - Fetcher: finished at
>> 2011-12-19 20:14:01, elapsed: 00:00:03
>> 2011-12-19 20:14:02,197 INFO  parse.ParseSegment - ParseSegment:
>> starting at 2011-12-19 20:14:02
>> 2011-12-19 20:14:02,198 INFO  parse.ParseSegment - ParseSegment:
>> segment: /cdda/nutch/crawl/segments/20111219201355
>> 2011-12-19 20:14:03,062 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2
>> ...is that enough for the fetch logs?  It's all crawl/generator
>> messages after that.
>>
>>
>> I ran:
>> ./nutch freegen ../urls/ ./test-segments
>> ./nutch readseg -dump ./test-segments/ ./segment-output
>>
>> I got an error:
>> Exception in thread "main"
>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/content
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
>> Input path does not exist:
>> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
>>         at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
>> 90) at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
>> putFormat.java:44) at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
>> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>> at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
>> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at
>> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
>>
>> So do I need to run the generator step in the middle?  How is this
>> different than just doing a crawl?
>>
>> Thanks!
>>
>> -- Chris
>>
>>
>>
>> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
>>
>> <ma...@openindex.io> wrote:
>> >> I'm a little confused -- should I set up a whole other instance of
>> >> nutch, crawldb, etc?
>> >
>> > Yes, i use clean instances for quick testing. Makes things easy
>> > sometimes.
>> >
>> >> Set the log to trace, I think this helps to tell why.....
>> >>
>> >> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
>> >> FetchSchedule impl:
>> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
>> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> >> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
>> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
>> >> -shouldFetch rejected 'http://url/Alpha.docx',
>> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
>> >> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
>> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
>> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> >> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
>> >> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
>> >> Generator: 0 records selected for fetching, exiting ...
>> >
>> > Now, this is the generator indeed but you need to fetcher logs.
>> >
>> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
>> >> still got a rejected because it is before the next fetch time...why do
>> >> I get that?  How do I set it up to always crawl all the docs?  (Not
>> >> practical for production, but it's what I want when testing...)
>> >
>> > As i said, create segments using the freegen tool. It takes an input dir
>> > with seed files, just as your initial inject. Or can also inject files
>> > and give them meta data with a very low fetch interval so Nutch will
>> > crawl it each time, i usually take this approach in small tests.
>> >
>> > http://url<TAB>nutch.fetchInterval=10
>> >
>> > The URL will be selected by the generator all the time because of this
>> > low fetch interval.
>> >
>> >> -- Chris
>> >>
>> >>
>> >>
>> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>> >>
>> >> <ma...@openindex.io> wrote:
>> >> >> > Hmm, the status db_gone prevents it from being indexed, of course.
>> >> >> > It is perfectly possible for the checkers to pass but that the
>> >> >> > fetcher will fail. There may have been an error and i remeber you
>> >> >> > using a proxy earlier, that's likely the problem here too. The
>> >> >> > checkers don't use proxy configurations.
>> >> >> >
>> >> >> > Check the logs to make sure.
>> >> >>
>> >> >> I cut out the proxy, and that let me get as far as I have now.
>> >> >>  Having that in place prevents me from crawling the local
>> >> >> source...is there any way to be able to crawl both the inside &
>> >> >> outside networks? [separate issue, but something that I'll need this
>> >> >> to do]
>> >> >
>> >> > Not that i know of. You can use separate configs but this is tricky.
>> >> > Better use separate crawldb's configs etc.
>> >> >
>> >> >> > That's good. But remember, to pass it _must_ match regex prefixed
>> >> >> > by a +. This, however, is not your problem because in that case it
>> >> >> > wouldn't have ended up in the CrawlDB at all.
>> >> >>
>> >> >> I have two +'s that it should match on, including +.*
>> >> >
>> >> > That'll do.
>> >> >
>> >> >> > Check the fetcher output thoroughly. Grep around. You should find
>> >> >> > it.
>> >> >>
>> >> >> What exactly am I grepping for?
>> >> >> This is the block between the doc and the next one that it tries to
>> >> >> crawl....
>> >> >
>> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not
>> >> > an error. Does debug say anything? You can set the level for the
>> >> > Fetcher in conf/log4j.properties. You can use freegen tool to
>> >> > generate a segments from some input text for tests.
>> >> >
>> >> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
>> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
>> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO
>> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540
>> >> >> INFO
>> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
>> >> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
>> >> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
>> >> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
>> >> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
>> >> >> mode
>> >> >>
>> >> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
>> >> >>
>> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> >> >> Fetcher: throughput threshold: -1
>> >> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
>> >> >> threshold retries: 5
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
>> >> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
>> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
>> >> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> --Chris

Re: Missing document

Posted by Markus Jelsma <ma...@openindex.io>.

Half-way, it's clear in the log. Is your document a redirect, i've not yet 
seen such a log line before.

* haven't double-checked source code



> Not sure where fetching starts...
> 
> 2011-12-19 20:13:53,223 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000 2011-12-19 20:13:53,261 INFO  regex.RegexURLNormalizer
> - can't find rules for scope 'partition', using default
> 2011-12-19 20:13:53,394 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000 2011-12-19 20:13:53,399 INFO  regex.RegexURLNormalizer
> - can't find rules for scope 'generate_host_count', using default
> 2011-12-19 20:13:54,474 INFO  crawl.Generator - Generator:
> Partitioning selected urls for politeness.
> 2011-12-19 20:13:55,479 INFO  crawl.Generator - Generator: segment:
> /cdda/nutch/crawl/segments/20111219201355
> 2011-12-19 20:13:56,537 INFO  regex.RegexURLNormalizer - can't find
> rules for scope 'partition', using default
> 2011-12-19 20:13:56,939 INFO  crawl.Generator - Generator: finished at
> 2011-12-19 20:13:56, elapsed: 00:00:05
> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: starting at
> 2011-12-19 20:13:57
> 2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: segment:
> /nutch/crawl/segments/20111219201355
> 2011-12-19 20:13:58,743 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: threads: 10
> 2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: time-out divisor:
> 2 2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls
> ... 2011-12-19 20:13:58,756 INFO  plugin.PluginRepository - Plugins:
> looking in: /nutch/plugins
> 2011-12-19 20:13:58,774 INFO  fetcher.Fetcher - QueueFeeder finished:
> total 1 records + hit by time limit :0
> <cut plugin loader stuff, can push this if you need it>
> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - fetching
> http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
> Documents/Alpha.docx
> 2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
> 2011-12-19 20:13:59,038 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold: -1
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.host = null
> 2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.port = 8080
> 2011-12-19 20:13:59,043 INFO  http.Http - http.timeout = 10000
> 2011-12-19 20:13:59,043 INFO  http.Http - http.content.limit = -1
> 2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-12-19 20:13:59,043 INFO  http.Http - http.agent =
> google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
> dni-ices-search@ugov.gov)
> 2011-12-19 20:13:59,043 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-12-19 20:13:59,380 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0
> 2011-12-19 20:14:00,372 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2011-12-19 20:14:01,451 INFO  fetcher.Fetcher - Fetcher: finished at
> 2011-12-19 20:14:01, elapsed: 00:00:03
> 2011-12-19 20:14:02,197 INFO  parse.ParseSegment - ParseSegment:
> starting at 2011-12-19 20:14:02
> 2011-12-19 20:14:02,198 INFO  parse.ParseSegment - ParseSegment:
> segment: /cdda/nutch/crawl/segments/20111219201355
> 2011-12-19 20:14:03,062 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2
> ...is that enough for the fetch logs?  It's all crawl/generator
> messages after that.
> 
> 
> I ran:
> ./nutch freegen ../urls/ ./test-segments
> ./nutch readseg -dump ./test-segments/ ./segment-output
> 
> I got an error:
> Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/content
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
> Input path does not exist:
> file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
>         at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:1
> 90) at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
> putFormat.java:44) at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
> 1) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at
> org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225) at
> org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)
> 
> So do I need to run the generator step in the middle?  How is this
> different than just doing a crawl?
> 
> Thanks!
> 
> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> >> I'm a little confused -- should I set up a whole other instance of
> >> nutch, crawldb, etc?
> > 
> > Yes, i use clean instances for quick testing. Makes things easy
> > sometimes.
> > 
> >> Set the log to trace, I think this helps to tell why.....
> >> 
> >> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
> >> FetchSchedule impl:
> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> >> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
> >> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
> >> -shouldFetch rejected 'http://url/Alpha.docx',
> >> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
> >> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
> >> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
> >> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> >> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
> >> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
> >> Generator: 0 records selected for fetching, exiting ...
> > 
> > Now, this is the generator indeed but you need to fetcher logs.
> > 
> >> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
> >> still got a rejected because it is before the next fetch time...why do
> >> I get that?  How do I set it up to always crawl all the docs?  (Not
> >> practical for production, but it's what I want when testing...)
> > 
> > As i said, create segments using the freegen tool. It takes an input dir
> > with seed files, just as your initial inject. Or can also inject files
> > and give them meta data with a very low fetch interval so Nutch will
> > crawl it each time, i usually take this approach in small tests.
> > 
> > http://url<TAB>nutch.fetchInterval=10
> > 
> > The URL will be selected by the generator all the time because of this
> > low fetch interval.
> > 
> >> -- Chris
> >> 
> >> 
> >> 
> >> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
> >> 
> >> <ma...@openindex.io> wrote:
> >> >> > Hmm, the status db_gone prevents it from being indexed, of course.
> >> >> > It is perfectly possible for the checkers to pass but that the
> >> >> > fetcher will fail. There may have been an error and i remeber you
> >> >> > using a proxy earlier, that's likely the problem here too. The
> >> >> > checkers don't use proxy configurations.
> >> >> > 
> >> >> > Check the logs to make sure.
> >> >> 
> >> >> I cut out the proxy, and that let me get as far as I have now.
> >> >>  Having that in place prevents me from crawling the local
> >> >> source...is there any way to be able to crawl both the inside &
> >> >> outside networks? [separate issue, but something that I'll need this
> >> >> to do]
> >> > 
> >> > Not that i know of. You can use separate configs but this is tricky.
> >> > Better use separate crawldb's configs etc.
> >> > 
> >> >> > That's good. But remember, to pass it _must_ match regex prefixed
> >> >> > by a +. This, however, is not your problem because in that case it
> >> >> > wouldn't have ended up in the CrawlDB at all.
> >> >> 
> >> >> I have two +'s that it should match on, including +.*
> >> > 
> >> > That'll do.
> >> > 
> >> >> > Check the fetcher output thoroughly. Grep around. You should find
> >> >> > it.
> >> >> 
> >> >> What exactly am I grepping for?
> >> >> This is the block between the doc and the next one that it tries to
> >> >> crawl....
> >> > 
> >> > Hmm, that looks fine but can still indicate a 404 because a 404 is not
> >> > an error. Does debug say anything? You can set the level for the
> >> > Fetcher in conf/log4j.properties. You can use freegen tool to
> >> > generate a segments from some input text for tests.
> >> > 
> >> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
> >> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
> >> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO
> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,540
> >> >> INFO
> >> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
> >> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
> >> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
> >> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
> >> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
> >> >> mode
> >> >> 
> >> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
> >> >> 
> >> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> >> >> Fetcher: throughput threshold: -1
> >> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
> >> >> threshold retries: 5
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
> >> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
> >> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
> >> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
> >> >> spinWaiting=10, fetchQueues.totalSize=13
> >> >> 
> >> >> Thanks!
> >> >> 
> >> >> --Chris

Re: Missing document

Posted by Christopher Gross <co...@gmail.com>.

Not sure where fetching starts...

2011-12-19 20:13:53,223 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,223 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2011-12-19 20:13:53,261 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2011-12-19 20:13:53,394 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2011-12-19 20:13:53,395 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2011-12-19 20:13:53,399 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'generate_host_count', using default
2011-12-19 20:13:54,474 INFO  crawl.Generator - Generator:
Partitioning selected urls for politeness.
2011-12-19 20:13:55,479 INFO  crawl.Generator - Generator: segment:
/cdda/nutch/crawl/segments/20111219201355
2011-12-19 20:13:56,537 INFO  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2011-12-19 20:13:56,939 INFO  crawl.Generator - Generator: finished at
2011-12-19 20:13:56, elapsed: 00:00:05
2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: starting at
2011-12-19 20:13:57
2011-12-19 20:13:57,695 INFO  fetcher.Fetcher - Fetcher: segment:
/nutch/crawl/segments/20111219201355
2011-12-19 20:13:58,743 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: threads: 10
2011-12-19 20:13:58,744 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
2011-12-19 20:13:58,749 DEBUG fetcher.Fetcher - -feeding 500 input urls ...
2011-12-19 20:13:58,756 INFO  plugin.PluginRepository - Plugins:
looking in: /nutch/plugins
2011-12-19 20:13:58,774 INFO  fetcher.Fetcher - QueueFeeder finished:
total 1 records + hit by time limit :0
<cut plugin loader stuff, can push this if you need it>
2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,036 INFO  fetcher.Fetcher - fetching
http://wnspstg8o.imostg.intelink.gov/sites/mlogic/Shared
Documents/Alpha.docx
2011-12-19 20:13:59,036 DEBUG fetcher.Fetcher - redirectCount=0
2011-12-19 20:13:59,038 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,039 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,040 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,041 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,042 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.host = null
2011-12-19 20:13:59,043 INFO  http.Http - http.proxy.port = 8080
2011-12-19 20:13:59,043 INFO  http.Http - http.timeout = 10000
2011-12-19 20:13:59,043 INFO  http.Http - http.content.limit = -1
2011-12-19 20:13:59,043 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-12-19 20:13:59,043 INFO  http.Http - http.agent =
google-robot-intelink/Nutch-1.4 (CDDA Crawler; search.intelink.gov;
dni-ices-search@ugov.gov)
2011-12-19 20:13:59,043 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-12-19 20:13:59,380 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2011-12-19 20:14:00,050 INFO  fetcher.Fetcher - -activeThreads=0
2011-12-19 20:14:00,372 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2011-12-19 20:14:01,451 INFO  fetcher.Fetcher - Fetcher: finished at
2011-12-19 20:14:01, elapsed: 00:00:03
2011-12-19 20:14:02,197 INFO  parse.ParseSegment - ParseSegment:
starting at 2011-12-19 20:14:02
2011-12-19 20:14:02,198 INFO  parse.ParseSegment - ParseSegment:
segment: /cdda/nutch/crawl/segments/20111219201355
2011-12-19 20:14:03,062 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2
...is that enough for the fetch logs?  It's all crawl/generator
messages after that.


I ran:
./nutch freegen ../urls/ ./test-segments
./nutch readseg -dump ./test-segments/ ./segment-output

I got an error:
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_generate
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_fetch
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/crawl_parse
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/content
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_data
Input path does not exist:
file:/data/search/cdda/nutch-1.4/bin/test-segments/parse_text
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.segment.SegmentReader.dump(SegmentReader.java:225)
        at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:564)

So do I need to run the generator step in the middle?  How is this
different than just doing a crawl?

Thanks!

-- Chris



On Mon, Dec 19, 2011 at 3:22 PM, Markus Jelsma
<ma...@openindex.io> wrote:
>
>> I'm a little confused -- should I set up a whole other instance of
>> nutch, crawldb, etc?
>
> Yes, i use clean instances for quick testing. Makes things easy sometimes.
>
>>
>> Set the log to trace, I think this helps to tell why.....
>>
>> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
>> FetchSchedule impl:
>> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
>> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
>> -shouldFetch rejected 'http://url/Alpha.docx',
>> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
>> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
>> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
>> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
>> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
>> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
>> Generator: 0 records selected for fetching, exiting ...
>
> Now, this is the generator indeed but you need to fetcher logs.
>
>> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
>> still got a rejected because it is before the next fetch time...why do
>> I get that?  How do I set it up to always crawl all the docs?  (Not
>> practical for production, but it's what I want when testing...)
>
> As i said, create segments using the freegen tool. It takes an input dir with
> seed files, just as your initial inject. Or can also inject files and give
> them meta data with a very low fetch interval so Nutch will crawl it each
> time, i usually take this approach in small tests.
>
> http://url<TAB>nutch.fetchInterval=10
>
> The URL will be selected by the generator all the time because of this low
> fetch interval.
>
>> -- Chris
>>
>>
>>
>> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
>>
>> <ma...@openindex.io> wrote:
>> >> > Hmm, the status db_gone prevents it from being indexed, of course. It
>> >> > is perfectly possible for the checkers to pass but that the fetcher
>> >> > will fail. There may have been an error and i remeber you using a
>> >> > proxy earlier, that's likely the problem here too. The checkers don't
>> >> > use proxy configurations.
>> >> >
>> >> > Check the logs to make sure.
>> >>
>> >> I cut out the proxy, and that let me get as far as I have now.  Having
>> >> that in place prevents me from crawling the local source...is there
>> >> any way to be able to crawl both the inside & outside networks?
>> >> [separate issue, but something that I'll need this to do]
>> >
>> > Not that i know of. You can use separate configs but this is tricky.
>> > Better use separate crawldb's configs etc.
>> >
>> >> > That's good. But remember, to pass it _must_ match regex prefixed by a
>> >> > +. This, however, is not your problem because in that case it
>> >> > wouldn't have ended up in the CrawlDB at all.
>> >>
>> >> I have two +'s that it should match on, including +.*
>> >
>> > That'll do.
>> >
>> >> > Check the fetcher output thoroughly. Grep around. You should find it.
>> >>
>> >> What exactly am I grepping for?
>> >> This is the block between the doc and the next one that it tries to
>> >> crawl....
>> >
>> > Hmm, that looks fine but can still indicate a 404 because a 404 is not an
>> > error. Does debug say anything? You can set the level for the Fetcher in
>> > conf/log4j.properties. You can use freegen tool to generate a segments
>> > from some input text for tests.
>> >
>> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
>> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
>> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher
>> >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO
>> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
>> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
>> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
>> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
>> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode
>> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
>> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Fetcher:
>> >> throughput threshold: -1
>> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
>> >> threshold retries: 5
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
>> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
>> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
>> >> en-us,en-gb,en;q=0.7,*;q=0.3
>> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
>> >> spinWaiting=10, fetchQueues.totalSize=13
>> >>
>> >> Thanks!
>> >>
>> >> --Chris

Re: Missing document

Posted by Markus Jelsma <ma...@openindex.io>.

> I'm a little confused -- should I set up a whole other instance of
> nutch, crawldb, etc?

Yes, i use clean instances for quick testing. Makes things easy sometimes.

> 
> Set the log to trace, I think this helps to tell why.....
> 
> 2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> 20:14:10,716 INFO  crawl.AbstractFetchSchedule -
> maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
> -shouldFetch rejected 'http://url/Alpha.docx',
> fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
> INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
> INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
> 20:14:10,843 INFO  crawl.AbstractFetchSchedule -
> maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
> Generator: 0 records selected for fetching, exiting ...

Now, this is the generator indeed but you need to fetcher logs. 

> Now, before I ran this I cleared the crawldb, linkdb & segments, but I
> still got a rejected because it is before the next fetch time...why do
> I get that?  How do I set it up to always crawl all the docs?  (Not
> practical for production, but it's what I want when testing...)

As i said, create segments using the freegen tool. It takes an input dir with 
seed files, just as your initial inject. Or can also inject files and give 
them meta data with a very low fetch interval so Nutch will crawl it each 
time, i usually take this approach in small tests.

http://url<TAB>nutch.fetchInterval=10

The URL will be selected by the generator all the time because of this low 
fetch interval.

> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> >> > Hmm, the status db_gone prevents it from being indexed, of course. It
> >> > is perfectly possible for the checkers to pass but that the fetcher
> >> > will fail. There may have been an error and i remeber you using a
> >> > proxy earlier, that's likely the problem here too. The checkers don't
> >> > use proxy configurations.
> >> > 
> >> > Check the logs to make sure.
> >> 
> >> I cut out the proxy, and that let me get as far as I have now.  Having
> >> that in place prevents me from crawling the local source...is there
> >> any way to be able to crawl both the inside & outside networks?
> >> [separate issue, but something that I'll need this to do]
> > 
> > Not that i know of. You can use separate configs but this is tricky.
> > Better use separate crawldb's configs etc.
> > 
> >> > That's good. But remember, to pass it _must_ match regex prefixed by a
> >> > +. This, however, is not your problem because in that case it
> >> > wouldn't have ended up in the CrawlDB at all.
> >> 
> >> I have two +'s that it should match on, including +.*
> > 
> > That'll do.
> > 
> >> > Check the fetcher output thoroughly. Grep around. You should find it.
> >> 
> >> What exactly am I grepping for?
> >> This is the block between the doc and the next one that it tries to
> >> crawl....
> > 
> > Hmm, that looks fine but can still indicate a 404 because a 404 is not an
> > error. Does debug say anything? You can set the level for the Fetcher in
> > conf/log4j.properties. You can use freegen tool to generate a segments
> > from some input text for tests.
> > 
> >> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
> >> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
> >> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher
> >> - Using queue mode : byHost 2011-12-19 18:42:19,540 INFO
> >>  fetcher.Fetcher - Using queue mode : byHost 2011-12-19 18:42:19,541
> >> INFO  fetcher.Fetcher - Using queue mode : byHost 2011-12-19
> >> 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
> >> 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode :
> >> byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode
> >> : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue
> >> mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Fetcher:
> >> throughput threshold: -1
> >> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
> >> threshold retries: 5
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
> >> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
> >> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
> >> en-us,en-gb,en;q=0.7,*;q=0.3
> >> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
> >> spinWaiting=10, fetchQueues.totalSize=13
> >> 
> >> Thanks!
> >> 
> >> --Chris

Re: Missing document

Posted by Christopher Gross <co...@gmail.com>.

I'm a little confused -- should I set up a whole other instance of
nutch, crawldb, etc?

Set the log to trace, I think this helps to tell why.....

2011-12-19 20:14:10,716 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,716
INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
20:14:10,716 INFO  crawl.AbstractFetchSchedule -
maxInterval=77760002011-12-19 20:14:10,738 DEBUG crawl.Generator -
-shouldFetch rejected 'http://url/Alpha.docx',
fetchTime=1328213639379, curTime=13247576493782011-12-19 20:14:10,843
INFO  crawl.FetchScheduleFactory - Using FetchSchedule impl:
org.apache.nutch.crawl.DefaultFetchSchedule2011-12-19 20:14:10,843
INFO  crawl.AbstractFetchSchedule - defaultInterval=25920002011-12-19
20:14:10,843 INFO  crawl.AbstractFetchSchedule -
maxInterval=77760002011-12-19 20:14:11,145 WARN  crawl.Generator -
Generator: 0 records selected for fetching, exiting ...
Now, before I ran this I cleared the crawldb, linkdb & segments, but I
still got a rejected because it is before the next fetch time...why do
I get that?  How do I set it up to always crawl all the docs?  (Not
practical for production, but it's what I want when testing...)
-- Chris



On Mon, Dec 19, 2011 at 2:42 PM, Markus Jelsma
<ma...@openindex.io> wrote:
>
>> > Hmm, the status db_gone prevents it from being indexed, of course. It is
>> > perfectly possible for the checkers to pass but that the fetcher will
>> > fail. There may have been an error and i remeber you using a proxy
>> > earlier, that's likely the problem here too. The checkers don't use
>> > proxy configurations.
>> >
>> > Check the logs to make sure.
>>
>> I cut out the proxy, and that let me get as far as I have now.  Having
>> that in place prevents me from crawling the local source...is there
>> any way to be able to crawl both the inside & outside networks?
>> [separate issue, but something that I'll need this to do]
>
> Not that i know of. You can use separate configs but this is tricky. Better
> use separate crawldb's configs etc.
>
>>
>> > That's good. But remember, to pass it _must_ match regex prefixed by a +.
>> > This, however, is not your problem because in that case it wouldn't have
>> > ended up in the CrawlDB at all.
>>
>> I have two +'s that it should match on, including +.*
>
> That'll do.
>
>>
>> > Check the fetcher output thoroughly. Grep around. You should find it.
>>
>> What exactly am I grepping for?
>> This is the block between the doc and the next one that it tries to
>> crawl....
>
> Hmm, that looks fine but can still indicate a 404 because a 404 is not an
> error. Does debug say anything? You can set the level for the Fetcher in
> conf/log4j.properties. You can use freegen tool to generate a segments from
> some input text for tests.
>
>>
>> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
>> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
>> Fetcher: throughput threshold: -1
>> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
>> threshold retries: 5
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
>> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
>> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
>> spinWaiting=10, fetchQueues.totalSize=13
>>
>> Thanks!
>>
>> --Chris

Re: Missing document

Posted by Markus Jelsma <ma...@openindex.io>.

> > Hmm, the status db_gone prevents it from being indexed, of course. It is
> > perfectly possible for the checkers to pass but that the fetcher will
> > fail. There may have been an error and i remeber you using a proxy
> > earlier, that's likely the problem here too. The checkers don't use
> > proxy configurations.
> > 
> > Check the logs to make sure.
> 
> I cut out the proxy, and that let me get as far as I have now.  Having
> that in place prevents me from crawling the local source...is there
> any way to be able to crawl both the inside & outside networks?
> [separate issue, but something that I'll need this to do]

Not that i know of. You can use separate configs but this is tricky. Better 
use separate crawldb's configs etc.
 
> 
> > That's good. But remember, to pass it _must_ match regex prefixed by a +.
> > This, however, is not your problem because in that case it wouldn't have
> > ended up in the CrawlDB at all.
> 
> I have two +'s that it should match on, including +.*

That'll do.

> 
> > Check the fetcher output thoroughly. Grep around. You should find it.
> 
> What exactly am I grepping for?
> This is the block between the doc and the next one that it tries to
> crawl....

Hmm, that looks fine but can still indicate a 404 because a 404 is not an 
error. Does debug say anything? You can set the level for the Fetcher in 
conf/log4j.properties. You can use freegen tool to generate a segments from 
some input text for tests.

> 
> 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching
> http://url/Alpha.docx 2011-12-19 18:42:19,538 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,539 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,540 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,541 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> Using queue mode : byHost 2011-12-19 18:42:19,542 INFO  fetcher.Fetcher -
> Fetcher: throughput threshold: -1
> 2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
> threshold retries: 5
> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
> 2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
> 2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
> 2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
> 2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
> crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
> 2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
> spinWaiting=10, fetchQueues.totalSize=13
> 
> Thanks!
> 
> --Chris

Re: Missing document

Posted by Christopher Gross <co...@gmail.com>.

>
> Hmm, the status db_gone prevents it from being indexed, of course. It is
> perfectly possible for the checkers to pass but that the fetcher will fail.
> There may have been an error and i remeber you using a proxy earlier, that's
> likely the problem here too. The checkers don't use proxy configurations.
>
> Check the logs to make sure.
>

I cut out the proxy, and that let me get as far as I have now.  Having
that in place prevents me from crawling the local source...is there
any way to be able to crawl both the inside & outside networks?
[separate issue, but something that I'll need this to do]

>
> That's good. But remember, to pass it _must_ match regex prefixed by a +.
> This, however, is not your problem because in that case it wouldn't have ended
> up in the CrawlDB at all.

I have two +'s that it should match on, including +.*

>
> Check the fetcher output thoroughly. Grep around. You should find it.
>

What exactly am I grepping for?
This is the block between the doc and the next one that it tries to crawl....

2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - fetching http://url/Alpha.docx
2011-12-19 18:42:19,538 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,539 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,540 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,541 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Using queue mode : byHost
2011-12-19 18:42:19,542 INFO  fetcher.Fetcher - Fetcher: throughput
threshold: -1
2011-12-19 18:42:19,543 INFO  fetcher.Fetcher - Fetcher: throughput
threshold retries: 5
2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.host = null
2011-12-19 18:42:19,545 INFO  http.Http - http.proxy.port = 8080
2011-12-19 18:42:19,545 INFO  http.Http - http.timeout = 10000
2011-12-19 18:42:19,545 INFO  http.Http - http.content.limit = -1
2011-12-19 18:42:19,545 INFO  http.Http - http.agent =
crawler-nutch/Nutch-1.4 (Crawler; email@site.com)
2011-12-19 18:42:19,545 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-12-19 18:42:20,545 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:21,548 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:22,550 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:23,552 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13
2011-12-19 18:42:24,554 INFO  fetcher.Fetcher - -activeThreads=10,
spinWaiting=10, fetchQueues.totalSize=13

Thanks!

--Chris

Re: Missing document

Posted by Markus Jelsma <ma...@openindex.io>.

> I ran the nutch readdb and dumped it to a text file, I found the entry
> for one of them:
> 
> http://url/Alpha.docx   Version: 7
> Status: 3 (db_gone)
> Fetch time: Thu Feb 02 18:42:19 GMT 2012
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 0.21058823
> Signature: null
> Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx
> 
> I guess the problem is that it is "gone" but I really don't know why
> -- the file does exist and nutch can seem to find/parse it with the
> checker runs.

Hmm, the status db_gone prevents it from being indexed, of course. It is 
perfectly possible for the checkers to pass but that the fetcher will fail. 
There may have been an error and i remeber you using a proxy earlier, that's 
likely the problem here too. The checkers don't use proxy configurations.

Check the logs to make sure.

> Wouldn't the URL filter block it at that level?  In any
> case, it doesn't match on anything that has a - in the
> regex-urlfilter.xml file, so I don't think it is being filtered out
> there.

That's good. But remember, to pass it _must_ match regex prefixed by a +. 
This, however, is not your problem because in that case it wouldn't have ended 
up in the CrawlDB at all.


> Is there another thing that I could look at?
> 
> The only thing that dumps out errors is the hadoop logs, and there is
> a lot going on there...is there anything in particular that I should
> look for near where it crawls that file?
> I don't see anything
> error-related near it or the other missing files.

Check the fetcher output thoroughly. Grep around. You should find it.

> 
> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma
> 
> <ma...@openindex.io> wrote:
> > Check if it is in your CrawlDB at all. Debug further from that point on.
> > If it is not, then why? Perhaps some URL filter? If it is, did it get an
> > error?
> > 
> >> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
> >> a document isn't getting added to my Solr Index.
> >> 
> >> I can use the parsechecker and indexchecker to verify the link to the
> >> docx file, and they both can get to it and parse it just fine.  But
> >> when I use the crawl command, it doesn't appear.  What config file
> >> should I be checking?  Do those tools use the same settings, or is
> >> there something different about the way they operate?
> >> 
> >> Any help would be appreciated!
> >> 
> >> -- Chris

Re: Missing document

Posted by Christopher Gross <co...@gmail.com>.

I ran the nutch readdb and dumped it to a text file, I found the entry
for one of them:

http://url/Alpha.docx   Version: 7
Status: 3 (db_gone)
Fetch time: Thu Feb 02 18:42:19 GMT 2012
Modified time: Thu Jan 01 00:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 0.21058823
Signature: null
Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx

I guess the problem is that it is "gone" but I really don't know why
-- the file does exist and nutch can seem to find/parse it with the
checker runs.  Wouldn't the URL filter block it at that level?  In any
case, it doesn't match on anything that has a - in the
regex-urlfilter.xml file, so I don't think it is being filtered out
there.  Is there another thing that I could look at?

The only thing that dumps out errors is the hadoop logs, and there is
a lot going on there...is there anything in particular that I should
look for near where it crawls that file?  I don't see anything
error-related near it or the other missing files.

-- Chris



On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> Check if it is in your CrawlDB at all. Debug further from that point on. If it
> is not, then why? Perhaps some URL filter? If it is, did it get an error?
>
>> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
>> a document isn't getting added to my Solr Index.
>>
>> I can use the parsechecker and indexchecker to verify the link to the
>> docx file, and they both can get to it and parse it just fine.  But
>> when I use the crawl command, it doesn't appear.  What config file
>> should I be checking?  Do those tools use the same settings, or is
>> there something different about the way they operate?
>>
>> Any help would be appreciated!
>>
>> -- Chris

Re: Missing document

Posted by Markus Jelsma <ma...@openindex.io>.

Check if it is in your CrawlDB at all. Debug further from that point on. If it 
is not, then why? Perhaps some URL filter? If it is, did it get an error? 

> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
> a document isn't getting added to my Solr Index.
> 
> I can use the parsechecker and indexchecker to verify the link to the
> docx file, and they both can get to it and parse it just fine.  But
> when I use the crawl command, it doesn't appear.  What config file
> should I be checking?  Do those tools use the same settings, or is
> there something different about the way they operate?
> 
> Any help would be appreciated!
> 
> -- Chris