You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Frumpus <fr...@yahoo.com.INVALID> on 2015/10/29 19:31:01 UTC

Nutch 1.10 won't crawl subdirectories on my site

I'm new to Nutch and Solr so I've probably just got something configured incorrectly, but I can't find a setting for this in any conf files.I'm testing Nutch 1.10 on a relatively small site and it will crawl any page in the root of the site, but nothing in a subdir. So when I look at the core in Solr (5.3.1) and search I can find a page www.somesite.com/somepage.php but none of the pages with urls like www.somesite.com/somedir/somepage.php are there.I am using the following command to run the crawl script:sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore urls/ TestCrawl/ 5This should take it through 5 iterations, but it only runs one and reports that there are no more URLs to fetch and exits. There are no errors in the console or hadoop log.Result:Injecting seed URLs
/opt/apache-nutch-1.10/bin/nutch inject TestCrawl//crawldb urls/
Injector: starting at 2015-10-29 09:51:55
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2015-10-29 09:51:58, elapsed: 00:00:02
Thu Oct 29 09:51:58 CDT 2015 : Iteration 1 of 5
Generating a new segment
/opt/apache-nutch-1.10/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true TestCrawl//crawldb TestCrawl//segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2015-10-29 09:51:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch nowseed.txthttp://www.somesite.com(I have also tried adding a trailing '/' but that didn't change anything.)I have tried all of the following in regex-urlfilter.txt and none seem to work any differently than the others. I have a poor understanding of these filters though.+^http://([a-z0-9\]*\.)*www.somesite.com/
+^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/
+^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/([a-z0-9\-A-Z]*\/)*
+^http://([a-z0-9\]*\.)*www.somesite.com/([a-z0-9\]*\/)*I've gone through the hadoop log extensively just to be sure they didn't get crawled in an earlier run, thinking this may be a problem with indexing in solr, but it looks like they have just never been crawled and are being ignored.Can someone point me in the right direction here to troubleshoot this thing? I'm out of ideas and googles.

Re: Nutch 1.10 won't crawl subdirectories on my site

Posted by Frumpus <fr...@yahoo.com.INVALID>.

Thank you, this does appear to be the problem. I should have thought of that. 
      From: Robbe Roels <Ro...@knowliah.com>
 To: "user@nutch.apache.org" <us...@nutch.apache.org>; Frumpus <fr...@yahoo.com> 
 Sent: Friday, October 30, 2015 2:41 AM
 Subject: RE: Nutch 1.10 won't crawl subdirectories on my site
   
Have you checked your robot has rights and is not denied crawling in the robots.txt of sites subdirs 

Kind regards

Robbe 



-----Original Message-----
From: Frumpus [mailto:frumpus@yahoo.com.INVALID] 
Sent: donderdag 29 oktober 2015 22:17
To: user@nutch.apache.org
Subject: Re: Nutch 1.10 won't crawl subdirectories on my site

Tried but got same result. 
      From: Michael Joyce <jo...@apache.org>
 To: 
Cc: user@nutch.apache.org; Frumpus <fr...@yahoo.com>
 Sent: Thursday, October 29, 2015 3:27 PM
 Subject: Re: Nutch 1.10 won't crawl subdirectories on my site
  
Sorry that should probably be
+^http://([a-z0-9]*\.)*somesite.com.*


-- Jimmy

On Thu, Oct 29, 2015 at 1:25 PM, Michael Joyce <jo...@apache.org> wrote:

> If you're not getting subdirectories try .* on the end of the filter 
> to grab everything and that should get you what you need.
>
> So
> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com.*
>
>
>
>
> -- Jimmy
>
> On Thu, Oct 29, 2015 at 11:31 AM, Frumpus <fr...@yahoo.com.invalid>
> wrote:
>
>> I'm new to Nutch and Solr so I've probably just got something 
>> configured incorrectly, but I can't find a setting for this in any 
>> conf files.I'm testing Nutch 1.10 on a relatively small site and it 
>> will crawl any page in the root of the site, but nothing in a subdir. 
>> So when I look at the core in Solr (5.3.1) and search I can find a 
>> page www.somesite.com/somepage.php but none of the pages with urls 
>> like www.somesite.com/somedir/somepage.php are there.I am using the 
>> following command to run the crawl script:sudo -E bin/crawl -i -D 
>> solr.server.url= http://localhost:8983/solr/TestCrawlCore urls/ 
>> TestCrawl/ 5This should take it through 5 iterations, but it only 
>> runs one and reports that there are no more URLs to fetch and exits. 
>> There are no errors in the console or hadoop log.Result:Injecting 
>> seed URLs /opt/apache-nutch-1.10/bin/nutch inject TestCrawl//crawldb 
>> urls/
>> Injector: starting at 2015-10-29 09:51:55
>> Injector: crawlDb: TestCrawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Total number of urls rejected by filters: 0
>> Injector: Total number of urls after normalization: 1
>> Injector: Merging injected urls into crawl db.
>> Injector: overwrite: false
>> Injector: update: false
>> Injector: URLs merged: 1
>> Injector: Total new urls injected: 0
>> Injector: finished at 2015-10-29 09:51:58, elapsed: 00:00:02 Thu Oct 
>> 29 09:51:58 CDT 2015 : Iteration 1 of 5 Generating a new segment 
>> /opt/apache-nutch-1.10/bin/nutch generate -D mapred.reduce.tasks=2 -D 
>> mapred.child.java.opts=-Xmx1000m -D 
>> mapred.reduce.tasks.speculative.execution=false -D 
>> mapred.map.tasks.speculative.execution=false -D 
>> mapred.compress.map.output=true TestCrawl//crawldb 
>> TestCrawl//segments -topN 50000 -numFetchers 1 -noFilter
>> Generator: starting at 2015-10-29 09:51:58
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: false
>> Generator: normalizing: true
>> Generator: topN: 50000
>> Generator: 0 records selected for fetching, exiting ...
>> Generate returned 1 (no new segments created) Escaping loop: no more 
>> URLs to fetch nowseed.txthttp://www.somesite.com(I
>> have also tried adding a trailing '/' but that didn't change 
>> anything.)I have tried all of the following in regex-urlfilter.txt 
>> and none seem to work any differently than the others. I have a poor 
>> understanding of these filters 
>> though.+^http://([a-z0-9\]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/([a-z0-9\-A-Z]*\/)*
>> <http://www.somesite.com/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>
>> +^http://([a-z0-9\]*\.)*www.somesite.com/([a-z0-9\]*\/)*I've
>> <http://www.somesite.com/(%5Ba-z0-9%5C%5D*%5C/)*I've> gone through 
>> the


>> hadoop log extensively just to be sure they didn't get crawled in an 
>> earlier run, thinking this may be a problem with indexing in solr, 
>> but it looks like they have just never been crawled and are being 
>> ignored.Can someone point me in the right direction here to troubleshoot this thing?
>> I'm out of ideas and googles.
>
>
>

Nutch 1.10 won't crawl subdirectories on my site

Posted by Frumpus <fr...@yahoo.com.INVALID>.

I'm new to Nutch and Solr so I've probably just got something configured incorrectly, but I can't find a setting for this in any conf files.I'm testing Nutch 1.10 on a relatively small site and it will crawl any page in the root of the site, but nothing in a subdir. So when I look at the core in Solr (5.3.1) and search I can find a page www.somesite.com/somepage.php but none of the pages with urls like www.somesite.com/somedir/somepage.php are there.I am using the following command to run the crawl script:sudo -E bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TestCrawlCore urls/ TestCrawl/ 5This should take it through 5 iterations, but it only runs one and reports that there are no more URLs to fetch and exits. There are no errors in the console or hadoop log.Result:Injecting seed URLs
/opt/apache-nutch-1.10/bin/nutch inject TestCrawl//crawldb urls/
Injector: starting at 2015-10-29 09:51:55
Injector: crawlDb: TestCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: URLs merged: 1
Injector: Total new urls injected: 0
Injector: finished at 2015-10-29 09:51:58, elapsed: 00:00:02
Thu Oct 29 09:51:58 CDT 2015 : Iteration 1 of 5
Generating a new segment
/opt/apache-nutch-1.10/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true TestCrawl//crawldb TestCrawl//segments -topN 50000 -numFetchers 1 -noFilter
Generator: starting at 2015-10-29 09:51:58
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch nowseed.txthttp://www.somesite.com(I have also tried adding a trailing '/' but that didn't change anything.)I have tried all of the following in regex-urlfilter.txt and none seem to work any differently than the others. I have a poor understanding of these filters though.+^http://([a-z0-9\]*\.)*www.somesite.com/
+^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/
+^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/([a-z0-9\-A-Z]*\/)*
+^http://([a-z0-9\]*\.)*www.somesite.com/([a-z0-9\]*\/)*I've gone through the hadoop log extensively just to be sure they didn't get crawled in an earlier run, thinking this may be a problem with indexing in solr, but it looks like they have just never been crawled and are being ignored.Can someone point me in the right direction here to troubleshoot this thing? I'm out of ideas and googles.

Re: Nutch 1.10 won't crawl subdirectories on my site

Posted by Michael Joyce <jo...@apache.org>.

Sorry that should probably be
+^http://([a-z0-9]*\.)*somesite.com.*


-- Jimmy

On Thu, Oct 29, 2015 at 1:25 PM, Michael Joyce <jo...@apache.org> wrote:

> If you're not getting subdirectories try .* on the end of the filter to
> grab everything and that should get you what you need.
>
> So
> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com.*
>
>
>
>
> -- Jimmy
>
> On Thu, Oct 29, 2015 at 11:31 AM, Frumpus <fr...@yahoo.com.invalid>
> wrote:
>
>> I'm new to Nutch and Solr so I've probably just got something configured
>> incorrectly, but I can't find a setting for this in any conf files.I'm
>> testing Nutch 1.10 on a relatively small site and it will crawl any page in
>> the root of the site, but nothing in a subdir. So when I look at the core
>> in Solr (5.3.1) and search I can find a page
>> www.somesite.com/somepage.php but none of the pages with urls like
>> www.somesite.com/somedir/somepage.php are there.I am using the following
>> command to run the crawl script:sudo -E bin/crawl -i -D solr.server.url=
>> http://localhost:8983/solr/TestCrawlCore urls/ TestCrawl/ 5This should
>> take it through 5 iterations, but it only runs one and reports that there
>> are no more URLs to fetch and exits. There are no errors in the console or
>> hadoop log.Result:Injecting seed URLs
>> /opt/apache-nutch-1.10/bin/nutch inject TestCrawl//crawldb urls/
>> Injector: starting at 2015-10-29 09:51:55
>> Injector: crawlDb: TestCrawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Total number of urls rejected by filters: 0
>> Injector: Total number of urls after normalization: 1
>> Injector: Merging injected urls into crawl db.
>> Injector: overwrite: false
>> Injector: update: false
>> Injector: URLs merged: 1
>> Injector: Total new urls injected: 0
>> Injector: finished at 2015-10-29 09:51:58, elapsed: 00:00:02
>> Thu Oct 29 09:51:58 CDT 2015 : Iteration 1 of 5
>> Generating a new segment
>> /opt/apache-nutch-1.10/bin/nutch generate -D mapred.reduce.tasks=2 -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true TestCrawl//crawldb TestCrawl//segments
>> -topN 50000 -numFetchers 1 -noFilter
>> Generator: starting at 2015-10-29 09:51:58
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: false
>> Generator: normalizing: true
>> Generator: topN: 50000
>> Generator: 0 records selected for fetching, exiting ...
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch nowseed.txthttp://www.somesite.com(I
>> have also tried adding a trailing '/' but that didn't change anything.)I
>> have tried all of the following in regex-urlfilter.txt and none seem to
>> work any differently than the others. I have a poor understanding of these
>> filters though.+^http://([a-z0-9\]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/([a-z0-9\-A-Z]*\/)*
>> <http://www.somesite.com/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>
>> +^http://([a-z0-9\]*\.)*www.somesite.com/([a-z0-9\]*\/)*I've
>> <http://www.somesite.com/(%5Ba-z0-9%5C%5D*%5C/)*I've> gone through the
>> hadoop log extensively just to be sure they didn't get crawled in an
>> earlier run, thinking this may be a problem with indexing in solr, but it
>> looks like they have just never been crawled and are being ignored.Can
>> someone point me in the right direction here to troubleshoot this thing?
>> I'm out of ideas and googles.
>
>
>

Re: Nutch 1.10 won't crawl subdirectories on my site

Posted by Frumpus <fr...@yahoo.com.INVALID>.

Tried but got same result. 
      From: Michael Joyce <jo...@apache.org>
 To: 
Cc: user@nutch.apache.org; Frumpus <fr...@yahoo.com> 
 Sent: Thursday, October 29, 2015 3:27 PM
 Subject: Re: Nutch 1.10 won't crawl subdirectories on my site
   
Sorry that should probably be
+^http://([a-z0-9]*\.)*somesite.com.*


-- Jimmy

On Thu, Oct 29, 2015 at 1:25 PM, Michael Joyce <jo...@apache.org> wrote:

> If you're not getting subdirectories try .* on the end of the filter to
> grab everything and that should get you what you need.
>
> So
> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com.*
>
>
>
>
> -- Jimmy
>
> On Thu, Oct 29, 2015 at 11:31 AM, Frumpus <fr...@yahoo.com.invalid>
> wrote:
>
>> I'm new to Nutch and Solr so I've probably just got something configured
>> incorrectly, but I can't find a setting for this in any conf files.I'm
>> testing Nutch 1.10 on a relatively small site and it will crawl any page in
>> the root of the site, but nothing in a subdir. So when I look at the core
>> in Solr (5.3.1) and search I can find a page
>> www.somesite.com/somepage.php but none of the pages with urls like
>> www.somesite.com/somedir/somepage.php are there.I am using the following
>> command to run the crawl script:sudo -E bin/crawl -i -D solr.server.url=
>> http://localhost:8983/solr/TestCrawlCore urls/ TestCrawl/ 5This should
>> take it through 5 iterations, but it only runs one and reports that there
>> are no more URLs to fetch and exits. There are no errors in the console or
>> hadoop log.Result:Injecting seed URLs
>> /opt/apache-nutch-1.10/bin/nutch inject TestCrawl//crawldb urls/
>> Injector: starting at 2015-10-29 09:51:55
>> Injector: crawlDb: TestCrawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Total number of urls rejected by filters: 0
>> Injector: Total number of urls after normalization: 1
>> Injector: Merging injected urls into crawl db.
>> Injector: overwrite: false
>> Injector: update: false
>> Injector: URLs merged: 1
>> Injector: Total new urls injected: 0
>> Injector: finished at 2015-10-29 09:51:58, elapsed: 00:00:02
>> Thu Oct 29 09:51:58 CDT 2015 : Iteration 1 of 5
>> Generating a new segment
>> /opt/apache-nutch-1.10/bin/nutch generate -D mapred.reduce.tasks=2 -D
>> mapred.child.java.opts=-Xmx1000m -D
>> mapred.reduce.tasks.speculative.execution=false -D
>> mapred.map.tasks.speculative.execution=false -D
>> mapred.compress.map.output=true TestCrawl//crawldb TestCrawl//segments
>> -topN 50000 -numFetchers 1 -noFilter
>> Generator: starting at 2015-10-29 09:51:58
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: false
>> Generator: normalizing: true
>> Generator: topN: 50000
>> Generator: 0 records selected for fetching, exiting ...
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch nowseed.txthttp://www.somesite.com(I
>> have also tried adding a trailing '/' but that didn't change anything.)I
>> have tried all of the following in regex-urlfilter.txt and none seem to
>> work any differently than the others. I have a poor understanding of these
>> filters though.+^http://([a-z0-9\]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/([a-z0-9\-A-Z]*\/)*
>> <http://www.somesite.com/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>
>> +^http://([a-z0-9\]*\.)*www.somesite.com/([a-z0-9\]*\/)*I've
>> <http://www.somesite.com/(%5Ba-z0-9%5C%5D*%5C/)*I've> gone through the


>> hadoop log extensively just to be sure they didn't get crawled in an
>> earlier run, thinking this may be a problem with indexing in solr, but it
>> looks like they have just never been crawled and are being ignored.Can
>> someone point me in the right direction here to troubleshoot this thing?
>> I'm out of ideas and googles.
>
>
>

RE: Nutch 1.10 won't crawl subdirectories on my site

Posted by Robbe Roels <Ro...@knowliah.com>.

Have you checked your robot has rights and is not denied crawling in the robots.txt of sites subdirs 

Kind regards

Robbe 

-----Original Message-----
From: Frumpus [mailto:frumpus@yahoo.com.INVALID] 
Sent: donderdag 29 oktober 2015 22:17
To: user@nutch.apache.org
Subject: Re: Nutch 1.10 won't crawl subdirectories on my site

Tried but got same result. 
      From: Michael Joyce <jo...@apache.org>
 To: 
Cc: user@nutch.apache.org; Frumpus <fr...@yahoo.com>
 Sent: Thursday, October 29, 2015 3:27 PM
 Subject: Re: Nutch 1.10 won't crawl subdirectories on my site
   
Sorry that should probably be
+^http://([a-z0-9]*\.)*somesite.com.*


-- Jimmy

On Thu, Oct 29, 2015 at 1:25 PM, Michael Joyce <jo...@apache.org> wrote:

> If you're not getting subdirectories try .* on the end of the filter 
> to grab everything and that should get you what you need.
>
> So
> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com.*
>
>
>
>
> -- Jimmy
>
> On Thu, Oct 29, 2015 at 11:31 AM, Frumpus <fr...@yahoo.com.invalid>
> wrote:
>
>> I'm new to Nutch and Solr so I've probably just got something 
>> configured incorrectly, but I can't find a setting for this in any 
>> conf files.I'm testing Nutch 1.10 on a relatively small site and it 
>> will crawl any page in the root of the site, but nothing in a subdir. 
>> So when I look at the core in Solr (5.3.1) and search I can find a 
>> page www.somesite.com/somepage.php but none of the pages with urls 
>> like www.somesite.com/somedir/somepage.php are there.I am using the 
>> following command to run the crawl script:sudo -E bin/crawl -i -D 
>> solr.server.url= http://localhost:8983/solr/TestCrawlCore urls/ 
>> TestCrawl/ 5This should take it through 5 iterations, but it only 
>> runs one and reports that there are no more URLs to fetch and exits. 
>> There are no errors in the console or hadoop log.Result:Injecting 
>> seed URLs /opt/apache-nutch-1.10/bin/nutch inject TestCrawl//crawldb 
>> urls/
>> Injector: starting at 2015-10-29 09:51:55
>> Injector: crawlDb: TestCrawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Total number of urls rejected by filters: 0
>> Injector: Total number of urls after normalization: 1
>> Injector: Merging injected urls into crawl db.
>> Injector: overwrite: false
>> Injector: update: false
>> Injector: URLs merged: 1
>> Injector: Total new urls injected: 0
>> Injector: finished at 2015-10-29 09:51:58, elapsed: 00:00:02 Thu Oct 
>> 29 09:51:58 CDT 2015 : Iteration 1 of 5 Generating a new segment 
>> /opt/apache-nutch-1.10/bin/nutch generate -D mapred.reduce.tasks=2 -D 
>> mapred.child.java.opts=-Xmx1000m -D 
>> mapred.reduce.tasks.speculative.execution=false -D 
>> mapred.map.tasks.speculative.execution=false -D 
>> mapred.compress.map.output=true TestCrawl//crawldb 
>> TestCrawl//segments -topN 50000 -numFetchers 1 -noFilter
>> Generator: starting at 2015-10-29 09:51:58
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: filtering: false
>> Generator: normalizing: true
>> Generator: topN: 50000
>> Generator: 0 records selected for fetching, exiting ...
>> Generate returned 1 (no new segments created) Escaping loop: no more 
>> URLs to fetch nowseed.txthttp://www.somesite.com(I
>> have also tried adding a trailing '/' but that didn't change 
>> anything.)I have tried all of the following in regex-urlfilter.txt 
>> and none seem to work any differently than the others. I have a poor 
>> understanding of these filters 
>> though.+^http://([a-z0-9\]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/
>> +^http://([a-z0-9\-A-Z]*\.)*www.somesite.com/([a-z0-9\-A-Z]*\/)*
>> <http://www.somesite.com/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>
>> +^http://([a-z0-9\]*\.)*www.somesite.com/([a-z0-9\]*\/)*I've
>> <http://www.somesite.com/(%5Ba-z0-9%5C%5D*%5C/)*I've> gone through 
>> the


>> hadoop log extensively just to be sure they didn't get crawled in an 
>> earlier run, thinking this may be a problem with indexing in solr, 
>> but it looks like they have just never been crawled and are being 
>> ignored.Can someone point me in the right direction here to troubleshoot this thing?
>> I'm out of ideas and googles.
>
>
>