You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Lukas, Ray" <Ra...@idearc.com> on 2009/02/25 21:39:56 UTC

Does not locate my urls or filter problem.

Invalid indexes are generated {newbie question}

Please if you could help. I am trying to get Nutch to work from Java. I
wish to crawl a web page and generate Lucene indexes and then use the
NutchBean to query them.  I located an example in the Nutch distribution
and have it working, or so I thought. 
I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
fine, but does not seem to use my URL's.
I get the following directory
C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
00. This directory contains the following files. 
.data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
shallow search, but even so... . 
Excited, I then attempted to open them in Luke and I am not able to.
"There is no valid Lucene index in this directory".. 
The output ends with 
2009-02-25 15:08:56,899 WARN  crawl.Generator
(Generator.java:generate(425)) - Generator: 0 records selected for
fetching, exiting ...
2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
Stopping at depth=0 - no more URLs to fetch.
2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
URLs to fetch - check your seed list and URL filters.

My crawlURLFilter.txt file contains: 
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/
This is my filter, right?

And I have a directory urlsDir which contains one file holding the
string "http://ant.apache.org/" followed by a blank line. This is my
seed list, right?

I know it is going to my urlsDir file. If I remove the http://  Nutch
complains about an unknown protocol. 

I am running V0.9, I know it is something small.. But I just don't see
it.. 
Thanks in advance.
ray

Re: Does not locate my urls or filter problem.

Posted by Bartosz Gadzimski <ba...@o2.pl>.

Lukas,

I meant that it happend in 0.9 and in trunk versions of nutch default 
installation (and compiled with no changes in source code)

Thanks,
Bartosz

Lukas, Ray pisze:
> No no.. This is not stupid.. Your idea is not stupid.. There are places in other systems where I have seen this.. And ... Ah.. Yeah I did try that.. If I recall constructing a jar file requires a blank line after the driver path spec in the manifest file, does it not... .. So your thinking is not stupid.. Clever actually.. Ha ha.. Oh man.. Thanks man.. I am going to mess aorund with this idea a little more.. Hey thanks man for helping me out on this.. 
>
> ray 
>
> -----Original Message-----
> From: Bartosz Gadzimski [mailto:bartek--g@o2.pl] 
> Sent: Thursday, February 26, 2009 8:06 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Does not locate my urls or filter problem.
>
> Hello,
>
> It might sound stupid but try to add few "spaces" and few new lines in 
> your myURLS.txt (it happend few times on different computers both linux 
> and windows)
>
> Thanks,
> Bartosz
>
> Lukas, Ray pisze:
>   
>> Thanks for your help.. I have neither of these properties defined in the  
>> nutch-site.xml file, which is minimal.. Which I am including by the way. If I muck up the url in my url "shopping list" nutch sees and reports the bad url..
>>
>> "crawl-urlfilter.txt" points to "crawl-urlfilter.txt" (the default setting I believe) which now contains 
>> "+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/".
>>
>> Crawl.Java contains (I disabled command line params and hardcoded them for now)
>>     Path rootUrlDir =  new Path("urlsDir"); // point to my shopping list
>>     Path dir = new Path("crawl-" + getDate()); // this is where my Lucene index files will go 
>>     int threads = job.getInt("fetcher.threads.fetch", 10);
>>     int depth = 5; // just do something for me.. 
>>     int topN = Integer.MAX_VALUE;
>>
>> The urlsDir contains one file called myURLS.txt and in that file I have
>> "http://www.msn.com/", followed by a blank line.
>>
>> That URL is seen, because if I mess it up by saying simply www.msn.com I get a unknown protocol exception.. So Nutch does go there and find what I have there.. It just does not use it.. And I don't know why... Nothing seems to pop out at me...
>>
>> Thanks ray
>>
>>
>>
>> -----Original Message-----
>> From: Koch Martina [mailto:Koch@huberverlag.de] 
>> Sent: Thursday, February 26, 2009 1:11 AM
>> To: nutch-user@lucene.apache.org
>> Subject: AW: Does not locate my urls or filter problem.
>>
>> Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
>> You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.
>>
>> Kind regards,
>> Martina
>>
>>
>> My nutch-site.xml file:
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> <configuration>
>>
>> <property>
>> 	<name>http.agent.name</name>
>> 	<value>ideacrMediaInc</value>
>> 	<description>ideacrMediaInc</description>
>> </property>
>>
>> <property>
>> 	<name>http.agent.description</name>
>> 	<value>Testing out Nutch</value>
>> 	<description>Testing out Nutch</description>
>> </property>
>>
>> <property>
>> 	<name>http.agent.url</name>
>> 	<value>www.idearc.com</value>
>> 	<description>www.idearc.com</description>
>> </property>
>>
>> <property>
>> 	<name>http.agent.email</name>
>> 	<value>ray dot lukas at idearc dot com</value>
>> 	<description>ray dot lukas at idearc dot com</description>
>> </property>
>>
>> <property>
>>   <name>plugin.folders</name>
>>   <value>/plugins</value>
>>   <description />
>> </property>
>>
>> <property>
>>   <name>searcher.dir</name>
>>   <value>/crawl.test</value>
>>   <description />
>> </property>
>>
>> </configuration>
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
>> Gesendet: Mittwoch, 25. Februar 2009 21:40
>> An: nutch-user@lucene.apache.org
>> Betreff: Does not locate my urls or filter problem.
>>
>> Invalid indexes are generated {newbie question}
>>
>> Please if you could help. I am trying to get Nutch to work from Java. I
>> wish to crawl a web page and generate Lucene indexes and then use the
>> NutchBean to query them.  I located an example in the Nutch distribution
>> and have it working, or so I thought. 
>> I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
>> fine, but does not seem to use my URL's.
>> I get the following directory
>> C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
>> 00. This directory contains the following files. 
>> .data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
>> shallow search, but even so... . 
>> Excited, I then attempted to open them in Luke and I am not able to.
>> "There is no valid Lucene index in this directory".. 
>> The output ends with 
>> 2009-02-25 15:08:56,899 WARN  crawl.Generator
>> (Generator.java:generate(425)) - Generator: 0 records selected for
>> fetching, exiting ...
>> 2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
>> Stopping at depth=0 - no more URLs to fetch.
>> 2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
>> URLs to fetch - check your seed list and URL filters.
>>
>> My crawlURLFilter.txt file contains: 
>> # accept hosts in MY.DOMAIN.NAME
>> +^http://([a-z0-9]*\.)*apache.org/
>> This is my filter, right?
>>
>> And I have a directory urlsDir which contains one file holding the
>> string "http://ant.apache.org/" followed by a blank line. This is my
>> seed list, right?
>>
>> I know it is going to my urlsDir file. If I remove the http://  Nutch
>> complains about an unknown protocol. 
>>
>> I am running V0.9, I know it is something small.. But I just don't see
>> it.. 
>> Thanks in advance.
>> ray
>>
>>
>>   
>>     
>
>
>

RE: Does not locate my urls or filter problem.

Posted by "Lukas, Ray" <Ra...@idearc.com>.

I get two interesting things from generator.. First 
"!readers[0].next(new FloatWritable())" at line 424 returns true forcing the
"Generator: 0 records selected for fetching, exiting ..." message..

Prior to that we got the 
LOG.info("Generator: jobtracker is 'local', generating exactly one partition.");
Message which means that I am not running in a distributed environment?? Is that correct..  I am indeed running local.. 

I am thinking that my cofig is so messed up that I should push the reset button and reload Nutch 0.9.... Hey guys .. Let me do that and see if this helps instead of dragging us all down a rat hole .. 

I will let you know what happens.. Thanks to all.. Bailing out of this burning jet, trading in for a new one.. Learned a bunch, time to take that to a new clean environment.. 

Thanks guys.. 

ray



-----Original Message-----
From: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
Sent: Thursday, February 26, 2009 8:50 AM
To: nutch-user@lucene.apache.org
Subject: RE: Does not locate my urls or filter problem.

No no.. This is not stupid.. Your idea is not stupid.. There are places in other systems where I have seen this.. And ... Ah.. Yeah I did try that.. If I recall constructing a jar file requires a blank line after the driver path spec in the manifest file, does it not... .. So your thinking is not stupid.. Clever actually.. Ha ha.. Oh man.. Thanks man.. I am going to mess aorund with this idea a little more.. Hey thanks man for helping me out on this.. 

ray 

-----Original Message-----
From: Bartosz Gadzimski [mailto:bartek--g@o2.pl] 
Sent: Thursday, February 26, 2009 8:06 AM
To: nutch-user@lucene.apache.org
Subject: Re: Does not locate my urls or filter problem.

Hello,

It might sound stupid but try to add few "spaces" and few new lines in 
your myURLS.txt (it happend few times on different computers both linux 
and windows)

Thanks,
Bartosz

Lukas, Ray pisze:
> Thanks for your help.. I have neither of these properties defined in the  
> nutch-site.xml file, which is minimal.. Which I am including by the way. If I muck up the url in my url "shopping list" nutch sees and reports the bad url..
>
> "crawl-urlfilter.txt" points to "crawl-urlfilter.txt" (the default setting I believe) which now contains 
> "+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/".
>
> Crawl.Java contains (I disabled command line params and hardcoded them for now)
>     Path rootUrlDir =  new Path("urlsDir"); // point to my shopping list
>     Path dir = new Path("crawl-" + getDate()); // this is where my Lucene index files will go 
>     int threads = job.getInt("fetcher.threads.fetch", 10);
>     int depth = 5; // just do something for me.. 
>     int topN = Integer.MAX_VALUE;
>
> The urlsDir contains one file called myURLS.txt and in that file I have
> "http://www.msn.com/", followed by a blank line.
>
> That URL is seen, because if I mess it up by saying simply www.msn.com I get a unknown protocol exception.. So Nutch does go there and find what I have there.. It just does not use it.. And I don't know why... Nothing seems to pop out at me...
>
> Thanks ray
>
>
>
> -----Original Message-----
> From: Koch Martina [mailto:Koch@huberverlag.de] 
> Sent: Thursday, February 26, 2009 1:11 AM
> To: nutch-user@lucene.apache.org
> Subject: AW: Does not locate my urls or filter problem.
>
> Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
> You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.
>
> Kind regards,
> Martina
>
>
> My nutch-site.xml file:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>
> <property>
> 	<name>http.agent.name</name>
> 	<value>ideacrMediaInc</value>
> 	<description>ideacrMediaInc</description>
> </property>
>
> <property>
> 	<name>http.agent.description</name>
> 	<value>Testing out Nutch</value>
> 	<description>Testing out Nutch</description>
> </property>
>
> <property>
> 	<name>http.agent.url</name>
> 	<value>www.idearc.com</value>
> 	<description>www.idearc.com</description>
> </property>
>
> <property>
> 	<name>http.agent.email</name>
> 	<value>ray dot lukas at idearc dot com</value>
> 	<description>ray dot lukas at idearc dot com</description>
> </property>
>
> <property>
>   <name>plugin.folders</name>
>   <value>/plugins</value>
>   <description />
> </property>
>
> <property>
>   <name>searcher.dir</name>
>   <value>/crawl.test</value>
>   <description />
> </property>
>
> </configuration>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
> Gesendet: Mittwoch, 25. Februar 2009 21:40
> An: nutch-user@lucene.apache.org
> Betreff: Does not locate my urls or filter problem.
>
> Invalid indexes are generated {newbie question}
>
> Please if you could help. I am trying to get Nutch to work from Java. I
> wish to crawl a web page and generate Lucene indexes and then use the
> NutchBean to query them.  I located an example in the Nutch distribution
> and have it working, or so I thought. 
> I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
> fine, but does not seem to use my URL's.
> I get the following directory
> C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
> 00. This directory contains the following files. 
> .data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
> shallow search, but even so... . 
> Excited, I then attempted to open them in Luke and I am not able to.
> "There is no valid Lucene index in this directory".. 
> The output ends with 
> 2009-02-25 15:08:56,899 WARN  crawl.Generator
> (Generator.java:generate(425)) - Generator: 0 records selected for
> fetching, exiting ...
> 2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
> Stopping at depth=0 - no more URLs to fetch.
> 2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
> URLs to fetch - check your seed list and URL filters.
>
> My crawlURLFilter.txt file contains: 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*apache.org/
> This is my filter, right?
>
> And I have a directory urlsDir which contains one file holding the
> string "http://ant.apache.org/" followed by a blank line. This is my
> seed list, right?
>
> I know it is going to my urlsDir file. If I remove the http://  Nutch
> complains about an unknown protocol. 
>
> I am running V0.9, I know it is something small.. But I just don't see
> it.. 
> Thanks in advance.
> ray
>
>
>

RE: Does not locate my urls or filter problem.

Posted by "Lukas, Ray" <Ra...@idearc.com>.

No no.. This is not stupid.. Your idea is not stupid.. There are places in other systems where I have seen this.. And ... Ah.. Yeah I did try that.. If I recall constructing a jar file requires a blank line after the driver path spec in the manifest file, does it not... .. So your thinking is not stupid.. Clever actually.. Ha ha.. Oh man.. Thanks man.. I am going to mess aorund with this idea a little more.. Hey thanks man for helping me out on this.. 

ray 

-----Original Message-----
From: Bartosz Gadzimski [mailto:bartek--g@o2.pl] 
Sent: Thursday, February 26, 2009 8:06 AM
To: nutch-user@lucene.apache.org
Subject: Re: Does not locate my urls or filter problem.

Hello,

It might sound stupid but try to add few "spaces" and few new lines in 
your myURLS.txt (it happend few times on different computers both linux 
and windows)

Thanks,
Bartosz

Lukas, Ray pisze:
> Thanks for your help.. I have neither of these properties defined in the  
> nutch-site.xml file, which is minimal.. Which I am including by the way. If I muck up the url in my url "shopping list" nutch sees and reports the bad url..
>
> "crawl-urlfilter.txt" points to "crawl-urlfilter.txt" (the default setting I believe) which now contains 
> "+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/".
>
> Crawl.Java contains (I disabled command line params and hardcoded them for now)
>     Path rootUrlDir =  new Path("urlsDir"); // point to my shopping list
>     Path dir = new Path("crawl-" + getDate()); // this is where my Lucene index files will go 
>     int threads = job.getInt("fetcher.threads.fetch", 10);
>     int depth = 5; // just do something for me.. 
>     int topN = Integer.MAX_VALUE;
>
> The urlsDir contains one file called myURLS.txt and in that file I have
> "http://www.msn.com/", followed by a blank line.
>
> That URL is seen, because if I mess it up by saying simply www.msn.com I get a unknown protocol exception.. So Nutch does go there and find what I have there.. It just does not use it.. And I don't know why... Nothing seems to pop out at me...
>
> Thanks ray
>
>
>
> -----Original Message-----
> From: Koch Martina [mailto:Koch@huberverlag.de] 
> Sent: Thursday, February 26, 2009 1:11 AM
> To: nutch-user@lucene.apache.org
> Subject: AW: Does not locate my urls or filter problem.
>
> Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
> You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.
>
> Kind regards,
> Martina
>
>
> My nutch-site.xml file:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>
> <property>
> 	<name>http.agent.name</name>
> 	<value>ideacrMediaInc</value>
> 	<description>ideacrMediaInc</description>
> </property>
>
> <property>
> 	<name>http.agent.description</name>
> 	<value>Testing out Nutch</value>
> 	<description>Testing out Nutch</description>
> </property>
>
> <property>
> 	<name>http.agent.url</name>
> 	<value>www.idearc.com</value>
> 	<description>www.idearc.com</description>
> </property>
>
> <property>
> 	<name>http.agent.email</name>
> 	<value>ray dot lukas at idearc dot com</value>
> 	<description>ray dot lukas at idearc dot com</description>
> </property>
>
> <property>
>   <name>plugin.folders</name>
>   <value>/plugins</value>
>   <description />
> </property>
>
> <property>
>   <name>searcher.dir</name>
>   <value>/crawl.test</value>
>   <description />
> </property>
>
> </configuration>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
> Gesendet: Mittwoch, 25. Februar 2009 21:40
> An: nutch-user@lucene.apache.org
> Betreff: Does not locate my urls or filter problem.
>
> Invalid indexes are generated {newbie question}
>
> Please if you could help. I am trying to get Nutch to work from Java. I
> wish to crawl a web page and generate Lucene indexes and then use the
> NutchBean to query them.  I located an example in the Nutch distribution
> and have it working, or so I thought. 
> I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
> fine, but does not seem to use my URL's.
> I get the following directory
> C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
> 00. This directory contains the following files. 
> .data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
> shallow search, but even so... . 
> Excited, I then attempted to open them in Luke and I am not able to.
> "There is no valid Lucene index in this directory".. 
> The output ends with 
> 2009-02-25 15:08:56,899 WARN  crawl.Generator
> (Generator.java:generate(425)) - Generator: 0 records selected for
> fetching, exiting ...
> 2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
> Stopping at depth=0 - no more URLs to fetch.
> 2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
> URLs to fetch - check your seed list and URL filters.
>
> My crawlURLFilter.txt file contains: 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*apache.org/
> This is my filter, right?
>
> And I have a directory urlsDir which contains one file holding the
> string "http://ant.apache.org/" followed by a blank line. This is my
> seed list, right?
>
> I know it is going to my urlsDir file. If I remove the http://  Nutch
> complains about an unknown protocol. 
>
> I am running V0.9, I know it is something small.. But I just don't see
> it.. 
> Thanks in advance.
> ray
>
>
>

Re: Does not locate my urls or filter problem.

Posted by Bartosz Gadzimski <ba...@o2.pl>.

Hello,

It might sound stupid but try to add few "spaces" and few new lines in 
your myURLS.txt (it happend few times on different computers both linux 
and windows)

Thanks,
Bartosz

Lukas, Ray pisze:
> Thanks for your help.. I have neither of these properties defined in the  
> nutch-site.xml file, which is minimal.. Which I am including by the way. If I muck up the url in my url "shopping list" nutch sees and reports the bad url..
>
> "crawl-urlfilter.txt" points to "crawl-urlfilter.txt" (the default setting I believe) which now contains 
> "+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/".
>
> Crawl.Java contains (I disabled command line params and hardcoded them for now)
>     Path rootUrlDir =  new Path("urlsDir"); // point to my shopping list
>     Path dir = new Path("crawl-" + getDate()); // this is where my Lucene index files will go 
>     int threads = job.getInt("fetcher.threads.fetch", 10);
>     int depth = 5; // just do something for me.. 
>     int topN = Integer.MAX_VALUE;
>
> The urlsDir contains one file called myURLS.txt and in that file I have
> "http://www.msn.com/", followed by a blank line.
>
> That URL is seen, because if I mess it up by saying simply www.msn.com I get a unknown protocol exception.. So Nutch does go there and find what I have there.. It just does not use it.. And I don't know why... Nothing seems to pop out at me...
>
> Thanks ray
>
>
>
> -----Original Message-----
> From: Koch Martina [mailto:Koch@huberverlag.de] 
> Sent: Thursday, February 26, 2009 1:11 AM
> To: nutch-user@lucene.apache.org
> Subject: AW: Does not locate my urls or filter problem.
>
> Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
> You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.
>
> Kind regards,
> Martina
>
>
> My nutch-site.xml file:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>
> <property>
> 	<name>http.agent.name</name>
> 	<value>ideacrMediaInc</value>
> 	<description>ideacrMediaInc</description>
> </property>
>
> <property>
> 	<name>http.agent.description</name>
> 	<value>Testing out Nutch</value>
> 	<description>Testing out Nutch</description>
> </property>
>
> <property>
> 	<name>http.agent.url</name>
> 	<value>www.idearc.com</value>
> 	<description>www.idearc.com</description>
> </property>
>
> <property>
> 	<name>http.agent.email</name>
> 	<value>ray dot lukas at idearc dot com</value>
> 	<description>ray dot lukas at idearc dot com</description>
> </property>
>
> <property>
>   <name>plugin.folders</name>
>   <value>/plugins</value>
>   <description />
> </property>
>
> <property>
>   <name>searcher.dir</name>
>   <value>/crawl.test</value>
>   <description />
> </property>
>
> </configuration>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
> Gesendet: Mittwoch, 25. Februar 2009 21:40
> An: nutch-user@lucene.apache.org
> Betreff: Does not locate my urls or filter problem.
>
> Invalid indexes are generated {newbie question}
>
> Please if you could help. I am trying to get Nutch to work from Java. I
> wish to crawl a web page and generate Lucene indexes and then use the
> NutchBean to query them.  I located an example in the Nutch distribution
> and have it working, or so I thought. 
> I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
> fine, but does not seem to use my URL's.
> I get the following directory
> C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
> 00. This directory contains the following files. 
> .data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
> shallow search, but even so... . 
> Excited, I then attempted to open them in Luke and I am not able to.
> "There is no valid Lucene index in this directory".. 
> The output ends with 
> 2009-02-25 15:08:56,899 WARN  crawl.Generator
> (Generator.java:generate(425)) - Generator: 0 records selected for
> fetching, exiting ...
> 2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
> Stopping at depth=0 - no more URLs to fetch.
> 2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
> URLs to fetch - check your seed list and URL filters.
>
> My crawlURLFilter.txt file contains: 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*apache.org/
> This is my filter, right?
>
> And I have a directory urlsDir which contains one file holding the
> string "http://ant.apache.org/" followed by a blank line. This is my
> seed list, right?
>
> I know it is going to my urlsDir file. If I remove the http://  Nutch
> complains about an unknown protocol. 
>
> I am running V0.9, I know it is something small.. But I just don't see
> it.. 
> Thanks in advance.
> ray
>
>
>

RE: Does not locate my urls or filter problem.

Posted by "Lukas, Ray" <Ra...@idearc.com>.

Thanks for your help.. I have neither of these properties defined in the  
nutch-site.xml file, which is minimal.. Which I am including by the way. If I muck up the url in my url "shopping list" nutch sees and reports the bad url..

"crawl-urlfilter.txt" points to "crawl-urlfilter.txt" (the default setting I believe) which now contains 
"+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/".

Crawl.Java contains (I disabled command line params and hardcoded them for now)
    Path rootUrlDir =  new Path("urlsDir"); // point to my shopping list
    Path dir = new Path("crawl-" + getDate()); // this is where my Lucene index files will go 
    int threads = job.getInt("fetcher.threads.fetch", 10);
    int depth = 5; // just do something for me.. 
    int topN = Integer.MAX_VALUE;

The urlsDir contains one file called myURLS.txt and in that file I have
"http://www.msn.com/", followed by a blank line.

That URL is seen, because if I mess it up by saying simply www.msn.com I get a unknown protocol exception.. So Nutch does go there and find what I have there.. It just does not use it.. And I don't know why... Nothing seems to pop out at me...

Thanks ray



-----Original Message-----
From: Koch Martina [mailto:Koch@huberverlag.de] 
Sent: Thursday, February 26, 2009 1:11 AM
To: nutch-user@lucene.apache.org
Subject: AW: Does not locate my urls or filter problem.

Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.

Kind regards,
Martina


My nutch-site.xml file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

<property>
	<name>http.agent.name</name>
	<value>ideacrMediaInc</value>
	<description>ideacrMediaInc</description>
</property>

<property>
	<name>http.agent.description</name>
	<value>Testing out Nutch</value>
	<description>Testing out Nutch</description>
</property>

<property>
	<name>http.agent.url</name>
	<value>www.idearc.com</value>
	<description>www.idearc.com</description>
</property>

<property>
	<name>http.agent.email</name>
	<value>ray dot lukas at idearc dot com</value>
	<description>ray dot lukas at idearc dot com</description>
</property>

<property>
  <name>plugin.folders</name>
  <value>/plugins</value>
  <description />
</property>

<property>
  <name>searcher.dir</name>
  <value>/crawl.test</value>
  <description />
</property>

</configuration>


-----Ursprüngliche Nachricht-----
Von: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
Gesendet: Mittwoch, 25. Februar 2009 21:40
An: nutch-user@lucene.apache.org
Betreff: Does not locate my urls or filter problem.

Invalid indexes are generated {newbie question}

Please if you could help. I am trying to get Nutch to work from Java. I
wish to crawl a web page and generate Lucene indexes and then use the
NutchBean to query them.  I located an example in the Nutch distribution
and have it working, or so I thought. 
I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
fine, but does not seem to use my URL's.
I get the following directory
C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
00. This directory contains the following files. 
.data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
shallow search, but even so... . 
Excited, I then attempted to open them in Luke and I am not able to.
"There is no valid Lucene index in this directory".. 
The output ends with 
2009-02-25 15:08:56,899 WARN  crawl.Generator
(Generator.java:generate(425)) - Generator: 0 records selected for
fetching, exiting ...
2009-02-25 15:10:50,670 INFO  crawl.Crawl (Crawl.java:main(144)) -
Stopping at depth=0 - no more URLs to fetch.
2009-02-25 15:11:20,280 WARN  crawl.Crawl (Crawl.java:main(161)) - No
URLs to fetch - check your seed list and URL filters.

My crawlURLFilter.txt file contains: 
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/
This is my filter, right?

And I have a directory urlsDir which contains one file holding the
string "http://ant.apache.org/" followed by a blank line. This is my
seed list, right?

I know it is going to my urlsDir file. If I remove the http://  Nutch
complains about an unknown protocol. 

I am running V0.9, I know it is something small.. But I just don't see
it.. 
Thanks in advance.
ray

RE: AW: Does not locate my urls or filter problem.

Posted by "Lukas, Ray" <Ra...@idearc.com>.

You are correct.... Hum.. In there I have what I believe are the default
settings.. 

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept anything else
+.

So... This is good then, right.. I think so.. Accept any URL that is
mentioned in the location that I pass into crawl.java via the rootUrlDir
variable.. So I think I am good, although you did correctly point out a
flaw in my understanding.. Thanks.. So now this gets more interesting..
Nutch just seems to hate my urls.. :).. But why.. Ha ha.. Oh man..
Sigh.. Sorry this should be easy.. I am missing something.. The failure
is happening in generator.generate() method returning a null segment..
Debugging that now.. Any help, gratefully appreciated. Thanks man..
ray 

-----Original Message-----
From: Eric J. Christeson [mailto:Eric.Christeson@ndsu.edu] 
Sent: Thursday, February 26, 2009 7:23 AM
To: nutch-user@lucene.apache.org
Subject: Re: AW: Does not locate my urls or filter problem.

Koch Martina wrote:
> Please check your nutch-site.xml. If the property
"urlfilter.regex.file" there points to another file than your
"crawl-urlfilter.txt" this setting takes precedence.
> You can also disable the urlfilter-regex plugin by removing it from
the "plugin.includes" property of nutch-site.xml and check if your crawl
starts fecthing URLs.
> 
> Kind regards,
> Martina

In nutch-0.9, nutch-default.xml has urlfilter.regex.file set to
regex-urlfilter.txt

Thanks,
Eric

Re: AW: Does not locate my urls or filter problem.

Posted by "Eric J. Christeson" <Er...@ndsu.edu>.

Koch Martina wrote:
> Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
> You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.
> 
> Kind regards,
> Martina

In nutch-0.9, nutch-default.xml has urlfilter.regex.file set to
regex-urlfilter.txt

Thanks,
Eric

AW: Does not locate my urls or filter problem.

Posted by Koch Martina <Ko...@huberverlag.de>.

Please check your nutch-site.xml. If the property "urlfilter.regex.file" there points to another file than your "crawl-urlfilter.txt" this setting takes precedence.
You can also disable the urlfilter-regex plugin by removing it from the "plugin.includes" property of nutch-site.xml and check if your crawl starts fecthing URLs.

Kind regards,
Martina

-----Ursprüngliche Nachricht-----
Von: Lukas, Ray [mailto:Ray.Lukas@idearc.com]
Gesendet: Mittwoch, 25. Februar 2009 21:40
An: nutch-user@lucene.apache.org
Betreff: Does not locate my urls or filter problem.

Invalid indexes are generated {newbie question}

Please if you could help. I am trying to get Nutch to work from Java. I
wish to crawl a web page and generate Lucene indexes and then use the
NutchBean to query them. I located an example in the Nutch distribution
and have it working, or so I thought.
I am executing org.apache.nutch.crawl.Crawl. The code seems to runs
fine, but does not seem to use my URL's.
I get the following directory
C:\EclipseWorkspaces\Nutch\crawl-20090225143955\crawldb\current\part-000
00. This directory contains the following files.
.data.crc, .index.crc, data, and index. All are 1MB. I was doing a very
shallow search, but even so... .
Excited, I then attempted to open them in Luke and I am not able to.
"There is no valid Lucene index in this directory"..
The output ends with
2009-02-25 15:08:56,899 WARN crawl.Generator
(Generator.java:generate(425)) - Generator: 0 records selected for
fetching, exiting ...
2009-02-25 15:10:50,670 INFO crawl.Crawl (Crawl.java:main(144)) -
Stopping at depth=0 - no more URLs to fetch.
2009-02-25 15:11:20,280 WARN crawl.Crawl (Crawl.java:main(161)) - No
URLs to fetch - check your seed list and URL filters.

My crawlURLFilter.txt file contains:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.org/
This is my filter, right?

And I have a directory urlsDir which contains one file holding the
string "http://ant.apache.org/" followed by a blank line. This is my
seed list, right?

I know it is going to my urlsDir file. If I remove the http:// Nutch
complains about an unknown protocol.

I am running V0.9, I know it is something small.. But I just don't see
it..
Thanks in advance.
ray