You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ajaxtrend <te...@yahoo.com> on 2007/12/17 17:54:11 UTC
URL filter help
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is
+^http://www.text.com/products/.*
urls/my.txt contains following entry
http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says
No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion.
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.
Thanks in advance.
- BR
---------------------------------
Never miss a thing. Make Yahoo your homepage.
Re: URL filter help
Posted by ajaxtrend <te...@yahoo.com>.
Can anybody help me on this exception? Because of this exception index is corrupted.
ajaxtrend <te...@yahoo.com> wrote: Now I realized that if index contains no documents and there are urls in the DB then it it generates error while removing duplicates. To get rid of error, I did an hack in dedup method of DeleteDuplicates class
//Removing duplicates
try{
JobClient.runJob(job);
}catch(Exception e){
LOG.info("Dedup: Error occurred: "+e.getMessage());
}
This solves my problem.
-BR
ajaxtrend wrote:
I tried to added a meta-tag in my customized indexed filter. Based on URL pattern, I added a meta-tag called 'indexit', with value true or false.
In the reduce() method indexer, I checked this meta-tag and accordingly indexed the document or not, something like in the end of reduce() method
String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}
This works, as the document does not get indexed. However , it gives IOException
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
I understand that DeleteDuplicate tries to remove duplicates of some URLs, and there are no documents indexed for these URLs , thats why it gives expection.
Any suggestion to run it gracefully, I mean is this the write way of controlling index process of a document?
I would really appreciate your suggestions.
- BR
ajaxtrend wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is
+^http://www.text.com/products/.*
urls/my.txt contains following entry
http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says
No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion.
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.
Thanks in advance.
- BR
---------------------------------
Never miss a thing. Make Yahoo your homepage.
---------------------------------
Never miss a thing. Make Yahoo your homepage.
---------------------------------
Never miss a thing. Make Yahoo your homepage.
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
Re: URL filter help
Posted by ajaxtrend <te...@yahoo.com>.
Now I realized that if index contains no documents and there are urls in the DB then it it generates error while removing duplicates. To get rid of error, I did an hack in dedup method of DeleteDuplicates class
//Removing duplicates
try{
JobClient.runJob(job);
}catch(Exception e){
LOG.info("Dedup: Error occurred: "+e.getMessage());
}
This solves my problem.
-BR
ajaxtrend <te...@yahoo.com> wrote:
I tried to added a meta-tag in my customized indexed filter. Based on URL pattern, I added a meta-tag called 'indexit', with value true or false.
In the reduce() method indexer, I checked this meta-tag and accordingly indexed the document or not, something like in the end of reduce() method
String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}
This works, as the document does not get indexed. However , it gives IOException
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
I understand that DeleteDuplicate tries to remove duplicates of some URLs, and there are no documents indexed for these URLs , thats why it gives expection.
Any suggestion to run it gracefully, I mean is this the write way of controlling index process of a document?
I would really appreciate your suggestions.
- BR
ajaxtrend wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is
+^http://www.text.com/products/.*
urls/my.txt contains following entry
http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says
No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion.
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.
Thanks in advance.
- BR
---------------------------------
Never miss a thing. Make Yahoo your homepage.
---------------------------------
Never miss a thing. Make Yahoo your homepage.
---------------------------------
Never miss a thing. Make Yahoo your homepage.
Re: URL filter help
Posted by ajaxtrend <te...@yahoo.com>.
I tried to added a meta-tag in my customized indexed filter. Based on URL pattern, I added a meta-tag called 'indexit', with value true or false.
In the reduce() method indexer, I checked this meta-tag and accordingly indexed the document or not, something like in the end of reduce() method
String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}
This works, as the document does not get indexed. However , it gives IOException
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
I understand that DeleteDuplicate tries to remove duplicates of some URLs, and there are no documents indexed for these URLs , thats why it gives expection.
Any suggestion to run it gracefully, I mean is this the write way of controlling index process of a document?
I would really appreciate your suggestions.
- BR
ajaxtrend <te...@yahoo.com> wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is
+^http://www.text.com/products/.*
urls/my.txt contains following entry
http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says
No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion.
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.
Thanks in advance.
- BR
---------------------------------
Never miss a thing. Make Yahoo your homepage.
---------------------------------
Never miss a thing. Make Yahoo your homepage.