You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ajaxtrend <te...@yahoo.com> on 2007/12/17 17:54:11 UTC

URL filter help

Hello Group,
                   I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is
   
  +^http://www.text.com/products/.*
   
  urls/my.txt contains following entry
   
  http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says
   
  No URLs to fetch - check your seed list and URL filters.
  I am sure this muct have been answered. I have already searched archive but not able to find any suggestion. 
  I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.
   
  Thanks in advance.
   
  - BR

       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: URL filter help

Posted by ajaxtrend <te...@yahoo.com>.

Can anybody help me on this exception? Because of this exception index is corrupted.

ajaxtrend <te...@yahoo.com> wrote:  Now I realized that if index contains no documents and there are urls in the DB then it it generates error while removing duplicates. To get rid of error, I did an hack in dedup method of DeleteDuplicates class

//Removing duplicates
try{
JobClient.runJob(job);
}catch(Exception e){
LOG.info("Dedup: Error occurred: "+e.getMessage());
}

This solves my problem.

-BR


ajaxtrend wrote:
I tried to added a meta-tag in my customized indexed filter. Based on URL pattern, I added a meta-tag called 'indexit', with value true or false. 

In the reduce() method indexer, I checked this meta-tag and accordingly indexed the document or not, something like in the end of reduce() method

String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}

This works, as the document does not get indexed. However , it gives IOException

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I understand that DeleteDuplicate tries to remove duplicates of some URLs, and there are no documents indexed for these URLs , thats why it gives expection.

Any suggestion to run it gracefully, I mean is this the write way of controlling index process of a document?

I would really appreciate your suggestions.

- BR


ajaxtrend wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is

+^http://www.text.com/products/.*

urls/my.txt contains following entry

http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says

No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion. 
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.

Thanks in advance.

- BR


---------------------------------
Never miss a thing. Make Yahoo your homepage.


---------------------------------
Never miss a thing. Make Yahoo your homepage.


---------------------------------
Never miss a thing. Make Yahoo your homepage.

       
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: URL filter help

Posted by ajaxtrend <te...@yahoo.com>.

Now I realized that if index contains no documents and there are urls in the DB then it it generates error while removing duplicates. To get rid of error, I did an hack in dedup method of DeleteDuplicates class
   
  //Removing duplicates
  try{
  JobClient.runJob(job);
  }catch(Exception e){
  LOG.info("Dedup: Error occurred: "+e.getMessage());
  }
   
  This solves my problem.
   
  -BR
  

ajaxtrend <te...@yahoo.com> wrote:
  I tried to added a meta-tag in my customized indexed filter. Based on URL pattern, I added a meta-tag called 'indexit', with value true or false. 

In the reduce() method indexer, I checked this meta-tag and accordingly indexed the document or not, something like in the end of reduce() method

String indexIt = parse.getData().getMeta("indexit");
if(indexIt != null){
if(!Boolean.getBoolean(indexIt)){
return ;
}
}

This works, as the document does not get indexed. However , it gives IOException

Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I understand that DeleteDuplicate tries to remove duplicates of some URLs, and there are no documents indexed for these URLs , thats why it gives expection.

Any suggestion to run it gracefully, I mean is this the write way of controlling index process of a document?

I would really appreciate your suggestions.

- BR


ajaxtrend wrote:
Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is

+^http://www.text.com/products/.*

urls/my.txt contains following entry

http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says

No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion. 
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.

Thanks in advance.

- BR


---------------------------------
Never miss a thing. Make Yahoo your homepage.


---------------------------------
Never miss a thing. Make Yahoo your homepage.

       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: URL filter help

Posted by ajaxtrend <te...@yahoo.com>.

I tried to added a meta-tag in my customized indexed filter. Based on URL pattern, I added a meta-tag called 'indexit', with value true or false. 
   
  In the reduce() method indexer, I checked this meta-tag and accordingly indexed the document or not, something like in the end of reduce() method
   
  String indexIt = parse.getData().getMeta("indexit");
  if(indexIt != null){
  if(!Boolean.getBoolean(indexIt)){
  return ;
  }
  }
   
  This works, as the document does not get indexed. However , it gives IOException
   
  Exception in thread "main" java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
  at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
   
  I understand that DeleteDuplicate tries to remove duplicates of some URLs, and there are no documents indexed for these URLs , thats why it gives expection.
   
  Any suggestion to run it gracefully, I mean is this the write way of controlling index process of a document?
   
  I would really appreciate your suggestions.
   
  - BR
  

ajaxtrend <te...@yahoo.com> wrote:
  Hello Group,
I need to index URLs that matches a particular URL pattern and I have added the pattern in crawl-urlfilter.txt e.g. I want to index all urls of www.test.com that are sub directory of product so my regex is

+^http://www.text.com/products/.*

urls/my.txt contains following entry

http://www.text.com, that mean I want to start indexing from main page of www.text.com. However nutch does not index anything and when I run nutch it says

No URLs to fetch - check your seed list and URL filters.
I am sure this muct have been answered. I have already searched archive but not able to find any suggestion. 
I would really appreciate if you can put your valuable suggestion or let me know the classes to be looked into.

Thanks in advance.

- BR


---------------------------------
Never miss a thing. Make Yahoo your homepage.

       
---------------------------------
Never miss a thing.   Make Yahoo your homepage.