You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arun Kumar Sharma <sh...@yahoo.co.in> on 2006/04/24 09:53:01 UTC

unable to filter different file format like .java,.jar,.class with nutch version 0.7.2

Hi,
      I am crawling filesystem with nutch 0.7.2 on windows. I have enabled parse plugin for text and html. 
              It is to my surprise that it is including search results of file with extension of .java, .class,.jar,.dll  and so on so forth.
     I can add these into ignore list in regex-urlfilter.txt. But that is not a solution. Since there are number of file format and I can't add each of them in ignore list.
     Alternative could be that it fetch and show result only of parsable documents.
     can anybody help me in this regards.....l
   


    Regards, 
Arun Sharma (Tech Lead-Java/J2EE ) 
  www.voltix.com, www.voltixindia.com
  SCO 13-15, Sector 34A
  Chandigarh




				
---------------------------------
 Jiyo cricket on Yahoo! India cricket
Yahoo! Messenger Mobile Stay in touch with your buddies all the time.

Re: unable to filter different file format like .java,.jar,.class with nutch version 0.7.2

Posted by TDLN <di...@gmail.com>.
> Since there are number of file format and I can't add each of them in ignore list.

Why not? You can add something like

-\.(java|.class|jar|dll)

etc.

Rgrds, Thomas



>      Alternative could be that it fetch and show result only of parsable documents.
>      can anybody help me in this regards.....l
>
>
>
>     Regards,
> Arun Sharma (Tech Lead-Java/J2EE )
>   www.voltix.com, www.voltixindia.com
>   SCO 13-15, Sector 34A
>   Chandigarh
>
>
>
>
>
> ---------------------------------
>  Jiyo cricket on Yahoo! India cricket
> Yahoo! Messenger Mobile Stay in touch with your buddies all the time.
>