You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mark Stephenson <ms...@us.ibm.com> on 2010/09/30 01:28:41 UTC

Excluding javascript files from indexing and search results.

Hi,

I'm wondering if there's a way to prevent nutch from indexing  
javascript files.  I still would like to fetch and parse javascript  
files to find valuable outlinks, but I don't want them to show up in  
my search results.  Is there a good way to do this?

Thanks a lot,
Mark

RE: Excluding javascript files from indexing and search results.

Posted by "Nemani, Raj" <Ra...@turner.com>.
Sorry, I did not read your requirement completely (that you wanted to
parse the JS files for outlinks).  My bad.

Thanks
Raj


-----Original Message-----
From: Mark Stephenson [mailto:mstephen@us.ibm.com] 
Sent: Thursday, September 30, 2010 4:49 PM
To: user@nutch.apache.org
Subject: Re: Excluding javascript files from indexing and search
results.

Thanks a lot Arkadi.  I implemented the approach you suggested and it  
seems to be doing exactly what I want.

Thanks again,
Mark

On Sep 29, 2010, at 6:35 PM, <Ar...@csiro.au> wrote:

> Hi Mark,
>
> I am not sure, maybe there is a simpler way, but if you want to  
> something to be fetched and processed but not indexed, you can write  
> an index filter plugin and return null for documents that you don't  
> want in the index. This is relatively easy to do, just use the index- 
> basic filter as an example.
>
> Regards,
>
> Arkadi
>
>> -----Original Message-----
>> From: Mark Stephenson [mailto:mstephen@us.ibm.com]
>> Sent: Thursday, September 30, 2010 9:29 AM
>> To: user@nutch.apache.org
>> Subject: Excluding javascript files from indexing and search results.
>>
>> Hi,
>>
>> I'm wondering if there's a way to prevent nutch from indexing
>> javascript files.  I still would like to fetch and parse javascript
>> files to find valuable outlinks, but I don't want them to show up in
>> my search results.  Is there a good way to do this?
>>
>> Thanks a lot,
>> Mark


Re: Excluding javascript files from indexing and search results.

Posted by Mark Stephenson <ms...@us.ibm.com>.
Thanks a lot Arkadi.  I implemented the approach you suggested and it  
seems to be doing exactly what I want.

Thanks again,
Mark

On Sep 29, 2010, at 6:35 PM, <Ar...@csiro.au> wrote:

> Hi Mark,
>
> I am not sure, maybe there is a simpler way, but if you want to  
> something to be fetched and processed but not indexed, you can write  
> an index filter plugin and return null for documents that you don't  
> want in the index. This is relatively easy to do, just use the index- 
> basic filter as an example.
>
> Regards,
>
> Arkadi
>
>> -----Original Message-----
>> From: Mark Stephenson [mailto:mstephen@us.ibm.com]
>> Sent: Thursday, September 30, 2010 9:29 AM
>> To: user@nutch.apache.org
>> Subject: Excluding javascript files from indexing and search results.
>>
>> Hi,
>>
>> I'm wondering if there's a way to prevent nutch from indexing
>> javascript files.  I still would like to fetch and parse javascript
>> files to find valuable outlinks, but I don't want them to show up in
>> my search results.  Is there a good way to do this?
>>
>> Thanks a lot,
>> Mark


RE: Excluding javascript files from indexing and search results.

Posted by Ar...@csiro.au.
Hi Mark,

I am not sure, maybe there is a simpler way, but if you want to something to be fetched and processed but not indexed, you can write an index filter plugin and return null for documents that you don't want in the index. This is relatively easy to do, just use the index-basic filter as an example.

Regards,

Arkadi

>-----Original Message-----
>From: Mark Stephenson [mailto:mstephen@us.ibm.com]
>Sent: Thursday, September 30, 2010 9:29 AM
>To: user@nutch.apache.org
>Subject: Excluding javascript files from indexing and search results.
>
>Hi,
>
>I'm wondering if there's a way to prevent nutch from indexing
>javascript files.  I still would like to fetch and parse javascript
>files to find valuable outlinks, but I don't want them to show up in
>my search results.  Is there a good way to do this?
>
>Thanks a lot,
>Mark

RE: Excluding javascript files from indexing and search results.

Posted by "Nemani, Raj" <Ra...@turner.com>.
Look in crawl url filter file and/or regex url filter file.  There is
section in there that specifies file extensions that you don't want to
be processed.  
Just like below.  Note that the comment is misleading.  Some of the
extensions can indeed be parsed.  I chose to not parse them (ex:pdf,
rtf, txt,doc etc)

# skip image and other suffixes we can't yet parse
-\.(swf|SWF|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|
wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|IC
O|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|w
ma|WMA|PSD|psd|dll|DLL|exe|EXE|chm|CHM|db|DB|doc|DOC|pdf|PDF|wpd|WPD)$

Hope this helps

-----Original Message-----
From: Mark Stephenson [mailto:mstephen@us.ibm.com] 
Sent: Wednesday, September 29, 2010 7:29 PM
To: user@nutch.apache.org
Subject: Excluding javascript files from indexing and search results.

Hi,

I'm wondering if there's a way to prevent nutch from indexing  
javascript files.  I still would like to fetch and parse javascript  
files to find valuable outlinks, but I don't want them to show up in  
my search results.  Is there a good way to do this?

Thanks a lot,
Mark