You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jay jiang <jj...@bbn.com> on 2006/03/01 22:00:13 UTC
Re: not indexing path names

Jake,

That's exactly the case. I have a workaround by using the "file://" 
protocol since all data is in our intranet.

Ideally, and it should not be hard to do, is to allow index-basic plugin 
to not just add more fields to a document, but also to nullify the 
document (i.e. not indexing it).

--Jay Jiang
 
Vanderdray, Jacob wrote:

>Jay,
>
>	Sorry, I didn't understand what you were trying to do.  I think
>I get it now.  You've got directory listing turned on and you're using
>that to list out the content of the site, but you don't want the
>directory listings returned as search results.  Does that sound right?
>
>	I don't know of any search filters that would do quite what
>you're looking to do.  If you control the site, you might be able to
>switch from using directory listings for your content to using actual
>html pages.  At that point you could add robot meta tags on those pages
>to follow, but not index them.
>
>Jake.
>
>-----Original Message-----
>From: jay jiang [mailto:jjiang@bbn.com] 
>Sent: Friday, February 17, 2006 11:31 AM
>To: nutch-user@lucene.apache.org
>Subject: Re: not indexing path names
>
>Thanks, Jake.  This does not work.  I guess I did not describe my 
>problem clearly.  I'll try again.
>
>My startup url is:   http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/
>
>And here are some of the entries in the crawl log:
>
>060217 105511 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/
>060217 105511 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/
>060217 105511 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/
>...
>060217 105519 fetching 
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/TWN-2005-08-06.html
>060217 105519 fetching 
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/TWN-2005-03-27.html
>060217 105519 fetching 
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/TWN-2005-12-03.html
>
>If my search query is "nerds", it will bring up those three path names 
>as individual results as well.  For example:
>
>*Index of /meta-data/0/4/The Word Nerds/2005/08 
><http://pod-master-001.bbn.com/meta-data/pods/pod9/0/4/The%20Word%20Nerd
>s/2005/08/>* 
>
>* ... *meta-data/0/4/The Word *Nerds*/2005/08 Index of /meta-data/0/4/* 
>... *
>
>So my question is how I can filter out those path names in the result 
>list.  I think there should be an option some where in the configuration
>
>file to allow NOT to index certain files based on the url pattern.  I 
>know we have similar options in crawl-urlfilter.txt.  But in my case 
>these directories do need to be crawled.  However, the directory name 
>should not be indexed as a single document.  It's more like we'd have a 
>file called index-urlfilter.txt.
>
>Thanks,
>--Jay 
>
>
>Vanderdray, Jacob wrote:
>
>  
>
>>Jay,
>>
>>	The url field is handled by the query-basic filter.  There is a
>>setting inside conf/nutch-default.xml that controls the weighting
>>(boost) for that field.  You can reduce the influence of this field by
>>putting a new value in your conf/nutch-site.xml file.  You may even be
>>able to completely nullify it by setting the value to 0.0.  I've pasted
>>what I think you'd need to put in nutch-site.xml bellow.  I haven't
>>tested this.  Let me know how it goes if you give it a try.
>>
>>Thanks,
>>Jake.
>>
>><property>
>> <name>query.url.boost</name>
>> <value>0.0</value>
>> <description> Used as a boost for url field in Lucene query.
>> </description>
>></property>
>>
>>-----Original Message-----
>>From: jay jiang [mailto:jjiang@bbn.com] 
>>Sent: Thursday, February 16, 2006 2:08 PM
>>To: nutch-user@lucene.apache.org
>>Subject: not indexing path names
>>
>>I am crawling an intranet.  Apparently Nutch also indexes the url path 
>>names (as a document) as it crawls.  So if a query word appears in the 
>>path name,  the entire url path  name would be one result.  Since this 
>>kind of info would typically be of no value to users, I want to filter 
>>them out. 
>>
>>I think we have to crawl them since we need to get the actual document 
>>urls underneath the path.  But we do not want to index them.  Is there 
>>anyway to configure not to index path names during the crawling step?  
>>If not, can we configure it in the search step?  I know we can always 
>>filter it using getDetails().  But this seems not a very clean way.
>>
>>Thanks,
>>--Jay
>>  
>>
>> 
>>
>>    
>>
>
>  
>