You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by jay jiang <jj...@bbn.com> on 2006/03/01 22:00:13 UTC
Re: not indexing path names
Jake,
That's exactly the case. I have a workaround by using the "file://"
protocol since all data is in our intranet.
Ideally, and it should not be hard to do, is to allow index-basic plugin
to not just add more fields to a document, but also to nullify the
document (i.e. not indexing it).
--Jay Jiang
Vanderdray, Jacob wrote:
>Jay,
>
> Sorry, I didn't understand what you were trying to do. I think
>I get it now. You've got directory listing turned on and you're using
>that to list out the content of the site, but you don't want the
>directory listings returned as search results. Does that sound right?
>
> I don't know of any search filters that would do quite what
>you're looking to do. If you control the site, you might be able to
>switch from using directory listings for your content to using actual
>html pages. At that point you could add robot meta tags on those pages
>to follow, but not index them.
>
>Jake.
>
>-----Original Message-----
>From: jay jiang [mailto:jjiang@bbn.com]
>Sent: Friday, February 17, 2006 11:31 AM
>To: nutch-user@lucene.apache.org
>Subject: Re: not indexing path names
>
>Thanks, Jake. This does not work. I guess I did not describe my
>problem clearly. I'll try again.
>
>My startup url is: http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/
>
>And here are some of the entries in the crawl log:
>
>060217 105511 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/
>060217 105511 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/
>060217 105511 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/
>...
>060217 105519 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/08/TWN-2005-08-06.html
>060217 105519 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/10/TWN-2005-03-27.html
>060217 105519 fetching
>http://xxx/meta-data/0/4/The%20Word%20Nerds/2005/12/TWN-2005-12-03.html
>
>If my search query is "nerds", it will bring up those three path names
>as individual results as well. For example:
>
>*Index of /meta-data/0/4/The Word Nerds/2005/08
><http://pod-master-001.bbn.com/meta-data/pods/pod9/0/4/The%20Word%20Nerd
>s/2005/08/>*
>
>* ... *meta-data/0/4/The Word *Nerds*/2005/08 Index of /meta-data/0/4/*
>... *
>
>So my question is how I can filter out those path names in the result
>list. I think there should be an option some where in the configuration
>
>file to allow NOT to index certain files based on the url pattern. I
>know we have similar options in crawl-urlfilter.txt. But in my case
>these directories do need to be crawled. However, the directory name
>should not be indexed as a single document. It's more like we'd have a
>file called index-urlfilter.txt.
>
>Thanks,
>--Jay
>
>
>Vanderdray, Jacob wrote:
>
>
>
>>Jay,
>>
>> The url field is handled by the query-basic filter. There is a
>>setting inside conf/nutch-default.xml that controls the weighting
>>(boost) for that field. You can reduce the influence of this field by
>>putting a new value in your conf/nutch-site.xml file. You may even be
>>able to completely nullify it by setting the value to 0.0. I've pasted
>>what I think you'd need to put in nutch-site.xml bellow. I haven't
>>tested this. Let me know how it goes if you give it a try.
>>
>>Thanks,
>>Jake.
>>
>><property>
>> <name>query.url.boost</name>
>> <value>0.0</value>
>> <description> Used as a boost for url field in Lucene query.
>> </description>
>></property>
>>
>>-----Original Message-----
>>From: jay jiang [mailto:jjiang@bbn.com]
>>Sent: Thursday, February 16, 2006 2:08 PM
>>To: nutch-user@lucene.apache.org
>>Subject: not indexing path names
>>
>>I am crawling an intranet. Apparently Nutch also indexes the url path
>>names (as a document) as it crawls. So if a query word appears in the
>>path name, the entire url path name would be one result. Since this
>>kind of info would typically be of no value to users, I want to filter
>>them out.
>>
>>I think we have to crawl them since we need to get the actual document
>>urls underneath the path. But we do not want to index them. Is there
>>anyway to configure not to index path names during the crawling step?
>>If not, can we configure it in the search step? I know we can always
>>filter it using getDetails(). But this seems not a very clean way.
>>
>>Thanks,
>>--Jay
>>
>>
>>
>>
>>
>>
>
>
>