You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by purpureleaf <pu...@gmail.com> on 2007/08/29 08:37:34 UTC

invisible (not choosed) drop-down list options are included in index

I found anything between 
<select> </select> is included in nutch's index. No matter it is selected or
not.
Since drop-down lists usually have many options in them, this reduce the
quality of index.
For example, every register form has all country names, or even all us
states names on them.

In my case, this turns all country names to be useless keywords, since some
of our pages has register form on it, and on matter you search for Italy
Spain France or Iran, them jump in.

I check nutch's code(parse-html) , it seems that this is by design, since
nutch just take the parse result of nekohtml. 

In my project, I remove all of them by custom parser, actually I hacked the
indexer.
But does nutch has a general solution to this problem?

Regards
Pan

-- 
View this message in context: http://www.nabble.com/invisible-%28not-choosed%29-drop-down-list-options-are-included-in-index-tf4345960.html#a12381455
Sent from the Nutch - User mailing list archive at Nabble.com.