You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Douglas Brunner <he...@gmail.com> on 2006/05/07 18:10:22 UTC

Feature idea - Indexing Text Lengths

Sorry i cant give more then an idea, I'm not a java developer, but I think the idea could prove useful.

I'm not completely sure how the spider works while indexing, but I've noticed when indexing a site like w3schools.com they have a lot of keywords listed in their side menus. So, if I just index that one site (which many people may use nutch for), and search for asp, I get a lot of pages that have little to do with asp, but have the keywords listed over and over (Since their in the side menu) and get high placement.

The idea is to limit the length of sentences that get entered into the index. So, after parsing a page, and words that don't make what appears to be a complete sentence get ignored.

Hopefully I can properly explain what I'm thinking with this example:

A typical webpage may look like this.
----------------------------
<table>
	<tr>
		<td><a href='manual/'>php manual</a></td>
		<td><a href='functions/'>php functions</a></td>
		<td><a href='arrays/'>php arrays</a></td> 
		<td><a href='variables/'>php variables</a></td>
		<td><a href='modules/'>php modules</a></td>
	</tr>
</table>
<table>
	<tr>
		<td>This page gives detailed information about how to compile php with aspell capabilities. First, you need a computer,......</td>
	</tr>
</table>
----------------------------

Once the HTML is stripped (Not the line breaks) it may look something like this

----------------------------


php manual
php functions
php arrays
php variables
php modules




This page gives detailed information about how to compile php with aspell capabilities. First, you need a computer,......


----------------------------

So, there are a lot of left over words from the side column menus. Since their no more then two words long, I would love to be able to ignore them since I don't believe their always related to the content of the page. Being able to configure a setting at 3 words, 5 words, 20 words, etc could help increase relevancy since users will be visiting that page to read the content, not the side menu.




Re: Feature idea - Indexing Text Lengths

Posted by Jérôme Charron <je...@gmail.com>.
> Sorry i cant give more then an idea, I'm not a java developer, but I think
> the idea could prove useful.
> The idea is to limit the length of sentences that get entered into the
> index. So, after parsing a page, and words that don't make what appears to
> be a complete sentence get ignored.

Douglas,

Here is a previous discussion about this subject on the list:
http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg03070.html
Take a look at this thread.. this problem is not so easy.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/