You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Aditya <ad...@gmail.com> on 2009/05/02 12:48:37 UTC

Question related to improving search results

Hi,

 

New to this group.

 

Question:

 

Generally sites like wikipeadia have a template and every page follows it.
These templates contains the word that occurs in every page. 

 

For example wikipedia template has the list of language in the left panel.
Now these words gets indexed every time since they are not (cannot be) stop
words. 

if user for example search for "Galego", every wikipedia page will be in the
search result which is wrong as every wikipedia page does not talk about
"Galego" 

 

Any takes on this one for how to solve this problem?

 

 

Best Regards,

Aditya

 


Re: Question related to improving search results

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
I suppose you're talking about content that is indexed from web  
crawling.  It's a messy problem.  Extraneous junk needs to be filtered  
out and not indexed, so some form of header/footer/sidebar detection  
and exclusion definitely makes searching crawled pages much better.

When possible, index clean content.  In the case of wikipedia, you can  
get full dumps of the content without the templates, just the content.

	Erik

On May 2, 2009, at 6:48 AM, Aditya wrote:

> Hi,
>
> New to this group.
>
> Question:
>
> Generally sites like wikipeadia have a template and every page  
> follows it. These templates contains the word that occurs in every  
> page.
>
> For example wikipedia template has the list of language in the left  
> panel. Now these words gets indexed every time since they are not  
> (cannot be) stop words.
> if user for example search for "Galego", every wikipedia page will  
> be in the search result which is wrong as every wikipedia page does  
> not talk about "Galego"
>
> Any takes on this one for how to solve this problem?
>
>
> Best Regards,
> Aditya
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org