You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by starz10de <fa...@yahoo.com> on 2009/07/26 13:24:26 UTC

Index html sites using IndexHtml

Hi,

I am indexing a set of html websites using lucene (IndexHtml). The indexer
work fine and I can also find the indexed term but the problem this class
(IndexHtml) index all text inside the html site even the advertisements. I
am interested just in the body text and not interested in the advertisements
or side links text.

Any help how to solve this problem? Did I use the class wrongly?



-- 
View this message in context: http://www.nabble.com/Index-html-sites-using-IndexHtml-tp24666110p24666110.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Index html sites using IndexHtml

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 26, 2009, at 7:24 AM, starz10de wrote:

>
> Hi,
>
> I am indexing a set of html websites using lucene (IndexHtml). The  
> indexer
> work fine and I can also find the indexed term but the problem this  
> class
> (IndexHtml) index all text inside the html site even the  
> advertisements. I
> am interested just in the body text and not interested in the  
> advertisements
> or side links text.
>
> Any help how to solve this problem? Did I use the class wrongly?
>


No, you didn't do anything wrong.  That class does not have any  
capabilities like you want (in fact, it's a pretty basic bit of demo  
code).  You might look into some more robust HTML parsing libraries  
out there.

-Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org