You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Sebastian Ho <se...@bii.a-star.edu.sg> on 2004/05/13 03:25:42 UTC

clean up html before indexing or add tags to ignore list

Hi

This is a typical web crawler, indexing and search application
development. I have wrote my crawler and planning to add lucene in next.
One questions pop to my mind, in terms of performance, do i clean up the
html removing all tags before indexing, or i add all tags into the
ignore list during indexing/search stage. 

Which is better?

Thanks

Sebastian Ho


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: clean up html before indexing or add tags to ignore list

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Lynx rulez!

	<http://www.blogscene.org/erik?q=lynx>


On May 13, 2004, at 12:05 PM, Andy Goodell wrote:

> If you are running linux, i recommend before indexing with lucene, you
> use the program lynx with the option -dump which dumps the formatted
> text without the tags, and runs really really fast in most cases.
>
> - andy g
>
> On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>>
>> Clean up seems cleaner.  Just extract the textual information from 
>> HTML
>> using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.
>>
>> You can also get fancy and preserve the 'structural' information (e.g.
>> H1 text is more important that H2, which is more important than BODY,
>> which is more important that DIV, etc.) and combine it with field
>> boosting at index time.
>>
>> Otis
>>
>>
>>
>> --- Sebastian Ho <se...@bii.a-star.edu.sg> wrote:
>>> Hi
>>>
>>> This is a typical web crawler, indexing and search application
>>> development. I have wrote my crawler and planning to add lucene in
>>> next.
>>> One questions pop to my mind, in terms of performance, do i clean up
>>> the
>>> html removing all tags before indexing, or i add all tags into the
>>> ignore list during indexing/search stage.
>>>
>>> Which is better?
>>>
>>> Thanks
>>>
>>> Sebastian Ho
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: clean up html before indexing or add tags to ignore list

Posted by Andy Goodell <go...@gmail.com>.
If you are running linux, i recommend before indexing with lucene, you
use the program lynx with the option -dump which dumps the formatted
text without the tags, and runs really really fast in most cases.

- andy g

On Thu, 13 May 2004 03:46:37 -0700 (PDT), Otis Gospodnetic
<ot...@yahoo.com> wrote:
> 
> Clean up seems cleaner.  Just extract the textual information from HTML
> using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.
> 
> You can also get fancy and preserve the 'structural' information (e.g.
> H1 text is more important that H2, which is more important than BODY,
> which is more important that DIV, etc.) and combine it with field
> boosting at index time.
> 
> Otis
> 
> 
> 
> --- Sebastian Ho <se...@bii.a-star.edu.sg> wrote:
> > Hi
> >
> > This is a typical web crawler, indexing and search application
> > development. I have wrote my crawler and planning to add lucene in
> > next.
> > One questions pop to my mind, in terms of performance, do i clean up
> > the
> > html removing all tags before indexing, or i add all tags into the
> > ignore list during indexing/search stage.
> >
> > Which is better?
> >
> > Thanks
> >
> > Sebastian Ho
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: clean up html before indexing or add tags to ignore list

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Clean up seems cleaner.  Just extract the textual information from HTML
using NekoHTML or JTidy or HTMLParser (.sf.net) or some such.

You can also get fancy and preserve the 'structural' information (e.g.
H1 text is more important that H2, which is more important than BODY,
which is more important that DIV, etc.) and combine it with field
boosting at index time.

Otis

--- Sebastian Ho <se...@bii.a-star.edu.sg> wrote:
> Hi
> 
> This is a typical web crawler, indexing and search application
> development. I have wrote my crawler and planning to add lucene in
> next.
> One questions pop to my mind, in terms of performance, do i clean up
> the
> html removing all tags before indexing, or i add all tags into the
> ignore list during indexing/search stage. 
> 
> Which is better?
> 
> Thanks
> 
> Sebastian Ho
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org