You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Blargy <zm...@hotmail.com> on 2010/06/10 05:38:00 UTC

Indexing HTML

What is the preferred way to index html using DIH (my html is stored in a
blob field in our database)? 

I know there is the built in HTMLStripTransformer but that doesn't seem to
work well with malformed/incomplete HTML. I've created a custom transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:

<field column="description" name="description" tidy="true"
ignoreErrors="true" propertiesFile="config/tidy.properties"/>
<field column="description" name="description" stripHTML="true"/>

However this method isn't fool-proof as you can see by my ignoreErrors
option. 

I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
Is this something I should look into? Are there any alternatives that deal
with malformed/incomplete  html? Thanks






-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

Posted by Lance Norskog <go...@gmail.com>.

Looking at it again, there appears to be only one HTML stripper. Your
alternative is to use the regex PatternReplace stuff with some custom
patterns. Ok make a stopword list of all html keywords.

On Thu, Jun 10, 2010 at 8:00 AM, Blargy <zm...@hotmail.com> wrote:
>
> Do I even need to tidy/clean up the html if I use the
> HTMLStripCharFilterFactory?
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Lance Norskog
goksron@gmail.com

Re: Indexing HTML

Posted by Blargy <zm...@hotmail.com>.

Do I even need to tidy/clean up the html if I use the
HTMLStripCharFilterFactory?
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

Posted by Blargy <zm...@hotmail.com>.

Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at
index time?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

Posted by Blargy <zm...@hotmail.com>.

Does the HTMLStripChar apply at index time or query time? Would it matter to
use over the other?

As a side question, if I want to perform highlighter summaries against this
field do I need to store the whole field or just index it with
TermVector.WITH_POSITIONS_OFFSETS? 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing HTML

Posted by Lance Norskog <go...@gmail.com>.

The HTMLStripChar variants are newer and might work better.

On Wed, Jun 9, 2010 at 8:38 PM, Blargy <zm...@hotmail.com> wrote:
>
> What is the preferred way to index html using DIH (my html is stored in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't seem to
> work well with malformed/incomplete HTML. I've created a custom transformer
> to first tidy up the html using JTidy then I pass it to the
> HTMLStripTransformer like so:
>
> <field column="description" name="description" tidy="true"
> ignoreErrors="true" propertiesFile="config/tidy.properties"/>
> <field column="description" name="description" stripHTML="true"/>
>
> However this method isn't fool-proof as you can see by my ignoreErrors
> option.
>
> I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
> Is this something I should look into? Are there any alternatives that deal
> with malformed/incomplete  html? Thanks
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goksron@gmail.com

Re: Indexing HTML

Posted by Ken Krugler <kk...@transpac.com>.

On Jun 9, 2010, at 8:38pm, Blargy wrote:

>
> What is the preferred way to index html using DIH (my html is stored  
> in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't  
> seem to
> work well with malformed/incomplete HTML. I've created a custom  
> transformer
> to first tidy up the html using JTidy then I pass it to the
> HTMLStripTransformer like so:
>
> <field column="description" name="description" tidy="true"
> ignoreErrors="true" propertiesFile="config/tidy.properties"/>
> <field column="description" name="description" stripHTML="true"/>
>
> However this method isn't fool-proof as you can see by my ignoreErrors
> option.
>
> I quickly took a peek at Tika and I noticed that it has its own  
> HtmlParser.
> Is this something I should look into? Are there any alternatives  
> that deal
> with malformed/incomplete  html? Thanks

Actually the Tika HtmlParser just wraps TagSoup - that's a good option  
for cleaning up busted HTML.

-- Ken

--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225




--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g