You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Blargy <zm...@hotmail.com> on 2010/06/10 05:38:00 UTC
Indexing HTML
What is the preferred way to index html using DIH (my html is stored in a
blob field in our database)?
I know there is the built in HTMLStripTransformer but that doesn't seem to
work well with malformed/incomplete HTML. I've created a custom transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:
<field column="description" name="description" tidy="true"
ignoreErrors="true" propertiesFile="config/tidy.properties"/>
<field column="description" name="description" stripHTML="true"/>
However this method isn't fool-proof as you can see by my ignoreErrors
option.
I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
Is this something I should look into? Are there any alternatives that deal
with malformed/incomplete html? Thanks
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML
Posted by Lance Norskog <go...@gmail.com>.
Looking at it again, there appears to be only one HTML stripper. Your
alternative is to use the regex PatternReplace stuff with some custom
patterns. Ok make a stopword list of all html keywords.
On Thu, Jun 10, 2010 at 8:00 AM, Blargy <zm...@hotmail.com> wrote:
>
> Do I even need to tidy/clean up the html if I use the
> HTMLStripCharFilterFactory?
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
--
Lance Norskog
goksron@gmail.com
Re: Indexing HTML
Posted by Blargy <zm...@hotmail.com>.
Do I even need to tidy/clean up the html if I use the
HTMLStripCharFilterFactory?
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p885797.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML
Posted by Blargy <zm...@hotmail.com>.
Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at
index time?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML
Posted by Blargy <zm...@hotmail.com>.
Does the HTMLStripChar apply at index time or query time? Would it matter to
use over the other?
As a side question, if I want to perform highlighter summaries against this
field do I need to store the whole field or just index it with
TermVector.WITH_POSITIONS_OFFSETS?
--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing HTML
Posted by Lance Norskog <go...@gmail.com>.
The HTMLStripChar variants are newer and might work better.
On Wed, Jun 9, 2010 at 8:38 PM, Blargy <zm...@hotmail.com> wrote:
>
> What is the preferred way to index html using DIH (my html is stored in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't seem to
> work well with malformed/incomplete HTML. I've created a custom transformer
> to first tidy up the html using JTidy then I pass it to the
> HTMLStripTransformer like so:
>
> <field column="description" name="description" tidy="true"
> ignoreErrors="true" propertiesFile="config/tidy.properties"/>
> <field column="description" name="description" stripHTML="true"/>
>
> However this method isn't fool-proof as you can see by my ignoreErrors
> option.
>
> I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
> Is this something I should look into? Are there any alternatives that deal
> with malformed/incomplete html? Thanks
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
--
Lance Norskog
goksron@gmail.com
Re: Indexing HTML
Posted by Ken Krugler <kk...@transpac.com>.
On Jun 9, 2010, at 8:38pm, Blargy wrote:
>
> What is the preferred way to index html using DIH (my html is stored
> in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't
> seem to
> work well with malformed/incomplete HTML. I've created a custom
> transformer
> to first tidy up the html using JTidy then I pass it to the
> HTMLStripTransformer like so:
>
> <field column="description" name="description" tidy="true"
> ignoreErrors="true" propertiesFile="config/tidy.properties"/>
> <field column="description" name="description" stripHTML="true"/>
>
> However this method isn't fool-proof as you can see by my ignoreErrors
> option.
>
> I quickly took a peek at Tika and I noticed that it has its own
> HtmlParser.
> Is this something I should look into? Are there any alternatives
> that deal
> with malformed/incomplete html? Thanks
Actually the Tika HtmlParser just wraps TagSoup - that's a good option
for cleaning up busted HTML.
-- Ken
--------------------------------------------
<http://ken-blog.krugler.org>
+1 530-265-2225
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g