You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Paul deGrandis <pa...@gmail.com> on 2008/02/22 18:19:49 UTC

Indexing content, storing html

Hi all,

I'm working on a solr app that pulls HTML from an embedded JavaScript
WYSIWYG editor, and I need to index on the content, but store and
reproduce the HTML.  The problem I have is when I try to add and
commit, the HTML gets interpreted as XML.  Is the way to do this
properly to create an HTMLTokenFilterFactory?  And if so, is there a
collection of plugins (like filters and such) that someone can point
me to?

Regards,
Paul

Re: Indexing content, storing html

Posted by Paul deGrandis <pa...@gmail.com>.

Thanks, this is perfect for what I'm trying to do.

Paul

On 2/22/08, Reece <li...@gmail.com> wrote:
> Well I don't remember the specific name of it, I just wrote that
>  because it sounded close :)
>
>  There is a list of them here though:
>  http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
>  -Reece
>
>
>
>  On Fri, Feb 22, 2008 at 2:10 PM, Paul deGrandis
>
> <pa...@gmail.com> wrote:
>  > Thanks!
>  >
>  >  Does Solr include an HTMLTokenFilterFactory?
>  >
>  >  Paul
>  >
>  >
>  >
>  >  On 2/22/08, Reece <li...@gmail.com> wrote:
>  >  > I did this as well, but found problems when searching (tags in between
>  >  >  words caused searching nightmares).  I recommend stripping out all the
>  >  >  tags using the HTMLTokenFilterFactory or your own regex when indexing,
>  >  >  and storing the actual HTML in an actual database.
>  >  >
>  >  >  If you really want to store the HTML though, you can use cdata in the
>  >  >  xml like this:
>  >  >
>  >  >  <?xml version="1.0" encoding="UTF-8" ?>
>  >  >         <add>
>  >  >             <doc>
>  >  >                 <field name="id">123</field>
>  >  >                 <field name="title"><![CDATA[yourbightmlstring]]></field>
>  >  >             </doc>
>  >  >       </add>
>  >  >
>  >  >  The CDATA thing will basically say anything between it's tag's will be
>  >  >  rendered as the field value.  It only breaks if your html string has a
>  >  >  "]]>" in it to end the data tag.
>  >  >
>  >  >
>  >  >  -Reece
>  >  >
>  >  >
>  >  >
>  >  >
>  >  >  On Fri, Feb 22, 2008 at 12:19 PM, Paul deGrandis
>  >  >  <pa...@gmail.com> wrote:
>  >  >  > Hi all,
>  >  >  >
>  >  >  >  I'm working on a solr app that pulls HTML from an embedded JavaScript
>  >  >  >  WYSIWYG editor, and I need to index on the content, but store and
>  >  >  >  reproduce the HTML.  The problem I have is when I try to add and
>  >  >  >  commit, the HTML gets interpreted as XML.  Is the way to do this
>  >  >  >  properly to create an HTMLTokenFilterFactory?  And if so, is there a
>  >  >  >  collection of plugins (like filters and such) that someone can point
>  >  >  >  me to?
>  >  >  >
>  >  >  >  Regards,
>  >  >  >  Paul
>  >  >  >
>  >  >
>  >
>

Re: Indexing content, storing html

Posted by Reece <li...@gmail.com>.

Well I don't remember the specific name of it, I just wrote that
because it sounded close :)

There is a list of them here though:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

-Reece



On Fri, Feb 22, 2008 at 2:10 PM, Paul deGrandis
<pa...@gmail.com> wrote:
> Thanks!
>
>  Does Solr include an HTMLTokenFilterFactory?
>
>  Paul
>
>
>
>  On 2/22/08, Reece <li...@gmail.com> wrote:
>  > I did this as well, but found problems when searching (tags in between
>  >  words caused searching nightmares).  I recommend stripping out all the
>  >  tags using the HTMLTokenFilterFactory or your own regex when indexing,
>  >  and storing the actual HTML in an actual database.
>  >
>  >  If you really want to store the HTML though, you can use cdata in the
>  >  xml like this:
>  >
>  >  <?xml version="1.0" encoding="UTF-8" ?>
>  >         <add>
>  >             <doc>
>  >                 <field name="id">123</field>
>  >                 <field name="title"><![CDATA[yourbightmlstring]]></field>
>  >             </doc>
>  >       </add>
>  >
>  >  The CDATA thing will basically say anything between it's tag's will be
>  >  rendered as the field value.  It only breaks if your html string has a
>  >  "]]>" in it to end the data tag.
>  >
>  >
>  >  -Reece
>  >
>  >
>  >
>  >
>  >  On Fri, Feb 22, 2008 at 12:19 PM, Paul deGrandis
>  >  <pa...@gmail.com> wrote:
>  >  > Hi all,
>  >  >
>  >  >  I'm working on a solr app that pulls HTML from an embedded JavaScript
>  >  >  WYSIWYG editor, and I need to index on the content, but store and
>  >  >  reproduce the HTML.  The problem I have is when I try to add and
>  >  >  commit, the HTML gets interpreted as XML.  Is the way to do this
>  >  >  properly to create an HTMLTokenFilterFactory?  And if so, is there a
>  >  >  collection of plugins (like filters and such) that someone can point
>  >  >  me to?
>  >  >
>  >  >  Regards,
>  >  >  Paul
>  >  >
>  >
>

Re: Indexing content, storing html

Posted by Paul deGrandis <pa...@gmail.com>.

Thanks!

Does Solr include an HTMLTokenFilterFactory?

Paul

On 2/22/08, Reece <li...@gmail.com> wrote:
> I did this as well, but found problems when searching (tags in between
>  words caused searching nightmares).  I recommend stripping out all the
>  tags using the HTMLTokenFilterFactory or your own regex when indexing,
>  and storing the actual HTML in an actual database.
>
>  If you really want to store the HTML though, you can use cdata in the
>  xml like this:
>
>  <?xml version="1.0" encoding="UTF-8" ?>
>         <add>
>             <doc>
>                 <field name="id">123</field>
>                 <field name="title"><![CDATA[yourbightmlstring]]></field>
>             </doc>
>       </add>
>
>  The CDATA thing will basically say anything between it's tag's will be
>  rendered as the field value.  It only breaks if your html string has a
>  "]]>" in it to end the data tag.
>
>
>  -Reece
>
>
>
>
>  On Fri, Feb 22, 2008 at 12:19 PM, Paul deGrandis
>  <pa...@gmail.com> wrote:
>  > Hi all,
>  >
>  >  I'm working on a solr app that pulls HTML from an embedded JavaScript
>  >  WYSIWYG editor, and I need to index on the content, but store and
>  >  reproduce the HTML.  The problem I have is when I try to add and
>  >  commit, the HTML gets interpreted as XML.  Is the way to do this
>  >  properly to create an HTMLTokenFilterFactory?  And if so, is there a
>  >  collection of plugins (like filters and such) that someone can point
>  >  me to?
>  >
>  >  Regards,
>  >  Paul
>  >
>

Re: Indexing content, storing html

Posted by Reece <li...@gmail.com>.

I did this as well, but found problems when searching (tags in between
words caused searching nightmares).  I recommend stripping out all the
tags using the HTMLTokenFilterFactory or your own regex when indexing,
and storing the actual HTML in an actual database.

If you really want to store the HTML though, you can use cdata in the
xml like this:

<?xml version="1.0" encoding="UTF-8" ?>
        <add>
            <doc>
                <field name="id">123</field>
                <field name="title"><![CDATA[yourbightmlstring]]></field>
            </doc>
      </add>

The CDATA thing will basically say anything between it's tag's will be
rendered as the field value.  It only breaks if your html string has a
"]]>" in it to end the data tag.

-Reece

On Fri, Feb 22, 2008 at 12:19 PM, Paul deGrandis
<pa...@gmail.com> wrote:
> Hi all,
>
>  I'm working on a solr app that pulls HTML from an embedded JavaScript
>  WYSIWYG editor, and I need to index on the content, but store and
>  reproduce the HTML.  The problem I have is when I try to add and
>  commit, the HTML gets interpreted as XML.  Is the way to do this
>  properly to create an HTMLTokenFilterFactory?  And if so, is there a
>  collection of plugins (like filters and such) that someone can point
>  me to?
>
>  Regards,
>  Paul
>