You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by aseem cheema <as...@gmail.com> on 2009/11/10 19:56:33 UTC

HTMLStripCharFilterFactory not working when using SolrJ java client

Hey Guys,
I have HTMLStripCharFilterFactory char filter declared in my
schema.xml for fieldType text (code below). I am using this field type
for body field of my schema. I am seeing different behavior when I use
SolrJ to post a document (code below) and when I use the analysis.jsp.
The text I am putting in the field is <center>content</center>.

When SolrJ is used, the field gets the whole value
<center>content</center>, but when analysis.jsp is used, it shows only
"content" being used for the field.

What am I possibly doing wrong here? How do I get
HTMLStripCharFilterFactory to work, even if I am pushing data using
SolrJ. Thanks.

Your help is highly appreciated.
Thanks
-- 
Aseem

############# schema.xml ######################
        <analyzer type="index">
          <charFilter class="solr.HTMLStripCharFilterFactory"/>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.StopFilterFactory"
                  ignoreCase="true"
                  words="stopwords.txt"
                  enablePositionIncrements="true"
                  />
          <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1"                  catenateAll="0"
splitOnCaseChange="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
          <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        </analyzer>

################## SolrJ Code ######################
     CommonsHttpSolrServer server = new
CommonsHttpSolrServer("http://aseem.desktop.amazon.com:8983/solr/sharepoint");
      SolrInputDocument doc = new SolrInputDocument();
      UpdateRequest req = new UpdateRequest();
      doc.addField("url", "http://haha.com");
      doc.addField("body", sbr.toString());*/
      doc.addField("body", "<center>content</center>");
      req.add(doc);
      req.setAction(ACTION.COMMIT, false, false);
      UpdateResponse resp = req.process(server);
      System.out.println(resp);

Re: HTMLStripCharFilterFactory not working when using SolrJ java client

Posted by aseem cheema <as...@gmail.com>.

I printed the UpdateRequest object (getXML) and the XML is:
<add><doc boost="1.0"><field name="url">http://haha.com</field><field
name="body">&lt;center&gt;content&lt;/center&gt;</field></doc></add>

I can see that the issue is because the HTML/XML <> are replaced by &lt; &gt;
I understand that it is required to do so to keep them from
interfering with the solr xml document, but how do I accomplish what I
want to? I need to get the html in body field stripped out.

Any help is highly appreciated.
Thanks
Aseem

On Tue, Nov 10, 2009 at 10:56 AM, aseem cheema <as...@gmail.com> wrote:
> Hey Guys,
> I have HTMLStripCharFilterFactory char filter declared in my
> schema.xml for fieldType text (code below). I am using this field type
> for body field of my schema. I am seeing different behavior when I use
> SolrJ to post a document (code below) and when I use the analysis.jsp.
> The text I am putting in the field is <center>content</center>.
>
> When SolrJ is used, the field gets the whole value
> <center>content</center>, but when analysis.jsp is used, it shows only
> "content" being used for the field.
>
> What am I possibly doing wrong here? How do I get
> HTMLStripCharFilterFactory to work, even if I am pushing data using
> SolrJ. Thanks.
>
> Your help is highly appreciated.
> Thanks
> --
> Aseem
>
> ############# schema.xml ######################
>        <analyzer type="index">
>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.StopFilterFactory"
>                  ignoreCase="true"
>                  words="stopwords.txt"
>                  enablePositionIncrements="true"
>                  />
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1"                  catenateAll="0"
> splitOnCaseChange="1"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>          <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>
> ################## SolrJ Code ######################
>     CommonsHttpSolrServer server = new
> CommonsHttpSolrServer("http://aseem.desktop.amazon.com:8983/solr/sharepoint");
>      SolrInputDocument doc = new SolrInputDocument();
>      UpdateRequest req = new UpdateRequest();
>      doc.addField("url", "http://haha.com");
>      doc.addField("body", sbr.toString());*/
>      doc.addField("body", "<center>content</center>");
>      req.add(doc);
>      req.setAction(ACTION.COMMIT, false, false);
>      UpdateResponse resp = req.process(server);
>      System.out.println(resp);
>



-- 
Aseem

Re: HTMLStripCharFilterFactory not working when using SolrJ java client

Posted by aseem cheema <as...@gmail.com>.

HTMLStripCharFilterFactory class has a constructor that accept
escaptedTags. I believe this will solve my problem. But I am not sure
how to pass this from schema.xml file. I have tried <charFilter
class="solr.HTMLStripCharFilterFactory" escapedTags="&lt;,&gt;"/> but
that didn't work.
Anybody?
Thanks

On Tue, Nov 10, 2009 at 10:56 AM, aseem cheema <as...@gmail.com> wrote:
> Hey Guys,
> I have HTMLStripCharFilterFactory char filter declared in my
> schema.xml for fieldType text (code below). I am using this field type
> for body field of my schema. I am seeing different behavior when I use
> SolrJ to post a document (code below) and when I use the analysis.jsp.
> The text I am putting in the field is <center>content</center>.
>
> When SolrJ is used, the field gets the whole value
> <center>content</center>, but when analysis.jsp is used, it shows only
> "content" being used for the field.
>
> What am I possibly doing wrong here? How do I get
> HTMLStripCharFilterFactory to work, even if I am pushing data using
> SolrJ. Thanks.
>
> Your help is highly appreciated.
> Thanks
> --
> Aseem
>
> ############# schema.xml ######################
>        <analyzer type="index">
>          <charFilter class="solr.HTMLStripCharFilterFactory"/>
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <filter class="solr.StopFilterFactory"
>                  ignoreCase="true"
>                  words="stopwords.txt"
>                  enablePositionIncrements="true"
>                  />
>          <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1"                  catenateAll="0"
> splitOnCaseChange="1"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>          <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>
> ################## SolrJ Code ######################
>     CommonsHttpSolrServer server = new
> CommonsHttpSolrServer("http://aseem.desktop.amazon.com:8983/solr/sharepoint");
>      SolrInputDocument doc = new SolrInputDocument();
>      UpdateRequest req = new UpdateRequest();
>      doc.addField("url", "http://haha.com");
>      doc.addField("body", sbr.toString());*/
>      doc.addField("body", "<center>content</center>");
>      req.add(doc);
>      req.setAction(ACTION.COMMIT, false, false);
>      UpdateResponse resp = req.process(server);
>      System.out.println(resp);
>



-- 
Aseem