You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by aseem cheema <as...@gmail.com> on 2009/11/10 19:56:33 UTC
HTMLStripCharFilterFactory not working when using SolrJ java client
Hey Guys,
I have HTMLStripCharFilterFactory char filter declared in my
schema.xml for fieldType text (code below). I am using this field type
for body field of my schema. I am seeing different behavior when I use
SolrJ to post a document (code below) and when I use the analysis.jsp.
The text I am putting in the field is <center>content</center>.
When SolrJ is used, the field gets the whole value
<center>content</center>, but when analysis.jsp is used, it shows only
"content" being used for the field.
What am I possibly doing wrong here? How do I get
HTMLStripCharFilterFactory to work, even if I am pushing data using
SolrJ. Thanks.
Your help is highly appreciated.
Thanks
--
Aseem
############# schema.xml ######################
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
################## SolrJ Code ######################
CommonsHttpSolrServer server = new
CommonsHttpSolrServer("http://aseem.desktop.amazon.com:8983/solr/sharepoint");
SolrInputDocument doc = new SolrInputDocument();
UpdateRequest req = new UpdateRequest();
doc.addField("url", "http://haha.com");
doc.addField("body", sbr.toString());*/
doc.addField("body", "<center>content</center>");
req.add(doc);
req.setAction(ACTION.COMMIT, false, false);
UpdateResponse resp = req.process(server);
System.out.println(resp);
Re: HTMLStripCharFilterFactory not working when using SolrJ java
client
Posted by aseem cheema <as...@gmail.com>.
I printed the UpdateRequest object (getXML) and the XML is:
<add><doc boost="1.0"><field name="url">http://haha.com</field><field
name="body"><center>content</center></field></doc></add>
I can see that the issue is because the HTML/XML <> are replaced by < >
I understand that it is required to do so to keep them from
interfering with the solr xml document, but how do I accomplish what I
want to? I need to get the html in body field stripped out.
Any help is highly appreciated.
Thanks
Aseem
On Tue, Nov 10, 2009 at 10:56 AM, aseem cheema <as...@gmail.com> wrote:
> Hey Guys,
> I have HTMLStripCharFilterFactory char filter declared in my
> schema.xml for fieldType text (code below). I am using this field type
> for body field of my schema. I am seeing different behavior when I use
> SolrJ to post a document (code below) and when I use the analysis.jsp.
> The text I am putting in the field is <center>content</center>.
>
> When SolrJ is used, the field gets the whole value
> <center>content</center>, but when analysis.jsp is used, it shows only
> "content" being used for the field.
>
> What am I possibly doing wrong here? How do I get
> HTMLStripCharFilterFactory to work, even if I am pushing data using
> SolrJ. Thanks.
>
> Your help is highly appreciated.
> Thanks
> --
> Aseem
>
> ############# schema.xml ######################
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
>
> ################## SolrJ Code ######################
> CommonsHttpSolrServer server = new
> CommonsHttpSolrServer("http://aseem.desktop.amazon.com:8983/solr/sharepoint");
> SolrInputDocument doc = new SolrInputDocument();
> UpdateRequest req = new UpdateRequest();
> doc.addField("url", "http://haha.com");
> doc.addField("body", sbr.toString());*/
> doc.addField("body", "<center>content</center>");
> req.add(doc);
> req.setAction(ACTION.COMMIT, false, false);
> UpdateResponse resp = req.process(server);
> System.out.println(resp);
>
--
Aseem
Re: HTMLStripCharFilterFactory not working when using SolrJ java
client
Posted by aseem cheema <as...@gmail.com>.
HTMLStripCharFilterFactory class has a constructor that accept
escaptedTags. I believe this will solve my problem. But I am not sure
how to pass this from schema.xml file. I have tried <charFilter
class="solr.HTMLStripCharFilterFactory" escapedTags="<,>"/> but
that didn't work.
Anybody?
Thanks
On Tue, Nov 10, 2009 at 10:56 AM, aseem cheema <as...@gmail.com> wrote:
> Hey Guys,
> I have HTMLStripCharFilterFactory char filter declared in my
> schema.xml for fieldType text (code below). I am using this field type
> for body field of my schema. I am seeing different behavior when I use
> SolrJ to post a document (code below) and when I use the analysis.jsp.
> The text I am putting in the field is <center>content</center>.
>
> When SolrJ is used, the field gets the whole value
> <center>content</center>, but when analysis.jsp is used, it shows only
> "content" being used for the field.
>
> What am I possibly doing wrong here? How do I get
> HTMLStripCharFilterFactory to work, even if I am pushing data using
> SolrJ. Thanks.
>
> Your help is highly appreciated.
> Thanks
> --
> Aseem
>
> ############# schema.xml ######################
> <analyzer type="index">
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
>
> ################## SolrJ Code ######################
> CommonsHttpSolrServer server = new
> CommonsHttpSolrServer("http://aseem.desktop.amazon.com:8983/solr/sharepoint");
> SolrInputDocument doc = new SolrInputDocument();
> UpdateRequest req = new UpdateRequest();
> doc.addField("url", "http://haha.com");
> doc.addField("body", sbr.toString());*/
> doc.addField("body", "<center>content</center>");
> req.add(doc);
> req.setAction(ACTION.COMMIT, false, false);
> UpdateResponse resp = req.process(server);
> System.out.println(resp);
>
--
Aseem