You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by mike topper <mt...@riseup.net> on 2007/03/06 16:03:19 UTC
problem with solr.HTMLStripWhitespaceTokenizerFactory
I'm trying to use the html stripping factory in order to strip html tags
from my description field when indexing.
I added this fieldtype:
<fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
and then in my schema i have this:
<field name="description" type="text_html" indexed="true" stored="true"/>
when inserting it it seems like nothing happens ie when i do a query
here is the response for a test description:
<str name="description">
<br>hi<br>my<br>name<br>is<br>topper<br>and this <b> blahblah</b> is a <b>test</b>
</str>
Any Ideas?
-Mike
Re: Time after snapshot is "visible" on the slave
Posted by galo <ga...@last.fm>.
Yep, the snapinstaller was failing and it was the same problem as Jeff
posted this morning about bin/optimize, but this time with bin/commit,
not using ${webapp_name}.
I fixed that and worked normally. I've submitted a bug to JIRA as I
think Jeff didn't submit it yet
Mm now I see your other email.. oh well..
Thanks for your help,
Graham Stead wrote:
> Hi Galo,
>
> The snapinstaller actually performs a commit as its last step, so if that
> didn't work, it's not surprising that running commit separately didn't work,
> either.
>
> I would suggest running the snapinstaller and/or commit scripts with the -V
> option. This will produce verbose debugging information and allow you to see
> where they encounter problems.
>
> Hope this helps,
> -Graham
>
>
>
>
--
Galo Navarro, Developer
galo@last.fm
t. +44 (0)20 7780 7080
Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL
http://www.last.fm/user/galeote
RE: Time after snapshot is "visible" on the slave
Posted by Graham Stead <gs...@ieee.org>.
I forgot to mention that the admin page (solr/admin/stats.jsp) is an
excellent way to see when the last searcher was opened. After running
commit, you should see update to the openedAt and registeredAt timestamps,
e.g.,:
openedAt : Tue Mar 06 08:14:19 PST 2007
registeredAt : Tue Mar 06 08:15:55 PST 2007
If you have added documents, you'll numDocs and/or maxDoc change as well.
If you don't see these update then something isn't right. If you see them
update but cannot find your documents in the index, then your indexing
process may not be working correctly.
Hope this helps,
-Graham
PS: If you are running replication with multiple solr instances, your
problem may be caused by a simple bug in the commit, optimize, and
readercycle scripts. Replace the /solr/ in the curl statement with
${webapp_name}:
From:
rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d "<commit/>"`
To:
rs=`curl http://${solr_hostname}:${solr_port}/${webapp_name}/update -s -d
"<commit/>"`
I haven't had time to commit these bug fixes yet.
RE: Time after snapshot is "visible" on the slave
Posted by Graham Stead <gs...@ieee.org>.
Hi Galo,
The snapinstaller actually performs a commit as its last step, so if that
didn't work, it's not surprising that running commit separately didn't work,
either.
I would suggest running the snapinstaller and/or commit scripts with the -V
option. This will produce verbose debugging information and allow you to see
where they encounter problems.
Hope this helps,
-Graham
Time after snapshot is "visible" on the slave
Posted by galo <ga...@last.fm>.
Hi,
I've been testing index replication and after snappulling and installing
the latest version of the master index, if i run a query on the slave i
don't get any results back (tried a commit in despair, which didn't work
either). If I restart the web server (tomcat) then it works.
Am I missing any steps or just being too impatient sending queries?
Cheers
--
Galo Navarro, Developer
galo@last.fm
t. +44 (0)20 7780 7080
Last.fm | http://www.last.fm
Karen House 1-11 Baches Street
London N1 6DL
http://www.last.fm/user/galeote
Re: problem with solr.HTMLStripWhitespaceTokenizerFactory
Posted by Yonik Seeley <yo...@apache.org>.
On 3/6/07, mike topper <mt...@riseup.net> wrote:
> when inserting it it seems like nothing happens ie when i do a query
> here is the response for a test description:
>
> <str name="description">
>
> <br>hi<br>my<br>name<br>is<br>topper<br>and this <b> blahblah</b> is a <b>test</b>
>
> </str>
The tag stripping happens during the analysis phase, and affects what
gets indexed.
For returned field values, you get what you put in.
-Yonik