You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lenya.apache.org by Åsmund Tharaldsen <as...@netcom-gsm.no> on 2004/09/28 16:05:47 UTC
Excluding the sitetree in the search
Hi!
We are using Lucene in the search. When we are doing a search, the result also
includes the menu (data from the sitetree.xml). Therefore we get more
results/hits than necessary.
Since the crawling is using the sitetree.xml, we can't exclude it before the
crawling. So we are trying to resolve this by using the "org.apache.lenya.
lucene.index.ConfigurableIndexer" instead of the DefaultIndexer. Our crawler
produces html-files. We are trying to filter when generating the index by
reading the title and the content, where the content we want is marked/started
with <td id="content">.
Our "lucene-live.xconf" looks like this:
<?xml version="1.0"?>
<lucene>
<update-index type="new"/>
<index-dir src="../../work/search/lucene/index/live/index"/>
<htdocs-dump-dir src="../../work/search/lucene/htdocs_dump/live"/>
<indexer class="org.apache.lenya.lucene.index.ConfigurableIndexer">
<configuration src="cmfs-luceneDoc.xconf"/>
<extensions src="html"/>
</indexer>
</lucene>
And our "cmfs-luceneDoc.xconf" looks like:
<?xml version="1.0"?>
<luc:document xmlns:luc="http://apache.org/cocoon/lenya/lucene/1.0">
<luc:field name="title" type="Text" xpath="/html/head/title"/>
<luc:field name="contents" type="Text"
xpath="/html/body/table/td[@id='content']"/>
</luc:document>
Has anybody tried to do anything like this ?
Any better way to solve this?
Regards,
Åsmund
---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-user-help@cocoon.apache.org