You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lenya.apache.org by Åsmund Tharaldsen <as...@netcom-gsm.no> on 2004/09/28 16:05:47 UTC

Excluding the sitetree in the search

Hi!
We are using Lucene in the search. When we are doing a search, the result also 
includes the menu (data from the sitetree.xml). Therefore we get more 
results/hits than necessary. 

Since the crawling is using the sitetree.xml, we can't exclude it before the 
crawling. So we are trying to resolve this by using the "org.apache.lenya.
lucene.index.ConfigurableIndexer" instead of the DefaultIndexer. Our crawler 
produces html-files. We are trying to filter when generating the index by 
reading the title and the content, where the content we want is marked/started 
with <td id="content">.

Our "lucene-live.xconf" looks like this:

<?xml version="1.0"?>
<lucene>
	<update-index type="new"/>
	<index-dir src="../../work/search/lucene/index/live/index"/>
	<htdocs-dump-dir src="../../work/search/lucene/htdocs_dump/live"/>
	<indexer class="org.apache.lenya.lucene.index.ConfigurableIndexer">
		<configuration src="cmfs-luceneDoc.xconf"/>
		<extensions src="html"/>
	</indexer>
</lucene>

And our "cmfs-luceneDoc.xconf" looks like:

<?xml version="1.0"?>
<luc:document xmlns:luc="http://apache.org/cocoon/lenya/lucene/1.0">
	<luc:field name="title" type="Text" xpath="/html/head/title"/>
	<luc:field name="contents" type="Text" 
xpath="/html/body/table/td[@id='content']"/>	
</luc:document>

Has anybody tried to do anything like this ?
Any better way to solve this?

Regards,
Åsmund


---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-user-help@cocoon.apache.org