You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lenya.apache.org by Jann Forrer <ja...@id.unizh.ch> on 2004/01/20 15:00:39 UTC
crawl and index
Hi
As described in the lenya documentation I was crawling and indexing a
website which works really good. I noticed that pdf ant html can be
crawled. However, we are also intereseted in indexing php Files. I started
a little test and slightly modified the two files:
org/apache/lenya/search/crawler/IterativeHTMLCrawler.java
org/apache/lenya/lucene/index/AbstractIndexer.java
i.e. I did only add the php extension and did crawl and index a simple
site with two php Files and that worked. However I don't know whether the
actual classes can also be used to crawl and index php Files?
Jann
---------------------------------------------------------------
Jann Forrer
Informatikdienste
Universität Zürich
Winterthurerstr. 190
CH-8057 Zuerich
oooO mail: jann.forrer@id.unizh.ch
( ) phone: +41 1 63 56772
\ ( fax: +41 1 63 54505
\_) http://www.id.unizh.ch
---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-user-help@cocoon.apache.org
Re: crawl and index
Posted by Michael Wechner <mi...@wyona.com>.
Jann Forrer wrote:
>Hi
>
>As described in the lenya documentation I was crawling and indexing a
>website which works really good. I noticed that pdf ant html can be
>crawled. However, we are also intereseted in indexing php Files. I started
>a little test and slightly modified the two files:
>
> org/apache/lenya/search/crawler/IterativeHTMLCrawler.java
>
it would make sense to make the crawler configurable, whereas for
each mime-type resp. suffix one should be able to add specific parser.
In most cases the HTML parser could be used.
> org/apache/lenya/lucene/index/AbstractIndexer.java
>
>i.e. I did only add the php extension and did crawl and index a simple
>site with two php Files and that worked. However I don't know whether the
>actual classes can also be used to crawl and index php Files?
>
as long as the dumped php files are txt respl html it will be no problem.
The DefaultIndexer could be made configurable just as the
"ConfigurableIndexer".
You might want to take a look at
src/webapp/lenya/pubs/oscom/config/search/lucene-cmfsMatrix.xconf
HTH
Michi
>
>Jann
>
>
>---------------------------------------------------------------
>Jann Forrer
>Informatikdienste
>Universität Zürich
>Winterthurerstr. 190
>CH-8057 Zuerich
>
>oooO mail: jann.forrer@id.unizh.ch
>( ) phone: +41 1 63 56772
> \ ( fax: +41 1 63 54505
> \_) http://www.id.unizh.ch
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
>For additional commands, e-mail: lenya-user-help@cocoon.apache.org
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-user-help@cocoon.apache.org