You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lenya.apache.org by Jann Forrer <ja...@id.unizh.ch> on 2004/01/20 15:00:39 UTC

crawl and index

Hi

As described in the lenya documentation I was crawling and indexing a
website which works really good. I noticed that pdf ant html can be
crawled. However, we are also intereseted in indexing php Files. I started
a little test and slightly modified the two files:

 org/apache/lenya/search/crawler/IterativeHTMLCrawler.java
 org/apache/lenya/lucene/index/AbstractIndexer.java

i.e. I did only add the php extension and did crawl and index a simple
site with two php Files and that worked. However I don't know whether the
actual classes can also be used to crawl and index php Files?

Jann


---------------------------------------------------------------
Jann Forrer
Informatikdienste
Universität Zürich
Winterthurerstr. 190
CH-8057 Zuerich

oooO   mail:  jann.forrer@id.unizh.ch
(  )   phone: +41 1 63 56772
 \ (   fax:   +41 1 63 54505
  \_)  http://www.id.unizh.ch

---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-user-help@cocoon.apache.org


Re: crawl and index

Posted by Michael Wechner <mi...@wyona.com>.
Jann Forrer wrote:

>Hi
>
>As described in the lenya documentation I was crawling and indexing a
>website which works really good. I noticed that pdf ant html can be
>crawled. However, we are also intereseted in indexing php Files. I started
>a little test and slightly modified the two files:
>
> org/apache/lenya/search/crawler/IterativeHTMLCrawler.java
>

it would make sense to make the crawler configurable, whereas for
each mime-type resp. suffix one should be able to add specific parser.
In most cases the HTML parser could be used.

> org/apache/lenya/lucene/index/AbstractIndexer.java
>
>i.e. I did only add the php extension and did crawl and index a simple
>site with two php Files and that worked. However I don't know whether the
>actual classes can also be used to crawl and index php Files?
>

as long as the dumped php files are txt respl html it will be no problem.
The DefaultIndexer could be made configurable just as the 
"ConfigurableIndexer".
You might want to take a look at

src/webapp/lenya/pubs/oscom/config/search/lucene-cmfsMatrix.xconf

HTH

Michi

>
>Jann
>
>
>---------------------------------------------------------------
>Jann Forrer
>Informatikdienste
>Universität Zürich
>Winterthurerstr. 190
>CH-8057 Zuerich
>
>oooO   mail:  jann.forrer@id.unizh.ch
>(  )   phone: +41 1 63 56772
> \ (   fax:   +41 1 63 54505
>  \_)  http://www.id.unizh.ch
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
>For additional commands, e-mail: lenya-user-help@cocoon.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-user-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-user-help@cocoon.apache.org