You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Bresnik <jb...@auditintegrity.com> on 2003/03/24 23:46:16 UTC
org.apache.lucene.demo.IndexHTML - parse JSP files?
anyone know of a quick and easy way to get this demo
[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
crawler to create a local [static] version of the site [i.e. they are not
longer "JSP" files just the html output from the original JSP file - but in
the interest of keeping the URL intact, I need to parse the JSP extentions -
the short question is, does anyone know of a way to *not* ignore the *.jsp
files?
thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
Posted by Tatu Saloranta <ta...@hypermall.net>.
On Monday 24 March 2003 18:03, Michael Wechner wrote:
> John Bresnik wrote:
> >anyone know of a quick and easy way to get this demo
> >[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
> >crawler to create a local [static] version of the site [i.e. they are not
> >longer "JSP" files just the html output from the original JSP file - but
> > in the interest of keeping the URL intact, I need to parse the JSP
> > extentions - the short question is, does anyone know of a way to *not*
> > ignore the *.jsp files?
>
> just modify IndexHTML: there is one line in there which decides what
> extension it will index.
There is another question I was wondering; since JSP is not XML (ie. can not
be reliably parse using an XML or even HTML parser [or for that matter, even
with simplest XML markup tokenizer that ignores nesting], needs a lower level
scanner), has anyone tried connecting an actual JSP processor to Lucene? Or
writing a simple one just meant for indexing, without having to execute code
embedded?
[the problem with JSP compared to XML is that it need not nest properly with
HTML content around; one can use JSP inside attribute values, for example;
thus, first JSP has to be processed to HTML, and then HTML needs to be
further tokenized]
Jakarta has to have at least one such processor (haven't looked at whether
there's a separate component or if Tomcat just has one embedded?). Of course
parsing JSP is problematic in many ways, not just getting jsp tagging out;
dynamic portions probably just have to be ignored, and all text inside
included (except for things inside comments).
-+ Tatu +-
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
Posted by John Bresnik <jb...@auditintegrity.com>.
ah thanks.. i couldnt find the demo classes [turns out they were in a
different dir] - thanks.
----- Original Message -----
From: "Michael Wechner" <mi...@wyona.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, March 24, 2003 5:03 PM
Subject: Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
> John Bresnik wrote:
>
> >anyone know of a quick and easy way to get this demo
> >[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to
a
> >crawler to create a local [static] version of the site [i.e. they are not
> >longer "JSP" files just the html output from the original JSP file - but
in
> >the interest of keeping the URL intact, I need to parse the JSP
extentions -
> >the short question is, does anyone know of a way to *not* ignore the
*.jsp
> >files?
> >
>
> just modify IndexHTML: there is one line in there which decides what
> extension it will index.
>
> HTH
>
> Michael
>
> >
> >thanks.
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
Posted by Michael Wechner <mi...@wyona.org>.
John Bresnik wrote:
>anyone know of a quick and easy way to get this demo
>[org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a
>crawler to create a local [static] version of the site [i.e. they are not
>longer "JSP" files just the html output from the original JSP file - but in
>the interest of keeping the URL intact, I need to parse the JSP extentions -
>the short question is, does anyone know of a way to *not* ignore the *.jsp
>files?
>
just modify IndexHTML: there is one line in there which decides what
extension it will index.
HTH
Michael
>
>thanks.
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org