You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by w i l l i a m__b o y d <wi...@javafreelancer.com> on 2002/02/23 19:27:01 UTC
RE: JSP Parser class wanted
i have had some success in solving my problem. mind you, it is a hack; a quick fix. it may or may not work for everyone. also the jsp pages i am indexing/searching have very little dynamically generated content. they are mostly static.
my problem was there was too much gobbledy-gook turning up in the summary. i only wanted content from the main body of the document to appear in the summary. since all of my relevant body content is inside <p> tags my approach was to have the parser only add stuff that is in <p> tags to the summary. to do that, in the HtmlParser.jj file that comes with the lucene demo, I added the following line amongst the other variable declarations:
...
boolean inPTag = false;
...
then i changed the addText() method to:
void addText(String text) throws IOException {
if (inScript)
return;
if (inTitle)
title.append(text);
else {
if ( !inPTag ) // I added this line...
return; // ... and this line
addToSummary(text);
if (!titleComplete && !title.equals("")) { // finished title
synchronized(this) {
titleComplete = true; // tell waiting threads
notifyAll();
} // end synchronized blick
} // if
} // end else
length += text.length();
pipeOut.write(text);
afterSpace = false;
}
then i changed the Tag() method to:
void Tag() throws IOException :
{
Token t1, t2;
boolean inImg = false;
}
{
t1=<TagName> {
inTitle = t1.image.equalsIgnoreCase("<title"); // keep track if in <TITLE>
inImg = t1.image.equalsIgnoreCase("<img"); // keep track if in <IMG>
if (inScript) { // keep track if in <SCRIPT>
inScript = !t1.image.equalsIgnoreCase("</script");
} else {
inScript = t1.image.equalsIgnoreCase("<script");
}
// i added the following if conditional:
if (inPTag) { // keep track if in p tag
inPTag = !t1.image.equalsIgnoreCase("</p");
} else {
inPTag = t1.image.equalsIgnoreCase("<p");
}
}
(t1=<ArgName>
(<ArgEquals>
(t2=ArgValue() // save ALT text in IMG tag
{
// I commented the next two lines out because I didn't want the contents
// of alt tags showing up in the summary:
// if (inImg && t1.image.equalsIgnoreCase("alt") && t2 != null)
// addText("[" + t2.image + "]");
}
)?
)?
)*
<TagEnd>
}
all of the above is in addition to the other changes i mentioned in my earlier posts.
Then I recompiled HtmlParser.jj with javacc; compiled the java files that javacc produced; stuffed those class files into a jar; then placed the jar in the classpath so that the lucene indexer could see the new parser.
hope this helps. if anyone has a better solution please post it here. as i said, it's a hack. but with my deadline, it is all i have time for. one day i would love to spend the time really learning javacc and lucene inside and out. then maybe i could build a proper parser. today is just not that day ;¬)
Re: JSP Parser class wanted
Posted by Chris Opler <ch...@free.fr>.
Hi,
this is a great tool to retrieve and scrape html pages (rendered or not)...
http://www.research.compaq.com/SRC/WebL/
:-)
Chris Opler
w i l l i a m__b o y d wrote:
> > If they're mostly static, why not just code a little crawler to
> > request the pages via the web-server and parse the rendered HTML?
> >
>
> right then. i've added that onto my list of things to do. immediately after
> "meet project deadline" and "...learning javacc and lucene inside and
> out..." ;¬) if anyone has such code they're willing to contribute i would
> put it to good use.
>
> ----- Original Message -----
> From: Steven J. Owens <pu...@darksleep.com>
> To: Lucene Users List <lu...@jakarta.apache.org>; w i l l i a m__b o y
> d <wi...@javafreelancer.com>
> Sent: Sunday, February 24, 2002 1:25 AM
> Subject: Re: JSP Parser class wanted
>
> > w i l l i a m__b o y d <wi...@javafreelancer.com> writes:
> >
> > > i have had some success in solving my problem. mind you, it is a
> > > hack; a quick fix. it may or may not work for everyone. also the jsp
> > > pages i am indexing/searching have very little dynamically generated
> > > content. they are mostly static.
> >
> > If they're mostly static, why not just code a little crawler to
> > request the pages via the web-server and parse the rendered HTML?
> >
> > Steven J. Owens
> > puff@darksleep.com
>
> --
> To unsubscribe, e-mail: <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
--
=======================
http://www.openwine.org
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: JSP Parser class wanted
Posted by w i l l i a m__b o y d <wi...@javafreelancer.com>.
> If they're mostly static, why not just code a little crawler to
> request the pages via the web-server and parse the rendered HTML?
>
right then. i've added that onto my list of things to do. immediately after
"meet project deadline" and "...learning javacc and lucene inside and
out..." ;¬) if anyone has such code they're willing to contribute i would
put it to good use.
----- Original Message -----
From: Steven J. Owens <pu...@darksleep.com>
To: Lucene Users List <lu...@jakarta.apache.org>; w i l l i a m__b o y
d <wi...@javafreelancer.com>
Sent: Sunday, February 24, 2002 1:25 AM
Subject: Re: JSP Parser class wanted
> w i l l i a m__b o y d <wi...@javafreelancer.com> writes:
>
> > i have had some success in solving my problem. mind you, it is a
> > hack; a quick fix. it may or may not work for everyone. also the jsp
> > pages i am indexing/searching have very little dynamically generated
> > content. they are mostly static.
>
> If they're mostly static, why not just code a little crawler to
> request the pages via the web-server and parse the rendered HTML?
>
> Steven J. Owens
> puff@darksleep.com
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>
Re: JSP Parser class wanted
Posted by "Steven J. Owens" <pu...@darksleep.com>.
w i l l i a m__b o y d <wi...@javafreelancer.com> writes:
> i have had some success in solving my problem. mind you, it is a
> hack; a quick fix. it may or may not work for everyone. also the jsp
> pages i am indexing/searching have very little dynamically generated
> content. they are mostly static.
If they're mostly static, why not just code a little crawler to
request the pages via the web-server and parse the rendered HTML?
Steven J. Owens
puff@darksleep.com
--
To unsubscribe, e-mail: <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>