You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by w i l l i a m__b o y d <wi...@javafreelancer.com> on 2002/02/23 19:27:01 UTC

RE: JSP Parser class wanted

i have had some success in solving my problem. mind you, it is a hack; a quick fix. it may or may not work for everyone. also the jsp pages i am indexing/searching have very little dynamically generated content. they are mostly static. 

my problem was there was too much gobbledy-gook turning up in the summary. i only wanted content from the main body of the document to appear in the summary. since all of my relevant body content is inside <p> tags my approach was to have the parser only add stuff that is in <p> tags to the summary. to do that, in the HtmlParser.jj file that comes with the lucene demo, I added the following line amongst the other variable declarations: 

        ...

	boolean inPTag = false;

        ...


then i changed the addText() method to: 
  void addText(String text) throws IOException {
    if (inScript)
      return;
    if (inTitle)
      title.append(text);
    else {
		if ( !inPTag )  // I added this line...
		  return;       // ... and this line 
      addToSummary(text);
      if (!titleComplete && !title.equals("")) {  // finished title
	synchronized(this) {
	  titleComplete = true;			  // tell waiting threads
	  notifyAll();
	} // end synchronized blick
      } // if
    } // end else

    length += text.length();
    pipeOut.write(text);

    afterSpace = false;
  }


then i changed the Tag() method to: 
void Tag() throws IOException :
{
  Token t1, t2;
  boolean inImg = false;
}
{
  t1=<TagName> {
    inTitle = t1.image.equalsIgnoreCase("<title"); // keep track if in <TITLE>
    inImg = t1.image.equalsIgnoreCase("<img");	  // keep track if in <IMG>
    if (inScript) {				  // keep track if in <SCRIPT>
      inScript = !t1.image.equalsIgnoreCase("</script");
    } else {
      inScript = t1.image.equalsIgnoreCase("<script");
    }
		// i added the following if conditional:
   if (inPTag) {				  // keep track if in p tag
      inPTag = !t1.image.equalsIgnoreCase("</p");
    } else {
      inPTag = t1.image.equalsIgnoreCase("<p");
    }		
  }
  (t1=<ArgName>
   (<ArgEquals>
    (t2=ArgValue()				  // save ALT text in IMG tag
     {
// I commented the next two lines out because I didn't want the contents
// of alt tags showing up in the summary:		 
//       if (inImg && t1.image.equalsIgnoreCase("alt") && t2 != null)
//         addText("[" + t2.image + "]");
     }
    )?
   )?
  )*
  <TagEnd>
}

all of the above is in addition to the other changes i mentioned in my earlier posts. 

Then I recompiled HtmlParser.jj with javacc; compiled the java files that javacc produced; stuffed those class files into a jar; then placed the jar in the classpath so that the lucene indexer could see the new parser. 

hope this helps. if anyone has a better solution please post it here. as i said, it's a hack. but with my deadline, it is all i have time for. one day i would love to spend the time really learning javacc and lucene inside and out. then maybe i could build a proper parser. today is just not that day ;¬)



Re: JSP Parser class wanted

Posted by Chris Opler <ch...@free.fr>.
Hi,

this is a great tool to retrieve and scrape html pages (rendered or not)...

http://www.research.compaq.com/SRC/WebL/

:-)

Chris Opler

w i l l i a m__b o y d wrote:

> >      If they're mostly static, why not just code a little crawler to
> > request the pages via the web-server and parse the rendered HTML?
> >
>
> right then. i've added that onto my list of things to do. immediately after
> "meet project deadline" and "...learning javacc and lucene inside and
> out..." ;¬) if anyone has such code they're willing to contribute i would
> put it to good use.
>
> ----- Original Message -----
> From: Steven J. Owens <pu...@darksleep.com>
> To: Lucene Users List <lu...@jakarta.apache.org>; w i l l i a m__b o y
> d <wi...@javafreelancer.com>
> Sent: Sunday, February 24, 2002 1:25 AM
> Subject: Re: JSP Parser class wanted
>
> > w i l l i a m__b o y d <wi...@javafreelancer.com> writes:
> >
> > > i have had some success in solving my problem. mind you, it is a
> > > hack; a quick fix. it may or may not work for everyone. also the jsp
> > > pages i am indexing/searching have very little dynamically generated
> > > content. they are mostly static.
> >
> >      If they're mostly static, why not just code a little crawler to
> > request the pages via the web-server and parse the rendered HTML?
> >
> > Steven J. Owens
> > puff@darksleep.com
>
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>

--
=======================
http://www.openwine.org



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: JSP Parser class wanted

Posted by w i l l i a m__b o y d <wi...@javafreelancer.com>.
>      If they're mostly static, why not just code a little crawler to
> request the pages via the web-server and parse the rendered HTML?
>

right then. i've added that onto my list of things to do. immediately after
"meet project deadline" and "...learning javacc and lucene inside and
out..." ;¬) if anyone has such code they're willing to contribute i would
put it to good use.

----- Original Message -----
From: Steven J. Owens <pu...@darksleep.com>
To: Lucene Users List <lu...@jakarta.apache.org>; w i l l i a m__b o y
d <wi...@javafreelancer.com>
Sent: Sunday, February 24, 2002 1:25 AM
Subject: Re: JSP Parser class wanted


> w i l l i a m__b o y d <wi...@javafreelancer.com> writes:
>
> > i have had some success in solving my problem. mind you, it is a
> > hack; a quick fix. it may or may not work for everyone. also the jsp
> > pages i am indexing/searching have very little dynamically generated
> > content. they are mostly static.
>
>      If they're mostly static, why not just code a little crawler to
> request the pages via the web-server and parse the rendered HTML?
>
> Steven J. Owens
> puff@darksleep.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: JSP Parser class wanted

Posted by "Steven J. Owens" <pu...@darksleep.com>.
w i l l i a m__b o y d <wi...@javafreelancer.com> writes:

> i have had some success in solving my problem. mind you, it is a
> hack; a quick fix. it may or may not work for everyone. also the jsp
> pages i am indexing/searching have very little dynamically generated
> content. they are mostly static.

     If they're mostly static, why not just code a little crawler to
request the pages via the web-server and parse the rendered HTML?
 
Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>