You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ross Rankin <ro...@commercescience.com> on 2006/07/13 22:33:44 UTC

HTMLParser

Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.  

I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTMLParser
looked like it would fit the bill.

 

First, it does not seem to be parsing� hence my first problem and it
also is throwing an exception along with this phrase sprinkled around
�(No such file or directory)�.  

 

I think I may be using it wrong, so here�s what I have done.  In my
object where I create my document, I have the following code:

        StringExtractor extract = new
StringExtractor(record.get("column14").toString().trim());

        try {

            value = extract.extractStrings(false);

        } catch (ParserException pe) {

            System.out.println("Index Long Description Parser
Exception:" + pe.getMessage() );

            value = "";

        }

 

What I get out in value is like the following:

<LI><FONT size=2>Crystal Clear III and 3D combfilter for natural, sharp
images with enhanced quality </FONT>

<LI><FONT size=2>Compact and sleek design </FONT>

<LI><FONT size=2>Incredible Surround (No such file or directory)

 

So the tags are still there and oddly the �(No such file or directory)�
phrase is added which is not in the original text.

 

Then I get a ParserException.

 

What am I doing wrong?

 

Thanks,

Ross

Re: HTMLParser

Posted by Yonik Seeley <ys...@gmail.com>.

I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

It's pretty stand-alone, so it should be trivial to rip it out of Solr
and re-use it in your Lucene project.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 7/13/06, Ross Rankin <ro...@commercescience.com> wrote:
> Since I cannot seem to access the HTMLParser mailing list and I saw the
> library recommended here, I thought someone here that has used it
> successfully can help me out.
>
> I have HTML text stored in a database field which I want to add to a
> Lucene document, but I want to remove the HTML tags, so HTMLParser
> looked like it would fit the bill.
>
>
>
> First, it does not seem to be parsing… hence my first problem and it
> also is throwing an exception along with this phrase sprinkled around
> "(No such file or directory)".
>
>
>
> I think I may be using it wrong, so here's what I have done.  In my
> object where I create my document, I have the following code:
>
>         StringExtractor extract = new
> StringExtractor(record.get("column14").toString().trim());
>
>         try {
>
>             value = extract.extractStrings(false);
>
>         } catch (ParserException pe) {
>
>             System.out.println("Index Long Description Parser
> Exception:" + pe.getMessage() );
>
>             value = "";
>
>         }
>
>
>
> What I get out in value is like the following:
>
> <LI><FONT size=2>Crystal Clear III and 3D combfilter for natural, sharp
> images with enhanced quality </FONT>
>
> <LI><FONT size=2>Compact and sleek design </FONT>
>
> <LI><FONT size=2>Incredible Surround (No such file or directory)
>
>
>
> So the tags are still there and oddly the '(No such file or directory)'
> phrase is added which is not in the original text.
>
>
>
> Then I get a ParserException.
>
>
>
> What am I doing wrong?
>
>
>
> Thanks,
>
> Ross

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: HTMLParser

Posted by Charles Bell <ch...@bellsouth.net>.

The following little program should do the job for
you.


/* HTMLTextStripper.java
 * July 15, 2006
 */

import java.io.*;

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;

/** HTMLTextStripper
 *  @author  Charles Bell
 *  @version July 15, 2006
 */
public class HTMLTextStripper extends DefaultHandler{
	
	private SAXParser parser = null;
		
	private String tempString = "";
	private boolean debug = false;
	private String errorMessage = "";
	
    public static void main(String[] args){
        new HTMLTextStripper().test();
    }
    
	public void test(){
        
        System.out.println(stripText("<html><body>This
is body text. <p>This is paragraph text.</p><p>This is
malformed html because of no end p
tag.</body></html>"));
        System.out.println("error: " +
getErrorMessage());
    }
	public String stripText(String html) {
        try{
		    parser =
SAXParserFactory.newInstance().newSAXParser();
        }catch(ParserConfigurationException pce){
           
System.err.println("ParserConfigurationException: " +
pce.getMessage());				
        }catch(SAXException saxe){
            System.err.println("SAXException: " +
saxe.getMessage());				
        }
        if (parser != null){
            try{
 
                InputSource inputsource = new
InputSource(new StringReader(html));
                parser.parse(inputsource, this);
            }catch(IOException ioe){
                errorMessage = errorMessage +
("IOException: " + ioe.getMessage());				
            }catch(SAXException saxe){
                errorMessage = errorMessage +
("SAXException: " + saxe.getMessage());				
            }
        }else{
            errorMessage = errorMessage + ("XML Reader
not initialized.");
        }
        
        return tempString;

    }
	
	public String getErrorMessage(){
		return errorMessage;
	}
	
	
	/** characters is called by the SAXparser when it 
	*   encounters character data in an xml document.
	*/
	public void characters(char[] ch, int start, int
length) throws SAXException{
		tempString = tempString + new
String(ch,start,length);
	}
	
}

--- Ross Rankin <ro...@commercescience.com> wrote:

> Since I cannot seem to access the HTMLParser mailing
> list and I saw the
> library recommended here, I thought someone here
> that has used it
> successfully can help me out.  
> 
> I have HTML text stored in a database field which I
> want to add to a
> Lucene document, but I want to remove the HTML tags,
> so HTMLParser
> looked like it would fit the bill.
> 
>  
> 
> First, it does not seem to be parsing hence my
> first problem and it
> also is throwing an exception along with this phrase
> sprinkled around
> (No such file or directory).  
> 
>  
> 
> I think I may be using it wrong, so heres what I
> have done.  In my
> object where I create my document, I have the
> following code:
> 
>         StringExtractor extract = new
>
StringExtractor(record.get("column14").toString().trim());
> 
>         try {
> 
>             value = extract.extractStrings(false);
> 
>         } catch (ParserException pe) {
> 
>             System.out.println("Index Long
> Description Parser
> Exception:" + pe.getMessage() );
> 
>             value = "";
> 
>         }
> 
>  
> 
> What I get out in value is like the following:
> 
> <LI><FONT size=2>Crystal Clear III and 3D combfilter
> for natural, sharp
> images with enhanced quality </FONT>
> 
> <LI><FONT size=2>Compact and sleek design </FONT>
> 
> <LI><FONT size=2>Incredible Surround (No such file
> or directory)
> 
>  
> 
> So the tags are still there and oddly the (No such
> file or directory)
> phrase is added which is not in the original text.
> 
>  
> 
> Then I get a ParserException.
> 
>  
> 
> What am I doing wrong?
> 
>  
> 
> Thanks,
> 
> Ross
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: HTMLParser

Posted by Ross Rankin <ro...@commercescience.com>.

Ok I got it fixed and though I would respond back so it was in the archive for the next poor soul...

Here's the code I used:
        StringBean sb = new StringBean ();
        String htmlSource = record.get("column14").toString().trim(); 
        Parser parser = new Parser(new Lexer(htmlSource));
        try {
            parser.visitAllNodesWith(sb);
        } catch (ParserException pe) {
            System.out.println("Index ParserException:" + pe.getMessage());
        }
        value = sb.getStrings ();

-----Original Message-----
From: Ross Rankin 
Sent: Thursday, July 13, 2006 4:34 PM
To: java-user
Subject: HTMLParser

Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.  

I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTMLParser
looked like it would fit the bill.

 

First, it does not seem to be parsing hence my first problem and it
also is throwing an exception along with this phrase sprinkled around
(No such file or directory).  

 

I think I may be using it wrong, so heres what I have done.  In my
object where I create my document, I have the following code:

        StringExtractor extract = new
StringExtractor(record.get("column14").toString().trim());

        try {

            value = extract.extractStrings(false);

        } catch (ParserException pe) {

            System.out.println("Index Long Description Parser
Exception:" + pe.getMessage() );

            value = "";

        }

 

What I get out in value is like the following:

<LI><FONT size=2>Crystal Clear III and 3D combfilter for natural, sharp
images with enhanced quality </FONT>

<LI><FONT size=2>Compact and sleek design </FONT>

<LI><FONT size=2>Incredible Surround (No such file or directory)

 

So the tags are still there and oddly the (No such file or directory)
phrase is added which is not in the original text.

 

Then I get a ParserException.

 

What am I doing wrong?

 

Thanks,

Ross




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org