You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Giovanni Novelli <gi...@gmail.com> on 2005/07/29 09:17:45 UTC

Text extraction from HTML

Hello,
I'm working to the development of a multi-agents software that
involves some information indexing, information retrieval and
information categorization tasks. I want to build the training set for
categorization using a set of HTML pages fetched from DMOZ RDF dumps.
I have tried the HtmlParser coming with Nutch but I wasn't able to
make it work without adjusting global configuration Nutch's xml;
perhaps it's the only way to make such plugin work? Does Lucene expose
any good HTML parser in the contrib section to parse web pages found
in the wild?

Best regards,
Giovanni Novelli

P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Text extraction from HTML

Posted by Jack Tang <hi...@gmail.com>.
Hi Novelli

Do you insist on HtmlParser in Nutch? 
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net

http://htmlparser.sourceforge.net/

Regards
/Jack

On 7/29/05, Giovanni Novelli <gi...@gmail.com> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> 
> Best regards,
> Giovanni Novelli
> 
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Text extraction from HTML

Posted by Giovanni Novelli <gi...@gmail.com>.
I have tried both HtmlParser v1.5 and NekoHTML. About the former my
implementation doesn't work as i.e. it get text from javascripts; I
have followed the hint from
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html

The following is my NOT working implementation relying upon HtmlParser v1.5:

import org.htmlparser.visitors.TextExtractingVisitor;
import org.htmlparser.*;
import org.htmlparser.util.*;

public class HtmlFilter {
        public static String getText(String html) {
                Parser parser = Parser.createParser(html, "UTF-8");
                TextExtractingVisitor visitor = new TextExtractingVisitor();
                try {
                        parser.visitAllNodesWith(visitor);
                } catch (ParserException e) {
                        e.printStackTrace();
                }
                String textInPage = visitor.getExtractedText();
                return textInPage;
        }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Text extraction from HTML

Posted by Patrick Kimber <ma...@gmail.com>.
Hi Giovanni
We are using the Neko HTML parser.  Some simple example code can be
found in the "Lucene in Action" book.

For more information:
http://www.manning.com/books/hatcher2
http://www.apache.org/~andyc/neko/doc/html/

Patrick

On 29/07/05, Giovanni Novelli <gi...@gmail.com> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> 
> Best regards,
> Giovanni Novelli
> 
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Text extraction from HTML

Posted by Jack Tang <hi...@gmail.com>.
Hi Novelli

Do you insist on HtmlParser in Nutch? 
Or some alternatives are available, maybe, you can try htmlparser
hosted on sf.net

http://htmlparser.sourceforge.net/

Regards
/Jack

On 7/29/05, Giovanni Novelli <gi...@gmail.com> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> 
> Best regards,
> Giovanni Novelli
> 
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars