You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/04/10 01:27:27 UTC

[Lucene-java Wiki] Trivial Update of "LuceneFAQ" by MartinJericho

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by MartinJericho:
http://wiki.apache.org/jakarta-lucene/LuceneFAQ

The comment on the change is:
Updated link to Jericho HTML Parser TextExtractor javadoc

------------------------------------------------------------------------------
  
  The author of [http://furl.net FURL] recommends [http://www.tagsoup.info TagSoup].
  
- [http://jerichohtml.sourceforge.net/ Jericho HTML Parser] provides a simple [http://jerichohtml.sourceforge.net/doc/api/au/id/jericho/lib/html/TextExtractor.html TextExtractor] class that converts any segment of an HTML document into a string of space-separated words, optionally including the values from title, alt, label, and summary attributes.  The parser is also very tolerant of badly formatted HTML and can also handle server-based source tags such as JSP, ASP, PHP etc.
+ [http://jerichohtml.sourceforge.net/ Jericho HTML Parser] provides a simple [http://jericho.htmlparser.net/docs/javadoc/index.html?net/htmlparser/jericho/TextExtractor.html TextExtractor] class that converts any segment of an HTML document into a string of space-separated words, optionally including the values from title, alt, label, and summary attributes.  The parser is also very tolerant of badly formatted HTML and can also handle server-based source tags such as JSP, ASP, PHP etc.
  
  
  ==== How can I index XML documents? ====