You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@poi.apache.org by ni...@apache.org on 2014/08/05 00:56:46 UTC

svn commit: r1615820 - /poi/site/publish/text-extraction.html

Author: nick
Date: Mon Aug  4 22:56:46 2014
New Revision: 1615820

URL: http://svn.apache.org/r1615820
Log:
Site republish

Modified:
    poi/site/publish/text-extraction.html

Modified: poi/site/publish/text-extraction.html
URL: http://svn.apache.org/viewvc/poi/site/publish/text-extraction.html?rev=1615820&r1=1615819&r2=1615820&view=diff
==============================================================================
--- poi/site/publish/text-extraction.html (original)
+++ poi/site/publish/text-extraction.html Mon Aug  4 22:56:46 2014
@@ -292,8 +292,7 @@ if (VERSION > 3) {
       provides common methods to get at the <a href="hpfs/">HPFS
       document metadata</a>.</p>
      
-<p>All OOXML based text extractors (available in POI 3.5 and later) 
-      also extend from
+<p>All OOXML based text extractors also extend from
       <em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
       provides common methods to get at the OOXML metadata.</p>
     
@@ -304,9 +303,9 @@ if (VERSION > 3) {
 <h3>Text Extractor Factory</h3>
 </div>
      
-<p>As part of the addition of OOXML support in Apache POI 3.5, there
-      is a common class to select the appropriate POI text extractor for 
-      you. <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
+<p>POI provides a a common class to select the appropriate text extractor 
+      for you, based on the supplied document's contents. 
+      <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
       similar function to WorkbookFactory. You simply pass it an
       InputStream, a File, a POIFSFileSystem or a OOXML Package. It
       figures out the correct text extractor for you, and returns it.</p>
@@ -325,17 +324,20 @@ if (VERSION > 3) {
 <p>For .xls files, there is 
       <em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will 
       return text, optionally with formulas instead of their contents. 
-      Those using POI 3.5 can also use 
-      <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
-      a similar task for .xlsx files.</p>
-     
-<p>In addition, there is a second text extractor for .xls files,
-      <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>. This
-      is based on the streaming EventUserModel code, and will generally
+      Similarly, for .xlsx files there is
+      <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, which 
+      provides the same functionality.</p>
+     
+<p>For those working in constrained memory footprints, there are
+      two more Excel text extractors available. For .xls files, it's
+      <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>,
+      based on the streaming EventUserModel code, and will generally
       deliver a lower memory footprint for extraction. However, it will
       have problems correctly outputting more complex formulas, as it 
       works with records as they pass, and so doesn't have access to all
-      parts of complex and shared formulas.</p>
+      parts of complex and shared formulas. For .xlsx files the equivalent is
+      <em>org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor</em>, 
+      which is based on the XSSF SAX Event codebase.</p>
     
 
     
@@ -352,9 +354,9 @@ if (VERSION > 3) {
       older Word 6 and Word 95 files, using the scratchpad class
       <em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
      
-<p>Since POI 3.5, it is possible to use
-      <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
-      text extraction for .docx files.</p> 
+<p>For .docx files, the relevant class is 
+      <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>
+</p>
     
 
     
@@ -366,9 +368,9 @@ if (VERSION > 3) {
 <p>For .ppt files, in scratchpad there is 
       <em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which 
       will return text for your slideshow, optionally restricted to just
-      slides text or notes text. Those using POI 3.5 can also use 
-      <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to 
-      perform a similar task for .pptx files.</p>
+      slides text or notes text. For .pptx files, the class to use is
+      <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>
+</p>
     
 
     



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org
For additional commands, e-mail: commits-help@poi.apache.org