You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@poi.apache.org by ni...@apache.org on 2014/08/05 00:56:46 UTC
svn commit: r1615820 - /poi/site/publish/text-extraction.html
Author: nick
Date: Mon Aug 4 22:56:46 2014
New Revision: 1615820
URL: http://svn.apache.org/r1615820
Log:
Site republish
Modified:
poi/site/publish/text-extraction.html
Modified: poi/site/publish/text-extraction.html
URL: http://svn.apache.org/viewvc/poi/site/publish/text-extraction.html?rev=1615820&r1=1615819&r2=1615820&view=diff
==============================================================================
--- poi/site/publish/text-extraction.html (original)
+++ poi/site/publish/text-extraction.html Mon Aug 4 22:56:46 2014
@@ -292,8 +292,7 @@ if (VERSION > 3) {
provides common methods to get at the <a href="hpfs/">HPFS
document metadata</a>.</p>
-<p>All OOXML based text extractors (available in POI 3.5 and later)
- also extend from
+<p>All OOXML based text extractors also extend from
<em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
provides common methods to get at the OOXML metadata.</p>
@@ -304,9 +303,9 @@ if (VERSION > 3) {
<h3>Text Extractor Factory</h3>
</div>
-<p>As part of the addition of OOXML support in Apache POI 3.5, there
- is a common class to select the appropriate POI text extractor for
- you. <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
+<p>POI provides a a common class to select the appropriate text extractor
+ for you, based on the supplied document's contents.
+ <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
similar function to WorkbookFactory. You simply pass it an
InputStream, a File, a POIFSFileSystem or a OOXML Package. It
figures out the correct text extractor for you, and returns it.</p>
@@ -325,17 +324,20 @@ if (VERSION > 3) {
<p>For .xls files, there is
<em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will
return text, optionally with formulas instead of their contents.
- Those using POI 3.5 can also use
- <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
- a similar task for .xlsx files.</p>
-
-<p>In addition, there is a second text extractor for .xls files,
- <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>. This
- is based on the streaming EventUserModel code, and will generally
+ Similarly, for .xlsx files there is
+ <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, which
+ provides the same functionality.</p>
+
+<p>For those working in constrained memory footprints, there are
+ two more Excel text extractors available. For .xls files, it's
+ <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>,
+ based on the streaming EventUserModel code, and will generally
deliver a lower memory footprint for extraction. However, it will
have problems correctly outputting more complex formulas, as it
works with records as they pass, and so doesn't have access to all
- parts of complex and shared formulas.</p>
+ parts of complex and shared formulas. For .xlsx files the equivalent is
+ <em>org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor</em>,
+ which is based on the XSSF SAX Event codebase.</p>
@@ -352,9 +354,9 @@ if (VERSION > 3) {
older Word 6 and Word 95 files, using the scratchpad class
<em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
-<p>Since POI 3.5, it is possible to use
- <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
- text extraction for .docx files.</p>
+<p>For .docx files, the relevant class is
+ <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>
+</p>
@@ -366,9 +368,9 @@ if (VERSION > 3) {
<p>For .ppt files, in scratchpad there is
<em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which
will return text for your slideshow, optionally restricted to just
- slides text or notes text. Those using POI 3.5 can also use
- <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to
- perform a similar task for .pptx files.</p>
+ slides text or notes text. For .pptx files, the class to use is
+ <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>
+</p>
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org
For additional commands, e-mail: commits-help@poi.apache.org