You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@poi.apache.org by ni...@apache.org on 2014/08/05 00:51:25 UTC
svn commit: r1615818 -
/poi/site/src/documentation/content/xdocs/text-extraction.xml
Author: nick
Date: Mon Aug 4 22:51:25 2014
New Revision: 1615818
URL: http://svn.apache.org/r1615818
Log:
Since 3.5 was so long ago now, update the docs to show the ooxml text extractors as standard
Modified:
poi/site/src/documentation/content/xdocs/text-extraction.xml
Modified: poi/site/src/documentation/content/xdocs/text-extraction.xml
URL: http://svn.apache.org/viewvc/poi/site/src/documentation/content/xdocs/text-extraction.xml?rev=1615818&r1=1615817&r2=1615818&view=diff
==============================================================================
--- poi/site/src/documentation/content/xdocs/text-extraction.xml (original)
+++ poi/site/src/documentation/content/xdocs/text-extraction.xml Mon Aug 4 22:51:25 2014
@@ -59,16 +59,15 @@
<em>org.apache.poi.POIOLE2TextExtractor</em>. This additionally
provides common methods to get at the <link href="hpfs/">HPFS
document metadata</link>.</p>
- <p>All OOXML based text extractors (available in POI 3.5 and later)
- also extend from
+ <p>All OOXML based text extractors also extend from
<em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
provides common methods to get at the OOXML metadata.</p>
</section>
<section><title>Text Extractor Factory</title>
- <p>As part of the addition of OOXML support in Apache POI 3.5, there
- is a common class to select the appropriate POI text extractor for
- you. <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
+ <p>POI provides a a common class to select the appropriate text extractor
+ for you, based on the supplied document's contents.
+ <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
similar function to WorkbookFactory. You simply pass it an
InputStream, a File, a POIFSFileSystem or a OOXML Package. It
figures out the correct text extractor for you, and returns it.</p>
@@ -81,16 +80,19 @@
<p>For .xls files, there is
<em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will
return text, optionally with formulas instead of their contents.
- Those using POI 3.5 can also use
- <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
- a similar task for .xlsx files.</p>
- <p>In addition, there is a second text extractor for .xls files,
- <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>. This
- is based on the streaming EventUserModel code, and will generally
+ Similarly, for .xlsx files there is
+ <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, which
+ provides the same functionality.</p>
+ <p>For those working in constrained memory footprints, there are
+ two more Excel text extractors available. For .xls files, it's
+ <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>,
+ based on the streaming EventUserModel code, and will generally
deliver a lower memory footprint for extraction. However, it will
have problems correctly outputting more complex formulas, as it
works with records as they pass, and so doesn't have access to all
- parts of complex and shared formulas.</p>
+ parts of complex and shared formulas. For .xlsx files the equivalent is
+ <em>org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor</em>,
+ which is based on the XSSF SAX Event codebase.</p>
</section>
<section><title>Word</title>
@@ -100,18 +102,16 @@
<p>Those using POI 3.7 can also extract simple textual content from
older Word 6 and Word 95 files, using the scratchpad class
<em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
- <p>Since POI 3.5, it is possible to use
- <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
- text extraction for .docx files.</p>
+ <p>For .docx files, the relevant class is
+ <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em></p>
</section>
<section><title>PowerPoint</title>
<p>For .ppt files, in scratchpad there is
<em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which
will return text for your slideshow, optionally restricted to just
- slides text or notes text. Those using POI 3.5 can also use
- <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to
- perform a similar task for .pptx files.</p>
+ slides text or notes text. For .pptx files, the class to use is
+ <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em></p>
</section>
<section><title>Publisher</title>
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org
For additional commands, e-mail: commits-help@poi.apache.org