You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@poi.apache.org by fa...@apache.org on 2021/10/14 19:15:15 UTC
svn commit: r1894264 - in /poi/site/src/documentation/content/xdocs:
components/configuration.xml site.xml
Author: fanningpj
Date: Thu Oct 14 19:15:15 2021
New Revision: 1894264
URL: http://svn.apache.org/viewvc?rev=1894264&view=rev
Log:
config options
Added:
poi/site/src/documentation/content/xdocs/components/configuration.xml
- copied, changed from r1894261, poi/site/src/documentation/content/xdocs/text-extraction.xml
Modified:
poi/site/src/documentation/content/xdocs/site.xml
Copied: poi/site/src/documentation/content/xdocs/components/configuration.xml (from r1894261, poi/site/src/documentation/content/xdocs/text-extraction.xml)
URL: http://svn.apache.org/viewvc/poi/site/src/documentation/content/xdocs/components/configuration.xml?p2=poi/site/src/documentation/content/xdocs/components/configuration.xml&p1=poi/site/src/documentation/content/xdocs/text-extraction.xml&r1=1894261&r2=1894264&rev=1894264&view=diff
==============================================================================
--- poi/site/src/documentation/content/xdocs/text-extraction.xml (original)
+++ poi/site/src/documentation/content/xdocs/components/configuration.xml Thu Oct 14 19:15:15 2021
@@ -21,157 +21,92 @@
<document>
<header>
- <title>Apache POI - Text Extraction</title>
+ <title>Apache POI - Configuration</title>
<authors>
- <person id="NB" name="Nick Burch" email="nick@apache.org"/>
+ <person id="POI" name="POI Developers" email="dev@poi.apache.org"/>
</authors>
</header>
<body>
<section><title>Overview</title>
- <p>For a number of years now, Apache POI has provided basic
- text extraction for all the project supported file formats. In
- addition, as well as the (plain) text, these provides access to
- the metadata associated with a given file, such as title and
- author.</p>
- <p>For more advanced text extraction needs, including Rich Text
- extraction (such as formatting and styling), along with XML and
- HTML output, Apache POI works closely with
- <a href="https://tika.apache.org/">Apache Tika</a> to deliver
- POI-powered Tika Parsers for all the project supported file formats.</p>
- <p>If you are after turn-key text extraction, including the latest
- support, styles etc, you are strongly advised to make use of
- <a href="https://tika.apache.org/">Apache Tika</a>, which builds
- on top of POI to provide Text and Metadata extraction. If you wish
- to have something very simple and stand-alone, or you wish to make
- heavy modifications, then the POI provided text extractors documented
- below might be a better fit for your needs.</p>
- </section>
-
- <section><title>Common functionality</title>
- <p>All of the POI text extractors extend from
- <em>org.apache.poi.extractor.POITextExtractor</em>. This provides a common
- method across all extractors, getText(). For many cases, the text
- returned will be all you need. However, many extractors do provide
- more targeted text extraction methods, so you may wish to use
- these in some cases.</p>
- <p>All POIFS / OLE 2 based text extractors also extend from
- <em>org.apache.poi.extractor.POIOLE2TextExtractor</em>. This additionally
- provides common methods to get at the <a href="hpfs/">HPFS
- document metadata</a>.</p>
- <p>All OOXML based text extractors also extend from
- <em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
- provides common methods to get at the OOXML metadata.</p>
- </section>
-
- <section><title>Text Extractor Factory</title>
- <p>POI provides a a common class to select the appropriate text extractor
- for you, based on the supplied document's contents.
- <em>ExtractorFactory</em> provides a
- similar function to WorkbookFactory. You simply pass it an
- InputStream, a File, a POIFSFileSystem or a OOXML Package. It
- figures out the correct text extractor for you, and returns it.</p>
- <p>For complete detection and text extractor auto-selection, users
- are strongly encouraged to investigate
- <a href="https://tika.apache.org/">Apache Tika</a>.</p>
- </section>
-
- <section><title>Excel</title>
- <p>For .xls files, there is
- <em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will
- return text, optionally with formulas instead of their contents.
- Similarly, for .xlsx files there is
- <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, which
- provides the same functionality.</p>
- <p>For those working in constrained memory footprints, there are
- two more Excel text extractors available. For .xls files, it's
- <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>,
- based on the streaming EventUserModel code, and will generally
- deliver a lower memory footprint for extraction. However, it will
- have problems correctly outputting more complex formulas, as it
- works with records as they pass, and so doesn't have access to all
- parts of complex and shared formulas. For .xlsx files the equivalent is
- <em>org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor</em>,
- which is based on the XSSF SAX Event codebase.</p>
- </section>
-
- <section><title>Word</title>
- <p>For .doc files from Word 97 - Word 2003, in scratchpad there is
- <em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
- return text for your document.</p>
- <p>Those using POI 3.7 can also extract simple textual content from
- older Word 6 and Word 95 files, using the scratchpad class
- <em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
- <p>For .docx files, the relevant class is
- <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em></p>
- </section>
-
- <section><title>PowerPoint</title>
- <p>For .ppt and .pptx files, there is common extractor
- <em>org.apache.poi.sl.extractor.SlideShowExtractor.SlideShowExtractor</em>, which
- will return text for your slideshow, optionally restricted to just
- slides text or notes text. For .ppt you need to add the poi-scratchpad.jar
- and for .pptx the poi-ooxml.jar and its dependencies are needed</p>
- </section>
-
- <section><title>Publisher</title>
- <p>For .pub files, in scratchpad there is
- <em>org.apache.poi.hpbf.extractor.PublisherExtractor</em>, which
- will return text for your file.</p>
- </section>
-
- <section><title>Visio</title>
- <p>For .vsd files, in scratchpad there is
- <em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which
- will return text for your file.</p>
- </section>
-
- <section><title>Embedded Objects</title>
- <p>Extractors already exist for Excel, Word, PowerPoint and Visio;
- if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it.
+ <p>The best way to learn about using Apache POI is to read through the <a href="index.html">feature documentation</a>
+ and other online examples online.
</p>
- <source>
-FileInputStream fis = new FileInputStream(inputFile);
-POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
-// Firstly, get an extractor for the Workbook
-POIOLE2TextExtractor oleTextExtractor =
- ExtractorFactory.createExtractor(fileSystem);
-// Then a List of extractors for any embedded Excel, Word, PowerPoint
-// or Visio objects embedded into it.
-POITextExtractor[] embeddedExtractors =
- ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
-for (POITextExtractor textExtractor : embeddedExtractors) {
- // If the embedded object was an Excel spreadsheet.
- if (textExtractor instanceof ExcelExtractor) {
- ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
- System.out.println(excelExtractor.getText());
- }
- // A Word Document
- else if (textExtractor instanceof WordExtractor) {
- WordExtractor wordExtractor = (WordExtractor) textExtractor;
- String[] paragraphText = wordExtractor.getParagraphText();
- for (String paragraph : paragraphText) {
- System.out.println(paragraph);
- }
- // Display the document's header and footer text
- System.out.println("Footer text: " + wordExtractor.getFooterText());
- System.out.println("Header text: " + wordExtractor.getHeaderText());
- }
- // PowerPoint Presentation.
- else if (textExtractor instanceof PowerPointExtractor) {
- PowerPointExtractor powerPointExtractor =
- (PowerPointExtractor) textExtractor;
- System.out.println("Text: " + powerPointExtractor.getText());
- System.out.println("Notes: " + powerPointExtractor.getNotes());
- }
- // Visio Drawing
- else if (textExtractor instanceof VisioTextExtractor) {
- VisioTextExtractor visioTextExtractor =
- (VisioTextExtractor) textExtractor;
- System.out.println("Text: " + visioTextExtractor.getText());
- }
-}
- </source>
+ <p>To keep the features documentation focused on the APIs, there is little mention of some of the configuration
+ settings that can be enabled that may prove useful to users who have to handle very large documents or very
+ large throughput.
+ </p>
+ <table>
+ <tr>
+ <th>Configuration Setting</th>
+ <th>Description</th>
+ </tr>
+
+ <tr>
+ <td><a href="https://poi.apache.org/apidocs/5.0/org/apache/poi/openxml4j/util/ZipSecureFile.html#setMinInflateRatio-double-">
+ org.apache.poi.openxml4j.util.ZipSecureFile.setMinInflateRatio(double ratio)</a>
+ </td>
+ <td>Sets the ratio between de- and inflated bytes to detect zipbomb.
+ It defaults to 1% (= 0.01d), i.e. when the compression is better than 1% for any given read package part, the parsing will fail indicating a Zip-Bomb.
+ </td>
+ </tr>
+
+ <tr>
+ <td><a href="https://poi.apache.org/apidocs/5.0/org/apache/poi/openxml4j/util/ZipSecureFile.html#setMaxEntrySize-long-">
+ org.apache.poi.openxml4j.util.ZipSecureFile.setMaxEntrySize(long maxEntrySize)</a>
+ </td>
+ <td>Sets the maximum file size of a single zip entry. It defaults to 4GB, i.e. the 32-bit zip format maximum.
+ This can be used to limit memory consumption and protect against security vulnerabilities when documents are provided by users.
+ POI 5.1.0 removes the previous limit of 4GB on this setting.
+ </td>
+ </tr>
+
+ <tr>
+ <td><a href="https://poi.apache.org/apidocs/5.0/org/apache/poi/openxml4j/util/ZipSecureFile.html#setMaxTextSize-long-">
+ org.apache.poi.openxml4j.util.ZipSecureFile.setMaxTextSize(long maxTextSize)</a>
+ </td>
+ <td>Sets the maximum number of characters of text that are extracted before an exception is thrown during extracting text from documents.
+ This can be used to limit memory consumption and protect against security vulnerabilities when documents are provided by users.
+ The default is approx 10 million chars. Prior to POI 5.1.0, the max allowed was approx 4 billion chars.
+ </td>
+ </tr>
+
+ <tr>
+ <td>org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(int thresholdBytes)
+ </td>
+ <td><strong>Coming in POI 5.1.0.</strong>
+ Number of bytes at which a zip entry is regarded as too large for holding in memory
+ and the data is put in a temp file instead - defaults to -1 meaning temp files are not used
+ and that zip entries with more than 2GB of data after decompressing will fail, 0 means all
+ zip entries are stored in temp files. A threshold like 50000000 (approx 50Mb is recommended)
+ </td>
+ </tr>
+
+ <tr>
+ <td>org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setEncryptTempFiles(boolean encrypt)
+ </td>
+ <td><strong>Coming in POI 5.1.0.</strong>
+ Whether temp files should be encrypted (default false). Only affects temp files related to zip entries.
+ </td>
+ </tr>
+
+ <tr>
+ <td>org.apache.poi.openxml4j.opc.ZipPackage.setUseTempFilePackageParts(boolean tempFilePackageParts)
+ </td>
+ <td><strong>Coming in POI 5.1.0.</strong>
+ Whether to save package part data in temp files to save memory (default=false).
+ </td>
+ </tr>
+
+ <tr>
+ <td>org.apache.poi.openxml4j.opc.ZipPackage.setEncryptTempFilePackageParts(boolean encryptTempFiles)
+ </td>
+ <td><strong>Coming in POI 5.1.0.</strong>
+ Whether to encrypt package part temp files (default=false).
+ </td>
+ </tr>
+
+ </table>
</section>
</body>
Modified: poi/site/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/poi/site/src/documentation/content/xdocs/site.xml?rev=1894264&r1=1894263&r2=1894264&view=diff
==============================================================================
--- poi/site/src/documentation/content/xdocs/site.xml (original)
+++ poi/site/src/documentation/content/xdocs/site.xml Thu Oct 14 19:15:15 2021
@@ -111,6 +111,7 @@ See http://xml.apache.org/forrest/linkin
<hmef label="TNEF (HMEF) for winmail.dat" href="hmef/index.html"/>
<oxml4j label="OpenXML4J (OOXML)" href="oxml4j/index.html"/>
<log label="Logging framework" href="logging.html"/>
+ <config label="Configuration" href="configuration.html"/>
</components>
<help label="Help" tab="help" href="help/">
<mailinglists label="Mailing Lists" href="index.html"/>
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org
For additional commands, e-mail: commits-help@poi.apache.org