You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@poi.apache.org by fa...@apache.org on 2021/10/14 19:15:15 UTC

svn commit: r1894264 - in /poi/site/src/documentation/content/xdocs: components/configuration.xml site.xml

Author: fanningpj
Date: Thu Oct 14 19:15:15 2021
New Revision: 1894264

URL: http://svn.apache.org/viewvc?rev=1894264&view=rev
Log:
config options

Added:
    poi/site/src/documentation/content/xdocs/components/configuration.xml
      - copied, changed from r1894261, poi/site/src/documentation/content/xdocs/text-extraction.xml
Modified:
    poi/site/src/documentation/content/xdocs/site.xml

Copied: poi/site/src/documentation/content/xdocs/components/configuration.xml (from r1894261, poi/site/src/documentation/content/xdocs/text-extraction.xml)
URL: http://svn.apache.org/viewvc/poi/site/src/documentation/content/xdocs/components/configuration.xml?p2=poi/site/src/documentation/content/xdocs/components/configuration.xml&p1=poi/site/src/documentation/content/xdocs/text-extraction.xml&r1=1894261&r2=1894264&rev=1894264&view=diff
==============================================================================
--- poi/site/src/documentation/content/xdocs/text-extraction.xml (original)
+++ poi/site/src/documentation/content/xdocs/components/configuration.xml Thu Oct 14 19:15:15 2021
@@ -21,157 +21,92 @@
 
 <document>
   <header>
-    <title>Apache POI - Text Extraction</title>
+    <title>Apache POI - Configuration</title>
     <authors>
-      <person id="NB" name="Nick Burch" email="nick@apache.org"/>
+      <person id="POI" name="POI Developers" email="dev@poi.apache.org"/>
     </authors>
   </header>
   
   <body>
     <section><title>Overview</title>
-      <p>For a number of years now, Apache POI has provided basic 
-       text extraction for all the project supported file formats. In 
-       addition, as well as the (plain) text, these provides access to 
-       the metadata associated with a given file, such as title and 
-       author.</p>
-      <p>For more advanced text extraction needs, including Rich Text
-       extraction (such as formatting and styling), along with XML and
-       HTML output, Apache POI works closely with 
-       <a href="https://tika.apache.org/">Apache Tika</a> to deliver 
-       POI-powered Tika Parsers for all the project supported file formats.</p>
-      <p>If you are after turn-key text extraction, including the latest
-       support, styles etc, you are strongly advised to make use of 
-       <a href="https://tika.apache.org/">Apache Tika</a>, which builds 
-       on top of POI to provide Text and Metadata extraction. If you wish
-       to have something very simple and stand-alone, or you wish to make
-       heavy modifications, then the POI provided text extractors documented
-       below might be a better fit for your needs.</p>
-    </section>
-
-    <section><title>Common functionality</title>
-     <p>All of the POI text extractors extend from
-      <em>org.apache.poi.extractor.POITextExtractor</em>. This provides a common
-      method across all extractors, getText(). For many cases, the text
-      returned will be all you need. However, many extractors do provide
-      more targeted text extraction methods, so you may wish to use
-      these in some cases.</p>
-     <p>All POIFS / OLE 2 based text extractors also extend from
-      <em>org.apache.poi.extractor.POIOLE2TextExtractor</em>. This additionally
-      provides common methods to get at the <a href="hpfs/">HPFS
-      document metadata</a>.</p>
-     <p>All OOXML based text extractors also extend from
-      <em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
-      provides common methods to get at the OOXML metadata.</p>
-    </section>
-
-    <section><title>Text Extractor Factory</title>
-     <p>POI provides a a common class to select the appropriate text extractor 
-      for you, based on the supplied document's contents. 
-      <em>ExtractorFactory</em> provides a
-      similar function to WorkbookFactory. You simply pass it an
-      InputStream, a File, a POIFSFileSystem or a OOXML Package. It
-      figures out the correct text extractor for you, and returns it.</p>
-     <p>For complete detection and text extractor auto-selection, users
-      are strongly encouraged to investigate
-      <a href="https://tika.apache.org/">Apache Tika</a>.</p>
-    </section>
-
-    <section><title>Excel</title>
-     <p>For .xls files, there is 
-      <em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will 
-      return text, optionally with formulas instead of their contents. 
-      Similarly, for .xlsx files there is
-      <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, which 
-      provides the same functionality.</p>
-     <p>For those working in constrained memory footprints, there are
-      two more Excel text extractors available. For .xls files, it's
-      <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>,
-      based on the streaming EventUserModel code, and will generally
-      deliver a lower memory footprint for extraction. However, it will
-      have problems correctly outputting more complex formulas, as it 
-      works with records as they pass, and so doesn't have access to all
-      parts of complex and shared formulas. For .xlsx files the equivalent is
-      <em>org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor</em>, 
-      which is based on the XSSF SAX Event codebase.</p>
-    </section>
-
-    <section><title>Word</title>
-     <p>For .doc files from Word 97 - Word 2003, in scratchpad there is 
-      <em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will 
-      return text for your document.</p>
-     <p>Those using POI 3.7 can also extract simple textual content from
-      older Word 6 and Word 95 files, using the scratchpad class
-      <em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
-     <p>For .docx files, the relevant class is 
-      <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em></p>
-    </section>
-
-    <section><title>PowerPoint</title>
-     <p>For .ppt and .pptx files, there is common extractor
-      <em>org.apache.poi.sl.extractor.SlideShowExtractor.SlideShowExtractor</em>, which
-      will return text for your slideshow, optionally restricted to just
-      slides text or notes text. For .ppt you need to add the poi-scratchpad.jar
-      and for .pptx the poi-ooxml.jar and its dependencies are needed</p>
-    </section>
-
-    <section><title>Publisher</title>
-     <p>For .pub files, in scratchpad there is 
-      <em>org.apache.poi.hpbf.extractor.PublisherExtractor</em>, which 
-      will return text for your file.</p>
-    </section>
-
-    <section><title>Visio</title>
-     <p>For .vsd files, in scratchpad there is 
-      <em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which 
-      will return text for your file.</p>
-    </section>
-
-    <section><title>Embedded Objects</title>
-      <p>Extractors already exist for Excel, Word, PowerPoint and Visio; 
-        if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it.     
+      <p>The best way to learn about using Apache POI is to read through the <a href="index.html">feature documentation</a>
+          and other online examples online.
       </p>
-      <source>
-FileInputStream fis = new FileInputStream(inputFile);
-POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
-// Firstly, get an extractor for the Workbook
-POIOLE2TextExtractor oleTextExtractor = 
-   ExtractorFactory.createExtractor(fileSystem);
-// Then a List of extractors for any embedded Excel, Word, PowerPoint
-// or Visio objects embedded into it.
-POITextExtractor[] embeddedExtractors =
-   ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
-for (POITextExtractor textExtractor : embeddedExtractors) {
-   // If the embedded object was an Excel spreadsheet.
-   if (textExtractor instanceof ExcelExtractor) {
-      ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
-      System.out.println(excelExtractor.getText());
-   }
-   // A Word Document
-   else if (textExtractor instanceof WordExtractor) {
-      WordExtractor wordExtractor = (WordExtractor) textExtractor;
-      String[] paragraphText = wordExtractor.getParagraphText();
-      for (String paragraph : paragraphText) {
-         System.out.println(paragraph);
-      }
-      // Display the document's header and footer text
-      System.out.println("Footer text: " + wordExtractor.getFooterText());
-      System.out.println("Header text: " + wordExtractor.getHeaderText());
-   }
-   // PowerPoint Presentation.
-   else if (textExtractor instanceof PowerPointExtractor) {
-      PowerPointExtractor powerPointExtractor =
-         (PowerPointExtractor) textExtractor;
-      System.out.println("Text: " + powerPointExtractor.getText());
-      System.out.println("Notes: " + powerPointExtractor.getNotes());
-   }
-   // Visio Drawing
-   else if (textExtractor instanceof VisioTextExtractor) {
-      VisioTextExtractor visioTextExtractor = 
-         (VisioTextExtractor) textExtractor;
-      System.out.println("Text: " + visioTextExtractor.getText());
-   }
-}
-      </source>
+      <p>To keep the features documentation focused on the APIs, there is little mention of some of the configuration
+          settings that can be enabled that may prove useful to users who have to handle very large documents or very
+          large throughput.
+      </p>
+      <table>
+        <tr>
+          <th>Configuration Setting</th>
+          <th>Description</th>
+        </tr>
+
+        <tr>
+          <td><a href="https://poi.apache.org/apidocs/5.0/org/apache/poi/openxml4j/util/ZipSecureFile.html#setMinInflateRatio-double-">
+              org.apache.poi.openxml4j.util.ZipSecureFile.setMinInflateRatio(double ratio)</a>
+          </td>
+          <td>Sets the ratio between de- and inflated bytes to detect zipbomb.
+              It defaults to 1% (= 0.01d), i.e. when the compression is better than 1% for any given read package part, the parsing will fail indicating a Zip-Bomb.
+          </td>
+        </tr>
+
+        <tr>
+          <td><a href="https://poi.apache.org/apidocs/5.0/org/apache/poi/openxml4j/util/ZipSecureFile.html#setMaxEntrySize-long-">
+            org.apache.poi.openxml4j.util.ZipSecureFile.setMaxEntrySize(long maxEntrySize)</a>
+          </td>
+          <td>Sets the maximum file size of a single zip entry. It defaults to 4GB, i.e. the 32-bit zip format maximum.
+            This can be used to limit memory consumption and protect against security vulnerabilities when documents are provided by users.
+            POI 5.1.0 removes the previous limit of 4GB on this setting.
+          </td>
+        </tr>
+
+        <tr>
+          <td><a href="https://poi.apache.org/apidocs/5.0/org/apache/poi/openxml4j/util/ZipSecureFile.html#setMaxTextSize-long-">
+            org.apache.poi.openxml4j.util.ZipSecureFile.setMaxTextSize(long maxTextSize)</a>
+          </td>
+          <td>Sets the maximum number of characters of text that are extracted before an exception is thrown during extracting text from documents.
+            This can be used to limit memory consumption and protect against security vulnerabilities when documents are provided by users.
+            The default is approx 10 million chars. Prior to POI 5.1.0, the max allowed was approx 4 billion chars.
+          </td>
+        </tr>
+
+        <tr>
+          <td>org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(int thresholdBytes)
+          </td>
+          <td><strong>Coming in POI 5.1.0.</strong>
+            Number of bytes at which a zip entry is regarded as too large for holding in memory
+            and the data is put in a temp file instead - defaults to -1 meaning temp files are not used
+            and that zip entries with more than 2GB of data after decompressing will fail, 0 means all
+            zip entries are stored in temp files. A threshold like 50000000 (approx 50Mb is recommended)
+          </td>
+        </tr>
+
+        <tr>
+          <td>org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setEncryptTempFiles(boolean encrypt)
+          </td>
+          <td><strong>Coming in POI 5.1.0.</strong>
+            Whether temp files should be encrypted (default false). Only affects temp files related to zip entries.
+          </td>
+        </tr>
+
+        <tr>
+          <td>org.apache.poi.openxml4j.opc.ZipPackage.setUseTempFilePackageParts(boolean tempFilePackageParts)
+          </td>
+          <td><strong>Coming in POI 5.1.0.</strong>
+            Whether to save package part data in temp files to save memory (default=false).
+          </td>
+        </tr>
+
+        <tr>
+          <td>org.apache.poi.openxml4j.opc.ZipPackage.setEncryptTempFilePackageParts(boolean encryptTempFiles)
+          </td>
+          <td><strong>Coming in POI 5.1.0.</strong>
+            Whether to encrypt package part temp files (default=false).
+          </td>
+        </tr>
+
+      </table>
     </section>
   </body>
 

Modified: poi/site/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/poi/site/src/documentation/content/xdocs/site.xml?rev=1894264&r1=1894263&r2=1894264&view=diff
==============================================================================
--- poi/site/src/documentation/content/xdocs/site.xml (original)
+++ poi/site/src/documentation/content/xdocs/site.xml Thu Oct 14 19:15:15 2021
@@ -111,6 +111,7 @@ See http://xml.apache.org/forrest/linkin
         <hmef label="TNEF (HMEF) for winmail.dat" href="hmef/index.html"/>
         <oxml4j label="OpenXML4J (OOXML)" href="oxml4j/index.html"/>
         <log label="Logging framework" href="logging.html"/>
+        <config label="Configuration" href="configuration.html"/>
     </components>
     <help label="Help" tab="help" href="help/">
         <mailinglists label="Mailing Lists" href="index.html"/>



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@poi.apache.org
For additional commands, e-mail: commits-help@poi.apache.org