You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2021/12/10 11:57:59 UTC

[GitHub] [drill] dzamo commented on a change in pull request #2359: DRILL-8028: Add PDF Format Plugin

dzamo commented on a change in pull request #2359:
URL: https://github.com/apache/drill/pull/2359#discussion_r766622138



##########
File path: contrib/format-pdf/src/main/java/org/apache/drill/exec/store/pdf/Utils.java
##########
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.store.pdf;
+
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import technology.tabula.ObjectExtractor;
+import technology.tabula.Page;
+import technology.tabula.PageIterator;
+import technology.tabula.Rectangle;
+import technology.tabula.RectangularTextContainer;
+import technology.tabula.Table;
+import technology.tabula.detectors.NurminenDetectionAlgorithm;
+import technology.tabula.extractors.BasicExtractionAlgorithm;
+import technology.tabula.extractors.ExtractionAlgorithm;
+import technology.tabula.extractors.SpreadsheetExtractionAlgorithm;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class Utils {
+
+  public static final ExtractionAlgorithm DEFAULT_ALGORITHM = new BasicExtractionAlgorithm();
+  private static final Logger logger = LoggerFactory.getLogger(Utils.class);
+
+  /**
+   * Returns a list of tables found in a given PDF document.  There are several extraction algorithms
+   * available and this function uses the default Basic Extraction Algorithm.
+   * @param document The input PDF document to search for tables
+   * @return A list of tables found in the document.
+   */
+  public static List<Table> extractTablesFromPDF(PDDocument document) {
+    return extractTablesFromPDF(document, DEFAULT_ALGORITHM);
+  }
+
+  public static List<Table> extractTablesFromPDF(PDDocument document, ExtractionAlgorithm algorithm) {
+    System.setProperty("java.awt.headless", "true");
+
+    NurminenDetectionAlgorithm detectionAlgorithm = new NurminenDetectionAlgorithm();
+
+    ExtractionAlgorithm algExtractor;
+
+    SpreadsheetExtractionAlgorithm extractor = new SpreadsheetExtractionAlgorithm();
+
+    ObjectExtractor objectExtractor = new ObjectExtractor(document);
+    PageIterator pages = objectExtractor.extract();
+    List<Table> tables= new ArrayList<>();
+    while (pages.hasNext()) {
+      Page page = pages.next();
+
+      algExtractor = algorithm;
+      List<Rectangle> tablesOnPage = detectionAlgorithm.detect(page);
+
+      for (Rectangle guessRect : tablesOnPage) {
+        Page guess = page.getArea(guessRect);
+        tables.addAll(algExtractor.extract(guess));
+      }
+    }
+
+    try {
+      objectExtractor.close();
+    } catch (Exception e) {
+      logger.debug("Error closing Object extractor.");

Review comment:
       @cgivre I have just added a description of an option like that to the Drill 2.0 wiki page.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org