You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2020/07/16 14:14:15 UTC

[tika] branch branch_1x updated (941a150 -> d8d4af1)

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a change to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git.


    from 941a150  fix merge conflicts and unit test
     new 6fb39c9  TIKA-3131 -- swap default values of averageCharTolerance and spacingTolerance to match PDFBox defaults (#325)
     new d8d4af1  TIKA-3088 - fix NPE in OpenDocumentContentParser caused by com.sun.org.apache.xml.internal.serializer.ToHTMLStream

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../java/org/apache/tika/parser/odf/OpenDocumentContentParser.java    | 4 ++--
 .../src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java     | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)


[tika] 02/02: TIKA-3088 - fix NPE in OpenDocumentContentParser caused by com.sun.org.apache.xml.internal.serializer.ToHTMLStream

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit d8d4af1650b62952b55cf25560ed64b33ce32e0f
Author: tallison <ta...@apache.org>
AuthorDate: Thu Jul 16 10:12:04 2020 -0400

    TIKA-3088 - fix NPE in OpenDocumentContentParser caused by com.sun.org.apache.xml.internal.serializer.ToHTMLStream
---
 .../java/org/apache/tika/parser/odf/OpenDocumentContentParser.java    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
index 066f3e9..2e40c68 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java
@@ -411,7 +411,7 @@ public class OpenDocumentContentParser extends AbstractParser {
                 // to incoming handler
                 if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) {
                     final String el = headingStack.pop();
-                    handler.endElement(XHTMLContentHandler.XHTML, el, el);
+                    handler.endElement(namespaceURI, el, el);
                 } else if (TEXT_NS.equals(namespaceURI) && "list".equals(localName)) {
                     endList();
                 } else if (TEXT_NS.equals(namespaceURI) && "span".equals(localName)) {
@@ -422,7 +422,7 @@ public class OpenDocumentContentParser extends AbstractParser {
                 } else if ("annotation".equals(localName) || "note".equals(localName) ||
                         "notes".equals(localName)) {
                         closeStyleTags();
-                        handler.endElement("", localName, localName);
+                        handler.endElement(namespaceURI, localName, localName);
                 } else {
                     super.endElement(namespaceURI, localName, qName);
                 }


[tika] 01/02: TIKA-3131 -- swap default values of averageCharTolerance and spacingTolerance to match PDFBox defaults (#325)

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 6fb39c9583e04edf72bd19f800b591b1f49c6497
Author: Clark Perkins <cl...@users.noreply.github.com>
AuthorDate: Wed Jul 15 14:08:01 2020 -0500

    TIKA-3131 -- swap default values of averageCharTolerance and spacingTolerance to match PDFBox defaults (#325)
---
 .../src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java     | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
index 9613781..f88ff0f 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
@@ -119,11 +119,11 @@ public class PDFParserConfig implements Serializable {
 
     //The character width-based tolerance value used to estimate where spaces in text should be added
     //Default taken from PDFBox.
-    private Float averageCharTolerance = 0.5f;
+    private Float averageCharTolerance = 0.3f;
 
     //The space width-based tolerance value used to estimate where spaces in text should be added
     //Default taken from PDFBox.
-    private Float spacingTolerance = 0.3f;
+    private Float spacingTolerance = 0.5f;
 
     // The multiplication factor for line height to decide when a new paragraph starts.
     //Default taken from PDFBox.