You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2018/08/03 15:26:05 UTC

[tika] branch master updated (7e477e3 -> 6ef6672)

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


    from 7e477e3  TIKA-2701 change test file name
     add a6a7667  Update StrictHtmlEncodingDetector and rename it to StandardHtmlEncodingDetector
     add 7d96565  TIKA-2673 Fix race condition in CharsetAliases
     add 82a1c61  TIKA-2673 Remove wildcard imports
     add c27f53b  TIKA-2673 PreScanner: use read() instead of skip(long)
     add e7cda26  TIKA-2673 Make the read limit in StandardHtmlEncodingDetector configurable
     new 7323225  Merge branch 'TIKA-2673' of https://github.com/GerardBouchar/tika into GerardBouchar-TIKA-2673
     new f8f5e23  TIKA-2673 -- small modifications
     new 6ef6672  Merge branch 'GerardBouchar-TIKA-2673'

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../parser/html/StrictHtmlEncodingDetector.java    | 491 ---------------------
 .../html/charsetdetector/CharsetAliases.java       | 145 ++++++
 .../charsetdetector/CharsetDetectionResult.java    |  62 +++
 .../parser/html/charsetdetector/MetaProcessor.java |  74 ++++
 .../parser/html/charsetdetector/PreScanner.java    | 270 +++++++++++
 .../StandardHtmlEncodingDetector.java              | 104 +++++
 .../charsets/ReplacementCharset.java               |  65 +++
 .../charsets/XUserDefinedCharset.java              |  57 +++
 .../tika/parser/html/whatwg-encoding-labels.tsv    | 234 ----------
 ....java => StandardHtmlEncodingDetectorTest.java} |  94 +++-
 10 files changed, 863 insertions(+), 733 deletions(-)
 delete mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/StrictHtmlEncodingDetector.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetAliases.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/CharsetDetectionResult.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/MetaProcessor.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/ReplacementCharset.java
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/charsets/XUserDefinedCharset.java
 delete mode 100644 tika-parsers/src/main/resources/org/apache/tika/parser/html/whatwg-encoding-labels.tsv
 rename tika-parsers/src/test/java/org/apache/tika/parser/html/{StrictHtmlEncodingDetectorTest.java => StandardHtmlEncodingDetectorTest.java} (77%)


[tika] 01/03: Merge branch 'TIKA-2673' of https://github.com/GerardBouchar/tika into GerardBouchar-TIKA-2673

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 73232252c84a74c598014a7ba9d6ef63732a75d4
Merge: 7e477e3 e7cda26
Author: TALLISON <ta...@apache.org>
AuthorDate: Fri Aug 3 11:06:58 2018 -0400

    Merge branch 'TIKA-2673' of https://github.com/GerardBouchar/tika into GerardBouchar-TIKA-2673

 .../parser/html/StrictHtmlEncodingDetector.java    | 491 ---------------------
 .../html/charsetdetector/CharsetAliases.java       | 145 ++++++
 .../charsetdetector/CharsetDetectionResult.java    |  62 +++
 .../FullStandardEncodingDetector.java              |  20 +
 .../parser/html/charsetdetector/MetaProcessor.java |  74 ++++
 .../parser/html/charsetdetector/PreScanner.java    | 268 +++++++++++
 .../StandardHtmlEncodingDetector.java              | 104 +++++
 .../StandardIcu4JEncodingDetector.java             |  51 +++
 .../charsets/ReplacementCharset.java               |  65 +++
 .../charsets/XUserDefinedCharset.java              |  57 +++
 .../tika/parser/html/whatwg-encoding-labels.tsv    | 234 ----------
 ....java => StandardHtmlEncodingDetectorTest.java} |  94 +++-
 12 files changed, 932 insertions(+), 733 deletions(-)


[tika] 02/03: TIKA-2673 -- small modifications

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit f8f5e23841d23cfcaa13bfba6dccf7f44f33fdd5
Author: TALLISON <ta...@apache.org>
AuthorDate: Fri Aug 3 11:25:08 2018 -0400

    TIKA-2673 -- small modifications
---
 .../FullStandardEncodingDetector.java              | 20 ---------
 .../parser/html/charsetdetector/PreScanner.java    |  6 ++-
 .../StandardHtmlEncodingDetector.java              |  2 +-
 .../StandardIcu4JEncodingDetector.java             | 51 ----------------------
 4 files changed, 5 insertions(+), 74 deletions(-)

diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java
deleted file mode 100644
index ab1edad..0000000
--- a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/FullStandardEncodingDetector.java
+++ /dev/null
@@ -1,20 +0,0 @@
-package org.apache.tika.parser.html.charsetdetector;
-
-import org.apache.tika.detect.CompositeEncodingDetector;
-
-import static java.util.Arrays.asList;
-
-/**
- * A composite encoding detector chaining a {@link StandardHtmlEncodingDetector}
- * (that may return null) and a {@link StandardIcu4JEncodingDetector} (that always returns a value)
- * This full detector thus always returns an encoding, and still works very well with data coming
- * from the web.
- */
-public class FullStandardEncodingDetector extends CompositeEncodingDetector {
-    public FullStandardEncodingDetector() {
-        super(asList(
-                new StandardHtmlEncodingDetector(),
-                StandardIcu4JEncodingDetector.STANDARD_ICU4J_ENCODING_DETECTOR
-        ));
-    }
-}
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
index 85b35a0..a00aeb1 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/PreScanner.java
@@ -85,9 +85,11 @@ class PreScanner {
     }
 
     Charset scan() {
-        while (processAtLeastOneByte())
-            if (detectedCharset.isFound())
+        while (processAtLeastOneByte()) {
+            if (detectedCharset.isFound()) {
                 return detectedCharset.getCharset();
+            }
+        }
         return null;
     }
 
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
index 0418270..f9d1a1b 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardHtmlEncodingDetector.java
@@ -36,7 +36,7 @@ import static org.apache.tika.parser.html.charsetdetector.CharsetAliases.getChar
  * https://html.spec.whatwg.org/multipage/parsing.html#the-input-byte-stream
  * <p>
  * If a resource was fetched over HTTP, then HTTP headers should be added to tika metadata
- * when using {@link #detect}, especially {@link Metadata.CONTENT_TYPE}, as it may contain charset information.
+ * when using {@link #detect}, especially {@link Metadata#CONTENT_TYPE}, as it may contain charset information.
  * <p>
  * This encoding detector may return null if no encoding is detected.
  * It is meant to be used inside a {@link org.apache.tika.detect.CompositeDetector}.
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java
deleted file mode 100644
index f7ed53f..0000000
--- a/tika-parsers/src/main/java/org/apache/tika/parser/html/charsetdetector/StandardIcu4JEncodingDetector.java
+++ /dev/null
@@ -1,51 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.tika.parser.html.charsetdetector;
-
-import org.apache.tika.detect.EncodingDetector;
-import org.apache.tika.metadata.Metadata;
-import org.apache.tika.parser.txt.CharsetDetector;
-import org.apache.tika.parser.txt.CharsetMatch;
-
-import java.io.IOException;
-import java.io.InputStream;
-import java.nio.charset.Charset;
-import java.nio.charset.StandardCharsets;
-
-/**
- * Last resort detector, that never returns null.
- * Uses ICU4J for sniffing the charset, and uses standard charset aliases in {@link CharsetAliases}
- * to convert the charset name detected by ICU to a java charset.
- * This detector is stateless and a single instance can be used several times for different streams.
- */
-public class StandardIcu4JEncodingDetector implements EncodingDetector {
-    public static EncodingDetector STANDARD_ICU4J_ENCODING_DETECTOR = new StandardIcu4JEncodingDetector();
-
-    public Charset detect(InputStream input, Metadata metadata) throws IOException {
-        CharsetDetector detector = new CharsetDetector();
-        detector.enableInputFilter(true); // enabling input filtering (stripping of HTML tags)
-        detector.setText(input);
-        for (CharsetMatch match : detector.detectAll()) {
-            Charset detected = CharsetAliases.getCharsetByLabel(match.getName());
-            if (detected != null) return detected;
-        }
-        // This detector is meant to be used in last resort. It should never return null
-        // So if no charset was found, decode the input as simple ASCII.
-        // The ASCII charset is guaranteed to be present in all JVMs.
-        return StandardCharsets.US_ASCII;
-    }
-}


[tika] 03/03: Merge branch 'GerardBouchar-TIKA-2673'

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 6ef66723174ea15d4e168769dea6f76f0ad7bd91
Merge: 7e477e3 f8f5e23
Author: TALLISON <ta...@apache.org>
AuthorDate: Fri Aug 3 11:25:50 2018 -0400

    Merge branch 'GerardBouchar-TIKA-2673'

 .../parser/html/StrictHtmlEncodingDetector.java    | 491 ---------------------
 .../html/charsetdetector/CharsetAliases.java       | 145 ++++++
 .../charsetdetector/CharsetDetectionResult.java    |  62 +++
 .../parser/html/charsetdetector/MetaProcessor.java |  74 ++++
 .../parser/html/charsetdetector/PreScanner.java    | 270 +++++++++++
 .../StandardHtmlEncodingDetector.java              | 104 +++++
 .../charsets/ReplacementCharset.java               |  65 +++
 .../charsets/XUserDefinedCharset.java              |  57 +++
 .../tika/parser/html/whatwg-encoding-labels.tsv    | 234 ----------
 ....java => StandardHtmlEncodingDetectorTest.java} |  94 +++-
 10 files changed, 863 insertions(+), 733 deletions(-)