You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2022/07/05 10:33:16 UTC

[tika] branch main updated (0d7a42f34 -> fc887690a)

This is an automated email from the ASF dual-hosted git repository.

nick pushed a change to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git


    from 0d7a42f34 TIKA-3795: update protobuf
     new 9d928bbf9 TIKA-3810 VTT with UTF-8 BOM
     new ec4cb612d WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM
     new fc887690a Merge branch 'main' of https://github.com/apache/tika into main

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../org/apache/tika/mime/tika-mimetypes.xml        |  6 ++++
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  4 +++
 .../resources/test-documents/testWebVTT_utf8.vtt   | 42 ++++++++++++++++++++++
 3 files changed, 52 insertions(+)
 create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt


[tika] 02/03: WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM

Posted by ni...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit ec4cb612d1cda09907c88f2c5a06cc3cb7a839ef
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Tue Jul 5 11:22:59 2022 +0100

    WebVTT is text based, so check for both line endings on the BOM cases like we do for no-BOM
---
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 7b4ac0d7d..9b24ae3f4 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7004,11 +7004,14 @@
       <!-- With Byte Order Mark -->
       <match value="0xfeff" offset="0">
          <match value="WEBVTT\r" type="string" offset="2"/>
+         <match value="WEBVTT\n" type="string" offset="2"/>
       </match>
       <match value="0xfeff" offset="0">
+         <match value="WEBVTT\r" type="string" offset="2"/>
          <match value="WEBVTT\n" type="string" offset="2"/>
       </match>
       <match value="0xefbbbf" offset="0">
+         <match value="WEBVTT\r" type="string" offset="3"/>
          <match value="WEBVTT\n" type="string" offset="3"/>
       </match>
       <!-- Common Header -->


[tika] 01/03: TIKA-3810 VTT with UTF-8 BOM

Posted by ni...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 9d928bbf9e93131d5021d4e5afddb4ba18df6531
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Tue Jul 5 11:21:17 2022 +0100

    TIKA-3810 VTT with UTF-8 BOM
---
 .../org/apache/tika/mime/tika-mimetypes.xml        |  3 ++
 .../java/org/apache/tika/mime/TestMimeTypes.java   |  4 +++
 .../resources/test-documents/testWebVTT_utf8.vtt   | 42 ++++++++++++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 2329c0a3b..7b4ac0d7d 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -7008,6 +7008,9 @@
       <match value="0xfeff" offset="0">
          <match value="WEBVTT\n" type="string" offset="2"/>
       </match>
+      <match value="0xefbbbf" offset="0">
+         <match value="WEBVTT\n" type="string" offset="3"/>
+      </match>
       <!-- Common Header -->
       <match value="WEBVTT FILE\r" type="string" offset="0"/>
       <match value="WEBVTT FILE\n" type="string" offset="0"/>
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index 2a2936bae..ea2ecbeff 100644
--- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -1140,6 +1140,10 @@ public class TestMimeTypes {
         // With a custom text header
         assertType("text/vtt", "testWebVTT_header.vtt");
         assertTypeByData("text/vtt", "testWebVTT_header.vtt");
+
+        // With a UTF-8 BOM before the header
+        assertType("text/vtt", "testWebVTT_utf8.vtt");
+        assertTypeByData("text/vtt", "testWebVTT_utf8.vtt");
     }
 
     @Test
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt
new file mode 100644
index 000000000..722a923fc
--- /dev/null
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testWebVTT_utf8.vtt
@@ -0,0 +1,42 @@
+WEBVTT
+
+1
+00:00:00.350 --> 00:00:02.010
+Well, the feedback indicates
+
+2
+00:00:02.010 --> 00:00:03.880
+that many new hires aren't sure
+
+3
+00:00:03.880 --> 00:00:05.560
+where to find information related
+
+4
+00:00:05.560 --> 00:00:09.390
+to HR, benefits and other onboarding processes
+
+5
+00:00:09.390 --> 00:00:11.050
+or who to ask.
+
+6
+00:00:11.050 --> 00:00:13.850
+Also, they're not always sure where they belong
+
+7
+00:00:13.850 --> 00:00:15.740
+in the structure of the company.
+
+8
+00:00:15.740 --> 00:00:18.470
+Because the company is growing and changing,
+
+9
+00:00:18.470 --> 00:00:20.890
+even tenured employees are getting confused
+
+10
+00:00:20.890 --> 00:00:23.663
+about who does what and who reports to whom.
+


[tika] 03/03: Merge branch 'main' of https://github.com/apache/tika into main

Posted by ni...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

nick pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git

commit fc887690a91a4b689a40a0be11d68dcdeb45a66f
Merge: ec4cb612d 0d7a42f34
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Tue Jul 5 11:32:57 2022 +0100

    Merge branch 'main' of https://github.com/apache/tika into main

 tika-parent/pom.xml | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)