You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2024/03/26 12:37:33 UTC

(tika) branch branch_2x updated (fd9b62467 -> 88b582f25)

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a change to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git


    from fd9b62467 TIKA-4162: update aws
     new 361e0de09 TIKA-4225 -- add detection for amf (#1688)
     new 722aaaf00 TIKA-4224 -- add detection for 3mf (#1689)
     new 922970203 TIKA-4222 -- add openscad glob (#1690)
     new de408df02 TIKA-4223 -- add detection of stl (#1691)
     new 800b551ff update CHANGES.txt
     new 88b582f25 Merge remote-tracking branch 'origin/branch_2x' into branch_2x

The 6 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt                                        |   7 +++
 .../org/apache/tika/mime/tika-mimetypes.xml        |  34 ++++++++++++++-
 .../java/org/apache/tika/TikaDetectionTest.java    |   2 +-
 .../detect/microsoft/ooxml/OPCPackageDetector.java |  47 +++++++++++++--------
 .../tika/detect/TestContainerAwareDetector.java    |   5 +++
 .../java/org/apache/tika/mime/TestMimeTypes.java   |   6 +++
 .../src/test/resources/test-documents/test3mf.3mf  | Bin 0 -> 28243 bytes
 .../resources/test-documents/testSTL-ascii.stl     |  16 +++++++
 .../resources/test-documents/testSTL-binary.stl    | Bin 0 -> 160 bytes
 9 files changed, 97 insertions(+), 20 deletions(-)
 create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/test3mf.3mf
 create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-ascii.stl
 create mode 100644 tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-binary.stl


(tika) 03/06: TIKA-4222 -- add openscad glob (#1690)

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 92297020375b15c57c47799d5dcb5b82bd38b981
Author: Tim Allison <ta...@apache.org>
AuthorDate: Mon Mar 25 17:07:01 2024 -0400

    TIKA-4222 -- add openscad glob (#1690)
    
    (cherry picked from commit c5693624cbd43d0d76357b9f21705991d6f3a4ff)
---
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index de95917bb..09b8e3821 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3898,6 +3898,11 @@
       </match>
     </magic>
   </mime-type>
+  <mime-type type="application/x-openscad">
+    <tika:link>https://openscad.org/index.html</tika:link>
+    <glob pattern="*.scad"/>
+    <sub-class-of type="text/plain"/>
+  </mime-type>
   <mime-type type="application/x-executable">
     <sub-class-of type="application/x-elf"/>
     <magic priority="50">


(tika) 05/06: update CHANGES.txt

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 800b551ffeae297641f986ed01db7d9c3c9456fd
Author: tallison <ta...@apache.org>
AuthorDate: Tue Mar 26 08:37:18 2024 -0400

    update CHANGES.txt
---
 CHANGES.txt | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index 4367faa6a..612bc4b8d 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,3 +1,10 @@
+Release 2.9.2 - ???
+
+   * Dependency upgrades including temporary workarounds for regressions in commons-compress.
+
+   * Add detection for OpenSCAD, 3MF, AMF, STL file formats via Robin Schimpf (TIKA-4222, TIKA-4223,
+     TIKA-4224, TIKA-4225).
+
 Release 2.9.1 - 10/17/2023
 
    * Dependency upgrades including commons-compress to fix CVE-2023-42503.


(tika) 02/06: TIKA-4224 -- add detection for 3mf (#1689)

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 722aaaf006f02d2e5c6deaaae1dcbf08a2e331e4
Author: Tim Allison <ta...@apache.org>
AuthorDate: Mon Mar 25 17:06:45 2024 -0400

    TIKA-4224 -- add detection for 3mf (#1689)
    
    (cherry picked from commit 3ffbc04f7a1023aa8e6d5ea22d19feb2a7e61a8f)
---
 .../org/apache/tika/mime/tika-mimetypes.xml        |   6 +++
 .../detect/microsoft/ooxml/OPCPackageDetector.java |  47 +++++++++++++--------
 .../tika/detect/TestContainerAwareDetector.java    |   5 +++
 .../src/test/resources/test-documents/test3mf.3mf  | Bin 0 -> 28243 bytes
 4 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 81a0af3c9..de95917bb 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -2062,6 +2062,12 @@
     <glob pattern="*.ost"/>
   </mime-type>
 
+  <mime-type type="application/vnd.ms-package.3dmanufacturing-3dmodel+xml">
+    <tika:link>https://en.wikipedia.org/wiki/3D_Manufacturing_Format</tika:link>
+    <_comment>3D manufacturing format</_comment>
+    <glob pattern="*.3mf"/>
+  </mime-type>
+
   <mime-type type="application/vnd.ms-pki.seccat">
     <glob pattern="*.cat"/>
   </mime-type>
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java
index cdef864e0..369ba475c 100644
--- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java
@@ -88,6 +88,9 @@ public class OPCPackageDetector implements ZipContainerDetector {
             MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.template");
     static final MediaType XLAM = MediaType.application("vnd.ms-excel.addin.macroEnabled.12");
     static final MediaType XPS = MediaType.application("vnd.ms-xpsdocument");
+
+    static final MediaType THREE_MF = MediaType.application("vnd.ms-package.3dmanufacturing-3dmodel+xml");
+
     static final Set<String> OOXML_HINTS =
             fillSet("word/document.xml", "_rels/.rels", "[Content_Types].xml",
                     "ppt/presentation.xml", "ppt/slides/slide1.xml", "xl/workbook.xml",
@@ -100,6 +103,8 @@ public class OPCPackageDetector implements ZipContainerDetector {
             "http://schemas.openxps.org/oxps/v1.0/fixedrepresentation";
     private static final String STAR_OFFICE_6_WRITER = "application/vnd.sun.xml.writer";
 
+    private static final String THREE_MF_DOCUMENT =
+            "http://schemas.microsoft.com/3dmanufacturing/2013/01/3dmodel";
     static Map<String, MediaType> OOXML_CONTENT_TYPES = new ConcurrentHashMap<>();
 
     static {
@@ -153,29 +158,37 @@ public class OPCPackageDetector implements ZipContainerDetector {
         // Check for the normal Office core document
         PackageRelationshipCollection core =
                 pkg.getRelationshipsByType(PackageRelationshipTypes.CORE_DOCUMENT);
+
         // Otherwise check for some other Office core document types
         if (core.size() == 0) {
             core = pkg.getRelationshipsByType(PackageRelationshipTypes.STRICT_CORE_DOCUMENT);
-        }
-        if (core.size() == 0) {
-            core = pkg.getRelationshipsByType(PackageRelationshipTypes.VISIO_CORE_DOCUMENT);
-        }
-        if (core.size() == 0) {
-            core = pkg.getRelationshipsByType(XPS_DOCUMENT);
-            if (core.size() == 1) {
-                return MediaType.application("vnd.ms-xpsdocument");
+
+            if (core.size() == 0) {
+                core = pkg.getRelationshipsByType(PackageRelationshipTypes.VISIO_CORE_DOCUMENT);
             }
-            core = pkg.getRelationshipsByType(OPEN_XPS_DOCUMENT);
-            if (core.size() == 1) {
-                return MediaType.application("vnd.ms-xpsdocument");
+            if (core.size() == 0) {
+                core = pkg.getRelationshipsByType(XPS_DOCUMENT);
+                if (core.size() == 1) {
+                    return MediaType.application("vnd.ms-xpsdocument");
+                }
+                core = pkg.getRelationshipsByType(OPEN_XPS_DOCUMENT);
+                if (core.size() == 1) {
+                    return MediaType.application("vnd.ms-xpsdocument");
+                }
             }
-        }
 
-        if (core.size() == 0) {
-            core = pkg.getRelationshipsByType(
-                    "http://schemas.autodesk.com/dwfx/2007/relationships/documentsequence");
-            if (core.size() == 1) {
-                return MediaType.parse("model/vnd.dwfx+xps");
+            if (core.size() == 0) {
+                core = pkg.getRelationshipsByType(
+                        "http://schemas.autodesk.com/dwfx/2007/relationships/documentsequence");
+                if (core.size() == 1) {
+                    return MediaType.parse("model/vnd.dwfx+xps");
+                }
+            }
+            if (core.size() == 0) {
+                core = pkg.getRelationshipsByType(THREE_MF_DOCUMENT);
+                if (core.size() == 1) {
+                    return THREE_MF;
+                }
             }
         }
         // If we didn't find a single core document of any type, skip detection
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
index 9ad968b9c..d35df67bf 100644
--- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java
@@ -262,6 +262,11 @@ public class TestContainerAwareDetector extends MultiThreadedTikaTest {
         assertTypeByData("testODTnotaZipFile.odt", "text/plain");
     }
 
+    @Test
+    public void test3MF() throws Exception {
+        assertTypeByData("test3mf.3mf", "application/vnd.ms-package.3dmanufacturing-3dmodel+xml");
+        assertTypeByNameAndData("test3mf.3mf", "application/vnd.ms-package.3dmanufacturing-3dmodel+xml");
+    }
     @Test
     public void testODFDifferentOrder() throws Exception {
         //TIKA-3356
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/test3mf.3mf b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/test3mf.3mf
new file mode 100644
index 000000000..f7d0cf5a7
Binary files /dev/null and b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/test3mf.3mf differ


(tika) 06/06: Merge remote-tracking branch 'origin/branch_2x' into branch_2x

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 88b582f25f823475268b90a14e01e4c0c10f661d
Merge: 800b551ff fd9b62467
Author: tallison <ta...@apache.org>
AuthorDate: Tue Mar 26 08:37:25 2024 -0400

    Merge remote-tracking branch 'origin/branch_2x' into branch_2x

 tika-parent/pom.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


(tika) 01/06: TIKA-4225 -- add detection for amf (#1688)

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 361e0de09532818cf2491a370329084b1d992952
Author: Tim Allison <ta...@apache.org>
AuthorDate: Mon Mar 25 15:02:57 2024 -0400

    TIKA-4225 -- add detection for amf (#1688)
    
    * TIKA-4225 -- add detection for amf
    
    (cherry picked from commit 36e3ba8cd6f489be1241536661f6f1821458b902)
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml      | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index ca2dcaa6f..81a0af3c9 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3283,6 +3283,12 @@
     </match>
     </magic>
   </mime-type>
+  <mime-type type="application/x-amf">
+    <tika:link>https://en.wikipedia.org/wiki/Additive_manufacturing_file_format</tika:link>
+    <root-XML localName="amf"/>
+    <glob pattern="*.amf"/>
+    <sub-class-of type="application/xml"/>
+  </mime-type>
   <mime-type type="application/x-atari-floppy-disk-image">
     <tika:link>http://fileformats.archiveteam.org/wiki/ATR</tika:link>
     <magic priority="50">


(tika) 04/06: TIKA-4223 -- add detection of stl (#1691)

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_2x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit de408df0212ee3a5a3e6f6a5467940f3957be25f
Author: Tim Allison <ta...@apache.org>
AuthorDate: Tue Mar 26 08:28:50 2024 -0400

    TIKA-4223 -- add detection of stl (#1691)
    
    * TIKA-4223 -- add detection for binary and text based stl
    
    (cherry picked from commit 9d45b69dab2016342e44ee2b8bf5ed508676b38b)
---
 .../resources/org/apache/tika/mime/tika-mimetypes.xml  |  17 +++++++++++++++--
 .../test/java/org/apache/tika/TikaDetectionTest.java   |   2 +-
 .../test/java/org/apache/tika/mime/TestMimeTypes.java  |   6 ++++++
 .../test/resources/test-documents/testSTL-ascii.stl    |  16 ++++++++++++++++
 .../test/resources/test-documents/testSTL-binary.stl   | Bin 0 -> 160 bytes
 5 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 09b8e3821..54f4b2051 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -2072,7 +2072,10 @@
     <glob pattern="*.cat"/>
   </mime-type>
   <mime-type type="application/vnd.ms-pki.stl">
-    <glob pattern="*.stl"/>
+    <!-- on TIKA-4223, we moved this glob to model/x.stl-binary.
+    We think this pki.stl is a subtype of pkcs7-signature?!
+    -->
+    <!--<glob pattern="*.stl"/> -->
   </mime-type>
   <mime-type type="application/vnd.ms-playready.initiator+xml"/>
 
@@ -7041,7 +7044,17 @@
     <glob pattern="*.mesh"/>
     <glob pattern="*.silo"/>
   </mime-type>
-
+  <mime-type type="model/x.stl-ascii">
+    <magic priority="60">
+      <match value="solid " offset="0" type="string">
+        <match value="facet " offset="7:256" type="string"/>
+      </match>
+    </magic>
+  </mime-type>
+  <mime-type type="model/x.stl-binary">
+    <_comment>no magic available</_comment>
+    <glob pattern="*.stl"/>
+  </mime-type>
   <mime-type type="model/vnd.dwf">
     <acronym>DWF</acronym>
     <_comment>AutoCAD Design Web Format</_comment>
diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
index 1cd0f40a2..79fd61f34 100644
--- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
+++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
@@ -354,7 +354,7 @@ public class TikaDetectionTest {
         assertEquals("application/vnd.ms-ims", tika.detect("x.ims"));
         assertEquals("application/vnd.ms-lrm", tika.detect("x.lrm"));
         assertEquals("application/vnd.ms-pki.seccat", tika.detect("x.cat"));
-        assertEquals("application/vnd.ms-pki.stl", tika.detect("x.stl"));
+        assertEquals("model/x.stl-binary", tika.detect("x.stl"));
         assertEquals("application/vnd.ms-powerpoint", tika.detect("x.ppt"));
         assertEquals("application/vnd.ms-powerpoint", tika.detect("x.pps"));
         assertEquals("application/vnd.ms-powerpoint", tika.detect("x.pot"));
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
index 3dad7d6af..886fe4ad6 100644
--- a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
@@ -212,6 +212,12 @@ public class TestMimeTypes {
         assertTypeByNameAndData("application/x-subrip", "test_subrip.srt");
     }
 
+    @Test
+    public void testSTL() throws Exception {
+        assertTypeByNameAndData("model/x.stl-binary", "testSTL-binary.stl");
+        assertTypeByNameAndData("model/x.stl-ascii", "testSTL-ascii.stl");
+    }
+
     @Test
     public void testTTML() throws Exception {
         assertTypeByData("application/ttml+xml", "test_ttml.ttml");
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-ascii.stl b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-ascii.stl
new file mode 100644
index 000000000..9d5bfe085
--- /dev/null
+++ b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-ascii.stl
@@ -0,0 +1,16 @@
+solid OpenSCAD_Model
+  facet normal 0 0 -1
+    outer loop
+      vertex -10 -35 0
+      vertex 10 -25 0
+      vertex 10 -35 0
+    endloop
+  endfacet
+  facet normal -0 0 -1
+    outer loop
+      vertex 10 -25 0
+      vertex -10 -35 0
+      vertex -10 -25 0
+    endloop
+  endfacet
+endsolid OpenSCAD_Model
diff --git a/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-binary.stl b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-binary.stl
new file mode 100644
index 000000000..e76f48fd1
Binary files /dev/null and b/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testSTL-binary.stl differ