You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2020/06/02 13:52:42 UTC

[tika] branch branch_1x updated (1f2eae8 -> 5d1be35)

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a change to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git.


    from 1f2eae8  Merge remote-tracking branch 'origin/branch_1x' into branch_1x
     new f6b0770  TIKA-3094 add ignored unit test that runs the bundle against all of the test files.
     new 098256b  TIKA-3094 -- new metadata for every parse :(
     new b7c5d2e  TIKA-3094: add javax.xml.bind to system packages.  Fix java 11 jaxb.
     new 7a141d1  TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text
     new 5b852f1  Make the bplist magic more specific where possible, keep version catch-all as now otherwise
     new 73b475f  Add glob for Xcode Memgraph files, which are bplist-based
     new d921031  Tweak whitespace to be consistent
     new 3e32418  TIKA-3101 -- extract metadata from XMP basic schema
     new c9957b9  TIKA-3101 -- extract metadata from XMP basic schema, cleanup
     new 1825d83  TIKA-3104 -- addition of rudimentary bplist parser
     new 6a73350  TIKA-3104 -- fix markup to be closer to plist xml, add embedded document extractor
     new 5d1be35  TIKA-3104 -- fix localization thanks to forbiddenapis

The 12 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt                                        |   4 +
 LICENSE.txt                                        |  22 +++
 tika-bundle/pom.xml                                |  47 ++++-
 .../test/java/org/apache/tika/bundle/BundleIT.java |  61 ++++++-
 tika-bundle/test-bundles.xml                       |  14 ++
 .../main/java/org/apache/tika/metadata/XMP.java    |  17 +-
 .../org/apache/tika/mime/tika-mimetypes.xml        |  24 ++-
 tika-parsers/pom.xml                               |   5 +
 .../org/apache/tika/parser/apple/PListParser.java  | 191 +++++++++++++++++++++
 .../tika/parser/pdf/PDMetadataExtractor.java       |  74 +++++++-
 .../services/org.apache.tika.parser.Parser         |   1 +
 ...gleFileParserTest.java => PListParserTest.java} |  38 ++--
 .../org/apache/tika/parser/pdf/PDFParserTest.java  |  14 ++
 .../resources/test-documents/testBPList.bplist     | Bin 0 -> 24433 bytes
 .../test-documents/testPDF_XMPBasicSchema.pdf      | Bin 0 -> 1577 bytes
 15 files changed, 480 insertions(+), 32 deletions(-)
 create mode 100644 tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
 copy tika-parsers/src/test/java/org/apache/tika/parser/apple/{AppleSingleFileParserTest.java => PListParserTest.java} (54%)
 create mode 100644 tika-parsers/src/test/resources/test-documents/testBPList.bplist
 create mode 100644 tika-parsers/src/test/resources/test-documents/testPDF_XMPBasicSchema.pdf


[tika] 05/12: Make the bplist magic more specific where possible, keep version catch-all as now otherwise

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 5b852f1c5091f37ec5549790e512ee4b1d7a1280
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Thu May 28 07:05:30 2020 +0100

    Make the bplist magic more specific where possible, keep version catch-all as now otherwise
---
 .../main/resources/org/apache/tika/mime/tika-mimetypes.xml    | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 1343bc0..88604a0 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3206,6 +3206,17 @@
   </mime-type>
 
   <mime-type type="application/x-bplist">
+    <!-- Check for well-known bplist versions -->
+    <magic priority="70">
+      <match value="bplist\000\000" type="string" offset="0"/>
+      <match value="bplist\000\001" type="string" offset="0"/>
+      <match value="bplist\100\000" type="string" offset="0"/>
+      <match value="bplist00" type="string" offset="0"/>
+      <match value="bplist01" type="string" offset="0"/>
+      <match value="bplist10" type="string" offset="0"/>
+      <match value="bplist15" type="string" offset="0"/>
+      <match value="bplist16" type="string" offset="0"/>
+    </magic>
     <!-- The priority is 60, as .webarchive files often contain
          (X)HTML content. The bplist magic must trump the XHTML
          magics further within the file. This must also be


[tika] 08/12: TIKA-3101 -- extract metadata from XMP basic schema

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 3e32418dfa15a3fe19e219d4aaafdc82bb7e25f3
Author: tallison <ta...@apache.org>
AuthorDate: Mon Jun 1 10:03:00 2020 -0400

    TIKA-3101 -- extract metadata from XMP basic schema
    
    # Conflicts:
    #	tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
---
 .../main/java/org/apache/tika/metadata/XMP.java    |  15 ++++-
 .../tika/parser/pdf/PDMetadataExtractor.java       |  69 ++++++++++++++++++++-
 .../org/apache/tika/parser/pdf/PDFParserTest.java  |  14 +++++
 .../test-documents/testPDF_XMPBasicSchema.pdf      | Bin 0 -> 1577 bytes
 4 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMP.java b/tika-core/src/main/java/org/apache/tika/metadata/XMP.java
index 0f8c7fc..9a26920 100644
--- a/tika-core/src/main/java/org/apache/tika/metadata/XMP.java
+++ b/tika-core/src/main/java/org/apache/tika/metadata/XMP.java
@@ -26,6 +26,14 @@ public interface XMP {
     String PREFIX_ = PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER;
 
     /**
+     * An unordered array of text strings that unambiguously identify the resource
+     * within a given context. An array item may be qualified with xmpidq:Scheme
+     * (see 8.7, “xmpidq namespace”) to denote the formal identification system to
+     * which that identifier conforms.
+     */
+    Property ADVISORY = Property.externalTextBag(PREFIX_ + "Advisory");
+
+    /**
      * The date and time the resource was created. For a digital file, this need not
      * match a file-system creation time. For a freshly created resource, it should
      * be close to that time, modulo the time taken to write the file. Later file
@@ -49,7 +57,7 @@ public interface XMP {
     /**
      * A word or short phrase that identifies a resource as a member of a userdefined collection.
      */
-    Property LABEL = Property.externalDate(PREFIX_ + "Label");
+    Property LABEL = Property.externalText(PREFIX_ + "Label");
 
     /**
      * The date and time that any metadata for this resource was last changed. It
@@ -63,6 +71,11 @@ public interface XMP {
     Property MODIFY_DATE = Property.externalDate(PREFIX_ + "ModifyDate");
 
     /**
+     * A word or short phrase that identifies a resource as a member of a userdefined collection.
+     */
+    Property NICKNAME = Property.externalText(PREFIX_ + "NickName");
+
+    /**
      * A user-assigned rating for this file. The value shall be -1 or in the range
      * [0..5], where -1 indicates “rejected” and 0 indicates “unrated”. If xmp:Rating
      * is not present, a value of 0 should be assumed.
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
index 374471b..16605cb 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
@@ -22,8 +22,10 @@ import java.util.Calendar;
 import java.util.List;
 import java.util.Locale;
 
+import org.apache.commons.lang3.StringUtils;
 import org.apache.jempbox.xmp.XMPMetadata;
 import org.apache.jempbox.xmp.XMPSchema;
+import org.apache.jempbox.xmp.XMPSchemaBasic;
 import org.apache.jempbox.xmp.XMPSchemaDublinCore;
 import org.apache.jempbox.xmp.pdfa.XMPSchemaPDFAId;
 import org.apache.pdfbox.cos.COSArray;
@@ -38,6 +40,7 @@ import org.apache.tika.metadata.Metadata;
 import org.apache.tika.metadata.PDF;
 import org.apache.tika.metadata.Property;
 import org.apache.tika.metadata.TikaCoreProperties;
+import org.apache.tika.metadata.XMP;
 import org.apache.tika.mime.MediaType;
 import org.apache.tika.parser.ParseContext;
 import org.apache.tika.parser.image.xmp.JempboxExtractor;
@@ -66,21 +69,26 @@ class PDMetadataExtractor {
             xmp = new XMPMetadata(dom);
         }
         XMPSchemaDublinCore dcSchema = null;
-
+        XMPSchemaBasic basic = null;
         if (xmp != null) {
             try {
                 dcSchema = xmp.getDublinCoreSchema();
             } catch (IOException e) {
             }
-
+            try {
+                basic = xmp.getBasicSchema();
+            } catch (IOException e) {
+                //swallow
+            }
             JempboxExtractor.extractXMPMM(xmp, metadata);
         }
-
         extractMultilingualItems(metadata, TikaCoreProperties.DESCRIPTION, null, dcSchema);
         extractDublinCoreListItems(metadata, TikaCoreProperties.CONTRIBUTOR, dcSchema);
         extractDublinCoreListItems(metadata, TikaCoreProperties.CREATOR, dcSchema);
         extractMultilingualItems(metadata, TikaCoreProperties.TITLE, null, dcSchema);
 
+        extractBasic(basic, metadata);
+
         try {
             if (xmp != null) {
                 xmp.addXMLNSMapping(XMPSchemaPDFAId.NAMESPACE, XMPSchemaPDFAId.class);
@@ -104,6 +112,61 @@ class PDMetadataExtractor {
         }
     }
 
+    private static void extractBasic(XMPSchemaBasic basic, Metadata metadata) {
+        if (basic == null) {
+            return;
+        }
+        //add the elements from the basic schema if they haven't already
+        //been extracted from dublin core
+        setNotNull(XMP.CREATOR_TOOL, basic.getCreatorTool(), metadata);
+        setNotNull(XMP.LABEL, basic.getLabel(), metadata);
+        try {
+            setNotNull(XMP.CREATE_DATE, basic.getCreateDate(), metadata);
+        } catch (IOException e) {
+        }
+        try {
+            setNotNull(XMP.MODIFY_DATE, basic.getModifyDate(), metadata);
+        } catch (IOException e) {
+        }
+        try {
+            setNotNull(XMP.METADATA_DATE, basic.getMetadataDate(), metadata);
+        } catch (IOException e) {
+        }
+
+        List<String> identifiers = basic.getIdentifiers();
+        if (identifiers != null) {
+            for (String identifier : identifiers) {
+                metadata.add(XMP.IDENTIFIER, identifier);
+            }
+        }
+        List<String> advisories = basic.getAdvisories();
+        if (advisories != null) {
+            for (String advisory : advisories) {
+                metadata.add(XMP.ADVISORY, advisory);
+            }
+        }
+        setNotNull(XMP.NICKNAME, basic.getNickname(), metadata);
+        setNotNull(XMP.RATING, basic.getRating(), metadata);
+    }
+
+    private static void setNotNull(Property property, String value, Metadata metadata) {
+        if (metadata.get(property) == null && ! StringUtils.isEmpty(value)) {
+            metadata.set(property, value);
+        }
+    }
+
+    private static void setNotNull(Property property, Calendar value, Metadata metadata) {
+        if (metadata.get(property) == null && value != null) {
+            metadata.set(property, value);
+        }
+    }
+
+    private static void setNotNull(Property property, Integer value, Metadata metadata) {
+        if (metadata.get(property) == null && value != null) {
+            metadata.set(property, value);
+        }
+    }
+
     /**
      * As of this writing, XMPSchema can contain bags or sequence lists
      * for some attributes...despite standards documentation.
diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
index 4e2e3c5..7547208 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
@@ -26,6 +26,8 @@ import static org.junit.Assert.fail;
 import static org.junit.Assume.assumeTrue;
 
 import java.io.InputStream;
+import java.nio.file.Paths;
+import java.util.Arrays;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.List;
@@ -54,6 +56,7 @@ import org.apache.tika.metadata.Metadata;
 import org.apache.tika.metadata.OfficeOpenXMLCore;
 import org.apache.tika.metadata.PDF;
 import org.apache.tika.metadata.TikaCoreProperties;
+import org.apache.tika.metadata.XMP;
 import org.apache.tika.metadata.XMPMM;
 import org.apache.tika.mime.MediaType;
 import org.apache.tika.parser.AutoDetectParser;
@@ -1579,6 +1582,15 @@ public class PDFParserTest extends TikaTest {
 
     }
 
+    @Test
+    public void testXMPBasicSchema() throws Exception {
+        //TIKA-3101
+        List<Metadata> metadataList = getRecursiveMetadata("testPDF_XMPBasicSchema.pdf");
+        Metadata m = metadataList.get(0);
+        //these two fields derive from the basic schema in the XMP, not dublin core
+        assertEquals("Hewlett-Packard MFP", m.get(XMP.CREATOR_TOOL));
+        assertEquals("1998-08-29T13:53:15Z", m.get(XMP.CREATE_DATE));
+    }
     /**
      * Simple class to count end of document events.  If functionality is useful,
      * move to org.apache.tika in src/test
@@ -1607,4 +1619,6 @@ public class PDFParserTest extends TikaTest {
             return true;
         }
     }
+
+
 }
diff --git a/tika-parsers/src/test/resources/test-documents/testPDF_XMPBasicSchema.pdf b/tika-parsers/src/test/resources/test-documents/testPDF_XMPBasicSchema.pdf
new file mode 100644
index 0000000..69c912e
Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testPDF_XMPBasicSchema.pdf differ


[tika] 12/12: TIKA-3104 -- fix localization thanks to forbiddenapis

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 5d1be35c317a69b53dfeca8a3e6398d6ff321a46
Author: tallison <ta...@apache.org>
AuthorDate: Tue Jun 2 09:22:29 2020 -0400

    TIKA-3104 -- fix localization thanks to forbiddenapis
---
 .../src/main/java/org/apache/tika/parser/apple/PListParser.java        | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
index 643a611..5d4cc3e 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
@@ -46,6 +46,7 @@ import java.text.DateFormat;
 import java.text.ParseException;
 import java.text.SimpleDateFormat;
 import java.util.Collections;
+import java.util.Locale;
 import java.util.Map;
 import java.util.Set;
 
@@ -79,7 +80,7 @@ public class PListParser extends AbstractParser {
 
         EmbeddedDocumentExtractor embeddedDocumentExtractor =
                 EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
-        DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
+        DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ", Locale.US);
         NSObject rootObj = null;
         try {
             if (stream instanceof TikaInputStream && ((TikaInputStream) stream).hasFile()) {


[tika] 11/12: TIKA-3104 -- fix markup to be closer to plist xml, add embedded document extractor

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 6a733506581d263b545a4a40767b116a3659b12e
Author: tallison <ta...@apache.org>
AuthorDate: Tue Jun 2 08:50:54 2020 -0400

    TIKA-3104 -- fix markup to be closer to plist xml, add embedded document extractor
---
 .../org/apache/tika/parser/apple/PListParser.java  | 116 ++++++++++++++++-----
 .../apache/tika/parser/apple/PListParserTest.java  |   8 +-
 2 files changed, 99 insertions(+), 25 deletions(-)

diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
index ff56efe..643a611 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
@@ -26,8 +26,9 @@ import com.dd.plist.NSSet;
 import com.dd.plist.NSString;
 import com.dd.plist.PropertyListFormatException;
 import com.dd.plist.PropertyListParser;
-import com.lexicalscope.jewelcli.internal.cglib.asm.$MethodAdapter;
 import org.apache.tika.exception.TikaException;
+import org.apache.tika.extractor.EmbeddedDocumentExtractor;
+import org.apache.tika.extractor.EmbeddedDocumentUtil;
 import org.apache.tika.io.TikaInputStream;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.mime.MediaType;
@@ -41,30 +42,44 @@ import org.xml.sax.SAXException;
 import javax.xml.parsers.ParserConfigurationException;
 import java.io.IOException;
 import java.io.InputStream;
+import java.text.DateFormat;
 import java.text.ParseException;
+import java.text.SimpleDateFormat;
 import java.util.Collections;
 import java.util.Map;
 import java.util.Set;
 
 /**
  * Parser for Apple's plist and bplist.  This is a wrapper around
- *       <groupId>com.googlecode.plist</groupId>
- *       <artifactId>dd-plist</artifactId>
- *       <version>1.23</version>
+ *       com.googlecode.plist:dd-plist
  */
 public class PListParser extends AbstractParser {
 
+    private static final String ARR = "array";
+    private static final String DATA = "data";
+    private static final String DATE = "date";
+    private static final String DICT = "dict";
+    private static final String KEY = "key";
+    private static final String NUMBER = "number";
+    private static final String PLIST = "plist";
+    private static final String SET = "set";
+    private static final String STRING = "string";
+
+
     private static final Set<MediaType> SUPPORTED_TYPES =
             Collections.singleton(MediaType.application("x-bplist"));
-
     @Override
     public Set<MediaType> getSupportedTypes(ParseContext context) {
         return SUPPORTED_TYPES;
     }
 
     @Override
-    public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
+    public void parse(InputStream stream, ContentHandler handler, Metadata metadata,
+                      ParseContext context) throws IOException, SAXException, TikaException {
 
+        EmbeddedDocumentExtractor embeddedDocumentExtractor =
+                EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
+        DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
         NSObject rootObj = null;
         try {
             if (stream instanceof TikaInputStream && ((TikaInputStream) stream).hasFile()) {
@@ -76,47 +91,100 @@ public class PListParser extends AbstractParser {
             throw new TikaException("problem parsing root", e);
         }
         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
+        State state = new State(xhtml, metadata, embeddedDocumentExtractor, df);
         xhtml.startDocument();
-        parseObject(rootObj, xhtml, metadata);
+        xhtml.startElement(PLIST);
+        parseObject(rootObj, state);
+        xhtml.endElement(PLIST);
         xhtml.endDocument();
     }
 
-    private void parseObject(NSObject obj, XHTMLContentHandler handler, Metadata metadata)
-            throws SAXException {
+    private void parseObject(NSObject obj, State state)
+            throws SAXException, IOException {
 
         if (obj instanceof NSDictionary) {
-            parseDict((NSDictionary)obj, handler, metadata);
+            parseDict((NSDictionary)obj, state);
         } else if (obj instanceof NSArray) {
             NSArray nsArray = (NSArray)obj;
+            state.xhtml.startElement(ARR);
             for (NSObject child : nsArray.getArray()) {
-                parseObject(child, handler, metadata);
+                parseObject(child, state);
             }
+            state.xhtml.endElement(ARR);
         } else if (obj instanceof NSString) {
-            handler.characters(((NSString)obj).toString());
+            state.xhtml.startElement(STRING);
+            state.xhtml.characters(((NSString)obj).getContent());
+            state.xhtml.endElement(STRING);
         } else if (obj instanceof NSNumber) {
-            handler.characters(((NSNumber) obj).toString());
+            state.xhtml.startElement(NUMBER);
+            state.xhtml.characters(((NSNumber) obj).toString());
+            state.xhtml.endElement(NUMBER);
         } else if (obj instanceof NSData) {
-            handleData((NSData) obj, handler, metadata);
+            state.xhtml.startElement(DATA);
+            handleData((NSData) obj, state);
+            state.xhtml.endElement(DATA);
         } else if (obj instanceof NSDate) {
-            handler.characters(((NSDate)obj).toString());
-        } else{
-            throw new UnsupportedOperationException("don't know baout: "+obj.getClass());
+            state.xhtml.startElement(DATE);
+            String dateString = state.dateFormat.format(((NSDate)obj).getDate());
+            state.xhtml.characters(dateString);
+            state.xhtml.endElement(DATE);
+        } else if (obj instanceof NSSet) {
+            state.xhtml.startElement(SET);
+            parseSet((NSSet)obj, state);
+            state.xhtml.endElement(SET);
+        } else {
+            throw new UnsupportedOperationException("don't yet support this type of object: "+obj.getClass());
+        }
+    }
 
+    private void parseSet(NSSet obj, State state)
+            throws SAXException, IOException {
+        state.xhtml.startElement(SET);
+        for (NSObject child : obj.allObjects()) {
+            parseObject(child, state);
         }
+        state.xhtml.endElement(SET);
     }
 
-    private void parseDict(NSDictionary obj, XHTMLContentHandler xhtml, Metadata metadata) throws SAXException {
+    private void parseDict(NSDictionary obj, State state)
+            throws SAXException, IOException {
+        state.xhtml.startElement(DICT);
         for (Map.Entry<String, NSObject> mapEntry : obj.getHashMap().entrySet()) {
             String key = mapEntry.getKey();
             NSObject value = mapEntry.getValue();
-            xhtml.startElement("div", "class", key);
-            parseObject(value, xhtml, metadata);
-            xhtml.endElement("div");
+            state.xhtml.element(KEY, key);
+            parseObject(value, state);
+        }
+        state.xhtml.endElement(DICT);
+    }
+
+    private void handleData(NSData value, State state) throws IOException,
+            SAXException {
+        state.xhtml.characters(value.getBase64EncodedData());
+        Metadata embeddedMetadata = new Metadata();
+        if (! state.embeddedDocumentExtractor.shouldParseEmbedded(embeddedMetadata)) {
+            return;
+        }
+
+        try (TikaInputStream tis = TikaInputStream.get(value.bytes())) {
+            state.embeddedDocumentExtractor.parseEmbedded(tis, state.xhtml, embeddedMetadata, false);
         }
     }
 
-    private void handleData(NSData value, XHTMLContentHandler handler, Metadata metadata) {
-        byte[] bytes = value.bytes();
-        //TODO handle embedded file
+    private static class State {
+        final XHTMLContentHandler xhtml;
+        final Metadata metadata;
+        final EmbeddedDocumentExtractor embeddedDocumentExtractor;
+        final DateFormat dateFormat;
+
+        public State(XHTMLContentHandler xhtml,
+                     Metadata metadata,
+                     EmbeddedDocumentExtractor embeddedDocumentExtractor,
+                     DateFormat df) {
+            this.xhtml = xhtml;
+            this.metadata = metadata;
+            this.embeddedDocumentExtractor = embeddedDocumentExtractor;
+            this.dateFormat = df;
+        }
     }
 }
diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java
index 534f65b..9d78548 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java
@@ -23,6 +23,8 @@ import org.junit.Test;
 
 import java.util.List;
 
+import static org.junit.Assert.assertEquals;
+
 
 public class PListParserTest extends TikaTest {
 
@@ -31,8 +33,12 @@ public class PListParserTest extends TikaTest {
         //test file is MIT licensed:
         // https://github.com/joeferner/node-bplist-parser/blob/master/test/iTunes-small.bplist
         List<Metadata> metadataList = getRecursiveMetadata("testBPList.bplist");
+        assertEquals(21, metadataList.size());
         Metadata m = metadataList.get(0);
         String content = m.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);
-        assertContains("<div class=\"Application Version\">9.0.3</div>", content);
+        assertContains("<key>Application Version</key><string>9.0", content);
+
+        //TODO -- bad encoding right after this...smart quote?
+        assertContains("<string>90", content);
     }
 }


[tika] 09/12: TIKA-3101 -- extract metadata from XMP basic schema, cleanup

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit c9957b9e237fe6aca4e23f4c8b25fc981c8703ae
Author: tallison <ta...@apache.org>
AuthorDate: Mon Jun 1 10:12:19 2020 -0400

    TIKA-3101 -- extract metadata from XMP basic schema, cleanup
---
 tika-core/src/main/java/org/apache/tika/metadata/XMP.java    | 12 +++++++-----
 .../java/org/apache/tika/parser/pdf/PDMetadataExtractor.java |  5 +++++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMP.java b/tika-core/src/main/java/org/apache/tika/metadata/XMP.java
index 9a26920..867248e 100644
--- a/tika-core/src/main/java/org/apache/tika/metadata/XMP.java
+++ b/tika-core/src/main/java/org/apache/tika/metadata/XMP.java
@@ -26,10 +26,12 @@ public interface XMP {
     String PREFIX_ = PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER;
 
     /**
-     * An unordered array of text strings that unambiguously identify the resource
-     * within a given context. An array item may be qualified with xmpidq:Scheme
-     * (see 8.7, “xmpidq namespace”) to denote the formal identification system to
-     * which that identifier conforms.
+     * Unordered text strings of advisories.
+     */
+    Property ABOUT = Property.externalTextBag(PREFIX_ + "About");
+
+    /**
+     * Unordered text strings of advisories.
      */
     Property ADVISORY = Property.externalTextBag(PREFIX_ + "Advisory");
 
@@ -71,7 +73,7 @@ public interface XMP {
     Property MODIFY_DATE = Property.externalDate(PREFIX_ + "ModifyDate");
 
     /**
-     * A word or short phrase that identifies a resource as a member of a userdefined collection.
+     * A word or short phrase that represents the nick name fo the file
      */
     Property NICKNAME = Property.externalText(PREFIX_ + "NickName");
 
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
index 16605cb..0d3f59d 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDMetadataExtractor.java
@@ -36,6 +36,7 @@ import org.apache.pdfbox.pdmodel.common.PDMetadata;
 import org.apache.poi.util.IOUtils;
 import org.apache.tika.exception.TikaException;
 import org.apache.tika.extractor.EmbeddedDocumentUtil;
+import org.apache.tika.metadata.DublinCore;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.metadata.PDF;
 import org.apache.tika.metadata.Property;
@@ -119,6 +120,8 @@ class PDMetadataExtractor {
         //add the elements from the basic schema if they haven't already
         //been extracted from dublin core
         setNotNull(XMP.CREATOR_TOOL, basic.getCreatorTool(), metadata);
+        setNotNull(DublinCore.TITLE, basic.getTitle(), metadata);
+        setNotNull(XMP.ABOUT, basic.getAbout(), metadata);
         setNotNull(XMP.LABEL, basic.getLabel(), metadata);
         try {
             setNotNull(XMP.CREATE_DATE, basic.getCreateDate(), metadata);
@@ -147,6 +150,8 @@ class PDMetadataExtractor {
         }
         setNotNull(XMP.NICKNAME, basic.getNickname(), metadata);
         setNotNull(XMP.RATING, basic.getRating(), metadata);
+        //TODO: find an example where basic.getThumbNail is not null
+        //and figure out how to add that info
     }
 
     private static void setNotNull(Property property, String value, Metadata metadata) {


[tika] 04/12: TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 7a141d136578af5ce8cc5dd73e097566d7cbe1aa
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Mon May 18 05:06:27 2020 +0100

    TIKA-2961 Make the CAF mime magic more specific to avoid false positives, by checking for a version number after the "caff" header text
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml      | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 06005f1..1343bc0 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -4958,7 +4958,11 @@
      <_comment>Core Audio Format</_comment>
      <_comment>com.apple.coreaudio-format</_comment>
      <magic priority="60">
-        <match value="caff" type="string" offset="0" />
+        <match value="caff\000\000" type="string" offset="0" />
+        <match value="caff\000\001" type="string" offset="0" />
+        <match value="caff\000\002" type="string" offset="0" />
+        <match value="caff\100\000" type="string" offset="0" />
+        <match value="caff\200\000" type="string" offset="0" />
      </magic>
      <glob pattern="*.caf"/>
   </mime-type>


[tika] 06/12: Add glob for Xcode Memgraph files, which are bplist-based

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 73b475f73bbed54db3e9a1ec000020f337b90e7c
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Thu May 28 07:06:14 2020 +0100

    Add glob for Xcode Memgraph files, which are bplist-based
---
 .../src/main/resources/org/apache/tika/mime/tika-mimetypes.xml     | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index 88604a0..f6a108e 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3226,6 +3226,7 @@
       <match value="bplist" type="string" offset="0"/>
     </magic>
   </mime-type>
+
   <mime-type type="application/x-gtar">
     <_comment>GNU tar Compressed File Archive (GNU Tape Archive)</_comment>
     <magic priority="50">
@@ -3777,6 +3778,12 @@
     <glob pattern="*.lzma"/>
   </mime-type>  
 
+  <mime-type type="application/x-memgraph">
+    <_comment>Apple Xcode Memgraph</_comment>
+	  <sub-class-of type="application/x-bplist"/>
+     <glob pattern="*.memgraph"/>
+  </mime-type>
+
   <mime-type type="application/x-mobipocket-ebook">
     <acronym>MOBI</acronym>
     <_comment>Mobipocket Ebook</_comment>


[tika] 02/12: TIKA-3094 -- new metadata for every parse :(

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 098256bd8eaba266f959c3478c7c9812dbf6e114
Author: tballison <ta...@apache.org>
AuthorDate: Tue May 5 10:42:12 2020 -0400

    TIKA-3094 -- new metadata for every parse :(
---
 .../src/test/java/org/apache/tika/bundle/BundleIT.java    | 15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
index 2cab1d5..517aa0a 100644
--- a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
+++ b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
@@ -317,29 +317,18 @@ public class BundleIT {
         Parser parser = tika.getParser();
         ParseContext context = new ParseContext();
         context.set(Parser.class, parser);
-        Metadata metadata = new Metadata();
         Set<String> needToFix = new HashSet<>();
         needToFix.add("testAccess2_encrypted.accdb");
-
-        Set<String> unknownProblem = new HashSet<>();
-        //these all trigger org.apache.tika.metadata.PropertyTypeException
-        //which for some reason we can't catch (?!)
-        //We don't see problems with these files in tika-parsers?!
-/*        unknownProblem.add("testPPT_embedded_two_slides.pptx");
-        unknownProblem.add("testWORD_multi_authors.docx");
-        unknownProblem.add("testEXCEL_embeded.xlsx");
-        unknownProblem.add("testVORBIS.ogg");
-        unknownProblem.add("testWORD_2006ml.docx");
-        unknownProblem.add("testRTFEmbeddedLink.rtf");*/
         System.out.println(getTestDir());
         for (File f : getTestDir().listFiles()) {
             if (f.isDirectory()) {
                 continue;
             }
-            if (needToFix.contains(f.getName()) || unknownProblem.contains(f.getName())) {
+            if (needToFix.contains(f.getName())) {
                 continue;
             }
             System.out.println("about to parse "+f);
+            Metadata metadata = new Metadata();
             try (InputStream is = TikaInputStream.get(f)) {
                 parser.parse(is, handler, metadata, context);
             } catch (EncryptedDocumentException e) {


[tika] 10/12: TIKA-3104 -- addition of rudimentary bplist parser

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 1825d83961d3db5bcb2ff80b93d791dab0533fdf
Author: tallison <ta...@apache.org>
AuthorDate: Mon Jun 1 16:43:09 2020 -0400

    TIKA-3104 -- addition of rudimentary bplist parser
---
 CHANGES.txt                                        |   4 +
 LICENSE.txt                                        |  22 ++++
 tika-bundle/pom.xml                                |   2 +
 tika-parsers/pom.xml                               |   5 +
 .../org/apache/tika/parser/apple/PListParser.java  | 122 +++++++++++++++++++++
 .../services/org.apache.tika.parser.Parser         |   1 +
 .../apache/tika/parser/apple/PListParserTest.java  |  38 +++++++
 .../resources/test-documents/testBPList.bplist     | Bin 0 -> 24433 bytes
 8 files changed, 194 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index 668d132..6081b4f 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,5 +1,9 @@
 Release 1.24.1 - 4/17/2020
 
+   * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104).
+
+Release 1.24.1 - 4/17/2020
+
    * Allow gzip compression of input and output streams for tika-server (TIKA-3073).
 
 Release 1.24 - 3/11/2020
diff --git a/LICENSE.txt b/LICENSE.txt
index e998546..17ea384 100644
--- a/LICENSE.txt
+++ b/LICENSE.txt
@@ -437,3 +437,25 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
+
+com.googlecode.plist:dd-plist
+dd-plist - An open source library to parse and generate property lists
+Copyright (C) 2016 Daniel Dreibrodt
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
\ No newline at end of file
diff --git a/tika-bundle/pom.xml b/tika-bundle/pom.xml
index c2966ac..0ced4bf 100644
--- a/tika-bundle/pom.xml
+++ b/tika-bundle/pom.xml
@@ -199,6 +199,7 @@
               commons-io|
               commons-exec|
               commons-collections4|
+              dd-plist|
               junrar|
               pdfbox|
               pdfbox-tools|
@@ -280,6 +281,7 @@
               com.adobe.xmp;resolution:=optional,
               com.adobe.xmp.impl;resolution:=optional,
               com.adobe.xmp.options;resolution:=optional,
+              com.dd.plist;resolution:=optional,
               com.adobe.xmp.properties;resolution:=optional,
               com.github.luben.zstd;resolution:=optional,
               com.github.openjson;resolution:=optional,
diff --git a/tika-parsers/pom.xml b/tika-parsers/pom.xml
index 418db5b..6fc97f2 100644
--- a/tika-parsers/pom.xml
+++ b/tika-parsers/pom.xml
@@ -160,6 +160,11 @@
       <version>${mime4j.version}</version>
     </dependency>
     <dependency>
+      <groupId>com.googlecode.plist</groupId>
+      <artifactId>dd-plist</artifactId>
+      <version>1.23</version>
+    </dependency>
+    <dependency>
       <groupId>org.apache.commons</groupId>
       <artifactId>commons-compress</artifactId>
       <version>${commons.compress.version}</version>
diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
new file mode 100644
index 0000000..ff56efe
--- /dev/null
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/apple/PListParser.java
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.parser.apple;
+
+import com.dd.plist.NSArray;
+import com.dd.plist.NSData;
+import com.dd.plist.NSDate;
+import com.dd.plist.NSDictionary;
+import com.dd.plist.NSNumber;
+import com.dd.plist.NSObject;
+import com.dd.plist.NSSet;
+import com.dd.plist.NSString;
+import com.dd.plist.PropertyListFormatException;
+import com.dd.plist.PropertyListParser;
+import com.lexicalscope.jewelcli.internal.cglib.asm.$MethodAdapter;
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.io.TikaInputStream;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.mime.MediaType;
+import org.apache.tika.parser.AbstractParser;
+import org.apache.tika.parser.ParseContext;
+import org.apache.tika.sax.XHTMLContentHandler;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+
+import javax.xml.parsers.ParserConfigurationException;
+import java.io.IOException;
+import java.io.InputStream;
+import java.text.ParseException;
+import java.util.Collections;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * Parser for Apple's plist and bplist.  This is a wrapper around
+ *       <groupId>com.googlecode.plist</groupId>
+ *       <artifactId>dd-plist</artifactId>
+ *       <version>1.23</version>
+ */
+public class PListParser extends AbstractParser {
+
+    private static final Set<MediaType> SUPPORTED_TYPES =
+            Collections.singleton(MediaType.application("x-bplist"));
+
+    @Override
+    public Set<MediaType> getSupportedTypes(ParseContext context) {
+        return SUPPORTED_TYPES;
+    }
+
+    @Override
+    public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
+
+        NSObject rootObj = null;
+        try {
+            if (stream instanceof TikaInputStream && ((TikaInputStream) stream).hasFile()) {
+                rootObj = PropertyListParser.parse(((TikaInputStream) stream).getFile());
+            } else {
+                rootObj = PropertyListParser.parse(stream);
+            }
+        } catch (PropertyListFormatException|ParseException|ParserConfigurationException e) {
+            throw new TikaException("problem parsing root", e);
+        }
+        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
+        xhtml.startDocument();
+        parseObject(rootObj, xhtml, metadata);
+        xhtml.endDocument();
+    }
+
+    private void parseObject(NSObject obj, XHTMLContentHandler handler, Metadata metadata)
+            throws SAXException {
+
+        if (obj instanceof NSDictionary) {
+            parseDict((NSDictionary)obj, handler, metadata);
+        } else if (obj instanceof NSArray) {
+            NSArray nsArray = (NSArray)obj;
+            for (NSObject child : nsArray.getArray()) {
+                parseObject(child, handler, metadata);
+            }
+        } else if (obj instanceof NSString) {
+            handler.characters(((NSString)obj).toString());
+        } else if (obj instanceof NSNumber) {
+            handler.characters(((NSNumber) obj).toString());
+        } else if (obj instanceof NSData) {
+            handleData((NSData) obj, handler, metadata);
+        } else if (obj instanceof NSDate) {
+            handler.characters(((NSDate)obj).toString());
+        } else{
+            throw new UnsupportedOperationException("don't know baout: "+obj.getClass());
+
+        }
+    }
+
+    private void parseDict(NSDictionary obj, XHTMLContentHandler xhtml, Metadata metadata) throws SAXException {
+        for (Map.Entry<String, NSObject> mapEntry : obj.getHashMap().entrySet()) {
+            String key = mapEntry.getKey();
+            NSObject value = mapEntry.getValue();
+            xhtml.startElement("div", "class", key);
+            parseObject(value, xhtml, metadata);
+            xhtml.endElement("div");
+        }
+    }
+
+    private void handleData(NSData value, XHTMLContentHandler handler, Metadata metadata) {
+        byte[] bytes = value.bytes();
+        //TODO handle embedded file
+    }
+}
diff --git a/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser b/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
index 52805fc..028de26 100644
--- a/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
+++ b/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
@@ -14,6 +14,7 @@
 #  limitations under the License.
 
 org.apache.tika.parser.apple.AppleSingleFileParser
+org.apache.tika.parser.apple.PListParser
 org.apache.tika.parser.asm.ClassParser
 org.apache.tika.parser.audio.AudioParser
 org.apache.tika.parser.audio.MidiParser
diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java
new file mode 100644
index 0000000..534f65b
--- /dev/null
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/apple/PListParserTest.java
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.parser.apple;
+
+import org.apache.tika.TikaTest;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.sax.AbstractRecursiveParserWrapperHandler;
+import org.junit.Test;
+
+import java.util.List;
+
+
+public class PListParserTest extends TikaTest {
+
+    @Test
+    public void testBasicBinaryPList() throws Exception {
+        //test file is MIT licensed:
+        // https://github.com/joeferner/node-bplist-parser/blob/master/test/iTunes-small.bplist
+        List<Metadata> metadataList = getRecursiveMetadata("testBPList.bplist");
+        Metadata m = metadataList.get(0);
+        String content = m.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);
+        assertContains("<div class=\"Application Version\">9.0.3</div>", content);
+    }
+}
diff --git a/tika-parsers/src/test/resources/test-documents/testBPList.bplist b/tika-parsers/src/test/resources/test-documents/testBPList.bplist
new file mode 100644
index 0000000..b7edb14
Binary files /dev/null and b/tika-parsers/src/test/resources/test-documents/testBPList.bplist differ


[tika] 07/12: Tweak whitespace to be consistent

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit d921031a8c8c694f87d43cdb481a0efae4093dbd
Author: Nick Burch <ni...@gagravarr.org>
AuthorDate: Thu May 28 07:15:16 2020 +0100

    Tweak whitespace to be consistent
---
 tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
index f6a108e..f16ae5a 100644
--- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
+++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
@@ -3780,8 +3780,8 @@
 
   <mime-type type="application/x-memgraph">
     <_comment>Apple Xcode Memgraph</_comment>
-	  <sub-class-of type="application/x-bplist"/>
-     <glob pattern="*.memgraph"/>
+    <sub-class-of type="application/x-bplist"/>
+    <glob pattern="*.memgraph"/>
   </mime-type>
 
   <mime-type type="application/x-mobipocket-ebook">


[tika] 01/12: TIKA-3094 add ignored unit test that runs the bundle against all of the test files.

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit f6b07702895af9c12a9c5f91a20db50d506a8bbd
Author: tallison <ta...@apache.org>
AuthorDate: Mon May 4 21:21:44 2020 -0400

    TIKA-3094 add ignored unit test that runs the bundle against all of the test files.
---
 tika-bundle/pom.xml                                |  3 +-
 .../test/java/org/apache/tika/bundle/BundleIT.java | 57 ++++++++++++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/tika-bundle/pom.xml b/tika-bundle/pom.xml
index 3628cfa..dfe8a36 100644
--- a/tika-bundle/pom.xml
+++ b/tika-bundle/pom.xml
@@ -178,7 +178,6 @@
               xmlbeans|
               jackcess|
               jackcess-encrypt|
-              commons-lang|
               commons-lang3|
               tagsoup|
               asm|
@@ -192,6 +191,7 @@
               boilerpipe|
               rome|
               rome-utils|
+              jdom2|
               sentiment-analysis-parser|
               opennlp-tools|
               geoapi|
@@ -372,6 +372,7 @@
               org.jaxen.dom4j;resolution:=optional,
               org.jaxen.pattern;resolution:=optional,
               org.jaxen.saxpath;resolution:=optional,
+              org.jaxen.util;resolution:=optional,
               org.jdom;resolution:=optional,
               org.jdom.input;resolution:=optional,
               org.jdom.output;resolution:=optional,
diff --git a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
index 4cefffb..2cab1d5 100644
--- a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
+++ b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
@@ -45,6 +45,8 @@ import javax.inject.Inject;
 import org.apache.tika.Tika;
 import org.apache.tika.detect.DefaultDetector;
 import org.apache.tika.detect.Detector;
+import org.apache.tika.exception.EncryptedDocumentException;
+import org.apache.tika.exception.TikaException;
 import org.apache.tika.fork.ForkParser;
 import org.apache.tika.io.TikaInputStream;
 import org.apache.tika.metadata.Metadata;
@@ -56,6 +58,7 @@ import org.apache.tika.parser.Parser;
 import org.apache.tika.parser.internal.Activator;
 import org.apache.tika.parser.ocr.TesseractOCRParser;
 import org.apache.tika.sax.BodyContentHandler;
+import org.junit.Ignore;
 import org.junit.Test;
 import org.junit.runner.RunWith;
 import org.ops4j.pax.exam.Configuration;
@@ -67,6 +70,7 @@ import org.osgi.framework.Bundle;
 import org.osgi.framework.BundleContext;
 import org.osgi.framework.ServiceReference;
 import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
 
 @RunWith(PaxExam.class)
 @ExamReactorStrategy(PerMethod.class)
@@ -301,4 +305,57 @@ public class BundleIT {
         String content = handler.toString();
         assertTrue(content.contains("Attachment Test"));
     }
+
+    @Test
+    @Ignore
+    public void testAll() throws Exception {
+        Tika tika = new Tika();
+
+        // Package extraction
+        ContentHandler handler = new BodyContentHandler();
+
+        Parser parser = tika.getParser();
+        ParseContext context = new ParseContext();
+        context.set(Parser.class, parser);
+        Metadata metadata = new Metadata();
+        Set<String> needToFix = new HashSet<>();
+        needToFix.add("testAccess2_encrypted.accdb");
+
+        Set<String> unknownProblem = new HashSet<>();
+        //these all trigger org.apache.tika.metadata.PropertyTypeException
+        //which for some reason we can't catch (?!)
+        //We don't see problems with these files in tika-parsers?!
+/*        unknownProblem.add("testPPT_embedded_two_slides.pptx");
+        unknownProblem.add("testWORD_multi_authors.docx");
+        unknownProblem.add("testEXCEL_embeded.xlsx");
+        unknownProblem.add("testVORBIS.ogg");
+        unknownProblem.add("testWORD_2006ml.docx");
+        unknownProblem.add("testRTFEmbeddedLink.rtf");*/
+        System.out.println(getTestDir());
+        for (File f : getTestDir().listFiles()) {
+            if (f.isDirectory()) {
+                continue;
+            }
+            if (needToFix.contains(f.getName()) || unknownProblem.contains(f.getName())) {
+                continue;
+            }
+            System.out.println("about to parse "+f);
+            try (InputStream is = TikaInputStream.get(f)) {
+                parser.parse(is, handler, metadata, context);
+            } catch (EncryptedDocumentException e) {
+                //swallow
+            } catch (SAXException e) {
+                //
+            } catch (TikaException e) {
+                System.err.println("tika Exception "+f.getName());
+                e.printStackTrace();
+            }
+        }
+    }
+
+    private File getTestDir() {
+        return new File("../tika-parsers/src/test/resources/test-documents");
+    }
+
+
 }


[tika] 03/12: TIKA-3094: add javax.xml.bind to system packages. Fix java 11 jaxb.

Posted by ta...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch branch_1x
in repository https://gitbox.apache.org/repos/asf/tika.git

commit b7c5d2ed1d43430dd29d25f1d7e8954ba48bb46d
Author: Bob Paulin <bo...@bobpaulin.com>
AuthorDate: Thu May 7 20:01:22 2020 -0500

    TIKA-3094: add javax.xml.bind to system packages.  Fix java 11 jaxb.
---
 tika-bundle/pom.xml                                | 42 ++++++++++++++++++++++
 .../test/java/org/apache/tika/bundle/BundleIT.java | 17 +++++----
 tika-bundle/test-bundles.xml                       | 14 ++++++++
 3 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/tika-bundle/pom.xml b/tika-bundle/pom.xml
index dfe8a36..c2966ac 100644
--- a/tika-bundle/pom.xml
+++ b/tika-bundle/pom.xml
@@ -135,6 +135,45 @@
       <artifactId>slf4j-simple</artifactId>
       <scope>test</scope>
     </dependency>
+    
+    <dependency>
+	    <groupId>org.glassfish.jaxb</groupId>
+	    <artifactId>jaxb-runtime</artifactId>
+	    <version>2.3.2</version>
+	    <scope>test</scope>
+	</dependency>
+	<dependency>
+	    <groupId>com.sun.istack</groupId>
+	    <artifactId>istack-commons-runtime</artifactId>
+	    <version>3.0.8</version>
+	    <scope>test</scope>
+	</dependency>
+	<dependency>
+	    <groupId>com.sun.xml.fastinfoset</groupId>
+	    <artifactId>FastInfoset</artifactId>
+	    <version>1.2.16</version>
+	    <scope>test</scope>
+	</dependency>
+	<dependency>
+    <groupId>jakarta.activation</groupId>
+	  <artifactId>jakarta.activation-api</artifactId>
+	  <version>1.2.1</version>
+	  <scope>test</scope>
+	</dependency>
+	<dependency>
+	  <groupId>jakarta.xml.bind</groupId>
+	  <artifactId>jakarta.xml.bind-api</artifactId>
+	  <version>2.3.2</version>
+	  <scope>test</scope>
+	</dependency>
+    <dependency>
+	  <groupId>org.glassfish.jaxb</groupId>
+	  <artifactId>txw2</artifactId>
+	  <version>2.3.2</version>
+	  <scope>test</scope>
+	</dependency>
+	
+    
   </dependencies>
 
   <build>
@@ -561,6 +600,9 @@
           </execution>
         </executions>
         <configuration>
+          <additionalClasspathElements>
+          	<additionalClasspathElement>${project.build.directory}/test-bundles/jdk9plus</additionalClasspathElement>
+          </additionalClasspathElements>
           <systemPropertyVariables>
             <org.ops4j.pax.logging.DefaultServiceLog.level>
               INFO
diff --git a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
index 517aa0a..af7cc50 100644
--- a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
+++ b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java
@@ -24,6 +24,7 @@ import static org.ops4j.pax.exam.CoreOptions.bundle;
 import static org.ops4j.pax.exam.CoreOptions.junitBundles;
 import static org.ops4j.pax.exam.CoreOptions.mavenBundle;
 import static org.ops4j.pax.exam.CoreOptions.options;
+import static org.ops4j.pax.exam.CoreOptions.systemPackages;
 
 import java.io.ByteArrayInputStream;
 import java.io.File;
@@ -90,12 +91,14 @@ public class BundleIT {
     @Configuration
     public Option[] configuration() throws IOException, URISyntaxException, ClassNotFoundException {
     	File base = new File(TARGET, "test-bundles");
-        return options(
-        		bundle(new File(base, "tika-core.jar").toURI().toURL().toString()),
-        		mavenBundle("org.ops4j.pax.logging", "pax-logging-api", "1.8.5"),
-        		mavenBundle("org.ops4j.pax.logging", "pax-logging-service", "1.8.5"),
-        		junitBundles(),
-        		bundle(new File(base, "tika-bundle.jar").toURI().toURL().toString()));
+    	 return options(
+         		systemPackages("javax.xml.bind"),
+         		bundle(new File(base, "tika-core.jar").toURI().toURL().toString()),
+         		mavenBundle("org.ops4j.pax.logging", "pax-logging-api", "1.8.5"),
+         		mavenBundle("org.ops4j.pax.logging", "pax-logging-service", "1.8.5"),
+         		junitBundles(),
+         		bundle(new File(base, "tika-bundle.jar").toURI().toURL().toString())
+                 );
     }
 
     @Test
@@ -318,7 +321,7 @@ public class BundleIT {
         ParseContext context = new ParseContext();
         context.set(Parser.class, parser);
         Set<String> needToFix = new HashSet<>();
-        needToFix.add("testAccess2_encrypted.accdb");
+        //needToFix.add("testAccess2_encrypted.accdb");
         System.out.println(getTestDir());
         for (File f : getTestDir().listFiles()) {
             if (f.isDirectory()) {
diff --git a/tika-bundle/test-bundles.xml b/tika-bundle/test-bundles.xml
index 0e60103..e07d1a3 100644
--- a/tika-bundle/test-bundles.xml
+++ b/tika-bundle/test-bundles.xml
@@ -31,5 +31,19 @@
         <include>org.apache.tika:tika-bundle</include>
       </includes>
     </dependencySet>
+    <dependencySet>
+      <outputDirectory>jdk9plus</outputDirectory>
+      <outputFileNameMapping>${artifact.artifactId}.jar</outputFileNameMapping>
+      <includes>
+      	<include>org.glassfish.jaxb:jaxb-runtime</include>
+        <include>com.sun.istack:istack-commons-runtime</include>
+        <include>jakarta.activation:jakarta.activation-api</include>
+        <include>com.sun.xml.fastinfoset:FastInfoset</include>
+        <include>jakarta.xml.bind:jakarta.xml.bind-api</include>
+        <include>org.glassfish.jaxb:txw2</include>
+        <include>org.jvnet.staxex:stax-ex</include>
+      </includes>
+      <scope>test</scope>
+    </dependencySet>
   </dependencySets>
 </assembly>