You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by to...@apache.org on 2017/09/14 00:35:22 UTC

[tika] branch master updated (d1a8bff -> db89ab3)

This is an automated email from the ASF dual-hosted git repository.

totaro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.


    from d1a8bff  TIKA-2459 -- fix special character handling
     add e763021  Improvement for TIKA-2449 contributed by Giuseppe Totaro
     add 7b869c0  Added a regular expression to match standard word within a pattern for TIKA-2449 contributed by Giuseppe Totaro
     add 31625a2  Used the alphabetical order for the list of the standard organizations by relying on TreeMap. Thanks to Lewis McGibbney for this insightful suggestion (TIKA-2449 contributed by Giuseppe Totaro).
     new 7dd38d5  Merge branch 'master' of https://github.com/apache/tika
     new db89ab3  TIKA-2449: Enabling extraction of standard references from text

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 CHANGES.txt                                        |   2 +
 .../org/apache/tika/sax/StandardOrganizations.java | 166 ++++++++++++++++++
 .../org/apache/tika/sax/StandardReference.java     | 124 ++++++++++++++
 .../sax/StandardsExtractingContentHandler.java     | 116 +++++++++++++
 .../java/org/apache/tika/sax/StandardsText.java    | 188 +++++++++++++++++++++
 .../tika/example/StandardsExtractionExample.java   | 109 ++++++++++++
 .../sax/StandardsExtractingContentHandlerTest.java |  55 ++++++
 .../test-documents/testStandardsExtractor.pdf      | Bin 0 -> 143659 bytes
 8 files changed, 760 insertions(+)
 create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardOrganizations.java
 create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardReference.java
 create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardsExtractingContentHandler.java
 create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardsText.java
 create mode 100644 tika-example/src/main/java/org/apache/tika/example/StandardsExtractionExample.java
 create mode 100644 tika-parsers/src/test/java/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java
 create mode 100644 tika-parsers/src/test/resources/test-documents/testStandardsExtractor.pdf

-- 
To stop receiving notification emails like this one, please contact
['"commits@tika.apache.org" <co...@tika.apache.org>'].

[tika] 02/02: TIKA-2449: Enabling extraction of standard references from text

Posted by to...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

totaro pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit db89ab3ca701077f2615647667d868ca1cf9a728
Author: Giuseppe Totaro <to...@gmail.com>
AuthorDate: Wed Sep 13 17:35:10 2017 -0700

    TIKA-2449: Enabling extraction of standard references from text
---
 CHANGES.txt | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/CHANGES.txt b/CHANGES.txt
index f7d0521..26ad26e 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,5 +1,7 @@
 Release 1.17 - ???
 
+  * Enabling extraction of standard references from text (TIKA-2449).
+
   * Load external custom mimetypes XML from system property 
     tika.custom-mimetypes (TIKA-2460). 
 

-- 
To stop receiving notification emails like this one, please contact
"commits@tika.apache.org" <co...@tika.apache.org>.

[tika] 01/02: Merge branch 'master' of https://github.com/apache/tika

Posted by to...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

totaro pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git

commit 7dd38d5500c539328e2b0a083e1998a28e680539
Merge: 31625a2 d1a8bff
Author: Giuseppe Totaro <to...@gmail.com>
AuthorDate: Wed Sep 13 17:33:22 2017 -0700

    Merge branch 'master' of https://github.com/apache/tika

 CHANGES.txt                                        |  18 +++++-
 .../org/apache/tika/batch/fs/BatchProcessTest.java |   7 +--
 .../org/apache/tika/detect/CompositeDetector.java  |   7 +++
 .../org/apache/tika/detect/OverrideDetector.java   |  41 +++++++++++++
 .../tika/exception/ZeroByteFileException.java      |  11 ++++
 .../main/java/org/apache/tika/metadata/TIFF.java   |   3 +
 .../apache/tika/metadata/TikaCoreProperties.java   |   3 +
 .../org/apache/tika/mime/MimeTypesFactory.java     |  19 ++++++
 .../org/apache/tika/parser/AutoDetectParser.java   |  10 +++-
 .../org/apache/tika/mime/MimeTypesReaderTest.java  |  16 ++++++
 .../org/apache/tika/mime/external-mimetypes.xml    |  22 +++++++
 .../apache/tika/eval/reports/ResultsReporter.java  |  10 +++-
 .../tika/parser/image/ImageMetadataExtractor.java  |  24 ++++++++
 .../org/apache/tika/parser/mbox/MboxParser.java    |   1 +
 .../apache/tika/parser/mbox/OutlookPSTParser.java  |  14 +++++
 .../tika/parser/microsoft/WordExtractor.java       |   5 ++
 .../ooxml/OOXMLWordAndPowerPointTextHandler.java   |  51 ++++++++++++----
 .../ooxml/SXWPFWordExtractorDecorator.java         |   2 +-
 .../recognition/tf/TensorflowRESTRecogniser.java   |  12 +++-
 .../services/org.apache.tika.detect.Detector       |   1 +
 .../parser/recognition/tf/InceptionRestDockerfile  |   4 +-
 .../tika/parser/recognition/tf/inceptionapi.py     |  29 ++++++++--
 .../apache/tika/parser/AutoDetectParserTest.java   |  64 +++++++++++++++------
 .../apache/tika/parser/image/TiffParserTest.java   |  13 ++++-
 .../apache/tika/parser/mbox/MboxParserTest.java    |  15 +++++
 .../tika/parser/mbox/OutlookPSTParserTest.java     |  15 +++++
 .../tika/parser/microsoft/WordParserTest.java      |  16 ++++++
 .../parser/microsoft/ooxml/OOXMLParserTest.java    |  17 ++++++
 .../parser/microsoft/ooxml/SXWPFExtractorTest.java |  16 ++++++
 .../tika/parser/ocr/TesseractOCRParserTest.java    |   9 +++
 .../recognition/ObjectRecognitionParserTest.java   |  35 +++++++++++
 .../parser/recognition/tika-config-tflow-rest.xml  |   2 +
 .../test/resources/test-documents/single_mail.mbox |  25 ++++++++
 .../test-documents/testPST_variousBodyTypes.pst    | Bin 0 -> 271360 bytes
 .../test-documents/testTIFF_multipage.tif          | Bin 0 -> 156867 bytes
 .../resources/test-documents/testWORD_phonetic.doc | Bin 0 -> 27136 bytes
 .../test-documents/testWORD_phonetic.docx          | Bin 0 -> 12523 bytes
 .../testWORD_specialControlCharacter1415.doc       | Bin 0 -> 25600 bytes
 38 files changed, 492 insertions(+), 45 deletions(-)

-- 
To stop receiving notification emails like this one, please contact
"commits@tika.apache.org" <co...@tika.apache.org>.