You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by to...@apache.org on 2017/09/14 00:35:22 UTC
[tika] branch master updated (d1a8bff -> db89ab3)
This is an automated email from the ASF dual-hosted git repository.
totaro pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git.
from d1a8bff TIKA-2459 -- fix special character handling
add e763021 Improvement for TIKA-2449 contributed by Giuseppe Totaro
add 7b869c0 Added a regular expression to match standard word within a pattern for TIKA-2449 contributed by Giuseppe Totaro
add 31625a2 Used the alphabetical order for the list of the standard organizations by relying on TreeMap. Thanks to Lewis McGibbney for this insightful suggestion (TIKA-2449 contributed by Giuseppe Totaro).
new 7dd38d5 Merge branch 'master' of https://github.com/apache/tika
new db89ab3 TIKA-2449: Enabling extraction of standard references from text
The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
CHANGES.txt | 2 +
.../org/apache/tika/sax/StandardOrganizations.java | 166 ++++++++++++++++++
.../org/apache/tika/sax/StandardReference.java | 124 ++++++++++++++
.../sax/StandardsExtractingContentHandler.java | 116 +++++++++++++
.../java/org/apache/tika/sax/StandardsText.java | 188 +++++++++++++++++++++
.../tika/example/StandardsExtractionExample.java | 109 ++++++++++++
.../sax/StandardsExtractingContentHandlerTest.java | 55 ++++++
.../test-documents/testStandardsExtractor.pdf | Bin 0 -> 143659 bytes
8 files changed, 760 insertions(+)
create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardOrganizations.java
create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardReference.java
create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardsExtractingContentHandler.java
create mode 100644 tika-core/src/main/java/org/apache/tika/sax/StandardsText.java
create mode 100644 tika-example/src/main/java/org/apache/tika/example/StandardsExtractionExample.java
create mode 100644 tika-parsers/src/test/java/org/apache/tika/sax/StandardsExtractingContentHandlerTest.java
create mode 100644 tika-parsers/src/test/resources/test-documents/testStandardsExtractor.pdf
--
To stop receiving notification emails like this one, please contact
['"commits@tika.apache.org" <co...@tika.apache.org>'].
[tika] 02/02: TIKA-2449: Enabling extraction of standard references
from text
Posted by to...@apache.org.
This is an automated email from the ASF dual-hosted git repository.
totaro pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git
commit db89ab3ca701077f2615647667d868ca1cf9a728
Author: Giuseppe Totaro <to...@gmail.com>
AuthorDate: Wed Sep 13 17:35:10 2017 -0700
TIKA-2449: Enabling extraction of standard references from text
---
CHANGES.txt | 2 ++
1 file changed, 2 insertions(+)
diff --git a/CHANGES.txt b/CHANGES.txt
index f7d0521..26ad26e 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,5 +1,7 @@
Release 1.17 - ???
+ * Enabling extraction of standard references from text (TIKA-2449).
+
* Load external custom mimetypes XML from system property
tika.custom-mimetypes (TIKA-2460).
--
To stop receiving notification emails like this one, please contact
"commits@tika.apache.org" <co...@tika.apache.org>.
[tika] 01/02: Merge branch 'master' of
https://github.com/apache/tika
Posted by to...@apache.org.
This is an automated email from the ASF dual-hosted git repository.
totaro pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/tika.git
commit 7dd38d5500c539328e2b0a083e1998a28e680539
Merge: 31625a2 d1a8bff
Author: Giuseppe Totaro <to...@gmail.com>
AuthorDate: Wed Sep 13 17:33:22 2017 -0700
Merge branch 'master' of https://github.com/apache/tika
CHANGES.txt | 18 +++++-
.../org/apache/tika/batch/fs/BatchProcessTest.java | 7 +--
.../org/apache/tika/detect/CompositeDetector.java | 7 +++
.../org/apache/tika/detect/OverrideDetector.java | 41 +++++++++++++
.../tika/exception/ZeroByteFileException.java | 11 ++++
.../main/java/org/apache/tika/metadata/TIFF.java | 3 +
.../apache/tika/metadata/TikaCoreProperties.java | 3 +
.../org/apache/tika/mime/MimeTypesFactory.java | 19 ++++++
.../org/apache/tika/parser/AutoDetectParser.java | 10 +++-
.../org/apache/tika/mime/MimeTypesReaderTest.java | 16 ++++++
.../org/apache/tika/mime/external-mimetypes.xml | 22 +++++++
.../apache/tika/eval/reports/ResultsReporter.java | 10 +++-
.../tika/parser/image/ImageMetadataExtractor.java | 24 ++++++++
.../org/apache/tika/parser/mbox/MboxParser.java | 1 +
.../apache/tika/parser/mbox/OutlookPSTParser.java | 14 +++++
.../tika/parser/microsoft/WordExtractor.java | 5 ++
.../ooxml/OOXMLWordAndPowerPointTextHandler.java | 51 ++++++++++++----
.../ooxml/SXWPFWordExtractorDecorator.java | 2 +-
.../recognition/tf/TensorflowRESTRecogniser.java | 12 +++-
.../services/org.apache.tika.detect.Detector | 1 +
.../parser/recognition/tf/InceptionRestDockerfile | 4 +-
.../tika/parser/recognition/tf/inceptionapi.py | 29 ++++++++--
.../apache/tika/parser/AutoDetectParserTest.java | 64 +++++++++++++++------
.../apache/tika/parser/image/TiffParserTest.java | 13 ++++-
.../apache/tika/parser/mbox/MboxParserTest.java | 15 +++++
.../tika/parser/mbox/OutlookPSTParserTest.java | 15 +++++
.../tika/parser/microsoft/WordParserTest.java | 16 ++++++
.../parser/microsoft/ooxml/OOXMLParserTest.java | 17 ++++++
.../parser/microsoft/ooxml/SXWPFExtractorTest.java | 16 ++++++
.../tika/parser/ocr/TesseractOCRParserTest.java | 9 +++
.../recognition/ObjectRecognitionParserTest.java | 35 +++++++++++
.../parser/recognition/tika-config-tflow-rest.xml | 2 +
.../test/resources/test-documents/single_mail.mbox | 25 ++++++++
.../test-documents/testPST_variousBodyTypes.pst | Bin 0 -> 271360 bytes
.../test-documents/testTIFF_multipage.tif | Bin 0 -> 156867 bytes
.../resources/test-documents/testWORD_phonetic.doc | Bin 0 -> 27136 bytes
.../test-documents/testWORD_phonetic.docx | Bin 0 -> 12523 bytes
.../testWORD_specialControlCharacter1415.doc | Bin 0 -> 25600 bytes
38 files changed, 492 insertions(+), 45 deletions(-)
--
To stop receiving notification emails like this one, please contact
"commits@tika.apache.org" <co...@tika.apache.org>.