You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2019/02/28 02:53:03 UTC
[tika] branch TIKA-2833 created (now d3317f9)
This is an automated email from the ASF dual-hosted git repository.
tallison pushed a change to branch TIKA-2833
in repository https://gitbox.apache.org/repos/asf/tika.git.
at d3317f9 TIKA-2833 -- initial commit with csv detection and swapping out the TXTParser in favor of the CSVParser
This branch includes the following new commits:
new 19ab44f TIKA-1: Standard {trunk,branches,tags} setup
new f10cd42 TIKA-1: Standard README, NOTICE, and LICENSE files.
new 3794e9a TIKA-4: Basic Maven 2 POM and source tree for Tika.
new 6e750bb TIKA-4: Ignore Eclipse project files.
new c5417be TIKA-2: Basic web site based on Maven 2.
new c914984 TIKA-2: The site is deployed to the incubator/tika/site directory in svn.
new 99b1e06 TIKA-4: Added brief Maven build instructions and some other project documentation.
new 747610b - update POM to include additional developer attributes for mattmann
new 3aff751 - placeholder for unit tests
new c88c4da Add Rida Benjelloun id and email
new 7b7ac41 Changelog for Tika.
new 77428b0 patch for TIKA-5
new 6e3ee16 TIKA-7: Added the Lius Lite code from Rida. External dependencies are not included, need to update the POM with proper dependency settings.
new 4e56c5e pom.xml: Replaced tabs with spaces, fixed indentation.
new b27dfe0 TIKA-7: Added missing dependencies to POM.
new 2b86daf TIKA-8: Replaced the jmimeinfo dependency with a trivial mime type detector.
new d363b82 TIKA-10 Remove MimeInfoException catch clauses and import from TestParsers. Contributed by Keith R. Bennett.
new f42035e TIKA-13 Fix obsolete package names in config.xml. Contributed by Keith R. Bennett.
new 3c85d05 - fix for TIKA-11
new eb9d2e9 - addendum to TIKA-11 (move /src/main/test -> moved to /src/test)
new a2b47b8 - fixed typo (K. Bennett via mattmann)
new c47f57c TIKA-12: Support MIME type detection based on a URL. Patch from Keith Bennett.
new 74f807d TIKA-12: Added MimeTypesUtils test case contributed by Keith Bennett.
new d7dabee TIKA-19: fix org.apache.tika.TestParsers, test more file types and improve exception handling in LiusConfig and ParserFactory. Includes fixes from TIKA-16 and TIKA-14 which were contributed by Keith R. Bennett, thanks!
new 1e2373c remove redundant sourceDirectory statements, we're using the standard Maven layout now
new 346d584 TIKA-15: Applied patch from Keith Bennett.
new 9dab155 - bring CHANGES.txt up to date
new 21cf8be - fix for TIKA-18
new 780f13d TIKA-12 - Decouple Parser from ParserConfig
new 3762413 - patch for TIKA-6
new 64c1e0e - patch for TIKA-6 (cont.)
new 773fbd2 - fix license header: d'oh
new 53f61c8 TIKA-25 - Removed hardcoded reference to C:\oo.xml, as suggested by Keith Bennett.
new 9f4c38c - fix for TIKA-17
new 32fda63 - fix for TIKA-22
new 033a07c TIKA-21 - Simplified configuration code - LiusConfig is now instantiated as: new LiusConfig("config.file"); - Dropped use of static caching and maps for config objects - Made configuration objects immutable (except for Content values)
new f8183f2 TIKA-17 - Rename all "Luis" classes to be "Tika" classes
new 0c20384 TIKA-27 - Replaced more "lius" references with "tika"
new 83cb301 - remove Hadoop/Nutch Configuration and Configurable interfaces - wire together MimeUtils and tika config.xml file - wire together MimeUtils and TikaConfig
new b2c0c6d TIKA-30 - Added utility constructors to TikaConfig - TikaConfig(String), calls TikaConfig(File) - TikaConfig(File), calls TikaConfig(Document) - TikaConfig(URL), calls TikaConfig(Document) - TikaConfig(InputStream), calls TikaConfig(Document) - TikaConfig(Document), calls TikaConfig(Element) - TikaConfig(Element), the base implementation
new bcc9f0c - fix for TIKA-28
new 4707526 TIKA-26 - Implemented Parser.getContent(String) in the base class
new 6244bec TIKA-26 - Implemented Parser.getStrContent() in the base class
new 522f4f3 TIKA-26 - Use Map<String, Content> instead of List<Content>
new da4fde4 typo
new cfe3527 TIKA-31 - protected Parser.parse(InputStream stream, Iterable<Content> contents)
new d3e678b TIKA-32 - remove useless CDATA clauses, and code cleanup - contributed by Keith R. Bennett, thanks!
new 6db2b1d - fix for TIKA-36
new 23c4ddd Tika-38. TXTParser Keith contribution
new c30bbcb TIKA-33 - Stateless parsers
new 3db2089 Update CHANGES.txt file
new 7bdb1c8 TIKA-35 Extract MsOffice properties. I have implement a method in Utils class that allows to copy InputStream in memory.
new 29eedb3 - make Tika=>TIKA to be consistent with JIRA key names - use apache committer ID if Tika committer
new 701d470 - fix for TIKA-34 (contributed by K. Bennett)
new a03498c TIKA-35 - Extract MsOffice properties, use RereadableInputStream developed by K. Bennett
new 53d14c5 TIKA-35 Close RereadableInputStream in MSExtractor and RereadableInputStreamTest classes.
new 9a00212 ZIP extraction. Three methods has been added to ParseUtils class. getParsersFromZip() methods return a list of parsers. consult unit test class to see how it works.
new d4bb41f TIKA-44 - Spaces for indentation
new aab5b10 TIKA-42 - Content class needs (String, String, String) constructor - Patch from Keith Bennett.
new ee2c3b9 TIKA-43 - Parser interface
new dcd8975 TIKA-43 - Parser interface
new f00e6fb TIKA-47 - Remove TikaLogger - Removed org.apache.tika.log - Moved log4j/log4j.properties to src/test/resources - Use a system property instead of code to configure Log4J
new 6d37de9 TIKA-46 - Use Metadata in Parser - With improvements by Chris Mattmann
new d840df1 Set svn:eol-style to native
new 58c2360 TIKA-46 - Use Metadata in Parser - Use Metadata.TITLE as suggested by Chris
new 62e58ea TIKA-46 - Use Metadata in Parser - Moved metadata configuration to the Parser classes - Removed the Content class
new aceff84 TIKA-48 - Merge MS Extractors and Parsers - Moved MSExtractor base class to org.apache.tika.ms.MSParser - Extracted the PropertiesReaderListener class to a top level class - Merged MS Extractor classes to MS Parsers - Refactored the Excel parsing functionality into smaller methods - Various cleanups (indentation, formatting, etc.)
new 810b1d4 TIKA-45 - RereadableInputStream needs to be able to read to the end of the original stream on first rewind - Committed patch from Keith Bennett
new e01051b TIKA-41 - Resource files occur twice in jar file - Use declarative constructs to put the resources in the correct place
new 09b699a - update to include Jukka's update for TIKA-41
new 54aa413 TIKA-49 - use the correct Apache license headers, thanks to Robert Burrell Donkin
new fa555fa TIKA-49 - use the correct Apache license headers, thanks to Robert Burrell Donkin
new aefc60f svn:ignore more files
new 838fe5f TIKA-51, Leftover temp files after running Tika tests, fixed. Also added TIKA_ prefix to all File.createTempFile() calls
new 7d91d37 TIKA-40 - Tika needs to support diverse character encodings - Use ICU4J to parse text content - Support Metadata.CONTENT_ENCODING hints in TXTParser - Added specific test cases for TXTParser
new e7d7a1c - fix for TIKA-55 (contributed by K. Bennett)
new 423f67e TIKA-52 - RereadableInputStream needs to support not closing the input stream it wraps
new c943f69 TIKA-53 - XHTML SAX events from parsers
new d064cb2 TIKA-57 - Rename org.apache.tika.ms to org.apache.tika.parser.ms
new a6ca816 update issueManagement section
new ee39fc8 TIKA-62 - Use TikaConfig.getDefaultConfig() instead of a hardcoded config path in TestParsers
new 70517c3 TIKA-58 - Replace jtidy html parser with nekohtml based parser
new 5e0a2b3 add acknowledgment as required by NekoHTML license
new 3fb58b7 TIKA-60 - Rename Microsoft parser classes
new e759bbb TIKA-60 - Rename Microsoft parser classes
new 1081cb5 TIKA-63 - Avoid multiple passes over the input stream in Microsoft parsers - Use POIFSFileSystem as the source of both metadata and text content - Added separate test case classes for the Microsoft parsers - Got rid of some extra listeners and exceptions
new 9fce256 TIKA-66 - Use Java 5 features in org.apache.tika.mime - Use Java 5 generics and foreach constructs to simplify code - Removed some unused variables and method parameters - Other minor cleanups
new 9477c5e - make test case class name consistent with other names (i.e., start with "Test...")
new b12c01d - fix for TIKA-56
new a328f9c TIKA-65 - Add encode detection support for HTML parser
new b0a87ad remove failing test temporarily
new 1d2e41f TIKA-68 - Add dummy parser classes to be used as sentinels
new 703c4b0 TIKA-67 - Add an auto-detecting Parser implementation
new 8004791 TIKA-70 - Better MIME information for the Open Document formats
new 0038570 TIKA-70 - Better MIME information for the Open Document formats
new e1da9a1 Removed an extra debug print
new 580824e TIKA-71 - Remove ParserConfig and ParserFactory
new 67c79ba Testing new committer status; added my name.
new f7079fd Moved name to its correct position in alphabetical order. (Sorry!)
new 9ffdd54 TIKA-72: Added Metadata.RESOURCE_NAME_KEY, and changed uses of "filename" to it.
new a8d1e67 TIKA-72: As per Chris' suggestion, moved RESOURCE_NAME_KEY from Metadata to new interface TikaMetadataKeys, and changed Metadata to implement TikaMetadataKeys.
new bd68bd6 Added clearer error message if a stream cannot be opened from a URL.
new a73e6cc TIKA-78 - AutoDetectParserTest should include tests for bad MIME types and resource names.
new fb3290d TIKA-77. The ParserPostProcessor is no longer used to wrap the parser.
new 2e72c03 The use of Utils was there because the method was originally in the Utils class. Now that it is in TikaConfig, using TikaConfig is preferable.
new 2b09fac TIKA-72: The use of "filename" is replaced with "resource name", since we may be dealing with file names, URL's, etc.
new 30d4072 TIKA-78: In AutoDetectParserTest, put each document type test in its own method so that one failure would not prevent the other document types from being tested.
new 579bee4 Correct indenting (four spaces instead of one as the first indent on line)
new c176570 Set svn:eol-style to native
new 91f76e6 TIKA-75: Provides a MimeUtils.getType(URL) method that will determine MIME type based on the stream and, if necessary, the name.
new a7d091b TIKA-81. Added default constructor to MimeUtils.
new 076d9ef TIKA-82. Disabling a log call.
new 4c8f0b8 TIKA-83 - Create a org.apache.tika.sax package for SAX utilities
new 9db6b38 TIKA-84 - Add MimeTypes.getMimeType(InputStream)
new 6b827ba Add news about Keith's committership, and document the website update steps
new 001e1f7 TIKA-84 - Add MimeTypes.getMimeType(InputStream) - Added also getMimeType(String, InputStream) - Extracted common code to readMagicHeader(InputStream) - Javadoc improvements
new f383585 TIKA-85 - Add glob patterns from the ASF svn:eol-style documentation - Added patterns based on svn:eol-style and svn:mime-type defaults - Many of the patterns should be assigned to appropriate MIME subtypes
new 5087833 TIKA-87 - MimeTypes should allow modification of MIME types - Reversed the MimeTypes -> MimeTypesReader dependency - Work in progress
new 2291cab TIKA-88: Moved all nonredundant functionality from MimeUtils to MimeTypes. Moved test code from MimeUtilsTest to MimeTypesTest accordingly. Deleted MimeUtils class and its test class. Modified URL for MIME type config file in default tika-config.xml to have leading "/". Created MimeTypesFactory class as a public factory and adapter to package protected MimeTypesReader.
new b1bcf42 TIKA-87 - MimeTypes should allow modification of MIME types - Merged MimeInfo and MimeType - Made MimeType Comparable
new 05b7bb7 TIKA-87 - MimeTypes should allow modification of MIME types - Made Magic Comparable
new 4b126e8 TIKA-87 - MimeTypes should allow modification of MIME types - MimeType.addAlias(String) can now be used to add new aliases - MimeType.addPattern(String) can now be used to add new patterns - MimeTypes.forName(String) validates the name - MimeTypes.forName(String) creates and registers the type if needed - Simplified type name handling and validation - New test cases
new 3be8f6c TIKA-87 - MimeTypes should allow modification of MIME types - MimeType.setSuperType(MimeType) can now be used to modify inheritance
new a01dcbe TIKA-87 - MimeTypes should allow modification of MIME types - Streamlined pattern handling
new 5cbfe94 TIKA-100 - Structured PDF parsing - Customized the PdfTextStripper class to produce XHTML SAX events (there's a somewhat similar PdfText2HTML class in PDFBox, but that class produces a character stream instead of SAX events)
new c13e78d - fix for JDK 6 reliance introduced in TIKA-100 commit
new add1d56 - fix for TIKA-101 (contributed by Niall Pemberton)
new b6ba8b7 set svn:eol-style to native
new c839f7e - move bin.xml, src.xml to src/main/assembly to make mvn assembly:assembly work correctly (by default)
new 1bf1bed TIKA-91: Add proper attribution for code from textmining.org
new 91913d3 TIKA-102 - Parser implementations loading a large amount of content into a single String could be problematic - Patch by Niall Pemberton
new 0dd035b TIKA-102 - Parser implementations loading a large amount of content into a single String could be problematic - Forgot to include the new files in Niall Pemberton's patch
new fa6bb7e TIKA-107 - Remove use of assertions for argument checking - Committed patch from Niall Pemberton
new 2e18733 TIKA-104 - Add utility methods to throw IOException with the caused intialized - Added an IOException subclass instead - Adapted code from Niall Pemberton's patch
new 1f2b716 TIKA-106 - Remove dependency on Jakarta ORO - use JDK 1.4 Regex - Patch from Niall Pemberton
new ae5915a TIKA-105 - Excel parser implementation based on POI's Event API - New class contributed by Niall Pemberton
new a424f6c - prep for 0.1-incubating release
new 4ca79ab TIKA-110: Add KEYS file for Tika
new c1fec08 TIKA-111: Missing license headers - Added license headers where needed - Merged src/site/SITE-README.txt to README.txt - Added a HEADER.txt file with the standard header
new 1cb374a - add my gpg key to KEYS file in prep for release
new f722083 - prep for release
new 08e456a pom.xml: Updated trunk version to 0.2-SNAPSHOT
new 9c97684 - add download link for tika releases to website
new b04f706 - update site with news of first tika release (0.1-incubating)
new 67dbe57 - update CHANGES.txt to reflect new Tika dev version (0.2-incubating)
new 0da63b1 - Replace XMLParser by XMLParserUtils - Create Class DcXMLParser that extends XMLParserUtils and implements Parser. This class allows DublinCore metadata parsing - Add method setXMLParserNameSpaceContext() in XMLParserUtils. - Improvement of OpenOfficeParser to extract document content from office:body. - OpenOfficeParser extends XMLParserUtils - Modification to tika-config to use DcXMLParser instead of XMLParser
new ebe7868 TIKA-112 XMLParser improvement - Replace XMLParser by XMLParserUtils - Create Class DcXMLParser that extends XMLParserUtils and implements Parser. This class allows DublinCore metadata parsing - Add method setXMLParserNameSpaceContext() in XMLParserUtils. - Improvement of OpenOfficeParser to extract document content from office:body. - OpenOfficeParser extends XMLParserUtils - Modification to tika-config to use DcXMLParser instead of XMLParser
new 314a53b add license header
new 7f80b3f remove unused imports
new d795c5f TIKA-109: WordParser fails on some Word files - Applied WordParser patch from Dave Meikle - Removed the now unused WordTextPiece class
new 7fdba7e TIKA-105: Excel parser implementation based on POI's Event API - Replaced ExcelParser with ExcelEventParser - Use a setter for listenForAllRecords (JavaBean properties are more flexible than constructor arguments) - Use debug logging for all output - Removed some of the explicit log.isDebugEnabled() checks (simplicity over insignificant performance gains) - Inlined the trivial debug(Record) method
new 93411a7 TIKA-105: Excel parser implementation based on POI's Event API - Added a changelog entry for revisions 606141 and 613566.
new 4b33a0d TIKA-109: WordParser fails on some Word files - The patch was from Dave Meikle and not from Mats. I'm sorry for the mistake.
new 5e97d46 TIKA-116: Streaming parser for OpenDocument files - Streaming XPath implementation in o.a.tika.sax.xpath - New o.a.tika.sax utility classes - Streaming XML parser - Avoid closing the input stream while parsing XML - Streaming OpenDocument parser - Extract correct OpenDocument MIME type while parsing
new 88d8173 TIKA-117: Drop JDOM and Jaxen dependencies - Note the signature changes in TikaConfig constructors! - Dropped a few obsolete Utils methods
new 110abef TIKA-115: Tika package with all the dependencies - The bin assembly now contains all runtime dependencies - Reviewed the dependency licenses and updated the NOTICE and LICENSE files accordingly
new e448a95 TIKA-97: Tika GUI - Added a simple Swing GUI for Tika
new ed76d40 TIKA-97: Tika GUI - Dropped Java 6 methods
new 50dd486 TIKA-97: Tika GUI - Make the extracted text content scrollable
new 4b0936c TIKA-97: Tika GUI - Dropped another Java 6 dependency
new 5030b11 TIKA-116 - isolate test that uses accented chars, which currently fails
new c4d64f8 TIKA-116 - DcXMLParserTest.testXMLParserNonAsciiChars fixed
new b0ddfce TIKA-96: Tika CLI - Added the o.a.tika.cli.TikaCLI command line class - Initial features: + four output formats (xml, html, text, metadata) + three input sources (files, URLs, standard input) + two logging levels (info and debug) + usage message + GUI mode - Added simple Unix and DOS start scripts - Added required packaging and manifest settings
new 6774cd7 TIKA-118: Bouncy Castle binaries require US exports regulation compliance - Added export control information in the README
new 4ed2d4b TIKA-123: Structured MS Office parsing - Changed OfficeParser to allow structured parsing in subclasses - ExcelParser now outputs XHTML tables with nice tabs and line breaks - Dropped unused formatting code from ExcelParser (TODO fix that) - Streamlined PowerPointParser and started using Java 5 features - No functional changes (yet) in PowerPointParser - No functional changes (yet) in WordParser
new b2b79ce TIKA-123: Structured MS Office parsing - New utility methods in XHTMLContentHandler
new 8e3a5f5 TIKA-123: Structured MS Office parsing - Fixed incorrect test case
new 71aee51 TIKA-123: Structured MS Office parsing - Close the PowerPoint <p/> element properly
new b8bad51 TIKA-103: Excel parsing ignores cell formating - Added test document contributed by Niall Pemberton
new e6fa719 TIKA-123: Structured MS Office parsing - Upgraded POI dependency to 3.0.2-FINAL and added poi-scratchpad
new 9398c06 TIKA-123: Structured MS Office parsing - Replaced custom PowerPoint parser with PowerPointExtractor from POI HSLF
new 1cd1b27 TIKA-123: Structured MS Office parsing - Replaced custom Word parsing code with WordExtractor from POI HWPF
new 80866ff TIKA-122: Use Commons IO 1.4 - Introduced Commons IO 1.4 dependency - Use the new dependency in the obvious places
new a738abc TIKA-123: Structured MS Office parsing - We no longer use the textmining.org code
new 7158881 TIKA-123: Structured MS Office parsing - Moved property file parsing to a separate Parser class
new 75fb47b TIKA-123: Structured MS Office parsing - Consolidated all MS Office parsing to a single class - Reliable MIME magic for pseudo type application/x-tika-msoffice - Added MIME magic for RTF
new 3a743ef TIKA-126: Add Parser.parse(InputStream, Metadata) for metadata extraction
new c26e7b3 TIKA-127: Add support for Visio files
new 5cb14de TIKA-129: node() support for the streaming XPath utility
new 68628ea TIKA-130: self-or-descendant axis does not match self in streaming XPath - Also added @Override annotations to SubtreeMatcher
new aa271ef TIKA-131: Lazy XHTML prefix generation
new c370c52 TIKA-128: HTML parser should produce XHTML SAX events
new 2907d8f TIKA-133: TeeContentHandler constructor should use varargs
new 3e05c07 TIKA-97: Tika GUI - New tabs for different views of the parser output - Improved drag-and-drop support - Improved error handling
new 49028e3 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Patch by Niall Pemberton
new f81d990 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Use a TreeMap instead of custom linked lists for the sparse matrix
new cdd4cf3 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Replace TikaExcelCell with a modular/extensible set of classes that encapsulate the functionality of rendering the cell content to XHTML
new a8c7b38 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Refactored processCellValue to a getCellValue factory method
new 474c19f TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Added NumberCell for formatted numbers
new 9da6dd5 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Further refactoring to simplify cell value handling
new f6d4c07 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Merged the two sid case statements to one
new 9028d6f TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Improved formatting of internalProcessRecord
new 6f20d1a TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Improved exception handling, now all subsequent HSSF events are simply ignored
new f198ac1 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - Removed the insideWorksheet flag - Improved javadocs - Extracted PointComparator to an explicit utility class
new 77e1d38 TIKA-97: Tika GUI - Simplify the HTML output for JEditorPane to better understand it
new 9624d5b Reformatted NOTICE to be less verbose
new 8740cb5 TIKA-132: Refactor Excel extractor to parse per sheet and add hyperlink support - The numbers are now correctly formatted thanks to the default NumberFormat being used instead of Double.toString() - Updated the test case accordingly, and added assertions to prevent regressions - TODO: Proper number formatting based on Excel formatting settings
new 4228d37 TIKA-123: Structured MS Office parsing - Commented out failing test case. - TODO: Improve getMimeType to better support MS Office files
new ddfe1a5 TIKA-123: Structured MS Office parsing - More failing MS Office detection test cases
new f45c4b7 TIKA-134: mvn package does not produce packages for bin/src - Based on a patch by Karl Heinz Marbaise
new 315b7e1 TIKA-138: Ignore HTML style and script content - Added a set of elements to discard, currently style and script
new 64e1e51 TIKA-113: Metadata (such as title) should not be part of content - Added BodyContentHandler that only processes XHTML body events - Added utility constructors for WriteOutContentHandler and BodyContentHandler - Updated test cases and related code to use BodyContentHandler where appropriate - Removed AppendableAdaptor class as it's not used anymore
new 759cf17 Replaced tabs with spaces in tika-mimetypes.xml
new 25150e0 TIKA-139: Add a composite parser
new a5c897e TIKA-87: MimeTypes should allow modification of MIME types TIKA-89: Rename MimeType and MimeTypes - Trying to decouple the MIME type registry from Tika configuration - Work in progress
new 7c83be9 TIKA-92: Image metadata extraction - Added a simple ImageParser based on ImageIO - Currently only supports custom "width" and "height" metadata fields - Included a few test images
new 068d81f Simplified log4j configuration for unit tests
new d368623 - fix for TIKA-142
new 284f644 TIKA-143: Add ParsingReader
new 83d6421 Modified svn:ignore to cover things like ".checkstyle". Also there is no longer need to ignore log files.
new fd62fb0 TIKA-115: Tika package with all the dependencies - Create a runnable standalone jar instead of a bin package
new 6d73404 TIKA-115: Tika package with all the dependencies - Shell scripts no longer needed as we have a runnable jar
new e0a48ab typo
new e3631d0 TIKA-118: Bouncycastle binaries requires US exports regulation compliance - Added a download page with an export notice
new 6c031eb TIKA-144: Upgrade nekohtml dependency - Upgraded to version 1.9.7 - This version is ALv2 and has no NOTICE file
new fa3b197 TIKA-145: Separate NOTICEs and LICENSEs for binary and source packages
new 73cffdb TIKA-146: Upgrade to POI 3.1 - Upgraded POI dependency
new 91d1077 TIKA-146: Upgrade to POI 3.1 - Enable Excel hyperlink support available in POI 3.1
new 73603f4 TIKA-54: Outlook msg parser - Patch by Dave Meikle - Test file by Rida Benjelloun
new c9f08c2 TIKA-99: Support external parser programs
new eb79973 TIKA-149: Parser for zip files
new 28c1369 TIKA-149: Parser for zip files
new 70ca6ce TIKA-149: Parser for zip files
new 9b506be TIKA-149: Parser for zip files
new 39ab46a TIKA-149: Parser for zip files
new e12b04d TIKA-149: Parser for zip files
new d1ab05c TIKA-149: Parser for zip files
new 424c708 TIKA-150: Parser for tar files
new 46f2fb8 TIKA-151: Stream compression support
new c22581f TIKA-156: Some MIME magic patterns are ignored by MimeTypes
new e889dfa TIKA-155: Java class file parser
new 19f04ad Removed debug prints from test cases.
new 2a15448 Disabled the spell checking performance test as there is no assertion to check.
new 6abd44b TIKA-155: Java class file parser
new 2a405d3 TIKA-155: Java class file parser
new b2a441f TIKA-108: New Tika logos
new 982c97b Documentation, first draft...
new bd23146 Add the documentation page (just one so far :-) to navigation.
new c2b41c4 Missing license header.
new 9ff3f2d TIKA-120: Add support for retrieving ID3 tags from MP3 files
new 72b945e TIKA-120: Add support for retrieving ID3 tags from MP3 files
new 8080fca TIKA-120: Add support for retrieving ID3 tags from MP3 files
new e97ee49 TIKA-120: Add support for retrieving ID3 tags from MP3 files
new e0c59b5 TIKA-120: Add support for retrieving ID3 tags from MP3 files
new ce4095f TIKA-120: Add support for retrieving ID3 tags from MP3 files
new 58401bc TIKA-120: Add support for retrieving ID3 tags from MP3 files
new 058e1fe TIKA-120: Add support for retrieving ID3 tags from MP3 files
new d2df838 TIKA-120: Add support for retrieving ID3 tags from MP3 files
new 79a5048 TIKA-54: Outlook msg parser
new df019d9 TIKA-54: Outlook msg parser
new 4d37376 TIKA-157: List all the document formats supported by Tika
new 892eebd TIKA-151: Stream compression support
new 8956153 TIKA-157: List all the document formats supported by Tika
new 0482529 TIKA-114: PDFParser : Getting content of the document using "writer.ToString ()" , some words are stuck together
new a766f5c TIKA-157: List all the document formats supported by Tika
new ea4bcc3 TIKA-157: List all the document formats supported by Tika
new 57260ff TIKA-157: List all the document formats supported by Tika
new 50eaa3b TIKA-157: List all the document formats supported by Tika
new 5f6e66f TIKA-108: New Tika logos
new f792637 TIKA-161: Enable PMD reports
new 7d8686c TIKA-126: Add Parser.parse(InputStream, Metadata) for metadata extraction
new ca427a1 TIKA-159 - Add support for parsing basic audio types: wav, aiff, au, midi
new 4937653 Removed the unused <distributionManagement/> entry
new 836c4b9 TIKA-140: HTML parser unable to extract text
new c6d2f09 TIKA-163: GUI does not support drag and drop in Gnome or KDE
new 032f752 Improved/extended documentation
new 871a44c Fix typo, merge paragraphs.
new d7fc922 Updated index page with documentation and download links in body text.
new 90fb111 Tested new committer status by adding my name.
new 9d5c3d8 TIKA-166: Updated HTMLParser to parse HTML meta tags into Metadata
new 3a7faf9 TIKA-166: Updated HTMLParser to parse HTML meta tags into Metadata
new 6e87254 Graduate Tika from Incubator to Lucene
new 669b0e8 TIKA-170: Graduate Tika
new 0dc4fa2 TIKA-170: Graduate Tika
new 8cb0cb8 Added code-signing key in preparation for a release
new b5ff62e TIKA-170: Graduate Tika
new 3a58cc2 Updated formats page to finish some todos on supported formats
new 1b9fd8c TIKA-170 - updated version number to reflect graduated status
new da3388c TIKA-170: Graduate Tika
new 40ba713 TIKA-170: Graduate Tika
new a094d5c Added some recent news.
new 37c987a Upgraded version number to 0.3-SNAPSHOT now that 0.2 is branched.
new 9c86a5d TIKA-175: Retrotranslate Tika for use in Java 1.4 environments
new 647fdac TIKA-176: Getting Started guide
new 8dfa33c TIKA-171: New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags
new 1fb92c3 TIKA-164: Update nekohtml version
new 00253e0 TIKA-165: Update icu4j
new c1b35b7 TIKA-176: Getting Started guide
new 6adc68b Added missing license information on HTML, XML and SVG files
new 2b4b0f7 TIKA-177: Improved build instructions in README
new 3079789 Add missing svn:eol-style settings.
new 49d2f2b Add missing svn:eol-style settings.
new 031a1bd TIKA-176: Getting Started guide
new 27fdf2c TIKA-172: New Open Document Parser that emits structured XHTML content
new f969a26 TIKA-172: New Open Document Parser that emits structured XHTML content
new 3c15d9c TIKA-172: New Open Document Parser that emits structured XHTML content
new aab5571 Updates to CHANGES.txt to reflect re-creation of 0.2 release from trunk
new ed92e58 TIKA-170 : Updated mailing lists to reflect graduation
new a0886c3 TIKA-170 : Updated tika-user address
new aac081e TIKA-152: Support for Office XML files
new e43a19e TIKA-179: Tika stand alone CLI --text output mostly not working, other output formats are fine
new 1e81622 TIKA-181: Retrotranslator plugin fails if using a 1.0-SNAPSHOT version
new 8e8d2f8 Updated website, CHANGES.txt and README.txt for 0.2 release
new 2e937c5 TIKA-183: Fix Maven plugin versions
new 9506373 TIKA-184: Avoid the <resource/> entry on ${basedir}
new 717833a TIKA-180: XHTMLContentHandler unable to extract text from MSWord file
new 235e807 TIKA-180: XHTMLContentHandler unable to extract text from MSWord file
new aefe614 TIKA-180: XHTMLContentHandler unable to extract text from MSWord file
new a9fb614 TIKA-188: Automatic whitespace for block elements in XHTMLContentHandler
new e77e233 TIKA-185: XML files with (unsatisfied) SYSTEM entities can not be extracted
new aef10d5 CHANGES.txt: Added a higher level summary of some of the more notable changes in the upcoming 0.3 release.
new 312721c CHANGES.txt: Added credits for all people who show up in the 0.3 contribution report.
new 9b7b2af CHANGES.txt: Added a pointer to the contribution report.
new 8757a35 TIKA-154: Better detection of plain text versus binary formats with a text header
new a77cf07 TIKA-95: Pluggable magic header detectors
new 7d67ade TIKA-95: Pluggable magic header detectors
new 006831e TIKA-95: Pluggable magic header detectors
new 2e25dc0 TIKA-95: Pluggable magic header detectors
new 2a6d726 TIKA-95: Pluggable magic header detectors
new e11ead4 TIKA-95: Pluggable magic header detectors
new d806c4d TIKA-95: Pluggable magic header detectors
new 816eb26 TIKA-95: Pluggable magic header detectors
new 2699cf4 TIKA-190: wrong handling of ignorableWhitespace/characters in SafeContentHandler and WriteoutContentHandler
new dd352d2 TIKA-189: Text extraction from Excel files juxtaposes cells
new b6bcbd9 TIKA-95: Pluggable magic header detectors
new 59ec6ce TIKA-95: Pluggable magic header detectors
new de1a353 TIKA-95: Pluggable magic header detectors
new 32d2989 TIKA-95: Pluggable magic header detectors
new 073a36a TIKA-95: Pluggable magic header detectors
new e167f83 TIKA-95: Pluggable magic header detectors
new 6397cde TIKA-189: Text extraction from Excel files juxtaposes cells
new 8b7a1d4 TIKA-95: Pluggable magic header detectors
new 7789252 TIKA-95: Pluggable magic header detectors
new 16669e3 TIKA-95: Pluggable magic header detectors
new f807af0 TIKA-192: Add glob and magic patterns for image types
new ab6aec8 TIKA-192: Add glob and magic patterns for image types
new 741c575 TIKA-192: Add glob and magic patterns for image types
new 820841c TIKA-192: Add glob and magic patterns for image types
new de1c077 TIKA-192: Add glob and magic patterns for image types
new b8a131f TIKA-192: Add glob and magic patterns for image types
new 16f436c TIKA-192: Add glob and magic patterns for image types
new 418284d TIKA-192: Add glob and magic patterns for image types
new d966d35 TIKA-192: Add glob and magic patterns for image types
new a93976f TIKA-192: Add glob and magic patterns for image types
new 411c880 TIKA-192: Add glob and magic patterns for image types
new 4429727 TIKA-192: Add glob and magic patterns for image types
new d3a3286 TIKA-192: Add glob and magic patterns for image types
new 8dbb649 TIKA-192: Add glob and magic patterns for image types
new 0fa61dc TIKA-192: Add glob and magic patterns for image types
new ce6ac91 TIKA-196: Configuration parser fails in Java 1.4
new 9758d20 TIKA-199: Improved audio detection and parsing
new 215473a Reverted changes that were accidentally included in revision 741674 (TIKA-199).
new 922fb5f TIKA-199: Improved audio detection and parsing
new 2e85556 TIKA-199: Improved audio detection and parsing
new e678e32 TIKA-201: Extract lyrics and other text from MIDI audio files
new 2ba3c7e TIKA-201: Extract lyrics and other text from MIDI audio files
new 2a77cf3 TIKA-202: Warnings during Site generation
new f6f9d4f TIKA-197: Microsoft Outlook (msg) files get parsed multiple times
new b669279 add apachecon promo
new d9575ae TIKA-203: Earlier metadata extraction in ParsingReader
new 8a80ee5 TIKA-152: Support for Office XML files
new a981a84 TIKA-192: Add glob and magic patterns for image types
new 91146f5 Acknowledge more Tika 0.3 contributors
new 1588c38 TIKA-186: Refactor the MS Office property names to MSOffice.java
new 78ac637 TIKA-152: Support for Office XML files
new 3b01147 TIKA-152: Support for Office XML files
new 74df811 TIKA-152: Support for Office XML files
new 0688994 Updated year in copyright notices.
new cc9a94f - fix for TIKA-194
new 29c5b52 - fix for TIKA-205
new c5f0d09 - remove extraneous System.out
new 8c2c6bc - 0.3 RC version bump
new 59402a9 - reflect 0.3 RC (even though release date will change, will make final updates in branch)
new f322f4c TIKA-206: Improved pipe mode in Tika CLI
new 9b541f4 TIKA-200 Allow drag and drop of URLs in TikaGUI
new 3b4986c Updated trunk version to 0.4-SNAPSHOT
new b75cfe3 - update to 0.4 unreleased changes
new 972e57a apache tika 0.3 docs update
new 035d8d3 TIKA-211: memory issue in ExcelExtractor
new b7e960e TIKA-211: memory issue in ExcelExtractor
new 60c9b0e TIKA-210: html content directly under body node not parsed correctly
new d016d32 TIKA-208: Special characters in HTML file are not parsed correctly
new 23e3bae TIKA-217: TikaConfig fails when a parser can't be loaded due to an Error
new dd22d43 TIKA-216: Zip bomb prevention
new f29a52a TIKA-216: Zip bomb prevention
new 896aaaf TIKA-216: Zip bomb prevention
new b22616e TIKA-216: Zip bomb prevention
new 1c0109f TIKA-216: Zip bomb prevention
new 4b64d9a TIKA-216: Zip bomb prevention
new 3040a36 TIKA-215: Use a thread pool in ParsingReader
new 12ec68e TIKA-209: Language detection is weak
new 4ab12d7 Improved documentation about support for audio formats
new 8ab4ea1 Improved documentation formatting.
new 5ff52d6 TIKA-219: Split Tika to separate modules
new 378ccee TIKA-219: Split Tika to separate modules
new 9b7e6ea TIKA-219: Split Tika to separate modules
new 14f7707 TIKA-219: Split Tika to separate modules
new a2371ab TIKA-219: Split Tika to separate modules
new 1caa599 TIKA-219: Split Tika to separate modules
new 77abee6 TIKA-221: Drop log4j dependency from tika-core
new 049cace TIKA-222: Drop commons-codec dependency from tika-core
new cd088d3 TIKA-226 - Change to generate javadocs and source references for each module
new 6be2c4a - TIKA-227 Make MimeType JavaDoc match behaviour (Robert Burrell Donkin via mattmann)
new 8189f14 TIKA-220: Remove obsolete utility code
new 368e6fa TIKA-225: [PATCH] Various bugfixes for MIME detection
new e5d034d TIKA-225: [PATCH] Various bugfixes for MIME detection
new 27a100a TIKA-225: [PATCH] Various bugfixes for MIME detection
new e10b21c TIKA-230 : Addition of Parent POM File. Patch by Robert Burrell Donkin
new 199fd5f TIKA-230: Add the parent POM to the multimodule build
new ea35b47 TIKA-230: [PATCH] Parent pom
new d96821e TIKA-230: More POM cleanups.
new 985a604 Ignore generated and hidden files.
new fd995b9 TIKA-229: Per-component LICENSE and NOTICE files
new d9493ee TIKA-233: Inline the ICU4J charset detection logic
new 7151759 TIKA-228: Add OSGi metadata to Tika
new fc3c880 TIKA-228: Add OSGi metadata to Tika
new f86934f TIKA-228: Add OSGi metadata to Tika
new f9d22f8 TIKA-219: Split Tika to separate modules
new 1cd949d TIKA-228: Add OSGi metadata to Tika
new e5bf650 TIKA-204: Use commons-compress for parsing packages
new f6ead4d TIKA-233: Inline the ICU4J charset detection logic
new 206dc7a TIKA-198: Better distinction between IOException and TikaException
new 3864f42 TIKA-193: PDFParser adds mime-type twice
new 1991c32 TIKA-231: Difference between Web-Site and real working code
new 3864ebb TIKA-231: Difference between Web-Site and real working code
new 6aaa141 TIKA-237: Better distinction between SAXException and TikaException
new 26a8225 TIKA-237: Better distinction between SAXException and TikaException
new 3a6036e TIKA-87: MimeTypes should allow modification of MIME types
new fd4a584 TIKA-234: Drop SpellCheckedMetadata
new fa0955e TIKA-238: Better handling of delegating parser implementations
new bb1ed46 Replace a tab with spaces.
new 2fa2699 TIKA-238: Better handling of delegating parser implementations
new f670711 TIKA-235: Site search powered by Lucene/Solr
new fd0470f TIKA-225: [PATCH] Various bugfixes for MIME detection
new 3efc2cf TIKA-204: Use commons-compress for parsing packages
new a70b99f TIKA-204: Use commons-compress for parsing packages
new 96742ab TIKA-248: No logging in tika-core
new 87a45f9 TIKA-249: Inline key commons-io classes
new 2d893ab TIKA-247: parse language and category from MS Office properties
new 410d493 TIKA-148: The ExcelParsing should scan the cell comments
new 74947b6 TIKA-244: Missing Header/Footer text for Word'97 documents
new 226aaf4 TIKA-255: Embedded Visio Content Crashes PPT Parser
new 0236e01 TIKA-253: Better mime type for ooxml files
new cccb797 TIKA-254: parse ooxml templates and macro-enabled formats
new f04edc5 TIKA-240: Drop the BOM when extracting plain text
new 15c8343 TIKA-258: AutoDetectParser does not allow to use alternative mime detector
new acf76a6 - fix for TIKA-121 MimeType.clean method no longer exists as a capability
new f676cc0 - fix for TIKA-74 Test Resources should be loaded by the class loader (e.g. getResourceAsStream())
new 2be0726 - cleanup javadoc
new 3d8a58a TIKA-257: Uncorrect mime-type detection for ooxml
new 471655e TIKA-260: Weird transitive dependencies from commons-logging
new 7da2b34 - prep for release
new fc8bdb7 - prep for 0.4 RC (wow there are a lot of these poms now!)
new 401af9c Update version number to 0.5-SNAPSHOT
new 9b4ca8f Improved instructions for getting started with Tika.
new ace4928 TIKA-262: ParsingReader does not parse metadata for larger MS Office documents
new f4ebc4b - update site documentation to reflect release of 0.4
new 458a03d - prepare for next release
new 576101f - update to reflect 0.4 release
new 8c130ba TIKA-209: Language detection is weak.
new 29af90d TIKA-263: Core parser classes duplicated in the tika-parser and tika-core jar files.
new 62697af TIKA-209: Language detection is weak.
new 27e6e47 TIKA-209: Language detection is weak.
new 655afb5 TIKA-209: Language detection is weak.
new 0c8e965 TIKA-209: Language detection is weak.
new b4e1a57 TIKA-265: Web-Site http://lucene.apache.org/tika/gettingstarted.html does not correspond to current release
new 985e93d TIKA-264: Getting Started: change "source directory" to "base directory" or similar
new 5099b33 TIKA-250: XLS parser does not extract empty sheet names
new 3acc27c TIKA-266: Empty tika-core jar
new 290bbf8 Code style: Reindent at four spaces, remove unused access modifiers, inline singleton classes.
new 84587e3 TIKA-209: Language detection is weak
new 02519da - update web site news per grant's comment at: http://www.lucidimagination.com/search/document/e6e888e48060d38c/apachecon_promo
new a09f5af TIKA-268: HTMLParser omits necessary space-characters when parsing table-data
new 00ae4b3 TIKA-267: encrypted pdf files aren't handled properly
new d34e550 TIKA-217: secure-processing not supported by some JAXP implementations
new 4abeb2d TIKA-217: secure-processing not supported by some JAXP implementations
new 296e279 TIKA-274: CharsetDetector.setDeclaredEncoding has no effect
new 4307560 TIKA-273: Content encoding in HtmlParser
new 3922b01 TIKA-275: Parse context
new 41ca17f TIKA-275: Parse context
new f091ceb TIKA-275: Parse context
new d785334 TIKA-275: Parse context
new 1b6a563 TIKA-276: Drop the StringUtils class
new d75342b TIKA-269: Ease of use -facade for Tika
new a6cb65c TIKA-269: Ease of use -facade for Tika
new 06d4d21 TIKA-269: Ease of use -facade for Tika
new 0a60ac0 TIKA-269: Ease of use -facade for Tika
new 4acfd54 TIKA-275: Parse context
new 02b1eaa TIKA-277: Tika stand alone CLI --possibility to specify output encoding (--text)
new 934fd0e TIKA-277: Tika stand alone CLI --possibility to specify output encoding (--text)
new d634072 TIKA-158: Upgrade to Apache PDFBox
new c98e188 TIKA-158: Upgrade to Apache PDFBox
new c5abb68 TIKA-280: Fix NOTICE files to match consensus from legal team
new 231fac1 TIKA-281: Use repository.apache.org to deploy snapshots and releases
new 95071ec TIKA-281: Use repository.apache.org to deploy snapshots and releases
new ee13b1a TIKA-281: Use repository.apache.org to deploy snapshots and releases
new 4c04a49 TIKA-158: Upgrade to Apache PDFBox
new 43bb4c6 TIKA-283: XWPFWordExtractorDecorator does not extract links in tables
new a68d61c TIKA-283: XWPFWordExtractorDecorator does not extract links in tables
new 120c238 TIKA-275: Parse context
new 4130936 TIKA-275: Parse context
new f4f7f72 TIKA-285: Update media type registry to the latest httpd mime type database
new 8f90597 TIKA-285: Update media type registry to the latest httpd mime type database
new 659945d TIKA-285: Update media type registry to the latest httpd mime type database
new 2ad8d40 TIKA-285: Update media type registry to the latest httpd mime type database
new d77f5cf TIKA-285: Update media type registry to the latest httpd mime type database
new a1d21f4 TIKA-285: Update media type registry to the latest httpd mime type database
new 439f7ed TIKA-285: Update media type registry to the latest httpd mime type database
new cce94cb TIKA-285: Update media type registry to the latest httpd mime type database
new 0cc002f TIKA-285: Update media type registry to the latest httpd mime type database
new 6661e11 TIKA-285: Update media type registry to the latest httpd mime type database
new e99d737 TIKA-285: Update media type registry to the latest httpd mime type database
new 2ec2eb1 TIKA-285: Update media type registry to the latest httpd mime type database
new ead42a4 TIKA-285: Update media type registry to the latest httpd mime type database
new b19fe7e TIKA-285: Update media type registry to the latest httpd mime type database
new 710f9bd TIKA-285: Update media type registry to the latest httpd mime type database
new 5649aaa TIKA-275: Parse context
new 9cfb4e1 TIKA-269: Ease of use -facade for Tika
new da2fb0c TIKA-269: Ease of use -facade for Tika
new 7990a84 TIKA-269: Ease of use -facade for Tika
new e59923b TIKA-285: Update media type registry to the latest httpd mime type database
new d7a498f TIKA-285: Update media type registry to the latest httpd mime type database
new 7240438 TIKA-285: Update media type registry to the latest httpd mime type database
new d7b1952 TIKA-285: Update media type registry to the latest httpd mime type database
new 7b0b425 TIKA-291: Adobe InDesign support
new 18d0767 TIKA-281: Use repository.apache.org to deploy snapshots and releases
new b6ced30 TIKA-281: Use repository.apache.org to deploy snapshots and releases
new 6b9a82f TIKA-292: PDFBox is too verbose
new a459b77 TIKA-284: Upgrade to POI 3.5-FINAL
new dc95913 TIKA-299: Update Geronimo dependency in tika-parsers pom.xml to 1.0.1
new 26293c0 TIKA-297: The HtmlParser ignores <menu> tags, resulting in invalid XHTML
new a463b9f TIKA-297: The HtmlParser ignores <menu> tags, resulting in invalid XHTML
new c5038b8 TIKA-296: Automatically set the supertype for "+xml" mimetypes
new 1cc319f TIKA-294: TikaCLI always uses System.in for input
new c9dfb89 TIKA-290: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.txt.TXTParser@6caf16
new 0486ffb TIKA-256: MSWord parser does not extract footnotes and comments
new 0297f44 TIKA-279: XWPFWordExtractorDecorator does not extract some headers/footers
new 6c6c17d TIKA-293: XWPFWordExtractorDecorator does not extract bookmarks
new cb4e59d Add svn:eol-style
new 1e7a874 TIKA-295: Rough cut of mbox parser
new d509069 TIKA-310: Use TagSoup to parse HTML
new 99c2dd5 TIKA-311: Broken handling of <a name="..."/> tags
new 645850a TIKA-287: HtmlParser should resolve relative paths in <a href="xxx"> elements
new b42b4b8 TIKA-287: HtmlParser should resolve relative paths in <a href="xxx"> elements
new 732fadb TIKA-287: HtmlParser should resolve relative paths in <a href="xxx"> elements
new 987ec1d TIKA-309: Mime type application/rdf+xml not correctly detected
new 16924c3 TIKA-306: patch: OOXMLParserTest uses OpenOfficeParser
new 03be219 TIKA-305: XHTML href attributes end up in the wrong namespace
new 29c9a27 TIKA-304: HtmlParser could be easier to subclass
new 52ac68f TIKA-302: patch: initial support for ePUB
new e7524be TIKA-302: patch: initial support for ePUB
new 01bf6b5 TIKA-302: patch: initial support for ePUB
new e7f48bf TIKA-301: patch: embedded ODF and office:annotation
new e2049a1 TIKA-312: TikaCLI can't print metadata
new 5ab52bf TIKA-300: rename openoffice.. parser classes to odf..
new 7a3b506 TIKA-314: Initial support for JPEG EXIF metadata extraction
new bda2bb7 TIKA-314: Initial support for JPEG EXIF metadata extraction
new f11860e TIKA-209: Language detection is weak.
new 3476a72 TIKA-209: Language detection is weak.
new 7a6089c TIKA-209: Language detection is weak
new ef1cd4d Add change log entries for TIKA-209 and TIKA-275
new a9e8732 TIKA-269: Ease of use -facade for Tika
new 6c2d654 TIKA-320: Allow disabling language detection in AutoDetectParser
new 084dcb8 TIKA-320: Allow disabling language detection in AutoDetectParser
new 958c208 TIKA-209: Language detection is weak.
new 9458719 TIKA-319: HtmlParser - use encoding hint only if charset is supported
new dd1ddf9 TIKA-313: patch: ODF improvements for svg:desc, presentation notes
new 39469ba - fix for TIKA-309: Mime type application/rdf+xml not correctly detected
new ad11aac TIKA-275: Parse context
new 995d275 - remove duplicate glob: TIKA-309
new e5b3736 - increasing the offset to 4k bytes for an appearing <html tag seems to have fixed the unstable build issue introduced by TIKA-309
new bc54bef RE: TIKA-309, yes I can't count (4*1024 = 4096).
new 22c4ea3 - prep for release
new 2fbe85b - change back to SNAPSHOT: mvn release:prepare will take care of this
new 330bd83 test of command line commit (needed by mvn release:prepare)
new 6f0312f [maven-release-plugin] prepare release 0.5
new 62a539e [maven-release-plugin] prepare for next development iteration
new d56ccaa - make CHANGES.txt "release"-ified
new 4b815bc - undo the m2 release plugin's magic
new 6e420b2 [maven-release-plugin] prepare release 0.5
new 3c28408 [maven-release-plugin] prepare for next development iteration
new 98a96da TIKA-309: Mime type application/rdf+xml not correctly detected
new 246ab61 TIKA-309: Mime type application/rdf+xml not correctly detected
new 5d8b457 TIKA-321: Optimize type detection speed
new b93c5d7 TIKA-321: Optimize type detection speed
new efdda0a TIKA-321: Optimize type detection speed
new b4405b7 TIKA-326: Map javax.imageio.IIOException to TikaException
new 3ee9be7 TIKA-321: Optimize type detection speed
new ac74696 TIKA-324: Tika CLI mangles utf-8 content in text (-t) mode (on Mac OS X)
new 0118770 TIKA-325: tika-parent/pom.xml missing <inceptionYear>2007</inceptionYear>
new 7d5d6c7 TIKA-321: Optimize type detection speed
new 95163d2 - update for current development
new c47d81b TIKA-330: Better HWP (Hangul Word Processor) detection pattern
new 59d94cb TIKA-321: Optimize type detection speed
new 51c6242 - fix for TIKA-336 More issues with RDF mime detection
new 8cbebbf TIKA-321: Optimize type detection speed
new 6886317 TIKA-321: Optimize type detection speed
new c1f3579 TIKA-334: HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
new 61d64d2 TIKA-329: secure-processing not supported by some JAXP implementations (2)
new ce27fbd TIKA-340: Provide full Tika bundle
new 6a8a48c TIKA-340: Provide full Tika bundle
new 496afa9 TIKA-340: Provide full Tika bundle
new 43e945b TIKA-340: Provide full Tika bundle
new 4545895 TIKA-332: Use http-equiv meta tag charset info when processing HTML documents
new dbc2ef4 TIKA-334: HtmlParser should use CharsetDetector whenever no charset is specified via meta http-equiv tag
new afa3c1f TIKA-335: TXTParser should use incoming charset
new 2354fe6 TIKA-341: Use charset in CONTENT_TYPE metadata when detecting the character encoding
new 533dd67 Add a change log entry about the character encoding improvements.
new 0f10bd2 TIKA-345: Add application/vnd.wap.xhtml+xml to list of mimetypes handled by HtmlParser
new 8bd0ccb TIKA-347: Make HtmlParser customizable through ParseContext
new 2f0fa3b TIKA-343: some parsers produces glued words
new 7ed838c TIKA-343: some parsers produces glued words
new ebe7995 TIKA-343: some parsers produces glued words
new 5df8f87 TIKA-343: some parsers produces glued words
new d569761 TIKA-343: some parsers produces glued words
new 16db668 TIKA-339: HtmlParser & TXTParser should not use language returned by CharsetDetector if language hint has been provided
new 43c83e7 TIKA-328: Add parser for .flv videos
new e3c68b4 TIKA-328: Add parser for .flv videos
new 1ea1cf0 TIKA-342: Improve OSGi bundling
new bf856a4 TIKA-125: Pass Locale information to parsers
new cacb6b8 TIKA-282: RTF parser expects a GUI environment
new cee95ed TIKA-349: HtmlParser's http-equiv code needs to be more flexible
new 3f27155 TIKA-350: HtmlParser's content-type handling code needs to be more flexible
new 8d09544 TIKA-351: MediaType.parse should be more forgiving of broken input
new 0678c44 TIKA-352: Use MediaType.parse when extracting charset from content-type metadata in parsers
new 9f406c3 TIKA-352: Use MediaType.parse when extracting charset from content-type metadata in parsers
new 4c09850 TIKA-352: Use MediaType.parse when extracting charset from content-type metadata in parsers
new dfb6447 TIKA-353: Upgrade to POI 3.6
new 0b6a7cd Update change log, minor readme improvement
new 7bb3530 TIKA-347: Make HtmlParser customizable through ParseContext
new 312ec4a Added my info to project team in pom.xml
new 895590e TIKA-103: Addition of POI supported number/date formatting handling within ExcelParser
new e543c5b TIKA-103: Addition of POI supported number/date formatting handling within XSSFExcelExtractorDecorator
new 5bffa84 TIKA-103: Corrected XSSFExcelExtractorDecorator to use document style table.
new a5e9584 TIKA-103: Corrected XSSFExcelExtractorDecorator to use correct style index.
new e207c87 TIKA-103: Updated CHANGES.txt with details of new features.
new 7ece6ff - fix for TIKA-327
new 8d76fd9 - fix for TIKA-366 Increase buffer size for mime type sniffing
new 28327a8 - fix for TIKA-367 Mime type rootXML equality improvement
new cecaa12 - fix for TIKA-357: Increase buffer size for meta tag sniffing. Patch contributed by Ken Krugler.
new 6f85dd5 - prep for release
new f20cdb9 - include contributors (always forget to do this on the first try!)
new 72f2665 [maven-release-plugin] prepare release 0.6
new dfcc01d [maven-release-plugin] prepare for next development iteration
new 2cd366d - bump CHANGES.txt
new 093a62b TIKA-317: Excel formatting depends on the default locale
new 8602783 TIKA-368: ID3v2 support for mp3 parser
new fa286ee TIKA-365: Extract more OpenDocument metadata
new 4a8ca9c TIKA-362: Add publisher support
new 9544c3a TIKA-364: [PATCH] Metadata mark for xlsx documents with protected sheets
new 7c51e89 TIKA-372: Channel and SampleRate information for MP3 files
new 96a15bc TIKA-141: Mime Content Type detection of a web document from its URL.
new 4437342 TIKA-141: Mime Content Type detection of a web document from its URL.
new 57cfc63 TIKA-374: AutoDetectParser not thread-safe
new 0fa888e TIKA-199: Improved audio detection and parsing
new 3ad79a6 TIKA-375: Improve code quality metrics
new 4ae3021 TIKA-278: Move Tika site sources outside trunk
new 530149b TIKA-278: Move Tika site sources outside trunk
new a4013fa - fix for TIKA-376 Typo in parse-rtf spec in tika-config.xml
new 6a7f39d TIKA-377: Error parsing HTML partial with AutoDetect parser
new bfc53af TIKA-377: Error parsing HTML partial with AutoDetect parser
new 70bfbe8 TIKA-380: Upgrade to PDFBox 1.0.0
new 101a5c3 TIKA-370: Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox
new f721fac TIKA-370: Tika pom.xml is missing dependencies on bouncycastle jars needed by PDFBox
new 33171a5 TIKA-317: Annotation-based Tika configuration
new 3c807da TIKA-317: Annotation-based Tika configuration
new c9d44db TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new 090001d TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new e64cad2 TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new 1cad9b3 TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new 3caba57 TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new 6e4a971 TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new 6ef60c2 TIKA-378: TikaConfig should notify users if it cannot initialize some parser
new 80c39e9 TIKA-317: Service provider -based Tika configuration
new fad2211 TIKA-382: No textextraction in tika-app
new 9a049b4 TIKA-386: Tika relies on X11
new 567841a - add Ken to the committers list
new 55fb7c9 TIKA-388: Don't trust streams that claim mark support
new 4400e3a TIKA-261: Ability to limit the amount of extracted text
new 9c7c2c2 TIKA-282: RTF parser expects a GUI environment
new 29ab691 TIKA-282: RTF parser expects a GUI environment
new bbd3cd3 TIKA-392: RTF parser smashes words together in subsequent table cells
new 939fa08 Updated KEYS to include my new RSA code-signing key
new cca6e63 TIKA-395: Update to allow OutlookParser to support new format Outlook messages.
new f60ce11 TIKA-393: Upgrade to PDFBOX 1.1.0
new f66c262 - prep for release 0.7
new 8de4115 - consistency with sentence endings
new aa01b10 [maven-release-plugin] prepare release 0.7
new 27dc9b5 [maven-release-plugin] prepare for next development iteration
new c7b924c - prep for 0.8 development - incorporate comment from gsingers RE: mentioning the major versions of libraries used in tika-parser
new 229b78f - patch for TIKA-398 TestParsers fails when classpathh contains special characters like spaces (Uwe Schindler via mattmann)
new 14a229e - basic support for netCDF parsing, as specified in TIKA-400 netCDF Tika Parser. Can extend more later, but enough support right now to commit. Includes basic unit tests.
new d06aed2 TIKA-396: Parse Attachement included within Outlook Message.
new 648b5bf TIKA-404: Media-type handling depends on the locale
new a918dd6 TIKA-404: Media-type handling depends on the locale
new 550a2dc TIKA-92: Image metadata extraction
new 6464286 TIKA-92: Image metadata extraction
new afc0523 TIKA-396: Parser Attachements from Outlook Messages
new 9f10da2 TIKA-379: Html elements and attributes not available in XHTML representation
new ff494ed TIKA-403: Refactor log library usage in tika-parsers
new baafe0e TIKA-403: Refactor log library usage in tika-parsers
new 1f3c0a6 Use spaces instead of tabs for indentation
new e8c71a7 TIKA-403: Refactor log library usage in tika-parsers
new c25430c TIKA-400: netCDF Tika Parser
new 5ec7c51 TIKA-409: Missing poi-ooxml-schemas-3.6.jar in tika-bundle
new bdf52f7 TIKA-153: Allow passing of files or memory buffers to parsers
new 8658485 TIKA-412: Exclude the xml-apis dependency
new f962509 TIKA-153: Allow passing of files or memory buffers to parsers
new cecdcea TIKA-153: Allow passing of files or memory buffers to parsers
new 3af35a9 TIKA-153: Allow passing of files or memory buffers to parsers
new 571fc14 TIKA-153: Allow passing of files or memory buffers to parsers
new 9d31098 TIKA-404: Media-type handling depends on the locale
new 7736aaa TIKA-153: Allow passing of files or memory buffers to parsers
new af2aabb TIKA-298: CompositeParser.getParser() should use mimetype hierarchy when falling back
new 807b024 TIKA-298: CompositeParser.getParser() should use mimetype hierarchy when falling back
new 8a5cfcb TIKA-89: Rename MimeType and MimeTypes
new ccd0340 TIKA-89: Rename MimeType and MimeTypes
new c760e5d TIKA-419: Allow parser lookup from a custom class loader
new ae52cb8 tika now a tlp, moved svn
new 6f758f8 TIKA-415: Findbugs: XHTMLDowngradeHandler equals() comparing different types
new c0a9227 TIKA-417: Unable to parse the content for UCS2 Litte Endian encoded file
new a576fb3 TIKA-153: Allow passing of files or memory buffers to parsers
new 64ed199 TIKA-402: Support for Keynote and Pages documents
new 86735b2 TIKA-416: Out-of-process text extraction
new 2135910 TIKA-400: netCDF Tika Parser
new a2d2c1a - fix for TIKA-432 Include NOTICE and LICENSE file updates for NCAR NetCDF parser lib
new 67748e3 TIKA-416: Out-of-process text extraction
new b122281 TIKA-425: Exception parsing mp3
new b2b3685 TIKA-418: RuntimeException while getting content for ppsx, ppsm, pptm, thmx and xps file types
new 890bcb5 TIKA-424: Avoid ArrayIndexOutOfBoundsException on some mp3 files
new f1f45a7 TIKA-413: DWG Parser
new 19439b7 TIKA-413: DWG Parser
new d4dd400 TIKA-413: DWG Parser
new 8451402 TIKA-402: Support for Keynote and Pages documents
new dd6330f - fix for TIKA-379 Html elements and attributes not available in XHTML representation
new f4c4123 - consistency, Chris, consistency.
new eb7aed1 TIKA-402: Support for Keynote and Pages documents
new a2d3c00 TIKA-402: Support for Keynote and Pages documents
new a4a6c02 TIKA-402: Support for Keynote and Pages documents
new d533a2d TIKA-402: Support for iWork documents
new d0f703f TIKA-269: Ease of use -facade for Tika
new d38860c - test SVN auth
new 9f5ae40 - revert SVN auth test
new ead95c8 TIKA-95: Pluggable magic header detectors
new 97549f6 TIKA-441: Sometimes, tika not working (crashed) because of null classloader
new 2846c74 TIKA-440: [Patch] Fetch the composer information in the MP3 Parser
new c1fe863 TIKA-439: DWGParser (and some others) not used by AutoDetectParser
new 8d14ba6 TIKA-308: Improve supertype handling in type registry
new 0bda933 TIKA-308: Improve supertype handling in type registry
new 72cc380 TIKA-308: Improve supertype handling in type registry
new 414c592 TIKA-298: CompositeParser.getParser() should use mimetype hierarchy when falling back
new f809f45 TIKA-89: Rename MimeType and MimeTypes
new c705865 TIKA-89: Rename MimeType and MimeTypes
new 157fcc9 TIKA-308: Improve supertype handling in type registry
new 2b73f6b TIKA-442: Image extractors use inconsistent metadata keys and formats for common features
new 11cb267 - fix for TIKA-444 Tika sites refers to incorrect svn repo URL
new af37f82 Add myself to the committers list, and remove Ken Krugler's duplicate entry
new ed8497d Upgrade to POI 3.7 beta 1 (TIKA-373) Includes patch from TIKA-361 to update the Outlook parser to match the new HSMF API + update to the MBox parser to capture equivalent metadata
new 85a8a76 Apply patch from Maxim Valyanskiy from TIKA-437 - support encrypted OOXML office files which use the default password.
new c02b152 Use the new TIFF Metadata entries for image width/length/sampling from the TIFF, JPEG and general Image (ImageIO) parsers. Gives a small number of consistent image related metadata entries across all formats. (TIKA-442)
new 1197065 Apply Jukka's patch from TIKA-371 - now we're on POI 3.7 beta 1, do the locale handling in unit tests better
new 189557f MP3 Lyrics text extraction support Updates the MP3 parser to detect a LyricsV3 block before the ID3v1 tags block. If found, the lyrics text will be captured and output.
new b471e61 Unit test to show that we support pptx, pptm, ppsx and ppsm (TIKA-418) .thmx will need a POI upgrade, but the file format lacks any text! .xps is still unsupported by POI
new 9677c50 Add geographic metadata namespace (TIKA-445)
new 59d2e57 Enable extraction of longitude and latitude from JPEG/Tiff files (via the EXIF tags), and HTML (via the ICBM meta tag), to the new geographic metadata namespace
new b8223f9 TIKA-371: Excel formatting depends on the default locale
new a4322a1 TIKA-446: Upgrade to PDFBox 1.2.0
new 423eabd Fix for TIKA-449 (Update parsers to extract geographic metadata) from Jukka to ensure that the lats and longs are correctly formatted in all locales
new 61751d2 TIKA-452 - Extract custom pdf metadata
new ccf1216 Test for TIKA-452 - Extract custom pdf metadata
new 2f78035 TIKA-442 follow-on - map another Exif/JPEG tag (comment) onto a standard tika metadata key
new 3cebb67 TIKA-454 Illegal Charset Name crashes HTMLParser
new 007f2d7 TIKA-402: Support for iWork documents
new 76afce1 TIKA-375: Improve code quality metrics
new 035dab3 TIKA-292: PDFBox is too verbose
new dce07df TIKA-292: PDFBox is too verbose
new 66882d3 TIKA-402: Support for iWork documents
new a87d3e6 TIKA-402: Support for iWork documents
new ca465e3 TIKA-402: Support for iWork documents
new 9ba3b58 TIKA-459: Improve handling for invalid charset names.
new ed34e70 TIKA-453: Fix Estonian language identifier.
new 7bee75a TIKA-446: Upgrade to PDFBox 1.2.1
new 74a432d TIKA-420: Integration of Boilerpipe.
new 4658fd7 TIKA-451 - Inconsistent date format for Metadata.CREATION_DATE and Metadata.LAST_MODIFIED Make CREATION_DATE and LAST_MODIFIED Date property instances, and add support for getting and setting Dates (+getting ints), as discussed in TIKA-451 Unit tests for getting and settings ints and dates are included. Work to update the existing parsers to make use of the new Date setter is still outstanding
new 18f82c7 When building ISO8601 dates, ensure we're always working in UTC (for TIKA-451)
new 11bbc01 Update parsers to fix problems with new style Date properties, for TIKA-451
new 6cd30a3 Update parsers to fix problems with new style Date properties, for TIKA-451 (file was missed from last commit)
new 8497182 The bundle needs to include boilerplate, as it's a required dependency
new ad22596 Accept a wider range of ISO8601 date formats when turning a Property from a String into a Date, for parsers which do set(Property,String) - for TIKA-451
new efdc32a - update index doc to include 0.7 skeleton
new ab5353a - more skeleton
new 16fe44c - update for Tika 0.8
new 28eafc8 - fix for TIKA-464 Contribute a "get Tika parsing up and running in 5 minutes" quick start guide
new 85aa4d2 - fix for TIKA-466 Feed Parser contributed by jnioche
new a5aff98 Add the new rome dependency to the bundle (TIKA-466)
new 5da4b6a TIKA-470 - New tika-app option to list the supported parsers, and their mime types, via options of --list-parsers and --list-parser-details
new 3ca5da9 TIKA-447 - Container aware mimetype detection Initial implementation of container aware detection. New ContainerAwareDetector class, which is a Detector, which will open and handle OLE2 and Zip files to detect the mimetype, falling back on a specified default detector for non-container formats. Some work remains - Not all Zip file based things are detected yet, and the Zip based parsers don't yet take advantage of the already open zip stream. (OLE2 ones can)
new 5fcd987 Add Office Open XML (OOXML) support to the Zip container aware detector (TIKA-447) If an OOXML zip file entry is found, passes this to POI and fetches the content type through that. Also updates the OOXML text extractor to take advantage of the open package if detection was already done.
new 852dbed Container aware detection for Jars, and add stub TODOs for iWork files (TIKA-447)
new bc5f04e Fix 1.6ism in recent TIKA-447 commit
new 21b54bf Slightly improve OLE2 file type matches, for cases where the OLE2 properties stream is in one of the first couple of blocks in the file. Add a note about using the ContainerAwareDetector for better results. (TIKA-447)
new 6f48f57 Make mime type detection a little bit more stable (TIKA-391) Make the comparison operator work better on Magic types, and ensure that the type is present on the magic to help debugging and sorting. Also add tests to show that we can detect the same file multiple times, and get the same answer each time.
new 66077e1 Apply patch from TIKA-472 - Extract JPEG title, description and author Also fix a few indents to follow tika standard of space not tab
new f49c4bd Excel parsing improvements for files with charts (TIKA-214) Support chart based sheets, outputting chart labels, not over-writing sheet entries with chart ones, and outputting extra sheet text inside the sheet but outside the table. Also adds unit test based on file from TIKA-214, along with a few toString() methods to aid with debugging
new 9fe414c TIKA-358: Auto-detection of HTML fails with common auto-generated template
new 3ab1096 Don't break on MP3 files where the ID3v2.4 tags are broken, and lie about their size (TIKA-424) (The unit test for this will not normally be run, unless you explicitly download the sample file, as we can't re-distribute it as part of Tika)
new 871bea9 The id3v2.4 spec doc has a bug - the layout section says 4*size to bytes, but the description is just size=bytes. Switch to the latter, which is what the other programs use, and add a unit test based on a mid3v2 generated file. (TIKA-424)
new c0527e0 TIKA-447: Container aware mimetype detection
new 955611f TIKA-473: Prepare Tika site for svnpubsub
new 1b3de63 TIKA-474 - Do what we can with MP3 files where the ID3 header is truncated
new 8fa36e1 - fix for TIKA-476 Add page count to metadata
new a1cd688 TIKA-89: Rename MimeType and MimeTypes
new c730e0d TIKA-475: MBoxParser class (inside tika-parsers-0.7-jdk14.jar) calls method getSimpleName from java.lang.Class (jdk1.5)
new 458753c TIKA-476: Add page count to metadata
new 40de6ed TIKA-476: Add page count to metadata
new f9e1ddd TIKA-468: Missing Silde-Count metadata for PPT files
new 621950e TIKA-477: Add GUI support for Boilerpipe, and improve output from Boilerpipe content handler.
new f8a3d0b TIKA-477: Also commit fix for BP test
new 486281b TIKA-478: Fix handling of <head> elements in HTML parser, and improve robustness of XHTMLContentHandler.
new a1c0503 TIKA-478: Fix up missing end </body> and </html> tags for document with no real content.
new 31345be TIKA-463: emit <img> tags with resolved URLs for src attribute.
new 163d93c TIKA-457: Fix frameset handling (both general, and for broken HTML)
new 2a21313 - fix for TIKA-479 Post link to Tika in Action on Tika website
new 28410f6 - docs for TIKA-447 should be part of trunk site build too, so that when 0.8 (and beyond) docs go up, detection is included.
new 52664ff TIKA-460 : A elements never reached with IdentityMapper
new e347b98 TIKA-463 & TIKA-457: Fixed issue with emitting <meta> elements that had null content values (these are valid in Metadata, but not <meta>.
new fed8c46 TIKA-480: Don't emit empty attributes, and include full set of standard HTML elements around <p>...</p> output.
new 4f6a589 TIKA-481: Resolve href in <link> element.
new ed7a1d1 TIKA-463: Get all URLs (resolved) from XHTML-valid elements.
new 98eb0b0 TIKA-483: Empty file detected as text/plain
new 2e6dc43 TIKA-401: Tika hangs on corrupt zip files
new d32c1d8 TIKA-495: Metadata constructor is slow
new fc3a086 TIKA-487 - ContainerAwareDetector fallback support for truncated zip files
new 97cd04c When building tika-app, embed de.* classes too, along with org and com, as boilerpipe uses that namespace (TIKA-420)
new 31452f9 - fix for TIKA-498 HTML parser fails on turkish locale
new 05acb51 TIKA-501: Remove ICU-based language detection from plain text parser (TXTParser)
new f72c1fb - fix for TIKA-488 Add alternative search provider on site
new f878d8e Add missing svn:eol-style settings
new 61f707f TIKA-503: Add a ContentHandler for collecting links from parser output
new c1f7fd7 Add missing svn:eol-style settings
new bc3fd07 Add missing svn:eol-style settings
new 9a2bef7 Apply Staffan Olsson's patch from TIKA-482 (with a few tweaks), which improves how EXIF metadata is processed from TIFF and JPEG files, and moves more of the Date properties to be real ISO8601 dates internally.
new 729bca5 Add several more common EXIF tags to the TIFF metadata namespace, and have the EXIF parser also output property-typed tags for these (TIKA-504)
new 806ccd1 Add TIFF/Exif Flash property and support (TIKA-504)
new 86f9f62 TIKA-416: Out-of-process text extraction
new 7012400 Add basic extension based detection of corel formats, and works, along with the submitted sample files (TIKA-486)
new 4bf74e9 Add support for to the ContainerAwareDetector for Corel OLE2 formats, and Microsoft Works (TIKA-486) Also slightly refactor the child container detectors, so we can do common fallback logic when the container detector can't figure it out
new e33c7d5 Apply (with slight tweaks) Antoni Mylka's container aware detector patch for truncated OLE2 documents - TIKA-485
new acc172c TIKA-153: Allow passing of files or memory buffers to parsers
new 51e72bb TIKA-153: Allow passing of files or memory buffers to parsers
new c0b380d TIKA-153: Allow passing of files or memory buffers to parsers
new 30d9660 Add various test office files which have images and other office files embeded in them. (Will be used for unit tests for TIKA-509)
new d2877b7 Initial work on Container Extractors (TIKA-509) Basics of the interfaces and key classes are included, along with a partial POIFS extractor implementation.
new d98bc51 TIKA-507: Parser for font files
new a20e0c1 Add missing svn:eol-style setting. Looks like I need to fix the auto-prop configuration on my laptop...
new 8dd9c86 Refactor how container extraction works - Jukka's patch from TIKA-509 Replace the AutoContainerExtractor with ParserContainerExtractor, and push more of the work to the Parsers
new c3859d2 Support for container extraction of Images in .xls, and OOXML files embeded in OLE2 documents (TIKA-509) Also rename ContainerEmbededResourceHandler to EmbededResourceHandler as suggested by Jukka, fix ParserContainerExtractor recursion, and remove ContainerExtractor from TikaConfig now we have ParserContainerExtractor.
new c6505ef Add support for extracting images embeded in Word .doc files - TIKA-509
new 14be8fe Have the ooxml container aware detector use the file, not the input stream, as it's more efficient (TIKA-447)
new 0c40fd5 We don't need to wrap our stream in a BufferedInputStream for mark/reset to work if it is already one (identified in TIKA-509 work)
new ca80cad OOXML support for embedded resource extraction for .docx and .xlsx. (TIKA-509) Also fix a spelling mistake in a class name + comments
new 6d9cf87 TIKA-509: Container contents extraction
new 2d822bc Make the emf/wmf mimetypes returned for the OLE2 office files match that stored in the OOXML files, as well as refactoring the container tests to reduce duplication (TIKA-509)
new 274acac TIKA-509: Container contents extraction
new 90f60e3 - fix for TIKA-512 Print the supported Metadata models and their associated met keys in tika-app
new cdd1aa5 More Office embedded resource extraction support (TIKA-509) Existing outlook code has been updated to the new style, and tests added XSLF .pptx support has been added with tests POI version bumped from 3.7 beta 1 to 3.7 beta 2, as required for better outlook attachement support
new 3864c2e Container extraction tests for package based parsers (TIKA-509)
new 3620143 Tidy up OOXML unit tests by removing TODOs, and make the sample word document contain a bit more so we can later improve the unit test (TIKA-506)
new af99a31 TIKA-514: Provide constructor for AutoDetectParser that has explicit list of supported parsers
new bb7ab51 Enable word6 / word95 support via the new POI Word6Extractor class (TIKA-408)
new 01a1067 Move the AutoDetectParser tests from TIKA-514 into the existing AutoDetectParserTest class
new 488e132 Make our sample .doc file more complex, to match the sample .docx, so our tests for TIKA-506 will have more to work on
new d0cd679 Apply patch from TIKA-506 - Improve the html generated from .doc and .docx to include more things This patch includes an upgrade to POI 3.7 beta 3 For .docx, we now return headers where appropriate, tables, hyperlinks, non standard styles as classes, and images in the correct place For .doc, we also do headers, hyperlinks and non standard styles. Tables only work for 1st level ones, nested tables just come out as paragraphs for now (Lists are not yet supported in either [...]
new 1bd3087 Add missing table close tags for .doc (TIKA-506)
new c442efe From suggestion in TIKA-506, make Word paragraphs formatted in the style of "HTML Preformatted" use pre tags
new f9495f1 When processing .doc files, handle matching the embedded images to the character runs better. Somewhat copes with \u0001 real images vs \u0008 floating escher images, and is as good as we can probably get given the current Microsoft docs... Avoids NullPointers though! (TIKA-506)
new 2d396c8 Not all XWPFParagraphs have the root document, so check for this to avoid a NPE (TIKA-506)
new 13bd4cd TIKA-519 - Display embedded images in the GUI Formatted Text pane where they occur in the document. Applies updated patch from TIKA-519 as discussed
new 0abbfee TIKA-520 - Apply patch from Sjoerd Smeets for DWG files that lack a header, which avoids ArrayIndexOutOfBoundException
new c6ab1a9 TIKA-383: new option for TIKA CLI to get only the languages of a document
new 1eee081 TIKA-383: new option for TIKA CLI to get only the languages of a document
new 29972fd TIKA-426: Parsing javascript as XML
new fbc7bac TIKA-426: Parsing javascript as XML
new ba18170 TIKA-411: Generate list of supported and detected types automatically
new dd178c1 TIKA-411: Generate list of supported and detected types automatically
new 6d7804f TIKA-527: Allow override mapping mime<-->parsers through config
new 36cc24e Remove extra System.out prints from a test case, clean whitespace
new d4e7bfd TIKA-528: Reuse TagSoup HtmlSchema instance across HtmlParsers (performance improvement)
new 1a564ef TIKA-533: Mis-detection of zip-within-zip as application/vnd.apple.iwork, with no output by CLI app
new d60651e Add iWork support to the Container Aware Detector (TIKA-533) It's a bit icky for now, but it works and it's quick...
new befb3db Add --container-aware-detector option to the Tika CLI, which will switch the detector used by the auto parser
new 3e08e27 TIKA-535: Implement Apache project branding requirements
new 8a5f288 TIKA-394: Missing spaces on html parsing
new 1c04a00 TIKA-503: Add a ContentHandler for collecting links from parser output
new 024bcc4 TIKA-503: Add a ContentHandler for collecting links from parser output
new e947211 - progress towards TIKA-407 Push NetCDF4 lib dependency to Maven Central and Update Tika POM: upgrade tika-parsers to depend on eventual Maven Central group/artifactId. Also temporarily change M2-forge to Sonatype OSS (will remove when Central sync is loaded)
new aacb3b4 - progress towards TIKA-407 Push NetCDF4 lib dependency to Maven Central and Update Tika POM: netcdf is now available from Maven Central: see http://repo1.maven.org/maven2/edu/ucar/netcdf/4.2/
new c496ffa - fix for TIKA-515 MimeType.getDescription() often returns nothing when "tika-mimetypes.xml" has a useful description already available.
new b3c8dc8 - fix for TIKA-399 HDF4/5 Tika Parser
new 7153f6b - fix for TIKA-399 HDF4/5 Tika Parser
new 92188c4 TIKA-446: Upgrade to PDFBox 1.3.1
new f3f1f15 - fix for TIKA-399 HDF4/5 Tika Parser
new 83a6efa - fix for TIKA-490 Support for adding language profiles dynamically
new 195f1d6 - suggestion by jukkaz: make sure we're not using JDK5, before we run the NetCDF and HDF tests
new ab41dbd - fix for TIKA-399 HDF4/5 Tika Parser: add he5 file extension for application/x-hdf
new 4b5ec9a - don't put another application/ in front of the existing application/x-hdf and application/x-netcdf
new 3cbfe69 - update export dependency; artifact now named netcdf
new 5a02577 TIKA-373: Upgrade to POI 3.7
new 2cdb437 - add NetCDF classes to tika-app bundle: TIKA-400
new d4a4014 TIKA-462: Get Boilerpipe into Maven.
new 8ee313a Fix build problem caused by removing the java.net Maven repository.
new 2326440 XWPFWordExtractorDecorator: extract text from footnotes
new e35b48c TIKA-462: Remove java.net repository from parser pom
new 6789916 TIKA-543: Remove rome 1.0 dependency on java.net repository
new 4d5eb38 - fix for TIKA-537 Command line option --list-parsers should list 2nd level parsers below CompositeParsers
new 86e32d9 - fix for TIKA-523 Add application/ms-tnef as alias to application/vnd.ms-tnef
new 4162253 - prep for 0.8 RC
new 26e79a5 [maven-release-plugin] prepare release 0.8
new 191a3f5 [maven-release-plugin] prepare for next development iteration
new 7e2de56 TIKA-510: Use POI usermodel API for text extraction from XSLF shape
new 453ae26 TIKA-511: NPE when POI is configured to prefer event extractors
new 3bc4231 add unit-test on parsing write-protected xlsx
new 7cb1b4a PackageExtractor: javadoc fix
new 31e60d9 Improved extraction of EXIF and IPTC metadata from JPEG and TIFF Images (TIKA-482) (Applys patch from Staffan Olsson from TIKA-482)
new cfe575a Missing new directory from previous TIKA-482 patch commit
new 1585135 Extract interface for EmbeddedDocumentExtractor
new 67695eb Extract embedded Ole10Native files from POIFS
new 7413cfd TIKA-549: support for extracting OLE-shapes from PPT
new bbd3acf TIKA-549: support for extracting OLE-shapes from PPT
new 9dbebc4 TIKA-550 - Add stable filenames for extracted embedded files from Office binaries
new 21c35f7 TIKA-552 - Handle word styles like "heading 4" just like "Heading 4", and in .docx files insert bookmarks as anchor tags, along with relative hyperlinks for the text that references them. (Updates the .doc test file to include bookmarks, but there's no .doc handling of them yet)
new 3dd1211 TIKA-553: Automatic license header checks
new 45576d7 - get ready for next dev cycle
new 3d244de - add in detail link on contributions for 0.8
new 98cf861 TikaInputStream: do not wrap ByteArrayInputStream/BufferedInputStream in BufferedInputStream
new e54acb4 OOXMLExtractor: use EmbeddedDocumentExtractor
new 7c8ac0b XSLFPowerPointExtractorDecorator: imports cleanup
new 8e58142 TIKA-548: PDF content extracted as single line
new 4939ac6 - progress towards TIKA-556 Problems with the NetCDF jar: update to NetCDF 4.2-min jar, but include temporary repository definition before sync to Central. Once available in central, will remove tika-parent/pom.xml mod.
new cbdd84e - progress towards TIKA-556 Problems with the NetCDF jar: and voila, the netcdf-4.2.-min jar is in Central and we're set!
new 00afd53 If we hit the write limit, give a helpful error message in case you hadn't been expecting it (TIKA-557)
new eb949a4 When detecting macro enabled OOXML files, return the same format media type as in mimetypes.xml. Adds unit tests for a few of these. (TIKA-560)
new f409f28 New test files from TIKA-560
new 96a8b1d Apply mimetype updates from TIKA-560
new 1573be6 TIKA-564: Support returning original markup in BoilerpipeContentHandler
new 2a92008 TIKA-560: Improve detection of .mht, Foxmail, and OOXML files
new dac8ec8 TIKA-562: In tika-mimetypes.xml OpenXML types should have x-tika-ooxml as their parent
new d82bca5 TIKA-563: .vor files are Staroffice Templates, not Staroffice Writer documents
new 98c2833 TIKA-555: image/bmp mime type does not exist
new 29ef6fd TIKA-555: image/bmp mime type does not exist
new adf4620 TIKA-461: RFC822 messages not parsed
new 070b583 TIKA-555: image/bmp mime type does not exist
new 9e9a9d9 TIKA-461: RFC822 messages not parsed
new a7d6cf4 TIKA-461: RFC822 messages not parsed
new 1613373 TIKA-565: Improved OSGi bundling
new c58d8ce TIKA-565: Improved OSGi bundling
new 0349400 TIKA-565: Improved OSGi bundling
new 5af3281 TIKA-565: Improved OSGi bundling
new 6059044 TIKA-461: RFC822 messages not parsed
new f0033c0 TIKA-566: Better convenience methods for type detection
new edd122d TIKA-548: PDF content extracted as single line
new bf42aac TIKA-567: Temporary file leak in TikaInputStream
new 0df83fb Add missing svn:eol-style
new 12622b4 TIKA-447: Container aware mimetype detection
new 0fa3727 TIKA-447: Container aware mimetype detection
new 201dd27 TIKA-447: Container aware mimetype detection
new d20903f Add a PDF file that is protected with the default (Empty) password. Taken from PDFBOX-858 but for work on TIKA-389
new 5451e28 Fix TIKA-389 - If the PDF is protected (aka encrypted), then always try to decrypt it. Otherwise, we can end up with garbage metadata. Includes a unit test that shows we now get the metadata correctly.
new d5b8fde TIKA-569: More fault-tolerant loading of parsers and detectors
new 5f1e9d0 TIKA-569: More fault-tolerant loading of parsers and detectors
new a78e298 Apply patch from TIKA-570 from Benson Margulies - stricter BMP detection and unit test
new 05e6069 TikaCLI: add attachement extraction option
new ded569d TIKA-573: add MimeType.getExtension(). Extensions are taken from filename patterns
new 9f5593e TIKA-574: Support for IBM866 (CP866) encoding in TXTParser Submitted by: Kostya Gribov, grossws at gmail.com
new ea4b6e4 tika-parent/pom.xml: added myself to commiters list
new 02f945f TIKA-574: add missing test file
new 841ede4 fixed compilation in Java 1.5.0
new 42458b1 TIKA-569: More fault-tolerant loading of parsers and detectors
new 6b51782 TIKA-416: Out-of-process text extraction
new 42eecdb TIKA-416: Out-of-process text extraction
new 2680077 TIKA-416: Out-of-process text extraction
new 1d0ef6b TIKA-416: Out-of-process text extraction
new ab0921a TIKA-416: Out-of-process text extraction
new 1812a86 TIKA-416: Out-of-process text extraction
new cb6b98e TIKA-416: Out-of-process text extraction
new fa6ee73 Add test access mdb file from TIKA-586
new c187e0d Add test true type font file from TIKA-586
new 34bb845 Access mdb detection and test from Martijn in TIKA-586
new bb31716 TIKA-416: Out-of-process text extraction
new ba8a969 TIKA-416: Out-of-process text extraction
new bdc9e61 TIKA-416: Out-of-process text extraction
new dba5043 TIKA-567: Temporary file leak in TikaInputStream
new ff74808 TIKA-567: Temporary file leak in TikaInputStream
new 05f333c TIKA-567: Temporary file leak in TikaInputStream
new b6f45ba TIKA-567: Temporary file leak in TikaInputStream
new b65fad1 TIKA-587: NullPointerException in OutlookExtractor on missing chunks
new 3e79510 TIKA-585: AudioParser Fails with NPE on fileFormat.properties
new 164302f TIKA-375: Improve code quality metrics
new 728a226 TIKA-582: Lithuanian language identification
new 46dce5b TIKA-581: Parser fails on files that parsed with v0.7
new 8133622 TIKA-587: XMLParser ContentHandler: multiple endDocument calls
new 663b274 TIKA-416: Out-of-process text extraction
new aec08ff TIKA-416: Out-of-process text extraction
new 3b3845a TIKA-416: Out-of-process text extraction
new 7391243 TIKA-416: Out-of-process text extraction
new 4a9b1f4 TIKA-416: Out-of-process text extraction
new 9ffe396 OutlookExtractor: fix NPE on messages without 'from' field
new 1466606 TIKA-416: Out-of-process text extraction
new cb91494 Fix up the iwork mime types with the patch from TIKA-588, and also add a unit test for the detection using the non-container detector (we already had container aware detector iwork tests)
new f4098f1 TIKA-416: Out-of-process text extraction
new b1e9764 TIKA-416: Out-of-process text extraction
new cd4dff4 TIKA-422: Wrong charset conversion in some RTF documents.
new ca5aa21 TIKA-593: Tika network server
new 7b15af5 TIKA-372 follow-on: Set two more XMPDM metadata values for MP3, and add unit tests
new 23aa290 Update copyright year to 2011
new 6fd7bb8 TIKA-525: Mismatched start and end elements in HtmlParser
new 568868c TIKA-594: Upgrade Tika to pdfbox 1.4.0
new 981a123 TIKA-508: HtmlParser link processing should skip usemap and codebase attributes
new c139c2d TIKA-497: HtmlHandler should fix up incorrect capitalization of names in <meta http-equiv="xxx"> attributes before putting into metadata
new 88a0189 - fix for TIKA-596 NetCDF and HDF files don't parse correctly from the command line via tika-app
new 71e1271 - 0.9 RC release prep
new 69ebcc7 Whitespace tickle.
new fec548c [maven-release-plugin] prepare release 0.9
new 30e45df [maven-release-plugin] prepare for next development iteration
new c292f1a [maven-release-plugin] prepare release 0.9
new 6eaa7cc [maven-release-plugin] prepare for next development iteration
new 3272ece - bumpity
new 453479f - bumpity
new 92573a1 TIKA-593: JAX-RS network server
new df1e4cb tika-server: fix compilation under Java 1.5
new bbc97d8 tika-server: fix compilation under Java 1.5 - remove @Override on interface implementations
new 2b34017 tika-server: add license header to commons-logging.properties
new ba5671c TIKA-597 : Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler
new 2bf987b TIKA-606 - MP3 lyrics tags use a 6 digit length for the overall size, but only 5 digits for each tag
new 4fca144 TIKA-611 : setSortByPosition reverted to the default value (false) in PDFTextStripper so that columns are separated
new 2129a0f TIKA-609: IOException from jempbox
new e15598f TIKA-597: Bogus exception handler in org.apache.tika.parser.mail.MailContentHandler.body(BodyDescriptor, InputStream)
new 1bbaa90 TIKA-607: ParseUtils.getStringContent( ) of a text file - parser is null
new 2c5efd4 TIKA-600: [patch] suspect transferable code
new 2313022 TIKA-601: [patch] objects that compareTo each other, should also equals each other
new 7857da1 TIKA-602: [patch] use short-cuircuiting rel ops
new d7601cc TIKA-599: Thread issue with autodetect parser
new 4d96672 TIKA-593: Tika network server
new 7da803f TIKA-594: Upgrade Tika to pdfbox 1.5.0
new d27c244 TIKA-593: Tika network server
new 22c94c9 - fix for TIKA-614 Support hdf5 data file with file extension *.h5 contributed by Cynthia L. Wong
new 9fbce43 Update the OOXML Excel (.xlsx) extractor to be largely SAX based, to reduce the memory use (it now works in a similar-ish way to the .xls one). Bumps the POI dependency up to 3.8 beta 1. (TIKA-521)
new b13d07e Fix the mime magic detection of TNEF files, and add a unit test for it. (The rest of the TNEF support will be committed when POI 3.8 beta 2 is out). (TIKA-615)
new 8b87a38 When parsing an RFC822 file, don't assume that the from address is always in a certain format. Fixes TIKA-618 and adds a unit test for it.
new cd45bd2 TIKA-534 - When parsing a jpeg file with unhandled tags in it, skip these
new ca6b5ca TIKA-592 - Support AutoCad DWG files from AutoCad 2000 (version 1015), and add Custom Properties support across all versions. Adds unit tests for various other "versions" (where the file format doesn't seem to have changed even if the product version has)
new f45b0f8 Turning an ASCII string into static final bytes without exceptions shouldn't be this hard.... Fix 1.6ism for TIKA-492
new b206755 update Excel number-format tests for use with latest POI trunk
new 2ebf678 DOCX: rich text parsing for DOCX headers / footers
new cf73252 OOXMLParserTest.testWordPicturesInHedaer disabled due to bug in POI - wait for 3.8-beta2 or 3.8-final
new 8ff0334 Add some more detection tests, which show that for container formats the addition of the filename lets us specialise from eg tika-msoffice to msword
new 8eb3b88 Fix deprecated warnings
new 0c0cbf7 TIKA-620 - When trying to identify a parser for a media type in AutoDetect and similar, if the Parser claims to support an alias of the media type but not the canonical one (eg someone changed the mimetype file but not the parser), then have the parser accepted on the alias.\nAlso adds AutoDetectParser tests for images (the bmp one of which didn't work before)
new c7ece7f TIKA-620 - When creating a default TikaConfig instance with a DefaultParser, have the newly created parser wired up with the Mime Type Registry we create. This allows the parser to resolve media type aliases and supertypes as it assumes it can.
new 683ac82 TIKA-555 fallout - While image/bmp isn't the official mimetype, it is what Java thinks it is. So, switch from the official to the un-offial one before asking Java to give us image processors
new dca2ee6 TIKA-620 - Have CompositeParser always use the canonical mimetype internally, via suitable calls to registry.normalise, rather than trying to handle the aliases individually
new 4cbd467 TIKA-160: Support encryption formats
new 8b643b6 TIKA-160: Support encryption formats
new e09dc30 Fix deprecated warnings
new 58f7914 TIKA-625: Easier XML parser extensibility
new 3b61cea TIKA-626: Add an AbstractParser class
new 5e1afea TIKA-629 - Add the sample .asf, .wmv and .wma files from Microsoft
new 46cda3d TIKA-629 - Add detection for .asf, .wmv and .wma (including tests) Adds support for unicodeLE and unicodeBE strings in the mimetypes reader
new 1cb70f3 TIKA-631 - Sample Chinese outlook file
new 0ea4a15 TIKA-631 - Stub out the work for improving the outlook parsing WRT html body content and better encoding detection
new 7dc399b TIKA-633: NPE in XWPFWordExtractorDecorator.extractHeaders
new 553d769 OOXMLParserTest: fix compilation on java 1.5
new 2821b70 TIKA-634 - Initial work on supporting more flexible ExternalParser loading (via XML, part done), and external parser metadata extraction
new 20f881f TIKA-634 - Example external parsers config file
new 92b0f12 TIKA-634 - Add support for checking if the external command is there, for collecting the output from a file, and a wrapper CompositeParser that loads all available External Parsers
new b94b086 TIKA-635: Tika GUI improvements
new 6aa8fc2 TIKA-635: Tika GUI improvements
new 6acca4a TIKA-635: Tika GUI improvements
new 2ffde96 TIKA-615 - Outlook parsing update for POI 3.8 beta 2
new 336bf67 TIKA-615 - POI powered TNEF parser
new f6a4a01 TIKA-615 - Update the new parser to use AbstractParser
new 90b0f8e TIKA-622 - Switch the POI based parser from the old POIFS to the new, lower memory NPOIFS
new 5f8737a TIKA-639: Maximum pool size for ForkParser
new 51989e7 TIKA-635: Tika GUI improvements
new 15229b6 TIKA-639: Maximum pool size for ForkParser
new f182496 TIKA-639: Maximum pool size for ForkParser
new d029485 OOXMLParserTest: enable testWordPicturesInHeader that was disabled due to bug in POI 3.8beta1
new d75b366 docx: extract image description in alt attribute
new 75c5bcf TIKA-593: Tika network server
new b6d67d3 TIKA-621: RTF parsing fails with Java 7 early access on 64bit platforms
new cfb1779 TIKA-461: RFC822 messages not parsed
new e67859b TIKA-461: RFC822 messages not parsed
new 6b711d2 <?xml version="1.0" encoding="UTF-8"?>
new 9c9d6f8 TIKA-644 - When generating html headings from word, h6 is the highest the xhtml allows, so don't try generating h7 (or higher) even if Word has a 'Heading 7' style
new 0a7b8ce TIKA-643 - Now that we're using NPOIFS which takes files, simplify the code as we don't need to use an InputStream
new a5df877 TIKA-643 - Change TagginedInputStream to work like TikaInputStream for creation, with a static get, to avoid double wrapping. Also adds toString methods on the two
new 03752ab TIKA-643 - Add toString() to another of our InputStreams
new cac8d4c Remove un-used import
new 1fe4980 Office: SummaryExtractor: do not fail on files without property stream (original fault file was generated by Java Excel API library)
new e6e7650 OfficeParser: HWPF: ignore invalid style references
new 907880f TIKA-649 Fix for .docx files with no header or footer policy defined
new ee817f0 TIKA-647 - Fix inverted logic on --list-met-models
new fc9e8a0 TIKA-213 JSON metadata output support, using the GSON library to do most of the work
new 3ab6160 TIKA-619 - Apply patch from Alexander Chow to ignore errors from a JRE GIF bug
new 3e0191d TIKA-654 - Open the OOXML OPCPackage as read only, and fix serial version warning
new 5119a98 TIKA-654 - If we have an open container that can be closed, close it when closing the stream
new 87fd9af TIKA-655 - Push the iWorks detection logic from ZipContainerDetector to IWorkPackageParser, and make that detect similar to OfficeParser does. Then, put the content handler selection logic into IWorkPackageParser, and remove IWorkParser (which claimed to be a regular parser but in fact only worked when called from IWorkPackageParser) The result is that tika app can then parse iWork files, and unit tests still work
new 6af91c6 TIKA-656 RFC822 and MBox parsers should output the same date metadata keys
new 5716665 TIKA-656 Switch two more Office metadata keys that hold dates to being typed date properties
new 469aa10 TIKA-656 Correct general POI date to metadata handling, plus test
new 8ef4561 TIKA-656 Update the Outlook parser to handle dates the same way as the other mail parsers
new 8f4b2ae TIKA-652 Add a few more external properties to match the internal ones
new 1835ad3 TIKA-652 Update the POIFS parser to handle custom metadata entries in the same way that the Open Document one already does
new d18f3f0 INFRA-3583: Add Tika to Sonar
new c2aed15 TIKA-658 TCPDump pcap mime matching
new 8c58587 TIKA-659 Merge the ODF parser tests, and put them in the new package
new ac4b6ef TIKA-646 Helper class to allow us to avoid calling endDocument until a later time
new 1fcb10a TIKA-646 Avoid calling endDocument for OOXML and ODF parsers until after we have extracted the metadata
new ddceb68 TIKA-655: IWorkPackageParser / IWorkParser not registering properly
new c69ff98 TIKA-640: RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
new 77c7847 TIKA-650: Missing required alt attribute on img tag
new c772872 TIKA-645: Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
new 0b7fa58 tika-parsers/pom.xml: remove unused commons-httpclient dependency
new 08859fc TIKA-213 Remove leading zeros from integers when outputting JSON
new 35fc876 TIKA-660: Remove logging of duplicate parser definitions
new 49a9c4f TIKA-661: MimeType class does contain a String with accessor named Extension. This should be a List<String> Extensions due to several reasons.
new c2f7fe0 TIKA-645: Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
new 49a3991 TIKA-662: OOXMLExtractorFactory: use file when stream is TikaInputStream and it .hasFile()
new 9f57e91 TIKA-416: Out-of-process text extraction
new 5943753 TIKA-416: Out-of-process text extraction
new 329ebc5 TIKA-642: Few of RTF files not extracting properly
new 3b10c70 TIKA-660 Merge the two CompositeParserTests and PatternsTests into one each in core
new 406cca3 TIKA-645: Parsers can't get at an underlying TikaInputStream to get the file if they wanted one
new c5f2c92 TIKA-572: Update plugin versions in the POM structure
new 40fd8c1 TIKA-628: Binary distribution for releases
new 15bd1d4 TIKA-664 Add mime entries for Adobe Premiere (PPJ) and Adobe SoundBooth (ASND), plus a common PhotoShop alias
new 9b54a1c TIKA-375: Improve code quality metrics
new 2ff952a TIKA-375: Improve code quality metrics
new 355bfe2 TIKA-160: Support encryption formats
new 7fe73eb TIKA-259: Safe parsing of droste.zip
new d64aa92 TIKA-527: Allow override mapping mime<-->parsers through config
new 3f618f6 TIKA-527: Allow override mapping mime<-->parsers through config
new 4d89ea2 TIKA-346: ZipParser throws "invalid compression method" error for some archives
new 49692de TIKA-665: NullPointerException from com.sun.org.apache.xml.internal.serializer.ToStream.writeAttrString on some excel files from the CLI
new deeff20 TIKA-668: Better handling of XML parse errors
new fb732d6 TIKA-671 - initial support for FictionBook document (fb2) format
new 33e5529 added 3 chm files, testChm, testChm2, testChm3
new e152e26 added chm patch included tests, changed tika-bundle-it pom.xml, version 0.9 --> 1.0
new ab666f2 - progress towards TIKA-245 Support of CHM Format (Oleg's patch, in parts, as suggested by Jukka)
new cbba597 - progress towards TIKA-245 Support of CHM Format (Oleg's patch, in parts, as suggested by Jukka)
new 82a4e49 - progress towards TIKA-245 Support of CHM Format (Oleg's patch, in parts, as suggested by Jukka)
new 8c6010b - progress towards TIKA-245 Support of CHM Format (Oleg's patch, in parts, as suggested by Jukka)
new 5619f19 - progress towards TIKA-245 Support of CHM Format (Oleg's patch, in parts, as suggested by Jukka)
new 1d3fe57 - progress towards TIKA-245 Support of CHM Format (Oleg's patch, in parts, as suggested by Jukka)
new 75e5f97 TIKA-631 Apply Outlook extraction enhancement to better extract html and rtf versions using POI 3.8 beta 3
new 69f49a9 support of Java 5
new 42d7de9 support of Java 5
new f2e35d0 added missing Apache lisense header
new 3dee434 added missed Apache lisense headers to the chm tests
new cf8cb44 (TIKA-672) Proper error handling in the CHM parser
new 6a62fe0 (TIKA-672) added to the chm tests more sophisticated error handling
new 608a608 TIKA-466: Feed Parser
new 9ffaf87 OfficeParser: choose correct Decryptor for document
new 7b5f83a List the commons compress dependency in the bundle too (TIKA-671)
new 31ce68f TIKA-434 - Pushback buffer overflow in TagSoup
new 77c6b73 TIKA-679 Add detector support for CADKey PRT files
new f5a9438 TIKA-679 CADKey PRT parser
new afb429c TIKA-679 Remove un-used import, and fix warning
new 5ad2798 TIKA-679 Add missing license header
new 96f592a TIKA-679 Update the CADKEY PRT parser to get the description, and tweak the text encoding based on work by Troy
new fc13845 TIKA-678 Add unit test using supplied test file that shows the problem with option headers no longer exists
new 92b2619 TIKA-683 Rename TikaTest to something more specific, so we can use that name for a parent superclass of our tests
new f7607ec TIKA-683 New TikaTest parent class for tests (which RTF test will shortly use)
new ca890e4 TIKA-683 Create a dedicate RTF parser test, based on the existing checks in TestParsers
new f5f1ef1 TIKA-683 Unit test for Japanese RTF text
new 2780869 TIKA-507 Split the mime type entries for AFM and PFM (font metrics) out from the fonts themselves, and add magic detection patterns for them
new d27bad8 TIKA-507 Add byte based detection tests for .pfa/.pfb/.pfm (which we currently lack free sample files for)
new 61c1637 added ngram profiler and its tests, also added an optinton to the TikaCLI.java for lang.profile creation and its test
new 5bbd7ab changed verification point og the testListParsers()
new 24b8cfb commented an assert temporarily
new b303c6a Add quick test to validate that RSS feeds will be processed by the appropriate parser (see https://issues.apache.org/jira/browse/NUTCH-1053).
new 90b1063 added license headers to alice.cli.test & welsh_corpus files
new 30897dc added license header to welsh_corpus file
new 25ca130 TIKA-593: Update tika-server
new 1dc0a2e TIKA-527: Allow override mapping mime<-->parsers through config
new d7b8d74 TIKA-565: Improved OSGi bundling
new 2af24f4 - patch for TIKA-422 contributed by Mike McCandless.
new e6defb6 - accidentally committed in progress geo folder, removing it.
new 62abba6 - commit unit test patch for TIKA-683 from Mike McCandless.
new 8efad87 TIKA-692: TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
new b16c898 TIKA-692: TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
new 332e498 TIKA-692: TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
new c3318a1 TIKA-447: Container aware mimetype detection
new 1be0e52 TIKA-667: Changes to RFC822Parser to support turning off strict parsing
new e7b51f0 TIKA-693 - Incorrect mime-type for .pptm, .ppsm and .ppsx in OOXMLParser
new 5aa2fc4 XSLFPowerPointExtractorDecorator: remove unnecessary call to slide._getCTSlide()
new 3c3e6d1 TIKA-434: Bug in TagSoup causes IOException
new ea2cfb0 ZipContainerDetector: fix file descriptor leak
new f74ff48 TIKA-697 Sample files in the AR archive format
new 4316d8d TIKA-700 Upgrade the POI dependency to 3.8 Beta 4
new 600e69d Fix TIKA-700 related 1.6ism
new d64fb55 TIKA-392: add 3 RTF test cases
new 728a218 TIKA-392: use unicode escapes for non-ascii chars
new 3359566 TIKA-701: Fix problems with TemporaryFiles
new 8ef85fc TIKA-701: Fix problems with TemporaryFiles
new 3e832b0 TIKA-701: Fix problems with TemporaryFiles
new 6bc0e05 TIKA-701: Fix problems with TemporaryFiles
new 6bd95ee TIKA-701: Fix problems with TemporaryFiles
new ddf56ce TIKA-701: Fix problems with TemporaryFiles
new 6862cc1 typo
new 0ac8f50 TIKA-687: Temporary file not removed after detection
new 411d5d8 TIKA-207: MS word doc containing tracked changes produces incorrect text
new 9d73a10 TIKA-704: PDF and Outlook docs embedded in MS Word documents not parsed
new 9d5cf32 TIKA-702: Cannot compile Tika with Java 7 (ImageMetadataExtractor.java)
new 9c9763a TIKA-698: "Invalid UTF-16 surrogate detected:" parsing PowerPoint 97-2003
new e25a454 Embedded file extraction is broken for some OOXML files (bug introduced few commits ago)
new b4bfb99 TIKA-704: PDF and Outlook docs embedded in MS Word documents not parsed
new 07f6f63 add several test cases, derived from test case coming in TIKA-683
new 04d1fb7 TIKA-704 Tweak detection of embedded non-office documents in OLE2 streams
new 4c5599b TIKA-698: use the unicode replacement char (U+fffd) when replacing invalid XML chars in SafeContentHandler
new 4639ee2 TIKA-710: Expose the Parser and Detector instances within the Tika facade
new 9cb9439 TIKA-704: PDF and Outlook docs embedded in MS Word documents not parsed
new e924f82 TIKA-704: PDF and Outlook docs embedded in MS Word documents not parsed
new 505a835 PPT: avoid NPE when OLEShape.getObjectData() is null (see POI bug#51771). Patch by Yegor Kozlov
new 5c22595 TIKA-683: new RTF parser that performs its own direct shallow parse (instead of using RTFEditorKit from javax.swing)
new 1121dfe TIKA-594: Upgrade Tika to pdfbox 1.6.0
new 24ef103 TIKA-692: TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
new ad30687 TIKA-692: TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
new d30f1ff TIKA-598: Update HDF parser and NetCDF parser to emit minimal XHTML
new 717b173 TIKA-688: Enhance content-type detector to recognize almost plain text
new 0fdbd74 TIKA-698: explain that Unicode repl. char is now used for invalid chars
new 2f7241d TIKA-717: add testComment test
new fdb902b TIKA-717: fix RTF parser to extract annotations (comments)
new 9dc084c TIKA-688: Enhance content-type detector to recognize almost plain text
new c4c8b22 TIKA-565: Improved OSGi bundling
new de35385 Drop svn:executable properties.
new 3f9f0fe TIKA-700 Upgrade the POI dependency to 3.8 Beta 4
new 7b36616 TIKA-565: Improved OSGi bundling
new 8089184 Add missing svn:eol-style settings
new e679c0e Some CHANGES.txt updates
new 5e9164d Add a more specific mime magic pattern for detecting single stream Ogg Vorbis files
new ea21418 TIKA-705 Temporary workaround for the relative links issue, pending upgrade to POI 3.8 beta 5
new 6187e03 TIKA-705: re-enable test case
new ab1aea6 TIKA-725: Empty title element makes Tika-generated HTML documents not open in Chromium
new ebb0338 TIKA-716: Upgrade apache-Mime4J to Version 0.7
new 39be977 TIKA-726: add EncryptedDocumentException for situation when extraction can be done due to unknown or wrong password
new 111fcd5 TIKA-726: use EncryptedDocumentException in OfficeParser and CryptoParser
new e39bff0 TIKA-726: add Apache license header
new a5429c5 TIKA-726: throw EncryptedDocumentException in ExcelExtractor
new c9e37f6 TIKA-716 Fix tika-bundle dependency list following apache-Mime4J upgrade
new acbefdf TIKA-712 Fetch Master Slide text for PPT and PPTX text extraction
new 1de7bfe TIKA-712: master slide's text is now extracted
new ec9d7d2 TIKA-709: Tika network server does not print anything in response to, for example, Word documents
new 0d2676c TIKA-727 Improve the HSLF PPT parser by using HSLF usermodel classes to generate more specific XHTML events
new 1109b5a TIKA-508: HtmlParser link processing should skip usemap and codebase attributes
new 2bbf5ca TIKA-720 Sample EBCDIC (IBM-500/CP500) text file
new eb73a4b TIKA-720 Add Charset Detector for the IBM500 (EBCDIC) charset
new b62ce33 Make the error message more helpful when this test assert fails
new aa944e8 TIKA-720 Add documentation for some of CharsetRecog_sbcs, and tweak the EBCDIC bit to avoid false matches for short snippets of HTML
new 7aeed0b Add a disabled Outlook RTF related test, pending a fix for TIKA-632. (We're nearly there with the recent RTF improvements, but not quite....)
new 683c9a8 TIKA-508: HtmlParser link processing should skip usemap and codebase attributes
new a682c4c TIKA-712: temporarily turn off pulling text from master slide until we can figure out how not to pull out the boilerplate text
new 9836592 TIKA-712: add (disabled) test cases showing the bug
new fcde81c TIKA-651: Unescaped attribute value generated
new 16de4aa Use spaces for indentation
new 23109a8 Avoid closing stdout in TikaCLITest
new 77ebfcd TIKA-651: Unescaped attribute value generated
new cbd8a31 Replace 1.0 with 0.10 in @since statements
new 525ded2 - prep for 0.10 RC1
new 4ef2b04 - prep for RC 0.10 #1
new e4ffb45 [maven-release-plugin] prepare release 0.10
new 15f13bd [maven-release-plugin] prepare for next development iteration
new fb8a540 TIKA-731: NPE in WordExtractor.handleParagraph()
new 9c609e6 TIKA-732: Upgrade to Commons Codec 1.5
new 71a0b7e - update CHANGES.txt changelog with contribution report for 0.10 and with dependency tree.
new 3b9fbf9 HSLF Extractor improvements from Pablo from TIKA-727
new c80eee3 TIKA-632: extract hyperlinks from RTF docs
new 0406ad0 TIKA-632: temporarily disable RTF hyperlink test
new add1168 TIKA-632: enable test case
new 70d59d1 TIKA-711: add test for optional hypen across doc types; leave .doc turned off until we can fix it
new 65a5b20 typo in comment
new f60f184 TIKA-717: PPT is extracting comments correctly
new 0396c4f TIKA-738: move (ignored) test case to PDFParserTest
new dc2b9a1 TIKA-733: try to be robust when RTF doc has too many closing {'s vs opening }'s
new af7df6b TIKA-711: correctly handle optional hyphen from Word docs (.doc)
new 4e0b3e6 TIKA-742: extract paragraphs inside PDF pages
new da8cc2a TIKA-742: fix Java 1.6 only code
new 3a70dc8 TIKA-743: Upgrade to Apache parent POM version 10
new 609cfe7 TIKA-739: For certain DWG files, the Tika content parser outputs garbage
new 84d1ee3 TIKA-741: "Zip bomb" (XML nesting) detection is too strict
new 03fa932 TIKA-699: Automatic checks against backwards-incompatible API changes
new 4e22d4b TIKA-744: Drop support for Java 1.4
new 8f3097f Tweak the MP3 parser to put a class on the lyrics paragraph, so it could be filtered for if required
new b5d5065 TIKA-745 If we find a ID3v2 Genre that isn't one of the ones in v1, use it as-is
new 7d873ab TIKA-746 Allow MimeTypesFactory to take more than once resource to load, and update the default to be to load tika-mimetypes.xml followed by any custom-mimetypes.xml files found
new aea844e TIKA-448: Tika FLVParser hangs
new a2f30df TIKA-682 Add mime magic detection for PSD files
new f5bc2a7 TIKA-749 Add EndianUtils, which provides a way to read small and big endian numbers from streams, based on the version in POI
new 01bbb05 TIKA-682 Add a basic PSD metadata extracting Parser
new bec14db TIKA-749 Convert the DWG and PRT parsers to use the Tika endian util, rather than the POI one
new a5cc41a TIKA-682 Fix 1.6ism
new c256022 TIKA-748: ignore \* if it's not right after group start {
new 578050c TIKA-750: JavaDoc of Tika XPathParser should mention descendant:node()
new bc49afc TIKA-681: eight new n-gram language profiles
new 66396ca TIKA-681: eight new n-gram language profiles
new e2f5b4c TIKA-751: some initial improvements to embedded office doc handling in AbstractPOIFSExtractor
new d2ac303 Add a common alias for the WordPerfect mimetype
new c3f9910 TIKA-752: Typo in timezone used in Metadata.iso8601Format
new 02ec12f TIKA-657: Email parser gets into trouble on malformed html in enron corpus
new 1aaac7f TIKA-657: Email parser gets into trouble on malformed html in enron corpus
new 773b46e TIKA-753: speed up processing of embedded office docs
new 30a22be TIKA-755 Have TikaConfig create a DefaultDetector instance based on the supplied MimeTypes and/or ClassLoader, and switch Tika+AutoDetectParser to get their detector from there, rather than create their own DefaultDetector instance
new 5cc8b40 TIKA-756: XMP output from Tika CLI
new 9769f86 TIKA-738: optionally extract PDF annotations
new 94356cb TIKA-724: add option to PDFParser to control auto-space behavior
new f22152d TIKA-761: Provide version number by CLI argument -V
new eab9318 TIKA-565: Improved OSGi bundling
new 82e053d TIKA-746 Allow MimeTypesFactory to take more than once resource to load, and update the default to be to load tika-mimetypes.xml followed by any custom-mimetypes.xml files found
new 70c5417 TIKA-582: remove extra quotes from Lithuanian 3gram tables
new c05a585 TIKA-565: Improved OSGi bundling
new 1076bb7 Summarize changelog entries by feature rather than by issue
new 6df67a8 Add a few more 1.0 changelog entries based on notable issues in Jira
new 9adbcd0 Normalize CHANGES.txt to use UTF-8
new dcc6e5b Include only compile-scope tika-parsers dependencies in CHANGES.txt. The other dependencies aren't really of interest to normal users.
new c25ac97 Uniform formatting of the CHANGES.txt file
new 25f7a4d TIKA-703: Drop deprecated methods/classes/interfaces
new 1d55d3a TIKA-703: Drop deprecated methods/classes/interfaces
new 6603aa6 TIKA-761: Provide version number by CLI argument -V
new 08af8cd TIKA-763: Update license metadata
new 8e434ae TIKA-763: Update license metadata
new 3083e91 TIKA-736: extract header/footer text for OpenOffice docs
new 693d3c6 TIKA-764 Update OpenDocumentMetaParser to use the common Metadata keys for document statistics, and remove use of a deprecated class in fetching the stats
new 9ba4191 TIKA-565: Improved OSGi bundling
new 309a9a5 Remove unused import
new 2e976ca TIKA-565: Improved OSGi bundling
new 8f39338 TIKA-565: Improved OSGi bundling
new 1ef02db TIKA-565: Improved OSGi bundling
new e910698 TIKA-565: Improved OSGi bundling
new 5fc2030 TIKA-699: Automatic checks against backwards-incompatible API changes
new 6673d51 TIKA-565: Improved OSGi bundling
new 3d12327 TIKA-565: Improved OSGi bundling
new 1d7259d TIKA-565: Improved OSGi bundling
new d482bd0 TIKA-565: Improved OSGi bundling
new 43bcd67 TIKA-565: Improved OSGi bundling
new 52d6efd TIKA-565: Improved OSGi bundling
new bc3f524 massage CHANGES.txt: inline issue numbers so we can match to the right description
new eea0229 TIKA-763: Update license metadata
new a0aec54 TIKA-763: Update license metadata
new 044cd54 TIKA-769: Upgrade to Commons Compress 1.3
new 2a6eca0 TIKA-761: Provide version number by CLI argument -V
new 8797f12 TIKA-703: Drop deprecated methods/classes/interfaces
new 7d30b7a - prep for release
new 15f252d [maven-release-plugin] prepare release 1.0
new 0121288 [maven-release-plugin] prepare for next development iteration
new 2c128e8 - add release date: updated on RC vote area, and will push to dist.apache.org on release (if successful).
new 7f970ef TIKA-767: allow controlling whether PDFBox should try to remove overlapped duplicated text; default to disabled
new 1a7dfa1 TIKA-712: strengthen the test cases here to not only validate the text came through but also to make sure boilerplate text did not
new a39d582 TIKA-714: add test case for PPTX to extract text from word art
new a157a6f TIKA-529: don't allocate byte[] for each byte when detecting IBM420 charset
new 2c3dbde TIKA-777: process buffered bytes/text on font change
new 79150b7 TIKA-773: .NET version of Tika
new 6d7b5d4 TIKA-780: Optimize loading of the media type registry
new d57de4c TIKA-780: Optimize loading of the media type registry
new 72646ef Adjust clirr checks to use Tika 1.0 as the baseline.
new c57d94d TIKA-780: Optimize loading of the media type registry
new 2e5cd68 TIKA-780: Optimize loading of the media type registry
new 85878f6 fix typo
new 9f8c762 TIKA-780: Optimize loading of the media type registry
new b7a58ef TIKA-780: Optimize loading of the media type registry
new 350d1be Exclude the enum types from clirr checks.
new ad78ecd TIKA-781: don't output whitespace when we are in an ignored GroupState
new 5e91e71 TIKA-663 Mimetype entry for JSP with magic
new 1752947 TIKA-779 Works 2000 container aware detection, plus test
new 7b80961 TIKA-612: enable controlling PDFBox's setSortByPosition from PDFParser
new c631563 TIKA-782: properly handle \bin control word
new a7e1524 TIKA-784 Mimetype entry and glob for DITA
new 0edc3ad TIKA-784 Sample DITA task, concept and map files. (Based on some Alfresco documentation, with content replaced with Tika info)
new 9ae596c TIKA-784 DITA mimetype entries for the 3 subtypes, plus tests
new 1cbd62a TIKA-785 Add a --list-detectors method to TikaCLI, along the lines of the existing --list-parsers one
new e3f9af7 TIKA-784 Switch the DITA types to be format specialisations, rather than their own dedicated mimetypes, to match the OASIS recommendation
new 78f6eac Expand container detection tests, and added disabled (failing) tests for TIKA-786
new ba101bb A few more TIKA-786 related tests
new fa077bf Add basic JavaDoc for a few MediaType methods that lacked it
new 2b39c75 TIKA-786 Control the ordering of detectors in DefaultDetector, so that user supplied detectors come first, then Tika ones, and finally MimeTypes. This ensures that more specific detectors get to try first
new 482cf37 Add a note about TIKA-786 to Changes
new f8e3364 TIKA-787: Improve charset detection for UTF-8 HTML fragment
new 4073be5 TIKA-789 Sample Microsoft Project (MPP) files
new 25fcdf9 TIKA-789 More consistent naming for sample MPP files
new a141b27 TIKA-789 Microsoft Project (MPP) is OLE2 based
new 1b01e4f TIKA-789 POIFS Container Detection support for MPP files
new 39a1bac TIKA-789 Improve MPP detection based on info from Alex Ott
new f05edfe TIKA-789 Add (metadata only) Project support to OfficeParser, and add a unit test that checks we correctly get Project metadata back from our sample files
new 8e449f9 Add CHANGES entry for TIKA-789
new 00801cb TIKA-789 Add the project type to the OfficeParser mimetype list, and add a note on why Works is missing from the list
new 6ef64dd TIKA-778: fix cases where PDFParser produced too many </p> tags
new 31f96ed TIKA-697 Test CPIO file
new e72a9c4 TIKA-697 Archive formats mimetype tests (not all of which work yet)
new 4885ea7 TIKA-697 Correct mime match for .ar unix archives, add the suggested extra filetypes and aliases, and list .deb as being ar based
new 2d42f07 TIKA-697 Add mime magic for .deb files, which are base on .ar but have a specific first entry
new 96c92f5 TIKA-794 Correct Little16 mime magic logic, and enable the CPIO test now that the detection is correct
new 9a93719 TIKA-791 Sample protect Microsoft Office documents
new e735a65 TIKA-791 POIFS Container Detector support for encrypted OOXML files, plus tests and new (tika specific) mimetype
new a78309d TIKA-790 Remove the duplicated detection code between OfficeParser and POIFSContainerDetector, by following the pattern from TIKA-791 and adding a type for OLE10Native, then pushing the rest of the detection work to POIFSContainerDetector
new 90d327c Patch+Test from Antoni from TIKA-797 - Correct the default PPT extension
new 513dbdb TIKA-798 - EMF and WMF metafiles aren't the same, so split the mimetypes, plus add magic+tests for them (patch from Antoni)
new 82ddc53 TIKA-410 Word Parser support for extracting textbox content (Patch from John Mastarone)
new 374538b TIKA-800 Wrap the ArchiveInputStream in PackageExtractor so that it can be used with Detectors
new 852f148 TIKA-565: Improved OSGi bundling
new 7e2326a TIKA-800: mark/reset not supported from POIFSContainerDetector
new 7e2393f TIKA-567: Temporary file leak in TikaInputStream
new 707b8f2 Add TikaCLI help for the -f/--fork option previously added for TIKA-416
new b73dccb TIKA-801: fixed NPE when filtering Outlook docs with RTF or HTML content
new f8e572b Add disabled tests for TIKA-808 (parser needs fixing so that tests can pass)
new eec596c TIKA-809 Handle embedded files with no file extension
new 50d6e58 TIKA-808 Remove un-used fork test imports
new 4287c50 TIKA-803 Wrap the outlook message body in a special div
new 50437f3 TIKA-812 Support for detection of MS Works 7.0 Spreadsheet files
new 4bd426e TIKA-813 Support for detection of Apple "bplist" files (Binary Property List) and webarchive files - a special case of bplists.
new 853fea0 TIKA-814 MimeTypes detects plain text based on a larger sample of bytes.
new 60711a9 TIKA-700 Upgrade to POI 3.8 beta 5
new ce78a81 TIKA-757 Tidy Excel extractor code after POI upgrade
new ac13e8e TIKA-757 Tidy the Word Extractor picture locating code
new b21fa6a TIKA-757 Tidy the OLE10Native extractor code now that POI has been upgraded
new 03e6113 TIKA-705 / TIKA-757 - Simplify the OOXML related parts code, following POI upgrade
new 3d6dbb3 TIKA-816 The Excel (XLS) Parser should format numeric formula cell values, and handle string formula cell values
new 75d487d TIKA-821 Added support for detection of old MS Works Word Processor files
new e3a1831 TIKA-812 Clarified the javadoc of the test method I introduced with ver2 of my patch. In ver2 I added a magic which allows pure MimeTypes to detect works-spreadsheet files, but forgot to update the javadoc for the text method. It was untrue. Now it's OK.
new 48b28a6 TIKA-822 - Handle quoted parameters on media types
new 57a99d0 TIKA-823 support for detecting StarOffice types, both in MimeTypes and POIFSContainerDetector
new edb6775 TIKA-828: TaggedIOException can be passed non Serializable objects
new 486d767 TIKA-808: Fork Parser doesn't work for PDF files
new 0ee5935 TIKA-808: Fork Parser doesn't work for PDF files
new dd8480f TIKA-829 Validate inputs to the ForkParser constructor (must not be another ForkParser) and TikaInputStream get (must not be null) - patch from Jerome Lacoste
new 915be4f TIKA-831 Fix the data type when comparing errors from the forked server, and add some more Forked unit tests (one disabled) - patch originally from Jerome Lacoste
new a90b827 TIKA-827 Handle sending non serializable exceptions back from the ForkServer
new e2720e4 TIKA-831 Fix test warnings, and enable the last test (needs to not use the Tika facade)
new b93fc6f TIKA-793 Correct the null termination stripping in the ID3 tag code, when dealing with double byte encoded strings
new a739ddd TIKA-793 Sample MP3 with UTF-16 text in the comments (others are all ISO8859-1 text)
new b20beb6 TIKA-833 Mark some more Excel formatting tests as passing (with tweaks to match what actually gets stored)
new bc07570 TIKA-830 Improved error message in the ForkParser if we are unable to serialize the parse objects
new 5a1ebc7 TIKA-831 Start on a test for the ForkParser with a parser exception that isn't serializable (currently not working so disabled)
new 554605b TIKA-793 Unit test for i18n MP3 tags (excluding comments)
new 3bdf69a TIKA-793 Correctly handle the COM/COMM tag in MP3s, which is in a different form - encoding+language+desc+text
new f6001f7 Add TIKA-793 to the changelog
new 52d51b9 - apply patch from TIKA-824 contributed by Markus Jelsma
new 9e22f46 TIKA-826 We don't currently support .xps or .xlsb files (which are OOXML based), so ensure we don't explicitly claim them, and have the OOXML parser decline if it gets them on the basis of the parent type
new 83a8a97 TIKA-826 For an OOXML type we can't handle, use EmptyParser for valid, empty xml
new e303705 Patch from Fabian Lange from TIKA-837 - Make inner classes static where possible
new fa076c5 TIKA-375: EmptyParser Singleton should be final
new 0f54339 TIKA-793 follow-on: MP3 files can have more than one comment, as long as the language+description pair is unique, so support capturing multiple comments
new ed4a055 TIKA-695 Sample office files with custom properties
new 022f18c TIKA-695 Add unit tests for XLS/DOC/PPT custom properties extraction
new 4b51bea TIKA-695 Support for Custom OOXML properties, plus start on tests for it
new 0c52b66 TIKA-695 Support some more OOXML custom property types, and expand the unit test coverage
new f28f380 TIKA-840 Expose the logic for detecting the type of an OOXML file from an open container
new 3698626 TIKA-840 Update the OOXML parsers, so that rather than hard coding the content type, the file specific one is feteched and set
new 36bbcfc TIKA-805 Improved .pptx XSLF extraction, patch from Yegor Kozlov
new 80d36eb TIKA-841 User supplied parsers should be prefered over built in ones in DefaultParser
new a2555c6 TIKA-507 FontBox powered .afm font metrics parser, patch from Fernando Arreola
new 83eb5b8 TIKA-843 Metadata support for dates without times (treated as midnight UTC)
new f6adfbd TIKA-843 Switch dates without times to noon UTC
new c657ecd TIKA-846 Patch from Ray Gauss to parse RDF Bag Elements to multi-valued metadata
new 45ac50a TIKA-846 Fix indent
new 92f4446 TIKA-844 Add an internal TagBag property type constructor, patch from Ray Gauss
new c95bbb3 TIKA-845 Correct the conversion of XML tags to multi-valued metadata values, and avoid duplicating existing values
new 7c43bff TIKA-849 Add a sample iBooks epub file from Andrew Jackson, and add a unit test for the Zip Container Detector of epub zip formats
new 1118426 TIKA-849 iBooks epub mimetype entry, and fix a few comments
new c877ecc TIKA-849 Initial ibooks epub support and test, from Andrew Jackson. Metadata only for now though, text isn't coming through as it's within <object> tags
new 84e8fc3 TIKA-849 EPub (and iBooks) files typically have multiple xhtml documents making up the whole, so avoid repeatedly starting/ending the document for each part
new 8f05b4e TIKA-839 Update the .potm test file to match the others, and enable testing of it - file+patch from John Mastarone
new f52da71 TIKA-760 Avoid NPE in XHTMLContentHandler if a null string is passed to the characters method
new 85ae815 TIKA-770 Convert the remaining ODF document statistics to be defined properties, and update all of the Office Count statistics to be integer typed properties
new b754b45 TIKA-770 Set all document statistics with Properties rather than Strings, now they are all typed
new 8584631 TIKA-851 More specific quicktime/mp4 matches, for the common subtypes, based on the ftyp atom
new b802cc1 TIKA-851 Another mp4 audio alias
new 34c0293 TIKA-842 IPTC Metadata Properties, including full descriptions of all the properties taken from the Specification, along with appropriate License/Notice information for this.
new 7901a55 TIKA-842 Avoid property name clash with IPTC and the old-style values from DublinCore
new bd764cc TIKA-851 Add another MP4 audio extension
new 20dfa5a TIKA-852 Sample MP4 Audio (M4A) file
new fa10b69 TIKA-852 Initial MP4 Parser, powered by MP4Parser from Google Code
new d53216a Update the bundling defintions to include MP4Parser and dependencies
new fe9897c TIKA-852 MP4 files can be very large, so avoid trying to buffer them in memory
new 24bc3dc TIKA-853 Close the stream in the MP4 Parser, and use a cleaner way to get two of the metadata boxes
new ab54c21 TIKA-757 Remove POI related TODO, now that we have upgraded to a version with the fix
new a170fbe TIKA-850 Add a new interface, PasswordProvider, which can be set on the ParseContext to provide a way to supply document passwords. Updates PDFParser to use this in preference
new 5498c0e TIKA-854: No text extraction for Word macroenabled template
new 53ef1b9 TIKA-852 Avoid NPE on missing metadata boxes
new 6454566 TIKA-852 Support setting the channel type from a channel count in mp4, via a couple of different possible routes (see dev@tika discussions)
new 5a42442 TIKA-747 Add Vorbis and FLAC test files, for integration tests
new 5e104cc TIKA-747 Add the Vorbis and FLAC parsers, along with a simple integration test
new 4cf7614 Re-enable the iWorks tests disabled in r1023712 as part of TIKA-533, as they work properly again now
new 0ec2c05 Update CHANGES with recent new parsers
new 59cab5f TIKA-818 Use a temp file for PDFBox resource processing, if the input is a file based TikaInputStream
new 72564dd Mark text/javascript as an alias of the official application/javascript mimetype, rather than being seperate, and add the application/x-javascript which is sometimes used in older things too
new f68f1a4 Add the older audio/x-mpeg alias for audio/mpeg
new dc9ffcc TIKA-850 Update OfficeParser to support the new style password fetching via PasswordProvider
new eb62078 TIKA-865 Reduce the amount of MimeTypes.forName that needs to be synchronized
new 1f16bf8 TIKA-866: Invalid configuration file causes OutOfMemoryException
new 84a2a27 TIKA-865 Tweak what we lock on
new 929f753 TIKA-864: Metadata.formatDate causes blocking in concurrent use
new a547dde TIKA-866: Invalid configuration file causes OutOfMemoryException
new fbd2d6d TIKA-866: Invalid configuration file causes OutOfMemoryException
new 8cdd3dc - upgrade to 1.1-SNAPSHOT dependency.
new 8c111b9 update CHANGES.txt in prep for release.
new 963aa1b [maven-release-plugin] prepare release 1.1
new bbd71ac [maven-release-plugin] prepare for next development iteration
new 6a8b23b TIKA-870: allow setting maxStringLength per-call to Tika.parseToString
new f1f4b03 - patch for TIKA-874 Identify FITS (Flexible Image Transport System) files contributed by Peter May
new b11403d TIKA-875: fix file handle leak in ImageParser
new e91ba0c TIKA-877 - fix extraction for OLE-attachements in TikaCli
new e488767 TIKA-882 - ignore incorrect part references in OOXML Extractor
new 49613a1 TikaResource: remove obsolete Parser.parse() implementation
new 8618f64 TikaResource: force UTF-8 output
new 2c25c2a TikaResource: improve exception logging/processing
new 830f702 TikaResource: extract anonymous class
new 44d14a5 New rewritten UnpackerResource for TIKA-593:
new 6d5a229 tika-server: configure surefire plugin
new 6153280 TIKA-883 - Extract embedded images in PPT
new fcfdec7 - update template for 1.1
new f1a9287 TIKA-593 - enable jax-rs network server module
new 2e21373 tika-server: remove java.net repository
new a2bf1f8 tika-server: update java version because jersey-core requires java6
new 66bdd0b TIKA-866: Invalid configuration file causes OutOfMemoryException
new 16c92d0 TIKA-884: Dynamic loading of Parser and Detector services
new 93ab990 - progress towards TIKA-593: replace Jersey with CXF. Checking in to reduce the need to review patches. Disabled 3 tests for now that aren't passing. Will work with Max to make them pass.
new 6d7753e - ignore
new 27163d3 - set Content-Type field: that's what the test is actually doing. TIKA-593
new 892f9f9 - TIKA-593: try with 1.5
new abb7e35 - TIKA-593: improvement: use CXF client for test harnesses, remove all extraneous pom.xml dependencies and remove dep on commons-httpclient
new 71772d2 - update and configure logging.properties
new f906d34 - accept should be */* since by default CXF client sets to an XML accept (yikes), thanks to pramirez for identification, TIKA-593, see: http://cxf.547215.n5.nabble.com/Why-is-the-default-accept-for-WebClient-text-xml-td5013707.html
new a1e4fed - TIKA-593: forgot X method
new 232cc5d TIKA-886 If we open the OPCPackage from a File on a TikaInputStream, have it tracked (+closed) by the TikaInputStream the same way that ZipContainerDetector opened ones are
new d442f36 Bump the Apache James Mime4J version from 0.7 to 0.7.2, for recent bugfixes
new 0b8f59f TIKA-593: add ExceptionMapper for TikaException for prodivers list (this fixes test415)
new 46f295f TIKA-593 - trying to fix .jar build
new 7b42120 - TIKA-593: remove FIXME and uncomment @Test, per max's comments.
new 4f5c6b5 - TIKA-593 note.
new a9e1304 TIKA-700 Upgrade to POI 3.8 Final
new 5caebac TIKA-593: share/bundle plugin configuration
new b63d7e6 TIKA-593: fix java5 compatibility
new 1c777da TIKA-890 Sample APK file, along with sample EAR and WAR files (related)
new 83cee2a TIKA-890 Update the APK mimetype entry to mark it as JAR derived, and add entries for WAR and EAR (also JAR derived)
new d8e91b6 TIKA-890 Container Aware detection of JAR derived types such as WAR, EAR and APK, with tests
new f433242 TIKA-896: OSGi deployment without declarative services
new 07e105b TIKA-896: OSGi deployment without declarative services
new 0eaf0f5 TIKA-896: OSGi deployment without declarative services
new 67f1be5 Ignore Eclipse project settings and other hidden files.
new 1f5a278 tika-server/pom.xml: add svn:eol-style, remove duplicate license header
new aeeb366 TIKA-897 Detect XML files that start with the UTF-8 BOM, plus test
new d53f02f - apply patch from TIKA-901: Provide version number in tika-server contributed by Ingo Renner
new 35539ab TIKA-861 Patch from Ryan Quam to enable extracting PDF Links. (Links are extracted for now at the end of the page, further work will be needed to match them to the text they apply to)
new 0f729e6 - disable until shade plugin is fixed.
new 27fbb0a TIKA-903 Avoid breaking on Password Protected iWorks files. We can't parse them yet though, as we don't know how the encryption works
new aa901df TIKA-906 Support extracting Headers, Footers and Footnotes in iWorks Pages files. As part of this, make the parser a little more aware of where in the file it is, and start tracking some of the earlier parts of the file ready for when we hit the main text
new 7d8b3ea Magic for PCKS7 in PEM format, and DER format (probably...)
new c66a8ba TIKA-876 Slight PKCS7 der magic tweak
new c98a421 TIKA-907 Comments in iWorks Pages files
new 7ef3548 TIKA-852 Upgrade the MP4 parser to 1.0 RC1, which allows us to enable the MP4 unit test (patch from Sebastian Annies)
new de70921 TIKA-858 Patch from Craig Stires to add support for parsing IPTC ANPA News Wire Feeds
new f010744 TIKA-858 Fix Java 1.6isms
new 9543dfa TIKA-858 Fix Java 1.6isms
new 0204adc TIKA-858 Fix Java 1.6isms
new 824a2b4 Add a .gitignore file for people using the git mirrors
new 9b1fcfd Add Adobe AfterEffects mimetypes, fix up the Adobe Premier detection, and give .AEP to AfterEffects as it seems much more common now than AudioGraph
new fdec3f3 remove stale nocommit
new b90d72d Patch from Ray Gauss II from TIKA-915 - add a disabled unit and a small sample file for the geo rounding problem
new e2f1ef5 Whoops, properly disable the test for TIKA-915 this time...
new 28ae612 TIKA-913 Mime Magic for PE, PE32 and PE64 executables
new e59e664 TIKA-915 related - add mime magic for the elf format too, based on the mimetypes in the httpd magic file
new 47a66ff TIKA-917 A few sample files for Linux-ELF, and a PE32 one, plus the C file
new cd6533b TIKA-917 Start on a parser for PE and ELF executables, to output metadata
new a42f88a add iWork test case
new 4a06417 remove leftover sop
new 22e9376 TIKA-917 Pull the property definitions out to their own class, add more machine types, and define the platform
new 291e167 TIKA-917 Expand platform and architecture parsing
new 385c993 TIKA-925 - Patch from Ray Gauss to start on improving how the common metadata is stored/fetched
new 5c32a67 TIKA-917 Some more sample elf files
new c0d60f0 TIKA-926 Patch from Ray Gauss - Data Typed Metadata.set(...) Value Methods Should Call Metadata.set(Property...)
new 9e79399 Update JavaDoc following TIKA-864 change
new 49229f7 TIKA-927 - Patch from Ray Gauss to support Composite Properties (useful for backwards compatibility, and mapping between application and core properties)
new 912357d TIKA-917 Get the elf OS, if that bit of the header is set (but it often gets left as null....)
new 00a14bb TIKA-916 Correctly bail out early for .xps and .thmx files, which are an unsupported variant of PPTX, plus tests
new 5796aa6 TIKA-928 Patch from Ray Gauss (plus extra JavaDocs) - start to define the set of common consistent metadata that all parsers will try to provide, no matter what their individual file format may term things
new 08ef9c1 TIKA-928 Include the Geographic details to the common set of properties, and group slightly
new 7488dcc Add some simple JavaDoc descriptions of the property types, to help people who don't natively speak xmp! (TIKA-926 related)
new ffe9d51 TIKA-926 Patch from Ray Gauss to allow set(Property,String[]) and add(Property,String), to mirror the string key based methods but with type safety
new 362aa5e TIKA-876 Another pkcs7 magic pattern
new ce9ffb3 TIKA-929 Start to replace the old non-prefixed, largely non-property MSOffice metadata definitions with new style ones
new a8755b0 TIKA-929 Bring some of the key parts of the Office metadata into TikaCoreProperties, with composites to support the previous (now deprecated) ones in MSOffice
new d22c2ea TIKA-929 Update the ODF Parser to use the new style Office properties
new e2c5e08 TIKA-929 Use the prefered constant rather than the IPTC imported one
new 9b94789 TIKA-928 Patch from Ray Gauss to improve metadata properties setting/getting
new 34dff7b Make the composite test more explicit in what it does, fix up some deprecated warnings, and fix the typed getters for composites
new 70c85b4 Fix setter to be by property not name for add(Property)
new e269e9c TIKA-928 Fix up the DWG parser and tests to use the new style properties
new 89dc4d7 TIKA-928 Epic patch from Ray Gauss - Update parsers and unit tests to use the new style TikaCoreProperties for setting (which supports aliasing for backwards compatibility), rather than old string based ones
new a63b299 TIKA-842 Patch from Ray Gauss to split out the Photoshop and XMP Rights namespaces, and updates IPTC to use the new DublinCore properties (plus fix inconsistent indents)
new a0e2a5e TIKA-929 Bring across MSOffice.AUTHOR in the same way as initial and last authors
new 923ef5d TIKA-929 Fix up parsers to use the new style TikaCoreProperties.AUTHOR, along with fixing a few other deprecated bits in the process
new c508447 TIKA-929 Ensure backwards compatibility on the Office document statistics
new 32033de TIKA-842 Patch from Ray Gauss to tidy up a few property names
new 4a01b48 TIKA-903, TIKA-906, TIKA-907: add some CHANGES.txt entries
new edf26d4 TIKA-923: add test case
new 5eb3dc1 TIKA-910: fix text in Keynote text boxes and bullet points to not run together
new fb7f940 TIKA-904: handle iWork Pages documents created in layout mode
new 527f3d7 TIKA-924: extract table names from iWork Numbers docs
new ddf2c2a TIKA-931: Tika's PDFParser fails to parse documents embedded in a PDF Package
new 2e6b1bc PDFBOX-1320: fix NPE when visiting embedded files
new 8fd2a52 TIKA-923: extract items from Keynote master slides too
new 914cafd - fix for TIKA-935 TikaException thrown when trying to parse archive (*.ar) files contributed by Josh Mastarone
new 6e8f9ea TIKA-939 Another WMV codec to look out for, to specialise an ASF to WMV
new ffe67b3 Fix the case of the .ar files in the unit tests (TIKA-935) - case must match that stored in SVN or tests will fail on case-sensitive file systems
new e58890a TIKA-940 Sample 7zip (7z) file, based on the zip example
new 46f742f TIKA-940 Mime Magic and unit test for 7zip
new 12c4ad2 TIKA-935: TikaException thrown when trying to parse archive (*.ar) files
new 689a717 TIKA-932: Upgrade to Commons Compress 1.4.1
new b279121 TIKA-932: Upgrade to Commons Compress 1.4.1
new 1dfc65f TIKA-932: Upgrade to Commons Compress 1.4.1
new f15a7fc TIKA-941: Detecting KML / KMZ files
new e99be72 TIKA-941: Detecting KML / KMZ files
new 328b935 TIKA-929: Consistent, namespaced definitions for office file related metadata
new 14113de TIKA-943: Add parameter to tika-app to supply password for decryption
new 4c58d09 TIKA-934: Tika in server mode stops responding and reports NPE over and over in logs
new bb535a0 TIKA-876: Signed pdf parsing
new 1d1f292 TIKA-908: Adding XMP specification part one namespaces and properties
new 19a8688 TIKA-908: Adding XMP specification part one namespaces and properties
new c181ed9 TIKA-900: Tika fails to detect ISO9660 disk images
new 3725dbf TIKA-747: Ogg Vorbis and FLAC Parsers
new 78b1b5a TIKA-810: Upgrade to PDFbox 1.7.0 as available
new d22a1fd TIKA-747: Ogg Vorbis and FLAC Parsers
new 3ffc69f TIKA-847: Add regular expression support to the MagicDetector
new 833b5e1 TIKA-832: ForkParser is unfriendly to code that prints things to its output
new 411e942 TIKA-941 Sample KML and KMZ files, KML sample file from Google from the file format documentation
new 5c51b48 TIKA-941 Mark KMZ as being Zip based, so data only detection works properly
new 9e38af6 TIKA-941 KML/KMZ detection unit tests
new f502178 TIKA-788 Some DWG files have an implausable header offset. Avoid problems and just skip over them, pending a better understanding of the file format
new 5c29442 TIKA-863 Avoid creating a new AutoDetectParser (and implicit TikaConfig) for each part in a RFC822 message. Instead, check for one on the ParseContext, otherwise cache the TikaConfig for the lifetime of the message being parsed
new fad14f3 TIKA-593: Tika network server
new 89d98a2 TIKA-593: Tika network server
new 59e93d7 TIKA-593: Tika network server
new 1051da0 TIKA-773: .NET version of Tika
new b8fe97e Added rgauss as developer to tika-parent/pom.xml First commit
new e154de9 TIKA-773: .NET version of Tika
new ce112b2 TIKA-773: .NET version of Tika
new aa6c45e TIKA-756: XMP output from Tika CLI
new a63b9b8 TIKA-756: XMP output from Tika CLI
new 87f40d2 TIKA-756: XMP output from Tika CLI
new 6bc6c88 TIKA-756: XMP output from Tika CLI
new dfd8dc5 TIKA-756: XMP output from Tika CLI
new 23eb923 TIKA-947: AbstractMetadataHandler addMetadata Does not Check Property.isMultiValuePermitted - Added check for isMultiValuePermitted, if false call metadata.set instead of metadata.add
new 75585bd TIKA-773: .NET version of Tika
new 63c2c69 TIKA-756: XMP output from Tika CLI
new 352210b TIKA-773: .NET version of Tika
new 07a1590 TIKA-930: Consolidation of Some Tika Core Properties - Added the Dublin Core Terms namespace and prefix
new 196a61f TIKA-773: .NET version of Tika
new 0b9fc00 TIKA-756: XMP output from Tika CLI
new 5696d7b TIKA-756: XMP output from Tika CLI
new 50f56b6 TIKA-949 Mimetype entries for some zip-based process/mapping formats
new 843569a Test file from TIKA-948
new 1543cb0 TIKA-948 There is more than one way to embed things in OLE2, so add subtypes for both
new 218352a TIKA-948 Start to be able to correctly detect differnt things embedded in CompObj, such as PDF files, and also to be able to extract the contents
new 03fd5d5 Fix the extraction test for the file type, and check for one additional file
new 0ce2171 TIKA-951: Bundle activation policy for Eclipse
new 270853e TIKA-951: Bundle activation policy for Eclipse
new f28f04a TIKA-948 Look up the file extension for the mimetype detected for embedded resources, and fix unit tests for this
new 3e2a652 TIKA-948 Add mime magic for ChemDraw .cdx files, then fix the Cli extraction test so it has the correct extension
new 0a472d2 TIKA-561: Support EMLX file detection
new 6bc0537 TIKA-322: Improve encoding detection speed and accuracy
new 8756776 TIKA-322: Improve encoding detection speed and accuracy
new 929c897 TIKA-471: Avoid Charset name bottleneck when multiple threads are using HtmlParser
new 6621297 TIKA-471: Avoid Charset name bottleneck when multiple threads are using HtmlParser
new 9ead13f TIKA-471: Avoid Charset name bottleneck when multiple threads are using HtmlParser
new 12a9747 TIKA-471: Avoid Charset name bottleneck when multiple threads are using HtmlParser
new 89941a2 TIKA-471: Avoid Charset name bottleneck when multiple threads are using HtmlParser
new b8beefd TIKA-471: Avoid Charset name bottleneck when multiple threads are using HtmlParser
new ddb997a TIKA-502: Add programming language mime-types
new c6bcd32 Add an entry for TIKA-948
new 95a1cf9 TIKA-906: Added basic support for AutoPageNumbers and their formats
new 7d89a5e TIKA-431: Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
new ee57f95 set svn:eol-style to avoid test failures on Windows
new 09c6122 TIKA-431: Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
new b7ada46 TIKA-431: Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
new 2ecd434 TIKA-892: Tika does not use the HTML5 meta charset tag when determining charset
new dcd7050 Fix for TIKA-945 Upgrade tika-server to CXF 2.6.1
new 8d7a5b7 Prep for 1.2 RC #1
new a0444db [maven-release-plugin] prepare release 1.2
new e09b77f [maven-release-plugin] prepare for next development iteration
new 0f0b041 Cleanup of javadoc
new 5615ba6 TIKA-957 NTIF mime entry and magic
new 8630e56 TIKA-811: Upgrade metadatExtractor version for OpenJDK 7 support - Upgraded metadata-extractor to 2.6.2 - Refactored calls to metadata-extractor library methods and tags for new API - Simplified use of JpegMetadataReader to use readMetadata method - Updated TIFF parsing to utilize a temp File since metadata-extractor method accepting InputStream is now deprecated TIKA-915: Image geodata being rounded to integers - Refactored GeotagHandler to use metadata-e [...]
new 99ad680 TIKA-915: Image geodata being rounded to integers - Added decimal formatting to GeotagHandler rather than test since the metadata-extractor is adding false precision
new efdb467 TIKA-962: Backwards Compatibility for Metadata.LAST_AUTHOR is Broken - Added tests for backwards compatibility of Metadata.LAST_AUTHOR - Changed TikaCoreProperties.MODIFIER to be a composite property containing Metadata.LAST_AUTHOR
new c36e5a0 TIKA-963: Backwards Compatibility for Metadata.DATE is Incorrect - Added tests for backwards compatibility for Metadata.DATE and Metadata.CREATION_DATE - Moved Metadata.DATE to be part of the TikaCoreProperties.MODIFIED composite property - Added setting of Metadata.DATE to PRTParser
new f58064e TIKA-963: Backwards Compatibility for Metadata.DATE is Incorrect - Added a few more tests for backwards compatibility for Metadata.DATE and Metadata.CREATION_DATE
new 49e18e0 TIKA-906: Added missing licence header in AutoPageNumberUtilsTest.java
new dfcf3d4 TIKA-965: Text Detection Fails on Mostly Non-ASCII UTF-8 Files - Added looksLikeUTF8 method to TextStatistics - Added check to TextDetector.detect for looksLikeUTF8 - Added testTextNonASCIIUTF8 to AutoDetectParserTest and testTextNonASCIIUTF8.txt test resource
new c70b12f TIKA-969: TikaException Thrown When Handling Unknown Fields for Some JPEGs - Added check for null tag description
new afcde8c add 1.3 section to CHANGES
new eda4abd TIKA-970: Full identification of the JPEG 2000 family of formats
new 8e23098 TIKA-966: org.apache.tika.Tika missing from tika-bundle-1.2.jar
new 199368c TIKA-968: tika-bundle missing org.apache.commons.logging.LogFactory
new 609efd4 TIKA-956: show where embedded docs occurred when extracting processing Word (.doc) documents
new 56237e3 TIKA-869: IdentityHtmlMapper.mapSafeElement() needs to return lower-cased incoming name
new 08d52ec TIKA-889: XHTMLContentHandler wont emit newline when html element matches ENDLINE set
new 069305b TIKA-771: "Hello, World!" in UTF-8/ASCII gets detected as IBM500
new df53713 TIKA-975: LinkBuilder to optionally collapse anchor whitespace
new 0f1934c TIKA-983: HTML parser should add Open Graph meta tag data to Metadata returned by parser
new 4cdcefd TIKA-981: also extract from PDF pop-up annotations
new 0ce025c TIKA-982: handle Wordpad/RTF docs embedded in Word doc
new 4f920c7 TIKA-986: don't throw NullPointerException on detached PKCS7 signature
new bf24bc0 TIKA-918: extract chart name for charts embedded in Numbers documents
new 8d38e36 TIKA-920: handle multi-valued metadata keys
new 9e21496 TIKA-989: leave placeholder where embedded document appears in .docx files
new 6630079 remove system.out.println
new d3989e6 Add a test Opus audio file (ogg based, should eventually be supportable similar to Vorbis via TIKA-747)
new a06209b TIKA-999: extract page, word, character count metadata from RTF docs
new 22b1235 TIKA-997: leave placeholder at end of slide where embedded document appears in .pptx documents
new 9c0f4ca TIKA-999: also extract CREATION_DATE from RTF
new 8b68138 TIKA-999: fix false test failure
new a846f7d TIKA-997: also leave placeholder for embedded images
new 433f21c TIKA-1006: don't NPE if style is null
new 4e09c63 TIKA-1005: also extract text from text boxes in .docx documents
new 3a76e71 Add test CSS and JS files taken from the Tika website, and use these to add additional detection unit tests for these two formats
new a64d33c TIKA-1011: fix NPE when charset isn't recognized in .mhtml files
new 7b76a7f TIKA-984: JpegParserTest fails for some locales - Changed GEO_DECIMAL_FORMAT to simple String GEO_DECIMAL_FORMAT_STRING - Changed GeotagHandler to create a new, Locale-specific DecimalFormat object using the GEO_DECIMAL_FORMAT_STRING
new 7d61675 TIKA-984: JpegParserTest fails for some locales - Changed from DecimalFormatSymbols.getInstance to constructor for Java 5 support
new 60e5d2c TIKA-775: Embed Capabilities - Added an Embedder interface, similar to Parser, which defines getSupportedEmbedTypes and an embed method - Added a base ExternalEmbedder implementation of the Embedder interface, similar to ExternalParser, which can call a command line executable, the default being sed, to perform embedding - Added a base ExternalEmbedderTest which 'embeds' lines in a text file then uses a TXTParser to verify the expected embedded metadata exists
new acdb2b0 TIKA-1015: include rel id in Metadata when parsing embedded documents inside Word (.doc)
new 6aba74b TIKA-799: ForkParser does not populate metadata object after completing a parse
new 71629f0 TIKA-1009: Expose TextDocument in BoilerpipeContentHandler
new a2ddd8c TIKA-1019: also leave placeholder for links inside .doc
new 1d637c8 TIKA-1019: revert for now: the test file is too large
new 355eba9 TIKA-1022: DWG Custom properties not extracted - Added testDWG2010_custom_props.dwg - Added CUSTOM_PROPERTIES_ALT_PADDING_VALUES constant for values found in test file - Added check for alternate padding values in skipToCustomProperties - Added testDWG2010CustomPropertiesParser unit test
new 363ee3b TIKA-1019: also leave placeholder for links inside .doc
new a3f156b TIKA-1024: don't returned naked BOM for MP3 ID3 tag values
new 963100e TIKA-1025: leave placeholder where embedded docs appear in .ppt extraction
new 60ca6ad TIKA-1026: ServiceLoader should respect OSGi service ranking
new 02e550a TIKA-1027: Allow null values when setting metadata
new 2e9d1ef TIKA-775: Embed Capabilities
new d84781a remove stale comment
new e10c23b TIKA-1031: create parent dirs when extracting embedded files
new 9ec7153 TIKA-1032: dedup relID by slideN_ for embedded files in .pptx
new 51a7c9c TIKA-712: extract master text, except for title/body
new 771e368 TIKA-1035: extract text from PDF bookmarks
new 12f4084 TIKA-1036: leave placeholders when we extract embedded archive members
new 1781506 TIKA-1031: TikaCLI doesn't create sub-dirs when extracting Zip files
new 9db581a create temp files under tika-app/target for this test
new ad9512d TIKA-1036: also set EMBEDDED_RELATIONSHIP_ID in the Metadata when extracting the embedded document
new 01c02e6 TIKA-1035: move bookmarks before </body>, use <ul>,<li>
new 95b8975 TIKA-1042: Lotus Notes .eml Files Not Always Detected Properly - Added testLotusEml.eml which demonstrates the problem (with some info redacted) - Added testDetectLotusNotesEml method to TestContainerAwareDetector - Added new match to the message/rfc822 mime-type which looks for X-Notes-Item and Message-ID
new 7260050 TIKA-1041: Tika 1.2 universalcharset errors
new 978cc18 Added .xmp extension to application/rdf+xml mime-type for better detection and parsing - This mime type is indicated in the XMP spec part 3, page 7: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/cs6/XMPSpecificationPart3.pdf
new 3e7e7c9 TIKA-990: Mp3Parser extracts wrong number of channels - Changed AudioFrame to grab the correct bits (7,6) - Updated Mp3ParserTest for the correct channels in the files
new ab9f799 TIKA-1044 Fix issue for Word extractors on text that lacks any styling, plus tests based on files from Jonas Wilhelmsson
new 2bf25c2 TIKA-775: Embed Capabilities - Removed logging in ExternalEmbedderTest - Minor formatting changes in ExternalEmbedder for better readability
new 2052fc9 TIKA-976 Excel95 files should be correctly detected, but as POI HSSF does not support them they should not generate exceptions if you try to parse one
new 39a63ca TIKA-725: Empty title element makes Tika-generated HTML documents not open in Chromium - Added an assert to TikaCLITest which verifies the issue - Added ExpandedTitleContentHandler - Changed TikaCLI to use ExpandedTitleContentHandler for html output
new 1e78515 TIKA-725: Empty title element makes Tika-generated HTML documents not open in Chromium - Added license header
new 7e3cc3a TIKA-1049: Upgrade PDFBox to 1.7.1
new 71166de TIKA-1048: add space after each extracted XML element
new b928475 TIKA-1013: Added ability to check if a mime-type is already registered from Ryan McKinley
new fb41a40 Tika-1055 patch from Bernhard Berger to add mimetypes for a number of programming languages
new ed5591c Remove three duplicated mimetype entries (keeping the one with more information in the definition each time), from Karel Zacek in TIKA-1052, and add a change entry for TIKA-1055
new 7171855 Patch from Emmanuel Hugonnet from TIKA-1021 - PSD data lengths are even padded
new e4e1f7d Add a unit test for HDF4 files, which shows that TIKA-958 was already fixed
new c8dea65 TIKA-1056: unify ImageMetadataExtractor interface - Made parseTiff public
new d1291b6 message/rfc822 pattern from Marco Quaranta from TIKA-1058
new 908791f Update CHANGES.txt for 1.3 release date
new daf615e [maven-release-plugin] prepare release tika-1.3
new 320617c Revert failed release:prepare command
new 7a0048c [maven-release-plugin] prepare release tika-1.3
new 527e4de [maven-release-plugin] prepare for next development iteration
new 03f7625 TIKA-1060: Degrade gracefully when juniversalchardet not present
new 3df2f59 TIKA-1062: parse lists from RTF documents
new f5bcf3b TIKA-852: Quicktime / MP4 Metadata Parser
new 562dd39 Apply patch from Raimund Merkert and Chris Mattmann for TIKA-1047: Provide a JAX-RS to detect only mediatype.
new 1b7871d Apply patch from Raimund Merkert and Chris Mattmann for TIKA-1047: Provide a JAX-RS to detect only mediatype.
new 3e18897 Apply patch from Raimund Merkert and Chris Mattmann for TIKA-1047: Provide a JAX-RS to detect only mediatype.
new 1489edc TIKA-1065 Mimetype entries for SAS file types
new 86a6536 TIKA-1065 SAS subtype and mime magic
new 452b6e6 TIKA-1076 Upgrade to Apache POI 3.9. Commit disables some HSLF related unit test checks, they need re-enabling along with a fix soon
new 5016a83 Support tika:link and tika:uti mimetype extensions, along with unit tests. Modified version of the patch from TIKA-1012
new d45386e FileMaker Pro mime entry from Marco Quaranta from TIKA-1061
new 00522e3 TIKA-991 Enable the DURATION property
new 68e5671 Apply patch from Oliver Heger from TIKA-991 - Re-work MP3 parser to capture audio duration by processing more of the audio frames
new d25f52d Add missing license header
new 956e640 TIKA-1053: upgrade to ASM 4.1
new a6d2100 comment out @Overrides
new 36f5051 fix for TIKA-1081 Error in specification of glob pattern for awk files identified by Giuseppe Totaro
new 88a0702 change Tika to accept Java 1.6 source and write Java 1.6 bytecode
new 46d0baf TIKA-1084 Merge image/x-icon (old) with the newer standard image/vnd.microsoft.icon
new 5456ae2 Patch from Ryan McKinley from TIKA-1083 - Add Link and UTI information for a number of common mimetypes
new 5e50085 ChangeLog entry for TIKA-1012 and TIKA-1083
new a6f9d9d TIKA-1087 PICT mime magic and unit test
new d23cd38 TIKA-1074: log certain exceptions and continue
new 6d8ac09 TIKA-1074: catch Exception not Throwable, and restore interrupt bit for InterruptedExc
new 2b8fdb1 TIKA-1074: remove future proofing for InterruptedException
new 5a82c4f patch for TIKA-1090 contributed by Lewis John McGibbney.
new ef226f4 Patch for TIKA-1096 CompressorParser: Add support for handling concatenated InputStreams contributed by Gregory Canan.
new 6e9e7cc Patch for TIKA-1096 CompressorParser: Add support for handling concatenated InputStreams contributed by Gregory Canan.
new 6a41a3a Word2 and Word5 mimetype magic, from investigations into TIKA-1092
new 0e4ccf6 Mimetype entries with magic for the arj and uc2 archive formats TIKA-1099
new 00ab3e0 TIKA-1104 - Upgraded PDFBox to 1.8.1
new 5f62566 Patch from Ryan McKinley from TIKA-1014 - Allow custom MimeTypesReader (with tests)
new 5ffe387 TIKA-1115: ExifHandler throws NullPointerException - Added check for null datetime - Added file exhibiting problem datetime field - Added unit test
new aa18a79 Patch by Markus Jelsma for TIKA-992 to allow OpenGraph meta tags to have multiple values.
new c4581fb New code to help for TIKA-1118, currently disabled pending a POI upgrade
new ade21f4 TIKA-1123 - Added mimetypes for additional programming languages
new f50544d TIKA-1126 - Patch by Ali Mosavian to allow Tika Server to produce text/html output
new 9d81bdd Patch for TIKA-1127 provided by Ali Mosavian.
new 390d8c3 TIKA-1102: detect fragment that starts with <div> or <DIV> as HTML.
new 14aa5bf TIKA-1128: replace line tabulation with line break
new c1dfba2 TIKA-1133: Ability to Allow Empty and Duplicate Tika Values for XML Elements - Added constructors in ElementMetadataHandler to specify allowing duplicates and empty values - Added a unit test and test data which confirms the default and override behaviors
new 7b30eb8 Fixed previous whitespace issues in separate commit for better readability of diffs.
new ec32a29 TIKA-1135: Incorrect Cardinality and Case in IPTC Metadata Definition - Fixes to cardinality - Fixes to key name case to match specification
new 281bcf5 TIKA-1130: .docx text extract leaves out some portions of text - Added test file - Added disabled unit test
new d546aa1 Fix for TIKA-1129: Test HTML file has poorly chosen GPL text in it
new ca9b4b7 Fix for TIKA-1129: Test HTML file has poorly chosen GPL text in it
new 7c2ae59 Prep for 1.4 RC #1.
new 7ca70eb [maven-release-plugin] prepare release 1.4
new 72c7c50 [maven-release-plugin] prepare for next development iteration
new 8b82be8 TIKA-1128: normalize newlines before assert
new 21a6eba Updated patch for TIKA-991 contributed by Oliver Heger
new 0df62cc Test file from Paul Brinich from TIKA-1136
new 76cfbc1 Mimetype, Zip container detector and unit test for the Apple IPA format. Original logic from Paul Brinich from TIKA-1136
new 32431aa The Office Parser has a default password it can use, so if the PasswordProvider can't provide one (i.e. if it returns null for the password), keep going with the default rather than passing a null password through to POI (which doesn't like that)
new 2db3812 Patch from Dietmar Glachs from TIKA-1070 - avoid stackoverflow in ToXMLContentHandler by resetting the parent state after the end of an element
new 5dd29c4 Patch from Daniel Bonniot from TIKA-1109 - Fetch OOXML metadata earlier, to tidy code and make it available if required during parsing
new 8f492eb TIKA-1130 Upgrade to POI 3.10 beta 1
new 970bafe Patch from Tim Allison from TIKA-1130 - Extract from .docx SDT runs as well
new 066f587 Helper class for unit testing TIKA-1145 - Test classloader that logs the resources it loads
new c7d7f66 Fix TIKA-1145 - If a specific ClassLoader was given to TikaConfig, have that used for loading the mimetypes too
new a852c4f TIKA-1147: File-Based TikaInputStreams are Deleted by ExternalEmbedder.embed - Restructured tests to be able to accept different input streams - Added test for passing in a TikaInputStream - Changed ExternalEmbedder to close the input stream rather than delete its file
new 22fc879 TIKA-1146 Support for case-insensitive string matching on magic patterns (for ASCII text only - works at a byte level). Also adds more magic detection tests covering several of the string formats
new c743609 Patch from Kai-Uwe Schmidt from TIKA-1146 - Handle rfc822 message detection with unusual (but standards ok) cases of the header strings, with test
new 18a89e9 TIKA-1156 AMR glob, subtypes, magic and unit test
new 042ec60 TIKA-1156 AMR-WB mime magic and unit test
new 9ec2aef Tika 1139 update to 1129
new ba80052 TIKA-1124, process attachments within an embedded PDF
new 7f080de Tika 1124 not 1142...sorry
new 34f7c18 TIKA-1159 Mime Magic for Adobe InDesign from Kabron Kline, plus sample file and unit test
new 62d5ad3 TIKA 1001 more flexible html meta-header encoding detector
new 51531a1 TIKA-1153 upgrade PDFBox to 1.8.2
new 2d61e41 SolidWorks mimetype entry from gunter rombauts from TIKA-1160
new a7e0353 TIKA-961: No whitespace added if BoilerpipeContentHandler.setIncludeMarkup(true)
new 1e8e3c8 TIKA-1166: FLVParser NullPointerException - Added check for null entry value
new 26b2e60 Test OSX jnilib file from Apache ActiveMQ 5.8.0 (TIKA-1169)
new 5ce1b58 Mimetype for jnilib files, which share some magic with Java classes but are actually native OSX code, plus test (TIKA-1169)
new 4c6313f TIKA-1170: Insufficiently specific magic for binary image/cgm files - Applied patch from Andrew Jackson which… - Added additional matches to image/cgm magic - Added example cgm file - Added test of image/cgm to MimeDetectionTest
new b2fcf77 TIKA-1170: Insufficiently specific magic for binary image/cgm files - Fix for incorrect application of patch - Additional test and resource from Andrew Jackson for false positive cgm matches on malformed HTML files
new 009a45a bumped poi to 3.10-beta2
new 2cf36cc tika-1100 textboxes in xlsx; modified XSSFExcelExtractorDecorator and added test in OOXMLParserTest
new e9c0166 TIKA-792 fixed by POI-3.10-beta2; added test for missing ooxml bean
new ed60a56 commented out TIKA-792 test for now
new 145bf2f second attempt to add test for detecting missing ooxml bean. Builds successfully locally. Jenkins failed last time. Stack traces didn't point to this test; but redirecting stderr may be the culprit.
new 146dcda TIKA-1076 extract text from tables in ppt.
new 2ac245f TIKA-817 -- autodates in ppt and pptx. Already fixed by TIKA-805. Added files and tests to confirm behavior specifiedin POI-52367 and POI-52368
new 20ab62d TIKA-1171 -- extra asterisks from master slide in PPT; added tests to TIKA-712 test files to show 1171 was fixed. Borrowed extraction code from POI PowerPointExtractor
new 29e9b15 updated CHANGES.txt to cover recent activity
new aaa547f added 1130 to CHANGES.txt
new 5e320cd TIKA-1177: Add Matroska (mkv, mka) format detection - Added Matroska video and audio mime-types and extensions - Added WebM video mime-type - Added mkv and webm test files (converted from existing testFLV.flv) - Added name detection unit tests
new d80fcf1 TIKA-1177 Add a common parent type for the Matroska container, then add data+name tests for WebM and Matroska Video which use them. (For full detection, we need TIKA-1180)
new 38b1509 When the Tika App is in Server Mode, wrap the raw Socket InputStream as a TikaInputStream so that detectors can use mark/reset on it (TIKA-1183)
new c183694 Include a workaround for PDFBOX-1749 by trying to use AWT (if we have a TikaInputStream) to check the TTF is valid. Should mostly solve TIKA-1182 until the upstream fix is done
new ac538c9 TIKA-1188 mpx mimetype
new 23a7f5c TIKA-1188 Other parsers may be working on OLE2 files, and hence may want to be able to call the common OLE2 metadata extraction code, so make it public not package
new dfee810 Pull the Date -> ISO8601 logic out of Metadata to a common utils class, so that other bits of Tika (eg TIKA-1188) can use it
new 9227e9f TIKA-817: (PPT/PPTX) Missing date/time in text content
new ceccd0b TIKA-1192: RTF: fix AIOOBE in handling list override
new 80f4692 TIKA-1200 upgrade pdfbox to 1.8.3
new 0332d28 TIKA-1201 enable parameter for NonSequentialPDFParser
new 6f7177f [TIKA-1196] Adding a host option currently set to localhost by default, thanks to Rian Stockbower for driving the discussion and patches, also doing few minor updates to minimise a number of warnings
new 75c8f1f [TIKA-1197] Switching to CXF 2.7.8
new f708b84 [TIKA-1198] Support for multipart payloads
new 8a04063 TIKA-1202 added PDFParserConfig and refactored PDFParserTest and TikaTest to reduce boilerplate
new 4a59ade TIKA-1202 -- small bug in using default or context config; added in-memory option for nonsequential parser; added more constraints to tests
new dcbe0d2 TIKA-973 added basic extraction of pdf AcroForm content. Many thanks to Ben Litchfield for org.apache.pdfbox.examples.fdf.PrintFields, on which this patch relies.
new ffc3b3c TIKA-973 reopened. Would prefer test docs unequivocally consistent with Apache License 2.0. Deleted initial test docs from trunk and commented out test case. Also added extractAcroFormContent to parameter file (should have been done in initial check in).
new 2a36fcd TIKA-1209: Upgrade Tika tests to JUnit 4.X
new 30523c0 JIRA-1211: OpenDocument (ODF) parser produces multiple startDocument() events
new 9385906 TIKA-1110: Incorrect declared SUPPORTED_TYPES in ChmParser
new 0eb87d7 TIKA-1110: Incorrect declared SUPPORTED_TYPES in ChmParser
new 8d794b5 TIKA-1110: Incorrect declared SUPPORTED_TYPES in ChmParser
new 3dec312 TIKA-1152: Process loops infinitely on parsing of a CHM file
new 84cc8dc TIKA-1210: Address tika-parsers o.a.t.mime.TestMimeTypes TODO: Need a test flash file
new 2608c79 TIKA-672: Proper error handling in the CHM parser
new bc8f90a TIKA-672: Proper error handling in the CHM parser
new 2942530 TIKA-672: Proper error handling in the CHM parser
new f1c5c69 TIKA-1193: Allow access to HtmlParser's HtmlSchema
new f570a3b TIKA-1160: Add support for SolidWorks files
new ad853f4 TIKA-1086: Added import for org.w3c.dom in tika-bundle
new 8fd76a4 TIKA-820: Added setDocumentLocator delegate call in TextContentHandler
new c44e747 TIKA-1217: Integrate with Java-7 FileTypeDetector API
new 94ff74c TIKA-1078: TikaCLI escapes invalid filename characters as hex codes
new 57583e8 Adding a server profile to Tika Server
new 1d07545 misc: remove version of junit on dependency
new e085bf2 [TIKA-1198] Updating JAX-RS server to accept multipart/form-data payloads at a dedicated path
new 7d3194a TIKA-1226: PDF TextStripper fails when it encounters PDSignature Field.
new 33280e0 TIKA-1226, removed println...doh.
new da992b2 Updating CHANGES.txt
new 78a7ceb TIKA-1224 - Adding a generic SourceCode parser for Java, Groovy, C++ with HTML render
new 1ca0efd TIKA-1228: Look for attachments under Kids node if embeddedFiles.getNames() returns null
new 77d57d5 TIKA-1230: update PDFBox to v1.8.4 and updated CHANGES.txt
new 17ef0de Add the text file from TIKA-1229, and stub out a unit test for it
new 25ffe53 Updated KEYS with revocation details
new 2aaec12 TIKA-1229 - Parse Word Headers and Footers as proper ranges, not simple text strings, and thus be able to handle hyperlinks etc in them
new 0605d6c Updated KEYS with new code signing key
new eac3711 Updated CHANGE.txt for 1.5 release
new f607a4d [maven-release-plugin] prepare release 1.5-rc1
new 6f84307 [maven-release-plugin] prepare for next development iteration
new 8282ef2 prepare for next development iteration
new e327976 prepare for next development iteration
new 0e3c51e temporary fix to TIKA-1233. Added extra catch clause to catch PDFBOX-1803 related StringIndexOutOfBoundsException. When PDFBOX-1803 is fixed, we should be able to remove these catches
new c8ab7c4 TIKA-1237 upgrade to poi-3.10-FINAL
new 2ae3e90 TIKA-1223 - Extract thumbnail of OOXML Office as attachment
new 9f869d2 got rid of brittle requirement for specific number of pdfs to be tested in PDFParserTest
new 86d1aed Test 7zip file, based on the other test archive files, for TIKA-1243
new 8316bb2 TIKA-1243 - Upgrade to Commons Compress 1.7, and add a disabled unit test for 7z support. 7z support is not enabled yet, pending a commons compress fix
new aae0f4e Remove debug line accidently committed
new ffb52a3 Create a spanned zip test file for TIKA-1241
new d188a62 TIKA-1241 Mime Magic for empty and spanned zip files, plus spanned zip file detection unit test
new eed505e Patch from TIKA-1225 from Marco Quaranta - MDI mime magic
new a360ec5 TIKA-1248: handle empty/null declaredEncoding with call to CharsetDetector.getReader
new b6f8f9e TIKA-1249 vcard mime magic from Marco Quaranta
new 91ec256 Switch to use FileZip instead of ZipArchiveInputStream on UnpackerResourceTest to fix build.
new 7b2c4e8 Add more test of OfficeParser
new 98c6957 TIKA-623 - Integrate PST Parser with java-libpst-0.7
new beff394 TIKA-1257 - Filter control characters in output
new 416b57e TIKA-1257 - Add missing test doc
new 099c53c TIKA-1232: add fine-grained pdf version extraction
new 3cfe5ad TIKA-1252: handle multiple authors in PDF xmp metadata
new cb6c9b8 TIKA-1252 small clean up
new bb4b8f1 clean up whitespace in PDFParser components
new a504a1e cleanup whitespace in OutlookPSTParser
new bbf9de4 Add a mimetype entry for Ogg Opus audio files
new 38974e6 TIKA-1023 Upgrade the Ogg parser to 0.3, which fixes a maven issue, and adds Ogg Opus support
new e503b24 TIKA-1243 Upgrade to Commons Compress 1.8
new 5e9a8a0 TIKA-1243 Fix the test 7z file to match the order of the tar one, to simplify testing
new a8e46e9 TIKA-1243 Add 7z support now that we have upgraded to Commons Compress 1.8, but it is a little nasty until COMPRESS-269 is resolved
new 310b0bb TIKA-241 Test rar archive, based on the same contents as our other test-document archive files
new 52ca7b3 TIKA-1259 Dedicated mimetype entry for Ogg Speex
new f81efb5 TIKA-1259 Some more Ogg based mime entries, for FLAC-in-Ogg, Ogg Theora and OGM video
new ccc0770 TIKA-1259 Add Ogg mimetypes for the various uncompressed formats which can be stored in Ogg
new 41c6749 TIKA-1259 More Ogg mimetypes and fix-ups - Add dirac, tweak the default theora type, disable the FLAC-in-Ogg test pending an update, and add a common parent to all Ogg Audio types
new 348c92f TIKA-1259 Complete adding the well known Ogg related mimetypes
new d145ee6 Upgrade the Ogg plugin to 0.4, re-enable the FLAC test, and add an Opus one. Solves TIKA-1259 and TIKA-1113
new 2c34c60 Changes update for recent Ogg updates
new ef61057 TIKA-1263 An additional Atom feed xml namespace, and Atom parsing+detection unit tests
new 5b743bd TIKA-1151: Maven Build Should Automatically Produce test-jar Artifacts - Added maven-jar-plugin to relevant poms
new cc3c35e Add my signature to David's GPG key, as used in the 1.5 release
new 39e87f3 TIKA-1264 Updated Outlook PST mimetype from Luis Filipe Nassif
new 556ee2a TIKA-1196 Options keys must be unique, so leave host with -h and push help to -?
new fe90fb4 TIKA-1244 - Extract mails as attached elements. Integrate code of Luis Filipe Nassif
new fb1d4b6 Reput again failed test causing by compress-1.7
new 8a7965a TIKA-623 - Extract each mail as attachment
new 2b7cd2e TIKA - 623 - Update CHANGES.txt
new 626625c TIKA-1268: Extract images from PDF documents
new 398f50f TIKA-1271: trivial refactoring of classes useful for testing embedded document handling
new c4efd41 TIKA-1270 Prep for new endpoints by having the existing ones use a common TikaConfig object
new 0d2fe29 TIKA-1270 Prep for new endpoints - refactor server unit tests to reduce duplication
new 3d60db6 TIKA-1270 Prep for new endpoints - fix eclipse warnings
new 026aad2 If a mimetype is handeld by a composite parser, report the underlying parser against the type (TIKA-1270)
new 702b819 TIKA-1270 Start on support for reporting the mimetypes that are known, still partly WIP
new 3b1050e The custom assert "assertContains" can be static, to allow use from elsewhere in the codebase
new 55310ec TIKA-1010 extract embedded documents from RTF
new b17a97e TIKA-1270 Move to a common set of logic to decide what to display, so the output type bit just deals with formatting it only, and add a browser friendly html view too
new 2ddb759 TIKA-936: encoding of ZipArchiveInputStream
new 3468cb2 TIKA-1277: Magic bytes from Wikipedia
new a570bfb [TIKA-1279] Missing return lines at output of SourceCodeParser
new 5d29d84 [TIKA-1276] Add missing embedded dependencies in tika-bundle from patch of [rwesten]
new e0db416 TIKA-1278: Expose PDF Avg Char and Spacing Tolerance Config Params - Added averageCharTolerance and spacingTolerance fields to PDFParserConfig - Moved configuration of PDF2XHTML params from PDF2XHTML.process to new PDFParserConfig.configure method
new 3a9afb5 TIKA-1279 - Use System.getProperty() compatible with Java6 + tests
new 0a32931 TIKA-1279 trivial fix caps in testJAVA.java in test cases so that tests pass in *nix
new b4c8568 TIKA-1280 GZip now has an official mimetype
new 01726d3 TIKA-1281 Add an alias application/x-xml for the XML mimetype (canonical application/xml)
new f23511c TIKA-1270 Provide a Tika Server endpoint to report on Detectors, modelled on the Tika App --list-detectors method
new 5954a1d TIKA-1270 Unit test for the Detectors server endpoint
new 61617ab TIKA-1270 Provide a slightly better way to handle the user-facing HTML output, we may well want to replace it with something either better or more JAXRS like shortly!
new 1e2af4e TIKA-1269 Stub out a human readable welcome page for the Tika Server
new 306cf13 TIKA-1269 Have the human-facing welcome page tell you roughly what the different endpoints are
new dd20615 Add a note for why TIKA-1269 is not yet working properly
new 1428d63 TIKA-1290 - Upgrade to PDFBOX 1.8.5
new 7d524cb TIKA-1269 Sort welcome output, and add test coverage of it
new fafcf75 Patch from Annie Bryant Burgess contributed to address TIKA-1265.
new 90f5d76 Changes record for TIKA-1265.
new f1aee6f TIKA-1270 WIP parser details endpoint, similar to --list-parsers and --list-parser-details from the Tika CLI
new 14991ad TIKA-1270 Complete parser details endpoint, and tests
new c9530d0 Changelog update
new 5c3070c TIKA-1175 Magic for MS-Money files from Boris Naguet
new e652afa Some more KML namespace definitions, from Marco Quaranta from TIKA-941
new 54afc7a Mimetype for the OPC based DWFX format, and detector support for it. TIKA-1204
new 359aae5 TIKA-1221 Add Zip Container Detector support for the OPC-based XPS format, along with more mimetype details of it and a unit test
new d02d033 TIKA-1259 One more Ogg mimetype
new be6e237 Bump Vorbis Java version to 0.6, to solve TIKA-1112
new 969804d TIKA-1282 Some more common gzip aliases
new 6b7db99 TIKA-1233: removed catch blocks after upgrade to PDFBOX-1.8.5; see PDFBOX-1803
new f033bcf TIKA-1231: added more null checks after underlying fix was made in PDFBox-1.8.5
new 821e197 Ignore a test until TIKA-1298 is fixed
new c2da98e temporary bug fix until TIKA-1295 is resolved
new 3d146b4 test doc actually added for r1594957 temporary bug fix until TIKA-1295 is resolved
new 2f4fbc6 More OSX Mach-O file magic, for TIKA-1169, from Matthias Kruegger. This closes #8 github pull request
new ef71795 TIKA-1292 Sample (Apache v2 Licensed) Jar with HTML in it
new ee8bc66 Allow the DefautlDetector unit test to also check how MimeTypes would detect this, and add a commented-out check that uses that for TIKA-1292 (currently failing)
new aacc74d [TIKA-1272] Removing redundant Tika server properties file, patch from Lewis John McGibbney applied
new 2d51ad2 Add some notes on entries, to help people maintaining the file know what to do, related to TIKA-1292
new 9eaa4fc Container formats with specific, low-false-positive magic matches need a slightly higher priority, so that they don't accidently end up being matched based on the contents of the container near the start of the file. Partly solves TIKA-1292. This closes #6 github pull request
new 8e70572 Add a disabled unit test for TIKA-1292, which when working will ensure that if we have two matching magics at the same priority, the name is used to specialise if possible, first defined if not
new 3902226 Set an explicit priority on the OLE2 match, remove two MS Word matches which were OLE2 ones in disguise, and add an intermediate staroffice parent on the staroffice types. Helps with TIKA-1292 testing
new f689fa7 TIKA-1292 If there is more than one mime magic which matches at the highest priority, keep track and then try to pick based on filename or type hint later
new 8c0119b add license header to RTFObjDataParser and clean up whitespace in RTFEmbObjHandler
new 0e7ea0a TIKA-1294 add ability to turn off image extraction from PDFs
new 0b57b5c TIKA-1291/TIKA-1310 fix bug in JSON output from CLI
new 08a4029 fix to TIKA-1294, uppercase enum
new 051bf86 TIKA-1312 and TIKA-1313 - FDF and XSL-FO mimetypes from Marco Quaranta
new d0ced96 TIKA-1305: make RTF list handling slightly more robust against corrupt list metadata
new 6d7a36b TIKA-1305: test file added to svn...argh.
new 65748ed Patch from Lewis from TIKA-1258 - Bump the NetCDF dependency from 4.2 to 4.2.20
new 4fa1d2a TIKA-1311 centralize serialization
new 8f27da6 contribution for TIKA-1319: Translation module contributed by Tyler Palsulich.
new 01c9cb5 TIKA-1324 As discussed on the mailing lists, use a common url prefix for the unpacker resources
new 55d95c5 TIKA-1269 Some endpoints may lack a produces annotation
new d14fa52 fix for TIKA-1316 identified by Tyler Palsulich.
new 375fde6 Patch from Matthias Krueger from TIKA-1322 - XMLParser opens a p tag at the start, so always close it (not just on valid files), to avoid triggering the SecureContentHandler depth check on multiple xml errors. This closes #9 from github
new 841c587 Remove the temporary PDFBox workaround for TIKA-1182, now that we have upgraded to a version with the fix
new c1a9be0 TIKA-1325 Temp fix - pull out some of the font metadata keys to string constants, and rename the test
new d70cbe0 TIKA-1326 MSI files are, rather improbably, based on OLE2 documents not Windows PE files. Patch from Luis Filipe Nassif plus test updates
new 7cd4050 TIKA-1325 Have the TTF parser pull out a little bit more, and have it do so similar to the AFM one does, plus add some TTF tests
new 68f513f fix potential null pointer exception in PDFParser; found while working on TIKA-1302
new 15bc1b3 Provide explicit DateUtil support for formatting Dates in an unknown timezone, matching what TestMetadata checks detail, and allow for setting a Metadata date value from a Calendar. Finally, use this for the TTF dates, to hopefully solve the TIKA-1325 test problem
new 02b5424 To support OSGi testing, allow for a test class to find out what class names ServiceLoader.loadStaticServiceProviders will try, helps TIKA-1276
new b97ef32 TIKA-1303 skip bogus second title tag
new 02afb8b Start trying to get the Tika OSGi tests to run, by splitting them out by area, explicitly using an OSGi test running, and beginning to check that in-bundle and non-bundle have the same parsers and detectors. Many tests disabled though as broken... TIKA-1276
new 1e443c4 Patch from Michal Hlavac from TIKA-1258 - Correct Tika Bundle OSGi build list for newer NetCDF
new d6307b7 Add a tests that ensures that the Tika Bundles are found + started, and clarify a bit why the detectors test should work but isn't
new d943c10 Comment out the date related tests for TIKA-1325, to avoid problems with timezone matching in dates while we await a PDFBox fix
new 4a1cbdd Another dependency for netcdf for TIKA-1258
new 468698d TIKA-1325: small workaround until we can integrate PDFBOX-2122. Default timezone is now set and then unset for ttf test in FontParsers test.
new 1646005 Fix for TIKA-1327: New parser for Matlab .mat files contributed by Annie Burgess.
new c9ad965 partially revert r1601805: didn't need to change the slf4j dep. extraneous commit.
new ccaa9a3 Add in the Java code to go along with r1601805: TIKA-1327: Matlab parser from Annie Burgess.
new c18b83a Partial revert of r1601805 - reset NetCDF dep to 4.2.20
new df7b18e Add a test CSV file, based on the Excel one, and a unit test shows it gets detected correctly. Also mark CSV as explicitly being a child of text/plain, rather than the previous implicit definition, to avoid confusion. TIKA-1335
new 583af26 - fix for TIKA-1336 This closes #10
new af6788a Update docs for TIKA-1335 TIKA-1336.
new 4357682 fix for TIKA-1337: LanguageProfile for Persian/Farsi contributed by Omid Pourhadi This closes #11.
new a3c1c9e fix for TIKA-1338 Converted README to Markdown contributed by Kyle Maxwell (krmaxwell@gmail.com) This closes #1.
new 43dde08 Fix for TIKA-1339: Upgrade rome dependency to 1.0 contributed by Pradeep Singh (Github user: pksinghus) This closes #2.
new 8477468 convert README to markdown TIKA-1338.
new 6a306fb TIKA-1341: fix double endDocument in PDFParser
new 4a7cdda - ignores
new 7fa9772 - missing license headers
new 3a50244 TIKA-1352 upgrade to PDFBox 1.8.6
new 23adca3 TIKA-1352 update CHANGES.txt
new a9a5c03 TIKA-1353 If a File is available, parse ODF documents with it, so that the metadata can always be processed first
new 8c0667b Bump the java-libpst version to 0.8.1 for TIKA-1350. This closes #12 from github
new 5f27b97 - apply patch for TIKA-1274 ENVI Header parser contributed by Ann Burgess
new 8dcf13f - this should be in the translate package
new 22feb27 - getters and setters.
new 631711b - remove extranoues print
new 59c7e23 - use path prefix to load the properties file.
new 4a75eef - fix for TIKA-1362: Add GoogleTranslate implementation of Translation API
new ba74309 [TIKA-1351] Updating AutoDetect, Composite and PDF parsers to guard against null content handlers
new 9d189e6 Patch from Tyler Palsulich from TIKA-1327 - More enhancements to the Matlab parser
new 3aa7738 updated patch for TIKA-1363 from Annie Burgess: enables Mat parser in META-INF and fixes unit test to use AutoDetectParser to validate it.
new d3d2629 Fix for JRIA issue TIKA-411, generate list list of supported types automatically.
new 9860b74 Fix for JIRA issue TIKA-1105, pass along ParseContext in CompositeParser.
new 070b30c Add check for possible NPE in CompositeParser.
new 15521f1 Fix for JIRA issue TIKA-1370, adding a CachedTranslator.
new efd7147 Fix for TIKA-1357: Use BufferedReader to parse ENVI files (from Ann Burgess).
new adeab55 Fix whitespace in EnviHeaderParser.
new 8e0c444 Fix for TIKA-1251: RuntimeException with certain word docs (contributed by Vadim Roizman).
new 904223a Fix TIKA-411 entry in CHANGES.txt. Was in 1.5 section.
new 7ce1b27 Remove extraneous print statement from MatParserTest.
new f086cdf Remove redundand array initialization.
new b9e5ee1 Remove more redundant array initialization.
new 424c0eb Remove several redundant type casts.
new 50177e6 Remove caught and immediately rethrown IOException.
new 628842d Remove various unused imports.
new 79211cc Use foreach loops instead of for/while, when possible.
new 51377f1 Use String.contains instead of String.indexOf > -1
new 3b74476 Remove unnecessary boxing and unboxing.
new d4fc8dd Remove extraneous print statement from EnviHeaderParser.
new f1d00db TIKA-1373 - Send html content to SAX events by using TagSoup
new 1c7f1bc Fix potential NPE and fix javadoc refs for PDFParser
new 4fefb49 Patch from Matthias Krueger from TIKA-1361 - Upgrade MP4Parser to 1.0.2, add a custom Data Source and use that for explicit temp handling. This closes #14 from Github
new cc4898b Update imports following TIKA-1361 changes, to match our current preference for explicit (not wildcard) imports
new 25867f6 I don't know what the android.util package is, nor why we would/wouldn't want it, but without marking it as optional the bundle tests cry...
new aeebd70 TIKA-1375: decrease memory consumption when extracting images in PDFs
new e00b441 TIKA-1376: improve embedded file name extraction in PDFParser
new 5c0089a TIKA-1374: Try to extract OS-specific embedded files within PDFs
new 081afb0 Update change log for 1.6 rc1
new 301460e [maven-release-plugin] prepare release 1.6
new 3cbfb59 [maven-release-plugin] prepare for next development iteration
new c28e3b0 Remove unused src directory for TIKA-1316.
new b5a860d Remove old build id-less build config from root pom.xml.
new cb9cbd5 Remove unused imports and redundant throws declarations in tika-server.
new b7e764d Use assertTrue instead of assertEquals(true...
new ed355dd Chain together StringBuilder.append calls instead of using String concatenation.
new 65aea2b - TIKA-1378: MicrosoftTranslator setClient and setId NPE (thanks to tpalsulich for the review!)
new 2045326 Partial TIKA-1377 patch from Dan Becker, with changes - add more XMPDM keys (in order), and add ID3v1 stubs for new tags which ID3v1 does not contain (in the same way as others)
new b488880 Partial TIKA-1377 patch from Dan Becker, with changes - ID3v2 support for more keys, MP3 Parser support to use that, and tests
new d0dfe79 Partial TIKA-1377 patch from Dan Becker, with changes - Extract more XMPDM data from MP4, with tests
new 8c1ff56 Include the tool used to create the MP4 in the XMP output, fixes a TODO spotted while working on TIKA-1377
new 2336b0c The MP4 parser has extracted the channel count for some time, so enable the test for that
new 73e942a TIKA-1381 - Added Lingo24Translator implementation
new 4689c26 TIKA-1381 - Added Lingo24Translator implementation to CHANGES.txt
new 9b475d3 Make the Tika CLI Extraction test more robust, with better failure messages
new 9b57be7 TIKA-1380: staging an updated test file for the actual patch once POI 3.11-beta-1 is released
new 4153045 added test and test docs for comments in xls and xlsx; lack of tests detected during work on TIKA-1380
new 2cf27a7 TIKA-1380 Upgrade to Apache POI 3.11 beta 1
new 85137c5 Found existing comments test in TestParsers; clean up earlier tests for comments in xls and xlsx
new cf4ee8f Upgrade the Commons Codec version to match that in Apache POI, upgraded in TIKA-1380
new 7446992 Update svn:ignore on newer modules to match that on the existing ones
new 2936d55 Enable the check for TIKA-1118, now that we have upgraded POI
new be7476e Convert from assertTrue(contains) to assertContains, to make addressing TODOs and failures much easier, for TIKA-1380
new 62979fa Enable the POI fraction test, as it now passes with the latest POI release (TIKA-1380)
new a13fdac TIKA-1317 extract contents from SDTs within cells in tables in XWPF (docx) files
new e07922d Switch from assertTrue(containts) to assertContains, to give better failure messages, and enable one now-passing test following TIKA-1380
new bf9d5c4 Another assertTrue(contains) to assertContains change
new 2624321 Enable more tests / TODOs now that we have upgraded POI with TIKA-1380
new e27454a Address the remainder of the test TODOs that we can following the POI upgrade in TIKA-1380
new e6a6083 Use the tika-parent version of junit via Maven dependency management.
new 90c7245 TIKA-1275 upgrade commons compress to 1.8.1; updated CHANGES.txt, too
new 4e1ce48 TIKA-1383: Minor simplification in the way Tika server is set up
new 3de8dbf TIKA-1383 Fixing TikaWelcome issues
new aa2024a Restore the HTML test of the Tika Welcome page, accidently zapped in the TIKA-1383 changes
new 080b036 TIKA-1380; fix cases where ole.getLabel() == null for ole attachments
new 95051f2 [TIKA-1371] Optional registration of new TikaLoggingFilter
new 3b00450 [TIKA-1371] Minor update to TikaLoggingFilter
new 6cc0e11 Fix for TIKA-1387 (thanks Uwe Schindler). Adding the Maven forbidden-apis plugin and fixing identified errors.
new 7e64483 Fix for TIKA-1389, remove wildcard imports from project.
new e285888 Create tika-example module for TIKA-1390.
new add88fd Fix scm links in the example and translate pom.xml files.
new fa97e9a Add in Apache license header to tika-example pom.xml.
new eb9495e Fix for TIKA-1385, creating an ExternalTranslator. Also creates a MosesTranslator.
new 12c6b36 Update CHANGES.txt with TIKA-1385 and TIKA-1390 entries.
new 0832bcd Disabling the forbidden-apis plugin until TIKA-1387 is resolved.
new a4cf450 Update 7z related comments
new 61eccc1 Re-enabling forbidden-apis and working around identified errors in External and Moses Translators.
new 50dcb7b For places formatting numbers in fixed formats, or case-insensitive comparing Ascii strings, use Locale.ROOT not Locale.getDefault() to ensure predictable behaviour, and avoid issues in locales like Turkish. TIKA-1387
new 91d1f00 For places formatting numbers in fixed formats, or case-insensitive comparing Ascii strings, use Locale.ROOT not Locale.getDefault() to ensure predictable behaviour, and avoid issues in locales like Turkish. TIKA-1387
new 1c660bc Review SimpleDateFormat use, adding comments where OK or potentially an issue, for TIKA-1387
new a07a70a Finish thread safety fix for TIKA-1387
new 7f8c2d5 Fix typo in name of tika-example module.
new 1036f5a Initial commit for TIKA-1391, parsing examples.
new 23eb15c Second initial commit for TIKA-1391, parsing examples.
new 979c7d8 LanguageIdentifierExample for TIKA-1392.
new e5717ac Microsoft Translator example for TIKA-1393.
new e011883 AxCrypt mimetype, test file and test TIKA-1399
new 0d87ac9 Bump the POI dependency to 3.11-beta2, and remove the Geronimo stax one which is no longer required by anything now we are on Java 1.6 TIKA-1380
new b15e67f Start on examples of using different Content Handlers to get differing output
new e6821da TIKA-1259 Add the Ogg Daala video mimetype, and remove an incorrect vorbis magic (it was actually a general Ogg one, which is already on the parent)
new 25dc052 More content handler examples
new 295658d ContentHandler example showing how to break the resulting text up by size
new 9a8b11e Pull in the Tika Core tests as a dependency for the examples, some of the examples tests rely on asserts defined in the Core tests
new 82c3744 Patch from Uwe - disable the forbidden API check on the Tika Bundle, which has no java code of its own, as the way we unpack classes before bundling confuses the checker. TIKA-1387
new 4e36e66 correct examples pom to pull test-jar from tika-parsers
new 32ebc7a Fix for TIKA-674: CompositeParser should indicate which parser was actually selected for parsing contributed by Andrzej Bialecki.
new 79db92b TIKA-1404 The tika-app in server mode needs to close the TikaInputStream when done with it, to avoid leaking temp files
new 5d3ad0e Prep for 1.6 RC #2.
new 2d08e52 [maven-release-plugin] prepare release 1.6-rc2
new b64bd85 [maven-release-plugin] prepare for next development iteration
new befbc6c prep changes for next development iteration.
new 1bafbb8 If we open a new NPOIFS object from a TikaInputStream, attach the opened container to the stream so it gets auto-closed when parsing is complete TIKA-1410
new 2fedbba Fix warnings in Eclipse about un-handled types in the switch statement
new a9406a5 Patch from TIKA-1412 from Andrzej Bialecki - Handle ODF setup case of TikaInputStream from stream with no open container
new d0f46bc Have PackageParser include the last-modified date from the archive in the metadata, when handling embedded entries TIKA-1246
new 783594d Fix inconsistent whitespace/indents, spotted while working on TIKA-1246
new cc922f4 Patch from Luis Filipe Nassif from TIKA-1411 - Avoid 7z file leak through use of TemporaryResources
new 0365ed2 TIKA-1413 - Remove embedded thumbnail from body
new d7f6323 surround in plugin management to resolve http://stackoverflow.com/questions/6352208/how-to-solve-plugin-execution-not-covered-by-lifecycle-configuration-for-sprin
new 1bdd91b Fix a TODO by adding in the PowerPoint .ppt embedded resources extraction unit tests
new fa93128 TIKA-1418 add example for how to dump tika config; and add --config to CLI
new 85b96e0 TIKA-1418 add files
new d26f2a3 TIKA-1418 remove println...the horror.
new 464a908 TIKA-93, create a new Tesseract OCR Parser.
new 853df89 TIKA-1329 add RecursiveParserWrapper
new 3fc408e Add TesseractOCRParser to the META-INF services list.
new 718776d Simple JavaDoc fix for AutoDetectParser.
new 190e17a Fix Tika Mime Type post TIKA-93.
new ec40cb6 TIKA-1412 - Add UnitTest
new c206307 Fix for TIKA-1421 Check if Tesseract is installed before attempting OCR Contributed by tpalsulich,mattmann.
new b32b061 TIKA-1424: clear PDFont's resources after each document
new 6fa4168 TIKA-1419: upgrade to PDFBox 1.8.7 and update CHANGES.txt for this and a few recent changes
new fdd4969 TIKA-1420, create an example of a PhoneNumberContentExtractor.
new a8fee2d Add license headers to PhoneExtractingContentHandler and its test.
new 5976d31 TIKA-1420, refactor the phone number extraction to use a custom method of de-obfuscating numbers.
new b871795 TIKA-1420, move the PhoneExtractingContentHandler to tika-core. Tests in tika-parsers.
new 5be919d Use TikaTest.assertContains in PhoneExtractorContentHandlerTest.
new af4bca2 TIKA-1433 : extract documents embedded within annotations in PDFs
new 4f42198 TIKA-1427: add markup for documents embedded in pdfs
new e8fac5b TIKA-1427 cleanup. Handle inline images with same markup as Word parser
new ff8e253 TIKA-1427, small clean up to ensure that inline image number tracks with extracted file
new c53ad42 Fix for TIKA-1435: Upgrade Rome to 1.5 contributed by Johannes Mockenhaupt <gi...@jotomo.de>. This closes #16.
new d600793 Fix for TIKA-1354 Register ForkParser Service in Activator. Contributed by Michal Hlavac <hl...@hlavki.eu>. This closes #13.
new ccb4bee This closes #3. Looks like it's already merged, so tickling to get ASF PR to close at Github.
new 2335991 Fix for TIKA-1369: Resolve thread safety issue in ImageMetadataExtractor. Contributed by Vilmos Papp <pa...@gmail.com>. This closes #15.
new bac4ce4 Revert TIKA-1435 until we figure out the Rome/JDOM/HDFParser issue merge 1629338:1629337
new ace2ad9 - fix for TIKA-1441 ExternalParsers should allow dynamic keys to be specified for Regexs
new 3c68cf0 TIKA-1441 change log.
new a331b5e - fix for TIKA-605: GDAL Parser
new 8f7d739 Update for TIKA-605
new 63a2a6b - TIKA-605: fix remainder of tpalsulich comments from https://reviews.apache.org/r/26542
new fa4d2bf - TIKA-605: deal with heading boundaries; add associated unit tests to expose and prove fixed for regression
new eb7202d Fix for TIKA-1422 contributed by tpalsulich and mattmann.
new 1056aa2 TIKA-1444 Virtual PC Virtual Hard Disk mimetype
new eb1e098 [TIKA-1242] Update to CXF 3.0.2
new ab0065e [TIKA-1242] Moving the CXF rt/rs/service/description dep into a test scope
new 3d1a87e WEBP mimetype from Nelson Monterroso TIKA-1450
new f3718fa WEBP sample file from Nelson Monterroso, and associated unit test for TIKA-1450
new cf55425 TIKA-1422 - Apply fix of [~olegt] in Windows
new fe06af6 TIKA-1422 - Fixing build & minor refactory of naming test class
new 8f57b93 clean up from TIKA-1311
new d42e013 TIKA-1451 add RecursiveParserWrapper output to CLI and GUI
new eb7e348 move pretty print metadata key sorter into standalone class
new 45da28c move pretty print metadata key sorter into standalone class, with added PrettyMetadataKeyComparator...argh
new 669565c upgrade gson to 2.2.4
new ccd5946 - getParsers is never called.
new 42cebed TIKA-1422. Skip checking the number of some handler invocations in the RFC822ParserTest if Tesseract is installed.
new e2293d3 TIKA-1459 fix write limit bug in BasicContentHandlerFactory when creating a BodyContentHandler
new 2cad861 cleanup tika-app pom, remove unnecessary gson dependency
new 63fb75d Very small Windows exe for TIKA-1461, generated with Visual Studio 2008 with advice from http://www.phreedom.org/research/tinype/
new 2377a7c TIKA-1461 PE files must also have the MZ header at the start, so tweak magic and add positive and negative mime magic detection tests for it
new 4f4ce4c If this test fails at all, have it report which test file it failed on to assist debugging
new eacb983 TIKA-1463 - Fix tesseractPath in Windows
new 0aa8bca TIKA-1467: in PDFParser, move metadata set isEncrypted() to before decryption step.
new ce51005 Fix for TIKA-1472 Warning on Tika Server startup - Failed to load class org.slf4j.impl.StaticLoggerBinder contributed by Konstantin Gribov <gr...@gmail.com> this closes #22.
new 78c8213 TIKA-1475. Reformat pom.xml files.
new 3b2fba4 TIKA-1476 - Updated TesseractOCRConfig to read from property file if present on classpath
new 069ac2d TIKA-1476 - Added tests for TesseractOCRConfig external configuration through properties files
new 994d7f5 Add .svn to .gitignore for people using git/svn
new 2d05a8a TIKA-1476 - Fix test on Windows env.
new 65977bb TIKA-1446 - Integration of [binhawking]'s work on CHM parser
new b6fda2a TIKA-1446 - Revert CRLF on profile language files
new ccc1fc0 TIKA-1476: Added default configuration file
new 1cc2725 TIKA-1446: Updated test so it loads the test documents from the classpath
new 746de4d Reverting incorrect commit whilst fixing test on TIKA-1446
new a0d5a4f TIKA-595: Adding Julien Nioche's patch to enable Multivalue Metadata for Html
new 9005c67 TIKA-1477: Added new custom header to Tika resource override Tesseract OCR language
new 74c231a TIKA-1477: Updated Tika resource to dynamically set TesseractOCRConfig and PDFParserConfig files from custom headers
new 739ac45 TIKA-1486 Remove duplicated mimetype defintions
new e500f6a TIKA-1486 Make DITA Task be a subclass of DITA, and not of itself
new bb9380c TIKA-1488 add X-Tika as namespace
new f64c9e5 TIKA-1487 Test Excel 4 file from govdocs, and an AOO generated Excel 5 file
new a495a10 TIKA-1487 Based on the file format docs from OpenOffice, add detection and mime types for the older Excel 2, 3 and 4 pre-ole2 formats
new 64419f7 Add a TODO for TIKA-1490
new ccd0c4b Add back return, while we're pending TIKA-1490
new ed77de4 TIKA-1218, prevent negative array size exceptions for corrupted mp3s.
new 9f1b421 TIKA-1491 BPG magic from Johan van der Knijff
new 49a6266 Some (but not all) test BPG files from Johan van der Knijff from TIKA-1491
new 9565b06 TIKA-1491 BPG detection test
new d66f8d7 TIKA-1384 and TIKA-1496. Upgrade slf4j-log4j12 to version 1.7.7 and manage it with tika-parent dependency management.
new d3bea08 Fix tika-bundle slf4j dependency issue.
new f233237 TIKA-1495 Decoder for UE7 integers, which work a bit like UTF8 does for strings, used in BPG
new 0de5e6f TIKA-1495 Start on a BPG parser, so far just covering width and height (rest to follow later)
new a4be4be Update the link to the XMP Spec, which seems to have moved on the Adobe site
new 1f32c70 TIKA-1495 Fetch the BPG colour information, and have a rough go at storing that as metdata for both BPG and PSD
new 04253d4 Prepare for TIKA-1494 - also provide the ParseContext to TikaResource.fillMetadata
new 8b06655 Fix indents / whitespace
new 2129abf TIKA-1494 Support fetching the password for Excel .xls files from the ParseContext where given
new 8b433c7 TIKA-1494 Allow supplying a password on a per-request basis via the Password header
new 3f89b48 Update the change list
new 1a317d1 Start processing extension data for BPG files
new d6d1d22 TIKA-1495 Start on BPG Exif and XMP handling, but for some reason the drewnoakes Exif code gives silent errors
new 2109402 TIKA-1442, upgrade to PDFBox 1.8.8
new 57db5ef TIKA-1498 add recursive parser wrapper output to tika-server
new 84c5d24 Update CHANGES.txt with recently resolved issues.
new 05eef80 TIKA-1498: now actually add providers to cli...argh
new f776fc0 TIKA-1497: add JSON and XMP output to tika-server's /meta
new 26b66a9 TIKA-1497: update changes.txt
new 1233a93 Temporary workaround for TIKA-1445 for Tika 1.7 - always pass the image to the regular parser to get the metadata set. Will be replaced in 1.8 with composite parsers + user selected config with strategy
new a8edc72 Fix some warnings
new 7dc73ce TIKA-1445 - Allow you to exclude certain mimetypes from a parser that would otherwise handle them, in your Tika Config xml
new f13e537 TIKA-1494. Test that decrypting with the wrong password returns a 500 error in MetadataResourceTest.
new e3fbbfb TIKA-1499: fold MetadataEP in tika-server into MetadataResource
new 8f6b48a Additional OSGi bundle definitions from Tim Allison from TIKA-1469
new a6b6ad4 Upgrade to POI 3.11 final, patch from TIKA-1469
new 75230ce Update the BPG parser following spec updates TIKA-1495
new 46b321f TIKA-1490 Parser for old Excel 2-4 files
new d398c71 TIKA-1490 Unit tests for Excel 2-4 parser
new e76299d TIKA-1490 Use the Old Excel parser for older OLE2 based formats too, like Excel 5 and 95
new e967f1d Some test database files for TIKA-1502
new dbc8f86 TIKA-1502 MySQL and SQLite3 mime types, with magic where possible
new 554922a More test database files for TIKA-1502
new f7e2453 Start on magic for subtypes of Berkeley DB TIKA-1502
new 52ccbef Split the Berkeley DB mimetypes into three levels, and add a detection test (passes) and a heirarchy test (disabled as fails) TIKA-1502
new ef0e28d Fix test for TIKA-1502 - re-order the MediaTypeRegistry logic for getting the super type, so that if an explicit inheritance has been defined between one parametered type and another, that inheritance is used in preference to "drop all parameters"
new e51064f One more media type with parameters test, for unknown parameters
new 79bb58b TIKA-879 Add a new parent mime type, for the text based message formats, of text/x-tika-text-based-message, which allows Thunderbird messages to be correctly detected as they now show up as being text based not binary based in the hierarchy
new 400ed89 Upgrade the Maven Shade plugin - slightly faster, and avoids spurious warnings about duplicate xmlbeans classes
new db03e43 Missing test file from TIKA-879
new f3df5ae TIKA-1500. Strip tags from content in FeedParser.
new 5a7216e TIKA-1503. Don't run the GDAL FITS test if FITS files aren't supported by the installed version of gdalinfo.
new 78265ea Pure whitespace change. Reformat the GDALParser and its test.
new 9277ce6 TIKA-1465. Reformat XHTML generation from NetCDFParser.
new 02f4720 Pure whitespace change. Fix formatting of NetCDFParser and its test.
new a97f811 Update CHANGES.txt for 1.7 release.
new 716c50f Add Tyler Palsulich's key to KEYS.
new 001406d [maven-release-plugin] prepare release 1.7-rc1
new cbe900f [maven-release-plugin] prepare for next development iteration
new f5ad49d TIKA-1506: close PSTFile's file handle after parsing
new 5c694fd [maven-release-plugin] prepare release 1.7-rc2
new 0a6bd7a [maven-release-plugin] prepare for next development iteration
new bddf67d Shorten the ParseContext fetching of the TesseractOCRConfig
new 5f68452 TIKA-1445 If Tesseract isn't available, don't offer any supported mime types, so the parser avoids being picked by DefaultParser or similar
new 895ebb7 Cleaner workaround parser call from Tim Allison from TIKA-1445
new 1843bb5 TIKA-1445 Unit test to show that when an invalid tesseract config is given, and tesseract cannot be found, TesseractOCRParser will return no types and will not be selected by DefaultParser
new 5cf7785 TIKA-1445 Unit test to check a JPEG via Tesseract gets both OCR text and normal JPEG metadata
new 0a176a5 TIKA-1445 Use assertContains, and fix a problem with the ForkParser integration tests
new d7f253f Shorter supportedTypes initialisation
new 192c0d0 TIKA-1445 Cache if Tesseract is present at a given path or not
new 4533e06 Temporary workaround for the TIKA-1507 ForkParser / OGI issue
new 9f8fc77 Disabled exif related bpg tests for TIKA-1495
new f433700 TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two options for parser
new 41c09d7 TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted
new 9ba366d Fix indenting in TesseractOCRParser.
new 5d3fc7f Remove unused variables from TesseractOCRParser.
new c9887c1 TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small import and comment changes.
new 31fd0dc Add Tyler Palsulich to parent pom developers list.
new 9192474 Add tika-server to maven-release-plugin configuration.
new 484fe4e TIKA-1412: Fixed test issue on Windows build
new 539d511 Update release date in CHANGES.txt.
new 6e18dde Remove redundant release config in root pom.xml.
new dcc2511 [maven-release-plugin] prepare release 1.7-rc3
new 221f10d [maven-release-plugin] prepare for next development iteration
new 9b4a088 Add missing svn:ignore on new-ish sub-projects
new e09cb44 TIKA-241 Unrar parser from Luis Filipe Nassif
new b9b8537 TIKA-241 Refactor to use common logic between PackageParser and RarParser for populating xhtml+metadata of embedded resources
new eebeab1 Use assertContains(needle,haystack) rather than assertTrue(haystack.contains(needle)), to get more helpful failure error messages
new a729485 Fix indents/whitespace, and use assertContains
new 1b462d3 Tweak comments/whitespace, and use assertContains
new a7bfac6 Use the common assertContains method, and use it more widely
new 89eea5f Use assertContains(needle,haystack) rather than assertTrue(haystack.contains(needle)), to get more helpful failure error messages
new 7ef8091 Add the unrar license to the main and parsers license files (junrar uses the same, category B, license)
new 389d935 Update vote.txt template.
new c2f825f Add 1.8 current development section to CHANGES.txt.
new 7cd31ee TIKA-1028 Have PackageParser report encrypted zips via EncryptedDocumentException rather than commons compress UnsupportedZipFeatureException
new 0c9237e Test rfc822 file with an encrypted zip file attached from TIKA-1028
new a9f559d TIKA-1028 If an encrypted attachment is found in a RFC822 email, silently skip it and carry on, so the rest of the email can be processed (may need more work!)
new 32985ed Partial unit test for TIKA-1028
new 7b7ad95 TIKA-1222 For RFC822 mails, start to prefer a EmbeddedDocumentExtractor to a Parser for handling embedded resources, but retain the Parser use if not for backwards compatibility
new 3b1d249 TIKA-1222 Move further towards EmbeddedDocumentExtractor, but keeping backwards compatibility
new 8abbe99 TIKA-1222 Unit test for rfc822 embedded resources
new 24faaa9 TIKA-1028 New rfc822 mail with encrypted zip of known password, from Juha Haaga, and matching unit test changes, but I'm not sure it's quite right just yet... (See TODOs)
new 58b7c17 TIKA-1028 Refactor the RFC822 parser to setup recursion once per file, not once per attachment, and get it so that a non-encrypted zip attachment is correctly extracted. (Commons Compress currently lacks password protected zip support
new ab00944 TIKA-1521 Support password protected 7zip files
new 2f5148a Add references to the Tika issue for upgrading for the fix
new 9398ed2 TIKA-1526: initial fix for jvm bug that can affect users with a default Locale of tr running on MACOSX or BSD. We still need to confirm that this fixes the problem and/or add a unit test.
new 83b8f79 TIKA-1529: step 1...get rid of toLowerCase in BasicContentHandlerFactoryTest
new 892f3a8 TIKA-1529: turn forbidden-apis back on and clean up all mentions of UTF-8
new ff34714 fix for TIKA-1530: Include parsed mp4 duration in metadata contributed by Oskar Wickström <os...@live.com> This closes #25.
new bb0822a Use a locale-consistent DecimalFormat to set the mp4 duration, avoiding rounding issues TIKA-1530
new 07892ac TIKA-1521: follow commons-compress and require installation of jce before testing password on 7z file
new 493567d TIKA-1534: Upgrade to Commons Compress 1.9
new 6778d5d TIKA-1329, added examples for the RecursiveParserWrapper
new c4b1393 TIKA-1423 Build a parser to extract data from GRIB formats
new 1c9b450 TIKA-1518: Added Dockerfile to support building a Tika Server image
new 418fdcd Fix for TIKA-1537 Installation on OSX 10.10.2 generates OutOfMemory Error during parser tests contributed by Andrew Hwang. This closes #26
new 1493ca7 Update readme with correct Java and Maven version requirements.
new c2b4463 TIKA-1423: Added exclusion to avoid duplicate JCL dependency
new 8d9ddf9 TIKA-1542 substitute Apache friendly TTF test file for our current copyrighted file
new 0462e2d Patch for TIKA-936 Fix for RarParser for handling Chinese characters contributed by kongxianghe1234 <ko...@gmail.com>. This closes #27.
new 1916f12 TIKA-1542 substitute Apache friendly TTF test file for our current copyrighted file, take 2. See PDFBOX-2383
new 12bdd1c Fix for TIKA-1539 GRB file magic bytes and extension matching contributed by Luke Sh LukeLiush <ha...@gmail.com>.
new 6f788cb Fix for TIKA-1539 GRB file magic bytes and extension matching contributed by Luke Sh LukeLiush <ha...@gmail.com>. This closes #28
new 14386dc TIKA-1539 Fix indent, and move the GRIB and XQuery mime entries to the right place in the sorted list
new 415e35c ParserDecorator unit tests
new 1cdc9ae TIKA-1509 Provide a possible "parser with fallback" implementation, with lots of questions!
new a39fc18 TIKA-1547. Use POST for HTML form input.
new 169ecc8 Fix for TIKA-1541: StringsParser: a simple strings-based parser for Tika Contributed by Giuseppe Totaro.
new 73da44c TIKA-1547. Update CHANGES.txt.
new ad6a794 TIKA-1269. Add Miredot documentation for tika-server.
new 8dbed10 TIKA-1269. Add link to tika.apache.org Miredot documentation on the tika-server index.
new e12bb7a TIKA-1544 consecutive new lines not preserved in rtf
new 8fc3a09 TIKA-1548 improve handling of encrypted pdfs when wrong password is offered
new 961a8f2 TIKA-1511 add parser for sqlite3
new 8ecbab9 TIKA-1511, with new files added...doh
new e66ec97 TIKA-1511, third time is the charm...many apologies
new dfb59c3 TIKA-1511 try to revert to earlier version of sqlite-jdbc to avoid unsatisfiedlikeerror on ubuntu
new 8494d93 Fix for TIKA-1549 Increased the speed of language identification by a factor of two. Fix contributed by Toke Eskildsen <te...@ekot.dk>. This closes #29.
new a6708c1 Bump up the core maven plugin versions to the latest available
new c19701b TIKA-1553: add an EvilParser for testing purposes
new dadd869 TIKA-1323: allow tika-server to return stack traces from parse exceptions for easier analysis of parser exceptions via tika-server.
new 538de86 TIKA-1556 clean up whitespace in tika-server
new 99b53ce TIKA-1558. Enable blacklisting of Parsers and other services with a servicename.blacklist META-INF file.
new 11787f5 Apply rollback patch for TIKA-1354 contributed by Michal Hlavac <hl...@hlavki.eu> this closes #30.
new bab243e Updated tests for TIKA-1541 simple strings parser from Guiseppe Totaro.
new de5df17 Updated tests for TIKA-1541 simple strings parser from Guiseppe Totaro.
new 4f562f3 instructions for how to contribute via Github. This closes #31.
new a39e891 Fix for TIKA-1483 Create a Latin1 charset raw string parser contributed by Lius Filipe Nassif.
new 960cd68 Fix for TIKA-1483 Create a Latin1 charset raw string parser contributed by Lius Filipe Nassif.
new 547166e More assert containing methods
new 8b02f6a Start on unit testing for the new TIKA-1558 style parser blacklisting
new 43d7a8b Start to prepare for child parser definitions within a composite parser
new 9ee3a34 TIKA-1558 Support excluding (blacklisting) parsers from config, so you can use DefaultParser for all except certain parsers. Also supports child parsers of a composite parser from config, towards TIKA-1509
new c56b275 Fix for TIKA-1561 GCMD Directory Interchange Format (.dif) identification contributed by LukeLiush <ha...@gmail.com>. This closes #32.
new deefc83 Add the config blacklisting to the Changelog
new 07597af Sereal, CBOR and WinInf mime magic from file(1)
new 1eb9f97 A few more mimetype updates inspired by file(1)
new fcfec8b Add a Tika CLI option for comparing with the File(1) magic directory, to report types to consider adding, and types we may be able to get magic for TIKA-289
new d3217c7 Add getChildTypes(MediaType) support to MediaTypeRegistry, to allow you to navigate the hierarchy the other way too
new 114b481 When looking at the file(1) magic dir, check children for magic too, as sometimes they have it, and update the changelog
new d7156b1 TIKA-1563 Put the more common gzip file extension (.gz) first in the glob list
new c63ae23 Use ${project.version} for tika-core dependency in tika-translate.
new ecd7ce8 TIKA-758. Remove PDFBOX workarounds in PDF2XHTML.
new aca3dfa TIKA-758 clean up after remembering PDFBOX-1130
new 2697354 TIKA-995. Properly output XHTML body attributes, contributed by Markus Jelsma.
new bd19155 Update CHANGES.txt for TIKA-995.
new d1e8d71 TIKA-1489 add optional accessibility checking to PDF files
new fe76f57 TIKA-1000. Ignore an invalid SAXNotRecognizedException.
new 15303fe TIKA-1038. Fix possible infinite recursion while parsing some PDFs.
new 0c81e7a TIKA-1553 change EvilParser to MockParser and move to core
new 0a92ecf turn off pdfbox logging in PDFParserTest
new 4aab1c1 Fix for TIKA-1567 WelcomeResource in TikaServer doesn't print PathParam prefix. This closes #33.
new db07978 TIKA-1553, add action types for printing to stdout and stderr
new fa5eb09 Remove println from XHTMLContentHandlerTest.
new 86b897d TIKA-1571 Upgrade UCAR dependencies to 4.5.5
new e559b33 TIKA-1286 Sample Visio OOXML VSDX files from Pascal Essiembre
new 44032de TIKA-1286 Visio OOXML mimetypes, and non-container detection unit tests
new 3f6d06f TIKA-1286 Bring the overall file mime types into line with the other OOXML formats, and add container aware detection + tests for the visio ooxml types
new 257a53b Support detection of OOXML-Strict files, and add a disabled unit test for OOXML-Strict xlsx parsing (not yet supported by POI)
new 4112484 TIKA-1564. Move tika-server resources and writers to their own packages.
new 63445bd TIKA-1564. Fix package visibility compiling issue in tika-server.
new 794595c TIKA-1063. Add basic ODF style support, contributed by Axel Dörfler.
new ce710eb TIKA-1063. Add ODF style test resource file.
new 54ea090 TIKA-1564. Move TarWriter to the server writers package.
new c46f545 TIKA-1117. Don't let iWorkPackageParser close the given InputStream.
new 19c87d9 TIKA-1137. Break early when possible in ForkParserIntegrationTest, contributed by Adrian Nistor.
new 4cf0ab4 TIKA-1576. Upgrade metadata-extractor to version 2.7.2.
new bd4798a Pure whitespace change. Reformat ImageMetadataExtractor.
new d3be3ea Fix for TIKA-1365 Lower priority for XML starting with comment, allow HTML starting with comment to be detected as text/html contributed by Matthias Krueger <mk...@mkr.io> this closes #35.
new 355da43 Adding EMF magic as per Microsoft's EMF specification, and thanks to Luis Filipe Nassif contributed by Matthias Krueger <mk...@mkr.io> this closes #34.
new 05ad16e Update tika-dotnet version to 1.8-SNAPSHOT.
new fc87e4a TIKA-1416. Refactor Translator Exception handling.
new e5ca726 Style changes for GDALParser.
new 832ba70 Log Tesseract messages.
new fe4cd58 initial commit of TIKA-1330
new 5a6626c updated patch for TIKA-1579: outputs NetCDF file type in metadata
new 375274a updated patch for TIKA-1578: outputs file type in metadata
new bc34c83 TIKA-1531 upgrade to POI 3.12-beta1
new d524ba6 update to CHANGES.txt
new dbc5eb7 TIKA-1330 clean up logging and some dependencies. Still some log4j dependencies for now
new de00e27 TIKA-1581 - Switch using Jhighlight on CDDL/LGPL dual-licensed and update notices
new 8a94581 TIKA-1581 - Typo & CHANGES.txt
new b765db2 TIKA-1583. Convert tika-server README to markdown.
new a4decc5 TIKA-1583. Small formatting changes for tika-server README.
new 849eb31 TIKA-1583. Remove old README.
new a606c94 TIKA-1586. Enable CORS requests on Tika server.
new 7b979a8 Fix for TIKA-1580: Support IsaTab MIME identification and parsing. Thanks to Giuseppe Totaro for all the great work!
new 62de8ee Update pdfbox to 1.8.9
new 6e5056a TIKA-1511 include xerial and native libs; some cleanup of README in preparation for 1.8 release
new 990c0c2 TIKA-1512 temporary workaround. Currently not including test docs or tests that derive from govdocs1
new 90bc595 TIKA-1584: fixed regression in Tika 1.7 that prevents processing of embedded docs with /tika service
new 6ce4a16 ForkParser.setJavaCommand takes List<String> now
new 4c1834e Fix broken ForkParser APIs
new 383208c TIKA-1581 - Mention @kkrugler thanks in CHANGES.txt
new 8054ddd TIKA-1330, trivial fixes to avoid NPE with consumersManagerMaxMillis parameter
new 4ae33b7 TIKA-1330: add integration tests to TikaCLITest
new 3134278 TIKA-1423: exclude pdfs and readme.txt files from tika-app and tika-server jars. Anything else we can exclude?
new 6648ed8 TIKA-1511: add public domain license notice for Sqlite to main License.txt
new 671b314 TIKA-1589 - Patch from Max Daniline to extract MP3 duration from files with no ID3 tags. This closes #38 from github
new 5f506c1 TIKA-1558. Refactor Parser blacklisting.
new b32a233 TIKA-1558. Better error message and fix typo.
new 42d0111 TIKA-1586. Change CORS short option to -C.
new d888256 Reformat 1.8 section of CHANGES.txt in preparation of 1.8 release.
new 6a013f5 TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch
new b45a092 TIKA-1330 clean up logging in tika-batch ant tika-app integration of tika-batch, take 2
new a19c06f TIKA-1330 flush stacktrace writers
new 18d0536 Remove blacklist custom mimetypes.
new 3fff336 Updated bouncycastle to 1.52
new fe31952 TIKA-1330 fix logging in TikaCLI to avoid adding multiple appenders
new d2dde3e TIKA-1323: flush writer when printing stack trace
new 6194e68 TIKA-1519 - don't allow potentially erroneous http-equiv Content-Type to overwrite Content-Type in HtmlParser
new 43086f4 TIKA-1519 change underscore to dash
new de74bb3 tika-batch cosmetics
new 6d184fc tika-parent: added myself to committer list
new 39644a9 tika-parent: slf4j updated and adapters added
new c88345a tika-app: pass all logging through slf4j
new 6906f35 Update release date for Tika 1.8 in CHANGES.txt.
new 1c07108 [maven-release-plugin] prepare release 1.8-rc1
new d176c6c [maven-release-plugin] prepare for next development iteration
new aecf60c TIKA-1594 upgrade metadata-extractor to 2.8.0 and add parser for webp metadata
new 1e33ed3 Fix for TIKA-1597
new 13a280b Cosmetics: remove trailing whitespace (RTFParser)
new 7d231d0 TIKA-1519: add charset information for the non-html formats, too: XHTML(s) and x-asp
new ece254a TIKA-6000 - Fixing NPE when having style in footnote
new f2c263d TIKA-1600. Reformat ODF Parser files and move OpenDocumentParserTest tests to ODFParserTest.
new 7af403b Update CHANGES.txt in preperation for Tika 1.8-RC2.
new a9a967f [maven-release-plugin] prepare release 1.8-rc2
new 6df7f40 [maven-release-plugin] prepare for next development iteration
new 3e9db56 TIKA-1605
new 35090b3 another npe in PDFParser
new b46164b TIKA-1606: update Guava version to something slightly more recent
new 52a2eaf TIKA-1511, move xerial dependency to 'provided'
new 6d28756 Add 1.9 section to CHANGES.txt.
new 88a5a85 Update version of tika-dotnet.
new f25385e fix documentation of when we moved the sqlite dependencies to provided
new 3c5905e TIKA-1501: Fix disabled OSGi related unit tests. Fixes from Bob Paulin.
new 7ada216 TIKA-1611 -- allow RecursiveParserWrapper to catch exceptions caused by embedded documents
new d9f730c WIP Fix for TIKA-1610: Support MIME extension for CBOR files contributed by LukeLiush <ha...@gmail.com> this closes #42
new b40b92e TIKA-1580: Fix to allow test to run on Windows with space in folder
new 60059ae TIKA-1610 Bump the CBOR mime magic priority to 60, to be more specific than (x)html, which is what CBOR often contains, and add a detection unit test
new c6ed613 tickle to close #44.
new 54e6717 Revert r1676165 as it included local changes.
new b9617fb Patch from Bob Paulin from TIKA-1617 - Change OSGi Detection test to use OSGi Service
new 40e0757 Spotted when looking at TIKA-1617 - DefaultParser should override getAllComponentParsers to mirror getParsers behaviour when a dynamic service loader exists
new f192ed7 Add a parsers equivalent OSGi test to mirror the detectors one, spotted while working on TIKA-1617
new 7ca5ec0 Fix for TIKA-1532 DIF Parser contributed by HyperDunk <aa...@gmail.com>. This closes #46.
new 2a26a5d Fix for TIKA-1535 Inheritance modification for the class MIMETypes contributed by LukeLiush <ha...@gmail.com> this closes #45.
new 79c07df Fix for TIKA-443 Geographic Information Parser contributed by unknown <ga...@gmail.com> this closes #47.
new 281fedf Fix for TIKA-1571 add probabilistic mime selection contributed by LukeLiush <ha...@gmail.com> this closes #41.
new 39352e9 Fix for TIKA-1582 Mime Detection based on neural networks with Byte-frequency-histogram contributed by LukeLiush <ha...@gmail.com>. This closes #36
new 3210d79 - fix for TIKA-1621: TikaResource should log errors determining ContentDisposition
new 7f9d4c2 Update CHANGES entry for TIKA-1621.
new cd91ee4 Add some javadocs explaining what this does, and use a proper version UID to avoid serialisation problems (eg with Forked mode
new 3784cac Add some javadocs explaining what this does, and use a proper version UID to avoid serialisation problems (eg with Forked mode TIKA-1517
new 66d39b7 Update whitespace to match coding conventions
new c3d82a8 TIKA-1517 Pull the ordering logic for loaded classed (non-Tika first etc) out into a util class
new ca0676f - DIFParser has been moved to o.a.tika.parser.dif
new 761467a - DIFParser has been moved to o.a.tika.parser.dif
new f54cb9e TIKA-1562: Add examples from the Tika in Action book
new 2867f73 TIKA-1562: Add examples from the Tika in Action book
new d3c776c - fix for TIKA-1622 Expose Tika LanguageIdentifier via Tika Server
new d4f437c fix for TIKA-1622 Expose Tika LanguageIdentifier via Tika Server
new a54b37a - fix for TIKA-1622 Expose Tika LanguageIdentifier via Tika Server
new 3982066 - fix for TIKA-1622 Expose Tika LanguageIdentifier via Tika Server
new 18902db - typo
new e533a22 - provide no arg constructor for service loading - this will get selected by default per its ordering by name - TODO: fix this later.
new 9f255bc - fix for TIKA-1620: OUTPUT_FILE_TOKEN not being replaced in ExternalParser contributed by Pascal Essiembre
new 497445e OUTPUT_FILE_TOKEN not being replaced in ExternalParser contributed by Pascal Essiembre TIKA-1620
new fa8b7b6 properties file doesn't require package prefix (will default to this package)
new 2280fd6 - no need for class prefix
new 3241830 - use class classloader
new 6d21945 - don't need the package prefix for Lingo24 translator
new 072980c - no need for pkg prefix for properties file
new 6481195 - use class classloader to load config
new a850812 TIKA-1623 Expose Translation Interface from Tika Server
new 0a1ffa9 Fix for TIKA-1625 Add support to Tika Server for parsing remote file URLs and for providing language detection contributed by junwei1229 <ll...@tradeshift.com> this closes #48.
new 73741e7 - be consistent and set language in /rmeta as well per TIKA-1625
new 24c6004 Updated Apache POI to 3.12
new 90ff7e3 TIKA-1628: ExternalParser.check now returns false if SecurityException is thrown
new 88f140f TIKA-1629 fix eol-style to LF in *.java *.properties and select *.xml
new b5a79c8 Update to TIKA-1622 with corrected French language example contributed by Thomas Ledoux.
new 345d33f TIKA-1085 Treat a PDF with a leading Byte Order Mark the same for detection, and add low-priorty matches for the PDF magic coming in 1-1024 bytes of the start (may give false positives if too high), plus tests
new 03203f1 TIKA-1632 zlib mime magic from Pavel Micka
new 77a5e12 TIKA-1632 Add some test zlib compressed files, another magic for it, and detection unit tests
new 8df88b0 TIKA-1635 Disabled zlib parser support, not yet enabled pending a fix for a commons compress bug
new 2259e44 Add an alternate zlib mimetype found in some places
new 13c38d7 TIKA-1634 Add some sample matlab files
new fbd9142 TIKA-1634 Two more kinds of matlab magic, and tests
new 84b7ca2 fix for TIKA-1614 Geo Topic Parser contributed by aranyali <ar...@gmail.com> and modified and updated by Chris Mattmann thi closes #43.
new 3ab3c34 - formatting
new c385761 - clean up imports/formatting
new 1b23ca5 - missing Apache header
new 2aa8b31 - fix for TIKA-1636: Toggle loading error warn logs in Tika Service Loading from the Command Line
new 7728826 Update CHANGES.txt for TIKA-1636.
new 3097527 - fix for TIKA-1638: Make ExternalParser actually work
new 5c664b9 Update changes for TIKA-1638.
new ba78bef - fix for TIKA-1510: FFMpeg installed but not parsing video files
new 5a1f1f3 Update changes for TIKA-1510.
new 227ef01 TIKA-1510: fix videoColorSpace
new 65127bf - fix unit tests associated with TIKA-1638
new 7ec0a0b - fix for TIKA-1639: Add EXIFTool as an ExternalParser
new be0e09c CHANGES update for TIKA-1639.
new a7e0a56 - fix unit test if CompositeExternalParser isn't available b/c exiftool and/or ffmpeg aren't installed.
new a7b50ca TIKA-1315 -- basic list support for WordExtractor; still need to add in override behavior once we add a class to ooxml via POI
new 5239ee7 TIKA-1643: clean up code in tika-parsers -- changed all newlines to lf and autocorrected code for most parsers that I've mis-styled.
new 7212465 TIKA-1643: reverted tika-server's pom to re-include qmino. i still need to delete this locally for every build because i can't figure out how to get to qmino through a proxy
new bea6f06 upgrade release plugin to get around [ERROR] Failed to execute goal org.apache.maven.plugins:maven-release-plugin: error.
new 48d76bd [maven-release-plugin] prepare release 1.9-rc1
new 741be83 [maven-release-plugin] prepare for next development iteration
new e549ec8 Try to make the low-priority padded PDF magic match more specific, as it looks to have incorrectly triggered on a few of the govdocs text files
new fa1c631 Bibtex entries are case insensitive, and might start with a comment, so tweak magic and add a test file. (Spotted in govdocs1)
new db9ac56 TIKA-1634 Few more matlab and other code related tests
new cc32605 TIKA-1646 fix RecursiveParserWrapper to add Metadata object even if an exception is hit while parsing the container
new 9413f45 TIKA-1646 small cleanup
new 826467e TIKA-1315 cleanup after run against govdocs1
new a1b849f Fix for TIKA-1634 Detecting problem with Matlab source code contributed by Jihyun Oh <ma...@gmail.com> this closes #49.
new 04be328 TIKA-1233 reopened
new 482c7f8 Mark the Tex formats as subtypes of text, so that if there isn't a dedicated parser for them, then they still get some basic text extracted via the text parser. Improves govdocs1 coverage
new 209b648 Fix for TIKA-1652, TIKA-1426: Tika Server should allow config file override from the command line like Tika App
new 6199820 Fix for TIKA-1645 & TIKA-1642: Extraction of biomedical information using CTAKESParser contributed by Selina Chu, Giuseppe Totaro and mattmann.
new c68c280 CTAKESParser: don't enable via SPI since enabled via config.
new d170261 [maven-release-plugin] prepare release 1.9-rc2
new 4fab75b [maven-release-plugin] prepare for next development iteration
new fa5e666 Fix indents to match http://tika.apache.org/contribute.html#Code_Formatting TIKA-1642
new 952ba27 Improve how the Tika CLI reports decorated parsers in --list-parsers
new 5132638 Include parser decoration details in the Tika Server parser listings as well
new b93a7d2 TIKA-1653 Re-do the XML parsing in the Tika Config, so that a parser tag with another inside it doesn't get accidently duplicated at the top level
new 342dd8f Make the nesting more visually obvious in the Server HTML parsers listing
new 086501e Allow Tika Config xml to have a ParserDecorator with child parsers, and note about how this can work in the javadocs
new 826bae2 cTAKES config xml example and code example in JavaDocs TIKA-1642
new 80a1126 TIKA-1654 Reset cTAKES CAS into CTAKESParser (Fix for TIKA-1645)
new fdaedd0 Adjusted indentation in pom.xml file to match rest of file
new 84964ec Reformat tika-parsers pom
new e846610 Reformatted POMs
new 78ac030 Add a mime type definition for Java properties files, after a discussion on stackoverflow showed we didn't have one
new 64a268a TIKA-1660 Java Properties sample file and detection test, follows on from r1686199
new 5cdc797 TIKA-1654: Reset cTAKES CAS into CTAKESParser
new cc9fbcf add test to ensure that the list reader for tika-batch properly creates subdirectories
new 1a3749f Fix for TIKA-1659 ZipContainerDetector does not detect all IPA files contributed by Rami Shomali <ra...@lookout.com> this closes #51.
new 90a2202 TIKA-1663 add a DigestingParser
new 444dadd Fix for TIKA-1664: GDALParser now correctly sets nitf as a supported media type contributed by Joseph North <jo...@gmail.com> this closes #53.
new 761273f Fix for TIKA-1669: xpath node test ./node() should match all contained nodes contributed by WulfB <wu...@inacta.ch> this closes #52
new fd8514c Rollback r1688087 as it seems to cause some tests to fail.
new 2a47d9a TIKA-1601: integrate Jackcess to parse MSAccess files
new 06cfbaa Fix for TIKA-1602: Detecting standards-non-compliant emails as message/rfc822 contributed by Jeremy B. Merrill <je...@nytimes.com> this closes #40.
new 425506e TIKA-1536. Upgrade to Java 1.7.
new 4695df5 TIKA-1536. Update CHANGES.txt with upgrade to Java 7.
new de5a2de Remove change comment, TIKA-1602
new 2764fb8 TIKA-1673 drop source file name from embedded file path; made a few java 7 updates; added timing for embedded docs
new f2218da TIKA-1673 -- doh, add back dropped qmino in server's pom
new 9688c77 TIKA-1400
new 165eebc TIKA-1674: initial commit to add example of how to extract embedded files
new 6029c0d TIKA-1676
new 320b289 TIKA-1681
new 898f300 TIKA-1684
new 4dcfb74 TIKA-1685 clean up easily cleaned up deprecations
new 4a20585 TIKA-1687 upgrad xerial.org's sqlite-jdbc to 3.8.10.1
new c11eee5 TIKA-1238: Update OutlookExtractor's codepoint detection algorithm
new 9a8798b TIKA-1678 -- initial commit. Need to wait for fix to PDFBOX-2896 to generate test file.
new 9c04fa6 TIKA-1683 -- add encryption support for Jackcess
new 0658ee6 TIKA-1683 -- add encryption support for Jackcess, this time with test document
new d2e68d4 TIKA-1692 : allow MimeTypes to look for a registered mime type that may or may not have parameters.
new 28149a6 Tweak the getRegisteredMimeType javadocs a little bit, to try to make it clearer
new 5f3acd2 Fix some javadoc warnings
new 7cdf08c Fix some javadoc warnings
new a4baebe TIKA-1588 upgrade to PDFBox 1.8.10
new 98672cd TIKA-1690: revert changes made in r1678515 that added fileUrl capability in tika-server
new 194a301 TIKA-1667: upgrade to POI 3.13-beta1
new 1ecb5f9 TIKA-1678: clean up and add test file and unit test
new 58a757e TIKA-1689: revert mistakenly flipped sort order of parsers from r1677328
new 1ff04b1 TIKA-1689: with mention in Changes.txt
new 797b0e8 Exclude junit compile-time dep from json-simple
new 70dbd04 Remove junit from OSGi bundle deps
new d8d05e0 Start on detector config tests for TIKA-1702
new f018a43 Set missing svn:ignore
new 0eeab40 TIKA-1702 Refactor some of the config parser loading to be more re-usable for detectors, and bring the method signature in line WRT Composite vs not (must always be composite)
new 1db02ed More TIKA-1702 refactoring to bring detectors in line with parsers
new cb89a82 TIKA-1702 Start moving to a loader class pattern for common Detector and Parser (+later others)
new 62428f2 TIKA-1702 CompositeDetector support for excludes, along the lines of the CompositeParser support
new e3f48bd TIKA-1702 Move the parser and detector creation logic to the config loader classes
new b1091b0 Allow Detectors to be defined as excluded in Tika Config XML TIKA-1702
new 79f20f6 If DefaultTranslator has multiple translators loaded, use the first available, not just blindly the first
new 06fefb3 Empty Translator, similar to the ones for Parser and Detector, for use in testing etc
new 727528c Convert Translator config to the new pattern for TIKA-1702, and add unit tests for Translator xml config
new 2cb29c2 Changelog
new 9489bfd Move AbstractTikaConfigTest to Core, and use that to shorten TikaConfigTest TIKA-1700
new cd129b5 TIKA-1700 Add TikaConfig constructors that take a ServiceLoader, and add a unit test that shows we (now) use the LoadErrorHandler on that properly for reporting problems with listed class names
new 3533866 Fix up the Probabilistic Mime Detection Test
new ff0e5c3 Update CHANGES.txt for 1.10 release
new abd6b87 Patch from Bob Paulin from TIKA-1700 - Allow setting the Service Loader dynamic flag and load error handler from the tika config xml
new a57f21e [maven-release-plugin] prepare release 1.10-rc1
new 1569b3f [maven-release-plugin] prepare for next development iteration
new 38697cb Fix for TIKA-1703: Can't Specify Tesseract Data Folder Distinct from Tesseract Executable Path Contributed by Christian Wolfe <ta...@gmail.com> this closes #56.
new 3857bd5 Move to the most recent org.apache parent pom
new 4023120 More Tika Core rat excludes
new 8045b05 License headers and Apache Rat excludes
new 347d7d2 License headers and Apache Rat excludes
new 89d4674 Updated tika-dotnet POM for Apache 1.10 release
new ee94feb Updated tika-dotnet POM for Apache 1.10 release
new 532de24 Fix indents/whitespace
new 650d01c Several people on StackOverflow are getting confused by this example, show how to use AutoDetectParser first, all the components second
new d5796b9 Replace deprecated method use and outdated practice from the example
new f67c93c One more improvement
new ef88dd9 TIKA-1705: Upgraded ASM to 5.0.4. Patch from Uwe Schindler.
new c8edc83 TIKA-1705: Changed dependecy from asm-debug-all to asm
new c7c166a - fix for TIKA-1699: Integrate the GROBID PDF extractor in Tika contributed by Sujen Shah <su...@gmail.com> this closes #55.
new 1c67639 Changes.txt for TIKA-1699.
new 8bb18b1 - statically initialize the context once (so in Tika server it can be reused)
new 99aaf11 Back out r1695816, so the build can pass again, pending a fix of the broken grobid poms. Fix being tracked in TIKA-1699
new 7f5ef18 Use a consistent version of Commons IO everywhere, enable the Forbidden APIs check for it, and fix problems it found TIKA-1706
new 8e03be7 Move the parent test class of many Tika tests to core/test, so core tests can use it too
new 98b088d Outlook detection with custom config tests, based on work by Justin Palmer TIKA-1708
new 3f784c6 TIKA-1708 If the Tika Config detector entry calls for MimeTypes, use the already created one, avoid creating a new empty one
new 4e7851d Tweak text to avoid a false match from the tika-core test dummy mimetype, and try to make constants use clearer
new 5e486df - ignority *.log.*
new 1139986 TIKA-1699: refactored GROBID parser to use GROBID rest API. Only introduced 2 deps, CXF client, and also org.json. very small and works great. Thanks to Sujen Shah for his initial work on the GROBID patch.
new b4f1c29 - fix typo: TIKA-1699
new a47881f - further guards
new 3e9ccb1 - TIKA-1699: statically load the rest URL properties inside of GROBIDRESTParser
new 31c5d2d - fix for TIKA-1712: GROBID parser fails in tika-app thanks to Sergey Beryozkin and Daniel Kulp for the idea for the fix.
new 4753ec6 Update changes for TIKA-1712
new d783608 TIKA-1699: fix bundle for GROBID parser deps.
new feda994 Changelog update
new 2861eb3 TIKA-1714 Allow --host=* to easily trigger listening on all addresses for the Tika Server
new f4bdbbe TIKA-1710 patch from Yaniv Kunda - Use java.nio.charset.StandardCharsets
new f71910e TIKA-1710 patch from Yaniv Kunda - Use java.nio.charset.StandardCharsets
new 19eb444 TIKA-1710 patch from Yaniv Kunda - Use Commons IO instead of the Tika Core IO copies, and java.nio.charset.StandardCharsets
new d5981c7 TIKA-1710 patch from Yaniv Kunda - Use Commons IO instead of the Tika Core IO copies, and java.nio.charset.StandardCharsets
new a15819d TIKA-1710 patch from Yaniv Kunda - Use java.nio.charset.StandardCharsets
new 9b82372 TIKA-1710 patch from Yaniv Kunda - Use Commons IO instead of the Tika Core IO copies, and java.nio.charset.StandardCharsets
new 8f598d7 TIKA-1710 patch from Yaniv Kunda - Use Commons IO instead of the Tika Core IO copies, and java.nio.charset.StandardCharsets
new b3c9e3f TIKA-1710 patch from Yaniv Kunda - Use Commons IO instead of the Tika Core IO copies, and java.nio.charset.StandardCharsets
new 7695a01 TIKA-1710 patch from Yaniv Kunda - Use Commons IO instead of the Tika Core IO copies, and java.nio.charset.StandardCharsets
new 613a391 Import fix
new 3e31179 TIKA-1711 As Tika needs 1.7, remove 1.6 specific bits of the bundle build. Patch from Yaniv Kunda
new d9c469f TIKA-1718 Enforce a consistent commons compress version between components
new 2d55f1d TIKA-1718 Upgrade to Commons Compress 1.10, and fix various TODOs that this permits
new e349b4f TIKA-1718 Add more Commons Compress supported formats
new a406151 One more format to add support for
new 2bf1790 TIKA-1710 Guava is no longer required, we have StandardCharsets instead now
new 93f8d19 Changelog update
new fe421fd Bring in line with other parsers with special InputStream requirements, by using TikaInputStream TIKA-1710
new 9e19740 TIKA-1722: Tika methods that accept a File needlessly convert it to a URL
new be5f57d TIKA-1721: Replace IOExceptionWithCause in ForkClient
new 6c2abfe TIKA-1720: Collect multiple exceptions in TemporaryResources.close() using Throwable.addSuppressed()
new 24b8c2a TIKA-1716 change default /rmeta content handler to xml and allow users to specify which content handler to use for content
new 8938fdf TIKA-1719: Utilize try-with-resources where it is trivial
new 3497ea6 Test HWP files from Mungeol Heo from TIKA-1728
new e7a3a49 TIKA-1728 HWP v5(+?) detection
new 193b5bb TIKA-1728 Fix the HWP v5 mime type hierarchy
new 03af7fb Fix license headers and reformat in tika-example
new dba2033 Migrate phone numbers example to file walk API
new b3665cd Add interruptable parsing example
new c31fe55 tika-example: explicit locale in String#toLowerCase
new 0ae1f64 TIKA-1734 via Yaniv Kunda -- use java.nio.file.Path in TemporaryResources
new ffb46d6 TIKA-1657 Update the example of dumping a Tika Config to support different output modes, for Translators and Detectors
new d6d180c Parser updates for config dumping
new e607d65 Patch from Yaniv Kunda from TIKA-1750 - avoid NPE in CachedTranslator if no underlying translator is available
new 83a3bb9 Expand the Tika Config dumping support for parsers
new f3cd6ad Expose the ServiceLoader used by TikaConfig, and use that to support serialising the service loader config xml section
new cec1530 TIKA-1744 via Yaniv Kunda -- upgrade TikaInputStream to use Path. Thank you, Yaniv.
new 71d72df TIKA-1747: migrate to Path from File in tika-batch
new e07bbf8 TIKA-1752: use j.n.f.Path in o.a.tika.detect
new 44462fb Reformat to avoid tabs and use JUL for logging
new 3f99f45 - Files isn't always present (just found test case on older version of GDAL)
new 285a462 TIKA-1707: upgrade to POI 3.13
new 191c03d TIKA-1742 prevent infinite recursion while processing inline images in PDFs by limiting extraction to unique images per page...following Tilman Hausherr's solution on PDFBox
new 36c2ea4 can't have assumeTrue in a try{}catch{} block: http://stackoverflow.com/questions/8736506/custom-junit-runner-doesnt-ignore-tests-with-assume-assumetruesomefalseconditi
new 5d50c2e clean up from TIKA-1742 and TIKA-1748
new 91422cb TIKA-1757 and TIKA-1758. Mea culpa. Thank you Uwe Schindler and Yaniv Kinda
new bbbfaac fix two unchecked operations
new e51adae TIKA-1756: update forbiddenapis to 2.0 via Uwe Schindler
new 5bf930e TIKA-1744 tidying up via Yaniv Kunda
new 0c86583 TIKA-1741 Include CTAKESConfig.properties within tika-parsers resources by default
new 8d8823a TIKA-1765
new 2a560eb TIKA-1755 make div and other formatting more consistent btwn PPT and PPTX
new 2aa5b9e TIKA-1736 upgrade jackcess-encrypt
new 3d9454b TIKA-1762 - Create a Configurable ExecutorService in TikaConfig
new 1996cdf TIKA-1772 WebVTT mime entry from Alexander Widera
new ae80661 Test JP2 (JPEG2000) file from Andreas Hirtzel from TIKA-1773
new 96980ec JPEG2000 (jp2) detection tests
new f949ddf TIKA-1772 Test WebVTT file from Alexander Widera, mime magic for it, and detection tests
new 4bb76d3 Fix for TIKA-1771 lower magic priority xhtml magic priority to ensure emails detected as message/rfc822 contributed by Jeremy B. Merrill <je...@nytimes.com> this closes #58.
new 6e6359e Fix for TIKA-1772: Mimetype of VTT files contributed by Alexander Widera <wi...@chemmedia.de> this closes #59.
new 6f6a764 Fix for TIKA-1745 Add methods accepting java.nio.file.Path to org.apache.tika.Tika and org.apache.tika.parser.ParsingReader contributed by Yaniv Kunda.
new 29f0bf0 Fix for TIKA-1746: modify TikaFileTypeDetector to use new detect method accepting java.nio.file.Path contributed by Yaniv Kunda.
new 0a1a884 Fix for TIKA-1751: Use java.nio.file.Path in TikaConfig contributed by Yaniv Kunda.
new 424da1d Prep CHANGES.txt for 1.11 rc 1
new b66fca8 [maven-release-plugin] prepare release 1.11-rc1
new 986d9f9 [maven-release-plugin] prepare for next development iteration
new 5a7027a TIKA-1777 fix regression in ppt spacing, patch from Andreas Beeker
new ff31d82 Bump changes.
new 63351d1 TIKA-1777 wasn't fixed in time for 1.11. Fixing CHANGES.txt to reflect my earlier entry error
new f43de5a TIKA-1782 allow XHTMLContentHandler to pass attributes of html element via Markus Jelsma
new 9f680b0 Add Tika Facade parse methods for Path and File which take a Metadata object, to mirror the existing InputStream one. This closes #60 from GitHub
new 2e60efb TIKA-1507 - Moved tika-external-parsers.xml to tika-core to prevent OSGi split package issue.
new a960709 TIKA-1786 -- clean up logging in tika-batch
new 9e9ea27 TIKA-1792 ASiC E and S mimetypes, detection and tests. Files and mimetype from Roberto Benedetti
new 9c6f81c Tweak ASiC comment and priority based on feedback from the spec
new f73655a TIKA-1793 Add rfc822 email detection for common thunderbird message first headers
new c2895d5 TIKA-1791 GeoParser fix for models in a jar file, from Thamme Gowda N. This closes #63 from GitHub
new 58df156 Fix inconsistent whitespace
new 02bcf20 TIKA-1791 Comments and logging
new 6750a05 Fix inconsistent whitespace
new 74665c0 Changelog update
new f682eee TIKA-1795 RTFParser should set, not add, mime type
new 702b6c4 Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62
new 9de407a Fix for TIKA-1787: Include Stanford Name Entity Recognition in Tika contributed by Thamme Gowda N and Yueheng He this closes #61 this closes #62
new a147d7c Fix for TIKA-1798 Parser for Video Similarity using PooledTimeSeries metric contributed by Aditya Dhulipala and Chris Mattmann this closes #64.
new 88b0581 Add an ignore for the on-demand downloaded NER files
new ded5294 Fix for TIKA-1803 Use lucene-geo-gazetteer REST API in GeoTopicParser contributed by Madhav Sharan msharan@usc.edu this closes #65
new 3574c9c Fix for TIKA-1815 Text content from parser is empty when NamedEntityParser is enabled contributed by Thamme Gowda <tg...@gmail.com> this closes #67
new caa2b5a Fix for TIKA-1816: Lenient testing for NamedEntityParser contributed by Thamme Gowda <tg...@gmail.com> this closes #68
new efaf482 TIKA-1817 Mime magic for AutoCAD DXF in Ascii and Binary, plus the related DXB
new e9718f9 TIKA-1817 Test DXF ASCII file, and detection unit test
new 61671f5 TIKA-1820 Upgrade rome to 1.5.1 && TIKA-1516 Downgrade Rome dependency to 0.9 to avoid nasty NPE
new 784c0a4 TIKA-1821 Support for 1 and 3 byte length PKCS7 DER encoded magic (needs neater way, to follow)
new f35351e Try to make the common parts clearer for the DER-encoded PKCS7 signature (length comes between 0x308. and the pkcs7 object)
new 3e7e335 TIKA-1826: should use getValues instead of just get in Tika gui
new 40b4a6d Fix for TIKA-1816: Lenient testing for NamedEntityParser contributed by Thamme Gowda <tg...@gmail.com>
new 489ab93 Fix for TIKA-1834: Fix for GeoTopic parser holding state while running Tika server contributed by smadha <ms...@usc.edu> this closes #71.
new fe841bc TIKA-1835: LinkContentHandler skips iframe and rel tags
new 52b82bd fix for TIKA-1840 contributed by zetisam
new 7d43bd7 fix for TIKA-1840 contributed by zetisam -- fixed indentation
new c42b5ad The testGetJSON() method had a strange cast to (Object) that I removed to improve readability and maintainability. This was identified by findbugs rule BC_IMPOSSIBLE_CAST.
new efb645e Update to record change for GH #73 contributed by Marc Breslow <ma...@devfactory.com> this closes #73.
new b4b5316 Merge branch 'TIKA-1840' of https://github.com/zetisam/tika into TIKA-1840
new 1bc6176 Fix for TIKA-1840 contributed by Sam Heijens <sa...@zeticon.com> this closes #72
new 9fa7a4d Changes.txt for 1.12 release.
new d0d9013 Prep pom.xmls for release - remove all SCM tags except for tika-parent. Update scm tags to Git. Prep for 1.12 release.
new 4b4246c Update SCM connection.
new 39b9c1c Rollback release.
new 2eb6715 Update SCM url.
new 809370e Upgrade Git SCM provider and Maven release plugin to 2.4.2 and 1.8.1 respectively to get around http://stackoverflow.com/questions/15166781/mvn-releaseprepare-not-committing-changes-to-pom-xml
new c0d2b4f [maven-release-plugin] prepare release 1.12-rc1
new 5c0ef63 [maven-release-plugin] prepare for next development iteration
new 38fbc50 TIKA-1823 Sample AutoCAD 2010 DWF file
new 6a09233 TIKA-1823 AutoCAD DWF mime magic and subtypes
new 6ef9c94 TIKA-1830 upgrade pdfbox to 1.8.11
new 256209a TIKA-1830 upgrade pdfbox to 1.8.11...updated CHANGES.txt file. doh!
new d685742 Added NLTK NER
new 2b99eea Merge remote-tracking branch 'upstream/master' Integrated NLTK into Tika Parsers by using endpoint as NLTKRest
new db2b475 Update NLTKNERecogniser.java
new 59ddcaa Update NLTKNERecogniserTest.java
new 892beca Update NLTKNERecogniserTest.java
new 25cee54 TIKA-1799: upgrade to POI 3.14-beta1
new 6ac99bf TIKA-1799: upgrade to POI 3.14-beta1, cleanup
new 57ae2c5 Test PKCS7 Signature files produced by CADES, from Alessandro De Angelis TIKA-1821
new 046e43f PKCS7 signature detection tests, using test files from TIKA-1821
new fabeac9 minimal cleanup while working on TIKA-1849: turn test back on.
new 1e0159b TIKA-1845: fix npe created by combination of error in RTFEmbObjHandler and failure to handle null in TikaResource
new d8a2fc0 Test JS file that includes <html in it, based on JS from the ComDev website TIKA-1141
new d740f5d Lower the priority of <html later in the file header
new 557b370 Unit test for detecting JS files
new 6c0b790 Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika
new 559557a TIKA-1854: add handling for embeddedStorageClassId in MSOffice docs (patch from Daniel Bonniot de Ruisselet)
new c5b9cb7 clean up tests in server that used to rely on EvilParser (early name for MockParser)...discovered while working on 2.x branch TIKA-1 8 5 1.
new 14ca320 Used Apache CXF WebClient
new 542bebc Record change for TIKA-1835.
new 2eb49a7 TIKA-1856 Upgrade the Ogg dependency for the truncated files fix
new fc801d1 upgrade sqlite-jdbc to 3.8.11.2
new 28b9a66 Briefly describe the parser, and link to the wiki for more details
new 1b14b39 nltk modification
new f054bcd Merge remote-tracking branch 'upstream/master'
new 13d772a TIKA-1869 update Jackson to latest version 2.7.1
new 0bd05ce TIKA-1870 refactor RichTextContentHandler into tika-core from tika-server so users if needing it don't need to depend upon tika-server
new 08e38bb Update changelog for Jackson upgrade from John Patrick from TIKA-1869. This closes #75 from github
new 3b7922d TIKA-1870 JavaDoc and Test coverage for RichTextContentHandler that lived in tika-server
new ac4c0b2 created NLTK host server properties
new 6c595fb Merge remote-tracking branch 'upstream/master'
new 0c03008 TIKA-1874 fix potential npe
new 1882def Merge branch 'bugfix/TIKA-1870' of https://github.com/nhojpatrick/tika
new ed762b7 TIKA-1870 Move RichTextContentHandler from Server to Core, contributed by John Patrick. This closes #77 from Github
new a13369b fix for TIKA-1876 contributed by manalishah
new 7ebe007 fix for TIKA-1876 contributed by manalishah
new c809690 fix for TIKA-1876 contributed by manalishah
new 3a7e24c fix for TIKA-1876 contributed by manalishah
new 114d0ff fix for TIKA-1876 contributed by manalishah
new cdb684d fix for TIKA-1876 contributed by manalishah
new 602d237 fix for TIKA-1877 contributed by prasadns14
new 7801007 Update tika-mimetypes.xml
new 0dbd69c updated with changes
new e147de3 resolved conflicts
new dbefe98 TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
new 7c245fa TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
new 3fbc03c Fix for TIKA-1876 Integrate Natural Language Toolkit (NLTK) into Tika to perform Named Entity Recognition contributed by Manali Shah <ma...@gmail.com> this closes #80
new 9056894 Fix merge conflict.
new 3aa1dca TIKA-1657 move xmlification of TikaConfig to tika-core. Thank you, Nick!
new 5a34107 TIKA-1657 move xmlification of TikaConfig to tika-core. Thank you, Nick!
new 9a1ba94 Fix for side effect of TIKA-1857-- javax.xml.stream is no longer optional. Thank you, Bob Paulin, for diagnosing this!
new 46deb4d Added .hfa mime type mime-type.xml
new 86579ec updated mime magic for cab, quicktime, fits and netcdf based on fht analysis on polar-data dump
new e5d348d Fix for TIKA-1886 provided by nandan-pc
new 355a7d1 Patch from prasadns14 from TIKA-1875: NetCDF mime magic, and detection unit test. This closes #78 from github
new e105088 Merge branch 'TIKA-1877' of https://github.com/prasadns14/tika
new 1299c9e Rename the test file for TIKA-1877 to better match our test file naming pattern. This closes #81 from github
new 963a916 TIKA-1878 Make the Apache SIS version a property, to allow for easier upgrades
new 1b7009d TIKA-1878 - Upgrade Apche SIS to 0.6. This closes #79 from github
new bee1a87 Better express the MP4/QuickTime relationship in our mime type hierarchy
new b878281 Test CAB file and CAB file descriptor, based on the existing archive test file set TIKA-1890
new f7d3097 TIKA-1890 Mime magic for CAB files, and unit tests for detection
new 74e71eb Magic for Mobipocket Ebook and ESRI Shapefiles from TIKA-1892 from Suman Kashyap
new c5d4ec6 TIKA-1894: Add XMPMM support to PDFParser and JpegParser via Jempbox
new b9d5c22 Add missing dependency on tika-test-resources
new 973204e Roll in new lang detect support in new module
new e9f5f42 Add project.build.sourceEncoding to properties
new e38512e Add tika-langdetect dependency in other modules
new 3a7a94c Remove built-in lang detector
new f9113be Move base lang detect classes to core
new 3bee1d9 Make detector "discoverable", use that everywhere
new 1caa4fb TIKA-1853: upgrade to POI 3.14-final
new 68225a9 fix for TIKA-1872 contributed by trevorlewis
new 260d77b Update tika-mimetypes.xml
new b2cf231 Add uniformity to parser parameter configuration.
new ae51417 remove unwanted TODO:
new db1c0e6 Rerranged the order of mime-type x-erdas-hfa in tika-mimetypes.xml , changed the test file name and reduced sized of test file
new bf2d405 TIKA-1899 -- didn't add a test because triggering file was larger than the fix, metaphorically.
new 64db961 Added a TikaConfigException, params getter
new 0d69ca7 Test Case updated with newer exception and getter
new a1243c7 TIKA-1900 - Max Pool size must be set before core pool size in java 9.
new a991394 Add missing license headers
new eafe280 Added missing license headers
new 4a40cf5 TIKA-1905 - Fix JavaDoc Failures on Java 8
new 73aaa1b TIKA-1906: ExternalParser No Longer Supports Commands in Array Format - Added check for command length and reintroduced copying all arguments for arrays
new 6a8bb0d resource leaks discovered while working on 2.x (TIKA-1855).
new 1924c3f Merge remote-tracking branch 'origin/master'
new 98eb56e TIKA-1285 -- upgrade to PDFBox 2.0.0
new 9ebf066 TIKA-1285 -- upgrade to PDFBox 2.0.0 -- for now turn off tests with different results in different OS. TODO: figure out the cause and turn back on!
new 959c9ad Test vnd.mif file from Steffen Netz from TIKA-1898
new c94236a Remove erroneous backslashes before already-escaped < entries in vnd.mif mime magic, plus unit tests. Thanks to Steffen Netz in TIKA-1898 for help with this
new 9cdfc4a Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872
new 3279a11 Depend on 1.13-SNAPSHOT, not 2.0.
new 34db935 TIKA-1918: make outputSuffix optional in tika-batch
new 01109c8 Merge remote-tracking branch 'origin/master'
new 67fac45 TIKA-1919
new 404d420 TIKA-1920
new 30e4e61 TIKA-1921 -- note: need to set default timezone to UTC both programmatically and in surefire plugin at least with Java 8.
new 6950dcf TIKA-1922 -- upgrade jackcess to 2.1.3
new 8580fc2 TIKA-1923 -- upgrade bouncy castle to 1.54
new 9a71fa7 TIKA-1924 - upgrade iso parser to 1.1.18
new fc725b9 Upgrade changes.txt to reflect upgrades.
new 53f29e0 Upgrade changes.txt to reflect upgrades -- fix spacing and small clean up
new 63bb154 fix for TIKA-1926 contributed by hasanayesha
new 5e170d4 TIKA-1916 -- Thanks to fxfixer (Nick C) for opening the issue and submitting a patch and test file. This closes #94.
new 4f22b08 TIKA-1927 -- Thanks to fxfixer (Nick C) for opening the issue and submitting a patch. This closes #98. xerial's sqlite now stores timestamps differently -- varchar. I got rid of the hack to handle that. I also added more robust handling of nulls throughout. The parser should return no value/empty string for null values of all types.
new 48d17b6 TIKA-1927 -- update changes . This closes #98
new 5a74add TIKA-1927 -- add catch for UnsupportedOperationException, which SerialBlob can throw in some versions/OS of java.
new 6a3aa1a TIKA-1929 -- ensure deletion of temp file no matter which type of InputStream is used. Clean up resources correctly in sqlite unit tests.
new 1c85418 TIKA-1929 insignificant cleanup
new e0005b3 TIKA-1929 cleanup more tests
new 7a4301e TIKA-1914 -- need to startDocument in ExecutableParser. Thanks to Nick C (fsfixer) for this. This closes #93
new e920647 TIKA-1033 -- add detection for embedded MSGraph.Chart objects. Also add two convenience methods to TikaTest.
new 27f1b8f TIKA-1932
new e2ef2e9 TIKA-1932 - with correct pattern
new ca1c265 TIKA-1932 - with correct pattern, third times the charm...argh. This will avoid incorrectly closing a TikaInputStream if a TIS is passed in.
new cffda85 TIKA-1934
new d692390 TIKA-1934
new f89a19f TIKA-1935
new 71f8423 TIKA-1936 -- clean up parsers and tests that aren't cleaning up tmp files, with heavy refactoring of PDFParser tests.
new 270b8a9 TIKA-1936 -- small cleanup in pdfparser test
new 61d8ec7 TIKA-1933
new d184e9b TIKA-1944 Magic for apple single/double files from Nick C
new 7e2c089 Grobid NER
new 5f859fb mitie ner parser added
new 2210c81 runtime binding to mitie
new ab09b0c read all entities from NLTKRest
new f9a716a remove starred imports
new 26081ca TIKA-1949 -- upgrade commons compress to 1.11
new 81bc3cd Merge remote-tracking branch 'origin/master'
new 08e932b fix for TIKA-1943 contributed by Mark Duske
new f509917 fix for TIKA-1943 contributed by Mark Duske
new 86145d9 fix for TIKA-1943 contributed by Mark Duske
new b5e246f code cleanup
new b4404c3 TIKA-1948 -- handle per page IOExceptions more robustly in PDFParser
new e032ac6 TIKA-1948...not sure why these weren't comitted..argh.
new 3d49131 Add DOM and Stax parsers to ParseContext
new 172c584 clean up to "Add DOM and Stax parsers to ParseContext
new dfea473 TIKA-1931 -- revert isoparser to 1.1.7 because of rare permanent hangs on some files starting with 1.1.9
new 8eac1c5 TIKA-1931 -- revert isoparser to 1.1.7 because of rare permanent hangs on some files starting with 1.1.9, with updage to CHANGES.txt
new 1c5e96c TIKA-1937 LinkContentHandler wasn't extracting links from script tags via Joseph Naegele
new 52851a4 TIKA-1895 upgrade to POI 3.15-beta1
new da1fe24 Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika into TIKA-1872
new b47c364 Merge branch 'master' of https://github.com/NamithaGS/tika into TIKA-1881
new e4dc21c Record entry for TIKA-1881.
new d40f8d4 Update changes.txt and record #85 and fix conflicts.
new 2e0c9bd Skip PooledTimeSeriesParser if it's not available
new 1e2bd89 Added CompositeParser workaround
new 3f1d6ae Resolve TIKA-1883: add SFDU MIME magic and type.
new 9b92734 Record TIKA-1883.
new acad4cb This closes #87.
new 9730a36 Merge branch 'TIKA-1886' of https://github.com/nandan-pc/tika into TIKA-1886
new f61a4ed Record change for TIKA-1886
new 1f96a0e Fix for TIKA-1882: .cab, .xar, .mobi and .mov files from the TREC-DD-Polar dataset. This closes #82.
new 3d59471 Record change for TIKA-1882 this closes #82.
new dab1039 TIKA-1956 -- prevent NPE when trying to get embedded image offset in WordParser
new 483c162 EPub mimes can end in new lines; let's trim the mime for easier comparability. Too small for jira or test...imho.
new 9f3a32a Added imports
new 443ecd3 Merge branch 'TIKA-1844' of https://github.com/cafed00d4j/tika
new 72d76f8 TIKA-1844 pass through POT if it isn't available -- via Aditya Dhulipala. This closes #107.
new 2ec36ff TIKA-1844 clean up indentation, clean up streams in case of exceptions, make isAvailable check happen only once. This closes #107.
new 8487fa7 TIKA-1844 clean up indentation, clean up streams in case of exceptions, make isAvailable check happen only once. This closes #107.
new 0dc29d0 TIKA-1950 -- clean up jdom version conflicts
new f39c087 removed logs
new 80b27e6 Merge remote-tracking branch 'upstream/master' into TIKA-1913
new 2cce66d TIKA-1924 -- upgrade to isoparser 1.1.18. Add workaround to prevent infinite loop. Remove aspectjrt license because isoparser 1.1.18 no longer relies on it.
new c9d508d TIKA-1924 -- add back qmino from last commit and clean up earlier work on ParseContext.
new ea0e68b Updated TextLangDetector and fixed build errors
new e558f5d fix: changed to post request from get request
new 055bbd4 Merge branch 'TIKA-1872' of https://github.com/trevorlewis/tika into TIKA-1872
new a484e5e fix: remove test and handle null quantities
new 20c1cfe Merge branch 'TIKA-1872'
new 2c72c42 Update with information about TIKA-1872, TIKA-1696 and TIKA-1723.
new 2caf3da Resolve conflicts in CHANGES.txt
new b5fe00e removed starred imports
new cd06762 model path flaw
new 5a8e269 Merge pull request #1 from yashtanna93/master
new a30b143 Merge branch 'TIKA-1926' of https://github.com/hasanayesha/tika into TIKA-1926
new e0ca3b5 Update changes for TIKA-1926.
new eb51d9b Merge branch 'master' of https://github.com/AravindRam/tika into grobid-quantities
new a353200 Update changes.txt to reflect Grobid Quantities NER merge.
new 9ecb183 Merge branch 'TIKA-1917' of https://github.com/manalishah/tika into TIKA-1917
new 2d06bc2 Update changes for TIKA-1917.
new aebaee6 Merge branch 'master' of https://github.com/reevapp/tika into TIKA-1943
new e2fdcaa record changes for TIKA-1943.
new d711ac9 Merge branch 'TIKA-1913' of https://github.com/manalishah/tika into TIKA-1913
new f827026 record changes for MITIE and TIKA-1913.
new d0c5259 Fix compile error.
new 792c7cf use assumeTrue to pass tests if not connected to Yandex.
new ab7c325 clean up earlier work on ParseContext.
new 0cdf17d Adding parser for ICNS files
new 7a543c8 TIKA-1924 - needed to add sanity check in map()
new 92a4835 TIKA-1894 -- fix potential NPE in XMPMM extraction
new ee60bc6 TIKA-1894 -- fix potential NPE in XMPMM extraction
new 64b7ded TIKA-1960 -- put legacy language detection code back in so that we don't break the API without some warning. Deprecate it thoroughly.
new 19ed261 TIKA-1960: Added Welsh Corpus back into test resources to support deprecation of legacy language code.
new b6d23c1 fix for TIKA-1938 contributed by naegelejd
new d4fb28f TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder
new 6a19918 whitespace cleanup of SourceCodeParser
new 9f09a55 TIKA-1964 -- clean up incorrect use of BigInteger.add()
new e2e10f9 Grobid Quantities parser extracts types
new eb05d04 fixed small whitespace issue
new 585ab9b Merge github pull #110 for TIKA-1893
new 567f7f7 Fix whitespace and keep ordering
new fba811b Whitespace
new 7f486e6 Comment updates
new c93ff3e TIKA-1966 Converted versions of test iWorks files from latest iWorks for iPad
new 4aff483 Merge branch 'master' into TIKA-1343
new 8e4c3ff Merge branch 'master' of https://github.com/cmenekse/tika into TIKA-1965
new c9bc910 TIKA-1965: Added types to Grobid quantities parser. Pull Request by Can Menekse.
new 91313e3 Reverting incorrect addition of : to header
new d447193 TIKA-1885: Addition of ZeroSizeFileDetector based on Pull Request from Adesh Gupta.
new 5f0e930 Added CHANGE information for TIKA-1885 and TIKA-1965
new eede044 TIKA-1885: Updated test to specify charset in getBytes()
new c991452 Update CHANGES.txt for 1.13 release.
new ff066dd Added ASL2.0 Headers
new cc5f841 [maven-release-plugin] prepare release 1.13-rc1
new be66dac [maven-release-plugin] prepare for next development iteration
new a91d083 TIKA-1955: Updated check to read from stream to avoid misreporting due to blocking
new 1caba4d [maven-release-plugin] rollback the release of 1.13-rc1
new 114b604 TIKA-1885: Updated MimeType to application/x-empty to match Unix file command
new 54c5909 Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tika
new 4b2abf6 [maven-release-plugin] prepare release 1.13-rc1
new 6ed2129 [maven-release-plugin] prepare for next development iteration
new 1180dc3 Revert "[maven-release-plugin] prepare for next development iteration"
new e7e0886 Revert "[maven-release-plugin] prepare release 1.13-rc1"
new 75fc12b TIKA-1955: Updated to use mark() and reset()
new 58c82ca [maven-release-plugin] prepare release 1.13-rc1
new 079c25b [maven-release-plugin] prepare for next development iteration
new 84619e0 Revert "[maven-release-plugin] prepare for next development iteration"
new b216b87 Revert "[maven-release-plugin] prepare release 1.13-rc1"
new 386b68b [maven-release-plugin] prepare release 1.13-rc1
new da5bbbe [maven-release-plugin] prepare for next development iteration
new 46d5775 fix for TIKA-1938 contributed by naegelejd
new 69852e4 TIKA-1454 -- added initial hyperlink extraction for ppt, pptx, xlsx. Areas for improvement: limit links to external links for ppt and pptx. Fix NPE in cell.getHyperlink within ppt in POI
new c1a3ce0 Merge remote-tracking branch 'origin/master'
new bb78082 TIKA-1454 -- clean up and add entry to CHANGES.txt
new bc0b1f7 TIKA-1958 - add mime detection and parsers for 2003 MSWord XML (wordml) and MSExcel XML (spreadsheetML)
new b2821d9 TIKA-1928 Fix detection for filenames containing a #, avoid mis-detecting that part as a page anchor
new e08d006 TIKA-1971 - add another magic for rfc822
new 09cc658 TIKA-1970 - Mac Mail date of interesting format not parsed by james mime4j
new de6dbd0 Merge remote-tracking branch 'origin/master'
new 7408531 TIKA-1971 - failed to revert earlier work on dbf - this is a clean up from the TIKA-1971. Thank you, Nick.
new 89881ce TIKA-1977 set title vs add title -- also clean up of javadoc href and whitespace from TIKA-1970
new 4a324ff TIKA-1977 avoid use of forbiddenapi
new 534347d TIKA-1976 improve date parsing in rfc882 parser
new 6e42062 TIKA-1976 fix forbiddenapis
new bb46c0e TIKA-1979 -- add message on stdout when tika-app's server. Add deprecation message on stderr for tika-app's server
new e780d56 merged upstream changes and resolved conflicts
new b64612d Update javadoc with @since
new fc4f13d TIKA-1978 Invocation of java.net.URL.equals(Object), which blocks to do domain name resolution, in org.apache.tika.parser.geo.topic.GeoParser.initialize(URL)
new e74f663 TIKA-1513 -- add mime detection and parsing for dbf files. Thanks to Nick C for the mime definition and Luis Filipe Nassif for collaboration.
new 167966e Merge remote-tracking branch 'origin/master'
new cb492f4 TIKA-1513 -- add mime detection and parsing for dbf files. Thanks to Nick C for the mime definition and Luis Filipe Nassif for collaboration.
new 608fbf5 TIKA-1985 -- add charset handling to field names; add datetime processing; rework calculation of number of columns to handle extra zero-padding at end of header. Waiting on permission for test file.
new b47f162 TIKA-1985 -- remove errant printstacktrace
new 0186992 Added support for type for runtime parameters
new 9e08a6b Updated test case with type checking
new dcaeccb TIKA-1513 -- update mime type according to Nick Burch's recommendation, other small import clean up
new 1a04f80 TIKA-1985: fix auto-import mis-clean up
new 16290d8 TIKA-1837 -- strip comments before trying to find encoding in HTMLEncodingDetector
new c6eefbd FIX: return value typo
new 6ad18f4 TIKA-1990 -- make sure to include JPEG filters when exporting jpegs embedded in PDFs
new aad23d9 Added @Field Annotation to support auto initilaize params from config
new 67941a6 Using TikaConfigException instead of RuntimeException
new 40f8ec9 TIKA-1992 -- check for duplicat images via COSStream not object name.
new 3ca9214 TIKA-1992 -- check for duplicate images via COSStream not object name.
new fbd766e TIKA-1992 -- check for duplicate images via COSStream not object name....sorry.
new a20c46c TIKA-1992 -- check for duplicate images via COSStream not object name...fourth time is the charm...ugh.
new ea47b71 Merge branch 'TIKA-1508' of https://git-wip-us.apache.org/repos/asf/tika into TIKA-1508
new 7aeb95d TIKA-1994 -- integrate OCR with PDFParser
new 27d9290 Remove duplicate code
new 1202f45 Merge branch 'TIKA-1986' into TIKA-1508
new 1af1078 TIKA-1994 -- integrate OCR with PDFParser, update CHANGES.txt
new 49ddf6e Merge remote-tracking branch 'origin/TIKA-1508' into TIKA-1508
new 18ab8f9 TIKA-1508 add some unit tests for ParameterizedParserTest
new 853750d TIKA-1508 proof of concept with on parameter on PDFParser
new 3e14505 TIKA-1999 add limit to number of events extracted from the XMPMM section by the JempboxExtractor
new 99aa587 TIKA-1999 small fix and update CHANGES.txt
new 71726bc Fix configure issue with decorated parsers
new e48d191 Add test case config file
new 9b5dc7f External Parser now have consumer for ignored lines, Fix TIKA-2002
new eccc153 Added an utility to load and insatiate classes
new 2184e2c Object recognition parser, tensorflow based implementation, and test cases for these
new 0305cfb Explicit Locale
new f3e9d82 Feedback incorporated - longer keynote + confidence output to metadata
new 6b56ad1 Added the script file for Tensorflow classify images
new 2a61205 Script copied from class path rather than HTTP GET and Code conventions correction for static final constants
new 06633cc TIKA-1996 -- upgrade to PDFBox 2.0.2
new ecdc403 Merge remote-tracking branch 'origin/master' into TIKA-1508
new 338db90 Start factoring out "configurable"; change signature of ParseContext's setParam to (Class, Param); add check for illegal field being specified in TikaConfig.
new 2140858 Merge remote-tracking branch 'origin/TIKA-1508' into TIKA-1508
new ef1f7b9 fix conflict
new 03d3824 Remove Configurable entirely; update PDFParser example for one field.
new 4d308fd TIKA-2006 -- add mime definitions for iCal and vCalendar
new d405172 Add mime definition for Windows Media Metafile (TIKA-2004).
new 01a9b6d Add mime detection and parser for Microsoft Owner File (TIKA-2008).
new f7fe685 TIKA-2008 -- change owner metadata key from TikaCoreProperties.CREATOR to TikaCoreProperties.MODIFIER
new 592ae6a TIKA-2008 -- change owner metadata key from TikaCoreProperties.CREATOR to TikaCoreProperties.MODIFIER
new acf031a TIKA-2009 -- add mime magic for djvu
new 6291648 make sure to test magic for vcs/ics/asx
new ade60ed TIKA-2011 -- add mime detection for Endnote Import file
new d9dcd59 Merge remote-tracking branch 'origin/master' into TIKA-1508
new 0132037 TIKA-1986 -- add Initializable, strip out handling of params passed via ParseContext in PDFParser
new 30b0f66 TIKA-1986 -- revert parsecontext to ab7c325 and update PDFParser to handle non-primitive parameter setting
new f8fe50a TIKA-1986 -- allow params to be passed into initializable, delete configurableparser
new 4c7481e Added Documented Annotation
new 027625d Pulled upstream changes (TIKA-1508) and resolved merge Conflicts
new 31cf12d updated with the new changes of TIKA-1508
new 5101023 Tesseract may see the t in haystack as a ! some times...
new a46ffac Upgrade Commons Compress to 1.12 (supports progress on TIKA-1358)
new d6981ad TIKA-1358 add preliminary detection for iWorks 2013 file types
new 52ea9ba Detection magic for POI-generated OOXML files, which have _rels before content type, plus test
new 7ae760e TIKA-2019 -- parsers for 2003 MS xml files fail to add spaces/tabs correctly when using the ToTextHandler
new 81279a1 Merge remote-tracking branch 'origin/master'
new 2031de7 TIKA-2019 -- clean up -- move state variables to inner classes, convert protected to package private, add @Override on parse
new 48b27d2 fix for TIKA-2021 contributed by Zarana Parekh
new de84d71 fix for TIKA-2021 contributed by Zarana Parekh
new be8b433 creation of TIKA-2016 contributed by amensiko
new 47221b9 TIKA-2022 -- add applefile parser
new 0f3b0bd TIKA-2022 -- add applefile parser
new e6c2839 TIKA-2022 -- clean up test, change dependency on CloseShieldInputStream to commons.io
new c1cea20 TIKA-2023 -- clean up RTFParser to use EndianUtils
new 7db0ab6 TIKA-2023 -- clean up newlines and indenting
new b10f250 optional processing enabled
new 37695d4 added validation tests for new processing features
new 2c4670e TIKA-2022 -- clean up AppleSingleFileParser to use EndianUtils, shorten test file, make field types private
new b6d55ae TIKA-2024 -- extract original file name/location, initial patch: rtf, applefile, word2003, word, pdf
new 69d8250 TIKA-2024 -- remove debugging test
new 7cc610e TIKA-2026 -- improve extraction of embedded files from ppt, pptx and xlsx
new 52f04be TIKA-2026 --fix caps on test files
new a57d836 TIKA-2024 add location extraction for OLE1.0 embedded files
new 23a11ef fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...
new bc6667c updated property name, removed orthogonal changes
new 6773d42 updated Javadoc for Tesseract config and parser
new fa30edd updated scope in pom.xml
new fe559b8 Merge master into TIKA-1343
new 27e999d updated config file
new c2a8ac1 formatting chanages
new 12b0aee rebasing pom.xml for tika-bundle
new 6809282 formatting changes
new 1a46c59 added check for non-UNIX OS
new 95b2cd1 TIKA-2029: add some content for links so that we don't generate bad html <a href="http://tika.apache.org/"/>
new b3c09b4 formatting changes
new 4fd3e68 fix orthogonal changes
new 1c6cff8 Merge branch 'TIKA-2021' of https://github.com/Zarana-Parekh/tika
new 6f16480 Fix to work if ImageMagick isn't present. Fix forbidden APIs.
new 636060e Record TIKA-2021 change.
new c0320f1 TIKA-2030 - add processing for <text:s/> element in odt, thanks to David Pilato for identifying this.
new 8d29f7a Merge remote-tracking branch 'origin/master'
new ff187a0 TIKA-2030 -- fix test document so that it is correctly detected.
new 3ecdc0c Email with attachment for testing extraction issues
new 952fb54 TIKA-2037 RFC822Parser should wrap the James InputStream of embedded resources to avoid problems with downstream detection or extraction
new a383567 TIKA-2025 -- override general format in excel to extract 15 digit integers
new f00ab04 TIKA-2029 -- upgrade jackcess to 2.1.4
new 9e0a87e TIKA-1993 -- High throughput Tensorflow Inception based image classifier via: (1) GRPC and (2) REST API
new 3deea1b TIKA-1993 -- improve usability of docker container
new 72d2d88 TIKA-2042 MBOX magic and detection unit test
new 53c461a Changelog update
new d698d49 TIKA-2041, step 1, upgrade icu4j components; add back ebcdic and bump bytesRead back to 12000 from the "modern" 8000
new 7dc5c67 TIKA-2041 -- add unit test in HTMLParserTest
new f5b04b6 Merge remote-tracking branch 'origin/master'
new 71cb936 TIKA-2040 - prevent permanent hang/oom on corrupt chm file
new e539802 removed error print statements, static changes
new 85e5385 TIKA-2048
new 8a68b5d clean up MatParser
new bd9a9b9 TIKA-2041 - add important diffs between new copy/paste from ICU4J and legacy code which may have included Tika-specific mods.
new 9024f12 fix for TIKA-1980 contributed by naegelejd
new 5495ffc TIKA-1980 - via Joseph Naegele. This closes #121
new f77eb2b Removed GRPC implementation as it is redundant over REST
new a1d1a81 TIKA-1986 Fixed the outdated Bad paramter test case and removed deadcode in comments
new 0096dd7 Log warn when confidence is not equal to size of objects.
new 52be425 1. use start/End document in handler; 2. populate metadata before handler is called. 3. make topN 2 in both REST and script configs.
new afb7e36 Merge branch 'TIKA-1508'
new da82df5 Update changes for TIKA-1508, TIKA-1993, TIKA-1986.
new c3fc92f Tickle to close Github issues. This closes all GROBID recogniser's - they have already been committed. This closes #116 this closes #117 this closes #118 this closes #119
new fde6717 Tickle to close. This closes #96.
new 6213cc1 This closes #115. Empty MimeType subsumes application/zerosize.
new 33dc408 This closes #109. No need for PersonaParser. Going to handle this in USC IR repo, no need for a parser.
new a5add3e Tickle this closes #120. Orthogonal changes that I don't understand to Lingo24 API.
new bf072bb Merge branch 'master' of https://github.com/Zarana-Parekh/tika into TIKA-2031
new 830685e record change for TIKA-2031.
new 5e4678e remove dependency on edu.usc.sentiment
new bec2f9b remove sysout
new c71e0b2 TIKA-2007 -- upgrade jackson, clean up unused dependency in tika-parsers
new 173ff59 TIKA-2065 -- upgrade forbiddenapis
new 069fa86 TIKA-2066 -- upgrade commons-io
new 52be682 TIKA-2067 -- upgrade maven plugins
new 80efc84 TIKA-1255 -- fix hyperlinks in doc/docx if there is formatting TIKA-2078 -- handle multiple runs within a hyperlink (docx)
new 3c0abc8 TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (Awaiting suitably licensed file for testing)
new 2222fe0 TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit test
new 9130bbc Changelog update
new 27b9cf5 TIKA-2055 catch exception when totalTime out of unsigned int range in ooxml
new 1c0e600 Merge remote-tracking branch 'origin/master'
new 6ebbd40 clean up triplicate commons-exec defs...not sure how these got in here.
new 084379c Remove unused variable
new 07aea36 TIKA-2051 -- upgrade to PDFBox 2.0.3
new a1250ff Improve logging and trivial code conventions
new d50a693 Merge branch 'master' into TIKA-1343
new cc6f6dc TIKA-2013 -- upgrade to POI 3.15-final, make sure to add new close() throughout for MAPIMessage and NPOIFS
new 4153812 TIKA-2047 -- maintain mime info for mimes that are subtype of text/plain handled by TXTParser
new 2ae7206 TIKA-2069 -- extract macros from MSOffice docs
new 8a45f67 TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target metadata object in any order
new 10507d0 add hOCR output format to TesseractParser
new 3a5431e TIKA-2093 -- add option for Tesseract's hOCR output, thanks to Eric Pugh! This closes #133.
new d612aea TIKA-2081 -- add fileUrl back into tika-server
new b58368f TIKA-2081 -- add fileUrl back into tika-server, update changes.txt
new e9e8d3b TIKA-2081 -- add fileUrl back into tika-server -- fix commandline options not to include '-'
new 98d75f6 TIKA-2095 -- include Tika version in GREETING
new ce07d8a TIKA-2057 - maintain DocInfo metadata in PDFs
new 5466468 typo in changes.txt
new 308d26f TIKA-2097 fix npe in MboxParser
new c33ac04 fix for TIKA-2098 contributed by alexshadow007
new 0a4b0e8 Merge branch 'TIKA-2098' of https://github.com/alexshadow007/tika
new 9b497d1 TIKA-2098 small clean up. Test for writelimitreached for each catchable IOException. Many thanks to Alexander Kazakov for finding this and submitting https://github.com/apache/tika/pull/134
new feac58b TIKA-2101 -- Don't call MAPIMessage.close()
new 5af482e TIKA-2106 -- need to lowercase hocr/txt suffix; thanks to Eric Pugh. This closes #136
new 1b72a38 TIKA-2110 -- include exception along with message
new bfd1d91 * Upgrade metadata extractor to 2.9.1 (TIKA-2113).
new 8e819c3 TIKA-2122 : add all headers from MSG and RFC822 files
new bf08ba9 TIKA-2122 : add all headers from MSG and RFC822 files, update changes file
new 02425b2 TIKA-2123 : digester fails when multiple digest values on large files; add more robust tests
new 88058a0 Prep for release of 1.14 RC #1.
new dbb6baa [maven-release-plugin] prepare release 1.14-rc1
new 687d770 [maven-release-plugin] prepare for next development iteration
new b3f1497 TIKA-2127 : npe if there is no notes master
new f19be22 Merge remote-tracking branch 'origin/master'
new a6b8e04 ugh...remove println statement from AppleSingleFileParser
new bc7216f TIKA-2133
new 7ca105e TIKA-2130
new 7fbf0f3 TIKA-2090 -- first draft
new 5657ae6 Merge branch 'TIKA-1343' of https://github.com/lewismc/tika into TIKA-1343
new dadbf55 TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder
new 8d5eaaa TIKA-1343 Stabalize check for JoshuaNetworkTranslator availability.
new 4dd6fd1 TIKA-2090 -- add more areas where javascript might live and add ability to turn action extraction on/off
new 01163e2 TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if MSWord can handle it, we should, too).
new e215b9d Merge remote-tracking branch 'origin/master'
new b67373f TIKA-2056 Make ExternalParser.LineConsumer Serializable
new 011f338 TIKA-2056 Fixup (forgotten import)
new 15a9230 TIKA-2111 - set instead of add "Content-Type" in the ExecutableParser
new 2df68c8 improve test for TIKA-2098
new 75fa138 TIKA-2157 -- handle zip exception in embedded stream
new aadccbf TIKA-1933 -- one more place where we weren't properly closing the ForkParser and were leaving behind a tmp ForkParser jar
new 7b45c7c TIKA-1896 -- add test files and unit tests, no fix yet
new ab53bdc TIKA-2171 --upgrade SQLite to 3.15.1
new 7dda921 TIKA-2173 add other setters to PDFParser so that they can be configured
new c17d1b8 TIKA-2174 add jp2 and jpx to file formats handled by TesseractOCRParser
new 47ba703 TIKA-2159 -- first step
new 91cdce4 TIKA-2175 -- add extraction for inline jp2/jpx from PDFParser
new 1aff638 TIKA-2174 -- add .ppm to tesseract
new 98de288 TIKA-2174 -- fix jp2
new b97045a TIKA-2174/TIKA-2175 -- clean up
new e8bf985 TIKA-2170 -- allow users to configure timeout for ForkServer
new 2e325cb TIKA-2170 -- fix unit test to allow for different exceptions to be thrown for losing connection to server
new d647a23 Mimetype for SAS Xport (XPT) files
new a9a9e08 TIKA-2116 upgrade to POI 3.16-beta1
new 81fad8c TIKA-2179 -- add detection and parsing for word2006ml files
new 2df8567 TIKA-2169 -- fix xhtml markup caused by bug in OCR parser
new 361ffa4 TIKA-2096 -- automatically add AutoDetectParser for embedded documents if the user forgets
new 1cfd250 TIKA-2096 -- update CHANGES.txt
new 7b4f6fa TIKA-2096 -- fix example of not including embedded docs
new d19e472 TIKA-1321 -- add SAX based docx parser and integrate it with the recent 2006ml parser work -- initial commit
new fe20ecd TIKA-2187 -- change default behavior in experimental .docx parser to ignore deleted text. Allow configuration of including or not including text from .doc files.
new 09931fe TIKA-2187 -- fixed test
new 0e0f30d Merge branch 'pdf_javascript'
new 99b5924 TIKA-2090 -- add ability to extract PDActions from PDF files
new 40401e5 [TIKA-2189] fix for Default value mismatch for "enableImageProcessing" in TesseractOCRConfig.properties and TesseractOCRConfig.java
new 8943013 TIKA-2191 -- step1 -- add other docx tests and comment/ignore where appropriate
new f93d4e1 TIKA-2191 -- step2 -- add handling for docm files...extract macros
new 1aca10a TIKA-2191 -- step 3 -- clean up <b> and <i> tag handling
new 806eaf8 TIKA-2191 -- step 4-- add markup for embedded pics
new 4469ca2 TIKA-2191 -- step 5 actually extract images embedded in areas besides the body of docx/m
new 615bf75 TIKA-2192 - add extraction of embedded objects in DOM docx parser from more than just main document
new 5425d02 update changes for TIKA-2191 and TIKA-2192
new 3ee9fd5 TIKA-2191 - step 6(?) add list numbering, bookmarks and styles
new 192e3ca remove println...the horror...ugh
new faf6c2b TIKA-2191: fixes after regression testing on TIKA_1302 corpus: 1) add 'cr' and 'br' and 2) add 'template' to potential main story body parts
new 0f3fe38 TIKA-2191: fixes after regression testing on TIKA_1302 corpus: 1) add 'cr' and 'br' and 2) add 'template' to potential main story body parts -- git add test file.
new 0f78a31 TIKA-2191: convert Styles reader to SAX and store only styleId->styleName map.
new 533572b TIKA-2195: refactor MockParser to consolidate service loading and custom mime type into tica-core/src/text
new 85c3457 TIKA-2173: improve configuration of PDFParser via @Field
new 653b980 TIKA-2191 -- optimize branching in start and endElement based on corpus statistics
new 1d9445b Update to PDFBox 2.0.4
new 90cdf1f TIKA-2210 -- add experimental SAX parser for pptx -- this is a first cut. More refactoring is in order.
new ca37313 TIKA-2218 -- add a few more places where .pptx can include embedded objects
new 376318f TIKA-2220 - refactor new sax pptx and docx to reduce code duplication.
new c624104 TIKA-2221 -- correctly catch and convert encrypted document exception to EncryptedDocumentException in WordParser via Matthew Caruana Galizia
new bf12e5a remove printlns in ZeroSizeFileDetectorTest
new 2dbd651 TIKA-2219 - make sure to transmit encoding name in detectAll() via Pascal Essiembre
new 8e12ebe Merge branch 'bug_TIKA-2189' of https://github.com/dasbipulkumar/tika
new c83f87b clean up tabs
new 87c2ef3 New WordPerfect and QuattroPro parsers for TIKA-1946 contributed by pascal.essiembre
new ae44b9e TIKA-2190 -- add configurability for preserve interword spacing
new aa2407a add comment on outputType and trigger close of TIKA-2189. This closes #139.
new 0de63a1 Now throwing TikaException on unsupported QuattroPro format instead of logging a message, as suggested by Luis Filipe Nassif.
new 202f137 TIKA-2211 -- make sure that <head> content doesn't appear as content in epub
new 7a5b983 Merge branch 'TIKA-1946' of https://github.com/essiembre/tika
new d011d70 TIKA-1946 -- initial commit of QuattroPro and WordPerfect parsers. Many thanks to Pascal Essiembre for contributing these!!!
new df14f78 TIKA-2224 Mime magic for OneNote
new 009c143 TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote files
new 135f326 Changelog
new 9546bd3 TIKA-2224 We now differ from HTTPD on onenote formats, as we have subtypes they lack
new aa448a3 Merge
new 2013d33 TIKA-2226 -- add UnsupportedFormatException
new 84a3720 TIKA-1946 -- updates, add detection for wp 5.0 and 5.1, and quattropro 7-8 vs quattropro 9
new 6c3e5db Merge remote-tracking branch 'origin/master'
new ef1d907 TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test
new 6dc442d TIKA-2228 - WordPerfect parser update to handle 5.x from Pascal Essiembre. This closes #142.
new 940e6f4 TIKA-2230 -- add paragraph markup to WordPerfect parsers
new e6cbaa0 Fix for TIKA-2232 contributed by pascal.essiembre
new 09632ef Now failing on any type of X-TIKA:EXCEPTION
new 11fe4ba Now properly decoding JBIG2 when inline in PDF (as opposed to pretend it is PNG).
new 9fdf9a8 Merge branch 'TIKA-2232' of https://github.com/essiembre/tika
new 86dbde4 TIKA-2232 -- shorten one unit test and update changes.
new 66f6310 TIKA-2234 -- get rid of ThreadLocal
new d1b1ad3 TIKA-2235 - set default dpi for OCR to 300 via Matthew Caruana Galizia
new ba26f6e TIKA-2232 add unit test for OCR of jbig2 embedded in PDF.
new a38a2b0 TIKA-2237
new 8eb7d35 TIKA-2159 handle preparse exceptions uniformly
new 526fc08 TIKA-2134 -- handle missing parts more robustly
new c9639bd TIKA-2238 -- add mime detection for embedded MSEquation files
new b9474f1 TIKA-2238 -- add mime detection for embedded MSEquation files
new 25aa2be TIKA-2232 -- ImageParser shouldn't allege that it can handle jbig2 when jbig2 library is not on class path
new 320a1f1 TIKA-2241 Add new config dumping option STATIC_FULL which lists all supported+active mime types for parsers
new 5c51534 TIKA-2231: Improved param validation of TesseractOCRConfig.setLanguage() and added more tests
new 8a04f20 TIKA-2242 -- fix style markup in ODT
new 02eae6d Merge branch 'TIKA-2231' of https://github.com/ham1/tika
new c978a11 TIKA-2231 -- update changes.txt. This closes #147
new 00bb6f4 TIKA-2240 -- improve mime detection for .wri files
new 9d97e16 TIKA-2232 -- log/warn if jbig2 parser is not on classpath
new 896c46a be more parsimonious wrapping streams
new 9477d03 Merge remote-tracking branch 'apache/master'
new 4cc15e2 TIKA-2244 -- be more parsimonious with BufferedInputStream. This closes #148.
new 847156a TIKA-2250 As of RFC7903, the official mime type for BMP is now the one without the x- prefix
new e6c0082 TIKA-2250 As of RFC7903, the official mime type for WMF is now an image one and without the x- prefix
new 90bf4f6 TIKA-2250 As of RFC7903, the official mime type for EMF is now an image one and without the x- prefix
new 836e2d9 TIKA-2244 -- be more parsimonious with BufferedInputStream -- AutoDetectReader
new 7afcfc7 Merge remote-tracking branch 'origin/master'
new fe94908 TIKA-2249 -- update javadocs to alert devs that tables are not "maintained" by the PDFParser
new 280ab87 TIKA-2251 -- make catch blocks as small as possible and improve "logging" with malformed files in new experimental SAX docx/pptx parsers.
new c09422a TIKA-2253 Obtain new Miredot license key and upgrade plugin version in tika-server
new 8213e53 Merge branch 'TIKA-2253'
new 2735942 TIKA-2255 Test SAS files
new c5130ec TIKA-2255 Magic for older sas data files
new 73a37a4 TIKA-2255 Mime detection unit tests for SAS files
new 3c0cd64 TIKA-2025 -- fix xls/x testBigIntegersWGeneralFormat to work in multiple locales. This closes #151
new da8363f Merge remote-tracking branch 'origin/master'
new 7555b13 TIKA-2259 -- improve url extraction from PDFs = copy Tilman Hausherr's code from PDFBOX-3644
new 0d54f07 TIKA-2181 - upgrade to POI 3.16.beta2
new bc3b263 TIKA-2198 - add null check to Tika after upgrade to POI 3.16.beta2
new 27e026e TIKA-2134 -- remove npe catch after upgrade to POI 3.16.beta2
new b9befb4 TIKA-2247 and TIKA-2246 -- add parsers for EMF/WMF
new aa7a0c3 TIKA-1332 -- initial commit for tika-eval module. More work remains.
new 506b572 TIKA-1332 -- fix one report for eval profiler and clean up whitespace
new d194ba4 TIKA-1332 -- downgrade Lucene to 5.x to allow for Java 7
new 6c6b77b TIKA-1332 -- clean up commons-io version mgmt
new a2d214c TIKA-1332 -- fix analyzer chain for common tokens, clean up UTF-8 references
new dc2dcd4 TIKA-1332 -- add English/Spanish common tokens, fix logging
new 9cf8258 TIKA-2267 -- add common tokens for some languages into tika-eval
new 3366bc6 Fix for Tika-2269
new 0b85460 TIKA-2269 -- Fix potential NPE in FeedParser via Julien Nioche. This closes #153
new 94fd3f6 change tika-eval default logging to INFO
new b3837a4 TIKA-2275
new 166ebb2 TIKA-2276 -- pass through TikaConfig via ParseContext in AutoDetectParser
new 1c87339 TIKA-2276 -- pass through ParseContext to prevent needless creation of TikaConfig
new e86f2d8 TIKA-2277 -- remove ParseContext field from AbstractParser
new bce5c79 TIKA-2277
new 10ca360 TIKA-2276 -- rollback...sorry.
new 6e4116b TIKA-2276 -- Have AutoDetectParser pass itself to the ParseContext for embedded documents if the user hasn't specified a parser _instead_ of keeping around a TikaConfig and passing that in.
new e3a50ba TIKA-2276 -- Try to reuse parsers from ParseContext for custom embedded handling, instead of creating a new HtmlParser/RTFParser.
new 579a92b TIKA-2276 -- Have AutoDetectParser pass itself to the ParseContext for embedded documents if the user hasn't specified a parser _instead_ of keeping around a TikaConfig and passing that in.
new 6697dcd Check for HTMLParser/create a new one just once.
new 6d022be TIKA-2273 -- two tests turned off temporarily in bundle. First draft of adding configurability to EncodingDetectors
new 500e15d TIKA-2278 -- clean up extract exception handling
new 4a4e89a TIKA-2278 -- clean up extract exception handling, add license header
new e7a0c3e TIKA-2273 -- fix configuration of encoding detectors when parsers are loaded statically.
new 5e0f926 TIKA-2273 -- cleanup, update CHANGES.txt
new 7a7887b TIKA-2279 -- simplify token counting
new efc67b8 TIKA-1857 -- fix extraction of field contents to handle hierarchical structure in <data> section
new d492657 tika-eval fix bug that stores parent file extension instead of embedded doc file extension
new 3c3e8e1 TIKA-2286
new b8fd7ee TIKA-2285 -- triggering document did not trigger an string out of bounds exception, but a corrupt/very short stylename could.
new 87745df TIKA-2285 -- don't trim before check
new c3383b0 TIKA-2281 -- extract mapi message class
new 745f13c TIKA-1865 - step 1, split out sender name from sender email/exchange info where possible in MSG files.
new 4f10801 clean up whitespace
new 2234f33 TIKA-1865 -- step 2, the other parsers.
new fecb19a TIKA-2281 applied to PSTParser
new 3bfe830 TIKA-1865 bug fix
new 24fec4d TIKA-2287 -- add general jdbc.
new 409e905 TIKA-2287 -- bug fix, improve handling when ref tables already exist
new b2f3eaf TIKA-1879 -- improve recipient email address extraction; revert the X400/500/Exchange processing for the "from" field from TIKA-1865
new 67cd6c3 TIKA-2290 - fix bug that prevented setting ocr strategy on PDFParserConfig
new 51e8320 Update mailing list archive links
new 49d6fd7 Bumped junit and slf4j versions
new abfc826 TIKA-2242 -- fix annotations with <p> elements inside of <p> elements
new 65182ee TIKA-2295 -- extract images from ODT
new 6465282 TIKA-2297: Added initial Lingo24 Language Detector
new 79b6c15 TIKA-2292: Updated CXF version to 3.0.12
new 6f9ca9d TIKA-2292: Updated CXF version to 3.0.12
new 0173a2f Merge branch 'TIKA-2297' of https://github.com/dameikle/tika
new 1bdc1a3 TIKA-2297: Added Lingo24 Language Detector
new e9ff4c0 Fix for TIKA-2303 contributed by ppalazon
new 585316d TIKA-2287 -- bug fixes
new 3d64e60 Merge remote-tracking branch 'origin/master'
new 1725007 TIKA-2236 upgrad PDFBox to 2.0.5 and JempBox to 1.8.13
new 679e460 Merge branch 'TIKA-2303' of https://github.com/ppalazon/tika
new 22f6ccf TIKA-2303 -- allow users to configure whether or not to extract bookmarks via Pablo Palazon. This closes #157
new 7894819 clean up unit tests, and modify two ODF tests based on feedback of broken build on user list
new a9145d8 modify two ODF tests based on feedback of broken build on user list -- remove parsing of embedded files.
new f55b87f TIKA-2300 include exception for streams that can't be read in pkg parser via Aeham Abushwashi
new c4660b4 TIKA-2307 avoid swallowing unsupported stream exception in PackageParser
new 58e1846 TIKA-2307 avoid swallowing unsupported stream exception, wrap in TikaException
new 256a281 TIKA-2212 ooxml parser should use finer-grained media types so that they can be filtered by users with includes/excludes
new bb82205 TIKA-1772 More WebVTT magic - for cases with no header, and with custom headers
new 3c02c4b TIKA-1772 More test WebVTT files - no text header, and a custom one
new 40647ea TIKA-1772 More WebVTT unit tests
new 8d31ab6 Changelog update
new 315a0d6 add initial support for xlsb
new adde012 Merge remote-tracking branch 'origin/master'
new fbd2a4e undo super bone-headed commit that was intended for personal fork. Wait until poi-3.16-beta3.
new a00ced9 undo super boneheaded commit, add binary back to list of unparseables in OOXMLParser
new 268a168 Added logging deps to unify it in parsers
new 5b6d997 Reformat POMs a little
new e8ee4ce TIKA-2245 Logging unification
new 6ff825c Cosmetics: extra spaces and diamond operator
new e1cc5a6 Added explicit test scope for junit
new 4bb2f70 Fixed tika-bundle integration test
new 465491f Bumped pax-exam version to 4.10.0
new 0d8f030 Merge branch 'logging-refactored'
new 7ce58d6 TIKA-2245 Updated CHANGES.txt with info about logging
new a5cd6f4 Added dependencies for DL4JImageRecogniser parser
new f777f21 Imported VGG16 model via deeplearning4j
new 236db96 fix for TIKA-2306 contributed by kranthigv
new 0c0bd4b fix for TIKA-2306 contributed by kranthigv
new cb8f8f5 fix the image
new c7f27b5 inceptionapi.py file added for REST API feature
new 1fc82e8 fix the destination directory
new 900e4cf fix no variables to save
new 0341a5d unexpected argument
new b9f496c undefined variable
new f8c51ba undefined variable
new d199692 undefined variable
new 0eedec8 Working inceptionapi.py without comments
new 19c0e91 TIKA-2302 -- make extraction of macros optional in OfficeParsers and set default to false
new 5877c4c TIKA-2302 -- make extraction of macros optional in OfficeParsers and set default to false
new 09cb2df fix for TIKA-2306 contributed by kranthigv
new f92809a fix for TIKA-2306 contributed by kranthigv
new be773ca fix for TIKA-2306 contributed by kranthigv
new 75a2ae1 Changed models repo to a forked repo for future compatibility
new cc34967 Update python code styling
new 653abaa Updated dockerfile to launch the service
new f1aad6f Fix typo in Javadoc
new 7b2b27a removed some of the dead code
new 8ebe758 Merge pull request #1 from asmehra95/asmehra95-patch-1-1
new 82f069b removed unused dependencies
new 1472a4e [TIKA-DL] Added tika-dl module to the build system
new ce28a6f Fix scheme value for file URIs
new 3cbf368 [TIKA-DL] build jar with dependencies by default
new d1c9513 [TIKA-DL] add license headers
new 81b3f32 Fix typos and unnecessary spaces
new 5834afe Fix XML format
new 1ea20c6 bump maximum tokens to 1000000
new 246133a TIKA-2317 -- warn when content string is truncated, allow easier parameterization of other limits via commandline.
new e5b0d54 Merge remote-tracking branch 'origin/master'
new 6b45621 TIKA-2318 fix exception/common count comparisons to include both mime_type_a and mime_type_b
new 3b33da2 TIKA-2319
new 8c9e02e Fixed formatting issues
new 0fb1458 fixed all formatting issues and added new customization
new e187d82 Enabled snapshots repo and upgraded DL4J to 0.8.1-SNAPSHOT
new 2a2e631 TIKA-2323
new f3db573 TIKA-2325 -- allow configuration of default language code for "common words" metric
new 3aab15f TIKA-2311 -- maintain mime information for truncated ooxml
new 6205742 Merge branch 'TIKA-2307' of https://github.com/KranthiGV/tika into KranthiGV-TIKA-2306
new dbdead5 Merge remote-tracking branch 'origin/master' into TIKA-2306
new 84fb6fe Updated model repo link to official tensorflow's
new a405fc4 Fixed SentimentParser, upgraded, using params, added test
new f1caef1 TIKA-2016 Added license header
new 84ffe8d TIKA-2016 Undo orthogonal changes
new ae06bae TIKA-2016 fix classpath URL for model
new db8c814 Reduced disk I/O
new 09698c6 Remove redundancy. Not updating classify_image.py since it has no effect on runtime performance
new d6b3ca4 Merge branch 'TIKA-2306' of https://github.com/KranthiGV/tika into KranthiGV-TIKA-2306
new a970303 mock parser's uninterruptible sleep can happen to pause for exactly 3000 millis
new 67612b8 TIKA-1195 and TIKA-2329
new 0f1034a update javadoc for Latin1StringsParser
new 75eea6e TIKA-2330 -- prevent preventable ooms in both detecting and parsing corrupt files or files that are misidentified as compressed streams.
new 9e89b44 TIKA-2331 -- Upgrade RTFParser to use new TikaMemoryLimitException
new 37d0f05 Merge remote-tracking branch 'origin/master'
new 80e6a8c TIKA-2334 -- Upgrade "provided" sqlite parser to 3.16.1
new 941d61a update CHANGES.txt in prep for release. reorder changes to most significant first...changes in default behavior then new parsers...Completely subjective, and I'm open to reordering!
new a31ed0d TIKA-2331 -- more opportunities to check the alleged length of a byte[]
new 77d5745 TIKA-2024 -- another location where the original source path might be recorded
new 834920e Updated URL to point to ASF repo
new dd51591 Merge branch 'KranthiGV-TIKA-2307'
new 80e6991 change scope of jai-imageio-core (TIKA-2338)
new 4321f77 TIKA-2339 -- remove test file flagged by one antivirus as potentially problematic. We assess that the av software had a false positive. However, to make it easier for the reporter and for others facing this issue, let's remove the offending file
new 34b630b Revert "change scope of jai-imageio-core (TIKA-2338)"
new 87a1b4a Update util file for snomeds and polarity
new 347e601 increase token counts to long from int
new 03035d6 TIKA-2309 Time Stamped Data Envelope parser
new e568bbc Merge branch 'Shinobi75-TIKA-2309'
new a194bc4 TIKA-2099 -- temporarily copy/paste commons-compress' ArchiveStreamFactory to benefit from updates that enable detection of magic-less .tar files.
new 11ad0fd TIKA-2039 -- extra unit test... ensure standard handling of exception in embedded file
new 9b5662d Refactor utils update
new 6c903f2 fix for TIKA-2322 contributed by msharan@usc.edu
new 1efe2e9 Minor edits in java comments
new 1fa7fc4 Removed local code
new 70343df Moving video imports inside classigy_video method
new 10529eb Fix URL to inceptionapi.py
new 7c431b3 Adding opencv support in Inception File
new 92a90c7 Update to v4 inception.
new 91d18a6 Merge branch 'TIKA-2322' of https://github.com/smadha/tika into TIKA-2322
new 4d3a43c TIKA-2345 Tika Config Serialisation of EncodingDetector details
new 86e821b TIKA-2345 Test for Tika Config Serialisation of EncodingDetector
new 27f3b3d ExecutorService serialisation TODOs
new d77fb59 Merge branch 'master' of https://github.com/apache/tika
new bbd4647 Changing v3 to v4, moving import at the top
new 0cb3a19 Fixed changes as per v4 PR
new aa4954f TIKA-2346 Add OfficeParserConfig support to control extraction from shapes from non-shape-based formats
new 0876aa9 TIKA-2346 OfficeParserConfig control extraction from shapes from DOCX
new 6a32d49 V3 to v4 in documentation
new ba00902 Updating v4 in Java default value
new 932a4a8 Supporting both opencv 2 and 3
new 434736b Adding opencv with ffmpeg
new 58a116c Removed personal repository
new 310fa54 Installing ffmpeg from opt and cone from bash
new 562e4fa TIKA-2346 -- add unit tests and configurability for doc, xls and SAX docx parser.
new e141640 Update Dockerfile for InceptionVideoRest to depend on ubuntu 16.04 (get ffmpeg via apt-get); build OpenCV+Python from scratch and bind to apt-get ffmpeg. Contributed by ThejanW.
new 49bb469 Merge pull request #168 from smadha/TIKA-2322
new b19b9c3 Record change for TIKA-2322.
new 27f7b24 TIKA-2322: update dockerfile
new 04f150d Update github links to apache/tika
new 3462c9d Keeping the number of parallel threads as 4 for OpenCV build process
new 104ca3e include "caused by" exceptions when catching/rethrowing in emf/wmf
new b0a4b95 Merge remote-tracking branch 'origin/master'
new 197b9ab Remove orthogonal line changes
new 1612028 TIKA-2349 -- try to use digest info to link embedded documents in tika-eval's "Compare" mode
new 0ee2fe6 Remove unneeded line change
new 5df8780 TIKA-2350 -- catch malformed open actions.
new 0b37895 TIKA-2311 -- to handle truncated files more robustly, in ZipContainerDetector, try OPCContainer before ZipFile
new dbe8a03 Merge https://github.com/ThejanW/tika into TIKA-pr-175
new 2217a8f Updating Tika Bundle POM for sentiment analysis - still getting errors.
new b26aa05 Fix Tika Bundle for TIKA-2016.
new e7b0cad Fix for TIKA-2352 contributed by pascal.essiembre
new c5da6bb Add multi-class/categorical sentiment config test file.
new 1934881 TIKA-2352 -- via Pascal Essiembre. This closes #176
new b46f20d Add categorical test.
new f073660 record change for TIKA-2016.
new 70d2455 Merge branch 'master' into TIKA-2016
new 0d3eb1f Merge pull request #169 from thammegowda/TIKA-2016
new ea19d62 TIKA-2343 -- add boilerpipe option (tika-app's "text-main") to tika-server
new 3cc4bc2 Merge remote-tracking branch 'origin/master'
new 4375a8e TIKA-2343 -- change put to post for multi-part forms
new 90cbe00 TIKA-2354 -- incorrectly skipping many images
new e56e2b2 Merge remote-tracking branch 'apache/master'
new 01ae987 [TIKA-DL] Updated model path, fixed issue with HTTP URL from XML
new 414a429 fix for TIKA-2355 contributed by msharan@usc.edu
new 46334b8 TIKA-2356 -- temporary workaround for bug I added to POI (Bug 61034) <face_palm/>
new 1e436f5 ignore inception tmp model.
new d06c521 pin to 0.8.0-1 release.
new 9dc2360 Factor out the DL4J model version.
new 0aaa121 TIKA-2357: Increased support for Tesseract PSM up to 13 from Rafael Ferreira
new 704a039 Pin to DL4J model 0.8.0-2.
new e477480 Integrate Tika-DL into Tika-Server and Tika-App
new 3207f12 Merge pull request #165 from thammegowda/tika-dl
new 5d3f36a record change for TIKA-DL.
new e31c933 TIKA-2318 -- include container file length in reports that mention file path, and add a report that compares page count.
new 2ab94fe Merge remote-tracking branch 'origin/master'
new 82fd2ff TIKA-2358 -- remove tika-dl as a dependency in tika-app and tika-server
new 9068584 Tika 2262: Create dummy classes for the client
new f73a117 Merge remote-tracking branch 'upstream/master'
new c7719fb Tika 2262: Implement image captioning server
new 806bdc9 Tika 2262: Relocating image captioning server
new 25fde02 Tika 2262 #create initial version of im2txtRESTDockerfile #minor fixes for model_info.xml & im2txtapi.py
new f68d33a Tika 2262: minor fix for model_info.xml
new 76815ea Tika 2262: fix minor error in the "/" route
new 6b174ec Merge branch 'master' of https://github.com/apache/tika into HEAD
new c95c66d TIKA-2361 upgrade to PDFBox 2.0.6
new 464fb97 TIKA-2363 skip image recognition test if network call fails
new c020e48 TIKA-2360 -- require users to turn on SentimentParser; remove glob detection for .sent; skip unit tests if network call fails.
new f78b7d0 clean up white space
new ac1791a clean up indentation
new d51227f Tika 2262: Change directories of models & .py files
new ed57e6e TIKA-2364 -- convert printstacktrace to log
new d873147 TIKA-2360 -- fix "fix" for SentimentParserTest
new 48580b0 TIKA-2367 -- avoid npe in WMF
new 477ce4c Tika 2262: # Reformat dockerfile # Update directories in model_info.xml
new cba1cb2 Tika 2262: # Remove unneeded lines # Add symbolic link to im2txtapi.py
new d8d2374 Tika 2262: Change route "/getcaptions" into "/captions"
new 614d951 Tika 2262: Change method name "get_captions()" to "gen_captions()"
new de87206 Tika 2262: Remove optional metadata to speedup beam search
new 5e19c01 Tika 2262: Add informative log messages (for advanced users to troubleshoot errors when modifying model_info.xml)
new b5b4eb6 TrueTypeParser Close Open Fonts
new 9d34cbb Merge pull request #181 from icirellik/close-fonts
new 9f4bb56 TIKA-2370: avoid potential resource leak by closing TrueTypeFont via Cameron Rollheiser. This closes #181. I removed the unit test based on Cameron's advice.
new 993382c TIKA-2368: Clean up dependencies of SentimentParser. At a bare minimum for the release of 1.15, add tika-translate to the exclusion list.
new ebc87ae TIKA-2359: Alert user that tesseract is available and will be used.
new 5a964e5 TIKA-2372 Test DMG file
new d992c5e TIKA-2318: fix typo (add comma) in common tokens by mime type report
new a9883eb add key
new b2fc478 TIKA-2373 -- fix licenses via rat in prep for 1.15 release
new a3b2ab2 Update CHANGES.txt for 1.15 release.
new 8a68c13 update scm for 1.15 release
new 8d3140b Merge remote-tracking branch 'origin/master'
new 3ba922f update scm for 1.15 release
new f604694 [maven-release-plugin] prepare release 1.15-rc1
new d806b99 [maven-release-plugin] prepare for next development iteration
new e4bfb16 undo release version update in order to fix tika-dl's parent definition and respin
new a29ae4d fix tika-dl's pom's parent definition
new 1761530 [maven-release-plugin] prepare release 1.15-rc1
new 05ccbdf [maven-release-plugin] prepare for next development iteration
new 674f67d Update changes for 1.16
new 75accba Merge remote-tracking branch 'origin/master'
new 6485b8b update pointer for sources to github in email template.
new 977314f prep for 1.15-rc2. Thanks to Oleg Tikhonov for catching my gaffes in -rc1.
new 6740a97 prep for 1.15-rc2. Change 1.15-rc1 back to HEAD in scm <tag>
new d831efb [maven-release-plugin] prepare release 1.15-rc2
new 95292dd [maven-release-plugin] prepare for next development iteration
new 49f3530 fix indents/whitespace
new f718cb9 fix indents/whitespace
new b290cd7 TIKA-1804 -- convert json parsing to SAX in TEIParser, step 1: test current output.
new 9ea855b Added the vgg16 model
new a99a8d0 added test case for vgg16 model integration
new 7ddb47f added default configuration and image to test
new 938acbd removed the earlier required dependencies
new 2d5e5b9 added the vgg16 class
new cead1d0 TIKA 2262 : Implement initial version of java client
new c35e257 TIKA 2262 : Update return type to List<? extends RecognisedObject>
new 4f06e2c TIKA 2262 : Remove unneeded return type change to List<? extends RecognisedObject>
new 8979094 TIKA 2262 : Minor fixes # Fix minor error in apiUri # add toString() to CaptionObject class
new 6be752c TIKA 2262 : Minor fixes
new 6e9f796 TIKA 2262 : Change routes to "/inception/v3/.."
new d53fa48 TIKA 2262 : Allowing to set post params through config.xml
new 34ae621 update key with signatures
new 99a6db5 Merge remote-tracking branch 'origin/master'
new dab7788 TIKA 2262 : Modify ObjectRecognitionParser to support im2txt
new 92fb768 TIKA 2262 : Add tests & test images
new 9a4f576 TIKA 2262 : Remove wildcard import
new 8ffd67c TIKA 2262 : Remove unneeded initialization of TensorflowRESTRecognizer
new f534fce TIKA 2262 : Minor fixes & code reformatting
new a963a32 TIKA-2381 - include tika-eval artifact in release
new 574fcec TIKA-2360 -- broaden allowable exceptions to IOException
new fa1d411 TIKA-2379 - Make logging in bundle optional. Fix test.
new c3cf30f TIKA-2379 - Add test scope back in for JUnit.
new cd7aa37 Merge remote-tracking branch 'origin/master'
new b3ad71b TIKA-2383 -- upgrade forbiddenapis to 2.3 and update internalRuntimeForbidden->nonPortableRuntimeForbidden
new 77900ab TIKA-2341 -- upgrade commons-compress to 1.14, added capabilities for snappy and lz4-framed
new 6d710da TIKA-2384 - TikaResource can close an InputStream twice; this revealed a bug in CommonsDigester, which this also fixes. (via Haris Osmanagic).
new a8aa2fc fix for TIKA-2835 contributed by pmweiss5
new 5410928 TIKA-2386 -- enable more options for DigestingParser
new f74b089 TIKA-2384 -- went too far in earlier commit. We should close the inputstream in parse.
new 933ae1c TIKA-2384 -- update changes to reflect fix.
new 630d1fe add files
new 7acf48d TIKA-2387 -- parameterize scale for image rendering of pages in PDFs for OCR
new 916e5ed TIKA-2388 OpenOffice database files have application/vnd.oasis.opendocument.base as their embedded mimetype, so make that the canonical one
new 7842600 TIKA-1945 -- extract text from diagrams in ooxml files.
new f8f407a Merge remote-tracking branch 'origin/master'
new d2820ce TIKA-2254 -- extract text from charts in ooxml.
new 5cbaed8 TIKA-2362 -- Allow users to turn off extraction of headers and footers from .doc, .docx, .xls, .xlsx, .xlsb
new 79f2740 TIKA-2383 -- remove nonPortableRuntimeForbidden and add new signatures
new dbafffa TIKA-2383 -- remove nonPortableRuntimeForbidden and add new signatures
new d700aa4 fix for TIKA-1988 contributed by msharan@usc.edu
new ed2ceeb TIKA-2397 -- remove circular dependencies with conflicting versions brought in my the SentimentParser.
new 132d3e7 TIKA-2391 -- extract <script> elements as embedded documents
new c9031bf Reverting committed work dir
new 8410067 TIKA 2262 : Add apiBaseURI + reformat ObjectRecognitionParser
new c53af43 TIKA 2262 : Swap json imports to json simple
new 1760249 TIKA-1804 -- remove dependency on org.json
new 00de530 Merge branch 'tika-dl-me' of https://github.com/asmehra95/tika into TIKA-2298
new c476ec1 TIKA-2298: DL4J-VGG16 simplify conf, implementation
new 134cb38 Merge pull request #2 from thammegowda/TIKA-2298
new ab21426 turn dl4j test back on.
new 2bdf783 Add all image format support to the captioning server
new 5478311 Tika 2262 : Add all image format support to the captioning client + clean imports
new 7b47239 TIKA 2262 : Minor fixes in TensorflowRESTCaptioner & Tests
new ebb4bc6 Merge remote-tracking branch 'upstream/master'
new 7e51e2b ignore snaps
new 4abd944 ignore snap build dir
new 06eb659 add snap change
new 223c006 remove old files
new 2369149 TIKA 2262 : # cleanup imports in ObjectRecognitionParser # reformat im2txtapi.py # add pillow python package install to Im2txtRestDockerfile # add gif image & gif image test
new 6a60a79 TIKA 2262 : Minor reformatting + Change log level into info to see captioning time
new 1aacfa0 Normalized the confidence range b/w 0 and 1 and fixed topN return issue
new 8ef1eb9 TIKA 2262 : Remove unused imports + spaces
new b9825c6 TIKA-2336: Upgrade to POI 3.17-beta1
new bada130 As per RFC2361, the official mimetype for WAV is audio/vnd.wav
new 443bc7d Merge remote-tracking branch 'origin/master'
new 6e3fb26 TIKA-2380: Upgrade to Jackcess 2.1.8
new 93f941e TIKA-2389 and fix CHANGES.txt file
new 4161f22 TIKA-2389 -- allow users to configure warnings for problems during initialization
new 2a43a69 move test scope image processing dependencies together in the pom
new 2deadf4 TIKA-2374 -- tika-app cli should extract inline images by default
new 5cb985e remove tika-serialization from tika-bundle
new b409ff6 TIKA-2368 -- rename SentimentParser to avoid conflict with dependency
new 91d0af2 TIKA-2410 -- turn off bold/italic on \plain
new 7fd5bd7 TIKA-2335 -- extract x15ac:absPath metadata from xlsx and xlsb when available
new 95c515d TIKA-2089 -- extract macros from ppt
new 05e334a TIKA-2411 -- clean up unneeded dependencies
new bfedea8 TIKA-2368 -- move to different package to avoid split package warning
new 15d6078 TIKA-2312 upgrade provided xerial version
new 04cc72c TIKA-2313 upgrade mime4j to 0.8.1
new d192004 Upgrade gson
new 87d8f55 Upgrade gson and libpst (TIKA-2414 and TIKA-2415).
new a7b5705 Upgrade dependencies in tika-eval TIKA-2416
new dd2149a Revert upgrade of libpst to 0.9.3 back to 0.8.1
new 9a312f2 Revert upgrade of libpst to 0.9.3 back to 0.8.1
new b618984 Merge remote-tracking branch 'upstream/master'
new 0815b21 TIKA-2418 Make the QuickTime start-of-file Atom matches a bit more specific where possible, to reduce false positives
new 8eef056 Merge branch 'master' of https://github.com/apache/tika
new d98bec0 TIKA-2419 Do all 4 html doctype varients for the same text range
new 3830152 TIKA-2419 If we detect XML but the XML is broken, try the HTML magics before declaring it to be broken xml = plain text. Needed because, to avoid false positives on html-like formats such as email, XML has a higher magic priority than HTML
new ab4ea47 Merge branch 'tika-dl-me' of https://github.com/asmehra95/tika into TIKA-2298
new 0b92a77 Fix Tika Bundle error RE: tika serialization.
new 82ab03c Implement needed object recogniser method
new cb7b84a TIKA-2089 - bug fix, check for nulls
new 24b54af Merge remote-tracking branch 'origin/master'
new 621ded8 TIKA-2089 -- clean up try/catch with autocloseable
new 4ed69a8 TIKA-2420 -- protect against unsupportedoperationexception with query.toSQLString() on unknown query types.
new 215c262 Up max memory for surefire to 3GB.
new 9dc8e21 small cleanups to sql
new b58cfcf Record change for TIKA-2298: Very Deep Convolutional Networks for Large-Scale Image Recognition
new d65c2a1 Merge branch 'TIKA-2298'
new 94f8b9f Use ${commons.compress.version} per tballison.
new 158675d TIKA-2298 -- skip test if no network connectivity. Should rework for more elegant solution at some point.
new be02273 set config equal to the new Config object.
new a00d112 Test GraphViz files
new 8d8e818 TIKA-2422 -- improve detection of Graphviz *.dot format
new da7ade6 TIKA-2422 -- improve detection of Graphviz *.dot format - allow leading C++-style comments - add unit test, incl. comments and graphs with name/ID
new dd7acbf Merge pull request #190 from sebastian-nagel/TIKA-2422-detect-graphviz
new 3891b87 Add a few recent mime/magic changes to the changelog
new 92b7b68 fix merge of tika-parsers/pom.xml for age predictor.
new 4e24a14 Implement initializable interface.
new 048dd8e Fix import statement.
new a549ec6 Get needed OpenNLP models for age detection.
new 2d48094 Configure age detector based on classpath.
new 5a653d0 formatting in ModelGetter.
new f8cfa4b Fix typo.
new ce4fdda Use static class.getResource() instead of getClass().getClassLoader()
new 6ff1147 Fix pom to work with Age Recogniser.
new 3c2595d Automatically copy the Age models to the model dir so that you can have them available for classpath tests.
new 9ad510c Use mock class for testing.
new 327ae0b Record change for TIKA-1988 contributed by msharan@usc.edu
new 9be1785 Fix Felix bundle rules for Age Prediction Parser OGSI bundle. TIKA-1988.
new 05f8f89 TIKA-2389 -- add static checks to PDFParser, Tesseract, SQLLite to make sure that potential warnings only happen once. Rework TikaCLI to build parser only once based on tikaConfig so that initialization warning settings actually work.
new 0f6449f Merge remote-tracking branch 'origin/master'
new 58a602f TIKA-1988 -- allow for failure to copy age recognition models
new f776c24 TIKA-2399 exclude jj2000 because of potential license problems with ASL 2.0
new 632f52d TIKA-1988 -- allow for errors downloading models
new e07d9e1 - add Tika-NLP module - move AgeRecogniser out of tika-parsers TIKA-1988
new f94616a add Tika-NLP to build
new 3040849 Update CHANGES.txt for 1.16 release.
new d353914 [maven-release-plugin] prepare release 1.16-rc1
new 23c98cd [maven-release-plugin] prepare for next development iteration
new e9c8794 roll back to 1.16-SNAPSHOT
new f7fe12e exclude models from src.zip
new 1963322 [maven-release-plugin] prepare release 1.16-rc1
new 0c6e1ad not sure why pom.xml.releaseBackup files are now included after last commit.
new 75944e5 Merge remote-tracking branch 'origin/master'
new 64b59fe revert to 1.16-SNAPSHOT; third time is the charm
new 2af8fb1 revert to 1.16-SNAPSHOT; third time is the charm
new b99f344 [maven-release-plugin] prepare release 1.16-rc1
new 6aa4bc6 [maven-release-plugin] prepare for next development iteration
new 44082d3 TIKA 2262 : Adopt changes in TIKA-2389
new 36d742e Merge pull request #189 from ThejanW/master
new 2e7a4f5 Merge https://github.com/apache/tika into gsoc17
new b36a5b7 - explicitly set Locale in String.format
new e9793d3 - ignore pydevproject
new 2eabced Make sure tests get run if Tensorflow is available via Docker or script
new 56ab7b2 - handle exceptions
new 65ef6d8 record change for TIKA-2262
new 9f0144c TIKA-2426 -- fix locale-sensitive test for xlsb
new bea5b9d WARC and ARC magic from Andy Jackson from https://github.com/ukwa/tika/
new c23c648 update tag
new 068c87c Merge remote-tracking branch 'origin/master'
new 0277fbb TIKA-2042 Add a few more mbox patterns, based on file supplied by mcaruanagalizia Matthew Caruana Galizia
new 55caab7 POIFSContainerDetector ASCII-encoded magic number
new 08d09a5 Update WordMLParser.java
new 9687f08 SUPPORTED_TYPES is an immutable singleton set
new 9869851 TIKA-2430 -- add a capability to allow devs to easily run parsers against randomly corrupted files.
new 898946e Merge branch 'patch-1' of https://github.com/onealj/tika
new 71a80c9 Merge branch 'patch-2' of https://github.com/onealj/tika into onealj-patch2
new 12cce58 Merge branch 'onealj-patch2'
new 13ebacf Fix conflicts. This closes #193
new f3acaed Fix conflicts. This closes #193. Thank you, Javen!
new f53a2f2 SUPPORTED_TYPES is immutable
new 3de4c4f TIKA-2430 -- allow devs to fuzz embedded files individually
new 3f86a6b TIKA-2430 -- remove dev ignore emf
new 268a815 TIKA 2262 : Update links
new d57a621 Fix a typo in log message, and adjust code indentation
new 8af9c96 Merge pull request #195 from kinow/fix-typo-and-indentation
new 30b27ab Merge pull request #194 from onealj/patch-5
new 0d5cab9 TIKA 2262 : Minor changes to dockerfile
new ef12f7d Merge remote-tracking branch 'upstream/master'
new 0579efe Two more EML header magics from Matthew Caruana Galizia from TIKA-2042
new c92730c Minor changes
new 00221ad TIKA-2042 -- fix typo.
new 1bf3a7e TIKA-2431 -- upgrade to PDFBox 2.0.7
new 63ae47a Merge pull request #196 from ThejanW/master
new f31b7f1 TIKA-2433 All non-pipe modes need configuring, otherwise the Tika Server fails
new a51add2 Forbidden APIs fix - Use a specified encoding when turning Strings into Bytes
new 4455a6f TIKA-2436 Add a mime type for EMZ, subclass of gzip, much as we have for the related WMZ
new e2f5b17 TIKA 2402 : Include pillow + include logic to convert non jpeg image
new 523f5d6 TIKA 2402 : Minor changes in client & config + Include png, gif tests
new fe7533d TIKA 2402 : Minor fix in dockerfile
new 57ee179 TIKA-2439 throwing IllegalStateException in OptimaizeLangDetector.detectAll if no module have been loaded and thus detector is null which avoid an unhelpful NullPointerException
new 4e03f90 TIKA-2438 -- ooxml locale should be set via POI's LocaleUtil. Fix unit tests to be robust in different locales. Many thanks to Karl Richter for raising this issue.
new dada2b7 Merge branch 'npe' of https://github.com/krichter722/tika into krichter722-npe
new e5526b5 Merge branch 'krichter722-npe'
new 1941a29 Record changes related to Image Captioning.
new b422eda Merge branch 'master' into gsoc17
new 2bcc0a7 TIKA-2268 -- add report for common_tokens/alphabetic tokens
new a288bce Merge branch 'master' into gsoc17
new 0e1abd2 Merge branch 'TIKA-2355' of https://github.com/smadha/tika into gsoc17
new 0f57ea8 Merge branch 'TIKA-2355' of https://github.com/smadha/tika
new c6b6b17 record changes for TIKA-2265.
new 3621ee4 fix merge
new e2af4bd Update snapcraft.yaml
new 10baddc TIKA-2374 and TIKA-2434 - roll back extracting inline images for pdfs in tika-app to just -z option
new 20a0cd7 add alternate reason message to skip test
new 79c52ed TIKA-2434 add headless mode to tika-batch
new 1d1119e Add a sample Windows batch file
new 67e2c5a TIKA-2445 Windows Batch .bat / .cmd need their own type, as they are text-based, with some common-ish magic, plus unit tests
new 2207cd9 HTTPD magic is once more wrong, disable one check and explain why
new 2c54f93 Changes update
new 587e4ae TIKA-2447 Inspired by the patch from Jan Burkhardt, do not bother fetching+keeping data from PSD sections we ignore
new 930e677 Changelog
new cf1c283 directly compare stderr to empty string in testRedirectionOfStreams to obtain more meaningful messages if test fails
new bf03dd4 Convert EncryptedPowerPointFileException to EncryptedDocumentException
new 921949c Merge branch 'mattch-patch-1'
new 87033d6 Add unit test for PR-202 submitted by Matthew Caruana Galizia: https://github.com/apache/tika/pull/202
new 74574e3 TIKA-2440 -- extract phonetic runs from xls and allow users to turn off extraction of phonetic runs in both xls and xlsx.
new 72772c5 bump read timeout for downloading file to 60s (1min) in tika-dl
new 6564bc8 Merge pull request #203 from boegel/bump_timeout
new aa12fc1 Merge pull request #174 from q-centrix/TIKA-2332
new 516bbaa Merge pull request #187 from buggtb/master
new e763021 Improvement for TIKA-2449 contributed by Giuseppe Totaro
new 5638ebc TIKA-2448: Extract phonetic runs in docx with experimental SAX parser
new 70de289 Merge remote-tracking branch 'origin/master'
new a153397 TIKA-2450 -- AutoDetectParser should throw a ZeroByteFileException for zero-byte files after detection on the file extension.
new 83f1afa TIKA-2454: add OverrideDetector and allow PSTParser to specify body content type as text or html -- to avoid incorrect auto-detection of rfc/mbox, etc.
new e0ff3eb TIKA-2454: don't process the htmlbody. There could be encoding conflicts. Fallback to what we were doing...just process text.
new 3eff3ac TIKA-2454: cleanup unused TEXT_PARSER thanks to Matthew Caruana Galizia
new 87e483a Better to fix the .mv.db than to complain to the user.
new 560e91a TIKA-2456: fix detection of emails inside mbox
new 9e6a91c Merge branch 'master' of https://github.com/apache/tika.git
new 4de0c66 TIKA-2455: flag the containing multipart type
new 8000cfe TIKA-2451 - Extract number of tiffs in a multi-page tiff (TIKA-2451); many thanks to Mike Cantrell for supplying a test file.
new 82ac81b Merge pull request #201 from boegel/stderr_vs_empty_string
new 083f7b8 Further refinement on PR-201
new e41c129 Duh...further refinement on PR-201
new 79f4b4e fix conflicts in CHANGES.txt during merge
new 5b57ae4 Merge branch 'gsoc17'
new 7b869c0 Added a regular expression to match standard word within a pattern for TIKA-2449 contributed by Giuseppe Totaro
new 21c0f37 PicturesSource has been copied to Apache POI, mark the class to remove once we have upgraded to a version with it in
new 31625a2 Used the alphabetical order for the list of the standard organizations by relying on TreeMap. Thanks to Lewis McGibbney for this insightful suggestion (TIKA-2449 contributed by Giuseppe Totaro).
new 70ca280 TIKA-2460: load custom mimetypes XML from sys prop
new 26d6e0d further refinement on PR-201
new 582083e Merge remote-tracking branch 'origin/master'
new d1a8bff TIKA-2459 -- fix special character handling
new b987125 Update InceptionVideoRestDockerfile
new 7dd38d5 Merge branch 'master' of https://github.com/apache/tika
new db89ab3 TIKA-2449: Enabling extraction of standard references from text
new f311188 TIKA-2465 -- add explicit unit tests for xxe vulnerabilities
new 99abe4e TIKA-2465 -- scope catch/fail more finely
new 62d5665 Merge pull request #207 from armathur/patch-1
new ed0574b TIKA-2465 -- scope catch/fail more finely
new 92fe9b8 Merge remote-tracking branch 'origin/master'
new 1b951f2 improve docs for scope of these tests
new c0c2eaf TIKA-2467 refactor creation/configuration of XML parsers/factories/readers to be static methods.
new af4ea8a TIKA-2465 -- make sure to include slides for SAX PPTX parser
new 2e8d45a TIKA-2465 -- add epub
new a78717c TIKA 2400 : # Define apiBaseUri for inceptionREST # Link wiki pages
new 4f784fc TIKA 2400 : # Fix formatting issues of inceptionapi.py # include logic for checking minimum confidence
new 6b31053 TIKA 2400 : Minor changes to im2txtapi.py
new 8a875cd TIKA 2400 : Changes to video_util.py # Fix formatting issues # Remove unused imports
new 712b697 TIKA 2400 : Minor change to im2txtapi.py
new 9b94e17 TIKA 2400 : # Adjust the Object Recognition REST clients to work with changed servers
new 07abb31 TIKA 2400 : # Few refactoring to Object Recognition REST clients
new dead956 TIKA 2400 : Update dockerfiles
new 92c65e0 TIKA 2400 : Update dockerfiles with namespace 'thejanw'
new f16bd0e TIKA-2429 -- upgrade to POI 3.17, last version of POI that runs on Java < 1.8
new 015c695 TIKA-2429 -- upgrade to POI 3.17, and get it right in tika-eval
new 2a81e97 TIKA 2400 : Update dockerfiles with namespace 'uscdatascience'
new 384e971 make strawman app driver actually work. Add ability to specify a list of files.
new ac25932 TIKA-2466 Remove JAXB for easier use with Java 9 via Robert Munteanu.
new 0e38f94 TIKA-2470 -- modernize DocumentBuilderFactory security for Java 9
new c54efd8 TIKA-2470 -- fix...add back namespace aware
new 5d41096 prevent div by 0 exception in profile-reports.xml
new 2748538 TIKA-2268 -- add more reports and fix div by 0 bug
new 33da38e TIKA-2455: test for feature; only store multipart subtype in metadata
new 766cdac [TIKA-2472] Implementing Metadata#hashCode
new abfca01 Add test PCX and DCX files, generated by ImageMagick from the Test PNG file TIKA-2473
new 450ab4b TIKA-2473 PCX and DCX mime magic and detection unit tests
new 03e7d12 TIKA 2400 : Change requests
new b1d80bd [TIKA-2472] Reimplementing Metadata.hashCode using the AbstractMap code but with Arrays.hashCode as suggested by Ken
new 2966cab [TIKA-2472] Removing a null value check as Arrays.hashCode does it
new 1f38be3 fix for TIKA-2475 contributed by seanstory
new 369a04e [TIKA-2476] Making sure the trailing space is not added
new f444fd7 add tests for xml vulnerabilities. More work remains on entity expansion...
new 40e99f9 Merge branch 'TIKA-2475' of https://github.com/seanstory/tika into seanstory-TIKA-2475
new d5d739c Merge branch 'seanstory-TIKA-2475'
new 94850f2 TIKA-2475 mods and some new tests/cleanup for CharsetDetector. This closes #210.
new ad23d84 TIKA-2469 -- narrow mime detection for ms-owner files and add detection for nls files.
new 18aa69a [TIKA-1788] RFC822Parser: provide email attachment filenames when available
new 9653e77 [TIKA-1788] Provide Content-Disposition metadata in embedded files
new 9fb3461 TIKA 2400 : # Minor reformatting # Include inception v4 definition script # Remove unwanted classify_image.py file # Get rid of the need to have tensorflow models with PYTHONPATH
new a043cef TIKA 2400 : Remove redundant functions + Minor refactoring
new f6beced TIKA 2400 : Update environment of python scripts to python3
new 17e4b66 A dummy parser unit test for iWorks 13
new 5c7547b Have the iWorks 13 parser set the content type on the metadata if possible, otherwise remains no-op
new 0d92bc8 Add notes on why we can't get the Numbers or Pages type just yet - need to call out to another library or decode the Document.iwa snappy stream ourselves
new 1f28f46 Merge branch 'TIKA-1788' of https://github.com/AarjavP/tika into aarjavp-TIKA-1788
new 6fc2b7e Merge branch 'aarjavp-TIKA-1788'
new 96a3502 update some unit tests to use the RecursiveParserWrapper
new b5f5403 Merge branch 'master' into patch-2
new a01163d Merge branch 'mattcg-patch-2'
new 877d621 update a unit tests to use the RecursiveParserWrapper. This closes 205.
new 1047f64 TIKA 2400 : Changing environment of python scripts to python2
new ff481b2 TIKA-2478 -- rfc822 parser should handle alternative parts as the Outlook parser does. Added parameter to allow for legacy behavior in RFC822Parser and a parameter to "include all alternatives" to the OutlookParser.
new c009dc7 TIKA-2485 -- Allow configuration of markLimit in EncodingDetectors via tika-config.xml
new 93411f4 TIKA-2489 -- upgrade to PDFBox 2.0.8
new eae1002 allow for greater leniency in failure to load resources from the network
new 88a5e51 TIKA-2490 and TIKA-2491 -- turn off initializable problem stderr warnings in tika-app, confirm that configuration of initializable problems works from an input file and allow for a tika-config.xml file without specifying a classloader
new 690c744 TIKA 2400 : Changing environment of python scripts back to python3 for docker testing
new d8caba1 TIKA 2400 : finalize dockerfiles + scripts
new cc08d39 TIKA 2400 : Fix minor error in im2txt dockerfile
new dfb7187 TIKA-2492 -- exclude pdfbox debugger
new 66be8e7 TIKA-2492 -- exclude pdfbox debugger, but get it right this time.
new 04b0837 TIKA-2492 -- exclude pdfbox debugger from tika-bundle
new 9c2e1b9 TIKA-2488 -- catch potential npe in getting attachment's inputstream
new 18deefa Upgrade to Jackson 2.9.2 (TIKA-2501).
new 780ab0c * Upgrade to OpenNLP 1.8.3 (TIKA-2502).
new f0b6a17 TIKA-2503. Need to confirm this doesn't break anything
new 1b48d73 TIKA-2486 upgrade metadata-extractor to avoid CVE in xmp-core to 2.10.1
new be434b1 remove unused dependency
new b19c2d7 TIKA-2502 -- rollback until we can figure out how to get the upgrade working with our OSGi bundle.
new 06486c8 TIKA-2483 -- revert loading of mime repository in PackageParser from TIKA-2311 to avoid NPE in ForkParser
new ff5d065 TIKA-2034 upgrade xmpcore
new 7d83b86 TIKA-2504 exclude dependency on old vfs2 to remove vulnerability from plexus-utils
new b6bdb67 TIKA-2502 - Upgrade opennlp-tools to 1.8.3 maven-bundle-plugin to 3.3.0
new 1e8008c TIKA-2506 - Check config for null during DL4J Test.
new 91ef9a9 Merge pull request #208 from ThejanW/master
new 3ee0aff Remove docker files now present in https://github.com/USCDataScience/tika-dockers
new 946614b Update changes with TIKA-2400 / GH-208
new d64a32c Fix for TIKA-2347 Adds underline extraction from word documents
new 93cbed6 TIKA-2347 - Added extraction of <strike> element in DOCX files
new 639f3bf TIKA-2347 - Add underline extraction from Word documents (doc/docx) from Stuart Hendren as well as strikethrough extraction in docx.
new beedc42 TIKA-2347 - Add underline extraction from Word documents (doc/docx) from Stuart Hendren as well as strikethrough extraction in docx.
new d4fd659 TIKA-2510 -- Extract media files from ooxml
new ce4d948 TIKA-2511 Cache TikaConfig in EmbeddedDocumentUtil for faster processing of files with lots of embedded files.
new 6fa83ff clean up imports, update unit tests to use assertContains, and confirm that <strike> in xhtml doesn't add spaces in extracted text.
new fb93ab1 Fix OOM when parsing very large PDFs
new ef3fc7b TIKA-2512 add underline/strikethrough extraction for docx and pptx in SAX-based parsers
new 33bf39f Merge branch 'fix-oom-when-parsing-large-pdfs' of https://github.com/shrike/tika into shrike-fix-oom-when-parsing-large-pdfs
new 2e27414 Merge branch 'shrike-fix-oom-when-parsing-large-pdfs'
new 72c4e33 Update test and add note in release notes. Many thanks, shrike! This closes 213.
new a047fa9 TIKA-2510, correct fix. Only add to seen/handledTarget _after_ processing.
new 7ca6597 Merge branch 'TIKA-2835' of https://github.com/pmweiss/tika into TIKA-2385
new 537cfca Merge branch 'TIKA-2835' of https://github.com/pmweiss/tika into TIKA-2385
new b3434cd Merge branch 'TIKA-2385'
new f3842ea TIKA-2385: Corrected Tesseract OCR rotation.py script and made it a configurable option via Peter Weiss
new 7a9411f TIKA-2385: Added check for Python dependencies
new aff782d TIKA-2385: Updated check for Python dependencies to use temporary script instead of -c switch
new f3acc8f TIKA-2385: Fixed typo in dependency checker script
new 50295be TIKA-2385: Removed deprecated call
new 88b93c1 TIKA-2516 upgrade to cxf 3.0.13
new ac9f24e Merge remote-tracking branch 'origin/master'
new e83844c TIKA-2516 upgrade to cxf 3.0.16
new 6b2b626 TIKA-2503 upgrade to httpcomponents 4.5.4
new 2169cae Fix thread-safety in ChmExtractor (TIKA-2519).
new 95baca2 TIKA-2519 clean up, fix bug in MultiThreadedTikaTest files that failed to prevent files that caused exceptions; revert new ChmBlockInfo() to private
new 90d6245 TIKA-2483 -- add in all children of zip and tar to prevent overwriting of child file types by the PackageParser. Ensure that our semi-manual list is updated when there are changes to TikaConfig.
new f57e0e7 TIKA-2521
new f983eb4 Update CHANGES.txt for 1.17 release.
new b071ab1 add missing license headers. THANK YOU RAT!
new 94777e3 [maven-release-plugin] prepare release 1.17-rc1
new c069ad5 [maven-release-plugin] prepare for next development iteration
new af3d017 update changes for next release cycle
new 6f33bae roll back to start rc2
new b054df2 [maven-release-plugin] prepare release 1.17-rc2
new 6087955 [maven-release-plugin] prepare for next development iteration
new 78c8d74 TIKA-2524 -- add an XPS parser
new d0c315c Update Changes for branch_1x
new 2922511 TIKA-2509: Updated TesseractOCRParser to use configured ImageMagick path
new 9e1cd30 upgrade geo-apis and sis (TIKA-2535).
new fe97b16 TIKA-2535 -- but get it right... in tika-bundle and tika-java7
new 54bce08 TIKA-2556: Swap out com.tdunning:json for com.github.openjson:openjson to avoid jar conflicts.
new 6829643 TIKA-2547: RFC822 with multipart/mixed, first text element should be treated as the main body of the email, not an attachment.
new 32cbe38 TIKA-2561 upgrade jsoup to avoid potential xss vuln in grib
new 504ba00 TIKA-2564 -- wrap embedded stream in a stream that supports mark/reset in --extract option in tika-app
new 0e5fded TIKA-2559: Extract language metadata item from PDF files via Matt Sheppard.
new 5314bc4 update changes for TIKA-2559
new 856a90d TIKA-2571 -- rethrow SecurityException
new d9be32c TIKA-2569 -- Extract text from grouped text boxes in PPT.
new fd7ec73 Remove java 8 String.join
new aded888 TIKA-2563 -- Extract files embedded in HTML and javascript inside HTML that are stored in the Data URI scheme.
new 287f5d1 TIKA-2563 -- mods for 1.x branch
new 69a85d2 TIKA-2570 - upgrade to more recent version of jackson to avoid CVE-2017-17485 via Ewan Mellor.
new 72e6f70 fix documentation via David Pilato on twitter.
new 35d87d1 TIKA-2580 via Ewan Mellor.
new 520e73f TIKA-2580
new 71cf654 TIKA-2588 -- extract xlsx stored within ole objects in ppt/x via Brian McColgan
new 99f4852 Merge branch 'branch_1x' of https://github.com/apache/tika into branch_1x
new 5e3e910 TIKA-2578 and TIKA-2587 -- Allow for RFC822 detection for files starting with "dkim-" and/or "x-" via Andreas Meier
new 2cbca1c TIKA-1518: Add local docker build based on dockerfile-maven-plugin
new d810fba TIKA-1518: Updated the README and changed image name to tika-server for clarity
new 2e48245 TIKA-2598 -- add enforcerplugin to fail on dependency convergence problems, and fix dependency conflicts where possible.
new 4eb8ae1 Merge branch 'branch_1x' of https://github.com/apache/tika into branch_1x
new be6e95d TIKA-2576 -- Upgrade commons compress and add detection and parsing of zstd (if user provides com.github.luben:zstd-jni... via Andreas Meier
new cf0348d TIKA-2598 -- unbreak the build (sorry!), fix problems after tika-app
new 8163b59 TIKA-2598 -- unbreak the build (sorry, again!), fix missing javacpp dependency.
new d9f63a0 turn off debug in powerpointparsertest
new 32c19de TIKA-2600 -- remove md5 checksum, and switch sha-1 to sha-512 for release artifacts
new b9e9e5b TIKA-2594 -- improve eml detection for those starting with Subject: and containing html
new 164c928 TIKA-2592 -- ignore charsets not supported by IANA in html meta-headers via Andreas Meier.
new b4047eb TIKA-2591 -- Add workaround to identify TIFFs that might confuse commons-compress's tar detection via Daniel Schmidt
new a9b4b36 TIKA-2590 -- revert listenForAllRecords = false thanks to Grigoriy Alekseev
new c566cc4 TIKA-2590 update Changes.txt
new 33f756f TIKA-2527 -- Various new mimes and typo fixes in tika-mimetypes.xml via Andreas Meier.
new e12117c TIKA-2594 improve eml detection via Luis Filipe Nassif
new 42aa774 TIKA-1518: Detach docker file build from build phase in Maven execution
new c996d01 TIKA-2568: detection of full encrypted 7z files
new 2e1a810 TIKA-2338 support for tif in pdfs
new eb35173 TIKA-2338 -- fix imageio version conflict in tika-dl
new ceee42a TIKA-2530 -- temporary workaround -- check for zero length byte array in rtf body to avoid buffer underflow from POI, via Pascal Essiembre.
new 4d75a32 TIKA-2591 -- prevent AIOOBE when haystack shorter than needle
new 17d8fe4 TIKA-2604 -- properly escape (or not) class path in windows and linux environments.
new 029715d fix cherry-pick conflict
new 3ad2274 TIKA-2614 -- treat simple body inline, not as an attachment
new 1cd565c TIKA-2616 -- preserve message/news
new c5cf55f TIKA-2617 -- handle new IOOBE on streams now parsed as npoifs in ppt embedded streams as any other IOException on an embedded stream
new e44a38d Update forbiddenapis to version 2.5 and remove commons-io hack from pom.xml
new ca9c2f5 TIKA-2618 -- avoid overwriting labels
new f9910e2 update CHANGES.txt because of conflict in cherry-pick
new d1526d0 Fix for TIKA-2582 contributed by ewanmellor.
new b2ca378 Fix for TIKA-2584 contributed by ewanmellor.
new 2efe3f9 Fix for TIKA-2613 contributed by ewanmellor.
new 04225d2 TIKA-2621 -- add support for brotli
new cfefc4c TIKA-2621 -- add support for brotli - update CHANGES.txt
new fc718f4 TIKA-2620 allow configuration of setting KCMS
new cecce45 fix cherry-picked version clash for TIKA-2621
new b928453 TIKA-2625
new d1a7cab TIKA-2626
new b85d2f8 Update CHANGES.txt for 1.18 release.
new 72db7c5 fix license issues identified by rat during prep for 1.18 release
new 0afdf50 [maven-release-plugin] prepare release 1.18-rc1
new 2362a00 [maven-release-plugin] prepare for next development iteration
new 7fb331d rollback 1.18-rc1 attempt -- error transferring data to nexus
new c551a15 Update CHANGES.txt for 1.18 release
new c44e8b6 [maven-release-plugin] prepare release 1.18-rc1
new ef85fa8 forgot to delete existing tag -- revert to 1.18-SNAPSHOT
new 5cff40f [maven-release-plugin] prepare release 1.18-rc1
new 5b12e7f [maven-release-plugin] prepare for next development iteration
new e82c2ef fix potential resource leak
new b2d3932 fix potential resource leak
new 4fdc51a followup fix
new ffb48dd fix chm parser
new 302f22a fix readUE7
new 5d983aa fix chm; remove println
new 3b6682e rollback to 1.18-SNAPSHOT in prep for rc2
new d1bc093 fix potential resource leak, continued
new c9c1844 Update CHANGES.txt for 1.18-rc2 release
new 1203862 [maven-release-plugin] prepare release 1.18-rc2
new a39b325 [maven-release-plugin] prepare for next development iteration
new bb7adac TIKA-2634 upgrade Jackson to 2.9.5
new a8b41d3 TIKA-2634 upgrade Jackson to 2.9.5
new 85b2504 Merge remote-tracking branch 'origin/branch_1x' into branch_1x
new c68994f fix broken build on *nix caused by recent fixes; improve documentation; ensure trailing slash behavior on all OS
new e84d0d5 TIKA-2635 -- require that user specify path for imagemagick on windows to avoid conflict with system util "convert.exe"
new 15410ed roll back to 1.18-SNAPSHOT in prep for RC3
new 24cd176 update CHANGES.txt in prep for RC3
new 38ff2a9 [maven-release-plugin] prepare release 1.18-rc3
new 4d2753c [maven-release-plugin] prepare for next development iteration
new 0c1909a For now, if there's a network problem grabbing dl4j's model, skip the test silently. Do the same thing in both tests.
new 426e92d Merge branch 'branch_1x' of https://github.com/apache/tika into branch_1x
new f9722b4 TIKA-2644 improve api for recursiveparserwrapper
new 64fef4e TIKA-2644 improve api for recursiveparserwrapper -- deconflicted
new c203ef3 TIKA-2645 for tika-core
new 0101164 TIKA-2645 - use a pool for SAXParsers -- tika-parsers package
new aa1a749 TIKA-2645 - use a pool for SAXParsers -- improve comments and avoid permanent hangs if a parser has forgotten to release its SAXParser.
new 017096f TIKA-2645 - remove commons math from dependencies
new c40045a update changes
new cdca0f7 TIKA-2645 -- make pool methods private for better encapsulation and add a pool for DOM building
new 124a06d TIKA-2520 optimize OptimaizeLangDetector default loadModel()
new 7e3e34c Merge branch 'TIKA-2520' of https://github.com/mbaechler/tika into branch_1x
new ac73693 allow more flexibility for OCR variations in a PDFParser test
new 62926ca fix logic in iptc parser
new c9294bf make ocr test more flexible to allow for different versions/settings of tesseract
new 8d26096 TIKA-2100 extract content language from html lang attribute
new 3aba5c4 revert changes to imports
new 70662cd improve audioparser
new e9d807d TIKA-2100 -- fix unit test
new b9cf2f3 TIKA-2655 - Allow the RecursiveParserWrapper to work with the ForkParser
new acd92ac TIKA-2655 - Allow the RecursiveParserWrapper to work with the ForkParser --merge conflicts in test with 1x
new 6c747d1 TIKA-2655 - allow handlers to be proxied back or not when the handler is an AbstractRecursiveParserWrapperHandler
new 0105869 TIKA-2655 - failed to merge changelists before last commit. Sorry!
new 4a7bf9a merge conflicts
new 0e1f4e7 merge conflicts
new 4afd8f0 TIKA-2653 -- fix debugger on ForkParser test
new 5ee06ca merge conflicts
new 85cc113 TIKA-2653 fix merge conflictx
new 12f455c ForkParser -- update to master; handful of fixes
new 00ff640 TIKA-2656 -- allow absolute timeout for ForkParser
new 12884fd TIKA-2656 -- allow absolute timeout for ForkParser -- update CHANGES.txt
new aa6a63f TIKA-2657 -- add system exit, thread interrupt and gc-triggering new Date() in MockParser
new e350488 TIKA-2446 -- prevent oom during detection of corrupt zip
new 88fe62c TIKA-2446 -- prevent oom during detection of corrupt zip -- catch POI's "could be odt exception" and POI's RuntimeException
new f2b0e5a TIKA-2659 -- add parameters for max files processed in forkclient, and improve some of the offline smoke testing infrastructure.
new 90e3387 TIKA-2662 add streaming json serializer
new 5d6e09a TIKA-2661 upgrade commons-compress to 1.17
new 2f6933f TIKA-2661 upgrade junrar to 1.0.1
new 39c85da clean up dev mess
new 92aaf22 avoid npe when lang code not found on classpath
new ecbd316 undo idiocy
new 0b62157 avoid npe when lang code not found on classpath
new ebb22dd undo idiocy
new f309935 TIKA-2668 -- fix TaggedSAXException for Java 11-ea
new 2512051 TIKA-2660 -- enable building with Java 10
new 45468aa TIKA-2670 add try/catch block into ModelGetter
new 9a56aa4 TIKA-2660 -- enable building with Java 10 -- revert tika-dl until full fix is available.
new def58f6 TIKA-2675 OpenDocumentParser should fail on invalid zip files - throw IOException if ZipInputStream is invalid or does not contain any entries
new b4cdfcf TIKA-2677 -- fix multithreaded updating/access to MediaTypeRegistry, via Yuriy Koval
new df9ed82 TIKA-2673 -- unit tests for stricter adherence to spec via Gerard Bouchar
new c6f7b45 TIKA-2673 -- unit tests for stricter adherence to spec via Gerard Bouchar -- fix illegal getBytes()...mea culpa...
new 729d29e TIKA-2679 -- bump 1.x to 1.8
new bae509c Bumped PDFBox to 2.0.11
new ad8765d TIKA-2682 -- update jempbox to 1.8.175
new 6933efd TIKA-2669 -- pdf and tesseract config set in a tika-config.xml file on server start up are always overwritten to DefaultConfig in tika-server
new a333a4a Merge pull request #240 from sebastian-nagel/TIKA-2675-OpenDocumentParser-fail-invalid-zip
new db1301d improve htmlparser
new 525889a TIKA-2673 -- add StrictHtmlEncodingDetector, contributed by Gerard Bouchar
new a09d853 TIKA-2687
new 5c78eb7 TIKA-2687
new b2973e3 TIKA-2691 -- upgrade jai-imageio-core and pdfbox's jbig2-imageio while we're at it.
new 19364b8 TIKA-2690 via Hans Brende
new fc23648 TIKA-2688 via Yury Kats
new fe2b3ae TIKA-2692 -- minimal upgrades to pass ossindex-maven module -- except for tika-nlp module, which requires significant work. fix conflicts
new 6b37754 TIKA-2692 -- minimal upgrades to allow building w Java 11-ea
new 1438d8a TIKA-2692 -- general upgrades in prep for 1.19
new 8f61126 TIKA-2692 -- general upgrades in prep for 1.19
new 6afdf19 Depend on Parso for SAS7BDAT support
new 4c5bbae Add parso to the OSGi bundle
new fa5f282 Test Columnar files - SAS7BDAT and CSV (other spreadsheet+DB formats still required)
new 2d19fe0 TIKA-2462 Initial parser for SAS7BDAT files powered by Parso (now ASLv2). Still to do: Metadata, Unit Tests, Consistency with similar format tests
new f3508f2 XHTML improvements
new 284965e Some SAS7BDAT metadata and unit testing
new 39e1194 More SAS7BDAT metadata
new c31d40f SAS7BDAT html tests
new 02bef03 Clean up imports
new aaa78a3 Stub a unit test for TIKA-2641
new b6399c6 Handle .epub files using .htm rather than .html extensions for the embedded contents (TIKA-1288)
new 95a247c Add a test .sas7bdat file with labels, and generate the columnar/tabular test file in a few more formats
new 7f68ebb Add a time column to the test columnar files
new b92f752 CSV assert as best we can (no dedicated parser), start on XLS and SAS7BDAT consistency tests
new 65af2d9 Check header contents, check data rows count, add XLSX test
new 3f2b7a5 Remaining values to check
new 5d3dd69 Ensure that empty cells are still output
new 507f59f Not all formats know about %s, dates not completely consistent either...
new 81caa71 Use patterns to handle the date format variations
new d871b1f Add disabled, currently failing ODS test
new de53df9 Mime magic for DPX and ACES, thanks to Andreas Meier (TIKA-2628 and TIKA-2629)
new 6880127 TIKA-2479 Option to request missing rows where possible in Excel-like formats
new dcfbe5a TIKA-2479 Output missing left/mid cells in XLSX and XLSB, and optionally also missing rows
new b336360 Updated Columnar output from SAS with better formats
new 65cf9f2 Formatted columns in the columnar test Excel files
new 8ea6b22 TIKA-2479 Update XLS missing cell/row handling to match XLSX and XLSB, add unit test for missing rows, and enable the Columnar tests for the Excel formats
new 060bfa5 Move some fixes that didn't make it into 1.18 into 1.19
new 08a767a Changelog update
new 3da39b8 Add the other jackcess jar to the bundle
new d811a3a Move some fixes that didn't make it into 1.18 into 1.19, clean up
new 2cf0a96 TIKA-2703 make sure to process shape's parent drawing only once.
new 2745cfd TIKA-2703 -- related...simplify XSSFB to use more of XSSF rather copy/paste
new d66dcbb TIKA-2701 -- via Grigoriy Alekseev
new 6badaea TIKA-2673 -- add StandardHtmlEncodingDetector via Gerard Bouchar
new 4475b72 TIKA-2673 -- fix forbidden-apis failure and retro-fit for branch_1x
new f5a2fae TIKA-2648 detect interpreted server-side script languages
new bd9d75d improve xml reading
new 36fa58f TIKA-2704
new 375e3d7 TIKA-2705 -- allow parameter configuration for tesseract via tika-config.xml
new 5346cbb TIKA-2706 -- store exceptions from macroreader in child metadata
new 2cdf627 TIKA-2695 -- upgrade Lucene to 7.4.0
new b717ca6 TIKA-2667 upgrade jmatio
new 719826a fix doubled junit dependency in tika-nlp
new f44e109 TIKA-2672 -- upgrade deeplearning4j to 1.0.0-beta2 via Thejan Wijesinghe. Thank you, Thejan!!!
new b542f9b TIKA-2672 -- remove hard coded input dimensions
new ed0d3d1 TIKA-2707 -- upgrade to commons-compress 1.18
new 1f5669d TIKA-2710 - Change Tika OSGi Execution Environment to 1.8.
new 3c76b3a TIKA-2710 - Change Tika OSGi Execution Environment to 1.8. Format fix.
new 0dbf67d TIKA-2721: removed spring-* from tika-parsers deps
new 0951bf9 TIKA-2722 -- remove dead code and prevent potentially bad date.toString() call.
new 8a1392b Merge remote-tracking branch 'origin/branch_1x' into branch_1x
new 8d70109 TIKA-2722 -- clean up setting calendar values
new 2fd54ff TIKA-2722 -- clean up setting calendar values, take2
new 1ff63b0 improve xml parsing
new 39f69ef Mime magic for "MIME Encapsulation of Aggregate HTML Documents" (MHTML), pulled out from rfc822 (may not be fully correct long-term...)
new 4f85418 Changes update
new bb10dc2 TIKA-2552 -- upgrade to POI 4.0.0
new 49ed309 TIKA-2552 -- upgrade to POI 4.0.0 -- fix merge conflicts
new 92e488b TIKA-2719 -- add automatic module names
new 58dadac TIKA-2725 -- checkpoint commit ... basic child process is started...need to integrate actual statuswatcher, etc.
new e7cef35 TIKA-2725 -- first working draft...ready for commit and future cleanups
new 3af35f1 TIKA-2725 -- first working draft...include commit with conflicts resolved. :(
new 5211fc7 TIKA-2725 -- add synchronization to avoid potential NPE in watcher thread
new 153c394 update changes.txt in prep for 1.19 rc1
new 4aef777 TIKA-2692 -- upgrade a few other dependencies
new 0db4724 Update CHANGES for 1.19 release
new 10c75a1 Fix missing license headers; h/t rat!
new af04995 fix conflict
new 90285dc [maven-release-plugin] prepare release 1.19-rc1
new 7aba1c5 [maven-release-plugin] prepare for next development iteration
new 03e0942 roll back to 1.19.SNAPSHOT for second attempt of RC1
new 48e76da [maven-release-plugin] prepare release 1.19-rc1
new 82146ad roll back to 1.19.SNAPSHOT for second attempt of RC1, take 2
new 199112b [maven-release-plugin] prepare release 1.19-rc1
new 7259325 [maven-release-plugin] prepare for next development iteration
new 231fbb0 Fixed javadocs
new a24976a Cosmetics
new a366813 Removed #getDetector from ImportContextImpl
new d66c04a update changes after 1.19 release
new 49d1e82 Fixed javadocs
new e36fafe Fixed javadocs
new 8053e31 Removed #getDetector from ImportContextImpl
new b213fb3 Merge remote-tracking branch 'origin/branch_1x' into branch_1x
new ed1e2f3 TIKA-2729 -- child process should run in headless mode.
new 962c015 upgrade to forbiddenapis 2.6 https://github.com/apache/tika/pull/249 via Uwe Schindler
new 80cfd6d TIKA-2730 -- allow last frame to be truncated w/o throwing an EOF
new f6c38ef TIKA-2731 via jkakvas. This closes 250.
new c25671c TIKA-2731 via jkakavas. This closes 250.
new b29e11f Merge branch 'branch_1x' of https://github.com/apache/tika into branch_1x
new 6c2c8ad update test corrupted files
new 932ff38 TIKA-2638 -- allow multiple languages in config for OCR parser
new 88bb6ab TIKA-2732 -- allow configuration of XMLReaderUtils via TikaConfig
new f75ba63 TIKA-2733 -- improve oom unit test and error/logging when the child process can't start in tika-server
new a1f48b0 TIKA-2727
new 55742a4 TIKA-2736 -- improve reports for comparisons
new d712103 TIKA-2738 -- ForkParser option isn't working in tika-app. Make PasswordProvider serializable.
new e75554a TIKA-2739 -- ForkParser's child process should be headless
new a1d4e55 update changes for 1.19.1 release
new 60bc5d8 [maven-release-plugin] prepare release 1.19.1-rc1
new dc02c59 [maven-release-plugin] prepare for next development iteration
new 46dee91 reset to 1.19.1-SNAPSHOT after timed-out rc1 attempt
new b8d28ef [maven-release-plugin] prepare release 1.19.1-rc1
new 92a5a62 [maven-release-plugin] prepare for next development iteration
new aa9aeb7 reset to 1.19.1-SNAPSHOT after timed-out rc1 attempt, take 2
new a5162fb [maven-release-plugin] prepare release 1.19.1-rc1
new b78ed12 [maven-release-plugin] prepare for next development iteration
new 481481a reset to 1.19.1-SNAPSHOT after timed-out rc1 attempt, take 3
new 70823ad [maven-release-plugin] prepare release 1.19.1-rc1
new 5c6bbff [maven-release-plugin] prepare for next development iteration
new 019bd30 reset to 1.19.1-SNAPSHOT after timed-out rc1 attempt, take 4
new 628d34f [maven-release-plugin] prepare release 1.19.1-rc1
new 9ee3b48 [maven-release-plugin] prepare for next development iteration
new a279490 rolling back, again
new 341e359 [maven-release-plugin] prepare release 1.19.1-rc1
new 4392731 [maven-release-plugin] prepare for next development iteration
new 1d68362 TIKA-2740: Added TkInter module into Python dependency check
new 9ab9485 TIKA-2740: Updated Changes File
new ea0eb90 TIKA-2742 -- upgrade jmatio to 1.5 to avoid bringing in slf4j-log4j12
new 033758a TIKA-2473 - Replace com.sun.xml.bind:jaxb-impl and jaxb-core with org.glassfish.jaxb:jaxb-runtime and jaxb-core
new 4bb4ad6 TIKA-2478 -- maxFiles should take an argument...duh
new ad0f41c TIKA-2478 -- add preliminary pseudo test for -maxFiles
new 336c351 TIKA-2745 -- update PDFBox, jempbox and jbig2
new c6ad906 Update CHANGES for 1.19.1 release
new 3c2c410 [maven-release-plugin] prepare release 1.19.1-rc2
new b5596cb [maven-release-plugin] prepare for next development iteration
new fb849e6 Update changes file for 1.20
new c61ed55 Add logging for OOM.
new 307a8bd TIKA-2735 -- allow user to avoid extracting "master" sections and notes sections from ppt[xm]?
new 0f7e86c TIKA-2753 -- use -javaHome or $JAVA_HOME when starting child process w -spawnChild mode in tika-server
new c6cacec TIKA-2754 -- include filename in logging in tika-server
new 65d18af TIKA-2756 -- factor out code that relies on the old commons-lang... once Jackcess migrates to lang3, we'll be good to go.
new 889c2c9 TIKA-2757 -- add versions plugin
new 4076991 RSS test file is RSS v0.91, so name appropriately
new d31f568 Add a test RSS 2.0 file
new 416f996 Use the new RSS 2.0 file in tests too, alongside the current 0.91 one
new 9054732 TIKA-2764 parameterize inclusion/exclusion of deleted text, and fix '-' while you're at it.
new 6103161 TIKA-2761 -- write as much metadata as possible before writing to xhtml.
new 7a34b58 TIKA-2759 -- don't extract data uri if inside a <script/> element when not extracting <script/> content.
new eb53077 TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by Ronan O'Sullivan.
new 50a2a8f TIKA-2599: Fixed closing of styles around Hyperlinks. Contributed by Ronan O'Sullivan.
new 324cbd2 Merge pull request #253 from dameikle/TIKA-2599
new 6b2cdc9 TIKA-2762 Capture short fields (<150 chars) in EnviParserHeader Metadata
new 33d960c TIKA-2762 Capture short fields (<150 chars) in EnviParserHeader Metadata
new 0c49c85 TIKA-2773 upgrade sqlite version
new 4c564bd TIKA-2777 -- improve inefficient regex performance in Optimaize in tika-eval
new e0991f4 Corrected file name to match test
new 41608d5 Move glassfish warning from license to notice file.
new 589ee7f TIKA-2775 -- bulk upgrades
new 2ead2bb TIKA-2775 - bulk upgrade dependencies
new ccb96cd TIKA-2775 - bulk upgrade dependencies -- backoff minimum maven dependency to 3.1; clean up whitespace in tika-eval's pom
new 41dda34 prefer System.currentTimeMillis to creating a new Date object, throughout...
new 2309974 TIKA-2778 -- the shutdown method for tika-batch mode should not be typing anything on stdin of the parent process. Rather, require an interrupt and/or kill signal and then make sure the children are stopped as well.
new fe4f41b TIKA-2780 -- the shutdown method for tika-batch mode should not be typing anything on stdin of the parent process. Rather, require an interrupt and/or kill signal and then make sure the children are stopped as well.
new f9eff6f Merge branch 'branch_1x' of https://github.com/apache/tika into branch_1x
new 9fd50ed TIKA-2780 -- fix changes.txt
new c0b594e TIKA-2782 -- confirm child streams are redirected. Add workaround (shameless hack) if logger writes before streams are redirected.
new f9eec83 TIKA-2778 -- Upgrade jaxb-runtime and javax.activation for use in Java > 8
new 22f5707 TIKA-2776 -- tika-server in legacy mode should ignore oom.
new 9e2a9bb TIKA-2776 -- update CHANGES.txt
new d6938de fix for TIKA-2770 contributed by kristencheung
new 4538f1d TIKA-2770 fix merge conflicts from 81430d51c kristencheung
new f4c0b5a removing extra test
new f136f97 fix for TIKA-2770 conversion for UTM only
new 0a3a8be TIKA-2776 -- update CHANGES.txt
new 9137249 TIKA-2784 -- MockParser should allow us to simulate a parser grabbing stdout/stderr during static initialization.
new c7bb0c9 TIKA-2785 -- switch communication from child to parent to a shared memory-mapped file in -spawnChild mode in tika-server.
new 3e7e89a TIKA-2785 -- clean up logging in tika-server; redirect child stdout to parent stderr to avoid maven complaining about corrupting stdout in forked process; convert oom to fake oom
new e31e8ce TIKA-2785 -- try to fix test that is failing on Linux, but not Windows
new 5590e0a TIKA-2785 -- fix unit test that is failing in Linux but not Windows, take 3
new eec08ba cleanup accidental println
new 4141411 TIKA-2776 -- improve documentation for -maxFiles
new 690586f fix whitespace
new 4d6bc01 TIKA-2550 -- prevent content from script/style elements to be written in ToTextContentHandler
new d837e1b Upgrade to PDFBox 2.0.13 (TIKA-2788)
new 6b56ed2 TIKA-2779: Integrate/parameterize new rotated text handling in PDFBox
new 6322421 TIKA-2751 -- Upgrade to POI 4.0.1
new 44165a3 TIKA-2550 -- make sure that ToTextHandler's new behavior of ignoring script/style contents doesn't harm macro extraction in HTML parser
new 2439927 Upgrade MP4Parser to newer dependency coordinates org.mp4parser:isoparser (TIKA-2792).
new 6c122d1 put the overridden processTextPositions within the inner class -- bug fix for TIKA-2779.
new 582a1d4 TIKA-2795 -- catch IOException if child deletes shared file
new 8475ddb TIKA-2795 -- swapped memorymapped buffer for traditional open, write close of a temp file because of cross-platform challenges.
new df5792e TIKA-2637 -- ParsingReader should return -1 for a zero byte file
new 40b0427 handful of dependency upgrades
new 7696e38 TIKA-2792 -- revert mp4 parser based on large scale regression test results
new 8c88966 TIKA-2798 -- improve reporting for attachment diffs
new 4c9e38e TIKA-2791 -- add tags/structure to tika-eval
new 6a6c82a TIKA-2775 -- more updates (these were made locally before the first major regression run pre-1.20-rc-1)
new ad39610 TIKA-2798 -- revert junrar
new b2680df TIKA-2800 -- add num unique alphabetic tokens and num unique common tokens
new 1a1f980 TIKA-2799 - revert jackcess based on regression results
new 6f62b95 update CHANGES.txt and KEYS for 1.20 release.
new fc0e1a3 license fixes for rat
new adf4a1b [maven-release-plugin] prepare release 1.20-rc1
new 37407ad [maven-release-plugin] prepare for next development iteration
new 750bbfc TIKA-2801 -- add ossindex-maven-plugin and upgrade vulnerable dependencies (skipping tika-nlp for now).
new 3aa311c TIKA-2804 -- upgrade Lucene and Jackcess
new f2eb5ac TIKA-2802 -- try to clear the XMLReader's resources to avoid OOM
new 8fc1ed1 TIKA-2765
new c9ecf1c TIKA-2765 -- fix capitalization of test file.
new 35601db TIKA-2807 -- extract sdt content from within textbox in docx
new eaaf1e3 TIKA-2808 -- exclude h2 from ossindex-maven-plugin
new 73d009a TIKA-2809 -- add reports for tags; and add "b" tag.
new 2983c36 TIKA-2810 -- handle bad tags more robustly
new df99549 TIKA-2816 -- allow OCR parameter header setting in tika-server to include parameters of type long/Long
new e82382a TIKA-2816 -- fix unit test
new ea23d25 TIKA-2822 -- update common tokens lists with 7.x Lucene.
new 4a0c26b TIKA-2823
new a32234e TIKA-2822 -- remove common >=4 letter html markup entities
new 40ab7f8 rm println
new c566e65 TIKA-2717 -- upgrade jackson
new c7438e6 TIKA-2819 -- upgrade jaxb via Hans Brende -- many thanks for your patience and figuring this out!
new a1b83d2 TIKA-2802 -- bundle xerces2 with tika-parsers and upgrade CHANGES.txt
new 92c8575 TIKA-2824 -- general dependency upgrades
new 07d8277 TIKA-2824 -- general dependency/plugin upgrades and plugin cleanup
new acdc4fc TIKA-2825
new d7a5c20 TIKA-2819 -- remove activation-api from dependencies
new 2cd927a TIKA-2756 -- upgrade Jackcess and remove dependencies on commons-lang
new cfa524e TIKA-2824 update Lucene to 7.0.0
new 150d4bc TIKA-2828 -- initial CSVParser commit
new 3ad6edc TIKA-2826 - mea culpa and my apologies...fixed master vs branch_1x incompatibilities.
new fab1954 TIKA-2827 -- include both mime_a and mime_b more often in comparison diff reports
new 0252277 TIKA-2824 - general upgrades: h2
new d3317f9 TIKA-2833 -- initial commit with csv detection and swapping out the TXTParser in favor of the CSVParser
The 4366 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.