You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2023/09/13 15:22:47 UTC

[tika] branch deprecated_2.x_dev created (now 0a55b4a4e)

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a change to branch deprecated_2.x_dev
in repository https://gitbox.apache.org/repos/asf/tika.git


      at 0a55b4a4e TIKA-2354 -- .doc is missing many pictures

This branch includes the following new commits:

     new b1c00c050 TIKA-1985 -- ignore test until we get permission to use test file
     new e05dd5bf4 TIKA-1990 -- need to add JPEG filters to embedded stream when handling embedded jpegs in PDFParser
     new e5a7604bc TIKA-1992 -- check for duplicate inline images by COSStream not object name.
     new ebe702898 TIKA-1994 -- Integrate TesseractOCR with full page image rendering for PDFs
     new 89062edb0 TIKA-1999: add configurable limit to number of events extracted in XMPMM history.
     new ac52e5c15 TIKA-1999: fix setter, update changes.txt
     new b480d43f5 TIKA-1996 -- Upgrade to PDFBox 2.0.2
     new f90193aa0 TIKA-2006 -- add mime definitions for ical and vcal
     new ffaa4deaa TIKA-2004 -- add mime definitions for Windows Media Metafile
     new 60d4e3ff2 TIKA-2008 -- add mime definition and parser for MSOwnerFile
     new b3bf5141b TIKA-2008 -- change metadata key to TikaCoreProperties.MODIFIER
     new 73ce7681c TIKA-2009 -- add magic for djvu
     new b600b6701 make sure to test magic for vcs/ics/asx
     new 2f5537380 TIKA-2009 -- add detection for Endnote Import files
     new 1ce93ed9e TIKA-2019 -- fix WordMLParser and SpreadsheetMLParser
     new 767442614 fix indents and whitespace
     new 6bb6827e0 add startDocument and endDocument() to PRTParser so that it works with the ToXMLHandler
     new 0c71b2ffc TIKA-2020, remove 3 parameter parse() and simplify CAD tests
     new cd12917fa TIKA-2020 -- remove 3 parameter parse() and simplify CAD tests
     new b14b47e76 TIKA-2022 -- add parser for applefile
     new 5bc597dc8 TIKA-2023 -- clean up RTFParser to use EndianUtils and IOUtils.readFully
     new 865c45cd5 fix indentation
     new c84855f67 TIKA-2022 - clean up -- make entries private, move more into EndianUtils
     new e62f23057 TIKA-2024 extract original file name/path where possible, take 1
     new 933af20e8 rm inconsistently capitalized test files
     new dd3c2a486 TIKA-2026 -- improve extraction of attachments for PPT, PPTX, XLSX
     new c7a6bcac4 Convert new lines from windows to unix
     new 4678d6733 TIKA-2024 extract original path name from OLE1.0 embedded objects
     new 2eb4804d1 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...
     new 2a7e52ec4 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...not sure why Intellij didn't catch this one.  sorry.
     new 573527bbc Merge branch '2.x' of https://git-wip-us.apache.org/repos/asf/tika into 2.x
     new 87e1e23b4 TIKA-2030 - add handling for <text:s/> element to ODT parser. Thanks to David Pilato for opening this issue.
     new cdfacdb41 Merge remote-tracking branch 'origin/2.x' into 2.x
     new e27526b84 TIKA-2030 - fix test file so that it is correctly detected
     new f4bacf859 TIKA-2025 increase number of significant digits extracted in "general" format in xls/xlsx
     new 8b951a43c TIKA-2039 upgrade to jackcess 2.1.4
     new d6ce10b41 Email with attachment for testing extraction issues
     new 31374a39b TIKA-2037 RFC822Parser should wrap the James InputStream of embedded resources to avoid problems with downstream detection or extraction
     new 65cc9bcec TIKA-2042 MBOX magic and detection unit test
     new 53310facc Changelog update
     new f89887d2f TIKA-2037 Merge fixes for 2.x
     new 9f6c71fa6 TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug.
     new 1c582aba6 TIKA-2040 - prevent permanent hang/oom on corrupt chm file
     new fc7c372f5 TIKA-2048
     new 6ebbd7ef7 cleanup MatParser
     new b41c0b2a8 TIKA-2041 - add important diffs between new copy/paste from ICU4J and legacy code which may have included Tika-specific mods.
     new 5358bf1e1 TIKA-1938 via Joseph Naegele
     new 09bd22fb4 TIKA-1938 via Joseph Naegele
     new 27bc383eb TIKA-1980 via Joseph Naegele
     new db513d6ad TIKA-2007 upgrade jackson
     new 87b6d5d7d TIKA-2007 upgrade jackson, needed to update CachedTranslator (diff btwn trunk and 2.x)
     new 59e0ca0fc TIKA-2059 - Merge multimedia and pdf parser modules and bundles
     new 4704d976c TIKA-2061 - Embed xmpcore in tika-xmp since it is not a proper bundle
     new cebf72382 TIKA-2062 - Remove bouncy castle inlining in bundles
     new dc841e6ba TIKA-2060 - Added toggle to ClassLoaderUtils for OSGi
     new fcefaae59 TIKA-2063 - Create vorbis bundle
     new 5d9db6bec TIKA-2063 - Added Vorbis bundle to bundle parent.
     new 8234b96fe TIKA-2061 - Added Adobe BSD license to tika-xmp
     new b2a7e382a TIKA-2065 upgrade forbiddenapis
     new 164bf52c8 TIKA-2066 upgrade commons-io to 2.5
     new 8ff89d419 TIKA-2067 upgrade maven plugin dependencies
     new a0f365524 TIKA-2067 upgrade maven plugin dependencies -- revert felix bundle
     new b73cd8ce8 TIKA-2074 - ServiceLoader can use Class files loaded via dynamic loading
     new d57a85274 TIKA-2070 - Add Encoding Detector and Language Detectors to Dynamic Service Loader
     new 587dcb772 TIKA-2072 - Create TikaServiceFactory for creating TikaService
     new 7a0280c77 TIKA-2071 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
     new f8092d3bd TIKA-2073 - Tika Language Detect Project should include Bundle Activator and packaging consistent with other modules
     new f112c88fb TIKA-2075 - Expose Additional TikaService methods
     new 4636f95b2 TIKA-1255 and TIKA-2078 -- fix hyperlinks that include formatting and fix hyperlinks with multiple runs in docx
     new 443a21e3f TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (Awaiting suitably licensed file for testing)
     new e58ade381 TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit test
     new 9f6241161 Merge changes for TIKA-2064 to 2.x
     new ae0cb3059 TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml
     new 176f3aded TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml, include test file
     new 92453f5e7 GitHub user haisi opened a pull request:   https://github.com/apache/tika/pull/132
     new 1b32e3186 TIKA-2015 -- upgrade to PDFBox 2.0.3
     new 12b1d435b TIKA-2013 -- upgrade to POI 3.15 -- don't forget to close new NPOIFS and MAPIMessage
     new 32d9ece8d  * Maintain passed-in mime in TXTParser (TIKA-2047).
     new 66f433471 TIKA-2069 -- extract macros from MSOffice files.
     new d543378a8 TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target metadata object in any order
     new 673533d0e TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric Pugh.  This commit also catches 2.x up to trunk; there were clearly some other changes to Tesseract that hadn't yet made it into 2.x.
     new ce1fc3720   * Re-enable fileUrl for tika-server (TIKA-2081).  If you choose,     to use this feature, beware of the security vulnerabilities!     See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
     new bd7208929   * Re-enable fileUrl for tika-server (TIKA-2081).  Fix commandline options not to include '-'
     new 94789a963 Tika-2095 include Tika version in tika-server's GREETING
     new be78c549a Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).
     new 4392681af TIKA-2097 fix npe in mbox parser
     new cde4c0aa8 TIKA-2098 small clean up.  Test for writelimitreached for each catchable IOException.  Many thanks to Alexander Kazakov for finding this and submitting https://github.com/apache/tika/pull/134
     new b84fcc584 TIKA-2101 -- don't call MAPIMessage's close()
     new 1ab6c81ce TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. This closes #136
     new 1ec8c0947 Tesseract may see the t in haystack as a ! some times...
     new af74ea5c9 TIKA-2110-- log full exception throughout tika-batch
     new 3fe8ef819 Merge remote-tracking branch 'origin/2.x' into 2.x
     new 1e55953bc TIKA-2113-- upgrade metadata-extractor to 2.9.1
     new 30e03de89 TIKA-2122: Extract all headers from MSG/RFC822
     new 7e66e4979 TIKA-2123: digester fails with multiple digests on large files
     new c5f4f5263 TIKA-2127 : npe if there is no notes master)
     new 4c3bb1560 TIKA-2133
     new 936e3ac16 TIKA-2130
     new 4b393a6f9 TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if MSWord can handle it, we should, too).
     new a6978521f TIKA-2111 - ExecutableParser should set rather than add a Content-Type
     new 6ca74bec6 improve unit test for TIKA-2098
     new 2d5189186 TIKA-2157 - handle zip exception in embedded file
     new 1d1bc0dd7 TIKA-1933 - clean up one more place where we aren't closing the ForkParser and are leaving behind a tmp ForkParser jar
     new 2c9412ab1 TIKA-2171 - upgrade sqlite parser
     new bcd59cee7 TIKA-2171 - upgrade sqlite parser
     new 7422218eb TIKA-2173 - first steps.  Need to integrate parameter configuration into 2.x before I can do the rest
     new f2661f997 TIKA-2174 add jpx and jp2 to Tesseract
     new ab009aeb7 TIKA-2159 -- first step
     new 3f24e6c3e TIKA-2174 -- add ppm and update changes.txt
     new 9a68f4ccc TIKA-2174 -- clean up
     new 7adfe1cb5 TIKA-2170 allow configuration of timeout for ForkServer
     new 7df6fe4be TIKA-2170 fix unit test to allow for different exceptions depending on cause of timeout.
     new 8c01e4d8e TIKA-2116 upgrade to POI 3.16-beta1
     new 2f452304b Add mime detection and parser for Word 2006ML format (TIKA-2179).
     new a47a69933 TIKA-2169 fix xhtml in ocr
     new e5e4d4d91 TIKA-2096 change default to extract embedded documents even if the user forgets to specify an AutoDetectParser in the ParseContext
     new 1bb7c3384 TIKA-2179  --  add detection and parsing for word2006ml files -- this modification somehow fell to a different change list
     new de103c81f TIKA-2096 -- fix example, sorry...
     new 32162f59e TIKA 1321 initial commit
     new 3d08da79f TIKA-2187 -- make "ignore deleted" as the default in the experimental SAX .docx parser and update the WordExtractor to include extraction of deleted text if requested by the user.
     new 300100fcb  TIKA-2090: Allow extraction of PDActions (including Javascript) from PDFs (TIKA-2090).
     new d8853fe31 Update to PDFBox 2.0.4
     new 4f04b6c3e  TIKA-2218 -- add a new new locations within a pptx to check for embedded objects
     new ffb25af1b Merge remote-tracking branch 'origin/2.x' into 2.x
     new ee761ac00  TIKA-2221 -- correctly catch and rethrow encrypted document exception as EncryptedDocumentException in WordExtractor via Matthew Caruana Galizia
     new 68f305864 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre
     new 54154e004 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre -- fix test method to get inputstream from zip
     new 0d30aa1b2 TIKA 2190  --  add configurability for preserve interword spacing
     new 50c1dc69d update OCR config to include default for output type
     new c9fcb3315 TIKA-2211 modify test file to include style information to test that we're excluding it.
     new 337d38304 TIKA-2211 -- make sure that head (<style>) content isn't showing up in body in the EpubParser
     new 4383e3da7 TIKA-1946 -- initial commit to add parsers for WordPerfect and QuattroPro.  Many thanks to Pascal Essiembre for contributing these!!!
     new 39cf35551 TIKA_2226 add exception for unsupported formats
     new d8fa3c2a8 TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as quattropro 7-8 vs quattropro 9
     new bb76d986a TIKA-2224 Mime magic for OneNote
     new db21ee158 TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote files
     new 71584b2de TIKA-2224 We now differ from HTTPD on onenote formats, as we have subtypes they lack
     new cdb6456bb TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test
     new 785e47413 Manually merge changelog
     new 4e3534da0 Move new test file to the 2.x location
     new f1a541378 TIKA-2190 -- Add test file for maintain spacing
     new f0863bcea Merge remote-tracking branch 'origin/2.x' into 2.x
     new aaa661e25 TIKA-2228 from Pascal Essiembre and TIKA-2230.
     new 850de1467 TIKA-2234    get rid of ThreadLocal
     new c14e75070 TIKA-2235 -- bump default dpi for images created via PDF for OCR to 300 dpi via Matthew Caruana Galizia
     new 0bc9bd896 TIKA-2232 -- add processing of jbig2 (with necessary non ASL 2.0 libs) via Pascal Essiembre
     new e02084cc6 TIKA-2192
     new 2d908d59b TIKA-2237
     new 681615731 TIKA-2210 -- add experimental SAX parser for pptx and update (also TIKA-2191 and TIKA-2220)
     new 28b53bd4d TIKA 2159 handle preparse/embedded IO exceptions uniformly
     new ce4e7e7d9 TIKA 2134 -- handle missing parts more robustly
     new cd98c4cf3 TIKA-2238   add mime detection for embedded MSEquation files
     new dd70fd33a TIKA-2232 -- ImageParser shouldn't allege that it can handle jbig2 when jbig2 library is not on class path
     new 45a9b77d6 TIKA-2241 Add new config dumping option STATIC_FULL which lists all supported+active mime types for parsers
     new 4374bcecf TIKA-2242  fix style markup in ODT
     new 9dbff6065 TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in TesseractOCRParser via Graham Russell.
     new 161b122ba TIKA-2240 -- improve mime detection for .wri files
     new 8d783d27a TIKA-2232 -- log/warn if jbig2 is not on classpath
     new 78828176a TIKA-2244 --  be more parsimonious with BufferedInputStream via Josh Hight.
     new 58d56c33f TIKA-2250 As of RFC7903, the official mime type for BMP is now the one without the x- prefix
     new 6668d78fa TIKA-2250 As of RFC7903, the official mime type for WMF is now an image one and without the x- prefix
     new bd667acde TIKA-2250 As of RFC7903, the official mime type for EMF is now an image one and without the x- prefix
     new 985c1aef8 TIKA 2244 --   be more parsimonious with BufferedInputStream.    AutoDetectReader
     new 4599374d6 Merge remote-tracking branch 'origin/2.x' into 2.x
     new 235c2adab TIKA-2249 -- update javadocs to alert devs that tables are not "maintained" by the PDFParser
     new 3df8ce8b2 TIKA-2251 improve exception handling in SAX pptx/docx parsers
     new 6287b75b5 TIKA-2255 Test SAS files
     new 4d8feaee5 Move to Tika 2.x location
     new a79de0ccf TIKA-2255 Magic for older sas data files
     new 534a52598 TIKA-2255 Mime detection unit tests for SAS files
     new 28010d90d Mimetype for SAS Xport (XPT) files
     new 2d4889f44 TIKA 2025 --   fix xls/x testBigIntegersWGeneralFormat to work in multiple locales
     new 7b0655cc1 TIKA 2259  -- improve url extraction from PDFs = copy Tilman Hausherr's code from PDFBOX 3644
     new cf3996ed0 TIKA 2181   upgrade to POI 3 16 beta2
     new 27e81b97a TIKA-2181   upgrade to POI 3 16 beta2, make sure to upgrade overall bundle
     new 0d7f5bad0 TIKA-2198 - add null check to Tika after upgrade to POI 3.16-beta2
     new d9f376c12 TIKA-2134 - remove npe catch after upgrade to POI 3.16.beta2
     new 6bfe5d565 TIKA-2246 and TIKA-2247 -add parsers for EMF and WMF
     new 5e49c3308 TIKA-1332 initial commit of tika-eval.  More work remains.
     new 69dd0328b TIKA-1332 fix one profiler report and whitespace
     new 0d04b499a TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7
     new 44612ae40 TIKA-1332 fix pom for 2.0
     new 61532258f TIKA-1332 3rd time's the charm.  Fix dependencies with IOUtils.
     new 81150859b TIKA-1332 -- add English Spanish common tokens;  fix logging
     new 544ba9752 TIKA-2267 -- add common tokens for some languages into tika-eval
     new 824d176c9 TIKA-2269 -- fix potential NPE in FeedParser via Julien Nioche.
     new 0ce764915 TIKA-2275
     new 4ebc441bd TIKA-2276 -- pass through TikaConfig if not specified via ParseContext in AutoDetectParser
     new 35756b142 TIKA-2276   try to reuse parsers from ParseContext rather than creating own
     new a279d039d TIKA-2278    clean up extract exception handling
     new b2a462c6d TIKA 2276 -- cleanup
     new 6dcad8896 TIKA-2273 -- improve configuration of encoding detectors.  TODO: figure out loading in tika-app bundle and turn tests back on.
     new 5925bcb58 TIKA-2279 -   simplify token counting
     new 82509f32c TIKA-1857 xfa fix
     new d0ebfda73 fix tika-eval bug - include child file extension instead of parent
     new 4843ca157 TIKA-2286
     new 81f1591fe TIKA-2285 -- triggering file didn't actually trigger string index out of bounds exception, but there could be one with a null or very short styleName
     new 24160a1c0 TIKA-2281    add mapi message type
     new f70ea7a8f TIKA-1865 --  step 1  split out sender name from sender email exchange info where possible in msg files
     new 0274a2816 TIKA-1865    step 2  the other parsers 1
     new 2ebc90a5c TIKA-2281 applied to PSTParser
     new a12cae48f TIKA-1865 bug fix
     new 70895fcd9 TIKA-1865 clean up, deduplicate MailUtil, bug fix
     new 875c3a151 TIKA-2287 --   add jdbc
     new 5719bf788 TIKA-2287 --   bug fix, improve handling when ref tables already exist
     new 380af5b32 TIKA-2290 -- fix bug that prevents passing of ocr strategy via headers in tika-server
     new 15e22679f TIKA-1879 -- add more granularity to recipients in Outlook/PST emails
     new da2dce946 TIKA-2242 -- fix handling of annotations and <p> within a <p> in odt.
     new 93cb9717e TIKA-2295 -- extract images from odt
     new 7344209a1 clean up from sax docx work
     new 51cc80d24 TIKA-2236 upgrade PDFBox to 2.0.5 and JempBox to 1.8.13
     new 4ed7fccc3 TIKA 2287  -- bug fixes
     new 77f25f2e7 clean up unit tests
     new 29d7d7ceb TIKA-2300 record streams that can't be read via pkg's metadata via Aeham Abushwashi
     new fcccda6cc TIKA-2307
     new e3fead445 TIKA-2307 -- include finer grained supported types so that users can control includes/excludes with decorator via config
     new 2df5c536b TIKA-1772 More WebVTT magic - for cases with no header, and with custom headers
     new e34498bbe TIKA-1772 More test WebVTT files - no text header, and a custom one
     new d12c87b6d Merge 3c02c4b to the new 2.x test documents area
     new 78c31eb61 TIKA-1772 More WebVTT unit tests
     new f87948d28 Merge changelog update
     new 1826112e6 TIKA-2302 -- make macro extraction configurable and set default to false
     new 3e925166a Merge remote-tracking branch 'origin/2.x' into 2.x
     new 747b121fd Update mailing list archive links
     new 363675554 Bumped junit and slf4j versions
     new d8e4b5f6e Added explicit test scope for junit
     new 67a5e91b2 TIKA-2317 warn user if max content length is hit; allow for easier parameterization by commandline
     new c4888d59e Merge remote-tracking branch 'origin/2.x' into 2.x
     new 96a8ddd84 TIKA-2318
     new 37b8864ed TIKA-2319
     new fce6626f2 TIKA-2319 follow up
     new 6b9e36e3f TIKA-2323
     new 110247fcf turn off debug statement
     new d2907f41a TIKA-2325
     new 143efc8d9 TIKA-2311 -- maintain x-tika-ooxml mime type for truncated ooxml
     new 870ec187e In rare cases, elapsed can == 3000 exactly.  Fix this.
     new a847a863d TIKA-1195 and TIKA-2329, upgrade to POI 3.16-final and add xlsb parser
     new 73147a239 update javadoc for Latin1StringsParser
     new 51190df6e TIKA-2339 - remove test file that was identified by one av program as potentially contain MDropper.  We assess this as a false positive, but we've chosen to remove the file to allow users with this av program to build Tika.
     new 3743e4d67 TIKA-2309 Time Stamped Data Envelope parser
     new e7ad4ec15 TIKA-2309 fixed tika-parser-crypto-bundle IT
     new c67e62236 TIKA-2349 -- try to match embedded docs by digest in tika-eval "Compare"
     new 7c4258917 Merge remote-tracking branch 'origin/2.x' into 2.x
     new 4e1e87ff2 TIKA-2348 -- include caught exception in EMF/WMF rethrows
     new 6930ff025 TIKA-2311 -- try OPC before ZipFile.  This can work better on some truncated files.
     new 62e5a8477 TIKA-2350
     new babb2534e TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176.
     new 9ef078778 TIKA 2343 -- add text-main/boilerpipe option to tika-server
     new fe3971a69 TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176. Split to different change list...argh.
     new 21bcc5595 TIKA-2343 -- change put to post for multipart
     new 0a55b4a4e TIKA-2354 -- .doc is missing many pictures

The 251 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.