You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2023/09/13 15:22:47 UTC
[tika] branch deprecated_2.x_dev created (now 0a55b4a4e)
This is an automated email from the ASF dual-hosted git repository.
tallison pushed a change to branch deprecated_2.x_dev
in repository https://gitbox.apache.org/repos/asf/tika.git
at 0a55b4a4e TIKA-2354 -- .doc is missing many pictures
This branch includes the following new commits:
new b1c00c050 TIKA-1985 -- ignore test until we get permission to use test file
new e05dd5bf4 TIKA-1990 -- need to add JPEG filters to embedded stream when handling embedded jpegs in PDFParser
new e5a7604bc TIKA-1992 -- check for duplicate inline images by COSStream not object name.
new ebe702898 TIKA-1994 -- Integrate TesseractOCR with full page image rendering for PDFs
new 89062edb0 TIKA-1999: add configurable limit to number of events extracted in XMPMM history.
new ac52e5c15 TIKA-1999: fix setter, update changes.txt
new b480d43f5 TIKA-1996 -- Upgrade to PDFBox 2.0.2
new f90193aa0 TIKA-2006 -- add mime definitions for ical and vcal
new ffaa4deaa TIKA-2004 -- add mime definitions for Windows Media Metafile
new 60d4e3ff2 TIKA-2008 -- add mime definition and parser for MSOwnerFile
new b3bf5141b TIKA-2008 -- change metadata key to TikaCoreProperties.MODIFIER
new 73ce7681c TIKA-2009 -- add magic for djvu
new b600b6701 make sure to test magic for vcs/ics/asx
new 2f5537380 TIKA-2009 -- add detection for Endnote Import files
new 1ce93ed9e TIKA-2019 -- fix WordMLParser and SpreadsheetMLParser
new 767442614 fix indents and whitespace
new 6bb6827e0 add startDocument and endDocument() to PRTParser so that it works with the ToXMLHandler
new 0c71b2ffc TIKA-2020, remove 3 parameter parse() and simplify CAD tests
new cd12917fa TIKA-2020 -- remove 3 parameter parse() and simplify CAD tests
new b14b47e76 TIKA-2022 -- add parser for applefile
new 5bc597dc8 TIKA-2023 -- clean up RTFParser to use EndianUtils and IOUtils.readFully
new 865c45cd5 fix indentation
new c84855f67 TIKA-2022 - clean up -- make entries private, move more into EndianUtils
new e62f23057 TIKA-2024 extract original file name/path where possible, take 1
new 933af20e8 rm inconsistently capitalized test files
new dd3c2a486 TIKA-2026 -- improve extraction of attachments for PPT, PPTX, XLSX
new c7a6bcac4 Convert new lines from windows to unix
new 4678d6733 TIKA-2024 extract original path name from OLE1.0 embedded objects
new 2eb4804d1 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...
new 2a7e52ec4 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...not sure why Intellij didn't catch this one. sorry.
new 573527bbc Merge branch '2.x' of https://git-wip-us.apache.org/repos/asf/tika into 2.x
new 87e1e23b4 TIKA-2030 - add handling for <text:s/> element to ODT parser. Thanks to David Pilato for opening this issue.
new cdfacdb41 Merge remote-tracking branch 'origin/2.x' into 2.x
new e27526b84 TIKA-2030 - fix test file so that it is correctly detected
new f4bacf859 TIKA-2025 increase number of significant digits extracted in "general" format in xls/xlsx
new 8b951a43c TIKA-2039 upgrade to jackcess 2.1.4
new d6ce10b41 Email with attachment for testing extraction issues
new 31374a39b TIKA-2037 RFC822Parser should wrap the James InputStream of embedded resources to avoid problems with downstream detection or extraction
new 65cc9bcec TIKA-2042 MBOX magic and detection unit test
new 53310facc Changelog update
new f89887d2f TIKA-2037 Merge fixes for 2.x
new 9f6c71fa6 TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug.
new 1c582aba6 TIKA-2040 - prevent permanent hang/oom on corrupt chm file
new fc7c372f5 TIKA-2048
new 6ebbd7ef7 cleanup MatParser
new b41c0b2a8 TIKA-2041 - add important diffs between new copy/paste from ICU4J and legacy code which may have included Tika-specific mods.
new 5358bf1e1 TIKA-1938 via Joseph Naegele
new 09bd22fb4 TIKA-1938 via Joseph Naegele
new 27bc383eb TIKA-1980 via Joseph Naegele
new db513d6ad TIKA-2007 upgrade jackson
new 87b6d5d7d TIKA-2007 upgrade jackson, needed to update CachedTranslator (diff btwn trunk and 2.x)
new 59e0ca0fc TIKA-2059 - Merge multimedia and pdf parser modules and bundles
new 4704d976c TIKA-2061 - Embed xmpcore in tika-xmp since it is not a proper bundle
new cebf72382 TIKA-2062 - Remove bouncy castle inlining in bundles
new dc841e6ba TIKA-2060 - Added toggle to ClassLoaderUtils for OSGi
new fcefaae59 TIKA-2063 - Create vorbis bundle
new 5d9db6bec TIKA-2063 - Added Vorbis bundle to bundle parent.
new 8234b96fe TIKA-2061 - Added Adobe BSD license to tika-xmp
new b2a7e382a TIKA-2065 upgrade forbiddenapis
new 164bf52c8 TIKA-2066 upgrade commons-io to 2.5
new 8ff89d419 TIKA-2067 upgrade maven plugin dependencies
new a0f365524 TIKA-2067 upgrade maven plugin dependencies -- revert felix bundle
new b73cd8ce8 TIKA-2074 - ServiceLoader can use Class files loaded via dynamic loading
new d57a85274 TIKA-2070 - Add Encoding Detector and Language Detectors to Dynamic Service Loader
new 587dcb772 TIKA-2072 - Create TikaServiceFactory for creating TikaService
new 7a0280c77 TIKA-2071 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
new f8092d3bd TIKA-2073 - Tika Language Detect Project should include Bundle Activator and packaging consistent with other modules
new f112c88fb TIKA-2075 - Expose Additional TikaService methods
new 4636f95b2 TIKA-1255 and TIKA-2078 -- fix hyperlinks that include formatting and fix hyperlinks with multiple runs in docx
new 443a21e3f TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (Awaiting suitably licensed file for testing)
new e58ade381 TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit test
new 9f6241161 Merge changes for TIKA-2064 to 2.x
new ae0cb3059 TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml
new 176f3aded TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml, include test file
new 92453f5e7 GitHub user haisi opened a pull request: https://github.com/apache/tika/pull/132
new 1b32e3186 TIKA-2015 -- upgrade to PDFBox 2.0.3
new 12b1d435b TIKA-2013 -- upgrade to POI 3.15 -- don't forget to close new NPOIFS and MAPIMessage
new 32d9ece8d * Maintain passed-in mime in TXTParser (TIKA-2047).
new 66f433471 TIKA-2069 -- extract macros from MSOffice files.
new d543378a8 TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target metadata object in any order
new 673533d0e TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric Pugh. This commit also catches 2.x up to trunk; there were clearly some other changes to Tesseract that hadn't yet made it into 2.x.
new ce1fc3720 * Re-enable fileUrl for tika-server (TIKA-2081). If you choose, to use this feature, beware of the security vulnerabilities! See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
new bd7208929 * Re-enable fileUrl for tika-server (TIKA-2081). Fix commandline options not to include '-'
new 94789a963 Tika-2095 include Tika version in tika-server's GREETING
new be78c549a Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).
new 4392681af TIKA-2097 fix npe in mbox parser
new cde4c0aa8 TIKA-2098 small clean up. Test for writelimitreached for each catchable IOException. Many thanks to Alexander Kazakov for finding this and submitting https://github.com/apache/tika/pull/134
new b84fcc584 TIKA-2101 -- don't call MAPIMessage's close()
new 1ab6c81ce TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. This closes #136
new 1ec8c0947 Tesseract may see the t in haystack as a ! some times...
new af74ea5c9 TIKA-2110-- log full exception throughout tika-batch
new 3fe8ef819 Merge remote-tracking branch 'origin/2.x' into 2.x
new 1e55953bc TIKA-2113-- upgrade metadata-extractor to 2.9.1
new 30e03de89 TIKA-2122: Extract all headers from MSG/RFC822
new 7e66e4979 TIKA-2123: digester fails with multiple digests on large files
new c5f4f5263 TIKA-2127 : npe if there is no notes master)
new 4c3bb1560 TIKA-2133
new 936e3ac16 TIKA-2130
new 4b393a6f9 TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if MSWord can handle it, we should, too).
new a6978521f TIKA-2111 - ExecutableParser should set rather than add a Content-Type
new 6ca74bec6 improve unit test for TIKA-2098
new 2d5189186 TIKA-2157 - handle zip exception in embedded file
new 1d1bc0dd7 TIKA-1933 - clean up one more place where we aren't closing the ForkParser and are leaving behind a tmp ForkParser jar
new 2c9412ab1 TIKA-2171 - upgrade sqlite parser
new bcd59cee7 TIKA-2171 - upgrade sqlite parser
new 7422218eb TIKA-2173 - first steps. Need to integrate parameter configuration into 2.x before I can do the rest
new f2661f997 TIKA-2174 add jpx and jp2 to Tesseract
new ab009aeb7 TIKA-2159 -- first step
new 3f24e6c3e TIKA-2174 -- add ppm and update changes.txt
new 9a68f4ccc TIKA-2174 -- clean up
new 7adfe1cb5 TIKA-2170 allow configuration of timeout for ForkServer
new 7df6fe4be TIKA-2170 fix unit test to allow for different exceptions depending on cause of timeout.
new 8c01e4d8e TIKA-2116 upgrade to POI 3.16-beta1
new 2f452304b Add mime detection and parser for Word 2006ML format (TIKA-2179).
new a47a69933 TIKA-2169 fix xhtml in ocr
new e5e4d4d91 TIKA-2096 change default to extract embedded documents even if the user forgets to specify an AutoDetectParser in the ParseContext
new 1bb7c3384 TIKA-2179 -- add detection and parsing for word2006ml files -- this modification somehow fell to a different change list
new de103c81f TIKA-2096 -- fix example, sorry...
new 32162f59e TIKA 1321 initial commit
new 3d08da79f TIKA-2187 -- make "ignore deleted" as the default in the experimental SAX .docx parser and update the WordExtractor to include extraction of deleted text if requested by the user.
new 300100fcb TIKA-2090: Allow extraction of PDActions (including Javascript) from PDFs (TIKA-2090).
new d8853fe31 Update to PDFBox 2.0.4
new 4f04b6c3e TIKA-2218 -- add a new new locations within a pptx to check for embedded objects
new ffb25af1b Merge remote-tracking branch 'origin/2.x' into 2.x
new ee761ac00 TIKA-2221 -- correctly catch and rethrow encrypted document exception as EncryptedDocumentException in WordExtractor via Matthew Caruana Galizia
new 68f305864 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre
new 54154e004 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre -- fix test method to get inputstream from zip
new 0d30aa1b2 TIKA 2190 -- add configurability for preserve interword spacing
new 50c1dc69d update OCR config to include default for output type
new c9fcb3315 TIKA-2211 modify test file to include style information to test that we're excluding it.
new 337d38304 TIKA-2211 -- make sure that head (<style>) content isn't showing up in body in the EpubParser
new 4383e3da7 TIKA-1946 -- initial commit to add parsers for WordPerfect and QuattroPro. Many thanks to Pascal Essiembre for contributing these!!!
new 39cf35551 TIKA_2226 add exception for unsupported formats
new d8fa3c2a8 TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as quattropro 7-8 vs quattropro 9
new bb76d986a TIKA-2224 Mime magic for OneNote
new db21ee158 TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote files
new 71584b2de TIKA-2224 We now differ from HTTPD on onenote formats, as we have subtypes they lack
new cdb6456bb TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test
new 785e47413 Manually merge changelog
new 4e3534da0 Move new test file to the 2.x location
new f1a541378 TIKA-2190 -- Add test file for maintain spacing
new f0863bcea Merge remote-tracking branch 'origin/2.x' into 2.x
new aaa661e25 TIKA-2228 from Pascal Essiembre and TIKA-2230.
new 850de1467 TIKA-2234 get rid of ThreadLocal
new c14e75070 TIKA-2235 -- bump default dpi for images created via PDF for OCR to 300 dpi via Matthew Caruana Galizia
new 0bc9bd896 TIKA-2232 -- add processing of jbig2 (with necessary non ASL 2.0 libs) via Pascal Essiembre
new e02084cc6 TIKA-2192
new 2d908d59b TIKA-2237
new 681615731 TIKA-2210 -- add experimental SAX parser for pptx and update (also TIKA-2191 and TIKA-2220)
new 28b53bd4d TIKA 2159 handle preparse/embedded IO exceptions uniformly
new ce4e7e7d9 TIKA 2134 -- handle missing parts more robustly
new cd98c4cf3 TIKA-2238 add mime detection for embedded MSEquation files
new dd70fd33a TIKA-2232 -- ImageParser shouldn't allege that it can handle jbig2 when jbig2 library is not on class path
new 45a9b77d6 TIKA-2241 Add new config dumping option STATIC_FULL which lists all supported+active mime types for parsers
new 4374bcecf TIKA-2242 fix style markup in ODT
new 9dbff6065 TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in TesseractOCRParser via Graham Russell.
new 161b122ba TIKA-2240 -- improve mime detection for .wri files
new 8d783d27a TIKA-2232 -- log/warn if jbig2 is not on classpath
new 78828176a TIKA-2244 -- be more parsimonious with BufferedInputStream via Josh Hight.
new 58d56c33f TIKA-2250 As of RFC7903, the official mime type for BMP is now the one without the x- prefix
new 6668d78fa TIKA-2250 As of RFC7903, the official mime type for WMF is now an image one and without the x- prefix
new bd667acde TIKA-2250 As of RFC7903, the official mime type for EMF is now an image one and without the x- prefix
new 985c1aef8 TIKA 2244 -- be more parsimonious with BufferedInputStream. AutoDetectReader
new 4599374d6 Merge remote-tracking branch 'origin/2.x' into 2.x
new 235c2adab TIKA-2249 -- update javadocs to alert devs that tables are not "maintained" by the PDFParser
new 3df8ce8b2 TIKA-2251 improve exception handling in SAX pptx/docx parsers
new 6287b75b5 TIKA-2255 Test SAS files
new 4d8feaee5 Move to Tika 2.x location
new a79de0ccf TIKA-2255 Magic for older sas data files
new 534a52598 TIKA-2255 Mime detection unit tests for SAS files
new 28010d90d Mimetype for SAS Xport (XPT) files
new 2d4889f44 TIKA 2025 -- fix xls/x testBigIntegersWGeneralFormat to work in multiple locales
new 7b0655cc1 TIKA 2259 -- improve url extraction from PDFs = copy Tilman Hausherr's code from PDFBOX 3644
new cf3996ed0 TIKA 2181 upgrade to POI 3 16 beta2
new 27e81b97a TIKA-2181 upgrade to POI 3 16 beta2, make sure to upgrade overall bundle
new 0d7f5bad0 TIKA-2198 - add null check to Tika after upgrade to POI 3.16-beta2
new d9f376c12 TIKA-2134 - remove npe catch after upgrade to POI 3.16.beta2
new 6bfe5d565 TIKA-2246 and TIKA-2247 -add parsers for EMF and WMF
new 5e49c3308 TIKA-1332 initial commit of tika-eval. More work remains.
new 69dd0328b TIKA-1332 fix one profiler report and whitespace
new 0d04b499a TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7
new 44612ae40 TIKA-1332 fix pom for 2.0
new 61532258f TIKA-1332 3rd time's the charm. Fix dependencies with IOUtils.
new 81150859b TIKA-1332 -- add English Spanish common tokens; fix logging
new 544ba9752 TIKA-2267 -- add common tokens for some languages into tika-eval
new 824d176c9 TIKA-2269 -- fix potential NPE in FeedParser via Julien Nioche.
new 0ce764915 TIKA-2275
new 4ebc441bd TIKA-2276 -- pass through TikaConfig if not specified via ParseContext in AutoDetectParser
new 35756b142 TIKA-2276 try to reuse parsers from ParseContext rather than creating own
new a279d039d TIKA-2278 clean up extract exception handling
new b2a462c6d TIKA 2276 -- cleanup
new 6dcad8896 TIKA-2273 -- improve configuration of encoding detectors. TODO: figure out loading in tika-app bundle and turn tests back on.
new 5925bcb58 TIKA-2279 - simplify token counting
new 82509f32c TIKA-1857 xfa fix
new d0ebfda73 fix tika-eval bug - include child file extension instead of parent
new 4843ca157 TIKA-2286
new 81f1591fe TIKA-2285 -- triggering file didn't actually trigger string index out of bounds exception, but there could be one with a null or very short styleName
new 24160a1c0 TIKA-2281 add mapi message type
new f70ea7a8f TIKA-1865 -- step 1 split out sender name from sender email exchange info where possible in msg files
new 0274a2816 TIKA-1865 step 2 the other parsers 1
new 2ebc90a5c TIKA-2281 applied to PSTParser
new a12cae48f TIKA-1865 bug fix
new 70895fcd9 TIKA-1865 clean up, deduplicate MailUtil, bug fix
new 875c3a151 TIKA-2287 -- add jdbc
new 5719bf788 TIKA-2287 -- bug fix, improve handling when ref tables already exist
new 380af5b32 TIKA-2290 -- fix bug that prevents passing of ocr strategy via headers in tika-server
new 15e22679f TIKA-1879 -- add more granularity to recipients in Outlook/PST emails
new da2dce946 TIKA-2242 -- fix handling of annotations and <p> within a <p> in odt.
new 93cb9717e TIKA-2295 -- extract images from odt
new 7344209a1 clean up from sax docx work
new 51cc80d24 TIKA-2236 upgrade PDFBox to 2.0.5 and JempBox to 1.8.13
new 4ed7fccc3 TIKA 2287 -- bug fixes
new 77f25f2e7 clean up unit tests
new 29d7d7ceb TIKA-2300 record streams that can't be read via pkg's metadata via Aeham Abushwashi
new fcccda6cc TIKA-2307
new e3fead445 TIKA-2307 -- include finer grained supported types so that users can control includes/excludes with decorator via config
new 2df5c536b TIKA-1772 More WebVTT magic - for cases with no header, and with custom headers
new e34498bbe TIKA-1772 More test WebVTT files - no text header, and a custom one
new d12c87b6d Merge 3c02c4b to the new 2.x test documents area
new 78c31eb61 TIKA-1772 More WebVTT unit tests
new f87948d28 Merge changelog update
new 1826112e6 TIKA-2302 -- make macro extraction configurable and set default to false
new 3e925166a Merge remote-tracking branch 'origin/2.x' into 2.x
new 747b121fd Update mailing list archive links
new 363675554 Bumped junit and slf4j versions
new d8e4b5f6e Added explicit test scope for junit
new 67a5e91b2 TIKA-2317 warn user if max content length is hit; allow for easier parameterization by commandline
new c4888d59e Merge remote-tracking branch 'origin/2.x' into 2.x
new 96a8ddd84 TIKA-2318
new 37b8864ed TIKA-2319
new fce6626f2 TIKA-2319 follow up
new 6b9e36e3f TIKA-2323
new 110247fcf turn off debug statement
new d2907f41a TIKA-2325
new 143efc8d9 TIKA-2311 -- maintain x-tika-ooxml mime type for truncated ooxml
new 870ec187e In rare cases, elapsed can == 3000 exactly. Fix this.
new a847a863d TIKA-1195 and TIKA-2329, upgrade to POI 3.16-final and add xlsb parser
new 73147a239 update javadoc for Latin1StringsParser
new 51190df6e TIKA-2339 - remove test file that was identified by one av program as potentially contain MDropper. We assess this as a false positive, but we've chosen to remove the file to allow users with this av program to build Tika.
new 3743e4d67 TIKA-2309 Time Stamped Data Envelope parser
new e7ad4ec15 TIKA-2309 fixed tika-parser-crypto-bundle IT
new c67e62236 TIKA-2349 -- try to match embedded docs by digest in tika-eval "Compare"
new 7c4258917 Merge remote-tracking branch 'origin/2.x' into 2.x
new 4e1e87ff2 TIKA-2348 -- include caught exception in EMF/WMF rethrows
new 6930ff025 TIKA-2311 -- try OPC before ZipFile. This can work better on some truncated files.
new 62e5a8477 TIKA-2350
new babb2534e TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176.
new 9ef078778 TIKA 2343 -- add text-main/boilerpipe option to tika-server
new fe3971a69 TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176. Split to different change list...argh.
new 21bcc5595 TIKA-2343 -- change put to post for multipart
new 0a55b4a4e TIKA-2354 -- .doc is missing many pictures
The 251 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.