You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2023/09/13 15:21:50 UTC
[tika] branch 2.x deleted (was 0a55b4a4e)
This is an automated email from the ASF dual-hosted git repository.
tallison pushed a change to branch 2.x
in repository https://gitbox.apache.org/repos/asf/tika.git
was 0a55b4a4e TIKA-2354 -- .doc is missing many pictures
This change permanently discards the following revisions:
discard 0a55b4a4e TIKA-2354 -- .doc is missing many pictures
discard 21bcc5595 TIKA-2343 -- change put to post for multipart
discard fe3971a69 TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176. Split to different change list...argh.
discard 9ef078778 TIKA 2343 -- add text-main/boilerpipe option to tika-server
discard babb2534e TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176.
discard 62e5a8477 TIKA-2350
discard 6930ff025 TIKA-2311 -- try OPC before ZipFile. This can work better on some truncated files.
discard 4e1e87ff2 TIKA-2348 -- include caught exception in EMF/WMF rethrows
discard 7c4258917 Merge remote-tracking branch 'origin/2.x' into 2.x
discard c67e62236 TIKA-2349 -- try to match embedded docs by digest in tika-eval "Compare"
discard e7ad4ec15 TIKA-2309 fixed tika-parser-crypto-bundle IT
discard 3743e4d67 TIKA-2309 Time Stamped Data Envelope parser
discard 51190df6e TIKA-2339 - remove test file that was identified by one av program as potentially contain MDropper. We assess this as a false positive, but we've chosen to remove the file to allow users with this av program to build Tika.
discard 73147a239 update javadoc for Latin1StringsParser
discard a847a863d TIKA-1195 and TIKA-2329, upgrade to POI 3.16-final and add xlsb parser
discard 870ec187e In rare cases, elapsed can == 3000 exactly. Fix this.
discard 143efc8d9 TIKA-2311 -- maintain x-tika-ooxml mime type for truncated ooxml
discard d2907f41a TIKA-2325
discard 110247fcf turn off debug statement
discard 6b9e36e3f TIKA-2323
discard fce6626f2 TIKA-2319 follow up
discard 37b8864ed TIKA-2319
discard 96a8ddd84 TIKA-2318
discard c4888d59e Merge remote-tracking branch 'origin/2.x' into 2.x
discard 67a5e91b2 TIKA-2317 warn user if max content length is hit; allow for easier parameterization by commandline
discard d8e4b5f6e Added explicit test scope for junit
discard 363675554 Bumped junit and slf4j versions
discard 747b121fd Update mailing list archive links
discard 3e925166a Merge remote-tracking branch 'origin/2.x' into 2.x
discard 1826112e6 TIKA-2302 -- make macro extraction configurable and set default to false
discard f87948d28 Merge changelog update
discard 78c31eb61 TIKA-1772 More WebVTT unit tests
discard d12c87b6d Merge 3c02c4b to the new 2.x test documents area
discard e34498bbe TIKA-1772 More test WebVTT files - no text header, and a custom one
discard 2df5c536b TIKA-1772 More WebVTT magic - for cases with no header, and with custom headers
discard e3fead445 TIKA-2307 -- include finer grained supported types so that users can control includes/excludes with decorator via config
discard fcccda6cc TIKA-2307
discard 29d7d7ceb TIKA-2300 record streams that can't be read via pkg's metadata via Aeham Abushwashi
discard 77f25f2e7 clean up unit tests
discard 4ed7fccc3 TIKA 2287 -- bug fixes
discard 51cc80d24 TIKA-2236 upgrade PDFBox to 2.0.5 and JempBox to 1.8.13
discard 7344209a1 clean up from sax docx work
discard 93cb9717e TIKA-2295 -- extract images from odt
discard da2dce946 TIKA-2242 -- fix handling of annotations and <p> within a <p> in odt.
discard 15e22679f TIKA-1879 -- add more granularity to recipients in Outlook/PST emails
discard 380af5b32 TIKA-2290 -- fix bug that prevents passing of ocr strategy via headers in tika-server
discard 5719bf788 TIKA-2287 -- bug fix, improve handling when ref tables already exist
discard 875c3a151 TIKA-2287 -- add jdbc
discard 70895fcd9 TIKA-1865 clean up, deduplicate MailUtil, bug fix
discard a12cae48f TIKA-1865 bug fix
discard 2ebc90a5c TIKA-2281 applied to PSTParser
discard 0274a2816 TIKA-1865 step 2 the other parsers 1
discard f70ea7a8f TIKA-1865 -- step 1 split out sender name from sender email exchange info where possible in msg files
discard 24160a1c0 TIKA-2281 add mapi message type
discard 81f1591fe TIKA-2285 -- triggering file didn't actually trigger string index out of bounds exception, but there could be one with a null or very short styleName
discard 4843ca157 TIKA-2286
discard d0ebfda73 fix tika-eval bug - include child file extension instead of parent
discard 82509f32c TIKA-1857 xfa fix
discard 5925bcb58 TIKA-2279 - simplify token counting
discard 6dcad8896 TIKA-2273 -- improve configuration of encoding detectors. TODO: figure out loading in tika-app bundle and turn tests back on.
discard b2a462c6d TIKA 2276 -- cleanup
discard a279d039d TIKA-2278 clean up extract exception handling
discard 35756b142 TIKA-2276 try to reuse parsers from ParseContext rather than creating own
discard 4ebc441bd TIKA-2276 -- pass through TikaConfig if not specified via ParseContext in AutoDetectParser
discard 0ce764915 TIKA-2275
discard 824d176c9 TIKA-2269 -- fix potential NPE in FeedParser via Julien Nioche.
discard 544ba9752 TIKA-2267 -- add common tokens for some languages into tika-eval
discard 81150859b TIKA-1332 -- add English Spanish common tokens; fix logging
discard 61532258f TIKA-1332 3rd time's the charm. Fix dependencies with IOUtils.
discard 44612ae40 TIKA-1332 fix pom for 2.0
discard 0d04b499a TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7
discard 69dd0328b TIKA-1332 fix one profiler report and whitespace
discard 5e49c3308 TIKA-1332 initial commit of tika-eval. More work remains.
discard 6bfe5d565 TIKA-2246 and TIKA-2247 -add parsers for EMF and WMF
discard d9f376c12 TIKA-2134 - remove npe catch after upgrade to POI 3.16.beta2
discard 0d7f5bad0 TIKA-2198 - add null check to Tika after upgrade to POI 3.16-beta2
discard 27e81b97a TIKA-2181 upgrade to POI 3 16 beta2, make sure to upgrade overall bundle
discard cf3996ed0 TIKA 2181 upgrade to POI 3 16 beta2
discard 7b0655cc1 TIKA 2259 -- improve url extraction from PDFs = copy Tilman Hausherr's code from PDFBOX 3644
discard 2d4889f44 TIKA 2025 -- fix xls/x testBigIntegersWGeneralFormat to work in multiple locales
discard 28010d90d Mimetype for SAS Xport (XPT) files
discard 534a52598 TIKA-2255 Mime detection unit tests for SAS files
discard a79de0ccf TIKA-2255 Magic for older sas data files
discard 4d8feaee5 Move to Tika 2.x location
discard 6287b75b5 TIKA-2255 Test SAS files
discard 3df8ce8b2 TIKA-2251 improve exception handling in SAX pptx/docx parsers
discard 235c2adab TIKA-2249 -- update javadocs to alert devs that tables are not "maintained" by the PDFParser
discard 4599374d6 Merge remote-tracking branch 'origin/2.x' into 2.x
discard 985c1aef8 TIKA 2244 -- be more parsimonious with BufferedInputStream. AutoDetectReader
discard bd667acde TIKA-2250 As of RFC7903, the official mime type for EMF is now an image one and without the x- prefix
discard 6668d78fa TIKA-2250 As of RFC7903, the official mime type for WMF is now an image one and without the x- prefix
discard 58d56c33f TIKA-2250 As of RFC7903, the official mime type for BMP is now the one without the x- prefix
discard 78828176a TIKA-2244 -- be more parsimonious with BufferedInputStream via Josh Hight.
discard 8d783d27a TIKA-2232 -- log/warn if jbig2 is not on classpath
discard 161b122ba TIKA-2240 -- improve mime detection for .wri files
discard 9dbff6065 TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in TesseractOCRParser via Graham Russell.
discard 4374bcecf TIKA-2242 fix style markup in ODT
discard 45a9b77d6 TIKA-2241 Add new config dumping option STATIC_FULL which lists all supported+active mime types for parsers
discard dd70fd33a TIKA-2232 -- ImageParser shouldn't allege that it can handle jbig2 when jbig2 library is not on class path
discard cd98c4cf3 TIKA-2238 add mime detection for embedded MSEquation files
discard ce4e7e7d9 TIKA 2134 -- handle missing parts more robustly
discard 28b53bd4d TIKA 2159 handle preparse/embedded IO exceptions uniformly
discard 681615731 TIKA-2210 -- add experimental SAX parser for pptx and update (also TIKA-2191 and TIKA-2220)
discard 2d908d59b TIKA-2237
discard e02084cc6 TIKA-2192
discard 0bc9bd896 TIKA-2232 -- add processing of jbig2 (with necessary non ASL 2.0 libs) via Pascal Essiembre
discard c14e75070 TIKA-2235 -- bump default dpi for images created via PDF for OCR to 300 dpi via Matthew Caruana Galizia
discard 850de1467 TIKA-2234 get rid of ThreadLocal
discard aaa661e25 TIKA-2228 from Pascal Essiembre and TIKA-2230.
discard f0863bcea Merge remote-tracking branch 'origin/2.x' into 2.x
discard f1a541378 TIKA-2190 -- Add test file for maintain spacing
discard 4e3534da0 Move new test file to the 2.x location
discard 785e47413 Manually merge changelog
discard cdb6456bb TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test
discard 71584b2de TIKA-2224 We now differ from HTTPD on onenote formats, as we have subtypes they lack
discard db21ee158 TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote files
discard bb76d986a TIKA-2224 Mime magic for OneNote
discard d8fa3c2a8 TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as quattropro 7-8 vs quattropro 9
discard 39cf35551 TIKA_2226 add exception for unsupported formats
discard 4383e3da7 TIKA-1946 -- initial commit to add parsers for WordPerfect and QuattroPro. Many thanks to Pascal Essiembre for contributing these!!!
discard 337d38304 TIKA-2211 -- make sure that head (<style>) content isn't showing up in body in the EpubParser
discard c9fcb3315 TIKA-2211 modify test file to include style information to test that we're excluding it.
discard 50c1dc69d update OCR config to include default for output type
discard 0d30aa1b2 TIKA 2190 -- add configurability for preserve interword spacing
discard 54154e004 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre -- fix test method to get inputstream from zip
discard 68f305864 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre
discard ee761ac00 TIKA-2221 -- correctly catch and rethrow encrypted document exception as EncryptedDocumentException in WordExtractor via Matthew Caruana Galizia
discard ffb25af1b Merge remote-tracking branch 'origin/2.x' into 2.x
discard 4f04b6c3e TIKA-2218 -- add a new new locations within a pptx to check for embedded objects
discard d8853fe31 Update to PDFBox 2.0.4
discard 300100fcb TIKA-2090: Allow extraction of PDActions (including Javascript) from PDFs (TIKA-2090).
discard 3d08da79f TIKA-2187 -- make "ignore deleted" as the default in the experimental SAX .docx parser and update the WordExtractor to include extraction of deleted text if requested by the user.
discard 32162f59e TIKA 1321 initial commit
discard de103c81f TIKA-2096 -- fix example, sorry...
discard 1bb7c3384 TIKA-2179 -- add detection and parsing for word2006ml files -- this modification somehow fell to a different change list
discard e5e4d4d91 TIKA-2096 change default to extract embedded documents even if the user forgets to specify an AutoDetectParser in the ParseContext
discard a47a69933 TIKA-2169 fix xhtml in ocr
discard 2f452304b Add mime detection and parser for Word 2006ML format (TIKA-2179).
discard 8c01e4d8e TIKA-2116 upgrade to POI 3.16-beta1
discard 7df6fe4be TIKA-2170 fix unit test to allow for different exceptions depending on cause of timeout.
discard 7adfe1cb5 TIKA-2170 allow configuration of timeout for ForkServer
discard 9a68f4ccc TIKA-2174 -- clean up
discard 3f24e6c3e TIKA-2174 -- add ppm and update changes.txt
discard ab009aeb7 TIKA-2159 -- first step
discard f2661f997 TIKA-2174 add jpx and jp2 to Tesseract
discard 7422218eb TIKA-2173 - first steps. Need to integrate parameter configuration into 2.x before I can do the rest
discard bcd59cee7 TIKA-2171 - upgrade sqlite parser
discard 2c9412ab1 TIKA-2171 - upgrade sqlite parser
discard 1d1bc0dd7 TIKA-1933 - clean up one more place where we aren't closing the ForkParser and are leaving behind a tmp ForkParser jar
discard 2d5189186 TIKA-2157 - handle zip exception in embedded file
discard 6ca74bec6 improve unit test for TIKA-2098
discard a6978521f TIKA-2111 - ExecutableParser should set rather than add a Content-Type
discard 4b393a6f9 TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if MSWord can handle it, we should, too).
discard 936e3ac16 TIKA-2130
discard 4c3bb1560 TIKA-2133
discard c5f4f5263 TIKA-2127 : npe if there is no notes master)
discard 7e66e4979 TIKA-2123: digester fails with multiple digests on large files
discard 30e03de89 TIKA-2122: Extract all headers from MSG/RFC822
discard 1e55953bc TIKA-2113-- upgrade metadata-extractor to 2.9.1
discard 3fe8ef819 Merge remote-tracking branch 'origin/2.x' into 2.x
discard af74ea5c9 TIKA-2110-- log full exception throughout tika-batch
discard 1ec8c0947 Tesseract may see the t in haystack as a ! some times...
discard 1ab6c81ce TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. This closes #136
discard b84fcc584 TIKA-2101 -- don't call MAPIMessage's close()
discard cde4c0aa8 TIKA-2098 small clean up. Test for writelimitreached for each catchable IOException. Many thanks to Alexander Kazakov for finding this and submitting https://github.com/apache/tika/pull/134
discard 4392681af TIKA-2097 fix npe in mbox parser
discard be78c549a Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).
discard 94789a963 Tika-2095 include Tika version in tika-server's GREETING
discard bd7208929 * Re-enable fileUrl for tika-server (TIKA-2081). Fix commandline options not to include '-'
discard ce1fc3720 * Re-enable fileUrl for tika-server (TIKA-2081). If you choose, to use this feature, beware of the security vulnerabilities! See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
discard 673533d0e TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric Pugh. This commit also catches 2.x up to trunk; there were clearly some other changes to Tesseract that hadn't yet made it into 2.x.
discard d543378a8 TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target metadata object in any order
discard 66f433471 TIKA-2069 -- extract macros from MSOffice files.
discard 32d9ece8d * Maintain passed-in mime in TXTParser (TIKA-2047).
discard 12b1d435b TIKA-2013 -- upgrade to POI 3.15 -- don't forget to close new NPOIFS and MAPIMessage
discard 1b32e3186 TIKA-2015 -- upgrade to PDFBox 2.0.3
discard 92453f5e7 GitHub user haisi opened a pull request: https://github.com/apache/tika/pull/132
discard 176f3aded TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml, include test file
discard ae0cb3059 TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml
discard 9f6241161 Merge changes for TIKA-2064 to 2.x
discard e58ade381 TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit test
discard 443a21e3f TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (Awaiting suitably licensed file for testing)
discard 4636f95b2 TIKA-1255 and TIKA-2078 -- fix hyperlinks that include formatting and fix hyperlinks with multiple runs in docx
discard f112c88fb TIKA-2075 - Expose Additional TikaService methods
discard f8092d3bd TIKA-2073 - Tika Language Detect Project should include Bundle Activator and packaging consistent with other modules
discard 7a0280c77 TIKA-2071 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
discard 587dcb772 TIKA-2072 - Create TikaServiceFactory for creating TikaService
discard d57a85274 TIKA-2070 - Add Encoding Detector and Language Detectors to Dynamic Service Loader
discard b73cd8ce8 TIKA-2074 - ServiceLoader can use Class files loaded via dynamic loading
discard a0f365524 TIKA-2067 upgrade maven plugin dependencies -- revert felix bundle
discard 8ff89d419 TIKA-2067 upgrade maven plugin dependencies
discard 164bf52c8 TIKA-2066 upgrade commons-io to 2.5
discard b2a7e382a TIKA-2065 upgrade forbiddenapis
discard 8234b96fe TIKA-2061 - Added Adobe BSD license to tika-xmp
discard 5d9db6bec TIKA-2063 - Added Vorbis bundle to bundle parent.
discard fcefaae59 TIKA-2063 - Create vorbis bundle
discard dc841e6ba TIKA-2060 - Added toggle to ClassLoaderUtils for OSGi
discard cebf72382 TIKA-2062 - Remove bouncy castle inlining in bundles
discard 4704d976c TIKA-2061 - Embed xmpcore in tika-xmp since it is not a proper bundle
discard 59e0ca0fc TIKA-2059 - Merge multimedia and pdf parser modules and bundles
discard 87b6d5d7d TIKA-2007 upgrade jackson, needed to update CachedTranslator (diff btwn trunk and 2.x)
discard db513d6ad TIKA-2007 upgrade jackson
discard 27bc383eb TIKA-1980 via Joseph Naegele
discard 09bd22fb4 TIKA-1938 via Joseph Naegele
discard 5358bf1e1 TIKA-1938 via Joseph Naegele
discard b41c0b2a8 TIKA-2041 - add important diffs between new copy/paste from ICU4J and legacy code which may have included Tika-specific mods.
discard 6ebbd7ef7 cleanup MatParser
discard fc7c372f5 TIKA-2048
discard 1c582aba6 TIKA-2040 - prevent permanent hang/oom on corrupt chm file
discard 9f6c71fa6 TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug.
discard f89887d2f TIKA-2037 Merge fixes for 2.x
discard 53310facc Changelog update
discard 65cc9bcec TIKA-2042 MBOX magic and detection unit test
discard 31374a39b TIKA-2037 RFC822Parser should wrap the James InputStream of embedded resources to avoid problems with downstream detection or extraction
discard d6ce10b41 Email with attachment for testing extraction issues
discard 8b951a43c TIKA-2039 upgrade to jackcess 2.1.4
discard f4bacf859 TIKA-2025 increase number of significant digits extracted in "general" format in xls/xlsx
discard e27526b84 TIKA-2030 - fix test file so that it is correctly detected
discard cdfacdb41 Merge remote-tracking branch 'origin/2.x' into 2.x
discard 87e1e23b4 TIKA-2030 - add handling for <text:s/> element to ODT parser. Thanks to David Pilato for opening this issue.
discard 573527bbc Merge branch '2.x' of https://git-wip-us.apache.org/repos/asf/tika into 2.x
discard 2a7e52ec4 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...not sure why Intellij didn't catch this one. sorry.
discard 2eb4804d1 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...
discard 4678d6733 TIKA-2024 extract original path name from OLE1.0 embedded objects
discard c7a6bcac4 Convert new lines from windows to unix
discard dd3c2a486 TIKA-2026 -- improve extraction of attachments for PPT, PPTX, XLSX
discard 933af20e8 rm inconsistently capitalized test files
discard e62f23057 TIKA-2024 extract original file name/path where possible, take 1
discard c84855f67 TIKA-2022 - clean up -- make entries private, move more into EndianUtils
discard 865c45cd5 fix indentation
discard 5bc597dc8 TIKA-2023 -- clean up RTFParser to use EndianUtils and IOUtils.readFully
discard b14b47e76 TIKA-2022 -- add parser for applefile
discard cd12917fa TIKA-2020 -- remove 3 parameter parse() and simplify CAD tests
discard 0c71b2ffc TIKA-2020, remove 3 parameter parse() and simplify CAD tests
discard 6bb6827e0 add startDocument and endDocument() to PRTParser so that it works with the ToXMLHandler
discard 767442614 fix indents and whitespace
discard 1ce93ed9e TIKA-2019 -- fix WordMLParser and SpreadsheetMLParser
discard 2f5537380 TIKA-2009 -- add detection for Endnote Import files
discard b600b6701 make sure to test magic for vcs/ics/asx
discard 73ce7681c TIKA-2009 -- add magic for djvu
discard b3bf5141b TIKA-2008 -- change metadata key to TikaCoreProperties.MODIFIER
discard 60d4e3ff2 TIKA-2008 -- add mime definition and parser for MSOwnerFile
discard ffaa4deaa TIKA-2004 -- add mime definitions for Windows Media Metafile
discard f90193aa0 TIKA-2006 -- add mime definitions for ical and vcal
discard b480d43f5 TIKA-1996 -- Upgrade to PDFBox 2.0.2
discard ac52e5c15 TIKA-1999: fix setter, update changes.txt
discard 89062edb0 TIKA-1999: add configurable limit to number of events extracted in XMPMM history.
discard ebe702898 TIKA-1994 -- Integrate TesseractOCR with full page image rendering for PDFs
discard e5a7604bc TIKA-1992 -- check for duplicate inline images by COSStream not object name.
discard e05dd5bf4 TIKA-1990 -- need to add JPEG filters to embedded stream when handling embedded jpegs in PDFParser
discard b1c00c050 TIKA-1985 -- ignore test until we get permission to use test file