You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ta...@apache.org on 2023/09/13 15:21:50 UTC

[tika] branch 2.x deleted (was 0a55b4a4e)

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a change to branch 2.x
in repository https://gitbox.apache.org/repos/asf/tika.git


     was 0a55b4a4e TIKA-2354 -- .doc is missing many pictures

This change permanently discards the following revisions:

 discard 0a55b4a4e TIKA-2354 -- .doc is missing many pictures
 discard 21bcc5595 TIKA-2343 -- change put to post for multipart
 discard fe3971a69 TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176. Split to different change list...argh.
 discard 9ef078778 TIKA 2343 -- add text-main/boilerpipe option to tika-server
 discard babb2534e TIKA-2352 -- bug fix for WordPerfect parser via Pascal Essiembre. Pull request 176.
 discard 62e5a8477 TIKA-2350
 discard 6930ff025 TIKA-2311 -- try OPC before ZipFile.  This can work better on some truncated files.
 discard 4e1e87ff2 TIKA-2348 -- include caught exception in EMF/WMF rethrows
 discard 7c4258917 Merge remote-tracking branch 'origin/2.x' into 2.x
 discard c67e62236 TIKA-2349 -- try to match embedded docs by digest in tika-eval "Compare"
 discard e7ad4ec15 TIKA-2309 fixed tika-parser-crypto-bundle IT
 discard 3743e4d67 TIKA-2309 Time Stamped Data Envelope parser
 discard 51190df6e TIKA-2339 - remove test file that was identified by one av program as potentially contain MDropper.  We assess this as a false positive, but we've chosen to remove the file to allow users with this av program to build Tika.
 discard 73147a239 update javadoc for Latin1StringsParser
 discard a847a863d TIKA-1195 and TIKA-2329, upgrade to POI 3.16-final and add xlsb parser
 discard 870ec187e In rare cases, elapsed can == 3000 exactly.  Fix this.
 discard 143efc8d9 TIKA-2311 -- maintain x-tika-ooxml mime type for truncated ooxml
 discard d2907f41a TIKA-2325
 discard 110247fcf turn off debug statement
 discard 6b9e36e3f TIKA-2323
 discard fce6626f2 TIKA-2319 follow up
 discard 37b8864ed TIKA-2319
 discard 96a8ddd84 TIKA-2318
 discard c4888d59e Merge remote-tracking branch 'origin/2.x' into 2.x
 discard 67a5e91b2 TIKA-2317 warn user if max content length is hit; allow for easier parameterization by commandline
 discard d8e4b5f6e Added explicit test scope for junit
 discard 363675554 Bumped junit and slf4j versions
 discard 747b121fd Update mailing list archive links
 discard 3e925166a Merge remote-tracking branch 'origin/2.x' into 2.x
 discard 1826112e6 TIKA-2302 -- make macro extraction configurable and set default to false
 discard f87948d28 Merge changelog update
 discard 78c31eb61 TIKA-1772 More WebVTT unit tests
 discard d12c87b6d Merge 3c02c4b to the new 2.x test documents area
 discard e34498bbe TIKA-1772 More test WebVTT files - no text header, and a custom one
 discard 2df5c536b TIKA-1772 More WebVTT magic - for cases with no header, and with custom headers
 discard e3fead445 TIKA-2307 -- include finer grained supported types so that users can control includes/excludes with decorator via config
 discard fcccda6cc TIKA-2307
 discard 29d7d7ceb TIKA-2300 record streams that can't be read via pkg's metadata via Aeham Abushwashi
 discard 77f25f2e7 clean up unit tests
 discard 4ed7fccc3 TIKA 2287  -- bug fixes
 discard 51cc80d24 TIKA-2236 upgrade PDFBox to 2.0.5 and JempBox to 1.8.13
 discard 7344209a1 clean up from sax docx work
 discard 93cb9717e TIKA-2295 -- extract images from odt
 discard da2dce946 TIKA-2242 -- fix handling of annotations and <p> within a <p> in odt.
 discard 15e22679f TIKA-1879 -- add more granularity to recipients in Outlook/PST emails
 discard 380af5b32 TIKA-2290 -- fix bug that prevents passing of ocr strategy via headers in tika-server
 discard 5719bf788 TIKA-2287 --   bug fix, improve handling when ref tables already exist
 discard 875c3a151 TIKA-2287 --   add jdbc
 discard 70895fcd9 TIKA-1865 clean up, deduplicate MailUtil, bug fix
 discard a12cae48f TIKA-1865 bug fix
 discard 2ebc90a5c TIKA-2281 applied to PSTParser
 discard 0274a2816 TIKA-1865    step 2  the other parsers 1
 discard f70ea7a8f TIKA-1865 --  step 1  split out sender name from sender email exchange info where possible in msg files
 discard 24160a1c0 TIKA-2281    add mapi message type
 discard 81f1591fe TIKA-2285 -- triggering file didn't actually trigger string index out of bounds exception, but there could be one with a null or very short styleName
 discard 4843ca157 TIKA-2286
 discard d0ebfda73 fix tika-eval bug - include child file extension instead of parent
 discard 82509f32c TIKA-1857 xfa fix
 discard 5925bcb58 TIKA-2279 -   simplify token counting
 discard 6dcad8896 TIKA-2273 -- improve configuration of encoding detectors.  TODO: figure out loading in tika-app bundle and turn tests back on.
 discard b2a462c6d TIKA 2276 -- cleanup
 discard a279d039d TIKA-2278    clean up extract exception handling
 discard 35756b142 TIKA-2276   try to reuse parsers from ParseContext rather than creating own
 discard 4ebc441bd TIKA-2276 -- pass through TikaConfig if not specified via ParseContext in AutoDetectParser
 discard 0ce764915 TIKA-2275
 discard 824d176c9 TIKA-2269 -- fix potential NPE in FeedParser via Julien Nioche.
 discard 544ba9752 TIKA-2267 -- add common tokens for some languages into tika-eval
 discard 81150859b TIKA-1332 -- add English Spanish common tokens;  fix logging
 discard 61532258f TIKA-1332 3rd time's the charm.  Fix dependencies with IOUtils.
 discard 44612ae40 TIKA-1332 fix pom for 2.0
 discard 0d04b499a TIKA-1332 downgrade to Lucene 5.x so that this can run w/ Java 7
 discard 69dd0328b TIKA-1332 fix one profiler report and whitespace
 discard 5e49c3308 TIKA-1332 initial commit of tika-eval.  More work remains.
 discard 6bfe5d565 TIKA-2246 and TIKA-2247 -add parsers for EMF and WMF
 discard d9f376c12 TIKA-2134 - remove npe catch after upgrade to POI 3.16.beta2
 discard 0d7f5bad0 TIKA-2198 - add null check to Tika after upgrade to POI 3.16-beta2
 discard 27e81b97a TIKA-2181   upgrade to POI 3 16 beta2, make sure to upgrade overall bundle
 discard cf3996ed0 TIKA 2181   upgrade to POI 3 16 beta2
 discard 7b0655cc1 TIKA 2259  -- improve url extraction from PDFs = copy Tilman Hausherr's code from PDFBOX 3644
 discard 2d4889f44 TIKA 2025 --   fix xls/x testBigIntegersWGeneralFormat to work in multiple locales
 discard 28010d90d Mimetype for SAS Xport (XPT) files
 discard 534a52598 TIKA-2255 Mime detection unit tests for SAS files
 discard a79de0ccf TIKA-2255 Magic for older sas data files
 discard 4d8feaee5 Move to Tika 2.x location
 discard 6287b75b5 TIKA-2255 Test SAS files
 discard 3df8ce8b2 TIKA-2251 improve exception handling in SAX pptx/docx parsers
 discard 235c2adab TIKA-2249 -- update javadocs to alert devs that tables are not "maintained" by the PDFParser
 discard 4599374d6 Merge remote-tracking branch 'origin/2.x' into 2.x
 discard 985c1aef8 TIKA 2244 --   be more parsimonious with BufferedInputStream.    AutoDetectReader
 discard bd667acde TIKA-2250 As of RFC7903, the official mime type for EMF is now an image one and without the x- prefix
 discard 6668d78fa TIKA-2250 As of RFC7903, the official mime type for WMF is now an image one and without the x- prefix
 discard 58d56c33f TIKA-2250 As of RFC7903, the official mime type for BMP is now the one without the x- prefix
 discard 78828176a TIKA-2244 --  be more parsimonious with BufferedInputStream via Josh Hight.
 discard 8d783d27a TIKA-2232 -- log/warn if jbig2 is not on classpath
 discard 161b122ba TIKA-2240 -- improve mime detection for .wri files
 discard 9dbff6065 TIKA-2231 - allow underscored language codes (e.g. "chi_tra") in TesseractOCRParser via Graham Russell.
 discard 4374bcecf TIKA-2242  fix style markup in ODT
 discard 45a9b77d6 TIKA-2241 Add new config dumping option STATIC_FULL which lists all supported+active mime types for parsers
 discard dd70fd33a TIKA-2232 -- ImageParser shouldn't allege that it can handle jbig2 when jbig2 library is not on class path
 discard cd98c4cf3 TIKA-2238   add mime detection for embedded MSEquation files
 discard ce4e7e7d9 TIKA 2134 -- handle missing parts more robustly
 discard 28b53bd4d TIKA 2159 handle preparse/embedded IO exceptions uniformly
 discard 681615731 TIKA-2210 -- add experimental SAX parser for pptx and update (also TIKA-2191 and TIKA-2220)
 discard 2d908d59b TIKA-2237
 discard e02084cc6 TIKA-2192
 discard 0bc9bd896 TIKA-2232 -- add processing of jbig2 (with necessary non ASL 2.0 libs) via Pascal Essiembre
 discard c14e75070 TIKA-2235 -- bump default dpi for images created via PDF for OCR to 300 dpi via Matthew Caruana Galizia
 discard 850de1467 TIKA-2234    get rid of ThreadLocal
 discard aaa661e25 TIKA-2228 from Pascal Essiembre and TIKA-2230.
 discard f0863bcea Merge remote-tracking branch 'origin/2.x' into 2.x
 discard f1a541378 TIKA-2190 -- Add test file for maintain spacing
 discard 4e3534da0 Move new test file to the 2.x location
 discard 785e47413 Manually merge changelog
 discard cdb6456bb TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test
 discard 71584b2de TIKA-2224 We now differ from HTTPD on onenote formats, as we have subtypes they lack
 discard db21ee158 TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote files
 discard bb76d986a TIKA-2224 Mime magic for OneNote
 discard d8fa3c2a8 TIKA-1946 updates, detection of wordperfect 5.0 and 5.1 as well as quattropro 7-8 vs quattropro 9
 discard 39cf35551 TIKA_2226 add exception for unsupported formats
 discard 4383e3da7 TIKA-1946 -- initial commit to add parsers for WordPerfect and QuattroPro.  Many thanks to Pascal Essiembre for contributing these!!!
 discard 337d38304 TIKA-2211 -- make sure that head (<style>) content isn't showing up in body in the EpubParser
 discard c9fcb3315 TIKA-2211 modify test file to include style information to test that we're excluding it.
 discard 50c1dc69d update OCR config to include default for output type
 discard 0d30aa1b2 TIKA 2190  --  add configurability for preserve interword spacing
 discard 54154e004 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre -- fix test method to get inputstream from zip
 discard 68f305864 TIKA-2219 make sure to transmit charset name in detectAll via Pascal Essiembre
 discard ee761ac00  TIKA-2221 -- correctly catch and rethrow encrypted document exception as EncryptedDocumentException in WordExtractor via Matthew Caruana Galizia
 discard ffb25af1b Merge remote-tracking branch 'origin/2.x' into 2.x
 discard 4f04b6c3e  TIKA-2218 -- add a new new locations within a pptx to check for embedded objects
 discard d8853fe31 Update to PDFBox 2.0.4
 discard 300100fcb  TIKA-2090: Allow extraction of PDActions (including Javascript) from PDFs (TIKA-2090).
 discard 3d08da79f TIKA-2187 -- make "ignore deleted" as the default in the experimental SAX .docx parser and update the WordExtractor to include extraction of deleted text if requested by the user.
 discard 32162f59e TIKA 1321 initial commit
 discard de103c81f TIKA-2096 -- fix example, sorry...
 discard 1bb7c3384 TIKA-2179  --  add detection and parsing for word2006ml files -- this modification somehow fell to a different change list
 discard e5e4d4d91 TIKA-2096 change default to extract embedded documents even if the user forgets to specify an AutoDetectParser in the ParseContext
 discard a47a69933 TIKA-2169 fix xhtml in ocr
 discard 2f452304b Add mime detection and parser for Word 2006ML format (TIKA-2179).
 discard 8c01e4d8e TIKA-2116 upgrade to POI 3.16-beta1
 discard 7df6fe4be TIKA-2170 fix unit test to allow for different exceptions depending on cause of timeout.
 discard 7adfe1cb5 TIKA-2170 allow configuration of timeout for ForkServer
 discard 9a68f4ccc TIKA-2174 -- clean up
 discard 3f24e6c3e TIKA-2174 -- add ppm and update changes.txt
 discard ab009aeb7 TIKA-2159 -- first step
 discard f2661f997 TIKA-2174 add jpx and jp2 to Tesseract
 discard 7422218eb TIKA-2173 - first steps.  Need to integrate parameter configuration into 2.x before I can do the rest
 discard bcd59cee7 TIKA-2171 - upgrade sqlite parser
 discard 2c9412ab1 TIKA-2171 - upgrade sqlite parser
 discard 1d1bc0dd7 TIKA-1933 - clean up one more place where we aren't closing the ForkParser and are leaving behind a tmp ForkParser jar
 discard 2d5189186 TIKA-2157 - handle zip exception in embedded file
 discard 6ca74bec6 improve unit test for TIKA-2098
 discard a6978521f TIKA-2111 - ExecutableParser should set rather than add a Content-Type
 discard 4b393a6f9 TIKA-2144 - avoid npe if styles doesn't exist (odd, indeed, but if MSWord can handle it, we should, too).
 discard 936e3ac16 TIKA-2130
 discard 4c3bb1560 TIKA-2133
 discard c5f4f5263 TIKA-2127 : npe if there is no notes master)
 discard 7e66e4979 TIKA-2123: digester fails with multiple digests on large files
 discard 30e03de89 TIKA-2122: Extract all headers from MSG/RFC822
 discard 1e55953bc TIKA-2113-- upgrade metadata-extractor to 2.9.1
 discard 3fe8ef819 Merge remote-tracking branch 'origin/2.x' into 2.x
 discard af74ea5c9 TIKA-2110-- log full exception throughout tika-batch
 discard 1ec8c0947 Tesseract may see the t in haystack as a ! some times...
 discard 1ab6c81ce TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. This closes #136
 discard b84fcc584 TIKA-2101 -- don't call MAPIMessage's close()
 discard cde4c0aa8 TIKA-2098 small clean up.  Test for writelimitreached for each catchable IOException.  Many thanks to Alexander Kazakov for finding this and submitting https://github.com/apache/tika/pull/134
 discard 4392681af TIKA-2097 fix npe in mbox parser
 discard be78c549a Extract PDF DocInfo metadata into separate keys to prevent overwriting by XMP metadata (TIKA-2057).
 discard 94789a963 Tika-2095 include Tika version in tika-server's GREETING
 discard bd7208929   * Re-enable fileUrl for tika-server (TIKA-2081).  Fix commandline options not to include '-'
 discard ce1fc3720   * Re-enable fileUrl for tika-server (TIKA-2081).  If you choose,     to use this feature, beware of the security vulnerabilities!     See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
 discard 673533d0e TIKA-2093- Add Tesseract's hOCR output format as an option, via Eric Pugh.  This commit also catches 2.x up to trunk; there were clearly some other changes to Tesseract that hadn't yet made it into 2.x.
 discard d543378a8 TIKA-2069 -- extract macros from MSOffice docs, fix tests to find target metadata object in any order
 discard 66f433471 TIKA-2069 -- extract macros from MSOffice files.
 discard 32d9ece8d  * Maintain passed-in mime in TXTParser (TIKA-2047).
 discard 12b1d435b TIKA-2013 -- upgrade to POI 3.15 -- don't forget to close new NPOIFS and MAPIMessage
 discard 1b32e3186 TIKA-2015 -- upgrade to PDFBox 2.0.3
 discard 92453f5e7 GitHub user haisi opened a pull request:   https://github.com/apache/tika/pull/132
 discard 176f3aded TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml, include test file
 discard ae0cb3059 TIKA 2055 catch exception when totalTime out of unsigned int range in ooxml
 discard 9f6241161 Merge changes for TIKA-2064 to 2.x
 discard e58ade381 TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit test
 discard 443a21e3f TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (Awaiting suitably licensed file for testing)
 discard 4636f95b2 TIKA-1255 and TIKA-2078 -- fix hyperlinks that include formatting and fix hyperlinks with multiple runs in docx
 discard f112c88fb TIKA-2075 - Expose Additional TikaService methods
 discard f8092d3bd TIKA-2073 - Tika Language Detect Project should include Bundle Activator and packaging consistent with other modules
 discard 7a0280c77 TIKA-2071 - DefaultParser and CompositeParser does not filter excludedParsers from dynamic ServiceLoader Parsers
 discard 587dcb772 TIKA-2072 - Create TikaServiceFactory for creating TikaService
 discard d57a85274 TIKA-2070 - Add Encoding Detector and Language Detectors to Dynamic Service Loader
 discard b73cd8ce8 TIKA-2074 - ServiceLoader can use Class files loaded via dynamic loading
 discard a0f365524 TIKA-2067 upgrade maven plugin dependencies -- revert felix bundle
 discard 8ff89d419 TIKA-2067 upgrade maven plugin dependencies
 discard 164bf52c8 TIKA-2066 upgrade commons-io to 2.5
 discard b2a7e382a TIKA-2065 upgrade forbiddenapis
 discard 8234b96fe TIKA-2061 - Added Adobe BSD license to tika-xmp
 discard 5d9db6bec TIKA-2063 - Added Vorbis bundle to bundle parent.
 discard fcefaae59 TIKA-2063 - Create vorbis bundle
 discard dc841e6ba TIKA-2060 - Added toggle to ClassLoaderUtils for OSGi
 discard cebf72382 TIKA-2062 - Remove bouncy castle inlining in bundles
 discard 4704d976c TIKA-2061 - Embed xmpcore in tika-xmp since it is not a proper bundle
 discard 59e0ca0fc TIKA-2059 - Merge multimedia and pdf parser modules and bundles
 discard 87b6d5d7d TIKA-2007 upgrade jackson, needed to update CachedTranslator (diff btwn trunk and 2.x)
 discard db513d6ad TIKA-2007 upgrade jackson
 discard 27bc383eb TIKA-1980 via Joseph Naegele
 discard 09bd22fb4 TIKA-1938 via Joseph Naegele
 discard 5358bf1e1 TIKA-1938 via Joseph Naegele
 discard b41c0b2a8 TIKA-2041 - add important diffs between new copy/paste from ICU4J and legacy code which may have included Tika-specific mods.
 discard 6ebbd7ef7 cleanup MatParser
 discard fc7c372f5 TIKA-2048
 discard 1c582aba6 TIKA-2040 - prevent permanent hang/oom on corrupt chm file
 discard 9f6c71fa6 TIKA-2041, upgrade ICU4j's charset detector to avoid multithreading bug.
 discard f89887d2f TIKA-2037 Merge fixes for 2.x
 discard 53310facc Changelog update
 discard 65cc9bcec TIKA-2042 MBOX magic and detection unit test
 discard 31374a39b TIKA-2037 RFC822Parser should wrap the James InputStream of embedded resources to avoid problems with downstream detection or extraction
 discard d6ce10b41 Email with attachment for testing extraction issues
 discard 8b951a43c TIKA-2039 upgrade to jackcess 2.1.4
 discard f4bacf859 TIKA-2025 increase number of significant digits extracted in "general" format in xls/xlsx
 discard e27526b84 TIKA-2030 - fix test file so that it is correctly detected
 discard cdfacdb41 Merge remote-tracking branch 'origin/2.x' into 2.x
 discard 87e1e23b4 TIKA-2030 - add handling for <text:s/> element to ODT parser. Thanks to David Pilato for opening this issue.
 discard 573527bbc Merge branch '2.x' of https://git-wip-us.apache.org/repos/asf/tika into 2.x
 discard 2a7e52ec4 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...not sure why Intellij didn't catch this one.  sorry.
 discard 2eb4804d1 fix getRecursiveJson -> getRecursiveMetadata in TikaTest, no json is involved here...
 discard 4678d6733 TIKA-2024 extract original path name from OLE1.0 embedded objects
 discard c7a6bcac4 Convert new lines from windows to unix
 discard dd3c2a486 TIKA-2026 -- improve extraction of attachments for PPT, PPTX, XLSX
 discard 933af20e8 rm inconsistently capitalized test files
 discard e62f23057 TIKA-2024 extract original file name/path where possible, take 1
 discard c84855f67 TIKA-2022 - clean up -- make entries private, move more into EndianUtils
 discard 865c45cd5 fix indentation
 discard 5bc597dc8 TIKA-2023 -- clean up RTFParser to use EndianUtils and IOUtils.readFully
 discard b14b47e76 TIKA-2022 -- add parser for applefile
 discard cd12917fa TIKA-2020 -- remove 3 parameter parse() and simplify CAD tests
 discard 0c71b2ffc TIKA-2020, remove 3 parameter parse() and simplify CAD tests
 discard 6bb6827e0 add startDocument and endDocument() to PRTParser so that it works with the ToXMLHandler
 discard 767442614 fix indents and whitespace
 discard 1ce93ed9e TIKA-2019 -- fix WordMLParser and SpreadsheetMLParser
 discard 2f5537380 TIKA-2009 -- add detection for Endnote Import files
 discard b600b6701 make sure to test magic for vcs/ics/asx
 discard 73ce7681c TIKA-2009 -- add magic for djvu
 discard b3bf5141b TIKA-2008 -- change metadata key to TikaCoreProperties.MODIFIER
 discard 60d4e3ff2 TIKA-2008 -- add mime definition and parser for MSOwnerFile
 discard ffaa4deaa TIKA-2004 -- add mime definitions for Windows Media Metafile
 discard f90193aa0 TIKA-2006 -- add mime definitions for ical and vcal
 discard b480d43f5 TIKA-1996 -- Upgrade to PDFBox 2.0.2
 discard ac52e5c15 TIKA-1999: fix setter, update changes.txt
 discard 89062edb0 TIKA-1999: add configurable limit to number of events extracted in XMPMM history.
 discard ebe702898 TIKA-1994 -- Integrate TesseractOCR with full page image rendering for PDFs
 discard e5a7604bc TIKA-1992 -- check for duplicate inline images by COSStream not object name.
 discard e05dd5bf4 TIKA-1990 -- need to add JPEG filters to embedded stream when handling embedded jpegs in PDFParser
 discard b1c00c050 TIKA-1985 -- ignore test until we get permission to use test file