You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/03/28 11:09:41 UTC
[jira] [Commented] (TIKA-2313) Old Word document (Word 6.0, 1997)
has a badly encoded(?) output.
[ https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944951#comment-15944951 ]
Tim Allison commented on TIKA-2313:
-----------------------------------
I can't make any promises, but I'll take a look. Can you attach the file directly to this issue? More->Attach Files
As for a junk detector, see TIKA-1443. If you have any recommendations for metrics that would robustly identify junk across all languages, file formats and genres, let us know!
> Old Word document (Word 6.0, 1997) has a badly encoded(?) output.
> -----------------------------------------------------------------
>
> Key: TIKA-2313
> URL: https://issues.apache.org/jira/browse/TIKA-2313
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.14
> Reporter: Steven Hall
> Priority: Minor
>
> I've a really old Word document (last date of modification is December 1997) which was written with Microsoft Word 6.0.
> When I attempt to use Tika to extract the contents of the document, I receive an incorrect output. The output seems to be in Chinese, but I actually believe that the encoding of the document is not correctly mapped with the output encoding which causes characters to be thrown off. I'm a complete beginner in document encodings so could be wrong here!
> I did see TIKA-721 and TIKA-2038, but neither seem to be related to older documents. I've also read that Tika should support Word 6.0 so not sure.
> My guess for the moment is that the encoding within the document has incorrect character mappings. It's possible using an incompatible mapping that, when Tika converts into its UTF-16 output, maps to Chinese characters instead of the correct ones.
> What's interesting is that Tika correctly extracts all the metadata, including the document title, which is presumably in the same encoding as the document body.
> I have 2 questions:
> 1. Is there something I can pass to Tika to help out in detecting the encoding?
> 2. Is there a way of detecting this kind of bad output? In my application the number of documents like this is very small, but I don't have a very reliable way of detecting that the output is garbage.
> Like I said, quite a beginner with Tika so if there's any further commands you would like me to run please say.
> I've uploaded the document here (can reupload if you have another preferred provider): http://s000.tinyupload.com/?file_id=04273098555496975464
> Here is the output of:
> {noformat}java -jar tika-app-1.14.jar old.DOC{noformat}
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="cp:revision" content="3"/>
> <meta name="date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:word-count" content="38"/>
> <meta name="dc:creator" content="Preferred Customer"/>
> <meta name="meta:print-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Word-Count" content="38"/>
> <meta name="dcterms:created" content="1997-12-12T11:31:00Z"/>
> <meta name="dcterms:modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Save-Date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:character-count" content="227"/>
> <meta name="Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="meta:save-date" content="1997-12-12T12:57:00Z"/>
> <meta name="dc:title" content="KATALYSE"/>
> <meta name="Application-Name" content="Microsoft Word 6.0"/>
> <meta name="modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Edit-Time" content="8400000000"/>
> <meta name="Content-Length" content="20480"/>
> <meta name="Content-Type" content="application/msword"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
> <meta name="creator" content="Preferred Customer"/>
> <meta name="meta:author" content="Preferred Customer"/>
> <meta name="extended-properties:Application" content="Microsoft Word 6.0"/>
> <meta name="meta:creation-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Last-Printed" content="1997-12-12T11:31:00Z"/>
> <meta name="meta:last-author" content="Preferred Customer"/>
> <meta name="Creation-Date" content="1997-12-12T11:31:00Z"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="old.DOC"/>
> <meta name="Last-Author" content="Preferred Customer"/>
> <meta name="Character Count" content="227"/>
> <meta name="Page-Count" content="1"/>
> <meta name="Revision-Number" content="3"/>
> <meta name="extended-properties:Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="Author" content="Preferred Customer"/>
> <meta name="meta:page-count" content="1"/>
> <title>KATALYSE</title>
> </head>
> <body><p>䅋䅔奌䕓䅄䕔㨠䐍瑡ݥ䐓呁⁅䁜樠⽪䵍愯ᑡ㈱ㄯ⼲㜹ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍牯</p>
> <p>㈱㈱</p>
> <p>䴯</p>
> <p>㈱㜹</p>
> <p>愯</p>
> <p>㜹ᨍ</p>
> <p>ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍牯䴠ݲ⹍䈠剅䡔䑏܍܇܇剏䅇䥎䵓⁅ഺ潃灭湡ݹ潓楣瓩⃩偁䱐䍉䱏剏܍܇܇끎䐠⁅䕔䕌佃䥐啅⁒ഺ慆⁸渠ް㐰㜠‹㌸㈠‰㔴܍܇܇䅐䕇ⱓ夠䌠䵏剐卉䌠䱅䕌䌭⁉ഺ慐敧ⱳ椠据畬楤杮琠楨湯ݥܱ܇܇䕄䰠⁁䅐呒䐠⁅㨠䘍潲ݭ畇⁹䕌佃䕌܇潍獮敩牵ബ畓瑩⃠潮牴散瑮挠湯慴瑣琠泩烩潨楮畱ⱥ樠愧敬瀠慬獩物搠潣普物敭潮牴敲摮穥瘭畯畤ㄠ‹散扭敲瀠潲档楡⃠㔱と‰慤獮瘠獯氠捯畡⁸敤嘠杯慬獮മ䨍潶獵瀠敳瑮牥楡氠牯敤挠瑥整爠痩楮湯氠獥挠湯汣獵潩獮搠⁵牰ⷩ楤条潮瑳捩猠牴瑡柩煩敵മ慄獮挠瑥整愠瑴湥整敪瘠畯牰敩搠牣楯敲潍獮敩牵⃠❬獡畳慲据敤洠獥猠湥楴敭瑮敬敭汩敬牵䜉奕䰠䍅䱏ㄱ畲畇汩潬摵ⴠ㘠〹㌰䰠余⁎㨠〠⸴㈷㘮⸸㠰〮‸慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮畡挠灡瑩污搠㘵‶〰‰剆⹓删䌮匮慐楲䈠㌠㤷㔠㘶㜠ㄷ�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁㼁Ǡ☿ἁ쀁Aǰ⤏ā܇Sೀ˿āȀ̃萀㼡⇰'༁︃ἢ䏸àćĀⅿ܀老ἡǸćↀ缥ǰ䔃️℀쀃㼁༂�㼡⏰à�༡˼Ӽā⇀️ﰃā쀋̡ǰȃćↀEǀԟāǀȏā⇠︇�쀁̅︁Ēč︉�Ⅻ́耄༤⇀️㼂︣Ā⌏þ༁FǠ━ﰟ㼂耄ἡ˸˼П⇀︇܅̅老܁考ć老Ą㼊︦ĀʀӸăƀЇƀ⤃ĉAᎀ⾁︃܃︃�ɽǀȁğ␀༂耄ﰂ༥ϼ⇼ﰏﰃ쀆ἂ�ἤϸϸↀﰏ܂&̀�̉�</p>
> <p>ᨍ</p>
> <p></p>
> <p>ㄍ</p>
> <p>ㄱ畲畇汩潬摵ⴠ㘠〹㌰䰠余⁎㨠〠⸴㈷㘮⸸㠰〮‸慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮畡挠灡瑩污搠㘵‶〰‰剆⹓删䌮匮慐楲䈠㌠㤷㔠㘶㜠ㄷ�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁</p>
> <p></p>
> <p></p>
> <p></p>
> <p>ᨍ</p>
> </body></html>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)