You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/03/28 11:09:41 UTC

[jira] [Commented] (TIKA-2313) Old Word document (Word 6.0, 1997) has a badly encoded(?) output.

    [ https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944951#comment-15944951 ] 

Tim Allison commented on TIKA-2313:
-----------------------------------

I can't make any promises, but I'll take a look.  Can you attach the file directly to this issue?  More->Attach Files

As for a junk detector, see TIKA-1443.  If you have any recommendations for metrics that would robustly identify junk across all languages, file formats and genres, let us know!

> Old Word document (Word 6.0, 1997) has a badly encoded(?) output.
> -----------------------------------------------------------------
>
>                 Key: TIKA-2313
>                 URL: https://issues.apache.org/jira/browse/TIKA-2313
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Steven Hall
>            Priority: Minor
>
> I've a really old Word document (last date of modification is December 1997) which was written with Microsoft Word 6.0.
> When I attempt to use Tika to extract the contents of the document, I receive an incorrect output. The output seems to be in Chinese, but I actually believe that the encoding of the document is not correctly mapped with the output encoding which causes characters to be thrown off. I'm a complete beginner in document encodings so could be wrong here!
> I did see TIKA-721 and TIKA-2038, but neither seem to be related to older documents. I've also read that Tika should support Word 6.0 so not sure.
> My guess for the moment is that the encoding within the document has incorrect character mappings. It's possible using an incompatible mapping that, when Tika converts into its UTF-16 output, maps to Chinese characters instead of the correct ones.
> What's interesting is that Tika correctly extracts all the metadata, including the document title, which is presumably in the same encoding as the document body.
> I have 2 questions:
> 1. Is there something I can pass to Tika to help out in detecting the encoding?
> 2. Is there a way of detecting this kind of bad output? In my application the number of documents like this is very small, but I don't have a very reliable way of detecting that the output is garbage.
> Like I said, quite a beginner with Tika so if there's any further commands you would like me to run please say.
> I've uploaded the document here (can reupload if you have another preferred provider): http://s000.tinyupload.com/?file_id=04273098555496975464
> Here is the output of:
> {noformat}java -jar tika-app-1.14.jar old.DOC{noformat}
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="cp:revision" content="3"/>
> <meta name="date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:word-count" content="38"/>
> <meta name="dc:creator" content="Preferred Customer"/>
> <meta name="meta:print-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Word-Count" content="38"/>
> <meta name="dcterms:created" content="1997-12-12T11:31:00Z"/>
> <meta name="dcterms:modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Save-Date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:character-count" content="227"/>
> <meta name="Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="meta:save-date" content="1997-12-12T12:57:00Z"/>
> <meta name="dc:title" content="KATALYSE"/>
> <meta name="Application-Name" content="Microsoft Word 6.0"/>
> <meta name="modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Edit-Time" content="8400000000"/>
> <meta name="Content-Length" content="20480"/>
> <meta name="Content-Type" content="application/msword"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
> <meta name="creator" content="Preferred Customer"/>
> <meta name="meta:author" content="Preferred Customer"/>
> <meta name="extended-properties:Application" content="Microsoft Word 6.0"/>
> <meta name="meta:creation-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Last-Printed" content="1997-12-12T11:31:00Z"/>
> <meta name="meta:last-author" content="Preferred Customer"/>
> <meta name="Creation-Date" content="1997-12-12T11:31:00Z"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="old.DOC"/>
> <meta name="Last-Author" content="Preferred Customer"/>
> <meta name="Character Count" content="227"/>
> <meta name="Page-Count" content="1"/>
> <meta name="Revision-Number" content="3"/>
> <meta name="extended-properties:Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="Author" content="Preferred Customer"/>
> <meta name="meta:page-count" content="1"/>
> <title>KATALYSE</title>
> </head>
> <body><p>䅋䅔奌䕓഍഍䅄䕔㨠䐍瑡ݥ䐓呁⁅䁜樠⽪䵍愯ᑡ㈱ㄯ⼲㜹ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍⁳牯</p>
> <p>㈱㈱</p>
> <p>䴯</p>
> <p>㈱㜹</p>
> <p>愯</p>
> <p>㜹ᨍ</p>
> <p>ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍⁳牯䴠ݲ⹍䈠剅䡔䑏܍܇܇剏䅇䥎䵓⁅ഺ潃灭湡ݹ潓楣瓩⃩偁䱐䍉䱏剏܍܇܇끎䐠⁅䕔䕌佃䥐啅⁒ഺ慆⁸渠ް㐰㜠‹㌸㈠‰㔴܍܇܇䅐䕇ⱓ夠䌠䵏剐卉䌠䱅䕌䌭⁉ഺ慐敧ⱳ椠据畬楤杮琠楨⁳湯ݥܱ܇܇䕄䰠⁁䅐呒䐠⁅㨠䘍潲ݭ畇⁹䕌佃䕌܇഍഍഍潍獮敩牵ബ഍畓瑩⁥⃠潮牴⁥散瑮挠湯慴瑣琠泩烩潨楮畱ⱥ樠愧⁩敬瀠慬獩物搠⁥潣普物敭⁲潮牴⁥敲摮穥瘭畯⁳畤ㄠ‹散扭敲瀠潲档楡⃠㔱と‰慤獮瘠獯氠捯畡⁸敤嘠杯慬獮മ䨍⁥潶獵瀠敳瑮牥楡氠牯⁳敤挠瑥整爠痩楮湯氠獥挠湯汣獵潩獮搠⁵牰ⷩ楤条潮瑳捩猠牴瑡柩煩敵മ഍慄獮挠瑥整愠瑴湥整‬敪瘠畯⁳牰敩搠⁥牣楯敲‬潍獮敩牵‬⃠❬獡畳慲据⁥敤洠獥猠湥楴敭瑮⁳敬⁳敭汩敬牵⹳഍഍഍഍䜉奕䰠䍅䱏൅഍ㄱ‬畲⁥畇汩潬摵ⴠ㘠〹㌰䰠余⁎‭⹬㨠〠⸴㈷㘮⸸㠰〮‸‭慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮‮畡挠灡瑩污搠⁥㘵‶〰‰剆⹓删䌮匮‮慐楲⁳䈠㌠㤷㔠㘶㜠ㄷ഍഍�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁㼁Ǡ☿ἁ쀁Aǰ⤏ā܇Sೀ˿āȀ̃萀㼡⇰'༁︃ἢ䏸àćĀⅿ܀老ἡǸćↀ缥ǰ䔃️℀쀃㼁༂�㼡⏰à�༡˼Ӽā⇀️ﰃā쀋̡෿ǰȃćↀEǀԟāǀȏā⇠︇�쀁̅︁Ēč︉�Ⅻ́耄༤߼⇀️㼂︣Ā⌏þ༁FǠ━ﰟ㼂耄ἡ˸˼П⇀︇܅̅老܁考ć老Ą㼊︦Ā܏ʀӸăƀЇƀ⤃ĉAᎀ⾁︃܃︃�ɽǀȁğ␀༂耄ﰂ༥ϼ⇼ﰏﰃ쀆ἂ�ἤϸϸↀﰏ܂&̀�̉�</p>
> <p>ᨍ</p>
> <p>഍</p>
> <p>ㄍ</p>
> <p>ㄱ‬畲⁥畇汩潬摵ⴠ㘠〹㌰䰠余⁎‭⹬㨠〠⸴㈷㘮⸸㠰〮‸‭慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮‮畡挠灡瑩污搠⁥㘵‶〰‰剆⹓删䌮匮‮慐楲⁳䈠㌠㤷㔠㘶㜠ㄷ഍഍�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁</p>
> <p>഍</p>
> <p>഍</p>
> <p>഍</p>
> <p>ᨍ</p>
> </body></html>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)