You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/03/28 11:11:42 UTC

[jira] [Comment Edited] (TIKA-2313) Old Word document (Word 6.0, 1997) has a badly encoded(?) output.

    [ https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944951#comment-15944951 ] 

Tim Allison edited comment on TIKA-2313 at 3/28/17 11:10 AM:
-------------------------------------------------------------

I can't make any promises, but I'll take a look.  Can you attach the file directly to this issue?  More->Attach Files

As for a junk detector, see TIKA-1443.  If you have any recommendations for metrics that would robustly identify junk across all languages, file formats and genres, let us know!

You could run language id against the output, and if you _are 100% certain_ that your documents shouldn't be in Chinese, you might get some mileage from that.


was (Author: tallison@mitre.org):
I can't make any promises, but I'll take a look.  Can you attach the file directly to this issue?  More->Attach Files

As for a junk detector, see TIKA-1443.  If you have any recommendations for metrics that would robustly identify junk across all languages, file formats and genres, let us know!

> Old Word document (Word 6.0, 1997) has a badly encoded(?) output.
> -----------------------------------------------------------------
>
>                 Key: TIKA-2313
>                 URL: https://issues.apache.org/jira/browse/TIKA-2313
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Steven Hall
>            Priority: Minor
>
> I've a really old Word document (last date of modification is December 1997) which was written with Microsoft Word 6.0.
> When I attempt to use Tika to extract the contents of the document, I receive an incorrect output. The output seems to be in Chinese, but I actually believe that the encoding of the document is not correctly mapped with the output encoding which causes characters to be thrown off. I'm a complete beginner in document encodings so could be wrong here!
> I did see TIKA-721 and TIKA-2038, but neither seem to be related to older documents. I've also read that Tika should support Word 6.0 so not sure.
> My guess for the moment is that the encoding within the document has incorrect character mappings. It's possible using an incompatible mapping that, when Tika converts into its UTF-16 output, maps to Chinese characters instead of the correct ones.
> What's interesting is that Tika correctly extracts all the metadata, including the document title, which is presumably in the same encoding as the document body.
> I have 2 questions:
> 1. Is there something I can pass to Tika to help out in detecting the encoding?
> 2. Is there a way of detecting this kind of bad output? In my application the number of documents like this is very small, but I don't have a very reliable way of detecting that the output is garbage.
> Like I said, quite a beginner with Tika so if there's any further commands you would like me to run please say.
> I've uploaded the document here (can reupload if you have another preferred provider): http://s000.tinyupload.com/?file_id=04273098555496975464
> Here is the output of:
> {noformat}java -jar tika-app-1.14.jar old.DOC{noformat}
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="cp:revision" content="3"/>
> <meta name="date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:word-count" content="38"/>
> <meta name="dc:creator" content="Preferred Customer"/>
> <meta name="meta:print-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Word-Count" content="38"/>
> <meta name="dcterms:created" content="1997-12-12T11:31:00Z"/>
> <meta name="dcterms:modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Save-Date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:character-count" content="227"/>
> <meta name="Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="meta:save-date" content="1997-12-12T12:57:00Z"/>
> <meta name="dc:title" content="KATALYSE"/>
> <meta name="Application-Name" content="Microsoft Word 6.0"/>
> <meta name="modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Edit-Time" content="8400000000"/>
> <meta name="Content-Length" content="20480"/>
> <meta name="Content-Type" content="application/msword"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
> <meta name="creator" content="Preferred Customer"/>
> <meta name="meta:author" content="Preferred Customer"/>
> <meta name="extended-properties:Application" content="Microsoft Word 6.0"/>
> <meta name="meta:creation-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Last-Printed" content="1997-12-12T11:31:00Z"/>
> <meta name="meta:last-author" content="Preferred Customer"/>
> <meta name="Creation-Date" content="1997-12-12T11:31:00Z"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="old.DOC"/>
> <meta name="Last-Author" content="Preferred Customer"/>
> <meta name="Character Count" content="227"/>
> <meta name="Page-Count" content="1"/>
> <meta name="Revision-Number" content="3"/>
> <meta name="extended-properties:Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="Author" content="Preferred Customer"/>
> <meta name="meta:page-count" content="1"/>
> <title>KATALYSE</title>
> </head>
> <body><p>䅋䅔奌䕓഍഍䅄䕔㨠䐍瑡ݥ䐓呁⁅䁜樠⽪䵍愯ᑡ㈱ㄯ⼲㜹ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍⁳牯</p>
> <p>㈱㈱</p>
> <p>䴯</p>
> <p>㈱㜹</p>
> <p>愯</p>
> <p>㜹ᨍ</p>
> <p>ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍⁳牯䴠ݲ⹍䈠剅䡔䑏܍܇܇剏䅇䥎䵓⁅ഺ潃灭湡ݹ潓楣瓩⃩偁䱐䍉䱏剏܍܇܇끎䐠⁅䕔䕌佃䥐啅⁒ഺ慆⁸渠ް㐰㜠‹㌸㈠‰㔴܍܇܇䅐䕇ⱓ夠䌠䵏剐卉䌠䱅䕌䌭⁉ഺ慐敧ⱳ椠据畬楤杮琠楨⁳湯ݥܱ܇܇䕄䰠⁁䅐呒䐠⁅㨠䘍潲ݭ畇⁹䕌佃䕌܇഍഍഍潍獮敩牵ബ഍畓瑩⁥⃠潮牴⁥散瑮挠湯慴瑣琠泩烩潨楮畱ⱥ樠愧⁩敬瀠慬獩物搠⁥潣普物敭⁲潮牴⁥敲摮穥瘭畯⁳畤ㄠ‹散扭敲瀠潲档楡⃠㔱と‰慤獮瘠獯氠捯畡⁸敤嘠杯慬獮മ䨍⁥潶獵瀠敳瑮牥楡氠牯⁳敤挠瑥整爠痩楮湯氠獥挠湯汣獵潩獮搠⁵牰ⷩ楤条潮瑳捩猠牴瑡柩煩敵മ഍慄獮挠瑥整愠瑴湥整‬敪瘠畯⁳牰敩搠⁥牣楯敲‬潍獮敩牵‬⃠❬獡畳慲据⁥敤洠獥猠湥楴敭瑮⁳敬⁳敭汩敬牵⹳഍഍഍഍䜉奕䰠䍅䱏൅഍ㄱ‬畲⁥畇汩潬摵ⴠ㘠〹㌰䰠余⁎‭⹬㨠〠⸴㈷㘮⸸㠰〮‸‭慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮‮畡挠灡瑩污搠⁥㘵‶〰‰剆⹓删䌮匮‮慐楲⁳䈠㌠㤷㔠㘶㜠ㄷ഍഍�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁㼁Ǡ☿ἁ쀁Aǰ⤏ā܇Sೀ˿āȀ̃萀㼡⇰'༁︃ἢ䏸àćĀⅿ܀老ἡǸćↀ缥ǰ䔃️℀쀃㼁༂�㼡⏰à�༡˼Ӽā⇀️ﰃā쀋̡෿ǰȃćↀEǀԟāǀȏā⇠︇�쀁̅︁Ēč︉�Ⅻ́耄༤߼⇀️㼂︣Ā⌏þ༁FǠ━ﰟ㼂耄ἡ˸˼П⇀︇܅̅老܁考ć老Ą㼊︦Ā܏ʀӸăƀЇƀ⤃ĉAᎀ⾁︃܃︃�ɽǀȁğ␀༂耄ﰂ༥ϼ⇼ﰏﰃ쀆ἂ�ἤϸϸↀﰏ܂&̀�̉�</p>
> <p>ᨍ</p>
> <p>഍</p>
> <p>ㄍ</p>
> <p>ㄱ‬畲⁥畇汩潬摵ⴠ㘠〹㌰䰠余⁎‭⹬㨠〠⸴㈷㘮⸸㠰〮‸‭慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮‮畡挠灡瑩污搠⁥㘵‶〰‰剆⹓删䌮匮‮慐楲⁳䈠㌠㤷㔠㘶㜠ㄷ഍഍�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁</p>
> <p>഍</p>
> <p>഍</p>
> <p>഍</p>
> <p>ᨍ</p>
> </body></html>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)