You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/03/28 11:11:42 UTC
[jira] [Comment Edited] (TIKA-2313) Old Word document (Word 6.0,
1997) has a badly encoded(?) output.
[ https://issues.apache.org/jira/browse/TIKA-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944951#comment-15944951 ]
Tim Allison edited comment on TIKA-2313 at 3/28/17 11:10 AM:
-------------------------------------------------------------
I can't make any promises, but I'll take a look. Can you attach the file directly to this issue? More->Attach Files
As for a junk detector, see TIKA-1443. If you have any recommendations for metrics that would robustly identify junk across all languages, file formats and genres, let us know!
You could run language id against the output, and if you _are 100% certain_ that your documents shouldn't be in Chinese, you might get some mileage from that.
was (Author: tallison@mitre.org):
I can't make any promises, but I'll take a look. Can you attach the file directly to this issue? More->Attach Files
As for a junk detector, see TIKA-1443. If you have any recommendations for metrics that would robustly identify junk across all languages, file formats and genres, let us know!
> Old Word document (Word 6.0, 1997) has a badly encoded(?) output.
> -----------------------------------------------------------------
>
> Key: TIKA-2313
> URL: https://issues.apache.org/jira/browse/TIKA-2313
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.14
> Reporter: Steven Hall
> Priority: Minor
>
> I've a really old Word document (last date of modification is December 1997) which was written with Microsoft Word 6.0.
> When I attempt to use Tika to extract the contents of the document, I receive an incorrect output. The output seems to be in Chinese, but I actually believe that the encoding of the document is not correctly mapped with the output encoding which causes characters to be thrown off. I'm a complete beginner in document encodings so could be wrong here!
> I did see TIKA-721 and TIKA-2038, but neither seem to be related to older documents. I've also read that Tika should support Word 6.0 so not sure.
> My guess for the moment is that the encoding within the document has incorrect character mappings. It's possible using an incompatible mapping that, when Tika converts into its UTF-16 output, maps to Chinese characters instead of the correct ones.
> What's interesting is that Tika correctly extracts all the metadata, including the document title, which is presumably in the same encoding as the document body.
> I have 2 questions:
> 1. Is there something I can pass to Tika to help out in detecting the encoding?
> 2. Is there a way of detecting this kind of bad output? In my application the number of documents like this is very small, but I don't have a very reliable way of detecting that the output is garbage.
> Like I said, quite a beginner with Tika so if there's any further commands you would like me to run please say.
> I've uploaded the document here (can reupload if you have another preferred provider): http://s000.tinyupload.com/?file_id=04273098555496975464
> Here is the output of:
> {noformat}java -jar tika-app-1.14.jar old.DOC{noformat}
> {noformat}
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="cp:revision" content="3"/>
> <meta name="date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:word-count" content="38"/>
> <meta name="dc:creator" content="Preferred Customer"/>
> <meta name="meta:print-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Word-Count" content="38"/>
> <meta name="dcterms:created" content="1997-12-12T11:31:00Z"/>
> <meta name="dcterms:modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Last-Save-Date" content="1997-12-12T12:57:00Z"/>
> <meta name="meta:character-count" content="227"/>
> <meta name="Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="meta:save-date" content="1997-12-12T12:57:00Z"/>
> <meta name="dc:title" content="KATALYSE"/>
> <meta name="Application-Name" content="Microsoft Word 6.0"/>
> <meta name="modified" content="1997-12-12T12:57:00Z"/>
> <meta name="Edit-Time" content="8400000000"/>
> <meta name="Content-Length" content="20480"/>
> <meta name="Content-Type" content="application/msword"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
> <meta name="creator" content="Preferred Customer"/>
> <meta name="meta:author" content="Preferred Customer"/>
> <meta name="extended-properties:Application" content="Microsoft Word 6.0"/>
> <meta name="meta:creation-date" content="1997-12-12T11:31:00Z"/>
> <meta name="Last-Printed" content="1997-12-12T11:31:00Z"/>
> <meta name="meta:last-author" content="Preferred Customer"/>
> <meta name="Creation-Date" content="1997-12-12T11:31:00Z"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="old.DOC"/>
> <meta name="Last-Author" content="Preferred Customer"/>
> <meta name="Character Count" content="227"/>
> <meta name="Page-Count" content="1"/>
> <meta name="Revision-Number" content="3"/>
> <meta name="extended-properties:Template" content="C:\MSOFFICE\WINWORD\MODELES\FAXLYON.DOT"/>
> <meta name="Author" content="Preferred Customer"/>
> <meta name="meta:page-count" content="1"/>
> <title>KATALYSE</title>
> </head>
> <body><p>䅋䅔奌䕓䅄䕔㨠䐍瑡ݥ䐓呁⁅䁜樠⽪䵍愯ᑡ㈱ㄯ⼲㜹ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍牯</p>
> <p>㈱㈱</p>
> <p>䴯</p>
> <p>㈱㜹</p>
> <p>愯</p>
> <p>㜹ᨍ</p>
> <p>ܕ܇܇⁁❌呁䕔呎佉⁎䕄㨠名牍牯䴠ݲ⹍䈠剅䡔䑏܍܇܇剏䅇䥎䵓⁅ഺ潃灭湡ݹ潓楣瓩⃩偁䱐䍉䱏剏܍܇܇끎䐠⁅䕔䕌佃䥐啅⁒ഺ慆⁸渠ް㐰㜠‹㌸㈠‰㔴܍܇܇䅐䕇ⱓ夠䌠䵏剐卉䌠䱅䕌䌭⁉ഺ慐敧ⱳ椠据畬楤杮琠楨湯ݥܱ܇܇䕄䰠⁁䅐呒䐠⁅㨠䘍潲ݭ畇⁹䕌佃䕌܇潍獮敩牵ബ畓瑩⃠潮牴散瑮挠湯慴瑣琠泩烩潨楮畱ⱥ樠愧敬瀠慬獩物搠潣普物敭潮牴敲摮穥瘭畯畤ㄠ‹散扭敲瀠潲档楡⃠㔱と‰慤獮瘠獯氠捯畡⁸敤嘠杯慬獮മ䨍潶獵瀠敳瑮牥楡氠牯敤挠瑥整爠痩楮湯氠獥挠湯汣獵潩獮搠⁵牰ⷩ楤条潮瑳捩猠牴瑡柩煩敵മ慄獮挠瑥整愠瑴湥整敪瘠畯牰敩搠牣楯敲潍獮敩牵⃠❬獡畳慲据敤洠獥猠湥楴敭瑮敬敭汩敬牵䜉奕䰠䍅䱏ㄱ畲畇汩潬摵ⴠ㘠〹㌰䰠余⁎㨠〠⸴㈷㘮⸸㠰〮‸慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮畡挠灡瑩污搠㘵‶〰‰剆⹓删䌮匮慐楲䈠㌠㤷㔠㘶㜠ㄷ�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁㼁Ǡ☿ἁ쀁Aǰ⤏ā܇Sೀ˿āȀ̃萀㼡⇰'༁︃ἢ䏸àćĀⅿ܀老ἡǸćↀ缥ǰ䔃️℀쀃㼁༂�㼡⏰à�༡˼Ӽā⇀️ﰃā쀋̡ǰȃćↀEǀԟāǀȏā⇠︇�쀁̅︁Ēč︉�Ⅻ́耄༤⇀️㼂︣Ā⌏þ༁FǠ━ﰟ㼂耄ἡ˸˼П⇀︇܅̅老܁考ć老Ą㼊︦ĀʀӸăƀЇƀ⤃ĉAᎀ⾁︃܃︃�ɽǀȁğ␀༂耄ﰂ༥ϼ⇼ﰏﰃ쀆ἂ�ἤϸϸↀﰏ܂&̀�̉�</p>
> <p>ᨍ</p>
> <p></p>
> <p>ㄍ</p>
> <p>ㄱ畲畇汩潬摵ⴠ㘠〹㌰䰠余⁎㨠〠⸴㈷㘮⸸㠰〮‸慆⁸›㐰㜮⸲㠶〮⸳㘶匍䄮畡挠灡瑩污搠㘵‶〰‰剆⹓删䌮匮慐楲䈠㌠㤷㔠㘶㜠ㄷ�ઙ莤ꔮ䇈誦꜅֊厨꤃֊ƌ贀�㈱㈱㜹ᨍ餀ꐊ⺃좥ꙁ֊誧ꠅ͓誩谅�ƍĀǀȿăǀ☿܁܃耡Ā揠à܁聁缀」ǰăƀ⤟ἡ䏼~⨃܁Ἥ⇸'́쀁</p>
> <p></p>
> <p></p>
> <p></p>
> <p>ᨍ</p>
> </body></html>
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)