You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Moritz Dorka (JIRA)" <ji...@apache.org> on 2014/09/22 10:15:34 UTC

[jira] [Comment Edited] (TIKA-1315) Basic list support in WordExtractor

    [ https://issues.apache.org/jira/browse/TIKA-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142495#comment-14142495 ] 

Moritz Dorka edited comment on TIKA-1315 at 9/22/14 8:14 AM:
-------------------------------------------------------------

Hmm, apparently, files are global to a bug in Jira and are not linked to specific comments... Too bad. So this is related to [^ListManager.tar.bz2] and [^ListNumbering.patch] which I propose as substitutes for Filip's work.

----
\\
The original patch proposed by Filip is quite good but
*  it lacks true support for ListFormatOverrideLevels (which, admittedly, is a really brain-twisting feature of Word)
* it does not cope correctly with bullets / unnumbered items (i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists
* there is no support for legal formatting and
* no support for levels which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from scratch, with the exception of two helper methods ({{intToRoman()}} + {{intToLetter()}}) which are still based on the original blog post cited by Filip. I consider them rather trivial, so it is hopefully not a problem to include them in tika.
The code is an attempt to fully implement the algorithm outlined in MS-DOC, v20140721, [2.4.6.3|http://msdn.microsoft.com/en-us/library/dd921056%28v=office.12%29.aspx] + [2.4.6.4|http://msdn.microsoft.com/en-us/library/dd945275%28v=office.12%29.aspx].

Downside of my approach is that it IMHO externalizes quite a bit of functionality which should actually be inside POI. Since those ListLevelOverrides can also influence the overall formatting of the paragraph (something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase ({{WordParserTest.java}}) has an rather bad coverage for the proposed new algorithm. I have a better test file here which reaches about 80% (the rest being mostly error handling stuff). Give me a shout if you want that to be included in tika as well.

Make sure to apply [this patch|https://issues.apache.org/bugzilla/show_bug.cgi?id=56998] to POI before using this.


was (Author: morido):
Hmm, apparently files are global to a bug in Jira and are not linked to specific comments... Too bad. So this is related to ListManager.tar.bz2 and ListNumbering.patch which I propose as substitutes for Filip's work.

The original patch proposed by Filip is quite good but it lacks true support for ListFormatOverrideLevels (which, allegedly, is a really brain-twisting feature of Word), it does not cope correctly with bullets / unnumbered items (i.e. stuff which has 0x17 or 0xFF as its nfc) on arbitrary levels of multilevel lists and there is no support for either legal formatting or levels which restart at arbitrary more-significant levels.

Attached is a an improved version of the numbering algorithm written from scratch, with the exception of two helper methods (intToRoman() + intToLetter()) which are still based on the original blog post cited by Filip. I consider them rather trivial, so it is hopefully not a problem to include them in tika.
The code is an attempt to fully implement the algorithm outlined in [MS-DOC], v20140721, 2.4.6.3 + 2.4.6.4.

Downside of my approach is that it IMHO externalizes quite a bit of functionality which should actually be inside POI. Since those ListLevelOverrides can also influence the overall formatting of the paragraph (something which is handled by POI) this can lead to inconsistent behaviour.

The current testcase (WordParserTest.java) has an rather bad coverage for the proposed new algorithm. I have a better test file here which reaches about 80% (the rest being mostly error handling stuff). Give me a shout if you want that to be included in tika as well.

Make sure to apply https://issues.apache.org/bugzilla/show_bug.cgi?id=56998 to POI before using this.

> Basic list support in WordExtractor
> -----------------------------------
>
>                 Key: TIKA-1315
>                 URL: https://issues.apache.org/jira/browse/TIKA-1315
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Filip Bednárik
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: ListManager.tar.bz2, ListNumbering.patch, ListUtils.java, WordExtractor.java.patch, WordParserTest.java.patch
>
>
> Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff.
> In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did.
> Attached files are:
> Updated test
> Fixed WordExtractor
> Added ListUtils



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)