You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Pascal Magnard (JIRA)" <ji...@apache.org> on 2017/03/01 08:41:45 UTC

[jira] [Updated] (TIKA-2282) Paragraph auto-numbering is not extracted from DOCX and ODT.

     [ https://issues.apache.org/jira/browse/TIKA-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pascal Magnard updated TIKA-2282:
---------------------------------
    Summary: Paragraph auto-numbering is not extracted from DOCX and ODT.  (was: Paragraph numbering is not extracted from DOCX and ODT.)

> Paragraph auto-numbering is not extracted from DOCX and ODT.
> ------------------------------------------------------------
>
>                 Key: TIKA-2282
>                 URL: https://issues.apache.org/jira/browse/TIKA-2282
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: Windows 10
> MS Word 2016
> LibreOffice 5.3
>            Reporter: Pascal Magnard
>            Priority: Minor
>         Attachments: sample.doc, sample.doc.tika.txt, sample.docx, sample.docx.tika.txt, sample.odt, sample.odt.tika.txt
>
>
> When extracting text with AutoDetectParser, paragraph auto-numbering is not extracted for .docx and .odt. For .doc file, this numbering is correctly extracted (or should I write recomputed).
> I'm working on a project where the numbering information in the original document is critical for the users.
> In details, for the provided samples, sample.doc gives :
> 1 This is the first level
> 1.1 This is the second level
> 1.2 This is still second level
> 1.2.1 First repeat of third level
> 2 First repeat of first level
> 2.1 Fist Second
> 2.1.1 Second Third
> 2.2 Second Second
> 2.2.1 Third Third
> -----------------------------------------------------------
> which seems OK.
> But sample.docx and sample.odt give :
> This is the first level
> This is the second level
> This is still second level
> First repeat of third level
> First repeat of first level
> Fist Second
> Second Third
> Second Second
> Third Third
> -----------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)