You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Pascal Magnard (JIRA)" <ji...@apache.org> on 2017/03/01 08:39:45 UTC
[jira] [Created] (TIKA-2282) Paragraph numbering is not extracted
from DOCX and ODT.
Pascal Magnard created TIKA-2282:
------------------------------------
Summary: Paragraph numbering is not extracted from DOCX and ODT.
Key: TIKA-2282
URL: https://issues.apache.org/jira/browse/TIKA-2282
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.14
Environment: Windows 10
MS Word 2016
LibreOffice 5.3
Reporter: Pascal Magnard
Priority: Minor
When extracting text with AutoDetectParser, paragraph auto-numbering is not extracted for .docx and .odt. For .doc file, this numbering is correctly extracted (or should I write recomputed).
I'm working on a project where the numbering information in the original document is critical for the users.
In details, for the provided samples, sample.doc gives :
1 This is the first level
1.1 This is the second level
1.2 This is still second level
1.2.1 First repeat of third level
2 First repeat of first level
2.1 Fist Second
2.1.1 Second Third
2.2 Second Second
2.2.1 Third Third
-----------------------------------------------------------
which seems OK.
But sample.docx and sample.odt give :
This is the first level
This is the second level
This is still second level
First repeat of third level
First repeat of first level
Fist Second
Second Third
Second Second
Third Third
-----------------------------------------------------------
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)