You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2011/06/03 19:58:14 UTC
DO NOT REPLY [Bug 51320] New: Determine whether parts other than
QuillContents may contain useful text to extract and if so, support
extraction from those
https://issues.apache.org/bugzilla/show_bug.cgi?id=51320
Bug #: 51320
Summary: Determine whether parts other than QuillContents may
contain useful text to extract and if so, support
extraction from those
Product: POI
Version: 3.2-FINAL
Platform: PC
Status: NEW
Severity: normal
Priority: P2
Component: HPBF
AssignedTo: dev@poi.apache.org
ReportedBy: dgoldenberg@attivio.com
Classification: Unclassified
Right now, only QuillContents is taken into account when extracting text.
It seems worth researching whether any useful text may be extraced from the
Main and the Escher parts.
This is related to 51317 - Need ability to stream and chunk data out of MS
Publisher documents. If any extra parts get exposed we'd ideally want streaming
available on it.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 51320] Determine whether parts other than
QuillContents may contain useful text to extract and if so, support
extraction from those
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51320
Nick Burch <ni...@alfresco.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
OS/Version| |All
--- Comment #1 from Nick Burch <ni...@alfresco.com> 2011-06-03 19:47:46 UTC ---
The Escher parts are being passed by DDF. So, it should be fairly easy to walk
through them in some sample files and see if there's any useful text in there.
If there is, extending the text extractor to look for what we've identified
should be fairly straight forward. Any chance you could take a look in some
files you have to hand?
As for the main part, I seem to recall the issue is having no idea what on
earth is stored in it or the format... First up you'd want to look at hex
dumps, and see if there is handy text in there. If there is, then look at
several files to see if it's in the same place. If not, look for what might be
offsets to where the text lives, and if the offsets are in a predictable place
then we're ok.
Needs some investigations, sorry!
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 51320] Determine whether parts other than
QuillContents may contain useful text to extract and if so, support
extraction from those
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51320
--- Comment #2 from Dmitry Goldenberg <dg...@attivio.com> 2011-06-03 23:41:42 UTC ---
Nick,
Sorry I am swamped at the moment. This is not as critical since Quills get one
most of the content it seems...
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org