You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2011/06/03 19:58:14 UTC

DO NOT REPLY [Bug 51320] New: Determine whether parts other than QuillContents may contain useful text to extract and if so, support extraction from those

https://issues.apache.org/bugzilla/show_bug.cgi?id=51320

             Bug #: 51320
           Summary: Determine whether parts other than QuillContents may
                    contain useful text to extract and if so, support
                    extraction from those
           Product: POI
           Version: 3.2-FINAL
          Platform: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HPBF
        AssignedTo: dev@poi.apache.org
        ReportedBy: dgoldenberg@attivio.com
    Classification: Unclassified


Right now, only QuillContents is taken into account when extracting text.

It seems worth researching whether any useful text may be extraced from the
Main and the Escher parts.

This is related to 51317 - Need ability to stream and chunk data out of MS
Publisher documents. If any extra parts get exposed we'd ideally want streaming
available on it.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 51320] Determine whether parts other than QuillContents may contain useful text to extract and if so, support extraction from those

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51320

Nick Burch <ni...@alfresco.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO
         OS/Version|                            |All

--- Comment #1 from Nick Burch <ni...@alfresco.com> 2011-06-03 19:47:46 UTC ---
The Escher parts are being passed by DDF. So, it should be fairly easy to walk
through them in some sample files and see if there's any useful text in there.
If there is, extending the text extractor to look for what we've identified
should be fairly straight forward. Any chance you could take a look in some
files you have to hand?

As for the main part, I seem to recall the issue is having no idea what on
earth is stored in it or the format... First up you'd want to look at hex
dumps, and see if there is handy text in there. If there is, then look at
several files to see if it's in the same place. If not, look for what might be
offsets to where the text lives, and if the offsets are in a predictable place
then we're ok.

Needs some investigations, sorry!

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 51320] Determine whether parts other than QuillContents may contain useful text to extract and if so, support extraction from those

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51320

--- Comment #2 from Dmitry Goldenberg <dg...@attivio.com> 2011-06-03 23:41:42 UTC ---
Nick,

Sorry I am swamped at the moment. This is not as critical since Quills get one
most of the content it seems...

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org