You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2011/06/03 19:33:27 UTC

DO NOT REPLY [Bug 51317] New: Need ability to stream and chunk data out of MS Publisher documents

https://issues.apache.org/bugzilla/show_bug.cgi?id=51317

             Bug #: 51317
           Summary: Need ability to stream and chunk data out of MS
                    Publisher documents
           Product: POI
           Version: 3.2-FINAL
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: critical
          Priority: P2
         Component: HPBF
        AssignedTo: dev@poi.apache.org
        ReportedBy: dgoldenberg@attivio.com
    Classification: Unclassified


This is a follow-up to 45602 (Add Java API for MS Publisher .pub files).

Basically, we need to be able to stream text data out of pub files and have
enough API hooks to control its chunking.

Right now, HPBFDocument doesn't support the NIO version of the POI file system
which makes it load the whole document into memory.

Text extraction is done from the QuillContents object (probably needs to
examine the other parts like Main, Escher etc - subject of another ticket).
QuillContents currently reads the whole document input stream into a single
byte buffer, then makes sense of it and splits it into bits, then picks out the
text and hyperlink bits.

For streaming, we'd want a way to not load everything at once but:
a. emit bits as they're encountered
b. make their contents streamable/chunkable, since a single bit may contain a
lot of text data

I've attempted to implement this but came across exceptions in
NDocumentInputStream - subject of another ticket.

Additionally, this functionality would ideally cover Publisher 2010 files which
I don't believe it does - subject of another ticket.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 51317] Need ability to stream and chunk data out of MS Publisher documents

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=51317

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|critical                    |enhancement

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org