You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by an...@superlinksoftware.com on 2005/04/27 18:00:37 UTC

State of the Union for HWPF

So it looks like if I create a new document or even use one of the word 
office templates, I can add all the text I like and can even style it 
like existing text.

However, it looks at the moment like:

  * Delete is horribly broken, this needs to be fixed
  * You can do things with HWPF that are really structurally unsound
  * The usage patterns for when to create a Section, Paragraph and 
CharacterRun isn't very well defined
  * WHOA MOMMMA there are a lot of methods and constants that you can 
set on any given "thing"
  * Ryan apparently did not believe in JavaDoc, Junit (very few tests), 
or Documentation (which is why I continually refused to let HWPF out of 
scratchpad, which is why the project floundered up until now -- gee 
Ryan...maybe thats why its hard to use).

That being said:
  * It seems to be fairly functional even for somewhat complex documents 
especially in *reading*
  * SuperLink and its clients may put significant investment into HWPF 
in the near future to get it up to spec.

The API needs several refinements:

  * add "cloneProperties" methods to Paragraph and CharacterRun  - (done 
but not committed)
  * Why can't a characterRun be added to a paragraph?
  * Why can't a characterRun be deleted from a paragraph?
  * Groupings of similar properties should be broken down into 
compositions of objects rather than just one big Mega properties object.
  * Weird word structural abbreviations shouldn't be exposed to the 
usermodel
  * Unicode support.

Question:
  * There are a couple of people here with some good Word knowledge.. 
Can anyone give me some pointers on the difference between unicode text 
storage and non unicode text storage?

Glen, Avik and Rainer are scared...commit messages directly from me 
again ;-)

-Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


RE: State of the Union for HWPF

Posted by Kais Dukes <k....@complexar.com>.
Dear Andy,

I can answer your question with regards to unicode and non-unicode storage.
The fundamental structure in a Word document is known as the piece table. A
piece table is a structure that maps logical parts of document text to
locations in memory (or locations in the Word file, if you want to code
directly against the file format).

Abiword, the open source Word processor for example, has an excellent piece
table implementation. Microsoft Word has a different implementation, which
may be more efficient for very large documents.

You cannot really manipulate lots of text without using an efficient piece
table (or some equivalent text mapping structure) -- for larger documents or
for heavy edits the memory requires will explode (in the case of Java, you
will probably start getting JVM heap errors).

Now, with regards to unicode and non-unicode text, Microsoft Word does
something different to most other word processors. Word uses the piece table
to be able to store unicode and non-unicode text for the same file. What
this means is that Word itself judges which sequences should be unicode, and
which sequences should be a single character code (usually CP1252). It uses
some internal algorithm to decide this. However, if you are talking about
writing a Word document, you can set up your unicode / non-unicode sequence
however you wish, as long as the piece table is implemented correctly.

A "complex file" aka a "fast saved file" actually dumps the piece table
directly to disk. In this case, there are unicode and non-unicode sections
mixed up in the file, as well text stored on disk which isn't actually part
of the document's logical text stream.

A simple (non-complex non-fast saved) Word file will have only a single
character set, usually all unicode in the text stream.

Hope this helps.

Regards
-- Kais

-----Original Message-----
From: andy@superlinksoftware.com [mailto:andy@superlinksoftware.com]
Sent: 27 April 2005 17:01
To: poi-dev@jakarta.apache.org
Subject: State of the Union for HWPF


So it looks like if I create a new document or even use one of the word
office templates, I can add all the text I like and can even style it
like existing text.

However, it looks at the moment like:

  * Delete is horribly broken, this needs to be fixed
  * You can do things with HWPF that are really structurally unsound
  * The usage patterns for when to create a Section, Paragraph and
CharacterRun isn't very well defined
  * WHOA MOMMMA there are a lot of methods and constants that you can
set on any given "thing"
  * Ryan apparently did not believe in JavaDoc, Junit (very few tests),
or Documentation (which is why I continually refused to let HWPF out of
scratchpad, which is why the project floundered up until now -- gee
Ryan...maybe thats why its hard to use).

That being said:
  * It seems to be fairly functional even for somewhat complex documents
especially in *reading*
  * SuperLink and its clients may put significant investment into HWPF
in the near future to get it up to spec.

The API needs several refinements:

  * add "cloneProperties" methods to Paragraph and CharacterRun  - (done
but not committed)
  * Why can't a characterRun be added to a paragraph?
  * Why can't a characterRun be deleted from a paragraph?
  * Groupings of similar properties should be broken down into
compositions of objects rather than just one big Mega properties object.
  * Weird word structural abbreviations shouldn't be exposed to the
usermodel
  * Unicode support.

Question:
  * There are a couple of people here with some good Word knowledge..
Can anyone give me some pointers on the difference between unicode text
storage and non unicode text storage?

Glen, Avik and Rainer are scared...commit messages directly from me
again ;-)

-Andy

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/

--
No virus found in this incoming message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.10.3 - Release Date: 25/04/2005

--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.10.3 - Release Date: 25/04/2005


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/