You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Ryan Ackley <sa...@cfl.rr.com> on 2003/07/16 03:10:03 UTC

I have an idea

What would everyone else think of HWPFDocument implementing
javax.swing.text.Document. This would facilitate easy conversion between
RTF, HTML, DOC and other implementations for different formats that are out
there.

thoughts?

Re: I have an idea

Posted by Ryan Ackley <sa...@cfl.rr.com>.

> 3. Low level XML-transform (generator/serializer) closely coupled to the
> format

I don't know if this would really be meaningful for Word. There are a lot of
twists and turns in the Word format. I think its better to tell people to
use the low-level API.

IMHO, I think if we standardize on the OOo format for high level we'll be
safe because they have the same requirement as we do, which is 100%
MS-Office Compatibility. I don't know if Gnumeric had this? I can see in the
future that they may pile other junk on top of this, but I believe that the
core requirement will be there until they put Office out of business (the
year 2308).

Ryan

Re: fileformats.apache.org (was Re: I have an idea)

Posted by "Andrew C. Oliver" <ac...@apache.org>.

On 7/23/03 3:49 PM, "Ryan Ackley" <sa...@cfl.rr.com> wrote:

> What I had in mind when Andy came up with the
> idea was a central community for all libraries that read and\or write
> file formats. Kind of like http://wotsit.org on crack. So if someone had a
> requirement to read/write some format they could just go to
> formats.apache.org and see if a library was available. If they couldn't find
> it we would have a whole community available to help them write their own.
>

Exactly!  Of course we're not very helpful for people who don¹t want to
share code.  This is an open source community, not slave labor ;-)

> At work, one of my niches is file formats. I am a programmer on a document
> management/workflow system. We always get new file formats that need their
> text indexed for searching. Right now I have to go scour the web,
> sourceforge, groups, and
> try to find a) libraries or b) specs to program myself. Well it has happened
> at least once where I have written my own when there was something else out
> there already just because I didn't stumble across it (i.e. Excel). From my
> experience there are hardly any open source java APIs for reading file
> formats other than the big three: Word, Excel, PDF. For example, right now
> at work, I am writing java code to read text from a DWG file (AutoCAD) based
> on the spec at http://www.opendwg.org.
>

That¹s a cool idea, specs and doco as part of the project even when we don't
have a lib.  We'll of course have to be careful that its all legit, but
hey...thats with anything ...

(I smell wiki pages)

> I have a vision of a community of programmers frolicking through fields of
> Hex dumps
> and hard to understand format specs :-) I think one day it could be as
> important to the OS community as the Apache webserver. It would not only
> create a much needed community for fellow file format hackers to rally to,
> but also a one stop shop for file format APIs.
> 
> btw, what is CDF?

Common Document Format.  The idea is 1 API, 1 XML format that targets Ooo,
Word, Lotus, Whatever.  So in your Document management system you'd work
with one file format..  Someone says "okay give me that in word" you
transform CDF to Word and they edit it, upload it and you transform it back
to CDF and store the deltas.   (Just one use case)...  Robert was referring
back to an earlier email in the "I have an idea" thread.

-andy

> 
> Ryan
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> 

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?

Re: fileformats.apache.org (was Re: I have an idea)

Posted by Ryan Ackley <sa...@cfl.rr.com>.

What I had in mind when Andy came up with the
idea was a central community for all libraries that read and\or write
file formats. Kind of like http://wotsit.org on crack. So if someone had a
requirement to read/write some format they could just go to
formats.apache.org and see if a library was available. If they couldn't find
it we would have a whole community available to help them write their own.

At work, one of my niches is file formats. I am a programmer on a document
management/workflow system. We always get new file formats that need their
text indexed for searching. Right now I have to go scour the web,
sourceforge, groups, and
try to find a) libraries or b) specs to program myself. Well it has happened
at least once where I have written my own when there was something else out
there already just because I didn't stumble across it (i.e. Excel). From my
experience there are hardly any open source java APIs for reading file
formats other than the big three: Word, Excel, PDF. For example, right now
at work, I am writing java code to read text from a DWG file (AutoCAD) based
on the spec at http://www.opendwg.org.

I have a vision of a community of programmers frolicking through fields of
Hex dumps
and hard to understand format specs :-) I think one day it could be as
important to the OS community as the Apache webserver. It would not only
create a much needed community for fellow file format hackers to rally to,
but also a one stop shop for file format APIs.

btw, what is CDF?

Ryan

Re: fileformats.apache.org (was Re: I have an idea)

Posted by "Andrew C. Oliver" <ac...@apache.org>.

On 7/23/03 12:42 PM, "robert_weir@us.ibm.com" <ro...@us.ibm.com>
wrote:
> 
> Intriguing idea.  How broad do you see this being?  POI plus the long-lost
> sibling projects (like the Cocoon serializers)?  Other Apache projects
> (Batik?)  Concentrating on MS Office formats, or would this be  a place
> for other Java-based office-like parsers (OpenOffice, SmartSuite, Corel)
> to hang out?   Is there any synergy by having all these projects under one
> umbrella, such as internal reuse?  Obviously, we can share core code for
> CDF.  Anything else?  Can you paint a mental picture of what you would
> want this to look like in 18 months?
> 
> -Rob

Our bold vision for Batik or siblings kind of needs to be gentle.  Apache
politics are something of a quagmire.  Theoretically, the board would like
jakarta.apache.org to disappear into thin air.  They never want to create a
project based on a language again and want all the subprojects to become top
level.

The POI project has held together and grown through tight scope.  We're
loosening on this a little ATM with the TNEF issue.  TNEF seems like it
should be part of POI even though its not OLE 2 Compound Document based.  I
suspect there will be other file formats that will come around and "smell"
like POI but not really fit right in.  That being said I want TNEF in here
and now, not after we talk organization politics so pragmatism outweighs the
issue.

I am of the feeling that file formats should be free.  Meaning if I author a
document in Lotus, I ought to be able to get at that data and munge it into
any format I like.  Its my data, give it to me now as I like it.  I think
most folks on this project are of that opinion and of the opinion that
twisting bits and wading through hex dumps in search of the golden nugget is
FUN!  Secondly, hey, we're all in it for the money at some point.  The POI
developers are the best developers I've ever had the privilege of working
with.

Batik and other technologies often tie the encoding too close to the target
encoding.  Meaning there is no separation between the XML parsing part and
the binary encoding part of Batik in some places.  That¹s a problem for many
reasons.  The biggest is repurposing and flexibility.

My vision for POI is that it should keep its tight scope.  POI should be
focused on OLE 2 Compound Document formats.  Nothing more and nothing less.
I see the XML stuff living here because that¹s where the people working on
it live.  (The XML stuff which is directly related to POI APIs)

My vision for fileformats.apache.org is that there are many other things and
formats that aren't OLE 2 Compound Document based that should also be free
and debugged.  These are *other productivity suites*, graphics formats, etc.
POI can't swallow non-OLE 2 CDF file formats, they don't fit with the rest
of our code base, the structures are different, they look like bit hanging
off.  POI is part of fileformats.apache.org.  Its a model for other
projects.

CDF (fileformats.apache.org/cdf) would *use* POI and potentially XHSSF or
XHWPF and contain Java APIs for manipulating it, DTDs, Schemas, etc.  Whats
more is that I'd like to see code in C, C#, Java, C++ (yuck), whatever.

What I'd kind of like to do for the ApacheCon is pull our ranks.  Come up
with kind of a mission and consensus on direction.  Show kind of a game
plan.  Its a great chance to not only tell people about what we're doing at
POI, why Open Source works so well for this, but rally the troops and get
them excited about a new effort.  Its also a chance for us to all meet and
talk shop, find whiteboards and see the whites of each others eyes.

In 18 months:

FileFormats.Apache.Org - founded, PMC, etc.

I POI - Focus on OLE 2 CDF file formats
  a. POIFS - memory mapping, random access support added
  b. HSSF - memory mapping, random access, tighter memory model, image
support, formulas finished, graphing support finished, people whining for
pivot tables (maybe me getting a client to fund that ;-) )... Details
filling out, syntax candy
  a. HWPF - Reading writing documents, memory map, Random Access...
  b. HPSF - Write support added
  c. XHSSF - Serializing XHSSF format to XLS and Generating XSSF from XLS
  d. ???  - Your commonly used OLE 2 CDF based file format here.
II TNEF - or possible a mail encodings if its not big enough
III CDF 
  a. APIs (Java, C??) for Reading CDF format, Writing CDF format
  b. HWPF plugin for reading DOC as CDF or writing to DOC with CDF
  c. OOo plugin...
IV CSSF - Common Spreadsheet format
  a. APIs (Java, C) for reading Common Spreadsheet Format...
  b. HSSF plubin 
  c. Ooo plugin
  d. Gnumeric plugin
V Lotus???  
VI image formats...
VII

I even question whether plugins need to be written in multiple languages to
support multiple APIs... The Gnu Java Compiler might let us write one plugin
in Java and have it plugin to multiple places.

The synergies will find themselves really...  I anticipate that the PMC will
have like all of the current POI committers and perhaps folks from projects
like Batik...  

I know we have a bit of a manpower issue on POI ATM.  I mean this project
isn't like other projects where any bozo behind an IDE and a Java compiler
can do it...  (I tend to try and encourage people otherwise)  It takes a
twisted kind of person..  Right now we attract maybe 1 person to be
particularly active every three months and have one burn out every six
months...  We're starting to get a new breed of folks who commit patches now
and again and folks who put in contrib modules. . . I think CDF and some of
that could appeal to a wider audience (because they can dream up XML tag
languages...bores the hell out of me but most people seem to like it) and we
can suck people into the depths of the project by refusing to do things when
they need them and saying "you do it"...  (Andy's dirty trick #1... ;-) )
Besides, as I land more paid work for existing people I think others will
come in.  (My personal goal is to get all of core guys to where they can
afford to devote more time to it)

Anyhow, I haven't thought too much about this...  I'm more interested in
what everyone else has to say about it...  I can bla bla all day if you ask
the wrong questions ;-)

So what does everyone else think?

-Andy
-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?

fileformats.apache.org (was Re: I have an idea)

Posted by ro...@us.ibm.com.

>It might be a good idea to start approaching the board about the
>fileformats.apache.org idea...  However, As I understand it they're all a
>bit busy at the moment...  So maybe in a few weeks.

Intriguing idea.  How broad do you see this being?  POI plus the long-lost 
sibling projects (like the Cocoon serializers)?  Other Apache projects 
(Batik?)  Concentrating on MS Office formats, or would this be  a place 
for other Java-based office-like parsers (OpenOffice, SmartSuite, Corel) 
to hang out?   Is there any synergy by having all these projects under one 
umbrella, such as internal reuse?  Obviously, we can share core code for 
CDF.  Anything else?  Can you paint a mental picture of what you would 
want this to look like in 18 months?

-Rob

Re: I have an idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

Hi Robert, 

My preference is this:

1. Low level Java APIs (primarily for us)
2. High level Java APIs (for users)
3. Low level XML-transform (generator/serializer) closely coupled to the
format
4. XSLT <-> The Common Format

This approach doesn't preclude what you're talking about, I think it
actually enables it.

I've come to this after working with the HSSF Serializer for Cocoon where we
took a "Common Format -> low level format" approach.

I'd *like* to think that we could make the binary format irrelevant, but its
not because of the capability difference, granularity difference and etc.
Its the same old problem with AWT.

Take a look at what we did with the HSSF Cocoon Serializer.

http://cvs.apache.org/viewcvs/cocoon-2.1/src/blocks/poi/

We used the Gnumeric XML format.  It seemed like a good approach.  There was
no OpenOffice.org XML format at the time we started, and by the time we
finished the OOo format was still very cery fluid (prior to 1.0).

Unfortunately, the format didn't exactly match Excel's capabilities.
Gnumeric can do things Excel can't.  It does styles in a completely
different way that is not easy to match to Excel's and Excel can do things
Gnumeric can't.  Overall the Gnumeric way is an improvement on Excel in most
instances, but that actually makes things more problematic.  It makes things
rather lossy as well.  (especially with styling/formatting)

Now the application developer could still work with the Common Format...
There would just be ONE XSLT per format.

Quattro (I didn't know that was still around!!) -> Low Level -> QXML -> XSLT
-> TCF -> XSLT -> QXML -> Low Level -> Quatro

Excel -> Low Level -> HSSFXML -> XSLT -> TCF -> XSLT -> HSSFXML -> Low Level
-> Excel

What I'm actually talking about is taking the low level "primitives" if you
will.  For HSSF these are called records:
http://cvs.apache.org/viewcvs/jakarta-poi/src/java/org/apache/poi/hssf/recor
d/

And creating some kind of XML binding system for them.  We might even be
able to do this dynamically.  Thus get XML for free.  As the format evolves,
so does the XML capability.

We are getting ahead of ourselves.  Regardless of approach 1/2 have to be
done.

It might be a good idea to start approaching the board about the
fileformats.apache.org idea...  However, As I understand it they're all a
bit busy at the moment...  So maybe in a few weeks.

-Andy

On 7/21/03 11:22 AM, "robert_weir@us.ibm.com" <ro...@us.ibm.com>
wrote:

> Andy said:
> 
>>> To me the vocabulary of the XML is practically irrelevant provided the
> XML
>>> format is closely coupled with the binary format, you can always count
> on
>>> XSLT to make a transformation.
> 
> That's true and it is one way of doing it.  Another way is to have the XML
> format be independent of the underlying binary format.  That's pretty much
> what OpenOffice did with their formats.  They're not just record dumps of
> Office into XML.  They tried to make it be independent of any specific
> office suite.  So, in theory, the OpenOffice XML could come from Excel,
> OpenOffice, 123, Quattro Pro, or even be created on the fly from a web
> service without any real document.  I think there's great power in that.
> Instead of making the XML format irrelevant, it makes the binary format
> irrelevant.
> 
> In the end you probably have it both ways -- a lower-level API specific to
> a given binary format.  That is used directly for projects where
> performance is of primary importance.  Then, have a higher level project
> of XML readers and writers that adapt that API so some (hopefully)
> standards-based XML format.  The application developer would then work
> more at that level.
> 
> One step at a time...
> 
> -Rob
> 

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?

Re: I have an idea

Posted by ro...@us.ibm.com.

Andy said:

>>To me the vocabulary of the XML is practically irrelevant provided the 
XML
>>format is closely coupled with the binary format, you can always count 
on
>>XSLT to make a transformation.

That's true and it is one way of doing it.  Another way is to have the XML 
format be independent of the underlying binary format.  That's pretty much 
what OpenOffice did with their formats.  They're not just record dumps of 
Office into XML.  They tried to make it be independent of any specific 
office suite.  So, in theory, the OpenOffice XML could come from Excel, 
OpenOffice, 123, Quattro Pro, or even be created on the fly from a web 
service without any real document.  I think there's great power in that. 
Instead of making the XML format irrelevant, it makes the binary format 
irrelevant.

In the end you probably have it both ways -- a lower-level API specific to 
a given binary format.  That is used directly for projects where 
performance is of primary importance.  Then, have a higher level project 
of XML readers and writers that adapt that API so some (hopefully) 
standards-based XML format.  The application developer would then work 
more at that level.

One step at a time...

-Rob

Re: I have an idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

To me, we really are creating the lowest level required for all of this.
The API for manipulating the file format.

Re: I have an idea

Posted by ro...@us.ibm.com.

> I like the idea of finding an existing Document Object Model to use. We
> should all look around and see whats out there. My number one criteria 
for
> our DOM is intuitiveness. I would like to find one that fits our needs 
and
> is intuitive.

Intuitiveness is important.  We want users to be productive without too 
much unnecessary effort.  I think the right level of abstraction gets you 
90% of it.  Anything so the application developer can remain ignorant of 
all that gobbledygook that you and Praveen exchange:  "In the explanation 
given for 'sprmPlncLvl' it says The sprm is three bytes long and consists 
of the sprm code and a one byte two's complement value."  Ouch!  If users 
can remain as ignorant as I am of what that means, this project will be a 
success!

Obviously, using standards, official or de facto is also a good thing. 
Existing things out there in this domain (rich text models) include:

1) HTML/XHTML DOM
2) XSL:FOP
3) OpenOffice.org / OASIS Open Office XML
4) XML using some other vocabulary

Any others?

You take each one of these and weigh it against a few criteria:

1) Does it allow a clean separation of content and style?  Presumably Word 
is big on that and we don't want to loose that.
2) Is it expressive enough to represent the breadth of Word functionality 
that is important to us?
3) Is it easy to work with, lend itself to tooling, etc.?
4) Is it popular, widely adopted, etc., such that you might get some 
synergy with other projects?
5) Does it lend itself to a high-performance implementation?
6) Does it make the simple stuff simple while at the same time allowing 
more ambitious users to do more ambitious things?

HTML by itself fails 1) and 2).  Adding CSS stylesheets could remedy that, 
but it would still lack page-level features, like headers/footers, page 
numbers, or even hard or soft page breaks.

FOP gives a lot more support, though it is rather complex. 

OpenOffice.org mixes FOP with several other standard markups like MathML, 
SVG and XLink.  But it gets complex pretty quickly -- A simple "hello 
world" document generates XML with the following namespaces:

xmlns:office="http://openoffice.org/2000/office" 
xmlns:style="http://openoffice.org/2000/style" 
xmlns:text="http://openoffice.org/2000/text" 
xmlns:table="http://openoffice.org/2000/table" 
xmlns:draw="http://openoffice.org/2000/drawing" 
xmlns:fo="http://www.w3.org/1999/XSL/Format" 
xmlns:xlink="http://www.w3.org/1999/xlink" 
xmlns:number="http://openoffice.org/2000/datastyle" 
xmlns:svg="http://www.w3.org/2000/svg" 
xmlns:chart="http://openoffice.org/2000/chart" 
xmlns:dr3d="http://openoffice.org/2000/dr3d" 
xmlns:math="http://www.w3.org/1998/Math/MathML" 
xmlns:form="http://openoffice.org/2000/form" 
xmlns:script="http://openoffice.org/2000/script" 

But it is something of a moving target now that OASIS is drafting a format 
standard based on it.  But I think it will be an attractive and widely 
used format once it re-emerges as a standard.

I'm afraid I've raised more questions than I've answered ;-)  But in the 
end I really don't see anything out there that jumps out and says "I'm the 
API you want".

-Rob

Re: I have an idea

Posted by Ryan Ackley <sa...@cfl.rr.com>.

> Ryan, that reminds me of something I've been thinking of --  Is there
> already a good, standard API out there for manipulating abstract rich
> text?  Generally, tree/DOM-like structures are used to represent
> structured documents, like the Swing model interface you mentioned.  But
> we also have things like an HTML DOM, even an XML DOM of based on the
> XSL:FO vocabulary.    Interesting thing about the XSL:FO stuff is it would
> give an immediate interop with Apache FOP to target output formats like
> PDF.  OpenOffice's text format is also a superset of XLS:FO.

I like the idea of finding an existing Document Object Model to use. We
should all look around and see whats out there. My number one criteria for
our DOM is intuitiveness. I would like to find one that fits our needs and
is intuitive.

Its an interesting idea to use an XML-like DOM. We should definitely take a
look at the OpenOffice format. I think if we do it right it shouldn't be
hard for someone to translate it into whatever format they choose: XSL-FO,
XML, HTML, etc.

HDF has its roots in XSL-FO. The original version takes a Word doc and
outputs it to XSL-FO. I don't know if XSL-FO has a future outside of FOP.
Haven't been hearing much about it lately. It is definitely very
non-intuitive. I also found that XSL-FO just can't represent some things
that can appear in a Word file. I want to see how OpenOffice overcame some
shortcomings.

Ryan

Re: I have an idea

Posted by ro...@us.ibm.com.

Ryan, that reminds me of something I've been thinking of --  Is there 
already a good, standard API out there for manipulating abstract rich 
text?  Generally, tree/DOM-like structures are used to represent 
structured documents, like the Swing model interface you mentioned.  But 
we also have things like an HTML DOM, even an XML DOM of based on the 
XSL:FO vocabulary.    Interesting thing about the XSL:FO stuff is it would 
give an immediate interop with Apache FOP to target output formats like 
PDF.  OpenOffice's text format is also a superset of XLS:FO.

-Rob

Re: I have an idea

Posted by "Andrew C. Oliver" <ac...@apache.org>.

-1 - these should not be tied to GUI stuff and especially not SWING.  (AWT
is bad enough)  This makes the code less cross platform (remember some folks
are running this on AS/400s and UNIX boxes with no X libraries installed).
It would Also make it very slow.

You can easily create a contrib or separate package which can decorate this.

-Andy

On 7/15/03 9:10 PM, "Ryan Ackley" <sa...@cfl.rr.com> wrote:

> What would everyone else think of HWPFDocument implementing
> javax.swing.text.Document. This would facilitate easy conversion between
> RTF, HTML, DOC and other implementations for different formats that are out
> there.
> 
> thoughts?
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
> 

-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?