You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Jeremias Maerki <de...@jeremias-maerki.ch> on 2008/08/08 12:17:40 UTC
Re: Proposed XMP handling (was: SVN structure in PDFBox)

Hmm, I've started that proposal locally but when it came to serializing
XMP I noticed that Adobe's XMP toolkit is pretty limited in that area:
they don't support serializing to SAX which is important to me in the
FOP context.

Now, I'm not sure if I should invest much more time in my proposal in
the direction I've started off. The proposal strongly borrows from JAXP:
I have a MetadataFactory (~TransformerFactory), a Metadata (~Transformer)
class abstracting the metadata container. The goal was to become
indendent of the underlying data model and to provide two
implementations (XML Graphics Commons XMP support and Adobe XMP Toolkit).
I could still do that but I would not have a reason to abandon XGC's XMP
stuff in favor of Adobe's toolkit (which was one of my ideas). Since
Adobe is (in my experience) usually a black hole in terms of feedback
processing, I don't even want to try to help them improve their package.

I also noticed that with this proposal I wouldn't want to add support
for generic access to the underlying data model (at least at first). But
that would mean that you could only access values for which there are
accessors in the schema adapters. And each implementation has to provide
the actual adapter implementation when there's no generic access to the
underlying data model. Just an example of what I'm speaking about:

generic access in XGC:
        prop = meta.getProperty(XMPConstants.DUBLIN_CORE_NAMESPACE, "title");
        if (prop != null) {
            System.out.println("Title: " + prop.getValue());
        }
generic access in JempBox:
        XMPSchema schema = new XMPSchema(meta, "dc", "http://purl.org/dc/elements/1.1/");
        System.out.println(schema.getTextProperty("dc:title"))
schema adapter access in XGC:
        DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
        System.out.println("Title: " + dc.getTitle());
schema adapter access in JempBox:
        XMPSchemaDublinCore dc = new XMPSchemaDublinCore(meta);
        System.out.println("Title: " + dc.getTitle());
schema adapter access in my proposal:
        DublinCoreAdapter dc = DublinCoreSchema.INSTANCE.createAdapter(metaFactory, meta);
        System.out.println("Title: " + dc.getTitle());

Basically, it doesn't make sense to keep two XMP implementations inside
the ASF. One should be selected and the other's additional features
merged into the other. Therefore, if I don't push Adobe's XMP toolkit
anymore I only see one single implementation for the XMP API (which is
the core of my proposal). And if there's only one implementation it
doesn't make too much sense to invest too much time into the additional
infrastructure necessary to sustain multiple implementations.

Just to list again who is a potential user of an XMP package inside the
ASF (that I know of):
- Apache XML Graphics (FOP (today, read/write), Batik (future, read/write))
- Apache PDFBox (incubating, read/write)
- Apache Tika (incubating, read-only)
- Apache Sanselan (incubating, currently only exposing the raw stream, read/write)

I guess in the end, I just need a bit of feedback what you people think
would be the best route. So far I didn't get much. I'm still prepared to
make an effort here. But it feels wrong to me that I simply do something
and throw it in Apache Commons. At the end nobody uses it because they
can't identify themselves with it. I would like to go down a route most
people (if not everybody) are comfortable with.

So, thanks in advance for any feedback!

On 06.08.2008 17:41:32 Niall Pemberton wrote:
> On Wed, Aug 6, 2008 at 3:17 PM, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> > On 06.08.2008 14:45:23 Niall Pemberton wrote:
> >> On Tue, Aug 5, 2008 at 3:58 PM, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> >> > The outline is ok. As an alternative ..incubator/pdfbox could be moved
> >> > to ..incubator/pdfbox/main if we don't merge all three.
> >> >
> >> > As noted some time ago, XML Graphics Commons (XGC) has an XMP facility
> >> > that's more or less equivalent to JempBox, but it's not yet available as
> >> > a separate JAR (with only XMP stuff). Gut feeling is that XGC is
> >> > slightly stronger than JempBox but I'm biased. OTOH, I believe that XGC
> >> > has a few things that PDFBox could also use in time (like the image
> >> > loader framework [1][2]). Anyway, one of my priorities, now that PDFBox
> >> > is set up, is to write a set of Metadata adapters which enable XMP
> >> > reading/writing against XGC's XMP stuff and Adobe's XMP library. If in
> >> > any way possible, I'd really like to consolidate XMP handling inside ASF
> >> > projects. I'll do that as a proposal with code. Whether this is accepted
> >> > by the various communities is a different story. I do envision a Apache
> >> > Commons component. Please stop me if this is a silly idea. Of course,
> >> > XGC could also simply be reused but the XMP classes might not be as
> >> > complete as Adobe's XMP toolkit.
> >> >
> >> > [1] http://xmlgraphics.apache.org/commons/image-loader.html
> >> > [2] Using the image loader framework, PDFBox could do things like
> >> > loading SVG and WMF images (if FOP and Batik are in the classpath) and
> >> > embed them in the PDF. Or it can easily support embedding Barcodes when
> >> > I've written image converters for Barcode4J. MathML support with JEuclid
> >> > etc. etc. This stuff is pretty powerful.
> >> >
> >> > Concerning the font stuff: FOP has extensive font code that is destined
> >> > to move to XGC (when I finally get my affairs to together to start it).
> >> > Batik, too, has code to read TrueType fonts. So, some overlap with FOP
> >> > is already there. FontBox adds to that. That said, I'm not sure
> >> > consolidation is very easy since besides me I don't see many other
> >> > people who would help to push that. I have to choose my priorities
> >> > carefully. I'm stretched thin already.
> >> >
> >> > FontBox is sufficiently separate from the whole PDF topic that it makes
> >> > sense to keep it as a separate subproject. This encourages a clean
> >> > separation. If anyone can make use of FontBox outside the PDFBox context,
> >> > all the better.
> >> >
> >> > I'm currently leaning towards this: Consolidate XMP stuff from XGC and
> >> > JempBox into an Apache Commons component and discard JempBox. Or just
> >> > use XGC. Leave FontBox as subproject to PDFBox with a separate JAR as
> >> > dependency for PDFBox.
> >>
> >> If the desire is for JempBox to graduate to Apache Commons, then it
> >> would be best to raise this on the Commons dev list  - the sooner the
> >> better. With my *commons* hat on, I imagine the main issues will be:
> >>  1) the JempBox committers being unknown by the commons team
> >>  2) Are the JempBox committers likely to stick around to continue to support it
> >>
> >> If we discuss the possibility of JempBox graduating to Commons early
> >> on with Commons devs and invite anyone interested to monitor
> >> incubation here then I think it will help when the time comes to ask
> >> Commons to accept JempBox as a new Commons component.
> >>
> >> Also if graduation to Commons is likely, then this is a reason to hold
> >> off on a package rename for JempBox.
> >
> > Hmm, I think I may have expressed myself badly. I apologize. What I was
> > proposing is to basically retire JempBox but use it as a reference for
> > what XMP schema adapters are needed for a metadata component that could
> > serve both PDFBox and the XML Graphics project (and potentially others
> > like Tika). I'd like to approach this in two layers:
> > 1. Underlying XMP data model (i.e. Adobe's XMP toolkit or XMLGraphics
> > Commons' XMP package).
> > 2. XMP namespace adapters (for Dublin Core, PDF/A etc. etc.) for which
> > I'd write implementations against both XMP toolkit and XGC's XMP stuff.
> >
> > I would write this off-line (within the next two weeks if possible) and
> > then present it as a proposal (at the risk of doing something in vain).
> > Mostly this is just rewriting some of the code in XGC I already have and
> > decoupling the two layers a bit. Nothing big but pretty useful and
> > versatile in the end, I believe.
> >
> > If it helps I can certainly write a proposal (without code) beforehand
> > for the Commons Wiki. I've also thought about just requesting a lab and
> > do it there. Feedback welcome.
> 
> OK my mistake, sorry. Any Apache committers can have access to the
> Commons Sandbox. - so a quick note to Commons Dev outlining the new
> component should be enough - if you decide to go that route.
> 
> Niall
> 
> 
> >> Niall
> >>
> >> > I'm eager to hear other opinions and ideas.
> >> >
> >> > On 05.08.2008 16:21:40 Jukka Zitting wrote:
> >> >> Hi,
> >> >>
> >> >> Just a quick outline of the SVN structure I came up with:
> >> >>
> >> >> * The main PDFBox codebase has its trunk,tags,branches structure right
> >> >> below https://svn.apache.org/repos/asf/incubator/pdfbox.
> >> >>
> >> >> * The FontBox codebase has a separate trunk,tags,branches structure
> >> >> below https://svn.apache.org/repos/asf/incubator/pdfbox/fontbox.
> >> >>
> >> >> * The JempBox codebase has a separate trunk,tags,branches structure
> >> >> below https://svn.apache.org/repos/asf/incubator/pdfbox/jempbox.
> >> >>
> >> >> Note that the FontBox and JempBox codebases still need to be cleaned
> >> >> for svn:eol-style settings, etc.
> >> >>
> >> >> Should we keep FontBox and JempBox as separate codebases or perhaps
> >> >> merge them into the main PDFBox codebase? In other words, are there
> >> >> many (potential) users for those projects outside PDFBox?
> >> >>
> >> >> If we keep FontBox and JempBox separate, then I guess we should also
> >> >> set up separate Jira projects for them and start planning for the
> >> >> respective initial org.apache.* releases.
> >> >>
> >> >> BR,
> >> >>
> >> >> Jukka Zitting
> >> >
> >> >
> >> >
> >> >
> >> > Jeremias Maerki
> >> >
> >> >
> >
> >
> >
> >
> > Jeremias Maerki
> >
> >




Jeremias Maerki