You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@xmlgraphics.apache.org by Jeremias Maerki <de...@jeremias-maerki.ch> on 2007/11/19 10:26:47 UTC

Metadata use by Apache Java projects

(I realize this is heavy cross-posting but it's probably the best way to
reach all the players I want to address.)

As you may know, I've started developing an XMP metadata package inside
XML Graphics Commons in order to support XMP metadata (and ultimately
PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.

What is XMP? XMP, for those who don't know about it, is based on a
subset of RDF to provide a flexible and extensible way of
storing/representing document metadata.

Yesterday, I was surprised to discover that Adobe has published an XMP
Toolkit with Java support under the BSD license. In contrast to my
effort, Adobe's toolkit is quite complete if maybe a bit more
complicated to use. That got me thinking:

Every project I'm sending this message to is using document metadata in
some form:
- Apache XML Graphics: embeds document metadata in the generated files
(just FOP at the moment, but Batik is a similar candidate)
- Tika (in incubation): has as one of its main purposes the extraction
of metadata
- Sanselan (in incubation): extracts and embeds metadata from/in bitmap
images
- PDFBox (incubation in discussion): extracts and embeds XMP metadata
from/in PDF files (see also JempBox)

Every one of these projects has its own means to represent metadata in
memory. Wouldn't it make sense to have a common approach? I've worked
with XMP for some time now and I can say it's ideal to work with. It
also defines guidelines to embed XMP metadata in various file formats.
It's also relatively easy to map metadata between different file formats
(Dublin Core, EXIF, PDF Info etc.).

Sanselan and Tika have both chosen a very simple approach but is it
versatile enough for the future? While the simple Map<String, String[]> in
Tika allows for multiple authors, for example, it doesn't support
language alternatives for things such as dc:title or dc:description.

I'm seriously thinking about abandoning most of my XMP package work in
XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
support, tough:
- Metadata merging functionality (which I need for synchronizing the PDF
Info object and the XMP packet for PDF/A)
- Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
easier programming (which both Ben and I have written for JempBox and
XML Graphics Commons). Adobe's toolkit only allows generic access.

Some links:
Adobe XMP website: http://www.adobe.com/products/xmp/
Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
JempBox: http://sourceforge.net/projects/jempbox
Apache XML Graphics Commons:
http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/

My questions:
- Any interest in converging on a unified model/approach?
- If yes, where shall we develop this? As part of Tika (although it's
still in incubation)? As a seperate project (maybe as Apache Commons
subproject)? If more than XML Graphics uses this, XML Graphics is
probably not the right home.
- Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
the JempBox or XML Graphics Commons approach more interesting?
- Where's the best place to discuss this? We can't keep posting to
several mailing lists.

At any rate, I would volunteer to spearhead this effort, especially
since I have immediate need to have complete XMP functionality. I've
almost finished mapping all XMP structures in XG Commons but I haven't
committed my latest changes (for structured properties) and I may still
not cover all details of XMP.

Thanks for reading this far,
Jeremias Maerki

---------------------------------------------------------------------
Apache XML Graphics Project URL: http://xmlgraphics.apache.org/
To unsubscribe, e-mail: general-unsubscribe@xmlgraphics.apache.org
For additional commands, e-mail: general-help@xmlgraphics.apache.org

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

On 20.11.2007 04:39:08 Charles Matthew Chen wrote:
> Hi Jeremias & Antoine,
> 
>    Antoine, it looks like you found it pretty easy to convert
> Sanselan's metadata into XMP format.
> 
>    Jeremias, it sounds like you considering a new project which can
> translate data from many formats (read by a variety of projects) into
> XMP.  That sounds great!

Not really. I'm proposing a unified storage/access model metadata in Apache
projects. The necessary mapping needs to be done in the individual
projects who have knowledge about all the different document formats.

>    Sanselan could not use XMP internally to represent metadata,
> though.  Sanselan's goal is to read & write metadata (such as EXIF
> metadata) preserving not just tag values but directory structure,
> field order, field location, etc.  I'm in the process of refactoring
> the metadata data structures at the moment, actually, in order to
> approach binary compatibility as closely as possible.

I'm not suggesting you abandon that. That would be a bad idea. But XMP
could be the generic format for reading/writing image metadata for all
image formats Sanselan supports. Look at ImageIO: it has a generic
metadata format and native metadata formats for all the individual image
formats. At the moment, Sanselan only has the native metadata although
with a somewhat common storage model (Metadata.Item).

<snip/>

Jeremias Maerki

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

On 20.11.2007 09:51:12 Philipp Koch wrote:
<snip/>
> > But I'm not sure Tika could cover all this translation functionality for all the projects
> > using metadata. That's something the individual document format
> > libraries will be much better at. Tika is more of an aggregator.
> well, i am not sure if we can ever make sure that ALL "individual
> document format libraries" will ever support such a translation
> functionality. so having something (like tika (currently only for
> reading)) in between would definitely make sense to me.

Right, not every library will have metadata handling like will be needed,
so Tika will certainly need some code of its own to do metadata
translation. Looks like we're on the wrong mailing list by now. :-)

<snip/>

Jeremias Maerki

Re: Metadata use by Apache Java projects

Posted by Philipp Koch <ph...@day.com>.

> Philipp, I'm not talking about just reading meta data, but also writing
> it.
ok, i understand ;-). having a uniform way to access/write meta data
is indeed something worth thinking about - you are right! i have the
"digital asset management" use case in mind (that i currently develop)
which currently handles the meta data stuff for most of the formats
individually...

> Tika is a metadata extraction kit. I'm talking about something more general. If
> the common metadata storage model, if we can agree on one, at the end
> becomes a subproject/subproduct of Tika, I'm cool.
yes, this sounds interessting.

> But I'm not sure Tika could cover all this translation functionality for all the projects
> using metadata. That's something the individual document format
> libraries will be much better at. Tika is more of an aggregator.
well, i am not sure if we can ever make sure that ALL "individual
document format libraries" will ever support such a translation
functionality. so having something (like tika (currently only for
reading)) in between would definitely make sense to me.

regards,
philipp


On 11/20/07, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> On 20.11.2007 08:24:01 Philipp Koch wrote:
> > >    Jeremias, it sounds like you considering a new project which can
> > > translate data from many formats (read by a variety of projects) into
> > > XMP.  That sounds great!
> > hmm, i am not sure if (yet) another  new project should be set up for
> > this since the tika project already offers all the "infrastructure" to
> > read meta data from various formats. from my point of view, the tika
> > project should offer some kind of "meta data to xmp" translator.
>
> Philipp, I'm not talking about just reading metadata, but also writing
> it. Sanselan supports creating new TIFF, JPEG etc. files. FOP creates
> new PDF, SVG etc. files. These processes all need metadata. Tika is a
> metadata extraction kit. I'm talking about something more general. If
> the common metadata storage model, if we can agree on one, at the end
> becomes a subproject/subproduct of Tika, I'm cool. But I'm not sure Tika
> could cover all this translation functionality for all the projects
> using metadata. That's something the individual document format
> libraries will be much better at. Tika is more of an aggregator.
>
> > >    Sanselan could not use XMP internally to represent metadata,
> > > though.  Sanselan's goal is to read & write metadata (such as EXIF
> > > metadata) preserving not just tag values but directory structure,
> > > field order, field location, etc.
> > this makes sense to me, since i have only seen embedded xmp in adobe's
> > products that are using the pdf "file format" to store its data
> > (acrobat and illustrator at least)
>
> Sure, the adoption of XMP is somewhat limited. But I've worked with it
> for some time now and I've experienced the benefit. Our adopting it
> could actually improve acceptance elsewhere.
>
> <snip/>
>
> Jeremias Maerki
>
>

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

On 20.11.2007 08:24:01 Philipp Koch wrote:
> >    Jeremias, it sounds like you considering a new project which can
> > translate data from many formats (read by a variety of projects) into
> > XMP.  That sounds great!
> hmm, i am not sure if (yet) another  new project should be set up for
> this since the tika project already offers all the "infrastructure" to
> read meta data from various formats. from my point of view, the tika
> project should offer some kind of "meta data to xmp" translator.

Philipp, I'm not talking about just reading metadata, but also writing
it. Sanselan supports creating new TIFF, JPEG etc. files. FOP creates
new PDF, SVG etc. files. These processes all need metadata. Tika is a
metadata extraction kit. I'm talking about something more general. If
the common metadata storage model, if we can agree on one, at the end
becomes a subproject/subproduct of Tika, I'm cool. But I'm not sure Tika
could cover all this translation functionality for all the projects
using metadata. That's something the individual document format
libraries will be much better at. Tika is more of an aggregator.

> >    Sanselan could not use XMP internally to represent metadata,
> > though.  Sanselan's goal is to read & write metadata (such as EXIF
> > metadata) preserving not just tag values but directory structure,
> > field order, field location, etc.
> this makes sense to me, since i have only seen embedded xmp in adobe's
> products that are using the pdf "file format" to store its data
> (acrobat and illustrator at least)

Sure, the adoption of XMP is somewhat limited. But I've worked with it
for some time now and I've experienced the benefit. Our adopting it
could actually improve acceptance elsewhere.

<snip/>

Jeremias Maerki

Re: Metadata use by Apache Java projects

Posted by Philipp Koch <ph...@day.com>.

>    Jeremias, it sounds like you considering a new project which can
> translate data from many formats (read by a variety of projects) into
> XMP.  That sounds great!
hmm, i am not sure if (yet) another  new project should be set up for
this since the tika project already offers all the "infrastructure" to
read meta data from various formats. from my point of view, the tika
project should offer some kind of "meta data to xmp" translator.

>    Sanselan could not use XMP internally to represent metadata,
> though.  Sanselan's goal is to read & write metadata (such as EXIF
> metadata) preserving not just tag values but directory structure,
> field order, field location, etc.
this makes sense to me, since i have only seen embedded xmp in adobe's
products that are using the pdf "file format" to store its data
(acrobat and illustrator at least)

regards,
philipp

On 11/20/07, Charles Matthew Chen <ch...@gmail.com> wrote:
> Hi Jeremias & Antoine,
>
>    Antoine, it looks like you found it pretty easy to convert
> Sanselan's metadata into XMP format.
>
>    Jeremias, it sounds like you considering a new project which can
> translate data from many formats (read by a variety of projects) into
> XMP.  That sounds great!
>
>    Sanselan could not use XMP internally to represent metadata,
> though.  Sanselan's goal is to read & write metadata (such as EXIF
> metadata) preserving not just tag values but directory structure,
> field order, field location, etc.  I'm in the process of refactoring
> the metadata data structures at the moment, actually, in order to
> approach binary compatibility as closely as possible.
>
> Charles.
>
>
>
>
> On Nov 19, 2007 8:57 AM, Antoine Moreau de Bellaing <am...@enst.fr> wrote:
> > Thank you for your advice....
> > This class might (perhaps) help in converting EXIF metadata into XMP
> > by using Adobe'Toolkit
> >
> >
> > Regards,
> > Antoine Moreau de Bellaing.
> >
> > import java.io.File;
> > import java.io.IOException;
> > import java.util.Vector;
> >
> > import org.cmc.sanselan.ImageReadException;
> > import org.cmc.sanselan.Sanselan;
> > import org.cmc.sanselan.common.IImageMetadata;
> > import org.cmc.sanselan.formats.jpeg.JpegImageMetadata;
> > import org.cmc.sanselan.formats.tiff.TiffDirectory;
> > import org.cmc.sanselan.formats.tiff.TiffField;
> > import org.cmc.sanselan.formats.tiff.TiffImageMetadata;
> >
> > import com.adobe.xmp.XMPConst;
> > import com.adobe.xmp.XMPException;
> > import com.adobe.xmp.XMPMeta;
> > import com.adobe.xmp.XMPMetaFactory;
> >
> > public class XMPMetadataExample
> > {
> >         public static void metadataExample(File file) throws
> > ImageReadException,
> >                         IOException, XMPException
> >         {
> >                 IImageMetadata metadata = Sanselan.getMetadata(file);
> >
> >
> >                 if (metadata instanceof JpegImageMetadata)
> >                 {
> >                         JpegImageMetadata jpegMetadata = (JpegImageMetadata) metadata;
> >                         XMPMeta meta = xmpMeta(jpegMetadata);
> >                         System.out.println(XMPMetaFactory.serializeToString(meta, null));
> >
> >                 }
> >         }
> >
> >         private static XMPMeta xmpMeta(JpegImageMetadata jpegMetadata) throws
> > ImageReadException, IOException, XMPException
> >         {
> >                 XMPMeta meta = XMPMetaFactory.create();
> >                 Vector dirs = jpegMetadata.getExif().getDirectories();
> >                 for (int i = 0; i < dirs.size(); i++)
> >                 {
> >                         TiffImageMetadata.Directory dir = (TiffImageMetadata.Directory) dirs
> >                                         .get(i);
> >
> >                         Vector items = dir.getItems();
> >                         for (int j = 0; j < items.size(); j++)
> >                         {
> >                                 Object item = items.get(j);
> >                                 TiffImageMetadata.Item tiffItem = (TiffImageMetadata.Item) item;
> >                                 TiffField field = tiffItem.getTiffField();
> >                                 if (namespace(dir.type) != null) meta.setProperty
> > (namespace(dir.type), field.getTagName(), field.getValueDescription());
> >
> >                         }
> >                 }
> >                 return meta;
> >         }
> >
> >
> >         public static final String namespace(int type)
> >         {
> >                 switch (type)
> >                 {
> >                         case TiffDirectory.DIRECTORY_TYPE_UNKNOWN :
> >                                 return null;
> >                         case TiffDirectory.DIRECTORY_TYPE_ROOT :
> >                                 return XMPConst.NS_TIFF;
> >                         case TiffDirectory.DIRECTORY_TYPE_SUB :
> >                                 return null;
> >                         case TiffDirectory.DIRECTORY_TYPE_THUMBNAIL :
> >                                 return null;
> >                         case TiffDirectory.DIRECTORY_TYPE_EXIF :
> >                                 return XMPConst.NS_EXIF;
> >                         case TiffDirectory.DIRECTORY_TYPE_GPS :
> >                                 return null;
> >                         case TiffDirectory.DIRECTORY_TYPE_INTEROPERABILITY :
> >                                 return null;
> >                         default :
> >                                 return null;
> >                 }
> >         }
> > }
> >
> > Le 19 nov. 07 à 12:00, Jeremias Maerki a écrit :
> >
> >
> > > Cool, this proves my point that XMP is useful. ;-)
> > >
> > > AFAIK, JPEG metadata is usually not embedded as XMP but as EXIF/IPTC
> > > data. In this case, the EXIF and IPTC chunks would have to be
> > > converted
> > > into the XMP representation. I guess that's what Adobe's Bridge does.
> > > That's exactly what would need to be done if my proposal would be
> > > implemented.
> > >
> > > So, if you want to do it now (i.e. before we've reached a conclusion)
> > > you'll have to extract every single value from the metadata directory
> > > and put it into the structure exposed by Adobe's XMP Toolkit. To get
> > > the
> > > individual values, see:
> > > https://svn.apache.org/repos/asf/incubator/sanselan/trunk/src/main/java/org/cmc/sanselan/sampleUsage/MetadataExample.java
> > >
> > > The right mappings are easily found in the XMP specification.
> > >
> > > Jeremias Maerki
> > >
> > >
> > >
> > > On 19.11.2007 11:43:48 Antoine Moreau de Bellaing wrote:
> > >> Hello.
> > >> I'm looking for a way to connect the  Adobe XMP Toolkit to Sanselan.
> > >> Especially with JPEG.
> > >>
> > >> I'm really newbie, so I apology if my response doesn't make sense to
> > >> you all...
> > >>
> > >> Here's an output of Sanselan
> > >> TiffImageMetadata.toString()
> > >>              Root:
> > >>                      Make: 'Canon'
> > >>                      Model: 'Canon EOS 350D DIGITAL'
> > >>                      Orientation: 1
> > >>                      XResolution: 72
> > >>                      YResolution: 72
> > >>                      ResolutionUnit: 2
> > >>                      DateTime: 2007-10-06T16:47:56.000+0200
> > >>                      WhitePoint: 313/1000, 329/1000
> > >>                      PrimaryChromaticities: 64/100, 33/100, 21/100, 71/100, 15/100,
> > >> 6/100
> > >>                      YCbCrCoefficients: 299/1000, 587/1000, 114/1000
> > >>                      YCbCrPositioning: 2
> > >>                      Exif_IFD_Pointer: 320
> > >>
> > >>              Exif:
> > >>                      ExposureTime: 1/60
> > >>                      FNumber: 5
> > >>                      ExposureProgram: 0
> > >>                      ISOSpeedRatings: 400
> > >>                      ExifVersion: 48, 50, 50, 49
> > >>                      DateTimeOriginal: 2007-10-06T16:47:56.000+0200
> > >>                      DateTimeDigitized: 2007-10-06T16:47:56.000+0200
> > >>                      ComponentsConfiguration: 1, 2, 3, 0
> > >>                      ShutterSpeedValue: 387114/65536
> > >>                      ApertureValue: 304340/65536
> > >>                      ExposureBiasValue: 0
> > >>                      MeteringMode: 1
> > >>                      Flash: 16
> > >>                      FocalLength: 41
> > >>                      MakerNote: 24, 0, 1, 0, 3, 0, 46, 0, 0, 0, 34, 4, 0, 0, 2, 0, 3,
> > >> 0,
> > >> 4, 0, 0, 0, 126, 4, 0, 0, 3, 0, 3, 0, 4, 0, 0, 0, -122, 4, 0, 0, 4,
> > >> 0,
> > >> 3, 0, 34, 0, 0, 0, -114, 4, 0, 0, 6... (8340)
> > >>                      UserComment: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> > >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... (264)
> > >>                      FlashpixVersion: 48, 49, 48, 48
> > >>                      ColorSpace: 65535
> > >>                      PixelXDimension: 3456
> > >>                      PixelYDimension: 2304
> > >>                      Interoperability_IFD_Pointer: 9366
> > >>                      FocalPlaneXResolution: 3456000/874
> > >>                      FocalPlaneYResolution: 2304000/582
> > >>                      FocalPlaneResolutionUnit: 2
> > >>                      CustomRendered: 0
> > >>                      ExposureMode: 0
> > >>                      WhiteBalance: 0
> > >>                      SceneCaptureType: 0
> > >>                      Unknown: 22/10
> > >>
> > >>              Interoperability:
> > >>                      GPSLatitudeRef: 'R03'
> > >>                      GPSLatitude: 48, 49, 48, 48
> > >>
> > >>              Sub:
> > >>                      Compression: 6
> > >>                      XResolution: 72
> > >>                      YResolution: 72
> > >>                      ResolutionUnit: 2
> > >>                      JPEGInterchangeFormat: 9716
> > >>                      JPEGInterchangeFormatLength: 9176
> > >>
> > >> The same file parsed with Adobe Bridge produces this XMP file :
> > >>
> > >> <?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
> > >> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.1-c037
> > >> 46.282696, Mon Apr 02 2007 18:36:56        ">
> > >>    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:tiff="http://ns.adobe.com/tiff/1.0/">
> > >>          <tiff:Make>Canon</tiff:Make>
> > >>          <tiff:Model>Canon EOS 350D DIGITAL</tiff:Model>
> > >>          <tiff:Orientation>1</tiff:Orientation>
> > >>          <tiff:ImageWidth>3456</tiff:ImageWidth>
> > >>          <tiff:ImageLength>2304</tiff:ImageLength>
> > >>          <tiff:PhotometricInterpretation>2</
> > >> tiff:PhotometricInterpretation>
> > >>          <tiff:SamplesPerPixel>3</tiff:SamplesPerPixel>
> > >>          <tiff:BitsPerSample>
> > >>             <rdf:Seq>
> > >>                <rdf:li>8</rdf:li>
> > >>                <rdf:li>8</rdf:li>
> > >>                <rdf:li>8</rdf:li>
> > >>             </rdf:Seq>
> > >>          </tiff:BitsPerSample>
> > >>          <tiff:XResolution>72/1</tiff:XResolution>
> > >>          <tiff:YResolution>72/1</tiff:YResolution>
> > >>          <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
> > >>       </rdf:Description>
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:exif="http://ns.adobe.com/exif/1.0/">
> > >>          <exif:ExifVersion>0221</exif:ExifVersion>
> > >>          <exif:ExposureTime>1/60</exif:ExposureTime>
> > >>          <exif:ShutterSpeedValue>5906891/1000000</
> > >> exif:ShutterSpeedValue>
> > >>          <exif:FNumber>5/1</exif:FNumber>
> > >>          <exif:ApertureValue>4643856/1000000</exif:ApertureValue>
> > >>          <exif:ExposureProgram>0</exif:ExposureProgram>
> > >>          <exif:ISOSpeedRatings>
> > >>             <rdf:Seq>
> > >>                <rdf:li>400</rdf:li>
> > >>             </rdf:Seq>
> > >>          </exif:ISOSpeedRatings>
> > >>          <exif:DateTimeOriginal>2007-10-06T16:47:56+02:00</
> > >> exif:DateTimeOriginal>
> > >>          <exif:DateTimeDigitized>2007-10-06T16:47:56+02:00</
> > >> exif:DateTimeDigitized>
> > >>          <exif:ExposureBiasValue>0/2</exif:ExposureBiasValue>
> > >>          <exif:MeteringMode>1</exif:MeteringMode>
> > >>          <exif:Flash rdf:parseType="Resource">
> > >>             <exif:Fired>False</exif:Fired>
> > >>             <exif:Return>0</exif:Return>
> > >>             <exif:Mode>2</exif:Mode>
> > >>             <exif:Function>False</exif:Function>
> > >>             <exif:RedEyeMode>False</exif:RedEyeMode>
> > >>          </exif:Flash>
> > >>          <exif:FocalLength>41/1</exif:FocalLength>
> > >>          <exif:CustomRendered>0</exif:CustomRendered>
> > >>          <exif:ExposureMode>0</exif:ExposureMode>
> > >>          <exif:WhiteBalance>0</exif:WhiteBalance>
> > >>          <exif:SceneCaptureType>0</exif:SceneCaptureType>
> > >>          <exif:FocalPlaneXResolution>3456000/874</
> > >> exif:FocalPlaneXResolution>
> > >>          <exif:FocalPlaneYResolution>2304000/582</
> > >> exif:FocalPlaneYResolution>
> > >>          <exif:FocalPlaneResolutionUnit>2</
> > >> exif:FocalPlaneResolutionUnit>
> > >>       </rdf:Description>
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:xap="http://ns.adobe.com/xap/1.0/">
> > >>          <xap:ModifyDate>2007-10-06T16:47:56+02:00</xap:ModifyDate>
> > >>       </rdf:Description>
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:dc="http://purl.org/dc/elements/1.1/">
> > >>          <dc:creator>
> > >>             <rdf:Seq>
> > >>                <rdf:li>antoine</rdf:li>
> > >>             </rdf:Seq>
> > >>          </dc:creator>
> > >>       </rdf:Description>
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
> > >>          <aux:SerialNumber>1330734959</aux:SerialNumber>
> > >>          <aux:LensInfo>18/1 55/1 0/0 0/0</aux:LensInfo>
> > >>          <aux:Lens>18.0-55.0 mm</aux:Lens>
> > >>          <aux:ImageNumber>160</aux:ImageNumber>
> > >>          <aux:FlashCompensation>0/1</aux:FlashCompensation>
> > >>          <aux:OwnerName>antoine</aux:OwnerName>
> > >>          <aux:Firmware>1.0.3</aux:Firmware>
> > >>       </rdf:Description>
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/">
> > >>          <crs:AlreadyApplied>True</crs:AlreadyApplied>
> > >>       </rdf:Description>
> > >>       <rdf:Description rdf:about=""
> > >>             xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
> > >>          <photoshop:ColorMode>3</photoshop:ColorMode>
> > >>          <photoshop:ICCProfile>Canon EOS 350D DIGITAL</
> > >> photoshop:ICCProfile>
> > >>       </rdf:Description>
> > >>    </rdf:RDF>
> > >> </x:xmpmeta>
> > >> <?xpacket end="w"?>
> > >>
> > >>
> > >> Root corresponds to the tiff namespace
> > >> Exif corresponds to the exif namespace
> > >>
> > >> In Sanselan those variables are private :
> > >> TiffImageMetadata.directory
> > >>
> > >> Would an accesor to directory be usefull to parse XMP with Adobe's
> > >> Toolkit?
> > >>
> > >>
> > >> Regards,
> > >> Anoine Moreau de Bellaing
> > >>
> > >>
> > >> Le 19 nov. 07 à 10:26, Jeremias Maerki a écrit :
> > >>
> > >>> (I realize this is heavy cross-posting but it's probably the best
> > >>> way to
> > >>> reach all the players I want to address.)
> > >>>
> > >>> As you may know, I've started developing an XMP metadata package
> > >>> inside
> > >>> XML Graphics Commons in order to support XMP metadata (and
> > >>> ultimately
> > >>> PDF/A) in Apache FOP. Therefore, I have quite an interest in
> > >>> metadata.
> > >>>
> > >>> What is XMP? XMP, for those who don't know about it, is based on a
> > >>> subset of RDF to provide a flexible and extensible way of
> > >>> storing/representing document metadata.
> > >>>
> > >>> Yesterday, I was surprised to discover that Adobe has published an
> > >>> XMP
> > >>> Toolkit with Java support under the BSD license. In contrast to my
> > >>> effort, Adobe's toolkit is quite complete if maybe a bit more
> > >>> complicated to use. That got me thinking:
> > >>>
> > >>> Every project I'm sending this message to is using document metadata
> > >>> in
> > >>> some form:
> > >>> - Apache XML Graphics: embeds document metadata in the generated
> > >>> files
> > >>> (just FOP at the moment, but Batik is a similar candidate)
> > >>> - Tika (in incubation): has as one of its main purposes the
> > >>> extraction
> > >>> of metadata
> > >>> - Sanselan (in incubation): extracts and embeds metadata from/in
> > >>> bitmap
> > >>> images
> > >>> - PDFBox (incubation in discussion): extracts and embeds XMP
> > >>> metadata
> > >>> from/in PDF files (see also JempBox)
> > >>>
> > >>> Every one of these projects has its own means to represent
> > >>> metadata in
> > >>> memory. Wouldn't it make sense to have a common approach? I've
> > >>> worked
> > >>> with XMP for some time now and I can say it's ideal to work with. It
> > >>> also defines guidelines to embed XMP metadata in various file
> > >>> formats.
> > >>> It's also relatively easy to map metadata between different file
> > >>> formats
> > >>> (Dublin Core, EXIF, PDF Info etc.).
> > >>>
> > >>> Sanselan and Tika have both chosen a very simple approach but is it
> > >>> versatile enough for the future? While the simple Map<String,
> > >>> String[]> in
> > >>> Tika allows for multiple authors, for example, it doesn't support
> > >>> language alternatives for things such as dc:title or dc:description.
> > >>>
> > >>> I'm seriously thinking about abandoning most of my XMP package
> > >>> work in
> > >>> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it
> > >>> doesn't
> > >>> support, tough:
> > >>> - Metadata merging functionality (which I need for synchronizing the
> > >>> PDF
> > >>> Info object and the XMP packet for PDF/A)
> > >>> - Schema-specific adapters (for Dublin Core and many other XMP
> > >>> Schemas) for
> > >>> easier programming (which both Ben and I have written for JempBox
> > >>> and
> > >>> XML Graphics Commons). Adobe's toolkit only allows generic access.
> > >>>
> > >>> Some links:
> > >>> Adobe XMP website: http://www.adobe.com/products/xmp/
> > >>> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> > >>> JempBox: http://sourceforge.net/projects/jempbox
> > >>> Apache XML Graphics Commons:
> > >>> http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
> > >>>
> > >>> My questions:
> > >>> - Any interest in converging on a unified model/approach?
> > >>> - If yes, where shall we develop this? As part of Tika (although
> > >>> it's
> > >>> still in incubation)? As a seperate project (maybe as Apache Commons
> > >>> subproject)? If more than XML Graphics uses this, XML Graphics is
> > >>> probably not the right home.
> > >>> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> > >>> the JempBox or XML Graphics Commons approach more interesting?
> > >>> - Where's the best place to discuss this? We can't keep posting to
> > >>> several mailing lists.
> > >>>
> > >>> At any rate, I would volunteer to spearhead this effort, especially
> > >>> since I have immediate need to have complete XMP functionality. I've
> > >>> almost finished mapping all XMP structures in XG Commons but I
> > >>> haven't
> > >>> committed my latest changes (for structured properties) and I may
> > >>> still
> > >>> not cover all details of XMP.
> > >>>
> > >>> Thanks for reading this far,
> > >>> Jeremias Maerki
> > >>>
> > >>>
> > >>
> > >
> > >
> >
> >
>

Re: Metadata use by Apache Java projects

Posted by Charles Matthew Chen <ch...@gmail.com>.

Hi Jeremias & Antoine,

   Antoine, it looks like you found it pretty easy to convert
Sanselan's metadata into XMP format.

   Jeremias, it sounds like you considering a new project which can
translate data from many formats (read by a variety of projects) into
XMP.  That sounds great!

   Sanselan could not use XMP internally to represent metadata,
though.  Sanselan's goal is to read & write metadata (such as EXIF
metadata) preserving not just tag values but directory structure,
field order, field location, etc.  I'm in the process of refactoring
the metadata data structures at the moment, actually, in order to
approach binary compatibility as closely as possible.

Charles.




On Nov 19, 2007 8:57 AM, Antoine Moreau de Bellaing <am...@enst.fr> wrote:
> Thank you for your advice....
> This class might (perhaps) help in converting EXIF metadata into XMP
> by using Adobe'Toolkit
>
>
> Regards,
> Antoine Moreau de Bellaing.
>
> import java.io.File;
> import java.io.IOException;
> import java.util.Vector;
>
> import org.cmc.sanselan.ImageReadException;
> import org.cmc.sanselan.Sanselan;
> import org.cmc.sanselan.common.IImageMetadata;
> import org.cmc.sanselan.formats.jpeg.JpegImageMetadata;
> import org.cmc.sanselan.formats.tiff.TiffDirectory;
> import org.cmc.sanselan.formats.tiff.TiffField;
> import org.cmc.sanselan.formats.tiff.TiffImageMetadata;
>
> import com.adobe.xmp.XMPConst;
> import com.adobe.xmp.XMPException;
> import com.adobe.xmp.XMPMeta;
> import com.adobe.xmp.XMPMetaFactory;
>
> public class XMPMetadataExample
> {
>         public static void metadataExample(File file) throws
> ImageReadException,
>                         IOException, XMPException
>         {
>                 IImageMetadata metadata = Sanselan.getMetadata(file);
>
>
>                 if (metadata instanceof JpegImageMetadata)
>                 {
>                         JpegImageMetadata jpegMetadata = (JpegImageMetadata) metadata;
>                         XMPMeta meta = xmpMeta(jpegMetadata);
>                         System.out.println(XMPMetaFactory.serializeToString(meta, null));
>
>                 }
>         }
>
>         private static XMPMeta xmpMeta(JpegImageMetadata jpegMetadata) throws
> ImageReadException, IOException, XMPException
>         {
>                 XMPMeta meta = XMPMetaFactory.create();
>                 Vector dirs = jpegMetadata.getExif().getDirectories();
>                 for (int i = 0; i < dirs.size(); i++)
>                 {
>                         TiffImageMetadata.Directory dir = (TiffImageMetadata.Directory) dirs
>                                         .get(i);
>
>                         Vector items = dir.getItems();
>                         for (int j = 0; j < items.size(); j++)
>                         {
>                                 Object item = items.get(j);
>                                 TiffImageMetadata.Item tiffItem = (TiffImageMetadata.Item) item;
>                                 TiffField field = tiffItem.getTiffField();
>                                 if (namespace(dir.type) != null) meta.setProperty
> (namespace(dir.type), field.getTagName(), field.getValueDescription());
>
>                         }
>                 }
>                 return meta;
>         }
>
>
>         public static final String namespace(int type)
>         {
>                 switch (type)
>                 {
>                         case TiffDirectory.DIRECTORY_TYPE_UNKNOWN :
>                                 return null;
>                         case TiffDirectory.DIRECTORY_TYPE_ROOT :
>                                 return XMPConst.NS_TIFF;
>                         case TiffDirectory.DIRECTORY_TYPE_SUB :
>                                 return null;
>                         case TiffDirectory.DIRECTORY_TYPE_THUMBNAIL :
>                                 return null;
>                         case TiffDirectory.DIRECTORY_TYPE_EXIF :
>                                 return XMPConst.NS_EXIF;
>                         case TiffDirectory.DIRECTORY_TYPE_GPS :
>                                 return null;
>                         case TiffDirectory.DIRECTORY_TYPE_INTEROPERABILITY :
>                                 return null;
>                         default :
>                                 return null;
>                 }
>         }
> }
>
> Le 19 nov. 07 à 12:00, Jeremias Maerki a écrit :
>
>
> > Cool, this proves my point that XMP is useful. ;-)
> >
> > AFAIK, JPEG metadata is usually not embedded as XMP but as EXIF/IPTC
> > data. In this case, the EXIF and IPTC chunks would have to be
> > converted
> > into the XMP representation. I guess that's what Adobe's Bridge does.
> > That's exactly what would need to be done if my proposal would be
> > implemented.
> >
> > So, if you want to do it now (i.e. before we've reached a conclusion)
> > you'll have to extract every single value from the metadata directory
> > and put it into the structure exposed by Adobe's XMP Toolkit. To get
> > the
> > individual values, see:
> > https://svn.apache.org/repos/asf/incubator/sanselan/trunk/src/main/java/org/cmc/sanselan/sampleUsage/MetadataExample.java
> >
> > The right mappings are easily found in the XMP specification.
> >
> > Jeremias Maerki
> >
> >
> >
> > On 19.11.2007 11:43:48 Antoine Moreau de Bellaing wrote:
> >> Hello.
> >> I'm looking for a way to connect the  Adobe XMP Toolkit to Sanselan.
> >> Especially with JPEG.
> >>
> >> I'm really newbie, so I apology if my response doesn't make sense to
> >> you all...
> >>
> >> Here's an output of Sanselan
> >> TiffImageMetadata.toString()
> >>              Root:
> >>                      Make: 'Canon'
> >>                      Model: 'Canon EOS 350D DIGITAL'
> >>                      Orientation: 1
> >>                      XResolution: 72
> >>                      YResolution: 72
> >>                      ResolutionUnit: 2
> >>                      DateTime: 2007-10-06T16:47:56.000+0200
> >>                      WhitePoint: 313/1000, 329/1000
> >>                      PrimaryChromaticities: 64/100, 33/100, 21/100, 71/100, 15/100,
> >> 6/100
> >>                      YCbCrCoefficients: 299/1000, 587/1000, 114/1000
> >>                      YCbCrPositioning: 2
> >>                      Exif_IFD_Pointer: 320
> >>
> >>              Exif:
> >>                      ExposureTime: 1/60
> >>                      FNumber: 5
> >>                      ExposureProgram: 0
> >>                      ISOSpeedRatings: 400
> >>                      ExifVersion: 48, 50, 50, 49
> >>                      DateTimeOriginal: 2007-10-06T16:47:56.000+0200
> >>                      DateTimeDigitized: 2007-10-06T16:47:56.000+0200
> >>                      ComponentsConfiguration: 1, 2, 3, 0
> >>                      ShutterSpeedValue: 387114/65536
> >>                      ApertureValue: 304340/65536
> >>                      ExposureBiasValue: 0
> >>                      MeteringMode: 1
> >>                      Flash: 16
> >>                      FocalLength: 41
> >>                      MakerNote: 24, 0, 1, 0, 3, 0, 46, 0, 0, 0, 34, 4, 0, 0, 2, 0, 3,
> >> 0,
> >> 4, 0, 0, 0, 126, 4, 0, 0, 3, 0, 3, 0, 4, 0, 0, 0, -122, 4, 0, 0, 4,
> >> 0,
> >> 3, 0, 34, 0, 0, 0, -114, 4, 0, 0, 6... (8340)
> >>                      UserComment: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> >> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... (264)
> >>                      FlashpixVersion: 48, 49, 48, 48
> >>                      ColorSpace: 65535
> >>                      PixelXDimension: 3456
> >>                      PixelYDimension: 2304
> >>                      Interoperability_IFD_Pointer: 9366
> >>                      FocalPlaneXResolution: 3456000/874
> >>                      FocalPlaneYResolution: 2304000/582
> >>                      FocalPlaneResolutionUnit: 2
> >>                      CustomRendered: 0
> >>                      ExposureMode: 0
> >>                      WhiteBalance: 0
> >>                      SceneCaptureType: 0
> >>                      Unknown: 22/10
> >>
> >>              Interoperability:
> >>                      GPSLatitudeRef: 'R03'
> >>                      GPSLatitude: 48, 49, 48, 48
> >>
> >>              Sub:
> >>                      Compression: 6
> >>                      XResolution: 72
> >>                      YResolution: 72
> >>                      ResolutionUnit: 2
> >>                      JPEGInterchangeFormat: 9716
> >>                      JPEGInterchangeFormatLength: 9176
> >>
> >> The same file parsed with Adobe Bridge produces this XMP file :
> >>
> >> <?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
> >> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.1-c037
> >> 46.282696, Mon Apr 02 2007 18:36:56        ">
> >>    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> >>       <rdf:Description rdf:about=""
> >>             xmlns:tiff="http://ns.adobe.com/tiff/1.0/">
> >>          <tiff:Make>Canon</tiff:Make>
> >>          <tiff:Model>Canon EOS 350D DIGITAL</tiff:Model>
> >>          <tiff:Orientation>1</tiff:Orientation>
> >>          <tiff:ImageWidth>3456</tiff:ImageWidth>
> >>          <tiff:ImageLength>2304</tiff:ImageLength>
> >>          <tiff:PhotometricInterpretation>2</
> >> tiff:PhotometricInterpretation>
> >>          <tiff:SamplesPerPixel>3</tiff:SamplesPerPixel>
> >>          <tiff:BitsPerSample>
> >>             <rdf:Seq>
> >>                <rdf:li>8</rdf:li>
> >>                <rdf:li>8</rdf:li>
> >>                <rdf:li>8</rdf:li>
> >>             </rdf:Seq>
> >>          </tiff:BitsPerSample>
> >>          <tiff:XResolution>72/1</tiff:XResolution>
> >>          <tiff:YResolution>72/1</tiff:YResolution>
> >>          <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
> >>       </rdf:Description>
> >>       <rdf:Description rdf:about=""
> >>             xmlns:exif="http://ns.adobe.com/exif/1.0/">
> >>          <exif:ExifVersion>0221</exif:ExifVersion>
> >>          <exif:ExposureTime>1/60</exif:ExposureTime>
> >>          <exif:ShutterSpeedValue>5906891/1000000</
> >> exif:ShutterSpeedValue>
> >>          <exif:FNumber>5/1</exif:FNumber>
> >>          <exif:ApertureValue>4643856/1000000</exif:ApertureValue>
> >>          <exif:ExposureProgram>0</exif:ExposureProgram>
> >>          <exif:ISOSpeedRatings>
> >>             <rdf:Seq>
> >>                <rdf:li>400</rdf:li>
> >>             </rdf:Seq>
> >>          </exif:ISOSpeedRatings>
> >>          <exif:DateTimeOriginal>2007-10-06T16:47:56+02:00</
> >> exif:DateTimeOriginal>
> >>          <exif:DateTimeDigitized>2007-10-06T16:47:56+02:00</
> >> exif:DateTimeDigitized>
> >>          <exif:ExposureBiasValue>0/2</exif:ExposureBiasValue>
> >>          <exif:MeteringMode>1</exif:MeteringMode>
> >>          <exif:Flash rdf:parseType="Resource">
> >>             <exif:Fired>False</exif:Fired>
> >>             <exif:Return>0</exif:Return>
> >>             <exif:Mode>2</exif:Mode>
> >>             <exif:Function>False</exif:Function>
> >>             <exif:RedEyeMode>False</exif:RedEyeMode>
> >>          </exif:Flash>
> >>          <exif:FocalLength>41/1</exif:FocalLength>
> >>          <exif:CustomRendered>0</exif:CustomRendered>
> >>          <exif:ExposureMode>0</exif:ExposureMode>
> >>          <exif:WhiteBalance>0</exif:WhiteBalance>
> >>          <exif:SceneCaptureType>0</exif:SceneCaptureType>
> >>          <exif:FocalPlaneXResolution>3456000/874</
> >> exif:FocalPlaneXResolution>
> >>          <exif:FocalPlaneYResolution>2304000/582</
> >> exif:FocalPlaneYResolution>
> >>          <exif:FocalPlaneResolutionUnit>2</
> >> exif:FocalPlaneResolutionUnit>
> >>       </rdf:Description>
> >>       <rdf:Description rdf:about=""
> >>             xmlns:xap="http://ns.adobe.com/xap/1.0/">
> >>          <xap:ModifyDate>2007-10-06T16:47:56+02:00</xap:ModifyDate>
> >>       </rdf:Description>
> >>       <rdf:Description rdf:about=""
> >>             xmlns:dc="http://purl.org/dc/elements/1.1/">
> >>          <dc:creator>
> >>             <rdf:Seq>
> >>                <rdf:li>antoine</rdf:li>
> >>             </rdf:Seq>
> >>          </dc:creator>
> >>       </rdf:Description>
> >>       <rdf:Description rdf:about=""
> >>             xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
> >>          <aux:SerialNumber>1330734959</aux:SerialNumber>
> >>          <aux:LensInfo>18/1 55/1 0/0 0/0</aux:LensInfo>
> >>          <aux:Lens>18.0-55.0 mm</aux:Lens>
> >>          <aux:ImageNumber>160</aux:ImageNumber>
> >>          <aux:FlashCompensation>0/1</aux:FlashCompensation>
> >>          <aux:OwnerName>antoine</aux:OwnerName>
> >>          <aux:Firmware>1.0.3</aux:Firmware>
> >>       </rdf:Description>
> >>       <rdf:Description rdf:about=""
> >>             xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/">
> >>          <crs:AlreadyApplied>True</crs:AlreadyApplied>
> >>       </rdf:Description>
> >>       <rdf:Description rdf:about=""
> >>             xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
> >>          <photoshop:ColorMode>3</photoshop:ColorMode>
> >>          <photoshop:ICCProfile>Canon EOS 350D DIGITAL</
> >> photoshop:ICCProfile>
> >>       </rdf:Description>
> >>    </rdf:RDF>
> >> </x:xmpmeta>
> >> <?xpacket end="w"?>
> >>
> >>
> >> Root corresponds to the tiff namespace
> >> Exif corresponds to the exif namespace
> >>
> >> In Sanselan those variables are private :
> >> TiffImageMetadata.directory
> >>
> >> Would an accesor to directory be usefull to parse XMP with Adobe's
> >> Toolkit?
> >>
> >>
> >> Regards,
> >> Anoine Moreau de Bellaing
> >>
> >>
> >> Le 19 nov. 07 à 10:26, Jeremias Maerki a écrit :
> >>
> >>> (I realize this is heavy cross-posting but it's probably the best
> >>> way to
> >>> reach all the players I want to address.)
> >>>
> >>> As you may know, I've started developing an XMP metadata package
> >>> inside
> >>> XML Graphics Commons in order to support XMP metadata (and
> >>> ultimately
> >>> PDF/A) in Apache FOP. Therefore, I have quite an interest in
> >>> metadata.
> >>>
> >>> What is XMP? XMP, for those who don't know about it, is based on a
> >>> subset of RDF to provide a flexible and extensible way of
> >>> storing/representing document metadata.
> >>>
> >>> Yesterday, I was surprised to discover that Adobe has published an
> >>> XMP
> >>> Toolkit with Java support under the BSD license. In contrast to my
> >>> effort, Adobe's toolkit is quite complete if maybe a bit more
> >>> complicated to use. That got me thinking:
> >>>
> >>> Every project I'm sending this message to is using document metadata
> >>> in
> >>> some form:
> >>> - Apache XML Graphics: embeds document metadata in the generated
> >>> files
> >>> (just FOP at the moment, but Batik is a similar candidate)
> >>> - Tika (in incubation): has as one of its main purposes the
> >>> extraction
> >>> of metadata
> >>> - Sanselan (in incubation): extracts and embeds metadata from/in
> >>> bitmap
> >>> images
> >>> - PDFBox (incubation in discussion): extracts and embeds XMP
> >>> metadata
> >>> from/in PDF files (see also JempBox)
> >>>
> >>> Every one of these projects has its own means to represent
> >>> metadata in
> >>> memory. Wouldn't it make sense to have a common approach? I've
> >>> worked
> >>> with XMP for some time now and I can say it's ideal to work with. It
> >>> also defines guidelines to embed XMP metadata in various file
> >>> formats.
> >>> It's also relatively easy to map metadata between different file
> >>> formats
> >>> (Dublin Core, EXIF, PDF Info etc.).
> >>>
> >>> Sanselan and Tika have both chosen a very simple approach but is it
> >>> versatile enough for the future? While the simple Map<String,
> >>> String[]> in
> >>> Tika allows for multiple authors, for example, it doesn't support
> >>> language alternatives for things such as dc:title or dc:description.
> >>>
> >>> I'm seriously thinking about abandoning most of my XMP package
> >>> work in
> >>> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it
> >>> doesn't
> >>> support, tough:
> >>> - Metadata merging functionality (which I need for synchronizing the
> >>> PDF
> >>> Info object and the XMP packet for PDF/A)
> >>> - Schema-specific adapters (for Dublin Core and many other XMP
> >>> Schemas) for
> >>> easier programming (which both Ben and I have written for JempBox
> >>> and
> >>> XML Graphics Commons). Adobe's toolkit only allows generic access.
> >>>
> >>> Some links:
> >>> Adobe XMP website: http://www.adobe.com/products/xmp/
> >>> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> >>> JempBox: http://sourceforge.net/projects/jempbox
> >>> Apache XML Graphics Commons:
> >>> http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
> >>>
> >>> My questions:
> >>> - Any interest in converging on a unified model/approach?
> >>> - If yes, where shall we develop this? As part of Tika (although
> >>> it's
> >>> still in incubation)? As a seperate project (maybe as Apache Commons
> >>> subproject)? If more than XML Graphics uses this, XML Graphics is
> >>> probably not the right home.
> >>> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> >>> the JempBox or XML Graphics Commons approach more interesting?
> >>> - Where's the best place to discuss this? We can't keep posting to
> >>> several mailing lists.
> >>>
> >>> At any rate, I would volunteer to spearhead this effort, especially
> >>> since I have immediate need to have complete XMP functionality. I've
> >>> almost finished mapping all XMP structures in XG Commons but I
> >>> haven't
> >>> committed my latest changes (for structured properties) and I may
> >>> still
> >>> not cover all details of XMP.
> >>>
> >>> Thanks for reading this far,
> >>> Jeremias Maerki
> >>>
> >>>
> >>
> >
> >
>
>

Re: Metadata use by Apache Java projects

Posted by Antoine Moreau de Bellaing <am...@enst.fr>.

Thank you for your advice....
This class might (perhaps) help in converting EXIF metadata into XMP  
by using Adobe'Toolkit


Regards,
Antoine Moreau de Bellaing.

import java.io.File;
import java.io.IOException;
import java.util.Vector;

import org.cmc.sanselan.ImageReadException;
import org.cmc.sanselan.Sanselan;
import org.cmc.sanselan.common.IImageMetadata;
import org.cmc.sanselan.formats.jpeg.JpegImageMetadata;
import org.cmc.sanselan.formats.tiff.TiffDirectory;
import org.cmc.sanselan.formats.tiff.TiffField;
import org.cmc.sanselan.formats.tiff.TiffImageMetadata;

import com.adobe.xmp.XMPConst;
import com.adobe.xmp.XMPException;
import com.adobe.xmp.XMPMeta;
import com.adobe.xmp.XMPMetaFactory;

public class XMPMetadataExample
{
	public static void metadataExample(File file) throws  
ImageReadException,
			IOException, XMPException
	{
		IImageMetadata metadata = Sanselan.getMetadata(file);


		if (metadata instanceof JpegImageMetadata)
		{
			JpegImageMetadata jpegMetadata = (JpegImageMetadata) metadata;
			XMPMeta meta = xmpMeta(jpegMetadata);
			System.out.println(XMPMetaFactory.serializeToString(meta, null));

		}
	}

	private static XMPMeta xmpMeta(JpegImageMetadata jpegMetadata) throws  
ImageReadException, IOException, XMPException
	{
		XMPMeta meta = XMPMetaFactory.create();
		Vector dirs = jpegMetadata.getExif().getDirectories();
		for (int i = 0; i < dirs.size(); i++)
		{
			TiffImageMetadata.Directory dir = (TiffImageMetadata.Directory) dirs
					.get(i);
				
			Vector items = dir.getItems();
			for (int j = 0; j < items.size(); j++)
			{
				Object item = items.get(j);
				TiffImageMetadata.Item tiffItem = (TiffImageMetadata.Item) item;
				TiffField field = tiffItem.getTiffField();
				if (namespace(dir.type) != null) meta.setProperty  
(namespace(dir.type), field.getTagName(), field.getValueDescription());	
				
			}
		}
		return meta;
	}
	
	
	public static final String namespace(int type)
	{
		switch (type)
		{
			case TiffDirectory.DIRECTORY_TYPE_UNKNOWN :
				return null;
			case TiffDirectory.DIRECTORY_TYPE_ROOT :
				return XMPConst.NS_TIFF;
			case TiffDirectory.DIRECTORY_TYPE_SUB :
				return null;
			case TiffDirectory.DIRECTORY_TYPE_THUMBNAIL :
				return null;
			case TiffDirectory.DIRECTORY_TYPE_EXIF :
				return XMPConst.NS_EXIF;
			case TiffDirectory.DIRECTORY_TYPE_GPS :
				return null;
			case TiffDirectory.DIRECTORY_TYPE_INTEROPERABILITY :
				return null;
			default :
				return null;
		}
	}
}

Le 19 nov. 07 à 12:00, Jeremias Maerki a écrit :

> Cool, this proves my point that XMP is useful. ;-)
>
> AFAIK, JPEG metadata is usually not embedded as XMP but as EXIF/IPTC
> data. In this case, the EXIF and IPTC chunks would have to be  
> converted
> into the XMP representation. I guess that's what Adobe's Bridge does.
> That's exactly what would need to be done if my proposal would be
> implemented.
>
> So, if you want to do it now (i.e. before we've reached a conclusion)
> you'll have to extract every single value from the metadata directory
> and put it into the structure exposed by Adobe's XMP Toolkit. To get  
> the
> individual values, see:
> https://svn.apache.org/repos/asf/incubator/sanselan/trunk/src/main/java/org/cmc/sanselan/sampleUsage/MetadataExample.java
>
> The right mappings are easily found in the XMP specification.
>
> Jeremias Maerki
>
>
>
> On 19.11.2007 11:43:48 Antoine Moreau de Bellaing wrote:
>> Hello.
>> I'm looking for a way to connect the  Adobe XMP Toolkit to Sanselan.
>> Especially with JPEG.
>>
>> I'm really newbie, so I apology if my response doesn't make sense to
>> you all...
>>
>> Here's an output of Sanselan
>> TiffImageMetadata.toString()
>> 		Root:
>> 			Make: 'Canon'
>> 			Model: 'Canon EOS 350D DIGITAL'
>> 			Orientation: 1
>> 			XResolution: 72
>> 			YResolution: 72
>> 			ResolutionUnit: 2
>> 			DateTime: 2007-10-06T16:47:56.000+0200
>> 			WhitePoint: 313/1000, 329/1000
>> 			PrimaryChromaticities: 64/100, 33/100, 21/100, 71/100, 15/100,  
>> 6/100
>> 			YCbCrCoefficients: 299/1000, 587/1000, 114/1000
>> 			YCbCrPositioning: 2
>> 			Exif_IFD_Pointer: 320
>>
>> 		Exif:
>> 			ExposureTime: 1/60
>> 			FNumber: 5
>> 			ExposureProgram: 0
>> 			ISOSpeedRatings: 400
>> 			ExifVersion: 48, 50, 50, 49
>> 			DateTimeOriginal: 2007-10-06T16:47:56.000+0200
>> 			DateTimeDigitized: 2007-10-06T16:47:56.000+0200
>> 			ComponentsConfiguration: 1, 2, 3, 0
>> 			ShutterSpeedValue: 387114/65536
>> 			ApertureValue: 304340/65536
>> 			ExposureBiasValue: 0
>> 			MeteringMode: 1
>> 			Flash: 16
>> 			FocalLength: 41
>> 			MakerNote: 24, 0, 1, 0, 3, 0, 46, 0, 0, 0, 34, 4, 0, 0, 2, 0, 3,  
>> 0,
>> 4, 0, 0, 0, 126, 4, 0, 0, 3, 0, 3, 0, 4, 0, 0, 0, -122, 4, 0, 0, 4,  
>> 0,
>> 3, 0, 34, 0, 0, 0, -114, 4, 0, 0, 6... (8340)
>> 			UserComment: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
>> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... (264)
>> 			FlashpixVersion: 48, 49, 48, 48
>> 			ColorSpace: 65535
>> 			PixelXDimension: 3456
>> 			PixelYDimension: 2304
>> 			Interoperability_IFD_Pointer: 9366
>> 			FocalPlaneXResolution: 3456000/874
>> 			FocalPlaneYResolution: 2304000/582
>> 			FocalPlaneResolutionUnit: 2
>> 			CustomRendered: 0
>> 			ExposureMode: 0
>> 			WhiteBalance: 0
>> 			SceneCaptureType: 0
>> 			Unknown: 22/10
>>
>> 		Interoperability:
>> 			GPSLatitudeRef: 'R03'
>> 			GPSLatitude: 48, 49, 48, 48
>>
>> 		Sub:
>> 			Compression: 6
>> 			XResolution: 72
>> 			YResolution: 72
>> 			ResolutionUnit: 2
>> 			JPEGInterchangeFormat: 9716
>> 			JPEGInterchangeFormatLength: 9176
>>
>> The same file parsed with Adobe Bridge produces this XMP file :
>>
>> <?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
>> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.1-c037
>> 46.282696, Mon Apr 02 2007 18:36:56        ">
>>    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>>       <rdf:Description rdf:about=""
>>             xmlns:tiff="http://ns.adobe.com/tiff/1.0/">
>>          <tiff:Make>Canon</tiff:Make>
>>          <tiff:Model>Canon EOS 350D DIGITAL</tiff:Model>
>>          <tiff:Orientation>1</tiff:Orientation>
>>          <tiff:ImageWidth>3456</tiff:ImageWidth>
>>          <tiff:ImageLength>2304</tiff:ImageLength>
>>          <tiff:PhotometricInterpretation>2</
>> tiff:PhotometricInterpretation>
>>          <tiff:SamplesPerPixel>3</tiff:SamplesPerPixel>
>>          <tiff:BitsPerSample>
>>             <rdf:Seq>
>>                <rdf:li>8</rdf:li>
>>                <rdf:li>8</rdf:li>
>>                <rdf:li>8</rdf:li>
>>             </rdf:Seq>
>>          </tiff:BitsPerSample>
>>          <tiff:XResolution>72/1</tiff:XResolution>
>>          <tiff:YResolution>72/1</tiff:YResolution>
>>          <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
>>       </rdf:Description>
>>       <rdf:Description rdf:about=""
>>             xmlns:exif="http://ns.adobe.com/exif/1.0/">
>>          <exif:ExifVersion>0221</exif:ExifVersion>
>>          <exif:ExposureTime>1/60</exif:ExposureTime>
>>          <exif:ShutterSpeedValue>5906891/1000000</
>> exif:ShutterSpeedValue>
>>          <exif:FNumber>5/1</exif:FNumber>
>>          <exif:ApertureValue>4643856/1000000</exif:ApertureValue>
>>          <exif:ExposureProgram>0</exif:ExposureProgram>
>>          <exif:ISOSpeedRatings>
>>             <rdf:Seq>
>>                <rdf:li>400</rdf:li>
>>             </rdf:Seq>
>>          </exif:ISOSpeedRatings>
>>          <exif:DateTimeOriginal>2007-10-06T16:47:56+02:00</
>> exif:DateTimeOriginal>
>>          <exif:DateTimeDigitized>2007-10-06T16:47:56+02:00</
>> exif:DateTimeDigitized>
>>          <exif:ExposureBiasValue>0/2</exif:ExposureBiasValue>
>>          <exif:MeteringMode>1</exif:MeteringMode>
>>          <exif:Flash rdf:parseType="Resource">
>>             <exif:Fired>False</exif:Fired>
>>             <exif:Return>0</exif:Return>
>>             <exif:Mode>2</exif:Mode>
>>             <exif:Function>False</exif:Function>
>>             <exif:RedEyeMode>False</exif:RedEyeMode>
>>          </exif:Flash>
>>          <exif:FocalLength>41/1</exif:FocalLength>
>>          <exif:CustomRendered>0</exif:CustomRendered>
>>          <exif:ExposureMode>0</exif:ExposureMode>
>>          <exif:WhiteBalance>0</exif:WhiteBalance>
>>          <exif:SceneCaptureType>0</exif:SceneCaptureType>
>>          <exif:FocalPlaneXResolution>3456000/874</
>> exif:FocalPlaneXResolution>
>>          <exif:FocalPlaneYResolution>2304000/582</
>> exif:FocalPlaneYResolution>
>>          <exif:FocalPlaneResolutionUnit>2</
>> exif:FocalPlaneResolutionUnit>
>>       </rdf:Description>
>>       <rdf:Description rdf:about=""
>>             xmlns:xap="http://ns.adobe.com/xap/1.0/">
>>          <xap:ModifyDate>2007-10-06T16:47:56+02:00</xap:ModifyDate>
>>       </rdf:Description>
>>       <rdf:Description rdf:about=""
>>             xmlns:dc="http://purl.org/dc/elements/1.1/">
>>          <dc:creator>
>>             <rdf:Seq>
>>                <rdf:li>antoine</rdf:li>
>>             </rdf:Seq>
>>          </dc:creator>
>>       </rdf:Description>
>>       <rdf:Description rdf:about=""
>>             xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
>>          <aux:SerialNumber>1330734959</aux:SerialNumber>
>>          <aux:LensInfo>18/1 55/1 0/0 0/0</aux:LensInfo>
>>          <aux:Lens>18.0-55.0 mm</aux:Lens>
>>          <aux:ImageNumber>160</aux:ImageNumber>
>>          <aux:FlashCompensation>0/1</aux:FlashCompensation>
>>          <aux:OwnerName>antoine</aux:OwnerName>
>>          <aux:Firmware>1.0.3</aux:Firmware>
>>       </rdf:Description>
>>       <rdf:Description rdf:about=""
>>             xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/">
>>          <crs:AlreadyApplied>True</crs:AlreadyApplied>
>>       </rdf:Description>
>>       <rdf:Description rdf:about=""
>>             xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
>>          <photoshop:ColorMode>3</photoshop:ColorMode>
>>          <photoshop:ICCProfile>Canon EOS 350D DIGITAL</
>> photoshop:ICCProfile>
>>       </rdf:Description>
>>    </rdf:RDF>
>> </x:xmpmeta>
>> <?xpacket end="w"?>
>>
>>
>> Root corresponds to the tiff namespace
>> Exif corresponds to the exif namespace
>>
>> In Sanselan those variables are private :
>> TiffImageMetadata.directory
>>
>> Would an accesor to directory be usefull to parse XMP with Adobe's
>> Toolkit?
>>
>>
>> Regards,
>> Anoine Moreau de Bellaing
>>
>>
>> Le 19 nov. 07 à 10:26, Jeremias Maerki a écrit :
>>
>>> (I realize this is heavy cross-posting but it's probably the best
>>> way to
>>> reach all the players I want to address.)
>>>
>>> As you may know, I've started developing an XMP metadata package
>>> inside
>>> XML Graphics Commons in order to support XMP metadata (and  
>>> ultimately
>>> PDF/A) in Apache FOP. Therefore, I have quite an interest in  
>>> metadata.
>>>
>>> What is XMP? XMP, for those who don't know about it, is based on a
>>> subset of RDF to provide a flexible and extensible way of
>>> storing/representing document metadata.
>>>
>>> Yesterday, I was surprised to discover that Adobe has published an  
>>> XMP
>>> Toolkit with Java support under the BSD license. In contrast to my
>>> effort, Adobe's toolkit is quite complete if maybe a bit more
>>> complicated to use. That got me thinking:
>>>
>>> Every project I'm sending this message to is using document metadata
>>> in
>>> some form:
>>> - Apache XML Graphics: embeds document metadata in the generated  
>>> files
>>> (just FOP at the moment, but Batik is a similar candidate)
>>> - Tika (in incubation): has as one of its main purposes the  
>>> extraction
>>> of metadata
>>> - Sanselan (in incubation): extracts and embeds metadata from/in
>>> bitmap
>>> images
>>> - PDFBox (incubation in discussion): extracts and embeds XMP  
>>> metadata
>>> from/in PDF files (see also JempBox)
>>>
>>> Every one of these projects has its own means to represent  
>>> metadata in
>>> memory. Wouldn't it make sense to have a common approach? I've  
>>> worked
>>> with XMP for some time now and I can say it's ideal to work with. It
>>> also defines guidelines to embed XMP metadata in various file  
>>> formats.
>>> It's also relatively easy to map metadata between different file
>>> formats
>>> (Dublin Core, EXIF, PDF Info etc.).
>>>
>>> Sanselan and Tika have both chosen a very simple approach but is it
>>> versatile enough for the future? While the simple Map<String,
>>> String[]> in
>>> Tika allows for multiple authors, for example, it doesn't support
>>> language alternatives for things such as dc:title or dc:description.
>>>
>>> I'm seriously thinking about abandoning most of my XMP package  
>>> work in
>>> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it  
>>> doesn't
>>> support, tough:
>>> - Metadata merging functionality (which I need for synchronizing the
>>> PDF
>>> Info object and the XMP packet for PDF/A)
>>> - Schema-specific adapters (for Dublin Core and many other XMP
>>> Schemas) for
>>> easier programming (which both Ben and I have written for JempBox  
>>> and
>>> XML Graphics Commons). Adobe's toolkit only allows generic access.
>>>
>>> Some links:
>>> Adobe XMP website: http://www.adobe.com/products/xmp/
>>> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
>>> JempBox: http://sourceforge.net/projects/jempbox
>>> Apache XML Graphics Commons:
>>> http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
>>>
>>> My questions:
>>> - Any interest in converging on a unified model/approach?
>>> - If yes, where shall we develop this? As part of Tika (although  
>>> it's
>>> still in incubation)? As a seperate project (maybe as Apache Commons
>>> subproject)? If more than XML Graphics uses this, XML Graphics is
>>> probably not the right home.
>>> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
>>> the JempBox or XML Graphics Commons approach more interesting?
>>> - Where's the best place to discuss this? We can't keep posting to
>>> several mailing lists.
>>>
>>> At any rate, I would volunteer to spearhead this effort, especially
>>> since I have immediate need to have complete XMP functionality. I've
>>> almost finished mapping all XMP structures in XG Commons but I  
>>> haven't
>>> committed my latest changes (for structured properties) and I may
>>> still
>>> not cover all details of XMP.
>>>
>>> Thanks for reading this far,
>>> Jeremias Maerki
>>>
>>>
>>
>
>

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Cool, this proves my point that XMP is useful. ;-)

AFAIK, JPEG metadata is usually not embedded as XMP but as EXIF/IPTC
data. In this case, the EXIF and IPTC chunks would have to be converted
into the XMP representation. I guess that's what Adobe's Bridge does.
That's exactly what would need to be done if my proposal would be
implemented.

So, if you want to do it now (i.e. before we've reached a conclusion)
you'll have to extract every single value from the metadata directory
and put it into the structure exposed by Adobe's XMP Toolkit. To get the
individual values, see:
https://svn.apache.org/repos/asf/incubator/sanselan/trunk/src/main/java/org/cmc/sanselan/sampleUsage/MetadataExample.java

The right mappings are easily found in the XMP specification.

Jeremias Maerki



On 19.11.2007 11:43:48 Antoine Moreau de Bellaing wrote:
> Hello.
> I'm looking for a way to connect the  Adobe XMP Toolkit to Sanselan.
> Especially with JPEG.
> 
> I'm really newbie, so I apology if my response doesn't make sense to  
> you all...
> 
> Here's an output of Sanselan
> TiffImageMetadata.toString()
> 		Root:
> 			Make: 'Canon'
> 			Model: 'Canon EOS 350D DIGITAL'
> 			Orientation: 1
> 			XResolution: 72
> 			YResolution: 72
> 			ResolutionUnit: 2
> 			DateTime: 2007-10-06T16:47:56.000+0200
> 			WhitePoint: 313/1000, 329/1000
> 			PrimaryChromaticities: 64/100, 33/100, 21/100, 71/100, 15/100, 6/100
> 			YCbCrCoefficients: 299/1000, 587/1000, 114/1000
> 			YCbCrPositioning: 2
> 			Exif_IFD_Pointer: 320
> 
> 		Exif:
> 			ExposureTime: 1/60
> 			FNumber: 5
> 			ExposureProgram: 0
> 			ISOSpeedRatings: 400
> 			ExifVersion: 48, 50, 50, 49
> 			DateTimeOriginal: 2007-10-06T16:47:56.000+0200
> 			DateTimeDigitized: 2007-10-06T16:47:56.000+0200
> 			ComponentsConfiguration: 1, 2, 3, 0
> 			ShutterSpeedValue: 387114/65536
> 			ApertureValue: 304340/65536
> 			ExposureBiasValue: 0
> 			MeteringMode: 1
> 			Flash: 16
> 			FocalLength: 41
> 			MakerNote: 24, 0, 1, 0, 3, 0, 46, 0, 0, 0, 34, 4, 0, 0, 2, 0, 3, 0,  
> 4, 0, 0, 0, 126, 4, 0, 0, 3, 0, 3, 0, 4, 0, 0, 0, -122, 4, 0, 0, 4, 0,  
> 3, 0, 34, 0, 0, 0, -114, 4, 0, 0, 6... (8340)
> 			UserComment: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0... (264)
> 			FlashpixVersion: 48, 49, 48, 48
> 			ColorSpace: 65535
> 			PixelXDimension: 3456
> 			PixelYDimension: 2304
> 			Interoperability_IFD_Pointer: 9366
> 			FocalPlaneXResolution: 3456000/874
> 			FocalPlaneYResolution: 2304000/582
> 			FocalPlaneResolutionUnit: 2
> 			CustomRendered: 0
> 			ExposureMode: 0
> 			WhiteBalance: 0
> 			SceneCaptureType: 0
> 			Unknown: 22/10
> 
> 		Interoperability:
> 			GPSLatitudeRef: 'R03'
> 			GPSLatitude: 48, 49, 48, 48
> 
> 		Sub:
> 			Compression: 6
> 			XResolution: 72
> 			YResolution: 72
> 			ResolutionUnit: 2
> 			JPEGInterchangeFormat: 9716
> 			JPEGInterchangeFormatLength: 9176
> 
> The same file parsed with Adobe Bridge produces this XMP file :
> 
> <?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.1-c037  
> 46.282696, Mon Apr 02 2007 18:36:56        ">
>     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>        <rdf:Description rdf:about=""
>              xmlns:tiff="http://ns.adobe.com/tiff/1.0/">
>           <tiff:Make>Canon</tiff:Make>
>           <tiff:Model>Canon EOS 350D DIGITAL</tiff:Model>
>           <tiff:Orientation>1</tiff:Orientation>
>           <tiff:ImageWidth>3456</tiff:ImageWidth>
>           <tiff:ImageLength>2304</tiff:ImageLength>
>           <tiff:PhotometricInterpretation>2</ 
> tiff:PhotometricInterpretation>
>           <tiff:SamplesPerPixel>3</tiff:SamplesPerPixel>
>           <tiff:BitsPerSample>
>              <rdf:Seq>
>                 <rdf:li>8</rdf:li>
>                 <rdf:li>8</rdf:li>
>                 <rdf:li>8</rdf:li>
>              </rdf:Seq>
>           </tiff:BitsPerSample>
>           <tiff:XResolution>72/1</tiff:XResolution>
>           <tiff:YResolution>72/1</tiff:YResolution>
>           <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
>        </rdf:Description>
>        <rdf:Description rdf:about=""
>              xmlns:exif="http://ns.adobe.com/exif/1.0/">
>           <exif:ExifVersion>0221</exif:ExifVersion>
>           <exif:ExposureTime>1/60</exif:ExposureTime>
>           <exif:ShutterSpeedValue>5906891/1000000</ 
> exif:ShutterSpeedValue>
>           <exif:FNumber>5/1</exif:FNumber>
>           <exif:ApertureValue>4643856/1000000</exif:ApertureValue>
>           <exif:ExposureProgram>0</exif:ExposureProgram>
>           <exif:ISOSpeedRatings>
>              <rdf:Seq>
>                 <rdf:li>400</rdf:li>
>              </rdf:Seq>
>           </exif:ISOSpeedRatings>
>           <exif:DateTimeOriginal>2007-10-06T16:47:56+02:00</ 
> exif:DateTimeOriginal>
>           <exif:DateTimeDigitized>2007-10-06T16:47:56+02:00</ 
> exif:DateTimeDigitized>
>           <exif:ExposureBiasValue>0/2</exif:ExposureBiasValue>
>           <exif:MeteringMode>1</exif:MeteringMode>
>           <exif:Flash rdf:parseType="Resource">
>              <exif:Fired>False</exif:Fired>
>              <exif:Return>0</exif:Return>
>              <exif:Mode>2</exif:Mode>
>              <exif:Function>False</exif:Function>
>              <exif:RedEyeMode>False</exif:RedEyeMode>
>           </exif:Flash>
>           <exif:FocalLength>41/1</exif:FocalLength>
>           <exif:CustomRendered>0</exif:CustomRendered>
>           <exif:ExposureMode>0</exif:ExposureMode>
>           <exif:WhiteBalance>0</exif:WhiteBalance>
>           <exif:SceneCaptureType>0</exif:SceneCaptureType>
>           <exif:FocalPlaneXResolution>3456000/874</ 
> exif:FocalPlaneXResolution>
>           <exif:FocalPlaneYResolution>2304000/582</ 
> exif:FocalPlaneYResolution>
>           <exif:FocalPlaneResolutionUnit>2</ 
> exif:FocalPlaneResolutionUnit>
>        </rdf:Description>
>        <rdf:Description rdf:about=""
>              xmlns:xap="http://ns.adobe.com/xap/1.0/">
>           <xap:ModifyDate>2007-10-06T16:47:56+02:00</xap:ModifyDate>
>        </rdf:Description>
>        <rdf:Description rdf:about=""
>              xmlns:dc="http://purl.org/dc/elements/1.1/">
>           <dc:creator>
>              <rdf:Seq>
>                 <rdf:li>antoine</rdf:li>
>              </rdf:Seq>
>           </dc:creator>
>        </rdf:Description>
>        <rdf:Description rdf:about=""
>              xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
>           <aux:SerialNumber>1330734959</aux:SerialNumber>
>           <aux:LensInfo>18/1 55/1 0/0 0/0</aux:LensInfo>
>           <aux:Lens>18.0-55.0 mm</aux:Lens>
>           <aux:ImageNumber>160</aux:ImageNumber>
>           <aux:FlashCompensation>0/1</aux:FlashCompensation>
>           <aux:OwnerName>antoine</aux:OwnerName>
>           <aux:Firmware>1.0.3</aux:Firmware>
>        </rdf:Description>
>        <rdf:Description rdf:about=""
>              xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/">
>           <crs:AlreadyApplied>True</crs:AlreadyApplied>
>        </rdf:Description>
>        <rdf:Description rdf:about=""
>              xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
>           <photoshop:ColorMode>3</photoshop:ColorMode>
>           <photoshop:ICCProfile>Canon EOS 350D DIGITAL</ 
> photoshop:ICCProfile>
>        </rdf:Description>
>     </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end="w"?>
> 
> 
> Root corresponds to the tiff namespace
> Exif corresponds to the exif namespace
> 
> In Sanselan those variables are private :
> TiffImageMetadata.directory
> 
> Would an accesor to directory be usefull to parse XMP with Adobe's  
> Toolkit?
> 
> 
> Regards,
> Anoine Moreau de Bellaing
> 
> 
> Le 19 nov. 07 à 10:26, Jeremias Maerki a écrit :
> 
> > (I realize this is heavy cross-posting but it's probably the best  
> > way to
> > reach all the players I want to address.)
> >
> > As you may know, I've started developing an XMP metadata package  
> > inside
> > XML Graphics Commons in order to support XMP metadata (and ultimately
> > PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
> >
> > What is XMP? XMP, for those who don't know about it, is based on a
> > subset of RDF to provide a flexible and extensible way of
> > storing/representing document metadata.
> >
> > Yesterday, I was surprised to discover that Adobe has published an XMP
> > Toolkit with Java support under the BSD license. In contrast to my
> > effort, Adobe's toolkit is quite complete if maybe a bit more
> > complicated to use. That got me thinking:
> >
> > Every project I'm sending this message to is using document metadata  
> > in
> > some form:
> > - Apache XML Graphics: embeds document metadata in the generated files
> > (just FOP at the moment, but Batik is a similar candidate)
> > - Tika (in incubation): has as one of its main purposes the extraction
> > of metadata
> > - Sanselan (in incubation): extracts and embeds metadata from/in  
> > bitmap
> > images
> > - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> > from/in PDF files (see also JempBox)
> >
> > Every one of these projects has its own means to represent metadata in
> > memory. Wouldn't it make sense to have a common approach? I've worked
> > with XMP for some time now and I can say it's ideal to work with. It
> > also defines guidelines to embed XMP metadata in various file formats.
> > It's also relatively easy to map metadata between different file  
> > formats
> > (Dublin Core, EXIF, PDF Info etc.).
> >
> > Sanselan and Tika have both chosen a very simple approach but is it
> > versatile enough for the future? While the simple Map<String,  
> > String[]> in
> > Tika allows for multiple authors, for example, it doesn't support
> > language alternatives for things such as dc:title or dc:description.
> >
> > I'm seriously thinking about abandoning most of my XMP package work in
> > XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> > support, tough:
> > - Metadata merging functionality (which I need for synchronizing the  
> > PDF
> > Info object and the XMP packet for PDF/A)
> > - Schema-specific adapters (for Dublin Core and many other XMP  
> > Schemas) for
> > easier programming (which both Ben and I have written for JempBox and
> > XML Graphics Commons). Adobe's toolkit only allows generic access.
> >
> > Some links:
> > Adobe XMP website: http://www.adobe.com/products/xmp/
> > Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> > JempBox: http://sourceforge.net/projects/jempbox
> > Apache XML Graphics Commons:
> >  http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
> >
> > My questions:
> > - Any interest in converging on a unified model/approach?
> > - If yes, where shall we develop this? As part of Tika (although it's
> > still in incubation)? As a seperate project (maybe as Apache Commons
> > subproject)? If more than XML Graphics uses this, XML Graphics is
> > probably not the right home.
> > - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> > the JempBox or XML Graphics Commons approach more interesting?
> > - Where's the best place to discuss this? We can't keep posting to
> > several mailing lists.
> >
> > At any rate, I would volunteer to spearhead this effort, especially
> > since I have immediate need to have complete XMP functionality. I've
> > almost finished mapping all XMP structures in XG Commons but I haven't
> > committed my latest changes (for structured properties) and I may  
> > still
> > not cover all details of XMP.
> >
> > Thanks for reading this far,
> > Jeremias Maerki
> >
> >
>

Re: Metadata use by Apache Java projects

Posted by Antoine Moreau de Bellaing <am...@enst.fr>.

Hello.
I'm looking for a way to connect the  Adobe XMP Toolkit to Sanselan.
Especially with JPEG.

I'm really newbie, so I apology if my response doesn't make sense to  
you all...

Here's an output of Sanselan
TiffImageMetadata.toString()
		Root:
			Make: 'Canon'
			Model: 'Canon EOS 350D DIGITAL'
			Orientation: 1
			XResolution: 72
			YResolution: 72
			ResolutionUnit: 2
			DateTime: 2007-10-06T16:47:56.000+0200
			WhitePoint: 313/1000, 329/1000
			PrimaryChromaticities: 64/100, 33/100, 21/100, 71/100, 15/100, 6/100
			YCbCrCoefficients: 299/1000, 587/1000, 114/1000
			YCbCrPositioning: 2
			Exif_IFD_Pointer: 320

		Exif:
			ExposureTime: 1/60
			FNumber: 5
			ExposureProgram: 0
			ISOSpeedRatings: 400
			ExifVersion: 48, 50, 50, 49
			DateTimeOriginal: 2007-10-06T16:47:56.000+0200
			DateTimeDigitized: 2007-10-06T16:47:56.000+0200
			ComponentsConfiguration: 1, 2, 3, 0
			ShutterSpeedValue: 387114/65536
			ApertureValue: 304340/65536
			ExposureBiasValue: 0
			MeteringMode: 1
			Flash: 16
			FocalLength: 41
			MakerNote: 24, 0, 1, 0, 3, 0, 46, 0, 0, 0, 34, 4, 0, 0, 2, 0, 3, 0,  
4, 0, 0, 0, 126, 4, 0, 0, 3, 0, 3, 0, 4, 0, 0, 0, -122, 4, 0, 0, 4, 0,  
3, 0, 34, 0, 0, 0, -114, 4, 0, 0, 6... (8340)
			UserComment: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  
0, 0, 0, 0, 0, 0, 0, 0, 0, 0... (264)
			FlashpixVersion: 48, 49, 48, 48
			ColorSpace: 65535
			PixelXDimension: 3456
			PixelYDimension: 2304
			Interoperability_IFD_Pointer: 9366
			FocalPlaneXResolution: 3456000/874
			FocalPlaneYResolution: 2304000/582
			FocalPlaneResolutionUnit: 2
			CustomRendered: 0
			ExposureMode: 0
			WhiteBalance: 0
			SceneCaptureType: 0
			Unknown: 22/10

		Interoperability:
			GPSLatitudeRef: 'R03'
			GPSLatitude: 48, 49, 48, 48

		Sub:
			Compression: 6
			XResolution: 72
			YResolution: 72
			ResolutionUnit: 2
			JPEGInterchangeFormat: 9716
			JPEGInterchangeFormatLength: 9176

The same file parsed with Adobe Bridge produces this XMP file :

<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.1-c037  
46.282696, Mon Apr 02 2007 18:36:56        ">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
       <rdf:Description rdf:about=""
             xmlns:tiff="http://ns.adobe.com/tiff/1.0/">
          <tiff:Make>Canon</tiff:Make>
          <tiff:Model>Canon EOS 350D DIGITAL</tiff:Model>
          <tiff:Orientation>1</tiff:Orientation>
          <tiff:ImageWidth>3456</tiff:ImageWidth>
          <tiff:ImageLength>2304</tiff:ImageLength>
          <tiff:PhotometricInterpretation>2</ 
tiff:PhotometricInterpretation>
          <tiff:SamplesPerPixel>3</tiff:SamplesPerPixel>
          <tiff:BitsPerSample>
             <rdf:Seq>
                <rdf:li>8</rdf:li>
                <rdf:li>8</rdf:li>
                <rdf:li>8</rdf:li>
             </rdf:Seq>
          </tiff:BitsPerSample>
          <tiff:XResolution>72/1</tiff:XResolution>
          <tiff:YResolution>72/1</tiff:YResolution>
          <tiff:ResolutionUnit>2</tiff:ResolutionUnit>
       </rdf:Description>
       <rdf:Description rdf:about=""
             xmlns:exif="http://ns.adobe.com/exif/1.0/">
          <exif:ExifVersion>0221</exif:ExifVersion>
          <exif:ExposureTime>1/60</exif:ExposureTime>
          <exif:ShutterSpeedValue>5906891/1000000</ 
exif:ShutterSpeedValue>
          <exif:FNumber>5/1</exif:FNumber>
          <exif:ApertureValue>4643856/1000000</exif:ApertureValue>
          <exif:ExposureProgram>0</exif:ExposureProgram>
          <exif:ISOSpeedRatings>
             <rdf:Seq>
                <rdf:li>400</rdf:li>
             </rdf:Seq>
          </exif:ISOSpeedRatings>
          <exif:DateTimeOriginal>2007-10-06T16:47:56+02:00</ 
exif:DateTimeOriginal>
          <exif:DateTimeDigitized>2007-10-06T16:47:56+02:00</ 
exif:DateTimeDigitized>
          <exif:ExposureBiasValue>0/2</exif:ExposureBiasValue>
          <exif:MeteringMode>1</exif:MeteringMode>
          <exif:Flash rdf:parseType="Resource">
             <exif:Fired>False</exif:Fired>
             <exif:Return>0</exif:Return>
             <exif:Mode>2</exif:Mode>
             <exif:Function>False</exif:Function>
             <exif:RedEyeMode>False</exif:RedEyeMode>
          </exif:Flash>
          <exif:FocalLength>41/1</exif:FocalLength>
          <exif:CustomRendered>0</exif:CustomRendered>
          <exif:ExposureMode>0</exif:ExposureMode>
          <exif:WhiteBalance>0</exif:WhiteBalance>
          <exif:SceneCaptureType>0</exif:SceneCaptureType>
          <exif:FocalPlaneXResolution>3456000/874</ 
exif:FocalPlaneXResolution>
          <exif:FocalPlaneYResolution>2304000/582</ 
exif:FocalPlaneYResolution>
          <exif:FocalPlaneResolutionUnit>2</ 
exif:FocalPlaneResolutionUnit>
       </rdf:Description>
       <rdf:Description rdf:about=""
             xmlns:xap="http://ns.adobe.com/xap/1.0/">
          <xap:ModifyDate>2007-10-06T16:47:56+02:00</xap:ModifyDate>
       </rdf:Description>
       <rdf:Description rdf:about=""
             xmlns:dc="http://purl.org/dc/elements/1.1/">
          <dc:creator>
             <rdf:Seq>
                <rdf:li>antoine</rdf:li>
             </rdf:Seq>
          </dc:creator>
       </rdf:Description>
       <rdf:Description rdf:about=""
             xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
          <aux:SerialNumber>1330734959</aux:SerialNumber>
          <aux:LensInfo>18/1 55/1 0/0 0/0</aux:LensInfo>
          <aux:Lens>18.0-55.0 mm</aux:Lens>
          <aux:ImageNumber>160</aux:ImageNumber>
          <aux:FlashCompensation>0/1</aux:FlashCompensation>
          <aux:OwnerName>antoine</aux:OwnerName>
          <aux:Firmware>1.0.3</aux:Firmware>
       </rdf:Description>
       <rdf:Description rdf:about=""
             xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/">
          <crs:AlreadyApplied>True</crs:AlreadyApplied>
       </rdf:Description>
       <rdf:Description rdf:about=""
             xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/">
          <photoshop:ColorMode>3</photoshop:ColorMode>
          <photoshop:ICCProfile>Canon EOS 350D DIGITAL</ 
photoshop:ICCProfile>
       </rdf:Description>
    </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>


Root corresponds to the tiff namespace
Exif corresponds to the exif namespace

In Sanselan those variables are private :
TiffImageMetadata.directory

Would an accesor to directory be usefull to parse XMP with Adobe's  
Toolkit?


Regards,
Anoine Moreau de Bellaing


Le 19 nov. 07 à 10:26, Jeremias Maerki a écrit :

> (I realize this is heavy cross-posting but it's probably the best  
> way to
> reach all the players I want to address.)
>
> As you may know, I've started developing an XMP metadata package  
> inside
> XML Graphics Commons in order to support XMP metadata (and ultimately
> PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
>
> What is XMP? XMP, for those who don't know about it, is based on a
> subset of RDF to provide a flexible and extensible way of
> storing/representing document metadata.
>
> Yesterday, I was surprised to discover that Adobe has published an XMP
> Toolkit with Java support under the BSD license. In contrast to my
> effort, Adobe's toolkit is quite complete if maybe a bit more
> complicated to use. That got me thinking:
>
> Every project I'm sending this message to is using document metadata  
> in
> some form:
> - Apache XML Graphics: embeds document metadata in the generated files
> (just FOP at the moment, but Batik is a similar candidate)
> - Tika (in incubation): has as one of its main purposes the extraction
> of metadata
> - Sanselan (in incubation): extracts and embeds metadata from/in  
> bitmap
> images
> - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> from/in PDF files (see also JempBox)
>
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach? I've worked
> with XMP for some time now and I can say it's ideal to work with. It
> also defines guidelines to embed XMP metadata in various file formats.
> It's also relatively easy to map metadata between different file  
> formats
> (Dublin Core, EXIF, PDF Info etc.).
>
> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String,  
> String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.
>
> I'm seriously thinking about abandoning most of my XMP package work in
> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> support, tough:
> - Metadata merging functionality (which I need for synchronizing the  
> PDF
> Info object and the XMP packet for PDF/A)
> - Schema-specific adapters (for Dublin Core and many other XMP  
> Schemas) for
> easier programming (which both Ben and I have written for JempBox and
> XML Graphics Commons). Adobe's toolkit only allows generic access.
>
> Some links:
> Adobe XMP website: http://www.adobe.com/products/xmp/
> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> JempBox: http://sourceforge.net/projects/jempbox
> Apache XML Graphics Commons:
>  http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
>
> My questions:
> - Any interest in converging on a unified model/approach?
> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> the JempBox or XML Graphics Commons approach more interesting?
> - Where's the best place to discuss this? We can't keep posting to
> several mailing lists.
>
> At any rate, I would volunteer to spearhead this effort, especially
> since I have immediate need to have complete XMP functionality. I've
> almost finished mapping all XMP structures in XG Commons but I haven't
> committed my latest changes (for structured properties) and I may  
> still
> not cover all details of XMP.
>
> Thanks for reading this far,
> Jeremias Maerki
>
>

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Hi Chris

On 20.11.2007 18:06:25 Chris Mattmann wrote:
> Hi Jeremias,
> 
> >> I'm not quite sure I understand how Tika's metadata model isn't flexible
> >> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> >> and haven't been able to. I think it's important to realize that a balance
> >> must be struck between over-bloating a metadata library (and attaching on
> >> RDF support, inference, synonym support, etc.) and making sure that the
> >> smallest subset of it is actually useful.
> > 
> > I'm sorry. I didn't intend to stand on anyone's toes.
> > 
> > At any rate, I'm not talking about full RDF support. I'm talking about
> > XMP, which uses only a subset of RDF.
> 
> Great, and I wouldn't worry about stepping on anyone's toes. You certainly
> didn't step on mine. My point was, at some point, we're just building
> libraries on top of libraries on top of...well you get the picture. What I'm
> interested in is building the smallest metadata library that's actually
> useful and can be built upon to add higher level capabilities, just as Solr
> builds on top of Lucene to provide faceted search, etc. Lucene itself
> doesn't provide a means for understanding facets/etc., but provides a
> library for text/indexing: Solr adds that understanding. Similarly here, I
> think it would be great for Tika to provide a library to handle Metadata
> representation/access, and then for others, to build on top of it to provide
> higher level library support (RDF access/etc.).

I think Adobe's XMP toolkit accomplishes exactly that, at least for the
generic part. Every project will certainly have some extra needs like
XML Graphics needs metadata merging and concrete adapters (like in my
previous example) for easier programming. Other projects might need
other tools, or the same. If we find common parts we can put those in a
little metadata library (Commons?!).

You keep saying that Tika should be providing a library to handle
Metadata representation/access. But is Tika really the right container?
Tika's goal is clearly metadata extraction while the requirements for
such a library go a little beyond that focus. I don't think I'd have a
hard time selling Tika with all its dependencies to the XML Graphics
project for just metadata handling (but not extraction). However, if
that library would be a separate product of the Tika project, fine. Then,
we only have the problem with Tika being in the incubator at the moment.
Can we use incubator releases in non-incubator projects? I don't really
know.

> > 
> >> Also, I'd be against moving Metadata support out of Tika because that was
> >> one of the project's original goals (Metadata support), and I think it's
> >> advantageous for Tika to be a provider for a Metadata capability (of course,
> >> one related to document/content extraction).
> > 
> > Metadata capability in the context of content extraction, certainly yes.
> > Nobody disputes that. But other projects have different needs (like
> > embedding metadata). So in all this there are certain common needs and
> > I'm trying to see if we can find a common ground in the form of a
> > uniform way of manipulating and storing metadata in memory while at the
> > same time working off a freely available standard.
> 
> Yep I get that. I'm all for that. Could you explain what you mean by
> "embedding" metadata? Within a document?

Again, an example is probably best: Document production in FOP.
Imagine a workflow, where some application generates XML files which are
formatted to PDF by FOP. The XSLT stylesheet will build up an XMP
packet besides the actual document content from the XML data that is
embedded in the fo:declarations element of the resulting XSL-FO document.
The PDFs are generated with the PDF/A-1b profile for long-term storage.
The PDFs go into a searchable archive, so metadata, especially
application-specific metadata (for example, patent bibliographic data
like a subset of ST.36 from WIPO for patent documents), needs to be
provided. During the formatting FOP needs to add its own metadata
(production time of the document, PDF producer, required PDF/A
indicators). That's where I do the merging: the XMP packet from XSL-FO
gets merged with a packet generated by FOP. The end result is an XMP
document that will be embedded in the PDF file.

> > 
> >> I'm wondering too what it means that Tika doesn't support "language
> >> alternatives"? Do you mean synonyms?
> > 
> [..snip..]
> >       <dc:title>
> >         <rdf:Alt>
> >           <rdf:li xml:lang="x-default">Manual</rdf:li>
> >           <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
> >           <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
> >         </rdf:Alt>
> >       </dc:title>
> [..snip..]
> 
> > 
> > You can see that the title is available in three languages. The example
> > also shows the case with multiple authors.
> > 
> > To access the title using Adobe's XMP tookkit you'd do the following:
> > 
> > XMPMeta meta = XMPMetaFactory.parse(in);
> > String s;
> > 
> > //Get default title
> > s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);
> > 
> > //Get title in user language if available
> > String userLang = System.getProperty("user.language");
> > s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);
> > 
> > Easy, isn't it? :-) That's the generic access to properties as Adobe's
> > XMP toolkit provides it. But it can also be useful to have concrete
> > adapters for easier use and higher type-safety. Here's what I do in XML
> > Graphics Commons at the moment:
> > 
> > Metadata meta = XMPParser.parseXMP(url);
> > DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
> > String s;
> > s = dc.getTitle();
> > String userLang = System.getProperty("user.language");
> > s = dc.getTitle(userLang);
> 
> Great example Jeremias. I think that the same type of thing could be built
> into Tika, and Tika currently supports some of the functionality that you
> mention above. Instead of meta.getLocalizedText, you could make a call to
> Tika like:
> 
> /* pseudo code of course */
> Metadata meta = new Metadata();
> TikaParser p = ParserFactory.createParser();
> ContentHandler hander;
> p.parse(stream, handler, meta);
> 
> String s;
> 
> s = meta.getMetadata(DublinCore.TITLE);
> 
> /* or if you want back all the titles parsed (if more than one) */
> List<String> titles = meta.getAllMetadata(DublinCore.TITLE);

Ah, so you do get multiple titles but you probably still lose the
information which title is in which language, right?

> So, then you could build a DublinCoreAdapter on top of Tika's Metadata class
> too.
> 
> >> Also, you mention it's relatively easy
> >> in other libraries to map between different file format metadata. I think
> >> that this is fairly easy to do in Tika too, seeing as though its primary
> >> purpose is support metadata extraction from different file formats.
> > 
> > No argument there. I don't claim I know all the requirements and use
> > cases of Tika. But I would imagine it's important to preserve as much
> > metadata as possible. XMP is certainly one of the best containers I've
> > seen to achieve that goal.
> 
> Yep exactly. That's one of the key requirements of Tika's Metadata
> framework. So yeah, long story short, it would be great to collaborate: I
> just want to make sure that there is proper understanding of all the pieces
> going forward so we know where there are gaps, and where there are not.

Me happy!

Jeremias Maerki

Re: Metadata use by Apache Java projects

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Jeremias,

>> I'm not quite sure I understand how Tika's metadata model isn't flexible
>> enough? Of course, I'm a bit bias, but I'm really trying to understand here
>> and haven't been able to. I think it's important to realize that a balance
>> must be struck between over-bloating a metadata library (and attaching on
>> RDF support, inference, synonym support, etc.) and making sure that the
>> smallest subset of it is actually useful.
> 
> I'm sorry. I didn't intend to stand on anyone's toes.
> 
> At any rate, I'm not talking about full RDF support. I'm talking about
> XMP, which uses only a subset of RDF.

Great, and I wouldn't worry about stepping on anyone's toes. You certainly
didn't step on mine. My point was, at some point, we're just building
libraries on top of libraries on top of...well you get the picture. What I'm
interested in is building the smallest metadata library that's actually
useful and can be built upon to add higher level capabilities, just as Solr
builds on top of Lucene to provide faceted search, etc. Lucene itself
doesn't provide a means for understanding facets/etc., but provides a
library for text/indexing: Solr adds that understanding. Similarly here, I
think it would be great for Tika to provide a library to handle Metadata
representation/access, and then for others, to build on top of it to provide
higher level library support (RDF access/etc.).

> 
>> Also, I'd be against moving Metadata support out of Tika because that was
>> one of the project's original goals (Metadata support), and I think it's
>> advantageous for Tika to be a provider for a Metadata capability (of course,
>> one related to document/content extraction).
> 
> Metadata capability in the context of content extraction, certainly yes.
> Nobody disputes that. But other projects have different needs (like
> embedding metadata). So in all this there are certain common needs and
> I'm trying to see if we can find a common ground in the form of a
> uniform way of manipulating and storing metadata in memory while at the
> same time working off a freely available standard.

Yep I get that. I'm all for that. Could you explain what you mean by
"embedding" metadata? Within a document?

> 
>> I'm wondering too what it means that Tika doesn't support "language
>> alternatives"? Do you mean synonyms?
> 
[..snip..]
>       <dc:title>
>         <rdf:Alt>
>           <rdf:li xml:lang="x-default">Manual</rdf:li>
>           <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
>           <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
>         </rdf:Alt>
>       </dc:title>
[..snip..]

> 
> You can see that the title is available in three languages. The example
> also shows the case with multiple authors.
> 
> To access the title using Adobe's XMP tookkit you'd do the following:
> 
> XMPMeta meta = XMPMetaFactory.parse(in);
> String s;
> 
> //Get default title
> s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);
> 
> //Get title in user language if available
> String userLang = System.getProperty("user.language");
> s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);
> 
> Easy, isn't it? :-) That's the generic access to properties as Adobe's
> XMP toolkit provides it. But it can also be useful to have concrete
> adapters for easier use and higher type-safety. Here's what I do in XML
> Graphics Commons at the moment:
> 
> Metadata meta = XMPParser.parseXMP(url);
> DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
> String s;
> s = dc.getTitle();
> String userLang = System.getProperty("user.language");
> s = dc.getTitle(userLang);

Great example Jeremias. I think that the same type of thing could be built
into Tika, and Tika currently supports some of the functionality that you
mention above. Instead of meta.getLocalizedText, you could make a call to
Tika like:

/* pseudo code of course */
Metadata meta = new Metadata();
TikaParser p = ParserFactory.createParser();
ContentHandler hander;
p.parse(stream, handler, meta);

String s;

s = meta.getMetadata(DublinCore.TITLE);

/* or if you want back all the titles parsed (if more than one) */
List<String> titles = meta.getAllMetadata(DublinCore.TITLE);

So, then you could build a DublinCoreAdapter on top of Tika's Metadata class
too.

>> Also, you mention it's relatively easy
>> in other libraries to map between different file format metadata. I think
>> that this is fairly easy to do in Tika too, seeing as though its primary
>> purpose is support metadata extraction from different file formats.
> 
> No argument there. I don't claim I know all the requirements and use
> cases of Tika. But I would imagine it's important to preserve as much
> metadata as possible. XMP is certainly one of the best containers I've
> seen to achieve that goal.

Yep exactly. That's one of the key requirements of Tika's Metadata
framework. So yeah, long story short, it would be great to collaborate: I
just want to make sure that there is proper understanding of all the pieces
going forward so we know where there are gaps, and where there are not.

Cheers,
  Chris

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Hi Chris

On 19.11.2007 18:27:56 Chris Mattmann wrote:
> Hi Folks,
>  
> >> Sanselan and Tika have both chosen a very simple approach but is it
> >> versatile enough for the future? While the simple Map<String, String[]> in
> >> Tika allows for multiple authors, for example, it doesn't support
> >> language alternatives for things such as dc:title or dc:description.
> > 
> > IMHO it would be good to have a more flexible metadata model in Tika.
> > Better yet if it's a standard used across multiple projects. Best if
> > we don't need to implement it in Tika. :-)
> 
> I'm not quite sure I understand how Tika's metadata model isn't flexible
> enough? Of course, I'm a bit bias, but I'm really trying to understand here
> and haven't been able to. I think it's important to realize that a balance
> must be struck between over-bloating a metadata library (and attaching on
> RDF support, inference, synonym support, etc.) and making sure that the
> smallest subset of it is actually useful.

I'm sorry. I didn't intend to stand on anyone's toes.

At any rate, I'm not talking about full RDF support. I'm talking about
XMP, which uses only a subset of RDF.

> Also, I'd be against moving Metadata support out of Tika because that was
> one of the project's original goals (Metadata support), and I think it's
> advantageous for Tika to be a provider for a Metadata capability (of course,
> one related to document/content extraction).

Metadata capability in the context of content extraction, certainly yes.
Nobody disputes that. But other projects have different needs (like
embedding metadata). So in all this there are certain common needs and
I'm trying to see if we can find a common ground in the form of a
uniform way of manipulating and storing metadata in memory while at the
same time working off a freely available standard.

> I'm wondering too what it means that Tika doesn't support "language
> alternatives"? Do you mean synonyms?

Frankly, I don't know if that's synonyms. Maybe they are in RDF
terminology. The XMP spec talks about "property qualifiers" of which
"language alternatives" (using xml:lang) are a special case. The easiest
way to explain is by example:

<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:creator>
        <rdf:Seq>
          <rdf:li>John Doe</rdf:li>
          <rdf:li>Jane Doe</rdf:li>
        </rdf:Seq>
      </dc:creator>
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Manual</rdf:li>
          <rdf:li xml:lang="de">Bedienungsanleitung</rdf:li>
          <rdf:li xml:lang="fr">Mode d'emploi</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:date>2006-06-02T10:36:40+02:00</dc:date>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

You can see that the title is available in three languages. The example
also shows the case with multiple authors.

To access the title using Adobe's XMP tookkit you'd do the following:

XMPMeta meta = XMPMetaFactory.parse(in);
String s;

//Get default title
s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, XMPConst.X_DEFAULT);

//Get title in user language if available
String userLang = System.getProperty("user.language");
s = meta.getLocalizedText(XMPConst.NS_DC, "title", null, userLang);

Easy, isn't it? :-) That's the generic access to properties as Adobe's
XMP toolkit provides it. But it can also be useful to have concrete
adapters for easier use and higher type-safety. Here's what I do in XML
Graphics Commons at the moment:

Metadata meta = XMPParser.parseXMP(url);
DublinCoreAdapter dc = DublinCoreSchema.getAdapter(meta);
String s;
s = dc.getTitle();
String userLang = System.getProperty("user.language");
s = dc.getTitle(userLang);

(Obviously, the same could be done for Adobe's XMP toolkit.)

> Also, you mention it's relatively easy
> in other libraries to map between different file format metadata. I think
> that this is fairly easy to do in Tika too, seeing as though its primary
> purpose is support metadata extraction from different file formats.

No argument there. I don't claim I know all the requirements and use
cases of Tika. But I would imagine it's important to preserve as much
metadata as possible. XMP is certainly one of the best containers I've
seen to achieve that goal.

> > 
> >> My questions:
> >> - Any interest in converging on a unified model/approach?
> > 
> > Certainly.
> 
> +1
> 
> > 
> >> - If yes, where shall we develop this? As part of Tika (although it's
> >> still in incubation)? As a seperate project (maybe as Apache Commons
> >> subproject)? If more than XML Graphics uses this, XML Graphics is
> >> probably not the right home.
> >> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> >> the JempBox or XML Graphics Commons approach more interesting?
> > 
> > If there already exists acceptably licensed good code outside the ASF,
> > then I would prefer using that instead of reinventing the wheel within
> > the foundation.
> 
> I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata
> framework began in Nutch, and at the time based on a short survey that
> Jerome Charron and I undertook, there was no easy-to-use, Metadata library
> framework, that met the needs of the types of things done in Nutch/Tika --
> document extraction of metadata from large corpuses, supporting many values
> for keys: mapping between keys, etc. So, in my mind, we're definitely not
> re-inventing any wheel and the framework was borne more out of need/ease of
> use than anything else.
> 
> In any case, the use of a common framework is a good one to discuss and I'm
> open to it. So long as people like me can better understand the gaps in the
> current Tika Metadata framework and the benefits of addressing those gaps to
> all the projects that would need it.


Jeremias Maerki

Re: Metadata use by Apache Java projects

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Folks,
 
>> Sanselan and Tika have both chosen a very simple approach but is it
>> versatile enough for the future? While the simple Map<String, String[]> in
>> Tika allows for multiple authors, for example, it doesn't support
>> language alternatives for things such as dc:title or dc:description.
> 
> IMHO it would be good to have a more flexible metadata model in Tika.
> Better yet if it's a standard used across multiple projects. Best if
> we don't need to implement it in Tika. :-)

I'm not quite sure I understand how Tika's metadata model isn't flexible
enough? Of course, I'm a bit bias, but I'm really trying to understand here
and haven't been able to. I think it's important to realize that a balance
must be struck between over-bloating a metadata library (and attaching on
RDF support, inference, synonym support, etc.) and making sure that the
smallest subset of it is actually useful.

Also, I'd be against moving Metadata support out of Tika because that was
one of the project's original goals (Metadata support), and I think it's
advantageous for Tika to be a provider for a Metadata capability (of course,
one related to document/content extraction).

I'm wondering too what it means that Tika doesn't support "language
alternatives"? Do you mean synonyms? Also, you mention it's relatively easy
in other libraries to map between different file format metadata. I think
that this is fairly easy to do in Tika too, seeing as though its primary
purpose is support metadata extraction from different file formats.

> 
>> My questions:
>> - Any interest in converging on a unified model/approach?
> 
> Certainly.

+1

> 
>> - If yes, where shall we develop this? As part of Tika (although it's
>> still in incubation)? As a seperate project (maybe as Apache Commons
>> subproject)? If more than XML Graphics uses this, XML Graphics is
>> probably not the right home.
>> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
>> the JempBox or XML Graphics Commons approach more interesting?
> 
> If there already exists acceptably licensed good code outside the ASF,
> then I would prefer using that instead of reinventing the wheel within
> the foundation.

I'm not sure we're "re-inventing the wheel" here Jukka. Tika's Metadata
framework began in Nutch, and at the time based on a short survey that
Jerome Charron and I undertook, there was no easy-to-use, Metadata library
framework, that met the needs of the types of things done in Nutch/Tika --
document extraction of metadata from large corpuses, supporting many values
for keys: mapping between keys, etc. So, in my mind, we're definitely not
re-inventing any wheel and the framework was borne more out of need/ease of
use than anything else.

In any case, the use of a common framework is a good one to discuss and I'm
open to it. So long as people like me can better understand the gaps in the
current Tika Metadata framework and the benefits of addressing those gaps to
all the projects that would need it.

Thanks!

Cheers,
  Chris
 

> 
> BR,
> 
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Metadata use by Apache Java projects

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

[Responding just on tika-dev@. I guess Jeremias follows all these
forums, and can summarize in the end...]

On Nov 19, 2007 11:26 AM, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach?

+1

> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String, String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.

IMHO it would be good to have a more flexible metadata model in Tika.
Better yet if it's a standard used across multiple projects. Best if
we don't need to implement it in Tika. :-)

> My questions:
> - Any interest in converging on a unified model/approach?

Certainly.

> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> the JempBox or XML Graphics Commons approach more interesting?

If there already exists acceptably licensed good code outside the ASF,
then I would prefer using that instead of reinventing the wheel within
the foundation.

BR,

Jukka Zitting

Re: Metadata use by Apache Java projects

Posted by Jeremias Maerki <de...@jeremias-maerki.ch>.

Hi Antoni

Thanks for the interesting information. Frankly, you've scared me there
just a bit. It's interesting to see that there are so encompassing
efforts underway in some places. To me, full RDF still has a scare
factor. At least the subset XMP provides is "manageable" for mere
mortals. :-) At least, that's my impression. Maybe I still just know too
little about RDF. IMO, XMP finds a good compromise between
expressiveness and simplicity. The positive points for Adobe's XMP
toolkit: it is in Java, available now and under a license we can easily
use in Apache projects.

In your point 4, you mention some restrictions you see for XMP. But XMP
is a subset of RDF, so does RDF really restrict you from an RDF point of
view? I didn't really understand that point.

We'll see how this works out.

Jeremias Maerki



On 20.11.2007 15:25:44 Antoni Mylka wrote:
> Hi Jeremias, tika-dev
> 
> My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
> which is addressing similar things as Tika, we got your mail on the
> tika-dev mailing list. I also work for the Nepomuk Social Semantic
> Desktop project, I'm the maintainer of the Nepomuk Information Element
> Ontology. More below.
> 
> Your mail addresses four more-or-less orthogonal issues.
> 
> 1. The standardization of schemas, how the metadata should be
> represented i.e. URIs of classes and properties.
> 
> 2. The standardzation of the representational language This means the
> conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
> formal semantics.
> 
> 3. The standardization of the API that will work with the RDF triples
> and handle operations such as adding, deleting and querying triples.
> (And maybe the inference).
> 
> 4. The standardization of the RDF storage mechanisms.
> 
> XMP provides its answers to all these questions but they aren't the only
> ones. I know of at least two such standardization initiatives,
> 
> 1. Freedesktop.org the XESAM project. A gathering of the major
> open-source desktop search engines
> http://xesam.org/main
> 
> 2. Nepomuk Social Semantic Desktop Project. An EU-Funded research
> project with the Semantic-Web background.
> http://nepomuk.semanticdesktop.org
> 
> Many of the issues you are bound to come into have already been
> recognized and some answers have been given, naturally the requirements
> might have been different and the solutions aren't optimal, but it may
> be interesting for you to skim through the output of those projects. To
> sum it up:
> 
> 1.
> Freedesktop.org schema:
> <http://xesam.org/main/XesamOntology90>
> 
> Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/>
> Let the pointers take you from there.
> There is also an archive of discussions around the drafts of NIE. (there
> have been 8 at the moment).
> <http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority>
> 
> 2.
> Freedesktop don't use any specific representational language, but they
> support property inheritance. They implement it by themselves, without
> any general-purpose RDF inference.
> 
> Nepomuk uses the Nepomuk Representational Language. It has been
> considered better for our purposes, since it employs more intuitive
> semantics (so-called closed-world assumption, in normal RDF if you say
> that the value if nie:kisses property is a Human, and you write Antoni
> nie:kisses Frog - you can infer that the frog is a human, in NRL you can't)
> 
> 3.
> No-one tried to standardize the API, there are many libraries that work
> with both in-memory and persistent RDF repositories.
> 
> A few pointers:
> 
> There are many APIs out there:
> * jena.sourceforge.net - big api for rdf by HP
> * www.openrdf.org - rdf api optimized for client/server setups
> * http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above
> 
> There are many APIs generating "Schema Specific Adapters", the well
> known in Java are:
> * http://wiki.ontoworld.org/wiki/RDFReactor
> * elmo
> ** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html
> **
> http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314
> * https://sommer.dev.java.net/
> 
> from the above, elmo is quite stable and advanced.
> 
> There are murmurs of standardization of RDF Apis,
> Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net),
> and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought
> about starting a JSR discussion on an RDF api, but that never happened.
> The W3C may be interested to do something like this (they did it for DOM
> I think and for XML, or?), the contact people would be the deployment group:
> http://www.w3.org/2006/07/SWD/
> 
> so, to sum it up:
> There are many things out there handling RDF in Java, but nothing
> dominates yet as a single monopoly. In my sourroundings (my company,
> aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not
> perfect but it seems to work quite well.
> 
> 4.
> XMP prescribes that the metadata be contained within the files
> themselves. There are many scenarios where this is a limitation. Each
> application will have to maintain its indexes by itself and possibly use
> a different API to work with XMP storage (in the files) and the common
> storage (e.g. an index). There is an ongoing effort to combine the
> flexibility of RDF with the search-capabilities of Lucene. Two of the
> more prominent ones are
> 
> Sesame Lucene Sail
> <https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/>
> AFAIK there is no project page yet, but this idea has been worked on for
> at least two years now, e.g. in the gnowsis project
> www.gnowsis.org
> 
> Boca TextIndexing feature
> Part of the IBM SLRP
> <http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing>
> 
> In our opinion, such an initiative deserves at least a separate mailing
> list. We have already been working on metadata standardization for some
> time now and would be happy to help. Chris Mattman has written that it's
> necessary to strike a balance between functionality and over-bloating.
>  From my own experience i can say that it is VERY difficult :).
> 
> Antoni Mylka
> antoni.mylka@gmail.com
> 
> On Nov 19, 2007 10:26 AM, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> > (I realize this is heavy cross-posting but it's probably the best way to
> > reach all the players I want to address.)
> >
> > As you may know, I've started developing an XMP metadata package inside
> > XML Graphics Commons in order to support XMP metadata (and ultimately
> > PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
> >
> > What is XMP? XMP, for those who don't know about it, is based on a
> > subset of RDF to provide a flexible and extensible way of
> > storing/representing document metadata.
> >
> > Yesterday, I was surprised to discover that Adobe has published an XMP
> > Toolkit with Java support under the BSD license. In contrast to my
> > effort, Adobe's toolkit is quite complete if maybe a bit more
> > complicated to use. That got me thinking:
> >
> > Every project I'm sending this message to is using document metadata in
> > some form:
> > - Apache XML Graphics: embeds document metadata in the generated files
> > (just FOP at the moment, but Batik is a similar candidate)
> > - Tika (in incubation): has as one of its main purposes the extraction
> > of metadata
> > - Sanselan (in incubation): extracts and embeds metadata from/in bitmap
> > images
> > - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> > from/in PDF files (see also JempBox)
> >
> > Every one of these projects has its own means to represent metadata in
> > memory. Wouldn't it make sense to have a common approach? I've worked
> > with XMP for some time now and I can say it's ideal to work with. It
> > also defines guidelines to embed XMP metadata in various file formats.
> > It's also relatively easy to map metadata between different file formats
> > (Dublin Core, EXIF, PDF Info etc.).
> >
> > Sanselan and Tika have both chosen a very simple approach but is it
> > versatile enough for the future? While the simple Map<String, String[]> in
> > Tika allows for multiple authors, for example, it doesn't support
> > language alternatives for things such as dc:title or dc:description.
> >
> > I'm seriously thinking about abandoning most of my XMP package work in
> > XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> > support, tough:
> > - Metadata merging functionality (which I need for synchronizing the PDF
> > Info object and the XMP packet for PDF/A)
> > - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
> > easier programming (which both Ben and I have written for JempBox and
> > XML Graphics Commons). Adobe's toolkit only allows generic access.
> >
> > Some links:
> > Adobe XMP website: http://www.adobe.com/products/xmp/
> > Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> > JempBox: http://sourceforge.net/projects/jempbox
> > Apache XML Graphics Commons:
> >   http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
> >
> > My questions:
> > - Any interest in converging on a unified model/approach?
> > - If yes, where shall we develop this? As part of Tika (although it's
> > still in incubation)? As a seperate project (maybe as Apache Commons
> > subproject)? If more than XML Graphics uses this, XML Graphics is
> > probably not the right home.
> > - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> > the JempBox or XML Graphics Commons approach more interesting?
> > - Where's the best place to discuss this? We can't keep posting to
> > several mailing lists.
> >
> > At any rate, I would volunteer to spearhead this effort, especially
> > since I have immediate need to have complete XMP functionality. I've
> > almost finished mapping all XMP structures in XG Commons but I haven't
> > committed my latest changes (for structured properties) and I may still
> > not cover all details of XMP.
> >
> > Thanks for reading this far,
> > Jeremias Maerki
> >
> >
> 
> 
> 
> -- 
> Antoni Myłka
> antoni.mylka@gmail.com

Re: Metadata use by Apache Java projects

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Antoni,

> Chris Mattman has written that it's
necessary to
> strike a balance between functionality and over-bloating.
 From my own
> experience i can say that it is VERY difficult :).


Well from my own experience I can tell you that it *is* difficult, but
certainly doable.

I've been working with different forms of metadata (Dublin Core, ISO 11179,
RDF, OWL/etc.), been involved in international standards organizations
(CCSDS, ISO) who are developing metadata standards, and worked on several
projects that deal with metadata (Object Oriented Data Technology [OODT],
Semantic Web for Earth and Environmental Terminology [SWEET]) in different
domains (earth science, planetary science, space science, cancer
research/etc.) for almost 7 years now.

Sure, there are a lot of standards and people can talk about coming up with
a one-size-fits-all cookie cutter type library for these capabilities,
however, I think it's important to understand that developing such libraries
(rather than striking the balance) in my mind is the most difficult problem
to tackle. I think that in the end, all we can do as software developers, as
people who are trying to standardize metadata, is to try and develop core
libraries and functions that others can build upon for their own needs. I
don't think the Tika folks should be in the business of trying to develop
high capability metadata libraries, because in the end, just as everyone is
saying, those need to be tailored to a specific use-case or domain. On the
other hand, I think it's a much-more attainable goal to come up with a
simple, easy-to-use metadata library, that folks who need higher level
capability (inference, multi-language support, representation/etc.) can
build upon for their own needs. In other words, someone shouldn't have to
rewrite the ability to have met keys, with multiple values associated with
them, with ways to map between the keys, etc., however, it's reasonable that
someone may need to rewrite the ability to represent metadata in RDF (versus
OWL), to rewrite the ability to do language translation (e.g., using XMP
versus Adobe's toolkit), that type of thing.

In any case, I'm happy to participate in any standardization efforts wearing
my Tika hat, with the understanding that whatever gets developed needs to
"fit in" the right place, be architected for extensibility, and have
cognizance of what was done previously, what the gaps are, and why the gaps
should be addressed.

Thanks!

Cheers,
  Chris

______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Metadata use by Apache Java projects

Posted by Antoni Mylka <an...@gmail.com>.

Hi Jeremias, tika-dev

My name is Antoni Mylka, I am involved in aperture.sourceforge.net,
which is addressing similar things as Tika, we got your mail on the
tika-dev mailing list. I also work for the Nepomuk Social Semantic
Desktop project, I'm the maintainer of the Nepomuk Information Element
Ontology. More below.

Your mail addresses four more-or-less orthogonal issues.

1. The standardization of schemas, how the metadata should be
represented i.e. URIs of classes and properties.

2. The standardzation of the representational language This means the
conventions about how to use RDF (e.g. Bags, Seqs, Alts etc) and the
formal semantics.

3. The standardization of the API that will work with the RDF triples
and handle operations such as adding, deleting and querying triples.
(And maybe the inference).

4. The standardization of the RDF storage mechanisms.

XMP provides its answers to all these questions but they aren't the only
ones. I know of at least two such standardization initiatives,

1. Freedesktop.org the XESAM project. A gathering of the major
open-source desktop search engines
http://xesam.org/main

2. Nepomuk Social Semantic Desktop Project. An EU-Funded research
project with the Semantic-Web background.
http://nepomuk.semanticdesktop.org

Many of the issues you are bound to come into have already been
recognized and some answers have been given, naturally the requirements
might have been different and the solutions aren't optimal, but it may
be interesting for you to skim through the output of those projects. To
sum it up:

1.
Freedesktop.org schema:
<http://xesam.org/main/XesamOntology90>

Nepomuk schema: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie/>
Let the pointers take you from there.
There is also an archive of discussions around the drafts of NIE. (there
have been 8 at the moment).
<http://dev.nepomuk.semanticdesktop.org/query?status=new&status=assigned&status=reopened&status=closed&component=ontology-nie&order=priority>

2.
Freedesktop don't use any specific representational language, but they
support property inheritance. They implement it by themselves, without
any general-purpose RDF inference.

Nepomuk uses the Nepomuk Representational Language. It has been
considered better for our purposes, since it employs more intuitive
semantics (so-called closed-world assumption, in normal RDF if you say
that the value if nie:kisses property is a Human, and you write Antoni
nie:kisses Frog - you can infer that the frog is a human, in NRL you can't)

3.
No-one tried to standardize the API, there are many libraries that work
with both in-memory and persistent RDF repositories.

A few pointers:

There are many APIs out there:
* jena.sourceforge.net - big api for rdf by HP
* www.openrdf.org - rdf api optimized for client/server setups
* http://wiki.ontoworld.org/wiki/RDF2Go - Abstraction api of above

There are many APIs generating "Schema Specific Adapters", the well
known in Java are:
* http://wiki.ontoworld.org/wiki/RDFReactor
* elmo
** http://www.openrdf.net/doc/elmo/1.0/user-guide/index.html
**
http://sourceforge.net/project/showfiles.php?group_id=46509&package_id=157314
* https://sommer.dev.java.net/

from the above, elmo is quite stable and advanced.

There are murmurs of standardization of RDF Apis,
Max Völkel (FZI, Maintainer of RDF2Go), Henry Story (www.bblfish.net),
and Leo Sauermann (DFKI, http://leobard.twoday.net) repeatedly thought
about starting a JSR discussion on an RDF api, but that never happened.
The W3C may be interested to do something like this (they did it for DOM
I think and for XML, or?), the contact people would be the deployment group:
http://www.w3.org/2006/07/SWD/

so, to sum it up:
There are many things out there handling RDF in Java, but nothing
dominates yet as a single monopoly. In my sourroundings (my company,
aperture.sourceforge.net) we prefer to use RDF2Go as "the api", its not
perfect but it seems to work quite well.

4.
XMP prescribes that the metadata be contained within the files
themselves. There are many scenarios where this is a limitation. Each
application will have to maintain its indexes by itself and possibly use
a different API to work with XMP storage (in the files) and the common
storage (e.g. an index). There is an ongoing effort to combine the
flexibility of RDF with the search-capabilities of Lucene. Two of the
more prominent ones are

Sesame Lucene Sail
<https://src.aduna-software.org/svn/org.openrdf/projects/sesame2-contrib/openrdf-sail-contrib/openrdf-lucenesail/>
AFAIK there is no project page yet, but this idea has been worked on for
at least two years now, e.g. in the gnowsis project
www.gnowsis.org

Boca TextIndexing feature
Part of the IBM SLRP
<http://ibm-slrp.sourceforge.net/wiki/index.php?title=BocaTextIndexing>

In our opinion, such an initiative deserves at least a separate mailing
list. We have already been working on metadata standardization for some
time now and would be happy to help. Chris Mattman has written that it's
necessary to strike a balance between functionality and over-bloating.
 From my own experience i can say that it is VERY difficult :).

Antoni Mylka
antoni.mylka@gmail.com

On Nov 19, 2007 10:26 AM, Jeremias Maerki <de...@jeremias-maerki.ch> wrote:
> (I realize this is heavy cross-posting but it's probably the best way to
> reach all the players I want to address.)
>
> As you may know, I've started developing an XMP metadata package inside
> XML Graphics Commons in order to support XMP metadata (and ultimately
> PDF/A) in Apache FOP. Therefore, I have quite an interest in metadata.
>
> What is XMP? XMP, for those who don't know about it, is based on a
> subset of RDF to provide a flexible and extensible way of
> storing/representing document metadata.
>
> Yesterday, I was surprised to discover that Adobe has published an XMP
> Toolkit with Java support under the BSD license. In contrast to my
> effort, Adobe's toolkit is quite complete if maybe a bit more
> complicated to use. That got me thinking:
>
> Every project I'm sending this message to is using document metadata in
> some form:
> - Apache XML Graphics: embeds document metadata in the generated files
> (just FOP at the moment, but Batik is a similar candidate)
> - Tika (in incubation): has as one of its main purposes the extraction
> of metadata
> - Sanselan (in incubation): extracts and embeds metadata from/in bitmap
> images
> - PDFBox (incubation in discussion): extracts and embeds XMP metadata
> from/in PDF files (see also JempBox)
>
> Every one of these projects has its own means to represent metadata in
> memory. Wouldn't it make sense to have a common approach? I've worked
> with XMP for some time now and I can say it's ideal to work with. It
> also defines guidelines to embed XMP metadata in various file formats.
> It's also relatively easy to map metadata between different file formats
> (Dublin Core, EXIF, PDF Info etc.).
>
> Sanselan and Tika have both chosen a very simple approach but is it
> versatile enough for the future? While the simple Map<String, String[]> in
> Tika allows for multiple authors, for example, it doesn't support
> language alternatives for things such as dc:title or dc:description.
>
> I'm seriously thinking about abandoning most of my XMP package work in
> XML Graphics Commons in favor of Adobe's XMP Toolkit. What it doesn't
> support, tough:
> - Metadata merging functionality (which I need for synchronizing the PDF
> Info object and the XMP packet for PDF/A)
> - Schema-specific adapters (for Dublin Core and many other XMP Schemas) for
> easier programming (which both Ben and I have written for JempBox and
> XML Graphics Commons). Adobe's toolkit only allows generic access.
>
> Some links:
> Adobe XMP website: http://www.adobe.com/products/xmp/
> Adobe XMP Toolkit: http://www.adobe.com/devnet/xmp/
> JempBox: http://sourceforge.net/projects/jempbox
> Apache XML Graphics Commons:
>   http://svn.apache.org/viewvc/xmlgraphics/commons/trunk/src/java/org/apache/xmlgraphics/xmp/
>
> My questions:
> - Any interest in converging on a unified model/approach?
> - If yes, where shall we develop this? As part of Tika (although it's
> still in incubation)? As a seperate project (maybe as Apache Commons
> subproject)? If more than XML Graphics uses this, XML Graphics is
> probably not the right home.
> - Is Adobe's XMP toolkit interesting for adoption (!=incubation)? Is
> the JempBox or XML Graphics Commons approach more interesting?
> - Where's the best place to discuss this? We can't keep posting to
> several mailing lists.
>
> At any rate, I would volunteer to spearhead this effort, especially
> since I have immediate need to have complete XMP functionality. I've
> almost finished mapping all XMP structures in XG Commons but I haven't
> committed my latest changes (for structured properties) and I may still
> not cover all details of XMP.
>
> Thanks for reading this far,
> Jeremias Maerki
>
>

-- 
Antoni Myłka
antoni.mylka@gmail.com