You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Fabrizio Giudici <fa...@tidalwave.it> on 2021/08/20 17:10:16 UTC

Tika for parsing raw XMP

Hello.

Since this is my first message here I think it probably makes sense to 
write a couple of lines of personal introduction.

I've been working with Java and metadata in the past, both for customers 
and for my pleasure (to manage my stuff), in multiple reprises 
interspersed with years-long gaps when I did other things. Every time I 
get back to the topic I first look around and update my code with the 
latest libraries available. In the past I've worked with ImageIO, Drew 
Noakes' Metadata Extractor, mp3agic, other stuff, even wrote my codecs 
for camera RAW files. Personal introduction ends here.

At the moment I have three pet projects dealing with both music and 
photos. I'd like to get rid of old libraries (including some patches and 
forks and my own stuff) and converge to Tika if possible .

Now it was pretty easy to extract metadata with Tika from JPEG files, 
but after many different attempts I'm still clueless about "naked" XMP 
files (I only get a very small bunch of DC stuff). Those XMP files have 
been generated by camera RAW applications, such as Lightroom and Photo 
Supreme, and they are packed with tons of stuff - including all the EXIF 
metadata. I've searched on the javadoc, StackOverflow, Tika Wiki, but I 
was unable to find a simple working example.

AutoDetectParser, which works with JPEGs, doesn't do the job; I've also 
tried XMLProfiler and messed around with 
ImageMetadataExtractor.parseRawXMP(), but no way.

So, please, let me have a hint...

Thanks.

-- 
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it


Re: Tika for parsing raw XMP

Posted by Fabrizio Giudici <fa...@tidalwave.it>.
On 21/08/21 23:17, Fabrizio Giudici wrote:
>
> This is a test XMP that I'm using as a data source. It has been 
> produced by the DAM app Photo Supreme and, as a typical XMP sidecar, 
> contains both info that the application has extracted from the 
> original file (a Sony ARW) and data that I've manually entered:
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>
Quick correction, the correct URL is:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20210813-0091.xmp

-- 
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it


Re: Tika for parsing raw XMP

Posted by Tim Allison <ta...@apache.org>.
If you want a bunch of XMPs to work with:
https://corpora.tika.apache.org/base/xmps/

On Sun, Aug 22, 2021 at 3:25 PM Tim Allison <ta...@apache.org> wrote:
>
> Other point on xmps and Tika… xmp can contain jpegs and other binary formats. So it makes sense to handle these in the Tika framework.
>
> On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <ta...@apache.org> wrote:
>>
>> You are not reinventing the wheel. We only pull out what users have requested. I’ve toyed w pulling out more than we do, but haven’t found enough interest to pursue it.
>>
>> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to dig up a link when I’m back to a keyboard.
>>
>> As for an earlier point on this thread (not made by you) that Tika is only for binary formats, I strongly disagree at least for XMP. XMP is integral to pdf and psd and as standalone sidecar. We should normalize and extract what we can. Obv if you have custom needs, yes, break out your own xml parser, but we should do better in Tika.
>>
>> On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici <fa...@tidalwave.it> wrote:
>>>
>>> On 21/08/21 15:48, Tim Allison wrote:
>>>
>>> As you saw, we’re currently parsing embedded xmp w
>>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>>>
>>> I think I added hooks for custom xmp parsing when embedded in a pdf.
>>>
>>> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>>>
>>> I think it would be great if we were pulling more info out of xmp embedded or not and would be happy to review your code.
>>>
>>> Thanks. So let me recap, also with the help of some code that I've just committed.
>>>
>>> This is a test XMP that I'm using as a data source. It has been produced by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both info that the application has extracted from the original file (a Sony ARW) and data that I've manually entered:
>>>
>>>     https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>>>
>>> This is what I've been able to extract (in form of textual dump) with - spoiler alert - a quick and dirty custom parser. It's only a subset of the metadata items in the original XMP, given the roughness of the parser, but it's a good start for me (and in any case it already resolved a problem of mine).
>>>
>>>     https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>>>
>>> First approach I've tried:
>>>
>>> metadata.set(Metadata.CONTENT_TYPE, "application/xml");
>>> final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);
>>> ime.parseRawXMP(bytes);
>>>
>>> But this just made me get a small bunch of DC items:
>>>
>>>     https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>>>
>>> Second attempt:
>>>
>>> try (final InputStream is = new ByteArrayInputStream(bytes))
>>>   {
>>>     new JempboxExtractor(metadata).parse(is);
>>>   }
>>>
>>> with the trick of wrapping the bytes content inside an xpacket marker. Basically same results as above:
>>>
>>>     https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>>>
>>> If I correctly understand Tika code, basically Jempbox is used to create a DOM that is later processed, but only DC and MM are copied to metadata. I see handlers whose name seem to suggest that they copy all tags, but they are not used by parse().
>>>
>>> So in the end I tried is a quick and dirty custom parser that copies all attributes of the elements in the XMP; this is the relevant code in the handler:
>>>
>>> public void startElement (String uri, String localName, String qName, Attributes attributes)
>>>   {
>>>     for (int i = 0; i < attributes.getLength(); i++)
>>>       {
>>>         // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). More robust code
>>>         // should instead read the namespace and translate to a prefix.
>>>         final String key = attributes.getQName(i);
>>>         final String value = attributes.getValue(i);
>>>
>>>         try
>>>           {
>>>             metadata.add(key, value);
>>>           }
>>>         catch (PropertyTypeException e)
>>>           {
>>>             log.error("{}: {}", e.toString(), key);
>>>           }
>>>       }
>>>   }
>>>
>>> Full code here:
>>>
>>>     https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>>>
>>> (Forgive me for the long URLs with the commit id, but in this way I can make further work on my source repo without jeopardizing the references of this email.)
>>>
>>> Now the basic thing that I'd like to know is that I'm not reinventing the wheel; in other words, there's no code inside Tika that is extracting this information from a XMP sidecar. If this is confirmed, I can proceed on this path.
>>>
>>> Thanks.
>>>
>>>
>>> --
>>> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
>>> "We make Java work. Everywhere."
>>> http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it

Re: Tika for parsing raw XMP

Posted by Fabrizio Giudici <fa...@tidalwave.it>.
Thanks for all the replies, that make it clear a number of points for me.

 > You are not reinventing the wheel. We only pull out what users have 
requested.

This is the most important confirmation for me, thanks.

 > I’ve toyed w pulling out more than we do, but haven’t found enough 
interest to pursue it.

It makes sense. After the first version of my code, following John's 
advice I performed a refactoring and ended up having everything in a 
couple of classes, the former a plain SAX Handler, the latter a simple 
rule evaluator that reads a config file where XPath expressions are 
mapped to metadata item names. In this way I can fill a Metadata with 
whatever I need, from the text portion of the XMP elements or 
attributes. Now, basically Tika contribution in this part is only the 
Metadata structure, so I put the focus on it. It's "flattened", so I 
though about this feature being a value point, neutral, or a problem.

My first need was to write a simple tool to perform a consistency check 
in my photo metadata, that I had messed up with lens names and manually 
fixed - basically I needed to check whether the focal length was 
compatibile with the lens name. For this task the flattened structure of 
Tika Metadata was a plus, allowing me to accomplish the task with a few 
lines of code.

But my next step is to store XMP metadata in a semantic triple store... 
Given that RDF is the common term of XMP and triple stores, passing 
through Tika Metadata doesn't make sense. It can't even support metadata 
that is structured by nature (Jempbox has got support for stuff such 
history, but e.g. the Photo Supreme DAM uses a specific schema for its 
hierarchical keywords that also includes attributes (e.g. you can have a 
keyword that refers to a mountain and have its GPS coordinates too; you 
can even have relationships between keywords, so it's more a graph thing 
than a simple tree).

OTOH Tika satisfies my requirements for JPEGs, so I will incorporate it 
in another project. I think I'll use it also for music and video, even 
though I'll test it later.

In the end this is consistent with the other users' expectation about 
XMP, as you said. Rather than the point of being textual, what makes XMP 
so different is the possible complexity of the data structure _and_ the 
kind of use you might want to do with it...

For what concerns JPEG, Tika perfectly fits my needs. Music and Video: 
I'll test later, but I think it will be good as well.

On 22/08/21 21:25, Tim Allison wrote:
> Other point on xmps and Tika… xmp can contain jpegs and other binary 
> formats. So it makes sense to handle these in the Tika framework.
>
> On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <tallison@apache.org 
> <ma...@apache.org>> wrote:
>
>     You are not reinventing the wheel. We only pull out what users
>     have requested. I’ve toyed w pulling out more than we do, but
>     haven’t found enough interest to pursue it.
>
>     I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll
>     try to dig up a link when I’m back to a keyboard.
>
>     As for an earlier point on this thread (not made by you) that Tika
>     is only for binary formats, I strongly disagree at least for XMP.
>     XMP is integral to pdf and psd and as standalone sidecar. We
>     should normalize and extract what we can. Obv if you have custom
>     needs, yes, break out your own xml parser, but we should do better
>     in Tika.
>
-- 
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it


Re: Tika for parsing raw XMP

Posted by Tim Allison <ta...@apache.org>.
Other point on xmps and Tika… xmp can contain jpegs and other binary
formats. So it makes sense to handle these in the Tika framework.

On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <ta...@apache.org> wrote:

> You are not reinventing the wheel. We only pull out what users have
> requested. I’ve toyed w pulling out more than we do, but haven’t found
> enough interest to pursue it.
>
> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to
> dig up a link when I’m back to a keyboard.
>
> As for an earlier point on this thread (not made by you) that Tika is only
> for binary formats, I strongly disagree at least for XMP. XMP is integral
> to pdf and psd and as standalone sidecar. We should normalize and extract
> what we can. Obv if you have custom needs, yes, break out your own xml
> parser, but we should do better in Tika.
>
> On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici <
> fabrizio.giudici@tidalwave.it> wrote:
>
>> On 21/08/21 15:48, Tim Allison wrote:
>>
>> As you saw, we’re currently parsing embedded xmp w
>>
>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>>
>> I think I added hooks for custom xmp parsing when embedded in a pdf.
>>
>> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>>
>> I think it would be great if we were pulling more info out of xmp
>> embedded or not and would be happy to review your code.
>>
>> Thanks. So let me recap, also with the help of some code that I've just
>> committed.
>>
>> This is a test XMP that I'm using as a data source. It has been produced
>> by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both
>> info that the application has extracted from the original file (a Sony ARW)
>> and data that I've manually entered:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>>
>> This is what I've been able to extract (in form of textual dump) with -
>> spoiler alert - a quick and dirty custom parser. It's only a subset of the
>> metadata items in the original XMP, given the roughness of the parser, but
>> it's a good start for me (and in any case it already resolved a problem of
>> mine).
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>>
>> First approach I've tried:
>>
>> metadata.set(Metadata.CONTENT_TYPE, "application/xml");final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);ime.parseRawXMP(bytes);
>>
>> But this just made me get a small bunch of DC items:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>>
>> Second attempt:
>>
>> try (final InputStream is = new ByteArrayInputStream(bytes))
>>   {
>>     new JempboxExtractor(metadata).parse(is);  }
>>
>> with the trick of wrapping the bytes content inside an xpacket marker.
>> Basically same results as above:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>>
>> If I correctly understand Tika code, basically Jempbox is used to create
>> a DOM that is later processed, but only DC and MM are copied to metadata. I
>> see handlers whose name seem to suggest that they copy all tags, but they
>> are not used by parse().
>>
>> So in the end I tried is a quick and dirty custom parser that copies all
>> attributes of the elements in the XMP; this is the relevant code in the
>> handler:
>>
>> public void startElement (String uri, String localName, String qName, Attributes attributes)
>>   {    for (int i = 0; i < attributes.getLength(); i++)
>>       {
>>         // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). More robust code        // should instead read the namespace and translate to a prefix.        final String key = attributes.getQName(i);        final String value = attributes.getValue(i);         try          {
>>             metadata.add(key, value);          }
>>         catch (PropertyTypeException e)
>>           {
>>             log.error("{}: {}", e.toString(), key);          }
>>       }
>>   }
>>
>> Full code here:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>>
>> (Forgive me for the long URLs with the commit id, but in this way I can
>> make further work on my source repo without jeopardizing the references of
>> this email.)
>>
>> Now the basic thing that I'd like to know is that I'm not reinventing the
>> wheel; in other words, there's no code inside Tika that is extracting this
>> information from a XMP sidecar. If this is confirmed, I can proceed on this
>> path.
>>
>> Thanks.
>>
>>
>> --
>> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
>> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
>>
>>

Re: Tika for parsing raw XMP

Posted by Tim Allison <ta...@apache.org>.
You are not reinventing the wheel. We only pull out what users have
requested. I’ve toyed w pulling out more than we do, but haven’t found
enough interest to pursue it.

I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to
dig up a link when I’m back to a keyboard.

As for an earlier point on this thread (not made by you) that Tika is only
for binary formats, I strongly disagree at least for XMP. XMP is integral
to pdf and psd and as standalone sidecar. We should normalize and extract
what we can. Obv if you have custom needs, yes, break out your own xml
parser, but we should do better in Tika.

On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici <
fabrizio.giudici@tidalwave.it> wrote:

> On 21/08/21 15:48, Tim Allison wrote:
>
> As you saw, we’re currently parsing embedded xmp w
>
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>
> I think I added hooks for custom xmp parsing when embedded in a pdf.
>
> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>
> I think it would be great if we were pulling more info out of xmp embedded
> or not and would be happy to review your code.
>
> Thanks. So let me recap, also with the help of some code that I've just
> committed.
>
> This is a test XMP that I'm using as a data source. It has been produced
> by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both
> info that the application has extracted from the original file (a Sony ARW)
> and data that I've manually entered:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>
> This is what I've been able to extract (in form of textual dump) with -
> spoiler alert - a quick and dirty custom parser. It's only a subset of the
> metadata items in the original XMP, given the roughness of the parser, but
> it's a good start for me (and in any case it already resolved a problem of
> mine).
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>
> First approach I've tried:
>
> metadata.set(Metadata.CONTENT_TYPE, "application/xml");final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);ime.parseRawXMP(bytes);
>
> But this just made me get a small bunch of DC items:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>
> Second attempt:
>
> try (final InputStream is = new ByteArrayInputStream(bytes))
>   {
>     new JempboxExtractor(metadata).parse(is);  }
>
> with the trick of wrapping the bytes content inside an xpacket marker.
> Basically same results as above:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>
> If I correctly understand Tika code, basically Jempbox is used to create a
> DOM that is later processed, but only DC and MM are copied to metadata. I
> see handlers whose name seem to suggest that they copy all tags, but they
> are not used by parse().
>
> So in the end I tried is a quick and dirty custom parser that copies all
> attributes of the elements in the XMP; this is the relevant code in the
> handler:
>
> public void startElement (String uri, String localName, String qName, Attributes attributes)
>   {    for (int i = 0; i < attributes.getLength(); i++)
>       {
>         // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). More robust code        // should instead read the namespace and translate to a prefix.        final String key = attributes.getQName(i);        final String value = attributes.getValue(i);         try          {
>             metadata.add(key, value);          }
>         catch (PropertyTypeException e)
>           {
>             log.error("{}: {}", e.toString(), key);          }
>       }
>   }
>
> Full code here:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>
> (Forgive me for the long URLs with the commit id, but in this way I can
> make further work on my source repo without jeopardizing the references of
> this email.)
>
> Now the basic thing that I'd like to know is that I'm not reinventing the
> wheel; in other words, there's no code inside Tika that is extracting this
> information from a XMP sidecar. If this is confirmed, I can proceed on this
> path.
>
> Thanks.
>
>
> --
> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
>
>

Re: Tika for parsing raw XMP

Posted by Fabrizio Giudici <fa...@tidalwave.it>.
On 21/08/21 15:48, Tim Allison wrote:

> As you saw, we’re currently parsing embedded xmp w
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java 
> <https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java>
>
> I think I added hooks for custom xmp parsing when embedded in a pdf.
>
> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>
> I think it would be great if we were pulling more info out of xmp 
> embedded or not and would be happy to review your code.

Thanks. So let me recap, also with the help of some code that I've just 
committed.

This is a test XMP that I'm using as a data source. It has been produced 
by the DAM app Photo Supreme and, as a typical XMP sidecar, contains 
both info that the application has extracted from the original file (a 
Sony ARW) and data that I've manually entered:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp

This is what I've been able to extract (in form of textual dump) with - 
spoiler alert - a quick and dirty custom parser. It's only a subset of 
the metadata items in the original XMP, given the roughness of the 
parser, but it's a good start for me (and in any case it already 
resolved a problem of mine).

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt

First approach I've tried:

metadata.set(Metadata.CONTENT_TYPE, "application/xml"); final ImageMetadataExtractor ime =new ImageMetadataExtractor(metadata); ime.parseRawXMP(bytes);

But this just made me get a small bunch of DC items:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt

Second attempt:

try (final InputStream is =new ByteArrayInputStream(bytes))
   {
     new JempboxExtractor(metadata).parse(is); }

with the trick of wrapping the bytes content inside an xpacket marker. 
Basically same results as above:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt

If I correctly understand Tika code, basically Jempbox is used to create 
a DOM that is later processed, but only DC and MM are copied to 
metadata. I see handlers whose name seem to suggest that they copy all 
tags, but they are not used by parse().

So in the end I tried is a quick and dirty custom parser that copies all 
attributes of the elements in the XMP; this is the relevant code in the 
handler:

public void startElement (String uri, String localName, String qName, Attributes attributes)
   {
for (int i =0; i < attributes.getLength(); i++)
       {
         // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). 
More robust code // should instead read the namespace and translate to a 
prefix. final String key = attributes.getQName(i); final String value = attributes.getValue(i); try {
             metadata.add(key, value); }
         catch (PropertyTypeException e)
           {
             log.error("{}: {}", e.toString(), key); }
       }
   }

Full code here:

https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java

(Forgive me for the long URLs with the commit id, but in this way I can 
make further work on my source repo without jeopardizing the references 
of this email.)

Now the basic thing that I'd like to know is that I'm not reinventing 
the wheel; in other words, there's no code inside Tika that is 
extracting this information from a XMP sidecar. If this is confirmed, I 
can proceed on this path.

Thanks.

-- 
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it


Re: Tika for parsing raw XMP

Posted by Tim Allison <ta...@apache.org>.
As you saw, we’re currently parsing embedded xmp w
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java

I think I added hooks for custom xmp parsing when embedded in a pdf.

Is your primary issue that Tika is treating unembedded xmp as regular xml?

I think it would be great if we were pulling more info out of xmp embedded
or not and would be happy to review your code.

On Sat, Aug 21, 2021 at 2:53 AM Fabrizio Giudici <
fabrizio.giudici@tidalwave.it> wrote:

> John, thanks for your comment. Correctly understanding the scope of Tika
> is part of the things I have to do, so I'm waiting for other Tika people to
> confirm.
>
> In my understanding Tika also supports textual files (there are XML
> parsers inside, XMP is at least partially supported when embedded e.g. in a
> JPEG file, etc...), but I could be wrong.
>
> I know XMP is XML, but the schema is not trivial for what concerns
> representation of certain structured properties (see below), so a speficic
> Java data model is required and it would be nice to find one available in a
> library. I know there are other libraries supporting it (such as
> metadata-extractor, which is used by Tika) which I've already used in the
> past. Given that I have to deal with multiple file formats (photo, music,
> etc...) it would be nice to have a single "umbrella" API - also because
> this is an Apache project, with the usual governance model, so you get a
> well anticipated warning when it is going to reach end of life - while many
> projects out there often get to a stop without a warning.
>
> Back to the original topic...
>
> At the moment I was able to write a custom parser starting from
> AbstractParser and taking advantage of XMPContentHandler. It's quite rough,
> but it retrieves most of the obvious tags (including the ones I need now
> for a specific task). I need to understand whether I've just duplicated
> stuff that is already inside Tika, or whether I have properly extended Tika
> about a missing feature, or whether I'm stressing it too far.
>
> A potential problem - which is not urgent now - is that I don't know how
> Tika should deal with complex XMP properties such as hierarchic properties,
> given that it uses to flatten everything.
> On 20/08/21 19:29, John Ulric wrote:
>
> Fabrizio:
>
> I'm not a specialist in Tika, but XMP files are plain XML, and pretty well
> standardised, so you probably wouldn't need Tika to read these. Just use
> any old XML parser (from JDKs standard library, Saxon …) and filter out the
> values you need. I don't know if the Tika team agree, but I see Tika as a
> tool to extract information from binary data primarily.
>
> Cheers
> John
>
>
> --
> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
>
>

Re: Tika for parsing raw XMP

Posted by Fabrizio Giudici <fa...@tidalwave.it>.
John, thanks for your comment. Correctly understanding the scope of Tika 
is part of the things I have to do, so I'm waiting for other Tika people 
to confirm.

In my understanding Tika also supports textual files (there are XML 
parsers inside, XMP is at least partially supported when embedded e.g. 
in a JPEG file, etc...), but I could be wrong.

I know XMP is XML, but the schema is not trivial for what concerns 
representation of certain structured properties (see below), so a 
speficic Java data model is required and it would be nice to find one 
available in a library. I know there are other libraries supporting it 
(such as metadata-extractor, which is used by Tika) which I've already 
used in the past. Given that I have to deal with multiple file formats 
(photo, music, etc...) it would be nice to have a single "umbrella" API 
- also because this is an Apache project, with the usual governance 
model, so you get a well anticipated warning when it is going to reach 
end of life - while many projects out there often get to a stop without 
a warning.

Back to the original topic...

At the moment I was able to write a custom parser starting from 
AbstractParser and taking advantage of XMPContentHandler. It's quite 
rough, but it retrieves most of the obvious tags (including the ones I 
need now for a specific task). I need to understand whether I've just 
duplicated stuff that is already inside Tika, or whether I have properly 
extended Tika about a missing feature, or whether I'm stressing it too far.

A potential problem - which is not urgent now - is that I don't know how 
Tika should deal with complex XMP properties such as hierarchic 
properties, given that it uses to flatten everything.

On 20/08/21 19:29, John Ulric wrote:
> Fabrizio:
>
> I'm not a specialist in Tika, but XMP files are plain XML, and pretty 
> well standardised, so you probably wouldn't need Tika to read these. 
> Just use any old XML parser (from JDKs standard library, Saxon …) and 
> filter out the values you need. I don't know if the Tika team agree, 
> but I see Tika as a tool to extract information from binary data 
> primarily.
>
> Cheers
> John
>
>
-- 
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it


Re: Tika for parsing raw XMP

Posted by John Ulric <uj...@gmail.com>.
Fabrizio:

I'm not a specialist in Tika, but XMP files are plain XML, and pretty well
standardised, so you probably wouldn't need Tika to read these. Just use
any old XML parser (from JDKs standard library, Saxon …) and filter out the
values you need. I don't know if the Tika team agree, but I see Tika as a
tool to extract information from binary data primarily.

Cheers
John