You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Fabrizio Giudici <fa...@tidalwave.it> on 2021/08/20 17:10:16 UTC
Tika for parsing raw XMP
Hello.
Since this is my first message here I think it probably makes sense to
write a couple of lines of personal introduction.
I've been working with Java and metadata in the past, both for customers
and for my pleasure (to manage my stuff), in multiple reprises
interspersed with years-long gaps when I did other things. Every time I
get back to the topic I first look around and update my code with the
latest libraries available. In the past I've worked with ImageIO, Drew
Noakes' Metadata Extractor, mp3agic, other stuff, even wrote my codecs
for camera RAW files. Personal introduction ends here.
At the moment I have three pet projects dealing with both music and
photos. I'd like to get rid of old libraries (including some patches and
forks and my own stuff) and converge to Tika if possible .
Now it was pretty easy to extract metadata with Tika from JPEG files,
but after many different attempts I'm still clueless about "naked" XMP
files (I only get a very small bunch of DC stuff). Those XMP files have
been generated by camera RAW applications, such as Lightroom and Photo
Supreme, and they are packed with tons of stuff - including all the EXIF
metadata. I've searched on the javadoc, StackOverflow, Tika Wiki, but I
was unable to find a simple working example.
AutoDetectParser, which works with JPEGs, doesn't do the job; I've also
tried XMLProfiler and messed around with
ImageMetadataExtractor.parseRawXMP(), but no way.
So, please, let me have a hint...
Thanks.
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
Re: Tika for parsing raw XMP
Posted by Fabrizio Giudici <fa...@tidalwave.it>.
On 21/08/21 23:17, Fabrizio Giudici wrote:
>
> This is a test XMP that I'm using as a data source. It has been
> produced by the DAM app Photo Supreme and, as a typical XMP sidecar,
> contains both info that the application has extracted from the
> original file (a Sony ARW) and data that I've manually entered:
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>
Quick correction, the correct URL is:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20210813-0091.xmp
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
Re: Tika for parsing raw XMP
Posted by Tim Allison <ta...@apache.org>.
If you want a bunch of XMPs to work with:
https://corpora.tika.apache.org/base/xmps/
On Sun, Aug 22, 2021 at 3:25 PM Tim Allison <ta...@apache.org> wrote:
>
> Other point on xmps and Tika… xmp can contain jpegs and other binary formats. So it makes sense to handle these in the Tika framework.
>
> On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <ta...@apache.org> wrote:
>>
>> You are not reinventing the wheel. We only pull out what users have requested. I’ve toyed w pulling out more than we do, but haven’t found enough interest to pursue it.
>>
>> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to dig up a link when I’m back to a keyboard.
>>
>> As for an earlier point on this thread (not made by you) that Tika is only for binary formats, I strongly disagree at least for XMP. XMP is integral to pdf and psd and as standalone sidecar. We should normalize and extract what we can. Obv if you have custom needs, yes, break out your own xml parser, but we should do better in Tika.
>>
>> On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici <fa...@tidalwave.it> wrote:
>>>
>>> On 21/08/21 15:48, Tim Allison wrote:
>>>
>>> As you saw, we’re currently parsing embedded xmp w
>>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>>>
>>> I think I added hooks for custom xmp parsing when embedded in a pdf.
>>>
>>> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>>>
>>> I think it would be great if we were pulling more info out of xmp embedded or not and would be happy to review your code.
>>>
>>> Thanks. So let me recap, also with the help of some code that I've just committed.
>>>
>>> This is a test XMP that I'm using as a data source. It has been produced by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both info that the application has extracted from the original file (a Sony ARW) and data that I've manually entered:
>>>
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>>>
>>> This is what I've been able to extract (in form of textual dump) with - spoiler alert - a quick and dirty custom parser. It's only a subset of the metadata items in the original XMP, given the roughness of the parser, but it's a good start for me (and in any case it already resolved a problem of mine).
>>>
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>>>
>>> First approach I've tried:
>>>
>>> metadata.set(Metadata.CONTENT_TYPE, "application/xml");
>>> final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);
>>> ime.parseRawXMP(bytes);
>>>
>>> But this just made me get a small bunch of DC items:
>>>
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>>>
>>> Second attempt:
>>>
>>> try (final InputStream is = new ByteArrayInputStream(bytes))
>>> {
>>> new JempboxExtractor(metadata).parse(is);
>>> }
>>>
>>> with the trick of wrapping the bytes content inside an xpacket marker. Basically same results as above:
>>>
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>>>
>>> If I correctly understand Tika code, basically Jempbox is used to create a DOM that is later processed, but only DC and MM are copied to metadata. I see handlers whose name seem to suggest that they copy all tags, but they are not used by parse().
>>>
>>> So in the end I tried is a quick and dirty custom parser that copies all attributes of the elements in the XMP; this is the relevant code in the handler:
>>>
>>> public void startElement (String uri, String localName, String qName, Attributes attributes)
>>> {
>>> for (int i = 0; i < attributes.getLength(); i++)
>>> {
>>> // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). More robust code
>>> // should instead read the namespace and translate to a prefix.
>>> final String key = attributes.getQName(i);
>>> final String value = attributes.getValue(i);
>>>
>>> try
>>> {
>>> metadata.add(key, value);
>>> }
>>> catch (PropertyTypeException e)
>>> {
>>> log.error("{}: {}", e.toString(), key);
>>> }
>>> }
>>> }
>>>
>>> Full code here:
>>>
>>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>>>
>>> (Forgive me for the long URLs with the commit id, but in this way I can make further work on my source repo without jeopardizing the references of this email.)
>>>
>>> Now the basic thing that I'd like to know is that I'm not reinventing the wheel; in other words, there's no code inside Tika that is extracting this information from a XMP sidecar. If this is confirmed, I can proceed on this path.
>>>
>>> Thanks.
>>>
>>>
>>> --
>>> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
>>> "We make Java work. Everywhere."
>>> http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
Re: Tika for parsing raw XMP
Posted by Fabrizio Giudici <fa...@tidalwave.it>.
Thanks for all the replies, that make it clear a number of points for me.
> You are not reinventing the wheel. We only pull out what users have
requested.
This is the most important confirmation for me, thanks.
> I’ve toyed w pulling out more than we do, but haven’t found enough
interest to pursue it.
It makes sense. After the first version of my code, following John's
advice I performed a refactoring and ended up having everything in a
couple of classes, the former a plain SAX Handler, the latter a simple
rule evaluator that reads a config file where XPath expressions are
mapped to metadata item names. In this way I can fill a Metadata with
whatever I need, from the text portion of the XMP elements or
attributes. Now, basically Tika contribution in this part is only the
Metadata structure, so I put the focus on it. It's "flattened", so I
though about this feature being a value point, neutral, or a problem.
My first need was to write a simple tool to perform a consistency check
in my photo metadata, that I had messed up with lens names and manually
fixed - basically I needed to check whether the focal length was
compatibile with the lens name. For this task the flattened structure of
Tika Metadata was a plus, allowing me to accomplish the task with a few
lines of code.
But my next step is to store XMP metadata in a semantic triple store...
Given that RDF is the common term of XMP and triple stores, passing
through Tika Metadata doesn't make sense. It can't even support metadata
that is structured by nature (Jempbox has got support for stuff such
history, but e.g. the Photo Supreme DAM uses a specific schema for its
hierarchical keywords that also includes attributes (e.g. you can have a
keyword that refers to a mountain and have its GPS coordinates too; you
can even have relationships between keywords, so it's more a graph thing
than a simple tree).
OTOH Tika satisfies my requirements for JPEGs, so I will incorporate it
in another project. I think I'll use it also for music and video, even
though I'll test it later.
In the end this is consistent with the other users' expectation about
XMP, as you said. Rather than the point of being textual, what makes XMP
so different is the possible complexity of the data structure _and_ the
kind of use you might want to do with it...
For what concerns JPEG, Tika perfectly fits my needs. Music and Video:
I'll test later, but I think it will be good as well.
On 22/08/21 21:25, Tim Allison wrote:
> Other point on xmps and Tika… xmp can contain jpegs and other binary
> formats. So it makes sense to handle these in the Tika framework.
>
> On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <tallison@apache.org
> <ma...@apache.org>> wrote:
>
> You are not reinventing the wheel. We only pull out what users
> have requested. I’ve toyed w pulling out more than we do, but
> haven’t found enough interest to pursue it.
>
> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll
> try to dig up a link when I’m back to a keyboard.
>
> As for an earlier point on this thread (not made by you) that Tika
> is only for binary formats, I strongly disagree at least for XMP.
> XMP is integral to pdf and psd and as standalone sidecar. We
> should normalize and extract what we can. Obv if you have custom
> needs, yes, break out your own xml parser, but we should do better
> in Tika.
>
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
Re: Tika for parsing raw XMP
Posted by Tim Allison <ta...@apache.org>.
Other point on xmps and Tika… xmp can contain jpegs and other binary
formats. So it makes sense to handle these in the Tika framework.
On Sun, Aug 22, 2021 at 3:15 PM Tim Allison <ta...@apache.org> wrote:
> You are not reinventing the wheel. We only pull out what users have
> requested. I’ve toyed w pulling out more than we do, but haven’t found
> enough interest to pursue it.
>
> I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to
> dig up a link when I’m back to a keyboard.
>
> As for an earlier point on this thread (not made by you) that Tika is only
> for binary formats, I strongly disagree at least for XMP. XMP is integral
> to pdf and psd and as standalone sidecar. We should normalize and extract
> what we can. Obv if you have custom needs, yes, break out your own xml
> parser, but we should do better in Tika.
>
> On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici <
> fabrizio.giudici@tidalwave.it> wrote:
>
>> On 21/08/21 15:48, Tim Allison wrote:
>>
>> As you saw, we’re currently parsing embedded xmp w
>>
>> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>>
>> I think I added hooks for custom xmp parsing when embedded in a pdf.
>>
>> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>>
>> I think it would be great if we were pulling more info out of xmp
>> embedded or not and would be happy to review your code.
>>
>> Thanks. So let me recap, also with the help of some code that I've just
>> committed.
>>
>> This is a test XMP that I'm using as a data source. It has been produced
>> by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both
>> info that the application has extracted from the original file (a Sony ARW)
>> and data that I've manually entered:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>>
>> This is what I've been able to extract (in form of textual dump) with -
>> spoiler alert - a quick and dirty custom parser. It's only a subset of the
>> metadata items in the original XMP, given the roughness of the parser, but
>> it's a good start for me (and in any case it already resolved a problem of
>> mine).
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>>
>> First approach I've tried:
>>
>> metadata.set(Metadata.CONTENT_TYPE, "application/xml");final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);ime.parseRawXMP(bytes);
>>
>> But this just made me get a small bunch of DC items:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>>
>> Second attempt:
>>
>> try (final InputStream is = new ByteArrayInputStream(bytes))
>> {
>> new JempboxExtractor(metadata).parse(is); }
>>
>> with the trick of wrapping the bytes content inside an xpacket marker.
>> Basically same results as above:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>>
>> If I correctly understand Tika code, basically Jempbox is used to create
>> a DOM that is later processed, but only DC and MM are copied to metadata. I
>> see handlers whose name seem to suggest that they copy all tags, but they
>> are not used by parse().
>>
>> So in the end I tried is a quick and dirty custom parser that copies all
>> attributes of the elements in the XMP; this is the relevant code in the
>> handler:
>>
>> public void startElement (String uri, String localName, String qName, Attributes attributes)
>> { for (int i = 0; i < attributes.getLength(); i++)
>> {
>> // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). More robust code // should instead read the namespace and translate to a prefix. final String key = attributes.getQName(i); final String value = attributes.getValue(i); try {
>> metadata.add(key, value); }
>> catch (PropertyTypeException e)
>> {
>> log.error("{}: {}", e.toString(), key); }
>> }
>> }
>>
>> Full code here:
>>
>>
>> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>>
>> (Forgive me for the long URLs with the commit id, but in this way I can
>> make further work on my source repo without jeopardizing the references of
>> this email.)
>>
>> Now the basic thing that I'd like to know is that I'm not reinventing the
>> wheel; in other words, there's no code inside Tika that is extracting this
>> information from a XMP sidecar. If this is confirmed, I can proceed on this
>> path.
>>
>> Thanks.
>>
>>
>> --
>> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
>> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
>>
>>
Re: Tika for parsing raw XMP
Posted by Tim Allison <ta...@apache.org>.
You are not reinventing the wheel. We only pull out what users have
requested. I’ve toyed w pulling out more than we do, but haven’t found
enough interest to pursue it.
I’ve gathered a bunch of xmps extracted from our 1TB corpus. I’ll try to
dig up a link when I’m back to a keyboard.
As for an earlier point on this thread (not made by you) that Tika is only
for binary formats, I strongly disagree at least for XMP. XMP is integral
to pdf and psd and as standalone sidecar. We should normalize and extract
what we can. Obv if you have custom needs, yes, break out your own xml
parser, but we should do better in Tika.
On Sat, Aug 21, 2021 at 5:17 PM Fabrizio Giudici <
fabrizio.giudici@tidalwave.it> wrote:
> On 21/08/21 15:48, Tim Allison wrote:
>
> As you saw, we’re currently parsing embedded xmp w
>
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
>
> I think I added hooks for custom xmp parsing when embedded in a pdf.
>
> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>
> I think it would be great if we were pulling more info out of xmp embedded
> or not and would be happy to review your code.
>
> Thanks. So let me recap, also with the help of some code that I've just
> committed.
>
> This is a test XMP that I'm using as a data source. It has been produced
> by the DAM app Photo Supreme and, as a typical XMP sidecar, contains both
> info that the application has extracted from the original file (a Sony ARW)
> and data that I've manually entered:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
>
> This is what I've been able to extract (in form of textual dump) with -
> spoiler alert - a quick and dirty custom parser. It's only a subset of the
> metadata items in the original XMP, given the roughness of the parser, but
> it's a good start for me (and in any case it already resolved a problem of
> mine).
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
>
> First approach I've tried:
>
> metadata.set(Metadata.CONTENT_TYPE, "application/xml");final ImageMetadataExtractor ime = new ImageMetadataExtractor(metadata);ime.parseRawXMP(bytes);
>
> But this just made me get a small bunch of DC items:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
>
> Second attempt:
>
> try (final InputStream is = new ByteArrayInputStream(bytes))
> {
> new JempboxExtractor(metadata).parse(is); }
>
> with the trick of wrapping the bytes content inside an xpacket marker.
> Basically same results as above:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
>
> If I correctly understand Tika code, basically Jempbox is used to create a
> DOM that is later processed, but only DC and MM are copied to metadata. I
> see handlers whose name seem to suggest that they copy all tags, but they
> are not used by parse().
>
> So in the end I tried is a quick and dirty custom parser that copies all
> attributes of the elements in the XMP; this is the relevant code in the
> handler:
>
> public void startElement (String uri, String localName, String qName, Attributes attributes)
> { for (int i = 0; i < attributes.getLength(); i++)
> {
> // FIXME: this assumes QName is using the standard prefix (e.g. 'exif'). More robust code // should instead read the namespace and translate to a prefix. final String key = attributes.getQName(i); final String value = attributes.getValue(i); try {
> metadata.add(key, value); }
> catch (PropertyTypeException e)
> {
> log.error("{}: {}", e.toString(), key); }
> }
> }
>
> Full code here:
>
>
> https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
>
> (Forgive me for the long URLs with the commit id, but in this way I can
> make further work on my source repo without jeopardizing the references of
> this email.)
>
> Now the basic thing that I'd like to know is that I'm not reinventing the
> wheel; in other words, there's no code inside Tika that is extracting this
> information from a XMP sidecar. If this is confirmed, I can proceed on this
> path.
>
> Thanks.
>
>
> --
> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
>
>
Re: Tika for parsing raw XMP
Posted by Fabrizio Giudici <fa...@tidalwave.it>.
On 21/08/21 15:48, Tim Allison wrote:
> As you saw, we’re currently parsing embedded xmp w
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
> <https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java>
>
> I think I added hooks for custom xmp parsing when embedded in a pdf.
>
> Is your primary issue that Tika is treating unembedded xmp as regular xml?
>
> I think it would be great if we were pulling more info out of xmp
> embedded or not and would be happy to review your code.
Thanks. So let me recap, also with the help of some code that I've just
committed.
This is a test XMP that I'm using as a data source. It has been produced
by the DAM app Photo Supreme and, as a typical XMP sidecar, contains
both info that the application has extracted from the original file (a
Sony ARW) and data that I've manually entered:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/images/20180520-0261.xmp
This is what I've been able to extract (in form of textual dump) with -
spoiler alert - a quick and dirty custom parser. It's only a subset of
the metadata items in the original XMP, given the roughness of the
parser, but it's a good start for me (and in any case it already
resolved a problem of mine).
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-dump.txt
First approach I've tried:
metadata.set(Metadata.CONTENT_TYPE, "application/xml"); final ImageMetadataExtractor ime =new ImageMetadataExtractor(metadata); ime.parseRawXMP(bytes);
But this just made me get a small bunch of DC items:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-alt-dump.txt
Second attempt:
try (final InputStream is =new ByteArrayInputStream(bytes))
{
new JempboxExtractor(metadata).parse(is); }
with the trick of wrapping the bytes content inside an xpacket marker.
Basically same results as above:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/test/resources/expected-results/20210813-0091.xmp-jempbox-dump.txt
If I correctly understand Tika code, basically Jempbox is used to create
a DOM that is later processed, but only DC and MM are copied to
metadata. I see handlers whose name seem to suggest that they copy all
tags, but they are not used by parse().
So in the end I tried is a quick and dirty custom parser that copies all
attributes of the elements in the XMP; this is the relevant code in the
handler:
public void startElement (String uri, String localName, String qName, Attributes attributes)
{
for (int i =0; i < attributes.getLength(); i++)
{
// FIXME: this assumes QName is using the standard prefix (e.g. 'exif').
More robust code // should instead read the namespace and translate to a
prefix. final String key = attributes.getQName(i); final String value = attributes.getValue(i); try {
metadata.add(key, value); }
catch (PropertyTypeException e)
{
log.error("{}: {}", e.toString(), key); }
}
}
Full code here:
https://bitbucket.org/tidalwave/bluemarine2-src/src/12d0879cd72f504151ca625a85be1abf24e390b5/modules/MediaScanner/src/main/java/it/tidalwave/bluemarine2/mediascanner/impl/tika/XmpParser.java
(Forgive me for the long URLs with the commit id, but in this way I can
make further work on my source repo without jeopardizing the references
of this email.)
Now the basic thing that I'd like to know is that I'm not reinventing
the wheel; in other words, there's no code inside Tika that is
extracting this information from a XMP sidecar. If this is confirmed, I
can proceed on this path.
Thanks.
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
Re: Tika for parsing raw XMP
Posted by Tim Allison <ta...@apache.org>.
As you saw, we’re currently parsing embedded xmp w
https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/src/main/java/org/apache/tika/parser/xmp/JempboxExtractor.java
I think I added hooks for custom xmp parsing when embedded in a pdf.
Is your primary issue that Tika is treating unembedded xmp as regular xml?
I think it would be great if we were pulling more info out of xmp embedded
or not and would be happy to review your code.
On Sat, Aug 21, 2021 at 2:53 AM Fabrizio Giudici <
fabrizio.giudici@tidalwave.it> wrote:
> John, thanks for your comment. Correctly understanding the scope of Tika
> is part of the things I have to do, so I'm waiting for other Tika people to
> confirm.
>
> In my understanding Tika also supports textual files (there are XML
> parsers inside, XMP is at least partially supported when embedded e.g. in a
> JPEG file, etc...), but I could be wrong.
>
> I know XMP is XML, but the schema is not trivial for what concerns
> representation of certain structured properties (see below), so a speficic
> Java data model is required and it would be nice to find one available in a
> library. I know there are other libraries supporting it (such as
> metadata-extractor, which is used by Tika) which I've already used in the
> past. Given that I have to deal with multiple file formats (photo, music,
> etc...) it would be nice to have a single "umbrella" API - also because
> this is an Apache project, with the usual governance model, so you get a
> well anticipated warning when it is going to reach end of life - while many
> projects out there often get to a stop without a warning.
>
> Back to the original topic...
>
> At the moment I was able to write a custom parser starting from
> AbstractParser and taking advantage of XMPContentHandler. It's quite rough,
> but it retrieves most of the obvious tags (including the ones I need now
> for a specific task). I need to understand whether I've just duplicated
> stuff that is already inside Tika, or whether I have properly extended Tika
> about a missing feature, or whether I'm stressing it too far.
>
> A potential problem - which is not urgent now - is that I don't know how
> Tika should deal with complex XMP properties such as hierarchic properties,
> given that it uses to flatten everything.
> On 20/08/21 19:29, John Ulric wrote:
>
> Fabrizio:
>
> I'm not a specialist in Tika, but XMP files are plain XML, and pretty well
> standardised, so you probably wouldn't need Tika to read these. Just use
> any old XML parser (from JDKs standard library, Saxon …) and filter out the
> values you need. I don't know if the Tika team agree, but I see Tika as a
> tool to extract information from binary data primarily.
>
> Cheers
> John
>
>
> --
> Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
> "We make Java work. Everywhere."http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
>
>
Re: Tika for parsing raw XMP
Posted by Fabrizio Giudici <fa...@tidalwave.it>.
John, thanks for your comment. Correctly understanding the scope of Tika
is part of the things I have to do, so I'm waiting for other Tika people
to confirm.
In my understanding Tika also supports textual files (there are XML
parsers inside, XMP is at least partially supported when embedded e.g.
in a JPEG file, etc...), but I could be wrong.
I know XMP is XML, but the schema is not trivial for what concerns
representation of certain structured properties (see below), so a
speficic Java data model is required and it would be nice to find one
available in a library. I know there are other libraries supporting it
(such as metadata-extractor, which is used by Tika) which I've already
used in the past. Given that I have to deal with multiple file formats
(photo, music, etc...) it would be nice to have a single "umbrella" API
- also because this is an Apache project, with the usual governance
model, so you get a well anticipated warning when it is going to reach
end of life - while many projects out there often get to a stop without
a warning.
Back to the original topic...
At the moment I was able to write a custom parser starting from
AbstractParser and taking advantage of XMPContentHandler. It's quite
rough, but it retrieves most of the obvious tags (including the ones I
need now for a specific task). I need to understand whether I've just
duplicated stuff that is already inside Tika, or whether I have properly
extended Tika about a missing feature, or whether I'm stressing it too far.
A potential problem - which is not urgent now - is that I don't know how
Tika should deal with complex XMP properties such as hierarchic
properties, given that it uses to flatten everything.
On 20/08/21 19:29, John Ulric wrote:
> Fabrizio:
>
> I'm not a specialist in Tika, but XMP files are plain XML, and pretty
> well standardised, so you probably wouldn't need Tika to read these.
> Just use any old XML parser (from JDKs standard library, Saxon …) and
> filter out the values you need. I don't know if the Tika team agree,
> but I see Tika as a tool to extract information from binary data
> primarily.
>
> Cheers
> John
>
>
--
Fabrizio Giudici - Java Architect @ Tidalwave s.a.s.
"We make Java work. Everywhere."
http://tidalwave.it/fabrizio/blog - fabrizio.giudici@tidalwave.it
Re: Tika for parsing raw XMP
Posted by John Ulric <uj...@gmail.com>.
Fabrizio:
I'm not a specialist in Tika, but XMP files are plain XML, and pretty well
standardised, so you probably wouldn't need Tika to read these. Just use
any old XML parser (from JDKs standard library, Saxon …) and filter out the
values you need. I don't know if the Tika team agree, but I see Tika as a
tool to extract information from binary data primarily.
Cheers
John