You are viewing a plain text version of this content. The canonical link for it is here.

Posted to corpora-dev@tika.apache.org by Tim Allison <ta...@apache.org> on 2021/03/17 15:37:29 UTC

XMPs...all you could possibly want...and more!

All,

  I'm scraping XMPs out of our corpus and placing them here as standalone files:

https://corpora.tika.apache.org/base/xmps/

  I've binned the files roughly based on the container file's mime
type, e.g. https://corpora.tika.apache.org/base/xmps/pdf/

  The process is still running, and I view this as a first draft.
Please let me know if there's anything I can do to make these data
easier to use/more useful or if you see any problems.

  Cheers,

             Tim

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

If only we had some kind of a corpus we could all share... LOL

Here are two...if I understand correctly.

https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-4028-0.pdf

<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 3.0-28, framework 1.6'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:iX='http://ns.adobe.com/iX/1.0/'>

 <rdf:Description rdf:about='uuid:f6b0dc62-e0f5-11da-9df8-891f95b09a7c'
  xmlns:exif='http://ns.adobe.com/exif/1.0/'>
  <exif:ColorSpace>4294967295</exif:ColorSpace>
  <exif:PixelXDimension>163</exif:PixelXDimension>
  <exif:PixelYDimension>124</exif:PixelYDimension>
 </rdf:Description>
....

https://corpora.tika.apache.org/base/docs/bug_trackers/PDFBOX/PDFBOX-3724-0.pdf
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c003
61.141987, 2011/02/22-12:03:51        ">
 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about=""
    xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/"
    xmlns:plus="http://ns.useplus.org/ldf/xmp/1.0/"
    xmlns:xmp="http://ns.adobe.com/xap/1.0/"
    xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
    xmlns:stEvt="http://ns.adobe.com/xap/1.0/sType/ResourceEvent#"
    xmlns:fwl="http://ns.fotoware.com/iptcxmp-legacy/1.0/"
    xmlns:fwr="http://ns.fotoware.com/iptcxmp-reserved/1.0/"
    xmlns:crs="http://ns.adobe.com/camera-raw-settings/1.0/"
    xmlns:fwc="http://ns.fotoware.com/iptcxmp-custom/1.0/"
    xmlns:fwu="http://ns.fotoware.com/iptcxmp-user/1.0/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:Iptc4xmpExt="http://iptc.org/std/Iptc4xmpExt/2008-02-29/"
   photoshop:City="London"
   photoshop:DateCreated="2015-09-24"

On Wed, Mar 17, 2021 at 2:14 PM Tim Allison <ta...@apache.org> wrote:
>
> Sounds like we might be extracting that info in the following line in Tika?
>
> https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java#L302
>
> On Wed, Mar 17, 2021 at 2:03 PM sahyoun@fileaffairs.de
> <sa...@fileaffairs.de> wrote:
> >
> > Hi Leonard,
> >
> > attachments won't work at the mailing list - could you upload it to a
> > public location or send it to me in person?
> >
> > BR
> > Maruan
> >
> > Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol:
> > > Here is one that I have handy where there is XMP on the image...
> > >
> > > On 3/17/21, 1:44 PM, "sahyoun@fileaffairs.de"
> > > <sa...@fileaffairs.de> wrote:
> > >
> > >     Hi Leonard,
> > >
> > >     if you could provide a sample document with XMPs attached to
> > > various
> > >     PDF objects you're interested in I could come up with a quick
> > > sample
> > >     for Tim.
> > >
> > >     BR
> > >     Maruan
> > >
> > >     Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
> > >     > Hi Leonard,
> > >     >   I'm literally just scraping bytes out of files for now
> > > without any
> > >     > parsing...so if the XMP is concealed in a compressed stream or
> > >     > something more interesting, I'm not grabbing it.  I'm also not
> > >     > tracking which XMP is associated with which object.
> > >     >   Please forgive me...if I traverse the COSDocument's objects
> > > and
> > >     > look
> > >     > for /Metadata and grab the stream, will that be what you're
> > > looking
> > >     > for?  Or, is there a commandline tool I can run to get what
> > > you're
> > >     > interested in?
> > >     >   Thank you.
> > >     >
> > >     >   Cheers,
> > >     >
> > >     >               Tim
> > >     >
> > >     > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> > >     > <lr...@adobe.com.invalid> wrote:
> > >     > >
> > >     > > Are you only pulling document-level XMP?  If so, could you
> > > extend
> > >     > > it to support object-level metadata as well?   I, for one,
> > > would
> > >     > > love to get insight into the use of object-level metadata -
> > > what
> > >     > > objects are they attached to, what are they being used for,
> > > etc.
> > >     > >
> > >     > > Leonard
> > >     > >
> > >     > > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org>
> > > wrote:
> > >     > >
> > >     > >     All,
> > >     > >
> > >     > >       I'm scraping XMPs out of our corpus and placing them
> > > here as
> > >     > > standalone files:
> > >     > >
> > >     > >
> > >     > >
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&amp;reserved=0
> > >     > >
> > >     > >       I've binned the files roughly based on the container
> > > file's
> > >     > > mime
> > >     > >     type, e.g.
> > >     > >
> > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&amp;reserved=0
> > >     > >
> > >     > >       The process is still running, and I view this as a
> > > first
> > >     > > draft.
> > >     > >     Please let me know if there's anything I can do to make
> > > these
> > >     > > data
> > >     > >     easier to use/more useful or if you see any problems.
> > >     > >
> > >     > >       Cheers,
> > >     > >
> > >     > >                  Tim
> > >     > >
> > >
> > >     --
> > >     --
> > >     Maruan Sahyoun
> > >
> > >     FileAffairs GmbH
> > >     Josef-Schappe-Straße 21
> > >     40882 Ratingen
> > >
> > >     Tel: +49 (2102) 89497 88
> > >     Fax: +49 (2102) 89497 91
> > >     sahyoun@fileaffairs.de
> > >
> > > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&amp;reserved=0
> > >
> > >     Geschäftsführer: Maruan Sahyoun
> > >     Handelsregister: AG Düsseldorf, HRB 53837
> > >     UST.-ID: DE248275827
> > >
> > >
> >
> > --
> > --
> > Maruan Sahyoun
> >
> > FileAffairs GmbH
> > Josef-Schappe-Straße 21
> > 40882 Ratingen
> >
> > Tel: +49 (2102) 89497 88
> > Fax: +49 (2102) 89497 91
> > sahyoun@fileaffairs.de
> > www.fileaffairs.de
> >
> > Geschäftsführer: Maruan Sahyoun
> > Handelsregister: AG Düsseldorf, HRB 53837
> > UST.-ID: DE248275827
> >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

Sounds like we might be extracting that info in the following line in Tika?

https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/ImageGraphicsEngine.java#L302

On Wed, Mar 17, 2021 at 2:03 PM sahyoun@fileaffairs.de
<sa...@fileaffairs.de> wrote:
>
> Hi Leonard,
>
> attachments won't work at the mailing list - could you upload it to a
> public location or send it to me in person?
>
> BR
> Maruan
>
> Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol:
> > Here is one that I have handy where there is XMP on the image...
> >
> > On 3/17/21, 1:44 PM, "sahyoun@fileaffairs.de"
> > <sa...@fileaffairs.de> wrote:
> >
> >     Hi Leonard,
> >
> >     if you could provide a sample document with XMPs attached to
> > various
> >     PDF objects you're interested in I could come up with a quick
> > sample
> >     for Tim.
> >
> >     BR
> >     Maruan
> >
> >     Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
> >     > Hi Leonard,
> >     >   I'm literally just scraping bytes out of files for now
> > without any
> >     > parsing...so if the XMP is concealed in a compressed stream or
> >     > something more interesting, I'm not grabbing it.  I'm also not
> >     > tracking which XMP is associated with which object.
> >     >   Please forgive me...if I traverse the COSDocument's objects
> > and
> >     > look
> >     > for /Metadata and grab the stream, will that be what you're
> > looking
> >     > for?  Or, is there a commandline tool I can run to get what
> > you're
> >     > interested in?
> >     >   Thank you.
> >     >
> >     >   Cheers,
> >     >
> >     >               Tim
> >     >
> >     > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> >     > <lr...@adobe.com.invalid> wrote:
> >     > >
> >     > > Are you only pulling document-level XMP?  If so, could you
> > extend
> >     > > it to support object-level metadata as well?   I, for one,
> > would
> >     > > love to get insight into the use of object-level metadata -
> > what
> >     > > objects are they attached to, what are they being used for,
> > etc.
> >     > >
> >     > > Leonard
> >     > >
> >     > > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org>
> > wrote:
> >     > >
> >     > >     All,
> >     > >
> >     > >       I'm scraping XMPs out of our corpus and placing them
> > here as
> >     > > standalone files:
> >     > >
> >     > >
> >     > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&amp;reserved=0
> >     > >
> >     > >       I've binned the files roughly based on the container
> > file's
> >     > > mime
> >     > >     type, e.g.
> >     > >
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&amp;reserved=0
> >     > >
> >     > >       The process is still running, and I view this as a
> > first
> >     > > draft.
> >     > >     Please let me know if there's anything I can do to make
> > these
> >     > > data
> >     > >     easier to use/more useful or if you see any problems.
> >     > >
> >     > >       Cheers,
> >     > >
> >     > >                  Tim
> >     > >
> >
> >     --
> >     --
> >     Maruan Sahyoun
> >
> >     FileAffairs GmbH
> >     Josef-Schappe-Straße 21
> >     40882 Ratingen
> >
> >     Tel: +49 (2102) 89497 88
> >     Fax: +49 (2102) 89497 91
> >     sahyoun@fileaffairs.de
> >
> > https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&amp;reserved=0
> >
> >     Geschäftsführer: Maruan Sahyoun
> >     Handelsregister: AG Düsseldorf, HRB 53837
> >     UST.-ID: DE248275827
> >
> >
>
> --
> --
> Maruan Sahyoun
>
> FileAffairs GmbH
> Josef-Schappe-Straße 21
> 40882 Ratingen
>
> Tel: +49 (2102) 89497 88
> Fax: +49 (2102) 89497 91
> sahyoun@fileaffairs.de
> www.fileaffairs.de
>
> Geschäftsführer: Maruan Sahyoun
> Handelsregister: AG Düsseldorf, HRB 53837
> UST.-ID: DE248275827
>

Re: XMPs...all you could possibly want...and more!

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.

Hi Leonard,

attachments won't work at the mailing list - could you upload it to a
public location or send it to me in person?

BR
Maruan 

Am Mittwoch, dem 17.03.2021 um 17:57 +0000 schrieb Leonard Rosenthol:
> Here is one that I have handy where there is XMP on the image...
> 
> On 3/17/21, 1:44 PM, "sahyoun@fileaffairs.de"
> <sa...@fileaffairs.de> wrote:
> 
>     Hi Leonard,
> 
>     if you could provide a sample document with XMPs attached to
> various
>     PDF objects you're interested in I could come up with a quick
> sample
>     for Tim.
> 
>     BR
>     Maruan 
> 
>     Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
>     > Hi Leonard,
>     >   I'm literally just scraping bytes out of files for now
> without any
>     > parsing...so if the XMP is concealed in a compressed stream or
>     > something more interesting, I'm not grabbing it.  I'm also not
>     > tracking which XMP is associated with which object.
>     >   Please forgive me...if I traverse the COSDocument's objects
> and
>     > look
>     > for /Metadata and grab the stream, will that be what you're
> looking
>     > for?  Or, is there a commandline tool I can run to get what
> you're
>     > interested in?
>     >   Thank you.
>     > 
>     >   Cheers,
>     > 
>     >               Tim
>     > 
>     > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
>     > <lr...@adobe.com.invalid> wrote:
>     > > 
>     > > Are you only pulling document-level XMP?  If so, could you
> extend
>     > > it to support object-level metadata as well?   I, for one,
> would
>     > > love to get insight into the use of object-level metadata -
> what
>     > > objects are they attached to, what are they being used for,
> etc.
>     > > 
>     > > Leonard
>     > > 
>     > > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org>
> wrote:
>     > > 
>     > >     All,
>     > > 
>     > >       I'm scraping XMPs out of our corpus and placing them
> here as
>     > > standalone files:
>     > > 
>     > >    
>     > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&amp;reserved=0
>     > > 
>     > >       I've binned the files roughly based on the container
> file's
>     > > mime
>     > >     type, e.g.
>     > >
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&amp;reserved=0
>     > > 
>     > >       The process is still running, and I view this as a
> first
>     > > draft.
>     > >     Please let me know if there's anything I can do to make
> these
>     > > data
>     > >     easier to use/more useful or if you see any problems.
>     > > 
>     > >       Cheers,
>     > > 
>     > >                  Tim
>     > > 
> 
>     -- 
>     -- 
>     Maruan Sahyoun
> 
>     FileAffairs GmbH
>     Josef-Schappe-Straße 21
>     40882 Ratingen
> 
>     Tel: +49 (2102) 89497 88
>     Fax: +49 (2102) 89497 91
>     sahyoun@fileaffairs.de
>    
> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&amp;reserved=0
> 
>     Geschäftsführer: Maruan Sahyoun
>     Handelsregister: AG Düsseldorf, HRB 53837
>     UST.-ID: DE248275827
> 
> 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Re: XMPs...all you could possibly want...and more!

Posted by Leonard Rosenthol <lr...@adobe.com.INVALID>.

Here is one that I have handy where there is XMP on the image...

On 3/17/21, 1:44 PM, "sahyoun@fileaffairs.de" <sa...@fileaffairs.de> wrote:

    Hi Leonard,

    if you could provide a sample document with XMPs attached to various
    PDF objects you're interested in I could come up with a quick sample
    for Tim.

    BR
    Maruan 

    Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
    > Hi Leonard,
    >   I'm literally just scraping bytes out of files for now without any
    > parsing...so if the XMP is concealed in a compressed stream or
    > something more interesting, I'm not grabbing it.  I'm also not
    > tracking which XMP is associated with which object.
    >   Please forgive me...if I traverse the COSDocument's objects and
    > look
    > for /Metadata and grab the stream, will that be what you're looking
    > for?  Or, is there a commandline tool I can run to get what you're
    > interested in?
    >   Thank you.
    > 
    >   Cheers,
    > 
    >               Tim
    > 
    > On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    > <lr...@adobe.com.invalid> wrote:
    > > 
    > > Are you only pulling document-level XMP?  If so, could you extend
    > > it to support object-level metadata as well?   I, for one, would
    > > love to get insight into the use of object-level metadata - what
    > > objects are they attached to, what are they being used for, etc.
    > > 
    > > Leonard
    > > 
    > > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
    > > 
    > >     All,
    > > 
    > >       I'm scraping XMPs out of our corpus and placing them here as
    > > standalone files:
    > > 
    > >    
    > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615522173%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2TgR3TTbDedLLOn85E9sVHLePHUqDpzkDnF%2BsnzvIfk%3D&amp;reserved=0
    > > 
    > >       I've binned the files roughly based on the container file's
    > > mime
    > >     type, e.g.
    > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vheVHiNdgTtbOIL8plV6vRslcGB0d%2FByGYXtbByH2zk%3D&amp;reserved=0
    > > 
    > >       The process is still running, and I view this as a first
    > > draft.
    > >     Please let me know if there's anything I can do to make these
    > > data
    > >     easier to use/more useful or if you see any problems.
    > > 
    > >       Cheers,
    > > 
    > >                  Tim
    > > 

    -- 
    -- 
    Maruan Sahyoun

    FileAffairs GmbH
    Josef-Schappe-Straße 21
    40882 Ratingen

    Tel: +49 (2102) 89497 88
    Fax: +49 (2102) 89497 91
    sahyoun@fileaffairs.de
    https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.fileaffairs.de%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C388cecf991ed40022fd808d8e96c4aa6%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515998615532128%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qcCIbv8VTgWaudXut2FHgOOtJSQTJLDknTSznWdomgw%3D&amp;reserved=0

    Geschäftsführer: Maruan Sahyoun
    Handelsregister: AG Düsseldorf, HRB 53837
    UST.-ID: DE248275827

Re: XMPs...all you could possibly want...and more!

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.

Hi Leonard,

if you could provide a sample document with XMPs attached to various
PDF objects you're interested in I could come up with a quick sample
for Tim.

BR
Maruan 

Am Mittwoch, dem 17.03.2021 um 13:39 -0400 schrieb Tim Allison:
> Hi Leonard,
>   I'm literally just scraping bytes out of files for now without any
> parsing...so if the XMP is concealed in a compressed stream or
> something more interesting, I'm not grabbing it.  I'm also not
> tracking which XMP is associated with which object.
>   Please forgive me...if I traverse the COSDocument's objects and
> look
> for /Metadata and grab the stream, will that be what you're looking
> for?  Or, is there a commandline tool I can run to get what you're
> interested in?
>   Thank you.
> 
>   Cheers,
> 
>               Tim
> 
> On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> <lr...@adobe.com.invalid> wrote:
> > 
> > Are you only pulling document-level XMP?  If so, could you extend
> > it to support object-level metadata as well?   I, for one, would
> > love to get insight into the use of object-level metadata - what
> > objects are they attached to, what are they being used for, etc.
> > 
> > Leonard
> > 
> > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
> > 
> >     All,
> > 
> >       I'm scraping XMPs out of our corpus and placing them here as
> > standalone files:
> > 
> >    
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ujb11etR6nqAqqxo7l1SHMiDrU5KxYPRXTm4nvXrCXo%3D&amp;reserved=0
> > 
> >       I've binned the files roughly based on the container file's
> > mime
> >     type, e.g.
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HFcAVr0CLvIwEa5%2BsD8iYRSDgm6LWHNcXfzsPnSEDqs%3D&amp;reserved=0
> > 
> >       The process is still running, and I view this as a first
> > draft.
> >     Please let me know if there's anything I can do to make these
> > data
> >     easier to use/more useful or if you see any problems.
> > 
> >       Cheers,
> > 
> >                  Tim
> > 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Re: XMPs...all you could possibly want...and more!

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.

Hi Tim,

this is a quick sample code for iterating the pages of a PDF and report
of either that or possible image resources contain metadata.

for (PDPage page : document.getPages())
{
    COSBase metaObj =
page.getCOSObject().getDictionaryObject(COSName.METADATA);
    if (metaObj instanceof COSStream)
    {
        display("found page with metadata", meta);
        meta = new PDMetadata((COSStream) metaObj); 
    }

    PDResources resources = page.getResources();
    for (COSName resName : resources.getXObjectNames())
    {
        PDXObject xObject = resources.getXObject(resName);
        metaObj =
xObject.getCOSObject().getDictionaryObject(COSName.METADATA);
        if (metaObj instanceof COSStream)
        {
            meta = new PDMetadata((COSStream) metaObj);
            display("found image with metadata", meta);
        }
    }
}

This could be extented to report metadata for other resources such as
fonts.

A different approach would be to go low level and get the pages
COSDictionary, look for getDictionaryObject(COSName.METADATA) and
iterate all dictionary keys looking for dictionary objects which
themselves are dictionaries,look for
getDictionaryObject(COSName.METADATA) and so on.

One caveat with that approach is that you need to make sure that you
track the already visited dictionaries as PDF can backwards references.

I could extend the ExtractMetadata example in the PDFBox example code
if that helps you get started. Otherwise please drop a quick note if I
can be of any help.

BR
Maruan


Am Freitag, dem 19.03.2021 um 11:42 -0400 schrieb Tim Allison:
> All,
> 
>     The processes finished: https://corpora.tika.apache.org/base/xmps/
> 
>     Now has two subdirectories, one for the original raw byte scraping
> (1.2 million files with some junk
> https://corpora.tika.apache.org/base/xmps/scraped-xmps/) and one for
> the logical XMPs extracted by ExifTool (450k files
> https://corpora.tika.apache.org/base/xmps/exiftool-xmps/).
> 
>      I plan to write some lightweight code to traverse the DOM and
> look for all /Metadata objects and what they're attached to.
> 
>      If the XMP files are of any use or if they'd be of more use to
> you if we did further processing or packaging, please let me know.
> 
>     Cheers,
> 
>               Tim
> 
> On Wed, Mar 17, 2021 at 4:21 PM Tim Allison <ta...@apache.org>
> wrote:
> > 
> > > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> > > that'd be of any interest.
> > 
> > If there were a commandline or a Java SDK, I could run that next if
> > that'd be of any interest. :D
> > 
> > On Wed, Mar 17, 2021 at 3:28 PM Tim Allison <ta...@apache.org>
> > wrote:
> > > 
> > > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> > > that'd be of any interest.
> > > 
> > > I kicked off a process to run `exifTool -xmp -b` against the files.
> > > The output will go here:
> > > https://corpora.tika.apache.org/base/exiftool-xmps/
> > > 
> > > On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
> > > <lr...@adobe.com.invalid> wrote:
> > > > 
> > > > Very interesting - thanks.
> > > > 
> > > > FWIW: The XMPToolkit itself has a module called "XMPFiles"
> > > > (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it
> > > > is to read & write/update XMP (and other related metadata such as
> > > > EXIF) from various file formats.  It's what all the Adobe apps
> > > > use to handle XMP in any file format that we encounter.
> > > > 
> > > > Leonard
> > > > 
> > > > On 3/17/21, 2:48 PM, "Tim Allison" <ta...@apache.org> wrote:
> > > > 
> > > >     Wait...I'm sorry...I'm wrong on the first point.
> > > > 
> > > >     1) in Tika generally, we use Jempbox (currently) to parse XMP
> > > > when the
> > > >     parsers come across it and after they select the right one
> > > > and do any
> > > >     joining or other modifications...e.g. the "right" xmp.  We
> > > > use xmpcore
> > > >     for converting other metadata to XMP in our tika-xmp module,
> > > > and
> > > >     xmpcore is a dependency of Drew Noakes' metadata-extractor
> > > > which is
> > > >     critical.
> > > > 
> > > >     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison
> > > > <ta...@apache.org> wrote:
> > > >     >
> > > >     > >Isn't that why are you using the XMP Toolkit???
> > > >     >
> > > >     > Sorry, we may be talking about two different things.
> > > >     >
> > > >     > 1) In Tika generally, we use xmpcore to parse XMP after the
> > > > parsers
> > > >     > extract it and process it (correctly!) from various file
> > > > formats.
> > > >     >
> > > >     > 2) For this exercise, I wanted a quick and dirty byte
> > > > scanner to
> > > >     > extract the raw xmp packets...as much as we could find in
> > > > any file
> > > >     > format without relying on file-format specific parsers.
> > > >     >
> > > >     > I can do a second run where I modify Tika to extract the
> > > > XMP from the
> > > >     > various parsers after they do their processing (determining
> > > > most
> > > >     > recent/joining, etc) to extract the correct XMP.
> > > >     >
> > > >     > And I can do a third run where I modify Tika to extract XMP
> > > > associated
> > > >     > with embedded images in PDFs, for example.
> > > >     >
> > > >     > I hope this clarifies things.  Please let me know what
> > > > would be most
> > > >     > useful for you.
> > > >     >
> > > >     > Cheers,
> > > >     >
> > > >     >        Tim
> > > >     >
> > > >     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> > > >     > <lr...@adobe.com.invalid> wrote:
> > > >     > >
> > > >     > > >    The other thing is that I wanted to scrape xmp out
> > > > of files beyond PDFs.
> > > >     > > >
> > > >     > > Isn't that why are you using the XMP Toolkit???
> > > >     > >
> > > >     > > Leonard
> > > >     > >
> > > >     > > On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org>
> > > > wrote:
> > > >     > >
> > > >     > >     > ARGH!!!!   Please don't do this - it will get you
> > > > the wrong results in almost all cases.     Remember that in a PDF
> > > > with updates, there can/will be a new XMP block with each update.
> > > >     > >
> > > >     > >     Ha, right.  I completely understand (perhaps _only_
> > > > this small point
> > > >     > >     on PDFs).  On this pass, my goal was to see what was
> > > > in the file at
> > > >     > >     all, not what was the correct XMP. Part of my
> > > > interest is in what's
> > > >     > >     available in the file, but not available readily to
> > > > the user.
> > > >     > >
> > > >     > >     The other thing is that I wanted to scrape xmp out of
> > > > files beyond PDFs.
> > > >     > >
> > > >     > >     So, I can definitely take a second run where I let a
> > > > PDF tool extract
> > > >     > >     the correct XMP if there's interest in that.
> > > >     > >
> > > >     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> > > >     > >     <lr...@adobe.com.invalid> wrote:
> > > >     > >     >
> > > >     > >     > >      I'm literally just scraping bytes out of
> > > > files for now without any parsing
> > > >     > >     > >
> > > >     > >     > ARGH!!!!   Please don't do this - it will get you
> > > > the wrong results in almost all cases.     Remember that in a PDF
> > > > with updates, there can/will be a new XMP block with each update.
> > > >     > >     >
> > > >     > >     >
> > > >     > >     > > if I traverse the COSDocument's objects and
> > > > look     for /Metadata and grab the stream, will that be what
> > > > you're looking     for?
> > > >     > >     > >
> > > >     > >     > Just getting those elements would be a great
> > > > start.  If you could also include the rest of the dictionary in
> > > > which it was found (or at least the /Type and /Subtype keys, if
> > > > present) would be great!
> > > >     > >     >
> > > >     > >     > Leonard
> > > >     > >     >
> > > >     > >     > On 3/17/21, 1:39 PM, "Tim Allison"
> > > > <ta...@apache.org> wrote:
> > > >     > >     >
> > > >     > >     >     Hi Leonard,
> > > >     > >     >       I'm literally just scraping bytes out of
> > > > files for now without any
> > > >     > >     >     parsing...so if the XMP is concealed in a
> > > > compressed stream or
> > > >     > >     >     something more interesting, I'm not grabbing
> > > > it.  I'm also not
> > > >     > >     >     tracking which XMP is associated with which
> > > > object.
> > > >     > >     >       Please forgive me...if I traverse the
> > > > COSDocument's objects and look
> > > >     > >     >     for /Metadata and grab the stream, will that be
> > > > what you're looking
> > > >     > >     >     for?  Or, is there a commandline tool I can run
> > > > to get what you're
> > > >     > >     >     interested in?
> > > >     > >     >       Thank you.
> > > >     > >     >
> > > >     > >     >       Cheers,
> > > >     > >     >
> > > >     > >     >                   Tim
> > > >     > >     >
> > > >     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard
> > > > Rosenthol
> > > >     > >     >     <lr...@adobe.com.invalid> wrote:
> > > >     > >     >     >
> > > >     > >     >     > Are you only pulling document-level XMP?  If
> > > > so, could you extend it to support object-level metadata as
> > > > well?   I, for one, would love to get insight into the use of
> > > > object-level metadata - what objects are they attached to, what
> > > > are they being used for, etc.
> > > >     > >     >     >
> > > >     > >     >     > Leonard
> > > >     > >     >     >
> > > >     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison"
> > > > <ta...@apache.org> wrote:
> > > >     > >     >     >
> > > >     > >     >     >     All,
> > > >     > >     >     >
> > > >     > >     >     >       I'm scraping XMPs out of our corpus and
> > > > placing them here as standalone files:
> > > >     > >     >     >
> > > >     > >     >     >    
> > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
> > > >     > >     >     >
> > > >     > >     >     >       I've binned the files roughly based on
> > > > the container file's mime
> > > >     > >     >     >     type, e.g.
> > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
> > > >     > >     >     >
> > > >     > >     >     >       The process is still running, and I
> > > > view this as a first draft.
> > > >     > >     >     >     Please let me know if there's anything I
> > > > can do to make these data
> > > >     > >     >     >     easier to use/more useful or if you see
> > > > any problems.
> > > >     > >     >     >
> > > >     > >     >     >       Cheers,
> > > >     > >     >     >
> > > >     > >     >     >                  Tim
> > > >     > >     >     >
> > > >     > >     >
> > > >     > >
> > > > 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

All,

    The processes finished: https://corpora.tika.apache.org/base/xmps/

    Now has two subdirectories, one for the original raw byte scraping
(1.2 million files with some junk
https://corpora.tika.apache.org/base/xmps/scraped-xmps/) and one for
the logical XMPs extracted by ExifTool (450k files
https://corpora.tika.apache.org/base/xmps/exiftool-xmps/).

     I plan to write some lightweight code to traverse the DOM and
look for all /Metadata objects and what they're attached to.

     If the XMP files are of any use or if they'd be of more use to
you if we did further processing or packaging, please let me know.

    Cheers,

              Tim

On Wed, Mar 17, 2021 at 4:21 PM Tim Allison <ta...@apache.org> wrote:
>
> > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd be of any interest.
>
> If there were a commandline or a Java SDK, I could run that next if
> that'd be of any interest. :D
>
> On Wed, Mar 17, 2021 at 3:28 PM Tim Allison <ta...@apache.org> wrote:
> >
> > Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> > that'd be of any interest.
> >
> > I kicked off a process to run `exifTool -xmp -b` against the files.
> > The output will go here:
> > https://corpora.tika.apache.org/base/exiftool-xmps/
> >
> > On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
> > <lr...@adobe.com.invalid> wrote:
> > >
> > > Very interesting - thanks.
> > >
> > > FWIW: The XMPToolkit itself has a module called "XMPFiles" (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & write/update XMP (and other related metadata such as EXIF) from various file formats.  It's what all the Adobe apps use to handle XMP in any file format that we encounter.
> > >
> > > Leonard
> > >
> > > On 3/17/21, 2:48 PM, "Tim Allison" <ta...@apache.org> wrote:
> > >
> > >     Wait...I'm sorry...I'm wrong on the first point.
> > >
> > >     1) in Tika generally, we use Jempbox (currently) to parse XMP when the
> > >     parsers come across it and after they select the right one and do any
> > >     joining or other modifications...e.g. the "right" xmp.  We use xmpcore
> > >     for converting other metadata to XMP in our tika-xmp module, and
> > >     xmpcore is a dependency of Drew Noakes' metadata-extractor which is
> > >     critical.
> > >
> > >     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <ta...@apache.org> wrote:
> > >     >
> > >     > >Isn't that why are you using the XMP Toolkit???
> > >     >
> > >     > Sorry, we may be talking about two different things.
> > >     >
> > >     > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
> > >     > extract it and process it (correctly!) from various file formats.
> > >     >
> > >     > 2) For this exercise, I wanted a quick and dirty byte scanner to
> > >     > extract the raw xmp packets...as much as we could find in any file
> > >     > format without relying on file-format specific parsers.
> > >     >
> > >     > I can do a second run where I modify Tika to extract the XMP from the
> > >     > various parsers after they do their processing (determining most
> > >     > recent/joining, etc) to extract the correct XMP.
> > >     >
> > >     > And I can do a third run where I modify Tika to extract XMP associated
> > >     > with embedded images in PDFs, for example.
> > >     >
> > >     > I hope this clarifies things.  Please let me know what would be most
> > >     > useful for you.
> > >     >
> > >     > Cheers,
> > >     >
> > >     >        Tim
> > >     >
> > >     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> > >     > <lr...@adobe.com.invalid> wrote:
> > >     > >
> > >     > > >    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> > >     > > >
> > >     > > Isn't that why are you using the XMP Toolkit???
> > >     > >
> > >     > > Leonard
> > >     > >
> > >     > > On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:
> > >     > >
> > >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
> > >     > >
> > >     > >     Ha, right.  I completely understand (perhaps _only_ this small point
> > >     > >     on PDFs).  On this pass, my goal was to see what was in the file at
> > >     > >     all, not what was the correct XMP. Part of my interest is in what's
> > >     > >     available in the file, but not available readily to the user.
> > >     > >
> > >     > >     The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> > >     > >
> > >     > >     So, I can definitely take a second run where I let a PDF tool extract
> > >     > >     the correct XMP if there's interest in that.
> > >     > >
> > >     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> > >     > >     <lr...@adobe.com.invalid> wrote:
> > >     > >     >
> > >     > >     > >      I'm literally just scraping bytes out of files for now without any parsing
> > >     > >     > >
> > >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
> > >     > >     >
> > >     > >     >
> > >     > >     > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
> > >     > >     > >
> > >     > >     > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
> > >     > >     >
> > >     > >     > Leonard
> > >     > >     >
> > >     > >     > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
> > >     > >     >
> > >     > >     >     Hi Leonard,
> > >     > >     >       I'm literally just scraping bytes out of files for now without any
> > >     > >     >     parsing...so if the XMP is concealed in a compressed stream or
> > >     > >     >     something more interesting, I'm not grabbing it.  I'm also not
> > >     > >     >     tracking which XMP is associated with which object.
> > >     > >     >       Please forgive me...if I traverse the COSDocument's objects and look
> > >     > >     >     for /Metadata and grab the stream, will that be what you're looking
> > >     > >     >     for?  Or, is there a commandline tool I can run to get what you're
> > >     > >     >     interested in?
> > >     > >     >       Thank you.
> > >     > >     >
> > >     > >     >       Cheers,
> > >     > >     >
> > >     > >     >                   Tim
> > >     > >     >
> > >     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> > >     > >     >     <lr...@adobe.com.invalid> wrote:
> > >     > >     >     >
> > >     > >     >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
> > >     > >     >     >
> > >     > >     >     > Leonard
> > >     > >     >     >
> > >     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
> > >     > >     >     >
> > >     > >     >     >     All,
> > >     > >     >     >
> > >     > >     >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
> > >     > >     >     >
> > >     > >     >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
> > >     > >     >     >
> > >     > >     >     >       I've binned the files roughly based on the container file's mime
> > >     > >     >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
> > >     > >     >     >
> > >     > >     >     >       The process is still running, and I view this as a first draft.
> > >     > >     >     >     Please let me know if there's anything I can do to make these data
> > >     > >     >     >     easier to use/more useful or if you see any problems.
> > >     > >     >     >
> > >     > >     >     >       Cheers,
> > >     > >     >     >
> > >     > >     >     >                  Tim
> > >     > >     >     >
> > >     > >     >
> > >     > >
> > >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

> Ah, I wasn't aware of XMPFiles...thank you...I can run that next if that'd be of any interest.

If there were a commandline or a Java SDK, I could run that next if
that'd be of any interest. :D

On Wed, Mar 17, 2021 at 3:28 PM Tim Allison <ta...@apache.org> wrote:
>
> Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
> that'd be of any interest.
>
> I kicked off a process to run `exifTool -xmp -b` against the files.
> The output will go here:
> https://corpora.tika.apache.org/base/exiftool-xmps/
>
> On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
> <lr...@adobe.com.invalid> wrote:
> >
> > Very interesting - thanks.
> >
> > FWIW: The XMPToolkit itself has a module called "XMPFiles" (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & write/update XMP (and other related metadata such as EXIF) from various file formats.  It's what all the Adobe apps use to handle XMP in any file format that we encounter.
> >
> > Leonard
> >
> > On 3/17/21, 2:48 PM, "Tim Allison" <ta...@apache.org> wrote:
> >
> >     Wait...I'm sorry...I'm wrong on the first point.
> >
> >     1) in Tika generally, we use Jempbox (currently) to parse XMP when the
> >     parsers come across it and after they select the right one and do any
> >     joining or other modifications...e.g. the "right" xmp.  We use xmpcore
> >     for converting other metadata to XMP in our tika-xmp module, and
> >     xmpcore is a dependency of Drew Noakes' metadata-extractor which is
> >     critical.
> >
> >     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <ta...@apache.org> wrote:
> >     >
> >     > >Isn't that why are you using the XMP Toolkit???
> >     >
> >     > Sorry, we may be talking about two different things.
> >     >
> >     > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
> >     > extract it and process it (correctly!) from various file formats.
> >     >
> >     > 2) For this exercise, I wanted a quick and dirty byte scanner to
> >     > extract the raw xmp packets...as much as we could find in any file
> >     > format without relying on file-format specific parsers.
> >     >
> >     > I can do a second run where I modify Tika to extract the XMP from the
> >     > various parsers after they do their processing (determining most
> >     > recent/joining, etc) to extract the correct XMP.
> >     >
> >     > And I can do a third run where I modify Tika to extract XMP associated
> >     > with embedded images in PDFs, for example.
> >     >
> >     > I hope this clarifies things.  Please let me know what would be most
> >     > useful for you.
> >     >
> >     > Cheers,
> >     >
> >     >        Tim
> >     >
> >     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> >     > <lr...@adobe.com.invalid> wrote:
> >     > >
> >     > > >    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> >     > > >
> >     > > Isn't that why are you using the XMP Toolkit???
> >     > >
> >     > > Leonard
> >     > >
> >     > > On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:
> >     > >
> >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
> >     > >
> >     > >     Ha, right.  I completely understand (perhaps _only_ this small point
> >     > >     on PDFs).  On this pass, my goal was to see what was in the file at
> >     > >     all, not what was the correct XMP. Part of my interest is in what's
> >     > >     available in the file, but not available readily to the user.
> >     > >
> >     > >     The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> >     > >
> >     > >     So, I can definitely take a second run where I let a PDF tool extract
> >     > >     the correct XMP if there's interest in that.
> >     > >
> >     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> >     > >     <lr...@adobe.com.invalid> wrote:
> >     > >     >
> >     > >     > >      I'm literally just scraping bytes out of files for now without any parsing
> >     > >     > >
> >     > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
> >     > >     >
> >     > >     >
> >     > >     > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
> >     > >     > >
> >     > >     > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
> >     > >     >
> >     > >     > Leonard
> >     > >     >
> >     > >     > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
> >     > >     >
> >     > >     >     Hi Leonard,
> >     > >     >       I'm literally just scraping bytes out of files for now without any
> >     > >     >     parsing...so if the XMP is concealed in a compressed stream or
> >     > >     >     something more interesting, I'm not grabbing it.  I'm also not
> >     > >     >     tracking which XMP is associated with which object.
> >     > >     >       Please forgive me...if I traverse the COSDocument's objects and look
> >     > >     >     for /Metadata and grab the stream, will that be what you're looking
> >     > >     >     for?  Or, is there a commandline tool I can run to get what you're
> >     > >     >     interested in?
> >     > >     >       Thank you.
> >     > >     >
> >     > >     >       Cheers,
> >     > >     >
> >     > >     >                   Tim
> >     > >     >
> >     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> >     > >     >     <lr...@adobe.com.invalid> wrote:
> >     > >     >     >
> >     > >     >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
> >     > >     >     >
> >     > >     >     > Leonard
> >     > >     >     >
> >     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
> >     > >     >     >
> >     > >     >     >     All,
> >     > >     >     >
> >     > >     >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
> >     > >     >     >
> >     > >     >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
> >     > >     >     >
> >     > >     >     >       I've binned the files roughly based on the container file's mime
> >     > >     >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
> >     > >     >     >
> >     > >     >     >       The process is still running, and I view this as a first draft.
> >     > >     >     >     Please let me know if there's anything I can do to make these data
> >     > >     >     >     easier to use/more useful or if you see any problems.
> >     > >     >     >
> >     > >     >     >       Cheers,
> >     > >     >     >
> >     > >     >     >                  Tim
> >     > >     >     >
> >     > >     >
> >     > >
> >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

Ah, I wasn't aware of XMPFiles...thank you...I can run that next if
that'd be of any interest.

I kicked off a process to run `exifTool -xmp -b` against the files.
The output will go here:
https://corpora.tika.apache.org/base/exiftool-xmps/

On Wed, Mar 17, 2021 at 3:24 PM Leonard Rosenthol
<lr...@adobe.com.invalid> wrote:
>
> Very interesting - thanks.
>
> FWIW: The XMPToolkit itself has a module called "XMPFiles" (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & write/update XMP (and other related metadata such as EXIF) from various file formats.  It's what all the Adobe apps use to handle XMP in any file format that we encounter.
>
> Leonard
>
> On 3/17/21, 2:48 PM, "Tim Allison" <ta...@apache.org> wrote:
>
>     Wait...I'm sorry...I'm wrong on the first point.
>
>     1) in Tika generally, we use Jempbox (currently) to parse XMP when the
>     parsers come across it and after they select the right one and do any
>     joining or other modifications...e.g. the "right" xmp.  We use xmpcore
>     for converting other metadata to XMP in our tika-xmp module, and
>     xmpcore is a dependency of Drew Noakes' metadata-extractor which is
>     critical.
>
>     On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <ta...@apache.org> wrote:
>     >
>     > >Isn't that why are you using the XMP Toolkit???
>     >
>     > Sorry, we may be talking about two different things.
>     >
>     > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
>     > extract it and process it (correctly!) from various file formats.
>     >
>     > 2) For this exercise, I wanted a quick and dirty byte scanner to
>     > extract the raw xmp packets...as much as we could find in any file
>     > format without relying on file-format specific parsers.
>     >
>     > I can do a second run where I modify Tika to extract the XMP from the
>     > various parsers after they do their processing (determining most
>     > recent/joining, etc) to extract the correct XMP.
>     >
>     > And I can do a third run where I modify Tika to extract XMP associated
>     > with embedded images in PDFs, for example.
>     >
>     > I hope this clarifies things.  Please let me know what would be most
>     > useful for you.
>     >
>     > Cheers,
>     >
>     >        Tim
>     >
>     > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
>     > <lr...@adobe.com.invalid> wrote:
>     > >
>     > > >    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
>     > > >
>     > > Isn't that why are you using the XMP Toolkit???
>     > >
>     > > Leonard
>     > >
>     > > On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:
>     > >
>     > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
>     > >
>     > >     Ha, right.  I completely understand (perhaps _only_ this small point
>     > >     on PDFs).  On this pass, my goal was to see what was in the file at
>     > >     all, not what was the correct XMP. Part of my interest is in what's
>     > >     available in the file, but not available readily to the user.
>     > >
>     > >     The other thing is that I wanted to scrape xmp out of files beyond PDFs.
>     > >
>     > >     So, I can definitely take a second run where I let a PDF tool extract
>     > >     the correct XMP if there's interest in that.
>     > >
>     > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
>     > >     <lr...@adobe.com.invalid> wrote:
>     > >     >
>     > >     > >      I'm literally just scraping bytes out of files for now without any parsing
>     > >     > >
>     > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
>     > >     >
>     > >     >
>     > >     > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
>     > >     > >
>     > >     > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
>     > >     >
>     > >     > Leonard
>     > >     >
>     > >     > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
>     > >     >
>     > >     >     Hi Leonard,
>     > >     >       I'm literally just scraping bytes out of files for now without any
>     > >     >     parsing...so if the XMP is concealed in a compressed stream or
>     > >     >     something more interesting, I'm not grabbing it.  I'm also not
>     > >     >     tracking which XMP is associated with which object.
>     > >     >       Please forgive me...if I traverse the COSDocument's objects and look
>     > >     >     for /Metadata and grab the stream, will that be what you're looking
>     > >     >     for?  Or, is there a commandline tool I can run to get what you're
>     > >     >     interested in?
>     > >     >       Thank you.
>     > >     >
>     > >     >       Cheers,
>     > >     >
>     > >     >                   Tim
>     > >     >
>     > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
>     > >     >     <lr...@adobe.com.invalid> wrote:
>     > >     >     >
>     > >     >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
>     > >     >     >
>     > >     >     > Leonard
>     > >     >     >
>     > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
>     > >     >     >
>     > >     >     >     All,
>     > >     >     >
>     > >     >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
>     > >     >     >
>     > >     >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
>     > >     >     >
>     > >     >     >       I've binned the files roughly based on the container file's mime
>     > >     >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
>     > >     >     >
>     > >     >     >       The process is still running, and I view this as a first draft.
>     > >     >     >     Please let me know if there's anything I can do to make these data
>     > >     >     >     easier to use/more useful or if you see any problems.
>     > >     >     >
>     > >     >     >       Cheers,
>     > >     >     >
>     > >     >     >                  Tim
>     > >     >     >
>     > >     >
>     > >
>

Re: XMPs...all you could possibly want...and more!

Posted by Leonard Rosenthol <lr...@adobe.com.INVALID>.

Very interesting - thanks.

FWIW: The XMPToolkit itself has a module called "XMPFiles" (https://github.com/adobe/XMP-Toolkit-SDK#xmpfiles) whose job it is to read & write/update XMP (and other related metadata such as EXIF) from various file formats.  It's what all the Adobe apps use to handle XMP in any file format that we encounter.

Leonard

On 3/17/21, 2:48 PM, "Tim Allison" <ta...@apache.org> wrote:

    Wait...I'm sorry...I'm wrong on the first point.

    1) in Tika generally, we use Jempbox (currently) to parse XMP when the
    parsers come across it and after they select the right one and do any
    joining or other modifications...e.g. the "right" xmp.  We use xmpcore
    for converting other metadata to XMP in our tika-xmp module, and
    xmpcore is a dependency of Drew Noakes' metadata-extractor which is
    critical.

    On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <ta...@apache.org> wrote:
    >
    > >Isn't that why are you using the XMP Toolkit???
    >
    > Sorry, we may be talking about two different things.
    >
    > 1) In Tika generally, we use xmpcore to parse XMP after the parsers
    > extract it and process it (correctly!) from various file formats.
    >
    > 2) For this exercise, I wanted a quick and dirty byte scanner to
    > extract the raw xmp packets...as much as we could find in any file
    > format without relying on file-format specific parsers.
    >
    > I can do a second run where I modify Tika to extract the XMP from the
    > various parsers after they do their processing (determining most
    > recent/joining, etc) to extract the correct XMP.
    >
    > And I can do a third run where I modify Tika to extract XMP associated
    > with embedded images in PDFs, for example.
    >
    > I hope this clarifies things.  Please let me know what would be most
    > useful for you.
    >
    > Cheers,
    >
    >        Tim
    >
    > On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
    > <lr...@adobe.com.invalid> wrote:
    > >
    > > >    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
    > > >
    > > Isn't that why are you using the XMP Toolkit???
    > >
    > > Leonard
    > >
    > > On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:
    > >
    > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
    > >
    > >     Ha, right.  I completely understand (perhaps _only_ this small point
    > >     on PDFs).  On this pass, my goal was to see what was in the file at
    > >     all, not what was the correct XMP. Part of my interest is in what's
    > >     available in the file, but not available readily to the user.
    > >
    > >     The other thing is that I wanted to scrape xmp out of files beyond PDFs.
    > >
    > >     So, I can definitely take a second run where I let a PDF tool extract
    > >     the correct XMP if there's interest in that.
    > >
    > >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
    > >     <lr...@adobe.com.invalid> wrote:
    > >     >
    > >     > >      I'm literally just scraping bytes out of files for now without any parsing
    > >     > >
    > >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
    > >     >
    > >     >
    > >     > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
    > >     > >
    > >     > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
    > >     >
    > >     > Leonard
    > >     >
    > >     > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
    > >     >
    > >     >     Hi Leonard,
    > >     >       I'm literally just scraping bytes out of files for now without any
    > >     >     parsing...so if the XMP is concealed in a compressed stream or
    > >     >     something more interesting, I'm not grabbing it.  I'm also not
    > >     >     tracking which XMP is associated with which object.
    > >     >       Please forgive me...if I traverse the COSDocument's objects and look
    > >     >     for /Metadata and grab the stream, will that be what you're looking
    > >     >     for?  Or, is there a commandline tool I can run to get what you're
    > >     >     interested in?
    > >     >       Thank you.
    > >     >
    > >     >       Cheers,
    > >     >
    > >     >                   Tim
    > >     >
    > >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    > >     >     <lr...@adobe.com.invalid> wrote:
    > >     >     >
    > >     >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
    > >     >     >
    > >     >     > Leonard
    > >     >     >
    > >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
    > >     >     >
    > >     >     >     All,
    > >     >     >
    > >     >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
    > >     >     >
    > >     >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=P52Forv9X46J%2BcecAgfJ6%2FVllEXOuJIT8LOebljRYjE%3D&amp;reserved=0
    > >     >     >
    > >     >     >       I've binned the files roughly based on the container file's mime
    > >     >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd72980268ef74c392dc008d8e97543f5%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516037146889530%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l0Nz9sRuTzbF%2F122mGFilHpr3KZldEFPDb3fAZ9B0L0%3D&amp;reserved=0
    > >     >     >
    > >     >     >       The process is still running, and I view this as a first draft.
    > >     >     >     Please let me know if there's anything I can do to make these data
    > >     >     >     easier to use/more useful or if you see any problems.
    > >     >     >
    > >     >     >       Cheers,
    > >     >     >
    > >     >     >                  Tim
    > >     >     >
    > >     >
    > >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

Wait...I'm sorry...I'm wrong on the first point.

1) in Tika generally, we use Jempbox (currently) to parse XMP when the
parsers come across it and after they select the right one and do any
joining or other modifications...e.g. the "right" xmp.  We use xmpcore
for converting other metadata to XMP in our tika-xmp module, and
xmpcore is a dependency of Drew Noakes' metadata-extractor which is
critical.

On Wed, Mar 17, 2021 at 2:43 PM Tim Allison <ta...@apache.org> wrote:
>
> >Isn't that why are you using the XMP Toolkit???
>
> Sorry, we may be talking about two different things.
>
> 1) In Tika generally, we use xmpcore to parse XMP after the parsers
> extract it and process it (correctly!) from various file formats.
>
> 2) For this exercise, I wanted a quick and dirty byte scanner to
> extract the raw xmp packets...as much as we could find in any file
> format without relying on file-format specific parsers.
>
> I can do a second run where I modify Tika to extract the XMP from the
> various parsers after they do their processing (determining most
> recent/joining, etc) to extract the correct XMP.
>
> And I can do a third run where I modify Tika to extract XMP associated
> with embedded images in PDFs, for example.
>
> I hope this clarifies things.  Please let me know what would be most
> useful for you.
>
> Cheers,
>
>        Tim
>
> On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
> <lr...@adobe.com.invalid> wrote:
> >
> > >    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> > >
> > Isn't that why are you using the XMP Toolkit???
> >
> > Leonard
> >
> > On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:
> >
> >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
> >
> >     Ha, right.  I completely understand (perhaps _only_ this small point
> >     on PDFs).  On this pass, my goal was to see what was in the file at
> >     all, not what was the correct XMP. Part of my interest is in what's
> >     available in the file, but not available readily to the user.
> >
> >     The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> >
> >     So, I can definitely take a second run where I let a PDF tool extract
> >     the correct XMP if there's interest in that.
> >
> >     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
> >     <lr...@adobe.com.invalid> wrote:
> >     >
> >     > >      I'm literally just scraping bytes out of files for now without any parsing
> >     > >
> >     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
> >     >
> >     >
> >     > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
> >     > >
> >     > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
> >     >
> >     > Leonard
> >     >
> >     > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
> >     >
> >     >     Hi Leonard,
> >     >       I'm literally just scraping bytes out of files for now without any
> >     >     parsing...so if the XMP is concealed in a compressed stream or
> >     >     something more interesting, I'm not grabbing it.  I'm also not
> >     >     tracking which XMP is associated with which object.
> >     >       Please forgive me...if I traverse the COSDocument's objects and look
> >     >     for /Metadata and grab the stream, will that be what you're looking
> >     >     for?  Or, is there a commandline tool I can run to get what you're
> >     >     interested in?
> >     >       Thank you.
> >     >
> >     >       Cheers,
> >     >
> >     >                   Tim
> >     >
> >     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
> >     >     <lr...@adobe.com.invalid> wrote:
> >     >     >
> >     >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
> >     >     >
> >     >     > Leonard
> >     >     >
> >     >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
> >     >     >
> >     >     >     All,
> >     >     >
> >     >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
> >     >     >
> >     >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137263979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Xbzilw%2BpDWMnfVCtbMvLoAAMw0dLQM3S4rpli%2B%2BZUtY%3D&amp;reserved=0
> >     >     >
> >     >     >       I've binned the files roughly based on the container file's mime
> >     >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137273937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=R%2Fa6VoPWTqcCl52gBP8HLlLzVA5Xb1D4vtg2itxTx30%3D&amp;reserved=0
> >     >     >
> >     >     >       The process is still running, and I view this as a first draft.
> >     >     >     Please let me know if there's anything I can do to make these data
> >     >     >     easier to use/more useful or if you see any problems.
> >     >     >
> >     >     >       Cheers,
> >     >     >
> >     >     >                  Tim
> >     >     >
> >     >
> >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

>Isn't that why are you using the XMP Toolkit???

Sorry, we may be talking about two different things.

1) In Tika generally, we use xmpcore to parse XMP after the parsers
extract it and process it (correctly!) from various file formats.

2) For this exercise, I wanted a quick and dirty byte scanner to
extract the raw xmp packets...as much as we could find in any file
format without relying on file-format specific parsers.

I can do a second run where I modify Tika to extract the XMP from the
various parsers after they do their processing (determining most
recent/joining, etc) to extract the correct XMP.

And I can do a third run where I modify Tika to extract XMP associated
with embedded images in PDFs, for example.

I hope this clarifies things.  Please let me know what would be most
useful for you.

Cheers,

       Tim

On Wed, Mar 17, 2021 at 2:26 PM Leonard Rosenthol
<lr...@adobe.com.invalid> wrote:
>
> >    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
> >
> Isn't that why are you using the XMP Toolkit???
>
> Leonard
>
> On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:
>
>     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
>
>     Ha, right.  I completely understand (perhaps _only_ this small point
>     on PDFs).  On this pass, my goal was to see what was in the file at
>     all, not what was the correct XMP. Part of my interest is in what's
>     available in the file, but not available readily to the user.
>
>     The other thing is that I wanted to scrape xmp out of files beyond PDFs.
>
>     So, I can definitely take a second run where I let a PDF tool extract
>     the correct XMP if there's interest in that.
>
>     On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
>     <lr...@adobe.com.invalid> wrote:
>     >
>     > >      I'm literally just scraping bytes out of files for now without any parsing
>     > >
>     > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
>     >
>     >
>     > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
>     > >
>     > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
>     >
>     > Leonard
>     >
>     > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
>     >
>     >     Hi Leonard,
>     >       I'm literally just scraping bytes out of files for now without any
>     >     parsing...so if the XMP is concealed in a compressed stream or
>     >     something more interesting, I'm not grabbing it.  I'm also not
>     >     tracking which XMP is associated with which object.
>     >       Please forgive me...if I traverse the COSDocument's objects and look
>     >     for /Metadata and grab the stream, will that be what you're looking
>     >     for?  Or, is there a commandline tool I can run to get what you're
>     >     interested in?
>     >       Thank you.
>     >
>     >       Cheers,
>     >
>     >                   Tim
>     >
>     >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
>     >     <lr...@adobe.com.invalid> wrote:
>     >     >
>     >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
>     >     >
>     >     > Leonard
>     >     >
>     >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
>     >     >
>     >     >     All,
>     >     >
>     >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
>     >     >
>     >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137263979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Xbzilw%2BpDWMnfVCtbMvLoAAMw0dLQM3S4rpli%2B%2BZUtY%3D&amp;reserved=0
>     >     >
>     >     >       I've binned the files roughly based on the container file's mime
>     >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137273937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=R%2Fa6VoPWTqcCl52gBP8HLlLzVA5Xb1D4vtg2itxTx30%3D&amp;reserved=0
>     >     >
>     >     >       The process is still running, and I view this as a first draft.
>     >     >     Please let me know if there's anything I can do to make these data
>     >     >     easier to use/more useful or if you see any problems.
>     >     >
>     >     >       Cheers,
>     >     >
>     >     >                  Tim
>     >     >
>     >
>

Re: XMPs...all you could possibly want...and more!

Posted by Leonard Rosenthol <lr...@adobe.com.INVALID>.

>    The other thing is that I wanted to scrape xmp out of files beyond PDFs.
>
Isn't that why are you using the XMP Toolkit???

Leonard

On 3/17/21, 2:10 PM, "Tim Allison" <ta...@apache.org> wrote:

    > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.

    Ha, right.  I completely understand (perhaps _only_ this small point
    on PDFs).  On this pass, my goal was to see what was in the file at
    all, not what was the correct XMP. Part of my interest is in what's
    available in the file, but not available readily to the user.

    The other thing is that I wanted to scrape xmp out of files beyond PDFs.

    So, I can definitely take a second run where I let a PDF tool extract
    the correct XMP if there's interest in that.

    On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
    <lr...@adobe.com.invalid> wrote:
    >
    > >      I'm literally just scraping bytes out of files for now without any parsing
    > >
    > ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
    >
    >
    > > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
    > >
    > Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
    >
    > Leonard
    >
    > On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
    >
    >     Hi Leonard,
    >       I'm literally just scraping bytes out of files for now without any
    >     parsing...so if the XMP is concealed in a compressed stream or
    >     something more interesting, I'm not grabbing it.  I'm also not
    >     tracking which XMP is associated with which object.
    >       Please forgive me...if I traverse the COSDocument's objects and look
    >     for /Metadata and grab the stream, will that be what you're looking
    >     for?  Or, is there a commandline tool I can run to get what you're
    >     interested in?
    >       Thank you.
    >
    >       Cheers,
    >
    >                   Tim
    >
    >     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    >     <lr...@adobe.com.invalid> wrote:
    >     >
    >     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
    >     >
    >     > Leonard
    >     >
    >     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
    >     >
    >     >     All,
    >     >
    >     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
    >     >
    >     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137263979%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Xbzilw%2BpDWMnfVCtbMvLoAAMw0dLQM3S4rpli%2B%2BZUtY%3D&amp;reserved=0
    >     >
    >     >       I've binned the files roughly based on the container file's mime
    >     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7Cd262f00742e0448ff3e108d8e96fe674%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637516014137273937%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=R%2Fa6VoPWTqcCl52gBP8HLlLzVA5Xb1D4vtg2itxTx30%3D&amp;reserved=0
    >     >
    >     >       The process is still running, and I view this as a first draft.
    >     >     Please let me know if there's anything I can do to make these data
    >     >     easier to use/more useful or if you see any problems.
    >     >
    >     >       Cheers,
    >     >
    >     >                  Tim
    >     >
    >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

> ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.

Ha, right.  I completely understand (perhaps _only_ this small point
on PDFs).  On this pass, my goal was to see what was in the file at
all, not what was the correct XMP. Part of my interest is in what's
available in the file, but not available readily to the user.

The other thing is that I wanted to scrape xmp out of files beyond PDFs.

So, I can definitely take a second run where I let a PDF tool extract
the correct XMP if there's interest in that.

On Wed, Mar 17, 2021 at 1:56 PM Leonard Rosenthol
<lr...@adobe.com.invalid> wrote:
>
> >      I'm literally just scraping bytes out of files for now without any parsing
> >
> ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.
>
>
> > if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
> >
> Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!
>
> Leonard
>
> On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:
>
>     Hi Leonard,
>       I'm literally just scraping bytes out of files for now without any
>     parsing...so if the XMP is concealed in a compressed stream or
>     something more interesting, I'm not grabbing it.  I'm also not
>     tracking which XMP is associated with which object.
>       Please forgive me...if I traverse the COSDocument's objects and look
>     for /Metadata and grab the stream, will that be what you're looking
>     for?  Or, is there a commandline tool I can run to get what you're
>     interested in?
>       Thank you.
>
>       Cheers,
>
>                   Tim
>
>     On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
>     <lr...@adobe.com.invalid> wrote:
>     >
>     > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
>     >
>     > Leonard
>     >
>     > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
>     >
>     >     All,
>     >
>     >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
>     >
>     >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C019177601dd14d18c0f708d8e96babab%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515995945828272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TR%2F7vhQkkZ5NdSHyUpBk9Zeq3DVvHuOn1ltaqEG19bc%3D&amp;reserved=0
>     >
>     >       I've binned the files roughly based on the container file's mime
>     >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C019177601dd14d18c0f708d8e96babab%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515995945828272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b4miubAAVseiLWCaCakfvc9hFxke%2F3loqOiNBITZIeg%3D&amp;reserved=0
>     >
>     >       The process is still running, and I view this as a first draft.
>     >     Please let me know if there's anything I can do to make these data
>     >     easier to use/more useful or if you see any problems.
>     >
>     >       Cheers,
>     >
>     >                  Tim
>     >
>

Re: XMPs...all you could possibly want...and more!

Posted by Leonard Rosenthol <lr...@adobe.com.INVALID>.

>      I'm literally just scraping bytes out of files for now without any parsing
>
ARGH!!!!   Please don't do this - it will get you the wrong results in almost all cases.     Remember that in a PDF with updates, there can/will be a new XMP block with each update.


> if I traverse the COSDocument's objects and look     for /Metadata and grab the stream, will that be what you're looking     for?
>
Just getting those elements would be a great start.  If you could also include the rest of the dictionary in which it was found (or at least the /Type and /Subtype keys, if present) would be great!

Leonard

On 3/17/21, 1:39 PM, "Tim Allison" <ta...@apache.org> wrote:

    Hi Leonard,
      I'm literally just scraping bytes out of files for now without any
    parsing...so if the XMP is concealed in a compressed stream or
    something more interesting, I'm not grabbing it.  I'm also not
    tracking which XMP is associated with which object.
      Please forgive me...if I traverse the COSDocument's objects and look
    for /Metadata and grab the stream, will that be what you're looking
    for?  Or, is there a commandline tool I can run to get what you're
    interested in?
      Thank you.

      Cheers,

                  Tim

    On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
    <lr...@adobe.com.invalid> wrote:
    >
    > Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
    >
    > Leonard
    >
    > On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
    >
    >     All,
    >
    >       I'm scraping XMPs out of our corpus and placing them here as standalone files:
    >
    >     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C019177601dd14d18c0f708d8e96babab%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515995945828272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TR%2F7vhQkkZ5NdSHyUpBk9Zeq3DVvHuOn1ltaqEG19bc%3D&amp;reserved=0
    >
    >       I've binned the files roughly based on the container file's mime
    >     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C019177601dd14d18c0f708d8e96babab%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515995945828272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b4miubAAVseiLWCaCakfvc9hFxke%2F3loqOiNBITZIeg%3D&amp;reserved=0
    >
    >       The process is still running, and I view this as a first draft.
    >     Please let me know if there's anything I can do to make these data
    >     easier to use/more useful or if you see any problems.
    >
    >       Cheers,
    >
    >                  Tim
    >

Re: XMPs...all you could possibly want...and more!

Posted by Tim Allison <ta...@apache.org>.

Hi Leonard,
  I'm literally just scraping bytes out of files for now without any
parsing...so if the XMP is concealed in a compressed stream or
something more interesting, I'm not grabbing it.  I'm also not
tracking which XMP is associated with which object.
  Please forgive me...if I traverse the COSDocument's objects and look
for /Metadata and grab the stream, will that be what you're looking
for?  Or, is there a commandline tool I can run to get what you're
interested in?
  Thank you.

  Cheers,

              Tim

On Wed, Mar 17, 2021 at 1:17 PM Leonard Rosenthol
<lr...@adobe.com.invalid> wrote:
>
> Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.
>
> Leonard
>
> On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:
>
>     All,
>
>       I'm scraping XMPs out of our corpus and placing them here as standalone files:
>
>     https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ujb11etR6nqAqqxo7l1SHMiDrU5KxYPRXTm4nvXrCXo%3D&amp;reserved=0
>
>       I've binned the files roughly based on the container file's mime
>     type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HFcAVr0CLvIwEa5%2BsD8iYRSDgm6LWHNcXfzsPnSEDqs%3D&amp;reserved=0
>
>       The process is still running, and I view this as a first draft.
>     Please let me know if there's anything I can do to make these data
>     easier to use/more useful or if you see any problems.
>
>       Cheers,
>
>                  Tim
>

Re: XMPs...all you could possibly want...and more!

Posted by Leonard Rosenthol <lr...@adobe.com.INVALID>.

Are you only pulling document-level XMP?  If so, could you extend it to support object-level metadata as well?   I, for one, would love to get insight into the use of object-level metadata - what objects are they attached to, what are they being used for, etc.

Leonard

On 3/17/21, 11:37 AM, "Tim Allison" <ta...@apache.org> wrote:

    All,

      I'm scraping XMPs out of our corpus and placing them here as standalone files:

    https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ujb11etR6nqAqqxo7l1SHMiDrU5KxYPRXTm4nvXrCXo%3D&amp;reserved=0

      I've binned the files roughly based on the container file's mime
    type, e.g. https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcorpora.tika.apache.org%2Fbase%2Fxmps%2Fpdf%2F&amp;data=04%7C01%7Clrosenth%40adobe.com%7C40651db6e9fa4260de9108d8e95a9b01%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637515922640651454%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HFcAVr0CLvIwEa5%2BsD8iYRSDgm6LWHNcXfzsPnSEDqs%3D&amp;reserved=0

      The process is still running, and I view this as a first draft.
    Please let me know if there's anything I can do to make these data
    easier to use/more useful or if you see any problems.

      Cheers,

                 Tim