You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2011/09/23 02:23:01 UTC

Support for Open Graph meta tags

We were recently using Tika to process HTML pages that might have Open Graph meta tags.

The issue is that these tags get stripped out, and also aren't put into the metadata map.

The reason why is that Open Graph uses RDFa

http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090

Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.

But we could put them into the metadata map, by adding another test in the HtmlHandler code that currently has:

            if ("META".equals(name) && atts.getValue("content") != null) {
                // TIKA-478: For cases where we have either a name or
                // "http-equiv", assume that XHTMLContentHandler will emit
                // these in the <head>, thus passing them through safely.
                if (atts.getValue("http-equiv") != null) {
                    addHtmlMetadata(
                            atts.getValue("http-equiv"),
                            atts.getValue("content"));
                } else if (atts.getValue("name") != null) {
                    // Record the meta tag in the metadata
                    addHtmlMetadata(
                            atts.getValue("name"),
                            atts.getValue("content"));
                }

If we catch the case of having no name=xxx attribute, but there is a property=xxx, then that would take a tag like:

<meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />

and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

Thoughts on this?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Re: Support for Open Graph meta tags

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Ken,

Super +1, this sounds like a great idea.

Cheers,
Chris

On Sep 22, 2011, at 6:23 PM, Ken Krugler wrote:

> We were recently using Tika to process HTML pages that might have Open Graph meta tags.
> 
> The issue is that these tags get stripped out, and also aren't put into the metadata map.
> 
> The reason why is that Open Graph uses RDFa
> 
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> 
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> 
> But we could put them into the metadata map, by adding another test in the HtmlHandler code that currently has:
> 
>            if ("META".equals(name) && atts.getValue("content") != null) {
>                // TIKA-478: For cases where we have either a name or
>                // "http-equiv", assume that XHTMLContentHandler will emit
>                // these in the <head>, thus passing them through safely.
>                if (atts.getValue("http-equiv") != null) {
>                    addHtmlMetadata(
>                            atts.getValue("http-equiv"),
>                            atts.getValue("content"));
>                } else if (atts.getValue("name") != null) {
>                    // Record the meta tag in the metadata
>                    addHtmlMetadata(
>                            atts.getValue("name"),
>                            atts.getValue("content"));
>                }
> 
> If we catch the case of having no name=xxx attribute, but there is a property=xxx, then that would take a tag like:
> 
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> 
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"
> 
> Thoughts on this?
> 
> Thanks,
> 
> -- Ken
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Support for Open Graph meta tags

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Jukka,

This sounds like a good approach.

Cheers,
Chris

On Sep 23, 2011, at 3:24 AM, Jukka Zitting wrote:

> Hi,
> 
> On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> The reason why is that Open Graph uses RDFa
> 
> Instead of mapping the RDFa <meta> tags to Tika's Metadata and then
> back to normal XHTML <meta> tags, we might want to consider switching
> from plain XHTML to  XHTML-with-RDFa as Tika's output format. That
> should make it easier to support more descriptive metadata and content
> annotations down the line.
> 
> In any case it would still be good to mapRDFa <meta> tags also to the
> Metadata object. To do that properly (and to open the way to better
> XMP integration, my favourite TODO item :-), we'll probably need to
> extend the Metadata class to handle things like namespaces and
> structured values.
> 
> BR,
> 
> Jukka Zitting


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Re: Support for Open Graph meta tags

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 23 Sep 2011, Jukka Zitting wrote:
>> It would be great to get patches from that Mythical Someone who knows 
>> RDF
>
> Agreed. :-) As Antoni said, this is an area where we could and should
> be able to do better. There are quite a few RDF experts already at and
> around Apache, and it shouldn't be too hard to position Tika more
> prominently on their radars. The Any23 proposal that Chris is
> championing is one good chance for this.

I suggest a solution involving ApacheCon and some beer :)

Also at ApacheCon on the Tuesday is the BarCamp, so assuming a few of us 
will be there by then (I think we will be...) we could do a session there 
and hopefully get some RDF experts in to advice us

Nick

Re: Support for Open Graph meta tags

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Sep 23, 2011 at 4:19 PM, Ken Krugler
<kk...@transpac.com> wrote:
> From my fairly naive perspective, it seems like one of the challenges
> here is that Tika tries to normalize/simplify interacting with data. [...]
> Whereas RDF is more focused on precision, in being explicit about
> the relationships between data.

Yep, as you mention that's obviously an issue that needs work and
sometimes tricky tradeoffs.

That said, I'm pretty confident that there is no fundamental
disconnect between these two goals, and I think over time (years most
likely) we will be able to work out all the details. We're already
taking steps along that road with our parsers exposing increasingly
more detailed document structure and our metadata model already
handling things like dates in a more structured manner.

At least that seems to me like an obvious candidate for inclusion in a
future roadmap for post-1.0 Tika.

> It would be great to get patches from that Mythical Someone who knows RDF

Agreed. :-) As Antoni said, this is an area where we could and should
be able to do better. There are quite a few RDF experts already at and
around Apache, and it shouldn't be too hard to position Tika more
prominently on their radars. The Any23 proposal that Chris is
championing is one good chance for this.

Also, now that I work at Adobe, my XMP itch has been growing quite a
bit, so I wouldn't be surprised if I ended up working on better XMP
(and thus RDF) support soon after Tika 1.0 is out.

BR,

Jukka Zitting

Re: Support for Open Graph meta tags

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 23, 2011, at 7:00am, Antoni Mylka wrote:

> W dniu 2011-09-23 15:12, Jukka Zitting pisze:
>>> So I think I'll just patch my local copy to do the Q&D thing, and wait for
>>> someone with more XML/RDF-fu to deal with it properly.
>> 
>> Until Someone (TM, :-) does that, I'd be very happy to see the simple
>> property=xxx mapping you described added to HtmlParser.
> 
> There seems to be a long tradition in ASF to appeal to Someone when there is talk about RDF. Chris Mattman wrote back in November 2007:
> 
> "... it's reasonable that someone may need to rewrite the ability to represent metadata in RDF ..."
> 
> Whoever that Someone is - he has my support. ;-)
> 
> On a more serious note though. In the four years since that metadata discussion three separate RDF-related projects have appeared in/around ASF: Clerezza, Jena and Any23. Two are already in incubation, the third one tries to. Jeremias Maerki noticed the lack of coordination in the metadata field four years ago. It's not getting any better.

Agreed.

From my fairly naive perspective, it seems like one of the challenges here is that Tika tries to normalize/simplify interacting with data. E.g. I just want the text from any document I come across. That seems to be the primary use case.

Whereas RDF is more focused on precision, in being explicit about the relationships between data. So I would expect to see many interesting tradeoffs in figuring out how best to straddle both worlds. Heck, figuring out how best to map fairly simple document elements to XHTML 1.0 has proven challenging.

It would be great to get patches from that Mythical Someone who knows RDF - versus, say, me, where the end result is likely to be horribly wrong.

For better or worse, RDF has never been an itch that I've needed to scratch.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Re: Support for Open Graph meta tags

Posted by Antoni Mylka <an...@gmail.com>.
W dniu 2011-09-23 15:12, Jukka Zitting pisze:
>> So I think I'll just patch my local copy to do the Q&D thing, and wait for
>> someone with more XML/RDF-fu to deal with it properly.
>
> Until Someone (TM, :-) does that, I'd be very happy to see the simple
> property=xxx mapping you described added to HtmlParser.

There seems to be a long tradition in ASF to appeal to Someone when 
there is talk about RDF. Chris Mattman wrote back in November 2007:

"... it's reasonable that someone may need to rewrite the ability to 
represent metadata in RDF ..."

Whoever that Someone is - he has my support. ;-)

On a more serious note though. In the four years since that metadata 
discussion three separate RDF-related projects have appeared in/around 
ASF: Clerezza, Jena and Any23. Two are already in incubation, the third 
one tries to. Jeremias Maerki noticed the lack of coordination in the 
metadata field four years ago. It's not getting any better.

Antoni Myłka
antoni.mylka@gmail.com

Re: Support for Open Graph meta tags

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Sep 23, 2011 at 3:06 PM, Ken Krugler
<kk...@transpac.com> wrote:
> On Sep 23, 2011, at 3:24am, Jukka Zitting wrote:
>> In any case it would still be good to mapRDFa <meta> tags also to the
>> Metadata object. To do that properly (and to open the way to better
>> XMP integration, my favourite TODO item :-), we'll probably need to
>> extend the Metadata class to handle things like namespaces and
>> structured values.
>
> That's what I was afraid of :)
>
> My head starts to hurt when I have to deal with namespaces and RDF.

>From the client perspective the Metadata class should still provide a
simple key-value interface for basic things, just like the Tika facade
hides the more powerful constructs of the Parser and Detector
interfaces under a simplified API. Of course the implementation side
would be more complex...

> So I think I'll just patch my local copy to do the Q&D thing, and wait for
> someone with more XML/RDF-fu to deal with it properly.

Until Someone (TM, :-) does that, I'd be very happy to see the simple
property=xxx mapping you described added to HtmlParser. It's obviously
an improvement to the way Tika currently works, and I don't see any
major backwards compatibility issues caused by starting with a simple
solution like that and later on migrating to a more complete RDF-based
metadata model.

BR,

Jukka Zitting

Re: Support for Open Graph meta tags

Posted by Ken Krugler <kk...@transpac.com>.
On Sep 23, 2011, at 3:24am, Jukka Zitting wrote:

> Hi,
> 
> On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> The reason why is that Open Graph uses RDFa
> 
> Instead of mapping the RDFa <meta> tags to Tika's Metadata and then
> back to normal XHTML <meta> tags, we might want to consider switching
> from plain XHTML to  XHTML-with-RDFa as Tika's output format. That
> should make it easier to support more descriptive metadata and content
> annotations down the line.
> 
> In any case it would still be good to mapRDFa <meta> tags also to the
> Metadata object. To do that properly (and to open the way to better
> XMP integration, my favourite TODO item :-), we'll probably need to
> extend the Metadata class to handle things like namespaces and
> structured values.

That's what I was afraid of :)

My head starts to hurt when I have to deal with namespaces and RDF.

So I think I'll just patch my local copy to do the Q&D thing, and wait for someone with more XML/RDF-fu to deal with it properly.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Re: Support for Open Graph meta tags

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler
<kk...@transpac.com> wrote:
> The reason why is that Open Graph uses RDFa

Instead of mapping the RDFa <meta> tags to Tika's Metadata and then
back to normal XHTML <meta> tags, we might want to consider switching
from plain XHTML to  XHTML-with-RDFa as Tika's output format. That
should make it easier to support more descriptive metadata and content
annotations down the line.

In any case it would still be good to mapRDFa <meta> tags also to the
Metadata object. To do that properly (and to open the way to better
XMP integration, my favourite TODO item :-), we'll probably need to
extend the Metadata class to handle things like namespaces and
structured values.

BR,

Jukka Zitting

Re: Support for Open Graph meta tags

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 22 Sep 2011, Ken Krugler wrote:
> The reason why is that Open Graph uses RDFa

Is it worth quickly checking what Any23 does for this kind of thing? (They 
a hopefully soon-to-be-incubating project that a few people here are 
helping with, which has some Tika links). If they have a good model for 
handling this sort of rdf data, then it might make sense to do the same

If not, I'd suggest we follow your suggested example :)

Nick