You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2011/09/23 16:24:26 UTC

[jira] [Created] (TIKA-728) Return RDFa meta tags via Metadata

Return RDFa meta tags via Metadata
----------------------------------

                 Key: TIKA-728
                 URL: https://issues.apache.org/jira/browse/TIKA-728
             Project: Tika
          Issue Type: Improvement
            Reporter: Ken Krugler
            Assignee: Ken Krugler
            Priority: Minor


Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.

The reason why is that Open Graph uses RDFa:

http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090

Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.

We could take a tag like:

<meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />

and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-728:
--------------------------------

    Assignee:     (was: Ken Krugler)

Hoping Jörg picks this one up.
                
> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>            Reporter: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-728:
-----------------------------------

    Component/s: parser
                 metadata
    
> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata, parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113471#comment-13113471 ] 

Ken Krugler commented on TIKA-728:
----------------------------------

Jukka said (on the list):

{quote}
Instead of mapping the RDFa <meta> tags to Tika's Metadata and then
back to normal XHTML <meta> tags, we might want to consider switching
from plain XHTML to  XHTML-with-RDFa as Tika's output format. That
should make it easier to support more descriptive metadata and content
annotations down the line.

In any case it would still be good to mapRDFa <meta> tags also to the
Metadata object. To do that properly (and to open the way to better
XMP integration, my favourite TODO item :-), we'll probably need to
extend the Metadata class to handle things like namespaces and
structured values.
{quote}



> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113475#comment-13113475 ] 

Ken Krugler commented on TIKA-728:
----------------------------------

Antoni said (on the list):

{quote}
There seems to be a long tradition in ASF to appeal to Someone when there is talk about RDF. Chris Mattman wrote back in November 2007:

"... it's reasonable that someone may need to rewrite the ability to represent metadata in RDF ..."

Whoever that Someone is - he has my support. ;-)

On a more serious note though. In the four years since that metadata discussion three separate RDF-related projects have appeared in/around ASF: Clerezza, Jena and Any23. Two are already in incubation, the third one tries to. Jeremias Maerki noticed the lack of coordination in the metadata field four years ago. It's not getting any better.
{quote}

> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113472#comment-13113472 ] 

Ken Krugler commented on TIKA-728:
----------------------------------

That's what I was afraid of :)

My head starts to hurt when I have to deal with namespaces and RDF.

So I think I'll just patch my local copy to do the Q&D thing, and wait for someone with more XML/RDF-fu to deal with it properly.


> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Paolo Castagna (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144953#comment-13144953 ] 

Paolo Castagna commented on TIKA-728:
-------------------------------------

> It's not getting any better.

;-)

                
> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-728) Return RDFa meta tags via Metadata

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113473#comment-13113473 ] 

Ken Krugler commented on TIKA-728:
----------------------------------

Jukka said (on the list):

{quote}
>From the client perspective the Metadata class should still provide a
simple key-value interface for basic things, just like the Tika facade
hides the more powerful constructs of the Parser and Detector
interfaces under a simplified API. Of course the implementation side
would be more complex...

Until Someone (TM, :-) does that, I'd be very happy to see the simple
property=xxx mapping you described added to HtmlParser. It's obviously
an improvement to the way Tika currently works, and I don't see any
major backwards compatibility issues caused by starting with a simple
solution like that and later on migrating to a more complete RDF-based
metadata model.
{quote}

> Return RDFa meta tags via Metadata
> ----------------------------------
>
>                 Key: TIKA-728
>                 URL: https://issues.apache.org/jira/browse/TIKA-728
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Open Graph <meta> tags currently get stripped out, and also aren't put into the metadata map.
> The reason why is that Open Graph uses RDFa:
> http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090
> Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted.
> We could take a tag like:
> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" />
> and put it into the metadata map as "og:url" => "http://www.imdb.com/title/tt0117500/"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira