You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/05/11 02:56:52 UTC

Attributes in XHTML output

Hi all,

I was taking another look at TIKA-379, which is the issue of "Html  
elements and attributes not available in XHTML representation"

In a comment on that issue, Jukka said:

> The reason for the default HTML mapping rules in Tika are to  
> simplify and normalize the input documents so that client  
> applications could easily process all sorts of input (HTML or not)  
> without needing type- or source-specific heuristics. The basic idea  
> has been that clients should directly use the underlying parser  
> libraries when it needs custom processing of specific content types.

It feels to me like the issue of elements is a bit different than  
attributes. When processing the response, having a well-constrained  
set of (XHTML-valid) elements would definitely make it easier for  
clients.

But I don't see how restricting valid XHTML _attributes_ helps much.  
During processing of the result, you care about the structure of the  
DOM, not typically optional attributes.

Anybody care to weigh in on this?

My specific issue has to do with lang and rel attributes, which are  
very useful during crawling.

I know that the HtmlMapper support (with some improvements) could  
address my needs, but if there's a way to propagate safe attributes  
through to everybody, that seems like a superior solution.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Attributes in XHTML output

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-05-11 15:22, Ken Krugler wrote:

>> If you pass through all valid attributes unchanged, then clients need to
>> be aware of "lang" and "rel" and their meaning, which poses a question:
>> what if some other format uses "language" and "function" instead? your
>> client then would have to handle all such variants of the same
>> (semantically speaking) data. It's a natural expectation that such
>> details should be handled by the library, and the library should know
>> that for this particular format "language" is semantically equivalent to
>> a better-known "lang" attribute...
> 
> If it's valid XHTML, and validates with (say) the XHTML 1.0 Strict DTD,
> then I don't think you would have this case of getting back a language
> (versus lang) attribute.

No, of course not - but XHTML is not the original data that we have, we
generate it ourselves, and we have a choice of either dropping offending
attributes, or converting them to something acceptable under XHTML.


> Or are you talking about ways to make it easier for parsers to return
> conformant attributes?

Yes.

>> +1 for a component that knows how to map common format-specific
>> attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc.
>> The classes in o.a.nutch.metadata may be helpful.
> 
> So if I understand this correctly, it's not a concern about passing
> through valid XHTML attributes, but rather their value to clients -
> specifically in the context of normalizing the meaning for a variety of
> input formats.

Passing translated attributes when we can (according to a mapping), and
passing original attributes in a non-offending way when we can't
translate them.

> 
> I think the initial idea was to use the metadata map to return these in
> a generic way, which works for document-wide things...but most of what's
> interesting to me, at least, is on a per-element basis.
> 
> If we said that XHTML 1.0 Strict specified allowable attributes, would
> this address your concern about clients needing to handle multiple
> attribute names?

Can't we put any attributes that we want if they are under a different
namespace, and still be XHTML conformant? You are right that top-level
maps may not cut - e.g. when parsing bilingual corpora (like europarl)
every other line should get a different <p lang="">.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Attributes in XHTML output

Posted by Ken Krugler <kk...@transpac.com>.

Hi Andrzej,

Thanks for responding. See my comments/questions at the end...

On May 11, 2010, at 2:40am, Andrzej Bialecki wrote:

> On 2010-05-11 02:56, Ken Krugler wrote:
>> Hi all,
>>
>> I was taking another look at TIKA-379, which is the issue of "Html
>> elements and attributes not available in XHTML representation"
>>
>> In a comment on that issue, Jukka said:
>>
>>> The reason for the default HTML mapping rules in Tika are to  
>>> simplify
>>> and normalize the input documents so that client applications could
>>> easily process all sorts of input (HTML or not) without needing  
>>> type-
>>> or source-specific heuristics. The basic idea has been that clients
>>> should directly use the underlying parser libraries when it needs
>>> custom processing of specific content types.
>>
>> It feels to me like the issue of elements is a bit different than
>> attributes. When processing the response, having a well-constrained  
>> set
>> of (XHTML-valid) elements would definitely make it easier for  
>> clients.
>>
>> But I don't see how restricting valid XHTML _attributes_ helps much.
>> During processing of the result, you care about the structure of the
>> DOM, not typically optional attributes.
>>
>> Anybody care to weigh in on this?
>>
>> My specific issue has to do with lang and rel attributes, which are  
>> very
>> useful during crawling.
>
> Hi,
>
> In my opinion this has to do with the level of knowledge that you  
> expect
> from the clients of this API, and the extent of a meaningful schema
> mapping that you can perform by default.
>
> If you pass through all valid attributes unchanged, then clients  
> need to
> be aware of "lang" and "rel" and their meaning, which poses a  
> question:
> what if some other format uses "language" and "function" instead? your
> client then would have to handle all such variants of the same
> (semantically speaking) data. It's a natural expectation that such
> details should be handled by the library, and the library should know
> that for this particular format "language" is semantically  
> equivalent to
> a better-known "lang" attribute...

If it's valid XHTML, and validates with (say) the XHTML 1.0 Strict  
DTD, then I don't think you would have this case of getting back a  
language (versus lang) attribute.

Or are you talking about ways to make it easier for parsers to return  
conformant attributes?

> Such 1:1 mapping is often impossible to do, but in many useful cases  
> it
> is possible. I think this should be a configurable component in Tika.
>
> E.g. in many Nutch plugins we map format-specific attributes to a
> "standard set" of attributes that other Nutch plugins can rely upon.
> This is currently hardcoded in plugin implementations.
>
>> I know that the HtmlMapper support (with some improvements) could
>> address my needs, but if there's a way to propagate safe attributes
>> through to everybody, that seems like a superior solution.
>
> +1 for a component that knows how to map common format-specific
> attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc.
> The classes in o.a.nutch.metadata may be helpful.

So if I understand this correctly, it's not a concern about passing  
through valid XHTML attributes, but rather their value to clients -  
specifically in the context of normalizing the meaning for a variety  
of input formats.

I think the initial idea was to use the metadata map to return these  
in a generic way, which works for document-wide things...but most of  
what's interesting to me, at least, is on a per-element basis.

If we said that XHTML 1.0 Strict specified allowable attributes, would  
this address your concern about clients needing to handle multiple  
attribute names?

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Attributes in XHTML output

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-05-11 02:56, Ken Krugler wrote:
> Hi all,
> 
> I was taking another look at TIKA-379, which is the issue of "Html
> elements and attributes not available in XHTML representation"
> 
> In a comment on that issue, Jukka said:
> 
>> The reason for the default HTML mapping rules in Tika are to simplify
>> and normalize the input documents so that client applications could
>> easily process all sorts of input (HTML or not) without needing type-
>> or source-specific heuristics. The basic idea has been that clients
>> should directly use the underlying parser libraries when it needs
>> custom processing of specific content types.
> 
> It feels to me like the issue of elements is a bit different than
> attributes. When processing the response, having a well-constrained set
> of (XHTML-valid) elements would definitely make it easier for clients.
> 
> But I don't see how restricting valid XHTML _attributes_ helps much.
> During processing of the result, you care about the structure of the
> DOM, not typically optional attributes.
> 
> Anybody care to weigh in on this?
> 
> My specific issue has to do with lang and rel attributes, which are very
> useful during crawling.

Hi,

In my opinion this has to do with the level of knowledge that you expect
from the clients of this API, and the extent of a meaningful schema
mapping that you can perform by default.

If you pass through all valid attributes unchanged, then clients need to
be aware of "lang" and "rel" and their meaning, which poses a question:
what if some other format uses "language" and "function" instead? your
client then would have to handle all such variants of the same
(semantically speaking) data. It's a natural expectation that such
details should be handled by the library, and the library should know
that for this particular format "language" is semantically equivalent to
a better-known "lang" attribute...

Such 1:1 mapping is often impossible to do, but in many useful cases it
is possible. I think this should be a configurable component in Tika.

E.g. in many Nutch plugins we map format-specific attributes to a
"standard set" of attributes that other Nutch plugins can rely upon.
This is currently hardcoded in plugin implementations.

> 
> I know that the HtmlMapper support (with some improvements) could
> address my needs, but if there's a way to propagate safe attributes
> through to everybody, that seems like a superior solution.

+1 for a component that knows how to map common format-specific
attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc.
The classes in o.a.nutch.metadata may be helpful.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com