You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/23 19:01:26 UTC

Metadata case sensitivity

I ran into an issue recently, where the metadata after a parse had two  
versions of the same data.

One was from the HTTP response headers, and was called "Content-Type".

The other had been derived from a <meta http-equiv="content-type">  
element in the HTML content.

That brings up two questions:

1. Should Tika's Metadata ensure that keys are case-insensitive unique?

2. For the above case, who wins? Based on HTML5's approach to charset  
detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html) 
, I think it's the response header, but based on experience, I think  
it should be what's in the HTML.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: Metadata case sensitivity

Posted by Ken Krugler <kk...@transpac.com>.
Hi Chris,

Thanks for the ref to SpellCheckedMetadata.

Based on the previous decision, I'll go ahead and add some checks in  
the HtmlHandler code to fix up capitalization issues, since Tika  
itself is the "client" in this case (it consumes the content type  
information).

https://issues.apache.org/jira/browse/TIKA-497 tracks this.

-- Ken

On Aug 23, 2010, at 10:13am, Mattmann, Chris A (388J) wrote:

> Hey Ken,
>
> RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked  
> on it a long time ago, and it handles exactly the case you are  
> talking about. RE: #2, ehh...not sure! :) Jukka took out [1] in  
> r780895 [2], because he felt it would best be handled in client code.
>
> Cheers,
> Chris
>
> [1] http://s.apache.org/eo
> [2] http://svn.apache.org/viewvc/?rev=780895&view=rev
>
> On 8/23/10 10:01 AM, "Ken Krugler" <kk...@transpac.com>  
> wrote:
>
> I ran into an issue recently, where the metadata after a parse had two
> versions of the same data.
>
> One was from the HTTP response headers, and was called "Content-Type".
>
> The other had been derived from a <meta http-equiv="content-type">
> element in the HTML content.
>
> That brings up two questions:
>
> 1. Should Tika's Metadata ensure that keys are case-insensitive  
> unique?
>
> 2. For the above case, who wins? Based on HTML5's approach to charset
> detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
> , I think it's the response header, but based on experience, I think
> it should be what's in the HTML.
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Re: Metadata case sensitivity

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Ken,

RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked on it a long time ago, and it handles exactly the case you are talking about. RE: #2, ehh...not sure! :) Jukka took out [1] in r780895 [2], because he felt it would best be handled in client code.

Cheers,
Chris

[1] http://s.apache.org/eo
[2] http://svn.apache.org/viewvc/?rev=780895&view=rev

On 8/23/10 10:01 AM, "Ken Krugler" <kk...@transpac.com> wrote:

I ran into an issue recently, where the metadata after a parse had two
versions of the same data.

One was from the HTTP response headers, and was called "Content-Type".

The other had been derived from a <meta http-equiv="content-type">
element in the HTML content.

That brings up two questions:

1. Should Tika's Metadata ensure that keys are case-insensitive unique?

2. For the above case, who wins? Based on HTML5's approach to charset
detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
, I think it's the response header, but based on experience, I think
it should be what's in the HTML.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++