You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/23 19:01:26 UTC
Metadata case sensitivity
I ran into an issue recently, where the metadata after a parse had two
versions of the same data.
One was from the HTTP response headers, and was called "Content-Type".
The other had been derived from a <meta http-equiv="content-type">
element in the HTML content.
That brings up two questions:
1. Should Tika's Metadata ensure that keys are case-insensitive unique?
2. For the above case, who wins? Based on HTML5's approach to charset
detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
, I think it's the response header, but based on experience, I think
it should be what's in the HTML.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Re: Metadata case sensitivity
Posted by Ken Krugler <kk...@transpac.com>.
Hi Chris,
Thanks for the ref to SpellCheckedMetadata.
Based on the previous decision, I'll go ahead and add some checks in
the HtmlHandler code to fix up capitalization issues, since Tika
itself is the "client" in this case (it consumes the content type
information).
https://issues.apache.org/jira/browse/TIKA-497 tracks this.
-- Ken
On Aug 23, 2010, at 10:13am, Mattmann, Chris A (388J) wrote:
> Hey Ken,
>
> RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked
> on it a long time ago, and it handles exactly the case you are
> talking about. RE: #2, ehh...not sure! :) Jukka took out [1] in
> r780895 [2], because he felt it would best be handled in client code.
>
> Cheers,
> Chris
>
> [1] http://s.apache.org/eo
> [2] http://svn.apache.org/viewvc/?rev=780895&view=rev
>
> On 8/23/10 10:01 AM, "Ken Krugler" <kk...@transpac.com>
> wrote:
>
> I ran into an issue recently, where the metadata after a parse had two
> versions of the same data.
>
> One was from the HTTP response headers, and was called "Content-Type".
>
> The other had been derived from a <meta http-equiv="content-type">
> element in the HTML content.
>
> That brings up two questions:
>
> 1. Should Tika's Metadata ensure that keys are case-insensitive
> unique?
>
> 2. For the above case, who wins? Based on HTML5's approach to charset
> detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
> , I think it's the response header, but based on experience, I think
> it should be what's in the HTML.
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>
>
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: Chris.Mattmann@jpl.nasa.gov
> WWW: http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Re: Metadata case sensitivity
Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.
Hey Ken,
RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked on it a long time ago, and it handles exactly the case you are talking about. RE: #2, ehh...not sure! :) Jukka took out [1] in r780895 [2], because he felt it would best be handled in client code.
Cheers,
Chris
[1] http://s.apache.org/eo
[2] http://svn.apache.org/viewvc/?rev=780895&view=rev
On 8/23/10 10:01 AM, "Ken Krugler" <kk...@transpac.com> wrote:
I ran into an issue recently, where the metadata after a parse had two
versions of the same data.
One was from the HTTP response headers, and was called "Content-Type".
The other had been derived from a <meta http-equiv="content-type">
element in the HTML content.
That brings up two questions:
1. Should Tika's Metadata ensure that keys are case-insensitive unique?
2. For the above case, who wins? Based on HTML5's approach to charset
detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
, I think it's the response header, but based on experience, I think
it should be what's in the HTML.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++