You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2018/07/08 19:53:00 UTC
[jira] [Commented] (TIKA-2648) mime detection based on resource
name detects resources as "text/x-php" instead of "text/html"
[ https://issues.apache.org/jira/browse/TIKA-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536396#comment-16536396 ]
Ken Krugler commented on TIKA-2648:
-----------------------------------
[~wastl-nagel] - you mentioned that you thought this solution was very specific...do you have more input on what cases would be better served by a more generalized solution?
In general, what we're talking about here is processing content that is *generated* by the resource, versus the resource itself. I haven't run into situations other than this generation happening via an HTTP request, but it would be interesting to hear from others whether that happens.
> mime detection based on resource name detects resources as "text/x-php" instead of "text/html"
> -----------------------------------------------------------------------------------------------
>
> Key: TIKA-2648
> URL: https://issues.apache.org/jira/browse/TIKA-2648
> Project: Tika
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
>
> When using tika to detect a mime type given only an URL containing ".php" and a content-type hint of "text/html", it guesses "text/x-php", whereas one could expect "text/html".
> {code}
> TikaConfig tika = new TikaConfig();
> Metadata metadata = new Metadata();
> String url = "https://www.facebook.com/home.php";
> metadata.set(Metadata.RESOURCE_NAME_KEY, url);
> metadata.set(Metadata.CONTENT_TYPE, "text/html");
> MediaType type = tika.getDetector().detect(null, metadata);
> System.out.println(url + " is of type " + type.toString());
> // Prints https://www.facebook.com/home.php is of type text/x-php
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)