You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Public Network Services <pu...@gmail.com> on 2012/07/25 01:05:12 UTC

Charset detection

The CHANGES.txt document of Tika 1.2 mentions that

*Tika now returns the detected character encoding as*
*a "charset" parameter of the content type metadata field for text/plain*
*and text/html documents. For example, instead of just "text/plain", the*
*returned content type will be something like "text/plain; charset=UTF-8"*
*for a UTF-8 encoded text document.*


However, when parsing a set of plain text (ASCII) files (some IETF RFCs),
the return type is still just "text/plain", without any charset information.

The code I am using to detect the content of each file is something like:

Tika tika = new Tika();
InputStream is = TikaInputStream.get(new FileInputStream(file));
System.out.println(tika.detect(is));


and the output is still "text/plain", as per previous versions of Tika.

Should that be the case?

Re: Charset detection

Posted by Public Network Services <pu...@gmail.com>.

Of course the return type is MediaType, i.e.

MediaType type = TikaConfig.getDefaultConfig().getDetector().detect(...);



On Thu, Jul 26, 2012 at 1:03 AM, Public Network Services <
publicnetworkservices@gmail.com> wrote:

> Actually, I am surprised that many people are not shouting about this
> already.
>
> All the static detect() methods of the Tika convenience class return the
> mime type as a String and, if not the recommended approache, they are
> certainly very popular.
>
> I have always been puzzled as to why the return type of such methods
> should be String, as opposed to a MimeType object.
>
> Tika is an excellent work and all the contributors are to be
> congratulated, but, in all due respect, it seems that this modification of
> the return String for "text/plain" will cause numerous headaches.
>
> Perhaps you should issue a directive that people should use the MimeType
> class, even if by creating such objects by parsing the String that
> Tika.detect() returns. Or, do something like
>
> MimeType type = TikaConfig.getDefaultConfig().getDetector().detect(...);
>
>
> :-)
>
>
> On Wed, Jul 25, 2012 at 3:50 PM, Paulini, Matthew CTR USAF AFMC AFRL/RISA
> <ma...@rl.af.mil> wrote:
>
>> I can see how the encoding might be useful to some people. However, I
>> also agree that older code that is checking against the MIME type returned
>> from Tika for equality (i.e. .equals() or .compareTo() in java) rather than
>> (i.e. contains() in java) could cause some issues if the dependant code
>> doesn't do extra processing on the MIME before their check. Since the
>> encoding was never present before, the chances that older code would have
>> done processing to grab just the MIME type portion of the returned string
>> is slim, I would assume.
>>
>> Wouldn't it be more backword compatible if you just added an "encoding"
>> field to the list of metadata attributes that are returned?
>>
>> ~Scout
>>
>> ________________________________
>>
>> From: Public Network Services [mailto:publicnetworkservices@gmail.com]
>> Sent: Wed 7/25/2012 8:31 AM
>> To: user@tika.apache.org
>> Subject: Re: Charset detection
>>
>>
>> If it does not add much to processing, then it could be run earlier, for
>> consistency purposes
>>
>> Having said that, I am not sure about the usefulness of appending the
>> charset at the end of the detected MIME type string in the first place. It
>> is correct from a syntax point, but it adds one more level of string
>> processing to extract it (as opposed to just getting it from the metadata).
>> Are we sure, for instance, that older code (checking for equality to
>> "text/plain") will not be not broken?
>>
>> Of course the decision has already been made and you guys know very well
>> what you are doing, but it still puzzles me. :-)
>>
>>
>> On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <ju...@gmail.com>
>> wrote:
>>
>>
>>         Hi,
>>
>>
>>         On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
>>         <pu...@gmail.com> wrote:
>>         > Should that be the case?
>>
>>
>>         Yes. So far the extra charset detection code is only being run
>> when
>>         you actually parse a document, so the charset parameter gets
>> added at
>>         that point, not yet at type detection. Perhaps we should run
>> charset
>>         detection already earlier at that point?
>>
>>         BR,
>>
>>         Jukka Zitting
>>
>>
>>
>>
>

Re: Charset detection

Posted by Public Network Services <pu...@gmail.com>.

Actually, I am surprised that many people are not shouting about this
already.

All the static detect() methods of the Tika convenience class return the
mime type as a String and, if not the recommended approache, they are
certainly very popular.

I have always been puzzled as to why the return type of such methods should
be String, as opposed to a MimeType object.

Tika is an excellent work and all the contributors are to be congratulated,
but, in all due respect, it seems that this modification of the return
String for "text/plain" will cause numerous headaches.

Perhaps you should issue a directive that people should use the MimeType
class, even if by creating such objects by parsing the String that
Tika.detect() returns. Or, do something like

MimeType type = TikaConfig.getDefaultConfig().getDetector().detect(...);


:-)


On Wed, Jul 25, 2012 at 3:50 PM, Paulini, Matthew CTR USAF AFMC AFRL/RISA <
matthew.paulini.ctr@rl.af.mil> wrote:

> I can see how the encoding might be useful to some people. However, I also
> agree that older code that is checking against the MIME type returned from
> Tika for equality (i.e. .equals() or .compareTo() in java) rather than
> (i.e. contains() in java) could cause some issues if the dependant code
> doesn't do extra processing on the MIME before their check. Since the
> encoding was never present before, the chances that older code would have
> done processing to grab just the MIME type portion of the returned string
> is slim, I would assume.
>
> Wouldn't it be more backword compatible if you just added an "encoding"
> field to the list of metadata attributes that are returned?
>
> ~Scout
>
> ________________________________
>
> From: Public Network Services [mailto:publicnetworkservices@gmail.com]
> Sent: Wed 7/25/2012 8:31 AM
> To: user@tika.apache.org
> Subject: Re: Charset detection
>
>
> If it does not add much to processing, then it could be run earlier, for
> consistency purposes
>
> Having said that, I am not sure about the usefulness of appending the
> charset at the end of the detected MIME type string in the first place. It
> is correct from a syntax point, but it adds one more level of string
> processing to extract it (as opposed to just getting it from the metadata).
> Are we sure, for instance, that older code (checking for equality to
> "text/plain") will not be not broken?
>
> Of course the decision has already been made and you guys know very well
> what you are doing, but it still puzzles me. :-)
>
>
> On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <ju...@gmail.com>
> wrote:
>
>
>         Hi,
>
>
>         On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
>         <pu...@gmail.com> wrote:
>         > Should that be the case?
>
>
>         Yes. So far the extra charset detection code is only being run when
>         you actually parse a document, so the charset parameter gets added
> at
>         that point, not yet at type detection. Perhaps we should run
> charset
>         detection already earlier at that point?
>
>         BR,
>
>         Jukka Zitting
>
>
>
>

RE: Charset detection

Posted by "Paulini, Matthew CTR USAF AFMC AFRL/RISA" <ma...@rl.af.mil>.

I can see how the encoding might be useful to some people. However, I also agree that older code that is checking against the MIME type returned from Tika for equality (i.e. .equals() or .compareTo() in java) rather than (i.e. contains() in java) could cause some issues if the dependant code doesn't do extra processing on the MIME before their check. Since the encoding was never present before, the chances that older code would have done processing to grab just the MIME type portion of the returned string is slim, I would assume.
 
Wouldn't it be more backword compatible if you just added an "encoding" field to the list of metadata attributes that are returned?
 
~Scout

________________________________

From: Public Network Services [mailto:publicnetworkservices@gmail.com]
Sent: Wed 7/25/2012 8:31 AM
To: user@tika.apache.org
Subject: Re: Charset detection


If it does not add much to processing, then it could be run earlier, for consistency purposes 

Having said that, I am not sure about the usefulness of appending the charset at the end of the detected MIME type string in the first place. It is correct from a syntax point, but it adds one more level of string processing to extract it (as opposed to just getting it from the metadata). Are we sure, for instance, that older code (checking for equality to "text/plain") will not be not broken?

Of course the decision has already been made and you guys know very well what you are doing, but it still puzzles me. :-)


On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <ju...@gmail.com> wrote:


	Hi,
	

	On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
	<pu...@gmail.com> wrote:
	> Should that be the case?
	
	
	Yes. So far the extra charset detection code is only being run when
	you actually parse a document, so the charset parameter gets added at
	that point, not yet at type detection. Perhaps we should run charset
	detection already earlier at that point?
	
	BR,
	
	Jukka Zitting

Re: Charset detection

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Jul 25, 2012 at 2:31 PM, Public Network Services
<pu...@gmail.com> wrote:
> Having said that, I am not sure about the usefulness of appending the
> charset at the end of the detected MIME type string in the first place. It
> is correct from a syntax point, but it adds one more level of string
> processing to extract it (as opposed to just getting it from the metadata).
> Are we sure, for instance, that older code (checking for equality to
> "text/plain") will not be not broken?

That was part of the thinking behind for now doing the charset
detection when a document is already being parsed instead of already
during type detection time. It's also why the change was described in
so much detail in CHANGES.txt.

In general I'd recommend people dealing with media types to move away
from basic string matching to using the MediaType and
MediaTypeRegistry classes. That way code that for example checks the
type detection result against something like "text/plain" won't start
failing with a Tika version that might decide to qualify the type with
"text/plain; charset=UTF-8" or to return a more detailed media type
like "text/x-java-source".

BR,

Jukka Zitting

Re: Charset detection

Posted by Public Network Services <pu...@gmail.com>.

If it does not add much to processing, then it could be run earlier, for
consistency purposes

Having said that, I am not sure about the usefulness of appending the
charset at the end of the detected MIME type string in the first place. It
is correct from a syntax point, but it adds one more level of string
processing to extract it (as opposed to just getting it from the metadata).
Are we sure, for instance, that older code (checking for equality to
"text/plain") will not be not broken?

Of course the decision has already been made and you guys know very well
what you are doing, but it still puzzles me. :-)

On Wed, Jul 25, 2012 at 10:55 AM, Jukka Zitting <ju...@gmail.com>wrote:

> Hi,
>
> On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
> <pu...@gmail.com> wrote:
> > Should that be the case?
>
> Yes. So far the extra charset detection code is only being run when
> you actually parse a document, so the charset parameter gets added at
> that point, not yet at type detection. Perhaps we should run charset
> detection already earlier at that point?
>
> BR,
>
> Jukka Zitting
>

Re: Charset detection

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Jul 25, 2012 at 1:05 AM, Public Network Services
<pu...@gmail.com> wrote:
> Should that be the case?

Yes. So far the extra charset detection code is only being run when
you actually parse a document, so the charset parameter gets added at
that point, not yet at type detection. Perhaps we should run charset
detection already earlier at that point?

BR,

Jukka Zitting