You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/04/18 16:12:43 UTC
RE: Need Help
Ha. I'm in the process of comparing mimetype detection results from DROID, Tika and 'file' on our TIKA-1302 corpus.
After that, I was going to compare our different encoding detectors on the corpus...I'll have a better answer in a few weeks.
Others on this list probably have more info, but our general Encoding detector tries to get the encoding from an html meta charset info, then the UniversalEncodingDetector and then the Icu4JDetector. It stops when the first encoding detector returns a non-null answer. That order was initially set in July 2012, and we haven't changed it since.
In short, this is an area for further analysis.
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov]
Sent: Monday, April 18, 2016 9:59 AM
To: dev@tika.apache.org
Subject: Fwd: Need Help
Sent from my iPhone
Begin forwarded message:
From: harsh kumar <ku...@gmail.com>>
Date: April 18, 2016 at 2:02:23 AM PDT
To: <de...@tika.apache.org>>
Subject: Fwd: Need Help
Hi,
I am using tika for detecting the encoding of a file. But I found that the results are not uniform If I use charsetdetector and universalEncodingdetector for the same file.
Can you please brief me with the major differences between them and their best-fit use cases.
Looking forward to your early reply.
--
Warm Regards.....*
Harsh Kumar
RE: Need Help
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Haven’t gotten around to this yet. Sorry.
Anyone else have any input?
From: harsh kumar [mailto:kumarharsh19@gmail.com]
Sent: Friday, May 6, 2016 8:48 AM
To: Allison, Timothy B. <ta...@mitre.org>
Subject: Re: Need Help
Hey Timothy,
Can you please help me with your findings of the TIKA? I would be thankful to you for this.
--Harsh
On Tue, Apr 19, 2016 at 6:51 PM, harsh kumar <ku...@gmail.com>> wrote:
Hey Timothy,
Thanks for your reply.
It would be a great help if you can give your findings to me.
Can you please help me with some specific email id to reach for the same.
---------- Forwarded message ----------
From: Allison, Timothy B. <ta...@mitre.org>>
Date: Mon, Apr 18, 2016 at 7:42 PM
Subject: RE: Need Help
To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Cc: "kumarharsh19@gmail.com<ma...@gmail.com>" <ku...@gmail.com>>
Ha. I'm in the process of comparing mimetype detection results from DROID, Tika and 'file' on our TIKA-1302 corpus.
After that, I was going to compare our different encoding detectors on the corpus...I'll have a better answer in a few weeks.
Others on this list probably have more info, but our general Encoding detector tries to get the encoding from an html meta charset info, then the UniversalEncodingDetector and then the Icu4JDetector. It stops when the first encoding detector returns a non-null answer. That order was initially set in July 2012, and we haven't changed it since.
In short, this is an area for further analysis.
-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:chris.a.mattmann@jpl.nasa.gov<ma...@jpl.nasa.gov>]
Sent: Monday, April 18, 2016 9:59 AM
To: dev@tika.apache.org<ma...@tika.apache.org>
Subject: Fwd: Need Help
Sent from my iPhone
Begin forwarded message:
From: harsh kumar <ku...@gmail.com>>>
Date: April 18, 2016 at 2:02:23 AM PDT
To: <de...@tika.apache.org>>>
Subject: Fwd: Need Help
Hi,
I am using tika for detecting the encoding of a file. But I found that the results are not uniform If I use charsetdetector and universalEncodingdetector for the same file.
Can you please brief me with the major differences between them and their best-fit use cases.
Looking forward to your early reply.
--
Warm Regards.....*
Harsh Kumar
--
Warm Regards…..•
Harsh Kumar
--
Warm Regards…..•
Harsh Kumar