You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2019/06/18 20:26:10 UTC
Detection of plain text files
Hi devs,
I’m trying to remember the history of how Tika’s current mime-type detection has evolved, regarding handling of plain text files.
Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) it gets returned as application/octet-stream.
I thought that previously we had something which would check if the file only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides these), and a reasonable number of line ending chars, and if so then we’d return text/plain instead of application/octet-stream
Thanks,
— Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra
Re: Detection of plain text files
Posted by Tim Allison <ta...@apache.org>.
It seems reasonable to me. I think this is likely going to open a can
of worms, but we should do better. It'd be interesting to look at the
stats from tika-eval on the application/octet files in our regression
corpus and see if they are actually language-y...how often is this
happening?
On Tue, Jun 25, 2019 at 12:18 PM Ken Krugler
<kk...@transpac.com> wrote:
>
> Hi Tim,
>
> Seems like what we’d want is “isText()” vs what we’ve got, which is “isAscii()”
>
> Any thoughts on switching to what I thought was the older algorithm, of (a) not many unexpected control chars, and (b) a reasonable number of line ending chars?
>
> — Ken
>
> > On Jun 25, 2019, at 6:56 AM, Tim Allison <ta...@apache.org> wrote:
> >
> > Hi Ken,
> > I'm sorry for my delay. I took a short chunk of Japanese and
> > converted it to Shift_JIS.
> >
> > Your memory is largely correct (or we've changed the code base a
> > bit). The TextDetector makes a decision in favor of {{text/plain}} vs
> > {{application/octet}} via TextStatistics
> > (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
> > if the bytes are:
> >
> > a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
> > many control characters
> > b) kind of look like UTF-8
> >
> > In the example file I used, there were 0 control, 36 ascii (btwn 0x20
> > and 128) an 0 safe terms, but the total character count was 218. The
> > isAscii() requires > 90% of the characters appear btwn 0x20 and
> > 128...so the text detector failed.
> >
> > In short, this is an area for improvement. I suspect our current
> > mechanism would also be pretty awful on UTF-16.
> >
> > On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler <kk...@transpac.com> wrote:
> >>
> >> Hi devs,
> >>
> >> I’m trying to remember the history of how Tika’s current mime-type detection has evolved, regarding handling of plain text files.
> >>
> >> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) it gets returned as application/octet-stream.
> >>
> >> I thought that previously we had something which would check if the file only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides these), and a reasonable number of line ending chars, and if so then we’d return text/plain instead of application/octet-stream
> >>
> >> Thanks,
> >>
> >> — Ken
> >>
> >> --------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> Custom big data solutions & training
> >> Flink, Solr, Hadoop, Cascading & Cassandra
> >>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
Re: Detection of plain text files
Posted by Ken Krugler <kk...@transpac.com>.
Hi Tim,
Seems like what we’d want is “isText()” vs what we’ve got, which is “isAscii()”
Any thoughts on switching to what I thought was the older algorithm, of (a) not many unexpected control chars, and (b) a reasonable number of line ending chars?
— Ken
> On Jun 25, 2019, at 6:56 AM, Tim Allison <ta...@apache.org> wrote:
>
> Hi Ken,
> I'm sorry for my delay. I took a short chunk of Japanese and
> converted it to Shift_JIS.
>
> Your memory is largely correct (or we've changed the code base a
> bit). The TextDetector makes a decision in favor of {{text/plain}} vs
> {{application/octet}} via TextStatistics
> (https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
> if the bytes are:
>
> a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
> many control characters
> b) kind of look like UTF-8
>
> In the example file I used, there were 0 control, 36 ascii (btwn 0x20
> and 128) an 0 safe terms, but the total character count was 218. The
> isAscii() requires > 90% of the characters appear btwn 0x20 and
> 128...so the text detector failed.
>
> In short, this is an area for improvement. I suspect our current
> mechanism would also be pretty awful on UTF-16.
>
> On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler <kk...@transpac.com> wrote:
>>
>> Hi devs,
>>
>> I’m trying to remember the history of how Tika’s current mime-type detection has evolved, regarding handling of plain text files.
>>
>> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) it gets returned as application/octet-stream.
>>
>> I thought that previously we had something which would check if the file only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides these), and a reasonable number of line ending chars, and if so then we’d return text/plain instead of application/octet-stream
>>
>> Thanks,
>>
>> — Ken
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> Custom big data solutions & training
>> Flink, Solr, Hadoop, Cascading & Cassandra
>>
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra
Re: Detection of plain text files
Posted by Tim Allison <ta...@apache.org>.
Hi Ken,
I'm sorry for my delay. I took a short chunk of Japanese and
converted it to Shift_JIS.
Your memory is largely correct (or we've changed the code base a
bit). The TextDetector makes a decision in favor of {{text/plain}} vs
{{application/octet}} via TextStatistics
(https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/TextStatistics.java#L46)
if the bytes are:
a) mostly in the ascii range (btwn 0x20 and 128) and don't have too
many control characters
b) kind of look like UTF-8
In the example file I used, there were 0 control, 36 ascii (btwn 0x20
and 128) an 0 safe terms, but the total character count was 218. The
isAscii() requires > 90% of the characters appear btwn 0x20 and
128...so the text detector failed.
In short, this is an area for improvement. I suspect our current
mechanism would also be pretty awful on UTF-16.
On Tue, Jun 18, 2019 at 4:26 PM Ken Krugler <kk...@transpac.com> wrote:
>
> Hi devs,
>
> I’m trying to remember the history of how Tika’s current mime-type detection has evolved, regarding handling of plain text files.
>
> Currently if I run a Shift-JIS encoded file through Tika (suffix is “.env”) it gets returned as application/octet-stream.
>
> I thought that previously we had something which would check if the file only had tab/LF/CR bytes in the 0x00-0x1F range (so no other control chars besides these), and a reasonable number of line ending chars, and if so then we’d return text/plain instead of application/octet-stream
>
> Thanks,
>
> — Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>