You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Satinder Singh <hi...@gmail.com> on 2020/09/04 06:10:12 UTC

tika parser detecting "IBM500" for small files

Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
Example content of a small file:
"a d"

How to fix this?

Re: tika parser detecting "IBM500" for small files

Posted by Satinder Singh <hi...@gmail.com>.
I am using tika-parsers-1.24.1.jar on linux with openjdk version "1.8.0_222"

On Sat, Sep 5, 2020 at 12:13 AM John Patrick <nh...@gmail.com> wrote:
>
> Have you tried 1.24.1?
> Did it detect as a different type on an older version?
> Have you tried it on another machine...
> Are other files being detected as expected?
> What os are you using and what java version are you using?
>
>
> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> output https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
>
> I'm using 1.8.0_261 on a mac.
>
> John
>
> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
> >
> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> > Example content of a small file:
> > "a d"
> >
> > How to fix this?

Re: tika parser detecting "IBM500" for small files

Posted by John Patrick <nh...@gmail.com>.
So on a Mac I get; a
I get, "Match of UTF-8 with confidence 15"

Using; a d
I get; "Match of IBM500 in fr with confidence 98"

Using; d
I get; "Match of UTF-8 with confidence 15"

So I don't think oraclelinux is the issue, and if you have tested
yourself on different operating systems you could have seen the same
results as myself.

I've no idea why it thinks "a d" is IBM500 in french with 98% confidence...

If you think it is wrong to raise a defect, but with a file of such
few characters I would expect some strange detection.

John

On Fri, 11 Sep 2020 at 12:23, John Patrick <nh...@gmail.com> wrote:
>
> further inline comments...
>
> On Fri, 11 Sep 2020 at 08:02, Satinder Singh <hi...@gmail.com> wrote:
> >
> > inline ...
> >
> > 1) Did it work different before? what combination was that working
> > version... os + java + tika
> > linux  6 , tika-parser 1.24.1, java 8.
> > It never worked for a file having "d" in word start or end, e.g.
> > "a d"
> >
> > 2) Are other files working correct?
> > Yes
> >
> > Have you tried your code on other environments???
> > No. we need it working on linux 6/7
> Do you mean Oracle Linux?
> I know you need it working on linux 6 or 7 but knowing if your code
> works elsewhere potentially helps track down any issues...
> You might have a test environment which is Debian where it passes,
> another CentOs where it passes, your developers might be on Ubuntu and
> the code works fine their, but your production is Oracle Linux and
> it's failing there?
> If it is a physical host or a virtual host (vmware or similar)? or a
> container host (openshift or docker)?
>
> >
> > Have you tried using the tika-app-1.24.1.jar as per my example?
> > No. Requirement is for only encoding detection. So using only tika-parser-1.24.1
> Again it helps track down the issue, if my example tika-app works for
> you but your tika-parser still doesn't work then it helps identify
> where to look next
>
> >
> > Can you try adding this debug line;
> > System.out.println("file.encoding=" + System.setProperty("file.encoding"));
> > I will try it.
> >
> > On Wed, Sep 9, 2020 at 2:43 AM John Patrick <nh...@gmail.com> wrote:
> > >
> > > What about my other questions...
> > > 1) Did it work different before? what combination was that working
> > > version... os + java + tika
> > > 2) Are other files working correct?
> > >
> > > Have you tried your code on other environments???
> > >
> > > Have you tried using the tika-app-1.24.1.jar as per my example?
> > >
> > > Can you try adding this debug line;
> > > System.out.println("file.encoding=" + System.setProperty("file.encoding"));
> > >
> > > What does the debug line show?
> > >
> > >
> > >
> > > On 08/09/2020, Satinder Singh <hi...@gmail.com> wrote:
> > > > and my code is:
> > > >
> > > > import org.apache.tika.Tika;
> > > > import org.apache.tika.io.TikaInputStream;
> > > > import org.apache.tika.mime.MimeTypes;
> > > > import org.apache.tika.parser.txt.CharsetDetector;
> > > > import org.apache.tika.parser.txt.CharsetMatch;
> > > >
> > > > public static String detectEncoding(InputStream is)
> > > >   {
> > > >     CharsetDetector detector = new CharsetDetector();
> > > >      detector.setText(TikaInputStream.get(is));
> > > >     CharsetMatch detected = detector.detect();
> > > >
> > > > On Sat, Sep 5, 2020 at 12:13 AM John Patrick <nh...@gmail.com>
> > > > wrote:
> > > >>
> > > >> Have you tried 1.24.1?
> > > >> Did it detect as a different type on an older version?
> > > >> Have you tried it on another machine...
> > > >> Are other files being detected as expected?
> > > >> What os are you using and what java version are you using?
> > > >>
> > > >>
> > > >> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> > > >> output
> > > >> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
> > > >>
> > > >> I'm using 1.8.0_261 on a mac.
> > > >>
> > > >> John
> > > >>
> > > >> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
> > > >> >
> > > >> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> > > >> > Example content of a small file:
> > > >> > "a d"
> > > >> >
> > > >> > How to fix this?
> > > >

Re: tika parser detecting "IBM500" for small files

Posted by John Patrick <nh...@gmail.com>.
further inline comments...

On Fri, 11 Sep 2020 at 08:02, Satinder Singh <hi...@gmail.com> wrote:
>
> inline ...
>
> 1) Did it work different before? what combination was that working
> version... os + java + tika
> linux  6 , tika-parser 1.24.1, java 8.
> It never worked for a file having "d" in word start or end, e.g.
> "a d"
>
> 2) Are other files working correct?
> Yes
>
> Have you tried your code on other environments???
> No. we need it working on linux 6/7
Do you mean Oracle Linux?
I know you need it working on linux 6 or 7 but knowing if your code
works elsewhere potentially helps track down any issues...
You might have a test environment which is Debian where it passes,
another CentOs where it passes, your developers might be on Ubuntu and
the code works fine their, but your production is Oracle Linux and
it's failing there?
If it is a physical host or a virtual host (vmware or similar)? or a
container host (openshift or docker)?

>
> Have you tried using the tika-app-1.24.1.jar as per my example?
> No. Requirement is for only encoding detection. So using only tika-parser-1.24.1
Again it helps track down the issue, if my example tika-app works for
you but your tika-parser still doesn't work then it helps identify
where to look next

>
> Can you try adding this debug line;
> System.out.println("file.encoding=" + System.setProperty("file.encoding"));
> I will try it.
>
> On Wed, Sep 9, 2020 at 2:43 AM John Patrick <nh...@gmail.com> wrote:
> >
> > What about my other questions...
> > 1) Did it work different before? what combination was that working
> > version... os + java + tika
> > 2) Are other files working correct?
> >
> > Have you tried your code on other environments???
> >
> > Have you tried using the tika-app-1.24.1.jar as per my example?
> >
> > Can you try adding this debug line;
> > System.out.println("file.encoding=" + System.setProperty("file.encoding"));
> >
> > What does the debug line show?
> >
> >
> >
> > On 08/09/2020, Satinder Singh <hi...@gmail.com> wrote:
> > > and my code is:
> > >
> > > import org.apache.tika.Tika;
> > > import org.apache.tika.io.TikaInputStream;
> > > import org.apache.tika.mime.MimeTypes;
> > > import org.apache.tika.parser.txt.CharsetDetector;
> > > import org.apache.tika.parser.txt.CharsetMatch;
> > >
> > > public static String detectEncoding(InputStream is)
> > >   {
> > >     CharsetDetector detector = new CharsetDetector();
> > >      detector.setText(TikaInputStream.get(is));
> > >     CharsetMatch detected = detector.detect();
> > >
> > > On Sat, Sep 5, 2020 at 12:13 AM John Patrick <nh...@gmail.com>
> > > wrote:
> > >>
> > >> Have you tried 1.24.1?
> > >> Did it detect as a different type on an older version?
> > >> Have you tried it on another machine...
> > >> Are other files being detected as expected?
> > >> What os are you using and what java version are you using?
> > >>
> > >>
> > >> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> > >> output
> > >> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
> > >>
> > >> I'm using 1.8.0_261 on a mac.
> > >>
> > >> John
> > >>
> > >> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
> > >> >
> > >> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> > >> > Example content of a small file:
> > >> > "a d"
> > >> >
> > >> > How to fix this?
> > >

Re: tika parser detecting "IBM500" for small files

Posted by Satinder Singh <hi...@gmail.com>.
inline ...

1) Did it work different before? what combination was that working
version... os + java + tika
linux  6 , tika-parser 1.24.1, java 8.
It never worked for a file having "d" in word start or end, e.g.
"a d"

2) Are other files working correct?
Yes

Have you tried your code on other environments???
No. we need it working on linux 6/7

Have you tried using the tika-app-1.24.1.jar as per my example?
No. Requirement is for only encoding detection. So using only tika-parser-1.24.1

Can you try adding this debug line;
System.out.println("file.encoding=" + System.setProperty("file.encoding"));
I will try it.

On Wed, Sep 9, 2020 at 2:43 AM John Patrick <nh...@gmail.com> wrote:
>
> What about my other questions...
> 1) Did it work different before? what combination was that working
> version... os + java + tika
> 2) Are other files working correct?
>
> Have you tried your code on other environments???
>
> Have you tried using the tika-app-1.24.1.jar as per my example?
>
> Can you try adding this debug line;
> System.out.println("file.encoding=" + System.setProperty("file.encoding"));
>
> What does the debug line show?
>
>
>
> On 08/09/2020, Satinder Singh <hi...@gmail.com> wrote:
> > and my code is:
> >
> > import org.apache.tika.Tika;
> > import org.apache.tika.io.TikaInputStream;
> > import org.apache.tika.mime.MimeTypes;
> > import org.apache.tika.parser.txt.CharsetDetector;
> > import org.apache.tika.parser.txt.CharsetMatch;
> >
> > public static String detectEncoding(InputStream is)
> >   {
> >     CharsetDetector detector = new CharsetDetector();
> >      detector.setText(TikaInputStream.get(is));
> >     CharsetMatch detected = detector.detect();
> >
> > On Sat, Sep 5, 2020 at 12:13 AM John Patrick <nh...@gmail.com>
> > wrote:
> >>
> >> Have you tried 1.24.1?
> >> Did it detect as a different type on an older version?
> >> Have you tried it on another machine...
> >> Are other files being detected as expected?
> >> What os are you using and what java version are you using?
> >>
> >>
> >> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> >> output
> >> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
> >>
> >> I'm using 1.8.0_261 on a mac.
> >>
> >> John
> >>
> >> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
> >> >
> >> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> >> > Example content of a small file:
> >> > "a d"
> >> >
> >> > How to fix this?
> >

Re: tika parser detecting "IBM500" for small files

Posted by John Patrick <nh...@gmail.com>.
What about my other questions...
1) Did it work different before? what combination was that working
version... os + java + tika
2) Are other files working correct?

Have you tried your code on other environments???

Have you tried using the tika-app-1.24.1.jar as per my example?

Can you try adding this debug line;
System.out.println("file.encoding=" + System.setProperty("file.encoding"));

What does the debug line show?



On 08/09/2020, Satinder Singh <hi...@gmail.com> wrote:
> and my code is:
>
> import org.apache.tika.Tika;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.mime.MimeTypes;
> import org.apache.tika.parser.txt.CharsetDetector;
> import org.apache.tika.parser.txt.CharsetMatch;
>
> public static String detectEncoding(InputStream is)
>   {
>     CharsetDetector detector = new CharsetDetector();
>      detector.setText(TikaInputStream.get(is));
>     CharsetMatch detected = detector.detect();
>
> On Sat, Sep 5, 2020 at 12:13 AM John Patrick <nh...@gmail.com>
> wrote:
>>
>> Have you tried 1.24.1?
>> Did it detect as a different type on an older version?
>> Have you tried it on another machine...
>> Are other files being detected as expected?
>> What os are you using and what java version are you using?
>>
>>
>> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
>> output
>> https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
>>
>> I'm using 1.8.0_261 on a mac.
>>
>> John
>>
>> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
>> >
>> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
>> > Example content of a small file:
>> > "a d"
>> >
>> > How to fix this?
>

Re: tika parser detecting "IBM500" for small files

Posted by Satinder Singh <hi...@gmail.com>.
and my code is:

import org.apache.tika.Tika;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.mime.MimeTypes;
import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;

public static String detectEncoding(InputStream is)
  {
    CharsetDetector detector = new CharsetDetector();
     detector.setText(TikaInputStream.get(is));
    CharsetMatch detected = detector.detect();

On Sat, Sep 5, 2020 at 12:13 AM John Patrick <nh...@gmail.com> wrote:
>
> Have you tried 1.24.1?
> Did it detect as a different type on an older version?
> Have you tried it on another machine...
> Are other files being detected as expected?
> What os are you using and what java version are you using?
>
>
> As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
> output https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e
>
> I'm using 1.8.0_261 on a mac.
>
> John
>
> On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
> >
> > Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> > Example content of a small file:
> > "a d"
> >
> > How to fix this?

Re: tika parser detecting "IBM500" for small files

Posted by John Patrick <nh...@gmail.com>.
Have you tried 1.24.1?
Did it detect as a different type on an older version?
Have you tried it on another machine...
Are other files being detected as expected?
What os are you using and what java version are you using?


As I've just done it with 1.24 and I'm getting ISO-8859-1. Here is my
output https://gist.github.com/nhojpatrick/c11c00ce35f5af26de51efca9f8e8b4e

I'm using 1.8.0_261 on a mac.

John

On Fri, 4 Sep 2020 at 07:10, Satinder Singh <hi...@gmail.com> wrote:
>
> Why tika-parsers-1.24.jar detects encoding "IBM500" for small files.
> Example content of a small file:
> "a d"
>
> How to fix this?