You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Benson Margulies <bi...@gmail.com> on 2012/09/02 14:01:35 UTC

Failing to detect SJIS

I have some very simple code to call Tika:

        Parser parser = new AutoDetectParser();
        ContentHandler contentHandler = new BodyContentHandler(writer);
        ParseContext parseContext = new ParseContext();
        Metadata metadata = new Metadata();
        parser.parse(input, contentHandler, metadata, parseContext);

It has been working fine on many inputs, but I get no text in the
content handler when I feed it a file in the Shift-JIS encoding.

The metadata comes back with a content type of application/octet-stream.

I thought I'd better write here before opening a JIRA, in case I'm
missing something trivial

Re: Failing to detect SJIS

Posted by Benson Margulies <bi...@gmail.com>.
On Mon, Sep 3, 2012 at 12:38 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Mon, Sep 3, 2012 at 5:33 PM, Benson Margulies <bi...@gmail.com> wrote:
>> Apropos of nothing, I'd offer some patches to the front page and maybe
>> even some more doc, but I'm a little confused about how you are using
>> the site plugin, particularly for the front page. For example, no link
>> points to the SCM page.

I have no special fondness for the auto-generated SCM page, but I
think you should have a page that says where Tika is in svn.

>
> The site is located and built separately from the main source
> (http://svn.apache.org/repos/asf/tika/site/) so we can better manage
> multiple versions of documentation and other things that aren't tied
> to Tika's release cycle. As a result most of the default reports
> generated by Maven aren't too useful (some are even misleading), which
> is why we're not including links to them in the site template. If
> there are individual reports (like the SCM page) that do make sense,
> then it would be a good idea to selectively add that to the template.
>
> BR,
>
> Jukka Zitting

Re: Failing to detect SJIS

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Mon, Sep 3, 2012 at 5:33 PM, Benson Margulies <bi...@gmail.com> wrote:
> Apropos of nothing, I'd offer some patches to the front page and maybe
> even some more doc, but I'm a little confused about how you are using
> the site plugin, particularly for the front page. For example, no link
> points to the SCM page.

The site is located and built separately from the main source
(http://svn.apache.org/repos/asf/tika/site/) so we can better manage
multiple versions of documentation and other things that aren't tied
to Tika's release cycle. As a result most of the default reports
generated by Maven aren't too useful (some are even misleading), which
is why we're not including links to them in the site template. If
there are individual reports (like the SCM page) that do make sense,
then it would be a good idea to selectively add that to the template.

BR,

Jukka Zitting

Re: Failing to detect SJIS

Posted by Benson Margulies <bi...@gmail.com>.
On Mon, Sep 3, 2012 at 10:48 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Sun, Sep 2, 2012 at 2:01 PM, Benson Margulies <bi...@gmail.com> wrote:
>> It has been working fine on many inputs, but I get no text in the
>> content handler when I feed it a file in the Shift-JIS encoding.
>
> The text detector in Tika doesn't have a reliable way to detect
> Shift-JIS, which is why you're seeing the default
> application/octet-stream type. AFAIK there is no good way to reliably
> detect Shift-JIS by looking only at the incoming byte stream.
>
> If you already know that you're dealing with text, you can give Tika a
> media type hint of "text/plain" or even "text/plain;
> charset=Shift--JIS" as input metadata along with the document to be
> parsed. That should help Tika determine how to parse the document.

thanks, that did it.

Apropos of nothing, I'd offer some patches to the front page and maybe
even some more doc, but I'm a little confused about how you are using
the site plugin, particularly for the front page. For example, no link
points to the SCM page.


>
> For example, using the Shift-JIS file from
> https://issues.alfresco.com/jira/browse/ALF-15233 we get the
> following:
>
> $ java -jar tika-app.jar --detect < shiftjs.txt # look only at the byte stream
> application/octet-stream
>
> $ java -jar tika-app.jar --detect shiftjs.txt # Give the file name
> with .txt ending as a type hint
> text/plain
>
> $ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding
> is correctly detected
> 電子商取引(エレクトロニックコマース)、オンライン [...]
>
> Yes!
>
> BR,
>
> Jukka Zitting

Re: Failing to detect SJIS

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Sun, Sep 2, 2012 at 2:01 PM, Benson Margulies <bi...@gmail.com> wrote:
> It has been working fine on many inputs, but I get no text in the
> content handler when I feed it a file in the Shift-JIS encoding.

The text detector in Tika doesn't have a reliable way to detect
Shift-JIS, which is why you're seeing the default
application/octet-stream type. AFAIK there is no good way to reliably
detect Shift-JIS by looking only at the incoming byte stream.

If you already know that you're dealing with text, you can give Tika a
media type hint of "text/plain" or even "text/plain;
charset=Shift--JIS" as input metadata along with the document to be
parsed. That should help Tika determine how to parse the document.

For example, using the Shift-JIS file from
https://issues.alfresco.com/jira/browse/ALF-15233 we get the
following:

$ java -jar tika-app.jar --detect < shiftjs.txt # look only at the byte stream
application/octet-stream

$ java -jar tika-app.jar --detect shiftjs.txt # Give the file name
with .txt ending as a type hint
text/plain

$ java -jar tika-app.jar --text shiftjs.txt # Check that the encoding
is correctly detected
電子商取引(エレクトロニックコマース)、オンライン [...]

Yes!

BR,

Jukka Zitting