You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Noriyuki TAKEI <nt...@sios.com> on 2017/04/10 17:45:32 UTC

Japanese character is garbled when using TikaEntityProcessor

Hi,All

I use TikaEntityProcessor to extract the text content from binary or text
file.

But when I try to extract Japanese Characters from HTML File whose
caharacter encoding is SJIS, the content is garbled.In the case of UTF-8,it
does work 
well.

The setting of Data Import Handler is as below.

--- from here ---
<dataConfig>
  <dataSource name="ds-db"
              type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/bbs" 
              user="root" 
              password="xxxx"/>
  <dataSource name="ds-file" type="BinFileDataSource"/>

  <document>
    <entity name="messages"
            dataSource="ds-db"
            pk="id"
            query="select id,title from messages">
      <field column="id" name="id"/>
      <field column="title" name="title"/>

      <entity name="contents"
              dataSource="ds-db"
              pk="id"
              query="select id,path from contents where id=${messages.id}">

        <entity name="file" dataSource="ds-file"
processor="TikaEntityProcessor" url="${contents.path}" format="text">
          <field column="text" name="content" />
        </entity>
      </entity>
    </entity>
  </document>
</dataConfig>
--- to here ---

How do I solve this?




--
View this message in context: http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Japanese character is garbled when using TikaEntityProcessor

Posted by Noriyuki TAKEI <nt...@sios.com>.

Thanks!!I appreciate for your quick reply.



--
View this message in context: http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217p4329657.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Japanese character is garbled when using TikaEntityProcessor

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Please open an issue on Tika's JIRA and share the triggering file if possible.  If we can touch the file, we may be able to recommend alternate ways to configure Tika's encoding detectors.  We just added configurability to the encoding detectors and that will be available with Tika 1.15. [1]

We use a fallback set of detectors: html, universalchardet, icu4j.  Whichever one has a non-null answer, we go with that.  This is perhaps not the best option, but that's what we've been doing for a while. We are in the process of reassessing our current methods[2], but that will take some time.

[1] https://issues.apache.org/jira/browse/TIKA-2273
[2] https://issues.apache.org/jira/browse/TIKA-2038

-----Original Message-----
From: Noriyuki TAKEI [mailto:ntakei@sios.com] 
Sent: Monday, April 10, 2017 1:46 PM
To: solr-user@lucene.apache.org
Subject: Japanese character is garbled when using TikaEntityProcessor

Hi,All

I use TikaEntityProcessor to extract the text content from binary or text file.

But when I try to extract Japanese Characters from HTML File whose caharacter encoding is SJIS, the content is garbled.In the case of UTF-8,it does work well.

The setting of Data Import Handler is as below.

--- from here ---
<dataConfig>
  <dataSource name="ds-db"
              type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/bbs" 
              user="root" 
              password="xxxx"/>
  <dataSource name="ds-file" type="BinFileDataSource"/>

  <document>
    <entity name="messages"
            dataSource="ds-db"
            pk="id"
            query="select id,title from messages">
      <field column="id" name="id"/>
      <field column="title" name="title"/>

      <entity name="contents"
              dataSource="ds-db"
              pk="id"
              query="select id,path from contents where id=${messages.id}">

        <entity name="file" dataSource="ds-file"
processor="TikaEntityProcessor" url="${contents.path}" format="text">
          <field column="text" name="content" />
        </entity>
      </entity>
    </entity>
  </document>
</dataConfig>
--- to here ---

How do I solve this?




--
View this message in context: http://lucene.472066.n3.nabble.com/Japanese-character-is-garbled-when-using-TikaEntityProcessor-tp4329217.html
Sent from the Solr - User mailing list archive at Nabble.com.