You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Takahiro Ochi (JIRA)" <ji...@apache.org> on 2017/08/22 01:40:02 UTC

[jira] [Commented] (TIKA-2440) Phonetic strings handling for multilingual environments.

    [ https://issues.apache.org/jira/browse/TIKA-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136136#comment-16136136 ] 

Takahiro Ochi commented on TIKA-2440:
-------------------------------------

Hi there,

Does anyone have any ideas on this issue?

Apache POI version 3.16 or later have the argument for includePhoneticRuns I mentioned. Therefore, it depends on the version of Apache POI whether Apache Tika can set the argument or not.

The issue ticket regarding phonetic string handling on Apache POI is as follows:
- BugID: [51519|https://bz.apache.org/bugzilla/show_bug.cgi?id=51519]
- Module: XSSF
- Description: Allow user to select or ignore phonetic strings in shared strings table

Thanks,
Takahiro

> Phonetic strings handling for multilingual environments.
> --------------------------------------------------------
>
>                 Key: TIKA-2440
>                 URL: https://issues.apache.org/jira/browse/TIKA-2440
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Takahiro Ochi
>            Priority: Minor
>
> Hi there,
> I would like to propose an idea to improve phonetic strings handling for multilingual environments. I believe Tika should not concatenate phonetic strings because text with phonetic strings is recognized as noisy text in most situations of natural language processing.
> Excel files include phonetic strings in some languages such as Japanese, Chinese and so on. Apache POI concatenates phonetic strings onto the shared strings when Tika extract text from Excel files.
> Recent Apache POI has an switch flag for phonetic strings concatination as follows:
> https://poi.apache.org/apidocs/org/apache/poi/xssf/eventusermodel/ReadOnlySharedStringsTable.html#ReadOnlySharedStringsTable(org.apache.poi.openxml4j.opc.OPCPackage,%20boolean)
> Tika should set the 2nd argument "includePhoneticRuns" as false. Here is the simple patch for my idea.
> {code:java}
> $ diff -ru XSSFExcelExtractorDecorator.java ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java
> --- XSSFExcelExtractorDecorator.java    2017-06-10 19:13:33.355412625 +0900
> +++ ./tika/tika-1.15/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java 2017-06-10 19:14:30.452411830 +0900
> @@ -130,7 +130,7 @@
>              styles = xssfReader.getStylesTable();
>              iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
> -            strings = new ReadOnlySharedStringsTable(container);
> +            strings = new ReadOnlySharedStringsTable(container,false);
>          } catch (InvalidFormatException e) {
>              throw new XmlException(e);
>          } catch (OpenXML4JException oe) {
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)