You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Grigoriy Alekseev (JIRA)" <ji...@apache.org> on 2018/02/27 15:35:00 UTC

[jira] [Updated] (TIKA-2590) ExcelExtractor: cannot choose listening to the selected records only

     [ https://issues.apache.org/jira/browse/TIKA-2590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grigoriy Alekseev updated TIKA-2590:
------------------------------------
    Description: 
The listenForAllRecords argument is being always reset to 'true', so the 'else' branch is never reached. It may cause incorrect text extraction when records with certain unsupported types (e.g. SharedFormula) are present in a file.
{code:java}
        public void processFile(DirectoryNode root, boolean listenForAllRecords)
                throws IOException, SAXException, TikaException {

            // Set up listener and register the records we want to process
            HSSFRequest hssfRequest = new HSSFRequest();
            listenForAllRecords = true;
            if (listenForAllRecords) {
                hssfRequest.addListenerForAllRecords(formatListener);
            } else {
                hssfRequest.addListener(formatListener, BOFRecord.sid);
                hssfRequest.addListener(formatListener, EOFRecord.sid);
                hssfRequest.addListener(formatListener, DateWindow1904Record.sid);
                hssfRequest.addListener(formatListener, CountryRecord.sid);
                hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
                hssfRequest.addListener(formatListener, SSTRecord.sid);
                hssfRequest.addListener(formatListener, FormulaRecord.sid);
                hssfRequest.addListener(formatListener, LabelRecord.sid);
                hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
                hssfRequest.addListener(formatListener, NumberRecord.sid);
                hssfRequest.addListener(formatListener, RKRecord.sid);
                hssfRequest.addListener(formatListener, StringRecord.sid);
                hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
                hssfRequest.addListener(formatListener, TextObjectRecord.sid);
                hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
                hssfRequest.addListener(formatListener, FormatRecord.sid);
                hssfRequest.addListener(formatListener, ExtendedFormatRecord.sid);
                hssfRequest.addListener(formatListener, DrawingGroupRecord.sid);
                if (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
                    hssfRequest.addListener(formatListener, HeaderRecord.sid);
                    hssfRequest.addListener(formatListener, FooterRecord.sid);
                }
}
{code}

  was:
The listenForAllRecords argument is being always reset to 'true', so the 'else' branch is never reached.

{code:java}
        public void processFile(DirectoryNode root, boolean listenForAllRecords)
                throws IOException, SAXException, TikaException {

            // Set up listener and register the records we want to process
            HSSFRequest hssfRequest = new HSSFRequest();
            listenForAllRecords = true;
            if (listenForAllRecords) {
                hssfRequest.addListenerForAllRecords(formatListener);
            } else {
                hssfRequest.addListener(formatListener, BOFRecord.sid);
                hssfRequest.addListener(formatListener, EOFRecord.sid);
                hssfRequest.addListener(formatListener, DateWindow1904Record.sid);
                hssfRequest.addListener(formatListener, CountryRecord.sid);
                hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
                hssfRequest.addListener(formatListener, SSTRecord.sid);
                hssfRequest.addListener(formatListener, FormulaRecord.sid);
                hssfRequest.addListener(formatListener, LabelRecord.sid);
                hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
                hssfRequest.addListener(formatListener, NumberRecord.sid);
                hssfRequest.addListener(formatListener, RKRecord.sid);
                hssfRequest.addListener(formatListener, StringRecord.sid);
                hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
                hssfRequest.addListener(formatListener, TextObjectRecord.sid);
                hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
                hssfRequest.addListener(formatListener, FormatRecord.sid);
                hssfRequest.addListener(formatListener, ExtendedFormatRecord.sid);
                hssfRequest.addListener(formatListener, DrawingGroupRecord.sid);
                if (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
                    hssfRequest.addListener(formatListener, HeaderRecord.sid);
                    hssfRequest.addListener(formatListener, FooterRecord.sid);
                }
}
{code}

I will make a pull request with the fix on GitHub.


> ExcelExtractor: cannot choose listening to the selected records only
> --------------------------------------------------------------------
>
>                 Key: TIKA-2590
>                 URL: https://issues.apache.org/jira/browse/TIKA-2590
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>            Reporter: Grigoriy Alekseev
>            Priority: Critical
>             Fix For: 2.0.0
>
>
> The listenForAllRecords argument is being always reset to 'true', so the 'else' branch is never reached. It may cause incorrect text extraction when records with certain unsupported types (e.g. SharedFormula) are present in a file.
> {code:java}
>         public void processFile(DirectoryNode root, boolean listenForAllRecords)
>                 throws IOException, SAXException, TikaException {
>             // Set up listener and register the records we want to process
>             HSSFRequest hssfRequest = new HSSFRequest();
>             listenForAllRecords = true;
>             if (listenForAllRecords) {
>                 hssfRequest.addListenerForAllRecords(formatListener);
>             } else {
>                 hssfRequest.addListener(formatListener, BOFRecord.sid);
>                 hssfRequest.addListener(formatListener, EOFRecord.sid);
>                 hssfRequest.addListener(formatListener, DateWindow1904Record.sid);
>                 hssfRequest.addListener(formatListener, CountryRecord.sid);
>                 hssfRequest.addListener(formatListener, BoundSheetRecord.sid);
>                 hssfRequest.addListener(formatListener, SSTRecord.sid);
>                 hssfRequest.addListener(formatListener, FormulaRecord.sid);
>                 hssfRequest.addListener(formatListener, LabelRecord.sid);
>                 hssfRequest.addListener(formatListener, LabelSSTRecord.sid);
>                 hssfRequest.addListener(formatListener, NumberRecord.sid);
>                 hssfRequest.addListener(formatListener, RKRecord.sid);
>                 hssfRequest.addListener(formatListener, StringRecord.sid);
>                 hssfRequest.addListener(formatListener, HyperlinkRecord.sid);
>                 hssfRequest.addListener(formatListener, TextObjectRecord.sid);
>                 hssfRequest.addListener(formatListener, SeriesTextRecord.sid);
>                 hssfRequest.addListener(formatListener, FormatRecord.sid);
>                 hssfRequest.addListener(formatListener, ExtendedFormatRecord.sid);
>                 hssfRequest.addListener(formatListener, DrawingGroupRecord.sid);
>                 if (extractor.officeParserConfig.getIncludeHeadersAndFooters()) {
>                     hssfRequest.addListener(formatListener, HeaderRecord.sid);
>                     hssfRequest.addListener(formatListener, FooterRecord.sid);
>                 }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)