You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by "Jeremy Anderson (JIRA)" <ji...@apache.org> on 2009/10/27 13:46:59 UTC

[jira] Created: (JCR-2365) HTML Text Extractor does not extract or index numerics

HTML Text Extractor does not extract or index numerics
------------------------------------------------------

                 Key: JCR-2365
                 URL: https://issues.apache.org/jira/browse/JCR-2365
             Project: Jackrabbit Content Repository
          Issue Type: Bug
          Components: indexing, jackrabbit-text-extractors
    Affects Versions: 1.6.0
         Environment: Win XP-Pro; Win 2003 Enterprise 32bit
            Reporter: Jeremy Anderson


Numerics such as addresses/dates/financial figures are not extracted or indexed by the current HTML Extractor.  These values are handled properly and searchable when done via the PlainTextExtractor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (JCR-2365) HTML Text Extractor does not extract or index numerics

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JCR-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771340#action_12771340 ] 

Marcel Reutegger commented on JCR-2365:
---------------------------------------

Answering some follow up questions that I got from Jeremy by email:

> Is my understanding correct in that once upgrading to 1.6.1, the current Text-extractors module will become obsolete?

no, 1.6.1 will be just a bug fix release without changes in module dependencies. 1.6.1 will contain a fix to the HTML text extractor.

> If so will any changes be required to the workspace.xml for the textFilterClasses parameter to enable the use of the Apache Tika
> extractors?

The Apache Tika based text extractor is only available in the upcoming 2.0 release, but not in 1.6.x.

> Is it possible to enable this for JCR 1.6.0 so that HTML files have their numerics extracted and indexed?

It's probably easier to patch the 1.6.0 release, build the jackrabbit-text-extractors on 1.6 branch or wait for the 1.6.1 release.

> HTML Text Extractor does not extract or index numerics
> ------------------------------------------------------
>
>                 Key: JCR-2365
>                 URL: https://issues.apache.org/jira/browse/JCR-2365
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: indexing, jackrabbit-text-extractors
>    Affects Versions: 1.6.0
>         Environment: Win XP-Pro; Win 2003 Enterprise 32bit
>            Reporter: Jeremy Anderson
>             Fix For: 1.6.1, 2.0.0
>
>
> Numerics such as addresses/dates/financial figures are not extracted or indexed by the current HTML Extractor.  These values are handled properly and searchable when done via the PlainTextExtractor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (JCR-2365) HTML Text Extractor does not extract or index numerics

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JCR-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger resolved JCR-2365.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.0
                   1.6.1

This issue does not occur in trunk because we are not using the text-extractors module anymore. Text extraction is now handled by Apache Tika.

Fixed in 1.6 branch in revision: 830478

> HTML Text Extractor does not extract or index numerics
> ------------------------------------------------------
>
>                 Key: JCR-2365
>                 URL: https://issues.apache.org/jira/browse/JCR-2365
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: indexing, jackrabbit-text-extractors
>    Affects Versions: 1.6.0
>         Environment: Win XP-Pro; Win 2003 Enterprise 32bit
>            Reporter: Jeremy Anderson
>             Fix For: 1.6.1, 2.0.0
>
>
> Numerics such as addresses/dates/financial figures are not extracted or indexed by the current HTML Extractor.  These values are handled properly and searchable when done via the PlainTextExtractor

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.