You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Ravish Bhagdev (Created) (JIRA)" <ji...@apache.org> on 2011/11/30 10:35:40 UTC

[jira] [Created] (SOLR-2930) Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.

Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.
-----------------------------------------------------------------------------------------------------------------------------------------

                 Key: SOLR-2930
                 URL: https://issues.apache.org/jira/browse/SOLR-2930
             Project: Solr
          Issue Type: Improvement
          Components: contrib - Solr Cell (Tika extraction)
    Affects Versions: 3.5
            Reporter: Ravish Bhagdev


Tika 1.0 has fixed a major issue with processing and parsing of PDF files that was splitting the words incorrectly: https://issues.apache.org/jira/browse/TIKA-724

This causes text to be indexed incorrectly in solr and it becomes specially visible when using spellcheck features etc.  

They have added a special parameter set using setEnableAutoSpace that fixes the problem but there is currently no way of setting this when using Solr.  As discussed in thread on above issue, it would be nice if we could control this (and in future other) parameter via Solr configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2930) Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160036#comment-13160036 ] 

Robert Muir commented on SOLR-2930:
-----------------------------------

my bad, i confused this bug with the pdfbox 'character deletion' 
one (TIKA-767), thats still unfortunately not in tika 1.0 it seems.

                
> Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2930
>                 URL: https://issues.apache.org/jira/browse/SOLR-2930
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.5
>            Reporter: Ravish Bhagdev
>              Labels: pdf, text-splitting, tika,
>
> Tika 1.0 has fixed a major issue with processing and parsing of PDF files that was splitting the words incorrectly: https://issues.apache.org/jira/browse/TIKA-724
> This causes text to be indexed incorrectly in solr and it becomes specially visible when using spellcheck features etc.  
> They have added a special parameter set using setEnableAutoSpace that fixes the problem but there is currently no way of setting this when using Solr.  As discussed in thread on above issue, it would be nice if we could control this (and in future other) parameter via Solr configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (SOLR-2930) Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160034#comment-13160034 ] 

Robert Muir commented on SOLR-2930:
-----------------------------------

i think the most important piece is that this parameter is *off* by default.

for a search engine, if some bold content gets duplicated... there could really be worse things.

but if spaces get incorrectly added to words, thats going to mess up tokenization.
                
> Allow controlling an important PDF processing parameter in Tika that splits the words in text and is now suppored in version 1.0 of Tika.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2930
>                 URL: https://issues.apache.org/jira/browse/SOLR-2930
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>    Affects Versions: 3.5
>            Reporter: Ravish Bhagdev
>              Labels: pdf, text-splitting, tika,
>
> Tika 1.0 has fixed a major issue with processing and parsing of PDF files that was splitting the words incorrectly: https://issues.apache.org/jira/browse/TIKA-724
> This causes text to be indexed incorrectly in solr and it becomes specially visible when using spellcheck features etc.  
> They have added a special parameter set using setEnableAutoSpace that fixes the problem but there is currently no way of setting this when using Solr.  As discussed in thread on above issue, it would be nice if we could control this (and in future other) parameter via Solr configuration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org