You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2010/11/18 21:41:13 UTC

[jira] Created: (SOLR-2244) Add Language Identification support

Add Language Identification support
-----------------------------------

                 Key: SOLR-2244
                 URL: https://issues.apache.org/jira/browse/SOLR-2244
             Project: Solr
          Issue Type: New Feature
            Reporter: Grant Ingersoll


For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Tommaso Teofili (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935356#action_12935356 ] 

Tommaso Teofili commented on SOLR-2244:
---------------------------------------

Thanks for notifying Jon. My patch is very straightforward and simple so feel free to integrate/modify it with yours.

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Tommaso Teofili (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934813#action_12934813 ] 

Tommaso Teofili commented on SOLR-2244:
---------------------------------------

bq. I'm going to suggest that we rename contrib/extraction to be contrib/tika and that we just roll all of these things under one area, that way we don't have to muck with libraries, etc.

nice suggestion

bq. Heck, it might even make sense at this point to just move it into core.

+1

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934811#action_12934811 ] 

Grant Ingersoll commented on SOLR-2244:
---------------------------------------

I'm going to suggest that we rename contrib/extraction to be contrib/tika and that we just roll all of these things under one area, that way we don't have to muck with libraries, etc.

Heck, it might even make sense at this point to just move it into core.

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2244) Add Language Identification support

Posted by "Tommaso Teofili (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tommaso Teofili updated SOLR-2244:
----------------------------------

    Attachment: solr2244.patch

I've made a patch to use Tika 0.8 language identification feature inside an UpdateRequestProcessor

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Tommaso Teofili (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933569#action_12933569 ] 

Tommaso Teofili commented on SOLR-2244:
---------------------------------------

Cool, this would be a nice feature :)

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934655#action_12934655 ] 

Grant Ingersoll commented on SOLR-2244:
---------------------------------------

Cool, I will check it out.

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Assigned: (SOLR-2244) Add Language Identification support

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned SOLR-2244:
-------------------------------------

    Assignee: Grant Ingersoll

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966682#action_12966682 ] 

Grant Ingersoll commented on SOLR-2244:
---------------------------------------

I'm going to move forward with this patch, since I don't see one for SOLR-1979.  

I'm going to keep it in contrib/langid, but have it use the Tika libs from contrib/extraction, so that we won't have to package them twice.  I don't really like changing contrib/extraction to be contrib/tika since then it is not clear what the functionality is and we also may have other lang. id tools in the future.

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934816#action_12934816 ] 

Robert Muir commented on SOLR-2244:
-----------------------------------

bq. Heck, it might even make sense at this point to just move it into core.

non-option until SOLR-2088 is fixed. Solr "core" should work on turkish computers, too.


> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (SOLR-2244) Add Language Identification support

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved SOLR-2244.
-----------------------------------

    Resolution: Won't Fix

Actually, I'm going to switch back to SOLR-1979, as it is a superset of this patch.  I should have a patch up shortly.

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2244) Add Language Identification support

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966734#action_12966734 ] 

Jan Høydahl commented on SOLR-2244:
-----------------------------------

Added my patch to SOLR-1979. The difference from this patch is that it is based on contrib/extraction, is configured in-line instead of through own config file, and has a fallback configuration.

> Add Language Identification support
> -----------------------------------
>
>                 Key: SOLR-2244
>                 URL: https://issues.apache.org/jira/browse/SOLR-2244
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>         Attachments: solr2244.patch
>
>
> For starters, Tika has language identification capabilities that we can likely leverage, but moreover, make it easier for people to plug in language identification into the indexing process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org