You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Nikita Ahuja <ni...@smartshore.nl> on 2018/11/21 05:43:53 UTC

Language Detection for the data

Hi,

I have query related to detect the language of the records/data which is
going to be ingest in the Output Connector.

OpenNLP connector is not working for the detection as per the user
documentation, but this is not working appropriately. Please suggest is NLP
has to be used if yes, then how it should be used or is there any other
solution for this?

-- 
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: +91 99 888 57720
http://www.smartshore.nl

Re: Language Detection for the data

Posted by Karl Wright <da...@gmail.com>.
Look in the manifoldcf source tree for files named
"common_en_US.properties".  For every one of these you will need to create
a similar file for your specific locale.

Thanks,
Karl


On Tue, Dec 18, 2018 at 2:07 AM Nikita Ahuja <ni...@smartshore.nl> wrote:

> Thanks Karl,
>
> But I want to know how to add these files, so that such warnings also not
> come and a smooth flow is executed.
>
> Is there any way to do that?
>
> Thanks,
> Nikita
>
> On Wed, Dec 12, 2018 at 4:47 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi Nikita,
>>
>> This is occurring because en_GB does not have a translations file.  It's
>> a warning and the code falls back to using en_US.
>>
>> Karl
>>
>>
>> On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja <ni...@smartshore.nl>
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Thanks for the suggestion and Language for the data and content is able
>>> to detect now. But there is one issue while ingesting the records in the
>>> ElasticSearch Index. and it is stored there in the log file as:
>>>
>>> ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource
>>> bundle 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't
>>> find bundle for base name org.apache.manifoldcf.ui.i18n.common, locale
>>> en_GB; trying en
>>> java.util.MissingResourceException: Can't find bundle for base name
>>> org.apache.manifoldcf.ui.i18n.common, locale en_GB
>>>     at
>>> java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
>>> Source) ~[?:?]
>>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>>> ~[?:?]
>>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>>> ~[?:?]
>>>     at java.base/java.util.ResourceBundle.getBundle(Unknown Source)
>>> ~[?:?]
>>>     at
>>> org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
>>> [mcf-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
>>> [mcf-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
>>> [mcf-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
>>> [mcf-ui-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
>>> [mcf-ui-core.jar:?]
>>>     at
>>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
>>> [mcf-ui-core.jar:?]
>>>     at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]
>>>
>>>
>>> Is this can be resolved after adding any resource files or any other
>>> solution has to be opted?
>>>
>>> On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> The Tika transformer may well generate a language attribute.  You would
>>>> need to check with Tika, though, to know for sure, and under what
>>>> conditions it might generate this.  It should not be confused with document
>>>> format detection, which Tika definitely does in order to extract content.
>>>>
>>>> It looks like language detection in Tika either comes from document
>>>> metadata already present, or via a Java interface that you need to
>>>> explicitly call to get it.  If your documents need the latter, the Tika
>>>> connector does not currently do this:
>>>>
>>>> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>>>>
>>>> and
>>>>
>>>> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>>>>
>>>> The documentation does not clarify whether a language attribute is
>>>> actually generated; the architecture seems more suited to plug in machine
>>>> translators for your content.  I suspect you would need to run the output
>>>> of the Tika translator into the NullOutputConnector in order to see what
>>>> attributes are being generated to know for sure.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <ni...@smartshore.nl>
>>>> wrote:
>>>>
>>>>> HI All,
>>>>>
>>>>> Thanks for the timely replies. But I am basically concerned for the
>>>>> language detection of the .doc,.pdf or any other data present in the
>>>>> repository.
>>>>>
>>>>> As per my understanding Tika Transformation provides functionality for
>>>>> the same.
>>>>> But there is no output for the language of the documents.
>>>>>
>>>>> The sequence used is:
>>>>> 1. Repoistory Connector(Web)
>>>>> 2. Tika Transformation
>>>>> 3. MetaData Adjuster
>>>>> 4.Output Connector(Elastic)
>>>>>
>>>>> Is there anything which is being missed here for the language
>>>>> detection of the documents?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <fu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Nikita,
>>>>>>
>>>>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>>>>> should be enabled by default. It extracts named entities (people, locations
>>>>>> and organizations) from document.
>>>>>>
>>>>>> You should download trained models to run OpenNLP connector. You can
>>>>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>>>>
>>>>>> Check here for a detailed explanation:
>>>>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>>>>
>>>>>> Feel free to ask any questions when you try to integrate it. Also,
>>>>>> you should explain the points if you cannot success to run it.
>>>>>>
>>>>>> Kind Regards,
>>>>>> Furkan KAMACI
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Nikita,
>>>>>>>
>>>>>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>>>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>>>>>> It uses a specific directory to deliver the models that OpenNLP uses to
>>>>>>> match and extract content from documents.  Thus, you can provide any models
>>>>>>> you want that are compatible with the OpenNLP version we're including.
>>>>>>>
>>>>>>> Can you describe the steps you are taking and what you are seeing?
>>>>>>>
>>>>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have query related to detect the language of the records/data
>>>>>>>> which is going to be ingest in the Output Connector.
>>>>>>>>
>>>>>>>> OpenNLP connector is not working for the detection as per the user
>>>>>>>> documentation, but this is not working appropriately. Please suggest is NLP
>>>>>>>> has to be used if yes, then how it should be used or is there any other
>>>>>>>> solution for this?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Nikita
>>>>>>>> Email: nikita@smartshore.nl
>>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>>> a "Smartshore" Company
>>>>>>>> Mobile: +91 99 888 57720
>>>>>>>> http://www.smartshore.nl
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>> Email: nikita@smartshore.nl
>>>>> United Sources Service Pvt. Ltd.
>>>>> a "Smartshore" Company
>>>>> Mobile: +91 99 888 57720
>>>>> http://www.smartshore.nl
>>>>>
>>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Re: Language Detection for the data

Posted by Nikita Ahuja <ni...@smartshore.nl>.
Thanks Karl,

But I want to know how to add these files, so that such warnings also not
come and a smooth flow is executed.

Is there any way to do that?

Thanks,
Nikita

On Wed, Dec 12, 2018 at 4:47 PM Karl Wright <da...@gmail.com> wrote:

> Hi Nikita,
>
> This is occurring because en_GB does not have a translations file.  It's a
> warning and the code falls back to using en_US.
>
> Karl
>
>
> On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja <ni...@smartshore.nl> wrote:
>
>> Hi Karl,
>>
>> Thanks for the suggestion and Language for the data and content is able
>> to detect now. But there is one issue while ingesting the records in the
>> ElasticSearch Index. and it is stored there in the log file as:
>>
>> ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource
>> bundle 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't
>> find bundle for base name org.apache.manifoldcf.ui.i18n.common, locale
>> en_GB; trying en
>> java.util.MissingResourceException: Can't find bundle for base name
>> org.apache.manifoldcf.ui.i18n.common, locale en_GB
>>     at
>> java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
>> Source) ~[?:?]
>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>> ~[?:?]
>>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
>> ~[?:?]
>>     at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?]
>>     at
>> org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
>> [mcf-core.jar:?]
>>     at
>> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
>> [mcf-core.jar:?]
>>     at
>> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
>> [mcf-core.jar:?]
>>     at
>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
>> [mcf-ui-core.jar:?]
>>     at
>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
>> [mcf-ui-core.jar:?]
>>     at
>> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
>> [mcf-ui-core.jar:?]
>>     at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]
>>
>>
>> Is this can be resolved after adding any resource files or any other
>> solution has to be opted?
>>
>> On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Nikita,
>>>
>>> The Tika transformer may well generate a language attribute.  You would
>>> need to check with Tika, though, to know for sure, and under what
>>> conditions it might generate this.  It should not be confused with document
>>> format detection, which Tika definitely does in order to extract content.
>>>
>>> It looks like language detection in Tika either comes from document
>>> metadata already present, or via a Java interface that you need to
>>> explicitly call to get it.  If your documents need the latter, the Tika
>>> connector does not currently do this:
>>>
>>> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>>>
>>> and
>>>
>>> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>>>
>>> The documentation does not clarify whether a language attribute is
>>> actually generated; the architecture seems more suited to plug in machine
>>> translators for your content.  I suspect you would need to run the output
>>> of the Tika translator into the NullOutputConnector in order to see what
>>> attributes are being generated to know for sure.
>>>
>>> Karl
>>>
>>>
>>> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <ni...@smartshore.nl>
>>> wrote:
>>>
>>>> HI All,
>>>>
>>>> Thanks for the timely replies. But I am basically concerned for the
>>>> language detection of the .doc,.pdf or any other data present in the
>>>> repository.
>>>>
>>>> As per my understanding Tika Transformation provides functionality for
>>>> the same.
>>>> But there is no output for the language of the documents.
>>>>
>>>> The sequence used is:
>>>> 1. Repoistory Connector(Web)
>>>> 2. Tika Transformation
>>>> 3. MetaData Adjuster
>>>> 4.Output Connector(Elastic)
>>>>
>>>> Is there anything which is being missed here for the language detection
>>>> of the documents?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <fu...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Nikita,
>>>>>
>>>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>>>> should be enabled by default. It extracts named entities (people, locations
>>>>> and organizations) from document.
>>>>>
>>>>> You should download trained models to run OpenNLP connector. You can
>>>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>>>
>>>>> Check here for a detailed explanation:
>>>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>>>
>>>>> Feel free to ask any questions when you try to integrate it. Also, you
>>>>> should explain the points if you cannot success to run it.
>>>>>
>>>>> Kind Regards,
>>>>> Furkan KAMACI
>>>>>
>>>>>
>>>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Nikita,
>>>>>>
>>>>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>>>>> It uses a specific directory to deliver the models that OpenNLP uses to
>>>>>> match and extract content from documents.  Thus, you can provide any models
>>>>>> you want that are compatible with the OpenNLP version we're including.
>>>>>>
>>>>>> Can you describe the steps you are taking and what you are seeing?
>>>>>>
>>>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have query related to detect the language of the records/data
>>>>>>> which is going to be ingest in the Output Connector.
>>>>>>>
>>>>>>> OpenNLP connector is not working for the detection as per the user
>>>>>>> documentation, but this is not working appropriately. Please suggest is NLP
>>>>>>> has to be used if yes, then how it should be used or is there any other
>>>>>>> solution for this?
>>>>>>>
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Nikita
>>>>>>> Email: nikita@smartshore.nl
>>>>>>> United Sources Service Pvt. Ltd.
>>>>>>> a "Smartshore" Company
>>>>>>> Mobile: +91 99 888 57720
>>>>>>> http://www.smartshore.nl
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Thanks and Regards,
>>>> Nikita
>>>> Email: nikita@smartshore.nl
>>>> United Sources Service Pvt. Ltd.
>>>> a "Smartshore" Company
>>>> Mobile: +91 99 888 57720
>>>> http://www.smartshore.nl
>>>>
>>>
>>
>> --
>> Thanks and Regards,
>> Nikita
>> Email: nikita@smartshore.nl
>> United Sources Service Pvt. Ltd.
>> a "Smartshore" Company
>> Mobile: +91 99 888 57720
>> http://www.smartshore.nl
>>
>

-- 
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: +91 99 888 57720
http://www.smartshore.nl

Re: Language Detection for the data

Posted by Karl Wright <da...@gmail.com>.
Hi Nikita,

This is occurring because en_GB does not have a translations file.  It's a
warning and the code falls back to using en_US.

Karl


On Wed, Dec 12, 2018 at 4:39 AM Nikita Ahuja <ni...@smartshore.nl> wrote:

> Hi Karl,
>
> Thanks for the suggestion and Language for the data and content is able to
> detect now. But there is one issue while ingesting the records in the
> ElasticSearch Index. and it is stored there in the log file as:
>
> ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource bundle
> 'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't find
> bundle for base name org.apache.manifoldcf.ui.i18n.common, locale en_GB;
> trying en
> java.util.MissingResourceException: Can't find bundle for base name
> org.apache.manifoldcf.ui.i18n.common, locale en_GB
>     at
> java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
> Source) ~[?:?]
>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
> ~[?:?]
>     at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
> ~[?:?]
>     at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?]
>     at
> org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
> [mcf-core.jar:?]
>     at
> org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
> [mcf-core.jar:?]
>     at
> org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
> [mcf-core.jar:?]
>     at
> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
> [mcf-ui-core.jar:?]
>     at
> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
> [mcf-ui-core.jar:?]
>     at
> org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
> [mcf-ui-core.jar:?]
>     at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]
>
>
> Is this can be resolved after adding any resource files or any other
> solution has to be opted?
>
> On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi Nikita,
>>
>> The Tika transformer may well generate a language attribute.  You would
>> need to check with Tika, though, to know for sure, and under what
>> conditions it might generate this.  It should not be confused with document
>> format detection, which Tika definitely does in order to extract content.
>>
>> It looks like language detection in Tika either comes from document
>> metadata already present, or via a Java interface that you need to
>> explicitly call to get it.  If your documents need the latter, the Tika
>> connector does not currently do this:
>>
>> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>>
>> and
>>
>> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>>
>> The documentation does not clarify whether a language attribute is
>> actually generated; the architecture seems more suited to plug in machine
>> translators for your content.  I suspect you would need to run the output
>> of the Tika translator into the NullOutputConnector in order to see what
>> attributes are being generated to know for sure.
>>
>> Karl
>>
>>
>> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <ni...@smartshore.nl>
>> wrote:
>>
>>> HI All,
>>>
>>> Thanks for the timely replies. But I am basically concerned for the
>>> language detection of the .doc,.pdf or any other data present in the
>>> repository.
>>>
>>> As per my understanding Tika Transformation provides functionality for
>>> the same.
>>> But there is no output for the language of the documents.
>>>
>>> The sequence used is:
>>> 1. Repoistory Connector(Web)
>>> 2. Tika Transformation
>>> 3. MetaData Adjuster
>>> 4.Output Connector(Elastic)
>>>
>>> Is there anything which is being missed here for the language detection
>>> of the documents?
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <fu...@gmail.com>
>>> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>>> should be enabled by default. It extracts named entities (people, locations
>>>> and organizations) from document.
>>>>
>>>> You should download trained models to run OpenNLP connector. You can
>>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>>
>>>> Check here for a detailed explanation:
>>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>>
>>>> Feel free to ask any questions when you try to integrate it. Also, you
>>>> should explain the points if you cannot success to run it.
>>>>
>>>> Kind Regards,
>>>> Furkan KAMACI
>>>>
>>>>
>>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Nikita,
>>>>>
>>>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>>>> It uses a specific directory to deliver the models that OpenNLP uses to
>>>>> match and extract content from documents.  Thus, you can provide any models
>>>>> you want that are compatible with the OpenNLP version we're including.
>>>>>
>>>>> Can you describe the steps you are taking and what you are seeing?
>>>>>
>>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have query related to detect the language of the records/data which
>>>>>> is going to be ingest in the Output Connector.
>>>>>>
>>>>>> OpenNLP connector is not working for the detection as per the user
>>>>>> documentation, but this is not working appropriately. Please suggest is NLP
>>>>>> has to be used if yes, then how it should be used or is there any other
>>>>>> solution for this?
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Nikita
>>>>>> Email: nikita@smartshore.nl
>>>>>> United Sources Service Pvt. Ltd.
>>>>>> a "Smartshore" Company
>>>>>> Mobile: +91 99 888 57720
>>>>>> http://www.smartshore.nl
>>>>>>
>>>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Re: Language Detection for the data

Posted by Nikita Ahuja <ni...@smartshore.nl>.
Hi Karl,

Thanks for the suggestion and Language for the data and content is able to
detect now. But there is one issue while ingesting the records in the
ElasticSearch Index. and it is stored there in the log file as:

ERROR 2018-12-11T19:19:37,637 (qtp348148678-561) - Missing resource bundle
'org.apache.manifoldcf.ui.i18n.common' for locale 'en_GB': Can't find
bundle for base name org.apache.manifoldcf.ui.i18n.common, locale en_GB;
trying en
java.util.MissingResourceException: Can't find bundle for base name
org.apache.manifoldcf.ui.i18n.common, locale en_GB
    at
java.base/java.util.ResourceBundle.throwMissingResourceException(Unknown
Source) ~[?:?]
    at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
~[?:?]
    at java.base/java.util.ResourceBundle.getBundleImpl(Unknown Source)
~[?:?]
    at java.base/java.util.ResourceBundle.getBundle(Unknown Source) ~[?:?]
    at
org.apache.manifoldcf.core.i18n.Messages.getResourceBundle(Messages.java:132)
[mcf-core.jar:?]
    at
org.apache.manifoldcf.core.i18n.Messages.getMessage(Messages.java:178)
[mcf-core.jar:?]
    at
org.apache.manifoldcf.core.i18n.Messages.getString(Messages.java:216)
[mcf-core.jar:?]
    at
org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:343)
[mcf-ui-core.jar:?]
    at
org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:119)
[mcf-ui-core.jar:?]
    at
org.apache.manifoldcf.ui.i18n.Messages.getBodyJavascriptString(Messages.java:67)
[mcf-ui-core.jar:?]
    at org.apache.jsp.index_jsp._jspService(index_jsp.java:212) [jsp/:?]


Is this can be resolved after adding any resource files or any other
solution has to be opted?

On Wed, Nov 21, 2018 at 5:36 PM Karl Wright <da...@gmail.com> wrote:

> Hi Nikita,
>
> The Tika transformer may well generate a language attribute.  You would
> need to check with Tika, though, to know for sure, and under what
> conditions it might generate this.  It should not be confused with document
> format detection, which Tika definitely does in order to extract content.
>
> It looks like language detection in Tika either comes from document
> metadata already present, or via a Java interface that you need to
> explicitly call to get it.  If your documents need the latter, the Tika
> connector does not currently do this:
>
> https://tika.apache.org/1.19.1/detection.html#Language_Detection
>
> and
>
> https://tika.apache.org/1.19.1/examples.html#Language_Identification
>
> The documentation does not clarify whether a language attribute is
> actually generated; the architecture seems more suited to plug in machine
> translators for your content.  I suspect you would need to run the output
> of the Tika translator into the NullOutputConnector in order to see what
> attributes are being generated to know for sure.
>
> Karl
>
>
> On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <ni...@smartshore.nl> wrote:
>
>> HI All,
>>
>> Thanks for the timely replies. But I am basically concerned for the
>> language detection of the .doc,.pdf or any other data present in the
>> repository.
>>
>> As per my understanding Tika Transformation provides functionality for
>> the same.
>> But there is no output for the language of the documents.
>>
>> The sequence used is:
>> 1. Repoistory Connector(Web)
>> 2. Tika Transformation
>> 3. MetaData Adjuster
>> 4.Output Connector(Elastic)
>>
>> Is there anything which is being missed here for the language detection
>> of the documents?
>>
>>
>>
>>
>>
>> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <fu...@gmail.com>
>> wrote:
>>
>>> Hi Nikita,
>>>
>>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>>> should be enabled by default. It extracts named entities (people, locations
>>> and organizations) from document.
>>>
>>> You should download trained models to run OpenNLP connector. You can
>>> check here for such purpose: https://opennlp.apache.org/models.html
>>>
>>> Check here for a detailed explanation:
>>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>>
>>> Feel free to ask any questions when you try to integrate it. Also, you
>>> should explain the points if you cannot success to run it.
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>>
>>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com> wrote:
>>>
>>>> Hi Nikita,
>>>>
>>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>>> It uses a specific directory to deliver the models that OpenNLP uses to
>>>> match and extract content from documents.  Thus, you can provide any models
>>>> you want that are compatible with the OpenNLP version we're including.
>>>>
>>>> Can you describe the steps you are taking and what you are seeing?
>>>>
>>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have query related to detect the language of the records/data which
>>>>> is going to be ingest in the Output Connector.
>>>>>
>>>>> OpenNLP connector is not working for the detection as per the user
>>>>> documentation, but this is not working appropriately. Please suggest is NLP
>>>>> has to be used if yes, then how it should be used or is there any other
>>>>> solution for this?
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Nikita
>>>>> Email: nikita@smartshore.nl
>>>>> United Sources Service Pvt. Ltd.
>>>>> a "Smartshore" Company
>>>>> Mobile: +91 99 888 57720
>>>>> http://www.smartshore.nl
>>>>>
>>>>
>>
>> --
>> Thanks and Regards,
>> Nikita
>> Email: nikita@smartshore.nl
>> United Sources Service Pvt. Ltd.
>> a "Smartshore" Company
>> Mobile: +91 99 888 57720
>> http://www.smartshore.nl
>>
>

-- 
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: +91 99 888 57720
http://www.smartshore.nl

Re: Language Detection for the data

Posted by Karl Wright <da...@gmail.com>.
Hi Nikita,

The Tika transformer may well generate a language attribute.  You would
need to check with Tika, though, to know for sure, and under what
conditions it might generate this.  It should not be confused with document
format detection, which Tika definitely does in order to extract content.

It looks like language detection in Tika either comes from document
metadata already present, or via a Java interface that you need to
explicitly call to get it.  If your documents need the latter, the Tika
connector does not currently do this:

https://tika.apache.org/1.19.1/detection.html#Language_Detection

and

https://tika.apache.org/1.19.1/examples.html#Language_Identification

The documentation does not clarify whether a language attribute is actually
generated; the architecture seems more suited to plug in machine
translators for your content.  I suspect you would need to run the output
of the Tika translator into the NullOutputConnector in order to see what
attributes are being generated to know for sure.

Karl


On Wed, Nov 21, 2018 at 4:45 AM Nikita Ahuja <ni...@smartshore.nl> wrote:

> HI All,
>
> Thanks for the timely replies. But I am basically concerned for the
> language detection of the .doc,.pdf or any other data present in the
> repository.
>
> As per my understanding Tika Transformation provides functionality for the
> same.
> But there is no output for the language of the documents.
>
> The sequence used is:
> 1. Repoistory Connector(Web)
> 2. Tika Transformation
> 3. MetaData Adjuster
> 4.Output Connector(Elastic)
>
> Is there anything which is being missed here for the language detection of
> the documents?
>
>
>
>
>
> On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <fu...@gmail.com>
> wrote:
>
>> Hi Nikita,
>>
>> First of all, OpenNLP is a transformation connector at ManifoldCF and
>> should be enabled by default. It extracts named entities (people, locations
>> and organizations) from document.
>>
>> You should download trained models to run OpenNLP connector. You can
>> check here for such purpose: https://opennlp.apache.org/models.html
>>
>> Check here for a detailed explanation:
>> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>>
>> Feel free to ask any questions when you try to integrate it. Also, you
>> should explain the points if you cannot success to run it.
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>>
>> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com> wrote:
>>
>>> Hi Nikita,
>>>
>>> Can you be more specific when you say "OpenNLP is not working"?  All
>>> that this connector does is integrate OpenNLP as a ManifoldCF transformer.
>>> It uses a specific directory to deliver the models that OpenNLP uses to
>>> match and extract content from documents.  Thus, you can provide any models
>>> you want that are compatible with the OpenNLP version we're including.
>>>
>>> Can you describe the steps you are taking and what you are seeing?
>>>
>>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have query related to detect the language of the records/data which
>>>> is going to be ingest in the Output Connector.
>>>>
>>>> OpenNLP connector is not working for the detection as per the user
>>>> documentation, but this is not working appropriately. Please suggest is NLP
>>>> has to be used if yes, then how it should be used or is there any other
>>>> solution for this?
>>>>
>>>> --
>>>> Thanks and Regards,
>>>> Nikita
>>>> Email: nikita@smartshore.nl
>>>> United Sources Service Pvt. Ltd.
>>>> a "Smartshore" Company
>>>> Mobile: +91 99 888 57720
>>>> http://www.smartshore.nl
>>>>
>>>
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

Re: Language Detection for the data

Posted by Nikita Ahuja <ni...@smartshore.nl>.
HI All,

Thanks for the timely replies. But I am basically concerned for the
language detection of the .doc,.pdf or any other data present in the
repository.

As per my understanding Tika Transformation provides functionality for the
same.
But there is no output for the language of the documents.

The sequence used is:
1. Repoistory Connector(Web)
2. Tika Transformation
3. MetaData Adjuster
4.Output Connector(Elastic)

Is there anything which is being missed here for the language detection of
the documents?





On Wed, Nov 21, 2018 at 2:35 PM Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi Nikita,
>
> First of all, OpenNLP is a transformation connector at ManifoldCF and
> should be enabled by default. It extracts named entities (people, locations
> and organizations) from document.
>
> You should download trained models to run OpenNLP connector. You can check
> here for such purpose: https://opennlp.apache.org/models.html
>
> Check here for a detailed explanation:
> https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector
>
> Feel free to ask any questions when you try to integrate it. Also, you
> should explain the points if you cannot success to run it.
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com> wrote:
>
>> Hi Nikita,
>>
>> Can you be more specific when you say "OpenNLP is not working"?  All that
>> this connector does is integrate OpenNLP as a ManifoldCF transformer.  It
>> uses a specific directory to deliver the models that OpenNLP uses to match
>> and extract content from documents.  Thus, you can provide any models you
>> want that are compatible with the OpenNLP version we're including.
>>
>> Can you describe the steps you are taking and what you are seeing?
>>
>> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
>> wrote:
>>
>>> Hi,
>>>
>>> I have query related to detect the language of the records/data which is
>>> going to be ingest in the Output Connector.
>>>
>>> OpenNLP connector is not working for the detection as per the user
>>> documentation, but this is not working appropriately. Please suggest is NLP
>>> has to be used if yes, then how it should be used or is there any other
>>> solution for this?
>>>
>>> --
>>> Thanks and Regards,
>>> Nikita
>>> Email: nikita@smartshore.nl
>>> United Sources Service Pvt. Ltd.
>>> a "Smartshore" Company
>>> Mobile: +91 99 888 57720
>>> http://www.smartshore.nl
>>>
>>

-- 
Thanks and Regards,
Nikita
Email: nikita@smartshore.nl
United Sources Service Pvt. Ltd.
a "Smartshore" Company
Mobile: +91 99 888 57720
http://www.smartshore.nl

Re: Language Detection for the data

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi Nikita,

First of all, OpenNLP is a transformation connector at ManifoldCF and
should be enabled by default. It extracts named entities (people, locations
and organizations) from document.

You should download trained models to run OpenNLP connector. You can check
here for such purpose: https://opennlp.apache.org/models.html

Check here for a detailed explanation:
https://github.com/ChalithaUdara/OpenNLP-Manifold-Connector

Feel free to ask any questions when you try to integrate it. Also, you
should explain the points if you cannot success to run it.

Kind Regards,
Furkan KAMACI


On Wed, Nov 21, 2018 at 11:54 AM Karl Wright <da...@gmail.com> wrote:

> Hi Nikita,
>
> Can you be more specific when you say "OpenNLP is not working"?  All that
> this connector does is integrate OpenNLP as a ManifoldCF transformer.  It
> uses a specific directory to deliver the models that OpenNLP uses to match
> and extract content from documents.  Thus, you can provide any models you
> want that are compatible with the OpenNLP version we're including.
>
> Can you describe the steps you are taking and what you are seeing?
>
> On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl>
> wrote:
>
>> Hi,
>>
>> I have query related to detect the language of the records/data which is
>> going to be ingest in the Output Connector.
>>
>> OpenNLP connector is not working for the detection as per the user
>> documentation, but this is not working appropriately. Please suggest is NLP
>> has to be used if yes, then how it should be used or is there any other
>> solution for this?
>>
>> --
>> Thanks and Regards,
>> Nikita
>> Email: nikita@smartshore.nl
>> United Sources Service Pvt. Ltd.
>> a "Smartshore" Company
>> Mobile: +91 99 888 57720
>> http://www.smartshore.nl
>>
>

Re: Language Detection for the data

Posted by Karl Wright <da...@gmail.com>.
Hi Nikita,

Can you be more specific when you say "OpenNLP is not working"?  All that
this connector does is integrate OpenNLP as a ManifoldCF transformer.  It
uses a specific directory to deliver the models that OpenNLP uses to match
and extract content from documents.  Thus, you can provide any models you
want that are compatible with the OpenNLP version we're including.

Can you describe the steps you are taking and what you are seeing?

On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja <ni...@smartshore.nl> wrote:

> Hi,
>
> I have query related to detect the language of the records/data which is
> going to be ingest in the Output Connector.
>
> OpenNLP connector is not working for the detection as per the user
> documentation, but this is not working appropriately. Please suggest is NLP
> has to be used if yes, then how it should be used or is there any other
> solution for this?
>
> --
> Thanks and Regards,
> Nikita
> Email: nikita@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>