You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Moshe Recanati <mo...@kmslh.com> on 2014/06/26 08:21:08 UTC

How-to get results of comparison between documents

Hi,
I'm new to apache stanbol.
Until now we used Solr as our search engine.
We would like to enhance the capabilities and be able to enhance it with semantic capabilities and this is the reason we're trying stanbol.

Let's assume I've several documents that describe mobile phone specification with index on release date and vendor.
I want to query \ ask these documents 'What's the latest phone made by Samsung?' and get the latest document based on release date.

Please describe how can I do it (if at all).

Regards,
Moshe Recanati
SVP Engineering
Office + 972-73-2617564
Mobile  + 972-52-6194481
Skype    :  recanati
[KMS2]<http://finance.yahoo.com/news/kms-lighthouse-named-gartner-cool-121000184.html>
more at  www.kmslh.com<http://www.kmslh.com/>

RE: How-to get results of comparison between documents

Posted by Moshe Recanati <mo...@kmslh.com>.

Hi Rupert,
Thanks a lot, I'll check your suggestions and see if I can implement it.

Regards,
Moshe

Regards,
Moshe Recanati
SVP Engineering
Office + 972-73-2617564
Mobile  + 972-52-6194481
Skype    :  recanati

more at  www.kmslh.com

-----Original Message-----
From: Rupert Westenthaler [mailto:rupert.westenthaler@gmail.com] 
Sent: Friday, June 27, 2014 4:41 PM
To: dev@stanbol.apache.org
Subject: Re: How-to get results of comparison between documents

Hi,

Its a bit hard to answer to your very generic question. But at it looks like an interesting (an demanding) use case I will try to provide some useful information ...

The query for the latest phone made by Samsung can be easily answered by Solr if you have an index with all data (including the release
date) of mobile phones. But as you write this on this mailing list I assume that you do not have structured data with such information but instead intend to extract those information form unstructured text.

In the following I will try to summarize some possible things that might be interesting to you:

* If you want to detect new Entities  - e.g. a new Smart Phone you do not yet have in your database - you will need Named Entity Recognition (NER). Such things need to be trained for specific languages, specific types of writings (news vs. forum slang) and also the type of entities. Stanbol is integrated with OpenNLP and Stanford NLP. So models trained for such frameworks can also be used with Stanbol.
* If you do already have a vocabularies with Entities you are interested in (e.g. all Smart Phones, Vendors, ...) you can use Entity Linking to detect mentions of those in unstructured texts. This is also supported by Apache Stanbol.
* If you have documents describing an Entity (e.g. a fact sheet for a new smart phone) you need an engine that extracts facts. Such an engine will first need to detect a feature (e.g. "release date") in the unstructured text and then extract and assign the value to it. I am currently working on such an engine, but it is not yet available in Stanbol.
* If you have a vocabulary with Entities (e.g. all Smart Phones) with some basic information, but you want to enrich your database with more facts parsed form unstructured texts such as news articles, forum posts ... To do this you need an engine that can detect settings.
Where a setting is defined as a union over multiple participates, activities and parameters. To to add an new information to an entity you will need to extract an Setting where this entity participates and has an assigned parameter. The sentence "The Samsung Galaxy S10 will be released in Okt. 2019" is an example of such a Setting. Also news articles also mention sentences such as "iPhone 4 weights 137grams" or "dimensions of the Galaxy Grand 2 are 146.8×75.3×8.9mm". Such an engine is currently not available in Stanbol. However Cristian Petroaca is working since some time on extracting settings like that.

I hope this information answers your question and can help to make your use case more clear

best
Rupert

On Thu, Jun 26, 2014 at 8:21 AM, Moshe Recanati <mo...@kmslh.com> wrote:
>
> Hi,
>
> I'm new to apache stanbol.
>
> Until now we used Solr as our search engine.
>
> We would like to enhance the capabilities and be able to enhance it with semantic capabilities and this is the reason we're trying stanbol.
>
>
>
> Let's assume I've several documents that describe mobile phone specification with index on release date and vendor.
>
> I want to query \ ask these documents 'What's the latest phone made by Samsung?' and get the latest document based on release date.
>
>
>
> Please describe how can I do it (if at all).
>
>
>
> Regards,
>
> Moshe Recanati
>
> SVP Engineering
>
> Office + 972-73-2617564
>
> Mobile  + 972-52-6194481
>
> Skype    :  recanati
>
> more at  www.kmslh.com
>
>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: How-to get results of comparison between documents

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi,

Its a bit hard to answer to your very generic question. But at it
looks like an interesting (an demanding) use case I will try to
provide some useful information ...

The query for the latest phone made by Samsung can be easily answered
by Solr if you have an index with all data (including the release
date) of mobile phones. But as you write this on this mailing list I
assume that you do not have structured data with such information but
instead intend to extract those information form unstructured text.

In the following I will try to summarize some possible things that
might be interesting to you:

* If you want to detect new Entities  - e.g. a new Smart Phone you do
not yet have in your database - you will need Named Entity Recognition
(NER). Such things need to be trained for specific languages, specific
types of writings (news vs. forum slang) and also the type of
entities. Stanbol is integrated with OpenNLP and Stanford NLP. So
models trained for such frameworks can also be used with Stanbol.
* If you do already have a vocabularies with Entities you are
interested in (e.g. all Smart Phones, Vendors, ...) you can use Entity
Linking to detect mentions of those in unstructured texts. This is
also supported by Apache Stanbol.
* If you have documents describing an Entity (e.g. a fact sheet for a
new smart phone) you need an engine that extracts facts. Such an
engine will first need to detect a feature (e.g. "release date") in
the unstructured text and then extract and assign the value to it. I
am currently working on such an engine, but it is not yet available in
Stanbol.
* If you have a vocabulary with Entities (e.g. all Smart Phones) with
some basic information, but you want to enrich your database with more
facts parsed form unstructured texts such as news articles, forum
posts ... To do this you need an engine that can detect settings.
Where a setting is defined as a union over multiple participates,
activities and parameters. To to add an new information to an entity
you will need to extract an Setting where this entity participates and
has an assigned parameter. The sentence "The Samsung Galaxy S10 will
be released in Okt. 2019" is an example of such a Setting. Also news
articles also mention sentences such as "iPhone 4 weights 137grams" or
"dimensions of the Galaxy Grand 2 are 146.8×75.3×8.9mm". Such an
engine is currently not available in Stanbol. However Cristian
Petroaca is working since some time on extracting settings like that.

I hope this information answers your question and can help to make
your use case more clear

best
Rupert

On Thu, Jun 26, 2014 at 8:21 AM, Moshe Recanati <mo...@kmslh.com> wrote:
>
> Hi,
>
> I'm new to apache stanbol.
>
> Until now we used Solr as our search engine.
>
> We would like to enhance the capabilities and be able to enhance it with semantic capabilities and this is the reason we're trying stanbol.
>
>
>
> Let's assume I've several documents that describe mobile phone specification with index on release date and vendor.
>
> I want to query \ ask these documents 'What's the latest phone made by Samsung?' and get the latest document based on release date.
>
>
>
> Please describe how can I do it (if at all).
>
>
>
> Regards,
>
> Moshe Recanati
>
> SVP Engineering
>
> Office + 972-73-2617564
>
> Mobile  + 972-52-6194481
>
> Skype    :  recanati
>
> more at  www.kmslh.com
>
>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/