You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by florent andré <fl...@4sengines.com> on 2011/06/10 11:07:31 UTC

Entityhub : get all composed terms

Hi Rupert, *,

As promise in Berlin, I have a question for you ! :)

I have this query :

FieldQuery query = site.getQueryFactory().createFieldQuery();

		query.setConstraint(NamespaceEnum.skos + "prefLabel",
				new TextConstraint(signToFind));
		
		query.addSelectedField(NamespaceEnum.skos + "related");
		query.addSelectedField(NamespaceEnum.skos + "narrower");
		query.addSelectedField(NamespaceEnum.skos + "broader");
		query.addSelectedField(NamespaceEnum.skos + "inScheme");
		
		query.setLimit(this.numSuggestions);


When the signToFind is a one word term eg "Apache", I get all composed 
term that contain this word eg "Apache fondation", "Apache bylaw", etc...

That could be interesting in some case, but not always.

As I read in your documentation, there is :
- patternType: one of "wildcard", "regex" or "none" (default is "none")

As I don't define a pattern type, this could be in "none", so it could 
be a strict matching, right ?

So, in this case I could have only one word term matching entity, or I 
miss something ?


Thanks
++

Re: Entityhub : get all composed terms

Posted by Rupert Westenthaler <ru...@gmail.com>.

On Mon, Jun 13, 2011 at 2:14 PM, Florent André <fl...@apache.org> wrote:
> Thanks for this detailed explanation.
>
> Both usecases (with or without tokeniser) have justifications and usages
> depending on the situation.
>
> For workaround this,
> - I first try to use regex request like "^Apache$", but this don't work
> because - if I well remember - SolrYard don't accept regex request. That's
> true ?

Thats true. I think there are some RegexSearchers for Lucene, but I do
not know how to use them from Solr. I will have an other Look after
switching to the newest version of Solr.

>
> - I set up a loop that test each entity retrieve and select just the exact
> matching one.
>
> IMO, the choice between untokenised or tokenised version, could be better if
> done in the code and not during the indexing.
> For example, one could use
> if (getUntokenisedResults("Westenthaler") == null){
> getTokenisedResults("Westenthaler")}

The problem is that I can only search fields that are indexed. So to
allow both tokenized and un-tokenized searches one needs both versions
to be indexed in two different fields.
If both variants would be available, I would rather add this feature
by adding an new option to the text constraint.

>
> Just an idea...
>
> I'm not sure to well understand you last sentence :
>> However I could imagine that this would require a lot of changes to
>> the current code, because currently the code assumes that only one of
>> language and data type is present at the same time.

This means that adding this feature would require a lot of changes in
the SolrYard implementation.

>
> For now, the skos I use is only in FR, but we imagine to add EN information
> in it.

If you do no parse a language in the TextConstraint it will search all
languages. If you parse languages that it will limit the search to
prefLabels of such languages.

> It could be possible to index and retrieve entities for the two languages or
> not ?

If the provided SKOS files define labels in multiple languages they
will be index.

here a Example taken from the IPTC subject codes

		<rdf:Description
rdf:about="http://cv.iptc.org/newscodes/subjectcode/01002000">
  			<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept" />
  			<skos:prefLabel xml:lang="de">Architektur</skos:prefLabel>
  			<skos:prefLabel xml:lang="it">Architettura</skos:prefLabel>
  			<skos:prefLabel xml:lang="es">arquitectura</skos:prefLabel>
  			<skos:prefLabel xml:lang="fr">Architecture</skos:prefLabel>
  			<skos:prefLabel xml:lang="en-GB">architecture</skos:prefLabel>
  			<skos:definition xml:lang="de">Entwurf von Gebäuden, Denkmälern
und deren Umgebung.</skos:definition>
  			<skos:definition xml:lang="es">Diseño de edificios, monumentos y
espáciosalrededor de ellos</skos:definition>
  			<skos:definition xml:lang="it">Ideazione e progettazione di
edifici, monumenti e degli spazi loro circostanti</skos:definition>
  			<skos:definition xml:lang="en-GB">Designing of buildings,
monuments and the spaces around them</skos:definition>
  			<skos:definition xml:lang="fr">Conception des immeubles, des
monuments et des espaces qui les entourent</skos:definition>
  			
  			<skos:broaderTransitive>
  				<rdf:Description
rdf:about="http://cv.iptc.org/newscodes/subjectcode/01000000">
  					<rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#Concept" />
  				</rdf:Description>
  			</skos:broaderTransitive>
  			
  		</rdf:Description>

>
> Thanks for this really great enhancement.
> Use SolrYard is so much faster that use a D2RQ link : for the same (pretty
> big) document :
> - 22 seconds for solrYard
> - 2/3 min for D2RQ

great to hear!

best
Rupert Westenthaler
>
> ++
>
> On 06/10/2011 07:44 PM, Rupert Westenthaler wrote:
>>
>> Hi
>>
>> Good Question ...
>>
>> Text Fields are indexed by using tokenizers in Solr. Therefore a
>> search for "Apache" will find all documents (entities) that have this
>> token for the skos:prefLabel field. This is the reason why you also
>> get "Apache fondation", "Apache bylaw", etc... even if the PatternType
>> is set to "none".
>> As far as I know the only way to go around this is to deactivate any
>> tokenizers for such filed. However without a tokenizer a query for
>> "Westenthaler" would not return "Rupert Westenthaler", what would also
>> be seen as strange by a lot of users.
>>
>> To deactivate Tokenizers for a natural language field one needs to
>> modify the solr schema (schema.xml). Having both (tokenized and
>> un-tokenized) versions is currently not possible.
>>
>> Here are the necessary additions to the schema.xml to deactivate
>> tokenizing for the skos:prefLabel
>>
>> To get this you would need to add
>>    <!-- one field for each language -->
>>    <field name="@en/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@de/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@it/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@fr/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <field name="@/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="true" multiValued="true"/>
>>    <!-- used for multi lingual searches -->
>>    <field name="_!@/skos:prefLabel/"  type="lowercase"  indexed="true"
>> stored="false" multiValued="true"/>
>>
>> If this is a frequent feature I could modify the SolrYard to use
>> suffixes for languages. This would allow to index multiple versions of
>> natural language texts with different prefixes. The prefixes would
>> than indicate if a tokenizer should be used or not.
>>
>> However I could imagine that this would require a lot of changes to
>> the current code, because currently the code assumes that only one of
>> language and data type is present at the same time.
>>
>> best
>> Rupert Westenthaler
>>
>> On Fri, Jun 10, 2011 at 11:07 AM, florent andré
>> <fl...@4sengines.com>  wrote:
>>>
>>> Hi Rupert, *,
>>>
>>> As promise in Berlin, I have a question for you ! :)
>>>
>>> I have this query :
>>>
>>> FieldQuery query = site.getQueryFactory().createFieldQuery();
>>>
>>>                query.setConstraint(NamespaceEnum.skos + "prefLabel",
>>>                                new TextConstraint(signToFind));
>>>
>>>                query.addSelectedField(NamespaceEnum.skos + "related");
>>>                query.addSelectedField(NamespaceEnum.skos + "narrower");
>>>                query.addSelectedField(NamespaceEnum.skos + "broader");
>>>                query.addSelectedField(NamespaceEnum.skos + "inScheme");
>>>
>>>                query.setLimit(this.numSuggestions);
>>>
>>>
>>> When the signToFind is a one word term eg "Apache", I get all composed
>>> term
>>> that contain this word eg "Apache fondation", "Apache bylaw", etc...
>>>
>>> That could be interesting in some case, but not always.
>>>
>>> As I read in your documentation, there is :
>>> - patternType: one of "wildcard", "regex" or "none" (default is "none")
>>>
>>> As I don't define a pattern type, this could be in "none", so it could be
>>> a
>>> strict matching, right ?
>>>
>>> So, in this case I could have only one word term matching entity, or I
>>> miss
>>> something ?
>>>
>>>
>>> Thanks
>>> ++
>>>
>>
>>
>>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Entityhub : get all composed terms

Posted by Florent André <fl...@apache.org>.

Thanks for this detailed explanation.

Both usecases (with or without tokeniser) have justifications and usages 
depending on the situation.

For workaround this,
- I first try to use regex request like "^Apache$", but this don't work 
because - if I well remember - SolrYard don't accept regex request. 
That's true ?

- I set up a loop that test each entity retrieve and select just the 
exact matching one.

IMO, the choice between untokenised or tokenised version, could be 
better if done in the code and not during the indexing.
For example, one could use
if (getUntokenisedResults("Westenthaler") == null){
getTokenisedResults("Westenthaler")}

Just an idea...

I'm not sure to well understand you last sentence :
 > However I could imagine that this would require a lot of changes to
 > the current code, because currently the code assumes that only one of
 > language and data type is present at the same time.

For now, the skos I use is only in FR, but we imagine to add EN 
information in it.
It could be possible to index and retrieve entities for the two 
languages or not ?

Thanks for this really great enhancement.
Use SolrYard is so much faster that use a D2RQ link : for the same 
(pretty big) document :
- 22 seconds for solrYard
- 2/3 min for D2RQ

++

On 06/10/2011 07:44 PM, Rupert Westenthaler wrote:
> Hi
>
> Good Question ...
>
> Text Fields are indexed by using tokenizers in Solr. Therefore a
> search for "Apache" will find all documents (entities) that have this
> token for the skos:prefLabel field. This is the reason why you also
> get "Apache fondation", "Apache bylaw", etc... even if the PatternType
> is set to "none".
> As far as I know the only way to go around this is to deactivate any
> tokenizers for such filed. However without a tokenizer a query for
> "Westenthaler" would not return "Rupert Westenthaler", what would also
> be seen as strange by a lot of users.
>
> To deactivate Tokenizers for a natural language field one needs to
> modify the solr schema (schema.xml). Having both (tokenized and
> un-tokenized) versions is currently not possible.
>
> Here are the necessary additions to the schema.xml to deactivate
> tokenizing for the skos:prefLabel
>
> To get this you would need to add
>     <!-- one field for each language -->
>     <field name="@en/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@de/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@it/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@fr/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <field name="@/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="true" multiValued="true"/>
>     <!-- used for multi lingual searches -->
>     <field name="_!@/skos:prefLabel/"  type="lowercase"  indexed="true"
> stored="false" multiValued="true"/>
>
> If this is a frequent feature I could modify the SolrYard to use
> suffixes for languages. This would allow to index multiple versions of
> natural language texts with different prefixes. The prefixes would
> than indicate if a tokenizer should be used or not.
>
> However I could imagine that this would require a lot of changes to
> the current code, because currently the code assumes that only one of
> language and data type is present at the same time.
>
> best
> Rupert Westenthaler
>
> On Fri, Jun 10, 2011 at 11:07 AM, florent andré
> <fl...@4sengines.com>  wrote:
>> Hi Rupert, *,
>>
>> As promise in Berlin, I have a question for you ! :)
>>
>> I have this query :
>>
>> FieldQuery query = site.getQueryFactory().createFieldQuery();
>>
>>                 query.setConstraint(NamespaceEnum.skos + "prefLabel",
>>                                 new TextConstraint(signToFind));
>>
>>                 query.addSelectedField(NamespaceEnum.skos + "related");
>>                 query.addSelectedField(NamespaceEnum.skos + "narrower");
>>                 query.addSelectedField(NamespaceEnum.skos + "broader");
>>                 query.addSelectedField(NamespaceEnum.skos + "inScheme");
>>
>>                 query.setLimit(this.numSuggestions);
>>
>>
>> When the signToFind is a one word term eg "Apache", I get all composed term
>> that contain this word eg "Apache fondation", "Apache bylaw", etc...
>>
>> That could be interesting in some case, but not always.
>>
>> As I read in your documentation, there is :
>> - patternType: one of "wildcard", "regex" or "none" (default is "none")
>>
>> As I don't define a pattern type, this could be in "none", so it could be a
>> strict matching, right ?
>>
>> So, in this case I could have only one word term matching entity, or I miss
>> something ?
>>
>>
>> Thanks
>> ++
>>
>
>
>

Re: Entityhub : get all composed terms

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi

Good Question ...

Text Fields are indexed by using tokenizers in Solr. Therefore a
search for "Apache" will find all documents (entities) that have this
token for the skos:prefLabel field. This is the reason why you also
get "Apache fondation", "Apache bylaw", etc... even if the PatternType
is set to "none".
As far as I know the only way to go around this is to deactivate any
tokenizers for such filed. However without a tokenizer a query for
"Westenthaler" would not return "Rupert Westenthaler", what would also
be seen as strange by a lot of users.

To deactivate Tokenizers for a natural language field one needs to
modify the solr schema (schema.xml). Having both (tokenized and
un-tokenized) versions is currently not possible.

Here are the necessary additions to the schema.xml to deactivate
tokenizing for the skos:prefLabel

To get this you would need to add
   <!-- one field for each language -->
   <field name="@en/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@de/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@it/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@fr/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <field name="@/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="true" multiValued="true"/>
   <!-- used for multi lingual searches -->
   <field name="_!@/skos:prefLabel/"  type="lowercase"  indexed="true"
stored="false" multiValued="true"/>

If this is a frequent feature I could modify the SolrYard to use
suffixes for languages. This would allow to index multiple versions of
natural language texts with different prefixes. The prefixes would
than indicate if a tokenizer should be used or not.

However I could imagine that this would require a lot of changes to
the current code, because currently the code assumes that only one of
language and data type is present at the same time.

best
Rupert Westenthaler

On Fri, Jun 10, 2011 at 11:07 AM, florent andré
<fl...@4sengines.com> wrote:
> Hi Rupert, *,
>
> As promise in Berlin, I have a question for you ! :)
>
> I have this query :
>
> FieldQuery query = site.getQueryFactory().createFieldQuery();
>
>                query.setConstraint(NamespaceEnum.skos + "prefLabel",
>                                new TextConstraint(signToFind));
>
>                query.addSelectedField(NamespaceEnum.skos + "related");
>                query.addSelectedField(NamespaceEnum.skos + "narrower");
>                query.addSelectedField(NamespaceEnum.skos + "broader");
>                query.addSelectedField(NamespaceEnum.skos + "inScheme");
>
>                query.setLimit(this.numSuggestions);
>
>
> When the signToFind is a one word term eg "Apache", I get all composed term
> that contain this word eg "Apache fondation", "Apache bylaw", etc...
>
> That could be interesting in some case, but not always.
>
> As I read in your documentation, there is :
> - patternType: one of "wildcard", "regex" or "none" (default is "none")
>
> As I don't define a pattern type, this could be in "none", so it could be a
> strict matching, right ?
>
> So, in this case I could have only one word term matching entity, or I miss
> something ?
>
>
> Thanks
> ++
>

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen