You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Olivier Grisel (JIRA)" <ji...@apache.org> on 2011/07/01 16:44:28 UTC
[jira] [Issue Comment Edited] (STANBOL-246) Exact name match should get boosted in the entity hub SolrYard indices

    [ https://issues.apache.org/jira/browse/STANBOL-246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058570#comment-13058570 ] 

Olivier Grisel edited comment on STANBOL-246 at 7/1/11 2:42 PM:
----------------------------------------------------------------

Yes you are right. Though if we have the Shingle analyzer enabled for that field that might work for bi-grams. I think the shingle analyzer might also be configured for trigrams. However organization names can be 4 or 5 words long and we cannot afford storing 4-grams or 5-grams index for DBpedia sized knowledge bases I guess.

Maybe we should do some kind of query results post-processing to re-rank them when there is an exact match in the list. Or maybe page rank score will fix the issue.

      was (Author: ogrisel):
    Yes you are right. Though if we have the Shingle analyzer enabled for that field that might work for bi-grams. I think the shingle analyzer might also be configured for trigrams. However organization names can be 4 or 5 words long and we cannot afford storing 4-grams or 5-grams index for DBpedia sized knowledge bases I guess.

Maybe we should do some kind of many post-processing of the results to re-rank then when there is an exact match in that case. Or maybe page rank score will fix the issue.
  
> Exact name match should get boosted in the entity hub SolrYard indices
> ----------------------------------------------------------------------
>
>                 Key: STANBOL-246
>                 URL: https://issues.apache.org/jira/browse/STANBOL-246
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Rupert Westenthaler
>         Attachments: united_states_dbpedia_solrindex.json
>
>
> For instance, using the default embedded solryard index:
> {code}
>  curl -X POST -d "name=United States&limit=10&offset=0" http://localhost:8080/entityhub/site/dbpedia/find
> {code}
> The first results are "United States Navy" and "United States Air Force" and finally "United States" comes in the third position. See the attached JSON output.
> Exact name match (or close to exact matches) should get a score boost. This can probably be implemented with FuzzyQuery and minSimilarity of 0.8f for instance.
> https://lucene.apache.org/java/3_3_0/api/all/org/apache/lucene/search/FuzzyQuery.html
> Maybe in this case the popularity boost are bad because of the naive incoming links. Using a Page Rank style centrality score might work better in this case:
> https://github.com/julienledem/Pig-scripting-examples/tree/master/Page%20Rank
> https://github.com/mesos/spark/blob/master/bagel/src/main/scala/spark/bagel/examples/WikipediaPageRank.scala

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira