You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Antonio Perez <ap...@zaizi.com> on 2013/07/01 11:36:21 UTC

[GSoC] First Milestone: Freebase Disambiguation in Stanbol

Hi all

According to the schedule of the project, last friday was the first
milestone of the project 'Complete the integration of Freebase as EntityHub
ReferencedSite in Stanbol'.
The steps to achieve this task are the following:

- Download the freebase indexing tool (based on the Apache Stanbol Indexing
Tool) from
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/
- Generate the jar with maven and obtain the
org.apache.stanbol.entityhub.indexing.freebase-*.jar in the target
directory.
- Download the freebase dump from http://download.freebaseapps.com
- Rename the freebase dump from *.gz to *.ttl.gz (necessary for the
indexing tool, to treat the dump as N-Turtle)
- Initialize the configuration generating the directory structure using the
command:

java -jar org.apache.stanbol.entityhub.indexing.freebase-*.jar init

- Generate the scoring file using the fbrankings.sh script and put it in
'indexing/resources' directory
- Apply the fixit tool (http://people.apache.org/~andy/Freebase20121223/)
using the command:

 gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}

- Move the fixed data dump (*.ttl.gz) to 'indexing/resources/rdfdata'

- Configure the mappings and some other information contained in
'indexing/conf' directory

- Run the Freebase indexing tool:

java -jar -Xmx32g org.apache.stanbol.entityhub.indexing.freebase-*.jar index

(to avoid 'Bad IRI...' warnings in log, append 'grep -v "Bad IRI" to
the previous command)

- The indexing tool generates two files in 'indexing/dist' directory:

  * freebase.solrindex.zip must be copied to stanbol/datafiles

  * org.apache.stanbol.data.site.freebase-*.jar must be copied to
stanbol/fileinstall

(If Stanbol stable launcher is being used, add
'commons.solr.extras.kuromoji' and 'commons.solr.extras.smartcn' to
stanbol/fileinstall directory)


The indexing tool takes too much time in a standard computer, so in order
to execute this process, you'll need either a computer with SSD or
 a computer with 200GB of RAM in order to deal with the whole Freebase data
dump in memory.


For the next milestone (midterm evaluation) the following tasks need to be
done:
1.  Convert wiki-links data dump to RDF
    * Wiki-links contains a lot of disambiguation information which it is
wanted to incorporate to the Entityhub Freebase site.
    * The wiki-link data dump will be converted to RDF to be easier to
process by the new Stanbol Freebase indexing tool (point 2)
    * The wiki-link expanded dataset [1] will be used because it contains
information like extracted context for the mentions, alignment to Freebase
entities, etc.
2.  Develop a new stanbol indexer to join Freebase and wiki-links
information
3.  Generate a graph with the links in Freebase
    * To support Graph-based disambiguation algorithms in Stanbol, a graph
will be generated using Blueprints Neo4j and every node in the graph will
be associated to entries in the EntityHub to later be used to position
directly in a node on the graph.

Comments are more than welcome

Regards

[1] http://www.iesl.cs.umass.edu/data/wiki-links

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: [GSoC] First Milestone: Freebase Disambiguation in Stanbol

Posted by Antonio Perez <ap...@zaizi.com>.
Hi Rupert

Thanks for the advices

Regarding to wiki-links dataset, the current dataset contains the context
and freebase id (in guid format) but it doesn't contain the text of the
document.
It can be added later, because the wikilinks expanded dataset parser
(thrift to RDF) I have developed already support the content of the
document (when added).

You can download the current datasets from
http://iesl.cs.umass.edu/downloads/wiki-link/context-only/ .

I am thinking about creating a ReferencedSite with the expanded wikilinks +
google concept dictionary , also modifying the freebase id given by
wikilinks (guid) to the new freebase id (m) (which I have already done as a
test).
This way, we'll have freebase entities and wikilinks information related.

What do you think about it?

Moreover, my intention is to allow use the wikilinks information with
DBPedia. I'm going to use the freebase id of each mention in wikilinks to
link with Freebase and I would like to do the same thing but using the
wikipedia url to link with DBpedia. How could I do that?

Regards



On Wed, Jul 3, 2013 at 8:20 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi Antonio,
>
> Thank you for the nice overview.
>
> Let me mention that because of the following issue
>
> On Mon, Jul 1, 2013 at 11:36 AM, Antonio Perez <ap...@zaizi.com> wrote:
> >
> > The indexing tool takes too much time in a standard computer, so in order
> > to execute this process, you'll need either a computer with SSD or
> >  a computer with 200GB of RAM in order to deal with the whole Freebase
> data
> > dump in memory.
> >
>
> Rafa has started to work on an IndexingSource that can directly
> operate on the Freebase dump (any single file RDF dump that is sorted
> by SPO). With such a source one can index a dataset without first
> importing the data to an RDF triple store. As this is the most
> hardware demanding part of the chain it should greatly improve
> indexing performance.
>
> However this IndexingSource will not support LDPath and will therefore
> not support some of the available EntityProcessors.
>
> >
> > For the next milestone (midterm evaluation) the following tasks need to
> be
> > done:
> > 1.  Convert wiki-links data dump to RDF
> >     * Wiki-links contains a lot of disambiguation information which it is
> > wanted to incorporate to the Entityhub Freebase site.
> >     * The wiki-link data dump will be converted to RDF to be easier to
> > process by the new Stanbol Freebase indexing tool (point 2)
> >     * The wiki-link expanded dataset [1] will be used because it contains
> > information like extracted context for the mentions, alignment to
> Freebase
> > entities, etc.
> > 2.  Develop a new stanbol indexer to join Freebase and wiki-links
> > information
>
> The expanded dataset [1] is really great that is allows to avoid a lot
> of very time-consuming tasks (crawling the resource and extracting the
> mention text and context, linking the dbpedia URIs to freebase).
> Without this those information the usage of this great dataset would
> not be feasible because of time constraints.
>
> > 3.  Generate a graph with the links in Freebase
> >     * To support Graph-based disambiguation algorithms in Stanbol, a
> graph
> > will be generated using Blueprints Neo4j and every node in the graph will
> > be associated to entries in the EntityHub to later be used to position
> > directly in a node on the graph.
> >
>
> IMO this is really interesting not only for Disambiguation. I am
> really looking forward to this. Do not forget to test the code also
> with backends that are compatible with the Apache License.
>
> best
> Rupert
>
> > Comments are more than welcome
> >
> > Regards
> >
> > [1] http://www.iesl.cs.umass.edu/data/wiki-links
> >
> > --
> >
> > ------------------------------
> > This message should be regarded as confidential. If you have received
> this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> copy
> > by an authorised signatory.
> >
> > Zaizi Ltd is registered in England and Wales with the registration number
> > 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> > London W6 7AN.
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

-- 

------------------------------
This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. 
Statements of intent shall only become binding when confirmed in hard copy 
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is Brook House, 229 Shepherds Bush Road, 
London W6 7AN. 

Re: [GSoC] First Milestone: Freebase Disambiguation in Stanbol

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Antonio,

Thank you for the nice overview.

Let me mention that because of the following issue

On Mon, Jul 1, 2013 at 11:36 AM, Antonio Perez <ap...@zaizi.com> wrote:
>
> The indexing tool takes too much time in a standard computer, so in order
> to execute this process, you'll need either a computer with SSD or
>  a computer with 200GB of RAM in order to deal with the whole Freebase data
> dump in memory.
>

Rafa has started to work on an IndexingSource that can directly
operate on the Freebase dump (any single file RDF dump that is sorted
by SPO). With such a source one can index a dataset without first
importing the data to an RDF triple store. As this is the most
hardware demanding part of the chain it should greatly improve
indexing performance.

However this IndexingSource will not support LDPath and will therefore
not support some of the available EntityProcessors.

>
> For the next milestone (midterm evaluation) the following tasks need to be
> done:
> 1.  Convert wiki-links data dump to RDF
>     * Wiki-links contains a lot of disambiguation information which it is
> wanted to incorporate to the Entityhub Freebase site.
>     * The wiki-link data dump will be converted to RDF to be easier to
> process by the new Stanbol Freebase indexing tool (point 2)
>     * The wiki-link expanded dataset [1] will be used because it contains
> information like extracted context for the mentions, alignment to Freebase
> entities, etc.
> 2.  Develop a new stanbol indexer to join Freebase and wiki-links
> information

The expanded dataset [1] is really great that is allows to avoid a lot
of very time-consuming tasks (crawling the resource and extracting the
mention text and context, linking the dbpedia URIs to freebase).
Without this those information the usage of this great dataset would
not be feasible because of time constraints.

> 3.  Generate a graph with the links in Freebase
>     * To support Graph-based disambiguation algorithms in Stanbol, a graph
> will be generated using Blueprints Neo4j and every node in the graph will
> be associated to entries in the EntityHub to later be used to position
> directly in a node on the graph.
>

IMO this is really interesting not only for Disambiguation. I am
really looking forward to this. Do not forget to test the code also
with backends that are compatible with the Apache License.

best
Rupert

> Comments are more than welcome
>
> Regards
>
> [1] http://www.iesl.cs.umass.edu/data/wiki-links
>
> --
>
> ------------------------------
> This message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard copy
> by an authorised signatory.
>
> Zaizi Ltd is registered in England and Wales with the registration number
> 6440931. The Registered Office is Brook House, 229 Shepherds Bush Road,
> London W6 7AN.



--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen