You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Cristian Petroaca <cr...@gmail.com> on 2014/04/30 10:37:53 UTC

Working with dbpedia indexed data

Hi All,

I'm currently working on https://issues.apache.org/jira/browse/STANBOL-1279.

I am using the SiteManager to get a Site with referenceId = "dbpedia" and
am querying data related to some NERs (querying by NER label and type).
This works and I do get results from the dbpedia index.

What I want to do is this :

1. I want to be able to store and get yago class types in the dbpedia data.
This data is stored in the yago-types.nt file from the dbpedia 3.9
downloads. Is it possible to create a new dbpedia index with the 3.9 files
using this script
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
?

2. I want to access some specific dbpedia properties such as
dbpedia-owl:locationCity and others. These are already present in the
mappingbased_properties_en.nt
file which is in the fetch_data_en_int.sh script but are not in the
https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
file.
Should I include them there and do a dbpedia index rebuild?

I've already described this in the "Named entity coref resolution based on
dbpedia" mail thread but I thought of creating a new mail for visibility
and for not clogging the other thread.

Thanks,
Cristian

Re: Working with dbpedia indexed data

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Mon, May 26, 2014 at 9:19 PM, Cristian Petroaca
<cr...@gmail.com> wrote:
> Thanks Rupert! The genericrfd reindexing worked.
>
> Just one thing : it seems kind of odd that my solrindex.zip got from 796MB
> (after dbpedia indexing) to 1,5GB (after genericrdf indexing based on
> dbpedia index) but my yago_class_labels.nt file contains around 100,000
> entries.
> The only thing I changed in config was the name of the site as you
> suggested and in mappings.txt file I removed everything except "rdfs:label".
>

No Idea ... as long as all the data you need are available ^^

best
Rupert

>
> 2014-05-26 16:26 GMT+03:00 Rupert Westenthaler <
> rupert.westenthaler@gmail.com>:
>
>> Hi Cristian,
>>
>> On Mon, May 26, 2014 at 2:33 PM, Cristian Petroaca
>> <cr...@gmail.com> wrote:
>> > I just found out that according to
>> >
>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe
>> > min-score can actually be set to 0 and all entities will be indexed
>> > :).
>> > So, I'll give that a go ( hopefully my dbpedia index won't become
>> gigantic
>> > in size).
>> >
>>
>> Even if you set the value to zero it will still only index entities
>> listed in the incoming_links.txt file. So you will need to append the
>> Yago types to that file.
>>
>> An other possibility would be to first create the dbpedia index and
>> after that append the Yago classes by using the generic rdf indexing
>> tool. For that you can
>>
>> 1) take the destination folder of the dbpedia indexing tool and link
>> (or move) it to the destination of the generic indexing tool.
>> 2) make sure to configure the same site name as for the dbpedia index
>> tool to the generic indexing tool
>> 3) add the RDF data of the Yago classes to the rdf data folder of the
>> generic indexing tool
>> 4) adapt all the other configurations as needed
>> 5) start the indexing process.
>>
>> The generic indexing tool will check if the target solr index does
>> already exist. As it is present it will just add the additional
>> entities to the solr core.
>>
>> When the process completes you can use the "solrindex.zip" file
>> generated by the generic RDF indexing tool together with the OSGI
>> bunlde (the jar file) generated by the dbpedia indexing tool.
>>
>> Especially if you have already created an dbpedia index I would
>> recommend you to try this out as it would avoid re-indexing the whole
>> dbpedia data again.
>>
>> best
>> Rupert
>>
>>
>>
>> >
>> > 2014-05-25 16:58 GMT+03:00 Cristian Petroaca <
>> cristian.petroaca@gmail.com>:
>> >
>> >> Hi Rupert.
>> >>
>> >> I'm answering to your suggestions on integrating the yago class labels
>> in
>> >> the dbpedia index in this thread since it's a lot shorter than the other
>> >> one.
>> >>
>> >> For clarity, your suggestions were :
>> >>
>> >> "1. The indexing tool does support LDPath. That means you can import
>> >> all the required RDF files and use LDPath to append the labels of the
>> Yago
>> >> Types directly to the dbpedia entities. This would prevent additional
>> >> lookups to retrieve the types, but also increase the size of the index a
>> >> lot. 2. You could also index the Yago Types and use an additional
>> Entityhub
>> >> lookup to retrieve them. In this case you should first collect all types
>> >> referenced by Entities in the processed text and in a second step
>> retrieve
>> >> the labels. While this means additional lookups it will only load the
>> >> labels for an type once. In addition you could use a cache for types. 3.
>> >> Your engine could use LDPath to retrieve the types. This would require
>> to
>> >> index the data like with option (2) and use a LDPath statement similar
>> to
>> >> (1). It would be the slowest solution (as it requires an additional
>> lookup
>> >> for every extracted entity) but require the least code."
>> >>
>> >> It seems that the best solution would be no 2, so I took that path. But
>> >> I'm having some issues with building the dbpedia index with the yago
>> class
>> >> labels.
>> >>
>> >> I managed to create an .nt file from the data files on the yago site
>> which
>> >> contains the yago class labels. The file has this format :
>> >> <http://dbpedia.org/class/yago/Floret111669786> <
>> >> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
>> >> <http://dbpedia.org/class/yago/Servant110582154> <
>> >> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
>> >> <http://dbpedia.org/class/yago/Varietal107900225> <
>> >> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .
>> >>
>> >> I compressed this to a .bz2 archive and put it in the
>> >> indexing/resources/rdfdata folder with the rest of them.
>> >>
>> >> After running the indexer I got my dbpedia index but it seems the yago
>> >> class labels are not present in the index. The first clue was that they
>> >> were missing from the indexing/destination/indexed-entities-ids archive.
>> >> Second confirmation came when I tried to retrieve a yago class label by
>> >> calling site.getEntity(yago_class_uri) and the return was null. I should
>> >> mention that the same call works if I want to get a
>> >> http://dbpedia.org/resource/[id] entity.
>> >>
>> >> From what I saw, the indexing process indexes entities only if they are
>> in
>> >> the incoming_links.txt file and only if their score is higher than 2 so
>> I
>> >> guess that's the point where the yago classes were not inserted. From
>> >> looking at the code, the min-score parameter from the minincoming.config
>> >> file cannot be set to 0, or something that would ignore the
>> >> incoming_links.txt ranking and just index everything. So, in this
>> >> situation, is there a solution for getting these yago classes as
>> entities
>> >> in the index?
>> >>
>> >> I'd like to mention that the indexing process did correctly read the
>> >> yago_class_labels.nt file and started to index the entities into Jena.
>> >>
>> >> Thanks,
>> >> Cristian
>> >>
>> >>
>> >>
>> >> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <
>> cristian.petroaca@gmail.com>
>> >> :
>> >>
>> >> Hi Rupert,
>> >>>
>> >>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it
>> >>> in the gigantic "Named entities coreference" thread instead.
>> >>>
>> >>> So, I managed to create a dbpedia index with the yago class information
>> >>> but looking into the yago_types.nt file which assigns yago classes to
>> >>> dbpedia entities I realized that there are no yago class labels
>> present, I
>> >>> just have the class uri like : <
>> >>> http://dbpedia/..something../President1829302/. I also need the class
>> >>> labels so that I can compare them to the noun token's string from the
>> text.
>> >>>
>> >>> I can get the labels from one of the yago downloads here :
>> >>>
>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt
>> .
>> >>> I'll need another yago download file to map the yago wordnet classes to
>> >>> dbpedia uris. That could be done via a script maybe.
>> >>>
>> >>> Once I have the dbpedia_yago_class_uri -> label file is it possible to
>> >>> integrate this data in the dbpedia index and later be able to query the
>> >>> labels from the 'dbpedia' Site? How would that work in the dbpedia
>> indexing
>> >>> process? What should I change in the mappings.txt file? At first
>> glance it
>> >>> seems that the indexing is done based on the incoming_links.txt entity
>> >>> scoring and in my case I don't want to include triples involving the
>> actual
>> >>> entity but triples invloving a property of the entity (its yago class).
>> >>>
>> >>> Other than that, I saw that someone will be working on integrating YAGO
>> >>> as part of Gsoc 2014. So maybe waiting for that is an option too but I
>> >>> don't know what the extent of the integration will be.
>> >>>
>> >>> Thanks,
>> >>> Cristi
>> >>>
>> >>>
>> >>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
>> >>> rupert.westenthaler@gmail.com>:
>> >>>
>> >>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
>> >>>> <cr...@gmail.com> wrote:
>> >>>> > Hi All,
>> >>>> >
>> >>>> > I'm currently working on
>> >>>> https://issues.apache.org/jira/browse/STANBOL-1279.
>> >>>> >
>> >>>> > I am using the SiteManager to get a Site with referenceId =
>> "dbpedia"
>> >>>> and
>> >>>> > am querying data related to some NERs (querying by NER label and
>> type).
>> >>>> > This works and I do get results from the dbpedia index.
>> >>>> >
>> >>>> > What I want to do is this :
>> >>>> >
>> >>>> > 1. I want to be able to store and get yago class types in the
>> dbpedia
>> >>>> data.
>> >>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
>> >>>> > downloads. Is it possible to create a new dbpedia index with the 3.9
>> >>>> files
>> >>>> > using this script
>> >>>> >
>> >>>>
>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>> >>>> > ?
>> >>>>
>> >>>> yep. Just make suer you change
>> >>>>
>> >>>>     DBPEDIA=http://downloads.dbpedia.org/3.8
>> >>>>
>> >>>> to dbpedia 3.9
>> >>>>
>> >>>> BTW: you can also remove
>> >>>>
>> >>>>         #corrects encoding and recompress using gz
>> >>>>         bzcat ${filename}.bz2 \
>> >>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
>> \
>> >>>>             | gzip -c > ${filename}.gz
>> >>>>         rm -f ${filename}.bz2
>> >>>>
>> >>>> as this is no longer necessary.
>> >>>>
>> >>>> >
>> >>>> > 2. I want to access some specific dbpedia properties such as
>> >>>> > dbpedia-owl:locationCity and others. These are already present in
>> the
>> >>>> > mappingbased_properties_en.nt
>> >>>> > file which is in the fetch_data_en_int.sh script but are not in the
>> >>>> >
>> >>>>
>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
>> >>>> > file.
>> >>>> > Should I include them there and do a dbpedia index rebuild?
>> >>>>
>> >>>> Exactly. If the size of the created SolrIndex is an issue I recommend
>> >>>> also that you remove properties you do not need.
>> >>>>
>> >>>> >
>> >>>> > I've already described this in the "Named entity coref resolution
>> >>>> based on
>> >>>> > dbpedia" mail thread but I thought of creating a new mail for
>> >>>> visibility
>> >>>> > and for not clogging the other thread.
>> >>>>
>> >>>> The old thread is anyways already much to long. Please make sure that
>> >>>> important points and decisions of that thread are also reflected in
>> >>>> the description of STANBOL-1279
>> >>>>
>> >>>> best
>> >>>> Rupert
>> >>>>
>> >>>> >
>> >>>> > Thanks,
>> >>>> > Cristian
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> >>>> | Bodenlehenstraße 11                              ++43-699-11108907
>> >>>> | A-5500 Bischofshofen
>> >>>> |
>> REDLINK.CO..........................................................................
>> >>>> | http://redlink.co/
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                              ++43-699-11108907
>> | A-5500 Bischofshofen
>> | REDLINK.CO..........................................................................
>> | http://redlink.co/
>>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: Working with dbpedia indexed data

Posted by Cristian Petroaca <cr...@gmail.com>.
Thanks Rupert! The genericrfd reindexing worked.

Just one thing : it seems kind of odd that my solrindex.zip got from 796MB
(after dbpedia indexing) to 1,5GB (after genericrdf indexing based on
dbpedia index) but my yago_class_labels.nt file contains around 100,000
entries.
The only thing I changed in config was the name of the site as you
suggested and in mappings.txt file I removed everything except "rdfs:label".


2014-05-26 16:26 GMT+03:00 Rupert Westenthaler <
rupert.westenthaler@gmail.com>:

> Hi Cristian,
>
> On Mon, May 26, 2014 at 2:33 PM, Cristian Petroaca
> <cr...@gmail.com> wrote:
> > I just found out that according to
> >
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe
> > min-score can actually be set to 0 and all entities will be indexed
> > :).
> > So, I'll give that a go ( hopefully my dbpedia index won't become
> gigantic
> > in size).
> >
>
> Even if you set the value to zero it will still only index entities
> listed in the incoming_links.txt file. So you will need to append the
> Yago types to that file.
>
> An other possibility would be to first create the dbpedia index and
> after that append the Yago classes by using the generic rdf indexing
> tool. For that you can
>
> 1) take the destination folder of the dbpedia indexing tool and link
> (or move) it to the destination of the generic indexing tool.
> 2) make sure to configure the same site name as for the dbpedia index
> tool to the generic indexing tool
> 3) add the RDF data of the Yago classes to the rdf data folder of the
> generic indexing tool
> 4) adapt all the other configurations as needed
> 5) start the indexing process.
>
> The generic indexing tool will check if the target solr index does
> already exist. As it is present it will just add the additional
> entities to the solr core.
>
> When the process completes you can use the "solrindex.zip" file
> generated by the generic RDF indexing tool together with the OSGI
> bunlde (the jar file) generated by the dbpedia indexing tool.
>
> Especially if you have already created an dbpedia index I would
> recommend you to try this out as it would avoid re-indexing the whole
> dbpedia data again.
>
> best
> Rupert
>
>
>
> >
> > 2014-05-25 16:58 GMT+03:00 Cristian Petroaca <
> cristian.petroaca@gmail.com>:
> >
> >> Hi Rupert.
> >>
> >> I'm answering to your suggestions on integrating the yago class labels
> in
> >> the dbpedia index in this thread since it's a lot shorter than the other
> >> one.
> >>
> >> For clarity, your suggestions were :
> >>
> >> "1. The indexing tool does support LDPath. That means you can import
> >> all the required RDF files and use LDPath to append the labels of the
> Yago
> >> Types directly to the dbpedia entities. This would prevent additional
> >> lookups to retrieve the types, but also increase the size of the index a
> >> lot. 2. You could also index the Yago Types and use an additional
> Entityhub
> >> lookup to retrieve them. In this case you should first collect all types
> >> referenced by Entities in the processed text and in a second step
> retrieve
> >> the labels. While this means additional lookups it will only load the
> >> labels for an type once. In addition you could use a cache for types. 3.
> >> Your engine could use LDPath to retrieve the types. This would require
> to
> >> index the data like with option (2) and use a LDPath statement similar
> to
> >> (1). It would be the slowest solution (as it requires an additional
> lookup
> >> for every extracted entity) but require the least code."
> >>
> >> It seems that the best solution would be no 2, so I took that path. But
> >> I'm having some issues with building the dbpedia index with the yago
> class
> >> labels.
> >>
> >> I managed to create an .nt file from the data files on the yago site
> which
> >> contains the yago class labels. The file has this format :
> >> <http://dbpedia.org/class/yago/Floret111669786> <
> >> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
> >> <http://dbpedia.org/class/yago/Servant110582154> <
> >> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
> >> <http://dbpedia.org/class/yago/Varietal107900225> <
> >> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .
> >>
> >> I compressed this to a .bz2 archive and put it in the
> >> indexing/resources/rdfdata folder with the rest of them.
> >>
> >> After running the indexer I got my dbpedia index but it seems the yago
> >> class labels are not present in the index. The first clue was that they
> >> were missing from the indexing/destination/indexed-entities-ids archive.
> >> Second confirmation came when I tried to retrieve a yago class label by
> >> calling site.getEntity(yago_class_uri) and the return was null. I should
> >> mention that the same call works if I want to get a
> >> http://dbpedia.org/resource/[id] entity.
> >>
> >> From what I saw, the indexing process indexes entities only if they are
> in
> >> the incoming_links.txt file and only if their score is higher than 2 so
> I
> >> guess that's the point where the yago classes were not inserted. From
> >> looking at the code, the min-score parameter from the minincoming.config
> >> file cannot be set to 0, or something that would ignore the
> >> incoming_links.txt ranking and just index everything. So, in this
> >> situation, is there a solution for getting these yago classes as
> entities
> >> in the index?
> >>
> >> I'd like to mention that the indexing process did correctly read the
> >> yago_class_labels.nt file and started to index the entities into Jena.
> >>
> >> Thanks,
> >> Cristian
> >>
> >>
> >>
> >> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <
> cristian.petroaca@gmail.com>
> >> :
> >>
> >> Hi Rupert,
> >>>
> >>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it
> >>> in the gigantic "Named entities coreference" thread instead.
> >>>
> >>> So, I managed to create a dbpedia index with the yago class information
> >>> but looking into the yago_types.nt file which assigns yago classes to
> >>> dbpedia entities I realized that there are no yago class labels
> present, I
> >>> just have the class uri like : <
> >>> http://dbpedia/..something../President1829302/. I also need the class
> >>> labels so that I can compare them to the noun token's string from the
> text.
> >>>
> >>> I can get the labels from one of the yago downloads here :
> >>>
> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt
> .
> >>> I'll need another yago download file to map the yago wordnet classes to
> >>> dbpedia uris. That could be done via a script maybe.
> >>>
> >>> Once I have the dbpedia_yago_class_uri -> label file is it possible to
> >>> integrate this data in the dbpedia index and later be able to query the
> >>> labels from the 'dbpedia' Site? How would that work in the dbpedia
> indexing
> >>> process? What should I change in the mappings.txt file? At first
> glance it
> >>> seems that the indexing is done based on the incoming_links.txt entity
> >>> scoring and in my case I don't want to include triples involving the
> actual
> >>> entity but triples invloving a property of the entity (its yago class).
> >>>
> >>> Other than that, I saw that someone will be working on integrating YAGO
> >>> as part of Gsoc 2014. So maybe waiting for that is an option too but I
> >>> don't know what the extent of the integration will be.
> >>>
> >>> Thanks,
> >>> Cristi
> >>>
> >>>
> >>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
> >>> rupert.westenthaler@gmail.com>:
> >>>
> >>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
> >>>> <cr...@gmail.com> wrote:
> >>>> > Hi All,
> >>>> >
> >>>> > I'm currently working on
> >>>> https://issues.apache.org/jira/browse/STANBOL-1279.
> >>>> >
> >>>> > I am using the SiteManager to get a Site with referenceId =
> "dbpedia"
> >>>> and
> >>>> > am querying data related to some NERs (querying by NER label and
> type).
> >>>> > This works and I do get results from the dbpedia index.
> >>>> >
> >>>> > What I want to do is this :
> >>>> >
> >>>> > 1. I want to be able to store and get yago class types in the
> dbpedia
> >>>> data.
> >>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
> >>>> > downloads. Is it possible to create a new dbpedia index with the 3.9
> >>>> files
> >>>> > using this script
> >>>> >
> >>>>
> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
> >>>> > ?
> >>>>
> >>>> yep. Just make suer you change
> >>>>
> >>>>     DBPEDIA=http://downloads.dbpedia.org/3.8
> >>>>
> >>>> to dbpedia 3.9
> >>>>
> >>>> BTW: you can also remove
> >>>>
> >>>>         #corrects encoding and recompress using gz
> >>>>         bzcat ${filename}.bz2 \
> >>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g'
> \
> >>>>             | gzip -c > ${filename}.gz
> >>>>         rm -f ${filename}.bz2
> >>>>
> >>>> as this is no longer necessary.
> >>>>
> >>>> >
> >>>> > 2. I want to access some specific dbpedia properties such as
> >>>> > dbpedia-owl:locationCity and others. These are already present in
> the
> >>>> > mappingbased_properties_en.nt
> >>>> > file which is in the fetch_data_en_int.sh script but are not in the
> >>>> >
> >>>>
> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
> >>>> > file.
> >>>> > Should I include them there and do a dbpedia index rebuild?
> >>>>
> >>>> Exactly. If the size of the created SolrIndex is an issue I recommend
> >>>> also that you remove properties you do not need.
> >>>>
> >>>> >
> >>>> > I've already described this in the "Named entity coref resolution
> >>>> based on
> >>>> > dbpedia" mail thread but I thought of creating a new mail for
> >>>> visibility
> >>>> > and for not clogging the other thread.
> >>>>
> >>>> The old thread is anyways already much to long. Please make sure that
> >>>> important points and decisions of that thread are also reflected in
> >>>> the description of STANBOL-1279
> >>>>
> >>>> best
> >>>> Rupert
> >>>>
> >>>> >
> >>>> > Thanks,
> >>>> > Cristian
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >>>> | Bodenlehenstraße 11                              ++43-699-11108907
> >>>> | A-5500 Bischofshofen
> >>>> |
> REDLINK.CO..........................................................................
> >>>> | http://redlink.co/
> >>>>
> >>>
> >>>
> >>
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO..........................................................................
> | http://redlink.co/
>

Re: Working with dbpedia indexed data

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Cristian,

On Mon, May 26, 2014 at 2:33 PM, Cristian Petroaca
<cr...@gmail.com> wrote:
> I just found out that according to
> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe
> min-score can actually be set to 0 and all entities will be indexed
> :).
> So, I'll give that a go ( hopefully my dbpedia index won't become gigantic
> in size).
>

Even if you set the value to zero it will still only index entities
listed in the incoming_links.txt file. So you will need to append the
Yago types to that file.

An other possibility would be to first create the dbpedia index and
after that append the Yago classes by using the generic rdf indexing
tool. For that you can

1) take the destination folder of the dbpedia indexing tool and link
(or move) it to the destination of the generic indexing tool.
2) make sure to configure the same site name as for the dbpedia index
tool to the generic indexing tool
3) add the RDF data of the Yago classes to the rdf data folder of the
generic indexing tool
4) adapt all the other configurations as needed
5) start the indexing process.

The generic indexing tool will check if the target solr index does
already exist. As it is present it will just add the additional
entities to the solr core.

When the process completes you can use the "solrindex.zip" file
generated by the generic RDF indexing tool together with the OSGI
bunlde (the jar file) generated by the dbpedia indexing tool.

Especially if you have already created an dbpedia index I would
recommend you to try this out as it would avoid re-indexing the whole
dbpedia data again.

best
Rupert



>
> 2014-05-25 16:58 GMT+03:00 Cristian Petroaca <cr...@gmail.com>:
>
>> Hi Rupert.
>>
>> I'm answering to your suggestions on integrating the yago class labels in
>> the dbpedia index in this thread since it's a lot shorter than the other
>> one.
>>
>> For clarity, your suggestions were :
>>
>> "1. The indexing tool does support LDPath. That means you can import
>> all the required RDF files and use LDPath to append the labels of the Yago
>> Types directly to the dbpedia entities. This would prevent additional
>> lookups to retrieve the types, but also increase the size of the index a
>> lot. 2. You could also index the Yago Types and use an additional Entityhub
>> lookup to retrieve them. In this case you should first collect all types
>> referenced by Entities in the processed text and in a second step retrieve
>> the labels. While this means additional lookups it will only load the
>> labels for an type once. In addition you could use a cache for types. 3.
>> Your engine could use LDPath to retrieve the types. This would require to
>> index the data like with option (2) and use a LDPath statement similar to
>> (1). It would be the slowest solution (as it requires an additional lookup
>> for every extracted entity) but require the least code."
>>
>> It seems that the best solution would be no 2, so I took that path. But
>> I'm having some issues with building the dbpedia index with the yago class
>> labels.
>>
>> I managed to create an .nt file from the data files on the yago site which
>> contains the yago class labels. The file has this format :
>> <http://dbpedia.org/class/yago/Floret111669786> <
>> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
>> <http://dbpedia.org/class/yago/Servant110582154> <
>> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
>> <http://dbpedia.org/class/yago/Varietal107900225> <
>> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .
>>
>> I compressed this to a .bz2 archive and put it in the
>> indexing/resources/rdfdata folder with the rest of them.
>>
>> After running the indexer I got my dbpedia index but it seems the yago
>> class labels are not present in the index. The first clue was that they
>> were missing from the indexing/destination/indexed-entities-ids archive.
>> Second confirmation came when I tried to retrieve a yago class label by
>> calling site.getEntity(yago_class_uri) and the return was null. I should
>> mention that the same call works if I want to get a
>> http://dbpedia.org/resource/[id] entity.
>>
>> From what I saw, the indexing process indexes entities only if they are in
>> the incoming_links.txt file and only if their score is higher than 2 so I
>> guess that's the point where the yago classes were not inserted. From
>> looking at the code, the min-score parameter from the minincoming.config
>> file cannot be set to 0, or something that would ignore the
>> incoming_links.txt ranking and just index everything. So, in this
>> situation, is there a solution for getting these yago classes as entities
>> in the index?
>>
>> I'd like to mention that the indexing process did correctly read the
>> yago_class_labels.nt file and started to index the entities into Jena.
>>
>> Thanks,
>> Cristian
>>
>>
>>
>> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <cr...@gmail.com>
>> :
>>
>> Hi Rupert,
>>>
>>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it
>>> in the gigantic "Named entities coreference" thread instead.
>>>
>>> So, I managed to create a dbpedia index with the yago class information
>>> but looking into the yago_types.nt file which assigns yago classes to
>>> dbpedia entities I realized that there are no yago class labels present, I
>>> just have the class uri like : <
>>> http://dbpedia/..something../President1829302/. I also need the class
>>> labels so that I can compare them to the noun token's string from the text.
>>>
>>> I can get the labels from one of the yago downloads here :
>>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt.
>>> I'll need another yago download file to map the yago wordnet classes to
>>> dbpedia uris. That could be done via a script maybe.
>>>
>>> Once I have the dbpedia_yago_class_uri -> label file is it possible to
>>> integrate this data in the dbpedia index and later be able to query the
>>> labels from the 'dbpedia' Site? How would that work in the dbpedia indexing
>>> process? What should I change in the mappings.txt file? At first glance it
>>> seems that the indexing is done based on the incoming_links.txt entity
>>> scoring and in my case I don't want to include triples involving the actual
>>> entity but triples invloving a property of the entity (its yago class).
>>>
>>> Other than that, I saw that someone will be working on integrating YAGO
>>> as part of Gsoc 2014. So maybe waiting for that is an option too but I
>>> don't know what the extent of the integration will be.
>>>
>>> Thanks,
>>> Cristi
>>>
>>>
>>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
>>> rupert.westenthaler@gmail.com>:
>>>
>>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
>>>> <cr...@gmail.com> wrote:
>>>> > Hi All,
>>>> >
>>>> > I'm currently working on
>>>> https://issues.apache.org/jira/browse/STANBOL-1279.
>>>> >
>>>> > I am using the SiteManager to get a Site with referenceId = "dbpedia"
>>>> and
>>>> > am querying data related to some NERs (querying by NER label and type).
>>>> > This works and I do get results from the dbpedia index.
>>>> >
>>>> > What I want to do is this :
>>>> >
>>>> > 1. I want to be able to store and get yago class types in the dbpedia
>>>> data.
>>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
>>>> > downloads. Is it possible to create a new dbpedia index with the 3.9
>>>> files
>>>> > using this script
>>>> >
>>>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>>>> > ?
>>>>
>>>> yep. Just make suer you change
>>>>
>>>>     DBPEDIA=http://downloads.dbpedia.org/3.8
>>>>
>>>> to dbpedia 3.9
>>>>
>>>> BTW: you can also remove
>>>>
>>>>         #corrects encoding and recompress using gz
>>>>         bzcat ${filename}.bz2 \
>>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>>>>             | gzip -c > ${filename}.gz
>>>>         rm -f ${filename}.bz2
>>>>
>>>> as this is no longer necessary.
>>>>
>>>> >
>>>> > 2. I want to access some specific dbpedia properties such as
>>>> > dbpedia-owl:locationCity and others. These are already present in the
>>>> > mappingbased_properties_en.nt
>>>> > file which is in the fetch_data_en_int.sh script but are not in the
>>>> >
>>>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
>>>> > file.
>>>> > Should I include them there and do a dbpedia index rebuild?
>>>>
>>>> Exactly. If the size of the created SolrIndex is an issue I recommend
>>>> also that you remove properties you do not need.
>>>>
>>>> >
>>>> > I've already described this in the "Named entity coref resolution
>>>> based on
>>>> > dbpedia" mail thread but I thought of creating a new mail for
>>>> visibility
>>>> > and for not clogging the other thread.
>>>>
>>>> The old thread is anyways already much to long. Please make sure that
>>>> important points and decisions of that thread are also reflected in
>>>> the description of STANBOL-1279
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> >
>>>> > Thanks,
>>>> > Cristian
>>>>
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>> | REDLINK.CO..........................................................................
>>>> | http://redlink.co/
>>>>
>>>
>>>
>>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/

Re: Working with dbpedia indexed data

Posted by Cristian Petroaca <cr...@gmail.com>.
I just found out that according to
http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/README.mdthe
min-score can actually be set to 0 and all entities will be indexed
:).
So, I'll give that a go ( hopefully my dbpedia index won't become gigantic
in size).


2014-05-25 16:58 GMT+03:00 Cristian Petroaca <cr...@gmail.com>:

> Hi Rupert.
>
> I'm answering to your suggestions on integrating the yago class labels in
> the dbpedia index in this thread since it's a lot shorter than the other
> one.
>
> For clarity, your suggestions were :
>
> "1. The indexing tool does support LDPath. That means you can import
> all the required RDF files and use LDPath to append the labels of the Yago
> Types directly to the dbpedia entities. This would prevent additional
> lookups to retrieve the types, but also increase the size of the index a
> lot. 2. You could also index the Yago Types and use an additional Entityhub
> lookup to retrieve them. In this case you should first collect all types
> referenced by Entities in the processed text and in a second step retrieve
> the labels. While this means additional lookups it will only load the
> labels for an type once. In addition you could use a cache for types. 3.
> Your engine could use LDPath to retrieve the types. This would require to
> index the data like with option (2) and use a LDPath statement similar to
> (1). It would be the slowest solution (as it requires an additional lookup
> for every extracted entity) but require the least code."
>
> It seems that the best solution would be no 2, so I took that path. But
> I'm having some issues with building the dbpedia index with the yago class
> labels.
>
> I managed to create an .nt file from the data files on the yago site which
> contains the yago class labels. The file has this format :
> <http://dbpedia.org/class/yago/Floret111669786> <
> http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
> <http://dbpedia.org/class/yago/Servant110582154> <
> http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
> <http://dbpedia.org/class/yago/Varietal107900225> <
> http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .
>
> I compressed this to a .bz2 archive and put it in the
> indexing/resources/rdfdata folder with the rest of them.
>
> After running the indexer I got my dbpedia index but it seems the yago
> class labels are not present in the index. The first clue was that they
> were missing from the indexing/destination/indexed-entities-ids archive.
> Second confirmation came when I tried to retrieve a yago class label by
> calling site.getEntity(yago_class_uri) and the return was null. I should
> mention that the same call works if I want to get a
> http://dbpedia.org/resource/[id] entity.
>
> From what I saw, the indexing process indexes entities only if they are in
> the incoming_links.txt file and only if their score is higher than 2 so I
> guess that's the point where the yago classes were not inserted. From
> looking at the code, the min-score parameter from the minincoming.config
> file cannot be set to 0, or something that would ignore the
> incoming_links.txt ranking and just index everything. So, in this
> situation, is there a solution for getting these yago classes as entities
> in the index?
>
> I'd like to mention that the indexing process did correctly read the
> yago_class_labels.nt file and started to index the entities into Jena.
>
> Thanks,
> Cristian
>
>
>
> 2014-05-07 14:54 GMT+03:00 Cristian Petroaca <cr...@gmail.com>
> :
>
> Hi Rupert,
>>
>> Ok, I'll resend this mail in this thread. Again, out of habit I sent it
>> in the gigantic "Named entities coreference" thread instead.
>>
>> So, I managed to create a dbpedia index with the yago class information
>> but looking into the yago_types.nt file which assigns yago classes to
>> dbpedia entities I realized that there are no yago class labels present, I
>> just have the class uri like : <
>> http://dbpedia/..something../President1829302/. I also need the class
>> labels so that I can compare them to the noun token's string from the text.
>>
>> I can get the labels from one of the yago downloads here :
>> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt.
>> I'll need another yago download file to map the yago wordnet classes to
>> dbpedia uris. That could be done via a script maybe.
>>
>> Once I have the dbpedia_yago_class_uri -> label file is it possible to
>> integrate this data in the dbpedia index and later be able to query the
>> labels from the 'dbpedia' Site? How would that work in the dbpedia indexing
>> process? What should I change in the mappings.txt file? At first glance it
>> seems that the indexing is done based on the incoming_links.txt entity
>> scoring and in my case I don't want to include triples involving the actual
>> entity but triples invloving a property of the entity (its yago class).
>>
>> Other than that, I saw that someone will be working on integrating YAGO
>> as part of Gsoc 2014. So maybe waiting for that is an option too but I
>> don't know what the extent of the integration will be.
>>
>> Thanks,
>> Cristi
>>
>>
>> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
>> rupert.westenthaler@gmail.com>:
>>
>> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
>>> <cr...@gmail.com> wrote:
>>> > Hi All,
>>> >
>>> > I'm currently working on
>>> https://issues.apache.org/jira/browse/STANBOL-1279.
>>> >
>>> > I am using the SiteManager to get a Site with referenceId = "dbpedia"
>>> and
>>> > am querying data related to some NERs (querying by NER label and type).
>>> > This works and I do get results from the dbpedia index.
>>> >
>>> > What I want to do is this :
>>> >
>>> > 1. I want to be able to store and get yago class types in the dbpedia
>>> data.
>>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
>>> > downloads. Is it possible to create a new dbpedia index with the 3.9
>>> files
>>> > using this script
>>> >
>>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>>> > ?
>>>
>>> yep. Just make suer you change
>>>
>>>     DBPEDIA=http://downloads.dbpedia.org/3.8
>>>
>>> to dbpedia 3.9
>>>
>>> BTW: you can also remove
>>>
>>>         #corrects encoding and recompress using gz
>>>         bzcat ${filename}.bz2 \
>>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>>>             | gzip -c > ${filename}.gz
>>>         rm -f ${filename}.bz2
>>>
>>> as this is no longer necessary.
>>>
>>> >
>>> > 2. I want to access some specific dbpedia properties such as
>>> > dbpedia-owl:locationCity and others. These are already present in the
>>> > mappingbased_properties_en.nt
>>> > file which is in the fetch_data_en_int.sh script but are not in the
>>> >
>>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
>>> > file.
>>> > Should I include them there and do a dbpedia index rebuild?
>>>
>>> Exactly. If the size of the created SolrIndex is an issue I recommend
>>> also that you remove properties you do not need.
>>>
>>> >
>>> > I've already described this in the "Named entity coref resolution
>>> based on
>>> > dbpedia" mail thread but I thought of creating a new mail for
>>> visibility
>>> > and for not clogging the other thread.
>>>
>>> The old thread is anyways already much to long. Please make sure that
>>> important points and decisions of that thread are also reflected in
>>> the description of STANBOL-1279
>>>
>>> best
>>> Rupert
>>>
>>> >
>>> > Thanks,
>>> > Cristian
>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> | REDLINK.CO..........................................................................
>>> | http://redlink.co/
>>>
>>
>>
>

Re: Working with dbpedia indexed data

Posted by Cristian Petroaca <cr...@gmail.com>.
Hi Rupert.

I'm answering to your suggestions on integrating the yago class labels in
the dbpedia index in this thread since it's a lot shorter than the other
one.

For clarity, your suggestions were :

"1. The indexing tool does support LDPath. That means you can import
all the required RDF files and use LDPath to append the labels of the Yago
Types directly to the dbpedia entities. This would prevent additional
lookups to retrieve the types, but also increase the size of the index a
lot. 2. You could also index the Yago Types and use an additional Entityhub
lookup to retrieve them. In this case you should first collect all types
referenced by Entities in the processed text and in a second step retrieve
the labels. While this means additional lookups it will only load the
labels for an type once. In addition you could use a cache for types. 3.
Your engine could use LDPath to retrieve the types. This would require to
index the data like with option (2) and use a LDPath statement similar to
(1). It would be the slowest solution (as it requires an additional lookup
for every extracted entity) but require the least code."

It seems that the best solution would be no 2, so I took that path. But I'm
having some issues with building the dbpedia index with the yago class
labels.

I managed to create an .nt file from the data files on the yago site which
contains the yago class labels. The file has this format :
<http://dbpedia.org/class/yago/Floret111669786> <
http://www.w3.org/2000/01/rdf-schema#label> "floret"@en .
<http://dbpedia.org/class/yago/Servant110582154> <
http://www.w3.org/2000/01/rdf-schema#label> "retainer"@en .
<http://dbpedia.org/class/yago/Varietal107900225> <
http://www.w3.org/2000/01/rdf-schema#label> "varietal"@en .

I compressed this to a .bz2 archive and put it in the
indexing/resources/rdfdata folder with the rest of them.

After running the indexer I got my dbpedia index but it seems the yago
class labels are not present in the index. The first clue was that they
were missing from the indexing/destination/indexed-entities-ids archive.
Second confirmation came when I tried to retrieve a yago class label by
calling site.getEntity(yago_class_uri) and the return was null. I should
mention that the same call works if I want to get a
http://dbpedia.org/resource/[id] entity.

>From what I saw, the indexing process indexes entities only if they are in
the incoming_links.txt file and only if their score is higher than 2 so I
guess that's the point where the yago classes were not inserted. From
looking at the code, the min-score parameter from the minincoming.config
file cannot be set to 0, or something that would ignore the
incoming_links.txt ranking and just index everything. So, in this
situation, is there a solution for getting these yago classes as entities
in the index?

I'd like to mention that the indexing process did correctly read the
yago_class_labels.nt file and started to index the entities into Jena.

Thanks,
Cristian



2014-05-07 14:54 GMT+03:00 Cristian Petroaca <cr...@gmail.com>:

> Hi Rupert,
>
> Ok, I'll resend this mail in this thread. Again, out of habit I sent it in
> the gigantic "Named entities coreference" thread instead.
>
> So, I managed to create a dbpedia index with the yago class information
> but looking into the yago_types.nt file which assigns yago classes to
> dbpedia entities I realized that there are no yago class labels present, I
> just have the class uri like : <
> http://dbpedia/..something../President1829302/. I also need the class
> labels so that I can compare them to the noun token's string from the text.
>
> I can get the labels from one of the yago downloads here :
> http://www.mpi-inf.mpg.de/yago-naga/yago/download/yago/yagoMultilingualClassLabels.txt.
> I'll need another yago download file to map the yago wordnet classes to
> dbpedia uris. That could be done via a script maybe.
>
> Once I have the dbpedia_yago_class_uri -> label file is it possible to
> integrate this data in the dbpedia index and later be able to query the
> labels from the 'dbpedia' Site? How would that work in the dbpedia indexing
> process? What should I change in the mappings.txt file? At first glance it
> seems that the indexing is done based on the incoming_links.txt entity
> scoring and in my case I don't want to include triples involving the actual
> entity but triples invloving a property of the entity (its yago class).
>
> Other than that, I saw that someone will be working on integrating YAGO as
> part of Gsoc 2014. So maybe waiting for that is an option too but I don't
> know what the extent of the integration will be.
>
> Thanks,
> Cristi
>
>
> 2014-04-30 12:04 GMT+03:00 Rupert Westenthaler <
> rupert.westenthaler@gmail.com>:
>
> On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
>> <cr...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I'm currently working on
>> https://issues.apache.org/jira/browse/STANBOL-1279.
>> >
>> > I am using the SiteManager to get a Site with referenceId = "dbpedia"
>> and
>> > am querying data related to some NERs (querying by NER label and type).
>> > This works and I do get results from the dbpedia index.
>> >
>> > What I want to do is this :
>> >
>> > 1. I want to be able to store and get yago class types in the dbpedia
>> data.
>> > This data is stored in the yago-types.nt file from the dbpedia 3.9
>> > downloads. Is it possible to create a new dbpedia index with the 3.9
>> files
>> > using this script
>> >
>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
>> > ?
>>
>> yep. Just make suer you change
>>
>>     DBPEDIA=http://downloads.dbpedia.org/3.8
>>
>> to dbpedia 3.9
>>
>> BTW: you can also remove
>>
>>         #corrects encoding and recompress using gz
>>         bzcat ${filename}.bz2 \
>>             | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
>>             | gzip -c > ${filename}.gz
>>         rm -f ${filename}.bz2
>>
>> as this is no longer necessary.
>>
>> >
>> > 2. I want to access some specific dbpedia properties such as
>> > dbpedia-owl:locationCity and others. These are already present in the
>> > mappingbased_properties_en.nt
>> > file which is in the fetch_data_en_int.sh script but are not in the
>> >
>> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
>> > file.
>> > Should I include them there and do a dbpedia index rebuild?
>>
>> Exactly. If the size of the created SolrIndex is an issue I recommend
>> also that you remove properties you do not need.
>>
>> >
>> > I've already described this in the "Named entity coref resolution based
>> on
>> > dbpedia" mail thread but I thought of creating a new mail for visibility
>> > and for not clogging the other thread.
>>
>> The old thread is anyways already much to long. Please make sure that
>> important points and decisions of that thread are also reflected in
>> the description of STANBOL-1279
>>
>> best
>> Rupert
>>
>> >
>> > Thanks,
>> > Cristian
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                              ++43-699-11108907
>> | A-5500 Bischofshofen
>> | REDLINK.CO..........................................................................
>> | http://redlink.co/
>>
>
>

Re: Working with dbpedia indexed data

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Wed, Apr 30, 2014 at 10:37 AM, Cristian Petroaca
<cr...@gmail.com> wrote:
> Hi All,
>
> I'm currently working on https://issues.apache.org/jira/browse/STANBOL-1279.
>
> I am using the SiteManager to get a Site with referenceId = "dbpedia" and
> am querying data related to some NERs (querying by NER label and type).
> This works and I do get results from the dbpedia index.
>
> What I want to do is this :
>
> 1. I want to be able to store and get yago class types in the dbpedia data.
> This data is stored in the yago-types.nt file from the dbpedia 3.9
> downloads. Is it possible to create a new dbpedia index with the 3.9 files
> using this script
> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/dbpedia-3.8/fetch_data_en_int.sh
> ?

yep. Just make suer you change

    DBPEDIA=http://downloads.dbpedia.org/3.8

to dbpedia 3.9

BTW: you can also remove

        #corrects encoding and recompress using gz
        bzcat ${filename}.bz2 \
            | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
            | gzip -c > ${filename}.gz
        rm -f ${filename}.bz2

as this is no longer necessary.

>
> 2. I want to access some specific dbpedia properties such as
> dbpedia-owl:locationCity and others. These are already present in the
> mappingbased_properties_en.nt
> file which is in the fetch_data_en_int.sh script but are not in the
> https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/dbpedia/src/main/resources/indexing/config/mappings.txt
> file.
> Should I include them there and do a dbpedia index rebuild?

Exactly. If the size of the created SolrIndex is an issue I recommend
also that you remove properties you do not need.

>
> I've already described this in the "Named entity coref resolution based on
> dbpedia" mail thread but I thought of creating a new mail for visibility
> and for not clogging the other thread.

The old thread is anyways already much to long. Please make sure that
important points and decisions of that thread are also reflected in
the description of STANBOL-1279

best
Rupert

>
> Thanks,
> Cristian



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO ..........................................................................
| http://redlink.co/