You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by "Juan Vargas (JIRA)" <ji...@apache.org> on 2012/11/15 10:52:14 UTC

[jira] [Resolved] (STANBOL-804) Creating a spanish Index

     [ https://issues.apache.org/jira/browse/STANBOL-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Juan Vargas resolved STANBOL-804.
---------------------------------

    Resolution: Fixed

Thanks Rupert! 

I will copy a response: 

the RDF dump files from DBpedia do contain invalid UTF8 characters. With dbpedia version 3.7 this affected only very few files. In version 3.8 much more files are affected. Because of that I have recently
created a shell script that corrects such errors for all files.

see http://markmail.org/message/67ivlyoxfqad6xoe for details.

Was this basically does is executing the following command on all files

    bzcat ${filename}.bz2 \
        | sed 's/\\\\/\\u005c\\u005c/g;s/\\\([^u"]\)/\\u005c\1/g' \
        | gzip -c > ${filename}.gz
    rm -f ${filename}.bz2
                
> Creating a spanish Index
> ------------------------
>
>                 Key: STANBOL-804
>                 URL: https://issues.apache.org/jira/browse/STANBOL-804
>             Project: Stanbol
>          Issue Type: Question
>          Components: Enhancer, Entityhub, OWL
>            Reporter: Juan Vargas
>            Priority: Minor
>
> Hello.
> I'm Juan Vargas. a web developer at Notedlinks S.L. from Spain.
> I've been trying a few days to create a spanish index using dbpedia 3.8 files, following the next instructions of https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/dbpedia/README.md to use on Stanbol enhancer, its means:
> 1. Building index tool
>    - cd {stanbol-source}/entityhub/indexing/genericrdf/  (where you install stanbol) * require stanbol (http://stanbol.apache.org/docs/trunk/tutorial.html)
>    - mvn assembly:single
>    - move org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar on my target direct that i plan to make a index
> 2. Create sub-folder on target directory
>    - java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar init
> 3. Download dbpedia dump files and copy in 'indexing/resources/rdfdata':
>     http://downloads.dbpedia.org/3.8/dbpedia_3.6.owl.bz2    (general for any language)
>     http://downloads.dbpedia.org/3.8/es/instance_types_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/labels_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/short_abstracts_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/long_abstracts_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/geo_coordinates_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/persondata_es.nt.bz2  (doesnt seem to exist in spanish, any problem it isnt use ?)
>     http://downloads.dbpedia.org/3.8/es/article_categories_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/category_labels_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/skos_categories_es.nt.bz2
>     http://downloads.dbpedia.org/3.8/es/redirects_es.nt.bz2
> 4. Generate entities score and copy to 'indexing/resources':
>   - curl http://downloads.dbpedia.org/3.8/es/page_links_en.nt.bz2 | bzcat | sed -e 's/.*<http\:\/\/es\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' | sort \ | uniq -c | sort -nr > incoming_links.txt   (changes in spanish: url resource, 'en' for 'es', see suggested notes on url web)
> 5. Configuration of the index:
>  - I left by default, otherwise i dont understand too much how to configurate.
> 6. Execute jar to create index:
>   - java -jar org.apache.stanbol.entityhub.indexing.dbpedia-*-jar-with-dependencies.jar index
> The execution crash, and trace is as follows:
> 10:42:36,037 [Thread-3] ERROR source.ResourceLoader - Unable to load resource /home/juan/stanbol-index/indexing/resources/rdfdata/redirects_es.nt.bz2
> org.openjena.riot.RiotException: [line: 5854, col: 103] Broken token: http://es.dbpedia.org/resource/Pactos_de_
>     at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
>     at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
>     at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:679)
> Looking redirects_es.nt.bz2 file:
>   5852 <http://es.dbpedia.org/resource/Tratados_Lateranos> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
>    5853 <http://es.dbpedia.org/resource/Tratado_Laterano> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
>    5854 <http://es.dbpedia.org/resource/Tratado_Lateranense> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
>    5855 <http://es.dbpedia.org/resource/Tratados_Lateranenses> <http://dbpedia.org/ontology/wikiPageRedirects> <http://es.dbpedia.org/resource/Pactos_de_Letr\u00E1n> .
> I dont see any error. Someone could help me, if there are anything unusual?
> Also, i try to do a dbpedia 3.8 englsih version, to check if i wad doing wrong a spanish version, its seems ok, but finally minutes after, i got::
> 11:23:32,576 [Thread-3] ERROR source.ResourceLoader - Unable to load resource /home/juan/stanbol-index/indexing/resources/rdfdata/short_abstracts_en.nt.bz2
> org.openjena.riot.RiotException: [line: 1880, col: 96] Broken token: Bambara, also known as Bamana, and Bamanankan by speakers of the language, is a language spoken in Mali, and to a lesser extent Burkina Faso, Senegal by as many as six million people (in
>     at org.openjena.riot.ErrorHandlerLib$ErrorHandlerStd.fatal(ErrorHandlerLib.java:97)
>     at org.openjena.riot.lang.LangBase.raiseException(LangBase.java:205)
>     at org.openjena.riot.lang.LangBase.nextToken(LangBase.java:152)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:42)
>     at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:22)
>     at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:58)
>     at org.openjena.riot.lang.LangBase.parse(LangBase.java:75)
>     at org.openjena.riot.RiotReader.parseQuads(RiotReader.java:173)
>     at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadQuads$(BulkLoader.java:154)
>     at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader.loadDataset(BulkLoader.java:113)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset$(TDBLoader.java:282)
>     at com.hp.hpl.jena.tdb.TDBLoader.loadDataset(TDBLoader.java:193)
>     at com.hp.hpl.jena.tdb.TDBLoader.load(TDBLoader.java:74)
>     at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfResourceImporter.importResource(RdfResourceImporter.java:75)
>     at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResource(ResourceLoader.java:201)
>     at org.apache.stanbol.entityhub.indexing.core.source.ResourceLoader.loadResources(ResourceLoader.java:137)
>     at org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource.initialise(RdfIndexingSource.java:272)
>     at org.apache.stanbol.entityhub.indexing.core.impl.IndexingSourceInitialiser.run(IndexingSourceInitialiser.java:43)
>     at java.lang.Thread.run(Thread.java:679)
> Looking short_abstracts_en.nt.bz2:
> 1879 <http://dbpedia.org/resource/Bernard_of_Clairvaux> <http://www.w3.org/2000/01/rdf-schema#comment> "Bernard of Clairvaux, O. Cist (1090 \u2013 August 20, 1153) was a French abbot and the primary builder of the reforming Cistercian order. After the death of his mother, Bernard sought admission into the Cistercian order. Three years later, he was sent to found a new abbey at an isolated clearing in a glen known as the Val d'Absinthe, about 15\u00A0km southeast of Bar-sur-Aube. According to tradition, Bernard founded the monastery on 25 June 1115, naming it Claire Vall\u00E9e, which evolved into Clairvaux."@en .
>    1880 <http://dbpedia.org/resource/Bambara_language> <http://www.w3.org/2000/01/rdf-schema#comment> "Bambara, also known as Bamana, and Bamanankan by speakers of the language, is a language spoken in Mali, and to a lesser extent Burkina Faso, Senegal by as many as six million people (including second language users). The Bambara language is the language of people of the Bambara ethnic group, numbering about 4,000,000 people, but serves also as a lingua franca in Mali (it is estimated that about 80% of the population speak it as a first or second language)."@en .
>    1881 <http://dbpedia.org/resource/Bishkek> <http://www.w3.org/2000/01/rdf-schema#comment> "Bishkek, formerly Pishpek and Frunze, is the capital and the largest city of Kyrgyzstan. Bishkek is also the administrative centre of Chuy Province which surrounds the city, even though the city itself is not part of the province but rather a province-level unit of Kyrgyzstan. The name is thought to derive from a Kyrgyz word for a churn used to make fermented mare's milk, the Kyrgyz national drink."@en .
> Someone might say why appears errors like "broken pipe" or if I'm doing something wrong. I think that i follow well the guide. Thanks, and I hope that this information can help others that try to create indexes and an Apache Stanbol, that is a really great project. Nice work!
> Best,
> Juan.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira