You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2013/01/05 15:37:28 UTC

Chinese language support for Apache Stanbol

Hi all

As those that do follow the Stanbol JIRA might already know I worked
on Chinese support for Apache Stanbol In the last weeks (see
STANBOL-855 [1] and STANBOL-875 [2]) . But this is now the official
announcement.

## Processing Chinese language currently works the follows:

* Language detection (based on the langdetect engine [3])
* Sentence detection using the smartcn-sentence engine: This engine is
based on the Lucene Smratcn SentenceTokenizer [4]. Sentence detection
is optional but highly recommended when longer texts are parsed to the
Stanbol Enhancer
* Tokenization: There are two options 1. the Smartcn Tokenizer based
on the Smartcn WordTokenFilter [5] or 2. the PaodingAnalyzer [6] based
Tokenizer Engine.
* EntityLinking by using the EntityhubLinkingEngine [7] configured
with an Entityhub Site that contains Entities with Chinese labels. To
correctly tokenize Chinese language labels of Entities one needs to
configure a LabelTokenizer [8]. There are two implementations: 1. one
based on Smartcn and 2. one based on Paoding. Make sure that one of
those two is installed. If both are available the one with the higher
service.ranking will be used.

## Demo

A demo is available on the Full Launcher of the dev.iks-project.eu server

http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-zh-linking

This links Chinese language texts against a Chinese language DBpedia
index. The configuration uses the Smartcn implementation

### Chinese DBpedia index download

I build two versions of the Chinese DBpedia index based on DBpedia
3.8. For one Smartcn was used for indexing Chinese language literals.
For the other Paoding was used.

The indexes are available for download under

http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/chinese/

Note that such indexes do include labels for other languages. They
where also indexed to copy knowledge (e.g. geo:lat, geo:long, geo:alt,
foaf:homepage, foaf:depiction, rdf:type) from other language dbpedia
dumps as those information are very rare for the Chinese dbpedia
version.

### Managing Chinese language Vocabularies with the Stanbol Entityhub

Chinese language support is not enabled by default. You will need to add the

* org.apache.stanbol.commons.solr.extras.smartcn
* org.apache.stanbol.commons.solr.extras.paoding

bundles to your Stanbol Launcher. Those bundles are extensions to the
Stanbol Commons Solr Core module and allow to use textfields that use
smantcn/paoding analyzers.

There are also two bundlelists for smartcn [9] and paoding [10]. If
you add those to your custom Stanbol launcher configuration (see [11]
how to do that), than you will have all the modules available you need
to manage and process Chinese texts.

In [9] and [10] there are README.md files that provide details on how
to correctly configure Entityhub ManageSites and the Entityhub
Indexing Tool for vocabularies with Chinese language literals. So if
you want to index your own datasets you should really read those
README files.

### Whats missing & next steps

NLP processing wise the next steps would be to add support for

* Part of Speech (POS) tagging
* Named Entity Recognition

I know that Harish Suvarna was working on an Engine based on FudanNLP
[12] so maybe this would be an option. However as this framework is
LGPL licensed we could bot include it directly in Stanbol.

Regarding next steps there is very little I can do as I do not speak
Chinese. I can not even evaluate the quality of the results of the
current state …

So it would be really great if someone who does speak this language
would be interesting to take over and further improve the current
state. I will try my best to support further developments.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-855
[2] https://issues.apache.org/jira/browse/STANBOL-875
[3] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/langdetectengine
[4] http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/analysis/cn/smart/SentenceTokenizer.html
[5] http://lucene.apache.org/core/3_6_1/api/all/org/apache/lucene/analysis/cn/smart/WordTokenFilter.html
[6] https://code.google.com/p/paoding/
[7] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entityhublinking
[8] http://stanbol.staging.apache.org/docs/trunk/components/enhancer/engines/entitylinking#labeltokenizer
[9] http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/smartcn/
[10] http://svn.apache.org/repos/asf/stanbol/trunk/launchers/bundlelists/language-extras/paoding/
[11] http://stanbol.apache.org/production/your-launcher.html
[12] http://code.google.com/p/fudannlp/
--
| Rupert Westenthaler rupert.westenthaler@gmail.com
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen