You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/07/27 06:46:53 UTC

Re: Stanbol Chinese

Hi harish,

Note: Sorry I forgot to include the stanbol-dev mailing list in my last answer.


On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com> wrote:
> Thanks a lot Rupert.
>
> I am weighing between options 2 and 3. What is the difference? Optiion 2
> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It may
> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like a
> separate engine.

Option (2) will require some work improvements on the Stanbol side.
However there where already discussion on how to create a "text
processing chain" that allows to split up things like tokenizing, POS
tagging, Lemmatizing ... in different Enhancement Engines without
suffering form disadvantages of creating high amounts of RDF triples.
One Idea was to base this on the Apache Lucene TokenStream [1] API and
share the data as ContentPart [2] of the ContentItem.

Option (3) indeed means that you will create your own
EnhancementEngine - a similar one to the KeywordLinkingEngine.

>  But will I be able to use the stanbol dbpedia lookup using option 3?

Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
"FieldQuery" interface to search for Entities (see [1] for an example)

best
Rupert

[1] http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
[2] http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
[3] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java


>
> Btw, I created my own enhancement engine chains and I could see them
> yesterday in localhost:8080. But today all of them have vanished and only
> the default chain shows up. Can I dig them up somewhere in the stanbol
> directory?
>
> -harish
>
> I just created the eclipse project
> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
> <ru...@gmail.com> wrote:
>>
>> Hi,
>>
>> There are no NER (Named Entity Recognition) models for Chinese text
>> available via OpenNLP. So the default configuration of Stanbol will
>> not process Chinese text. What you can do is to configure a
>> KeywordLinking Engine for Chinese text as this engine can also process
>> in unknown languages (see [1] for details).
>>
>> However also the KeywordLinking Engine requires at least n tokenizer
>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>> Chinese text it will use the default one that uses a fixed set of
>> chars to split words (white spaces, hyphens ...). You may better how
>> well this would work with Chinese texts. My assumption would be that
>> it is not sufficient - so results will be sub-optimal.
>>
>> To apply Chinese optimization I see three possibilities:
>>
>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>> POS tagging, Named Entity Detection)
>> 2. allow the KeywordLinkingEngine to use other already available tools
>> for text processing (e.g. stuff that is already available for
>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>> because representing Tokens, POS ... as RDF would be to much of an
>> overhead.
>> 3. implement a new EnhancementEngine for processing Chinese text.
>>
>> Hope this helps to get you started.
>>
>> best
>> Rupert
>>
>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
>> [2]
>> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
>>
>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>> wrote:
>> > Hi Rupert,
>> > Finally I am getting some time to work on Stanbol. My job is to
>> > demonstrate
>> > Stanbol annotations for Chinese text.
>> > I am just starting on it. I am following the instructions to build an
>> > enhancement engine from Anuj's blog. dbpedia has some chinese data dump
>> > too.
>> > We may have to depend on the ngrams as keys and look them up in the
>> > dbpedia
>> > labels.
>> >
>> > I am planning to use the paoding chinese segmentor
>> > (http://code.google.com/p/paoding/) for word breaking.
>> >
>> > Just curious. I pasted some chinese text in default engine of stanbol.
>> > It
>> > kind of finished the processing in no time at all. This gave me
>> > suspicion
>> > that may be if the language is chinese, no further processing is done.
>> > Is it
>> > right? Any more tips for making all this work in Stanbol?
>> >
>> > -harish
>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Posted by Rupert Westenthaler <ru...@gmail.com>.
Forgot to include

> As an intermediate solution you can use an embedded maven repository
> (basically this duplicates the maven file structure within you
> project.
>

see http://svn.apache.org/repos/asf/incubator/stanbol/trunk/contrib/reasoners/hermit/
as an example.

in the pom.xml

 <repositories>
     <repository>
       <id>reasoners-hermit-embedded</id>
       <url>file://localhost/${project.basedir}/src/main/resources/maven/repo</url>
       <releases>
         <updatePolicy>always</updatePolicy>
       </releases>
       <snapshots>
         <updatePolicy>always</updatePolicy>
       </snapshots>
     </repository>
  </repositories>

and the files are located at

http://svn.apache.org/repos/asf/incubator/stanbol/trunk/contrib/reasoners/hermit/src/main/resources/maven/

As mentioned. This can only be an intermediate solution. The "correct"
way to deal with dependencies that are not available in maven central
is to provide a separate download that includes the jar file and a
shell script/bat that installs the dependency to the local repository.
However this is rather inconvenient for developers as their builds
will fail until they download this file and run the script.

best
Rupert

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Posted by Rupert Westenthaler <ru...@gmail.com>.
> Harish>>> I have the paoding.jar file. I checked in repo1.maven.org, it has
> com.54chen (groupid), paoding-rose (artifact id). I am not sure about this.

I check dependencies (that is available via mvn central) by

1. going to search.maven.org
2. searching for the artifact id (paoding-rose in your case)
3. opening the pom.xml file for the correct dependency
4. looking at the "packaging"  (in your case <packaging>jar</packaging>)
5. looking at the dependencies (in that case a long list I wonder if
everything is really needed.

> The packages I use from paoding are something like
> net.paoding.analysis.analyzer.*. So here I am with a jar file which is not
> there in central repository.

Does

<dependency>
    <groupId>com.54chen</groupId>
    <artifactId>paoding-rose</artifactId>
    <version>1.0</version>
</dependency>


include the functionality you need or is it not? If yes you can try to
exclude all packages (of the jar and dependencies) that you do not
need.

> For now, I mannually created in my local .m2
> repository the path net.paoding....then I do Embed-dependency in pom.xml
> for this jar file. For now it seems to work. But for long run, we need a
> central repository entry. In the abscense, what is the process to integrate
> jar files?
>

As an intermediate solution you can use an embedded maven repository
(basically this duplicates the maven file structure within you
project.

The steps required to add an dependency to maven central is described
at [1] section: Publishing your artifacts to the Central Repository >
Other Projects

best
Rupert


[1] http://maven.apache.org/guides/mini/guide-central-repository-upload.html


>>
>> best
>> Rupert
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
On Wed, Aug 1, 2012 at 9:16 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi
>
> On Thu, Aug 2, 2012 at 3:04 AM, harish suvarna <hs...@gmail.com> wrote:
> > Removing the stanbol folder in the trunk helped me. This stanbol folder
> is
> > created by the build process.
>
> Generally: The first time you start Stanbol (and the /stanbol folder
> is created) the jar files (bundles) , configurations ... are copied
> from the runnable jar. On any further start the copied information are
> used. Therefore replacing the launcher jar will not have any effect!
>
> To update a component (e.g. an Engine) you need to use the
> "install/update" the according Bundle(s). This can be done by several
> different ways (e.g. by going to
> "http://{stanbol}/system/console/bundles" and using the
> '[install/update...]' button.)
>
> Harish>>>> I uninstalled the bundle in the console, then I goto the
enhancer/engines/langdetect folder and did a 'mvn clean install'.  Then I
come to the browser and do an install mannually. Somehow this cycle did not
help me.
I for the sake of learning, renamed index folder, bundles folder etc, I was
starting Stanbol again. But was surprised that none of these folders are
created again. Now I understand.

You can also use the Sling Maven Plugin [1] or configure the Sling
> File Provider [2]. As there is no documentation here the needed steps
> to setup the Sling File Provider
>
> 0) Install the "Sling File Provider" Bundle (not needed as this is
> included by default)
> 1) Configure the "sling.fileinstall.dir" property: You can add this to
> the "{stanbol-working-dir}/stanbol/sling.properties file" or parse it
> as a system property '-Dsling.fileinstall.dir={path-to-dir}' when you
> start stanbol.
> 2) Create the referenced Folder
>
> Harish>>> Will experiment with it. Thanks.

> After that the "Sling File Provider" will automatically
> install/update/delete bundles and configurations added/updated/deleted
> in that folder.
>
> [1] http://sling.apache.org/site/sling.html
> [2]
> http://svn.apache.org/repos/asf/sling/tags/org.apache.sling.installer.provider.file-1.0.2
>
> > Thanks a lot for the help.
> >
> > In general, if we have a custom jar file how do we integrate into
> stanbol?
> > Does stanbol allow this?
>
> You can install OSGI bundles by using the Felix Web Console (as
> described above). Jar files that are no Bundles can not be added to
> Stanbol.
>
Harish>>> I have the paoding.jar file. I checked in repo1.maven.org, it has
com.54chen (groupid), paoding-rose (artifact id). I am not sure about this.
The packages I use from paoding are something like
net.paoding.analysis.analyzer.*. So here I am with a jar file which is not
there in central repository. For now, I mannually created in my local .m2
repository the path net.paoding....then I do Embed-dependency in pom.xml
for this jar file. For now it seems to work. But for long run, we need a
central repository entry. In the abscense, what is the process to integrate
jar files?

>
> best
> Rupert
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Stanbol Chinese

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi

On Thu, Aug 2, 2012 at 3:04 AM, harish suvarna <hs...@gmail.com> wrote:
> Removing the stanbol folder in the trunk helped me. This stanbol folder is
> created by the build process.

Generally: The first time you start Stanbol (and the /stanbol folder
is created) the jar files (bundles) , configurations ... are copied
from the runnable jar. On any further start the copied information are
used. Therefore replacing the launcher jar will not have any effect!

To update a component (e.g. an Engine) you need to use the
"install/update" the according Bundle(s). This can be done by several
different ways (e.g. by going to
"http://{stanbol}/system/console/bundles" and using the
'[install/update...]' button.)

You can also use the Sling Maven Plugin [1] or configure the Sling
File Provider [2]. As there is no documentation here the needed steps
to setup the Sling File Provider

0) Install the "Sling File Provider" Bundle (not needed as this is
included by default)
1) Configure the "sling.fileinstall.dir" property: You can add this to
the "{stanbol-working-dir}/stanbol/sling.properties file" or parse it
as a system property '-Dsling.fileinstall.dir={path-to-dir}' when you
start stanbol.
2) Create the referenced Folder

After that the "Sling File Provider" will automatically
install/update/delete bundles and configurations added/updated/deleted
in that folder.

[1] http://sling.apache.org/site/sling.html
[2] http://svn.apache.org/repos/asf/sling/tags/org.apache.sling.installer.provider.file-1.0.2

> Thanks a lot for the help.
>
> In general, if we have a custom jar file how do we integrate into stanbol?
> Does stanbol allow this?

You can install OSGI bundles by using the Felix Web Console (as
described above). Jar files that are no Bundles can not be added to
Stanbol.

best
Rupert

-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi Harish,

Thanks for the evaluation table. What do the numeric columns correspond 
to? Different detection engines? Which ones?

Best regards,

Walter


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi Harish,

I found the meaning of the numeric columns in your evaluation table. I 
had been confused by the apparently empty header fields. Thanks again.

Best regards,

Walter

-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
Removing the stanbol folder in the trunk helped me. This stanbol folder is
created by the build process.
Thanks a lot for the help.

In general, if we have a custom jar file how do we integrate into stanbol?
Does stanbol allow this?

I am attaching the language identification evaluation document we have done
in Adobe. Hope it helps you.

-harish

On Wed, Aug 1, 2012 at 11:49 AM, harish suvarna <hs...@gmail.com> wrote:

> I removed ~/stanbol folder. It is not helping. Let me clear the
> trunk/stanbol folder and see what happens. I suspect some cache clearnace
> problem.
>
> -harish
>
>
> On Wed, Aug 1, 2012 at 10:48 AM, Walter Kasper <ka...@dfki.de> wrote:
>
>> harish suvarna wrote:
>>
>>> I did ' mvn clean install'.
>>> Which stanbol folder is this ?
>>>
>>> $HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
>>> mean remove the enitre folder?
>>>
>>
>> I guess it is $HOME/stanbol where the runtime config data are stored. I
>> usually clear the complete folder for a clean restart.
>>
>>
>>> I restarted the machine and doing another mvn clean install now. I will
>>> post you in another 30 mins.
>>>
>>> -harish
>>>
>>> On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>
>>>  Hi again,
>>>>
>>>> It came to my mind that you should also clear the 'stanbol' folder of
>>>> the
>>>> Stanbol runtime system and restart the sysem.  The folder might contain
>>>> old
>>>> bundle configuration data that don't get updated automatically.
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Walter
>>>>
>>>> harish suvarna wrote:
>>>>
>>>>  Did a fresh build and inside Stanbol in localhost:8080, it is installed
>>>>> but
>>>>> is not activated. I still see the com.google.inject errors.
>>>>> I do see the pom.xml update from you.
>>>>>
>>>>> -harish
>>>>>
>>>>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>> The OSGI bundlöe declared some package imports that usually indeed are
>>>>>> not
>>>>>> available nor required. I fixed that. Just check out the corrected
>>>>>> pom.xml.
>>>>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Walter
>>>>>>
>>>>>> harish suvarna wrote:
>>>>>>
>>>>>>   Thanks Dr Walter. langdetect is very useful. I could successfully
>>>>>>
>>>>>>> compile
>>>>>>> it but unable to load into stanbol as I get th error
>>>>>>> ======
>>>>>>> ERROR: Bundle org.apache.stanbol.enhancer.******engines.langdetect
>>>>>>> [177]:
>>>>>>> Error
>>>>>>> starting/stopping bundle. (org.osgi.framework.******BundleException:
>>>>>>> Unresolved
>>>>>>> constraint in bundle org.apache.stanbol.enhancer.****
>>>>>>> **engines.langdetect
>>>>>>>
>>>>>>> [177]:
>>>>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>>>>> (package=com.google.inject))
>>>>>>> org.osgi.framework.******BundleException: Unresolved constraint in
>>>>>>> bundle
>>>>>>> org.apache.stanbol.enhancer.******engines.langdetect [177]: Unable
>>>>>>> to
>>>>>>>
>>>>>>> resolve
>>>>>>>
>>>>>>> 177.0: missing requirement [177.0] package;
>>>>>>> (package=com.google.inject)
>>>>>>>        at org.apache.felix.framework.*****
>>>>>>> *Felix.resolveBundle(Felix.**
>>>>>>> java:3443)
>>>>>>>        at org.apache.felix.framework.*****
>>>>>>> *Felix.startBundle(Felix.java:****
>>>>>>> **1727)
>>>>>>>        at org.apache.felix.framework.*****
>>>>>>> *Felix.setBundleStartLevel(**
>>>>>>> Felix.java:1333)
>>>>>>>        at
>>>>>>> org.apache.felix.framework.******StartLevelImpl.run(**
>>>>>>> StartLevelImpl.java:270)
>>>>>>>        at java.lang.Thread.run(Thread.******java:680)
>>>>>>>
>>>>>>>
>>>>>>> ==============
>>>>>>>
>>>>>>> I added the dependency
>>>>>>> <dependency>
>>>>>>>          <groupId>com.google.inject</******groupId>
>>>>>>>
>>>>>>>
>>>>>>>          <artifactId>guice</artifactId>
>>>>>>>          <version>3.0</version>
>>>>>>>        </dependency>
>>>>>>>
>>>>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>>>>> repo1.maven.org. I am not sure who is needing the inject library.
>>>>>>> The
>>>>>>> entire source of langdetect plugin does not contain the word inject.
>>>>>>> Only
>>>>>>> the manifest file in target/classes has this listed.
>>>>>>>
>>>>>>>
>>>>>>> -harish
>>>>>>>
>>>>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>    Hi Harish,
>>>>>>>
>>>>>>>  I checked in a new language identifier for Stanbol based on
>>>>>>>> http://code.google.com/p/********language-detection/<http://code.google.com/p/******language-detection/>
>>>>>>>> <http://**code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>>> >
>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>>> >
>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>> <http://**code.google.com/p/language-****detection/<http://code.google.com/p/language-**detection/>
>>>>>>>> >
>>>>>>>>
>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>>> >
>>>>>>>>   .
>>>>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Walter
>>>>>>>>
>>>>>>>> harish suvarna wrote:
>>>>>>>>
>>>>>>>>    Rupert,
>>>>>>>>
>>>>>>>>  My initial debugging for Chinese text told me that the language
>>>>>>>>> identification done by langid enhancer using apache tika does not
>>>>>>>>> recognize
>>>>>>>>> chinese. The tika language detection seems is not supporting the
>>>>>>>>> CJK
>>>>>>>>> languages. With the result, the chinese language is identified as
>>>>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>>>>> item
>>>>>>>>> 856 registered for detecting cjk languages
>>>>>>>>>      https://issues.apache.org/********jira/browse/TIKA-856<https://issues.apache.org/******jira/browse/TIKA-856>
>>>>>>>>> <https:/**/issues.apache.org/****jira/**browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
>>>>>>>>> >
>>>>>>>>> <https://**issues.apache.org/****jira/**browse/TIKA-856<http://issues.apache.org/**jira/**browse/TIKA-856>
>>>>>>>>> <https:**//issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>>>> >
>>>>>>>>> <https://**issues.apache.org/****jira/browse/**TIKA-856<http://issues.apache.org/**jira/browse/**TIKA-856>
>>>>>>>>> <http:/**/issues.apache.org/jira/**browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
>>>>>>>>> >
>>>>>>>>> <https:/**/issues.apache.org/**jira/**browse/TIKA-856<http://issues.apache.org/jira/**browse/TIKA-856>
>>>>>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>      in Feb 2012. I am not sure about the use of language
>>>>>>>>> identification
>>>>>>>>> in
>>>>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am just thinking that, for my purpose, pick option 3 and make
>>>>>>>>> sure
>>>>>>>>> that
>>>>>>>>> it is of my language of my interest and then call paoding
>>>>>>>>> segmenter.
>>>>>>>>> Then
>>>>>>>>> iterate over the ngrams and do an entityhub lookup. I just still
>>>>>>>>> need
>>>>>>>>> to
>>>>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>>>>> works.
>>>>>>>>>
>>>>>>>>> I find that the language detection library
>>>>>>>>> http://code.google.com/p/********language-detection/<http://code.google.com/p/******language-detection/>
>>>>>>>>> <http://**code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>>>> >
>>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>>>> >
>>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>>> <http://**code.google.com/p/language-****detection/<http://code.google.com/p/language-**detection/>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>> very good at language
>>>>>>>>>
>>>>>>>>> detection. It supports 53 languages out of box and the quality
>>>>>>>>> seems
>>>>>>>>> good.
>>>>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>>>>> engine
>>>>>>>>> based on this with the stanbol community approval. So if anyone
>>>>>>>>> sheds
>>>>>>>>> some
>>>>>>>>> light on how to add a new java library into stanbol, that be
>>>>>>>>> great. I
>>>>>>>>> am a
>>>>>>>>> maven beginner now.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> harish
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>     Hi harish,
>>>>>>>>>
>>>>>>>>>   Note: Sorry I forgot to include the stanbol-dev mailing list in
>>>>>>>>> my
>>>>>>>>>
>>>>>>>>>> last
>>>>>>>>>> answer.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <
>>>>>>>>>> hsuvarna@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>    Thanks a lot Rupert.
>>>>>>>>>>
>>>>>>>>>>  I am weighing between options 2 and 3. What is the difference?
>>>>>>>>>>> Optiion 2
>>>>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>>>>> text.
>>>>>>>>>>> It
>>>>>>>>>>>
>>>>>>>>>>>    may
>>>>>>>>>>>
>>>>>>>>>>>     be like paoding is hardcoded into KeyWordLinkingEngine.
>>>>>>>>>> Option 3 is
>>>>>>>>>>
>>>>>>>>>>  like
>>>>>>>>>>>
>>>>>>>>>>>    a
>>>>>>>>>>>
>>>>>>>>>>>     separate engine.
>>>>>>>>>>
>>>>>>>>>>     Option (2) will require some work improvements on the Stanbol
>>>>>>>>>>> side.
>>>>>>>>>>>
>>>>>>>>>>>  However there where already discussion on how to create a "text
>>>>>>>>>> processing chain" that allows to split up things like tokenizing,
>>>>>>>>>> POS
>>>>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>>>>> suffering form disadvantages of creating high amounts of RDF
>>>>>>>>>> triples.
>>>>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>>>>> and
>>>>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>>>>
>>>>>>>>>> Option (3) indeed means that you will create your own
>>>>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>>>>
>>>>>>>>>>       But will I be able to use the stanbol dbpedia lookup using
>>>>>>>>>> option
>>>>>>>>>> 3?
>>>>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use
>>>>>>>>>> the
>>>>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>>>>> example)
>>>>>>>>>>
>>>>>>>>>> best
>>>>>>>>>> Rupert
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://blog.mikemccandless.********com/2012/04/lucenes-**
>>>>>>>>>> tokenstreams-are-actually.******html<http://blog.**
>>>>>>>>>> mikemccandless.com/2012/04/******lucenes-tokenstreams-are-****<http://mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****>
>>>>>>>>>> actually.html<http://**mikemccandless.com/2012/04/****
>>>>>>>>>> lucenes-tokenstreams-are-****actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>> <http://blog.**mikemccandless.**com/2012/04/**<http://mikemccandless.com/2012/04/**>
>>>>>>>>>> lucenes-tokenstreams-are-****actually.html<http://blog.**
>>>>>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**
>>>>>>>>>> actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>>>>> >
>>>>>>>>>> [2]
>>>>>>>>>> http://incubator.apache.org/********stanbol/docs/trunk/**<http://incubator.apache.org/******stanbol/docs/trunk/**>
>>>>>>>>>> components/****<http://**incubator.apache.org/******
>>>>>>>>>> stanbol/docs/trunk/components/******<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
>>>>>>>>>> >
>>>>>>>>>> <http://**incubator.apache.**org/****stanbol/docs/trunk/**
>>>>>>>>>> components/**<http://incubator.apache.org/****stanbol/docs/trunk/components/**>
>>>>>>>>>> ** <http://incubator.apache.org/****stanbol/docs/trunk/**
>>>>>>>>>> components/**<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>>>>>> >>
>>>>>>>>>> enhancer/contentitem.html#********content-parts<http://**
>>>>>>>>>> incubator.apache.org/stanbol/******docs/trunk/components/**<http://incubator.apache.org/stanbol/****docs/trunk/components/**>
>>>>>>>>>> <ht**tp://incubator.apache.org/**stanbol/**docs/trunk/**
>>>>>>>>>> components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>> enhancer/contentitem.html#******content-parts<http://**
>>>>>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>>>>> >
>>>>>>>>>> [3]
>>>>>>>>>>
>>>>>>>>>> http://svn.apache.org/repos/********asf/incubator/stanbol/**
>>>>>>>>>> trunk/****<http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****>
>>>>>>>>>> <http://svn.apache.**org/repos/****asf/incubator/**
>>>>>>>>>> stanbol/trunk/**<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
>>>>>>>>>> >
>>>>>>>>>> <http://svn.apache.org/****repos/**asf/incubator/stanbol/**
>>>>>>>>>> **trunk/**<http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**>
>>>>>>>>>> <http://svn.apache.**org/repos/**asf/incubator/**stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>>>>> >
>>>>>>>>>> enhancer/engines/********keywordextraction/src/main/********
>>>>>>>>>> java/org/apache/stanbol/
>>>>>>>>>> **enhancer/engines/********keywordextraction/linking/**
>>>>>>>>>> impl/EntitySearcherUtils.java<******http://svn.apache.org/**
>>>>>>>>>> repos/**** <http://svn.apache.org/repos/****><http://svn.apache.*
>>>>>>>>>> *org/repos/** <http://svn.apache.org/repos/**>>
>>>>>>>>>> asf/incubator/stanbol/trunk/******enhancer/engines/**
>>>>>>>>>> keywordextraction/src/main/******java/org/apache/stanbol/**
>>>>>>>>>>
>>>>>>>>>> enhancer/engines/******keywordextraction/linking/**
>>>>>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/**
>>>>>>>>>> ** <http://svn.apache.org/repos/**>
>>>>>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/**
>>>>>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/**
>>>>>>>>>> enhancer/engines/****keywordextraction/linking/**
>>>>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>     Btw, I created my own enhancement engine chains and I could
>>>>>>>>>> see
>>>>>>>>>> them
>>>>>>>>>>
>>>>>>>>>>   yesterday in localhost:8080. But today all of them have
>>>>>>>>>> vanished and
>>>>>>>>>>
>>>>>>>>>>> only
>>>>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>>>>> stanbol
>>>>>>>>>>> directory?
>>>>>>>>>>>
>>>>>>>>>>> -harish
>>>>>>>>>>>
>>>>>>>>>>> I just created the eclipse project
>>>>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>>>>> <rupert.westenthaler@gmail.com********> wrote:
>>>>>>>>>>>
>>>>>>>>>>>    Hi,
>>>>>>>>>>>
>>>>>>>>>>>  There are no NER (Named Entity Recognition) models for Chinese
>>>>>>>>>>>> text
>>>>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol
>>>>>>>>>>>> will
>>>>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>>>>> process
>>>>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>>>>
>>>>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>>>>> tokenizer
>>>>>>>>>>>> for looking up Words. As there is no specific Tokenizer for
>>>>>>>>>>>> OpenNLP
>>>>>>>>>>>> Chinese text it will use the default one that uses a fixed set
>>>>>>>>>>>> of
>>>>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>>>>> how
>>>>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>>>>> that
>>>>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>>>>
>>>>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>>>>> detection,
>>>>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>>>>> tools
>>>>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in
>>>>>>>>>>>> you
>>>>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>>>>> OpenNLP,
>>>>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of
>>>>>>>>>>>> an
>>>>>>>>>>>> overhead.
>>>>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese
>>>>>>>>>>>> text.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>>>>
>>>>>>>>>>>> best
>>>>>>>>>>>> Rupert
>>>>>>>>>>>>
>>>>>>>>>>>> [1] http://incubator.apache.org/********stanbol/docs/trunk/**<http://incubator.apache.org/******stanbol/docs/trunk/**>
>>>>>>>>>>>> <http**://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>>>> >
>>>>>>>>>>>> <http:/**/incubator.apache.**org/****stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>>>> <**http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>>>>> >
>>>>>>>>>>>> multilingual.html<http://****inc**ubator.apache.org/**
>>>>>>>>>>>> stanbol/** <http://ubator.apache.org/stanbol/**><
>>>>>>>>>>>> http://incubator.**apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
>>>>>>>>>>>> >
>>>>>>>>>>>> docs/trunk/multilingual.html<**h**ttp://incubator.apache.org/**
>>>>>>>>>>>> ** <http://incubator.apache.org/**>
>>>>>>>>>>>> stanbol/docs/trunk/****multilingual.html<http://**
>>>>>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>>>>> >
>>>>>>>>>>>> [2]
>>>>>>>>>>>>
>>>>>>>>>>>>     http://wiki.apache.org/solr/****
>>>>>>>>>>>> ****LanguageAnalysis#Chinese.**2C_*<http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*>
>>>>>>>>>>>> ***<http://wiki.apache.org/**solr/****LanguageAnalysis#**
>>>>>>>>>>>> Chinese.2C_**<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>>> >
>>>>>>>>>>>> <http://wiki.apache.org/****solr/**LanguageAnalysis#****
>>>>>>>>>>>> Chinese.2C_**<http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**>
>>>>>>>>>>>> <http://wiki.**apache.org/solr/****
>>>>>>>>>>>> LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>>> **>
>>>>>>>>>>>>   Japanese.2C_Korean<http://****wi**ki.apache.org/solr/**<http*
>>>>>>>>>>>> *://wiki.apache.org/solr/** <http://wiki.apache.org/solr/**>>
>>>>>>>>>>>>
>>>>>>>>>>> LanguageAnalysis#Chinese.2C_******Japanese.2C_Korean<http://**
>>>>>>>>>> wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>
>>>>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>>>> >
>>>>>>>>>>    On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
>>>>>>>>>> hsuvarna@gmail.com>
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>>>
>>>>>>>>>>>>    Hi Rupert,
>>>>>>>>>>>>
>>>>>>>>>>>>  Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>>>>> demonstrate
>>>>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>>>>> I am just starting on it. I am following the instructions to
>>>>>>>>>>>>> build
>>>>>>>>>>>>> an
>>>>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese
>>>>>>>>>>>>> data
>>>>>>>>>>>>>
>>>>>>>>>>>>>    dump
>>>>>>>>>>>>>
>>>>>>>>>>>>>  too.
>>>>>>>>>>>>
>>>>>>>>>>>   We may have to depend on the ngrams as keys and look them up
>>>>>>>>>>> in the
>>>>>>>>>>>
>>>>>>>>>>>> dbpedia
>>>>>>>>>>>>> labels.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>>>> (http://code.google.com/p/********paoding/<http://code.google.com/p/******paoding/>
>>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.com/p/****paoding/>
>>>>>>>>>>>>> >
>>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.**
>>>>>>>>>>>>> com/p/**paoding/ <http://code.google.com/p/**paoding/>>
>>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.**
>>>>>>>>>>>>> com/p/paoding/ <http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>>>>> >>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   )
>>>>>>>>>>>>> for word breaking.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>>>>> stanbol.
>>>>>>>>>>>>> It
>>>>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>>>>> suspicion
>>>>>>>>>>>>> that may be if the language is chinese, no further processing
>>>>>>>>>>>>> is
>>>>>>>>>>>>> done.
>>>>>>>>>>>>> Is it
>>>>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -harish
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   --
>>>>>>>>>>>>>
>>>>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>>>>> ++43-699-11108907
>>>>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>     --
>>>>>>>>>>>>
>>>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>>> ++43-699-11108907
>>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    --
>>>>>>>>>>
>>>>>>>>>>  Dr. Walter Kasper
>>>>>>>>>
>>>>>>>> DFKI GmbH
>>>>>>>> Stuhlsatzenhausweg 3
>>>>>>>> D-66123 Saarbrücken
>>>>>>>> Tel.:  +49-681-85775-5300
>>>>>>>> Fax:   +49-681-85775-5338
>>>>>>>> Email: kasper@dfki.de
>>>>>>>> ------------------------------********------------------------**
>>>>>>>> --**
>>>>>>>>
>>>>>>>> --**--**-
>>>>>>>>
>>>>>>>>
>>>>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>>>>
>>>>>>>> Geschaeftsfuehrung:
>>>>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>>>> Dr. Walter Olthoff
>>>>>>>>
>>>>>>>> Vorsitzender des Aufsichtsrats:
>>>>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>>>>
>>>>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>>>>> ------------------------------********------------------------**
>>>>>>>> --**
>>>>>>>> --**--**-
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>>>>
>>>>>>> Dr. Walter Kasper
>>>>>> DFKI GmbH
>>>>>> Stuhlsatzenhausweg 3
>>>>>> D-66123 Saarbrücken
>>>>>> Tel.:  +49-681-85775-5300
>>>>>> Fax:   +49-681-85775-5338
>>>>>> Email: kasper@dfki.de
>>>>>> ------------------------------******--------------------------**
>>>>>> --**--**-
>>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>>
>>>>>> Geschaeftsfuehrung:
>>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>> Dr. Walter Olthoff
>>>>>>
>>>>>> Vorsitzender des Aufsichtsrats:
>>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>>
>>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>>> ------------------------------******--------------------------**
>>>>>> --**--**-
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>> Dr. Walter Kasper
>>>> DFKI GmbH
>>>> Stuhlsatzenhausweg 3
>>>> D-66123 Saarbrücken
>>>> Tel.:  +49-681-85775-5300
>>>> Fax:   +49-681-85775-5338
>>>> Email: kasper@dfki.de
>>>> ------------------------------****----------------------------**--**-
>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>
>>>> Geschaeftsfuehrung:
>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> Dr. Walter Olthoff
>>>>
>>>> Vorsitzender des Aufsichtsrats:
>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>
>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>> ------------------------------****----------------------------**--**-
>>>>
>>>>
>>>>
>>
>> --
>> Dr. Walter Kasper
>> DFKI GmbH
>> Stuhlsatzenhausweg 3
>> D-66123 Saarbrücken
>> Tel.:  +49-681-85775-5300
>> Fax:   +49-681-85775-5338
>> Email: kasper@dfki.de
>> ------------------------------**------------------------------**-
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> ------------------------------**------------------------------**-
>>
>>
>

Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
I removed ~/stanbol folder. It is not helping. Let me clear the
trunk/stanbol folder and see what happens. I suspect some cache clearnace
problem.

-harish

On Wed, Aug 1, 2012 at 10:48 AM, Walter Kasper <ka...@dfki.de> wrote:

> harish suvarna wrote:
>
>> I did ' mvn clean install'.
>> Which stanbol folder is this ?
>>
>> $HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
>> mean remove the enitre folder?
>>
>
> I guess it is $HOME/stanbol where the runtime config data are stored. I
> usually clear the complete folder for a clean restart.
>
>
>> I restarted the machine and doing another mvn clean install now. I will
>> post you in another 30 mins.
>>
>> -harish
>>
>> On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <ka...@dfki.de> wrote:
>>
>>  Hi again,
>>>
>>> It came to my mind that you should also clear the 'stanbol' folder of the
>>> Stanbol runtime system and restart the sysem.  The folder might contain
>>> old
>>> bundle configuration data that don't get updated automatically.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>  Did a fresh build and inside Stanbol in localhost:8080, it is installed
>>>> but
>>>> is not activated. I still see the com.google.inject errors.
>>>> I do see the pom.xml update from you.
>>>>
>>>> -harish
>>>>
>>>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>>
>>>>   Hi,
>>>>
>>>>> The OSGI bundlöe declared some package imports that usually indeed are
>>>>> not
>>>>> available nor required. I fixed that. Just check out the corrected
>>>>> pom.xml.
>>>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Walter
>>>>>
>>>>> harish suvarna wrote:
>>>>>
>>>>>   Thanks Dr Walter. langdetect is very useful. I could successfully
>>>>>
>>>>>> compile
>>>>>> it but unable to load into stanbol as I get th error
>>>>>> ======
>>>>>> ERROR: Bundle org.apache.stanbol.enhancer.******engines.langdetect
>>>>>> [177]:
>>>>>> Error
>>>>>> starting/stopping bundle. (org.osgi.framework.******BundleException:
>>>>>> Unresolved
>>>>>> constraint in bundle org.apache.stanbol.enhancer.****
>>>>>> **engines.langdetect
>>>>>>
>>>>>> [177]:
>>>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>>>> (package=com.google.inject))
>>>>>> org.osgi.framework.******BundleException: Unresolved constraint in
>>>>>> bundle
>>>>>> org.apache.stanbol.enhancer.******engines.langdetect [177]: Unable to
>>>>>>
>>>>>> resolve
>>>>>>
>>>>>> 177.0: missing requirement [177.0] package;
>>>>>> (package=com.google.inject)
>>>>>>        at org.apache.felix.framework.*****
>>>>>> *Felix.resolveBundle(Felix.**
>>>>>> java:3443)
>>>>>>        at org.apache.felix.framework.*****
>>>>>> *Felix.startBundle(Felix.java:****
>>>>>> **1727)
>>>>>>        at org.apache.felix.framework.*****
>>>>>> *Felix.setBundleStartLevel(**
>>>>>> Felix.java:1333)
>>>>>>        at
>>>>>> org.apache.felix.framework.******StartLevelImpl.run(**
>>>>>> StartLevelImpl.java:270)
>>>>>>        at java.lang.Thread.run(Thread.******java:680)
>>>>>>
>>>>>>
>>>>>> ==============
>>>>>>
>>>>>> I added the dependency
>>>>>> <dependency>
>>>>>>          <groupId>com.google.inject</******groupId>
>>>>>>
>>>>>>
>>>>>>          <artifactId>guice</artifactId>
>>>>>>          <version>3.0</version>
>>>>>>        </dependency>
>>>>>>
>>>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>>>> entire source of langdetect plugin does not contain the word inject.
>>>>>> Only
>>>>>> the manifest file in target/classes has this listed.
>>>>>>
>>>>>>
>>>>>> -harish
>>>>>>
>>>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de>
>>>>>> wrote:
>>>>>>
>>>>>>    Hi Harish,
>>>>>>
>>>>>>  I checked in a new language identifier for Stanbol based on
>>>>>>> http://code.google.com/p/********language-detection/<http://code.google.com/p/******language-detection/>
>>>>>>> <http://**code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>> >
>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>> >
>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>> <http://**code.google.com/p/language-****detection/<http://code.google.com/p/language-**detection/>
>>>>>>> >
>>>>>>>
>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>> >
>>>>>>>   .
>>>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Walter
>>>>>>>
>>>>>>> harish suvarna wrote:
>>>>>>>
>>>>>>>    Rupert,
>>>>>>>
>>>>>>>  My initial debugging for Chinese text told me that the language
>>>>>>>> identification done by langid enhancer using apache tika does not
>>>>>>>> recognize
>>>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>>>> languages. With the result, the chinese language is identified as
>>>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>>>> item
>>>>>>>> 856 registered for detecting cjk languages
>>>>>>>>      https://issues.apache.org/********jira/browse/TIKA-856<https://issues.apache.org/******jira/browse/TIKA-856>
>>>>>>>> <https:/**/issues.apache.org/****jira/**browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
>>>>>>>> >
>>>>>>>> <https://**issues.apache.org/****jira/**browse/TIKA-856<http://issues.apache.org/**jira/**browse/TIKA-856>
>>>>>>>> <https:**//issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>>> >
>>>>>>>> <https://**issues.apache.org/****jira/browse/**TIKA-856<http://issues.apache.org/**jira/browse/**TIKA-856>
>>>>>>>> <http:/**/issues.apache.org/jira/**browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
>>>>>>>> >
>>>>>>>> <https:/**/issues.apache.org/**jira/**browse/TIKA-856<http://issues.apache.org/jira/**browse/TIKA-856>
>>>>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>>> >
>>>>>>>>
>>>>>>>>      in Feb 2012. I am not sure about the use of language
>>>>>>>> identification
>>>>>>>> in
>>>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>>>
>>>>>>>>
>>>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>>>> that
>>>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>>>> Then
>>>>>>>> iterate over the ngrams and do an entityhub lookup. I just still
>>>>>>>> need
>>>>>>>> to
>>>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>>>> works.
>>>>>>>>
>>>>>>>> I find that the language detection library
>>>>>>>> http://code.google.com/p/********language-detection/<http://code.google.com/p/******language-detection/>
>>>>>>>> <http://**code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>>> >
>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>>> >
>>>>>>>> <http://**code.google.com/p/****language-**detection/<http://code.google.com/p/**language-**detection/>
>>>>>>>> <http://**code.google.com/p/language-****detection/<http://code.google.com/p/language-**detection/>
>>>>>>>> >
>>>>>>>>
>>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>>> >
>>>>>>>>
>>>>>>>>> is
>>>>>>>>>>
>>>>>>>>> very good at language
>>>>>>>>
>>>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>>>> good.
>>>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>>>> engine
>>>>>>>> based on this with the stanbol community approval. So if anyone
>>>>>>>> sheds
>>>>>>>> some
>>>>>>>> light on how to add a new java library into stanbol, that be great.
>>>>>>>> I
>>>>>>>> am a
>>>>>>>> maven beginner now.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> harish
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>>>>
>>>>>>>>     Hi harish,
>>>>>>>>
>>>>>>>>   Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>>>
>>>>>>>>> last
>>>>>>>>> answer.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <
>>>>>>>>> hsuvarna@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>    Thanks a lot Rupert.
>>>>>>>>>
>>>>>>>>>  I am weighing between options 2 and 3. What is the difference?
>>>>>>>>>> Optiion 2
>>>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>>>> text.
>>>>>>>>>> It
>>>>>>>>>>
>>>>>>>>>>    may
>>>>>>>>>>
>>>>>>>>>>     be like paoding is hardcoded into KeyWordLinkingEngine.
>>>>>>>>> Option 3 is
>>>>>>>>>
>>>>>>>>>  like
>>>>>>>>>>
>>>>>>>>>>    a
>>>>>>>>>>
>>>>>>>>>>     separate engine.
>>>>>>>>>
>>>>>>>>>     Option (2) will require some work improvements on the Stanbol
>>>>>>>>>> side.
>>>>>>>>>>
>>>>>>>>>>  However there where already discussion on how to create a "text
>>>>>>>>> processing chain" that allows to split up things like tokenizing,
>>>>>>>>> POS
>>>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>>>> suffering form disadvantages of creating high amounts of RDF
>>>>>>>>> triples.
>>>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>>>> and
>>>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>>>
>>>>>>>>> Option (3) indeed means that you will create your own
>>>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>>>
>>>>>>>>>       But will I be able to use the stanbol dbpedia lookup using
>>>>>>>>> option
>>>>>>>>> 3?
>>>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use
>>>>>>>>> the
>>>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>>>> example)
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://blog.mikemccandless.********com/2012/04/lucenes-**
>>>>>>>>> tokenstreams-are-actually.******html<http://blog.**
>>>>>>>>> mikemccandless.com/2012/04/******lucenes-tokenstreams-are-****<http://mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****>
>>>>>>>>> actually.html<http://**mikemccandless.com/2012/04/****
>>>>>>>>> lucenes-tokenstreams-are-****actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> <http://blog.**mikemccandless.**com/2012/04/**<http://mikemccandless.com/2012/04/**>
>>>>>>>>> lucenes-tokenstreams-are-****actually.html<http://blog.**
>>>>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**
>>>>>>>>> actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>>>> >
>>>>>>>>> [2]
>>>>>>>>> http://incubator.apache.org/********stanbol/docs/trunk/**<http://incubator.apache.org/******stanbol/docs/trunk/**>
>>>>>>>>> components/****<http://**incubator.apache.org/******
>>>>>>>>> stanbol/docs/trunk/components/******<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
>>>>>>>>> >
>>>>>>>>> <http://**incubator.apache.**org/****stanbol/docs/trunk/**
>>>>>>>>> components/**<http://incubator.apache.org/****stanbol/docs/trunk/components/**>
>>>>>>>>> ** <http://incubator.apache.org/****stanbol/docs/trunk/**
>>>>>>>>> components/**<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>>>>> >>
>>>>>>>>> enhancer/contentitem.html#********content-parts<http://**
>>>>>>>>> incubator.apache.org/stanbol/******docs/trunk/components/**<http://incubator.apache.org/stanbol/****docs/trunk/components/**>
>>>>>>>>> <ht**tp://incubator.apache.org/**stanbol/**docs/trunk/**
>>>>>>>>> components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>> enhancer/contentitem.html#******content-parts<http://**
>>>>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>>>> >
>>>>>>>>> [3]
>>>>>>>>>
>>>>>>>>> http://svn.apache.org/repos/********asf/incubator/stanbol/**
>>>>>>>>> trunk/****<http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****>
>>>>>>>>> <http://svn.apache.**org/repos/****asf/incubator/**
>>>>>>>>> stanbol/trunk/**<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
>>>>>>>>> >
>>>>>>>>> <http://svn.apache.org/****repos/**asf/incubator/stanbol/**
>>>>>>>>> **trunk/**<http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**>
>>>>>>>>> <http://svn.apache.**org/repos/**asf/incubator/**stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>>>> >
>>>>>>>>> enhancer/engines/********keywordextraction/src/main/********
>>>>>>>>> java/org/apache/stanbol/
>>>>>>>>> **enhancer/engines/********keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<******http://svn.apache.org/**
>>>>>>>>> repos/**** <http://svn.apache.org/repos/****><http://svn.apache.**
>>>>>>>>> org/repos/** <http://svn.apache.org/repos/**>>
>>>>>>>>> asf/incubator/stanbol/trunk/******enhancer/engines/**
>>>>>>>>> keywordextraction/src/main/******java/org/apache/stanbol/**
>>>>>>>>>
>>>>>>>>> enhancer/engines/******keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**>
>>>>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/**
>>>>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/**
>>>>>>>>> enhancer/engines/****keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>     Btw, I created my own enhancement engine chains and I could see
>>>>>>>>> them
>>>>>>>>>
>>>>>>>>>   yesterday in localhost:8080. But today all of them have vanished
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> only
>>>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>>>> stanbol
>>>>>>>>>> directory?
>>>>>>>>>>
>>>>>>>>>> -harish
>>>>>>>>>>
>>>>>>>>>> I just created the eclipse project
>>>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>>>> <rupert.westenthaler@gmail.com********> wrote:
>>>>>>>>>>
>>>>>>>>>>    Hi,
>>>>>>>>>>
>>>>>>>>>>  There are no NER (Named Entity Recognition) models for Chinese
>>>>>>>>>>> text
>>>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol
>>>>>>>>>>> will
>>>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>>>> process
>>>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>>>
>>>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>>>> tokenizer
>>>>>>>>>>> for looking up Words. As there is no specific Tokenizer for
>>>>>>>>>>> OpenNLP
>>>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>>>> how
>>>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>>>> that
>>>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>>>
>>>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>>>
>>>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>>>> detection,
>>>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>>>> tools
>>>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in
>>>>>>>>>>> you
>>>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>>>> OpenNLP,
>>>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of
>>>>>>>>>>> an
>>>>>>>>>>> overhead.
>>>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>>>
>>>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>>>
>>>>>>>>>>> best
>>>>>>>>>>> Rupert
>>>>>>>>>>>
>>>>>>>>>>> [1] http://incubator.apache.org/********stanbol/docs/trunk/**<http://incubator.apache.org/******stanbol/docs/trunk/**>
>>>>>>>>>>> <http**://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>>> >
>>>>>>>>>>> <http:/**/incubator.apache.**org/****stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>>> <**http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>>>> >
>>>>>>>>>>> multilingual.html<http://****inc**ubator.apache.org/**stanbol/**<http://ubator.apache.org/stanbol/**>
>>>>>>>>>>> <http://incubator.**apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
>>>>>>>>>>> >
>>>>>>>>>>> docs/trunk/multilingual.html<**h**ttp://incubator.apache.org/**
>>>>>>>>>>> ** <http://incubator.apache.org/**>
>>>>>>>>>>> stanbol/docs/trunk/****multilingual.html<http://**
>>>>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>>>> >
>>>>>>>>>>> [2]
>>>>>>>>>>>
>>>>>>>>>>>     http://wiki.apache.org/solr/****
>>>>>>>>>>> ****LanguageAnalysis#Chinese.**2C_*<http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*>
>>>>>>>>>>> ***<http://wiki.apache.org/**solr/****LanguageAnalysis#**
>>>>>>>>>>> Chinese.2C_**<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>> >
>>>>>>>>>>> <http://wiki.apache.org/****solr/**LanguageAnalysis#****
>>>>>>>>>>> Chinese.2C_**<http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**>
>>>>>>>>>>> <http://wiki.**apache.org/solr/****
>>>>>>>>>>> LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>> **>
>>>>>>>>>>>   Japanese.2C_Korean<http://****wi**ki.apache.org/solr/**<http**
>>>>>>>>>>> ://wiki.apache.org/solr/** <http://wiki.apache.org/solr/**>>
>>>>>>>>>>>
>>>>>>>>>> LanguageAnalysis#Chinese.2C_******Japanese.2C_Korean<http://**
>>>>>>>>> wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>
>>>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>>> >
>>>>>>>>>    On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
>>>>>>>>> hsuvarna@gmail.com>
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>>    Hi Rupert,
>>>>>>>>>>>
>>>>>>>>>>>  Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>>>> demonstrate
>>>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>>>> I am just starting on it. I am following the instructions to
>>>>>>>>>>>> build
>>>>>>>>>>>> an
>>>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese
>>>>>>>>>>>> data
>>>>>>>>>>>>
>>>>>>>>>>>>    dump
>>>>>>>>>>>>
>>>>>>>>>>>>  too.
>>>>>>>>>>>
>>>>>>>>>>   We may have to depend on the ngrams as keys and look them up in
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>>> dbpedia
>>>>>>>>>>>> labels.
>>>>>>>>>>>>
>>>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>>> (http://code.google.com/p/********paoding/<http://code.google.com/p/******paoding/>
>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.com/p/****paoding/>
>>>>>>>>>>>> >
>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.**
>>>>>>>>>>>> com/p/**paoding/ <http://code.google.com/p/**paoding/>>
>>>>>>>>>>>> <http://code.google.**com/p/****paoding/<http://code.google.**
>>>>>>>>>>>> com/p/paoding/ <http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>>>> >>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   )
>>>>>>>>>>>> for word breaking.
>>>>>>>>>>>>
>>>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>>>> stanbol.
>>>>>>>>>>>> It
>>>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>>>> suspicion
>>>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>>>> done.
>>>>>>>>>>>> Is it
>>>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>>>
>>>>>>>>>>>> -harish
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>   --
>>>>>>>>>>>>
>>>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>>>> ++43-699-11108907
>>>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     --
>>>>>>>>>>>
>>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>> ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    --
>>>>>>>>>
>>>>>>>>>  Dr. Walter Kasper
>>>>>>>>
>>>>>>> DFKI GmbH
>>>>>>> Stuhlsatzenhausweg 3
>>>>>>> D-66123 Saarbrücken
>>>>>>> Tel.:  +49-681-85775-5300
>>>>>>> Fax:   +49-681-85775-5338
>>>>>>> Email: kasper@dfki.de
>>>>>>> ------------------------------********------------------------**--**
>>>>>>>
>>>>>>> --**--**-
>>>>>>>
>>>>>>>
>>>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>>>
>>>>>>> Geschaeftsfuehrung:
>>>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>>> Dr. Walter Olthoff
>>>>>>>
>>>>>>> Vorsitzender des Aufsichtsrats:
>>>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>>>
>>>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>>>> ------------------------------********------------------------**--**
>>>>>>> --**--**-
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>>
>>>>>> Dr. Walter Kasper
>>>>> DFKI GmbH
>>>>> Stuhlsatzenhausweg 3
>>>>> D-66123 Saarbrücken
>>>>> Tel.:  +49-681-85775-5300
>>>>> Fax:   +49-681-85775-5338
>>>>> Email: kasper@dfki.de
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>
>>>>> Geschaeftsfuehrung:
>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>> Dr. Walter Olthoff
>>>>>
>>>>> Vorsitzender des Aufsichtsrats:
>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>
>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: kasper@dfki.de
>>> ------------------------------****----------------------------**--**-
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------****----------------------------**--**-
>>>
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
harish suvarna wrote:
> I did ' mvn clean install'.
> Which stanbol folder is this ?
>
> $HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
> mean remove the enitre folder?

I guess it is $HOME/stanbol where the runtime config data are stored. I 
usually clear the complete folder for a clean restart.

>
> I restarted the machine and doing another mvn clean install now. I will
> post you in another 30 mins.
>
> -harish
>
> On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <ka...@dfki.de> wrote:
>
>> Hi again,
>>
>> It came to my mind that you should also clear the 'stanbol' folder of the
>> Stanbol runtime system and restart the sysem.  The folder might contain old
>> bundle configuration data that don't get updated automatically.
>>
>>
>> Best regards,
>>
>> Walter
>>
>> harish suvarna wrote:
>>
>>> Did a fresh build and inside Stanbol in localhost:8080, it is installed
>>> but
>>> is not activated. I still see the com.google.inject errors.
>>> I do see the pom.xml update from you.
>>>
>>> -harish
>>>
>>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>
>>>   Hi,
>>>> The OSGI bundlöe declared some package imports that usually indeed are
>>>> not
>>>> available nor required. I fixed that. Just check out the corrected
>>>> pom.xml.
>>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Walter
>>>>
>>>> harish suvarna wrote:
>>>>
>>>>   Thanks Dr Walter. langdetect is very useful. I could successfully
>>>>> compile
>>>>> it but unable to load into stanbol as I get th error
>>>>> ======
>>>>> ERROR: Bundle org.apache.stanbol.enhancer.****engines.langdetect [177]:
>>>>> Error
>>>>> starting/stopping bundle. (org.osgi.framework.****BundleException:
>>>>> Unresolved
>>>>> constraint in bundle org.apache.stanbol.enhancer.****engines.langdetect
>>>>> [177]:
>>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>>> (package=com.google.inject))
>>>>> org.osgi.framework.****BundleException: Unresolved constraint in bundle
>>>>> org.apache.stanbol.enhancer.****engines.langdetect [177]: Unable to
>>>>> resolve
>>>>>
>>>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>>>        at org.apache.felix.framework.****Felix.resolveBundle(Felix.**
>>>>> java:3443)
>>>>>        at org.apache.felix.framework.****Felix.startBundle(Felix.java:**
>>>>> **1727)
>>>>>        at org.apache.felix.framework.****Felix.setBundleStartLevel(**
>>>>> Felix.java:1333)
>>>>>        at
>>>>> org.apache.felix.framework.****StartLevelImpl.run(**
>>>>> StartLevelImpl.java:270)
>>>>>        at java.lang.Thread.run(Thread.****java:680)
>>>>>
>>>>> ==============
>>>>>
>>>>> I added the dependency
>>>>> <dependency>
>>>>>          <groupId>com.google.inject</****groupId>
>>>>>
>>>>>          <artifactId>guice</artifactId>
>>>>>          <version>3.0</version>
>>>>>        </dependency>
>>>>>
>>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>>> entire source of langdetect plugin does not contain the word inject.
>>>>> Only
>>>>> the manifest file in target/classes has this listed.
>>>>>
>>>>>
>>>>> -harish
>>>>>
>>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>>>
>>>>>    Hi Harish,
>>>>>
>>>>>> I checked in a new language identifier for Stanbol based on
>>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>   .
>>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Walter
>>>>>>
>>>>>> harish suvarna wrote:
>>>>>>
>>>>>>    Rupert,
>>>>>>
>>>>>>> My initial debugging for Chinese text told me that the language
>>>>>>> identification done by langid enhancer using apache tika does not
>>>>>>> recognize
>>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>>> languages. With the result, the chinese language is identified as
>>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>>> item
>>>>>>> 856 registered for detecting cjk languages
>>>>>>>      https://issues.apache.org/******jira/browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
>>>>>>> <https://**issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>> <https://**issues.apache.org/**jira/browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
>>>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>>      in Feb 2012. I am not sure about the use of language
>>>>>>> identification
>>>>>>> in
>>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>>
>>>>>>>
>>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>>> that
>>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>>> Then
>>>>>>> iterate over the ngrams and do an entityhub lookup. I just still need
>>>>>>> to
>>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>>> works.
>>>>>>>
>>>>>>> I find that the language detection library
>>>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>>>> is
>>>>>>> very good at language
>>>>>>>
>>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>>> good.
>>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>>> engine
>>>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>>>> some
>>>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>>>> am a
>>>>>>> maven beginner now.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> harish
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>>>
>>>>>>>     Hi harish,
>>>>>>>
>>>>>>>   Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>>> last
>>>>>>>> answer.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>    Thanks a lot Rupert.
>>>>>>>>
>>>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>>>> Optiion 2
>>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>>> text.
>>>>>>>>> It
>>>>>>>>>
>>>>>>>>>    may
>>>>>>>>>
>>>>>>>>    be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>>>
>>>>>>>>> like
>>>>>>>>>
>>>>>>>>>    a
>>>>>>>>>
>>>>>>>>    separate engine.
>>>>>>>>
>>>>>>>>>    Option (2) will require some work improvements on the Stanbol
>>>>>>>>> side.
>>>>>>>>>
>>>>>>>> However there where already discussion on how to create a "text
>>>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>>> and
>>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>>
>>>>>>>> Option (3) indeed means that you will create your own
>>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>>
>>>>>>>>       But will I be able to use the stanbol dbpedia lookup using
>>>>>>>> option
>>>>>>>> 3?
>>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>>> example)
>>>>>>>>
>>>>>>>> best
>>>>>>>> Rupert
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://blog.mikemccandless.******com/2012/04/lucenes-**
>>>>>>>> tokenstreams-are-actually.****html<http://blog.**
>>>>>>>> mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****
>>>>>>>> actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
>>>>>>>> <http://blog.**mikemccandless.com/2012/04/**
>>>>>>>> lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>>> [2]
>>>>>>>> http://incubator.apache.org/******stanbol/docs/trunk/**
>>>>>>>> components/****<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
>>>>>>>> <http://**incubator.apache.org/****stanbol/docs/trunk/components/**
>>>>>>>> ** <http://incubator.apache.org/**stanbol/docs/trunk/components/**>>
>>>>>>>> enhancer/contentitem.html#******content-parts<http://**
>>>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>>> [3]
>>>>>>>>
>>>>>>>> http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
>>>>>>>> <http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>>> enhancer/engines/******keywordextraction/src/main/******
>>>>>>>> java/org/apache/stanbol/
>>>>>>>> **enhancer/engines/******keywordextraction/linking/**
>>>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**>
>>>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/**
>>>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/**
>>>>>>>> enhancer/engines/****keywordextraction/linking/**
>>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>>
>>>>>>>>     Btw, I created my own enhancement engine chains and I could see
>>>>>>>> them
>>>>>>>>
>>>>>>>>   yesterday in localhost:8080. But today all of them have vanished and
>>>>>>>>> only
>>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>>> stanbol
>>>>>>>>> directory?
>>>>>>>>>
>>>>>>>>> -harish
>>>>>>>>>
>>>>>>>>> I just created the eclipse project
>>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>>> <rupert.westenthaler@gmail.com******> wrote:
>>>>>>>>>
>>>>>>>>>    Hi,
>>>>>>>>>
>>>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>>> process
>>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>>
>>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>>> tokenizer
>>>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>>> how
>>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>>> that
>>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>>
>>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>>
>>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>>> detection,
>>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>>> tools
>>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>>> OpenNLP,
>>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>>>> overhead.
>>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>>
>>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>>
>>>>>>>>>> best
>>>>>>>>>> Rupert
>>>>>>>>>>
>>>>>>>>>> [1] http://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>>> <http:/**/incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>>> multilingual.html<http://**inc**ubator.apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
>>>>>>>>>> docs/trunk/multilingual.html<h**ttp://incubator.apache.org/**
>>>>>>>>>> stanbol/docs/trunk/**multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>>> [2]
>>>>>>>>>>
>>>>>>>>>>     http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*
>>>>>>>>>> ***<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>> <http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>   Japanese.2C_Korean<http://**wi**ki.apache.org/solr/**<http://wiki.apache.org/solr/**>
>>>>>>>> LanguageAnalysis#Chinese.2C_****Japanese.2C_Korean<http://**
>>>>>>>> wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
>>>>>>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>>    On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
>>>>>>>> hsuvarna@gmail.com>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>    Hi Rupert,
>>>>>>>>>>
>>>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>>> demonstrate
>>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>>>> an
>>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>>>
>>>>>>>>>>>    dump
>>>>>>>>>>>
>>>>>>>>>> too.
>>>>>>>>>   We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>>>> dbpedia
>>>>>>>>>>> labels.
>>>>>>>>>>>
>>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>> (http://code.google.com/p/******paoding/<http://code.google.com/p/****paoding/>
>>>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.**
>>>>>>>>>>> com/p/paoding/ <http://code.google.com/p/paoding/>>
>>>>>>>>>>>
>>>>>>>>>>>   )
>>>>>>>>>>> for word breaking.
>>>>>>>>>>>
>>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>>> stanbol.
>>>>>>>>>>> It
>>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>>> suspicion
>>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>>> done.
>>>>>>>>>>> Is it
>>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>>
>>>>>>>>>>> -harish
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>   --
>>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>>> ++43-699-11108907
>>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     --
>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>
>>>>>>>>
>>>>>>>>    --
>>>>>>>>
>>>>>>> Dr. Walter Kasper
>>>>>> DFKI GmbH
>>>>>> Stuhlsatzenhausweg 3
>>>>>> D-66123 Saarbrücken
>>>>>> Tel.:  +49-681-85775-5300
>>>>>> Fax:   +49-681-85775-5338
>>>>>> Email: kasper@dfki.de
>>>>>> ------------------------------******--------------------------**
>>>>>> --**--**-
>>>>>>
>>>>>>
>>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>>
>>>>>> Geschaeftsfuehrung:
>>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>> Dr. Walter Olthoff
>>>>>>
>>>>>> Vorsitzender des Aufsichtsrats:
>>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>>
>>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>>> ------------------------------******--------------------------**
>>>>>> --**--**-
>>>>>>
>>>>>>
>>>>>>
>>>>>>   --
>>>> Dr. Walter Kasper
>>>> DFKI GmbH
>>>> Stuhlsatzenhausweg 3
>>>> D-66123 Saarbrücken
>>>> Tel.:  +49-681-85775-5300
>>>> Fax:   +49-681-85775-5338
>>>> Email: kasper@dfki.de
>>>> ------------------------------****----------------------------**--**-
>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>
>>>> Geschaeftsfuehrung:
>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> Dr. Walter Olthoff
>>>>
>>>> Vorsitzender des Aufsichtsrats:
>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>
>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>> ------------------------------****----------------------------**--**-
>>>>
>>>>
>>>>
>> --
>> Dr. Walter Kasper
>> DFKI GmbH
>> Stuhlsatzenhausweg 3
>> D-66123 Saarbrücken
>> Tel.:  +49-681-85775-5300
>> Fax:   +49-681-85775-5338
>> Email: kasper@dfki.de
>> ------------------------------**------------------------------**-
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> ------------------------------**------------------------------**-
>>
>>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
I did ' mvn clean install'.
Which stanbol folder is this ?

$HOME/stanbol where it sores some user/config prefs or trunk/stanbol? You
mean remove the enitre folder?

I restarted the machine and doing another mvn clean install now. I will
post you in another 30 mins.

-harish

On Wed, Aug 1, 2012 at 10:36 AM, Walter Kasper <ka...@dfki.de> wrote:

> Hi again,
>
> It came to my mind that you should also clear the 'stanbol' folder of the
> Stanbol runtime system and restart the sysem.  The folder might contain old
> bundle configuration data that don't get updated automatically.
>
>
> Best regards,
>
> Walter
>
> harish suvarna wrote:
>
>> Did a fresh build and inside Stanbol in localhost:8080, it is installed
>> but
>> is not activated. I still see the com.google.inject errors.
>> I do see the pom.xml update from you.
>>
>> -harish
>>
>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>>
>>  Hi,
>>>
>>> The OSGI bundlöe declared some package imports that usually indeed are
>>> not
>>> available nor required. I fixed that. Just check out the corrected
>>> pom.xml.
>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>  Thanks Dr Walter. langdetect is very useful. I could successfully
>>>> compile
>>>> it but unable to load into stanbol as I get th error
>>>> ======
>>>> ERROR: Bundle org.apache.stanbol.enhancer.****engines.langdetect [177]:
>>>> Error
>>>> starting/stopping bundle. (org.osgi.framework.****BundleException:
>>>> Unresolved
>>>> constraint in bundle org.apache.stanbol.enhancer.****engines.langdetect
>>>> [177]:
>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>> (package=com.google.inject))
>>>> org.osgi.framework.****BundleException: Unresolved constraint in bundle
>>>> org.apache.stanbol.enhancer.****engines.langdetect [177]: Unable to
>>>> resolve
>>>>
>>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>>       at org.apache.felix.framework.****Felix.resolveBundle(Felix.**
>>>> java:3443)
>>>>       at org.apache.felix.framework.****Felix.startBundle(Felix.java:**
>>>> **1727)
>>>>       at org.apache.felix.framework.****Felix.setBundleStartLevel(**
>>>> Felix.java:1333)
>>>>       at
>>>> org.apache.felix.framework.****StartLevelImpl.run(**
>>>> StartLevelImpl.java:270)
>>>>       at java.lang.Thread.run(Thread.****java:680)
>>>>
>>>> ==============
>>>>
>>>> I added the dependency
>>>> <dependency>
>>>>         <groupId>com.google.inject</****groupId>
>>>>
>>>>         <artifactId>guice</artifactId>
>>>>         <version>3.0</version>
>>>>       </dependency>
>>>>
>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>> entire source of langdetect plugin does not contain the word inject.
>>>> Only
>>>> the manifest file in target/classes has this listed.
>>>>
>>>>
>>>> -harish
>>>>
>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>>
>>>>   Hi Harish,
>>>>
>>>>> I checked in a new language identifier for Stanbol based on
>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>> >
>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>> >
>>>>>
>>>>>  .
>>>>>>
>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Walter
>>>>>
>>>>> harish suvarna wrote:
>>>>>
>>>>>   Rupert,
>>>>>
>>>>>> My initial debugging for Chinese text told me that the language
>>>>>> identification done by langid enhancer using apache tika does not
>>>>>> recognize
>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>> languages. With the result, the chinese language is identified as
>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>> item
>>>>>> 856 registered for detecting cjk languages
>>>>>>     https://issues.apache.org/******jira/browse/TIKA-856<https://issues.apache.org/****jira/browse/TIKA-856>
>>>>>> <https://**issues.apache.org/**jira/**browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>> >
>>>>>> <https://**issues.apache.org/**jira/browse/**TIKA-856<http://issues.apache.org/jira/browse/**TIKA-856>
>>>>>> <https:/**/issues.apache.org/jira/**browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>> >
>>>>>>
>>>>>>     in Feb 2012. I am not sure about the use of language
>>>>>> identification
>>>>>> in
>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>
>>>>>>
>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>> that
>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>> Then
>>>>>> iterate over the ngrams and do an entityhub lookup. I just still need
>>>>>> to
>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>> works.
>>>>>>
>>>>>> I find that the language detection library
>>>>>> http://code.google.com/p/******language-detection/<http://code.google.com/p/****language-detection/>
>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/**language-detection/>
>>>>>> >
>>>>>> <http://**code.google.com/p/**language-**detection/<http://code.google.com/p/language-**detection/>
>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>> >>is
>>>>>>
>>>>>> very good at language
>>>>>>
>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>> good.
>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>> engine
>>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>>> some
>>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>>> am a
>>>>>> maven beginner now.
>>>>>>
>>>>>> Thanks,
>>>>>> harish
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>>
>>>>>>    Hi harish,
>>>>>>
>>>>>>  Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>> last
>>>>>>> answer.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Thanks a lot Rupert.
>>>>>>>
>>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>>> Optiion 2
>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>> text.
>>>>>>>> It
>>>>>>>>
>>>>>>>>   may
>>>>>>>>
>>>>>>>   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>>
>>>>>>>> like
>>>>>>>>
>>>>>>>>   a
>>>>>>>>
>>>>>>>   separate engine.
>>>>>>>
>>>>>>>>   Option (2) will require some work improvements on the Stanbol
>>>>>>>> side.
>>>>>>>>
>>>>>>> However there where already discussion on how to create a "text
>>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>> and
>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>
>>>>>>> Option (3) indeed means that you will create your own
>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>
>>>>>>>      But will I be able to use the stanbol dbpedia lookup using
>>>>>>> option
>>>>>>> 3?
>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>> example)
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1]
>>>>>>> http://blog.mikemccandless.******com/2012/04/lucenes-**
>>>>>>> tokenstreams-are-actually.****html<http://blog.**
>>>>>>> mikemccandless.com/2012/04/****lucenes-tokenstreams-are-****
>>>>>>> actually.html<http://mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html>
>>>>>>> <http://blog.**mikemccandless.com/2012/04/**
>>>>>>> lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>> >
>>>>>>> [2]
>>>>>>> http://incubator.apache.org/******stanbol/docs/trunk/**
>>>>>>> components/****<http://incubator.apache.org/****stanbol/docs/trunk/components/****>
>>>>>>> <http://**incubator.apache.org/****stanbol/docs/trunk/components/**
>>>>>>> ** <http://incubator.apache.org/**stanbol/docs/trunk/components/**>>
>>>>>>> enhancer/contentitem.html#******content-parts<http://**
>>>>>>> incubator.apache.org/stanbol/****docs/trunk/components/**<http://incubator.apache.org/stanbol/**docs/trunk/components/**>
>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>> >
>>>>>>> [3]
>>>>>>>
>>>>>>> http://svn.apache.org/repos/******asf/incubator/stanbol/trunk/****<http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**>
>>>>>>> <http://svn.apache.org/**repos/**asf/incubator/stanbol/**trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>> >
>>>>>>> enhancer/engines/******keywordextraction/src/main/******
>>>>>>> java/org/apache/stanbol/
>>>>>>> **enhancer/engines/******keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<****http://svn.apache.org/repos/****<http://svn.apache.org/repos/**>
>>>>>>> asf/incubator/stanbol/trunk/****enhancer/engines/**
>>>>>>> keywordextraction/src/main/****java/org/apache/stanbol/**
>>>>>>> enhancer/engines/****keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>    Btw, I created my own enhancement engine chains and I could see
>>>>>>> them
>>>>>>>
>>>>>>>  yesterday in localhost:8080. But today all of them have vanished and
>>>>>>>> only
>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>> stanbol
>>>>>>>> directory?
>>>>>>>>
>>>>>>>> -harish
>>>>>>>>
>>>>>>>> I just created the eclipse project
>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>> <rupert.westenthaler@gmail.com******> wrote:
>>>>>>>>
>>>>>>>>   Hi,
>>>>>>>>
>>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>> process
>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>
>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>> tokenizer
>>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>> how
>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>> that
>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>
>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>
>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>> detection,
>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>> tools
>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>> OpenNLP,
>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>>> overhead.
>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>
>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> [1] http://incubator.apache.org/******stanbol/docs/trunk/**<http://incubator.apache.org/****stanbol/docs/trunk/**>
>>>>>>>>> <http:/**/incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>> >
>>>>>>>>> multilingual.html<http://**inc**ubator.apache.org/stanbol/**<http://incubator.apache.org/stanbol/**>
>>>>>>>>> docs/trunk/multilingual.html<h**ttp://incubator.apache.org/**
>>>>>>>>> stanbol/docs/trunk/**multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>> >
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>>    http://wiki.apache.org/solr/******LanguageAnalysis#Chinese.2C_*
>>>>>>>>> ***<http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**>
>>>>>>>>> <http://wiki.apache.org/**solr/**LanguageAnalysis#**Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>  Japanese.2C_Korean<http://**wi**ki.apache.org/solr/**<http://wiki.apache.org/solr/**>
>>>>>>>>
>>>>>>> LanguageAnalysis#Chinese.2C_****Japanese.2C_Korean<http://**
>>>>>>> wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
>>>>>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>> >
>>>>>>>
>>>>>>>   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <
>>>>>>> hsuvarna@gmail.com>
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   Hi Rupert,
>>>>>>>>>
>>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>> demonstrate
>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>>> an
>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>>
>>>>>>>>>>   dump
>>>>>>>>>>
>>>>>>>>> too.
>>>>>>>>
>>>>>>>>  We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>>
>>>>>>>>>> dbpedia
>>>>>>>>>> labels.
>>>>>>>>>>
>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>> (http://code.google.com/p/******paoding/<http://code.google.com/p/****paoding/>
>>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>>> >
>>>>>>>>>> <http://code.google.**com/p/**paoding/<http://code.google.**
>>>>>>>>>> com/p/paoding/ <http://code.google.com/p/paoding/>>
>>>>>>>>>>
>>>>>>>>>>  )
>>>>>>>>>>>
>>>>>>>>>> for word breaking.
>>>>>>>>>>
>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>> stanbol.
>>>>>>>>>> It
>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>> suspicion
>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>> done.
>>>>>>>>>> Is it
>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>
>>>>>>>>>> -harish
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>> | Bodenlehenstraße 11
>>>>>>>>> ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    --
>>>>>>>>
>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>>
>>>>>> Dr. Walter Kasper
>>>>> DFKI GmbH
>>>>> Stuhlsatzenhausweg 3
>>>>> D-66123 Saarbrücken
>>>>> Tel.:  +49-681-85775-5300
>>>>> Fax:   +49-681-85775-5338
>>>>> Email: kasper@dfki.de
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>>
>>>>>
>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>
>>>>> Geschaeftsfuehrung:
>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>> Dr. Walter Olthoff
>>>>>
>>>>> Vorsitzender des Aufsichtsrats:
>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>
>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>> ------------------------------******--------------------------**
>>>>> --**--**-
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: kasper@dfki.de
>>> ------------------------------****----------------------------**--**-
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------****----------------------------**--**-
>>>
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi again,

It came to my mind that you should also clear the 'stanbol' folder of 
the Stanbol runtime system and restart the sysem.  The folder might 
contain old bundle configuration data that don't get updated automatically.

Best regards,

Walter

harish suvarna wrote:
> Did a fresh build and inside Stanbol in localhost:8080, it is installed but
> is not activated. I still see the com.google.inject errors.
> I do see the pom.xml update from you.
>
> -harish
>
> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>
>> Hi,
>>
>> The OSGI bundlöe declared some package imports that usually indeed are not
>> available nor required. I fixed that. Just check out the corrected pom.xml.
>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>
>>
>> Best regards,
>>
>> Walter
>>
>> harish suvarna wrote:
>>
>>> Thanks Dr Walter. langdetect is very useful. I could successfully compile
>>> it but unable to load into stanbol as I get th error
>>> ======
>>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>>> Error
>>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>>> Unresolved
>>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>>> [177]:
>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>> (package=com.google.inject))
>>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to resolve
>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>       at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>>> java:3443)
>>>       at org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>>       at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>>> Felix.java:1333)
>>>       at
>>> org.apache.felix.framework.**StartLevelImpl.run(**
>>> StartLevelImpl.java:270)
>>>       at java.lang.Thread.run(Thread.**java:680)
>>> ==============
>>>
>>> I added the dependency
>>> <dependency>
>>>         <groupId>com.google.inject</**groupId>
>>>         <artifactId>guice</artifactId>
>>>         <version>3.0</version>
>>>       </dependency>
>>>
>>> but looks like it is looking for version 1.3.0, which I can't find in
>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>> entire source of langdetect plugin does not contain the word inject. Only
>>> the manifest file in target/classes has this listed.
>>>
>>>
>>> -harish
>>>
>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>
>>>   Hi Harish,
>>>> I checked in a new language identifier for Stanbol based on
>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>> .
>>>> Just check out from Stanbol trunk, install and try out.
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Walter
>>>>
>>>> harish suvarna wrote:
>>>>
>>>>   Rupert,
>>>>> My initial debugging for Chinese text told me that the language
>>>>> identification done by langid enhancer using apache tika does not
>>>>> recognize
>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>> languages. With the result, the chinese language is identified as
>>>>> lithuanian language 'lt' . The apache tika group has an enhancement item
>>>>> 856 registered for detecting cjk languages
>>>>>     https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>     in Feb 2012. I am not sure about the use of language identification
>>>>> in
>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>
>>>>>
>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>> that
>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>> Then
>>>>> iterate over the ngrams and do an entityhub lookup. I just still need to
>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>> works.
>>>>>
>>>>> I find that the language detection library
>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>>> very good at language
>>>>>
>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>> good.
>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>> engine
>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>> some
>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>> am a
>>>>> maven beginner now.
>>>>>
>>>>> Thanks,
>>>>> harish
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>
>>>>>    Hi harish,
>>>>>
>>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>>>>> answer.
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Thanks a lot Rupert.
>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>> Optiion 2
>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text.
>>>>>>> It
>>>>>>>
>>>>>>>   may
>>>>>>   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>> like
>>>>>>>
>>>>>>>   a
>>>>>>   separate engine.
>>>>>>>   Option (2) will require some work improvements on the Stanbol side.
>>>>>> However there where already discussion on how to create a "text
>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>
>>>>>> Option (3) indeed means that you will create your own
>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>
>>>>>>      But will I be able to use the stanbol dbpedia lookup using option
>>>>>> 3?
>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>>>>
>>>>>> best
>>>>>> Rupert
>>>>>>
>>>>>> [1]
>>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>> [2]
>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>> [3]
>>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>>> java/org/apache/stanbol/
>>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>
>>>>>>
>>>>>>    Btw, I created my own enhancement engine chains and I could see them
>>>>>>
>>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>>> only
>>>>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>>>>> directory?
>>>>>>>
>>>>>>> -harish
>>>>>>>
>>>>>>> I just created the eclipse project
>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>> <rupert.westenthaler@gmail.com****> wrote:
>>>>>>>
>>>>>>>   Hi,
>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>> process
>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>
>>>>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>>>>> well this would work with Chinese texts. My assumption would be that
>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>
>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>
>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>> tools
>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>> overhead.
>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>
>>>>>>>> Hope this helps to get you started.
>>>>>>>>
>>>>>>>> best
>>>>>>>> Rupert
>>>>>>>>
>>>>>>>> [1] http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>> [2]
>>>>>>>>
>>>>>>>>    http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>
>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   Hi Rupert,
>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>> demonstrate
>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>> an
>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>
>>>>>>>>>   dump
>>>>>>> too.
>>>>>>>
>>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>> dbpedia
>>>>>>>>> labels.
>>>>>>>>>
>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>> )
>>>>>>>>> for word breaking.
>>>>>>>>>
>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>> stanbol.
>>>>>>>>> It
>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>> suspicion
>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>> done.
>>>>>>>>> Is it
>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>
>>>>>>>>> -harish
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>
>>>>>>>>
>>>>>>>   --
>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>>>
>>>>>>
>>>>>>   --
>>>> Dr. Walter Kasper
>>>> DFKI GmbH
>>>> Stuhlsatzenhausweg 3
>>>> D-66123 Saarbrücken
>>>> Tel.:  +49-681-85775-5300
>>>> Fax:   +49-681-85775-5338
>>>> Email: kasper@dfki.de
>>>> ------------------------------****----------------------------**--**-
>>>>
>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>
>>>> Geschaeftsfuehrung:
>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> Dr. Walter Olthoff
>>>>
>>>> Vorsitzender des Aufsichtsrats:
>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>
>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>> ------------------------------****----------------------------**--**-
>>>>
>>>>
>>>>
>> --
>> Dr. Walter Kasper
>> DFKI GmbH
>> Stuhlsatzenhausweg 3
>> D-66123 Saarbrücken
>> Tel.:  +49-681-85775-5300
>> Fax:   +49-681-85775-5338
>> Email: kasper@dfki.de
>> ------------------------------**------------------------------**-
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> ------------------------------**------------------------------**-
>>
>>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi,

I will loook into it.

Best regards,

Walter

Rupert Westenthaler wrote:
> Hi Walter
>
> On Wed, Aug 1, 2012 at 7:13 PM, Walter Kasper <ka...@dfki.de> wrote:
>> <rdf:Description
>> rdf:about="urn:enhancement-0fe47b47-13c6-fc7d-335f-59e48e7a2bf1">
>>      <j.2:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/>
>>      <j.8:extracted-from
>> rdf:resource="urn:content-item-sha1-811041df069ba48e9c4682927267e565d5ec7bd4"/>
>>      <rdf:type
>> rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
>>      <rdf:type
>> rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
>>      <j.2:language>en</j.2:language>
>>      <j.2:created
>> rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-08-01T16:53:40.970Z</j.2:created>
>>      <j.2:creator
>> rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</j.2:creator>
>>    </rdf:Description>
>>
> AFAIK the used framework supports confidence values and can also
> return multiple suggestions. Can you please use this features to
> create multiple Language annotations that include the confidence
> values.
>
> Usage of those is easy as there are two helper methods
>
> * EnhancementEngineHelper.getLanguage(..) method will return the
> language with the highest confidence - suited for simple use case
> * EnhancementEngineHelper.getLanguageAnnotations(..) returns a list
> with all language annotations (sorted by confidence). It returns the
> subjects of the language annotations. Users need to retrieve the
> language, fise:confidence, creator ... themselves.
>
> See STANBOL-613 [1] for details.
>
> best
> Rupert
>
> [1] https://issues.apache.org/jira/browse/STANBOL-613
>
>> Did you make 'mvn clean' before 'mvn install'?
>>
>> Walter
>>
>>
>> harish suvarna wrote:
>>> Did a fresh build and inside Stanbol in localhost:8080, it is installed
>>> but
>>> is not activated. I still see the com.google.inject errors.
>>> I do see the pom.xml update from you.
>>>
>>> -harish
>>>
>>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> The OSGI bundlöe declared some package imports that usually indeed are
>>>> not
>>>> available nor required. I fixed that. Just check out the corrected
>>>> pom.xml.
>>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Walter
>>>>
>>>> harish suvarna wrote:
>>>>
>>>>> Thanks Dr Walter. langdetect is very useful. I could successfully
>>>>> compile
>>>>> it but unable to load into stanbol as I get th error
>>>>> ======
>>>>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>>>>> Error
>>>>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>>>>> Unresolved
>>>>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>>>>> [177]:
>>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>>> (package=com.google.inject))
>>>>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>>>>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to
>>>>> resolve
>>>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>>>        at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>>>>> java:3443)
>>>>>        at
>>>>> org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>>>>        at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>>>>> Felix.java:1333)
>>>>>        at
>>>>> org.apache.felix.framework.**StartLevelImpl.run(**
>>>>> StartLevelImpl.java:270)
>>>>>        at java.lang.Thread.run(Thread.**java:680)
>>>>> ==============
>>>>>
>>>>> I added the dependency
>>>>> <dependency>
>>>>>          <groupId>com.google.inject</**groupId>
>>>>>          <artifactId>guice</artifactId>
>>>>>          <version>3.0</version>
>>>>>        </dependency>
>>>>>
>>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>>> entire source of langdetect plugin does not contain the word inject.
>>>>> Only
>>>>> the manifest file in target/classes has this listed.
>>>>>
>>>>>
>>>>> -harish
>>>>>
>>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>>>
>>>>>    Hi Harish,
>>>>>> I checked in a new language identifier for Stanbol based on
>>>>>>
>>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>>>
>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>> .
>>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Walter
>>>>>>
>>>>>> harish suvarna wrote:
>>>>>>
>>>>>>    Rupert,
>>>>>>> My initial debugging for Chinese text told me that the language
>>>>>>> identification done by langid enhancer using apache tika does not
>>>>>>> recognize
>>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>>> languages. With the result, the chinese language is identified as
>>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>>> item
>>>>>>> 856 registered for detecting cjk languages
>>>>>>>
>>>>>>> https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>>
>>>>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>>      in Feb 2012. I am not sure about the use of language
>>>>>>> identification
>>>>>>> in
>>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>>
>>>>>>>
>>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>>> that
>>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>>> Then
>>>>>>> iterate over the ngrams and do an entityhub lookup. I just still need
>>>>>>> to
>>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>>> works.
>>>>>>>
>>>>>>> I find that the language detection library
>>>>>>>
>>>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>>>>
>>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>>>>> very good at language
>>>>>>>
>>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>>> good.
>>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>>> engine
>>>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>>>> some
>>>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>>>> am a
>>>>>>> maven beginner now.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> harish
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>>>
>>>>>>>     Hi harish,
>>>>>>>
>>>>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>>> last
>>>>>>>> answer.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>    Thanks a lot Rupert.
>>>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>>>> Optiion 2
>>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>>> text.
>>>>>>>>> It
>>>>>>>>>
>>>>>>>>>    may
>>>>>>>>    be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>>>> like
>>>>>>>>>
>>>>>>>>>    a
>>>>>>>>    separate engine.
>>>>>>>>>    Option (2) will require some work improvements on the Stanbol
>>>>>>>>> side.
>>>>>>>> However there where already discussion on how to create a "text
>>>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>>> and
>>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>>
>>>>>>>> Option (3) indeed means that you will create your own
>>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>>
>>>>>>>>       But will I be able to use the stanbol dbpedia lookup using
>>>>>>>> option
>>>>>>>> 3?
>>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>>> example)
>>>>>>>>
>>>>>>>> best
>>>>>>>> Rupert
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>>>>>
>>>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>>> [2]
>>>>>>>>
>>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>>
>>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>>> [3]
>>>>>>>>
>>>>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>>>>> java/org/apache/stanbol/
>>>>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>>
>>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>>
>>>>>>>>
>>>>>>>>     Btw, I created my own enhancement engine chains and I could see
>>>>>>>> them
>>>>>>>>
>>>>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>>>>> only
>>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>>> stanbol
>>>>>>>>> directory?
>>>>>>>>>
>>>>>>>>> -harish
>>>>>>>>>
>>>>>>>>> I just created the eclipse project
>>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>>> <rupert.westenthaler@gmail.com****> wrote:
>>>>>>>>>
>>>>>>>>>    Hi,
>>>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>>> process
>>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>>
>>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>>> tokenizer
>>>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>>> how
>>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>>> that
>>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>>
>>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>>
>>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>>> detection,
>>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>>> tools
>>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>>> OpenNLP,
>>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>>>> overhead.
>>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>>
>>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>>
>>>>>>>>>> best
>>>>>>>>>> Rupert
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>>>>>
>>>>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>>> [2]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>>
>>>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>>>>
>>>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>>    On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna
>>>>>>>> <hs...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>    Hi Rupert,
>>>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>>> demonstrate
>>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>>>> an
>>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>>>
>>>>>>>>>>>    dump
>>>>>>>>> too.
>>>>>>>>>
>>>>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>>>> dbpedia
>>>>>>>>>>> labels.
>>>>>>>>>>>
>>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>>
>>>>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>>>>
>>>>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>>>> )
>>>>>>>>>>> for word breaking.
>>>>>>>>>>>
>>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>>> stanbol.
>>>>>>>>>>> It
>>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>>> suspicion
>>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>>> done.
>>>>>>>>>>> Is it
>>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>>
>>>>>>>>>>> -harish
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>    --
>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>
>>>>>>>>
>>>>>>>>    --
>>>>>> Dr. Walter Kasper
>>>>>> DFKI GmbH
>>>>>> Stuhlsatzenhausweg 3
>>>>>> D-66123 Saarbrücken
>>>>>> Tel.:  +49-681-85775-5300
>>>>>> Fax:   +49-681-85775-5338
>>>>>> Email: kasper@dfki.de
>>>>>> ------------------------------****----------------------------**--**-
>>>>>>
>>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>>
>>>>>> Geschaeftsfuehrung:
>>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>>> Dr. Walter Olthoff
>>>>>>
>>>>>> Vorsitzender des Aufsichtsrats:
>>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>>
>>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>>> ------------------------------****----------------------------**--**-
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> Dr. Walter Kasper
>>>> DFKI GmbH
>>>> Stuhlsatzenhausweg 3
>>>> D-66123 Saarbrücken
>>>> Tel.:  +49-681-85775-5300
>>>> Fax:   +49-681-85775-5338
>>>> Email: kasper@dfki.de
>>>> ------------------------------**------------------------------**-
>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>
>>>> Geschaeftsfuehrung:
>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> Dr. Walter Olthoff
>>>>
>>>> Vorsitzender des Aufsichtsrats:
>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>
>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>> ------------------------------**------------------------------**-
>>>>
>>>>
>>
>>
>>
>> --
>> Dr. Walter Kasper
>> DFKI GmbH
>> Stuhlsatzenhausweg 3
>> D-66123 Saarbrücken
>> Tel.:  +49-681-85775-5300
>> Fax:   +49-681-85775-5338
>> Email: kasper@dfki.de
>> -------------------------------------------------------------
>>
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> -------------------------------------------------------------
>>
>
>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by Rupert Westenthaler <ru...@gmail.com>.
Hi Walter

On Wed, Aug 1, 2012 at 7:13 PM, Walter Kasper <ka...@dfki.de> wrote:
> <rdf:Description
> rdf:about="urn:enhancement-0fe47b47-13c6-fc7d-335f-59e48e7a2bf1">
>     <j.2:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/>
>     <j.8:extracted-from
> rdf:resource="urn:content-item-sha1-811041df069ba48e9c4682927267e565d5ec7bd4"/>
>     <rdf:type
> rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
>     <rdf:type
> rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
>     <j.2:language>en</j.2:language>
>     <j.2:created
> rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-08-01T16:53:40.970Z</j.2:created>
>     <j.2:creator
> rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</j.2:creator>
>   </rdf:Description>
>

AFAIK the used framework supports confidence values and can also
return multiple suggestions. Can you please use this features to
create multiple Language annotations that include the confidence
values.

Usage of those is easy as there are two helper methods

* EnhancementEngineHelper.getLanguage(..) method will return the
language with the highest confidence - suited for simple use case
* EnhancementEngineHelper.getLanguageAnnotations(..) returns a list
with all language annotations (sorted by confidence). It returns the
subjects of the language annotations. Users need to retrieve the
language, fise:confidence, creator ... themselves.

See STANBOL-613 [1] for details.

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-613

>
> Did you make 'mvn clean' before 'mvn install'?
>
> Walter
>
>
> harish suvarna wrote:
>>
>> Did a fresh build and inside Stanbol in localhost:8080, it is installed
>> but
>> is not activated. I still see the com.google.inject errors.
>> I do see the pom.xml update from you.
>>
>> -harish
>>
>> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>>
>>> Hi,
>>>
>>> The OSGI bundlöe declared some package imports that usually indeed are
>>> not
>>> available nor required. I fixed that. Just check out the corrected
>>> pom.xml.
>>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>> Thanks Dr Walter. langdetect is very useful. I could successfully
>>>> compile
>>>> it but unable to load into stanbol as I get th error
>>>> ======
>>>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>>>> Error
>>>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>>>> Unresolved
>>>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>>>> [177]:
>>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>>> (package=com.google.inject))
>>>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>>>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to
>>>> resolve
>>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>>       at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>>>> java:3443)
>>>>       at
>>>> org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>>>       at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>>>> Felix.java:1333)
>>>>       at
>>>> org.apache.felix.framework.**StartLevelImpl.run(**
>>>> StartLevelImpl.java:270)
>>>>       at java.lang.Thread.run(Thread.**java:680)
>>>> ==============
>>>>
>>>> I added the dependency
>>>> <dependency>
>>>>         <groupId>com.google.inject</**groupId>
>>>>         <artifactId>guice</artifactId>
>>>>         <version>3.0</version>
>>>>       </dependency>
>>>>
>>>> but looks like it is looking for version 1.3.0, which I can't find in
>>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>>> entire source of langdetect plugin does not contain the word inject.
>>>> Only
>>>> the manifest file in target/classes has this listed.
>>>>
>>>>
>>>> -harish
>>>>
>>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>>
>>>>   Hi Harish,
>>>>>
>>>>> I checked in a new language identifier for Stanbol based on
>>>>>
>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>>
>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>>>
>>>>>> .
>>>>>
>>>>> Just check out from Stanbol trunk, install and try out.
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Walter
>>>>>
>>>>> harish suvarna wrote:
>>>>>
>>>>>   Rupert,
>>>>>>
>>>>>> My initial debugging for Chinese text told me that the language
>>>>>> identification done by langid enhancer using apache tika does not
>>>>>> recognize
>>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>>> languages. With the result, the chinese language is identified as
>>>>>> lithuanian language 'lt' . The apache tika group has an enhancement
>>>>>> item
>>>>>> 856 registered for detecting cjk languages
>>>>>>
>>>>>> https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>>>
>>>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>>     in Feb 2012. I am not sure about the use of language
>>>>>> identification
>>>>>> in
>>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>>
>>>>>>
>>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>>> that
>>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>>> Then
>>>>>> iterate over the ngrams and do an entityhub lookup. I just still need
>>>>>> to
>>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>>> works.
>>>>>>
>>>>>> I find that the language detection library
>>>>>>
>>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>>>
>>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>>>> very good at language
>>>>>>
>>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>>> good.
>>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>>> engine
>>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>>> some
>>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>>> am a
>>>>>> maven beginner now.
>>>>>>
>>>>>> Thanks,
>>>>>> harish
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>>
>>>>>>    Hi harish,
>>>>>>
>>>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my
>>>>>>> last
>>>>>>> answer.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Thanks a lot Rupert.
>>>>>>>>
>>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>>> Optiion 2
>>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese
>>>>>>>> text.
>>>>>>>> It
>>>>>>>>
>>>>>>>>   may
>>>>>>>
>>>>>>>   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>>>
>>>>>>>> like
>>>>>>>>
>>>>>>>>   a
>>>>>>>
>>>>>>>   separate engine.
>>>>>>>>
>>>>>>>>   Option (2) will require some work improvements on the Stanbol
>>>>>>>> side.
>>>>>>>
>>>>>>> However there where already discussion on how to create a "text
>>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API
>>>>>>> and
>>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>>
>>>>>>> Option (3) indeed means that you will create your own
>>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>>
>>>>>>>      But will I be able to use the stanbol dbpedia lookup using
>>>>>>> option
>>>>>>> 3?
>>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>>> "FieldQuery" interface to search for Entities (see [1] for an
>>>>>>> example)
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1]
>>>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>>>>
>>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>>> [2]
>>>>>>>
>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>>>
>>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>>> [3]
>>>>>>>
>>>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>>>> java/org/apache/stanbol/
>>>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>>>
>>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>>
>>>>>>>
>>>>>>>    Btw, I created my own enhancement engine chains and I could see
>>>>>>> them
>>>>>>>
>>>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>>>> only
>>>>>>>> the default chain shows up. Can I dig them up somewhere in the
>>>>>>>> stanbol
>>>>>>>> directory?
>>>>>>>>
>>>>>>>> -harish
>>>>>>>>
>>>>>>>> I just created the eclipse project
>>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>>> <rupert.westenthaler@gmail.com****> wrote:
>>>>>>>>
>>>>>>>>   Hi,
>>>>>>>>>
>>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>>> process
>>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>>
>>>>>>>>> However also the KeywordLinking Engine requires at least n
>>>>>>>>> tokenizer
>>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>>> chars to split words (white spaces, hyphens ...). You may better
>>>>>>>>> how
>>>>>>>>> well this would work with Chinese texts. My assumption would be
>>>>>>>>> that
>>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>>
>>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>>
>>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence
>>>>>>>>> detection,
>>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>>> tools
>>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with
>>>>>>>>> OpenNLP,
>>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>>> overhead.
>>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>>
>>>>>>>>> Hope this helps to get you started.
>>>>>>>>>
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>>>>
>>>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>>
>>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>>>
>>>>>>>
>>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>>   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna
>>>>>>> <hs...@gmail.com>
>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   Hi Rupert,
>>>>>>>>>>
>>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>>> demonstrate
>>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>>> an
>>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>>
>>>>>>>>>>   dump
>>>>>>>>
>>>>>>>> too.
>>>>>>>>
>>>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>>>
>>>>>>>>>> dbpedia
>>>>>>>>>> labels.
>>>>>>>>>>
>>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>>>
>>>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>>>
>>>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>>>
>>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> for word breaking.
>>>>>>>>>>
>>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>>> stanbol.
>>>>>>>>>> It
>>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>>> suspicion
>>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>>> done.
>>>>>>>>>> Is it
>>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>>
>>>>>>>>>> -harish
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>   --
>>>>>>>
>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>
>>>>> Dr. Walter Kasper
>>>>> DFKI GmbH
>>>>> Stuhlsatzenhausweg 3
>>>>> D-66123 Saarbrücken
>>>>> Tel.:  +49-681-85775-5300
>>>>> Fax:   +49-681-85775-5338
>>>>> Email: kasper@dfki.de
>>>>> ------------------------------****----------------------------**--**-
>>>>>
>>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>>
>>>>> Geschaeftsfuehrung:
>>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>>> Dr. Walter Olthoff
>>>>>
>>>>> Vorsitzender des Aufsichtsrats:
>>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>>
>>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>>> ------------------------------****----------------------------**--**-
>>>>>
>>>>>
>>>>>
>>> --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: kasper@dfki.de
>>> ------------------------------**------------------------------**-
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------**------------------------------**-
>>>
>>>
>
>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> -------------------------------------------------------------
>
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi,

This is strange. I just freshly compiled and started Stanbol on another 
machine and it just worked fine. The 'langdetect' component is active, 
there are no errors and no com.google.inject anywhere around. For some 
English text, I see these annotations from langdetect that are just as 
they should be:

<rdf:Description rdf:about="urn:enhancement-0fe47b47-13c6-fc7d-335f-59e48e7a2bf1">
     <j.2:type rdf:resource="http://purl.org/dc/terms/LinguisticSystem"/>
     <j.8:extracted-from rdf:resource="urn:content-item-sha1-811041df069ba48e9c4682927267e565d5ec7bd4"/>
     <rdf:type rdf:resource="http://fise.iks-project.eu/ontology/Enhancement"/>
     <rdf:type rdf:resource="http://fise.iks-project.eu/ontology/TextAnnotation"/>
     <j.2:language>en</j.2:language>
     <j.2:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2012-08-01T16:53:40.970Z</j.2:created>
     <j.2:creator rdf:datatype="http://www.w3.org/2001/XMLSchema#string">org.apache.stanbol.enhancer.engines.langdetect.LanguageDetectionEnhancementEngine</j.2:creator>
   </rdf:Description>


Did you make 'mvn clean' before 'mvn install'?

Walter

harish suvarna wrote:
> Did a fresh build and inside Stanbol in localhost:8080, it is installed but
> is not activated. I still see the com.google.inject errors.
> I do see the pom.xml update from you.
>
> -harish
>
> On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:
>
>> Hi,
>>
>> The OSGI bundlöe declared some package imports that usually indeed are not
>> available nor required. I fixed that. Just check out the corrected pom.xml.
>> On a fresh clean Stanbol installation langdetect worked fine for me.
>>
>>
>> Best regards,
>>
>> Walter
>>
>> harish suvarna wrote:
>>
>>> Thanks Dr Walter. langdetect is very useful. I could successfully compile
>>> it but unable to load into stanbol as I get th error
>>> ======
>>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>>> Error
>>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>>> Unresolved
>>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>>> [177]:
>>> Unable to resolve 177.0: missing requirement [177.0] package;
>>> (package=com.google.inject))
>>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to resolve
>>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>>       at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>>> java:3443)
>>>       at org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>>       at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>>> Felix.java:1333)
>>>       at
>>> org.apache.felix.framework.**StartLevelImpl.run(**
>>> StartLevelImpl.java:270)
>>>       at java.lang.Thread.run(Thread.**java:680)
>>> ==============
>>>
>>> I added the dependency
>>> <dependency>
>>>         <groupId>com.google.inject</**groupId>
>>>         <artifactId>guice</artifactId>
>>>         <version>3.0</version>
>>>       </dependency>
>>>
>>> but looks like it is looking for version 1.3.0, which I can't find in
>>> repo1.maven.org. I am not sure who is needing the inject library. The
>>> entire source of langdetect plugin does not contain the word inject. Only
>>> the manifest file in target/classes has this listed.
>>>
>>>
>>> -harish
>>>
>>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>>
>>>   Hi Harish,
>>>> I checked in a new language identifier for Stanbol based on
>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>>>> .
>>>> Just check out from Stanbol trunk, install and try out.
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Walter
>>>>
>>>> harish suvarna wrote:
>>>>
>>>>   Rupert,
>>>>> My initial debugging for Chinese text told me that the language
>>>>> identification done by langid enhancer using apache tika does not
>>>>> recognize
>>>>> chinese. The tika language detection seems is not supporting the CJK
>>>>> languages. With the result, the chinese language is identified as
>>>>> lithuanian language 'lt' . The apache tika group has an enhancement item
>>>>> 856 registered for detecting cjk languages
>>>>>     https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>>>     in Feb 2012. I am not sure about the use of language identification
>>>>> in
>>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>>> (approprite dbpedia language dump) for entity lookups?
>>>>>
>>>>>
>>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>>> that
>>>>> it is of my language of my interest and then call paoding segmenter.
>>>>> Then
>>>>> iterate over the ngrams and do an entityhub lookup. I just still need to
>>>>> understand the code around how the whole entity lookup for dbpedia
>>>>> works.
>>>>>
>>>>> I find that the language detection library
>>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>>> very good at language
>>>>>
>>>>> detection. It supports 53 languages out of box and the quality seems
>>>>> good.
>>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>>> engine
>>>>> based on this with the stanbol community approval. So if anyone sheds
>>>>> some
>>>>> light on how to add a new java library into stanbol, that be great. I
>>>>> am a
>>>>> maven beginner now.
>>>>>
>>>>> Thanks,
>>>>> harish
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>>> rupert.westenthaler@gmail.com> wrote:
>>>>>
>>>>>    Hi harish,
>>>>>
>>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>>>>> answer.
>>>>>>
>>>>>>
>>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>   Thanks a lot Rupert.
>>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>>> Optiion 2
>>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text.
>>>>>>> It
>>>>>>>
>>>>>>>   may
>>>>>>   be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>>> like
>>>>>>>
>>>>>>>   a
>>>>>>   separate engine.
>>>>>>>   Option (2) will require some work improvements on the Stanbol side.
>>>>>> However there where already discussion on how to create a "text
>>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>>
>>>>>> Option (3) indeed means that you will create your own
>>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>>
>>>>>>      But will I be able to use the stanbol dbpedia lookup using option
>>>>>> 3?
>>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>>>>
>>>>>> best
>>>>>> Rupert
>>>>>>
>>>>>> [1]
>>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>>> [2]
>>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>>> [3]
>>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>>> java/org/apache/stanbol/
>>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>>>
>>>>>>
>>>>>>    Btw, I created my own enhancement engine chains and I could see them
>>>>>>
>>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>>> only
>>>>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>>>>> directory?
>>>>>>>
>>>>>>> -harish
>>>>>>>
>>>>>>> I just created the eclipse project
>>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>>> <rupert.westenthaler@gmail.com****> wrote:
>>>>>>>
>>>>>>>   Hi,
>>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>>> process
>>>>>>>> in unknown languages (see [1] for details).
>>>>>>>>
>>>>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>>>>> well this would work with Chinese texts. My assumption would be that
>>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>>
>>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>>
>>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>>>>> POS tagging, Named Entity Detection)
>>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>>> tools
>>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>>> overhead.
>>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>>
>>>>>>>> Hope this helps to get you started.
>>>>>>>>
>>>>>>>> best
>>>>>>>> Rupert
>>>>>>>>
>>>>>>>> [1] http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>>> [2]
>>>>>>>>
>>>>>>>>    http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>>
>>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>>>   On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>   Hi Rupert,
>>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>>> demonstrate
>>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>>> an
>>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>>
>>>>>>>>>   dump
>>>>>>> too.
>>>>>>>
>>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>>> dbpedia
>>>>>>>>> labels.
>>>>>>>>>
>>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>>>> )
>>>>>>>>> for word breaking.
>>>>>>>>>
>>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>>> stanbol.
>>>>>>>>> It
>>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>>> suspicion
>>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>>> done.
>>>>>>>>> Is it
>>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>>
>>>>>>>>> -harish
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>
>>>>>>>>
>>>>>>>   --
>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>>>
>>>>>>
>>>>>>   --
>>>> Dr. Walter Kasper
>>>> DFKI GmbH
>>>> Stuhlsatzenhausweg 3
>>>> D-66123 Saarbrücken
>>>> Tel.:  +49-681-85775-5300
>>>> Fax:   +49-681-85775-5338
>>>> Email: kasper@dfki.de
>>>> ------------------------------****----------------------------**--**-
>>>>
>>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>>
>>>> Geschaeftsfuehrung:
>>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>>> Dr. Walter Olthoff
>>>>
>>>> Vorsitzender des Aufsichtsrats:
>>>> Prof. Dr. h.c. Hans A. Aukes
>>>>
>>>> Amtsgericht Kaiserslautern, HRB 2313
>>>> ------------------------------****----------------------------**--**-
>>>>
>>>>
>>>>
>> --
>> Dr. Walter Kasper
>> DFKI GmbH
>> Stuhlsatzenhausweg 3
>> D-66123 Saarbrücken
>> Tel.:  +49-681-85775-5300
>> Fax:   +49-681-85775-5338
>> Email: kasper@dfki.de
>> ------------------------------**------------------------------**-
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> ------------------------------**------------------------------**-
>>
>>




-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
Did a fresh build and inside Stanbol in localhost:8080, it is installed but
is not activated. I still see the com.google.inject errors.
I do see the pom.xml update from you.

-harish

On Wed, Aug 1, 2012 at 12:55 AM, Walter Kasper <ka...@dfki.de> wrote:

> Hi,
>
> The OSGI bundlöe declared some package imports that usually indeed are not
> available nor required. I fixed that. Just check out the corrected pom.xml.
> On a fresh clean Stanbol installation langdetect worked fine for me.
>
>
> Best regards,
>
> Walter
>
> harish suvarna wrote:
>
>> Thanks Dr Walter. langdetect is very useful. I could successfully compile
>> it but unable to load into stanbol as I get th error
>> ======
>> ERROR: Bundle org.apache.stanbol.enhancer.**engines.langdetect [177]:
>> Error
>> starting/stopping bundle. (org.osgi.framework.**BundleException:
>> Unresolved
>> constraint in bundle org.apache.stanbol.enhancer.**engines.langdetect
>> [177]:
>> Unable to resolve 177.0: missing requirement [177.0] package;
>> (package=com.google.inject))
>> org.osgi.framework.**BundleException: Unresolved constraint in bundle
>> org.apache.stanbol.enhancer.**engines.langdetect [177]: Unable to resolve
>> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>>      at org.apache.felix.framework.**Felix.resolveBundle(Felix.**
>> java:3443)
>>      at org.apache.felix.framework.**Felix.startBundle(Felix.java:**1727)
>>      at org.apache.felix.framework.**Felix.setBundleStartLevel(**
>> Felix.java:1333)
>>      at
>> org.apache.felix.framework.**StartLevelImpl.run(**
>> StartLevelImpl.java:270)
>>      at java.lang.Thread.run(Thread.**java:680)
>> ==============
>>
>> I added the dependency
>> <dependency>
>>        <groupId>com.google.inject</**groupId>
>>        <artifactId>guice</artifactId>
>>        <version>3.0</version>
>>      </dependency>
>>
>> but looks like it is looking for version 1.3.0, which I can't find in
>> repo1.maven.org. I am not sure who is needing the inject library. The
>> entire source of langdetect plugin does not contain the word inject. Only
>> the manifest file in target/classes has this listed.
>>
>>
>> -harish
>>
>> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>>
>>  Hi Harish,
>>>
>>> I checked in a new language identifier for Stanbol based on
>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>
>>> >.
>>>
>>> Just check out from Stanbol trunk, install and try out.
>>>
>>>
>>> Best regards,
>>>
>>> Walter
>>>
>>> harish suvarna wrote:
>>>
>>>  Rupert,
>>>> My initial debugging for Chinese text told me that the language
>>>> identification done by langid enhancer using apache tika does not
>>>> recognize
>>>> chinese. The tika language detection seems is not supporting the CJK
>>>> languages. With the result, the chinese language is identified as
>>>> lithuanian language 'lt' . The apache tika group has an enhancement item
>>>> 856 registered for detecting cjk languages
>>>>    https://issues.apache.org/****jira/browse/TIKA-856<https://issues.apache.org/**jira/browse/TIKA-856>
>>>> <https://**issues.apache.org/jira/browse/**TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>> >
>>>>
>>>>    in Feb 2012. I am not sure about the use of language identification
>>>> in
>>>> stanbol yet. Is the language id used to select the dbpedia  index
>>>> (approprite dbpedia language dump) for entity lookups?
>>>>
>>>>
>>>> I am just thinking that, for my purpose, pick option 3 and make sure
>>>> that
>>>> it is of my language of my interest and then call paoding segmenter.
>>>> Then
>>>> iterate over the ngrams and do an entityhub lookup. I just still need to
>>>> understand the code around how the whole entity lookup for dbpedia
>>>> works.
>>>>
>>>> I find that the language detection library
>>>> http://code.google.com/p/****language-detection/<http://code.google.com/p/**language-detection/>
>>>> <http://**code.google.com/p/language-**detection/<http://code.google.com/p/language-detection/>>is
>>>> very good at language
>>>>
>>>> detection. It supports 53 languages out of box and the quality seems
>>>> good.
>>>> It is apache 2.0 license. I could volunteer to create a new langid
>>>> engine
>>>> based on this with the stanbol community approval. So if anyone sheds
>>>> some
>>>> light on how to add a new java library into stanbol, that be great. I
>>>> am a
>>>> maven beginner now.
>>>>
>>>> Thanks,
>>>> harish
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>>> rupert.westenthaler@gmail.com> wrote:
>>>>
>>>>   Hi harish,
>>>>
>>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>>>> answer.
>>>>>
>>>>>
>>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>  Thanks a lot Rupert.
>>>>>>
>>>>>> I am weighing between options 2 and 3. What is the difference?
>>>>>> Optiion 2
>>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text.
>>>>>> It
>>>>>>
>>>>>>  may
>>>>>
>>>>>  be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
>>>>>> like
>>>>>>
>>>>>>  a
>>>>>
>>>>>  separate engine.
>>>>>>
>>>>>>  Option (2) will require some work improvements on the Stanbol side.
>>>>> However there where already discussion on how to create a "text
>>>>> processing chain" that allows to split up things like tokenizing, POS
>>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>>>> share the data as ContentPart [2] of the ContentItem.
>>>>>
>>>>> Option (3) indeed means that you will create your own
>>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>>
>>>>>     But will I be able to use the stanbol dbpedia lookup using option
>>>>> 3?
>>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1]
>>>>> http://blog.mikemccandless.****com/2012/04/lucenes-**
>>>>> tokenstreams-are-actually.**html<http://blog.**
>>>>> mikemccandless.com/2012/04/**lucenes-tokenstreams-are-**actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>>> >
>>>>> [2]
>>>>> http://incubator.apache.org/****stanbol/docs/trunk/components/****<http://incubator.apache.org/**stanbol/docs/trunk/components/**>
>>>>> enhancer/contentitem.html#****content-parts<http://**
>>>>> incubator.apache.org/stanbol/**docs/trunk/components/**
>>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>>> >
>>>>> [3]
>>>>> http://svn.apache.org/repos/****asf/incubator/stanbol/trunk/**<http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**>
>>>>> enhancer/engines/****keywordextraction/src/main/****
>>>>> java/org/apache/stanbol/
>>>>> **enhancer/engines/****keywordextraction/linking/**
>>>>> impl/EntitySearcherUtils.java<**http://svn.apache.org/repos/**
>>>>> asf/incubator/stanbol/trunk/**enhancer/engines/**
>>>>> keywordextraction/src/main/**java/org/apache/stanbol/**
>>>>> enhancer/engines/**keywordextraction/linking/**
>>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>>   Btw, I created my own enhancement engine chains and I could see them
>>>>>
>>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>>> only
>>>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>>>> directory?
>>>>>>
>>>>>> -harish
>>>>>>
>>>>>> I just created the eclipse project
>>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>>> <rupert.westenthaler@gmail.com****> wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>>
>>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>>> not process Chinese text. What you can do is to configure a
>>>>>>> KeywordLinking Engine for Chinese text as this engine can also
>>>>>>> process
>>>>>>> in unknown languages (see [1] for details).
>>>>>>>
>>>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>>>> well this would work with Chinese texts. My assumption would be that
>>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>>
>>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>>
>>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>>>> POS tagging, Named Entity Detection)
>>>>>>> 2. allow the KeywordLinkingEngine to use other already available
>>>>>>> tools
>>>>>>> for text processing (e.g. stuff that is already available for
>>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>>> overhead.
>>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>>
>>>>>>> Hope this helps to get you started.
>>>>>>>
>>>>>>> best
>>>>>>> Rupert
>>>>>>>
>>>>>>> [1] http://incubator.apache.org/****stanbol/docs/trunk/**<http://incubator.apache.org/**stanbol/docs/trunk/**>
>>>>>>> multilingual.html<http://**incubator.apache.org/stanbol/**
>>>>>>> docs/trunk/multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>>> >
>>>>>>> [2]
>>>>>>>
>>>>>>>   http://wiki.apache.org/solr/****LanguageAnalysis#Chinese.2C_**<http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**>
>>>>>>>
>>>>>> Japanese.2C_Korean<http://**wiki.apache.org/solr/**
>>>>> LanguageAnalysis#Chinese.2C_**Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>> >
>>>>>
>>>>>  On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hi Rupert,
>>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>>> demonstrate
>>>>>>>> Stanbol annotations for Chinese text.
>>>>>>>> I am just starting on it. I am following the instructions to build
>>>>>>>> an
>>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>>
>>>>>>>>  dump
>>>>>>>
>>>>>> too.
>>>>>>
>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>>> dbpedia
>>>>>>>> labels.
>>>>>>>>
>>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>>> (http://code.google.com/p/****paoding/<http://code.google.com/p/**paoding/>
>>>>>>>> <http://code.google.**com/p/paoding/<http://code.google.com/p/paoding/>
>>>>>>>> >)
>>>>>>>>
>>>>>>>> for word breaking.
>>>>>>>>
>>>>>>>> Just curious. I pasted some chinese text in default engine of
>>>>>>>> stanbol.
>>>>>>>> It
>>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>>> suspicion
>>>>>>>> that may be if the language is chinese, no further processing is
>>>>>>>> done.
>>>>>>>> Is it
>>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>>
>>>>>>>> -harish
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>>
>>>>>>>
>>>>>>  --
>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>>
>>>>>
>>>>>  --
>>> Dr. Walter Kasper
>>> DFKI GmbH
>>> Stuhlsatzenhausweg 3
>>> D-66123 Saarbrücken
>>> Tel.:  +49-681-85775-5300
>>> Fax:   +49-681-85775-5338
>>> Email: kasper@dfki.de
>>> ------------------------------****----------------------------**--**-
>>>
>>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>>
>>> Geschaeftsfuehrung:
>>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>>> Dr. Walter Olthoff
>>>
>>> Vorsitzender des Aufsichtsrats:
>>> Prof. Dr. h.c. Hans A. Aukes
>>>
>>> Amtsgericht Kaiserslautern, HRB 2313
>>> ------------------------------****----------------------------**--**-
>>>
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi,

The OSGI bundlöe declared some package imports that usually indeed are 
not available nor required. I fixed that. Just check out the corrected 
pom.xml. On a fresh clean Stanbol installation langdetect worked fine 
for me.

Best regards,

Walter

harish suvarna wrote:
> Thanks Dr Walter. langdetect is very useful. I could successfully compile
> it but unable to load into stanbol as I get th error
> ======
> ERROR: Bundle org.apache.stanbol.enhancer.engines.langdetect [177]: Error
> starting/stopping bundle. (org.osgi.framework.BundleException: Unresolved
> constraint in bundle org.apache.stanbol.enhancer.engines.langdetect [177]:
> Unable to resolve 177.0: missing requirement [177.0] package;
> (package=com.google.inject))
> org.osgi.framework.BundleException: Unresolved constraint in bundle
> org.apache.stanbol.enhancer.engines.langdetect [177]: Unable to resolve
> 177.0: missing requirement [177.0] package; (package=com.google.inject)
>      at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
>      at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
>      at org.apache.felix.framework.Felix.setBundleStartLevel(Felix.java:1333)
>      at
> org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:270)
>      at java.lang.Thread.run(Thread.java:680)
> ==============
>
> I added the dependency
> <dependency>
>        <groupId>com.google.inject</groupId>
>        <artifactId>guice</artifactId>
>        <version>3.0</version>
>      </dependency>
>
> but looks like it is looking for version 1.3.0, which I can't find in
> repo1.maven.org. I am not sure who is needing the inject library. The
> entire source of langdetect plugin does not contain the word inject. Only
> the manifest file in target/classes has this listed.
>
>
> -harish
>
> On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:
>
>> Hi Harish,
>>
>> I checked in a new language identifier for Stanbol based on
>> http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>.
>> Just check out from Stanbol trunk, install and try out.
>>
>>
>> Best regards,
>>
>> Walter
>>
>> harish suvarna wrote:
>>
>>> Rupert,
>>> My initial debugging for Chinese text told me that the language
>>> identification done by langid enhancer using apache tika does not
>>> recognize
>>> chinese. The tika language detection seems is not supporting the CJK
>>> languages. With the result, the chinese language is identified as
>>> lithuanian language 'lt' . The apache tika group has an enhancement item
>>> 856 registered for detecting cjk languages
>>>    https://issues.apache.org/**jira/browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>>    in Feb 2012. I am not sure about the use of language identification in
>>> stanbol yet. Is the language id used to select the dbpedia  index
>>> (approprite dbpedia language dump) for entity lookups?
>>>
>>>
>>> I am just thinking that, for my purpose, pick option 3 and make sure that
>>> it is of my language of my interest and then call paoding segmenter. Then
>>> iterate over the ngrams and do an entityhub lookup. I just still need to
>>> understand the code around how the whole entity lookup for dbpedia works.
>>>
>>> I find that the language detection library
>>> http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>is very good at language
>>> detection. It supports 53 languages out of box and the quality seems good.
>>> It is apache 2.0 license. I could volunteer to create a new langid engine
>>> based on this with the stanbol community approval. So if anyone sheds some
>>> light on how to add a new java library into stanbol, that be great. I am a
>>> maven beginner now.
>>>
>>> Thanks,
>>> harish
>>>
>>>
>>>
>>>
>>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>>> rupert.westenthaler@gmail.com> wrote:
>>>
>>>   Hi harish,
>>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>>> answer.
>>>>
>>>>
>>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks a lot Rupert.
>>>>>
>>>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>>>>>
>>>> may
>>>>
>>>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>>>>>
>>>> a
>>>>
>>>>> separate engine.
>>>>>
>>>> Option (2) will require some work improvements on the Stanbol side.
>>>> However there where already discussion on how to create a "text
>>>> processing chain" that allows to split up things like tokenizing, POS
>>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>>> suffering form disadvantages of creating high amounts of RDF triples.
>>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>>> share the data as ContentPart [2] of the ContentItem.
>>>>
>>>> Option (3) indeed means that you will create your own
>>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>>
>>>>     But will I be able to use the stanbol dbpedia lookup using option 3?
>>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [1]
>>>> http://blog.mikemccandless.**com/2012/04/lucenes-**
>>>> tokenstreams-are-actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>>> [2]
>>>> http://incubator.apache.org/**stanbol/docs/trunk/components/**
>>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>>> [3]
>>>> http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**
>>>> enhancer/engines/**keywordextraction/src/main/**java/org/apache/stanbol/
>>>> **enhancer/engines/**keywordextraction/linking/**
>>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>>
>>>>
>>>>   Btw, I created my own enhancement engine chains and I could see them
>>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>>> only
>>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>>> directory?
>>>>>
>>>>> -harish
>>>>>
>>>>> I just created the eclipse project
>>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>>> <rupert.westenthaler@gmail.com**> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>>> not process Chinese text. What you can do is to configure a
>>>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>>>> in unknown languages (see [1] for details).
>>>>>>
>>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>>> well this would work with Chinese texts. My assumption would be that
>>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>>
>>>>>> To apply Chinese optimization I see three possibilities:
>>>>>>
>>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>>> POS tagging, Named Entity Detection)
>>>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>>>> for text processing (e.g. stuff that is already available for
>>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>>> overhead.
>>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>>
>>>>>> Hope this helps to get you started.
>>>>>>
>>>>>> best
>>>>>> Rupert
>>>>>>
>>>>>> [1] http://incubator.apache.org/**stanbol/docs/trunk/**
>>>>>> multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>>> [2]
>>>>>>
>>>>>>   http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
>>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>>
>>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Rupert,
>>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>>> demonstrate
>>>>>>> Stanbol annotations for Chinese text.
>>>>>>> I am just starting on it. I am following the instructions to build an
>>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>>
>>>>>> dump
>>>>> too.
>>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>>> dbpedia
>>>>>>> labels.
>>>>>>>
>>>>>>> I am planning to use the paoding chinese segmentor
>>>>>>> (http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>)
>>>>>>> for word breaking.
>>>>>>>
>>>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>>>> It
>>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>>> suspicion
>>>>>>> that may be if the language is chinese, no further processing is done.
>>>>>>> Is it
>>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>>
>>>>>>> -harish
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>> | A-5500 Bischofshofen
>>>>>>
>>>>>
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>>
>>>>
>> --
>> Dr. Walter Kasper
>> DFKI GmbH
>> Stuhlsatzenhausweg 3
>> D-66123 Saarbrücken
>> Tel.:  +49-681-85775-5300
>> Fax:   +49-681-85775-5338
>> Email: kasper@dfki.de
>> ------------------------------**------------------------------**-
>> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
>> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>>
>> Geschaeftsfuehrung:
>> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
>> Dr. Walter Olthoff
>>
>> Vorsitzender des Aufsichtsrats:
>> Prof. Dr. h.c. Hans A. Aukes
>>
>> Amtsgericht Kaiserslautern, HRB 2313
>> ------------------------------**------------------------------**-
>>
>>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
Thanks Dr Walter. langdetect is very useful. I could successfully compile
it but unable to load into stanbol as I get th error
======
ERROR: Bundle org.apache.stanbol.enhancer.engines.langdetect [177]: Error
starting/stopping bundle. (org.osgi.framework.BundleException: Unresolved
constraint in bundle org.apache.stanbol.enhancer.engines.langdetect [177]:
Unable to resolve 177.0: missing requirement [177.0] package;
(package=com.google.inject))
org.osgi.framework.BundleException: Unresolved constraint in bundle
org.apache.stanbol.enhancer.engines.langdetect [177]: Unable to resolve
177.0: missing requirement [177.0] package; (package=com.google.inject)
    at org.apache.felix.framework.Felix.resolveBundle(Felix.java:3443)
    at org.apache.felix.framework.Felix.startBundle(Felix.java:1727)
    at org.apache.felix.framework.Felix.setBundleStartLevel(Felix.java:1333)
    at
org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:270)
    at java.lang.Thread.run(Thread.java:680)
==============

I added the dependency
<dependency>
      <groupId>com.google.inject</groupId>
      <artifactId>guice</artifactId>
      <version>3.0</version>
    </dependency>

but looks like it is looking for version 1.3.0, which I can't find in
repo1.maven.org. I am not sure who is needing the inject library. The
entire source of langdetect plugin does not contain the word inject. Only
the manifest file in target/classes has this listed.


-harish

On Tue, Jul 31, 2012 at 1:32 AM, Walter Kasper <ka...@dfki.de> wrote:

> Hi Harish,
>
> I checked in a new language identifier for Stanbol based on
> http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>.
> Just check out from Stanbol trunk, install and try out.
>
>
> Best regards,
>
> Walter
>
> harish suvarna wrote:
>
>> Rupert,
>> My initial debugging for Chinese text told me that the language
>> identification done by langid enhancer using apache tika does not
>> recognize
>> chinese. The tika language detection seems is not supporting the CJK
>> languages. With the result, the chinese language is identified as
>> lithuanian language 'lt' . The apache tika group has an enhancement item
>> 856 registered for detecting cjk languages
>>   https://issues.apache.org/**jira/browse/TIKA-856<https://issues.apache.org/jira/browse/TIKA-856>
>>   in Feb 2012. I am not sure about the use of language identification in
>> stanbol yet. Is the language id used to select the dbpedia  index
>> (approprite dbpedia language dump) for entity lookups?
>>
>>
>> I am just thinking that, for my purpose, pick option 3 and make sure that
>> it is of my language of my interest and then call paoding segmenter. Then
>> iterate over the ngrams and do an entityhub lookup. I just still need to
>> understand the code around how the whole entity lookup for dbpedia works.
>>
>> I find that the language detection library
>> http://code.google.com/p/**language-detection/<http://code.google.com/p/language-detection/>is very good at language
>> detection. It supports 53 languages out of box and the quality seems good.
>> It is apache 2.0 license. I could volunteer to create a new langid engine
>> based on this with the stanbol community approval. So if anyone sheds some
>> light on how to add a new java library into stanbol, that be great. I am a
>> maven beginner now.
>>
>> Thanks,
>> harish
>>
>>
>>
>>
>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>>
>>  Hi harish,
>>>
>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>> answer.
>>>
>>>
>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>> wrote:
>>>
>>>> Thanks a lot Rupert.
>>>>
>>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>>>>
>>> may
>>>
>>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>>>>
>>> a
>>>
>>>> separate engine.
>>>>
>>> Option (2) will require some work improvements on the Stanbol side.
>>> However there where already discussion on how to create a "text
>>> processing chain" that allows to split up things like tokenizing, POS
>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>> suffering form disadvantages of creating high amounts of RDF triples.
>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>> share the data as ContentPart [2] of the ContentItem.
>>>
>>> Option (3) indeed means that you will create your own
>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>
>>>    But will I be able to use the stanbol dbpedia lookup using option 3?
>>>>
>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>
>>> best
>>> Rupert
>>>
>>> [1]
>>> http://blog.mikemccandless.**com/2012/04/lucenes-**
>>> tokenstreams-are-actually.html<http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html>
>>> [2]
>>> http://incubator.apache.org/**stanbol/docs/trunk/components/**
>>> enhancer/contentitem.html#**content-parts<http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts>
>>> [3]
>>> http://svn.apache.org/repos/**asf/incubator/stanbol/trunk/**
>>> enhancer/engines/**keywordextraction/src/main/**java/org/apache/stanbol/
>>> **enhancer/engines/**keywordextraction/linking/**
>>> impl/EntitySearcherUtils.java<http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java>
>>>
>>>
>>>  Btw, I created my own enhancement engine chains and I could see them
>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>> only
>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>> directory?
>>>>
>>>> -harish
>>>>
>>>> I just created the eclipse project
>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>> <rupert.westenthaler@gmail.com**> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>> not process Chinese text. What you can do is to configure a
>>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>>> in unknown languages (see [1] for details).
>>>>>
>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>> well this would work with Chinese texts. My assumption would be that
>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>
>>>>> To apply Chinese optimization I see three possibilities:
>>>>>
>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>> POS tagging, Named Entity Detection)
>>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>>> for text processing (e.g. stuff that is already available for
>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>> overhead.
>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>
>>>>> Hope this helps to get you started.
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1] http://incubator.apache.org/**stanbol/docs/trunk/**
>>>>> multilingual.html<http://incubator.apache.org/stanbol/docs/trunk/multilingual.html>
>>>>> [2]
>>>>>
>>>>>  http://wiki.apache.org/solr/**LanguageAnalysis#Chinese.2C_**
>>> Japanese.2C_Korean<http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean>
>>>
>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Rupert,
>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>> demonstrate
>>>>>> Stanbol annotations for Chinese text.
>>>>>> I am just starting on it. I am following the instructions to build an
>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>>>>
>>>>> dump
>>>
>>>> too.
>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>> dbpedia
>>>>>> labels.
>>>>>>
>>>>>> I am planning to use the paoding chinese segmentor
>>>>>> (http://code.google.com/p/**paoding/<http://code.google.com/p/paoding/>)
>>>>>> for word breaking.
>>>>>>
>>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>>> It
>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>> suspicion
>>>>>> that may be if the language is chinese, no further processing is done.
>>>>>> Is it
>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>
>>>>>> -harish
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>>
>>>>
>>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>>>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> ------------------------------**------------------------------**-
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> ------------------------------**------------------------------**-
>
>

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi Harish,

I checked in a new language identifier for Stanbol based on 
http://code.google.com/p/language-detection/. Just check out from 
Stanbol trunk, install and try out.

Best regards,

Walter

harish suvarna wrote:
> Rupert,
> My initial debugging for Chinese text told me that the language
> identification done by langid enhancer using apache tika does not recognize
> chinese. The tika language detection seems is not supporting the CJK
> languages. With the result, the chinese language is identified as
> lithuanian language 'lt' . The apache tika group has an enhancement item
> 856 registered for detecting cjk languages
>   https://issues.apache.org/jira/browse/TIKA-856
>   in Feb 2012. I am not sure about the use of language identification in
> stanbol yet. Is the language id used to select the dbpedia  index
> (approprite dbpedia language dump) for entity lookups?
>
>
> I am just thinking that, for my purpose, pick option 3 and make sure that
> it is of my language of my interest and then call paoding segmenter. Then
> iterate over the ngrams and do an entityhub lookup. I just still need to
> understand the code around how the whole entity lookup for dbpedia works.
>
> I find that the language detection library
> http://code.google.com/p/language-detection/ is very good at language
> detection. It supports 53 languages out of box and the quality seems good.
> It is apache 2.0 license. I could volunteer to create a new langid engine
> based on this with the stanbol community approval. So if anyone sheds some
> light on how to add a new java library into stanbol, that be great. I am a
> maven beginner now.
>
> Thanks,
> harish
>
>
>
>
> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
>
>> Hi harish,
>>
>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>> answer.
>>
>>
>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>> wrote:
>>> Thanks a lot Rupert.
>>>
>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>> may
>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>> a
>>> separate engine.
>> Option (2) will require some work improvements on the Stanbol side.
>> However there where already discussion on how to create a "text
>> processing chain" that allows to split up things like tokenizing, POS
>> tagging, Lemmatizing ... in different Enhancement Engines without
>> suffering form disadvantages of creating high amounts of RDF triples.
>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>> share the data as ContentPart [2] of the ContentItem.
>>
>> Option (3) indeed means that you will create your own
>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>
>>>   But will I be able to use the stanbol dbpedia lookup using option 3?
>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>
>> best
>> Rupert
>>
>> [1]
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> [2]
>> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
>> [3]
>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
>>
>>
>>> Btw, I created my own enhancement engine chains and I could see them
>>> yesterday in localhost:8080. But today all of them have vanished and only
>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>> directory?
>>>
>>> -harish
>>>
>>> I just created the eclipse project
>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>> <ru...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>> not process Chinese text. What you can do is to configure a
>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>> in unknown languages (see [1] for details).
>>>>
>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>> Chinese text it will use the default one that uses a fixed set of
>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>> well this would work with Chinese texts. My assumption would be that
>>>> it is not sufficient - so results will be sub-optimal.
>>>>
>>>> To apply Chinese optimization I see three possibilities:
>>>>
>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>> POS tagging, Named Entity Detection)
>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>> for text processing (e.g. stuff that is already available for
>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>> overhead.
>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>
>>>> Hope this helps to get you started.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
>>>> [2]
>>>>
>> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>> wrote:
>>>>> Hi Rupert,
>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>> demonstrate
>>>>> Stanbol annotations for Chinese text.
>>>>> I am just starting on it. I am following the instructions to build an
>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>> dump
>>>>> too.
>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>> dbpedia
>>>>> labels.
>>>>>
>>>>> I am planning to use the paoding chinese segmentor
>>>>> (http://code.google.com/p/paoding/) for word breaking.
>>>>>
>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>> It
>>>>> kind of finished the processing in no time at all. This gave me
>>>>> suspicion
>>>>> that may be if the language is chinese, no further processing is done.
>>>>> Is it
>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>
>>>>> -harish
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
Dr Walter,
No problem at all. Thanks. I was trying to use this as a learning
experience for myself.
I look forward for it.
-harish

On Mon, Jul 30, 2012 at 12:18 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> On Mon, Jul 30, 2012 at 8:04 AM, Walter Kasper <ka...@dfki.de> wrote:
> > Hi Harish,
> >
> > I can provide a Stanbol wrapper for the
> > http://code.google.com/p/language-detection library as an additional
> > enhancement engine in the next days. I would be interested in evaluating
> it
> > anyway.
> >
>
> cool thx!
>
> best
> Rupert
>
> > Best regards,
> >
> > Walter
> >
> >
> > harish suvarna wrote:
> >>
> >> Rupert,
> >> My initial debugging for Chinese text told me that the language
> >> identification done by langid enhancer using apache tika does not
> >> recognize
> >> chinese. The tika language detection seems is not supporting the CJK
> >> languages. With the result, the chinese language is identified as
> >> lithuanian language 'lt' . The apache tika group has an enhancement item
> >> 856 registered for detecting cjk languages
> >>   https://issues.apache.org/jira/browse/TIKA-856
> >>   in Feb 2012. I am not sure about the use of language identification in
> >> stanbol yet. Is the language id used to select the dbpedia  index
> >> (approprite dbpedia language dump) for entity lookups?
> >>
> >>
> >> I am just thinking that, for my purpose, pick option 3 and make sure
> that
> >> it is of my language of my interest and then call paoding segmenter.
> Then
> >> iterate over the ngrams and do an entityhub lookup. I just still need to
> >> understand the code around how the whole entity lookup for dbpedia
> works.
> >>
> >> I find that the language detection library
> >> http://code.google.com/p/language-detection/ is very good at language
> >> detection. It supports 53 languages out of box and the quality seems
> good.
> >> It is apache 2.0 license. I could volunteer to create a new langid
> engine
> >> based on this with the stanbol community approval. So if anyone sheds
> some
> >> light on how to add a new java library into stanbol, that be great. I
> am a
> >> maven beginner now.
> >>
> >> Thanks,
> >> harish
> >>
> >>
> >>
> >>
> >> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
> >> rupert.westenthaler@gmail.com> wrote:
> >>
> >>> Hi harish,
> >>>
> >>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
> >>> answer.
> >>>
> >>>
> >>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Thanks a lot Rupert.
> >>>>
> >>>> I am weighing between options 2 and 3. What is the difference?
> Optiion 2
> >>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text.
> It
> >>>
> >>> may
> >>>>
> >>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is
> like
> >>>
> >>> a
> >>>>
> >>>> separate engine.
> >>>
> >>> Option (2) will require some work improvements on the Stanbol side.
> >>> However there where already discussion on how to create a "text
> >>> processing chain" that allows to split up things like tokenizing, POS
> >>> tagging, Lemmatizing ... in different Enhancement Engines without
> >>> suffering form disadvantages of creating high amounts of RDF triples.
> >>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
> >>> share the data as ContentPart [2] of the ContentItem.
> >>>
> >>> Option (3) indeed means that you will create your own
> >>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
> >>>
> >>>>   But will I be able to use the stanbol dbpedia lookup using option 3?
> >>>
> >>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
> >>> "FieldQuery" interface to search for Entities (see [1] for an example)
> >>>
> >>> best
> >>> Rupert
> >>>
> >>> [1]
> >>>
> >>>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>> [2]
> >>>
> >>>
> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
> >>> [3]
> >>>
> >>>
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
> >>>
> >>>
> >>>> Btw, I created my own enhancement engine chains and I could see them
> >>>> yesterday in localhost:8080. But today all of them have vanished and
> >>>> only
> >>>> the default chain shows up. Can I dig them up somewhere in the stanbol
> >>>> directory?
> >>>>
> >>>> -harish
> >>>>
> >>>> I just created the eclipse project
> >>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
> >>>> <ru...@gmail.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> There are no NER (Named Entity Recognition) models for Chinese text
> >>>>> available via OpenNLP. So the default configuration of Stanbol will
> >>>>> not process Chinese text. What you can do is to configure a
> >>>>> KeywordLinking Engine for Chinese text as this engine can also
> process
> >>>>> in unknown languages (see [1] for details).
> >>>>>
> >>>>> However also the KeywordLinking Engine requires at least n tokenizer
> >>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
> >>>>> Chinese text it will use the default one that uses a fixed set of
> >>>>> chars to split words (white spaces, hyphens ...). You may better how
> >>>>> well this would work with Chinese texts. My assumption would be that
> >>>>> it is not sufficient - so results will be sub-optimal.
> >>>>>
> >>>>> To apply Chinese optimization I see three possibilities:
> >>>>>
> >>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
> >>>>> POS tagging, Named Entity Detection)
> >>>>> 2. allow the KeywordLinkingEngine to use other already available
> tools
> >>>>> for text processing (e.g. stuff that is already available for
> >>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
> >>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
> >>>>> because representing Tokens, POS ... as RDF would be to much of an
> >>>>> overhead.
> >>>>> 3. implement a new EnhancementEngine for processing Chinese text.
> >>>>>
> >>>>> Hope this helps to get you started.
> >>>>>
> >>>>> best
> >>>>> Rupert
> >>>>>
> >>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
> >>>>> [2]
> >>>>>
> >>>
> >>>
> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
> >>>>>
> >>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Rupert,
> >>>>>> Finally I am getting some time to work on Stanbol. My job is to
> >>>>>> demonstrate
> >>>>>> Stanbol annotations for Chinese text.
> >>>>>> I am just starting on it. I am following the instructions to build
> an
> >>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
> >>>
> >>> dump
> >>>>>>
> >>>>>> too.
> >>>>>> We may have to depend on the ngrams as keys and look them up in the
> >>>>>> dbpedia
> >>>>>> labels.
> >>>>>>
> >>>>>> I am planning to use the paoding chinese segmentor
> >>>>>> (http://code.google.com/p/paoding/) for word breaking.
> >>>>>>
> >>>>>> Just curious. I pasted some chinese text in default engine of
> stanbol.
> >>>>>> It
> >>>>>> kind of finished the processing in no time at all. This gave me
> >>>>>> suspicion
> >>>>>> that may be if the language is chinese, no further processing is
> done.
> >>>>>> Is it
> >>>>>> right? Any more tips for making all this work in Stanbol?
> >>>>>>
> >>>>>> -harish
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >>>>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>>>> | A-5500 Bischofshofen
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >>> | Bodenlehenstraße 11                             ++43-699-11108907
> >>> | A-5500 Bischofshofen
> >>>
> >
> >
> > --
> > Dr. Walter Kasper
> > DFKI GmbH
> > Stuhlsatzenhausweg 3
> > D-66123 Saarbrücken
> > Tel.:  +49-681-85775-5300
> > Fax:   +49-681-85775-5338
> > Email: kasper@dfki.de
> > -------------------------------------------------------------
> > Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> > Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
> >
> > Geschaeftsfuehrung:
> > Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> > Dr. Walter Olthoff
> >
> > Vorsitzender des Aufsichtsrats:
> > Prof. Dr. h.c. Hans A. Aukes
> >
> > Amtsgericht Kaiserslautern, HRB 2313
> > -------------------------------------------------------------
> >
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Stanbol Chinese

Posted by Rupert Westenthaler <ru...@gmail.com>.
On Mon, Jul 30, 2012 at 8:04 AM, Walter Kasper <ka...@dfki.de> wrote:
> Hi Harish,
>
> I can provide a Stanbol wrapper for the
> http://code.google.com/p/language-detection library as an additional
> enhancement engine in the next days. I would be interested in evaluating it
> anyway.
>

cool thx!

best
Rupert

> Best regards,
>
> Walter
>
>
> harish suvarna wrote:
>>
>> Rupert,
>> My initial debugging for Chinese text told me that the language
>> identification done by langid enhancer using apache tika does not
>> recognize
>> chinese. The tika language detection seems is not supporting the CJK
>> languages. With the result, the chinese language is identified as
>> lithuanian language 'lt' . The apache tika group has an enhancement item
>> 856 registered for detecting cjk languages
>>   https://issues.apache.org/jira/browse/TIKA-856
>>   in Feb 2012. I am not sure about the use of language identification in
>> stanbol yet. Is the language id used to select the dbpedia  index
>> (approprite dbpedia language dump) for entity lookups?
>>
>>
>> I am just thinking that, for my purpose, pick option 3 and make sure that
>> it is of my language of my interest and then call paoding segmenter. Then
>> iterate over the ngrams and do an entityhub lookup. I just still need to
>> understand the code around how the whole entity lookup for dbpedia works.
>>
>> I find that the language detection library
>> http://code.google.com/p/language-detection/ is very good at language
>> detection. It supports 53 languages out of box and the quality seems good.
>> It is apache 2.0 license. I could volunteer to create a new langid engine
>> based on this with the stanbol community approval. So if anyone sheds some
>> light on how to add a new java library into stanbol, that be great. I am a
>> maven beginner now.
>>
>> Thanks,
>> harish
>>
>>
>>
>>
>> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>>
>>> Hi harish,
>>>
>>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>>> answer.
>>>
>>>
>>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>>> wrote:
>>>>
>>>> Thanks a lot Rupert.
>>>>
>>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>>>
>>> may
>>>>
>>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>>>
>>> a
>>>>
>>>> separate engine.
>>>
>>> Option (2) will require some work improvements on the Stanbol side.
>>> However there where already discussion on how to create a "text
>>> processing chain" that allows to split up things like tokenizing, POS
>>> tagging, Lemmatizing ... in different Enhancement Engines without
>>> suffering form disadvantages of creating high amounts of RDF triples.
>>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>>> share the data as ContentPart [2] of the ContentItem.
>>>
>>> Option (3) indeed means that you will create your own
>>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>>
>>>>   But will I be able to use the stanbol dbpedia lookup using option 3?
>>>
>>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>>
>>> best
>>> Rupert
>>>
>>> [1]
>>>
>>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>> [2]
>>>
>>> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
>>> [3]
>>>
>>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
>>>
>>>
>>>> Btw, I created my own enhancement engine chains and I could see them
>>>> yesterday in localhost:8080. But today all of them have vanished and
>>>> only
>>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>>> directory?
>>>>
>>>> -harish
>>>>
>>>> I just created the eclipse project
>>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>>> <ru...@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>>> not process Chinese text. What you can do is to configure a
>>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>>> in unknown languages (see [1] for details).
>>>>>
>>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>>> Chinese text it will use the default one that uses a fixed set of
>>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>>> well this would work with Chinese texts. My assumption would be that
>>>>> it is not sufficient - so results will be sub-optimal.
>>>>>
>>>>> To apply Chinese optimization I see three possibilities:
>>>>>
>>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>>> POS tagging, Named Entity Detection)
>>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>>> for text processing (e.g. stuff that is already available for
>>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>>> overhead.
>>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>>
>>>>> Hope this helps to get you started.
>>>>>
>>>>> best
>>>>> Rupert
>>>>>
>>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
>>>>> [2]
>>>>>
>>>
>>> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
>>>>>
>>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi Rupert,
>>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>>> demonstrate
>>>>>> Stanbol annotations for Chinese text.
>>>>>> I am just starting on it. I am following the instructions to build an
>>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>>>
>>> dump
>>>>>>
>>>>>> too.
>>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>>> dbpedia
>>>>>> labels.
>>>>>>
>>>>>> I am planning to use the paoding chinese segmentor
>>>>>> (http://code.google.com/p/paoding/) for word breaking.
>>>>>>
>>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>>> It
>>>>>> kind of finished the processing in no time at all. This gave me
>>>>>> suspicion
>>>>>> that may be if the language is chinese, no further processing is done.
>>>>>> Is it
>>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>>
>>>>>> -harish
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>
>>>>
>>>
>>>
>>> --
>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>> | A-5500 Bischofshofen
>>>
>
>
> --
> Dr. Walter Kasper
> DFKI GmbH
> Stuhlsatzenhausweg 3
> D-66123 Saarbrücken
> Tel.:  +49-681-85775-5300
> Fax:   +49-681-85775-5338
> Email: kasper@dfki.de
> -------------------------------------------------------------
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern
>
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
>
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
>
> Amtsgericht Kaiserslautern, HRB 2313
> -------------------------------------------------------------
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Stanbol Chinese

Posted by Walter Kasper <ka...@dfki.de>.
Hi Harish,

I can provide a Stanbol wrapper for the 
http://code.google.com/p/language-detection library as an additional 
enhancement engine in the next days. I would be interested in evaluating 
it anyway.

Best regards,

Walter

harish suvarna wrote:
> Rupert,
> My initial debugging for Chinese text told me that the language
> identification done by langid enhancer using apache tika does not recognize
> chinese. The tika language detection seems is not supporting the CJK
> languages. With the result, the chinese language is identified as
> lithuanian language 'lt' . The apache tika group has an enhancement item
> 856 registered for detecting cjk languages
>   https://issues.apache.org/jira/browse/TIKA-856
>   in Feb 2012. I am not sure about the use of language identification in
> stanbol yet. Is the language id used to select the dbpedia  index
> (approprite dbpedia language dump) for entity lookups?
>
>
> I am just thinking that, for my purpose, pick option 3 and make sure that
> it is of my language of my interest and then call paoding segmenter. Then
> iterate over the ngrams and do an entityhub lookup. I just still need to
> understand the code around how the whole entity lookup for dbpedia works.
>
> I find that the language detection library
> http://code.google.com/p/language-detection/ is very good at language
> detection. It supports 53 languages out of box and the quality seems good.
> It is apache 2.0 license. I could volunteer to create a new langid engine
> based on this with the stanbol community approval. So if anyone sheds some
> light on how to add a new java library into stanbol, that be great. I am a
> maven beginner now.
>
> Thanks,
> harish
>
>
>
>
> On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
>
>> Hi harish,
>>
>> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
>> answer.
>>
>>
>> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
>> wrote:
>>> Thanks a lot Rupert.
>>>
>>> I am weighing between options 2 and 3. What is the difference? Optiion 2
>>> sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
>> may
>>> be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
>> a
>>> separate engine.
>> Option (2) will require some work improvements on the Stanbol side.
>> However there where already discussion on how to create a "text
>> processing chain" that allows to split up things like tokenizing, POS
>> tagging, Lemmatizing ... in different Enhancement Engines without
>> suffering form disadvantages of creating high amounts of RDF triples.
>> One Idea was to base this on the Apache Lucene TokenStream [1] API and
>> share the data as ContentPart [2] of the ContentItem.
>>
>> Option (3) indeed means that you will create your own
>> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>>
>>>   But will I be able to use the stanbol dbpedia lookup using option 3?
>> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
>> "FieldQuery" interface to search for Entities (see [1] for an example)
>>
>> best
>> Rupert
>>
>> [1]
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>> [2]
>> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
>> [3]
>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
>>
>>
>>> Btw, I created my own enhancement engine chains and I could see them
>>> yesterday in localhost:8080. But today all of them have vanished and only
>>> the default chain shows up. Can I dig them up somewhere in the stanbol
>>> directory?
>>>
>>> -harish
>>>
>>> I just created the eclipse project
>>> On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
>>> <ru...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> There are no NER (Named Entity Recognition) models for Chinese text
>>>> available via OpenNLP. So the default configuration of Stanbol will
>>>> not process Chinese text. What you can do is to configure a
>>>> KeywordLinking Engine for Chinese text as this engine can also process
>>>> in unknown languages (see [1] for details).
>>>>
>>>> However also the KeywordLinking Engine requires at least n tokenizer
>>>> for looking up Words. As there is no specific Tokenizer for OpenNLP
>>>> Chinese text it will use the default one that uses a fixed set of
>>>> chars to split words (white spaces, hyphens ...). You may better how
>>>> well this would work with Chinese texts. My assumption would be that
>>>> it is not sufficient - so results will be sub-optimal.
>>>>
>>>> To apply Chinese optimization I see three possibilities:
>>>>
>>>> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
>>>> POS tagging, Named Entity Detection)
>>>> 2. allow the KeywordLinkingEngine to use other already available tools
>>>> for text processing (e.g. stuff that is already available for
>>>> Solr/Lucene [2] or the paoding chinese segment or referenced in you
>>>> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
>>>> because representing Tokens, POS ... as RDF would be to much of an
>>>> overhead.
>>>> 3. implement a new EnhancementEngine for processing Chinese text.
>>>>
>>>> Hope this helps to get you started.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
>>>> [2]
>>>>
>> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
>>>> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
>>>> wrote:
>>>>> Hi Rupert,
>>>>> Finally I am getting some time to work on Stanbol. My job is to
>>>>> demonstrate
>>>>> Stanbol annotations for Chinese text.
>>>>> I am just starting on it. I am following the instructions to build an
>>>>> enhancement engine from Anuj's blog. dbpedia has some chinese data
>> dump
>>>>> too.
>>>>> We may have to depend on the ngrams as keys and look them up in the
>>>>> dbpedia
>>>>> labels.
>>>>>
>>>>> I am planning to use the paoding chinese segmentor
>>>>> (http://code.google.com/p/paoding/) for word breaking.
>>>>>
>>>>> Just curious. I pasted some chinese text in default engine of stanbol.
>>>>> It
>>>>> kind of finished the processing in no time at all. This gave me
>>>>> suspicion
>>>>> that may be if the language is chinese, no further processing is done.
>>>>> Is it
>>>>> right? Any more tips for making all this work in Stanbol?
>>>>>
>>>>> -harish
>>>>
>>>>
>>>> --
>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>> | A-5500 Bischofshofen
>>>
>>
>>
>> --
>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>


-- 
Dr. Walter Kasper
DFKI GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Tel.:  +49-681-85775-5300
Fax:   +49-681-85775-5338
Email: kasper@dfki.de
-------------------------------------------------------------
Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
Firmensitz: Trippstadter Strasse 122, D-67663 Kaiserslautern

Geschaeftsfuehrung:
Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff

Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes

Amtsgericht Kaiserslautern, HRB 2313
-------------------------------------------------------------


Re: Stanbol Chinese

Posted by harish suvarna <hs...@gmail.com>.
Rupert,
My initial debugging for Chinese text told me that the language
identification done by langid enhancer using apache tika does not recognize
chinese. The tika language detection seems is not supporting the CJK
languages. With the result, the chinese language is identified as
lithuanian language 'lt' . The apache tika group has an enhancement item
856 registered for detecting cjk languages
 https://issues.apache.org/jira/browse/TIKA-856
 in Feb 2012. I am not sure about the use of language identification in
stanbol yet. Is the language id used to select the dbpedia  index
(approprite dbpedia language dump) for entity lookups?


I am just thinking that, for my purpose, pick option 3 and make sure that
it is of my language of my interest and then call paoding segmenter. Then
iterate over the ngrams and do an entityhub lookup. I just still need to
understand the code around how the whole entity lookup for dbpedia works.

I find that the language detection library
http://code.google.com/p/language-detection/ is very good at language
detection. It supports 53 languages out of box and the quality seems good.
It is apache 2.0 license. I could volunteer to create a new langid engine
based on this with the stanbol community approval. So if anyone sheds some
light on how to add a new java library into stanbol, that be great. I am a
maven beginner now.

Thanks,
harish




On Thu, Jul 26, 2012 at 9:46 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi harish,
>
> Note: Sorry I forgot to include the stanbol-dev mailing list in my last
> answer.
>
>
> On Fri, Jul 27, 2012 at 3:33 AM, harish suvarna <hs...@gmail.com>
> wrote:
> > Thanks a lot Rupert.
> >
> > I am weighing between options 2 and 3. What is the difference? Optiion 2
> > sounds like enhancing KeyWordLinkingEngine to deal with chinese text. It
> may
> > be like paoding is hardcoded into KeyWordLinkingEngine. Option 3 is like
> a
> > separate engine.
>
> Option (2) will require some work improvements on the Stanbol side.
> However there where already discussion on how to create a "text
> processing chain" that allows to split up things like tokenizing, POS
> tagging, Lemmatizing ... in different Enhancement Engines without
> suffering form disadvantages of creating high amounts of RDF triples.
> One Idea was to base this on the Apache Lucene TokenStream [1] API and
> share the data as ContentPart [2] of the ContentItem.
>
> Option (3) indeed means that you will create your own
> EnhancementEngine - a similar one to the KeywordLinkingEngine.
>
> >  But will I be able to use the stanbol dbpedia lookup using option 3?
>
> Yes. You need only to obtain a Entityhub "ReferencedSite" and use the
> "FieldQuery" interface to search for Entities (see [1] for an example)
>
> best
> Rupert
>
> [1]
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> [2]
> http://incubator.apache.org/stanbol/docs/trunk/components/enhancer/contentitem.html#content-parts
> [3]
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/keywordextraction/src/main/java/org/apache/stanbol/enhancer/engines/keywordextraction/linking/impl/EntitySearcherUtils.java
>
>
> >
> > Btw, I created my own enhancement engine chains and I could see them
> > yesterday in localhost:8080. But today all of them have vanished and only
> > the default chain shows up. Can I dig them up somewhere in the stanbol
> > directory?
> >
> > -harish
> >
> > I just created the eclipse project
> > On Thu, Jul 26, 2012 at 5:04 AM, Rupert Westenthaler
> > <ru...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> There are no NER (Named Entity Recognition) models for Chinese text
> >> available via OpenNLP. So the default configuration of Stanbol will
> >> not process Chinese text. What you can do is to configure a
> >> KeywordLinking Engine for Chinese text as this engine can also process
> >> in unknown languages (see [1] for details).
> >>
> >> However also the KeywordLinking Engine requires at least n tokenizer
> >> for looking up Words. As there is no specific Tokenizer for OpenNLP
> >> Chinese text it will use the default one that uses a fixed set of
> >> chars to split words (white spaces, hyphens ...). You may better how
> >> well this would work with Chinese texts. My assumption would be that
> >> it is not sufficient - so results will be sub-optimal.
> >>
> >> To apply Chinese optimization I see three possibilities:
> >>
> >> 1. add support for Chinese to OpenNLP (Tokenizer, Sentence detection,
> >> POS tagging, Named Entity Detection)
> >> 2. allow the KeywordLinkingEngine to use other already available tools
> >> for text processing (e.g. stuff that is already available for
> >> Solr/Lucene [2] or the paoding chinese segment or referenced in you
> >> mail). Currently the KeywordLinkingEngine is hardwired with OpenNLP,
> >> because representing Tokens, POS ... as RDF would be to much of an
> >> overhead.
> >> 3. implement a new EnhancementEngine for processing Chinese text.
> >>
> >> Hope this helps to get you started.
> >>
> >> best
> >> Rupert
> >>
> >> [1] http://incubator.apache.org/stanbol/docs/trunk/multilingual.html
> >> [2]
> >>
> http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean
> >>
> >> On Thu, Jul 26, 2012 at 2:00 AM, harish suvarna <hs...@gmail.com>
> >> wrote:
> >> > Hi Rupert,
> >> > Finally I am getting some time to work on Stanbol. My job is to
> >> > demonstrate
> >> > Stanbol annotations for Chinese text.
> >> > I am just starting on it. I am following the instructions to build an
> >> > enhancement engine from Anuj's blog. dbpedia has some chinese data
> dump
> >> > too.
> >> > We may have to depend on the ngrams as keys and look them up in the
> >> > dbpedia
> >> > labels.
> >> >
> >> > I am planning to use the paoding chinese segmentor
> >> > (http://code.google.com/p/paoding/) for word breaking.
> >> >
> >> > Just curious. I pasted some chinese text in default engine of stanbol.
> >> > It
> >> > kind of finished the processing in no time at all. This gave me
> >> > suspicion
> >> > that may be if the language is chinese, no further processing is done.
> >> > Is it
> >> > right? Any more tips for making all this work in Stanbol?
> >> >
> >> > -harish
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
>
>
>
> --
> | Rupert Westenthaler             rupert.westenthaler@gmail.com
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>