You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Chris Mattmann <ma...@apache.org> on 2013/10/26 19:48:20 UTC

Re: Having Problem in Word Count and Language Detaction

Hi Animesh,

Please detail your issue here on dev@tika.apache.org and I'm sure
someone can help.

Cheers,
Chris


-----Original Message-----
From: Animesh Kumar <an...@gmail.com>
Date: Wednesday, October 23, 2013 9:15 PM
To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
Subject: Fwd: Having Problem in Word Count and Language Detaction

>
>
>Sir/Mam,
>I am developing a web based software which use Apache Tika for getting
>Language and words Count of Uploaded file. Its working fine for English,
>Japanese , Hindi etc but giving wrong words count for Chinese. I am using
>tika-app-1.4.jar .
>and there is an another problem in word counting of file format different
>from doc and docx
>
>
>-- 
>With Thanks & Regards
>Animesh Kumar
>+918927992397 <tel:%2B918927992397>
>
>
>
>
>
>
>
>-- 
>With Thanks & Regards
>Animesh Kumar
>+918927992397 <tel:%2B918927992397>
>
>



Re: Having Problem in Word Count and Language Detaction

Posted by Oleg Tikhonov <ol...@apache.org>.
This one is better"
https://issues.apache.org/jira/browse/TIKA-546



On Sat, Oct 26, 2013 at 10:05 PM, Oleg Tikhonov <ol...@apache.org> wrote:

> Hi Animesh,
> my wild guess is that N-gram profile for Chinese wasn't trained pretty
> well. Try recreate Chinese language profile.
>
> Have a look here:
>
> http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html
>
> Hope it helps.
>
>
> On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann <ma...@apache.org>wrote:
>
>> Hi Animesh,
>>
>> Please detail your issue here on dev@tika.apache.org and I'm sure
>> someone can help.
>>
>> Cheers,
>> Chris
>>
>>
>> -----Original Message-----
>> From: Animesh Kumar <an...@gmail.com>
>> Date: Wednesday, October 23, 2013 9:15 PM
>> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
>> Subject: Fwd: Having Problem in Word Count and Language Detaction
>>
>> >
>> >
>> >Sir/Mam,
>> >I am developing a web based software which use Apache Tika for getting
>> >Language and words Count of Uploaded file. Its working fine for English,
>> >Japanese , Hindi etc but giving wrong words count for Chinese. I am using
>> >tika-app-1.4.jar .
>> >and there is an another problem in word counting of file format different
>> >from doc and docx
>> >
>> >
>> >--
>> >With Thanks & Regards
>> >Animesh Kumar
>> >+918927992397 <tel:%2B918927992397>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >--
>> >With Thanks & Regards
>> >Animesh Kumar
>> >+918927992397 <tel:%2B918927992397>
>> >
>> >
>>
>>
>>
>

Re: Having Problem in Word Count and Language Detaction

Posted by Oleg Tikhonov <ol...@apache.org>.
Hi Animesh,
my wild guess is that N-gram profile for Chinese wasn't trained pretty
well. Try recreate Chinese language profile.

Have a look here:
http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/section6.html

Hope it helps.


On Sat, Oct 26, 2013 at 8:48 PM, Chris Mattmann <ma...@apache.org> wrote:

> Hi Animesh,
>
> Please detail your issue here on dev@tika.apache.org and I'm sure
> someone can help.
>
> Cheers,
> Chris
>
>
> -----Original Message-----
> From: Animesh Kumar <an...@gmail.com>
> Date: Wednesday, October 23, 2013 9:15 PM
> To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
> Subject: Fwd: Having Problem in Word Count and Language Detaction
>
> >
> >
> >Sir/Mam,
> >I am developing a web based software which use Apache Tika for getting
> >Language and words Count of Uploaded file. Its working fine for English,
> >Japanese , Hindi etc but giving wrong words count for Chinese. I am using
> >tika-app-1.4.jar .
> >and there is an another problem in word counting of file format different
> >from doc and docx
> >
> >
> >--
> >With Thanks & Regards
> >Animesh Kumar
> >+918927992397 <tel:%2B918927992397>
> >
> >
> >
> >
> >
> >
> >
> >--
> >With Thanks & Regards
> >Animesh Kumar
> >+918927992397 <tel:%2B918927992397>
> >
> >
>
>
>