You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by ahmad ajiloo <ah...@gmail.com> on 2011/10/31 18:35:07 UTC

A problem in the right-to-left languages

Hello
When I use Tika for extracting my persian pdf files, all the characters
will be extracted vice versa. I mean that the characters showed from
beginning of the line to the end, but from left to right. However when I
use Tika gui via Nutch there is no mistake and the output text is
right-to-left !!

Following text is the first line of attached file in first mode (running
Tika independently):
   ﻲﻠﻋ ﺎﻳ ﻮﺗ ﻝﻼﺟ ﺯﺍ ﻢﻧﺯ ﻡﺩ ﻪﻜﻧﺁ ﺕﺭﺪﻗ ﺖﺳﺍﺮﻣ ﻪﻧ ﻲﻣﺮﻜﻣ ﺩﻮﺟ ﺩﻮﺟﻭ ﻪﺑ ﺖﻤﻳﻮﮔ ﻪﻛ
ﺖﺳﺍ ﺲﺑ ﻦﻴﻤﻫ ﻪﻧ ﻱﺪﺑﻮﻣ ﺖﺨﺗ ﻪﺑ ﻱﺍ ﻩﺩﺯ ﺖﻨﻄﻠﺳ ﻪﻴﻜﺗ ﻪﻜﻧﺁ ﻲﺋﻮﺗ

and this is in second mode (running Tika gui via Nutch) and this is a clear
persian text:
نه مراست قدرت آنكه دم زنم از جلال تو يا علي      نه همين بس است كه گويمت به
وجود جود مكرمي توئي آنكه تكيه سلطنت زده اي به تخت موبدي

Thanks for your attention

Re: A problem in the right-to-left languages

Posted by Ahmad Ajiloo <ah...@gmail.com>.

Hi
Did your probe conclude a result?

On Wed, Nov 2, 2011 at 4:40 AM, Ken Krugler <kk...@transpac.com>wrote:

> I know some of the original team members - I could ask.
>
> Are there specific questions, or just "is anybody still minding the fire"?
>
> -- Ken
>
> On Nov 1, 2011, at 2:43pm, Nick Burch wrote:
>
> > On Tue, 1 Nov 2011, Robert Muir wrote:
> >> Well as an alternative for them committing the ebcdic detection,
> perhaps we could look at the Charset detection apis and propose some API
> additions so that users (like Tika) can plug in custom detectors?
> >
> > In theory it should be pluggable, but I seem to recal we needed to tweak
> a few core bits to get the detector working (around negative matches for
> control characters)
> >
> > Looking at the svn version history, the ICU4J team don't appear to have
> done any work on their character detectors in several years. From the lack
> of responses when I asked on their list about extending them, I fear there
> may not be anyone left in their project who's interested in charset
> detectors any more. I'd love to be proved wrong though, if anyone has any
> personal contacts on the project they could prod about it?
> >
> > Nick
>
> --------------------------
> Ken Krugler
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>

Re: A problem in the right-to-left languages

Posted by Ken Krugler <kk...@transpac.com>.

I know some of the original team members - I could ask.

Are there specific questions, or just "is anybody still minding the fire"?

-- Ken

On Nov 1, 2011, at 2:43pm, Nick Burch wrote:

> On Tue, 1 Nov 2011, Robert Muir wrote:
>> Well as an alternative for them committing the ebcdic detection, perhaps we could look at the Charset detection apis and propose some API additions so that users (like Tika) can plug in custom detectors?
> 
> In theory it should be pluggable, but I seem to recal we needed to tweak a few core bits to get the detector working (around negative matches for control characters)
> 
> Looking at the svn version history, the ICU4J team don't appear to have done any work on their character detectors in several years. From the lack of responses when I asked on their list about extending them, I fear there may not be anyone left in their project who's interested in charset detectors any more. I'd love to be proved wrong though, if anyone has any personal contacts on the project they could prod about it?
> 
> Nick

--------------------------
Ken Krugler
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: A problem in the right-to-left languages

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 1 Nov 2011, Robert Muir wrote:
> Well as an alternative for them committing the ebcdic detection, perhaps 
> we could look at the Charset detection apis and propose some API 
> additions so that users (like Tika) can plug in custom detectors?

In theory it should be pluggable, but I seem to recal we needed to tweak a 
few core bits to get the detector working (around negative matches for 
control characters)

Looking at the svn version history, the ICU4J team don't appear to have 
done any work on their character detectors in several years. From the lack 
of responses when I asked on their list about extending them, I fear there 
may not be anyone left in their project who's interested in charset 
detectors any more. I'd love to be proved wrong though, if anyone has any 
personal contacts on the project they could prod about it?

Nick

Re: A problem in the right-to-left languages

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 1, 2011 at 12:47 PM, Nick Burch <ni...@alfresco.com> wrote:

> I've not had any luck with this - I tried submitting some of our changes
> back (eg the ebcidic detector) but they didn't seem to want them
>

Well as an alternative for them committing the ebcdic detection,
perhaps we could look at the Charset detection apis and propose some
API additions so that users (like Tika) can plug in custom detectors?

-- 
lucidimagination.com

Re: A problem in the right-to-left languages

Posted by Nick Burch <ni...@alfresco.com>.

On Tue, 1 Nov 2011, Robert Muir wrote:
> it would be nice to look at trying to remove the forked charsetdetection 
> code too (whatever changes tika has, get them into ICU, etc)

I've not had any luck with this - I tried submitting some of our changes 
back (eg the ebcidic detector) but they didn't seem to want them

Nick

Re: A problem in the right-to-left languages

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 1, 2011 at 9:14 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Tue, Nov 1, 2011 at 1:48 PM, Robert Muir <rc...@gmail.com> wrote:
>> I really think tika should include the parts of icu4j it depends on.
>> Often open source projects are hesitant to include icu jar because of
>> its size, but thats silly since the size is just a catch-all.
>> We can use the webapp to make a smaller one that includes the minimum
>> of stuff Tika needs. http://apps.icu-project.org/datacustom/
>
> We need a version that's available on the central Maven repository.
>

perhaps as a start, we could include the whole icu from maven, and
look at 'trimming' as an optimization?

it would be nice to look at trying to remove the forked
charsetdetection code too (whatever changes tika has, get them into
ICU, etc)

-- 
lucidimagination.com

Re: A problem in the right-to-left languages

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Nov 1, 2011 at 1:48 PM, Robert Muir <rc...@gmail.com> wrote:
> I really think tika should include the parts of icu4j it depends on.
> Often open source projects are hesitant to include icu jar because of
> its size, but thats silly since the size is just a catch-all.
> We can use the webapp to make a smaller one that includes the minimum
> of stuff Tika needs. http://apps.icu-project.org/datacustom/

We need a version that's available on the central Maven repository.

> Maybe we should open a JIRA issue to fix this? I think its a bug that
> Arabic and Persian text silently come out corrupted if you don't have
> this in your classpath.

+1

BR,

Jukka Zitting

Re: A problem in the right-to-left languages

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Nov 1, 2011 at 8:48 AM, Robert Muir <rc...@gmail.com> wrote:

> I really think tika should include the parts of icu4j it depends on.
> Often open source projects are hesitant to include icu jar because of
> its size, but thats silly since the size is just a catch-all.
> We can use the webapp to make a smaller one that includes the minimum
> of stuff Tika needs. http://apps.icu-project.org/datacustom/
>
> Maybe we should open a JIRA issue to fix this? I think its a bug that
> Arabic and Persian text silently come out corrupted if you don't have
> this in your classpath.

+1

I think it's awful to just silently produce bad results.

Mike McCandless

http://blog.mikemccandless.com

Re: A problem in the right-to-left languages

Posted by Robert Muir <rc...@gmail.com>.

On Tue, Nov 1, 2011 at 6:24 AM, Ahmad Ajiloo <ah...@gmail.com> wrote:
> Yes there is a difference. In Nutch we have a ICU4J library in lib
> directory. but there is no ICU4J lib or class file in a single tika jar
> file. for example in pdfbox jar file we have this path: com.ibm.icu . but
> there is no com.ibm path in a tika jar file.
> How can i add ICU4J library to the tika jar file?
>

I really think tika should include the parts of icu4j it depends on.
Often open source projects are hesitant to include icu jar because of
its size, but thats silly since the size is just a catch-all.
We can use the webapp to make a smaller one that includes the minimum
of stuff Tika needs. http://apps.icu-project.org/datacustom/

Maybe we should open a JIRA issue to fix this? I think its a bug that
Arabic and Persian text silently come out corrupted if you don't have
this in your classpath.

-- 
lucidimagination.com

Re: A problem in the right-to-left languages

Posted by Ahmad Ajiloo <ah...@gmail.com>.

Yes there is a difference. In Nutch we have a ICU4J library in lib
directory. but there is no ICU4J lib or class file in a single tika jar
file. for example in pdfbox jar file we have this path: com.ibm.icu . but
there is no com.ibm path in a tika jar file.
How can i add ICU4J library to the tika jar file?

On Mon, Oct 31, 2011 at 10:49 PM, Robert Muir <rc...@gmail.com> wrote:

> Do you have ICU4J jar in your classpath in both situations?
>
> On Mon, Oct 31, 2011 at 1:35 PM, ahmad ajiloo <ah...@gmail.com>
> wrote:
> > Hello
> > When I use Tika for extracting my persian pdf files, all the characters
> will
> > be extracted vice versa. I mean that the characters showed from
> beginning of
> > the line to the end, but from left to right. However when I use Tika gui
> via
> > Nutch there is no mistake and the output text is  right-to-left !!
> >
> > Following text is the first line of attached file in first mode (running
> > Tika independently):
> >    ﻲﻠﻋ ﺎﻳ ﻮﺗ ﻝﻼﺟ ﺯﺍ ﻢﻧﺯ ﻡﺩ ﻪﻜﻧﺁ ﺕﺭﺪﻗ ﺖﺳﺍﺮﻣ ﻪﻧ ﻲﻣﺮﻜﻣ ﺩﻮﺟ ﺩﻮﺟﻭ ﻪﺑ ﺖﻤﻳﻮﮔ ﻪﻛ
> ﺖﺳﺍ
> > ﺲﺑ ﻦﻴﻤﻫ ﻪﻧ ﻱﺪﺑﻮﻣ ﺖﺨﺗ ﻪﺑ ﻱﺍ ﻩﺩﺯ ﺖﻨﻄﻠﺳ ﻪﻴﻜﺗ ﻪﻜﻧﺁ ﻲﺋﻮﺗ
> >
> > and this is in second mode (running Tika gui via Nutch) and this is a
> clear
> > persian text:
> > نه مراست قدرت آنكه دم زنم از جلال تو يا علي      نه همين بس است كه گويمت
> به
> > وجود جود مكرمي توئي آنكه تكيه سلطنت زده اي به تخت موبدي
> >
> > Thanks for your attention
> >
> >
> >
> >
> >
>
>
>
> --
> lucidimagination.com
>

Re: A problem in the right-to-left languages

Posted by Robert Muir <rc...@gmail.com>.

Do you have ICU4J jar in your classpath in both situations?

On Mon, Oct 31, 2011 at 1:35 PM, ahmad ajiloo <ah...@gmail.com> wrote:
> Hello
> When I use Tika for extracting my persian pdf files, all the characters will
> be extracted vice versa. I mean that the characters showed from beginning of
> the line to the end, but from left to right. However when I use Tika gui via
> Nutch there is no mistake and the output text is  right-to-left !!
>
> Following text is the first line of attached file in first mode (running
> Tika independently):
>    ﻲﻠﻋ ﺎﻳ ﻮﺗ ﻝﻼﺟ ﺯﺍ ﻢﻧﺯ ﻡﺩ ﻪﻜﻧﺁ ﺕﺭﺪﻗ ﺖﺳﺍﺮﻣ ﻪﻧ ﻲﻣﺮﻜﻣ ﺩﻮﺟ ﺩﻮﺟﻭ ﻪﺑ ﺖﻤﻳﻮﮔ ﻪﻛ ﺖﺳﺍ
> ﺲﺑ ﻦﻴﻤﻫ ﻪﻧ ﻱﺪﺑﻮﻣ ﺖﺨﺗ ﻪﺑ ﻱﺍ ﻩﺩﺯ ﺖﻨﻄﻠﺳ ﻪﻴﻜﺗ ﻪﻜﻧﺁ ﻲﺋﻮﺗ
>
> and this is in second mode (running Tika gui via Nutch) and this is a clear
> persian text:
> نه مراست قدرت آنكه دم زنم از جلال تو يا علي      نه همين بس است كه گويمت به
> وجود جود مكرمي توئي آنكه تكيه سلطنت زده اي به تخت موبدي
>
> Thanks for your attention
>
>
>
>
>



-- 
lucidimagination.com