You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@uima.apache.org by ro...@ncuindia.edu, ro...@ncuindia.edu on 2018/07/02 11:05:38 UTC

Re: Problem in running DUCC Job for Arabic Language

Hey Eddie,

Before sending the data into jcas if i force encode it :-

String content2 = null;
content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
jcas.setDocumentText(content2);

And when i go in my first annotator i force decode it:-

String content = null;
content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"), "UTF-8");

Now the text is coming in arabic language without any problem.But again i have many analysis engine in my aggregate and i can't hardcode this snippet everywhere.

Maybe there is a problem in unicoding of the cas that is sent from collection reader to analysis engine.Now i was thinking that maybe if i can get to know the type of encoding in the cas, i can just encode the content into the unicode of CAS and it may work fine.

Best Regards
Rohit
On 2018/06/18 17:42:04, Eddie Epstein <ea...@gmail.com> wrote: 
> Hi Rohit,
> 
> In a DUCC job the CAS created by users CR in the Job Driver is serialized
> into cas.xmi format, transported to the Job Process where it is
> deserialized and given to the users analytics. Likely the problem is in CAS
> serialization or deserialization, perhaps due to the active LANG
> environment on the JD or JP machines?
> 
> Eddie
> 
> On Thu, Jun 14, 2018 at 1:48 AM, Rohit yadav <ro...@orkash.com> wrote:
> 
> > Hey,
> >
> > I use DUCC for english language and it works without any problem.
> > But lately i tried deploying a job for Arabic Language and all the content
> > of Arabic Text is replaced by *'?'* (Question Mark).
> >
> > I am extracting Data from Accumlo and after processing i send it to ES6.
> >
> > When i checked the log files of JD it shows that arabic data is coming
> > into CR without any problem.
> > But when i check another log file it shows that the moment data enters
> > into my AE arabic content is replaced by Question mark.
> > Please find the log files attached with this mail.
> >
> > I think this may be a problem of CM because the data is fine inside CR and
> > the most interesting part is that if i try running the same pipeline
> > through CPM  it works without any problem which means DUCC is facing some
> > issue.
> >
> > I'll look forward to your reply.
> >
> > --
> > Best Regards,
> > *Rohit Yadav*
> >
>

Re: Problem in running DUCC Job for Arabic Language

Posted by Jaroslaw Cwiklik <cw...@apache.org>.

Forgot to mention that if you have a shared file system the best practice
is not to serialize your content (SOFA)
from JD to service. Instead, in a CR add a path to the file containing
Subject of Analysis to the CAS and have
the CM in the pipeline read the content from the shared file system.
-jerry


On Tue, Nov 6, 2018 at 9:37 AM Jaroslaw Cwiklik <cw...@apache.org> wrote:

> Can you try setting -Dfile.encoding=ISO-8859-1 for the service (job)
> process and -Djavax.servlet.request.encoding=ISO-8859-1
> -Dfile.encoding=ISO-8859-1 for the JD process.
>
> The JD actually uses Jetty webserver to serve service requests over HTTP.
> I went as far as extracting Jetty server code from JD into a simple http
> server process and also extracted HttpClient related code from the service
> into a simple client process to be able to test.
>
> So on the server side I have:
> String text = new String("استعرض المتحدث باسم قوات «التحالف العربي
> لدعم".getBytes("UTF-8"),"ISO-8859-1");
> response.setHeader("content-type", "text/xml");
> String body = marshall(text);   // XStream serialization
> response.getWriter().write(body);
>
> On the client side:
>       System.out.println("Default Locale:   " + Locale.getDefault());
>       System.out.println("Default Charset:  " + Charset.defaultCharset());
>       System. out.println("file.encoding;    " +
> System.getProperty("file.encoding"));
>
>             HttpResponse response = httpClient.execute(postMethod);
>             HttpEntity entity = response.getEntity();
>                 String content = EntityUtils.toString(entity);
>      String result = (String) unmarshall(content); //XStream unmarshall
>   String o = new String(result.getBytes() );
> System.out.println(o);
>
> When I run with the above -D settings the client console shows:
> Default Locale:   en_US
> Default Charset:  ISO-8859-1
> file.encoding;    ISO-8859-1
>
> استعرض المتحدث باسم قوات «التحالف العربي لدعم
>
> Without the -D's I dont see arabic text and instead see garbage on the
> console.
>
> On Fri, Jul 6, 2018 at 3:00 AM rohit14csu173@ncuindia.edu <
> rohit14csu173@ncuindia.edu> wrote:
>
>> Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from
>> UIMA-AS client it works fine.
>> Infact the enviornment i.e the LANG argument is same for UIMA-AS Service
>> and DUCC JOB.
>>
>> Environ[3] = LANG=en_IN
>>
>> And if i change the LANG=ar then while getting the data coming in JD the
>> arabic text is already replaced with ???(Question Mark) and the encoding of
>> the data coming in JD or CR  shows ASCII encoding.
>> I don't understand why is this happening.
>>
>> Best
>> Rohit
>>
>>
>> On 2018/07/05 13:35:11, Eddie Epstein <ea...@gmail.com> wrote:
>> > So if you run the AE as a DUCC UIMA-AS service and send it CASes from
>> some
>> > UIMA-AS client it works OK? The full environment for all processes that
>> > DUCC launches are available via ducc-mon under the Specification or
>> > Registry tab for that job or managed reservation or service. Please see
>> if
>> > the LANG setting for the service is different from the LANG setting for
>> the
>> > job.
>> >
>> > One can also see the LANG setting for a linux process-id by doing:
>> >
>> > cat /proc/<pid>/environ
>> >
>> > The LANG to be used for a DUCC process can be set by adding to the
>> > --environment argument "LANG=xxx" as needed
>> >
>> > Thanks,
>> > Eddie
>> >
>> >
>> >
>> > On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu173@ncuindia.edu <
>> > rohit14csu173@ncuindia.edu> wrote:
>> >
>> > > Hey,
>> > >  Yeah you got it right the first snippet comes in CR before the data
>> goes
>> > > in CAS.
>> > > And the second snippet is in the first annotator or analysis
>> engine(AE) of
>> > > my Aggregate Desciptor.
>> > > I am pretty sure this is an issue of the CAS used by DUCC because if
>> i use
>> > > service of DUCC in which we are supposed to send the CAS and receive
>> the
>> > > same CAS with added features from DUCC i get accurate results.
>> > >
>> > > But the only problem comes in submitting a job where the cas is
>> generated
>> > > by DUCC.
>> > > This can also be a issue of the enviornment(Language) of DUCC because
>> the
>> > > default language is english.
>> > >
>> > > Bets Regards
>> > > Rohit
>> > >
>> > > On 2018/07/03 13:11:50, Eddie Epstein <ea...@gmail.com> wrote:
>> > > > Rohit,
>> > > >
>> > > > Before sending the data into jcas if i force encode it :-
>> > > > >
>> > > > > String content2 = null;
>> > > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
>> > > > > jcas.setDocumentText(content2);
>> > > > >
>> > > >
>> > > > Where is this code, in the job CR?
>> > > >
>> > > >
>> > > >
>> > > > >
>> > > > > And when i go in my first annotator i force decode it:-
>> > > > >
>> > > > > String content = null;
>> > > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
>> > > > > "UTF-8");
>> > > > >
>> > > >
>> > > > And is this in the first annotator of the job process, i.e. the CM?
>> > > >
>> > > > Please be as specific as possible.
>> > > >
>> > > > Thanks,
>> > > > Eddie
>> > > >
>> > >
>> >
>>
>

Re: Problem in running DUCC Job for Arabic Language

Posted by Jaroslaw Cwiklik <cw...@apache.org>.

Can you try setting -Dfile.encoding=ISO-8859-1 for the service (job)
process and -Djavax.servlet.request.encoding=ISO-8859-1
-Dfile.encoding=ISO-8859-1 for the JD process.

The JD actually uses Jetty webserver to serve service requests over HTTP. I
went as far as extracting Jetty server code from JD into a simple http
server process and also extracted HttpClient related code from the service
into a simple client process to be able to test.

So on the server side I have:
String text = new String("استعرض المتحدث باسم قوات «التحالف العربي
لدعم".getBytes("UTF-8"),"ISO-8859-1");
response.setHeader("content-type", "text/xml");
String body = marshall(text);   // XStream serialization
response.getWriter().write(body);

On the client side:
      System.out.println("Default Locale:   " + Locale.getDefault());
      System.out.println("Default Charset:  " + Charset.defaultCharset());
      System. out.println("file.encoding;    " +
System.getProperty("file.encoding"));

            HttpResponse response = httpClient.execute(postMethod);
            HttpEntity entity = response.getEntity();
                String content = EntityUtils.toString(entity);
     String result = (String) unmarshall(content); //XStream unmarshall
  String o = new String(result.getBytes() );
System.out.println(o);

When I run with the above -D settings the client console shows:
Default Locale:   en_US
Default Charset:  ISO-8859-1
file.encoding;    ISO-8859-1

استعرض المتحدث باسم قوات «التحالف العربي لدعم

Without the -D's I dont see arabic text and instead see garbage on the
console.

On Fri, Jul 6, 2018 at 3:00 AM rohit14csu173@ncuindia.edu <
rohit14csu173@ncuindia.edu> wrote:

> Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from
> UIMA-AS client it works fine.
> Infact the enviornment i.e the LANG argument is same for UIMA-AS Service
> and DUCC JOB.
>
> Environ[3] = LANG=en_IN
>
> And if i change the LANG=ar then while getting the data coming in JD the
> arabic text is already replaced with ???(Question Mark) and the encoding of
> the data coming in JD or CR  shows ASCII encoding.
> I don't understand why is this happening.
>
> Best
> Rohit
>
>
> On 2018/07/05 13:35:11, Eddie Epstein <ea...@gmail.com> wrote:
> > So if you run the AE as a DUCC UIMA-AS service and send it CASes from
> some
> > UIMA-AS client it works OK? The full environment for all processes that
> > DUCC launches are available via ducc-mon under the Specification or
> > Registry tab for that job or managed reservation or service. Please see
> if
> > the LANG setting for the service is different from the LANG setting for
> the
> > job.
> >
> > One can also see the LANG setting for a linux process-id by doing:
> >
> > cat /proc/<pid>/environ
> >
> > The LANG to be used for a DUCC process can be set by adding to the
> > --environment argument "LANG=xxx" as needed
> >
> > Thanks,
> > Eddie
> >
> >
> >
> > On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu173@ncuindia.edu <
> > rohit14csu173@ncuindia.edu> wrote:
> >
> > > Hey,
> > >  Yeah you got it right the first snippet comes in CR before the data
> goes
> > > in CAS.
> > > And the second snippet is in the first annotator or analysis
> engine(AE) of
> > > my Aggregate Desciptor.
> > > I am pretty sure this is an issue of the CAS used by DUCC because if i
> use
> > > service of DUCC in which we are supposed to send the CAS and receive
> the
> > > same CAS with added features from DUCC i get accurate results.
> > >
> > > But the only problem comes in submitting a job where the cas is
> generated
> > > by DUCC.
> > > This can also be a issue of the enviornment(Language) of DUCC because
> the
> > > default language is english.
> > >
> > > Bets Regards
> > > Rohit
> > >
> > > On 2018/07/03 13:11:50, Eddie Epstein <ea...@gmail.com> wrote:
> > > > Rohit,
> > > >
> > > > Before sending the data into jcas if i force encode it :-
> > > > >
> > > > > String content2 = null;
> > > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > > > jcas.setDocumentText(content2);
> > > > >
> > > >
> > > > Where is this code, in the job CR?
> > > >
> > > >
> > > >
> > > > >
> > > > > And when i go in my first annotator i force decode it:-
> > > > >
> > > > > String content = null;
> > > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > > > "UTF-8");
> > > > >
> > > >
> > > > And is this in the first annotator of the job process, i.e. the CM?
> > > >
> > > > Please be as specific as possible.
> > > >
> > > > Thanks,
> > > > Eddie
> > > >
> > >
> >
>

Re: Problem in running DUCC Job for Arabic Language

Posted by ro...@ncuindia.edu, ro...@ncuindia.edu.

Yes if i run the AE as a DUCC UIMA-AS Service and send it CASes from UIMA-AS client it works fine.
Infact the enviornment i.e the LANG argument is same for UIMA-AS Service and DUCC JOB.

Environ[3] = LANG=en_IN

And if i change the LANG=ar then while getting the data coming in JD the arabic text is already replaced with ???(Question Mark) and the encoding of the data coming in JD or CR  shows ASCII encoding.
I don't understand why is this happening.

Best
Rohit 


On 2018/07/05 13:35:11, Eddie Epstein <ea...@gmail.com> wrote: 
> So if you run the AE as a DUCC UIMA-AS service and send it CASes from some
> UIMA-AS client it works OK? The full environment for all processes that
> DUCC launches are available via ducc-mon under the Specification or
> Registry tab for that job or managed reservation or service. Please see if
> the LANG setting for the service is different from the LANG setting for the
> job.
> 
> One can also see the LANG setting for a linux process-id by doing:
> 
> cat /proc/<pid>/environ
> 
> The LANG to be used for a DUCC process can be set by adding to the
> --environment argument "LANG=xxx" as needed
> 
> Thanks,
> Eddie
> 
> 
> 
> On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu173@ncuindia.edu <
> rohit14csu173@ncuindia.edu> wrote:
> 
> > Hey,
> >  Yeah you got it right the first snippet comes in CR before the data goes
> > in CAS.
> > And the second snippet is in the first annotator or analysis engine(AE) of
> > my Aggregate Desciptor.
> > I am pretty sure this is an issue of the CAS used by DUCC because if i use
> > service of DUCC in which we are supposed to send the CAS and receive the
> > same CAS with added features from DUCC i get accurate results.
> >
> > But the only problem comes in submitting a job where the cas is generated
> > by DUCC.
> > This can also be a issue of the enviornment(Language) of DUCC because the
> > default language is english.
> >
> > Bets Regards
> > Rohit
> >
> > On 2018/07/03 13:11:50, Eddie Epstein <ea...@gmail.com> wrote:
> > > Rohit,
> > >
> > > Before sending the data into jcas if i force encode it :-
> > > >
> > > > String content2 = null;
> > > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > > jcas.setDocumentText(content2);
> > > >
> > >
> > > Where is this code, in the job CR?
> > >
> > >
> > >
> > > >
> > > > And when i go in my first annotator i force decode it:-
> > > >
> > > > String content = null;
> > > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > > "UTF-8");
> > > >
> > >
> > > And is this in the first annotator of the job process, i.e. the CM?
> > >
> > > Please be as specific as possible.
> > >
> > > Thanks,
> > > Eddie
> > >
> >
>

Re: Problem in running DUCC Job for Arabic Language

Posted by Eddie Epstein <ea...@gmail.com>.

So if you run the AE as a DUCC UIMA-AS service and send it CASes from some
UIMA-AS client it works OK? The full environment for all processes that
DUCC launches are available via ducc-mon under the Specification or
Registry tab for that job or managed reservation or service. Please see if
the LANG setting for the service is different from the LANG setting for the
job.

One can also see the LANG setting for a linux process-id by doing:

cat /proc/<pid>/environ

The LANG to be used for a DUCC process can be set by adding to the
--environment argument "LANG=xxx" as needed

Thanks,
Eddie



On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu173@ncuindia.edu <
rohit14csu173@ncuindia.edu> wrote:

> Hey,
>  Yeah you got it right the first snippet comes in CR before the data goes
> in CAS.
> And the second snippet is in the first annotator or analysis engine(AE) of
> my Aggregate Desciptor.
> I am pretty sure this is an issue of the CAS used by DUCC because if i use
> service of DUCC in which we are supposed to send the CAS and receive the
> same CAS with added features from DUCC i get accurate results.
>
> But the only problem comes in submitting a job where the cas is generated
> by DUCC.
> This can also be a issue of the enviornment(Language) of DUCC because the
> default language is english.
>
> Bets Regards
> Rohit
>
> On 2018/07/03 13:11:50, Eddie Epstein <ea...@gmail.com> wrote:
> > Rohit,
> >
> > Before sending the data into jcas if i force encode it :-
> > >
> > > String content2 = null;
> > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > jcas.setDocumentText(content2);
> > >
> >
> > Where is this code, in the job CR?
> >
> >
> >
> > >
> > > And when i go in my first annotator i force decode it:-
> > >
> > > String content = null;
> > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > "UTF-8");
> > >
> >
> > And is this in the first annotator of the job process, i.e. the CM?
> >
> > Please be as specific as possible.
> >
> > Thanks,
> > Eddie
> >
>

Re: Problem in running DUCC Job for Arabic Language

Posted by ro...@ncuindia.edu, ro...@ncuindia.edu.

Hey,
 Yeah you got it right the first snippet comes in CR before the data goes in CAS.
And the second snippet is in the first annotator or analysis engine(AE) of my Aggregate Desciptor.
I am pretty sure this is an issue of the CAS used by DUCC because if i use service of DUCC in which we are supposed to send the CAS and receive the same CAS with added features from DUCC i get accurate results.

But the only problem comes in submitting a job where the cas is generated by DUCC.
This can also be a issue of the enviornment(Language) of DUCC because the default language is english.

Bets Regards
Rohit

On 2018/07/03 13:11:50, Eddie Epstein <ea...@gmail.com> wrote: 
> Rohit,
> 
> Before sending the data into jcas if i force encode it :-
> >
> > String content2 = null;
> > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > jcas.setDocumentText(content2);
> >
> 
> Where is this code, in the job CR?
> 
> 
> 
> >
> > And when i go in my first annotator i force decode it:-
> >
> > String content = null;
> > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > "UTF-8");
> >
> 
> And is this in the first annotator of the job process, i.e. the CM?
> 
> Please be as specific as possible.
> 
> Thanks,
> Eddie
>

Re: Problem in running DUCC Job for Arabic Language

Posted by Eddie Epstein <ea...@gmail.com>.

Rohit,

Before sending the data into jcas if i force encode it :-
>
> String content2 = null;
> content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> jcas.setDocumentText(content2);
>

Where is this code, in the job CR?



>
> And when i go in my first annotator i force decode it:-
>
> String content = null;
> content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> "UTF-8");
>

And is this in the first annotator of the job process, i.e. the CM?

Please be as specific as possible.

Thanks,
Eddie