You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Philipp Steinkrüger <ph...@uni-koeln.de> on 2016/05/15 14:11:30 UTC

Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts. 

The environment variables for the shell from which I start the server as well as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment <https://perlgeek.de/en/article/set-up-a-clean-utf8-environment>) to set up a clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the umlauts are represented by ‘??’; for (2) they are represented by 'Ã¼’ (that is a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a python script (with Chris Mattmann’s module). If this behaviour can be controlled from within python, that would be fine for me. But since I got the problem also using curl and tika-app, I thought that the problem is more likely to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp

Re: Tika response encoding problem

Posted by Philipp Steinkrüger <ph...@uni-koeln.de>.

> On 16 May 2016, at 13:04 , Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> >>I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:
>  
> To clarify (you may already understand this, sorry)…the encoding parameter specifies the output encoding; it is not a hint to Tika in encoding detection.

Hi Tim,

yes, I do understand this. I guess the issues have become a little conflated. But judging from the response to the bug report, the issues have been taken apart and are dealt with separately, so I guess there is nothing for me to do at the moment. If you need further testing, let me know…

Philipp



>  
> With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with “Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with “Test-email-empty-works.txt”.  I get the same behavior when I redirect the output to a file:
>  
> java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt
>  
>  
>  
> Bizarrely, it looks like both files are being parsed by the RFC822Parser, and when I run the “detect” commandline option –d, on both files with 1.12 and trunk, both say RFC822.
>  
>  
>  
>  
>  
>  
> From: Philipp Steinkrüger [mailto:philipp.steinkrueger@uni-koeln.de <ma...@uni-koeln.de>] 
> Sent: Sunday, May 15, 2016 10:12 AM
> To: user@tika.apache.org <ma...@tika.apache.org>
> Subject: Tika response encoding problem
>  
> Dear list,
>  
> I am running Tika server 1.14 on a Debian jessie. I start the server with this command:
>  
> java -jar tika-server-1.14-SNAPSHOT.jar
>  
> If I send a file for metadata extraction like this
>  
> curl -T email.txt http://localhost:9998/meta <http://localhost:9998/meta>
>  
> The response screws up any umlauts. 
>  
> The environment variables for the shell from which I start the server as well as execute the curl command are as follows:
>  
> LANG=en_US.UTF-8
> LANGUAGE=en_US.UTF-8
> LC_CTYPE="en_US.UTF-8"
> LC_NUMERIC="en_US.UTF-8"
> LC_TIME="en_US.UTF-8"
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY="en_US.UTF-8"
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER="en_US.UTF-8"
> LC_NAME="en_US.UTF-8"
> LC_ADDRESS="en_US.UTF-8"
> LC_TELEPHONE="en_US.UTF-8"
> LC_MEASUREMENT="en_US.UTF-8"
> LC_IDENTIFICATION="en_US.UTF-8"
> LC_ALL=en_US.UTF-8
>  
> I followed this page (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment <https://perlgeek.de/en/article/set-up-a-clean-utf8-environment>) to set up a clean unicode environment. The test case mentioned on that page works fine.
>  
> I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:
> I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:
>  
> (1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt
>  
> and
>  
> (2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt
>  
> The output of umlauts does change, but in neither case is it right. For (1) the umlauts are represented by ‘??’; for (2) they are represented by 'Ã¼’ (that is a capital A with a ~ on top, followed by the quarter sign 1/4).
>  
> How can I fix this problem? Ultimately, I want to run queries to Tika from a python script (with Chris Mattmann’s module). If this behaviour can be controlled from within python, that would be fine for me. But since I got the problem also using curl and tika-app, I thought that the problem is more likely to be found in tika itself.
>  
> I’d be very grateful for any assistance!
> Best,
> Philipp
>  
>  
>

RE: Tika response encoding problem

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Our AutoDetectReader does correctly identify the encoding in this case.

Do we want to add logic that checks for ?<encoding>?, and if that doesn’t exist then use our AutoDetectReader?

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, May 16, 2016 11:15 AM
To: user@tika.apache.org
Subject: RE: Tika response encoding problem

The underlying james mime4j parser isn’t properly detecting utf-8 in the .txt file.  In the .eml file, the fields declare their encoding:

From: =?utf-8?Q?Philipp_Steinkr=C3=BCger?= philipp.steinkrueger@uni-koeln.de<ma...@uni-koeln.de>

Not sure how we’d want to fix this.

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, May 16, 2016 8:04 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: Tika response encoding problem

>>I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

To clarify (you may already understand this, sorry)…the encoding parameter specifies the output encoding; it is not a hint to Tika in encoding detection.

With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with “Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with “Test-email-empty-works.txt”.  I get the same behavior when I redirect the output to a file:

java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt

Bizarrely, it looks like both files are being parsed by the RFC822Parser, and when I run the “detect” commandline option –d, on both files with 1.12 and trunk, both say RFC822.

From: Philipp Steinkrüger [mailto:philipp.steinkrueger@uni-koeln.de]
Sent: Sunday, May 15, 2016 10:12 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts.

The environment variables for the shell from which I start the server as well as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:
I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the umlauts are represented by ‘??’; for (2) they are represented by 'Ã¼’ (that is a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a python script (with Chris Mattmann’s module). If this behaviour can be controlled from within python, that would be fine for me. But since I got the problem also using curl and tika-app, I thought that the problem is more likely to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp

RE: Tika response encoding problem

Posted by "Allison, Timothy B." <ta...@mitre.org>.

The underlying james mime4j parser isn’t properly detecting utf-8 in the .txt file.  In the .eml file, the fields declare their encoding:

From: =?utf-8?Q?Philipp_Steinkr=C3=BCger?= philipp.steinkrueger@uni-koeln.de<ma...@uni-koeln.de>

Not sure how we’d want to fix this.

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Monday, May 16, 2016 8:04 AM
To: user@tika.apache.org
Subject: RE: Tika response encoding problem

>>I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

To clarify (you may already understand this, sorry)…the encoding parameter specifies the output encoding; it is not a hint to Tika in encoding detection.

With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with “Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with “Test-email-empty-works.txt”.  I get the same behavior when I redirect the output to a file:

java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt

Bizarrely, it looks like both files are being parsed by the RFC822Parser, and when I run the “detect” commandline option –d, on both files with 1.12 and trunk, both say RFC822.

From: Philipp Steinkrüger [mailto:philipp.steinkrueger@uni-koeln.de]
Sent: Sunday, May 15, 2016 10:12 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts.

The environment variables for the shell from which I start the server as well as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:
I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the umlauts are represented by ‘??’; for (2) they are represented by 'Ã¼’ (that is a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a python script (with Chris Mattmann’s module). If this behaviour can be controlled from within python, that would be fine for me. But since I got the problem also using curl and tika-app, I thought that the problem is more likely to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp

RE: Tika response encoding problem

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>>I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

To clarify (you may already understand this, sorry)…the encoding parameter specifies the output encoding; it is not a hint to Tika in encoding detection.

With trunk and 1.12 in Tika app’s gui, I’m getting proper extraction with “Testemail-empty-doesnotwork.eml”, but the umlauts are corrupt with “Test-email-empty-works.txt”.  I get the same behavior when I redirect the output to a file:

java –jar tika-app-1.12.jar Testemail-empty-doesnotwork.eml > testOut2.txt



Bizarrely, it looks like both files are being parsed by the RFC822Parser, and when I run the “detect” commandline option –d, on both files with 1.12 and trunk, both say RFC822.






From: Philipp Steinkrüger [mailto:philipp.steinkrueger@uni-koeln.de]
Sent: Sunday, May 15, 2016 10:12 AM
To: user@tika.apache.org
Subject: Tika response encoding problem

Dear list,

I am running Tika server 1.14 on a Debian jessie. I start the server with this command:

java -jar tika-server-1.14-SNAPSHOT.jar

If I send a file for metadata extraction like this

curl -T email.txt http://localhost:9998/meta

The response screws up any umlauts.

The environment variables for the shell from which I start the server as well as execute the curl command are as follows:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I followed this page (https://perlgeek.de/en/article/set-up-a-clean-utf8-environment) to set up a clean unicode environment. The test case mentioned on that page works fine.

I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:
I also tried to use tika-app, since I saw in --help that I can pass the --encoding parameter. So I ran:

(1) java -jar tika-app-1.14-SNAPSHOT.jar --encoding=unicode -m email.txt

and

(2) java -jar tika-app-1.14-SNAPSHOT.jar —encoding=UTF-8 -m email.txt

The output of umlauts does change, but in neither case is it right. For (1) the umlauts are represented by ‘??’; for (2) they are represented by 'Ã¼’ (that is a capital A with a ~ on top, followed by the quarter sign 1/4).

How can I fix this problem? Ultimately, I want to run queries to Tika from a python script (with Chris Mattmann’s module). If this behaviour can be controlled from within python, that would be fine for me. But since I got the problem also using curl and tika-app, I thought that the problem is more likely to be found in tika itself.

I’d be very grateful for any assistance!
Best,
Philipp