You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2011/09/01 11:42:24 UTC

Re: svn commit: r1163336 - in /tika/trunk/tika-parsers/src/test: java/org/apache/tika/parser/rtf/ resources/test-documents/

On Tue, Aug 30, 2011 at 5:35 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Tue, Aug 30, 2011 at 9:07 PM,  <mi...@apache.org> wrote:
>> +        assertContains("zażółć gęślÄ… jaźń", content);
>> +        assertContains("ZAŻÓŠĆ GĘŚLÄ„ JAŹŃ", content);
>
> I think it would be best if we used \uNNNN escapes for non-ASCII
> characters in .java files. Our Maven build already standardizes to
> UTF-8, but there's no guarantee that someone who later edits the file
> uses the correct encoding settings.

Hmm, thinking more about this: are we sure we can't make full use of
UTF8 in our source files?  It'd make the source much more readable for
those of us working on non-ascii tests...

Also, if someone does edit the file and (say) writes it incorrectly in
the wrong encoding, their tests will suddenly fail locally and they'd
know something is up?  I think expecting Tika devs to grok source
encoding issues is reasonable?  Or are people actively using editors
that can't handle UTF8 or something...?

Failing that, if we really must only use ASCII for Tika's sources...
shouldn't we fix Maven to enforce this, so that I see an error when
compiling if I use non-ASCII?

Mike McCandless

http://blog.mikemccandless.com