You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@netbeans.apache.org by Emilian Bold <em...@gmail.com> on 2019/12/27 22:29:41 UTC

UTF8 input in NetBeans Output Window broken

Hello,

I can't read properly some text due to the way NetBeans configures
System.in for the running app. One cannot parse that text with any
charset and the read string is corrupted basically.

I have a Maven app and I already configured globally
-Dfile.encoding=utf-8 for Maven.

I've also added -J-Dfile.encoding=utf-8 for NetBeans just to be sure.

Still, if I try to paste Γίνεται into the Output window I get the
following from System.in.read :

Charset.defaultCharset: UTF-8
Input Γίνεται   :
Γίνεται
147(-109) 175(-81) 189(-67) 181(-75) 196(-60) 177(-79) 185(-71)

The above are the int (and byte) values.

The bellow is what the actual string constant .getBytes returns:

Internal Γίνεται: [-50, -109, -50, -81, -50, -67, -50, -75, -49, -124,
-50, -79, -50, -71]
UTF8 Γίνεται    : [-50, -109, -50, -81, -50, -67, -50, -75, -49, -124,
-50, -79, -50, -71]

So looks like the UTF8 values should be coded with two bytes but when
I paste the text it is coded as a single byte.

Oddly enough, it is *displayed* properly in the editor (for the
constant) and in the Output window for the input text.

Anybody has UTF8 input working for their configuration?

Sample code:

    public static void main(String[] args) throws IOException {
        System.out.println("Charset.defaultCharset: " +
Charset.defaultCharset().displayName());
        String x = "Γίνεται";
        int b;
        System.out.println("Input Γίνεται   : ");
        List<Integer> bytes = new ArrayList<>();
        while ((b = System.in.read()) != '\n') {
            bytes.add(Integer.valueOf(b));
            System.out.print(b + "(" + Integer.valueOf(b).byteValue() + ") ");
        }
        System.out.println();
        System.out.println("Internal Γίνεται: " +
Arrays.toString(x.getBytes()));
        System.out.println("UTF8 Γίνεται    : " +
Arrays.toString(x.getBytes("UTF-8")));

        byte[] actualbytes = new byte[bytes.size()];
        for(int i=0;i<bytes.size();i++){
            actualbytes[i] = bytes.get(i).byteValue();
        }
        Charset.availableCharsets().forEach((name, charset) -> {
            Scanner s = new Scanner(new
ByteArrayInputStream(actualbytes), charset);
            String r = s.next();
//            String r = new String(actualbytes, charset);
            System.out.println(name + ": " + r);

            if(Arrays.equals(actualbytes, r.getBytes())){
                System.out.println("======== BINGO!!! ==========");
            }
        });

Output:

Charset.defaultCharset: UTF-8
Input Γίνεται   :
Γίνεται
147(-109) 175(-81) 189(-67) 181(-75) 196(-60) 177(-79) 185(-71)
Internal Γίνεται: [-50, -109, -50, -81, -50, -67, -50, -75, -49, -124,
-50, -79, -50, -71]
UTF8 Γίνεται    : [-50, -109, -50, -81, -50, -67, -50, -75, -49, -124,
-50, -79, -50, -71]
Big5: �紗腔措
Big5-HKSCS: 𢵧蔥覺�
CESU-8: ����ı�
EUC-JP: �週脹�
EUC-KR: ��슉캇�
GB18030: 摨降谋�
GB2312: ��降谋�
GBK: 摨降谋�
IBM-Thai: lฮา๕D๑๙
IBM00858: ô»¢Á─▒╣
IBM01140: l®¨§D£¾
IBM01141: l®¨@D£¾
IBM01142: l®¨§D£¾
IBM01143: l®¨[D£¾
IBM01144: l®¨@D#¾
IBM01145: l®~§D£¾
IBM01146: l®¨§D[¾
IBM01147: l®~]D#¾
IBM01148: l®¨§D£¾
IBM01149: l®¨§D£¾
IBM037: l®¨§D£¾
IBM1026: l®¨§D£¾
IBM1047: l®]§D£¾
IBM273: l®¨@D£¾
IBM277: l®¨§D£¾
IBM278: l®¨[D£¾
IBM280: l®¨@D#¾
IBM284: l®~§D£¾
IBM285: l®¨§D[¾
IBM290: ツルンvD¢z
IBM297: l®~]D#¾
IBM420: lكنﻸDلﻼ
IBM424: l®¨§D£¾
IBM437: ô»╜╡─▒╣
IBM500: l®¨§D£¾
IBM775: ō»ĮĄ─▒╣
IBM850: ô»¢Á─▒╣
IBM852: ô»ŻÁ─▒╣
IBM855: Њ»йх─▒╣
IBM857: ô»¢Á─▒╣
IBM860: ô»╜╡─▒╣
IBM861: ô»╜╡─▒╣
IBM862: ף»╜╡─▒╣
IBM863: ô»╜╡─▒╣
IBM864: ±ﺥﺵ٥ﺅ١٩
IBM865: ô¤╜╡─▒╣
IBM866: Уп╜╡─▒╣
IBM868: ﭖ»ﺿﺷ─▒╣
IBM869: �»ΞΚ─▒╣
IBM870: lި§DĄŹ
IBM871: l®¨§D£¾
IBM918: lﻋﮔﻑDﻍﮎ
ISO-2022-CN: “¯½µÄ±¹
ISO-2022-JP: �������
ISO-2022-JP-2: �������
ISO-2022-KR: “¯½µÄ±¹
ISO-8859-1: “¯½µÄ±¹
ISO-8859-13: “ƽµÄ±¹
ISO-8859-15: “¯œµÄ±¹
ISO-8859-16: “Żœ”ıč
ISO-8859-2: “Ż˝ľÄąš
ISO-8859-3: “Ż½µÄħı
ISO-8859-4: “¯ŊĩÄąš
ISO-8859-5: “ЏНЕФБЙ
ISO-8859-6: “���ؤ��
ISO-8859-7: “―½΅Δ±Ή
ISO-8859-8: “¯½µ�±¹
ISO-8859-9: “¯½µÄ±¹
JIS_X0201: �ッスオトアケ
JIS_X0212-1990: ����
KOI8-R: ⌠╞╫╣д╠╧
KOI8-U: ⌠╞Ґ╣д╠╧
Shift_JIS: 同スオトアケ
TIS-620: �ฏฝตฤฑน
US-ASCII: �������
UTF-16: 鎯붵쒱�
UTF-16BE: 鎯붵쒱�
UTF-16LE: 꾓떽뇄�
UTF-32: ��
UTF-32BE: ��
UTF-32LE: ��
UTF-8: ����ı�
windows-1250: “Ż˝µÄ±ą
windows-1251: “ЇЅµД±№
windows-1252: “¯½µÄ±¹
windows-1253: “―½µΔ±Ή
windows-1254: “¯½µÄ±¹
windows-1255: “¯½µִ±¹
windows-1256: “¯½µؤ±¹
windows-1257: “ƽµÄ±¹
windows-1258: “¯½µÄ±¹
windows-31j: 同スオトアケ
x-Big5-HKSCS-2001: 𢵧蔥覺�
x-Big5-Solaris: �紗腔措
x-euc-jp-linux: �週脹�
x-EUC-TW: ����
x-eucJP-Open: �週脹�
x-IBM1006: “ﺁﺛﭖﺥﺎﺗ
x-IBM1025: lвЕщDыА
x-IBM1046: ﹿﺳﺿ٥ﺅ١٩
x-IBM1097: lﻎﻟﻗDﻐﮔ
x-IBM1098: ﺑ»¤ﺽ─▒╣
x-IBM1112: l®Ķ§D£¾
x-IBM1122: l®¨[D£¾
x-IBM1123: lвЕщDыА
x-IBM1124: “ЏНЕФБЙ
x-IBM1129: “¯½µÄ±¹
x-IBM1166: lвЕщDыА
x-IBM1364: lᅲ��D��
x-IBM1381: 降谋�
x-IBM1383: “的惫
x-IBM29626C: “�議厩
x-IBM300: ����
x-IBM33722: “�議厩
x-IBM737: Υψ╜╡─▒╣
x-IBM833: lᅲ��D��
x-IBM834: �쫏��
x-IBM856: ף»¢�─▒╣
x-IBM874: �ฏฝตฤฑน
x-IBM875: lσφίDάώ
x-IBM921: “ƽµÄ±¹
x-IBM922: “‾½µÄ±¹
x-IBM930: ツルンvD¢z
x-IBM933: lᅲ��D��
x-IBM935: l���D��
x-IBM937: l���D��
x-IBM939: lモ]ヨD£レ
x-IBM942: 同スオトアケ
x-IBM942C: 同スオトアケ
x-IBM943: 同スオトアケ
x-IBM943C: 同スオトアケ
x-IBM948: 胞慔燅�
x-IBM949: 슉캇�
x-IBM949C: 슉캇�
x-IBM950: 蔥覺�
x-IBM964: “���
x-IBM970: “�돨국
x-ISCII91: �ऒटगदऔछ
x-ISO-2022-CN-CNS: “¯½µÄ±¹
x-ISO-2022-CN-GB: “¯½µÄ±¹
x-iso-8859-11: “ฏฝตฤฑน
x-JIS0208: ����
x-JISAutoDetect: 同スオトアケ
x-Johab: 닖쫏캼�
x-MacArabic: …��٥ؤ١٩
x-MacCentralEurope: ďĮĹĶńĪĻ
x-MacCroatian: ìØΩµƒ±š
x-MacCyrillic: Уѓљµƒ±є
x-MacDingbat: �④❽⑩➄⑥❹
x-MacGreek: ™·ΫΒΡ±Ι
x-MacHebrew: �����
x-MacIceland: ìØΩµƒ±π
x-MacRoman: ìØΩµƒ±π
x-MacRomania: ìŞΩµƒ±π
x-MacSymbol: �↓∝⊗±≠
x-MacThai: ฏฝตฤฑน
x-MacTurkish: ìØΩµƒ±π
x-MacUkraine: Уѓљµƒ±є
x-MS932_0213: 同スオトアケ
x-MS950-HKSCS: 𢵧蔥覺�
x-MS950-HKSCS-XP: 蔥覺�
x-mswin-936: 摨降谋�
x-PCK: 同スオトアケ
x-SJIS_0213: 同スオトアケ
x-UTF-16LE-BOM: 꾓떽뇄�
X-UTF-32BE-BOM: ��
X-UTF-32LE-BOM: ��
x-windows-50220: �������
x-windows-50221: �������
x-windows-874: “ฏฝตฤฑน
x-windows-949: 벏슉캇�
x-windows-950: 蔥覺�
x-windows-iso2022jp: �������

--emi