中文乱码、英文数字正常,所有编码都试过了还是不能正常显示的解决办法...--688IT编程网

中⽂乱码、英⽂数字正常，所有编码都试过了还是不能正常显⽰

的解决办法

⼯作中发现从某公司的BI系统中导出的csv⽂件，其中所有的中⽂字符都不能正常显⽰，但是英⽂、数字、换⾏符、Tab均正常显⽰。

使⽤Word和Notepad++，试了所有的Encoding，都不能正常所显⽰。于是怀疑是数据遭到了不正确的⼆次转换所致。后经反复试验，发现果然如此。原始数据在数据库中应该是以GBK形式储存，在导出csv⽂件时，程序错误使⽤了不⽀持中⽂的Windows-1252 to UTF8函数，把所有⽤GBK表⽰的两个字节的汉字拆开，每个字节当成⼀个带⾳调符号的拉丁字母(⼗进制128-255范围内的字符，⽐如ÈÕÏú)，然后把这些拉丁字母转换成了UTF-8，导致乱码。

在纯英⽂的Windows系统环境下，可以直接使⽤Notepad++对此类乱码进⾏转码处理。

具体⽅法为：

⼀、⾸先确保操作系统的System Locale也设为英语： Control Pannel -- Region -- Administrative -- Language for non-Unicode programs也需要设置为English。

⼆、使⽤Notepad++打开包含乱码的⽂件，点击菜单栏中的Encoding -- Convert to ANSI，将⽂件转换为系统默认的ANSI-US编码，即Windows-1252。如果是中⽂系统，这步操作会将就⽂件转换为GBK，导致转换失败。因为ANSI是⼀个⼴义的编码标准，根据不同的语⾔环境会变化，GBK也是⼀种ANSI编码标准。

三、再点击Encoding -- Character sets -- Chinese -- GB2312(Simplified Chinese)，以GB2312编码解析⼆进制源码，就会看到熟悉的汉字！

如果⼿边没有纯英⽂Windows系统的机器，可以尝试⽤Microsoft App Locale(Win 7) 或Locale Emulator (Win 10)来模拟纯英⽂系统环境。

另外，我⼜写了⼀个Java代码来解决这个问题：

使⽤⽅法为：

1、⽀持Windows、Mac，但⾸先你需要安装Java虚拟机，请访问下载安装；

3、下载之后，把Jar包和需要转换的乱码⽂件，放在⼀个⽂件夹内；

4、Windows⽤户请在开始菜单内搜索“命令提⽰符”(command prompt)，Mac⽤户请⽤spotlight搜索终端(Terminal)，到后单击打开；

5、在命令⾏界⾯内，使⽤cd命令进⼊存放jar包的⽂件夹内。⽐⽅说，对于Windows⽤户，如果你的jar包存在D:\⽂件\，请先在命令提⽰符内键⼊D:，敲回车，然后再键⼊cd ⽂件。Mac⽤户没有分区的问题，直接cd + 绝对路径就可以了，⽐如cd /Users/username/Desktop/ ；

6、键⼊ java -cp convert_jre1.8x64.jar convert.twoTimeConvert + 参数；

该程序⽀持的参数列表为：

inputFilePath, outputFilePath, [inputEncoding], [middleEncoding], [originEncoding], [outputEncoding]

参数使⽤空格分隔。其中前两个参数必填，后4个参数可选。

inputFilePath：需转换的乱码⽂件的⽂件名。

outputFilePath：转换后⽂件的⽂件名。如该⽂件已存在，将覆盖。

inputEncoding：乱码⽂件⽬前的编码⽅式。以前⽂的例⼦为例，该参数应填写UTF-8。默认值为UTF-8。

middleEncoding：⾸次转换需要转⾄的编码。以前⽂的例⼦为例，该参数应填写Windows-1252。默认

值为Windows-1252。originEncoding：乱码⽂件最原始的编码。以前⽂的例⼦为例，该参数应填写GBK。默认值为GBK。

outputEncoding：最后输出⽂件的编码。以前⽂的例⼦为例，该参数可填写：GBK、UTF-8、UTF-16等⽀持中⽂字符的编码。默认值为UTF-8。

⽐如：java -cp convert_jre1.8x64.jar convert.twoTimeConvert 乱码.txt 转换结果.txt UTF-16 Windows-1252 GBK UTF-8

或者：java -cp convert_jre1.8x64.jar convert.twoTimeConvert 乱码.txt 转换结果.txt UTF-16

输⼊完成后按回车，如果有报错信息，屏幕上会输出。如果没有错误，转换结果.txt 应该已经出现在⽂件夹⾥了。

package convert;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.UnsupportedEncodingException;

import java.util.Arrays;

/**

* 本段代码⽤于恢复中⽂乱码，主要针对被错误转换后导致⽆法通过直接选择⽂件内码进⾏恢复的乱码。

* ⽐如⼀段GBK编码的⽂本，某程序错误使⽤了不⽀持中⽂的Windows-1252 to UTF-8函数进⾏转换，

* 导致所有中⽂全部变成了带⾳调符号的拉丁字母，⽐如Æ·Ãû。这时候可以把乱码从UTF-8转换回Windows-1252，

* 再使⽤GBK解析，得到中⽂。

* 本程序可以使⽤2-6个参数：

* inputFilePath, outputFilePath, [inputEncoding], [middleEncoding], [originEncoding], [outputEncoding]

* 参数使⽤空格分隔。其中前两个参数必填，后4个参数可选。

* inputFilePath：需转换的乱码⽂件的路径。

* outputFilePath：转换后⽂件的路径。如该路径指向的⽂件已存在，将覆盖。

* inputEncoding：乱码⽂件⽬前的编码⽅式。以前⽂的例⼦为例，该参数应填写UTF-8。默认值为UTF-8。

* middleEncoding：⾸次转换需要转⾄的编码。以前⽂的例⼦为例，该参数应填写Windows-1252。默认值为Windows-1252。

* originEncoding：乱码⽂件最原始的编码。以前⽂的例⼦为例，该参数应填写GBK。默认值为GBK。

* outputEncoding：最后输出⽂件的编码。以前⽂的例⼦为例，该参数可填写：GBK、UTF-8、UTF-16等⽀持中⽂字符的编码。默认值为UTF-8。

* 该程序所⽀持的编码为所有Java所⽀持的编码类型，请参考：acle/javase/8/docs/technotes/guides/intl/encoding.doc.html

* 我在GitHub上提供了测试⽤乱码⽂件，可以进⾏测试。github/kind03/Job/blob/master/test

_ * @author何晶 He, Jing

* @version 1.3 2017/11/9

public class twoTimeConvert {

//由于转换⼤⽂件需要分块处理，segmentSize为分块⼤⼩，默认为4096字节，可以⾃⾏改动。

//关于⽂件分块的介绍请见segmentConvert()⽅法。

public static final int segmentSize = 4096;

private static String inputCode = "UTF-8";

//ISO-8859-1 or Windows-1252 are both fine

private static String middleCode = "Windows-1252";

private static String originCode = "GBK";

private static String outputCode = "UTF-8";

private static String inputPath;

private static String outputPath;

public static void main(String[] args) throws IOException {

if (args.length >= 2) {

inputPath = args[0];

outputPath = args[1];

}

if (args.length >= 3) inputCode = args[2];

if (args.length >= 4) middleCode = args[3];

if (args.length >= 5) originCode = args[4];

if (args.length >= 6) outputCode = args[5];

if (args.length > 6 || args.length<2) {

+ " arguments. This script requires 2 to 6 arguments: \n"

+ "inputFilePath, outputFilePath, "

+ "[inputEncoding], [middleEncoding], [originEncoding] ,[outputEncoding]."

+ "Arguments should be divided by spaces.");

return;

}

segmentConvert();

}

/**

* 由于Java的CharsetEncoder Engine每次处理的字符数量有限，String类的容量也有限，

所以对于⼤⽂件，必须要拆分处理。

但是由于UTF-8格式中每个字符的长度可变，且经过两次转换，

原来的GBK编码已经⾯⽬全⾮，不太好区分每个汉字的开始和结束位置。

所以⼲脆查UTF-8中的标准ASCII的字符，即单个字节⼗进制值为0-127范围内的字符，

以ASCII字符后的位置来对⽂件进⾏分块(Segementation)，再逐块转换。

但如果在默认的分块⼤⼩(Segment Size)⼀个ASCII字符都不到的话，就会导致转换失败。

UTF-16也按照此原理进⾏转换。但由于UTF-16有⼤端(BE)和⼩端(LE)之分，

⽂件头部有时还有BOM，所以增加了BOM信息读取并通过BOM来判断是BE还是LE。

对于其他编码，只要和ASCII码兼容，都适⽤于对UTF-8进⾏分割的⽅法。

* @throws IOException

public static void segmentConvert() throws IOException {

FileInputStream fis = new FileInputStream(inputPath);

FileOutputStream fos = new FileOutputStream(outputPath);

byte[] buffer = new byte[segmentSize];

int len;

int counter = 0;

byte[] validBuffer;

byte[] combined;

byte[] left0 = null;

byte[] left1 = null;

byte[] converted;

//⽂件头部BOM信息读取

if ("UTF-16".equals(inputCode) || "UTF-16LE".equals(inputCode) ||"UTF-16BE".equals(inputCode)) {

一串好看的乱码

byte[] head = new byte[2];

if (head[0]==-1 && head[1]==-2) {

inputCode = "UTF-16LE";

}

else if (head[0]==-2 && head[1]==-1) {

inputCode = "UTF-16BE";

}

else {

left0 = head;

counter++;

}

while((ad(buffer)) == segmentSize) {

//to check the value of len

// System.out.println("len = " + len);

int i = segmentSize - 1;

if ("UTF-16LE".equals(inputCode)) {

while (i>-1) {

if ((buffer[i-1] >= 0 && buffer[i-1] <= 127) && buffer[i] ==0) {

break;}

i--;

if (i==0) {

//报错

+ "ASCII character(0x0000-0x0009) in a segment size of "+

segmentSize +" bytes\n"+"Plese adjust the segmentation size.");

break;

}

// i = segmentSpliter(buffer,"(buffer[i-1] >= 0 || buffer[i-1] <= 127) "

// + "&& (buffer[i]==0)");

}else if ("UTF-16BE".equals(inputCode)) {

while (i>-1) {

if ((buffer[i] >= 0 && buffer[i] <= 127) && buffer[i-1] ==0) {

break;}

i--;

if (i==0) {

//报错

+ "ASCII character(0x0000-0x0009) in a segment size of "+

segmentSize +" bytes\n"+"Plese adjust the segmentation size.");

break;

}

}else {

// the following segmentation method is not suitable for UTF-16 or UTF-32

// since they are not compatible with ASCII code

while (i>-1) {

if (buffer[i] <= 127 && buffer[i] >= 0) {

break;}

i--;

if (i==0) {

//报错

+ "ASCII character(0-127) in a segment size of "+

segmentSize +" bytes\n"+"Plese adjust the segmentation size.");

break;

}

validBuffer = pyOf(buffer, i+1);

if (counter%2==0){

left0 = pyOfRange(buffer,i+1,segmentSize);

combined = concat(left1,validBuffer);

left1 = null;

} else {

left1 = pyOfRange(buffer,i+1,segmentSize);

combined = concat(left0,validBuffer);

left0 = null;

}

counter++;

converted = realConvert2(combined,combined.length);

fos.write(converted);

}

//for the end part of the document

//can't use ad(buffer) since buffer has been read into in the while loop for the last time.

if(len < segmentSize) {

//to check the value of len

System.out.println("len = " + len);

if (len>0) {

validBuffer = pyOf(buffer, len);

} else {

/in case the file length is the multiple of 8

//in this case, the length of last segment will be 0

//only need to write what's in the left0 or left1

validBuffer = null;

}

if (counter%2==0){

//there is nothing left when dealing with the last part of the document

//therefore no need to give value to left0 or left1

combined = concat(left1,validBuffer);

} else {

combined = concat(left0,validBuffer);

}

converted = realConvert2(combined,combined.length);

fos.write(converted);

// for test purpose

System.out.println("================= last segment check =====================");

System.out.println(new String(converted));

}

fos.close();

fis.close();

}

public static byte[] concat(byte[] a, byte[] b) {

/for combining two arrays

if (a==null) return b;

if (b==null) return a;

int aLen = a.length;

int bLen = b.length;

byte[] c= new byte[aLen+bLen];

System.arraycopy(a, 0, c, 0, aLen);

System.arraycopy(b, 0, c, aLen, bLen);

return c;

}

/**

* 由于realConvert⽅法使⽤的CharsetEncoder Engine转换⽅法⽐较繁琐，不能直接对byte[]操作，

* 要先把byte[]转换为String再转换为char[]再转换为CharBuffer，⽽且使⽤CharsetEncoder转UTF-8时

* 还有bug，会导致结果中最后产⽣⼤量null字符，所以改⽤realConvert2()。

* realConvert2直接使⽤String类的构造⽅法String(byte[] bytes, String charsetName)

* 和getBytes(String charsetName)⽅法，更加简洁明了。

* @param in ；输⼊字节数组

* @param len ；该字节数组的有效长度。⽤以处理

* java.ad(byte[] b)⽅法产⽣的byte[]数组中包含部分⽆效元素的情况。

* 如果in数组中所有元素都有效，该变量可直接填⼊in.length

* @return

* @throws UnsupportedEncodingException

public static byte[] realConvert2 (byte[] in, int len) {

byte[] valid = pyOf(in, len);

try {

String step1 = new String(valid,inputCode);

byte[] step2 = Bytes(middleCode);

String step3 = new String(step2,originCode);

byte[] step4 = Bytes(outputCode);

return step4;

} catch (UnsupportedEncodingException e) {

+ "supported encodings at: acle/javase/8/docs/technotes/guides/intl/encoding.doc.html");

e.printStackTrace();

}

return valid;

}

688IT编程网

中文乱码、英文数字正常,所有编码都试过了还是不能正常显示的解决办法...

发表评论

推荐文章

英语打网球作文

乒乓球的对身体好处英语作文

乒乓球英语作文提纲

初一英语作文Myfavoriteclassmate

关于乒乓球和篮球的英语作文

热门文章

度米作文汇编之Myfavoritesports我最喜欢的运动英语作文带翻译_百度文 ...

七年级上册英语语法句型必考知识点概括

(完整版)我最喜欢的运动英语作文(带翻译)

打乒乓球的英语作文

我的乒乓球体育故事英语作文

tabletennis运动英语介绍

乒乓球比赛英语作文100字

打乒乓球英语作文45句

如何提高乒乓球水平英语作文

休闲娱乐打乒乓球英语作文

译林版六年级英语下册6B Unit1单元测试卷

一场乒乓球比赛不少于八句话英语作文

初一英语冠词试题答案及解析

外研社三起三年级下册英语全册教案

小学英语五年级上册期末培优试卷测试卷(含答案)

(完整版)译林版英语四下Unit2Afterschool教学设计

寒假辅导讲义4B U1-U4 复习与测试(教师版)-译林版(三起)英语四年级下册...

M3U1三年级

译林四年级英语下册期中综合复习题

五年级上册英语试题- Module2 单元测试牛津上海版(含答案及解析)_百 ...

最新文章

乒乓球的对身体好处英语作文

关于国兵英语作文

打乒乓球的优点英文作文

完形、阅读和语法填空训练(难) (一) 外研版英语七年级上册

五年级上册英语unit2思维导图

译林版版小学英语四年级下册期末质量综合试卷(含答案)

标签列表

688IT编程网

中文乱码、英文数字正常,所有编码都试过了还是不能正常显示的解决办法...

发表评论

推荐文章

英语打网球作文

乒乓球的对身体好处英语作文

乒乓球英语作文提纲

初一英语作文Myfavoriteclassmate

关于乒乓球和篮球的英语作文

热门文章

度米作文汇编之Myfavoritesports我最喜欢的运动英语作文带翻译_百度文 ...

七年级上册英语语法句型必考知识点概括

(完整版)我最喜欢的运动英语作文(带翻译)

打乒乓球的英语作文

我的乒乓球体育故事英语作文

tabletennis运动英语介绍

乒乓球比赛英语作文100字

打乒乓球英语作文45句

如何提高乒乓球水平英语作文

休闲娱乐打乒乓球英语作文

译林版六年级英语下册6B Unit1单元测试卷

一场乒乓球比赛不少于八句话英语作文

初一英语冠词试题答案及解析

外研社三起三年级下册英语全册教案

小学英语五年级上册期末培优试卷测试卷(含答案)

(完整版)译林版英语四下Unit2Afterschool教学设计

寒假辅导讲义4B U1-U4 复习与测试(教师版)-译林版(三起)英语四年级下册...

M3U1三年级

译林四年级英语下册期中综合复习题

五年级上册英语试题- Module2 单元测试 牛津上海版(含答案及解析)_百 ...

最新文章

乒乓球的对身体好处英语作文

关于国兵英语作文

打乒乓球的优点英文作文

完形、阅读和语法填空训练(难) (一) 外研版英语七年级上册

五年级上册英语unit2思维导图

译林版版小学英语四年级下册期末质量综合试卷(含答案)

标签列表

五年级上册英语试题- Module2 单元测试牛津上海版(含答案及解析)_百 ...