一、目的:
本篇文章详细记录了数据清洗工作的内容及过程,并对清洗的数据质量进行了评估和总结,便于成员对数据的质量有一个清晰认识。
二、背景:
历史数据中存有数据不规范、数据不准确、数据重复记录等问题,为确保业务正常高效运行,必须开展数据全面核实及数据清洗工作,为数据中心的建立奠定坚实的基础。
数据系统的功能模块预计在4月底上线,在系统上线前需要做好数据的清洗工作,将清洁有效的数据迁移到系统中,作为功能模块的基础数据。数据清洗工作的开展是为系统上线实施应用打牢数据基础。
系统将与统一身份认证系统集成,在统一身份认证平台上线之后,系统满足统一身份认证平台通过接口的相关功能需求,故需要将历史的数据清洗完成后同时迁移进两个系统中以满足业务需求。
三、参考资料:
1.《数据清洗总体需求》
2.《数据清洗总体方案》
3.《数据清洗规则》
四、本次数据清洗的问题分类:
1.无效数据
2.空值错误
3.准确性错误
4.格式错误
5.重复记录
6.一码多物
7.一物多码
8.混合型错误
五、本次数据清洗的质量标准:
1.数据的准确性。
2.数据的一致性。
3.数据的唯一性。
4.数据的规范性。
5.数据的及时性。
6.数据的完整性。
六、问题标注:
在数据清洗过程中,需要记录数据现存问题,以便于对其清洗进行评 估,为了便于操作,在数据库表中建立相关字段,进行问题简易标注
七、数据清理:
1.查询表数据:
select * from t_test
2.查询表数据总和
SELECT count(*) FROM t_test
3.添加列字段
alter table t_test add column cleaning_rounds varchar(255) default null comment '清洗轮次' after SECRET;
alter table t_test add column question varchar(255) default null comment '数据问题' after cleaning_rounds;
alter table t_test add column cleaning_results varchar(255) default null comment '清洗结果' after question;
alter table t_test add column detailed_description varchar(255) default null comment '处理细节描述' after cleaning_results;
alter table t_test add column question_description varchar(255) default null comment '问题描述' after question;
4.技术第一轮清洗
update t_test p set p.cleaning_rounds ='T-F1RC'
5.表字段为空查询 (无效数据)
select * from t_test p where p.person_code is null;
6.空值错误 第一次查询 单字段查询
select * from t_test p where p.person_code='';
select * from t_test p where p.country='';
select * from t_test p where p.name='';
select * from t_test p where p.sex='';
select * from t_test p where p.id_card='';
select * from t_test p where p.birthday='';
7.空值问题标注 姓名为空
update t_test p set p.question ='NE',p.question_description='姓名字段为空',p.cleaning_results ='SUWW' where p.person_code='123456'
验证是否修改成功
select * from t_test p where p.question_description='姓名字段为空';
8.空值错误 第二次查询
select * from t_test p where p.update_ent_code='';
select * from t_test p where p.update_time is null;
9.空值错误 第三次查询
select * from t_test p where p.data_new_update_time is null;
select * from t_test p where p.data_new_update_time ='';
验证查询
select p.data_new_update_time from t_test p
10.异常值查询 birthday
select * from t_test p where p.birthday>'2022/04/12'
birthday 倒叙
select ID_CARD,birthday from t_test p order by birthday desc
11.准确性错误
通过身份证获取性别
select id_card,IF(LEFT(SUBSTRING(id_card,17),1)%2=1,'男性','女性') as sexnew,sex from t_test
通过身份证18位获取性别
select id_card,IF(LEFT(SUBSTRING(id_card,17),1)%2=1,'男性','女性') as sexnew,sex from t_test where length(id_card) =18
通过身份证15位获取性别
select * from (select id_card, case when length(id_card)=15 and mod(substring(id_card,15,1),2)=0 then '女性'
when length(id_card)=15 and mod(substring(id_card,15,1),2)=1 then '男性'else null end sex from t_test where (LENGTH(id_card)=15)) a
查询18位和15位身份证
select * from (select id_card, case when length(id_card)=15 and mod(substring(id_card,15,1),2)=0 then '女性'when length(id_card)=15 and mod(substring(id_card,15,1),2)=1 then '男性'when length(id_card)=18 and MOD(substring(id_card,17,1),2)=0 then '女性'when length(id_card)=18 and MOD(substring(id_card,17,1),2)=1 then '男性'else null end sex from t_test where (LENGTH(id_card)=15 or LENGTH(id_card)=18)) a
18位身份证
select * from t_test where length(id_card) =18;
15位身份证
select * from t_test where length(id_card) =15;
标记15位身份证
UPDATE t_test p set p.question=CONCAT_WS(',',p.question,'A'),p.question_description=CONCAT_WS(',',p.question_description,'身份证号不足18位'),p.cleaning_results=CONCAT_WS(',',p.cleaning_results,'PI') where p.person_code IN (select c.person_code from (select * from t_test where length(id_card) =15) c);
修改标注记录
UPDATE t_test p set p.cleaning_results='CI,RW' where p.id IN (select c.id from (select * from t_test where id in ('id','id','id')) c);
验证数据
select * from t_test where length(id_card) !=15 and cleaning_results like '%CI,RW%';
身份证为空
select * from t_test where length(id_card) =18 and length(id_card) =15 or id_card is null or id_card='';
通过身份证18位获取年月日和表内生日年月日
select c.ID_CARD,CAST(SUBSTRING(c.id_card,7,8) AS DATE) as DATETIME,c.birthday from t_test c where c.id_card IS not null and LENGTH(c.id_card)=18
从身份证里获取年月日和生日年月日进行比对
select * from (select c.ID_CARD,CAST(SUBSTRING(c.id_card,7,8) AS DATE) as DATETIME,c.birthday from t_test c
where c.id_card IS not null and LENGTH(c.id_card)=18) t where SUBSTRING(t.DATETIME,1,4)!= SUBSTRING(t.birthday,1,4)
or SUBSTRING(t.DATETIME,6,2)!= SUBSTRING(t.birthday,6,2) or SUBSTRING(t.DATETIME,9,2)!= SUBSTRING(t.birthday,9,2)
12.格式错误
查姓名是否有空格
select * from t_test where name like '% %';
去掉姓名两端的空格
update t_test set name =TRIM(name) where name like '% %';
查出生年月格式错误
select DATE_FORMAT(p.birthday ,'%Y/%m/%d') as time from t_test p
标记相关问题
UPDATE t_test p set p.question=CONCAT_WS(',',p.question,'F'),p.question_description=CONCAT_WS(',',p.question_description,'日期连接符为/'),p.cleaning_results=CONCAT_WS(',',p.cleaning_results,'CI') where p.person_code IN (select c.person_code from (select * from t_test ) c);
13.更正生日日期格式
update t_test p set p.birthday=replace( p.birthday,'/','-') ;
更正标注
UPDATE t_test p set p.detailed_description=CONCAT_WS(',',p.detailed_description,'出生日期数据格式由YYYY/MM/DD 更正为YYYY-MM-DD') where p.person_code IN (select c.person_code from (select * from t_test ) c);
对update_ENT_code 分组
select UPDATE_ENT_CODE from t_test GROUP BY UPDATE_ENT_CODE
14.重复记录数据
select * from t_test where person_code in (select person_code from t_test GROUP BY person_code having count(person_code)>1)
标记问题
UPDATE t_test p set p.question=CONCAT_WS(',',p.question,'DB'),p.detailed_description=CONCAT_WS(',',p.detailed_description,'人员代码数据重复记录') where p.person_code IN (select c.person_code from (select * from t_test where person_code in (select person_code from t_test GROUP BY person_code having count(person_code)>1)) c);
将查询的数据添加到新的表里
insert into t_test_1 select * from t_test where person_code in (select person_code from t_test GROUP BY person_code having count(person_code)>1)
对person_code排序
select id,PERSON_CODE,ID_CARD,UPDATE_TIME from t_test_1 ORDER BY PERSON_CODE DESC
标记少量重复数据清洗结果
UPDATE t_test p set p.cleaning_results=CONCAT_WS(',',p.cleaning_results,'DWW') where p.id IN (select c.id from (select * from t_test where id in ('id','id','id','id','id','id')) c);
验证修改结果
SELECT * from t_test where cleaning_results LIKE '%DWW%';
15.旧数据导入到新表
INSERT INTO t_test_1 SELECT * from t_test
16.身份证号补位(15位补位到18位)
UPDATE t_test SET ID_CARD = CONCAT
(
SUBSTRING(ID_CARD,1,6),'19',SUBSTRING(ID_CARD,7,9),SUBSTRING('10X98765432',
(CAST(SUBSTRING(ID_CARD,1,1)AS SIGNED)*7+
CAST(SUBSTRING(ID_CARD,2,1)AS SIGNED)*9+
CAST(SUBSTRING(ID_CARD,3,1)AS SIGNED)*10+
CAST(SUBSTRING(ID_CARD,4,1)AS SIGNED)*5+
CAST(SUBSTRING(ID_CARD,5,1)AS SIGNED)*8+
CAST(SUBSTRING(ID_CARD,6,1)AS SIGNED)*4+
1*2+
9*1+
CAST(SUBSTRING(ID_CARD,7,1)AS SIGNED)*6+
CAST(SUBSTRING(ID_CARD,8,1)AS SIGNED)*3+
CAST(SUBSTRING(ID_CARD,9,1)AS SIGNED)*7+
CAST(SUBSTRING(ID_CARD,10,1)AS SIGNED)*9+
CAST(SUBSTRING(ID_CARD,11,1)AS SIGNED)*10+
CAST(SUBSTRING(ID_CARD,12,1)AS SIGNED)*5+
CAST(SUBSTRING(ID_CARD,13,1)AS SIGNED)*8+
CAST(SUBSTRING(ID_CARD,14,1)AS SIGNED)*4+
CAST(SUBSTRING(ID_CARD,15,1)AS SIGNED)*2)%11+1,1))
WHERE LENGTH(ID_CARD)=15
对补位完成的数据进行标记
UPDATE t_test p set p.detailed_description=CONCAT_WS(',',p.detailed_description,'已完成补位,请核对身份证号信息是否正确') where p.id IN (select c.id from (select * from t_test where length(id_card) =15) c);
17.补位后防止和之前身份证重复记录查询
人员身份证和人员代码
select * from t_test a where (a.ID_CARD,a.PERSON_CODE) in (select ID_CARD,PERSON_CODE from t_test group by ID_CARD,PERSON_CODE HAVING count(*)>1
18.有效数据标记
查询question_description='日期连接符为/'
select * from t_test where question_description='日期连接符为/'
标记有效数据
update t_test set vdata='1' where question_description='日期连接符为/'
19.主数据新模型建表
CREATE TABLE `t_test_copy` (
`staffsCodeOfGroup` VARCHAR(255) DEFAULT NULL COMMENT '代码id',
`staffName` VARCHAR(255) DEFAULT NULL COMMENT '姓名',
`staffGender` VARCHAR(255) DEFAULT NULL COMMENT '性别',
`dateOfBirth` DATE DEFAULT NULL COMMENT '出生日期',
`numberOfIdCertificate` VARCHAR(255) DEFAULT NULL COMMENT '身份证号码',
`politicalAffiliation` VARCHAR(255) DEFAULT NULL COMMENT '政治面貌',
`nationality` VARCHAR(255) DEFAULT NULL COMMENT '民族',
`workEmail` VARCHAR(255) DEFAULT NULL COMMENT '工作邮箱',
`workAddressroom` VARCHAR(255) DEFAULT NULL COMMENT '办公地址',
`workTelephone` VARCHAR(255) DEFAULT NULL COMMENT '工作电话',
`moveTelephone` VARCHAR(255) DEFAULT NULL COMMENT '移动电话',
`StaffCitizenship` VARCHAR(255) DEFAULT NULL COMMENT '国籍',
`staffType` VARCHAR(255) DEFAULT NULL COMMENT '员工类型',
`chineseNameOfOrganization` VARCHAR(255) DEFAULT NULL COMMENT '所属单位全称',
`codeOfOrganization` VARCHAR(255) DEFAULT NULL COMMENT '所属单位代码',
`department` VARCHAR(255) DEFAULT NULL COMMENT '所在部门',
`post` VARCHAR(255) DEFAULT NULL COMMENT '岗位名称',
`securityClassification` VARCHAR(255) DEFAULT NULL COMMENT '密级',
`business` VARCHAR(255) DEFAULT NULL COMMENT '职务',
`CN Code` VARCHAR(255) DEFAULT NULL COMMENT '证书CN号'
) ENGINE=INNODB DEFAULT CHARSET=utf8
20.旧数据转到新表
复制表数据到新表的几种方法:
复制表结构及数据到新表
CREATE TABLE 新表 SELECT * FROM 旧表
方法一:只复制表结构到新表
CREATE TABLE 新表 SELECT * FROM 旧表 WHERE 1=2
即:让WHERE条件不成立.
方法二:(低版本的mysql不支持,mysql4.0.25不支持,mysql5已经支持了)
CREATE TABLE 新表 LIKE 旧表
复制旧表的数据到新表(假设两个表结构一样)
INSERT INTO 新表 SELECT * FROM 旧表
复制旧表的数据到新表(假设两个表结构不一样)
INSERT INTO 新表(字段1,字段2,…….) SELECT 字段1,字段2,…… FROM 旧表