An Interactive Visual Analysis Method for Multi-Dimensional Data Deduplication
-
Graphical Abstract
-
Abstract
Duplication in multi-dimensional data seriously interferes with data mining,analysis,and application.Traditional data deduplication methods cannot meet the requirements for significant data analysis in terms of cost,efficiency,and usability.An interactive visual analysis method for data deduplication is proposed.It extracts high-dimensional feature vectors from multi-dimensional data through representation learning,projects the results into two-dimension space,employs an unsupervised clustering algorithm for analysis,and enables users to choose the algorithm and parameters in the visual analysis interface to gradually filter,identify,and remove duplicate data.Quantitative experiments and user studies are conducted on an extensive dataset from a supply chain integration service group company.The results show that proposed approach is more effective on complex data deduplication problems than mainstream data cleaning software,such as Trifacta Wrangler and OpenRefine,while achieving more than twice their efficiency and having significant superiority in learning cost and usability.
-
-