Abstract:
In the applications of machine learning, it is difficult to ensure the quality of training data due to the various sources of training data and the inexperience of some annotators. By tightly integrating machine learning and visualization, visual analytics techniques involve humans in the loop of data quality analysis and improvement, thereby enhancing the quality of training data and improving model performance. In this survey, we first summarize the main types of training data quality issues, including inaccurate annotations, low coverage, and insufficient annotations. Based on the identified problem types, we categorize and summarize relevant visual analytics approaches, including methods for correcting inaccurate annotations, reducing dataset biases, and enhancing the quality of unlabeled data. Finally, we delve into the opportunities and challenges faced in research on training data quality improvement using visual analytics. This includes enhancing data quality in scenarios such as complex tasks, large language models, multimodal data, and streaming data.