数值型关联分析中连续属性的探索式分区方法
Exploratory Partition Method of Continuous Attributes in Quantitative Association Analysis
-
摘要: 连续属性分区是数值型关联分析的核心问题与难点,为此提出“先粗略分区并进行关联分析,然后用户自由探索关联规则并给出分区建议,进而根据建议进一步分区”的迭代分区方法.提出基于最大化相邻区域置信度的目标函数以及满足区域支持度阈值的约束条件的子区域生成算法,以提供置信度较高的候选子区域.提供了一套可视分析系统,以允许用户通过结合现有规则数据选择候选子区域从而进一步优化分区.用户可以基于散点图和弦图观察规则信息及规则间的联系从而挑选感兴趣的规则;在柱状图中进一步观察选中规则的详细信息并挑选子区域以形成分区建议.用户可以观察多条规则并分别选择候选子区域,为消除选中子区域间的差异与矛盾,提出了基于“先过分割再合并”策略以及3条合并原则的区域整合算法,以形成分区结果并进行迭代.通过使用1组合成数据集、3组公开数据集以及云南省交通违法事故数据集进行案例分析,均获得了高置信度的规则,验证了所提出方法的有效性.Abstract: Continuous attribute partitioning is the core problem and difficulty of numerical association analysis.For this reason,an iterative partitioning method of“first rough partitioning and association analysis,then users freely explore association rules and give partitioning suggestions,and then further partition according to the recommendations”is proposed.Based on the objective function that maximizes the confidence of adjacent regions and the subregion generation algorithm that satisfies the constraint of the region support threshold to provide candidate subregions with higher confidence.A visual analysis system is provided to allow users to further optimize the partition by combining existing rule data to select candidate sub-regions.Users can observe rule information and the relationship between rules based on scatter diagrams and chord diagrams to select the rules of interest;Further observe the detailed information of the selected rule in the histogram and select sub-regions to form a partition recommendation.Users can observe multiple rules and select candidate sub-regions respectively.In order to eliminate the differences and contradictions between the selected sub-regions,an interval integration algorithm based on the strategy of“divide first and then merge”and the three merging principles is proposed to form partition results and iterate.Through the use of a set of synthetic data sets,three sets of public data sets,and the Yunnan Province traffic violation accident data set for case study,high-confidence rules are obtained and the effectiveness of the proposed method is verified.