高级检索
吴子旭, 付方发, 路禹, 王进祥. 基于消息传递模型的众核拓扑重配置容错方法[J]. 计算机辅助设计与图形学学报, 2014, 26(11): 2079-2090.
引用本文: 吴子旭, 付方发, 路禹, 王进祥. 基于消息传递模型的众核拓扑重配置容错方法[J]. 计算机辅助设计与图形学学报, 2014, 26(11): 2079-2090.
Wu Zixu, Fu Fangfa, Lu Yu, Wang Jinxiang. Fault-Tolerant Strategy for Topology Reconfiguration of Manycore Systems Based on Message Passing Model[J]. Journal of Computer-Aided Design & Computer Graphics, 2014, 26(11): 2079-2090.
Citation: Wu Zixu, Fu Fangfa, Lu Yu, Wang Jinxiang. Fault-Tolerant Strategy for Topology Reconfiguration of Manycore Systems Based on Message Passing Model[J]. Journal of Computer-Aided Design & Computer Graphics, 2014, 26(11): 2079-2090.

基于消息传递模型的众核拓扑重配置容错方法

Fault-Tolerant Strategy for Topology Reconfiguration of Manycore Systems Based on Message Passing Model

  • 摘要: 系统故障恢复时间是众核系统容错的一项重要指标.为加快系统故障恢复,在基于消息传递模型的众核系统中提出一种快速的拓扑重配置容错方法.首先根据物理拓扑故障情况为每个核心定义映射区域,利用匈牙利算法快速构建初始解;然后通过限制交错映射的发生,采用禁忌搜索在初始解的基础上快速优化,获得最终重配置映射解;最后根据重配置映射解更新各运算节点上的节点映射关系表完成拓扑重配置,实现众核系统的核级容错.实验结果表明,该方法能够快速找到优化的拓扑重配置方案并成功地完成系统恢复,具有较低的容错时间开销.

     

    Abstract: System fault-recovery time is a key objective for fault tolerance in manycore systems.To accelerate system recovery from faults, a fast topology reconfiguration strategy is proposed for fault tolerance in message passing model based manycore systems.Firstly, a mapping domain is defined for each core according to the fault condition of the physical topology and Hungarian algorithm is adopted for fast generation of the initial solution.Secondly, by restricting twisted mappings, Tabu search is employed to perform a fast optimization based on the initial solution and obtain the final reconfiguration mapping solution.Finally, by updating the mapping table on each computational node according to the reconfiguration mapping solution and completing the topology reconfiguration, the core-level fault tolerance of a manycore system is realized.The experimental results show that, the proposed strategy is capable of finding an optimal topology reconfiguration solution rapidly and recovering the system successfully while maintaining low time overhead for fault tolerance.

     

/

返回文章
返回