多元视觉-语义联合嵌入的人物交互检测网络
Multiple Visual Semantic Co-Embedding for Human-Object Interaction Detection Networks
-
摘要: 人-物交互检测对理解以人为中心的场景任务十分重要, 但其存在因动词的一词多义带来的视觉偏差问题以及图像的层次信息和语义关系难以合理利用的挑战. 为此, 提出多元视觉特征和语言先验联合嵌入的网络, 设计了视觉-语义双分支结构. 在视觉分支上, 将人-物对中人、对象和交互的多元层次关系在层次视觉融合模块中进行丰富的上下文交换, 增加用于关系推理的细粒度的上下文信息; 在语义分支上, 将交互三元组标签中的名词、交互动词和三元组短语联合编码成一个语义汇聚一致性图注意网络, 进行信息传递和多义感知; 最后通过视觉-语义联合嵌入模块计算视觉和语义之间的拟合程度, 得到交互三元组的检测结果. 实验结果表明, 在V-COCO数据集上, 代理平均精度达到70.7%, 角色平均精度达到72.4%; 在HICO-DET数据集上, 默认场景下, 完整类、罕见类和非罕见类的平均精度分别达到35.91%, 33.65%和36.28%; 所提网络优于对比的网络, 在少样本和零样本情况下同样表现出色.Abstract: Human-object interaction detection is very important for understanding human-centered scene tasks, but it has the challenge of visual bias caused by polysemy of verbs and the difficulty of rational use of hierarchical information and semantic relations of images. Therefore, a network of multiple visual features and language prior is proposed, and a visual-semantic double branch structure is designed. On the visual branch, the multi-hierarchical relationship between people and things and people, objects and interactions are exchanged in rich context in the hierarchical visual fusion module, and fine-grained context information for relational reasoning is added. In the semantic branch, the nouns, verbs and phrases in the interactive triplet tag are encoded into a semantic convergence agreement graph. The degree of fit between vision and semantics is calculated through the visual-semantic joint embedding module, and the detection result of the interactive triplet is obtained. The average accuracy of agents and actors reached 70.7% and 72.4% respectively on the V-COCO dataset. The average accuracy of complete class, rare class and non-rare class reached 35.91%, 33.65% and 36.28% respectively in the default scenario. The experimental results show that the proposed network is superior to other networks compared in the paper, and performs well under the condition of small sample and zero sample.