Multiple Visual Semantic Co-Embedding for Human-Object Interaction Detection Networks
-
Graphical Abstract
-
Abstract
Human-object interaction detection is very important for understanding human-centered scene tasks, but it has the challenge of visual bias caused by polysemy of verbs and the difficulty of rational use of hierarchical information and semantic relations of images. Therefore, a network of multiple visual features and language prior is proposed, and a visual-semantic double branch structure is designed. On the visual branch, the multi-hierarchical relationship between people and things and people, objects and interactions are exchanged in rich context in the hierarchical visual fusion module, and fine-grained context information for relational reasoning is added. In the semantic branch, the nouns, verbs and phrases in the interactive triplet tag are encoded into a semantic convergence agreement graph. The degree of fit between vision and semantics is calculated through the visual-semantic joint embedding module, and the detection result of the interactive triplet is obtained. The average accuracy of agents and actors reached 70.7% and 72.4% respectively on the V-COCO dataset. The average accuracy of complete class, rare class and non-rare class reached 35.91%, 33.65% and 36.28% respectively in the default scenario. The experimental results show that the proposed network is superior to other networks compared in the paper, and performs well under the condition of small sample and zero sample.
-
-