Temporal Bilinear Networks for Video Action Recognition


Temporal modeling in videos is a fundamental yet challenging problem in computer vision. In this paper, we propose a novel Temporal Bilinear (TB) model to capture the temporal pairwise feature interactions between adjacent frames. Compared with some existing temporal methods which are limited in linear transformations, our TB model considers explicit quadratic bilinear transformations in the temporal domain for motion evolution and sequential relation modeling. We further leverage the factorized bilinear model in linear complexity and a bottleneck network design to build our TB blocks, which also constrains the parameters and computation cost. We consider two schemes in terms of the incorporation of TB blocks and the original 2D spatial convolutions, namely wide and deep Temporal Bilinear Networks (TBN). Finally, we perform experiments on several widely adopted datasets including Kinetics, UCF101 and HMDB51. The effectiveness of our TBNs is validated by comprehensive ablation analyses and comparisons with various state-of-the-art methods.



Figure 1. The structure of Temporal Bilinear module. The feature maps are shown as the shape of their tensors.


Figure 2. The structures of the original ResNet block, proposed Wide TB block and Deep TB block.


@article{li2019temporal,   title={Temporal Bilinear Networks for Video Action Recognition},   author={Li, Yuqi and Li, Yanghao and Yan, Hongfei and Liu, Jiaying},   booktitle={AAAI Conference on Artificial Intelligence},   year={2019} }

Experimental Results

Table 1. Comparisons on the validation set of Kinetics using RGB as input.