- This report presents an eﬃcient method for semi-supervised video object segmentation – the problem of identifying foreground pixels occupied by a target object. The target is speciﬁed by the ground-truth mask in the ﬁrst video frame. While the state of the art achieves a segmentation accuracy greater than 80%, it runs relatively slow at less than 10 frames per second. This limits their application in many domains. In addition, accuracy of existing approaches typically suﬀers on cases of target occlusion by moving background objects. We address these two shortcomings of prior work by a novel deep architecture aimed at jointly tracking both foreground and background in the video in an eﬃcient manner. Our key hypothesis is that explicitly tracking the dynamic background of the target object helps improve segmentation in cases of target occlusion. We propose using two deep neural networks that work in parallelone for foreground object segmentation, and the other for background segmentation. They use the same architecture. Their output is integrated in another network for fusing the initial foreground and background segmentation into a more accurate target object segmentation. We perform experiments using various conﬁgurations of the proposed architecture on the DAVIS 2016 dataset. Our results support the key hypothesis where the joint tracking of the dynamic foreground and background indeed outperforms a baseline that tracks only the target object. On DAVIS 2016, our accuracy is 70.61%, while operating at over 100 frames per second.