Graduate Project


Video Object Segmentation by Jointly Tracking Foreground and Background Public Deposited

Downloadable Content

Download PDF


Attribute NameValues
  • This report presents an efficient method for semi-supervised video object segmentation – the problem of identifying foreground pixels occupied by a target object. The target is specified by the ground-truth mask in the first video frame. While the state of the art achieves a segmentation accuracy greater than 80%, it runs relatively slow at less than 10 frames per second. This limits their application in many domains. In addition, accuracy of existing approaches typically suffers on cases of target occlusion by moving background objects. We address these two shortcomings of prior work by a novel deep architecture aimed at jointly tracking both foreground and background in the video in an efficient manner. Our key hypothesis is that explicitly tracking the dynamic background of the target object helps improve segmentation in cases of target occlusion. We propose using two deep neural networks that work in parallelone for foreground object segmentation, and the other for background segmentation. They use the same architecture. Their output is integrated in another network for fusing the initial foreground and background segmentation into a more accurate target object segmentation. We perform experiments using various configurations of the proposed architecture on the DAVIS 2016 dataset. Our results support the key hypothesis where the joint tracking of the dynamic foreground and background indeed outperforms a baseline that tracks only the target object. On DAVIS 2016, our accuracy is 70.61%, while operating at over 100 frames per second.
Resource Type
Date Issued
Degree Level
Degree Name
Degree Field
Degree Grantor
Commencement Year
Committee Member
Academic Affiliation
Rights Statement
Peer Reviewed



This work has no parents.