Learning to recognize objects is a fundamental and essential step in human perception and understanding of the world. Accordingly, research of object discovery across diverse modalities plays a pivotal role in the context of computer vision. This field not only contributes significantly to enhancing our understanding of visual information but also offers a plethora of potential applications, like augmented reality, e-commerce, and robotics, particularly in industrial manipulation scenarios.
We first address the task of discovering objects from still images regardless of any predefined categories. We introduce a novel variational relaxation approach tailored to the task. By framing it as an optimization problem for piecewise-constant segmentation, this technique enables direct training of a fully convolutional network (FCN) for predicting object labels on each pixel. Applying our approach to the instance segmentation task achieved results almost as good as mask R-CNN without depending on a two-stage framework. Note that the training of the network does not depend on the category label, enabling our approach to discover objects unbounded by predefined categories.
Next, we extend our exploration to video sequences, focusing on the task of unsupervised video object segmentation. Here, we aim to discover and track objects within videos. Noticing that single-frame object proposals often fail to obtain a good proposal due to motion blur, occlusion, and other reasons, our approach involves refining key frame proposals using a Multi-proposal graph constructed from proposals initially generated in nearby frames and then propagated to the key frame. We then compute the maximal cliques within this graph, which contains proposals that represent the same object. Pixel-level voting is performed within each clique to generate the key frame proposals that could be better than any of the single-frame proposals. Then a semi-supervised VOS algorithm subsequently tracks these key frame proposals across the entire video, showcasing the potential for precise and robust object tracking in dynamic visual environments.
We further explore into the domain of Vision-Language, where we seek to identify objects associated with a specific textual context. In this multifaceted context, we tackle the intricate challenge of content moderation (CM), which assesses multimodal user-generated content to detect material that is illegal, harmful, or insulting. We present a novel CM model to address the asymmetric in semantics between vision and language. Our model features an innovative asymmetric fusion architecture that not only fuses the common knowledge in both modalities but also leverages the unique information present in each modality. Additionally, we introduce a novel cross-modality contrastive loss to capture knowledge that arises exclusively in multimodal context, which is crucial for addressing harmful intent that may emerge at the intersection of these modalities.