This thesis consists of two major components. The first part is concerned with video object instance segmentation (VOS), which is the task of assigning per-pixel labels perframe of a video sequence to indicate foreground object instance membership, given the first frame ground truth mask. VOS has myriad applications, from video post-processing to action recognition, and is an active area of research. A novel end-to-end trainable, online algorithm utilizing a bilinear LSTM to learn long-term appearance models is presented. The bilinear LSTM is used to guide the learned CNN features, integrating temporal information and building more discriminative appearance features for specific objects during inference. The second part of this thesis examines computer vision's potential applications for performing automated ecological inference for endemic flat-fish populations. Specifically, it looks at the construction of a visual tracking dataset, NHFish, consisting of underwater beam trawl videos collected along the Newport Hydrographic Line of Oregon coast benthos and the application of automated methods for video analysis of the beam trawl videos.