In this dissertation, we address action segmentation in videos under limited supervision. The goal of action segmentation is to predict an action class for each frame of a video. The limited supervision means ground truth labels of video frames are not available in training. We focus on three types of problems: (1) Transcript-level supervised learning, where the ground truth is a transcript which represents the temporal ordering of actions present in a training video; (2) Set-level supervised learning, where the ground truth specifies only a set of actions present; and (3) Unsupervised learning, where no ground truth is available. To address these problems, we make three hypotheses. First, we believe that action segmentation under limited supervision would benefit from reasoning over many candidate segmentations rather than predicting a single optimal segmentation. To this end, we efficiently represent a video by a segmentation graph, where paths are candidate segmentations. Second, we hypothesize that a discriminative learning of minimizing energy between valid segmentations that satisfy ground truth and invalid segmentations that violate ground truth is a better learning objective than only minimizing a loss defined with respect to valid segmentations. Third, we hypothesize that regularization of action affinity for same actions, sparsity of action activations for different actions, and orthonormality of parameter matrices are helpful in a limited supervision learning. The dissertation presents our approaches to action segmentation that are based on these hypotheses. Our key technical contributions include versions of a constrained Viterbi algorithm aimed at efficiently approximating the NP-hard all-color-shortest-path problem, as well as efficient Riemannian optimization on the Stiefel manifold via the Cayley transform for regularization of model parameters. Our experimental evaluation demonstrates the advantages of our approaches relative to existing work.