Segmentation is one of the first steps in human visual system which helps us see the world around us. Humans pre-attentively segment scenes into regions of unique textures in around 10-20 ms. In this thesis, we address the problem of segmentation by grouping dense pixel-wise descriptors. Our work is based on the fact that human vision has a feed forward and a feed backward loop, where low level feature are used to refine high level features in forward feed, and higher level feature information is used to refine the low level features in backward feed. Most vision algorithms are based on a feed-forward loop, where low-level features are used to construct and refine high level features, but they don’t have the feed back loop. We have introduced ”Shape-Tailored Local Descriptors”, where we use the high level feature information (region approximation) to update low level features i.e. the descriptor, and the low level feature information of the descriptor is used to update the segmentation regions. Our ”Shape-Tailored Local Descriptor” are dense local descriptors which are tailored to an arbitrarily shaped region, aggregating data only within the region of interest. Since the segmentation, i.e., the regions, are not known a-priori, we propose a joint problem for Shape-Tailored Local Descriptors and Segmentation (regions). Furthermore, since natural scenes consist of multiple objects, which may have different visual textures at different scales, we propose to use a multi-scale approach to segmentation. We have used a set of discrete scales, and a continuum of scales in our experiments, both resulted in state-of-the-art performance. Lastly we have looked into the nature of the features selected, we tried handcrafted color and gradient channels and we have also introduced an algorithm to incorporate learning optimal descriptors in segmentation approaches. In the final part of this thesis we have introduced techniques for unsupervised learning of descriptors for segmentation. This eliminates the problem of deep learning methods where we need huge amounts of training data to train the networks. The optimum descriptors are learned, without any training data, on the go during segmentation.
|Date made available
|KAUST Research Repository