short introduction to image parsing

humans easily do what computers cannot – look at a scene and not only recognise the objects in it, but also understand the image. a quick example:



looking at this picture, humans take a split second to recognise that there are two dogs (most likely border collies), a girl about to throw a ball, grass, trees and a cloudy sky in the picture. we can even recognise the excitement of the dogs. for a computer, this is much more difficult [see blogpost by stefan alberts for some of the reasons computers have this difficulty].

there are various methods of segmenting an image into possible recognisable parts. one of these methods is to parse the image. image parsing is used for, amongst other things, segmentation, detection and object recognition.

to ‘parse’ is something that most of us actually learn in school, just in the context of sentences, not images. [1] gives the following example of language parsing. if we have the sentence “the boy went home” we can split it up into a noun phrase (np) “the boy” and a verb phrase (vp) “went home”. the noun phrase can further be broken up into a determiner (det) “the” and a noun (n) “boy” whilst the verb phrase can further be broken up into a verb (verb) “went” and a noun “home”. the determiner, nouns and verb themselves are at a word level, seeing as they describe specific words. the noun phrase and verb phrase, are said to be on a phrase level. this can be visualised as follows (also taken from [1])



how can we do this for images? we can find smaller patterns/pieces of the image that will become our building blocks – eg hair, arm, torso, legs – just like we have our words in the sentence. these will then be combined to create larger parts of the image such as a person, just as the words combine to form a phrase. to recognise all the small areas, the words, can in itself be a tremendous task – what does ‘hair’ really look like? how can we tell a computer that if you see x in an image, it is classified as hair, since hair in one image very rarely looks like hair in another image (except in shampoo ads). there are different ways in dealing with this problem, such as describing the area, calling it ‘texture’, for one. but the different ways fall out of the scope of this brief introduction to image parsing.

before we look at an example, let’s look at a more formal definition, since that’s what us mathematicians like to do. [2] states that image parsing is “…  the task of decomposing an image into its constituent visual patterns”. these patterns can then be the hair, arms, nose, or grass texture, tree texture etc. [2] also gives an example that is fairly well used within the community. it is featured below



if the blue dotted lines are removed, we find what is normally referred to as a parse graph. the black arrowed lines show the actual graph of how the one segment is decomposed into it’s children, like how there is a person ‘child’ in the original picture whose ‘children’ are then ‘face’ and ‘texture’. the blue lines, however, are there to show what the spatial relationship between the different segments are – something that comes in handy in certain situations.

two ways to build up there parse graphs are discriminative methods and generative methods (these are not representative of all possible methods, by any means).

discriminative methods work from the bottom up – they tend to first find the smaller regions such as the arm, leg, torso, etc that will eventually form a ‘higher’ segment (person in this case).

generative methods work from the top down – they tend to see what are the likely objects in the scene, and then break down the image further to see which of these likely objects are indeed in the scene. they could, for instance, recognise person, sports field, spectator, maybe road and building in the football image. when they look further down, however, these methods could realise that the road and building, although they look as if they could be present, don’t  end up being there, and so in the final parse graph, they are not included. this is one example of how a generative algorithm could work, but the main idea that would make an algorithm generative is that it works from the top down.

so what is the aim of image parsing? as stated above, it is to break up the image into it’s visual patterns. from there we can learn different things, or make certain predictions.

we are, however, still a far cry from perfect systems. or making a computer see and perceive the way we do. but when we think about it, there are certain vital differences between how people see, and what computers have to their disposal when only looking at a 2d picture. we learn what objects are not only from a 3d perspective but also how they move (if animate objects like humans/animals/plants) and can transform. what we normally give as input to a computer is a whole lot of images (granted, these collections can have a substantial amount of images) that have mostly been hand segmented and tell it: learn that all these images fall into these (limited) categories (in supervised learning). in the situation under consideration, there are no images that follow through on a sequence, showing how objects can transform from one perspective to another. there is no extra information given like 3d perspectives. we are therefore wanting to make a computer do what we ourselves don’t do, and so we cannot (most likely) teach computers to see exactly as we do.

there is at least one other, rather interesting, application of image parsing that i have come across. here is a picture of a woman, on the left, and an artist’s painting of the picture on the right


p2p_orig  p2p_ipa

pretty good job, if this is the style of painting you like. the artist, however, was not a human being, it was an algorithm that uses image parsing to create ‘paintings’ of pictures. for the details, and more examples, see [3].



[1] c.f. schmidt, parsing example, url

[2] z. tu, x. chen, a.l. yuille, s. zhu, image parsing: unifying segmentation, detection, and recognition, international journal of computer vision, 2005.

[3] k. zhen, m. zhao, c. xiong, s. zhu, from image parsing to painterly rendering, url.


No comments yet.

Leave a comment

Leave a Reply