A couple of giraffes that are next to a book: fairy tales as a medium for looking at methodologies of compiling training sets for image captioning programs

Anna Ridler & Georgia Ward Dyer

1 Introduction

Both fairy tales and image captioning programs have underlying elements which can be illuminated by exploring them both in light of each other.  

The ability of a fairy tale to be distorted and transformed, whilst still retaining a persistent identity thanks to the underlying algorithms of the story, perfectly illustrates the tension between free-flowing creativity and tight rules; we explore this in our own retellings which are mediated through a variety of machine learning tools. Through this mediation there often emerge compelling and absurd associations between the image and the text.  

Fairy tales function as reflections of their contemporary context, and here we argue that some machine learning tools (such as image captioning programs) also do so, as a result of the way in which they are built.  Each time a fairy tale is told we learn something about the temporal, cultural and social context of its telling - in the Nazi retelling of “Little Red Riding Hood” she is saved from being eaten by a Semitic wolf. The use of recognisable stock figures and conventional imagery that is distinctive to fairy tales is also pertinent.  

In supervised learning, a kind of machine intelligence, the program is developed by first training it on a labelled data set – for instance, in image recognition, the program would be shown examples of images which have been labelled with details of what they depict.  These ‘training sets’ are compiled by researchers according to a variety of methodologies, but because human subjectivity is always involved at some point in either the source content or in the process, they inevitably come to enshrine certain cultural or social attitudes.  Another consequence of these methodologies is that image captioning programs assimilate only the most conventional illustrations of the world.

2 Retelling  

Recent research that statistically analysed folktales from Europe and Asia date the origins of some stories to thousands of years ago and the oldest – “The Smith and the Devil” – to the Bronze Age. These stories are told and retold; written and rewritten: “Cupid and Psyche” transformed into “Beauty and the Beast”transformed into Angela Carter’s “The Courtship of Mr Lyon” or “The Tiger’s Bride” . Fairytales cannot be said to have single authors – the archetype emerges through countless retellings across cultures and across time.  Writer Helen Oyeyemi observes that “when you retell a story, you’re testing what in it is relevant to all times and places. Bits of it hold up, and bits of it crumble and then new perspectives come through”.

In our own practice, we work on retelling fairy tales by mediating them through machine learning tools and data sets to find out what holds up, what crumbles, and what new perspectives come to light when filtered through artificial intelligence. We have retold “Beauty and the Beast” (“YouTube & the Bass” once passed through a speech-to-text system) and created a pictorial interactive piece based on Propp’s Morphology of the Folk Tale, which allows the user to recreate fairy tales how they wish. We have used various technologies – CaptionBot, CIFAR-10, CIFAR-100, Google Cloud Speech API, Google Image Search, Microsoft Research Cambridge Object Recognition Image Database, ImageNet – as a means to generate and reinterpret content, which we then curate into the final draft.

The unusual and compelling chance phrases or images (see Figures 1, 2) which surface from this process inevitably lead us in particular directions; our process is a collaborative one between creative decisions and unexpected associations, human and machine.  We test the limits of the retelling, there are gaps in the narrative, and an important part of the meaning-making is intentionally seceded to the reader. Fairytales after all are made up of small units of story which become building blocks that we instinctively know how to put together – the prince is always charming, the beautiful princess is always rescued - and this is possible even when the narrative is incomplete.

3 The conventional

In his introduction to his own retelling of the Grimms’ “Children’s and Household Tales”, Philip Pullman writes of the “conventional, stock figures” which inhabit fairy tales, and there is nothing more conventional than a definition generated by search results: a convention is a definition gradually agreed upon by a community through usage. Most of the training sets which image recognition machine learning programs use – and certainly the most famous and prevalent ones (80 Million Tiny Images, CIFAR-10 and -100) – were put together using complex methodologies. They nevertheless employ image results for particular search terms from search engines including Google, Flickr, Altavista, and Baidu to compile visual databases.  Visual databases which are compiled using image results from the web inevitably come to reflect the prevailing attitudes of the online community at that time.  For example, when the search term ‘castle’ is put into a search engine, the most conventional representations are ranked highest - medieval castles are shown more prominently than the television show ‘Castle’. The castle of popular imagination is illustrated. From an overview of the top ranked results, it is possible to identify common characteristics across all those images. This is not isolated to concrete nouns, but also occurs with abstract nouns. Take ‘beauty’ – when searched, commonality as to what ‘best’ represents it emerges across all of the highest ranked images: being female, being white, being heavily made up (see fig. 3). As the training set is developed from these highest ranked results, this narrow, conventional account of ‘beauty’ is then enshrined by machine learning programs as the definitive one. The same inclination towards convention is apparent in fairy tales – Pullman notes “[t]here is no imagery in fairy tales apart from the most obvious”. Skin is “as white as snow”, lips are ruby red, wrinkled old women are witches.

The methodology used to compile ImageNet is probably the most thorough in its efforts for a multifaceted, cross-checked database.  It is a foundational and widely-used visual database for testing computer vision and image recognition algorithms. The compiling of this database illustrates how image recognition programs come to reflect the conventional and the obvious.  In brief, the methodology for compiling ImageNet is as follows. It is based on the ‘ontology of concepts’ as structured by WordNet, which collects words or phrases into groups to describe different concepts.  ImageNet aims to provide illustrations for 80,000 of these groups (called ‘synsets’), and began this process by collecting image search results for those terms.  Since not all image search results are accurate illustrations of the term in question, the ImageNet team then used workers from Amazon Mechanical Turk to manually confirm the classification of these images. Because of this, the images selected are reflections of prevailing attitudes of the time and culture on two levels: both as images from the web as canonical illustrations of a term, and by then having users online confirm them as correct (or not). Furthermore, users are provided with a definition of the concept in question through a link to the Wikipedia page, itself a further example of using the conventional to determine ‘canonical’ definitions.  By only assimilating the most conventional illustrations of concepts, any more nuanced or complex forms are not recognisable to the program and so can not be part of the world as it sees it.  

Programs also sometimes recognise but will not caption. Because of the commercial aspect of image recognition, most are built with a filter to recognise but not caption anything that it considers to be pornographic. Many pornographic images online depict anuses and therefore image recognition programs are taught to reliably recognise these if they feature in an image, so that this image will be flagged as inappropriate and intentionally not be labelled. This sanitizing is reminiscent of the bowdlerization of modern fairy tales: the violence and rape and murder is glossed over or gone altogether in an attempt to make it more palatable to the widest possible audience. For example, a Korean Cinderella-type tale “Kongi and Potgi” is today most commonly told without the original ending of Kongi’s attempted murder by the Ugly Stepsister, whose subsequent punishment was to be pickled to death by their servants.

4 The absurd

But if an image captioning program’s vocabulary is constrained to ‘the conventional’, then why do the phrases they caption with so often have surreal and absurdist accent to them? It is precisely because it only has this limited vocabulary that this is so. In “Youtube & the Bass”, our retelling of Beauty and the Beast, ‘a person on a surfboard in a skate park’ greets Beauty at the castle; Beast becomes ‘a group of stuffed animals on top of a book’.  Since machine learning programs improve their accuracy through repeating tasks and acting on feedback, when an image recognition program is in its infancy, it frequently defaults back to labels it’s familiar with. In “YouTube & the Bass”, ‘a couple of giraffes’ make repeat appearances ‘next to a book’.  Images of giraffes must make up part of that particular program’s training set, and must be the program’s closest match to what is being captioned.

Repetition is another significant factor for both fairy tales and image captioning. Multiples frequently recur in fairy tales – the six sons and six daughters, the seven dwarves, three sons set off one after the other on the same quest, et al. Repetition in fairy tales is often used as reinforcement or validation of a theme or element in the story.  In resonance with this, image recognition programs have greater degrees of certainty in the labelling of an item if the item recognised appears multiple times in an image.  For example, if an image depicts one baseball then the confidence score in the label ‘baseball’ would be high; however, if the image contains ten baseballs then the confidence score in the label ‘baseball’ would be much higher. Which is to say, that the more ‘baseball-y’ the image is, the more enthusiastically the program considers it a ‘baseball’.

As outlined above, fairy tales are distinctly bound by certain rules – just not the ones of realism.  As Pullman writes, ‘realism cannot cope with the notion of multiples…[fairy tales] exist in another realm altogether, between the uncanny and the absurd’.  It is perhaps this which motivated thinkers such as Freud and Jung to give psychoanalytic theories of fairy tales, likening them to dreams. Moreover, in a mise en abyme, the characters themselves often dream in ways that are central to the plot, such as in de Beaumont’s “Beauty & the Beast”.  In conversations with a research scientist at Google DeepMind, an analogy emerged between dreaming and how machine learning programs work.  Our waking life experience equates to the machine learning program’s ‘training set’. When we dream, our brain uses this sensory data as the raw material from which to build up a detailed and internally coherent world, just as the program takes from its training set to build up its own picture of the world and what it means.  Although coherent based on the original input, both the dream and the program are warped and imperfect as reflections of the real world; generating uncanny and absurd moments.  

These moments are compelling – and it is only through their occurrence that we become aware of the program’s picture of the world as an imperfect one. We do not wish to use these imperfections as criticisms of these tools – they are still developing. Those working on improving them include developmental psychologists and neuroscientists; their research on how humans learn is used to inform the design of programs that learn. These programs will undoubtedly improve their performance over time – a number of machine learning algorithms have successively outperformed predictions for their accuracy. We therefore have only a finite window of time to see their imperfections and glimpse the fascinating machinations of their working.  

5 Conclusion

We have examined the strong analogies between the process of retelling a fairy tale and the way machine learning tools, in particular image captioning programs, work.  Through reviewing methodologies used for compiling image databases we come to understand how these programs learn only the conventional, yet how these same methodologies cause absurd and surreal moments. In the same way, fairy tales absorb the contemporary conventional yet defy realism, speaking to the fantastic and the illogical.  Ultimately the raw material that both fairy tales and training sets draw from are products of people – these might have different temporal, cultural or social contexts but they are always human.  Just as each retelling of a fairy tale gives us a reflection of its time, so the captions and labels given by machine learning tools give us a reflection of ourselves.