The aesthetics of absence : awareness in the age of neural networks

(1)

The Aesthetics of Absence:

Awareness in the Age of Neural Networks

MASSACHUST INSTITUTE

by

OF TECHNOLOGY

Matthew Groh

JUL 2 6 2019

B.A., Middlebury College (2010)

LIBRARIES

ANUNHIVt05

Submitted to the Program of Media Arts and Sciences, School of

Architecture and Planning

in partial fulfillment of the requirements for the degree of

Master of Science in Media Arts and Sciences

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

oMassachusetts

Institute of Technology 2019. All rights reserved.

Signature

redacted

A uthor

...

Program of Media Arts and Sciences, School of Architecture and

Signature redacted

Planning

May10,2019

Certified by ...

I

00r

Iyad Rahwan

Associate Professor of Media Arts and Sciences

Signature

Thesis Supervisor

redacted

Accepted by...

(2)

The Aesthetics of Absence:

Awareness in the Age of Neural Networks

by

Matthew Groh

Submitted to the Program of Media Arts and Sciences, School of Architecture and Planning

on May 10, 2019, in partial fulfillment of the requirements for the degree of

Master of Science in Media Arts and Sciences

Abstract

In this hurtling technological age, the world seems more lost than ever before. When we optimize only for what can be observe, we can lose sight of the mysteries that help to define us. This thesis begins with the premise that not all aspects of humanity are amenable to empirical study. Inspired by a monoprint painted by Paul Klee and vividly described by Walter Benjamin, we design and deploy a four-part intervention at the intersection of artificial intelligence and media. First, we probe the tradition of via negativa. Second, we develop an artificial intelligence (AI) model that can disappear objects in photographs and deploy it online on a website called Deep Angel. Frorh August 2018 to April 2019, over 100,000 people visited Deep Angel. Third, we examine the precautionary principle for Al media manipulation with a randomized experiment. In this particular domain with this particular technology, we find that exposure to media manipulation improves individuals' ability to detect manipulations. Fourth, we create art. By infusing ancient wisdom traditions with modern technologies, this thesis points to a path out of digital and material clutter towards a rehabilitation and recovery of what has been lost in this Internet age: presence.

Thesis Supervisor: Iyad Rahwan

Title: Associate Professor of Media Arts and Sciences

(3)

This masters thesis has been examined by a Committee of the

Department of Media Arts and Sciences as follows:

Signature redacted

Professor Iyad Rahwan... ...

Thesis Supervisor

Associate Professor of Media Arts and Sciences

Signature redacted

Dr. Andrew Lippman..

...

Thesis Reader

Senior Research Scientist

,Signature

redacted

W illiam Powers..

...

(4)

Acknowledgments

Some of the text and images in this thesis proposal have been previously submitted to, will be soon submitted to, is under review at, or accepted in peer-reviewed journals or art galleries. I am grateful for the wonderful guidance of Manuel Cebrian, William Powers, Andrew Lippman, and Iyad Rahwan. I thank Micah Epstein and Julian Kelly for designing fantastic graphics for the Deep Angel website, Joyce Feng for providing excellent research assistance as an MIT UROP, the Harvard Cyber Law group (Mason Kortz, Jessica Fjeld, Sally Kagay, and Rebecca Rechtszaid) for expert legal assistance and guidance, Abhimanyu Dubey for brilliant technical advice, and Zivvy Epstein for outstanding collaboration and support of the development of ideas throughout this project from the technical to the creative. All errors are my own.

(5)

List of Figures

1-1 Map of Creation: The Aristotelian Elements . . . . 17

1-2 Map of Creativity: The Buddhabrot of Creativity. . . . . 18

1-3 Photographic manipulation has been a tool of fascist governments in

nefarious attempts to subvert reality. On the top left, Joseph Stalin is standing next to Nikolai Yezhov who Stalin later ordered to be executed and disappeared from the photograph. On the top right, Mao Zedong is standing beside the "Gang of Four" who were arrested a month after Mao's death and subsequently erased from this historic photograph. On the bottom, Benito Mussolini strikes a heroic pose on a horse while his trainer holds the horse steady. The photographic manipulation showcases Mussolini's skill for manipulating the facts

and covered up his lack of horsemanship. . . . . 23

2-1 Angelus Novus by Paul Klee. 1920. . . . . 28

2-2 Examples of Negative Space in Paul Pfeiffer's Four Horsemen of the

Apocalypse and Adrian Piper's Everything . . . . 32

3-1 Comparisons of inpainting algorithms. Image graphics from Mikhail

(8)

3-2 End-to-end pipeline for Target Object Removal following [30,71] . . . 38 3-3 End-to-end pipeline for Unanchored Object Conjuring following 130,34,71] 39

3-4 Screenshots of Deep Angel's user interface. . . . . 41

3-5 Diagram of Deep Angel's server architecture . . . . 42

4-1 Examples of original images uploaded to Deep Angel and corresponding

m anipulations . . . . 47

4-2 Probability density function displaying the accuracy of guesses over

im ages . . . . 48

4-3 Accuracy of guesses over exposure to manipulated images with a 95%

confidence interval. . . . . 49

5-1 Screen Shot from The Broken Flaneur short film. Watch it on YouTube

at https://youtu.be/1QCFAwuIUUE . . . . 56

5-2 Photographs from the AI Spirits collection. . . . . 57

5-3 Photographs from the Shadows sans Substance collection. . . . . 59

A-1 This just got meta. Mason Kortz, Joan Donovan, Jessica Fjeld, and

Matt Groh are disappeared from their panel at 2019 SXSW titled

(9)

List of Tables

4.1 Top 10 Target Object Removal Selections for Uploaded Images and Targeted Instagram Crawls on Deep Angel. Each Instagram username selection initiated a targeted crawl of Instagram for the three most

recently uploaded images of selected user . . . . 46

4.2 Logistic regression results on guessing accuracy with user and image

fixed effects. Standard errors in parentheses. *, **, and *** indicates

statistical significance at the 90, 95, and 99 percent confidence intervals, respectively. All columns include user and image fixed effects. Column

(1) shows all users (2) drops all images where nothing was disappeared (3) drops all users who submitted fewer than 10 guesses (4) drops all

observations where a user has already seen a particular image (5) keeps

(10)

(11)

Chapter 1 Introduction

Change the perspective of your eyes and you see the whole world before you is

radiant.

Joseph Campbell

At a moment in history when technology is increasingly distracting, overwhelming, and manipulating, this thesis is a clarion call to humankind to recognize what it

means to be human again. Humans are technology's creators. What we create reflects

who we are. If we wish to know ourselves (and we ought), we need to make the time

to introspect. This is not a new problem; it is our ever-present challenge. In Charlie

Chaplin's final speech in The Great Dictator, he explains the fundamental tension between modern technology and humanity: "Machinery that gives abundance has left

us in want. Our knowledge has made us cynical; our cleverness, hard and unkind.

We think too much and feel too little. More than machinery we need humanity." Absence is the oft-overlooked anti-medium that offers a rehabilitation. In the words of Marshall McLuhan, anti-media and counter-environments provide "a means

of perceiving the dominant one [environment]" and create a wider awareness [44]. Through absence, we can explore being human with fresh eyes. This thesis is a not

(12)

an analysis of absence. Rather, it is a generative endeavor, a provocative probe and confrontational intervention. Following the trail of absence to its exhaustion, the aesthetics of absence is a lens to understand media, society, humanity, and the soul.

This probe considers how we can learn from the world by what it is not. This anti-approach is called via negativa and has roots in both early Christian mysticism and the Upanishads [68]. From religion to philosophy to politics to art, via negativa can be applied to build a deeper understanding of our environment. This thesis confronts technological determinism by co-opting artificial intelligence (AI) media manipulation to intervene in the default state. People should be able to access their default state network and set their own defaults. Via negativa is the thread that ties together speculative design, machine learning engineering, behavioral science, and generative algorithmic art in this thesis. Taken together, these seemingly disparate directions unite for an intervention aimed at the ultimate question of tech humanism: how do we encode the machines of the future with our best selves?

This thesis begins with three meditations: first, the relationship between four modes of creativity and absence; second, an experiment to reawaken wonder and rekindle presence; third, a confrontation with Al media manipulation. The goal of this thesis is to critique and expand on ancient wisdom, build new technology, understand

human behavior, and present a fresh perspective. Drawing on metaphors from ancient civilizations to post-modern philosophers, I collaborated with a colleagues to design an interactive media experiment based on an Al model. On the technical side, I engineered a neural network architecture that disappears objects from images. In other words, this neural network architecture is an Al model that can generate absence in photographs. I hosted this AI model on a website called Deep Angel where anyone on the Internet can interact with it. From August 2018 to April 2019, over

(13)

Users uploaded their own images and rated the quality of the manipulation of other users' uploaded images. Based on users' ratings, this thesis examines how exposure to manipulated media affects the ability to detect manipulations. Before concluding, this thesis presents three algorithmically generated artworks.

1.1 Creation and Creativity

Across ancient civilizations from Greece to India to Egypt, philosophers conceptualized Earth and its complexities as a mixture of four elements: earth, water, air, and fire. In On Generation and Corruption, Aristotle posits a further atomization of terrestrial nature. He describes two forms of change (hot and wet) and their corresponding privations (cold and dry) as underlying dimensions upon which matter attains its quality [42]. From a graphical perspective, the four classic elements manifest along points on the Cartesian plane where the X and Y axes are defined by Aristotle's dualistic forms of change. While Aristotle believed this discontinuous movement along these two dimensions could explain creation on Earth, he posited that something else explains the greater cosmos. Instead of the recti-linear motion that characterized changes between the classic four elements, he imagined the movement and materiality of the cosmos as continuous and circular with no contrary [42]. To explain what first set the Cosmos in motion, Aristotle proposed a fifth element, aether, that accounts for

the movement and non-dualistic nature of the greater cosmos

[42].

In other cultures,

this fifth element has been referred to as quintessence (pre-Renaissance Europe), Qi (ancient China), Mana (ancient Polynesian), and Akasha (ancient Indian). In Jewish mysticism, this element manifests as a process known as tzimtzum, the contraction of infinite light enabling an empty space in which physical world and free will come into existence. This metaphysical non-dualistic essence is central to the Buddhist teaching

(14)

of Sunyata, the teaching of emptiness. Before probing emptiness, let us consider how things come to be through creativity.

Aristotle's systematization of elements bears a striking resemblance to recent conceptualizations of creativity. In Rich Gold's 2007 book, The Plenitude: Creativity,

Innovation, and Making Stuff, Gold condenses creativity into a two-by-two cartoon

matrix of hats. Each hat - Science, Engineering, Design, Art -represents a cornerstone

of creativity and a path to worldly production [27]. While none of these hats are strictly defined, each represent different approaches to the act of creation. In a blog post reflecting on Rich Gold's matrix, John Maeda ascribes a mission to each corner of creativity. Science is for exploration, Engineering for invention, Design for communication, and Art for expression [43]. In creative processes, these hats are all intertwined and support each other. In Age of Entanglement, Neri Oxman extends these ideas into a speculative map, the Krebs Cycle of Creativity, that addresses the questions of (1) how we travel between the four "embodiments of creativity and innovation" and (2) the results of inhabiting creativity's interstitial zones [51]. Akin to Aristotle's forms of change, Oxman identifies two axes of traversal. The first axis spans culture and nature, which divides art and design from science and engineering. The second axis bridges production and perception,,which splits art and science from design and engineering. In contrast to Aristotle's conceptualization of recti-linear movement, Oxman metaphorically describes the energy and movement of creativity with the Krebs Cycle noting the "(r)evolution[ary]", perpetual nature of creative energy [51].

Oxman implores her readers, "Granted, my determination to posit the completed circle-to assert the continuity of the [Krebs Cycle of Creativity]-may be seen as naive, or even sophomoric. Please assume the former, and suspend disbelief." [51] I listen and ask the same as I expand on her ideas. Approach this idea with a

(15)

beginner's mind and assume naivete not pretension. After all, William James once wrote that it's "only your mystic, your dreamer, or your insolvent tramp or loafer,

[who]

can afford so sympathetic an occupation, an occupation which will change the usual standards of human value in the twinkling of an eye, giving to foolishness a place ahead of power, and laying low in a minute the distinctions which it takes a hard-working conventional man a lifetime to build up." [35] I speculate a unification of the Aristotelian classification of elements and the Krebs Cycle of Creativity, and I propose a missing fifth element.

In both maps of creation and creativity, the X-axis represents change in entropy, the degree of disorder in a system. According to the Second Law of Thermodynamics,

entropy in an open system always increases. 1 _{Without an external source of heat,}

a closed system's entropy increases. On the other hand, increasing heat in a closed system can decrease entropy. If we imagine nature as the forces (laws of physics) that act upon an open system at the cosmic scale and culture as the forces (where we develop and maintain values, institutions, art, and technology) that act upon a system within nature, then nature would represent increasing entropy and culture, reversing entropy. By reversing entropy, I mean rejecting the tendency for things to fall apart. We should remember Oxman's refrain that "nature is culture is nature" and recognize we could also imagine nature and culture representing the reverse.

[51]

The Y-axes of Aristotle and Oxman's maps align based on the qualities of their respective forms. The best way to understand what Aristotle meant by wetness and dryness is to consider an example, say flour. When flour is dry, it is a light powder that is easy to separate and disperse. But, once you add water to it, it immediately sticks together. The material quality of dryness connotes separateness and objectivity.

More specifically, the Second Law of Thermodynamics states that the total entropy of an isolated system can never decrease over time.

(16)

On the other hand, wetness naturally coheres and manifests in a subjective form. From the dimensions of objectivity and subjectivity, we can project production and perception. Production is a reification of concepts and objects into a defined state. The existence of that state is objective. In contrast, perception takes an input and becomes an observer's experience. The state of the observer's experience is subjective; it depends on the observer.

The missing fifth element in the map of creativity is the aesthetics of absence. It is not absence itself, but instead, it is the truth and beauty of presence that absence reveals. W.B. Yeats called it spiritus mundi, the creative spirit that inspires poets [691. In Buddhism, it is called Sunyata or simply Buddha-nature. We can imagine this fifth element as akin to aether with no contrary and perpetual motion. The point here is to recognize the underlying essence of creativity: the generative nature of absence in the "Aha!" moments of presence.

I adapt previous maps to include this fifth element at its core. Instead of the

metaphor of recti-linear or cyclical motion, creativity most certainly moves in a fractal pattern where the parts resemble the whole. Rather than cyclical repetitions, fractal motion is a metaphor for pattern matching in an ever expanding and changing context. Fractals can be generated by simple equations that produce complex geometric figures. For example, the Mandelbrot Set is defined by a set of points obtained from the

quadratic recurrence equation Z, 1 = Z +c where c represents points in the complex

plane and Z, does not tend to infinity as n goes to infinity for Zo = 0. By filtering out non-escaping trajectories, we can plot the Mandelbrot Set on the Complex Plane. After only few iterations, we begin to see the Buddhabrot, a fractal resembling the Buddha. As a metaphor, the Buddhabrot is a visual call to the Buddha-nature in everything. By plotting the Buddhabrot on the Complex Plane and overlaying the plot with the four embodiments of creativity in the same relative positions as the previous

(17)

maps of creativity, we can consider a new perspective. By moving beyond Euclidean Space to the Complex Plane, perception and production are approached with a mathematical metaphor. Specifically, I re-formulate Oxman's perception/production axis (Aristotle's wet/dry) as the Imaginary axis spaning imagination and reification. Likewise, the nature-culture (hot/cold) axis is represented by changes in entropy. which is signified by the Real axis. The fractal itself represents the entanglement of

the chaos and order, nature and culture, perception and production in the iterative creation process. As another perspective on creativity, I present the Buddhabrot of Creativity. ARISTOTELIAN ELEMENTS 4TH CENTURY BCE

7WAT

ER P\A I R \EARTH FIRE COLD < > HOT

(18)

(19)

1.2 Experiment in Phenomenology

Another map for creation and creativity could be a blank canvas upon which the observer imbues his or her own meaning. In 1989, Shepard Fairey and his posse set in motion OBEY, a guerrilla marketing campaign with no call to action. Drawing inspiration from the 1988 film They Live and its signs containing single word impera-tives to "CONFORM," "OBEY," and "CONSUME," the posse designed and deployed

an experiment. 2 They designed a stencil of Andre the Giant and the single word,

OBEY, and posted stickers of the stencil and OBEY in dense, urban environments. Without social media or even the World Wide Web, these stickers went viral and became a global meme. In Fairey's online Manifesto, he describes OBEY as an "experiment in phenomenology" designed to "reawaken a sense of wonder about one's

environment."

[23]

OBEY is jarring and frustrating. Nobody likes to be commanded

to do anything, let alone, simply obey. On the surface, OBEY does not refer to anything. That is the point. If people take the time to reflect on OBEY, they realize its irony, which addresses the idea that all commercial ads are trying to get us to obey. In Fairey's words, the intention of the OBEY experiment was to "stimulate curiosity and bring people to question both the sticker and their relationship with their surroundings." [23] Deep Angel extends the essence of OBEY to Al applied to content generation and media manipulation.

1.3 Artificial Intelligence and Media Manipulation

The recent emergence of artificial intelligence (Al) powered media manipulations has widespread societal implications for all fields and for journalism [17], democracy [17],

(20)

national security [3], and art

[31]

in particular. On one hand, AI has the potential to scale misinformation to unprecedented levels by creating various forms of synthetic media. For example, Al systems can synthesize realistic video portraits of an individual with full control of facial expressions including eye and lip movement [26,39,59,64,65];

Al systems can clone a speaker's voice with few training samples and generate

new natural sounding audio of something the speaker never previously said [5]; Al systems can synthesize visually indicated sound effects [50]; Al systems can generate high quality, relevant text based on an initial prompt [53]; Al systems can produce photo-realistic images of a variety of objects and combinations of objects from text inputs [14,38,48]; Al systems can generate photo-realistic videos of people expressing emotions from a single image [8].

On the other hand, these generative Al systems can offer new creative tools for artists and practitioners alike. For example, the Creative Adversarial Network learns art by its styles and generates new art by deviating from the styles' norms [21], while interactive GANs (iGANS) can offer applications for artists and designers to explore new ideas [16]. These examples highlight the diversity, automation, and scale of content generation in the age of Al.

La plus ga change, la plus c'est la mme chose. Media manipulation is not new.

In fact, it goes by many names - propaganda, fake news, misinformation, truthiness.

For a particular kind of media manipulation, there's a modern Latin term, damnatio

memoriae, that refers to the erasure of an individual from official accounts, often

in service of dominant political agendas. The earliest known instances of damnatio

memoriae were discovered in ancient Egyptian artifacts and similar patterns of removal

have appeared in most image-based societies across time and space since

[25,66].

Figure 1-3 presents iconic examples from recent history of individuals removed from photographs in this same fashion with an aim towards advancing a particular political

(21)

agenda.

Beyond the philosophical and political concerns, scalable media manipulation has practical concerns. In 1986, the Whole Earth Review published its 47th issue focusing on the state-of-the-art technology for image manipulation. The review includes the following excerpt from a fictional legal trial, which speaks to both the foresight of the publication and how media manipulation has long been a technological concern:

Your Honor, we cannot accept this photograph in evidence. While it purports to show my client in a hotel bedroom with a woman not his wife, there is no way to prove the photograph is real. As we know, the craft of digital retouching has advanced to the point where a "photograph" can represent anything whatever. It could show my client in bed with Your Honor.

To be sure, digital retouching is still a somewhat expensive process. A black-and-white photo like this, and the negative its made from, might cost a few thousand dollars to concoct as fiction, but considering my client's social position and the financial stakes of this case, the cost of the technique is irrelevant here. If Your Honor prefers, the defense will state that this photograph is a fake, but that is not necessary. The photograph could be a fake; no one can prove it isn't; therefore it cannot be admitted in evidence. Photography has no place in this or any other courtroom. For that matter, neither does film, videotape, or audiotape, in case the plaintiff plans to introduce in evidence other media susceptible to digital retouching.

-Some lawyer, any day now

[2]

(22)

nature of media in the age of generative neural networks. Historically, visual and audio manipulations required both skilled experts and a significant investment of time and resources. Today, an Al can produce photorealistic manipulation nearly instantaneously and at scale. This new capability poses an existential threat for standards of evidence, and thus, the changing technology calls for an examination of humans' ability to discern Al generated manipulations and how society trusts media.

Recently, research institutions have applied the precautionary principle to the dissemination of media manipulation technologies. For example, Google withheld the discriminator for their BigGAN model while publicly hosting the generator for anyone

to play with. [14] BigGAN can generate realistic appearing objects in images. [14]

Similarly, OpenAl restricted access to their GPT-2 model while open-sourcing a pared down model trained with fewer parameters. [53] GPT-2 can generate a plausible

story given an initial prompt.

[531

Withholding access to Al models prevents the

general population and research community from further evaluating these Al models. Technical know-how is not enough for replication of these kinds of models. The largest barriers are the computational costs and access to appropriate data. If so desired, a well-resourced state actor could overcome these barriers. As such, important questions arise at the intersection of Al and content generation: how should we apply the precautionary principle in the field of Al research and can we adapt our ability to detect fakes produced by increasingly sophisticated Al models?

(23)

Figure 1-3: Photographic manipulation has been a tool of fascist governments in nefarious attempts to subvert reality. On the top left, Joseph Stalin is standing next to Nikolai Yezhov who Stalin later ordered to be executed and disappeared from the photograph. On the top right, Mao Zedong is standing beside the .Gang of Four" who were arrested a month after Mao's death and subsequently erased from this historic photograph. On the bottom, Benito Mussolini strikes a heroic pose on a horse while his trainer holds the horse steady. The photographic manipulation showcases Mussolini's skill for manipulating the facts and covered up his lack of horsemanship.

As a medium, photography connects us to moments, places, and interactions that we might never dream of seeing. From an Earth-rise over the moon to a black hole in outer space to Martin Luther King Jr. in front of the Lincoln Memorial during the March on Washington for Jobs and Freedom, photographs offer a us a chance to examine a moment frozen in space and time.

Beyond serving as evidence and insight, photography is a medium for inducing empathy. Photography be "used to stimulate a moral response" [62]. For example, consider the moral outrage you feel when looking at the photograph of Phan Thi Kim Phuc running naked on a road after being burn by a South Vietnamese napalm attack.

(24)

This visceral feeling of moral outrage in response to the atrocities of war is absolutely justified. However, this moral response can be artificially co-opted. As an example, consider the Time Magazine cover photo of OJ Simpson that dramatically darkened his face and evoked a sinister perception by magazine readers. The problem is Time Magazine intentionally took advantage of how we process images. In the words of Richard Misrach, "Every medium creates a primary illusion... the novel creates an illusion of memory; music creates the illusion of passing time; drama creates the illusion of history... photography creates the primary illusion of fact." [2] We need to be careful not to assume fact in photography.

(25)

Chapter 2 Design via Negativa

Does this spark joy?

Marie Kondo

If we assumed that both (a) all aspects of what it means to be human are amenable

to empirical study and (b) our language is comprehensive enough to explain anything

and everything, then we would be able to describe the world strictly through positive

statements. But, neither assumption is realistic. There is so much we do not know

and cannot know. Via negativa - describing the world by what it is not - can help us navigate this uncertainty.

In the theory of special relativity from physics, the Heisenberg uncertainty principle

states the position and velocity of a subatomic particle cannot both be known

simultaneously. Likewise, Gdel's incompleteness theorems show that a complete

and consistent set of axioms for mathematics cannot exist. If we can accept the

theory of special relativity or Gdel's incompleteness theorems, then we admit that

uncertainty and unknowability are an intrinsic part of the universe. Furthermore, language cannot communicate everything. For example, we often describe experiences

(26)

with "you had to be there." In a 1964 case before the Supreme Court, Justice Potter Stewart explained that he could not explicitly describe obscenity. Instead, he said, "I know it when I see it." Sometimes, we do not even know something when we see it. Behavioral science is a field that studies these kinds of blind spots. One particular psychological bias is known as the endowment effect. [36] Without naming it as such, Aristotle described this bias as follows: "For most things are differently valued by those who have them and by those who wish to get them: what belongs to us, and what we give away, always seems very precious to us."

[6]

In other words, we attach additional value to objects that we own simply because we own the objects. Overtime, the endowment effect leads to material overload because we continue to accumulate things and falsely attach value to things we do not truly value. As a solution to material overload, a Netflix-series tagline turned 2019 Internet meme asks, "Does this

spark joy?" If something does not spark joy, then the answer - via negativa - is to

throw it out. The essence of this meme is age-old wisdom. In Revelation 3:14, the last book of the Gospel of John, God sends a message to the angel of the Church of Laodicea: "So, because you are lukewarm, and neither hot nor cold, I will spit you out of my mouth." Life is not worth tepid enthusiasm. When we recognize our world drifting to the lukewarm clutter, it is our moral obligation to call it out.

2.1 Angelus Novus

Angelus Novus, a monoprint by Paul Klee, embodies a call-to-arms against the lukewarm clutter of thoughtless progress. Walter Benjamin described the monoprint as follows:

(27)

away from something he is fixedly contemplating. His eyes are staring, his mouth is open, his wings are spread. This is how one pictures the angel of history. His face is turned toward the past. Where we perceive a chain of events, he sees one single catastrophe which keeps piling wreckage upon wreckage and hurls it in front of his feet. The angel would like to stay, awaken the dead, and make whole what has been smashed. But a storm is blowing from Paradise; it has got caught in his wings with such violence that the angel can no longer close them. The storm irresistibly propels him into the future to which his back is turned, while the pile of debris before him grows skyward. This storm is what we call progress. [11]

From Benjamin's perspective, technology is going awry. A monolithic catastrophe is brewing. While the angel is blind to the future, he can see the past as a growing pile of material overload. Angelus Novus' desire to warn the world about these visions sparked the conception of Deep Angel and led to research into past activist movements to revealing the state of reality by showing what it is not.

(28)

- -~ -~- - I, * 1~4

i~j~

-le-ii. V -t ; Il 'I P

r

(29)

2.2 Detournement

In logic, there's a Latin phrase reductio ad absurdum for a form of counterargument that makes its case by examining the extremes to which an argument's premise leads. One way to reconsider the future is to take the present to its logical extreme and examine what you see. Dtournement (French for hijacking or rerouting) is an activist technique combining reductio ad absurdum with via negativa to subvert existing power structures. In the 1950s, Guy Debord and the Situationist Internationals, saw the system of capitalism as reducing life to commodified experiences. Similar to Angelus Novus' gaze into a "single catastrophe", Debord warned humanity of the spectacle whereby a superficial manifestation of a value system gone awry takes control of individuals' agency and rules them as passive subjects controlled by a set of

commodities

119].

As a way out of the spectacle, detournement inverted the dominant

media culture upon itself to create satire and provoke thought. In detournement's wake sprung the culture jamming movement of the 1980s from which They Live and

OBEY draw their subversive roots.

Before d6tournement, there was Dada and Surrealism. Marcel Duchamp in-famously described his ready-made art as an antidote to the "retinal" art of his

contemporaries that was only pleasing the eyes.

[7]

DuChamp's intention was to

engage the mind. Today, Banksy's artwork continues the tradition of subverting traditional power structures by inverting their forms and functions.

Closely related to the counter-environment movements of the 20th century, Diego Velazquez (the original selfie-taker) pioneered the method of flipping art from the spectacle of the scene onto the spectacle of the seeing. [24] In Las Meninas, you are looking at a painting in which the artist is staring back at you. Las Meninas is not like other paintings. As a viewer of the painting, you are triggered to think

(30)

differently. Rather than contemplate the scene, you start to think about how the artwork perceives you. Or you consider how the media consumes you rather than you consuming the media. The process of meta-thinking is a key aspect of the via negativa, and these thought processes frequently arise in mystical approaches to understanding the Divine.

2.3 Suchness

In theology, via negativa is referred to as the apophatic approach to the Divine. Apophatic theology describes God by what God is not. Its opposite, cataphatic theology, approaches God by affirmations about what God is.

Zens Koans are seemingly paradoxical statements or questions that are designed to help truth-seekers release themselves from self-deception. In a discussion on suchness, the Buddhist philosopher, D.T. Suzuki, once expalined that "to be absolutely nothing is to be everything. When one is in possession of something that something will

keep all other somethings from coming in."

[45]

The underlying idea is that the

exhaustion of absence is the presence of suchness. Once we tune in, we see this suchness everywhere. In order to further explain this concept, D.T. Suzuki tells a story:

Student: Am I in possession of Buddha consciousness? Guru: No.

Student: Well, I heard that all things are in possession of Buddha con-sciousness. The stones, the trees, the flowers, the birds, the animals, and all beings.

(31)

Guru: Yes, you are correct. All things are in possession of Buddha consciousness. The stones, the flowers, the bees, the birds, but not you. Student: Why not me?

Guru: Because you're asking the question.

[15]

By asking the question, the student is focusing on the knowledge of himself as

separate from the universe and all things. Once he begins to live in the knowledge of

himself as transcendent, he will come into being with Buddha-nature.

[15]

Sometimes,

via negativa is most effective when it is apparent and explicit.

2.4 Negative Space in Art

What happens when people are removed from photographs? Two artists, Adrian Piper and Paul Pfeiffer reframe what we have previously seen as an act of totalitarian media manipulation.

In one of Piper's pieces from her Everything series, she creates a photographic palmipsest, which effaces two people's faces and writes "Everything will be taken away" over the effaced portion of the photograph. Her series draws from the following quote from Aleksandr Solzhenitsyn's Two Hundred Years Together: "You only have power over people so long as you don't take everything away from them. But when you've robbed a man of everything, he's no longer in your power - he's free again." Piper in her art and Solzhenitsyn in his prose suggest that once we lose everything, we become free again. In other words, absence creates presence.

From another perspective, Pfeiffer's Four Horseman of the Apocalypse series examines how the world looks like when we remove particular aspects. In one piece, he removes the basketball and the other players on the court leaving Bill Russell by

(32)

himself in mid-air going for a jump ball. What would have been obscured by the presence of everything else becomes vivid and prominent. In the middle of an empty court in front of an audience of thousands, Russell appears as Christ on an invisible crucifix. What is notable here is that the absence of a few components offers a new perspective and important question: why does removing things from a scene allow us to see something new in an old scene?

Figure 2-2: Examples of Negative Space in Paul Pfeiffer's Four Horsemen of the

(33)

Chapter 3 Engineering Computer Vision for

Human Vision

The soul without imagination is what an observatory would be without a telescope.

Henry Ward Beecher

3.1 Machine Vision

Convolutional neural networks have surpassed human-level accuracy in a variety of

object recognition and detection tasks [30,40]. Likewise, recent image generation

variants of the generative adversarial network algorithm are capable of producing

high-quality natural images [28,37]. If computers can successfully detect objects and

generate new scenes, what are the limits of computers to re-write history by making

subtle yet automatic changes to photographs? In particular, how well can computers

replace objects with a plausible background given the context of the photograph? If computers are reasonably successful at imagining the scenery behind objects, then

(34)

how well can computers conjure objects back into scenes? Are computers creative? What kind of metaphors does algorithmic omission create? These are the high-level questions that guide the machine learning portion of my thesis. I developed two end-to-end neural networks to address these questions: (1) targeted object removal and (2) unanchored object conjuring. The target object removal network is intended to detect objects and erase them from images. The unanchored object conjuring is intended to reverse the target object removal and reconstruct objects in an image.

3.2 Related Work

3.2.1 Object Detection and Instance Segmentation

Over the past decade, convolutional neural networks have dramatically improved

computational performance in object detection and instance segmentation

[40].

Object

detection and instance segmentation provide an automatic and scalable method to identify objects in images and separate the objects from the background and each other. Today, the state-of-the-art convolutional neural network for object detection and instance segmentation is Mask R-CNN, which builds upon an series of convolutional

neural networks: R-CNN, Fast R-CNN, and Faster R-CNN

130].

The neural network

is a region-based network that identifies a manageable number of potential object regions and evaluates each region with a convolutional neural network. Once the network identifies the bounding box of an object, it segments the object from its local bounding box.

(35)

3.2.2 Image Inpainting

Image inpainting refers to the filling in of missing pixels in an image. The first image inpainting algorithm introduced a directional image propagation scheme to refill selected portions of images [12]. Adobe commercialized inpainting as a feature called Content-Aware Fill in Photoshop. Adobe's Content-Aware Fill is powered by the Patch Match algorithm, which enables portions of an image to be removed and replaced with an approximate nearest neighbor image patch [9]. Patch Match is particularly adept at matching stationary backgrounds with uniform texture, but it often fails on non-stationary cases where both objects overlap with other objects and the space could be described with a semantic representation. Previous solutions to handle non-stationary cases rely on copying patches from similar scenery from a large database of images rather than imagining a wholly new patch 129].

In the last year and a half, dilated convolutional neural networks trained with an adversarial loss function have demonstrated a dramatic improvement in image

inpainting performance

[33,71].

These networks leverage semantic information learned

from large-scale datasets to handle stationary and non-stationary backgrounds and "imagine" missing content in the masked portion of the image. Further refinement of this architecture includes a contextual attention layer to capture non-local dependen-cies in images and gated convolutions to handle the inpainting of non-rectangular, freeform masks [41, 70, 71]. The size, variety, and quality of the datasets used for training these neural networks determine the inpainting quality. These neural net-works can be trained on scenery or human faces, and they perform well on similar sets of images to which they were trained.

One month after the launch of Deep Angel, Towards Data Science, a Medium publication, published a blog post comparing ground truth images to three versions of

(36)

inpainting: (1) human artists (2) neural networking inpainting (3) non-neural network inpainting [22]. In subjective quality scores of image inpainting, the unaltered image has the highest score followed by artists [22]. The highest performing computer vision inpainting model is generative inpainting trained on the MIT Places 2 dataset, which is the model used in the Target Object Removal pipeline [22].

M Ground truth M Human artist M Neural method M Non-neural method

Overall (3 imagesJ _{Overall (33 images)}

Ground truth______________ Ground truth

Artist *2

Artist#u Generative inting (Places2)

Artist~ AdobePhotoshop CS5

Generative ntinPlaces? Generative In aintinp((maeNet)

A oehtshop CS5 Statitcs of ath~st

statisticsofPatch Ofsets Exemplar-Based (patch 13 piels)

Eaenilar-Based (patch 13 pixes) Partial Convolutions

EaepilarBased (patch 9pixels Hig-Resolution Neural Inpainting

Generatienint mi obally andLocally Consistent

Hih-Resolution Neuralein ntng Shift-Net

lobally and Locally CoistentIn Deep Image Prioro

shift-Net 0 1 2

DeepImage Prior0 1 2 3 4 5

0 1 2 3 4 Ground truth Neural method Non-neural method

Figure 3-1: Comparisons of inpainting algorithms. Image graphics from Mikhail

Erofeev's Image Inpainting Humans vs. AI

[22]

3.3 Target Object Removal

I

engineered a Target Object Removal pipeline to remove objects in images and replace those objects with a plausible background. I combine a convolutional neural network

(CNN) trained to detect objects with a generative adversarial network (GAN) trained

to inpaint missing pixels in an image [28, 30,37,40]. Specifically, the model generates object masks with a CNN based on a RoIAlign bilinear interpolation on nearby points in the feature map [30]. RoIAlign bilinear interpolation is a technique used in Mask

R-CNN for preserving spatial locations of object instances within a convolutional

neural network. [30] Interpolation is a technique for constructing new data points within a range of discrete known data points. Bilinear interpolation is a technique for constructing these data points based on interpolating two variables. If we have

(37)

four points, then we can write a solution to the bilinear interpolation problem as

f(x,

y) = ao

+

a1x

+

a2y

+

a3xy. After generating object maskes, the pipeline crops

the object masks from the image and apply a generative inpainting architecture to

fill in the object masks

[33,71].

The generative inpainting architecture is based on

dilated CNNs with an adversarial loss function which allows the generative inpainting architecture to learn semantic information from large scale datasets and generate missing content that makes contextual sense in the masked portion of the image.

The end-to-end targeted object removal pipeline consists of three interfacing neural networks:

e Object Mask Generator (G): This network creates a segmentation mask X= G(X, y) given an input image X and a target class y. In our experiments, G is initialized from a semantic segmentation network trained on the 2014

MS-COCO dataset following the Mask-RCNN algorithm

130].

The network

generates masks for all object classes present in an image and the pipeline selects only the correct masks based on input y. This network was trained on

60 object classes.

" Generative Inpainter (I): This network creates an inpainted version Z

I(X, X) of the input image X and the object mask X. I is initialized following

the DeepFill algorithm trained on the MIT Places 2 dataset [71,72].

" Local Discriminator (D): The final discriminator network takes in the

in-painted image and determines the validity of the image. Following the training of a GAN discriminator, D is trained simultaneously on I where X are images from the MIT Places 2 dataset and Xare the same images with randomly assigned holes following [71,72].

(38)

For every input image and class label pair, an object mask is generated using G, which is paired with the image and inputted to the inpainting network I that produces the generated image. The inpainter is trained from the loss of the discriminator

D, following the typical GAN pipeline, which can be understood as an adversarial

training process by a generator and discriminator. An illustration of our neural network architecture is provided in Figure 3-2.

Spatially Discounted f Loss Class Box

-o Global Discriminator

Real or Fake

Input Rol Align Mask Dilated Convolution Coarse Result Contetual Attention Inpainting Result

efneentNetwork Local Discriminator

Object Mask Generation Generative Inpainting

Figure 3-2: End-to-end pipeline for Tarqet Object Removal following [30 71]

3.4 Unanchored Object Conjuring

If objects can be plausibly removed from images, then it is reasonable to imagine

objects can be plausibly generated in an image from which they never existed. We approached adding objects to images using image-to-image translation with conditional

adversarial networks

134].1

These neural networks learn a mapping from an input

image to an output image. Based on pairs of user submitted images as outputs and their resulting manipulations as inputs, we (Zivvy and I) trained a generative model

'The development of Unanchored Object Conjuring and Al Spirits was produced by a collaboration between Zivvy Epstein and Matt Groh. Zivvy and Matt jointly conceived the idea, Zivvy developed a high-quality image curation tool and curated images, Matt wrote the script to run the neural network. Manuel Cebrian and Iyad Rahwan were executive producers.

(39)

that can partially bring back missing objects in images. The latent structure of the input images is encoded in information like edges, shape, size, texture, and color that are anchored across contexts. By applying image-to-image translation to the results of the Target Object Removal pipeline, we force the model to learn both the structural representation for removed objects and their contextual location. We call this process Unanchored Object Conjuring.

In October, we filtered all images uploaded to Deep Angel to 5,634 images where people were selected to be removed. We manually filtered these images to the 1000 best manipulations based on qualitative judgements. Then, we resized and cropped

images to 1024 x 1024. We trained these images following the pix2pixHD

image-to-image translation architecture. Figure 3-3 shows the architecture for this extended Unanchored Object Conjuring pipeline.

Spatially Discounted t Loss G1 G2 G1

Class Box

- Global

Discriminator

Residual | _. Blocks Residual Input Rol Align Mask Dilated Convolution Coarse Contexta Atten on Inpanting or Fake

Coarse Network Result LayerwitDIlale Resl 2xdwsmln

Convolution

Refinement Network Local Discriminator

Object Mask Generation Generative Inpainting Image-to-image translation

(40)

3.5 Deep Angel

3.5.1 Interaction Design

In collaboration with lawyers, designers, and colleagues at Scalable Cooperation, I de-signed an interactive website called Deep Angel to make the Target Object Removal

ar-chitecture publicly available. 2 _{The website is hosted at https://deepangel.media.mit.edu.}

Deep Angel offers two main user interactions: (1) users can upload their own images and evaluate how the AI transformed the image and (2) users can guess which images on the website have been manipulated. Figure 3-4 contains screenshots of these two interactions.

In the first interaction, users select one of sixty objects to remove and either upload an image from their computer or select an Instagram account from which to transform the first three images. After the user submits his or her selections, Deep Angel returns both the original image and a transformation of the original image with the selected objects removed.

In the second interaction, users are presented with an image manipulated by the

Target Object Removal architecture and an image from the 2014 MS-COCO dataset.

Users are instructed to select the image that has something removed by Deep Angel. After the user makes a selection, Deep Angel reveals which image was manipulated and offers the user the opportunity to guess again on a new pair of images.

2

We retained the Cyberlaw Clinic from the Harvard Law School and Berkman Klein Center for Internet & Society to advise and support Deep Angel. Micah Epstein and Julian Kelly designed graphics and the UI/UX framework for the website. Together, Zivvy Epstein, Manuel Cebrian, and

I conceived the idea for Deep Angel. Nick Obradovich and Iyad Rahwan provided valuable insights

and support throughout the process. Nick suggested the idea to include a fake detection feature on the website. I performed the machine learning engineering and backend development. I also extended Micah and Julian's frontend code for a variety of additional features.

(41)

Figure 3-4: Screenshots of Deep Angel's user interface.

3.5.2 Backend Architecture

The the architecture for the Deep Angel website is diagrammed in Figure 3-5. We uses NGINX, a highly stable web-server, to serve a Flask application, which is a Python-based web framework. The Flask application has privileged access to an external API providing access to the Target Object Removal architecture. This API is hosted on a single Nvidia Geforce GTX Titan X GPU.

When a user uploads an image, the image is uploaded to Amazon's S3 file storage system and the S3 URL for the image is sent to the Target Object Removal API. Next, the API transforms the image, saves the manipulated images to S3, returns the S3 URLs of the manipulated images to Flask, and saves all relevant data to a relational database. Likewise, when a user selects an Instagram account, the API crawls Instagram, saves the first three images of that Instagram user to S3, and repeats the same process as when an image is uploaded.

When a user is interacting with the fake detection interface, a pair of images are randomly selected for display, users select an image, the users' selection is saved to the relational database, the correct selection is revealed to the user, and a new pair of images are randomly selected for display.

(42)

Users

NGINX

Flask _{Flask(AWS RDS)}PostGRES

Target Object

Removal F ile Storage API (AWS 53)

Figure 3-5: Diagram of Deep Angel's server architecture

3.5.3 Live Deployment

We publicly launched Deep Angel on August 28th, 2018 on Product Hunt, a website that curates the best new products on the Internet. 930 people upvoted Deep Angel and it was awarded #1 Product of the Day and #3 Product of the Week on Produce Hunt. The next week Deep Angel reached the top of Hacker News. Within the first few months of launching, Deep Angel was covered by the New York Times, Le Monde, Fast Company, Digg, Artsy, Aeon, and other media outlets [1,18,46,52,58,67]. From

(43)

August 2018 to April 2019, over 100,000 people from across the world visited the website. With data on all these user interactions, we can begin to explore the science of deception.

(44)

(45)

Chapter 4 Science of Deception

The amount of energy needed to refute bullshit is an order of magnitude bigger

than to produce it.

Alberto Brandolini

Empirically-speaking, how do people interact with Deep Angel? What kind of images do people upload? How well does the Target Object Removal pipeline work?

Are all manipulations plausible? Are any? How often do image removal manipulations

fool people? When do they fool people? Here we describe how people used Deep

Angel and I apply methodologies from statistics and psychophysics to understand

how people adapt to media manipulations.

4.1 User Interactions

Users uploaded 16,755 unique images from mobile phones and computers. In addition, user directed the crawling of 10,866 unique images from Instagram. The most

(46)

Image Uploads Object Person Car Cat Dog Elephant Bicycle Bird Tie Airplane Stop Sign Count 12293 1195 1037 1032 175 152 132 113 100 90 Order 1 6 2 3 4 7 22 31 13 8 Instagram Object Person Cat Dog Elephant Car Bicycle Sheep Stop Sign Airplane Skateboard

Table 4.1: Top 10 Target Object Removal Selections for Uploaded Images and Targeted Instagram Crawls on Deep Angel. Each Instagram username selection initiated a targeted crawl of Instagram for the three most recently uploaded images of selected user.

both the number of images from which the object was removed and the order in which the object appeared in the pull down menu. For the image uploads and Instagram directed crawling, seven and nine, respectively, of the first ten objects listed in the user interface were in the ten most frequently selected objects. While it's difficult to disentangle the choice architecture created by an ordered list from users' preferences of what objects to disappear, there appears to be a high propensity for users to choose to remove people.

The overwhelming majority of images uploaded and Instagram accounts selected were unique. 88 percent of the usernames entered for targeted Instagram crawls were unique. The most frequently selected Instagram accounts were cats__of instagram

(25), kimkardashian (23), and realdonaldtrump (19).

Count 6606 697 467 162 157 70 51 29 28 24 Order 1 2 3 4 6 7 5 8 13 10

(47)

Figure 4-1: Examples of original images uploaded to Deep Angel and corresponding manipulations.

4.2 Quality Evaluations

The evaluation of generative adversarial networks (GANs) for images is complicated. There is no single best quantitative metric for evaluating the performance of a

GAN.

[13]

There exist at least 26 quantitative and qualitative measures for evaluating

GANs trained on images. [13] These include metrics like the Inception Score and Frechet Inception Distance, which work well on data from the ImageNet dataset but have been discredited for evaluation on other datasets. [10, 13,32, 55,56,60, 73] In a paper that I coauthored earlier this year, we explain the problem of evaluating a generative model on images is the assumption that the dataset of images is a reasonable proxy for "the family of distributions from which it was sampled." [49] In light of the limitations of quantitative metrics, the evaluation of the quality of a GANs relies on human judgements [14,20,38,57,60].

Drawing on methods from psychophysics, Human Eye Perception Evaluation (HYPE) metric offers a structured, validated method for comparing the quality of

GAN generated images. [73] I measure performance based on HYPEO, the rate at

which people accurately identify manipulated images and real images without any time limitation. Following the HYPE method and standard practice in the psychophysics literature, the Deep Angel interface highlights which image was manipulated by revealing what was disappeared from the manipulated photograph after a user

(48)

guesses. [73]

By examining the most frequently misidentified images, it is possible to surface

extremely plausible object removal manipulations. Figure 4-1 presents two pairs of the most frequently misidentified images. However, most images uploaded by users are not plausible manipulations. Specifically, 62% of images are correctly identified as manipulated by users for over 90% of guesses. This result should not be surprising because successful manipulations with the Target Object Removal pipeline require that the image fit several conditions e.g. the object is relatively small and the background is not too complex. Figure 4-2 presents the distribution of accurate guessing across images, which shows that plausible manipulations are very image dependent. 0.06- 0.05-.9 0.04-.2 0.03-t 0.02- 0.01- 0.00-0.0 0.1 0.2 0 3 0.4 0.5 0'6

Percent Guessed Wrong

0.7 0.8

Figure 4-2: Probability density function displaying the accuracy of guesses over

images

As users are exposed to image manipulations on Deep Angel, they learn to spot the manipulations. Figure 4-3 shows the relationship between guessing accuracy and the number of images that a user has seen. 72% of users accurately identify the manipulated image on the first guess and 88% accurately identify the manipulated image on the tenth guess.

The aesthetics of absence : awareness in the age of neural networks

The Aesthetics of Absence:

Awareness in the Age of Neural Networks

by

Matthew Groh

JUL 2 6 2019

B.A., Middlebury College (2010)

LIBRARIES

Submitted to the Program of Media Arts and Sciences, School of

Architecture and Planning

in partial fulfillment of the requirements for the degree of

Master of Science in Media Arts and Sciences

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

oMassachusetts

Institute of Technology 2019. All rights reserved.

Signature

redacted

A uthor

...

...

Program of Media Arts and Sciences, School of Architecture and

Signature redacted

Planning

May10,2019

Certified by ...

I

00r

Iyad Rahwan

Associate Professor of Media Arts and Sciences

Signature

Thesis Supervisor

redacted

Accepted by...

The Aesthetics of Absence:

Awareness in the Age of Neural Networks

by

Matthew Groh

Abstract

This masters thesis has been examined by a Committee of the

Department of Media Arts and Sciences as follows:

Signature redacted

Professor Iyad Rahwan... ...

Thesis Supervisor

Associate Professor of Media Arts and Sciences

Signature redacted

Dr. Andrew Lippman..

...

Thesis Reader

Senior Research Scientist

,Signature

redacted

W illiam Powers..

...

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Creation and Creativity

[42].

[who]

[51]

7WAT

1.2

Experiment in Phenomenology

[23]

1.3

Artificial Intelligence and Media Manipulation

[31]

[25,66].

[2]

[531

Chapter 2

Design via Negativa

[6]

2.1