Heading image by Craig Sunter / Flickr
So we are veering into pet peeve territory, but the Next Web has a breathless article headlined This app uses artificial intelligence to turn design mockups into source code. Besides being misleading, the author confounds AI with the current implementation, i.e. deep neural networks, and their functions with an algorithm that is being improved. This is fundamentally flawed, for the same reasons as the AI assistants’ persistent inability to understand vague sentences: neural networks of today have no understanding of what they are doing.
Imagine, if you will, that it is 1995, and the world is just now being introduced to the magical machines called computers. A user is sitting at one of these devices, and has learned that they must turn it on such that the monitor has a shining light and the computer box has a shining light. This is going on still, but the monitor is black. They try pressing the keyboard. They try moving the mouse. They try pressing the buttons again, but it still will not cooperate. As a user, this is what they have learned, and this is what they expect, and when it doesn’t cooperate they do not know whether the VGA cable is not plugged in, the graphics card is not working, or someone has edited the X11R6 conf file and messed up.
The computer is a black box. The users have associated certain inputs with certain outputs and expect it to work. It is similar to training a parrot to press a lever for a cookie. They may not know why a cookie comes out when the lever is pressed, but they will do it because they want the cookie. A neural network is the same thing: it associates certain inputs with certain outputs, and, when presented with an input (or one that is similar enough), it produces an output that is similar to what it associated with that input. And so, the UI generator described above will produce the following outputs:
Note the subtle differences. “Ground Truth” refers to the actual UI, and “Generated” refers to the output of the software based on an image of the ground truth. There are a number of cues that show what’s going on here. Ignoring the text, which is a fixable error, note that the colour scheme of the output is wrong. While the software is correctly identifying text fields, headings, titles and boxes, the contents of those boxes aren’t carried over. To understand why, we have to, for a moment, think like a neural network. The rest of this post is written against the backdrop of the previous entry, Deep Learning: A superficial look. I will use definitions from there going forward.
Let’s zoom in on the issue with colours. Specifically, let’s look at GUI 6, a relatively complex set of boxes and buttons, featured on the demo video. The image generated looks reasonably OK, but what is it that the neural network is actually doing? Well, the output of the neural network is, according to the paper, a DSL: a domain-specific language that describes a UI in text form. The visual representation is generated from there. This distinction is important, as it shows the “hidden” layer of transposition.
The DSL output is what the neural network is trained to generate. When shown certain inputs, it has been trained to produce certain outputs, but it does not necessarily know what those outputs are. In GUI 6 this means, for example, that the DNN has been trained on recognising boxes; two boxes, four boxes, one box. It’s also trained to recognise buttons in those boxes, but this is a probability game. One reason the button in generated GUI 6 on the top left is green could simply be that the DNN has associated “top left box with button” to be described in text as “top left box with green button”. Quite likely, regardless of what colour button there was in the source image, the generated GUI would produce a green button.
In fact, going further, depending on how the DNN is trained, it is possible that it would never generate certain boxes without buttons (I am not saying this is the case here, I am illustrating). Much like the, albeit slightly offensive, joke in the Big Bang Theory about a woman learning English from TV commercials — and repeating a Geico commercial when all she wants to say is 15 minutes — the DNN does not know the meaning of its output. It only knows it is the best approximation. This is also why the text content of the boxes differs between input and output: the DNN simply posts as text content whatever was the most trained output.
In terms of input and output, these are not issues that cannot be fixed. The DNN shown in that training video had 1500 items to train on (with permutations on each that brought them up to just over 300,000 entries. Note that this is artificial entropy; they made 1,500 master items and then added or removed parts, changed colours, etc, to bring up the complexity. Compare with the ImageNet challenge that consisted of almost 500,000 images, each using different subjects, angles, colours, etc. Furthermore, the output of that challenge was to create single word outputs to classify an image, e.g. “cat”, “dog”. This should illustrate the difficulty of training a neural network to provide “correct” input and output.
However, it doesn’t scratch the deeper surface, which is that it’s not understanding. For a human, or even a computer program, it is possible to create an instruction that says “use the same color as the button”. It’s a copy/paste job, requiring very little thinking. For a neural network, however, it’s an almost herculean task. For example, if I trained a DNN on red and blue buttons, the moment I showed it a green button it’d give either a red one or a blue one, or maybe a purple one, but it wouldn’t even know what to do with the colour green, let alone copy it over. (This, by the way, is why current best-practices for DNNs is to have “comprehensive” training data)
So, back to the headline, while that 77% number is impressive, it comes with several caveats. First, it is from their own testing data, and I’m sure that if I put in any other sort of UI mockup in there it’d choke immediately. Second, the current methodology is going to make it very difficult to reach more than 90-95% accuracy, as it more or less requires them to train the DNN on practically every type of UI one would want to “create” with it.