OpenAI vs. Advent of Code

This year, I decided to participate in Advent of Code, which are 25 little coding puzzles for the Christmas season. The first couple of puzzles are fairly easy and it took me somewhere around 10-15 minutes to solve both parts. I’m not really competitively minded here, so this is purely for fun, but looking at the leaderboard, some people seem to take this rather seriously.

On day 3, however, I noticed that #1 on the leaderboard had solved the first puzzle part in TEN SECONDS. I mean, HOW? You can barely read the puzzle description in 10 seconds, let alone code a solution…

Turns out that guy had a bit of help, in the form of the OpenAI DaVinci text generation model. More details are on this Github repo.

I have to admit I’m impressed: of all the “AI” stuff I’ve seen so far, this is probably the closest thing to AGI (that elusive “Artificial General Intelligence”, i.e. human-level understanding of a problem).

Until now, I’ve mostly dismissed all these previous generative models as “stochastic parrots” - with a hat-tip to Emily M. Bender and Tinmit Gebru for this wonderful expression.

So I decided to give this a try myself, got an OpenAI account and gave DaVinci the prompt: “Write a Python program that solves the following puzzle: …“, with the AoC puzzle description for day 3 appended. I did this five times (with some slight random parameter tweaking in between - not very scientific, I know) and saved the resulting Python code.

The results are quite interesting - for reference, the correct answer with the example data would be 157.

$ for file in davinci-aoc22-* ; do echo -n "$file " ; python3 $file ; done
davinci-aoc22-1.py 936
davinci-aoc22-2.py 233
davinci-aoc22-3.py 0
davinci-aoc22-4.py 157
davinci-aoc22-5.py 233

First of all, it’s already impressive that DaVinci generates completely valid and runnable Python code. OTOH, it’s probably had billions of lines of Python code in its training set, and if you look at Github CoPilot, that’s more or less par for the course already.

Looking at the puzzle itself, there’s more or less five distinct steps you need to complete to arrive at the solution:

  • Parse the input into a list of strings and loop over them
  • Split each item into two equal halves
  • Find the letter that is contained in both parts
  • Calculate the priority/score for that letter
  • Sum up all the scores and print the result

And if I compare this with the results, then the level of text analysis is pretty crazy - DaVinci manages to parse the example data into a list called rucksacks, and output a single integer, in every single run. It also gets the split into first and second half correct in four out of five runs. It figures out how to correlate first and second half in three out of five, and gets the score calculation correct in two out of five. In summary, it does everything right in one out of five attempts.

However, to me, this still sounds a lot like the previous “stochastic parrots” after all. It tosses bits and pieces of the solution around, and connects them together, until one set hits on the actual solution by chance. If this would be an actual problem where you don’t know the solution up front, then which one is correct? Don’t get me wrong, it’s still hugely impressive how close this gets to human-level performance (or actually far beyond human-level if you factor in speed), but I don’t think there’s any understanding happening - this is still randomly picking examples out of its corpus based on individual parts of the input.

Disclaimer: this is a very shallow analysis, and I acknowledge that I might be somewhat biased, because the thought of this network actually understanding the problem does somehow make me a bit uneasy. Perhaps I’m turning into a crusty old luddite after all. 🤷

P.S. I was reminded that OpenAI was actually founded by Elon Musk, with financing from Peter Thiel. With specifically those two guys involved, it sounds like something you’d want to steer clear of, TBH.