Retrospective: 12 Months Since MIRI

Not written with the intention of being useful to any particular audience, just collecting my thoughts on this past year’s work.

September-December 2023: Orienting and Threat Modeling

Until September, I was contracting full time for a project at MIRI. When the MIRI project ended, I felt very confused about lots of things in AI safety. I didn’t know what sort of research would be useful for making AI safe, and I also didn’t really know what my cruxes were for resolving that confusion. When people asked what I was thinking about, I usually responded with some variation of the following:

MIRI traditionally says that even though their research is hard and terrible, everyone else’s research is not at all on track to have a shot at making safe AI, and so MIRI research is still the best option available.

After working with MIRI for a year, I certainly agree that their research is hard and terrible. I have yet to be convinced that everyone else’s research is worse. I’m working on figuring that out.

So I set out to investigate why MIRI thinks their research is the only kind of thing that has a shot, and if they’re right or wrong about that. My plan? Read a bunch of stuff, think about it, write down my thoughts, talk to lots of people about it, try to get a foothold on the vague mass of confusion in my head. To my credit, I did anticipate that this could eat up infinite time and effort if I let it, so I decided to call a hard “stop and re-evaluate” at the end of September. To my discredit, when the end-of-month deadline came I took a moment to reflect on what I’d learned, and I realized I still didn’t know what to do next. I thought that all I needed was a little more thinking bro and all the pieces would snap into place. Just a little more thinking bro. No need to set a next stop-and-re-evaluate deadline, it’ll just be a little more.

This dragged out into two more months of ruminating, mapping out cruxes, trying out different operationalizations, distilling what I’d learned, doing dialogues with myself, all while my total productivity slowly ground to a halt. I didn’t want to talk to people anymore, because I was upset and embarrassed about how little progress I had made. I started to dread sitting down at my desk to work. I knew I was stuck, but every time I resolved myself to just abandon the effort and just make a decision, I found that I didn’t know what decision to make.

Ultimately, the combination of things that pulled me out of the swamp was something like: if you can’t come to a decision about what sort of skills you want to build based on the kind of research you eventually want to do, just take whatever skills are on offer. I shifted most of my energy into just applying for stuff, any stuff, oriented around skilling up on empirical work, since that seemed to be where most of the mentorship and opportunities were anyway. As a result, eventually this part of my year came to an end.

Obviously this period could have been better planned and better executed. But it’s not like nothing came out of it! I never did write up a public summary of my work during this period, but I have a post in the works summarizing my current AI worldview, and whatever beliefs formed during this period that have survived until the present day will make it into that post. There is one thing I’ll note here, because it’s a bit more of a meta/psychological thing than an object level worldview thing:

When you can spare the effort, it’s good practice to not dismiss a view unless you can pass its ITT first. When a viewpoint seems plausible, makes claims that are important if they’re true, is prominent in your intellectual community, and has support from people you respect, it seems especially important to be able to pass the ITT and feel like you understand the view before saying it’s wrong. But there must be some limit to the amount of effort you’ll put in before giving up. It must at least be an imaginable outcome that you say “well, if I can’t make sense of it given the effort I’ve put in, this must be either nonsense or intentionally obfuscated, and I’m going to do something else.” Otherwise the only possible outcomes are that you accept it, or you investigate it forever in the name of charity and good practice.

January 2024: ARENA 3.0

On January 6, I flew to London for ARENA 3.0 to get better at ML engineering!

ARENA was an excellent environment for my productivity and mental health. I easily worked 10+ hour days, back to back, and had a lot of fun doing it. I had great conversations with other ARENA participants and people around the LISA office.

Sometimes I would talk through the half-baked research ideas I did have, and try to explain the motivating intuitions for them. Sometimes mid-conversation I would find a new distillation, or a new angle to explain things, that made sense. Sometimes, during or after these conversations, I would have ideas that felt useful or compelling. And over the course of the first half of January, the ideas coalesced that would eventually become Evaluating Stability of Unreflective Alignment.

Eventually, I started getting antsy about too much cooking ideas and not enough writing code or touching reality. Mikita Balesni, an evals researcher at Apollo based in the LISA office, offered to do a weekend hack session with me to help get an MVP up and running. We did that, the MVP proved the general concept for the experiments I wanted to do and pointed out some obvious next steps, and suddenly I had a tractable research project!

It’s hard to overstate how much of a difference this one month at ARENA made. I came in having basically given up on coming to a coherent inside view on alignment threat models and research prioritization, at least anytime soon, and focused on getting better at engineering instead. And I did get better at engineering, I highly recommend ARENA for that purpose! But the great intellectual environment at LISA and the relief of pressure from shifting out of “I have to figure this out ASAP” to “let me try and explain my huge pile of disorganized insights to this crowd of new people” ended up finally letting me coalesce my thoughts. Not only did I have a clearer picture of alignment research prioritization in general, I had a crux for whether I should do MIRI or non-MIRI style research, an method to evaluate that crux empirically, and a foothold on actually running those experiments!

Needless to say, I wanted to run with this momentum. I reached out on various Slacks and Discords looking for people to collaborate on this project with me. One person who saw my messages, Alana Xiang, told me to make it a SPAR project. And I did!

February-May 2024: Spring SPAR

SPAR provided me with a batch of applicants, I interviewed some of them, made offers to some of those, and then some accepted. This left me with a team of four surprisingly qualified and skilled mentees. We set out to follow up on some of the research directions I had staked out while at ARENA.

I won’t talk about the object level of what we did or what we found – that’s for a separate post. Instead, I’ll talk about the meta stuff, what was SPAR like, how did it go, etc.

  • The SPAR application process felt great. I spent a few hours putting together a project plan and an application, applicants showed up in my AirTable, I sent some of them emails and booked interviews, I sent some of the interviewees acceptance offers, and bada bing bada boom I had four mentees that all seemed quite competent.
  • Once we started, I was immediately smacked full in the face by the fact that managing a research project is actually hard and I am not very good at it (yet).
    • I wanted to get full mentee utilization, but I also wanted to produce results sequentially rather than working on stuff too much in parallel, to avoid reaching the end of spar with three experiments 80% complete and no useable end products. This basically didn’t work. and I just got terrible mentee utilization until I gave up and parallelized. I think my original approach was especially bad given that the mentees were relatively low time commitment and low contact with each other.
    • After parallelizing, my job was basically to periodically give mentees direction and think through stuff like experimental design choices. Unfortunately, this actually ended up being the bottleneck somewhat often. I think this happened for two reasons:
      • I flinched away from thinking about aversive questions. Two particularly important types of questions I flinched away from were questions that made me worry we might have to abandon a large chunk of work that was not actually useful, and questions that seemed to require detailed laborious planning. This meant proceeding with work premised on broken assumptions or without a detailed plan for far too long.
      • As a result of my “just a little more thinking bro” phase, I had developed a bit of an allergy to things that felt like “just sit down and think about X”. I wanted to just do empirical work, write code, run experiments, stay in contact with reality, and avoid anything that felt too much like intractable directionless thinking. This was unfortunate, because a lot of my role was thinking about stuff. On the one hand, I think this instinct has a lot of truth to it, and I’m still very wary of saying things like “I’m just going to sit down and think about X”. But on the other hand: 1) reality doesn’t look at how much thinking you’ve done recently when setting the correct amount of thinking to do next, 2) staying in contact with reality doesn’t happen via just beaming data into your eyeballs, you have to actually process it , 3) the way you turn tractionless “think about X” type problems into actual decisions is by thinking about what more specific question you need to answer and how to get traction on that.
  • The next big thing was that time can slip by really fast if you let it. When you’re in the middle of the project, it’s easy to spend week after week tweaking the same code, re-running variations on the same experiments, and not realizing just how little progress per time is being made.
  • Man, prompt sensitivity sucks. So much engineering effort in this project was sucked up by dealing with prompt sensitivity and validating different measurements. The intended product was of the form “here’s a number that reflects a high-level fact about the LLM’s planning behavior”, but man prompt sensitivity makes this kind of thing hard to trust. Existence proofs and lower bounds are way easier.
  • Probably the only negative thing I have to note about SPAR is that the structure makes it harder to pivot in a few ways. The most obvious way is the commitment to a three month project plan and deliverable deadline. At least for me, I felt some pressure to make sure that all my mentees’ work was included in the final write-up, and not make any pivots that would require throwing out a mentee’s previous work. My guess is I could’ve done slightly better research if I felt more free to pivot, but I think these effects are probably more feature than bug. SPAR needs to be able to assess mentor project plans, avoid bad mentee experiences, and produce a track record of project outputs. This slight loss of research flexibility is probably a cost that’s correct to pay, but worth knowing about if you’re considering running a SPAR project.

June-September 2024: Summer SPAR

  • Another great set of mentees!
  • The main thing to note from this project meta-wise is that I think my project management was significantly better. I think past me would’ve been surprised by the amount of improvement over the course of a single SPAR project!
    • I set each mentee working on an independent thread from the outset.
    • I think I did less flinching from thinking about things. Possibly this was because less flinch worthy stuff came up – but I think I deserve some of the credit for that, via a more clear and shovel-ready project plan.
    • I think I did a better job keeping my eye on the next milestone during the middle of the project.

September 2024 – ???: CMU SEI AISIRT

At some point during Summer SPAR, I got a job! I’m now an AI cybersecurity researcher at the Software Engineering Institute, an FFRDC based at Carnegie Mellon University and funded primarily by the DoD. Specifically, I’m on their Artificial Intelligence Security Incident Response Team, where we help the US government prepare for all things Artificial Intelligence Security Incident.

I’ll be here for the foreseeable future, since SEI seems like a great place to skill up, and because AISIRT is a small team doing a bunch of work I think might be very important in the near future.

This is, tragically, another job where I’m under infosec constraints that keep me from talking about basically anything to do with my actual work. But if you, dear reader, have takes about how the US government should prepare for the possibility of superhuman AGI within a decade, and would like to discuss them in a very general threat-modeling conversation that leaks zero bits about my work in my official capacity as an SEI employee, hit me up!

How Was This Year Overall

Looking back on this year (and kind of my trajectory working on AI Safety in general), it seems a lot messier than I would like. I made mistakes that should have been foreseeable ex ante. My thinking was often murky or clearly influenced by short-term emotional variation. I often flinched away from thinking about aversive problems until they were right in my face. A lot of the things which went well weren’t things I’d particularly anticipated, and involved more luck and coincidence than I would like.

Some of this bumbling is surely just what it’s like to try getting things done in a big, complicated world. And I’m actually pretty happy with the outcome! But man, there’s room for me to do so much better.