On Contact, Part 1

Context: for fun (and profit?) Basic Contact Contact is a lightweight many-versus-one word guessing game. I was first introduced to it on a long bus ride several years ago, and since then it's become one of my favorite games to play casually with friends. There are a few blog posts out there about contact, but I think it's incredibly underrated. The rules of contact are simple, but I often tell…

Evaluating Stability of Unreflective Alignment

This post has an accompanying SPAR project! Apply here if you're interested in working on this with me. Huge thanks to Mikita Balesni for helping me implement the MVP. Regular-sized thanks to Aryan Bhatt, Rudolph Laine, Clem von Stengel, Aaron Scher, Jeremy Gillen, Peter Barnett, Stephen Casper, and David Manheim for helpful comments. 0. Key Claims Most alignment work today doesn’t aim for alignment that is stable under value-reflection1. I…

Research Retrospective, Summer 2022

Context: I keep wanting one place to refer to the research I did Summer 2022, and the two Lesswrong links are kind of big and clunky. So here we go! Figured I'd add some brief commentary while I'm at it, mostly just so this isn't a totally empty linkpost. Summer 2022 I did AI Alignment research at MIRI under Evan Hubinger's mentorship. It was a lot like SERI MATS, but…

Unfinished Thoughts on ELK

Epistemic Status: posting for mostly internal reasons - to get something published even if I don't have a complete proposal yet, and to see if anything new crops up while summarizing my thoughts so far. For context, ELK is a conceptual AI safety research competition by ARC, more info here. In this post I will document some ideas I've considered, showing the general thought process, strategy, obstacles, and current state…