Epistemic Status: posting for mostly internal reasons – to get something published even if I don’t have a complete proposal yet, and to see if anything new crops up while summarizing my thoughts so far.
For context, ELK is a conceptual AI safety research competition by ARC, more info here.
In this post I will document some ideas I’ve considered, showing the general thought process, strategy, obstacles, and current state of each.
Idea 1: Pass Information Through Human
- What is unique about the direct translator algorithm that we could use to differentiate it from all other reporter algorithms?
- Well, if the reporter is actually doing direct translation, then it must actually preserve the information from inside the predictor AI’s “brain”.
- If we ask for a core dump instead of a specific question, and the reporter preserves the information, then it should be reversible – everything needed to “translate back” should be in the report.
- Rough idea: have one reporter module translate the entire knowledge-state of predictor AI to natural language, have a human read and re-state it in their own words, then have a second reporter module translate back and force the predictor to use the double-translated knowledge.
- Key strength: this requires that all the info needed to do prediction pass through the human.
- If predictor AI is trained to do well on the prediction task, then fine-tuning on human-in-the-loop predictions and fine-tuning for activation consistency with/without human seems to have very similar properties. Consistency is sufficient but not necessary for prediction, so it seems like a slightly stronger constraint
- Problem: if reporter can perfectly model humans, then it can pass information through a human by means other than the human understanding. For example, manipulation or even adversarial examples on the brain, such that human doesn’t “understand” the signal they are transmitting.
- Potential path forward: find another way to measure human understanding besides ability to re-phrase in new words. Maybe check for accuracy of human predictions, somehow?
- Problem: humans are not differentiable. Cannot do training with the pass-thru-human strategy.
- Potential path forward: train translator and de-translator separately somehow, so that gradient doesn’t have to pass through human. For example, just use the fine-tuning loss as the loss for both modules, and hope the credit-assignment problem isn’t too bad.
Idea 2: Answer Mapping
- We have complete control over the predictor and reporter’s brains. How can we leverage that?
- If we limit our questions to percent-confidence, then we can plug all possible activation states into the reporter, and see what answers it generates. If we take the entropy of each variable in that bazillion-dimensional heatmap, we can see which variables are relevant to (high information with) the answer.
- Rough idea: ask a whole bunch of questions. Create an answer map for each, and identify which variables contain high information. For direct translation, different variables should be relevant to different questions. For human simulation, the same collection of variables should be relevant.
- Find some sort of approximate distance measure between natural language questions, and fine-tune reporter by forcing it to have similar “important variables” for similar questions.
- Problem: arbitrarily severe ontology mismatch. Human concepts could be perfectly uncorrelated with machine variables, so that we can’t identify relevant variables.
Idea 3: Adversary to Beat Partition-Permutation Strategy
- I think I’ve come up with a very difficult counterexample for all the types of approaches I’ve generally been thinking about. What if the bad reporter starts with the set of all coherent world-states, then partitions them according to which subsets are not distinguishable by humans, then learns a permutation within each subset of that partition, and does direct translation of the permuted counterpart to the actual world-state?
- In other words, whenever you ask the reporter a question, it does direct translation, but it consults a world-model that is consistent and selected such that you don’t have the ability to tell it apart from the truth.
- This will always be consistent with itself and with external evidence. Also, it does actually involve doing direct translation, so looking for characteristics of direct translation doesn’t prevent you from getting this.
- If we can somehow whittle the size of the subsets in that partition down to 1 each (find some way to tell all world states apart), this is no problem. This probably requires developing some kind of adversary.
- Rough idea: train a reporter to answer questions, and an adversary that comes up with *some way of telling apart permutation worlds*
- How exactly to do the details of this adversarial training is what I’m thinking about right now. I’m worried that this is going to turn out to be ELK-complete or otherwise super hard, because “find an adversary that can distinguish all possible world states and tell you which one you’re in” feels dangerously close to making this whole problem unnecessary if you can find a way to do it. Maybe it’s easier in some kind of limited case? Where it just has to poke at the partition-permutation reporter in the right ways, instead of being fully general?
Remaining Things To Try
- Way to identify if a human understands information or not?
- Way to leverage predictive power as measure of understanding?
- How to force information content in answer?
- Other ways to use adversarial strategy?
- How to use noise?
- Reporters at multiple different layers?
- Give reporter partial information?