How to Tell if You’ve Instilled a False Belief in Your LLM

In the spirit of better late than never - this has been sitting in drafts for a couple months now. Big thanks for Aryan Bhatt for helpful input throughout. Thanks to Abhay Sheshadri for running a bunch of experiments on other models for me. Thanks to Rowan Wang, Stewy Slocum, Gabe Mukobi, Lauren Mangla, and assorted Claudes for feedback. Summary It would be useful to be able to make LLMs…