Rapid advancements in AI pose ethical challenges(e.g.bias, fairness) and security threats(e.g. existential risks, privacy issues) across a variety of sectors, and without a breadth of assessment they’re unlikely to be solved. Current AI safety methods are not incorporated in the system design but act as a post-processing step (i.e. reinforcement learning through human feedback) starting with established AI safety standards and guidelines applied to systems.

There isn’t a structure in place to comparatively assess these methods, and no way to actively integrate feedback/metrics from a variety of stakeholders. These standards are often established by computer scientists or engineers and do not involve transparent, public consultation. This means that there are limited avenues for the public to contribute to deciding what values are embedded in AI systems, and may fail to adequately reflect a diverse range of expertise needed for specific applications. For instance, within the healthcare sector, input datasets frequently exhibit a bias towards specific racial groups, resulting in AI models trained on such data producing insights that may not be applicable to individuals outside the group in question.

The obvious follow-up questions are then: can we develop methods to systematically test different strategies for AI safety to see if systems act in line with pre-established, embedded values? How can we develop values in a way that ensures transparency and input from various public stakeholders? What is the best strategy to ensure that testing the performance of AI safety strategies at fulfilling values within a simulated environment is reflective of how these systems would work in practice?

We need an institute to research AI safety methods (constitutional AI, RLHF, etc.) in a comparative manner using metrics that reflect consensus across stakeholders from a range of specialties and demographics. The optimal solution is to use a sandbox (a virtual testing environment) where we ensure standards accurately reflect real-world conditions to portray AI behavior in actual scenarios. Without releasing it into the world, where it could cause harm, we want to test very broad metrics in the artificial environment while ensuring output accurate to the real world. This is the technical challenge our institute fundamentally wants to address, with the ultimate goal of incorporating these ethical guidelines into the model building, as part of system design.

Crucially, our goal is to bridge our insights with Big Tech companies, AI safety organizations, and government bodies, establishing a feedback loop for perpetual improvement of our models. We aim for our expertise to be shared widely, enabling any institution focused on AI safety to access our resources and contribute to ongoing research. A collaborative approach would be at the heart of this institute, which will have regular discussion panels bringing state of the art researchers together to formalize future directions.