
Combined reflections by Jacob Nadelman and Vilom Oza (scroll below).
By Jacob Nadelman and Vilom Oza
Cornell Tech
In his talk discussing “Choices, Risks and Reward Reports: Charting Public Policy for Reinforcement Learning Systems,” Tom Gilbert (pictured above) considers a fundamental question: “what makes AI systems good.” As he regularly repeats, being good does not necessarily mean that these systems are “fair, transparent, or explainable.” AI systems may have those traits, but as a means to an end rather than as a goal itself. Probing further, to describe a system as “good” we need a method of documentation for said systems that states their goal and the criteria to judge them on. With that in mind, Gilbert also asks “what does it mean to document systems in a way that makes it possible to deliberate about what we want them to do, how we want them to behave, and how do we make them into the kind of system … operating in a way we substantively want.” With this in mind, Gilbert breaks the talk into three distinct pieces.
1. Reinforcement Learning (RL) and how is it different from static Machine Learning (ML)
Central to understanding how AI systems can be evaluated is a particular subdomain called reinforcement learning (RL). RL is a new and emerging subfield within AI/ML that differs from the traditional fields in a few key ways. In particular, traditional ML subareas are static: unsupervised learning summarizes unlabeled information and supervised learning predicts values using labeled information. These environments are set and can be thought of as “one off”.
Reinforcement Learning differs from traditional Machine Learning in that it solves “sequential decision-making problems” using whatever observation techniques the designer specifies for it. Put another way, reinforcement learning “navigates” a dynamic environment by selecting actions that maximize potential future rewards. Additionally, in RL the agent collects its own data. You set up an environment, define a reward, and the agent learns what actions bring it closer to the reward.
Consider an example that makes this tractable. In your home you set a thermostat and tell it how hot or cold you want your house to be. The system uses a traditional ML algorithm to figure out what to do to reach the acceptable range of temperature after taking a temperature measurement. This is called control feedback. In addition, the system will begin a “trial and error” approach to understand the environment in the form of a sequence of actions over time to see how much reward it can accumulate. This is behavioral feedback. For the thermostat example, Gilbert claims this would be akin to the thermostat learning how quickly to adjust the temperature so the room’s temperature remains steadier over time.

Finally, Gilbert introduces one more layer of complexity: exo-feedback. As the RL system adjusts temperature, the way other agents respond to the system changes. Perhaps our thermostat learns that it is far easier to maintain temperature if all members of the household are in one room. It could make one room pleasant and the rest far too cold, causing the family to all be in this one space. Under Gilbert’s reasoning, this amounts to a ‘shift’ in the actual dynamic of what it means to inhabit the house.

These shifts are fundamental to the overarching promise and problem of RL (and repeated use of traditional ML systems). The formulation of ML and use systems can reinforce or actually change dynamics over time, fundamentally altering what it means to interact in those environments. These changes can have quite pernicious results on the user (think how social media use has led users to focus on gathering likes rather than connecting with friends), and Gilbert next discusses strategies to deal with these risks.
2. Distinct Risks in the Formation of RL systems.
To mitigate these risks posed by AI systems, we first need to understand how they are designed. Gilbert focuses on four different design choices in particular.
1. Scoping the horizon
Determining the timescale and space of an agent’s environment has an incredible impact on behavior. For example: For self-driving cars, one of the design choices is about how far a car can see i.e. what can be observed about a road.
2. Defining rewards
It is not straightforward to define rewards in any complex environment. As a result, designers have to define proxies that can make the agent learn behaviors to maximize the required output. However, in the real world, this can often result in unexpected and exploitative behavior aka ‘Reward Hacking’.

3. Pruning information
A common practice in RL research is to change the environment to fit your needs. In the real world, modifying the environment is changing the information flow from the environment to your agent. Doing so can dramatically change what the reward function means for your agent and offload the risk to external systems.

4. Training multiple agents
Multi-agent training is one of the most rapidly growing and talked about subdomains; however, there is very little information known as to how the learning systems will interact. Ex: Driving a self-driving car on a road predominantly occupied by human drivers might be very different compared to driving on a road where the majority of cars are self-driven.
3. Reward Reports: Proposing a New Kind of AI Documentation
Each of the above key design choices comes with its set of risks. Listing the pitfalls of just one can be quite complicated: when it comes to social media, reward hacking is a major design risk. A user does not get onto social media with the intention of being engaged but rather to connect with friends or to network. However, due to reward hacking, it is common to see that social media usage leads to declining user wellbeing and increasing public distrust. Hence there is an increasing need to understand the dynamics of the system, what is at stake, and what it means to optimize the system. Given that these dynamics are widespread, it would be useful to have a summary report that can be applied to any system.
To achieve this goal, Gilbert proposes a Reward Report. Right now, the forms of documentation that exist only explain how accurate a model is, how (and by whom) the data was collected, or under what sets of features the model operates. These systems do not take into account the dynamic nature of models or the ex ante assumptions which can drastically alter their real world effects. Consequently, reward reports will include domain assumptions, the horizon and specification components. In addition, they need to be regularly updated to ensure compliance with regulation and the designer's intentions to keep up with the dynamic nature of these models. Gilbert’s contests that these changes will ensure better monitoring as these systems become more widespread and further allow for administrative or legal oversight if needed.
Conclusion
Gilbert’s talk proposes some provocative ideas surrounding the implications of AI, how to understand these implications with Reinforcement Learning, and a mitigation strategy to address the risks. Should Reward Reports be adopted, legislators and oversight groups would have significantly more information and control to make informed decisions regarding the success or failure of an AI system to achieve its goal. It is only with this extreme level of vetting that we can say with any certainty that an AI system is ‘good.’
Tom Gilbert’s insights on defining "good" AI challenge us to think beyond fairness and transparency. Reinforcement Learning (RL) stands out as a dynamic approach, adapting through experience. His perspective inspires thoughtful AI development. I’m dedicated to providing expert solutions for technical difficulties. For tech enthusiasts, explore my blog on How to Troubleshoot Common Epson Printer Errors for expert solutions. Epson printers are widely recognized for their reliability and high-quality printing capabilities. However, like any technology, they are not immune to occasional errors. Whether you are facing an Epson E-01 error, Epson error code 2000020A, or your Epson printer is offline, these issues can be frustrating, especially when you need to print important documents. In this blog, we’ll walk you through these…
Tom Gilbert’s insights on AI’s “goodness” challenge us to rethink fairness, transparency, and accountability in reinforcement learning. His approach to documentation and system evaluation is inspiring. As we explore AI safety, it's also vital to protect digital devices. I help individuals and businesses overcome technology challenges. Read our latest blog on safeguarding Apple devices from security alerts.
In the digital age, security is a top concern for every device user, especially for those with Apple products. Apple’s commitment to providing secure environments for its users is evident through its frequent updates, strong encryption standards, and intuitive security features. However, as cyber threats become more sophisticated, even Apple users might encounter disturbing alerts like AppleCare virus warning or Apple security warning…
Tom Gilbert’s insights into AI’s “goodness” challenge us to rethink fairness, transparency, and documentation in reinforcement learning. His discussion on RL’s dynamic decision-making highlights its transformative potential. Understanding AI systems is crucial—just as troubleshooting Brother printer issues ensures smooth performance. I troubleshoot and fix tech problems with precision and care. Explore Brother printer troubleshooting here Brother printers are known for their reliability and high-quality printing capabilities, making them a popular choice for both home and office environments. However, even the best printers can experience issues from time to time. Whether your Brother printer says offline, you're dealing with Brother printing blank pages, or you're simply trying to set up your new printer, there are several common issues users face. In this…
Tom Gilbert’s insights on AI’s “goodness” redefine evaluation beyond fairness and transparency. His breakdown of Reinforcement Learning (RL) highlights its dynamic adaptability, setting it apart from traditional Machine Learning. This evolving AI field inspires innovation in real-world applications. My goal is to offer seamless troubleshooting for all tech-related problems. Explore more innovations: Why You Should Consider Buying Ceiling PUF Panel 50mm for Superior Insulation and Aesthetics.
When designing or renovating a space, one of the most important considerations is creating an environment that is both energy-efficient and visually appealing. One of the best ways to achieve this is by installing ceiling Puf panels. If you’re looking to improve the insulation of your space, the buy ceiling Puf panel 50mm option…
Tom Gilbert’s insights on AI’s “goodness” spark deep reflection. His exploration of Reinforcement Learning (RL) highlights its evolving role in dynamic decision-making. Unlike static ML, RL adapts, learns, and refines actions for future rewards. For more on structured legal intelligence, explore Understanding the Importance of Adlegal in Modern Legal Practices.
In the ever-evolving landscape of modern business and law, one term that has emerged with increasing significance is Adlegal. This relatively new concept is revolutionizing the way legal services are marketed and delivered, particularly in the digital age. The term Adlegal refers to the intersection of legal practices and advertising, where law firms, lawyers, and legal services utilize advertising techniques to increase their visibility and attract clients. While traditional forms…