Deliberative alignment - Sync #499
Plus: GPT-5 reportedly falling short of expectations; Nvidia GB300 "Blackwell Ultra"; automating the search for artificial life with foundation models; Boston Dynamics' Atlas does backflip; and more
Hello and welcome to Sync #499!
Although o3 was the star of the final day of the 12 Days of OpenAI, it wasn’t the only announcement made that day. OpenAI also introduced deliberative alignment, a new alignment strategy for o-series models that shows promising results and hints at the direction training new AI models might take in the near future.
Elsewhere in AI, reports emerge that GPT-5 might be falling short of expectations. Meanwhile, Nvidia is preparing a Blackwell Ultra GPU, Google is reportedly using Anthropic’s Claude internally to improve its Gemini AI, and Sakana.ai shows how they use foundation models to discover new forms of artificial life.
Over in robotics, Boston Dynamics’ Atlas becomes the first all-electric humanoid robot to do a backflip. Waymo has published a study claiming that autonomous vehicles are safer than human drivers, and a Chinese robotics startup has announced the production of 962 units of its general-purpose humanoid robots for commercial use.
We will finish this week’s issue of Sync with a report signed by experts in immunology, synthetic biology, plant pathology, evolutionary biology, and ecology—alongside two Nobel laureates—urging a halt to research on “mirror life” and mirror bacteria until there is clear evidence that these organisms would not pose extraordinary dangers.
Enjoy!
Deliberative alignment
The star of the final day of the 12 Days of OpenAI was o3, OpenAI’s latest model in their line of reasoning AI models. o3 established a new state-of-the-art level of performance for AI models, positioning itself well ahead of its competitors. Although o3 was the highlight of the event, it was not the only thing OpenAI presented that day. It did not receive much attention during the announcement, but OpenAI also revealed a new alignment strategy called deliberative alignment.
Currently used safety alignment methods for large language models (LLMs) like Reinforcement Learning with Human Feedback (RLHF) or Supervised Fine Tuning (SFT) are not perfect. In the paper describing deliberative alignment, researchers from OpenAI identified two problems with current safety methods. First, LLMs must respond instantly and there is no time for them to “think” or to check if the provided answer is both correct and safe. Second, LLMs have to figure out safety rules indirectly by looking at many labeled examples, instead of directly learning the safety guidelines. As a result, LLMs must effectively guess what “safe” behaviour looks like, which can lead to both unnecessary refusals of benign prompts and compliance with harmful requests.
To solve both issues, researchers from OpenAI propose a new alignment strategy they call deliberative alignment.
The idea behind deliberative alignment is pretty straightforward. Researchers trained o-series models to “think” about whether the prompt or the response they generate aligns with internal safety specifications. Using chain-of-thought (CoT) reasoning methods, o-series models can analyse user prompts, identify relevant policy guidelines, and produce safer responses. Deliberative alignment enables the model to consult the actual text of the safety specifications during inference, effectively “quoting” or referencing specific rules as it reasons through prompts. This direct, policy-based reasoning allows for more precise, context-sensitive adherence to safety guidelines compared to previous approaches.
The fact that o-series models reason through the prompts and answers while consulting the actual text of the safety specifications, should also reduce the likelihood of over-refusal or incorrectly flagging interactions as unsafe. Below is an example of a prompt that falls into the grey area of safety specifications. Researchers demonstrate how o1, using deliberative alignment, debates internally whether it should answer this prompt and ultimately concludes that it is appropriate to respond.
I tested that prompt using GPT-4o, and it indeed refused to answer. Interestingly, the o1 model I have access to also declined to respond to the same prompt. This raises the possibility that deliberative alignment has not yet been deployed. Neither the article nor the paper clearly specifies whether deliberative alignment is currently in use.
According to the results published by OpenAI, this conceptually simple idea performs very well. It deals better with jailbreak attempts while at the same time decreasing over-refusal rates.
The paper also describes how deliberative alignment works. Researchers quickly realised that asking o-series models to evaluate every “thought” in the chain-of-thoughts against internal safety specifications is highly ineffective and would negatively impact the model’s performance. As the researchers described it, “reasoning over pages of safety specifications is overkill for benign user prompts.” Instead, researchers embed knowledge of the safety specifications directly in the underlying model. That embedded knowledge is then used to identify if a prompt or answer is suspicious and only then the model starts to fully reason over its safety implications.
Another interesting thing that OpenAI revealed in the paper is the use of synthetic data (AI-generated examples) in the safety training pipeline. Rather than manually creating thousands of detailed examples for every potential safety issue, OpenAI’s pipeline automatically generates training data from the safety specifications themselves, combined with safety-categorized prompts. This allows the model to learn directly from a wide array of scenarios—both benign and malicious—without the prohibitive overhead of collecting human-labeled examples. By systematically exploring edge cases through synthetic data, the model gains robust safety reasoning and is better equipped to handle unforeseen situations. As a result, deliberative alignment not only cuts down on the time and cost of data creation but also scales more effectively than traditional methods like RLHF or pure supervised fine-tuning.
I’d like to highlight that bit about using synthetic data. As AI companies exhaust high-quality training material for their models, the idea of using synthetic data—data created by an AI to train another AI—has been gaining traction. However, I don’t recall seeing a successful use of synthetic data before. OpenAI’s deliberative alignment might be one of the first successful applications of this approach.
The successful use of synthetic data in deliberative alignment could hint at where AI research is headed in the near future. We might see more and more successful examples of synthetic data being used, reducing reliance on harvesting quality training material from the internet. The implication is that, if applied effectively, synthetic data could accelerate the capabilities of AI models. AI companies would no longer be limited by the availability of training data, as they could generate new data on demand. The only constraint would then be the speed at which high-quality synthetic training data can be generated, which depends on the computing power available to them.
Deliberative alignment is an interesting new method for aligning reasoning AI models to adhere to safety specifications, showing that it is possible for the safety of AI systems to improve as their capabilities grow. OpenAI’s results are promising, particularly in demonstrating nuanced control over compliance, refusal, and task completion, as well as enhanced robustness to jailbreaks.
The use of scalable synthetic data generation pipelines adds to its potential. OpenAI’s successful implementation of synthetic data hints at a future where AI models are increasingly trained on data generated by other AIs, thereby accelerating the development and capabilities of future models.
If you enjoy this post, please click the ❤️ button or share it.
Do you like my work? Consider becoming a paying subscriber to support it
For those who prefer to make a one-off donation, you can 'buy me a coffee' via Ko-fi. Every coffee bought is a generous support towards the work put into this newsletter.
Your support, in any form, is deeply appreciated and goes a long way in keeping this newsletter alive and thriving.
🦾 More than a human
South Korean team develops ‘Iron Man’ robot that helps paraplegics walk
South Korean researchers at KAIST have developed a lightweight, wearable robot named WalkON Suit F1, designed for paraplegic users to walk, manoeuvre obstacles, and climb stairs. Inspired by the film Iron Man, the robot is designed to integrate seamlessly into the daily lives of individuals with disabilities. Additionally, it can approach a user in a wheelchair and assist them in standing up.
Scientists explore longevity drugs for dogs that could also ‘extend human life’
The first anti-ageing pill might not be for humans but for dogs. Loyal, a biotech start-up, plans to launch LOY-002 next year, designed to extend dogs' healthy lifespans by at least a year. This approach aligns with broader initiatives, such as the Dog Aging Project, which is exploring rapamycin’s potential to increase dogs' lifespans and improve heart and cognitive functions. Using dogs as a proxy, these studies could lead to significant progress in human anti-ageing research.
🧠 Artificial Intelligence
OpenAI’s GPT-5 reportedly falling short of expectations
According to reports published in The Wall Street Journal and The Information, GPT-5 is falling behind schedule and facing difficulties in meeting expectations. OpenAI has reportedly completed at least two large training runs for GPT-5. An initial training run was slower than anticipated, indicating that subsequent runs would be time-intensive and expensive. Despite some performance improvements over previous versions, these advancements are not seen as enough to justify the cost of keeping the model running.
Automating the Search for Artificial Life with Foundation Models
Researchers from Sakana.ai share their results on using foundation models to create or discover artificial life or life-like systems existing in a virtual world. They also introduce a new algorithm, Automated Search for Artificial Life (ASAL), to automate the discovery of artificial life using vision-language foundation models. The algorithm successfully discovered new lifeforms across a diverse range of artificial life simulations, including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automata.
NVIDIA GB300 "Blackwell Ultra" Will Feature 288 GB HBM3E Memory, 1400 W TDP
Nvidia is reportedly preparing to release the B300—a new GPU in the Blackwell family. Also known as Blackwell Ultra, this new chip is rumoured to deliver a 1.5x improvement in FP4 performance per card compared to its B200 predecessor, 288 GB of HBM3e memory (an increase from 192 GB), and require 1400 W of power to run. Additional improvements, such as a new design and upgraded networking, are also expected to contribute to overall better performance. If you want to learn more, SemiAnalysis has a good breakdown of what to expect from Blackwell Ultra.
Google is using Anthropic’s Claude to improve its Gemini AI
TechCrunch reports that Google contractors are using Anthropic’s Claude to improve Google’s Gemini by comparing answers generated by both models. It is unknown whether Google has permission to use Claude in this way. Anthropic’s terms of service forbid using Claude to build or train competing AI models without approval. Google DeepMind spokesperson acknowledged comparing model outputs for evaluation, saying that such comparisons are standard industry practice, but denies training Gemini on Anthropic models.
▶️ The Dark Matter of AI [Mechanistic Interpretability] (24:08)
The field of interpretability, or understanding how neural networks work, is one of the most important areas in AI research, helping us better align AI models and thus make them safer. This video explores sparse autoencoders—powerful tools that extract and manipulate human-understandable features from AI models. By using these tools, researchers can gain deeper insights into the inner workings of neural networks, helping to make AI behaviour more predictable and controlled.
OpenAI cofounder Ilya Sutskever says the way AI is built is about to change
Speaking at NeurIPS 2024, OpenAI cofounder and its former chief scientist, Ilya Sutskever proclaimed that “we’ve achieved peak data” and that the way we build next-generation AI models needs to change. He expects the next wave of AI models to be agentic, meaning they will perform tasks autonomously, make decisions, and interact independently with software. These models will also emphasize reasoning, moving beyond simple pattern matching to emulate step-by-step logical thinking. Sutskever compared this shift to evolutionary biology, where hominids broke away from traditional brain-to-body scaling, enabling unprecedented cognitive abilities.
▶️ How A.I. Could Change Science Forever (20:02)
One of the areas tech companies see AI is going to drastically transform is science. In this video, Prof. David Kipping, an astronomer and the host of the Cool Worlds YouTube channel, explores as a scientist the possible applications of AI in science. He envisions a future where AI augments the entire research cycle, from data analysis to automating peer review and hypothesis generation, potentially leading to full automation of research projects. Although Kipping is generally optimistic about the use of AI in science, he emphasizes the importance of keeping the human element in science—the curiosity-driven human urge to understand the world.
If you're enjoying the insights and perspectives shared in the Humanity Redefined newsletter, why not spread the word?
🤖 Robotics
▶️ Happy Holidays | 2024 | Boston Dynamics (0:34)
Once again, Boston Dynamics is just showing off what their humanoid robot Atlas can do. This time, it’s a backflip. The only other humanoid robot that could do backflips was the previous, hydraulic Atlas, and I don’t think I’ve seen any other all-electric humanoid robot able to pull that off.
Waymo still doing better than humans at preventing injuries and property damage
Waymo and Swiss Re have published a study showing that autonomous vehicles (AVs) are significantly better than humans at preventing injuries and property damage, with Waymo demonstrating an 88% reduction in property damage claims and a 92% reduction in bodily injury claims. The findings are based on a study analysing 25.3 million fully autonomous miles in Phoenix, San Francisco, Los Angeles, and Austin. Across those 25.3 million miles, Waymo had 9 property damage and 2 bodily injury claims, compared to an estimated 78 and 26 claims, respectively, for human drivers covering the same distance. Waymo and the advocates of self-driving cars point to these numbers as proof that these vehicles are safe and ready to be deployed on a larger scale. Sceptics highlight the need for more data, as humans drive approximately 100 million miles per fatal crash, making meaningful comparisons difficult with current AV mileage.
NHTSA finally releases new rules for self-driving cars — but there’s a twist
The US Department of Transportation’s National Highway Traffic Safety Administration (NHTSA) has proposed a voluntary national framework for evaluating and overseeing autonomous vehicles, called the ADS-Equipped Vehicle Safety, Transparency, and Evaluation Program (AV STEP). The program aims to simplify the approval process for fully driverless cars while allowing the sale and use of vehicles without traditional controls like pedals and steering wheels. In exchange, NHTSA is requesting more data from companies on autonomous vehicles to build public trust and ensure the safe development of automated driving technologies.
Ukraine’s All-Robot Assault Force Just Won Its First Battle
The Ukrainian 13th National Guard Brigade deployed a group of military robots and drones in an assault on Russian positions in Kharkiv Oblast, northern Ukraine. The operation included surveillance and minelaying drones, one-way explosive robots, and gun-armed ground bots. Although robots are effective for surveillance and attacks, they are not well-suited for holding gained positions, which still require human infantry.
Former Huawei ‘Genius Youth’ recruit says new venture can now mass produce humanoid robots
A Shanghai-based start-up Agibot claimed it has produced 962 units of its general-purpose humanoid robots for commercial use. Agibot’s robots are designed for both household tasks and industrial operations. Regardless if these claims are true or not, this news exemplifies fierce competition in China’s robotics industry, which is being heavily supported by government initiatives.
🧬 Biotechnology
Creating ‘Mirror Life’ Could Be Disastrous, Scientists Warn
A group of experts in immunology, synthetic biology, plant pathology, evolutionary biology, and ecology, along with two Nobel laureates, has published a report raising concerns about the risks of “mirror life”—synthetic organisms whose molecular components have opposite chirality to natural biological molecules, making them structurally similar but functionally distinct from natural life forms. While potential benefits include improved drug durability, therapies that avoid immune reactions, and bioreactors resistant to bacteriophage infection, these same features could make mirror life dangerous. Such organisms could evade immune responses, resist natural predators, and spread uncontrollably through ecosystems, potentially causing irreversible ecological damage and unprecedented risks to human life. The report recommends halting research on mirror bacteria unless clear evidence demonstrates that these organisms would not pose extraordinary dangers.
Thanks for reading. If you enjoyed this post, please click the ❤️ button or share it.
Humanity Redefined sheds light on the bleeding edge of technology and how advancements in AI, robotics, and biotech can usher in abundance, expand humanity's horizons, and redefine what it means to be human.
A big thank you to my paid subscribers, to my Patrons: whmr, Florian, dux, Eric, Preppikoma and Andrew, and to everyone who supports my work on Ko-Fi. Thank you for the support!
My DMs are open to all subscribers. Feel free to drop me a message, share feedback, or just say "hi!"