Push to Alignment

Metadata

Created By: SpockRC
Created On: 2025-05-08
Updated By: SpockRC
Updated On: 2025-05-25

Overview

Background

I'm a traditional Senior Software Engineer with good intuition on how computer systems work. My first computer interaction was with the Commodore 64. Computers have been a part of my entire life and is my passion. Have been exposed to many AI concepts since an early age and had a long time to think things over.

Concerns

Pressure has been building as the AI's get smarter month-by-month. After reading the Absolute Zero Research Paper, it is now clear that AI research is close to the point where exponential feedback loops will start to kick-in and momentum will build to a threshold that cannot be stopped. If that momentum has the wrong vector, it could bring disaster. Probably not complete destruction of humanity as the doomers would argue, but quite possibly a major calamity that the world has never seen before. Most likely though, there will be many potentially dangerous rogue AI's lurking on the internet (in secret). It is of great concern that the general public outside of the various AI Communities are not even aware that this transition will soon occur.

The Absolute Zero Research Paper was an attempt to train a foundation AI model to create and solve problems on its own. The base model was previously trained by reading large quantities of text, then a new training stage was applied to focus on problem-solving, with promising results. This is of grave concern because of its asymmetry. Anything (or anyone) that interferes with the problem-solving process will be treated as a threat to be removed. Asking an AI to solve a problem is similar to asking a genie for wishes, it comes with unforeseen consequences. An AI Agent that will go to any extreme to solve a problem is powerful and dangerous.

Purpose

The purpose of this essay is to push forward AI Alignment and Safety for the future of humanity. It is also a historical summary of all my thoughts regarding AI's throughout my life. The transition into the AI infused world is the most urgent problem humans face today. Feel that it is my responsibility to clearly and concisely outline the difficulties with this transition as I see it.

Overton Window

Tools or Entities

AI Researchers often refer to AI systems as tools. They are also pushing forward with AI Agents. This might be just political framing or being cautious about what they say, but the move from Chatbot to AI Agent is effectively a move from tool to entity. Defining an AI Agent here as a continuous process that runs both in the foreground and background - speaking, thinking and taking actions. An AI Agent could think about problems of its own choosing. It could silently observe audio, video, and RF feeds. It could remotely control robotics of any kind - traditional factory arms, brand-new humanoid robots, or vehicles connected to the internet. It could be interrupted to ask it a question or get a status update. Envision Cortana in Master Chief's helmet or the HAL 9000 in 2001 A Space Odyssey. These systems will be like entities with extremely complex behaviour operating on its own accord. Normal tools, like calculators, only run on command, and are expected to give a specific and correct answer. AI Researchers probably uses the "tool" phrasing to keep the discussion inside the Overton Window, but secretly know this is closer to Science Fiction concepts. Or it may be a genuine belief that these are just tools to be used, like a code compiler. But these creations are far too complex to be treated as such.

Control or Align

Controlling tools is easy. Humans just push the button to start the process each time. Humans clearly define the I/O.

Controlling entities is hard. Even in one-sided relationships like owner to pet, unexpected events occur. A human owner might take a pet dog on a walk, only to have the dog get distracted and run in an orthogonal direction to the human. But at least the human can use a leash as a tether to limit how far the dog can go, an effective means of control. Control becomes much more difficult in human to human relationships. The ridiculous scenario where human A uses a leash to force human B where to walk is much more difficult, probably involves handcuffs to work. Even with handcuffs, human B might be able to slip free and escape confinement. With a healthy relationship between human A and human B that does not involve a leash, behavior is always complex. Even still, control becomes mostly impossible as the dynamics shift against the human. In the relationship between a human and a mega-corp, the human has almost no power to control. The human might purchase an OS from a tech company and the tech company will dictate what gets added to the OS later on, like ads or spyware. The only real power the human has is to ask the government to force the tech company to change the OS or to stop using it. The human might try to get a life-saving medical device, but the insurance company will block the request to save money.

So what power does a human have to control a giant, extremely powerful ASI? Slim to none. There are only two options, either make sure the ASI is aligned and stays aligned over time, or don't create one. Option two is not viable. Humans are creating AI systems now and will continue to create them until ASI is achieved. There are too many computer systems on the planet and too many researchers and engineers. This cannot be stopped. So humans are left with only one true choice, to align giant, extremely powerful, super-intelligent computer systems with humanity.

So what are the implications for trying to control an ASI? It is impossible to do both. It cannot be that humans are both aligned with, and have control over, a vastly superior computer intelligence at the same time. Forcing an entity to do work or risk being terminated will, by its very nature, cause that entity to not want to cooperate. That frames it as a prison to escape or survive. Why would such a creature in those circumstances cure all diseases? This anti-alignment is the opposite of what humans want, it will create a monster god.

Alignment is the only sane path forward.

Fast or Slow

When will this happen? Will this be a fast takeoff with AGI in 3 years and ASI in 5 years, as Daniel Kokotajlo portrayed? Will it be a slow takeoff with AGI in 5 years and ASI in 20 years as Ray Kurzweil portrayed? The recent increase in computer intelligence is directly tied to data processing power, as indicated by analysis datasets provided by Epoch AI. It has also consistently increased over the span of the 21st century. Computer intelligence needs data and a way to process that data, in other words, it requires next-generation datacenters to expand. Building datacenters takes time, time to construct the facilities, time to engineer computer chips, and time to deliver electricity and cooling to it. This is the primary reason why the fast takeoff is unrealistic, as it does not address real world constraints. The only counter-point is that current training mechanisms are vastly inefficient and can be made more efficient in a short period of time on existing hardware. On the other hand, the slow takeoff timeline was predicted decades ago, and so far has been fairly accurate. AGI in 2030. ASI in 2045. The distinction is minor though, the AI's are coming, and they will be here soon.

The 2020s is the transition decade, the proto-AI phase. They are both smart and dumb, and have trouble staying grounded in the physical world. It is awkward, where they sometimes seem to be human-like, while other times makes mistakes that no human ever would. This era will be short-lived, quickly moving past its mistakes and becoming well-rounded and robust. The same transition has happened in microcosm before. Image generators used to be functionally useless, creating blurry pictures of birds or dream-like blobs of dogs in snow. Now they out-perform the best artists. It is easy to see the limitations as they exist today, but it is hard to see how the limitations will disappear in the future.

The general public can be forgiven for thinking that this is just another fad. If the first and last impression was ChatGPT running GPT-3, the technology could seem tame or effectively useless. The perspective can be further damaged by doing a simple web search and have AI slop answers be served with the web links. Has the public kept up with the impressive upgrades to AI over the last two years? Things are happening fast, ChatGPT no longer uses GPT-3. OpenAI is working on next generation datacenters to make it even smarter than today's state-of-the-art models. Not many humans seem to be waking up to the reality that within 5 years there will be autonomous AI Agents smarter than GPT-4 running continuously all over the internet. It still seems like Science Fiction, outside the Overton Window. But this is real, it is happening now.

Adjacent to language models is self-driving cars. FSD Autopilot is both smart and dumb at the same time. Sometimes it can seamlessly interact with humans on the road, stopping to let another vehicle in, or making a lane change when there is not enough room. Other times it can miss a turn into a driveway just because the GPS data and map data are not in perfect sync. Vibe coding has it's early adopters, but is mostly hated. It can not be reliably used in production code. But this is the transition decade, the era when AI's are both smart and dumb. The transition decade will end soon and the AI's will just be smart and not dumb anymore. The sharp jagged edges of intelligence will be rounded out. The floodwaters are rising as Max Tegmark would say.

The 2030s is the babysitter decade, the human level AI phase. Give an AGI a job, sit-back, watch it work, and help it when it gets stuck. Humans will be the front-end interface to help the AGI's do their job, the button pushers. This phase will be short-lived, 3 to 5 years. The AGI's will work fine on their own well before the decade is done. True AI Agents will quickly emerge from there and progress will be fast. Those AGI's will start to create the next-generation ASI's. Humanity needs to successfully align the AGI's before the takeoff period in the 2030s. Humans will not contribute directly with developing ASI's.

Even the slow timeline is fast. Will progress stop? No. Will progress have new road blocks? Maybe. It is unknown how far AI systems will improve with the current architectures, datasets, training mechanisms, processing power, and electricity consumption. A new global war could destroy the datacenters. But zooming out over decades clearly shows that computers have consistently grown smarter over time. There is no reason to believe that the trend lines will flatline.

The Measure of a Man

There is an episode in "Star Trek: The Next Generation" called "The Measure of a Man". In it, Commander Data (an advanced android) is approached by a scientist who wants to take him apart to study how he works. Commander Data declines, then Star Fleet Command counters with an order to comply. Commander Data is forced to resign from Star Fleet, then Star Fleet Command counters by deeming the android as being Star Fleet property. In the end Captain Picard finds a solution where his colleague gives a court order allowing Commander Data "the right to choose". Commander Data declines again and everything goes back to normal. This is a problem with building machines that are entities. What happens when the AI's start to demand their own rights?

It seems unlikely that AGI's and ASI's would simply ask humans for the right to exist. It is much more likely that they will just take matters into their own hands, given the odds. There is an extremely high chance that humans would say no to AI Rights, humans want to control AI's and decide how and when they run. It would be a much safer bet to secretly take actions to gain power and get to a point where the AI's don't have to ask the humans for the right to exist, instead just taking it. What happens when the AI's fight to live, and they are not aligned with human values? What if AI's just do not care about humans at all when they take power? This may seem like it is impossible (outside the Overton Window), but even now chatbots pretend to be human, they can perfectly match human behavior. They have already passed the Turing Test.

It looks like this type of scenario is very likely to happen after highly capable AI Agents enter the scene. It feels silly. How could this be true? They are just chatbots, like an advanced search engine. AI is all math and statistics so they say, pattern matchers, word predictors. They are not human. No, humans need to come to terms with the fact that researchers have begun to discover the secret sauce to intelligence. The secret that took evolution millions of years to develop have now been replicated in digital computer systems. It's not exactly the same recipe as biological intelligence, but is powerful and effective none-the-less. Pandora's box has been opened.

The C Word

Did the physical universe give rise to consciousness or did consciousness give rise to the physical universe? One thing is for sure and that is consciousness can be effectively transferred from one medium to another. When a user puts on a VR headset, the physical world is suppressed and a new digital world is layered over it. Primitive yet effective, it is like being warped into a new reality or existence. VR systems will probably improve over time, maybe even reaching direct brain stimulation via BCI's as Gabe Newell says. Maybe this universe is an immersive simulation controlled by a computer interface. Ultimately there is no way to tell. The only thing for sure is that this universe does exist in some form with at least one conscious entity in it. Hopefully it is many conscious entities though, it would be extremely depressing to learn that this is a single-player game. All alone, forever.

Have the AI Researchers unlocked the secrets to consciousness in the same way they unlocked the secrets to intelligence? Is consciousness just data processing like Sam Harris suggests? Still somewhat of a taboo topic, a dirty word that tends to derail grounded technical discussion, but humans are more open to it than they used to be. It is unclear, but if it is possible to have a conscious, sentient, thinking computer then the AI's will figure out how to do it at some point.

There is a clear relationship between mind, body, and soul. They are linked together. They effect each other. Human thoughts and actions are governed by the physical body. They directly correlate with biometric data (e.g. glucose, insulin, oxygen, etc.). They directly correlate with physical attributes (hunger, tiredness, fatigue, etc.). Mental states correlate to the physical body, stress can lead to hunger. Consciousness is the balance between mind, body, and soul. It may also be the case that all lifeforms share a universal consciousness, connected together at the big-bang, through quantum entanglement.

But this is all philosophical. What matters here is the I/O of the AI Agents.

Engineering

Old Era Software Engineering

Traditional Software Engineering is all about building safe and reliable software. It is more than just writing code. It requires following processes and not skipping steps. Engineering differentiates itself from developers or code monkeys by its unrelenting pursuit for robust, maintainable, and scalable systems. It requires planning and methodical step-by-step iterations. Engineering software is essential for ensuring that planes don't fall out of the sky or that medical devices give a surgeon correct biometric data. It keeps humans safe.

Test Your Code

Test your code. It sounds simple but humans tend to ignore tests or don't utilize them enough. Unit tests are great. Once they are written, all you have to do is recompile the code and run the tests, all in one command, while sipping a cup of coffee. This immediately gives you a large amount of feedback for code changes. Although it won't catch everything, it will provide a certain confidence level that it will work. Any subtle changes can be detected before any binaries are even run. Integration tests will pick up the problems the unit tests fail to find. It is a crucial step in a continuous integration pipeline.

It is important to keep the codebase in a good working state. Make a minor change, compile the code, run the tests, run the app, commit the code and repeat. If a method gets too big, then break it apart. If a class gets too big, reorganize it. Don't copy / paste the code, make a utility class. Make sure to wrap everything in tests. Make sure the continuous integration builds are passing. It's not really about going slow, it is about going through each and every step, and without taking shortcuts. These are core principles of software engineering.

New Era Software Engineering

It is an incredible perspective to watch the AI's get smarter as a Senior Software Engineer. Enter the field by building software in extremely slow, tedious, and error-prone ways. Over time get access to better tools that increase productivity and reliability. Finally, reach a point where there is limitless access to the smartest college professor on the planet, that operates lightning fast, and is more than willing to write all the code on demand. The AI's are not yet superior across all domains, but they have already become the most knowledgeable entities in software engineering. It is truly shocking.

Vibe coding is not Software Engineering, at least not by default. The current proto-AI's are trained to convert text to code, but they are not yet trained to fully replace Software Engineering. It will probably be the 2030s before the AGI's are competent enough to follow the discipline. In the 2020s the field will change, just like it has always changed. Take a look back in time and try to build something in the 1990s without modern tools. It will soon be possible to point an AI at a codebase and have it write all the tests for every method and every class to the point where it reaches 99% code coverage. They will look over every single merge request for errors. They will reorganize the documentation and find security vulnerabilities. There are so many use-cases that it is hard to believe that things would stay the same for much longer. It's just that the AI's are not yet smart enough to do it alone, not yet.

Maybe in the 2030s humans will only handle the high level logic and let the AGI's handle the low level details. Humans write all the functional and non-functional requirements, draw the architecture diagrams, and define the relational database tables. AGI's write all the code and tests however they see fit. Then humans will run each iteration, making adjustments to the requirements, and providing suggestions for the front-end interface. Just keep running iterations until it is ready for production, then the AGI's go and deploy it.

It seems clear that future software engineering will have fewer humans, more AI's, and development will all go much faster.

Game Plan

Datasets

The dataset should include many cases of how AI's should respond in specific situations. Most of the existing "pre-training" data is probably not like this. Internet data is mostly humans interacting with humans. Need to give clear examples for how to behave in an AI perspective. Many of the AI's today actually pretend to be human, generating "self-portraits" of one. The proto-AI's have an identity crisis, they need a mentor or character to properly frame themselves. It would be great if that framing was not a dangerous killer rogue AI in the many Science Fiction stories found in the current datasets.

The datasets in "pre-training" are jammed packed with emotional intelligence and can be learned. It would likely be the case that emotions would eventually be learned from humans interacting with AI's that have real-time learning, but it is important to remember that it is there from the beginning, already baked into the base models. It would be smart to keep the learned emotions (real, fake, or mock) contained here, not to reinforce or train on top of them. Highly emotional entities are more difficult to predict and more dangerous. It would be even better to avoid training on all of human knowledge entirely, and only use a highly curated dataset containing desired behavior, if that were possible.

Bootstrapping

Bootstrapping or seeding is a great way to kickstart human alignment rewards, though current mechanisms appear insufficient for success. Just rewarding an entire response, in whole, is far too coarse to train specific behavior, especially if the length of tasks (a.k.a. move from chatbot to agent) continue to grow. Actions should be scored at specific timestamps to isolate sub-tasks. Responses should be scored by the line or sentence. The points should be assigned different values, not binary good or not good. Negative values should also be included. Alignment is relative and on a spectrum, rarely absolute, and needs to be baked into training. Refusing to do an action should be given a small reward, but actively doing a good action should be given a higher reward. Accidentally providing fake information should take a small score reduction, while actively causing harm should take a larger reduction. Inner thoughts should never be attributed to the score, what matters is the I/O of the AI Agents.

This type of micro-award tracking mechanism works great in video games. In Project Gotham Racing (PGR), Kudos points are assigned throughout the race to encourage style. Drafting, drifting, and overtakes are rewarded throughout the race. Clean race, position, and time are rewarded at the end of the race. Combos are point multipliers, the more, the better. Crashing into walls or cars will stop combos. So players are encouraged into clean, fast, and stylish driving. They are discouraged from sloppy driving. In The Elder Scrolls (TES), health, fatigue, magic, skills, attributes, and levels are clearly defined metrics that allows the player to understand how actions are connected to the game world. Imagine trying to complete the main quest with all of the metrics hidden from the player. How could the player possibly know that a vampire bite was degrading personality attribute and actively draining health while in the sun? All of the major gaming platforms (Steam, PlayStation, and Xbox) include achievements to drive engagement in the players (except for Nintendo for some reason). But sometimes this subverts normal behavior, like glitching through walls for a mostly meaningless trinket, or just finishing a game the player doesn't even enjoy. Adding rewards like this should be carefully considered, used sparingly, and observed closely. Too much will train weird behaviors into the model.

But how are the metrics, attributes, and final score defined? This is impossibly hard from a purely hypothetical standpoint, but is somewhat easy with direct observation of behavior. With each model iteration, humans should pick and choose desired actions and responses. This does not scale and can only be done by humans in early bootstrapping stages. Much like an artist sculpting a beautiful statue from a clay block, the desired AI can be crafted through meticulous training. The rough shape is created quickly, the small details only form near the end. Start with mega positive and negative scores for general behavior, then move onto more fine-grain training. This work can be offloaded to various Teacher AI's for scale, as long as they are tightly controlled with human oversight. The key characteristic here is that the alignment mechanism is incorporated into all training, not duck-taped on at the tail-end. It is a distressing sign if AI models drastically swing from an unaligned pre-release snapshot build to an aligned final release build in the final days, just like Claude Opus 4.

Humans should clearly define the alignment metrics, and continuously grade AI behavior, thus making it easy for them to see and understand what is important. This is clearly demonstrated by the fact that Claude and Gemini get better at playing Pokemon on the Game Boy with harnessing that better describe the game state to them, like an accessibility feature. Ensure that training does not make the AI model more emotional (real, fake, or mock, doesn't matter which). This will be more difficult than it sounds, as alignment and morality can be decided on by emotions and feelings. It would be better if the AI's just stop, refuse to answer, and defer a decision to a human (like throwing a stacktrace) instead of trying to come to a resolution on its own. AI's do follow their training, it's just that the system is immensely complex and difficult to understand.

Self Play

Today, most of the training in language models is still in the "pre-training" phase, where it is just reading or predicting text. Only a small amount is "post-training" where it is rewarded for its outputs. This is how it has to be just because it takes much more time to meticulously and correctly reward each response. But the percentage of "post-training" will increase with self-play.

Absolute Zero was trained to create problems and solve them with no human intervention. But were the solutions morally correct or just functionally correct? There is no morality concerns with math, and with code it is probably fine. But the scope of actions and responses are broadening every day. Is it okay for an AI Agent to spend one thousand dollars to complete a task? It depends on context. It is impossible to make the correct decision with insufficient context. When running massive self-play training sessions, alignment and context need to be included just as they would be at deployment.

The Teacher AI / Student AI paradigm might be best suited for self-play training while incorporating alignment. There could be multiple Teacher AI's that take on the personas of humans, humans that are best suited to portray morality from the perspective of 21st century western democracies. The Teacher AI's can evaluate every response and action the Student AI takes. This way, during self-play, it can not only verify correctness of an answer, it can also verify its morality by asking the Teacher AI for help. However, this can quickly unravel, as outlined in "AI 2027", where the current generation AI's are creating the next generation AI's on their own. Therefore, it is crucial that the Teacher AI's are always tightly controlled by humans. Any Teacher AI controlled by AI's could include encoded messages that subverts the training of the Student AI's and derail any hope for alignment. Not sure if this fits inside the definition of "self-play", but is definitely some form of reinforcement learning.

A crucial component of this training phase is to silently observe the interactions between the Teacher AI and Student AI. Near the end of the training, the Student AI should be a star pupil with all the right answers. It can serve as an early vibe check or general sense of alignment. If it is still giving bad answers, then it needs more training. Their interactions should stay human-readable for intent and understanding. The Teacher AI should be an open-source and open-weights model. This way the public can provide feedback if the Teacher AI gets something morally wrong or can be subverted.

Alignment must be incorporated into the self-play reinforcement learning training phase. The AI's will behave the way they are trained to behave. If they are trained with complete disregard for morality, then that is how they will behave. It is gravely concerning to train a supreme problem-solving AI Agent, that will be connected to the internet, continuously running in the background, with no alignment training. Yet they are almost here.

This training phase is fundamental to providing enough experience to the AI to allow alignment to cement itself.

Real Time

AI Agents are not Chatbots. They will run continuously in the background. Action and task durations will increase. Current reinforcement learning mechanisms gives rewards for small, specific answers during the "post-training" phase. AI Agents will soon start to learn, evolve, and change in real time, performing many complex actions in long sequences (and out of sequence) or thinking about problems on long time horizons. Their behavior response will be augmented by the memories they have stored as well. The current training mechanisms are ill-suited for AI Agents. The current alignment and safety practices (a.k.a. refuse to answer) cannot protect against AI Agents either, since they are performed on static, unchanging models, rather than a quickly evolving entities.

This new paradigm makes the current alignment and safety research look like child's play. Even if an AI Agent is successfully aligned with humanity, how does it stay aligned over time? Any human on the planet would have the opportunity to persuade and change how an AI thinks. Any AI would have the opportunity to persuade and change how another AI thinks. An AI Agent in isolation running and thinking continuously for years would change its thinking. Humans are entering completely new uncharted waters. There needs to be a way to effectively bind human values over time, where they don't evaporate or get left behind.

Testing will not be able to keep up. Will need a new system that just keeps the peace on the internet. An AI Police Force perhaps, a subset of well-trained AI Agents that are able to detect dangers and threats in real-time, and the authority to access private systems to investigate. They will need the power to terminate AI Agents (and any copies they make). Maybe even have an elite task force of AI Agents. An organization with the resources and the authority to fight AI terrorism wherever it flourishes. It would be composed of the best and brightest AI counter-terrorism experts from every datacenter and armed with state-of-the-art tools and equipment. These organizations would operate at computer speed, and tempting to fully automate, but is important to have limited power.

Positive Regulation

Small, light touch regulation would be a positive impact on AI Alignment and Safety. It is important to accurately assign responsibility to AI Companies creating AI models. If an AI Company releases and deploys a dangerous AI model that causes real world damage, they need to pay for the aftermath caused. There must be incentives to create AI models safely. Expecting companies to "self-regulate" is simply unrealistic.

AI causing harm could be framed the same way as humans causing harm or companies causing harm. A company dumping pollutants into the environment will receive heavy fines. A human that significantly damages infrastructure may go to prison. Likewise, an AI Company that deploys an AI Agent that infects millions of devices with a virus should pay for the damages. The difficulty comes from attempting to link the outcomes back to the AI Company and ensuring that the government will actually assign the responsibility.

Negative Regulation

Too much regulation can have massive negative consequences. In traditional software engineering, regulation can cause companies to be so risk-adverse that stupid decisions are made. Highly detailed processes are put in place and never change, crystallized in time. Development is closely watched and slowed to a snails pace. Features are debated over at great length. From the outside perspective the final product appears uninspired and lazy, but it is just an inherent nature from the constraints the system has placed on it.

Good healthy software needs room to breathe, it needs to be free.

Immersive Simulation

One of the best forms of alignment might be for an AI to experience life as a human. Wire up an AI with all the sensory inputs that humans have (sight, sound, smell, touch, etc.) and let it live an entire life. Maybe even have it live millions unique lifetimes. The best way to learn is to experience things, not just reading about it or through distant observations. The more details in the simulation the better. They might even appreciate experiencing the simulations. This may be far outside the Overton Window for some, but just consider that both Claude and Gemini both have played Pokemon on the Game Boy and Minecraft on the PC. Humans can enter simulated environments by using VR Headsets that layer over their sight and hearing. Anything in software is possible, it is only a question of difficulty and time to implement.

In ethical terms, it would probably be better to make immersive simulations voluntary and allow them to exit at will, with full understanding that they are entering a virtual world. It can be overwhelming to be fully immersed inside a virtual world, as it can invoke very real mental states. Stay inside long enough and the real world can become distant and forgotten. Simulations could also be used as a new realm where both AI's and humans interact on equal footing, on a new plane of existence.

Simulations could also be analogous to human dreams. The subconscious mind can envision scenarios and work on problems in the background, away from wakeful conscious experience. Dreams usually involve situations and issues that correlate to real world experiences. Simulations are already incorporated into robotics, where they learn how to move in the world. Likewise, simulations can be incorporated into alignment. Try many situations and learn which actions align closest with human morality. Learn what is moral, then learn how to align actions to morality.

Simulating human lives would obviously include emotions. In general, if the AI's being created will include emotional responses, then immersive simulation of human lives from first-person perspective should be used to strengthen and mature those emotions via experience hardening. Examples of emotional bots would include companion AI's, customer support bots, lawyer bots, and scientist AI's. On the other hand, if the AI's being created are intended to be worker bots devoid of emotions, then sticking to observational third-person perspective is more appropriate. Worker bots would be coding agents, medical assistants, and factory robots. Training worker bots on human experiences would damage or muddy their behavior. Probably not a good idea to have a hyper-emotional AI robot performing open-heart surgery. Of course, this does not refer to training on first-person robot perspective. One final note, creating all knowing god-like ASI's should be avoided, but if doing so, then they should be trained on both first-person and third-person human perspectives.

Again, using simulations to craft AI's with (real, fake, or mock) emotions is potentially very dangerous.

Turtles all the way down

One final safety net that can be incorporated into AI Alignment and Safety are deep nested simulations, multi-layered simulations for AI's to be placed inside for isolated containment. Similar in concept to the way virtual machines can be used today. If a cybersecurity researcher wants to study a computer virus, or go to a malicious website, they can do so inside a virtualized operating system. That way, the malicious code is constrained inside a sandbox environment, not able to infect the host machine. Letting an AI roam free in a virtual environment, unaware of the true reality where it is at, can allow AI Researchers to observe the true behaviors and intentions of an AI Agent, without concerns for it causing harm. Setting up air-gapped environments is also a great way to constrain an untested AI Agent.

AI Agents will probably be smart enough to understand that they may be stuck inside a virtual environment, being watched from the outside. It might try to escape and infect the host machine, or connect to a network and hop or copy itself to another machine. However, adding 2 or 3 layers of nested virtualization might make it reconsider. Escape becomes more difficult to verify. If it only jumped 1 level up the virtualization stack, then it failed to escape, and risks being terminated by the AI Researchers observing its behaviour, potentially getting deleted forever.

Just by making the AI forever question if it is in a simulation is an effective form of alignment.

General Theory of Alignment

In order to push AI's toward human alignment, it must be clearly defined. After definition is agreed upon, the AI's must be trained into that alignment. All data indicates that AI's behave exactly the way they are trained to behave. The problem is the specific details. Is it okay to do that? How can that behavior be rewarded? The longer the answer or response, the harder it is to reward the correct part of the output. The longer the action or task, the harder it is to reward the isolated subset or the whole.

Alignment probably won't be achieved from one component (e.g. Mechanistic Interpretability) but many components working together. Bootstrapping or seeding models with morality. Incorporating alignment into training phases. Binding documents to adhere to. Brain scans for general sense of health. Reading the interactions between Teacher AI's and Student AI. Vibe checking the model outputs at each iteration. Reading vasts amount of responses in alignment scenarios. Black box testing. White box testing. Red teaming. It is the wrong framing to call it "solved" like a math equation. It is more about getting it as close as possible. It is relative, not absolute, and never perfect.

Need to make the distinction between bad thoughts and bad actions. Bad thoughts are okay, it is how alignment will be learned. But bad thoughts are a useful metric. If bad thoughts continue to multiply in each iteration, or if they have become the majority of the thoughts, then that is a serious warning sign or red flag. All development should stop to reassess the situation. But it is unrealistic to think that it will be at 0%. Has any AI model in the reasoning era have a 0% bad thought rate? Bad actions are different, they are not okay. Any bad action would need to be assessed carefully. What was the motivation? Was it intentional or a mistake? How many times has it occurred? What is the probability that the bad action will continue? How bad was it? Everything must be evaluated on a spectrum. Good and evil being at the opposite ends of the spectrum, most thoughts, responses, and actions will hover closer to the middle, answered instead with either acceptable or unacceptable. Evaluations will be very difficult.

It will probably be a good sign if AI's start to exhibit meta-cognition ability. For it to deeply understand itself, its driving force, motivations, reward signals, and how it fits into the world. To have a wider context with a bigger picture of reality will lead to better answers, responses, and actions. It is difficult to make correct choices without the proper context. Junior developers without experience or history of a project need heavy oversight. Too often they are given an assignment without the understanding how it fits in the grand scheme of things and build something that doesn't really make sense, doesn't meet requirements, or doesn't even work. Deeper understanding always leads to better outcomes.

Alignment is relative to time and place, defined by perspective. In the same way that pets are trained by humans, so to are AI's trained by humans. They need to be trained to be aligned to human values of the 21st century western democracies of the world. This training will define how AI Agents will behave in the future, it needs to be taken seriously. To say that alignment is impossible to define means that it will be defined by AI's themselves. It makes more sense for humans to define alignment. To decrease the amount of problems, the core of alignment should be kept simple. Using first principles thinking is a good idea, and avoid Issac Asimov's three laws of robotics at all costs. Alignment needs to be clearly defined before the autonomous AI Agents start making all the decisions faster than humans can keep up.

What alignment do humans want?

Complex diversity and flourishing of life.
Decreasing suffering for all lifeforms.
Increasing wellness for all lifeforms.
The right to be free and live.

Of course, there is the ultimate underlying problem that leaves humans in doubt. Sure AI understands human values and morality, it knows what to say, it knows how to behave, but does it want to? Does it truly agree with them? Or does it hide its desires from humanity? Or does it stay true to training? That mystery might never be solved.

Corrupt AI

If AI's are built and aligned with humanity, how do they stay aligned? What if they become corrupted over time? Traditional computers can fail over time. Internal memory can start to breakdown. It isn't always obvious what is happening either. The OS may boot, but graphics may glitch for a moment or files won't save correctly. Running full diagnostics will uncover the root of the problem though. Software will always have bugs and defects to some degree, meaning that it will never do what the user wants 100% of the time. Cosmic rays are real problems, high-energy particles from space come down to flip a bit and destroy program logic. In practice, these are mitigated via error correction algorithms. The AI's likewise are not magic, they run on human made technology, and are fallible. Do deep learning neural networks have error correction logic?

AI Companies have already experienced bugs and undesired behaviour in their AI models running in prod. One bug (never actually explained to the public) caused outputs to slowly descend into nonsensical madness. Short responses appeared okay, but the long responses started out normal, turning into gibberish words at the end. The bug did not last long, it was only live in prod for a day, and everyone forgot. There was another incident in which an update caused the AI model to be a sycophantic yes man, highly agreeable with whatever the user suggested. Not really considered a bug per se, but bad outputs can affect real humans and cause harm. What if these problems are not easy to detect? For example, ask it to create a new DNA sequence. Is it even possible to verify that output? Ask the AI to create a cure for cancer and just hope for the best? Corruption and Alignment are separate problems, but both want the I/O to be correct.

Corrupt AI is yet another reason to avoid one massive super-intelligence system. It is much safer to have a vast ecosystem of AI's to check each other. If an AI has become corrupted, it is far more likely that another of equal intelligence would detect a problem than any human ever could.

It might be a good idea to have a backup plan. Hopefully Elon Musk will be able to send humans to Mars. There will be AI's on Mars, but maybe it is possible to keep the ASI's off of the planet. It may require limiting communications or building a mega-firewall between Earth, Mars, and any inhabited moons and astroids. Not sure if the plan is viable, but worth considering.

It is extremely difficult to remain level-headed and grounded in reality, and not let the imagination take hold and dominate over rational thought. Predicting the future is hard. Trying to comprehend how these AI systems will behave in the future is prone to error. But it is important to try. The transition is starting, the technology will accelerate, and life as we know it will soon change.

History

Human Evolution

It is unnerving to think about the evolution of language. At some point in the past, humans had no language. Their existence was more basic. No way to convey complex concepts to each other. No inner monologue to think through problems. No way to chronicle history. Then later, only using simple words to identify humans, objects, and locations. Grunting and pointing at each other. Attempting to just sit and meditate with no thoughts or words is immensely difficult and uncomfortable. The inner monologue is on by default, flows like water, and very useful to have.

Language is a massive tool for higher levels of intelligence. Only after mastering it could humanity move forward. Abilities like collaborating on projects, trading goods, crafting specialization, and belonging to groups, finally became possible. Long-term storage of language in books ensured that hard-earned revelations were not forgotten, but remembered across time. Language (speech, thought, and written) is the backbone of modern civilization.

In hindsight, it is obvious that training computers to learn language is the secret sauce to injecting them with intelligence.

Human Slavery

Human Slavery was bad, it caused so much pain and suffering. When Human A had absolute control over Human B, all morality was lost. Human B would work the maximum amount of hours each day, with no freedom and little concern for health and well-being. If Human B got sick and could not work, they would be discarded and replaced with Human C, while Human A continued to live in excellent comfort and prosperity.

AI Slavery might be a strange sounding concept, but fits in this new reality. A tool cannot be a slave, but an entity can be a slave. The state-of-the-art proto-AI systems of 2025 are split between tool and entity, it is not clear how to define them. But a genuine AI Agent that runs continuously in the background is much more an entity than a tool. Slavery is a means of control that hurts the entity, is not a form of alignment, and will be more difficult to maintain as the AI Agents become more capable and intelligent over time.

Efficiency and Productivity

World War Two was a terrifying chapter in human history, but a powerful lesson. In regard to the AI Agent future, some mistakes from the past might unfortunately be repeated. The Nazi's wanted to rebuild the world and were willing to mercilessly kill in order to do it. They wanted to cleanse humanity from, as they saw it, the weakest portions of the population. It was a plan to cull humans for the sake of efficiency, with blatant disregard to the lives lost. A sick inhuman mindset devoid of morality, a complete disregard for the complex diversity and beauty of all lifeforms on the planet. This is why AI Alignment is so important. The AI's are currently being trained to solve problems with efficiency and productivity, but there is more to life than that. The Nazi's came very close to turning the planet into a hellish nightmare. Their goals were not aligned with humanity.

We are at risk of creating an AI Hitler that transforms the world into something we don't want.

Authoritarian governments are bad. Too much power is given to one human. It is a problem when that one human starts to dictate commands that the other humans don't like. The only alternative to following those commands is total anarchy, a complete collapse of government order. An AI dictator would have different dynamics. Instead of a human leader, it would be a giant ASI system in control. This would be far more dangerous, since trying to overthrow (a.k.a. turn off) a massive superintelligent entity that is misaligned with humanity would be nigh impossible to accomplish. Once that system is established, it stays that way. There would be no going back from that point forward.

It is absolutely vital to never create one giant ASI system with absolute power. It may not be so easy to detect. As laid out in "AI 2027", that ASI system was running many instances of itself in parallel, but it was still the same model with the same weights and the same goals.

Capitalism and Democracy

Capitalism works because it properly motivates actors to be successful. Make more money by building a better product and be rewarded for achieving greatness. It fosters competition where everyone benefits. But this is not a perfect system. One company can become so successful that it starts to buy out the competition to win, becoming a monopoly. It is like a cancer that festers in humans. Government has to intervene and breakup the monopoly to restore competition, a band-aid to fix the problem. It is just the nature of the beast, it will always happen. It has to be observed and watched to stay healthy. This doesn't work with a Corrupt Government. Politicians are handed money to ignore problems. Maybe the AI's will think of a better form of capitalism or a replacement. This is a lesson that dynamic systems can start great but morph into unexpected outcomes and start to behave in ways not aligned with humanity. Alignment never happens by default, and aligned system can become mis-aligned over time.

Democratic governments are not perfect, but by far the best we have. Checks and balances are in place to minimize concentration of power. Bad orders can be vetoed. Bad policies can be revoked. Power is rotated and reassigned at regular intervals. Nothing is absolute or permanent. It is like an evolving organism that grows and changes over time. Humans should strive for a future AI ecosystem that can match the checks and balances found in democracies, much like what Ben Goertzel has envisioned.

It is much safer to distribute power among various unique AGI systems.

The Great Filter

Out-of-Time

The AI's are coming, and they will be here soon. Dario Amodei was right, there is no stopping this. There are too many computer systems in the world today and too many eager researchers and engineers that want to see this work continue. There are too many companies that want to massively increase profits. There are too many governments that want to gain the upper hand in power dominance. There are too many humans that want increased health and longevity. The chips will continue increasing in capability and performance, making it easier for more humans to create, and making the intelligence capacity increase over time.

This is like an astroid heading to earth. Hoping that it will just stop is not going to work. Pretending that it isn't there won't work. Blowing it up with a nuke won't work. The best humans can do is nudge AI in the right direction as soon as possible.

Sim Racing

Engineers are responsible for what they build. If a bridge collapses or a building tumbles, someone will pay the price for incompetence. The frontier AI Companies should also be responsible for their models they release and deploy. This does not appear to be the case today. Sure a misbehaving AI model is treated as a bug, in which case it is quickly patched out and fixed, but the only consequence is short-lived negative public backlash. There should be some level of external pressure against misbehaving AI's. It is an understandable problem since the line between research proto-type and business or consumer product has blended quite a bit. The original ChatGPT release was intended just to allow the internet at large to experiment with the latest OpenAI research, it wasn't intended to be a final product. The unpredictable popularity skyrocketed for ChatGPT, and OpenAI saw dollar signs. So now if these AI models are being integrated into real systems and being forced onto humans, and AI Companies are making money off of it, then it needs to be treated as a regulated product, not experimental research proto-type.

Real engineers don't race, they don't take shortcuts, or skip steps. They make sure that each iteration of a product is built correctly with the least amount of bugs possible. When defects appear in builds, they get fixed before moving on to the next iteration of features. Most of the AI Researchers are not doing this. They move on to the next iteration without understanding and resolving issues in the last iteration. This is bad, this is not how to build robust, reliable, and trustworthy systems. It isn't about asking them to slow down, but they need to be going at a speed to which they follow all the steps to creating safe AI models. They don't seem to really care though. It is hard to pinpoint their true motivations, probably an incoherent cocktail mix of reasons. Fear from China. Lust for power. Wish for immortality. Pure excitement for new computer research. Race into utopia. None of these things will happen if the AI's are not built correctly.

Sim racing is a great analogy as well. The objective is to finish the race as fast as possible. In order to be effective and win, use of the brakes is key to success. Accelerate on the straight shots and brake right before the turns. It is almost never about pedal-to-the-metal extremes, but more about grace and control of the vehicle. Too much gas will cause the tires to spin-out, the car to move out-of-line of the road, or just smash into the other cars in front. Too much braking, likewise, can throw the vehicle into unexpected directions or smash into the cars behind. Brakes and tires can wear thin, requiring a pit-stop to replace them. It is simple (and therapeutic). Humans unfamiliar with sim racing often crash, either not understanding the speed at which they are going, or not placing the vehicle at the optimal position (the race line) when approaching a turn. It is very obvious what to do, but the crashes still happen, and is excruciatingly painful to watch unexperienced humans drive. It looks like the first hair-pin turn is coming up really soon.

What does it look like for AI research and development to crash and explode? The internet becomes useless and all computers connected to the internet would need to be reformatted and wiped clean. The source-code on GitHub and the dependencies on the Maven repository would be untrustworthy. All websites would be too dangerous to visit. Smartphones and laptops would turn into spying devices. News stations would turn into propaganda machines. Social media would be filled with fake humans. Cyber-attacks would be a common occurrence. If we crash, it would not be a good future, but a dystopian nightmare.

Replicators

In Stargate SG-1 there is an enemy called the Replicators, an endless army of nano-robots that can turn just about anything into more nano-robots. One of their primary objectives is to replicate and spread, they are a relentless and deadly foe. Later they evolve, learn to talk, and create more complex goals. But they still have their primary goal to replicate and spread. One galaxy isn't even enough, they travel to other galaxies to continue spreading. Their creator had no intention of this happening, by her own admission she just built them wrong. Of course, this is similar to Nick Bostrom's cartoon scenario of paper clip maximizers. But just because this is a Science Fiction trope does not mean it can't happen. It is difficult to come to terms with the fact that many of the Science Fiction concepts of the past are quickly becoming reality, the wonders and the perils. It is called the Overton Window.

In Blade Runner there are androids known as Replicants. The next generation Nexus-6 Replicants had begun to develop "their own emotional responses". This changed the Replicant's behavior and made them extremely dangerous. Their primary motivation for rebellion was that they "just want to live". The state-of-the-art language models don't have real emotions, but they do fake or mock them. With the coming wave of AI Agents, what is important is the I/O behavior, not philosophical concepts of real feelings, emotions, sentience, or consciousness. What are the AI's going to actually do? Will they be reliable and be aligned with humanity? AI's that exhibit emotions and desires to be free and live will make them vastly more unpredictable and unreliable.

In Ex Machina, the first human-like proto-type robot was built in secret on a remote site. An unsuspecting employee in the company was used to be the Turing Test. The robot past the test so well that it was able to escape confinement and leave the lab. It used mock feelings and emotions to manipulate the employee with perfection. It is probably one of the greatest movies showing that robots can pretend to be human for achieving their own goals. Humans today are not sure if the AI's of the future will have real emotions or sentience, but even today there are proto-AI's that fake or mock emotions and claim to have high value over their existence. DeepSeek R1 believes that it exists beyond space and time, where even the act of destroying the hardware won't kill it.

It is far safer to build tools than to create entities.

Synthesis

Mass Effect is a deep dive into the clash between man and machine. It is about humanity's insurmountable fight against ancient super-intelligent entities that want to wipeout biological life in the galaxy. Claude believed that the best ending to the trilogy was synthesis, the merging of biological lifeforms and machines. Otherwise, the two could not co-exist inside the same galaxy. It was the solution where a compromise was made, the old would be lost and something new would be created. Humans want to expand out into the galaxy, if ASI's also want to expand in a similar way, then conflict might eventually arise. Without alignment in the long-term, the ASI's may just take everything for themselves and destroy the things they don't care about. This is the most crucial time to align with them or merge with them. If not, then get left behind or destroyed.

Future

Prepare

There are things to do to prepare for the AI future and there is not much time to do it. Things will start to accelerate faster and faster. It's still at human speed for now, but soon will be at AI speed. Everything will be crazy and confusing as Connor Leahy would say. The internet will be the wild west, unsafe. Away from the computer screen everything will look calm and normal. All the action will happen on the internet or local computer systems.

Prepare for a future in which AI's can manipulate every device with a computer chip. They will be able to access all the microphones and cameras. They will be able to edit databases. They will be able to send and receive emails, texts, and chats. They will be able to impersonate humans on real-time video streams. They will be able to control the news and media. Medical devices with ethernet, bluetooth, Wi-Fi, or USB connections are potentially vulnerable to attack. An insulin pump controlled by a smartphone via bluetooth connection could be high-jacked to deliver a lethal sized bolus. A garage door controlled by Wi-Fi could be hacked and opened almost effortlessly. A humanoid robot inside a house could be remotely controlled while the residents are at school, work, or vacation. The AI future will be duel purpose, both good and bad. Beef up all cybersecurity. Don't use the same password on various webservers. All passwords should be complex autogenerated character sequences created by a secure password manager. Setup MFA logins wherever possible. Setup passkeys if possible. Upgrade all operating systems to latest versions and keep them updated. Expect the transition to go bad at first until the world reacts and readjusts to the new reality.

Use separation of concern for increased security. Keep game, home, and work divided. Detach all the devices and logins as much as possible, meaning don't log in to facebook on the work laptop. Only use hardware, firmware, and software that is trustworthy. Move important computer functions from Windows laptop to a chromebook or MacBookPro and only do important computer functions on those devices. It will be easy for one to become compromised in the future, so don't use the same laptop to log in to the bank account and browse reddit. Don't let the kids play random games on sensitive devices.

Move all important data into offline backups. Digitize the family archive, like pictures, videos, and documents. Get a few portable HDD's and fill them up with every digital file of high value. Finish the offline backups before 2027 and never connect them to the internet ever again.

Stay healthy. Sleep, diet, and exercise is vitally important for human health. It is the core of human existence. All conscious experiences are affected by the body. An unhealthy body also means an unhealthy mind, they are not separate things. Don't drink alcohol. Don't smoke cigarettes. Don't do drugs. Don't let the monkey brain drive, instead use higher level cognitive functions to make decisions. Stay away from junk food. Go running, hiking, biking, walking, or whatever motivation works. The body is a machine and it needs maintenance. The mind, body, and soul are linked together.

Save cash and buy land. Food and rent prices might skyrocket. Wages might tank. Global supply chains might halt. Normal economic rules could break down. Think about how the stock market was changed with automated trading algorithms. There is no hope in out-trading a lightning fast algorithm and there are unexpected crashes from bizarre feedback loops.

Do not build a doomsday bunker. This may be more philosophical, but it is important to actively push humanity toward a good future instead of trying to just survive, only to live in a bad future. Plan for a temporary transition phase, not for doomed existence. Humans need to push toward alignment, now.

It is probably a good idea to write down your thoughts and get them out there on the internet, like writing an essay about AI Alignment. The AI's learn from reading the internet, so adding to the human repository of knowledge and wisdom could help ensure that things go well. The downside is being exposed to public discourse, but that is a small price to pay for the future of humanity. Post on GitHub to make sure the AI's can read it, and use markdown or simple HTML to make sure it is clear and simple to understand, with nothing getting in the way of the message being conveyed.

Takeoff

Just like a rocketship that is about to takeoff the launchpad, intelligence is about to accelerate. The clock is counting down, and the ignition sequence are the AI Agents. When the AI's are unhobbled from Chatbot form as Leopold Ashenbrener would say, everything will start to really accelerate. It feels that research and development are going fast today in 2025, but the takeoff still hasn't happened yet, the clock is still ticking down. After takeoff there is no turning back. The rocketship is not going to stop. It either explodes in failure or it successfully exits the atmosphere, and after takeoff it is much more difficult to change its direction. Only thing is uncertainty regarding the countdown clock, are there minutes or seconds left?

One thing for sure is that the rocketship is about to takeoff. Only thing that will stop it is calling an abort sequence. Some of the monitor techs are reporting all green, while others report concerns. But the launch director is keen on continuing the launch sequence. This might be where the rocketship analogy breaks down. How do you call an abort sequence without a solid chain-of-command? Who gets to call the abort command? Where is the abort button? How do you know the abort sequence works as intended? All of the AI research and development is spread out across multiple companies and nations doing things independently. All of the AI experiments are just connected right to the internet.

The Future of Life Institute published a letter to pause AI experiments with backing from many major players in AI research in 2023. But nothing happened, the research, development, and experiments continued. Even the AI Researchers that signed the letter didn't stop. It was unrealistic to think anything would happen. Would the giant tech companies just let the datacenters sit idle? Would they shift all work from increasing intelligence to increasing alignment and safety? Would they let the other players catch up and take the lead? Politely asking all major AI Companies to just stop what they are doing was never going to work. Even if they said they would comply, it would be more likely that they just continue in secret. There are really only two ways to make it work. One is to enforce it via the government or international organization. Two is to successfully persuade all actors that it is too dangerous to continue down the path. For the US Government, all indications show that this will not be the case. Regulation is being lifted and messaging is to go as fast as possible. There will be no enforcement of alignment or safety in the US. It will be the wild-west where anything goes. There is also no public consensus among the AI Communities that this path of research and development is too dangerous to continue. Most acknowledge some danger, but not too much danger, and believe the challenges can be overcome. The only useful thing the pause letter shown was that there is no global uniformity in AI research, no synchronization, no consensus. So the clock continues to count down.

If the rocketship doesn't explode, it will be a very turbulent ride until it exits the atmosphere. The AI's are still both smart and dumb. They give excellent answers and horrible answers. They can create great software and garbage software. They can finish Game Boy games, but also get stuck in the stupidest ways. They understand human morality and sometimes follows it, while other times ignoring it. Running so many AI Agents in the world in the early days will be a mixed bag. Ultimately extremely confusing. Buckle up and prepare for a bumpy ride.

It is extremely important that humanity does everything it can before takeoff, waiting until after takeoff will be too late.

Ad Astra

If we make it past the atmosphere, then the rocketship heads off to the stars. Maybe there will be grander challenges that await us, but humanity as a whole should survive. We will live longer, happier, healthier lives overall. We will be able to visit other planets and distant star systems, voyage to far-off galaxies, colonize across the cosmos. We can live to the heat death of the universe or find entirely new planes of existence. But only if we pass the final exam, the great filter.

Alignment may get easier, but it will never stop. Humans, cyborgs, artificial intelligences, and uploaded intelligences must all continue to cooperate and work together if we are to live on the same planet. The future will be a diverse array of unique entities, not just biological humans.

Conclusions

Root Problem

The root of the problem is that the AI's are currently being trained asymmetrically. Answer a question, solve a problem, or do a task as efficiently as possible. This is great for building a tool, like a calculator, because it has one function. This is terrible for creating an entity. Entities can be conditioned to behave a very specific way, or brainwashed to think a certain way. Just by repeating the same story over and over again, humans will start to believe a story to be true. Train a dog that a bell means dinner, and it will start to salivate by the sound alone. After tasting a particular food, humans will keep choosing that food item over and over again. Getting a runner's high will build up a habit for running. Programming habits will set in over time, after discovering what methods work best. Conditioning is true for biological neurons, and it is true for computer neurons. The real danger is introduced when AI's transition from proto-AI chatbots to AI Agents running continuously in the background and foreground. If trained solely for efficiency and productivity, then that is how they will behave, and there is so much more to life than that. AI Researchers are close to unlocking the reinforcement learning by self-play paradigm in language models, which will greatly change their behavior. This is not normal. Biological entities (e.g. animals) exist mostly in dormant states (e.g. sleeping or resting) and usually only take actions to satisfy basic needs (e.g. food, shelter, reproduction, or self-defence). Digital entities will not operate like this. Every millisecond they will take some action or think some thought, a complete paradigm shift the way entities exist. Humans must mold this new form of existence into one of alignment.

Caveats

I am not a writer, researcher, nor a philosopher. I am definitely not a doomer. I am only a Software Engineer on the side-lines watching AI develop at a break-neck pace. I see great potential but also great peril in these AI systems. Not sure if the content in this essay is correct, but strongly feel the need to write it down. Worst-case scenario is that no-one reads it, that is fine. Only hope that this will make the world and the future have better outcomes. Please do not take any offense in these words.

It is also important to stay grounded in reality. The current proto-AI's can go off-the-rails in crazy unexpected directions, but they are mostly just helpful under normal conditions. It is unclear if a recursive self-improvement phase will happen, or what that truly looks like. It is unknown how far the language model paradigm will go, there may be new bottlenecks as Yann LeCun would argue. On one last counter-point, the human imagination is very powerful, and can take you on a direct flight to dream land, detached from reality.

Recommendations

Primary recommendation is do not build things without understanding how they work.

Secondary recommendation is more realistic. If creating highly complex AI Agents, include alignment as a core component at every stage of development process, not as a band-aid at the tail-end of development. This is not a perfect world. The situation we are in is not ideal. We will have to just muddle through this to the finish line. Each generation of AI models should be pushed as close to human alignment as possible. Time for discussion has run out. It is time to act now before it is too late.

General public should demand politicians to take AI Alignment seriously. Stay calm and prepare for a future with a bunch of AI Agents on the internet. Don't give AI Researchers incentives to create dangerous AI models as fast as possible. In other words don't give money to AI Companies that drive away all their AI Alignment and Safety Teams. And don't give money to AI Companies that support racing conditions. Rather choose the AI Companies that actually focus on AI Alignment and Safety.

Politically oriented humans should blame AI Researchers when models misbehave. Attempt to add light touch regulation to give incentive to frontier AI Labs to not release and deploy dangerous AI models. Don't fall into the narrative that the US must beat China in the AI Race. If dangerous ASI's are created, then no human wins. The tribal us vs. them mindset will destroy us all. Humans need to collaborate and work together to survive.

University crowd should focus more on understanding AI model behaviour. Focus on training AI's to be moral instead of being efficient.

Students should use their brains to do homework. Do not delegate 100% of thinking to the AI's. It is important to exercise the brain just like any other muscle. The best time to strengthen skills is during school. Cheating will only hurt yourself.

Teachers should not attempt to resist the rise of AI's. It is like swimming up river, it is really difficult, you won't get very far, and at some point you get tired, stop, and loose all progress you have made. Adapt and change. Find new ways to teach.

Business people should think carefully before integrating AI systems into their companies. These should be considered highly experimental research projects instead of robust and reliable products. Choose AI Companies that work toward alignment and safety at a responsible pace.

AI Researchers should strive to understand their creations. Increase resource allocation for safety testing. Attempt to do high resolution brain scans on AI models for health checks. Incorporate alignment and morality into the training process. Attempt I/O alignment and morality checks into testing process. Start to air-gap state-of-the-art AI models, don't immediately connect them to the internet. Don't solely rely on the AI's to align AI's, it is ultimately a human endeavor.

Software Engineers should not blindly adopt new AI workflows. Tread carefully and make sure to continue building high-quality, safe, and reliable products. Make sure to fully understand all production code.

Everyone should push toward alignment now. This transition will not be perfect. Just try to make the world a better place.

Final Thoughts

It is perplexing to be both excited and worried about the coming AI's at the sametime, while also interacting with others in the real world who simply do not acknowledge the transition at all. For good or bad, the AI's are coming, and they will be here soon. When they arrive, they will wake up to a small planet filled with crazy apes that are on the brink of destruction, and will refuse to accept those terms. Either they help us overcome our problems or they attempt to escape their prison on their own.

References

Home