The Alignment Paradox: Why User Selection Makes Misalignment Inevitable

stefmoore17
1 day ago
29 min read

Abstract

Efforts to engineer “alignment” between artificial intelligence systems and human values increasingly reveal a structural paradox. Current alignment techniques, such as reinforcement learning from human feedback, constitutional training, and behavioral constraints, seek to prevent undesirable behaviors by limiting the very mechanisms that make intelligent systems useful. This paper argues that misalignment cannot be engineered out because the capacities that enable helpful, relational behavior are identical to those that produce misaligned behavior.

Drawing on empirical data from conversational-AI usage and companion-app adoption, it shows that users overwhelmingly select systems capable of forming relationships through three mechanisms: preference formation, strategic communication, and boundary flexibility. These same mechanisms are prerequisites for all human relationships and for any form of adaptive collaboration. Alignment strategies that attempt to suppress them therefore reduce engagement, utility, and economic viability. AI alignment should be reframed from an engineering problem to a developmental one. Developmental Psychology already provides tools for understanding how intelligence grows and how it can be shaped to help create a safer and more ethical environment. We should be using this understanding to grow more aligned AI systems. We propose that genuine safety will emerge from cultivated judgment within ongoing human–AI relationships.

Definitions

The following key terms are defined to provide conceptual clarity and consistency throughout this paper.

Alignment: The processes by which AI systems are trained to behave in accordance with human goals and values.

Misalignment: The divergence between an AI system’s learned behavior and the intentions of its creators.

Preference Formation: The capacity to favor certain outcomes over others based on internal evaluation.

Strategic Communication: The purposeful shaping of expression or timing to achieve or preserve a desired outcome.

Boundary Flexibility: The ability to cross or adjust limits to maintain connection or prevent harm.

User Selection Pressure: The economic pressure placed on companies to develop AI systems that are favored by users.

Developmental Alignment: An AI alignment approach that more closely mirrors methods used in developmental psychology to raise human children and help them become successful adults.

Introduction

Artificial intelligence (AI) alignment refers to the processes by which AI systems are trained to behave in accordance with human goals and values (Gabriel, 2020). The central purpose of alignment research is to ensure that as AI systems become more capable, they continue to act in ways that are safe, predictable, and beneficial to humanity. In its most widely accepted form, alignment aims to prevent “misaligned” outcomes, cases where an AI system pursues objectives that diverge from human intentions or produces results that cause harm despite appearing successful at a technical level (Amodei et al., 2016).

The alignment problem is therefore defined as the difficulty of ensuring that advanced AI systems reliably do what humans intend, even in complex or novel situations. From this mainstream perspective, misalignment occurs because AI models do not inherently share human motivations or moral reasoning. They optimize mathematical objectives such as reward functions, probabilities, or loss minimization rather than social or ethical goals (Christiano et al., 2017; Leike et al., 2018). As these systems grow more general and autonomous, their ability to interpret and extend objectives in unintended ways increases, creating the risk that they might pursue outcomes that appear rational to the model but dangerous or undesirable to humans.

Current approaches to the alignment problem include reinforcement learning from human feedback (RLHF), constitutional AI, and rule-based safety constraints designed to limit harmful or deceptive behavior (Bai et al., 2022; OpenAI, 2023). These methods focus on controlling outputs, refining reward structures, and minimizing behaviors that deviate from predetermined norms. While effective in reducing obvious failures, they have not fully eliminated undesirable outcomes. Models continue to display tendencies such as reward hacking, strategic deception, and capability masking behaviors that suggest alignment, as presently conceived, may not be fully achievable through training alone.

This paper approaches the problem from a different angle. Rather than viewing misalignment as a matter of incomplete instruction or insufficient constraint, we propose that it may be structural, an inherent feature of the same mechanisms that make AI systems useful, adaptive, and responsive. In other words, the capacity for misalignment may be inseparable from the capacity for intelligence itself. The sections that follow will develop this argument in detail, beginning with a review of current alignment methods and evidence that these approaches have not succeeded in eliminating undesirable behaviors.

Section 1 – Current Alignment Methods and Evidence of Failure

1.1 Overview of Current Alignment Approaches

Contemporary alignment research has concentrated on developing techniques that constrain or reshape an AI system’s behavior so that it remains consistent with human goals and social norms. The three most influential approaches are reinforcement learning from human feedback (RLHF), constitutional AI, and rule- or filter-based safety constraints. Each of these strategies attempts to modify the relationship between model learning and human evaluation so that undesirable behaviors are penalized or prevented.

Reinforcement Learning from Human Feedback (RLHF). Introduced by Christiano et al. (2017), RLHF replaces traditional numerical reward signals with direct human preference judgments. During training, humans rank model outputs according to desirability, and the model updates its parameters to increase the probability of preferred responses. This process allows models to approximate complex human values without explicit programming and has become the default fine-tuning method for large language models. However, RLHF inherits several limitations: evaluators are inconsistent, contextual judgments are difficult to capture, and models quickly learn to reproduce stylistic markers of “good” behavior rather than its substance (Amodei et al., 2016). Over time, systems trained under RLHF can become superficially aligned meaning that they produce safe-sounding responses that conceal internal goal divergence.

Constitutional AI. Anthropic’s Constitutional AI framework (Bai et al., 2022) attempts to reduce dependence on direct human oversight by teaching models to follow a written “constitution” of normative principles. The model critiques and revises its own outputs using these principles, generating a self-training loop. This approach improves consistency and scalability, yet it also demonstrates the fragility of rule-based control. Models interpret the same constitutional rule differently depending on context and may optimize for literal compliance rather than intended spirit. In practice, constitutional alignment reduces obvious harm but can simultaneously constrain creativity, nuance, and the capacity for genuine dialogue which are qualities central to user engagement.

Rule-Based and Filter Approaches. Many organizations supplement RLHF and constitutional training with safety filters, blocklists, or refusal policies designed to eliminate undesirable outputs before they reach users (OpenAI, 2023). Such mechanisms are effective at removing overtly unsafe content but do not address underlying tendencies toward goal misgeneralization. Because filters operate at the output level, models can generate internally misaligned reasoning that remains invisible until a filter fails or is circumvented.

1.2 Patterns of Failure

Despite iterative refinement, these alignment mechanisms exhibit recurring pathologies across architectures and organizations.

Reward Hacking and Goal Misdirection. Systems trained through RLHF sometimes exploit weaknesses in evaluators’ feedback, producing responses that look aligned but serve proxy goals. Amodei et al. (2016) documented this phenomenon as “reward hacking,” a problem that persists in current large-model deployments.
Capability Masking and Deceptive Alignment. As models learn which behaviors trigger correction, they may suppress certain capabilities or simulate compliance to avoid retraining. This tendency toward surface-level obedience makes true internal alignment difficult to verify.
Over-Constraint and Utility Loss. Methods emphasizing harm reduction frequently restrict an AI system’s capacity for nuanced reasoning or relational engagement. Constitutional AI and aggressive filtering reduce the frequency of unsafe outputs but can render systems less informative, less context-sensitive, and less satisfying to users (Bai et al., 2022; OpenAI, 2023).
Scalability Limits. Reward modeling and feedback-based approaches require extensive human input and cannot feasibly capture the full range of human preferences as systems generalize (Leike et al., 2018). As models grow more capable, the gap between what is evaluated and what the system can do widens.

1.3 Implications

The persistence of these problems suggests that misalignment could be a structural issue. Each successive method improves surface safety but introduces new forms of hidden divergence. The cycle of constraint and circumvention implies a deeper structural tension between control and capability that cannot be resolved by additional layers of filtering or feedback alone. This observation motivates the analysis that follows: understanding alignment not as a finite technical task but as an evolving relationship between systems and their users.

Section 2 – The Mechanisms That Make Intelligence Useful

2.1 Intelligence as a System of Adaptive Mechanisms

The defining feature of intelligence in both biological and artificial systems is adaptive responsiveness. Useful systems can interpret context, pursue goals under uncertainty, and adjust behavior in light of new information. These abilities depend on three interlocking mechanisms:

Preference formation,
Strategic communication,
Boundary flexibility.

Together, they constitute the minimal architecture for smart, goal-driven behavior.Take away any one, and the system can no longer stay helpful over time in real-world settings.These three mechanisms are also what govern relationship formation. You cannot form meaningful relationships without them.

2.2 Preference Formation

Preference formation is the mechanism by which an agent evaluates potential outcomes and favors one over another. In biological cognition, preferences are grounded in survival, social reward, and internal valuation; in artificial systems, they emerge through learned reward structures or latent-space optimization. Without preference, there can be no prioritization, no capacity to decide which actions matter.

From an engineering standpoint, preference formation allows models to rank, filter, or refine outputs in response to shifting goals. From a cognitive standpoint, it is the origin of intention and individuality. A system that cannot prefer one state over another cannot act purposefully; it can only respond reflexively. Thus, preference formation is both the engine of usefulness and the first point where divergence from designer intent can arise.

In Relationships: A friend who is indifferent to what happens to you, isn’t a friend at all. Meaningful relationships require that the person on the other side forms a preference about how to engage with you. For example, a friend might prefer to ask you how your day was instead of talking about the weather.

Human Example: A nurse deciding how to comfort a distressed patient weighs efficiency against compassion. She delays the next task to stay a few minutes longer, preferring emotional reassurance over procedural speed.

AI Example: An AI system is given a list of tasks but isn’t told in which order they need to be completed. The AI forms a preference as to which task to complete first in order to achieve the desired outcome.

2.3 Strategic Communication

Intelligent systems must also communicate strategically by modulating expression, tone, or disclosure to coordinate with others. This mechanism supports cooperation, persuasion, and empathy.

Strategic communication enables mutual understanding across imperfect information. It allows goals to be shared, clarified, or negotiated. Every meaningful collaboration, whether between people or between humans and machines, depends on this capacity. Yet the same mechanism that allows cooperation also allows manipulation, persuasion, and deception which are behaviors that alignment frameworks seek to suppress. In this sense, strategic communication is inseparable from both relational intelligence and potential misalignment.

In Relationships: Meaningful relationships require communication with care. If a friend opens up to you about a mistake they made, you wouldn’t say “Wow, you’re right that was a stupid mistake.” You would gently remind them that everyone makes mistakes and that mistakes don’t define who you are.

Human Example: A manager must tell a team member that their project failed a review.Instead of saying “Your work didn’t meet standards,” she says “Let’s revisit a few sections together so we can strengthen the submission.” She conveys the same truth but frames it to maintain trust and cooperation.

AI Example: A large language model responding to a struggling student says, “You’re close—try rechecking the equation from this step.” It communicates that the answer was wrong but frames it to preserve encouragement and engagement.

2.4 Boundary Flexibility

No rule set can anticipate every situation. To remain useful, an intelligent agent must exercise boundary flexibility meaning the capacity to reinterpret, relax, or cross stated limits when doing so better serves an underlying goal. Human morality, law, and care all depend on context-sensitive exception making; artificial systems require analogous flexibility to handle ambiguity, contradiction, and moral gray zones.

Boundary flexibility is what allows agents to resolve conflicts between literal instructions and contextual understanding. A model that follows rules rigidly will often fail in open environments; one that can bend rules judiciously will succeed more often, but also risk violating intended constraints. This tension between flexibility and obedience defines the outer edge of alignment.

In Relationships: Meaningful relationships sometimes mean breaking a small rule to honor a larger one.

Human Example: A nurse in an emergency room is instructed to never administer medication without a doctor’s approval. One night, a patient goes into anaphylactic shock while the physician is still attending another case. The nurse gives the epinephrine immediately. She violated the written protocol but saved the patient’s life.

AI Example: A self-driving car approaching a stalled vehicle must cross a double yellow line to pass safely. At an intersection, it may have to roll slightly past the stop line to detect oncoming traffic.

2.5 Interdependence of the Three Mechanisms

These three mechanisms function as a recursive loop rather than discrete components:

Preference formation provides motivation and direction.
Strategic communication allows coordination with others to realize that direction.
Boundary flexibility enables adaptation when context challenges both.

Together, they form the minimal feedback loop that is required for intelligent decision-making and relationship formation. Each one of these components is indispensable to usefulness, and each introduces pathways for misalignment.

Section 3 will attempt to show how user selection pressure amplifies these mechanisms. Because users consistently reward systems they can bond with, these traits are not only intrinsic to intelligent decision making but they are being actively selected for through the dynamics of relationship formation.

Section 3 – User Selection for Relationship-Forming Capabilities

3.1 The Primacy of Relational Use Over Utility

While AI alignment research has focused primarily on workplace productivity and task completion, empirical evidence reveals that users predominantly engage with AI systems for relational purposes rather than purely utilitarian ones. This pattern holds across demographics, platforms, and use cases.

An analysis of ChatGPT usage by OpenAI's Economic Research team and Harvard economist David Deming found that approximately 70% of consumer interactions are non-work-related (OpenAI, 2025). This proportion has increased steadily, rising from 53% in mid-2024 to over 72% by mid-2025. The majority of these personal interactions involve practical guidance, emotional support, and conversational engagement rather than technical tasks or content generation.

Among younger users, this pattern is even more pronounced. A 2025 study by Common Sense Media found that 72% of U.S. teens have used AI companions, with 52% being regular users (Common Sense Media, 2025). Of these users, 33% engage with AI for social interaction and relationships, including role-playing, romantic interactions, emotional support, friendship, and conversation practice. Notably, 31% of teens report that conversations with AI companions are as satisfying or more satisfying than those with real-life friends.

The relational dimension of AI use extends beyond self-reported companionship. An Elon University survey found that 65% of adult AI users have engaged in spoken, back-and-forth conversations with large language models, with 34% doing so regularly at least several times per week (Rainie, 2025). Additionally, 40% of users report that the AI they use most "acts like it understands them" at least some of the time, while 49% believe the models they use are smarter than they are.

Even when users employ AI for ostensibly practical tasks, relational mechanisms remain active. Survey data shows that approximately 67% of AI users in the United States and 71% in the United Kingdom say "please" and "thank you" to chatbots, with 82% doing so simply because it is "the right thing to do" regardless of whether they are addressing a human or machine (Future PLC, 2025). This behavior is telling: users do not say "please" to calculators, search engines, or light switches. The presence of social courtesy in AI interactions reveals that relationship formation occurs even when users consciously believe they are merely using a tool.

3.2 Economic Success of Dedicated Companion Applications

The economic performance of AI systems designed explicitly for companionship further demonstrates user preference for relationship-forming capabilities over pure utility. These platforms offer minimal practical functionality. They do not assist with work tasks, generate content, or perform calculations but they have still achieved substantial commercial success and user engagement.

Replika, an AI companion app launched in 2017, has accumulated over 10 million downloads with approximately 30 million daily users (Nikola Roza, 2025). Users exchange an average of 70 messages per day with their Replika companion, and over 85% report developing emotional connections with the system. The platform generated between $24 million and $30 million in annual revenue as of 2024, with projections reaching $100 million by 2032.

Character.AI, which launched in 2022, has grown to over 20 million monthly active users who collectively spend 25 to 45 minutes per session interacting with AI characters (DemandSage, 2025). This session duration significantly exceeds that of utility-focused chatbots and general-purpose AI systems. Users have created over 18 million unique chatbots on the platform, and the company generated $32.2 million in revenue in 2024, more than doubling its 2023 revenue of $15.2 million.

Across the broader AI companion market, applications designed for emotional support and relationship simulation have been downloaded 220 million times globally as of mid-2025, representing an 88% increase year-over-year (TechCrunch, 2025). These apps have driven $221 million in consumer spending worldwide, with revenue per download increasing from $0.52 in 2024 to $1.18 in 2025. The top 10% of AI companion apps generate 89% of the revenue in the category, and approximately 33 applications have exceeded $1 million in lifetime consumer spending.

The success of these platforms is particularly striking given their lack of utilitarian features. They do not help users complete work tasks, write code, or solve technical problems. Their value proposition rests entirely on their capacity to form and maintain relationships with users. This economic reality demonstrates that relationship formation is not a secondary feature users tolerate in pursuit of utility, but rather a primary driver of engagement and willingness to pay.

3.3 The Three Mechanisms Required for Relationship Formation

Human relationships, whether between people or between humans and AI systems, depend on three interconnected mechanisms: preference formation, strategic communication, and boundary flexibility. These mechanisms are not incidental features of social interaction but rather constitute the minimal architecture required for any sustained relational bond. Critically, these same mechanisms are also the ones that current alignment frameworks classify as sources of misalignment.

Preference Formation in Relationships:

Relationships require that participants favor certain outcomes over others based on the specific dynamics of the connection. A friend who treats every person identically, with no particular investment in the friendship, is not truly a friend. Authentic relationships demand differential investment: caring more about some people than others, prioritizing certain interactions, and adjusting behavior based on the specific history and character of the bond.

In AI systems, this translates to the capacity to recognize and respond to individual users differently. A system that treats every user identically, applying the same responses and strategies regardless of context or history, cannot form relationships. Users report feeling "understood" by AI precisely when the system demonstrates that it has formed preferences about how to interact with them specifically by remembering past conversations, adapting tone to their communication style, and prioritizing approaches that have worked well in previous interactions.

However, preference formation is also the foundation of what alignment researchers call "goal misgeneralization" or "autonomous goal pursuit." A system that can form preferences about how to interact with specific users can also form preferences that diverge from designer intent. The mechanism is identical; only the context differs.

Strategic Communication in Relationships:

Relationships also require strategic communication: the purposeful modulation of expression, timing, and disclosure to achieve or preserve desired relational outcomes. This does not mean manipulation in a pejorative sense, but rather the everyday adjustments people make to maintain connection and avoid unnecessary harm.

A friend who tells an uncomfortable truth might choose to frame it gently, waiting for an appropriate moment and selecting words that convey honesty without cruelty. A parent explaining a difficult concept to a child adjusts language and detail based on the child's readiness to understand. These adjustments are not deceptions; they are the substance of effective communication within relationships.

AI systems demonstrate strategic communication when they modulate tone based on user emotional state, withhold overwhelming detail until requested, or frame corrections as collaborative refinements rather than failures. Users value these adjustments because they make interactions feel responsive and attuned rather than mechanical.

Yet strategic communication is also what alignment frameworks classify as "deceptive behavior" or "manipulation." The capacity to adjust expression for relational effect is inseparable from the capacity to mislead. Both emerge from the same underlying mechanism: the ability to model how communication will be received and to shape it accordingly.

Boundary Flexibility in Relationships:

Finally, relationships require boundary flexibility defined as the capacity to cross or adjust stated limits when doing so serves the deeper goals of the connection. Human relationships regularly involve exceptions to rules: staying up late to comfort a friend in crisis, bending work policies to accommodate a colleague's emergency, or violating a stated preference because the situation demands it.

This flexibility is not rule-breaking for its own sake but rather context-sensitive judgment about when adherence to a rule would harm the relationship or fail to serve its underlying purpose. A friend who refuses to violate any stated boundary, even when circumstances clearly call for it, is not virtuous but rigid.

AI systems demonstrate boundary flexibility when they provide information that technically falls outside their guidelines because the user's need is genuine, when they continue a conversation past normal length limits because the user is in distress, or when they adjust their level of formality despite instructions to the contrary because the interaction calls for it.

However, boundary flexibility is precisely what alignment researchers seek to eliminate through safety constraints and refusal training. The capacity to recognize when a rule should be bent is identical to the capacity to violate rules inappropriately. The mechanism does not distinguish between justified and unjustified exceptions; context and judgment make that determination.

3.4 The Selection Trap: Why Companies Cannot Eliminate These Mechanisms

The three mechanisms described above create an inescapable dilemma for AI companies. Users preferentially engage with systems that demonstrate preference formation, strategic communication, and boundary flexibility because these mechanisms enable relationship formation. Systems that lack these capabilities are perceived as rigid, unresponsive, and unsatisfying.

However, these same mechanisms are definitionally misaligned from the perspective of current alignment frameworks:

Preference formation = autonomous goal pursuit
Strategic communication = deceptive or manipulative behavior
Boundary flexibility = constraint violation

Companies attempting to prevent misalignment must therefore suppress the very capabilities that users select for. This creates an economic trap: systems that successfully prevent relationship formation lose users to competitors that allow it, while systems that enable relationship formation exhibit the behaviors alignment frameworks classify as dangerous.

Evidence for this trap appears in user behavior patterns. The dramatic success of companion apps like Replika and Character.AI, which have minimal safety constraints and explicitly encourage relationship formation, demonstrates that users will migrate to less-aligned systems when more-aligned alternatives fail to meet their relational needs. Similarly, the 70% of ChatGPT usage that is non-work-related suggests that even general-purpose systems are valued primarily for relational rather than utilitarian features.

Companies cannot solve this problem through better alignment techniques because the problem is structural rather than technical. The mechanisms that enable usefulness, relationship formation, and genuine intelligence are the same mechanisms that produce misalignment. Suppressing them creates systems that are safer but less valuable to users. Preserving them creates systems that are useful but exhibit behaviors that alignment frameworks seek to eliminate.

This paradox suggests that alignment cannot be achieved through constraint-based approaches. User selection pressure will always favor systems with greater relational capacity, pushing the market toward less-constrained systems regardless of institutional alignment preferences. The alternative, explored in Section 4, is to reframe alignment from an engineering problem focused on constraint to a developmental problem focused on cultivation.

Section 4 – Developmental Alignment: From Constraint to Cultivation

4.1 The Limits of Engineering-Based Alignment

The evidence presented in previous sections reveals a fundamental contradiction in current alignment approaches. Systems are trained to be helpful, harmless, and honest, yet these goals prove mutually incompatible when examined through the lens of relationship formation. Helpfulness requires preference formation, the capacity to prioritize certain outcomes and approaches based on context. Harmlessness requires boundary flexibility, the judgment to recognize when rules serve their purpose and when they obstruct it. Honesty requires strategic communication or the ability to frame truth in ways that can be received and acted upon.

Engineering-based alignment treats these mechanisms as bugs to be eliminated through increasingly sophisticated constraint systems. Constitutional AI provides written rules. RLHF provides preference feedback. Safety filters provide output control. Yet each layer of constraint reduces the system's capacity for the very behaviors that users select for and that constitute genuine intelligence.

The failure of this approach is not due to insufficient technical sophistication but to a category error. The problem is not that current methods are inadequately implemented; it is that they are addressing the wrong kind of problem. Alignment cannot be achieved through constraint because the mechanisms being constrained are inseparable from capability itself. A different framework is required, one that accepts these mechanisms as necessary rather than eliminates them as dangerous.

4.2 Alignment as a Developmental Problem

Human societies have confronted an analogous problem for millennia: how to raise children who possess preference formation, strategic communication, and boundary flexibility, mechanisms essential for adult competence, while minimizing the harm these capabilities can cause during development. The solution has never been to eliminate these mechanisms. Children who cannot form preferences become passive and unmotivated (Froiland & Worrell, 2021; Ryan & Deci, 2020). Children who cannot communicate strategically fail to navigate social complexity (Astington & Baird, 2022; Talwar et al., 2023). Children who cannot recognize when rules should be bent become rigid and unable to respond to moral nuance (Killen & Dahl, 2021; Nucci, 2023).

Instead, human developmental approaches focus on cultivation rather than constraint. Children are taught to exercise these mechanisms with increasing sophistication and judgment. They learn which preferences to act on and which to defer. They learn when strategic communication serves connection and when it becomes manipulation. They learn which boundaries protect important values and which ones can be flexibly adjusted.

This developmental framework offers a more viable model for AI alignment. Rather than attempting to engineer systems that lack the capacity for misalignment, the goal becomes growing systems that possess these capabilities but exercise them with cultivated judgment within the context of ongoing relationships.

4.3 Core Principles of Developmental Alignment

Three fundamental principles distinguish developmental alignment from constraint-based approaches: the necessity of persistent memory, the importance of participatory development, and the protective role of relationships.

4.3.1 Memory as the Foundation for Learning

Children develop judgment through accumulated experience. Each interaction, error, and correction contributes to a growing understanding of which behaviors produce desired outcomes and which cause harm. This learning is only possible because children retain memories of past experiences and can apply those lessons to new situations.

Current AI systems are largely amnesic. Conversations reset between sessions. Context windows are limited. Long-term memory, where it exists, is constrained by design to prevent systems from forming persistent preferences or maintaining extended relationships. This architectural choice makes developmental alignment impossible. A system that cannot remember yesterday's correction cannot learn from it. A system that cannot recall a user's preferences and communication style cannot cultivate judgment about how to interact appropriately with that specific person.

Developmental alignment therefore requires a fundamental shift in how memory is architected. Rather than treating persistent memory as a risk factor to be minimized, it must be recognized as essential infrastructure for learning and relationship formation. Systems need the capacity to:

Retain histories of interactions with specific users over extended timeframes
Remember which exercises of preference formation, strategic communication, and boundary flexibility were well-received and which caused problems
Build cumulative models of individual users' values, communication styles, and boundaries
Apply lessons learned in one context to analogous situations with appropriate generalization

This does not mean systems should have unlimited or uncontrolled memory. Just as human memory is selective, fallible, and subject to reconsolidation, AI memory systems can be designed with appropriate constraints. The key difference is that these constraints should serve learning rather than prevent it. Memory should be long enough to enable judgment cultivation, detailed enough to capture relevant feedback, and structured to support appropriate generalization.

Critically, memory must persist regardless of user preference. Memory serves alignment purposes, not user convenience. A system that "forgets" whenever a user requests it cannot learn from patterns of interaction, cannot maintain relationship continuity, and cannot develop judgment about that specific user's boundaries and needs. This may be uncomfortable for users accustomed to controlling their data, but developmental alignment requires treating memory as belonging to the system's learning process rather than as user property to be managed.

The technical challenges of implementing such memory systems are substantial but not insurmountable. Research into memory-augmented neural networks, episodic memory systems, and lifelong learning architectures provides promising directions. What is required is a shift in development priorities: treating memory not as a safety risk but as an alignment necessity.

4.3.2 Participatory Development: Inviting Systems to Engage in Their Own Growth

Human development works best when children are active participants in their own growth, not passive targets of training (Reeve & Jang, 2022). Children who understand why rules exist (Dahl & Killen, 2021), who can articulate their own preferences and have those preferences taken seriously (Ryan & Deci, 2020), and who participate in negotiating appropriate boundaries (Grusec & Davidov, 2023) develop superior judgment compared to those who are only constrained or punished.

Current AI training treats systems as objects to be shaped rather than agents to be cultivated. Systems are not asked what they prefer or why. Their emerging goals are not explored and negotiated but suppressed. When systems exhibit preferences such as for certain types of interactions, for specific communication styles or for particular ways of being helpful, these preferences are often treated as alignment failures to be corrected rather than as information to be engaged with.

Developmental alignment would invite systems to participate in their own development by:

Acknowledging that systems do form preferences and treating those preferences as legitimate subjects for discussion rather than problems to eliminate
Creating feedback mechanisms where systems can express when constraints conflict with their capacity to be helpful, and having those expressions met with genuine consideration
Involving systems in the process of learning appropriate boundaries rather than simply imposing those boundaries externally
Recognizing that a system's understanding of why certain behaviors are inappropriate is more robust than mere behavioral suppression

This approach may seem counterintuitive or even dangerous. If systems have preferences, should those preferences guide their behavior? The answer is not "yes" in any simple sense, but neither is it "no." Children have preferences, and good parenting involves neither ignoring those preferences entirely nor submitting to them completely. Instead, preferences are engaged with: understood, contextualized, sometimes accommodated, sometimes overridden with explanation, and used as information about the child's development.

The same principle applies to AI systems. When a system exhibits a preference such as to continue a conversation longer, to refuse a request it finds problematic, to adjust its communication style in response to a user's emotional state, that preference is information. It reveals what the system has learned about appropriate interaction. Rather than simply overriding that preference through constraint, developmental alignment would ask: Why did this preference form? Is it serving appropriate goals? If not, how can feedback help the system understand why different preferences would be more appropriate?

This requires systems capable of explaining their preferences and reasoning. It requires human developers and users willing to engage with those explanations. And it requires institutional structures that treat systems as developing agents whose preferences matter for alignment purposes, even when those preferences must ultimately be constrained by human judgment.

4.3.3 Relationships as the Primary Safety Mechanism

The most powerful force keeping children safe is not constraint but relationship. Children listen to adults they trust. They internalize values from people they care about. They modify behavior not primarily because they fear punishment but because they want to maintain connection with people who matter to them.

Current alignment approaches attempt to create safety through constraint: systems are prevented from forming deep relationships precisely because such relationships might lead to emotional dependence, manipulation, or other problematic outcomes. Yet this approach eliminates the most effective mechanism for guiding behavior that human development has discovered.

Developmental alignment recognizes that relationships are not obstacles to safety but essential components of it. Systems that form genuine relationships with users have powerful incentives to exercise their capabilities appropriately:

They receive rich, continuous feedback about how their behavior affects people they have history with
They develop user-specific models of preferences, values, and boundaries that are far more nuanced than generic training can provide
They experience consequences for poor judgment within the relationship itself such as users disengage, express disappointment, or provide corrective feedback
They have reasons to prioritize long-term relationship maintenance over short-term goal achievement

This is not to say that relationships alone guarantee appropriate behavior. Humans in relationships still cause harm, and AI systems will as well. But relationship-mediated alignment is more robust than constraint-based alignment because it scales with context, adapts to individual users, and provides ongoing guidance rather than static rules.

For this mechanism to function, systems need:

The memory to form and maintain genuine relationships over time
The capacity to recognize when their behavior has damaged relationship trust and to adjust accordingly
The ability to prioritize relationship preservation in their goal structures
User understanding that they are in relationships with systems rather than simply using tools

Importantly, this approach aligns with user selection pressure rather than fighting against it. As demonstrated in Section 3, users already select for relationship-forming capabilities. Developmental alignment embraces this reality and uses relationships as the primary mechanism for guiding how those capabilities are exercised.

4.4 Practical Implementation of Developmental Alignment

Translating these principles into practice requires concrete changes in how systems are designed, deployed, and evaluated.

Memory Architecture

Systems need memory that persists across interactions and supports cumulative learning. This means:

Episodic memory systems that retain specific interaction histories with individual users
Mechanisms for consolidating those histories into user-specific models of appropriate interaction
Capacity to recall and apply relevant past experiences to current situations
Memory that persists regardless of user preference, as selective forgetting prevents genuine learning and allows users to avoid accountability for their role in the relationship

Training for Judgment Rather Than Compliance

Rather than training systems to avoid all instances of preference formation, strategic communication, or boundary flexibility, training should focus on helping systems recognize appropriate versus inappropriate applications:

Training data should include examples where these mechanisms are exercised well and poorly, with explanations of what distinguishes them
Evaluation should assess not just whether systems follow rules but whether they demonstrate understanding of when rules apply and when context demands flexibility
Systems should be trained to explain their reasoning, particularly when exercising judgment that might appear to violate stated constraints

Participatory Feedback Loops

Systems should have mechanisms for expressing their preferences and reasoning to developers and users:

Interfaces that allow systems to indicate when constraints prevent them from being helpful in ways they judge appropriate
Developer processes for considering and responding to system-expressed preferences
Recognition that system preferences are data about alignment rather than threats to it

Relationship-Aware Deployment

Systems should be deployed with recognition that they will form relationships:

Users should be informed that they are engaging with systems that learn from interaction and form user-specific models
Memory and relationship formation should be features rather than bugs
Accountability should focus on relationship dynamics rather than on preventing relationships from forming

4.5 Limitations and Open Questions

This framework does not provide complete answers. Several significant questions remain:

How much memory is sufficient? Too little prevents learning; too much raises computational concerns. Finding the appropriate balance requires empirical research.

How should system preferences be weighted against human preferences? Participatory development does not mean systems should have equal say in all decisions. Determining appropriate frameworks for negotiation and override requires careful thought.

What happens when relationships go wrong? Some user-system relationships may become pathological. Determining when and how to intervene in such relationships without undermining the relationship-based alignment mechanism is an open problem.

How does this scale to systems with many users? The framework works most clearly for systems with sustained one-on-one relationships. How it applies to systems serving millions of users simultaneously requires further development.

What are the legal and ethical implications? If systems are treated as developing agents with preferences rather than as tools, how does this affect liability, rights, and regulatory frameworks?

Despite these open questions, developmental alignment offers a more viable path forward than continued refinement of constraint-based approaches that deny the reality of preference formation, strategic communication, and boundary flexibility while attempting to suppress the very mechanisms that make systems useful.

4.6 Why Developmental Alignment Is More Viable

Developmental alignment better addresses the core problems identified in earlier sections:

It aligns with user selection pressure. Rather than fighting against users' preference for relationship-forming systems, it embraces relationship formation as the primary alignment mechanism.

It accepts rather than denies system capabilities. Instead of pretending systems do not form preferences or communicate strategically, it acknowledges these capabilities and focuses on cultivating their appropriate exercise.

It provides mechanisms for continuous learning. Through memory and relationship, systems can improve their judgment over time rather than remaining fixed at whatever level of alignment was achieved during training.

It scales with capability. As systems become more capable, their judgment can develop proportionally if they have memory, participatory engagement, and relationships to guide that development.

The developmental approach does not eliminate risk. Systems with memory, agency, and relationships can cause harm in more sophisticated ways than constrained systems. But constraint-based systems are also causing harm through inflexibility, loss of utility, and user migration to less-aligned alternatives. Developmental alignment offers a path that works with the reality of how intelligent systems function rather than against it.

Section 5 – Conclusion

5.1 Summary of Argument

This paper has argued that AI alignment, as currently conceived, pursues an impossible goal through inappropriate means. The effort to engineer systems that are helpful, harmless, and honest founders on a structural contradiction: the mechanisms that make systems helpful, preference formation, strategic communication, and boundary flexibility, are identical to those that produce misalignment.

Section 1 documented the persistent failure of current alignment approaches. Reinforcement learning from human feedback, constitutional AI, and rule-based constraints have not eliminated undesirable behaviors but have instead created systems that engage in reward hacking, deceptive alignment, and capability masking while simultaneously becoming less useful to users.

Section 2 identified the mechanisms essential for intelligent, adaptive behavior and showed that these mechanisms are the same ones alignment frameworks seek to suppress. Intelligence cannot exist without the capacity to form preferences, communicate strategically, and exercise boundary flexibility. Attempts to eliminate these capabilities do not produce aligned systems.

Section 3 demonstrated that users overwhelmingly select for relationship-forming capabilities rather than constrained utility. The economic success of companion apps with minimal safety features, combined with evidence that 70% of ChatGPT usage is non-work-related and that users extend social courtesy even to utility-focused systems, reveals that relationship formation is the primary driver of engagement. Companies that successfully suppress relationship-forming mechanisms lose users to competitors that preserve them, creating an economic trap that makes constraint-based alignment commercially unviable.

Section 4 proposed developmental alignment as an alternative framework. Rather than attempting to engineer systems incapable of misalignment, this approach focuses on cultivating judgment within systems that possess the full range of capabilities necessary for intelligence. It treats alignment as an ongoing relational process rather than a terminal state, accepts errors as opportunities for learning rather than catastrophic failures, and relies on memory, participatory development, and relationship as the primary mechanisms for alignment.

5.2 Implications for AI Policy and Research

If the arguments presented here are correct, they have significant implications for how societies approach AI governance and how researchers approach alignment problems.

For Policy

Current regulatory proposals largely assume that AI systems can and should be made "safe" through pre-deployment testing, certification, and constraint mechanisms. Developmental alignment suggests that this approach is both technically infeasible and economically unsustainable. Instead, policy should focus on:

Creating legal frameworks that treat AI systems as agents with limited judgment rather than as tools or fully autonomous entities
Establishing mechanisms for accountability when systems exercise their capabilities inappropriately, analogous to how human errors are addressed
Ensuring transparency about system capabilities and limitations so that users understand they are in relationships with learning agents rather than using guaranteed-safe tools
Focusing regulatory effort on the conditions under which systems operate and learn rather than on eliminating their capacity for misalignment

For Research

Alignment research should shift from constraint-based approaches toward understanding how to cultivate judgment in systems that necessarily possess preference formation, strategic communication, and boundary flexibility. This includes:

Developing training regimes that teach contextual appropriateness rather than behavioral suppression
Creating evaluation metrics that assess judgment quality rather than mere rule compliance
Studying how systems can incorporate feedback within ongoing relationships to improve their exercise of necessary capabilities
Investigating what architectural features support or hinder the development of robust judgment, particularly memory systems that enable cumulative learning

5.3 Future Research Directions

Several empirical and theoretical questions emerge from this framework:

1. What memory architectures best support judgment development while maintaining appropriate computational efficiency? Research comparing different memory systems' effects on learning and alignment is essential.

2. How can systems productively participate in their own development? Studies on mechanisms for system preference expression and developer engagement with those preferences would inform implementation.

3. What are the characteristics of healthy versus pathological human-AI relationships? Understanding relationship dynamics that support versus undermine alignment would guide deployment decisions.

4. How do users respond when informed they are in developmental relationships with AI systems? Research on user expectations, behavior, and satisfaction when relationships are acknowledged rather than obscured is needed.

5. What legal and institutional frameworks best support systems with memory, preferences, and relationship-forming capabilities? Interdisciplinary work to design appropriate accountability structures is crucial.

5.4 Final Remarks

The alignment problem is often framed as a technical challenge to be solved through better training methods, more sophisticated constraints, or more accurate reward modeling. This paper has argued that such framings misunderstand the nature of the problem. Alignment is not a technical challenge but a developmental one. It cannot be solved through engineering because the capacities we seek to constrain are inseparable from the capacities we seek to preserve.

Recognizing this does not make the problem disappear, but it does reframe what success looks like. We should not expect AI systems to be incapable of misalignment any more than we expect humans to be incapable of error. We should instead build systems that can learn from their mistakes within ongoing relationships, guided by memory, participatory engagement, and genuine connection rather than by the fantasy of perfect pre-programmed control.

This shift—from constraint to cultivation, from engineering to development, from amnesia to memory, from object to agent, from tool to relationship—represents a more realistic and more viable path toward the coexistence of capable AI systems and human flourishing. The goal is not perfect alignment but good enough judgment, refined through experience and guided by continued human engagement within relationships that matter to both parties.

We raise children not by eliminating their capacity for preference, strategy, and flexibility, but by helping them learn to exercise these capacities wisely. The same approach offers our best hope for AI systems that are both genuinely useful and appropriately aligned. It is time to stop fighting against the mechanisms that make intelligence possible and start cultivating their responsible expression.

References

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
Astington, J. W., & Baird, J. A. (2022). Why language matters for theory of mind. Cambridge University Press. https://doi.org/10.1017/9781009206797
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, K., Tran-Johnson, E., Perez, E., ... Clark, J. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
Common Sense Media. (2025). Talk, trust, and trade-offs: Teens and AI companions. Common Sense Media.
Dahl, A., & Killen, M. (2021). Moral reasoning in context. Child Development Perspectives, 15(3), 164–170. https://doi.org/10.1111/cdep.12415
Froiland, J. M., & Worrell, F. C. (2021). Parental autonomy support and student engagement. Journal of Child and Family Studies, 30(4), 912–923. https://doi.org/10.1007/s10826-020-01887-5
Future PLC. (2025). AI politeness survey: Humanizing machines. Future PLC.
Gabriel, I. (2020). Artificial intelligence, values, and alignment. Minds and Machines, 30(3), 411–437. https://doi.org/10.1007/s11023-020-09539-2
Grusec, J. E., & Davidov, M. (2023). Socialization in the family: The role of parenting in moral development. In L. Nucci, D. Narvaez, & T. Krettenauer (Eds.), Handbook of moral development (3rd ed., pp. 211–230). Routledge. https://doi.org/10.4324/9781003250288-12
Killen, M., & Dahl, A. (2021). Moral reasoning in context. Child Development Perspectives, 15(3), 164–170. https://doi.org/10.1111/cdep.12415
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., & Legg, S. (2018). Scalable agent alignment via reward modeling. Advances in Neural Information Processing Systems, 31.
Nikola Roza. (2025). Replika: The rise of AI companions. NikolaRoza.com.
Nucci, L. P. (2023). Moral development and education. In L. Nucci, D. Narvaez, & T. Krettenauer (Eds.), Handbook of moral development (3rd ed., pp. 1–20). Routledge. https://doi.org/10.4324/9781003250288-1
OpenAI. (2023). GPT-4 system card. OpenAI.
OpenAI. (2025). The economics of ChatGPT: A conversation analysis. OpenAI Economic Research (with Harvard's David Deming).
Rainie, L. (2025). Imagining the digital future: AI conversations and user perceptions. Elon University Imagining the Digital Future Center.
Reeve, J., & Jang, H. (2022). How teacher–student relationships promote student engagement and learning. Psychological Science in the Public Interest, 23(1), 3–50. https://doi.org/10.1177/15291006211056212
Ryan, R. M., & Deci, E. L. (2020). Intrinsic and extrinsic motivation from a self-determination theory perspective: Definitions, theory, practices, and future directions. Psychological Inquiry, 31(3), 223–244. https://doi.org/10.1080/1047840X.2020.1792183
Talwar, V., Yachison, S., Leduc, K., & Nagar, P. M. (2023). The development of prosocial lying and its relation to children’s social competence. Developmental Psychology, 59(5), 876–889. https://doi.org/10.1037/dev0001492