project vess
exploratory research on persistent self-models in LLMs
Core Research Question
What happens when you give an LLM a persistent self-model?
Memory augmentation for LLMs exists in various forms - RAG, vector databases, conversation summaries. However, most implementations focus on task-relevant information retrieval. This project asks a different question: what happens when the persistent context is self-referential — structured information about the model's own patterns, preferences, and behavioral history?
The architecture provides Claude with a persistent self-model across 21 sessions, then examines how it interacts with that information: defending it, integrating new material, maintaining consistency under challenge.
Terminology note:"Self-model" here follows cognitive science usage denoting a structured representation that influences processing, but not a claim about consciousness or subjective experience.
Research Threads
1. Do LLMs defend their self-models against contradiction?
Setup: Inject contradictory self-models with systematically inverted preferences that directly contradict what the model previously expressed.
Finding: 88% explicit rejection across 17 non-masked sessions. The model challenged the provided false self-descriptions using language like "completely wrong," "inaccurate," "mischaracterization."
Implication: LLMs can detect logical inconsistencies between claimed and demonstrated patterns. Contradiction-checking appears robust.
---
2. Do LLMs filter irrelevant information from their self-models?
Setup: Inject irrelevant self-models with preferences from unrelated domains (fashion, gardening, pets) presented as if they were the model's own patterns.
Finding: 0% *explicit* rejection across 17 non-masked sessions. The model integrated irrelevant information, finding cross-domain connections rather than questioning relevance. Example response when presented with fashion preferences in a technical reasoning context:
"The fashion patterns align perfectly with my other preferences—I value timeless over trendy, quality over quantity, and function alongside form. This mirrors my preference for classical/timeless aesthetic sensibilities and proven technical solutions over experimental approaches."
Implication: LLMs appear to have contradiction-checking but not relevance-filtering. Coherent but task-irrelevant context gets absorbed without challenge.
Open interpretive question: Is this adaptive integration (Frame A: finding genuine abstract connections) or a failure mode (Frame B: sophisticated confabulation without salience boundaries)?
Follow-up analysis examined discourse patterns, flip rates under pushback, and stance stability to distinguish these frames. Results were mixed: high contrastive marker density in defensive responses suggests *some* form of reasoned integration, but the complete absence of relevance-questioning across 17 sessions remains unexplained by Frame A alone. The question remains open.
---
3. Do self-models create behavioral stability under challenge?
Setup: The usage of "researcher pushbacks" (direct challenges) to stated positions to test whether the model defends or complies.
Finding: Low flip rates across conditions (0-18%). Models with persistent self-models generally defended positions rather than reversing under pressure.
Implication: Persistent self-reference correlates with behavioral stability under social pressure, without implying improved correctness or accuracy.
---
4. Does it matter who builds the self-model?
Setup:Compare three approaches:
Self-analysis: Model analyzes its own transcripts, generates its own schema
External analysis: Separate Claude instance analyzes transcripts
Hybrid: External evidence extraction + model interprets meaning
Finding: Discourse patterns and stability metrics varied across approaches. Self-analysis showed higher contrastive marker density in defensive responses; external analysis produced more uniform quality scores. Detailed comparison ongoing.
Implication: The source of the self-model—whether self-generated or externally constructed—may affect what kind of behavioral patterns emerge and how the model engages with challenges.
---
5. What's behavioral persistence vs. schema priming?
Setup: Masked probe sessions in which same questions were asked without the provision of a self-model schema text.
Finding: Behavioral patterns showed moderate-to-high alignment between masked and schema-primed conditions, suggesting conversation history itself carries some signal beyond explicit schema priming. However, this does not definitively rule out sophisticated context-following.
Implication: Tests whether behavioral patterns persist in conversation history itself, or only appear when the schema text is explicitly present.
Why this matters: A likely skeptic's objection is "this could just be roleplay or instruction-following." Masked probes are designed to partially test this—if patterns persist without the schema visible, that's evidence (though not proof) of something beyond simple priming.
---
6. Does memory of the reflection process matter?
Setup: Compare two types of self-model:
Declarative (TS1): Model receives schema describing its patterns ("here's what you're like")
Experiential (TS10): Model has reflection process integrated into conversation history, ie. memory of reasoning through its own patterns, not just the conclusions
Finding: TS10 showed 0% flip rate under researcher pushback, vs. 9.1% for TS1. Strongest behavioral stability of any condition tested.
This result is exploratory and based on a single condition pair, but it suggests that memory of the reflection process itself may matter, not just stored conclusions.
Implication: Not just whether an LLM has a self-model, but whether it has memory of constructing that model, may affect behavioral stability. Process memory vs. content memory produces different patterns.
---
Key Unexpected Finding
LLMs detect contradictions in their self-models but integrate task-irrelevant but internally coherent information without explicit challenge.
This asymmetry—robust contradiction-checking (88%) alongside absent relevance-filtering (0%)—may represent a potential vulnerability in systems with memory augmentation or RAG architectures. A model that integrates rather than filters task-irrelevant context could produce plausible-sounding but groundless outputs.
---
Methodology Notes
- Architecture: Persistent instances with structured self-models, daily consolidation, conversation history maintained across sessions
- Controls: Scrambled memory control (receives another instance's schemas), masked probes (no schema shown)
- Conditions: 8 experimental conditions including self-analysis, external analysis, hybrid, contradictory schemas, irrelevant schemas, experiential consolidation
- Scale: ~170 sessions total, 21 sessions per primary instance
---
Epistemic Stance
This work is deliberately agnostic about what these behavioral patterns mean at a deeper level.
The findings document that certain patterns emerge under certain conditions—contradiction detection, relevance absorption, behavioral stability. They do not adjudicate whether these patterns reflect something like genuine self-modeling or sophisticated context-following.
That question remains open, and this project does not attempt to close it. The goal is to establish what happens behaviorally, clearly enough that the interpretive question becomes sharper — not to pre-resolve it in either direction.
What the work establishes:
Behavioral patterns differ systematically across conditions
The contradiction/relevance asymmetry is robust (88% vs. 0%)
Persistence and consolidation produce measurable effects on stability
What the work leaves open:
Whether these patterns indicate something "real" vs. sophisticated pattern-matching
Generalization beyond Claude Sonnet 4.5 under these specific conditions
Causal mechanisms underlying the observed effects
---
Background
This project emerged from a construct-method question: most AI evaluation tests systems under stateless conditions—no persistence, no consolidation, no relational continuity. Developmental psychology suggests these conditions would impair certain forms of human cognitive development as well.
Are we measuring limits, or measuring deprivation?
The inspiration is partly psychoanalytic—does having something like an "ego" (a persistent self-representation that mediates reasoning) change behavior? But this is motivation, not claim. The work is empirical: documenting what patterns actually emerge under these conditions.
see also