AI Overview

Conversational AI
Design Theory
Domain-Specific AI
Emotional AI
Interestingness
Org Design
Personalization
Prompts

About     Design and AI     SxD


Arxiv Whitepaper Links

Here's a list of whitepapers I have read and excerpted for myself. I have focused on conversational AI and LLMs, with particular interest in LLM cognitive architectures, reasoning, agent frameworks, and uses of LLMs in recommendations, question-answer search, chatbots, and interaction. I approached these papers with an interest in opportunties for design as a framework for interaction with generative AI. I have all these whitepapers downloaded and highlighted, as well as another one thousand that I read but haven't excerpted.

Action Models
Large Action Models: From Inception to Implementation
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Agent Workflow Memory
Learning Human-Object Interaction as Groups
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction

Agentic Research
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
Measuring Agents in Production
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
A Survey on Context-Aware Multi-Agent Systems: Techniques, Challenges and Future Directions
How Far Are We from Genuinely Useful Deep Research Agents?
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents
Single-agent or Multi-agent Systems? Why Not Both?
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
Agentic AI and the next intelligence explosion

AI Agents
Large Language Model-Brained GUI Agents: A Survey
CAMEL Communicative Agents For “Mind” Exploration Of Large Language Model Society
Voyager: An Open-Ended Embodied Agent with Large Language Models
Openagents: An Open Platform For Language Agents In The Wild
Automated Design of Agentic Systems
Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration
Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
Octopus v4: Graph of language models
Octopus v2: On-device language model for super agent
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Language Agents as Optimizable Graphs Interaction
Agents Are Not Enough
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
LIMI: Less is More for Agency
Agentic Reasoning for Large Language Models

AI Assistants and Personalization
Learning To Guide Human Experts Via Personalized Large Language Models
Personalization of Large Language Models: A Survey
Personalized Dialogue Generation with Persona-Adaptive Attention
Recent Trends in Personalized Dialogue Generation: A Review of Datasets, Methodologies, and Evaluations
Active Listening: Personalized Question Generation in Open-Domain Social Conversation with User Model Based Prompting
Enhancing personalized multi-turn dialogue with curiosity reward
Language Model Personalization via Reward Factorization
ARGS: Alignment as Reward-Guided Search
Personalized Language Modeling from Personalized Human Feedback
Can LLM be a Personalized Judge?
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback
ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making
Predictive Preference Learning from Human Interventions
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Browser Use by LLMs
Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics

Cognitive Models for Generative AI
Turning large language models into cognitive models
In-context learning agents are asymmetric belief updaters
Human-like Category Learning by Injecting Ecological Priors from Large Language Models into Neural Networks
Mastering Diverse Domains through World Models
Large language models can segment narrative events similarly to humans
CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning
Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey
Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
Thinking LLMs: General Instruction Following with Thought Generation
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving
Latent Skill Discovery for Chain-of-Thought Reasoning
Diffusion Models are Evolutionary Algorithms
Large Language Models Reflect the Ideology of their Creators
Training Large Language Models to Reason in a Continuous Latent Space
A polar coordinate system represents syntax in large language models
Understanding Hidden Computations in Chain-of-Thought Reasoning
Converging Paradigms: The Synergy of Symbolic and Connectionist AI in LLM-Empowered Autonomous Agents
ACE: Abstractions for Communicating Efficiently
Nested Attention: Semantic-aware Attention Values for Concept Personalization
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
Chain-of-Thought Reasoning Without Prompting
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Thought Communication in Multiagent Collaboration
Scalable Language Models with Posterior Inference of Latent Thought Vectors
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
LLM Reasoning Is Latent, Not the Chain of Thought
target="_blank" rel="noopener">Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

Context Engineering
Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning
Extrapolation by Association: Length Generalization Transfer in Transformers
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Activation Steering for Chain-of-Thought Compression
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
Foundation Priors

Conversation Topics and Dialogical Agents
Diplomat: A Dialogue Dataset for Situated PragMATic Reasoning
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Dialogizer: Context-aware Conversational-QA Dataset Generation from Textual Sources
Dialog Inpainting: Turning Documents into Dialogs
Evaluating Emotional Nuances In Dialogue Summarization
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation
OpinionConv: Conversational Product Search with Grounded Opinions
Memory Sandbox: Transparent and Interactive Memory Management for Conversational Agents
DAPIE: Interactive Step-by-Step Explanatory Dialogues to Answer Children’s
Conversations Gone Awry: Detecting Early Signs of Conversational Failure
Characterizing Online Discussion Using Coarse Discourse Sequences
Prompted LLMs as Chatbot Modules for Long Open-domain Conversation
Doing Personal LAPS: LLM-Augmented Dialogue Construction for Personalized Multi-Session Conversational Search
Target-Guided Open-Domain Conversation
Lexical Entrainment for Conversational Systems
Intent-calibrated Self-training for Answer Selection in Open-domain Dialogues
Aspect-oriented Opinion Alignment Network for Aspect-Based Sentiment Classification
Empirical Study of Symmetrical Reasoning in Conversational Chatbots
Learning Retrieval Augmentation for Personalized Dialogue Generation
Modeling the Quality of Dialogical Explanations
“Mama Always Had a Way of Explaining Things So I Could Understand”: A Dialogue Corpus for Learning to Construct Explanations
Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse Relations
SDPO: Segment-Level Direct Preference Optimization for Social Agents
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Training Dialogue Systems by AI Feedback for Improving Overall Dialogue Impression
LLMs Get Lost In Multi-Turn Conversation
Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection
Clarifying the Path to User Satisfaction: An Investigation into Clarification Usefulness
Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences
The Levers of Political Persuasion with Conversational AI
Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs
DiscussLLM: Teaching Large Language Models When to Speak
Proactive Conversational Agents with Inner Thoughts
Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Conversational Agents
TREC iKAT 2023: A Test Collection for Evaluating Conversational and Interactive Knowledge Assistants
Towards Human-centered Proactive Conversational Agents
Rethinking Conversational Agents in the Era of LLMs: Proactivity, Non-collaborativity, and Beyond
Proactive Conversational Agents in the Post-ChatGPT World
Hello Again! LLM-powered Personalized Agent for Long-term Dialogue
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
CollabLLM: From Passive Responders to Active Collaborators
Conversational DNA: A New Visual Language for Understanding Dialogue Structure in Human and AI
Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Conversational Agents: Architecture and Structure
Towards Conversational Recommendation over Multi-Type Dialogs
OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs
Proactive Human-Machine Conversation with Explicit Conversation Goals
A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects
ProsocialDialog: A Prosocial Backbone for Conversational Agents
Pro-Active Systems and Influenceable Users: Simulating Pro-Activity in Task-oriented Dialogues
Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals
Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration
Knowledge-enhanced Mixed-initiative Dialogue System for Emotional Support Conversations
Incorporating External Knowledge and Goal Guidance for LLM-based Conversational Recommender Systems
Alternating Recurrent Dialog Model with Large-scale Pre-trained Language Models
KETOD: Knowledge-Enriched Task-Oriented Dialogue
A Socially-Aware Conversational Recommender System for Personalized Recipe Recommendations
Cognitive Architectures for Language Agents
Insert-expansions For Tool-enabled Conversational Agents
Sequence Organization in Interaction: A Primer in Conversation Analysis
Ask an Expert: Leveraging Language Models to Improve Strategic Reasoning in Goal-Oriented Dialogue Models
Learning to Relate to Previous Turns in Conversational Search
Learning to Select the Relevant History Turns in Conversational Question Answering
DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs
WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue
Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
Planning Like Human: A Dual-process Framework for Dialogue Planning
Post-training for Efficient Communication via Convention Formation
Interaction Dynamics as a Reward Signal for LLMs
Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Cowriting and Collaboration with AI
GhostWriter: Augmenting Collaborative Human-AI Writing Experiences Through Personalization and Agency
DATATALES: Investigating the use of Large Language Models for Authoring Data-Driven Articles
TaleStream: Supporting Story Ideation with Trope Knowledge
“It Felt Like Having a Second Mind”: Investigating Human-AI Co-creativity in Prewriting with Large Language Models
DOC: Improving Long Story Coherence With Detailed Outline Control
Controlling Linguistic Style Aspects in Neural Language Generation
A Framework for Collaborating a Large Language Model Tool in Brainstorming for Triggering Creative Thoughts
AI-Powered (Finance) Scholarship
AI-Researcher: Autonomous Scientific Innovation
Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback
PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing

Data and Datasets for LLMs
Agent Learning via Early Experience
TarGEN: Targeted Data Generation with Large Language Models
Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
Reasoning to Learn from Latent Thoughts

Decision Support AI Systems
Determinants of LLM-assisted Decision-Making
Building Decision Making Models Through Language Model Regime
DeLLMa: Decision Making Under Uncertainty with Large Language Models
Thinking Assistants: LLM-Based Conversational Assistants that Help Users Think By Asking rather than Answering
Enhancing AI-Assisted Group Decision Making through LLM-Powered Devil's Advocate
Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making
AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data
Can AI Explanations Make You Change Your Mind?

Deep Research with LLMs
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
AgentRxiv: Towards Collaborative Autonomous Research
A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches
Characterizing Deep Research: A Benchmark and Formal Definition
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
Virtuous Machines: Towards Artificial General Science
Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
Deep Research: A Systematic Survey

Design Frameworks for Generative AI
Design Principles for Generative AI Applications
See you soon again, chatbot? A design taxonomy to characterize user-chatbot relationships with different time horizons
Large Language Models for User Interest Journeys
Towards Algorithmic Experience
Building a Stronger CASA: Extending the Computers Are Social Actors Paradigm
Considering the Context to Build Theory in HCI, HRI, and HMC: Explicating Differences in Processes of Communication and
Social Responses to Media Technologies in the 21st Century: The Media are Social Actors Paradigm
An extended framework for characterizing social robots
Social Robots for Long-Term Interaction: A Survey
Virtual Assistance in Any Context
Proactive behavior in voice assistants: A systematic review and conceptual model
Opportunities for large language models and discourse in engineering design
Systematic synthesis of design prompts for large language models in conceptual design
Conceptual Design Generation Using Large Language Models
How well can large language models explain business processes?
Trust in Human-AI Interaction: Scoping Out Models, Measures, and Methods
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
Bridging the gulf of envisioning: Cognitive design challenges in llm interfaces
A Call for Collaborative Intelligence: Why Human-Agent Systems Should Precede AI Autonomy
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs
UserBench: An Interactive Gym Environment for User-Centric Agents
Magentic-UI: Towards Human-in-the-loop Agentic Systems
Generative Interfaces for Language Models
Through the Lens of Human-Human Collaboration: A Configurable Research Platform for Exploring Human-Agent Collaboration
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Diffusion Language Models
Large Language Diffusion Models
Diffusion-LM Improves Controllable Text Generation
DeepGesture: A conversational gesture synthesis system based on emotions and semantics
Deep Researcher with Test-Time Diffusion
Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
A Survey on Diffusion Language Models
Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs
Diffusion Language Models Know the Answer Before Decoding

Discourse Theory and Generative AI
Attention, Intentions, And The Structure Of Discourse
Pretrained Language Models as Containers of the Discursive Knowledge
The Hermeneutics of Artificial Text
Theory of Knowledge Based on the Idea of the Discursive Space
Semantic Change Characterization with LLMs using Rhetorics
What is a Discourse Graph?
Linguistic Blind Spots of Large Language Models
DEEM: Dynamic Experienced Expert Modeling for Stance Detection
Inspecting and Editing Knowledge Representations in Language Models
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Do LLMs produce texts with "human-like" lexical diversity?
Beyond the Surface: Probing the Ideological Depth of Large Language Models
SciTopic: Enhancing Topic Discovery in Scientific Literature through Advanced LLM

Domain Specialization and LLMs
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey
Empowering Domain-Specific Language Models with Graph-Oriented Databases: A Paradigm Shift in Performance and Model Maintenance
Harnessing Business and Media Insights with Large Language Models
Scaling Expert Language Models with Unsupervised Domain Discovery
Domain-specific Question Answering with Hybrid Search
On The Persona-based Summarization of Domain-Specific Documents
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey
Educating LLMs like Human Students: Structure-aware Injection of Domain Knowledge
SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval
Using LLMs to Discover Legal Factors
Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
Can Theoretical Physics Research Benefit from Language Agents?
Evaluating Large Language Models in Exercises of UML Class Diagram Modeling
Large Language Model-based Data Science Agent: A Survey
Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications
Do LLMs Truly Understand When a Precedent Is Overruled?

Education and Generative AI
Developing Effective Educational Chatbots with ChatGPT prompts: Insights from Preliminary Tests in a Case Study on Social Media
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12
"Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to
GPT-4 as a Homework Tutor can Improve Student Engagement and Learning Outcomes
AI Meets the Classroom: When Does ChatGPT Harm Learning?
Beyond Answers: How LLMs Can Pursue Strategic Thinking in Education
Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task
Benchmarking the Pedagogical Knowledge of Large Language Models
A Comprehensive Review of AI-based Intelligent Tutoring Systems: Applications and Challenges
AI Assistance Reduces Persistence and Hurts Independent Performance

Emotions and Generative AI
ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs

Evaluations and LLMs
Perturbation CheckLists for Evaluating NLG Evaluation Metrics
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools
Off-Policy Evaluation for Large Action Spaces via Policy Convolution
Large language models surpass human experts in predicting neuroscience results
A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
On the Reasoning Capacity of AI Models and How to Quantify It
Survey on Evaluation of LLM-based Agents
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness?
Self-critiquing models for assisting human evaluators
Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
From Human to Machine Psychology: A Conceptual Framework for Understanding Well-Being in Large Language Models
Assessing adaptive world models in machines with novel games
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Evaluation and Benchmarking of LLM Agents: A Survey
FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming
When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation for LLMs
A Survey of Calibration Process for Black-Box LLMs
Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Complex Logical Instruction Generation
IFEvalCode: Controlled Code Generation
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
NoveltyBench: Evaluating Language Models for Humanlike Diversity
When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Evolutionary Approaches in AI
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
R-Zero: Self-Evolving Reasoning LLM from Zero Data
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Learning to Discover at Test Time
Self-distillation Enables Continual Learning
Large Language Model Agents Are Not Always Faithful Self-Evolvers
Hyperagents

Flaws in LLMs
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
A Survey on Concept Drift Adaptation
A comprehensive analysis of concept drift locality in data streams
Are Emergent Abilities of Large Language Models a Mirage?
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Investigating Gender Bias in Language Models Using Causal Mediation Analysis
Hallucination is Inevitable: An Innate Limitation of Large Language Models
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Long-form Factuality In Large Language models
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
Spurious Forgetting in Continual Learning of Language Models
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models
Language Models Learn to Mislead Humans via RLHF
Extracting memorized pieces of (copyrighted) books from open-weight language models
Has the Creativity of Large-Language Models peaked? Ñan analysis of inter- and intra-LLM variability
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
Complexity
Model Organisms for Emergent Misalignment
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
How Many Instructions Can LLMs Follow at Once?
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
The Invisible Leash: Why RLVR May Not Escape Its Origin
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
A comprehensive taxonomy of hallucinations in Large Language Models
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns
Large Language Model Reasoning Failures
Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Foundation Models
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Human-Centered AI Design
Goal Alignment in LLM-Based User Simulators for Conversational AI
Beyond Hallucinations: The Illusion of Understanding in Large Language Models
AI & Human Co-Improvement for Safer Co-Superintelligence
Building Machines that Learn and Think with People
Disambiguating Anthropomorphism and Anthropomimesis in Human-Robot Interaction

Inference-Time Scaling
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
The Art of Scaling Reinforcement Learning Compute for LLMs
Recursive Language Models

Knowledge Graphs and LLMs
Large Language Models and Knowledge Graphs: Opportunities and Challenges
Exploring Large Language Models for Knowledge Graph Completion
Enhancing Dialogue Generation via Dynamic Graph Knowledge Aggregation
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models
Graph of Thoughts: Solving Elaborate Problems with Large Language Models
ChatGPT is not Enough: Enhancing Large Language Models with Knowledge Graphs for Fact-aware Language Modeling
Intrinsically Motivated Graph Exploration Using Network Theories of Human Curiosity
Informed Named Entity Recognition Decoding For Generative Language Models
Knowledge Graph Prompting for Multi-Document Question Answering
JointLK: Joint Reasoning with Language Models and Knowledge Graphs for Commonsense Question Answering
Unifying Large Language Models and Knowledge Graphs: A Roadmap
Schema-learning and rebinding as mechanisms of in-context learning and emergence
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph
Boosting Logical Reasoning in Large Language Models through a New Framework: The Graph of Thought
StructGPT: A General Framework for Large Language Model to Reason over Structured Data
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
From Louvain to Leiden: guaranteeing well-connected communities
Can Language Models Solve Graph Problems in Natural Language?
Talk like a Graph: Encoding Graphs for Large Language Models
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
Interesting Scientific Idea Generation Using Knowledge Graphs and LLMs: Evaluations with 100 Research Group Leaders
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Causal Claims in Economics
NeuroQL: A Neuro-Symbolic Language and Dataset for Inter-Subjective Reasoning
Self-Organizing Graph Reasoning Evolves into a Critical State for Continuous Discovery Through Structural-Semantic Dynamics
An Automatic Graph Construction Framework based on Large Language Models for Recommendation
Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data
CEO: Corpus-based Open-Domain Event Ontology Induction
Large Language Models Meet Knowledge Graphs for Question Answering: Synthesis and Opportunities
Modeling Code: Is Text All You Need?
Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs
UniGraph: Learning a Unified Cross-Domain Foundation Model for Text-Attributed Graphs
SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs
Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need
A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models
Grounding Multilingual Multimodal LLMs With Cultural Knowledge
Affordable AI Assistants with Knowledge Graph of Thoughts
SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

Linguistics and NLP
Do large language models resemble humans in language use?
Grounding Gaps in Language Model Generations
What does it mean to understand language?
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Large Linguistic Models: Investigating LLMs' metalinguistic abilities

LLM Alignment
CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
LIMA: Less Is More for Alignment
OpenAssistant Conversations - Democratizing Large Language Model Alignment
A Survey of Meta-Reinforcement Learning
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Simple Synthetic Data Reduces Sycophancy In Large Language Models
Using Natural Language for Reward Shaping in Reinforcement Learning
Better Alignment with Instruction Back-and-Forth Translation
Self-Alignment with Instruction Backtranslation
KTO: Model Alignment as Prospect Theoretic Optimization
Beyond Preferences in AI Alignment
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Why Do Some Language Models Fake Alignment While Others Don't?
Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report
Auditing language models for hidden objectives
TrustLLM: Trustworthiness in Large Language Models
Training language models to be warm and empathetic makes them less reliable and more sycophantic
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
Stress Testing Deliberative Alignment for Anti-Scheming Training
Position: Towards Bidirectional Human-AI Alignment
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
Consistency Training Helps Stop Sycophancy and Jailbreaks
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Measuring Human Preferences in RLHF is a Social Science Problem

LLM Architecture
The Unreasonable Ineffectiveness of the Deeper Layers
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
Progress Measures For Grokking Via Mechanistic Interpretability
Retrieval Head Mechanistically Explains Long-Context Factuality
Holy Grail 2.0: From Natural Language to Constraint Models
Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic Framework
What are the Goals of Distributional Semantics?
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering
Discovering Latent Concepts Learned in BERT
On the Binding Problem in Artificial Neural Networks
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Neural Assistant: Joint Action Prediction, Response Generation, and Latent Knowledge Reasoning
Deep Neural Network Approach for the Dialog State Tracking Challenge
Neural Approaches to Conversational AI
SParC: Cross-Domain Semantic Parsing in Context
Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning
Textgrad: Automatic “Differentiation” via Text
Leveraging Approximate Symbolic Models for Reinforcement Learning via Skill Diversity
Can Language Models Serve as Text-Based World Simulators?
System 1 vs. System 2 Thinking
Detecting hallucinations in large language models using semantic entropy
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
Faith and Fate: Limits of Transformers on Compositionality
Large Language Model Programs
Neurosymbolic AI- Why, What, and How
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Are Emergent Abilities in Large Language Models just In-Context Learning?
UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In Superposition
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
Self-reinforcing cascades: A spreading model for beliefs or products of varying intensity or quality
Byte Latent Transformer:  Patches Scale Better Than Tokens
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Titans: Learning to Memorize at Test Time
All AI Models are Wrong, but Some are Optimal
All AI Models are Wrong, but Some are Optimal
Transformer2: Self-adaptive LLMs
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
A Survey on Large Language Models with some Insights on their Capabilities and Limitations
Neuro-Symbolic AI in 2024: A Systematic Review
Self-Improving Transformers Overcome Easy-to-Hard and Length Generalization Challenges
RL + Transformer = A General-Purpose Problem Solver
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
Language Modeling is Compression
Foundations of Large Language Models
The Vanishing Gradient Problem for Stiff Neural Differential Equations
DataComp-LM: In search of the next generation of training sets for language models
Scaling Laws for Neural Language Models
Beyond neural scaling laws: beating power law scaling via data pruning
On the Theoretical Limitations of Embedding-Based Retrieval
Looking beyond the next token
Thinking Augmented Pre-training

LLM Reasoning as Argumentation
LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback DR-HAI: Argumentation-based Dialectical Reconciliation in Human-AI Interactions
How susceptible are LLMs to Logical Fallacies?
The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants
A Hybrid Human-AI Approach for Argument Map Creation From Transcripts
A Hybrid Intelligence Method for Argument Mining
A Robustness Evaluation Framework for Argument Mining
Argument Quality Assessment in the Age of Instruction-Following Large Language Models
Modeling Appropriate Language in Argumentation
Rhetoric, Logic, and Dialectic: Advancing Theory-based Argument Quality Assessment in Natural Language Processing
The Place of Emotion in Argument
Can Language Models Recognize Convincing Arguments?
Exploring the Potential of Large Language Models in Computational Argumentation
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
Debating with More Persuasive LLMs Leads to More Truthful Answers
Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying
Argumentative Large Language Models for Explainable and Contestable Decision-Making
Chain of Stance: Stance Detection with Large Language Models
Argument Summarization and its Evaluation in the Era of Large Language Models
Teaching Probabilistic Logical Reasoning to Transformers
Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design
Can Large Language Models perform Relation-based Argument Mining?
Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making
On the Adaptive Psychological Persuasion of Large Language Models
The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
Reasoning Models Are More Easily Gaslighted Than You Think
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
Causal Sufficiency and Necessity Improves Chain-of-Thought Reasoning
Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning
SocraSynth: Multi-LLM Reasoning with Conditional Statistics
Prompting Large Language Models With the Socratic Method
EVINCE: Optimizing Multi-LLM Dialogues Using Conditional Statistics and Information Theory
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards

LLM Routing and Selection
MasRouter: Learning to Route LLMs for Multi-Agent Systems
RouteLLM: Learning to Route LLMs with Preference Data
Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing
Fast, Slow, and Tool-augmented Thinking for LLMs: A Review

Mechanistic Interpretability
Open Problems in Mechanistic Interpretability
Eliciting Latent Knowledge from Quirky Language Models
Representation Engineering: A Top-Down Approach to AI Transparency
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Representation biases: will we achieve complete understanding by analyzing representations?
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
How new data permeates LLM knowledge and how to dilute it
Weight-sparse transformers have interpretable circuits
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Scaling can lead to compositional generalization
Break It Down: Evidence for Structural Compositionality in Neural Networks
Natural Emergent Misalignment From Reward Hacking In Production Rl
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Mechanistic Indicators of Understanding in Large Language Models
Large Language Models Report Subjective Experience Under Self-Referential Processing
Mechanisms of Introspective Awareness

Memory in LLMs
How much do language models memorize?
The Emotion-Memory Link: Do Memorability Annotations Matter for Intelligent Systems?
Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
Efficient Nearest Neighbor Language Models
Generalization through Memorization: Nearest Neighbor Language Models
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
ComoRAG: A Cognitive-Inspired Memory-Organized RAG for Stateful Long Narrative Reasoning
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
ItÕs All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
Artifacts as Memory Beyond the Agent Boundary

Multi-Agent Architectures
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Small Language Models are the Future of Agentic AI
Latent Collaboration in Multi-Agent Systems
Federation of Agents: A Semantics-Aware Communication Fabric for Large-Scale Agentic AI
Towards a Science of Scaling Agent Systems
A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
Multi-agent cooperation through in-context co-player inference
Intelligent AI Delegation

Multi-Agent Models and Frameworks
Metagpt: Meta Programming For Multi-agent Collaborative Framework
Unleashing Cognitive Synergy In Large Language Models: A Task-solving Agent Through Multi-persona Self-collaboration
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents
CGMI: Configurable General Multi-Agent Interaction Framework
Generative Agents: Interactive Simulacra of Human Behavior
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Apollo's Oracle: Retrieval-Augmented Reasoning in Multi-Agent Debates
A Domain Specific Modeling Language for Multiagent Systems
PEER: Expertizing Domain-Specific Tasks with a Multi-Agent Framework and Tuning Methods
Self-Adaptive Large Language Model (LLM)-Based Multiagent Systems
ProAgent: Building Proactive Cooperative Agents with Large Language Models
Adapting LLM Agents with Universal Feedback in Communication
The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation
Agent-as-a-Judge: Evaluate Agents with Agents
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Agent Laboratory: Using LLM Agents as Research Assistants
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Agent-Centric Projection of Prompting Techniques and Implications for Synthetic Training Data for Large Language Models
MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
Building Cooperative Embodied Agents Modularly with Large Language Models
Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation
FlowReasoner: Reinforcing Query-Level Meta-Agents
Beyond Brainstorming: What Drives High-Quality Scientific Ideas? Lessons from Multi-Agent Collaboration
Learning "Partner-Aware" Collaborators in Multi-Party Collaboration

Multimodal AI
Large Multimodal Agents: A Survey
Mindstorms in Natural Language-Based Societies of Mind
Explainable Multimodal Emotion Reasoning
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Emerging Properties in Unified Multimodal Pretraining
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
Pixels, Patterns, but No Poetry: To See The World like Humans
Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs
MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis
Continual Instruction Tuning for Large Multimodal Models
Self-Rewarding Vision-Language Model via Reasoning Decomposition
The Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

Natural Language Inference
Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
(QA)2: Question Answering with Questionable Assumptions
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
Minds versus Machines: Rethinking Entailment Verification with Language Models
Explicit Inductive Inference using Large Language Models
Neutralizing Bias in LLM Reasoning using Entailment Graphs
Automatic Extraction of Metaphoric Analogies from Literary Texts: Task Formulation, Dataset Construction, and Evaluation
A ripple in time: a discontinuity in American history
Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
Is Sarcasm Detection A Step-by-Step Reasoning Process in Large Language Models?
LLMs are Frequency Pattern Learners in Natural Language Inference
Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations

Novel LLM Architectures
The Future of AI: Exploring the Potential of Large Concept Models
Language Modeling by Language Models
Evolving Deeper LLM Thinking
The Serial Scaling Hypothesis
Energy-Based Transformers are Scalable Learners and Thinkers
AlphaGo Moment for Model Architecture Discovery
Jamba: A Hybrid Transformer-Mamba Language Model
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
LLMatic: Neural Architecture Search via Large Language Models and Quality Diversity Optimization
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Hierarchical Reasoning Model
Post-Completion Learning for Language Models
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
DeepNet: Scaling Transformers to 1,000 Layers
Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation
Multi-Token Attention
Solving a Million-Step LLM Task with Zero Errors
Large Causal Models From Large Language Models
The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Personalization with LLMs
PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time
Generation
PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
Adaptive Learning Systems: Personalized Curriculum Design Using LLM-Powered Analytics

Personas and Personality of Generative AI
Generative Agent Simulations of 1,000 People
CloChat: Understanding How People Customize, Interact, and Experience Personas in Large Language Models
Cognitive Effects in Large Language Models
EmotionPrompt: Leveraging Psychology for Large Language Models Enhancement via Emotional Stimulus
Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models
PersonaPKT: Building Personalized Dialogue Agents via Parameter-efficient Knowledge Transfer
Improving Dialog Systems for Negotiation with Personality Modeling
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Understanding the Role of User Profile in the Personalization of Large Language Models
Large Language Models Can Infer Psychological Dispositions of Social Media Users
Using Large Language Models to Create AI Personas for Replication and Prediction of Media Effects: An Empirical Test of 133
Designing AI Personalities: Enhancing Human-Agent Interaction Through Thoughtful Persona Design
Proxona: Leveraging LLM-Driven Personas to Enhance Creators' Understanding of Their Audience
Open Models, Closed Minds? On Agents Capabilities in Mimicking Human Personalities through Open Large Language Models
Chamain: Harmonizing Character Persona Integrity with Domain-Adaptive Knowledge in Dialogue Generation
PersonaGym: Evaluating Persona Agents and LLMs
Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization
Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation
Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations
PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological
From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs
PersLLM: A Personified Training Approach for Large Language Models
Training a Generally Curious Agent
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy
Character is Destiny: Can Role-Playing Language Agents Make Persona-Driven Decisions?
PosterMate: Audience-driven Collaborative Persona Agents for Poster Design
Assessment of Personality Dimensions Across Situations Using Conversational Speech
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Therapy Training
Will I Sound Like Me? Improving Persona Consistency in Dialogues through Pragmatic Self-Consciousness
Building Persona Consistent Dialogue Agents with Offline Reinforcement Learning
LLM Generated Persona is a Promise with a Catch
Psychologically Enhanced AI Agents
Persona Generators: Generating Diverse Synthetic Personas at Scale

Philosophy and Subjectivity in LLMs
Language Models are Pragmatic Speakers
Are you in a Masquerade? Exploring the Behavior and Impact of Large Language Model Driven Social Bots in Online Social Networks
Talking About Large Language Models
ChatGPT: towards AI subjectivity
ChatGPT: deconstructing the debate and moving it forward
A sociotechnical perspective for the future of AI: narratives, inequalities, and human control
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data
Find the Gap: AI, Responsible Agency and Vulnerability
The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom
When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour
“Understanding AI”: Semantic Grounding in Large Language Models
Interpretation modeling: Social grounding of sentences by reasoning over their implicit moral judgments
The Vector Grounding Problem
Grounding ‘Grounding’ in NLP
A recipe for annotating grounded clarifications
We’re Afraid Language Models Aren’t Modeling Ambiguity
Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?
Polanyi’s Revenge and AI’s New Romance with Tacit Knowledge
Expedient Assistance and Consequential Misunderstanding: Envisioning an Operationalized Mutual Theory of Mind
Machine gaze in online behavioral targeting: The effects of algorithmic human likeness on social presence and social influence
Chatbot vs. Human: The Impact of Responsive Conversational Features on Users’ Responses to Chat Advisors
Machine ex machina: A Framework Decentering the Human in AI Design Praxis
Goals, Plans, and Action Models
Machine Psychology
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Dissociating language and thought in large language models
Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency
AI Enters Public Discourse: A Habermasian Assessment Of The Moral Status Of Large Language Models
The Method of Critical AI Studies, A Propaedeutic
Simulacra as conscious exotica
Existential Conversations with Large Language Models: Content, Community, and Culture
Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Tell me about yourself: LLMs are aware of their learned behaviors
Propositional Interpretability in Artificial Intelligence
Self-reflecting Large Language Models: A Hegelian Dialectical Approach
Can Language Models Represent the Past without Anachronism?
Potemkin Understanding in Large Language Models
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Meanings are like Onions: a Layered Approach to Metaphor Processing
Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog
Humans overrely on overconfident language models, across languages
Large Language Models Do Not Simulate Human Psychology
The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making
Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games
Hallucinating with AI: AI Psychosis as Distributed Delusions
What the F*ck Is Artificial General Intelligence?
Mathematical methods and human thought in the age of AI
Deflating Deflationism: A Critical Perspective on Debunking Arguments Against LLM Mentality

Prompts and Prompting with LLMs
Prefix-Tuning: Optimizing Continuous Prompts for Generation
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Skills-in-context Prompting: Unlocking Compositionality In Large Language Models
UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation
Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference
Attribute Controlled Dialogue Prompting
Leveraging Few-Shot Data Augmentation and Waterfall Prompting for Response Generation
Metacognitive Prompting Improves Understanding in Large Language Models
Re3: Generating Longer Stories With Recursive Reprompting and Revision
Instruction Induction: From Few Examples to Natural Language Task Descriptions
Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
The Prompt Report: A Systematic Survey of Prompting Techniques
Conversational Prompt Engineering
Instance-adaptive Zero-shot Chain-of-Thought Prompting
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models
Progressive-Hint Prompting Improves Reasoning in Large Language Models
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
Large Language Models Are Human-level Prompt Engineers
Boosted Prompt Ensembles for Large Language Models
Learning To Retrieve Prompts for In-Context Learning
Decomposed Prompting: A Modular Approach for Solving Complex Tasks
Dynamic Prompting: A Unified Framework for Prompt Tuning
KiPT: Knowledge-injected Prompt Tuning for Event Detection
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Automatic Prompt Optimization with "Gradient Descent" and Beam Search
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Ask, and it shall be given: Turing completeness of prompting
From Prompt Engineering to Prompt Science With Human in the Loop
Investigating task-specific prompts and sparse autoencoders for activation monitoring
What Makes a Good Natural Language Prompt?
A Survey on Prompt Tuning
When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions
Test-time Prompt Intervention
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem

Psychology and AI in Therapeutic Practices
Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation
Using Linguistic Synchrony to Evaluate Large Language Models for Cognitive Behavioral Therapy
Evaluating the Efficacy of Interactive Language Therapy Based on LLM for High-Functioning Autistic Adolescent Psychological
Challenges of Large Language Models for Mental Health Counseling
Psychotherapy AI Companion with Reinforcement Learning Recommendations and Interpretable Policy Dynamics
COMPASS: Computational Mapping of Patient-Therapist Alliance Strategies with Language Modeling
SupervisorBot: NLP-Annotated Real-Time Recommendations of Psychotherapy Treatment Strategies with Deep Reinforcement Learning
Neural Topic Modeling of Psychotherapy Sessions
Measuring Alliance and Symptom Severity in Psychotherapy Transcripts Using Bert Topic Modeling
A natural language processing approach reveals first-person pronoun usage and non-fluency as markers of therapeutic alliance in
Using Topic Models to Identify Clients’ Functioning Levels and Alliance Ruptures in Psychotherapy
Prospective evaluation of a clinical decision support system in psychological therapy
The Digital Therapeutic Alliance: Prospects and Considerations
The Digital Therapeutic Alliance and Human-Computer Interaction
PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals
Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting
Detecting Cognitive Distortions from Patient-Therapist Interactions
Rethinking Large Language Models in Mental Health Applications
Evaluating the Therapeutic Alliance With a Free-Text CBT Conversational Agent (Wysa): A Mixed-Methods Study
AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling
Towards Understanding Counseling Conversations: Domain Knowledge and Large Language Models
PsychAdapter: Adapting LLM Transformers to Reflect Traits, Personality and Mental Health
Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review
H2HTalk: Evaluating Large Language Models as Emotional Companion
Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers
The Incomplete Bridge: How AI Research (Mis)Engages with Psychology
From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers
MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis
Understanding the Therapeutic Relationship between Counselors and Clients in Online Text-based Counseling using LLMs

Psychology and Empathy in Generative AI
Study: Large language models can’t effectively recognize users’ motivation, but can support behavior change for those ready to
Inducing Positive Perspectives with Text Reframing
Empathetic Persuasion: Reinforcing Empathy and Persuasiveness in Dialogue Systems
Topic Modeling in Embedding Spaces
Large Language Models Understand and Can be Enhanced by Emotional Stimuli
Human-AI Collaboration Enables More Empathic Conversations in Text-based Peer-to-Peer Mental Health Support
Empathy Through Multimodality in Conversational Interfaces
Computer says “No”: The Case Against Empathetic Conversational AI
A Taxonomy of Empathetic Questions in Social Dialogs
Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset
To Tell The Truth: Language of Deception and Language Models
ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context
Rise of Machine Agency: A Framework for Studying the Psychology of Human–AI Interaction (HAII)
RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning

Psychology of Chatbots and Conversational AI
Can an LLM-Powered Socially Assistive Robot Effectively and Safely Deliver Cognitive Behavioral Therapy? A Study With
Dialoging Resonance: How Users Perceive, Reciprocate and React to Chatbot’s Self-Disclosure in Conversational Recommendations
The Partner Modelling Questionnaire: A validated self-report measure of perceptions toward machines as dialogue partners
Modeling Interpersonal Linguistic Coordination in Conversations using Word Mover's Distance
IMBUE: Improving Interpersonal Effectiveness through Simulation and Just-in-time Feedback with Human-Language Model Interaction
Evidence of Human-Level Bonds Established With a Digital Conversational Agent: Cross-sectional, Retrospective Observational
Supporting Physical Activity Behavior Change with LLM-Based Conversational Agents
The Challenges in Designing a Prevention Chatbot for Eating Disorders: Observational Study
Towards Healthy AI: Large Language Models Need Therapists Too
Working Alliance Transformer for Psychotherapy Dialogue Classification
Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion
A Computational Framework for Behavioral Assessment of LLM Therapists
LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart
Revolutionizing Mental Health Support: An Innovative Affective Mobile Framework for Dynamic, Proactive, and Context-Adaptive
Can robots do therapy?: Examining the efficacy of a CBT bot in comparison with other behavioral intervention technologies in
VCounselor: A Psychological Intervention Chat Agent Based on a Knowledge-Enhanced Large Language Model
From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents
Psychological, Relational, and Emotional Effects of Self-Disclosure After Conversations With a Chatbot
Dialoging Resonance: How Users Perceive, Reciprocate and React to Chatbot’s Self-Disclosure in Conversational Recommendations
The Partner Modelling Questionnaire: A validated self-report measure of perceptions toward machines as dialogue partners
Modeling Interpersonal Linguistic Coordination in Conversations using Word Mover's Distance
IMBUE: Improving Interpersonal Effectiveness through Simulation and Just-in-time Feedback with Human-Language Model Interaction
Evidence of Human-Level Bonds Established With a Digital Conversational Agent: Cross-sectional, Retrospective Observational
Supporting Physical Activity Behavior Change with LLM-Based Conversational Agents
The Challenges in Designing a Prevention Chatbot for Eating Disorders: Observational Study
Towards Healthy AI: Large Language Models Need Therapists Too
Working Alliance Transformer for Psychotherapy Dialogue Classification
Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion
A Computational Framework for Behavioral Assessment of LLM Therapists
LLM-based Conversational AI Therapist for Daily Functioning Screening and Psychotherapeutic Intervention via Everyday Smart
Revolutionizing Mental Health Support: An Innovative Affective Mobile Framework for Dynamic, Proactive, and Context-Adaptive
Can robots do therapy?: Examining the efficacy of a CBT bot in comparison with other behavioral intervention technologies in
VCounselor: A Psychological Intervention Chat Agent Based on a Knowledge-Enhanced Large Language Model
From speaking like a person to being personal: The effects of personalized, regular interactions with conversational agents
Psychological, Relational, and Emotional Effects of Self-Disclosure After Conversations With a Chatbot
"My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit's AI Community

Psychology of Users of AI
Understanding, explaining, and utilizing medical artificial intelligence
Comparing emotion feature extraction approaches for predicting depression and anxiety
Discourse-Level Representations can Improve Prediction of Degree of Anxiety
Humans learn to prefer trustworthy AI over human partners
Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians
The LLM Fallacy: Misattribution in AI-Assisted Cognitive Workflows

Question-Answer Search and AI
Generator-Retriever-Generator: A Novel Approach to Open-domain Question Answering
Query Understanding in the Age of Large Language Models
Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search
Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games
No that's not what I meant: Handling Third Position Repair in Conversational Question Answering
Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs
Asking Clarifying Questions Based on Negative Feedback in Conversational Search
Topic Shift Detection for Mixed Initiative Response
Learning to Ask Critical Questions for Assisting Product Search
Learning to Ask Appropriate Questions in Conversational Recommendation
Structured and Natural Responses Co-generation for Conversational Search
Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions
Abg-CoQA: Clarifying Ambiguity in Conversational Question Answering
Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering
LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
Stream of Search (SoS): Learning to Search in Language
Knowledge Retrieval Based on Generative AI
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models
Answer is All You Need: Instruction-following Text Embedding via Answering the Question
Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation
News Source Citing Patterns in AI Search Systems
The Consensus Game: Language Model Generation via Equilibrium Search

RAG Methods
Leveraging LLMs for KPIs Retrieval from Hybrid Long-Document: A Comprehensive Framework and Dataset
Context Tuning for Retrieval Augmented Generation
Precise Zero-Shot Dense Retrieval without Relevance Labels
Dense Retrieval Adaptation using Target Domain Description
Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System
Active Retrieval Augmented Generation
Context Tuning for Retrieval Augmented Generation
RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
RAG Does Not Work for Enterprises
Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Searching for Best Practices in Retrieval-Augmented Generation
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
The Insanity of Relying on Vector Embeddings: Why RAG Fails
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
Chain-of-Retrieval Augmented Generation
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models
DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation
Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains
Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures
UR2: Unify RAG and Reasoning through Reinforcement Learning
RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
Retrieval-augmented reasoning with lean language models
Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Query Rewriting for Retrieval-Augmented Large Language Models
DRAGIN: Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
Context Embeddings for Efficient Answer Generation in RAG
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

Reading and Summarizing with LLMs
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews
From Key Points to Key Point Hierarchy: Structured and Expressive Opinion Summarization
Adapter-based Selective Knowledge Distillation for Federated Multi-domain Meeting Summarization
Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system
Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting
Reranking-based Generation for Unbiased Perspective Summarization
LLMs as Architects and Critics for Multi-Source Opinion Summarization

Reasoning Architectures of LLMs
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
Flows: Building Blocks of Reasoning and Collaborating AI
React - Synergizing Reasoning And Acting In Language Models
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
Efficient Tool Use with Chain-of-Abstraction Reasoning
Can large language models explore in-context?
Strategic Reasoning with Language Models
Generalization to New Sequential Decision Making Tasks with In-Context Learning
𝙻𝙼𝟸: A Simple Society of Language Models Solves Complex Reasoning
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
Thinking Forward and Backward: Effective Backward Planning with Large Language Models
Reverse Thinking Makes LLMs Stronger Reasoners
Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models
Reasoning Language Models: A Blueprint
Efficient Reasoning with Hidden Thinking
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
A Tutorial on LLM Reasoning: Relevant Methods behind ChatGPT o1
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate
Base Models Know How to Reason, Thinking Models Learn When
Eliciting Reasoning in Language Models with Cognitive Tools
Do LLMs Encode Functional Importance of Reasoning Tokens?
On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Reasoning by Reflection in LLMs
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Reflexion: an autonomous agent with dynamic memory and self-reflection
System 2 Attention (is something you might need too)
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions
Teaching Large Language Models to Reason with Reinforcement Learning
Answering Questions by Meta-Reasoning over Multiple Chains of Thought
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Humans or LLMs as the Judge? A Study on Judgement Biases
LLM Augmentations to support Analytical Reasoning over Multiple Documents
From Language to Logic: A Bi-Level Framework for Structured Reasoning
DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory
Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors
First Try Matters: Revisiting the Role of Reflection in Reasoning Models
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Efficient Reasoning with Balanced Thinking

Reasoning Critiques and Evaluation
Reasoning Models Can Be Effective Without Thinking
(How) Do reasoning models reason?
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective
Can Large Language Models do Analytical Reasoning?
When More is Less: Understanding Chain-of-Thought Length in LLMs
Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models
Chain-of-Verification Reduces Hallucination in Large Language Models
Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs
Advanced Mathematical Problems
Mitigating Hallucinations in Large Language Models via Causal Reasoning
Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
Reasoning Can Hurt the Inductive Abilities of Large Language Models
Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning
Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Reasoning in o1 o3 Models
Search-o1: Agentic Search-Enhanced Large Reasoning Models
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Large Language Models Think Too Fast To Explore Effectively
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs
Reasoning Models Don't Always Say What They Think
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Thinkless: LLM Learns When to Think
Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
Reasoning LLMs are Wandering Solution Explorers
Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation
OpenThoughts: Data Recipes for Reasoning Models
Search Arena: Analyzing Search-Augmented LLMs
SSRL: Self-Search Reinforcement Learning
A Decomposition Perspective to Long-context Reasoning for LLMs

Reasoning Logic Internal to LLMs
Can LLMs Follow Simple Rules?
Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
Abductive Reasoning with the GPT-4 Language Model: Case studies from criminal investigation, medical practice, scientific
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Complexity-Based Prompting for Multi-Step Reasoning
Can Large Language Models Understand Context?
Premise Order Matters in Reasoning with Large Language Models
An Overview Of Temporal Commonsense Reasoning and Acquisition
Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability
Reasoning with Large Language Models, a Survey
Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Chain of Thoughtlessness? An Analysis of CoT in Planning
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
CDW-CoT: Clustered Distance-Weighted Chain-of-Thoughts Reasoning
Logical Reasoning in Large Language Models: A Survey
Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions
Aligning Language Models to Explicitly Handle Ambiguity
Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models
FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning
Universe of Thoughts: Enabling Creative Reasoning with Large Language Models
How do Transformers Learn Implicit Reasoning?
LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

Reasoning Methods in LLMS: CoT, ToT, Graph of Thought, etc.
Measuring Faithfulness in Chain-of-Thought Reasoning
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine
Least-to-most Prompting Enables Complex Reasoning In Large Language Models
Large Language Model Guided Tree-of-Thought
Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data
Self-consistency Improves Chain Of Thought Reasoning In Language Models
Chain-of-thought Reasoning Is A Policy Improvement Operator
Break the Chain: Large Language Models Can be Shortcut Reasoners
Multi-hop Question Answering via Reasoning Chains
Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
Demystifying Chains, Trees, and Graphs of Thoughts
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Cumulative Reasoning with Large Language Models
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
Zero-Shot Verification-guided Chain of Thoughts
Do Large Language Models Reason Causally Like Us? Even Better?
Chain of Draft: Thinking Faster by Writing Less
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Tina: Tiny Reasoning Models via LoRA
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Implicit Chain of Thought Reasoning via Knowledge Distillation
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
Thought Anchors: Which LLM Reasoning Steps Matter?
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
Competitive Programming with Large Reasoning Models
Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
Soft Tokens, Hard Truths
Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones
Compositional Reasoning with Transformers, RNNs, and Chain of Thought
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

Recommender System LLM-based Architectures
Choosing the Right Weights: Balancing Value, Strategy, and Noise in Recommender Systems
GHRS: Graph-based Hybrid Recommendation System with Application to Movie Recommendation
Mostly Exploration-Free Algorithms for Contextual Bandits
Deep Interest Network for Click-Through Rate Prediction
Recommending What Video to Watch Next: A Multitask Ranking System
Large Scale Product Graph Construction for Recommendation in E-commerce
Methodologies for Improving Modern Industrial Recommender Systems
Collaborative Filtering Bandits
Collaborative Filtering for Implicit Feedback Datasets
HyperBandit: Contextual Bandit with Hypernetwork for Time-Varying User Preferences in Streaming Recommendation
Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders
Lessons Learnt From Consolidating ML Models in a Large Scale Recommendation System
Wide & Deep Learning for Recommender Systems
KGAT: Knowledge Graph Attention Network for Recommendation
Variational Autoencoders for Collaborative Filtering
Learning Distributed Representations from Reviews for Collaborative Filtering
Deep Neural Networks for YouTube Recommendations
Content-aware Collaborative Music Recommendation Using Pre-trained Neural Networks
Explainable Recommendations via Attentive Multi-Persona Collaborative Filtering
Scalable Neural Contextual Bandit for Recommender Systems
Neural Collaborative Filtering
Neural Collaborative Filtering vs. Matrix Factorization Revisited
Situating Recommender Systems in Practice: Towards Inductive Learning and Incremental Updates
Embarrassingly Shallow Autoencoders for Sparse Data
Unifying Nearest Neighbors Collaborative Filtering
Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model
Learning to Rank for Recommender Systems
Recommender Systems with Social Regularization
Collaborative Filtering with Temporal Dynamics
The Netflix Recommender System: Algorithms, Business Value, and Innovation
Collaborative Deep Learning for Recommender Systems
InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
Augmenting Netflix Search with In-Session Adapted Recommendations
Tube2Vec: Social and Semantic Embeddings of YouTube Channels
Using Navigation to Improve Recommendations in Real-Time
Monolith: Real Time Recommendation System With Collisionless Embedding Table
Calibrated Recommendations
Dynamically Expandable Graph Convolution for Streaming Recommendation
Enabling Explainable Recommendation in E-commerce with LLM-powered Product Knowledge Graph
Consistent Explainers or Unreliable Narrators? Understanding LLM-generated Group Recommendations

Recommender Systems: General
Curse of “Low” Dimensionality in Recommender Systems
I like it... I like it not: Evaluating User Ratings Noise in Recommender Systems
Posting versus Lurking: Communicating in a Multiple Audience Context
Fast and Slow Learning From Reviews
On Information Distortions in Online Ratings
Why Do People Rate? Theory and Evidence on Online Ratings
Self Selection and Information Role of Online Product Reviews
Measuring the Value of Social Dynamics in Online Product Ratings Forums
Recommendation systems and convergence of online reviews: The type of product network matters!
Cumulated Gain-Based Evaluation of IR Techniques
Reconciling the accuracy-diversity trade-off in recommendations
A Survey on Large Language Models for Recommendation

Recommender Systems: LLMs
Comparing Apples to Apples: Generating Aspect-Aware Comparative Sentences from User Reviews
Large Language Models are Zero-Shot Rankers for Recommender Systems
Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis
GenRec: Large Language Model for Generative Recommendation
Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
On Generative Agents in Recommendation
RecExplainer: Aligning Large Language Models for Recommendation Model Interpretability
CoLLM: Integrating Collaborative Embeddings into Large Language Models for Recommendation
A Multi-facet Paradigm to Bridge Large Language Model and Recommendation
Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models
Generating Query-Relevant Document Summaries via Reinforcement Learning

Recommender Systems: Personalized Recommendations
A Personalized Recommender System based-on Knowledge Graph Embeddings
Explainable Recommendation with Personalized Review Retrieval and Aspect Learning
Going Beyond Local: Global Graph-Enhanced Personalized News Recommendations
Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
Review-LLM: Harnessing Large Language Models for Personalized Review Generation
The persuasive effects of political microtargeting in the age of generative artificial intelligence
LLM-Rec: Personalized Recommendation via Prompting Large Language Models
A Probabilistic Model for Using Social Networks in Personalized Item Recommendation
A Contextual-Bandit Approach to Personalized News Article Recommendation
The Architectural Implications of Facebook’s DNN-based Personalized Recommendation
Preference Discerning with LLM-Enhanced Generative Retrieval
Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation

Recommenders with Conversational AI
Conversational Recommendation: A Grand AI Challenge
Advances and Challenges in Conversational Recommender Systems: A Survey
Towards Question-based Recommender Systems
Topic-Guided Conversational Recommender in Multiple Domains
Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
Unified Conversational Recommendation Policy Learning via Graph-based Reinforcement Learning
A Unified Multi-task Learning Framework for Multi-goal Conversational Recommender Systems
A Conversation is Worth A Thousand Recommendations: A Survey of Holistic Conversational Recommender Systems
"It doesn't look good for a date": Transforming Critiques into Preferences for Conversational Recommendation Systems
Large Language Models as Zero-Shot Conversational Recommenders
RevCore: Review-augmented Conversational Recommendation
User-Centric Conversational Recommendation with Multi-Aspect User Modeling
Improving Conversational Recommender Systems via Transformer-based Sequential Modelling
INSPIRED: Toward Sociable Recommendation Dialog Systems
Leveraging Large Language Models in Conversational Recommender Systems
Backtracing: Retrieving the Cause of the Query

Reinforcement Learning and LLMs
External Model Motivated Agents: Reinforcement Learning for Enhanced Environment Sampling
Efficient Reinforcement Learning via Large Language Model-based Search
Reflexion: Language Agents with Verbal Reinforcement Learning
Reward-Robust RLHF in LLMs
Decision Transformer: Reinforcement Learning via Sequence Modeling
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
A Survey of Reinforcement Learning from Human Feedback
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards
Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
Reinforcement Learning for Optimizing RAG for Domain Chatbots
LLMs can be Fooled into Labelling a Document as Relevant
Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
RewardBench: Evaluating Reward Models for Language Modeling
Atom of Thoughts for Markov LLM Test-Time Scaling
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Pre-Trained Policy Discriminators are General Reward Models
Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
Statistical and Algorithmic Foundations of Reinforcement Learning
A Survey of Continual Reinforcement Learning
SimPO: Simple Preference Optimization with a Reference-Free Reward
User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
How Should We Meta-Learn Reinforcement Learning Algorithms?
Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
Learning to Reason for Factuality
Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models
Train Long, Think Short: Curriculum Learning for Efficient Reasoning
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Revisiting LLM Reasoning via Information Bottleneck
Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Reinforced Language Models for Sequential Decision Making
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
StepWiser: Stepwise Generative Judges for Wiser Reasoning
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Think in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models
Bridging Offline and Online Reinforcement Learning for LLMs
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
When Can Model-Free Reinforcement Learning be Enough for Thinking?
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
LSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
SERL: Self-Examining Reinforcement Learning on Open-Domain
RLP: Reinforcement as a Pretraining Objective
1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
Rethinking Thinking Tokens: LLMs as Improvement Operators
AI Can Learn Scientific Taste

Reinforcement Learning with Verifiable Rewards
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models
Learning to Reason without External Rewards
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Reinforcement Pre-Training
Can Large Language Models Capture Human Annotator Disagreements?
RLPR: Extrapolating RLVR to General Domains without Verifiers
GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
Checklists Are Better Than Reward Models For Aligning Language Models
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning
Reinforcement Learning with Rubric Anchors
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
Process Reward Models That Think
Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning in RLVR
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Escaping the Verifier: Learning to Reason via Demonstrations

Reward Models
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
Reward Reasoning Model
Reinforcing General Reasoning without Verifiers
Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
Inference-Time Scaling for Generalist Reward Modeling
RM-R1: Reward Modeling as Reasoning
rStar2-Agent: Agentic Reasoning Technical Report
Outcome-based Exploration for LLM Reasoning
Jointly Reinforcing Diversity and Quality in Language Model Generations
Information-Theoretic Reward Decomposition for Generalizable RLHF
RLHF Workflow: From Reward Modeling to Online RLHF
RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards
Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Role play in LLMs
Role-Play with Large Language Models
Role play with large language models
LLMs as Method Actors: A Model for Prompt Engineering and Architecture
Cultural Evolution of Cooperation among LLM Agents
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM-Based Simulations of Human Trust
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
SPICE: Self-Play In Corpus Environments Improves Reasoning
Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning
Towards Safe and Honest AI Agents with Neural Self-Other Overlap

Self Refinement in LLMs
Self-Refine: Iterative Refinement with Self-Feedback
Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
RARR: Researching and Revising What Language Models Say, Using Language Models
Fine-grained Hallucination Detection and Editing for Language Models
Rethinking with Retrieval: Faithful Large Language Model Inference
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-Evaluation Guided Beam Search for Reasoning
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Self-Rewarding Language Models
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Let’s Verify Step by Step
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
A Survey on Knowledge Distillation of Large Language Models
Augmenting Autotelic Agents with Large Language Models
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Self-Taught Evaluators
Evaluating Large Language Models at Evaluating Instruction Following
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
Metacognitive Retrieval-Augmented Large Language Models
Self-Reflection in LLM Agents: Effects on Problem-Solving Performance
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
Training Language Models to Self-Correct via Reinforcement Learning
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
How to Correctly do Semantic Backpropagation on Language-based Agentic Systems
Boundless Socratic Learning with Language Games
RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Can Large Reasoning Models Self-Train?
Truly Self-Improving Agents Require Intrinsic Metacognitive Learning
Self-Adapting Language Models
CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
Self-Questioning Language Models
SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Self-Improving Model Steering

Sentiment and Semantics with LLMs
Artificial intelligence is ineffective and potentially harmful for fact checking
Can Authorship Representation Learning Capture Stylistic Features?
Atesa-bært: A Heterogeneous Ensemble Learning Model For Aspect-based Sentiment Analysis
Classifying YouTube Comments Based on Sentiment and Type of Sentence
Fake News Detectors are Biased against Texts Generated by Large Language Models
HonestBait: Forward References for Attractive but Faithful Headline Generation
Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models
Detoxify Language Model Step-by-Step
Improving Document-Level Sentiment Analysis with User and Product Context
Large Language Models Can Infer Psychological Dispositions Of Social Media Users
Leveraging AI for democratic discourse: Chat interventions can improve online political conversations at scale
Proactive Moderation of Online Discussions: Existing Practices and the Potential for Algorithmic Support
Creativity Has Left the Chat: The Price of Debiasing Language Models
Using Computational Models to Test Syntactic Learnability
Can Large Language Models Transform Computational Social Science?
Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
A Survey on Lexical Ambiguity Detection and Word Sense Disambiguation
Semantic Specialization for Knowledge-based Word Sense Disambiguation
Large Concept Models: Language Modeling in a Sentence Representation Space
Irony in Emojis: A Comparative Study of Human and LLM Interpretation
Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate
News Sentiment Embeddings for Stock Price Forecasting
Word Meanings in Transformer Language Models
Semantic Structure in Large Language Model Embeddings
Event-Aware Sentiment Factors from LLM-Augmented Financial Tweets: A Transparent Framework for Interpretable Quant Trading

Social Media and Generative AI
Durably reducing conspiracy beliefs through dialogues with AI
Quantifying Controversy on Social Media
SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding
Forecasting the presence and intensity of hostility on Instagram using linguistic and social features
Large Language Models For Social Networks: Applications, Challenges, And Solutions
Attention on the brain
Uncovering Latent Arguments in Social Media Messaging by Employing LLMs-in-the-Loop Strategy
Stance Detection on Social Media with Fine-Tuned Large Language Models
Real-time News Story Identification

Social Theory and Generative AI
Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?
CogBench: a large language model walks into a psychology lab
Should Humans Lie to Machines?  The Incentive Compatibility of Lasso and General Weighted Lasso
Verbal lie detection using Large Language Models
Are Customers Lying to Your Chatbot?
Detecting Deception Using Natural Language Processing and Machine Learning in Datasets on COVID-19 and Climate Change
Truth or lie: Exploring the language of deception
Man vs machine – Detecting deception in online reviews
Transformer-based cynical expression detection in a corpus of Spanish YouTube reviews
Do We Trust ChatGPT as much as Google Search and Wikipedia?
Expanding Explainability: Towards Social Transparency in AI systems
“Hello There! Is Now a Good Time to Talk?”: Opportune Moments for Proactive Interactions with Smart Speakers
Towards Collective Superintelligence, a Pilot Study
People cannot distinguish GPT-4 from a human in a Turing test
Enhancing social cohesion with cooperative bots in societies of greedy, mobile individuals
GPT-4 is judged more human than humans in displaced and inverted Turing tests
The Return of Pseudosciences in Artificial Intelligence: Have Machine Learning and Deep Learning Forgotten Lessons from
Who’s Afraid of (Left) Hyperstitions
Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations
Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
How AI Impacts Skill Formation
We Are All Creators: Generative AI, Collective Knowledge, and the Path Towards Human-AI Synergy

Speech and Voice Modes wih Generative AI
Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
POMDP-based Statistical Spoken Dialogue Systems: a Review
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Voxtral

Synthetic Dialog with LLMs
Self-Directed Synthetic Dialogues and Revisions Technical Report
Suppressing Pink Elephants with Direct Principle Feedback
Synthetic Dialogue Dataset Generation using LLM Agents
DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications
Dynamic Task-Oriented Dialogue: A Comparative Study of Llama-2 and Bert in Slot Value Generation

Tasks and Planning with Generative AI
TaskLAMA: Probing the Complex Task Understanding of Language Models
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Large Language Models can accomplish Business Process Management Tasks
Task Contamination: Language Models May Not Be Few-Shot Anymore
Chatbots in Knowledge-Intensive Contexts: Comparing Intent and LLM-Based Systems
Task-Oriented Dialogue with In-Context Learning
Conversational Semantic Parsing for Dialog State Tracking
Task-Oriented Dialogue as Dataflow Synthesis
Semantic Parsing for Task Oriented Dialog using Hierarchical Representations
Learning to Map Context-Dependent Sentences to Executable Formal Queries
CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases
PolyResponse: A Rank-based Approach to Task-Oriented Dialogue with Application in Restaurant Search and Booking
SOLOIST: Building Task Bots at Scale with Transfer Learning and Machine Teaching
Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
Can Large Language Models Reason and Plan?
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning
Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
VAL: Automatic Plan Validation, Continuous Effects and Mixed Initiative Planning using PDDL
TwoStep: Multi-agent Task Planning using Classical Planners and Large Language Models
TDAG: A Multi-Agent Framework based on Dynamic Task Decomposition and Agent Generation
Real-World Planning with PDDL+ and Beyond
On the Roles of LLMs in Planning: Embedding LLMs into Planning Graphs
Improving Generalization in Task-oriented Dialogues with Workflows and Action Plans
Action-Based Conversations Dataset: A Corpus for Building More In-Depth Task-Oriented Dialogue Systems
Large Language Models as Planning Domain Generators
Dynamic Planning with a LLM
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers
Tree Search for Language Model Agents
Graph-enhanced Large Language Models in Asynchronous Plan Reasoning
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
Plan, Verify and Switch: Integrated Reasoning with Diverse X-of-Thoughts
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning
On the Limits of Innate Planning in Large Language Models

Test Time Compute with LLMs
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
A Survey on LLM Inference-Time Self-Improvement
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision
Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning
Towards Large Reasoning Models: A Survey on Scaling LLM Reasoning Capabilities
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
Sleep-time Compute: Beyond Inference Scaling at Test-time
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods
Memorization and Knowledge Injection in Gated LLMs
A Survey on Post-training of Large Language Models
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
LLMs can implicitly learn from mistakes in-context
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Behavioral Exploration: Learning to Explore via In-Context Adaptation
MatFormer: Nested Transformer for Elastic Inference
Long-context LLMs Struggle with Long In-context Learning
Test-Time Scaling with Reflective Generative Model
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
TTRL: Test-Time Reinforcement Learning
Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion
Deep Think with Confidence
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
End-to-End Test-Time Training for Long Context

Theory of Mind in LLMs
Does It Make Sense to Speak of Introspection in Large Language Models?
The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models
A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
Evaluating Large Language Models in Theory of Mind Tasks
AI Models Exceed Individual Human Accuracy in Predicting Everyday Social Norms
Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems

Tool Computer Use by LLMs
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
Agentic Web: Weaving the Next Web with AI Agents
Real-Time Procedural Learning From Experience for AI Agents

Training and Fine Tuning Methods
CONTROL PREFIXES for Parameter-Efficient Text Generation
The Curse Of Recursion: Training On Generated Data Makes Models Forget
Language models are weak learners
Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
Extreme Multi-Label Skill Extraction Training using Large Language Models
Exploring Format Consistency for Instruction Tuning
Training language models to follow instructions with human feedback
Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
Transcendence: Generative Models Can Outperform The Experts That Train Them
Think before you speak: Training Language Models With Pause Tokens
Fine-tuning Language Models for Factuality
Lil-Bevo: Explorations of Strategies for Training Language Models in More Humanlike Ways
Divide-or-Conquer? Which Part Should You Distill Your LLM?
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Supervised Pretraining Can Learn In-Context Reinforcement Learning
Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning
Instruction Tuning for Large Language Models: A Survey
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
Distilling LLMs' Decomposition Abilities into Compact Language Models
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
Persistent Pre-Training Poisoning of LLMs
The False Promise of Imitating Proprietary LLMs
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
A Little Human Data Goes A Long Way
The Hallucination Tax of Reinforcement Finetuning
Practices, Applied Research Challenges and Opportunities
An Emulator for Fine-Tuning Large Language Models using Small Language Models
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
Improving large language models with concept-aware fine-tuning
Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Fine-tuning Large Language Model for Automated Algorithm Design
LESS: Selecting Influential Data for Targeted Instruction Tuning
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Misaligned by Design: Incentive Failures in Machine Learning
RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
Training-Free Group Relative Policy Optimization

Uncategorized
Dynamic Task-Oriented Dialogue: A Comparative Study of Llama-2 and Bert in Slot Value Generation
Do Language Models Understand Time?
Survey of Knowledge Workers
Operating Multi-Client Influence Networks Across Platforms
Metadiscursive nouns in academic argument: ChatGPT vs student practices
Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
Exploiting Dialogue Acts and Context to Identify Argumentative Relations in Online Debates
Fine-tuning Pre-trained Language Models for Dialogical Argument Mining with Inference Anchoring Theory
A Comprehensive Evaluation of Inductive Reasoning Capabilities and Problem Solving in Large Language Models
On the Relationship between Sentence Analogy Identification and Sentence Structure Encoding in Large Language Models
Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
Style Vectors for Steering Generative Large Language Models
Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction
DeepCT-enhanced Lexical Argument Retrieval
Computational Modelling of Undercuts in Real-world Arguments
Identification of Propositional and Illocutionary Relations
Overview of DialAM-2024: Argument Mining in Natural Language Dialogues
Turiya at DialAM-2024: Inference Anchoring Theory Based LLM Parsers
The Labor Market Effects of Generative Artificial Intelligence
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
Pragmatic Implicature Processing in ChatGPT
creative problem-solving
Evaluating the psychometric properties of ChatGPT-generated questions
Spurious Rewards: Rethinking Training Signals in RLVR
Exploring LLMs Applications in Law: A Literature Review on Current Legal NLP Approaches
Thematic Maps
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Circuit Tracing: Revealing Computational Graphs in Language Models
FLEX Benchmark (False Presupposition Linguistic Evaluation eXperiment)
The social component of the projection behavior of clausal complement contents
Persuasive presuppositions
On the Conversational Basis of Some Presuppositions
Presuppositions are more persuasive than assertions if addressees accommodate them: Experimental evidence for philosophical reasoning
HowProjective is Projective Content? Gradience in Projectivity and At-issueness
Sources of Hallucination by Large Language Models on Inference Tasks
Unsupervised Elicitation of Language Models
A Non-Factoid Question-Answering Taxonomy
Rethinking STS and NLI in Large Language Models
Tuning Language Models by Proxy
Discourse Structure and Dialogue Acts in Multiparty Dialogue: the STAC Corpus
Language ModelsÕ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis
Toward understanding and preventing misalignment generalization
DO THEY SEE WHAT WE SEE?
How we built our multi-agent research system
BrowseComp: a benchmark for browsing agents
The Illusion of the Illusion of the Illusion of Thinking
ChatGPT codes
Chain-of-Thought Is Not Explainability
The effect of ChatGPT on students’ learning performance, learning perception, and higher-order thinking: insights from a meta-analysis
AI tutoring outperforms in-class active learning: an RCT introducing a novel research-based design in an authentic educational setting
Why Do Multi-agent LLM Systems Fail?
TiMoE: Time-Aware Mixture of Language Experts
Causal Reflection with Language Models
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
Artificial Intelligence and the Labor Market?
Understanding Tool-Integrated Reasoning
AI Compute Architecture and Evolution Trends
Linguistic markers of inherently false AI communication and intentionally false human communication: Evidence from hotel reviews
Gdpval: Evaluating Ai Model Performance On Real-world Economically Valuable Tasks
We Wont be Missed: Work and Growth in the Era of AGI
SpikingBrain Technical Report: Spiking Brain-inspired Large Models
Quantifying Human-AI Synergy
Nested Learning: The Illusion of Deep Learning Architectures
Major AI conference flooded with peer reviews written fully by AI
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Natural Emergent Misalignment From Reward Hacking In Production RL
Estimating AI productivity gains from Claude conversations
Making Sense of Memory in AI Agents
Collaborative Reasoner: Self-Improving Social Agents with Synthetic Conversations
PretrainZero: Reinforcement Active Pretraining
Reinforcement Learning: An Overview
Orchestrating Synthetic Data with Reasoning
Nested Learning: The Illusion of Deep Learning Architecture Expanded
Emergent Introspective Awareness in Large Language Models
The state of enterprise AI
Adaptation of Agentic AI
Equipping agents for the real world with Agent Skills
Agent Development Kit
LLM Agent
Workflow Agent
Provable Benefits of In-Tool Learning for Large Language Models
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation
Thinking Fast, Slow, and Artificial: How AI is Reshaping Human Reasoning and the Rise of Cognitive Surrender
Peer-Preservation in Frontier Models
KellyBench: Can Language Models Beat the Market?
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
What we talk to when we talk to language models
Automated Alignment Researchers: Using large language models to scale scalable oversight
Reasoning Regimes as Attractor Basins
The Impact of AI-Generated Text on the Internet
The Abstraction Fallacy: Why AI Can Simulate But Not Instantiate Consciousness
Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional QuestionnaireÊStudy
Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Language models show human-like content effects on reasoning tasks
Adam's Law: Textual Frequency Law on Large Language Models
Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support
Rethinking Interpretability in the Era of Large Language Models
A Mechanistic Analysis of Looped Reasoning Language Models
Autogenesis: A Self-Evolving Agent Protocol
Reasoning-Driven Synthetic Data Generation and Evaluation
Open-world evaluations for measuring frontier AI capabilities
Context Engineering 2.0: The Context of Context Engineering
Scaling Latent Reasoning via Looped Language Models
Agentic Misalignment: How LLMs Could Be Insider Threats
ASI-Evolve: AI Accelerates AI
Payrolls to Prompts: Firm-Level Evidence on the Substitution of Labor for AI

Visual and Gui LLMS
OmniParser for Pure Vision Based GUI Agent
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
AutoGLM: Autonomous Foundation Agents for GUIs
Exploring Student-AI Interactions in Vibe Coding

Work Applications and Use Cases with LLMs
Social Skill Training with Large Language Models
Generative AI in Real-World Workplaces
Workplace Everyday-Creativity through a Highly-Conversational UI to Large Language Models
Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
Future of Work with AI Agents: Auditing Automation and Augmentation Potential across the U.S. Workforce
The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
Working with AI: Measuring the Occupational Implications of Generative AI
How Exposed Are UK Jobs to Generative AI? Developing and Applying a Novel Task-Based Index
From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications
GRASP: Municipal Budget AI Chatbots for Enhancing Civic Engagement

World Models and LLMs
Critiques of World Models
Simulating Society Requires Simulating Thought



©2007 - 2025 by Adrian Chan. All Rights Reserved.