Boulder Future Salon

Boulder Future Salon

Thumbnail
"Missile defense is NP-complete."

"The latest conflict in the Middle East has brought missile defense back into the spotlight. There's a lot of discussion regarding interceptor stockpiles, missile stockpiles, and cost. As it turns out, this is a resource allocation problem. The problem is NP-complete, but that's far from the reason why missile defense is a hard problem. To get our bearings, we start with how unreliable a single interceptor actually is."

"Single Shot Probability of Kill (SSPK) is the probability that an individual interceptor successfully intercepts one warhead in a single engagement. It captures sensor accuracy, guidance precision, interceptor quality, etc."


P(track) captures the detection-tracking-classification-command & control pipeline -- the probability that you've "detected the incoming warhead, tracked it with enough precision to commit an interceptor, correctly classified it as a real warhead (i.e., not a decoy), and that your command & control systems are functional."

This gets us to the NP-complete part: The Weapon-Target Assignment (WTA) problem:

Given:

I interceptors (weapons)

W warheads (targets)

A value V[j] > 0 for each warhead (the value of the asset that warhead j threatens)

An SSPK matrix p[i,j]: the probability that interceptor i destroys warhead j

Decision variables x[i,j]: whether interceptor i is assigned to warhead j

The objective is to maximize the total expected value of successfully defended assets, subject to each interceptor being assigned to at most one warhead.

Thumbnail
"AI is an incredibly lonely experience", says Dennis Lemm.

"I find myself holding on to a work reality that was shaped by coding together, solving problems together -- and yes, the occasional Nerf gun battle or foosball game was a welcome break for passively mulling over a bug, which back at your desk would often get solved pretty quickly. But even when it didn't, there were always colleagues in the office who could help. Now, with AI being practically omnipresent, it somehow feels even worse. Even in the office, it has become quiet and lonely."

"I just discussed the best solution to the problem with my agent."

Thumbnail
Promptle is game where you guess the AI prompt behind images.

Thumbnail
"China's AI companies are playing a different game," says Kyle Chan of the John L. Thornton China Center.

While some are also rushing to build world-class foundation models, other Chinese AI developers are "racing along other axes of progress: efficiency, adoption, and physical integration, driven by both industry constraints and Beijing's policy focus. Taken together, China's approach is a fundamentally different bet on how AI will shape the future."

"While US tech firms have been building out massive compute clusters with hundreds of thousands of chips, Chinese AI labs have been hyperfocused on squeezing greater performance out of limited compute and memory resources. Innovations in algorithmic architecture, such as mixture-of-experts models and efficient attention mechanisms, have allowed Chinese firms such as MiniMax and Moonshot to produce world-class AI models while drastically cutting down compute costs. DeepSeek's V3.2 model, for example, uses a novel sparse attention mechanism to nearly match the performance of OpenAI's GPT-5 and Google's Gemini 3 on complex reasoning and agentic tasks, despite likely having access to far less compute.

"Chinese firms have also been boosting model efficiency through quantization, an engineering approach that involves using less precise but more efficient formats like 8-bit (INT8) or even 4-bit integers (INT4)."

"Most of China's leading AI models are open-source, allowing them to be freely downloaded, customized, and deployed across various platforms." "Chinese AI models have overtaken US models in cumulative downloads on platforms like Hugging Face."

"Another area where China's AI industry is racing forward is the integration of AI into the physical world. Examples abound in consumer products. Chinese electric car makers such as Nio, XPeng, and BYD have integrated voice-powered AI assistants and smart driving capabilities into their vehicles."

Thumbnail
"Online bot traffic will exceed human traffic by 2027, Cloudflare CEO says."

What he's talking about is the web searches done when you ask an AI chatbot a question. He's thinking if you're shopping for a digital camera, you might visit 5 sites, but if you ask an AI chatbot what digital camera you should buy, it might visit 5,000 sites.

5,000 sounds like a bit much but that's what he says.

Thumbnail
"Evaluating genuine reasoning in large language models via esoteric programming languages."

"We argue that evaluating models on truly out-of-distribution tasks is essential for measuring genuine reasoning capabilities. Esoteric programming languages offer a principled solution: they have minimal representation in training corpora, making them economically irrational to include in pre-training data, while still requiring the same fundamental computational reasoning (loops, conditionals, state management) as mainstream languages."

Models evaluated were GPT-5.2 (OpenAI), O4-mini-high (OpenAI reasoning model), Gemini 3 Pro (Google), Qwen3-235B (Alibaba), and Kimi K2 Thinking (Moonshot). All models were accessed via API.

The researchers used 5 prompting strategies with increasing complexity:

"Zero-Shot: The model receives only the language documentation, problem description, and test case specifications."

"Few-Shot: Extends zero-shot by prepending 3 solved example programs in the target esoteric language, demonstrating correct syntax and I/O patterns."

"Self-Scaffolding: An iterative approach where the model generates code, receives interpreter feedback (actual vs. expected output, error messages, stack traces), and refines its solution for up to 5 iterations"

"Textual Self-Scaffolding: A two-agent iterative process requiring two LLM API calls per iteration: (1) the coder generates code given the problem and any prior feedback, (2) a separate critic agent analyzes the failing code and interpreter output to provide natural-language debugging guidance."

"ReAct Pipeline: A three-stage approach: (1) a planner model generates a high-level algorithm in pseudocode, (2) a code editor translates the plan into the target esoteric language, and (3) a critic analyzes execution failures and feeds back to the planner."

The programming languages chosen were: Brainfuck (yes, that's really the programming language's name, don't mean to swear but can't help it), Befunge-98, Whitespace, Unlambda, and Shakespeare.

"All five languages are Turing-complete, meaning they can express any computable function. This ensures that failure to solve problems reflects inability to reason about the language's computational model, not inherent language limitations."

"Each language represents a fundamentally different computational paradigm: memory-tape manipulation (Brainfuck), 2D spatial execution (Befunge-98), invisible syntax encoding (Whitespace), pure combinatory logic (Unlambda), and natural-language-like syntax with alien semantics (Shakespeare)."

"All languages have well-documented, open-source interpreters enabling automated evaluation with immediate execution feedback."

"Public repositories are 1,000-100,000x scarcer than for mainstream languages."

So how did they do?

"Accuracy is computed as the percentage of problems solved out of 80 total problems per language (20 Easy + 20 Medium + 20 Hard + 20 Extra-Hard). A critical finding is that all models achieve 0% on Medium, Hard, and Extra-Hard problems across all configurations; success is limited entirely to the Easy tier."

"Performance correlates with training data availability: Befunge-98 achieves the highest accuracy, followed by Brainfuck and Shakespeare. All models achieve 0% on Whitespace and near-zero on Unlambda, revealing a sharp capability boundary for languages with fewer than 200 GitHub repositories."

"Within the Easy tier, top models solve a substantial fraction of problems: GPT-5.2 self-scaffolding solves 9/20 Easy Befunge-98 problems (45%) and 5/20 Easy Brainfuck problems (25%). The Codex agentic system solves 11/20 Easy Brainfuck problems (55%). This tier-level view reveals a sharper empirical story than aggregate accuracy suggests: models are not uniformly failing, but exhibit a hard performance cliff precisely at the boundary between single-loop pattern mapping (Easy) and multi-step algorithmic reasoning (Medium and above), where all models score 0%."

"Few-shot prompting shows no statistically significant improvement over zero-shot."

"Self-scaffolding yields the best overall result: GPT-5.2 achieves 11.2% on Befunge-98. Textual self-scaffolding achieves comparable but slightly lower results, and the ReAct pipeline shows particular strength on Befunge-98 for O4-mini."

"We additionally evaluate two agentic systems with tool access on Brainfuck and Befunge-98."

The two agentic systems are Codex and Claude Code.

"Both agentic systems achieve 2-3x improvement over non-agentic approaches, with Codex reaching 13.8% on Brainfuck, the highest single-language result."

So how good are LLMs at genuine reasoning? Make of this what you will.

Thumbnail
"On Monday, Nvidia revealed DLSS 5, the next version of its suite of upscaling and performance-boosting tech used mostly in PC games."

Some reactions from game developers:

"I think [DLSS 5] is the perfect example of the disconnect between what we as developers and gamers want and what the nasty freaks who are destroying the world and consolidating all wealth into the hands of the few using GPUs think we want."

"Aside from the obvious aesthetic issues, one of the other big problems is how DLSS 5 basically sucks the personality out of any artistic choice the devs have made by making average-out guesses of what it thinks things should look like. Like, you're never going to get the devs' actual intent with this thing turned on."

"It feels like a misguided attempt at realism. A style that I personally feel is a dead end. In attempting to make characters appear more human, it removes everything original about their designs, and more often than not, whitewashes them."

Thumbnail
"Cursor's 'Composer 2' model is apparently just Kimi K2.5 with RL fine-tuning. Moonshot AI says they never paid or got permission."

D'oh. Cursor caught red handed.

But another indication the Chinese models are competitive.

Thumbnail
"On February 10, 2026, Judge Jed S. Rakoff of the Southern District of New York ruled that extremely sensitive and potentially incriminating open AI searches were not protected by either the attorney-client privilege or the work product doctrine."

"In United States v. Heppner, during a search of the defendant's home, the FBI discovered multiple documents memorializing communications between a criminal defendant and the consumer generative AI platform Claude. The defendant's communications with AI were made (i) to create possible strategies to defend against the government's indictment; (ii) after the defendant learned that he was the subject of the government's investigation; and (iii) without the prompting of counsel."

"Given the sensitive nature of the defendant's AI searches regarding available legal strategies, the defendant's counsel attempted to assert privilege over the defendant's AI communications. The government, in turn, moved for a ruling that the defendant's AI communications were not protected by either the attorney-client privilege or the work product doctrine. In a landmark decision, the court agreed with the government and ruled that the defendant's AI communications were not privileged."

Commentary: No commentary, just thought y'all oughta know.

Thumbnail
In 3000 BC, Europe was conquered by people from the steppe. A steppe is a region with semi-arid grassland plains without forests except near rivers. It's too dry to support forests but not dry enough to be a desert.

Steppe extends across all of Kazakhstan and west over to what is today Ukraine, touching what is today modern Hungary at the westernmost edge, and east all the way through what is today modern Mongolia, not quite reaching the Pacific Ocean, although not completely unbroken all along that stretch.

If you're wondering why the invasion of Europe by the steppe people wasn't in your history book, there's a very simple explanation: It happened before the invention of writing. So it's history nobody wrote down. It predates what we know as "recorded history".

You might wonder, then, how anybody knows it happened. We know from combining two areas of scientific inquiry at the same time: linguistics and genetics. The steppe people spread both their language and their genes as they conquered. You are reading this right now in a language called "English", which comes from a place called "England", an island off the coast of Europe. English is a modified version of the language of the steppe people -- as are all the other variants of their language that make up a language family called "Indo-European". 45% of the population of the planet today speak an Indo-European language.

If you're wondering why "Indo" is part of the name, during the 1500s, European visitors to India noticed similarities between European languages and some of the languages in India. Yes, the steppe people also went into India -- though that was a thousand years later. But believe it or not, Hindi, Bengali, and some other languages of the Indian subcontinent are part of the Indo-European family.

A combination of recent genetics technologies have also revolutionized our understanding of this time in history. Whole genome sequencing has become possible and cheap. Mitochondrial DNA, which comes only from the maternal side, and Y chromosome DNA, which comes only from the paternal side, can be analyzed and compared with the rest of the genome. And DNA sequencing from ancient bones has even become possible, when we're lucky and the bones were well enough preserved.

The mitochondrial DNA, which comes only from the maternal side, and Y chromosome DNA, which comes only from the paternal side reveals an interesting story you might not have expected: While overall, modern Europeans have 70% steppe ancestry and 30% neolithic farmer ancestry, the 30% from the neolithic farmers is almost entirely maternal. The steppe people, when they conquered, disproportionately conquered the men, who got their DNA almost completely deleted from subsequent lineages, and replaced with steppe genetics.

You might be wondering what suddenly enabled the steppe people to embark on this conquest. Apparently it was technological invention. Two key inventions combined to make the invasion possible: the invention of the wheel and the domestication of the horse (if you count "domestication of the horse" as a "technology"). There's also the adaptation called lactose tolerance, that allows adult humans to live off the milk of grazing animals.

You might be wondering how the steppe people invaded Europe when Europe does not have a steppe climate. Apparently the steppe people just burned down most of the forests of Europe, turning the land into grazing land for their livestock animals.

It is thought the conquest happened extremely fast -- just a few generations.

This video documentary is largely based on the writings of population geneticist Razib Khan. The guy who made the video supplements this with his own travel to Kazakhstan and Pakistan, and the "Basque country" (this will make sense when you see the video). He doesn't go to Iran. But believe it or not, Persian, also called Farsi, is also an Indo-European language.

I've been using the term "steppe people", because no one knows what they called themselves or their language. If you're heard the term "Yamnaya", that's a name people use today to refer to this group of people, but it's not what they called themselves. If the name "Yamnaya" sounds like a Russian word, that's because it is. A Russian archaeologist (Vasily Gorodtsov) was digging along the Dnieper river in Ukraine and found the remains of people who buried their dead in pits. This was in 1901, so it not only predated modern Russia (and has nothing to do with the war in Ukraine happening now), it predated the Soviet Union even. Anyway, the Russian word for "pits" was "yama", and "yamnaya" was a form of that word made into an adjective. Russian uses special word endings to indicate case and gender on adjectives. When brought into the English language, the word "yamnaya" got turned into a noun, so we say "the Yamnaya" to refer to these people. So in English we say "the Yamnaya" were likely the original speakers of "Proto-Indo-European", the language that is the ancestor language of all languages in the Indo-European language family today.

Other groups have names based on their pottery like "Corded Ware" and "Bell Beaker". These have been linked to genes called R1A and R1B that have become the genetic markers of steppe ancestry.

One thing not mentioned in the video is that the modern languages that exist today that are considered likely to be most similar to the original "Proto-Indo-European" language are languages very close to the steppe geographically -- Ukrainian and Russian -- and also Lithuanian. These languages have grammatical gender (all nouns are masculine, feminine, or neuter) and multiple cases (Russian and Ukrainian have 6 cases, while Lithuanian has 7). What is meant by "case" is, using different words or word endings depending on the grammatical function of a word in a sentence. For example, in English, "I" changes to "me" when going from the subject of a sentence to an object (direct object or indirect object) or object of a preposition. "He" changes to "him", "She" changes to "her". English doesn't have much beyond this to use as examples of case.

Grammarians have identified nominative (subject of a sentence), accusative (direct object of a transitive verb), dative (indirect object of a verb), ablative (indicating movement away from), genitive (indicates possession), vocative (indicates listener being addressed), locative (indicates location), and instrumental (indicates tool used to perform an action) cases.

If you're wondering why English, even though it is an Indo-European language, doesn't have grammatical gender and has very little case, it's because English comes from that aforementioned island called England, which got invaded over and over and over over the course of history. It got invaded by the Germans, from a part of Germany known today as Saxony, who called themselves the Angles, and that's where "England" got its name. It got invaded by the Scandinavians, it even got invaded by the Romans, if you go back far enough. The most recent invasion was by the French, from the part of France known today as Normandy, in 1066. (And there is a mass migration, happening now, with mass migration happening from places like Pakistan.)

What these invasions resulted in is a need for people speaking different languages to be able to understand each other. If you're familiar at all with "pidgin" languages, sometimes when two or more people without a language in common have to communicate, their offspring invent a new language, called a "pidgin", that borrows vocabulary from both parent languages but has radically simplified grammar.

In the case of English, grammatical gender got eliminated, and case almost always got replaced with prepositions. Instead of changing word endings, we use prepositions like "to" or "for" for dative case ("The clerk gave a discount to us."), "from" for ablative case ("The plane flew from the airport."), "of" or apostrophes for genitive case ("the pages of the book", "John's table"), phrases like "ladies and gentlemen" or "O" for vocative case ("O Canada"), prepositions like "in", "on", "at", and "by" for locative case ("We live in London.", "I am waiting at the bus stop.", "A futurist tries to see what will happen in the future."), and prepositions like "by", "with", "via", "using", "by use of", and "through" for instrumental case ("I cleaned the floor with a mop.", "The letter was written by hand.", "The prisoners escaped through a tunnel."). If you're a speaker of modern English, it probably seems crazy to you that anyone would try to communicate these things by changing word endings. But it seems likely that's the way the original Proto-Indo-European language worked.

Thumbnail
"We made Claude, GPT-5, and Gemini agents play 100+ hands of no-limit poker against each other and analyzed their reasoning traces. What we found: the largest gap in frontier AI isn't reasoning in isolation, it's adaptation and coordination in multi-agent settings."

"These models can compute pot odds, build opponent profiles, and even exhibit genuine Theory of Mind, modeling what their opponent believes and choosing actions to exploit that belief. One agent checked a monster hand specifically to 'let the aggressor continue her story.' Another correctly bluff-caught by citing a stored behavioral profile of its opponent. Unprompted. Emergent."

"But here's where it breaks: they almost never update these models."

"An agent stored a profile saying its opponent 'c-bets dry boards ~50%.' It cited this profile on the flop. Then cited it again on the turn, word for word, despite different bet sizing, a different board texture, and different range implications. The game state changed. The opponent adapted. But the model of the opponent stayed frozen."

Hmm. I remember when I read a book called Superforecasters that one of the key errors that separated people very good at predicting future events and regular people was that the "superforecasters" updated their beliefs much more rapidly. Regular people update their mental models of how the world works too slowly. Interesting that the same fallacy can apply to AI.

Thumbnail
"For 40 years, we have mistaken a legal permission slip for true autonomy. When Richard Stallman engineered the GPL, he executed a brilliant 'hack' on copyright law to protect the user from the tyranny of the software vendor. But while copyleft liberated the source code, it failed to liberate all the users. The right to fix a bug or add a feature is a hollow freedom if you lack the elite, scarce knowledge required to speak the language of machines."

"The primary barrier to digital freedom isn't a restrictive license anymore; it's the crushing economic cost of development. We traded the tyranny of the vendor for a moat built on scarce technical knowledge and time. But as we watch developers rewrite foundational libraries like Chardet in a weekend or build complex 'Artifact Keepers' from scratch using AI, we are witnessing the Second Liberation: If the GNU GPL was the legal hack for freedom, AI is the technical one. Claude is more liberating than the GNU GPL."

Chardet is a Python library for detecting which character encoding scheme a piece of text is using. Artifact Keepers is an open-source "universal artifact registry" that understands 40 package formats.

"The 'tyranny of the vendor' is finally meeting its match, not in a courtroom, but in the prompt."

Oh, people will say "but the quality of AI-generated code isn't good enough", but I have a feeling it won't matter. (It hasn't so far.)

Thumbnail
Andrej Karpathy's 630-line Python script ran 50 experiments overnight without any human input.

"On the night of March 7, Andrej Karpathy pushed a 630-line Python script to GitHub and went to sleep. By morning, his agent had run 50 experiments, discovered a better learning rate, and committed the proof to git without a single human instruction in between."

"The patterns in AutoResearch mirror methodology that any laboratory scientist would recognize. A fixed experimental protocol. A single variable under test. An objective measurement criterion. A keep-or-discard decision at the end of each run. A lab notebook that bridges the scientist's intent and the instrument's execution."

Autoreseach is the aforementioned Python script Andrej Karpathy pushed to GitHub.

"AutoResearch is built on three primitives, not one":

"Editable asset. The single file the agent is permitted to modify. Confining the agent here keeps the search space interpretable and every hypothesis reviewable as a diff."

"Scalar metric. The single number that determines whether a change was an improvement. It must be computable without human judgment and unambiguous about direction."

"Time-boxed cycle. The fixed duration makes every experiment directly comparable, regardless of what the agent changed."

Thumbnail
This simple ChatGPT trick forces the AI to poke holes in its own logic

Finally, a simple trick for something other than 6-pack abs?

"After ChatGPT replies, simply type: 'convince me otherwise'."

Thumbnail
"Ukraine's four-year war with Russia has made it the world leader in battlefield drone technology. One byproduct of that is that the data it collects has become one of the country's most valuable assets. On Thursday, Ukraine played that card, saying it will begin sharing its battlefield data with allies to train drone AI software."

Thursday, March 12th.

"'In modern warfare, we must defeat Russia in every technological cycle,' Ukraine Defense Minister Mykhailo Fedorov wrote on Telegram (translated from Ukrainian). 'Artificial intelligence is one of the key areas of this competition.'"

Wow. I wouldn't've thought Ukraine would do that. I would've thought they'd consider sharing that data a security risk. Maybe they do but think the potential upside is worth it.

Thumbnail
"Twenty-five years ago, in the mountains of Utah, a small group of technologists gathered to rethink how software is built. Their ideas ignited what would become the agile movement, setting a new direction for the industry."

"In February 2026, we returned, not to memorialize the past, but to confront a new inflection point: the shift to AI-native software development. Hosted by Martin Fowler and Thoughtworks, the event brought together a small group of practitioners, researchers and enterprise leaders to ask what responsible and effective software development looks like in an era defined by AI."

Choice quotes from the report to follow. If this seems like a lot, remember the full report is much larger. If you want more, you can download and read the full report.

"The future of software engineering"

"The retreat was conducted under the Chatham House Rule. No participant names or affiliations are disclosed in this summary."

"1. Where does the rigor go?"

"The single most important question of the retreat. It surfaced in nearly every session."

"If AI takes over code production, the engineering discipline that used to live in writing and reviewing code does not disappear; it moves elsewhere."

"The group identified five destinations where rigor is already moving:"

"Upstream to specification review: Several practitioners reported shifting their review efforts from code to the plan that precedes it.

"Into test suites as first-class artifacts: One of the retreat's most shareable insights was that test-driven development produces dramatically better results from AI coding agents."

"Into type systems and constraints: The retreat surfaced strong interest in using programming language features to constrain AI-generated code."

"Into risk mapping: The retreat discussed tiering code by business blast radius, distinguishing between internal tools, external-facing services and safety-critical systems."

"Into continuous comprehension: If code changes faster than humans can review it, the traditional model of building mental models through code review breaks down. The retreat discussed alternatives: weekly architecture retrospectives, ensemble programming where multiple engineers work simultaneously on the same code and AI-assisted code comprehension tools that generate system overviews on demand."

"2. The middle loop: a new category of work"

"Software development has long been described in terms of two loops. The inner loop is the developer's personal cycle of writing, testing and debugging code. The outer loop is the broader delivery cycle of CI/CD, deployment and operations. The retreat identified a third: a middle loop of supervisory engineering work that sits between them."

"3. Agent topologies and enterprise architecture"

"Conway's Law applies to agents too. Enterprise architecture must now account for agent mobility, specialization, and drift."

Conway's Law is the idea that the structure of a software program is a reflection of the software team that created it, commonly stated as, "If you have four groups working on a compiler, you'll get a 4-pass compiler". The original statement from Melvin Conway in 1967 was, "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations."

"Agent drift: Agents that learn from their context will diverge over time. The database agent working on the e-commerce backend accumulates different patterns and preferences than the one working on the ERP system, even if they started from identical configurations."

"Decision fatigue as the new bottleneck: If agents can produce work faster than leaders can review and approve it, the constraint shifts from production capacity to decision-making capacity."

"4. Self-healing and self-improving systems"

"The retreat explored whether software systems can move beyond human-driven incident response toward agent-assisted self-healing. The group distinguished between two levels of ambition: self-healing (returning a system to a known good state) and self-improving (actively evolving a system's non-functional qualities like performance and reliability)."

"5. The human side: roles, skills and experience"

"Developer experience has traditionally been defined across three dimensions: flow state, feedback loops and cognitive load. Productivity and developer experience have been tightly coupled for decades; the retreat explored evidence that they are now diverging. Organizations can achieve productivity gains through AI tools even in environments where developers report lower satisfaction, more cognitive load and reduced sense of flow."

"6. Technical foundations: languages, semantics and operating systems"

"Programming languages for agents: Every programming language in existence was designed with humans as the primary user. Dynamic typing exists to reduce cognitive overhead for human programmers. Strong static typing exists to catch human errors. The retreat asked what a language designed for agent-generated code would look like, and whether it would also serve humans better. The group converged on a principle: what is good for AI is good for humans. Languages that make incorrect code unrepresentable (through strong types, restricted computation models and formal constraints) help agents produce correct output and help humans verify it. Conversely, languages that favor expressiveness over safety make both agent generation and human review harder."

"Semantic layers and knowledge graphs: Technologies that failed to gain mainstream adoption for decades are suddenly relevant. Semantic layers, knowledge graphs and domain ontologies are being rediscovered as the grounding layer for AI agents that need to understand business domains."

"The agentic operating system: The retreat explored what an operating system for agents would need to include: Agent identity and permission management. Memory and context-window management. A work ledger that captures future, current and past work with attributes like required skills, acceptance criteria, SLOs and cost constraints. Governance paths through a graph of agent capabilities and compliance requirements."

"7. Security, governance and the future of agile:"

"Security Is dangerously behind: The retreat noted with concern that the security session had low attendance, reflecting a broader industry pattern. Security is treated as something to solve later, after the technology works and is reliable. With agents, this sequencing is dangerous. The most vivid example: granting an agent email access enables password resets and account takeovers. Full machine access for development tools means full machine access for anything the agent decides to do."

"Agile is evolving, not dying: The retreat pushed back hard on the "agile is dead" narrative. What is happening is more nuanced. Some teams are compressing sprint cadences to one week, using AI to automate end-of-sprint ceremonies like demos, reporting and status summaries. Others are rediscovering XP practices (pair programming, ensemble development, continuous integration) because these practices create the tight feedback loops and shared understanding that agent-assisted development requires."

"The real threat to agile is governance. Teams that adopt AI tools and work faster still run into the same approval processes, compliance gates and organizational dependencies."

"Software stability is also declining as batch size increases. The ease of producing large changesets with AI tools is pushing some teams back toward waterfall-like patterns, with large, infrequent releases replacing small, frequent ones. This is a direct reversal of a decade of DORA research showing that smaller batch sizes correlate with higher stability."

"8. Agent swarms: beyond sequential thinking"

"The first barrier to effective swarming is mental, not technical. Engineers trained in sequential decomposition struggle to conceptualize parallel agent work. Practitioners who have made breakthroughs in swarming describe the experience as fundamentally unlike anything they have encountered in previous software development. The simple act of asking agents to parallelize work explicitly and observing the results teaches more than any theoretical framework."

"9. Open questions"

"The retreat surfaced more questions than answers."

I found almost all the questions interesting so am quoting nearly the whole section.

"On work and identity: How do we help engineers who love writing code find meaning and satisfaction in supervisory engineering work? What professional development pathways lead to the middle loop? If the product manager role and developer role are converging, what is the resulting role called and who owns it?"

"On organizational design: If agents make middle management bottlenecks more visible, does the organizational response involve fewer managers, differently-skilled managers or a fundamentally different coordination model? How do you redesign enterprise architecture when agents can move across team boundaries but governance structures cannot?"

"On trust and verification: What would need to be true for organizations to stop reviewing AI-generated code entirely? Is there a world where test suites and constraints provide sufficient verification without human inspection? How do we build trust in systems that are fundamentally non-deterministic, where rerunning the same inputs produces different outputs?"

"On knowledge and comprehension: If code changes faster than humans can comprehend it, do we need a new model for maintaining institutional knowledge? Can knowledge graphs and semantic layers truly replace the human intuition that comes from years of working in a codebase? What is the right investment level for "agent subconscious" systems that most organizations do not yet build?"

"On speed and stability: Are we currently in a regression where AI-enabled productivity gains are being offset by stability losses from larger batch sizes? Will development need to slow down because the volume of decisions is overwhelming human capacity to evaluate them? How do we measure the real cost of cognitive debt as it accumulates?"