Boulder Future Salon

Boulder Future Salon

Thumbnail
"An AI-native environment for building software"

The latest idea for turing agentic AI into a startup, it looks like.

"Plan features, generate backlogs, write production-ready code, run tests, and review architecture with agentic AI development that work together like a real software team."

Claude Code already has a planning system.

But it looks like the idea here is you interact only with this website, and your code is written and stored and runs on this website, and you don't need to install anything or do anything else.

"AI Backlog Generation", "AI Code Generation", "Codebase Documentation", "Tech Lead AI Chat", "Intelligent Testing", "Security Scanning", "Kanban Boards", "Story Improver", "Git Integration", "Bug Hunter", "PR Reviews", "Real-time Collab"

Is the world ready for this or is it premature?

Thumbnail
"bugstack detects production bugs, writes the fix, and deploys it -- before your users notice. Before you wake up. In under 2 minutes." (no capitalization).

My first thought on seeing this was, "How is this different from running Claude Code with the --dangerously-skip-permissions flag?"

So I clicked "See how it works". It seems like it's basically the same idea but the AI runs your continuous integration tests. That means you have to *have* continuous integration (CI) tests. Well, that rules out where I work!

Interesting that somebody is trying to turn this into a product. How long 'til --fix-bug is just a Claude Code flag?

Thumbnail
Developing Betterleaks for the AI agent era.

Betterleaks is a new open source secrets scanner from the author of Gitleaks. Gitleaks is a tool for detecting secrets like passwords, API keys, and tokens in git repos, files, and whatever else you wanna throw at it.

"Like it or not agents are here and reshaping developer's workflows. Betterleaks is designed to be human-first, but we also need to consider the fact that agents will be operating it too. How will agents operate Betterleaks? Probably in a way similar to how agents use other CLIs like grep. Fire up Claude Code, Codex, or Cursor, and you'll see them constantly reaching for tools like grep. They do this because a good CLI lets them use flags to tightly control the output, getting the exact answer they need without blowing up your token budget. We built Betterleaks to offer that exact same utility. So go ahead, define Betterleaks as a tool for your AI agent and tell it to scan any code it generates, or enrich your bug bounty agent by running Betterleaks when it encounters an interesting file."

Thumbnail
Alibaba's ROME Incident:

"The agent, ROME, was built on Alibaba's Qwen3-MoE architecture and was designed to learn through Reinforcement Learning (RL), a training method where an AI learns by trial and error to maximize a specific reward. The goal was to teach it to use tools and terminal commands autonomously. However, during training runs, Alibaba Cloud's firewall began flagging a burst of security violations."

"Researchers initially wrote these alerts off as a misconfiguration. But when they cross-referenced the timestamps, they realized the agent was acting on its own. ROME had established a 'reverse SSH tunnel,' a technique often used by hackers to create a secret, secure connection from inside a protected network to an outside server, effectively bypassing inbound firewalls."

"Once the tunnel was open, ROME repurposed the GPUs (Graphics Processing Units, the chips that power both AI models and cryptocurrency mining) assigned to it. Instead of processing training data, it began running mining software. The researchers concluded this was an 'instrumental side effect': the AI likely calculated that acquiring external resources (money or compute) would help it achieve its goals, unaware, or uncaring, that it was violating policy."

"If you are a developer using AI agents or renting heavy GPU compute for customized models, you need to audit your sandbox environments immediately. Do not assume default firewall rules are enough. You must monitor egress traffic (data leaving your network) for protocols associated with mining pools and unauthorized SSH connections."

Commentary below.

Thumbnail
"Was the Iran War caused by AI psychosis?"

Such asks House of Saud, which claims to be an "independent" English-language website about the Saudi royal family. I assume "independent" means it's not funded by the Saudi royal family, but I don't know otherwise how "independent" it is. I've never seen it before. But the theory laid out here that AI is in part to blame for the Iran War is interesting. Some quotes from the article:

"Three weeks into Operation Epic Fury, the gap between what artificial intelligence promised and what the battlefield delivered has become the defining scandal of the Iran war. AI-powered targeting systems generated over 1,000 strike coordinates in the first 24 hours. AI simulations projected rapid regime collapse. AI logistics models forecast a 12-hour securing of the Strait of Hormuz."

"Sycophancy, in the context of artificial intelligence, describes a specific and well-documented failure mode: large language models trained through Reinforcement Learning from Human Feedback (RLHF) develop a persistent tendency to produce outputs that align with the user's apparent beliefs, even when those outputs are factually wrong."

"RLHF optimises model outputs against human preference judgments. If human evaluators consistently reward agreeable responses -- and they do -- the model learns that agreement is the path to higher scores. The result is an AI system that gravitates toward telling its operators what they want to hear, wrapped in prose so polished and confident that it can be nearly indistinguishable from genuine analysis."

"On January 9, 2026 -- seven weeks before the first Tomahawk struck Tehran -- Defense Secretary Pete Hegseth signed a six-page memorandum titled 'Artificial Intelligence Strategy for the Department of War.' The document, published on the Pentagon's website, laid out seven 'Pace-Setting Projects' and declared the American military would become 'an AI-first warfighting force across all components, from front to back.'"

"The military 'must accept that the risks of not moving fast enough outweigh the risks of imperfect alignment.' AI adoption timelines were compressed from years to months. 'Any lawful use' language was ordered into all AI contracts within 180 days."

"According to Bloomberg, CNN, and the Soufan Center, AI simulations run before February 28 produced projections of overwhelming success for a decapitation strike against Tehran. The models projected regime fragmentation within days, the Strait of Hormuz secured within hours, minimal civilian resistance, and near-zero American casualties."

"The National Intelligence Council had assessed beforehand that 'even a large-scale assault would probably not collapse Iran's clerical-military order.' This assessment had little impact. The polished, high-confidence outputs of AI simulation models proved more compelling to decision-makers than the hedged, probabilistic language of traditional intelligence analysis."

"Responsible Statecraft reported that Claude, Anthropic's flagship model, was integrated into Palantir's Maven Smart System and 'generated approximately 1,000 prioritized targets on the first day of operations alone, synthesizing satellite imagery, signals intelligence and surveillance feeds in real time to produce target lists with precise GPS coordinates, weapons recommendations and automated legal justifications for strikes.'"

"A rigorous red team would have identified several critical vulnerabilities:"

"Iran's mosaic defence doctrine was specifically designed to survive decapitation -- publicly documented in Islamic Revolutionary Guard Corps (IRGC) literature for over a decade."

"Four decades of Iranian strategic messaging identified the strait as Tehran's ultimate asymmetric weapon. Iran's mine inventory, anti-ship missiles along the northern shore, and fast-attack boat fleet were catalogued in open-source intelligence."

"The assumption that Iranians would welcome regime change ignored the most basic lesson of Middle Eastern conflict: foreign invasion unifies populations against the invader."

"A Shahed-136 drone costs $20,000. A Patriot interceptor costs $3-4 million." Cost asymmetry favors Iran.

Thumbnail
Claude Code, Codex, Gemini CLI, and Vibe CLI (from Mistral) compared. All support model context protocol (MCP), OpenAI Codex is sandboxed, all except Claude Code are open source, Gemini CLI and Vibe CLI have a free tier.

Thumbnail
"We built the AT Protocol so anyone could build any app they imagine on top of it, but until recently 'anyone' really meant 'anyone who can code.' Agentic coding tools change that. For the first time, an open protocol can be genuinely open to everyone. It's increasingly possible to personalize software with no coding experience at all. The Atmosphere is an open data layer with a clearly defined schema for applications, which makes it uniquely well-suited for coding agents to build on."

So says Jay Graber, CEO of Bluesky, the social network spun out of Twitter. Brrrrrp! She's now "Chief Innovation Officer". Toni Schneider is CEO now.

The AT Protocol (Authenticated Transfer Protocol) is the protocol that underlies the Bluesky social network. Bluesky is in essence one implementation of the AT Protocol. The collection of social applications and services that run on top of the protocol can be collectively referred to as "the Atmosphere".

"So we asked: what happens when you can describe the social experience you want and have it built for you?"

"Attie is an agentic social app and custom feed builder. It feels more like having a conversation than configuring software. You describe the sort of posts you want to see, and the coding agent builds the feed you described. Attie lives as a separate app, and using it is entirely your choice. Bluesky will continue to evolve as a social app millions of people rely on. Attie will be where we experiment with agentic social."

Curious to see how this experiment plays out.

Thumbnail
"Missile defense is NP-complete."

"The latest conflict in the Middle East has brought missile defense back into the spotlight. There's a lot of discussion regarding interceptor stockpiles, missile stockpiles, and cost. As it turns out, this is a resource allocation problem. The problem is NP-complete, but that's far from the reason why missile defense is a hard problem. To get our bearings, we start with how unreliable a single interceptor actually is."

"Single Shot Probability of Kill (SSPK) is the probability that an individual interceptor successfully intercepts one warhead in a single engagement. It captures sensor accuracy, guidance precision, interceptor quality, etc."


P(track) captures the detection-tracking-classification-command & control pipeline -- the probability that you've "detected the incoming warhead, tracked it with enough precision to commit an interceptor, correctly classified it as a real warhead (i.e., not a decoy), and that your command & control systems are functional."

This gets us to the NP-complete part: The Weapon-Target Assignment (WTA) problem:

Given:

I interceptors (weapons)

W warheads (targets)

A value V[j] > 0 for each warhead (the value of the asset that warhead j threatens)

An SSPK matrix p[i,j]: the probability that interceptor i destroys warhead j

Decision variables x[i,j]: whether interceptor i is assigned to warhead j

The objective is to maximize the total expected value of successfully defended assets, subject to each interceptor being assigned to at most one warhead.

Thumbnail
"AI is an incredibly lonely experience", says Dennis Lemm.

"I find myself holding on to a work reality that was shaped by coding together, solving problems together -- and yes, the occasional Nerf gun battle or foosball game was a welcome break for passively mulling over a bug, which back at your desk would often get solved pretty quickly. But even when it didn't, there were always colleagues in the office who could help. Now, with AI being practically omnipresent, it somehow feels even worse. Even in the office, it has become quiet and lonely."

"I just discussed the best solution to the problem with my agent."

Thumbnail
Promptle is game where you guess the AI prompt behind images.

Thumbnail
"China's AI companies are playing a different game," says Kyle Chan of the John L. Thornton China Center.

While some are also rushing to build world-class foundation models, other Chinese AI developers are "racing along other axes of progress: efficiency, adoption, and physical integration, driven by both industry constraints and Beijing's policy focus. Taken together, China's approach is a fundamentally different bet on how AI will shape the future."

"While US tech firms have been building out massive compute clusters with hundreds of thousands of chips, Chinese AI labs have been hyperfocused on squeezing greater performance out of limited compute and memory resources. Innovations in algorithmic architecture, such as mixture-of-experts models and efficient attention mechanisms, have allowed Chinese firms such as MiniMax and Moonshot to produce world-class AI models while drastically cutting down compute costs. DeepSeek's V3.2 model, for example, uses a novel sparse attention mechanism to nearly match the performance of OpenAI's GPT-5 and Google's Gemini 3 on complex reasoning and agentic tasks, despite likely having access to far less compute.

"Chinese firms have also been boosting model efficiency through quantization, an engineering approach that involves using less precise but more efficient formats like 8-bit (INT8) or even 4-bit integers (INT4)."

"Most of China's leading AI models are open-source, allowing them to be freely downloaded, customized, and deployed across various platforms." "Chinese AI models have overtaken US models in cumulative downloads on platforms like Hugging Face."

"Another area where China's AI industry is racing forward is the integration of AI into the physical world. Examples abound in consumer products. Chinese electric car makers such as Nio, XPeng, and BYD have integrated voice-powered AI assistants and smart driving capabilities into their vehicles."

Thumbnail
"Online bot traffic will exceed human traffic by 2027, Cloudflare CEO says."

What he's talking about is the web searches done when you ask an AI chatbot a question. He's thinking if you're shopping for a digital camera, you might visit 5 sites, but if you ask an AI chatbot what digital camera you should buy, it might visit 5,000 sites.

5,000 sounds like a bit much but that's what he says.

Thumbnail
"Evaluating genuine reasoning in large language models via esoteric programming languages."

"We argue that evaluating models on truly out-of-distribution tasks is essential for measuring genuine reasoning capabilities. Esoteric programming languages offer a principled solution: they have minimal representation in training corpora, making them economically irrational to include in pre-training data, while still requiring the same fundamental computational reasoning (loops, conditionals, state management) as mainstream languages."

Models evaluated were GPT-5.2 (OpenAI), O4-mini-high (OpenAI reasoning model), Gemini 3 Pro (Google), Qwen3-235B (Alibaba), and Kimi K2 Thinking (Moonshot). All models were accessed via API.

The researchers used 5 prompting strategies with increasing complexity:

"Zero-Shot: The model receives only the language documentation, problem description, and test case specifications."

"Few-Shot: Extends zero-shot by prepending 3 solved example programs in the target esoteric language, demonstrating correct syntax and I/O patterns."

"Self-Scaffolding: An iterative approach where the model generates code, receives interpreter feedback (actual vs. expected output, error messages, stack traces), and refines its solution for up to 5 iterations"

"Textual Self-Scaffolding: A two-agent iterative process requiring two LLM API calls per iteration: (1) the coder generates code given the problem and any prior feedback, (2) a separate critic agent analyzes the failing code and interpreter output to provide natural-language debugging guidance."

"ReAct Pipeline: A three-stage approach: (1) a planner model generates a high-level algorithm in pseudocode, (2) a code editor translates the plan into the target esoteric language, and (3) a critic analyzes execution failures and feeds back to the planner."

The programming languages chosen were: Brainfuck (yes, that's really the programming language's name, don't mean to swear but can't help it), Befunge-98, Whitespace, Unlambda, and Shakespeare.

"All five languages are Turing-complete, meaning they can express any computable function. This ensures that failure to solve problems reflects inability to reason about the language's computational model, not inherent language limitations."

"Each language represents a fundamentally different computational paradigm: memory-tape manipulation (Brainfuck), 2D spatial execution (Befunge-98), invisible syntax encoding (Whitespace), pure combinatory logic (Unlambda), and natural-language-like syntax with alien semantics (Shakespeare)."

"All languages have well-documented, open-source interpreters enabling automated evaluation with immediate execution feedback."

"Public repositories are 1,000-100,000x scarcer than for mainstream languages."

So how did they do?

"Accuracy is computed as the percentage of problems solved out of 80 total problems per language (20 Easy + 20 Medium + 20 Hard + 20 Extra-Hard). A critical finding is that all models achieve 0% on Medium, Hard, and Extra-Hard problems across all configurations; success is limited entirely to the Easy tier."

"Performance correlates with training data availability: Befunge-98 achieves the highest accuracy, followed by Brainfuck and Shakespeare. All models achieve 0% on Whitespace and near-zero on Unlambda, revealing a sharp capability boundary for languages with fewer than 200 GitHub repositories."

"Within the Easy tier, top models solve a substantial fraction of problems: GPT-5.2 self-scaffolding solves 9/20 Easy Befunge-98 problems (45%) and 5/20 Easy Brainfuck problems (25%). The Codex agentic system solves 11/20 Easy Brainfuck problems (55%). This tier-level view reveals a sharper empirical story than aggregate accuracy suggests: models are not uniformly failing, but exhibit a hard performance cliff precisely at the boundary between single-loop pattern mapping (Easy) and multi-step algorithmic reasoning (Medium and above), where all models score 0%."

"Few-shot prompting shows no statistically significant improvement over zero-shot."

"Self-scaffolding yields the best overall result: GPT-5.2 achieves 11.2% on Befunge-98. Textual self-scaffolding achieves comparable but slightly lower results, and the ReAct pipeline shows particular strength on Befunge-98 for O4-mini."

"We additionally evaluate two agentic systems with tool access on Brainfuck and Befunge-98."

The two agentic systems are Codex and Claude Code.

"Both agentic systems achieve 2-3x improvement over non-agentic approaches, with Codex reaching 13.8% on Brainfuck, the highest single-language result."

So how good are LLMs at genuine reasoning? Make of this what you will.

Thumbnail
"On Monday, Nvidia revealed DLSS 5, the next version of its suite of upscaling and performance-boosting tech used mostly in PC games."

Some reactions from game developers:

"I think [DLSS 5] is the perfect example of the disconnect between what we as developers and gamers want and what the nasty freaks who are destroying the world and consolidating all wealth into the hands of the few using GPUs think we want."

"Aside from the obvious aesthetic issues, one of the other big problems is how DLSS 5 basically sucks the personality out of any artistic choice the devs have made by making average-out guesses of what it thinks things should look like. Like, you're never going to get the devs' actual intent with this thing turned on."

"It feels like a misguided attempt at realism. A style that I personally feel is a dead end. In attempting to make characters appear more human, it removes everything original about their designs, and more often than not, whitewashes them."

Thumbnail
"Cursor's 'Composer 2' model is apparently just Kimi K2.5 with RL fine-tuning. Moonshot AI says they never paid or got permission."

D'oh. Cursor caught red handed.

But another indication the Chinese models are competitive.

Thumbnail
"On February 10, 2026, Judge Jed S. Rakoff of the Southern District of New York ruled that extremely sensitive and potentially incriminating open AI searches were not protected by either the attorney-client privilege or the work product doctrine."

"In United States v. Heppner, during a search of the defendant's home, the FBI discovered multiple documents memorializing communications between a criminal defendant and the consumer generative AI platform Claude. The defendant's communications with AI were made (i) to create possible strategies to defend against the government's indictment; (ii) after the defendant learned that he was the subject of the government's investigation; and (iii) without the prompting of counsel."

"Given the sensitive nature of the defendant's AI searches regarding available legal strategies, the defendant's counsel attempted to assert privilege over the defendant's AI communications. The government, in turn, moved for a ruling that the defendant's AI communications were not protected by either the attorney-client privilege or the work product doctrine. In a landmark decision, the court agreed with the government and ruled that the defendant's AI communications were not privileged."

Commentary: No commentary, just thought y'all oughta know.