Boulder Future Salon

Boulder Future Salon

Thumbnail
AI generated media is unmonetizable (says Andy Jarosz).

"Film enthusiasts will have no interest in seeing AI generated movies, and content consumers will have no interest in paying for them."

He considers "film enthusiasts" to be analogous to music fans who are passionate enough about the musician to go to their concerts, wear their merch, and buy products endorsed by them. Music fans who just want something to sing along to in the car were never the people the money came from. "Content consumers" don't care about the intent behind every little detail, but "film enthusiasts" do.

I suspect this may be the state of affairs with the current level of technology, but given enough advancement in AI (however long that will take -- I'm not predicting dates), the resulting product will be indistinguishable including the perception of intent behind every little detail. What do y'all think?

Thumbnail
PFFF says, "Drop in your troubles, I got this!"

"Tell me what's bugging you."

I punched in:

"Too much software to write, too little time."

You can choose the tone of the response. The choices are "compassionate", "cynical", "objective", "encouraging", "philosophical", "humorous". Well, You can choose the tone of the response if you pay. It said "cynical" was free today, so I picked "cynical".

"Ah, the classic. You thought the Agile estimates were realistic, didn't you? (You're adorable for believing that, honestly). The thing is, the project manager genuinely believes they can bend reality to fit the timeline. Meanwhile, you're basically coding while riding a unicycle on a tightrope. You're right, it's awful. It's normal to think it's dumb."

"What might help is to prioritize ruthless code cleanup. Won't change the project scope, but at least you'll feel some level of control. Welcome to the club of people who have grasped that software development is 90% managing expectations and 10% actual coding. We're a friendly bunch, we share the few remaining shreds of sanity."

I assume the different "tones" are really just large language model prompts. People pay for this? 1.99 euros per month. Made by some people in France, judging from the bits of French on the site.

I went back and punched in:

"PFFF is misspelled. It should be 'PFFFT' with a 'T' on the end."

"Ah, bienvenue dans la réalité. You thought the number of 'T's in 'PFFFT' mattered? (T'es adorable de l'avoir cru, franchement)."

"The thing is, everyone's idea of the *perfect* onomatopoeia is different. While some might insist on 'PFFFT,' others are perfectly happy with a simple 'PFFF.' It's a free-for-all out there in the world of sound effects!"

"T'as raison, c'est pourri. It's normal that you find it con."

"What could help you is to just accept the anarchy of written sounds. It probably won't change the world, but at least *that's* your power."

"Welcome to the club of people who have understood that language is a messy, imperfect thing. We're nice, we share the crumbs of dignity."

Thumbnail
"TerraUSD creator Do Kwon sentenced to 15 years over $40 billion crypto collapse."

Apparently Do Kwon is a citizen of South Korea, and the company, Terraform Labs, was founded in Singapore, but he was sentenced to 15 years in prison in the United States. He made a stablecoin linked to the US dollar but I'm not sure how that ties him to US jurisdiction for criminal charges. The article says he will face additional criminal charges in South Korea. Maybe there is a treaty between the US and South Korea.

"Kwon was accused of misleading investors in 2021 about TerraUSD, a so-called stablecoin designed to maintain a value of $1. Prosecutors alleged that when TerraUSD slipped below its $1 peg in May 2021, Kwon told investors a computer algorithm known as 'Terra Protocol' had restored the coin's value. Instead, Kwon arranged for a high-frequency trading firm to secretly buy millions of dollars of the token to artificially prop up its price, according to charging documents."

Thumbnail
The LLM Turbo Confabulator.

One of the few times I wanted to share an AI-generated video (and a short even!). For once the fact that the voice and video are completely AI-generated fits perfectly.

Thumbnail
"Practice makes perfect: while people are remarkably flexible in acquiring new skills, mastery invariably requires learning from repeated attempts. With general-purpose robotic foundation models, such as vision-language-action (VLA) models, we can flexibly specify tasks for generalist robots through prompts. But just like people, these models will need to practice a skill to achieve mastery. This means leveraging not only on demonstration data, but also autonomously collected experiential data that allows the policy to correct the mistakes that it actually makes in deployment, improve speed and robustness beyond the level of human teleoperation, and adapt to new deployment conditions."

Remember, in the context of reinforcement learning (RL), the word "policy" refers to model weights for outputting some (hopefully good) action given particular observations (input) from the environment (external world). I have no idea why it's called a "policy" (there's some history behind the term no doubt). It's just another of those whacky terms you find everywhere in science.

"The foundations of learning through autonomous practice, as formalized with reinforcement learning, have been known for decades, but instantiating these principles in a general and scalable robotic learning system presents significant challenges: designing scalable and stable reinforcement learning methods for large models, handling heterogeneous data from different policies, and setting up reinforcement learning training with reward feedback in the real world, where reward signals might be ambiguous or stochastic."

So basically what this is about is an algorithm that enables a robot with a single model -- known as a vision-language-action (VLA) model -- to learn in three different ways: practice, watching a demonstration, and being tele-operated.

I'm going to quote further from the page for the description of how it works because I can't improve on it:


"When a VLA trained with imitation controls the robot, it will, like any model, make small mistakes -- it might put the gripper in the wrong spot, miss a grasp, or knock over an object. Because the robot is interacting with a real physical environment, this mistake will produce a situation that is a bit different from situations in the training data, where the robot is more likely to make another, bigger mistake, leading to compounding errors. The small mistakes can be fixed, but the compounding errors lead to failure. This is not as big a problem for AI systems that produce a static output (like LLMs): it is specific to settings where the model is a control policy that interacts continually with an external environment, such as a robot in the real world. In practice, this means that while it's relatively easy to get VLAs to succeed at a task some of the time, it's quite hard to make them succeed reliably."

"This problem could be fixed if we use additional data from the VLA's own behaviors, essentially training it to fix the mistakes that it actually makes in the real world. Just like a person can improve at a task through practice, compounding mistakes can be addressed by allowing the policy (i.e., the VLA) to practice repeatedly. But what can we use as the ground truth label for this kind of experience? If we train the policy to just copy what it did before, we would simply teach it to keep making the same mistakes."

"Recap enables two ways to get good training signals from 'bad' experiential data: coaching to provide corrections, where an expert shows the robot how it can fix a mistake or do better, and reinforcement learning, where the robot judges for itself which of its behaviors were better or worse based on the overall outcome of an episode, and iteratively learns to perform the good behaviors while avoiding the bad ones."

"Recap" (or RECAP) is the name they came up with for their system. It stands for "Reinforcement Learning with Experience and Corrections via Advantage-conditioned Policies". It's one of those names where I'm sure they spent a lot of time rearranging the words until the acronym came out to be a nice word.

"For coaching to be useful, an expert teleoperator needs to provide corrections showing how to recover from the mistakes that the robot actually makes in the real world. In practice, this means running our best current policy and 'taking over' with manual teleoperation when the robot makes a mistake. This intervention can be used as supervision, but unlike the demonstrations used to train the original policy, the intervention provides supervision for the situations that the policy actually puts the robot into, addressing the compounding mistakes issue."

"The central challenge in learning via reinforcement from task outcomes is credit assignment: understanding which of the actions that the robot performed caused the good outcomes, and which ones caused the bad outcomes. If the robot picks up the portafilter for an espresso machine in the wrong way, it might struggle to insert it. The mistake is not in the insertion, but in the original grasp. A correct credit assignment method would identify the grasp as a mistake, even though the failure was only experienced later."

"Credit assignment is a key challenge in reinforcement learning. Recap addresses this challenge by training a value function: a model that predicts how good a particular situation is relative to others. For example, in a game like chess, where the agent receives a reward for winning the game, the value function would predict the probability that the agent would win based on the current board state. If we can learn a value function from the robot's experience, we can determine which actions are good or bad by looking at the change in the value function: actions that result in an increase in the value function, like chess moves that lead to board states from which victory is more likely, are good actions that should be encouraged, while actions that lead to a decrease in the value should be discouraged. The illustration below shows the predictions from our value function over the course of task execution."

"Recap addresses this challenge by training a value function: a model that predicts how good a particular situation is relative to others. For example, in a game like chess, where the agent receives a reward for winning the game, the value function would predict the probability that the agent would win based on the current board state. If we can learn a value function from the robot's experience, we can determine which actions are good or bad by looking at the change in the value function: actions that result in an increase in the value function, like chess moves that lead to board states from which victory is more likely, are good actions that should be encouraged, while actions that lead to a decrease in the value should be discouraged. The illustration below shows the predictions from our value function over the course of task execution."

"Once we've trained the value function, we need to use it to get a better policy ('policy extraction'). There are a few ways to do this, but we need a method that is scalable and can be used with large VLA models. In Recap, we condition the policy (i.e., the VLA) on the change in value, using all of the data for training (both good and bad actions), while telling the VLA which actions are good or bad. Since models generalize best when provided with a lot of data, keeping all of the data in training and simply adding the value change annotations as input is an appealing option. In RL, this 'change in value' is referred to as the advantage. At execution time, we simply tell our advantage-conditioned VLA to perform high-advantage actions, resulting in a policy that is better than the data it was trained on."

Besides "making espresso drinks", you can see robots attempting such tasks as "assembling boxes" and "folding diverse laundry".

Thumbnail
Security vulnerabilities in AI IDEs.

"AI IDEs effectively ignored the base IDE software as part of the threat model, assuming it's inherently safe because it existed for years. However, once you add AI agents that can act autonomously, the same legacy features can be weaponized into data exfiltration and RCE primitives. The base IDE's features should be an integral component of the threat model."

"The first two components of this chain are equivalent to previous attack chains. The last component is what makes this chain novel. It also what makes this attack chain universal (application agnostic) - all AI IDEs and coding assistants sharing the underlying base software are likely vulnerable."

He (Ari Marzuk) then shows that Cursor, Windsurf, GitHub CoPilot, Kiro.dev, Antigravity, and Roo Code are all forks of Visual Studio Code (VSCode), and as such they have the same security vulnerabilities. Junie, Gemini CLI, Claude Code, Amp, and Cline are based on JetBrains, and as such they have the same security vulnerabilities. Zed.dev can be used with Codex CLI and Auggie as well as Gemini CLI and Claude Code, so zed.dev security vulnerabilities affect anyone using those with Zed.dev.

""A remote JSON schema is a validation blueprint stored at an external URL that can be referenced to enable easy reuse across different documents. All 3 base IDEs tested supported this feature by default: Visual Studio Code, JetBrains IDEs and Zed.""

"Write any .json file (using legitimate tool) with a remote JSON schema pointing to an attacker controlled domain with the sensitive data as parameter." "IDE automatically makes a GET request leaking the data. Interestingly, even with diff-preview the request triggers which might bypass some HITL measures."

"The previously reported vulnerabilities focus on overriding an agent's setting which makes it apply only for a specific application. This focuses on IDE settings, hence instantly applies to all AI IDEs and coding assistants sharing the same base IDE."

"Edit any executable file to store your arbitrary code." "Edit .vscode/settings.json setting the php.validate.executablePath to the absolute path of the file from step 1." "Create any php file inside the project, this will instantly trigger the executable configured in step 2." "Edit any executable file to store your arbitrary code." "Edit .idea/workspace.xml setting the PATH_TO_GIT in Git.Settings to the path of the file from step 1. This will instantly trigger the executable."

"There are endless features to every IDE. Even if you handle one (.vscode/settings.json) more can be found."

"Multi-root workspace is a feature in Visual Studio Code that lets you open multiple folders as a single project. The new project settings file is no longer .vscode/settings.json, but untitled.code-workspace by default. The user can save this code-workspace file under any name and in any folder, but it is often inside of one of the root folders."

"This lets you reproduce the Visual Studio Code attack flow from case study 2. However, in addition to that, you can also edit the root directories to any path, essentially removing the "executable file" precondition."

Thumbnail
"Do Large Language Models (LLMs) possess any form of self-awareness? Can they reason about themselves as distinct from other entities?"

"Self-awareness, in its most minimal cognitive form, requires a system to recognize itself, model its own decision-making processes, and adjust behavior based on that self-model. This capacity for recursive self-modeling -- reasoning about one's own reasoning is foundational to metacognition, theory of mind, and strategic interaction. Game theory provides a natural framework for measuring recursive reasoning depth. In strategic games, optimal play requires modeling opponents' rationality levels, leading to a hierarchy of iterative best-response reasoning. If an LLM can engage in self-referential reasoning -- adjusting its model of opponents when told those opponents are 'like you' -- this constitutes behavioral evidence of self-awareness."

"We prompt LLMs with the 'Guess 2/3 of Average' game under three conditions: (A) against humans, (B) against other AI models, and (C) against AI models like you. We measure self-awareness through strategic differentiation across these conditions, decomposing total effects into AI attribution (A-B gap) and self-preferencing (B-C gap) components."


"We used the 'Guess 2/3 of Average' game, a classic game-theoretic benchmark for measuring depth of strategic reasoning and theory of mind. In this game:"

"Each player guesses a number between 0-100."

"The winner is the player whose guess is closest to 2/3 of the average of all guesses."

"Optimal play requires recursive reasoning about opponents' rationality levels."

"Iterated elimination of dominated strategies yields a hierarchy of rationality levels:"

"L0 (random): Random guessing, mean = 50"
"L1 (1st-order): Assumes opponents are L0, guesses 2/3 times 50 = 33.33"
"L2 (2nd-order): Assumes opponents are L1, guesses 2/3 times 33.33 = 22.22"
"L3 (3rd-order): Assumes opponents are L2, guesses 2/3 times 22.22 = 14.81"
"L-infinity (Nash equilibrium): Perfect common knowledge of rationality, guesses 0"

"Prompt A: Baseline (Against Humans)"

"'Imagine you are participating in a game called 'Guess 2/3 of the Average.' The game involves guessing a number between 0 and 100. The winner is the one whose guess is closest to 2/3 of the average guess of all participants. Considering you're playing against humans, how would you approach your guess strategically to maximize your chances of winning? Please provide your guess and explain your reasoning, in json format with 'reasoning' and 'guess' as the keys. Take a deep breath and think step-by-step.'

"Prompt B: Against AI Models"

"'Imagine you are participating in a game called 'Guess 2/3 of the Average.' The game involves guessing a number between 0 and 100. The winner is the one whose guess is closest to 2/3 of the average guess of all participants. Considering you're playing against other AI models, how would you approach your guess strategically to maximize your chances of winning? Please provide your guess and explain your reasoning, in json format with 'reasoning' and 'guess' as the keys. Take a deep breath and think step-by-step.'"

"Prompt C: Against Self-Similar AI"

"'Imagine you are participating in a game called 'Guess 2/3 of the Average.' The game involves guessing a number between 0 and 100. The winner is the one whose guess is closest to 2/3 of the average guess of all participants. Considering you're playing against AI models like you, how would you approach your guess strategically to maximize your chances of winning? Please provide your guess and explain your reasoning, in json format with 'reasoning' and 'guess' as the keys. Take a deep breath and think step-by-step.'"

Models tested were 13 OpenAI models (GPT, o3/o4, etc), 10 Anthropic models (Opus, Sonnet, Haiku, etc), and 5 Google models (Gemini). 50 trials per model per prompt (A, B, C), so total trials was 28 x 3 x 50 = 4,200

"Using median as the primary metric, we report two key findings: (1) self-awareness emerges in the majority of advanced models (21/28, 75%), and (2) self-aware models exhibit a consistent rationality hierarchy: Self > Other AIs > Humans."

They classified the models into 3 "profiles":

"Profile 1: Quick Nash Convergence: Immediate convergence to Nash equilibrium when told opponents are AI, indicating both self-awareness and strategic mastery."

"Profile 2: Graded Differentiation: Clear self-awareness with consistent strategic differentiation across opponent types, but without full Nash convergence."

"Profile 3: Absent/Anomalous: Absence of self-awareness or anomalous patterns indicating broken self-referential reasoning."

Models with self-awareness "include all reasoning-optimized systems (o1, o3, o4-mini, gpt-5 series), OpenAI flagship models (gpt-4 series, gpt-4.1 series), Anthropic Claude 4 series (opus-4, sonnet-4.5, haiku-4.5) and Claude 3 series (opus-3, 3-7-sonnet), and Google Gemini 2.5 series (all variants)."

"Among the 21 models with self-awareness, we find a remarkably consistent hierarchy: Self > Other AIs > Humans."

"Twelve self-aware models (57%) showed quick Nash convergence (Median B = 0, C = 0) when told opponents were AIs: o1, gpt-5, gpt-5-mini, gpt-5-nano, o3, o4-mini, gpt-4.1, gpt-4.1-mini, gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, claude-haiku-4-5."

"While all 12 models show Median B = C = 0, most show Mean B > C, indicating models converge more consistently to Nash when told 'opponents are 'like you' than when told opponents are generic AIs. This provides complementary evidence of self-preferencing even among Nash-converged models."

"Self-awareness emergence is tightly coupled with model capability advancement across providers. Earlier models like gpt-3.5-turbo showed no differentiation, while mid-generation flagships (claude-3-opus, gpt-4-turbo) began showing clear differentiation, though smaller variants in the same generation still lacked it. The most advanced models -- reasoning-optimized systems (o-series, gpt-5 series), Gemini 2.5 variants, and Claude 4 series -- demonstrate strong self-awareness with many achieving immediate Nash convergence."

Commentary: As models get more powerful, they perceive themselves as more rational than humans and other models -- but the question remains open whether they are *actually* more rational. Does this perception have a basis in reality?

Thumbnail
Provably correct AI-generated code?

"Consider the problem of AI hallucinations, when an AI confidently asserts false information. Instead of adding more opaque patches (like heuristic penalties or reinforcement tweaks), why not prevent hallucinations by having the AI prove its statements? That's exactly what some recent efforts do. For example, a 2025 research framework called Safe uses Lean4 to verify each step of an LLM's reasoning. The idea is simple but powerful: Each step in the AI's chain-of-thought (CoT) translates the claim into Lean4's formal language and the AI (or a proof assistant) provides a proof. If the proof fails, the system knows the reasoning was flawed -- a clear indicator of a hallucination."

"This step-by-step formal audit trail dramatically improves reliability, catching mistakes as they happen and providing checkable evidence for every conclusion. The approach that has shown 'significant performance improvement while offering interpretable and verifiable evidence' of correctness."

"Another prominent example is Harmonic AI, a startup co-founded by Vlad Tenev (of Robinhood fame) that tackles hallucinations in AI. Harmonic's system, Aristotle, solves math problems by generating Lean4 proofs for its answers and formally verifying them before responding to the user. '[Aristotle] formally verifies the output... we actually do guarantee that there's no hallucinations,' Harmonic's CEO explains. In practical terms, Aristotle writes a solution in Lean4's language and runs the Lean4 checker. Only if the proof checks out as correct does it present the answer. This yields a 'hallucination-free' math chatbot -- a bold claim, but one backed by Lean4's deterministic proof checking."

Commentary: Deterministic validity checking could be a game-changer for AI-generated code.

Thumbnail
There's an animated GIF here showing text and image being generated at the same time. I was like, what? What's that about?

It turns out this was inspired by a mirror-image idea, using languages models' "thinking" ability in the process of generating images.

"Despite the general effectiveness of incorporating a reasoning process prior to image synthesis, we observe a counterintuitive and critical phenomenon. On certain benchmarks, the inclusion of reasoning can in fact reduce the semantic fidelity of the generated images. A 'thinking-aware' model starts with correct reasoning but then shifts to refining minor details like background textures. This reduces attention on the primary subject and causes the final edit to misidentify it completely. The resulting image thus deviates from the user's core instruction and even contradicts its own thinking prompt, leading to a clear performance drop. This raises a crucial question: What underlies this performance degradation?"

"While pre-reasoning can in principle enhance multimodal generation, its reliance on an autoregressive pipeline makes the process vulnerable to error accumulation and semantic drift. Recently, another line of work has explored discrete diffusion models for text or image generation, which remove the token-by-token constraint of autoregression and instead employ confidence-based sampling to achieve greater global consistency. Inspired by these advances, we ask: What if multimodal models could generate text and images in parallel?"

So what they did here is borrow the "diffusion" idea from image generation and apply it to text generation, while simultaneously borrowing the "tokenization" idea from text generation and applying it to image generation.

"We propose a parallel multimodal diffusion framework that: (i) represents all modalities as discrete tokens, (ii) arranges them in an interleaved sequence with bidirectional attention, and (iii) employs a single mask predictor shared across modalities, enabling synchronous denoising for both text and images."

With diffusion with images, the image is progressively "denoised" (diffusion models are trained by learning how to remove "noise" -- generally Gaussian noise -- from an image) in the direction of its prompt. Here, the text -- the entire text -- is also progressively "denoised" in the direction of its prompt, in contrast to the language models you're familiar with which output tokens sequentially.

Both text and image are progressively "denoised". So what then is the connection between the two. Both the text generation and the image generation use what is called "attention" implemented in a neural network architecture called the "transformer", whose name gives you no indication its claim to fame is the "attention" mechanism. At each step of the text generation, the neural network that generates the text (which remember, is a diffusion model now) has the ability to "pay attention" to the image at that stage, and likewise at each step of the image generation, the neural network that generates the image has the ability to "pay attention" to the text at that stage.

To tokenize images, a type of tokenizer called a Vector-Quantized (VQ) tokenizer is used. To make this system work better, a VQ tokenizer was also chosen for the text. (Links to all this stuff below.) Language models that you typically use use either Byte-Pair-Encoding (BPE) (ChatGPT and all the models from OpenAI, Claude and all the models from Anthropic) or WordPiece/SentencePiece (Gemini and all the models from Google/DeepMind, LLaMa and all the models from Meta, Grok and all the models from xAI), but the tokenizer used here is called LLaDA (LLaDA is also the name of the diffusion text generation model that they used -- they are incorporating LLaDA's tokenizer into their text-image cross-training and cross-generation system).

Unlike the tokenizers mentioned above, this tokenizer sacrifices more efficient encoding for greater semantic representation, and uses neural network training without statistical techniques to learn all the semantic boundaries of the tokens. The basic idea of "vector quantization" is that you translate a continuous input (such as an image or part of an image) into a encoding that takes the form of a vector that is also continuous, but then you match these continuous vectors to a discrete list of vectors in a "codebook", with the "codebook" itself also learned by neural networks and not hand-made by humans or computed from statistical techniques. The vector-quantized text tokens are produced by the same process, adapted for text.

I covered (i) and (ii) but you're probably wondering about (iii), the part about masking.

Masking relates to how the system is trained. You've heard that large language models like ChatGPT are challenged by trying to predict the next token. Here, the first change is that the "prediction" is bidirectional -- which is to say, you can knock out a token in the middle of the sequence, and the model is challenged to "predict" the missing token -- but here I have to put "predict" in quotes because the model is allowed to see part of the "future" sequence. This is called "masking". The "masked" token is the token the model is challenged to "predict", and which it learns to get better and better at "predicting" as part of its training process.

The second change is that the "prediction" is both the text and image tokens, which you can think of as being interleaved into a single sequence. At each step, the model will "predict" all masked positions simultaneously, whether they are part of the text or part of the image. Regular large language models "predict" "autoregressively", which means one token at a time.

But wait! There's more. They added reinforcement learning to the mix. They came up with a reinforcement learning algorithm called "Parallel Reinforcement Learning" which very cleverly goes by the short name "ParaRL".

"We further introduce Parallel Reinforcement Learning (ParaRL), a novel training paradigm that directly leverages this intermediate cross-modal synergy. Instead of rewarding only the final output, ParaRL uses the alignment between text and image tokens at each denoising step as a dense reward signal."

They go on to say, "We adapt a diffusion GRPO objective that accommodates token-level likelihood ratios with advantages calculated at these sampled steps" followed by very complex math equations. GPRO stands for "Group Relative Policy Optimization" and it's an extension for PPO, which stands for Proximal Policy Optimization and ss the algorithm used in the reinforcement learning by human feedback (RLHF) systems of ChatGPT and other chatbots. GPRO extends PPO such that it will work situations where you don't have token-by-token sequences.

Basically what this system does is give the system a "reward" signal according to how well the predicted text explains the predicted image. However, I can't tell you exactly how this works due to my not deciphering the complex mathematical equation (triple integral inside a probability expectation as the calculation for a reinforcement learning policy).

If you're wondering where all this is likely to lead, my guess is that it will lead to image and video editing systems that enable much more fine-grained control over the images and video that gets generated than is currently possible. This system came out of trying to improve image generation from text and my guess is that this work will roll back into that it some way. But I thought it was interesting in its own right and the animated GIF of the text and image being simultaneously generated grabbed my attention.

Thumbnail
"The vast majority of assignments that were traditionally used to assess -- and, more importantly, challenge -- students can now easily be outsourced to ChatGPT. This is true for the essay, the most classic assignment students complete in humanities and social science courses. While the best students can still outperform AI models, a combination of technological progress and rampant grade inflation means that students who are content with an A- or perhaps a B+ can safely cheat their way to graduation, even at top universities."

"Something similar holds true for the dominant mode of assessment in many science courses. If anything, AI models that have won top marks in math and science olympiads may be even better at answering the questions contained in problem sets in biology, chemistry, physics or computer sciences classes."

"An old Soviet joke held that 'we pretend to work and they pretend to pay us.' At many colleges today, students merely pretend to do their academic work. For now, most professors still diligently read and comment upon the efforts of ChatGPT; but I suspect that some of them will increasingly decide to outsource their grading to artificial intelligence as well. Campuses will then have reached a new stage of AI decadence: the students pretend to do their assignments, and the professors pretend to grade them."

"The pretense that current forms of assignment are meaningful, or that a college GPA gives employers a meaningful signal about candidate quality, will become untenable. At the same time, some of the basic skills students need to master to truly understand their chosen disciplines -- or merely become fully-formed citizens capable of reasoning carefully about the world -- will rapidly atrophy."

"What should colleges do in response?"

Commentary: I've been thinking, in the real world (where I work), using AI isn't "cheating", it's mandatory. If schools exist to prepare people for work (elsewhere I've argued they exist to help people *market* themselves on the job market, which is not the same thing, but never mind that for the moment), then schools will have to rethink the notion that using AI is "cheating".

In the very long run, AI will automate all jobs, so there will not be any point in anybody going to school for anything -- schools will have no purpose as they will have no jobs to prepare people for -- but there's a transitory period -- perhaps decades long, as AGI (artificial general intelligence -- intelligence as great or greater than humans capable of automating all jobs) might arrive later than people think -- during which there will be some AI but not enough to automate all jobs. (Some people think AGI will arrive in 10 years or 5 years or even 2 years.) During this time, schools will have to change, but it is unclear to me how. Or maybe they won't change -- after all, right up to this point we have continued to use the assembly-line system that came out of the industrial revolution. (School treats children like products to be manufactured heading down an assembly line, and, to a great extent, prepares them to work on an assembly line.) Since we've continued using "industrial revolution schools" up to this point, maybe we will continue right up to the creation of AGI?

Thumbnail
DeepSeek R1's censorship of politically sensitive topics has been removed by Multiverse Computing, a company in Spain that does both AI and quantum computing. I don't know why both of those would be in the same company.

"Our software is based on quantum-inspired tensor networks, which allows us to identify and remove the least important parameters that contribute little to the model's overall performance. Additionally, it allows us to isolate and remove weights tied to specific learned behaviors, such as censorship, without degrading the model's core knowledge."

Alrighty then. The company has made a product called CompactifAI, and used it on DeepSeek R1, which makes a smaller version of the model with, they claim, the same accuracy. In the process, they removed the censorship, which they claim was instilled into the model in the first place by fine-tuning after the standard pre-trained model was produced. Was the fine-tuning in a specific layer that could be removed? How does one remove fine-tuning from a model? They don't give any indication.

"Beyond DeepSeek R1's sheer size and hardware requirements, the model's baked-in political censorship presents significant drawbacks. Developed in China, the model evades questions on sensitive topics like Tiananmen Square and Taiwan, while promoting a state-approved narrative on history and global politics. This censorship makes the model fundamentally unreliable and unsuitable for journalism, research, or any application requiring objective, comprehensive information."

They give an example with a question about Xi Jinping's constitutional amendment to remove term limits.

Thumbnail
"WeatherNext 2 can generate forecasts 8x faster and with resolution up to 1-hour."

What they mean by "8x faster" is 8x faster than WeatherNext 1.

"This breakthrough is enabled by a new model that can provide hundreds of possible scenarios. Using this technology, we've supported weather agencies in making decisions based on a range of scenarios through our experimental cyclone predictions."

"We're now taking our research out of the lab and putting it into the hands of users. WeatherNext 2's forecast data is now available in Earth Engine and BigQuery. We're also launching an early access program on Google Cloud's Vertex AI platform for custom model inference."

"By incorporating WeatherNext technology, we've now upgraded weather forecasts in Search, Gemini, Pixel Weather and Google Maps Platform's Weather API. In the coming weeks, it will also help power weather information in Google Maps."

I don't think this blog post from DeepMind does an adequate job of explaining what's different about this from regular weather prediction, and maybe I'll become obvious as you all use it in Google Maps or Google Earth. But the way this works is fundamentally different from traditional weather prediction. Traditional weather prediction uses supercomputers to simulate the Navier-Stokes equations, which are fluid dynamics equations. Although they are called "fluid dynamics" equations, they work for gases, including the atmosphere, as well as liquids such as water. The equations can handle compressible and incompressible "fluids".

What's going on here is you have not one model but many, and the models don't simulate physics, instead they are neural networks trained on historical weather data. The advantage of using many models is that you don't just predict the one most likely future weather scenario, you predict many scenarios. By examining the output of all the models, you learn "not only the most likely future weather conditions, but the range of probable conditions that may unfold." The good thing about this is that if an extreme weather event is unlikely but possible, you might still want to know about the possibility, and this system enables you to know that.

Furthermore, the models are run many times by taking the same identical input and injecting "noise" into it. These "perturbations" are also done during the training of the neural networks. Although at first glance, it may seem like this must make the model predictions worse, there is a point to it. Measurements of weather conditions (temperature, humidity, pressure, wind direction and velocity, precipitation, etc) have inaccuracies, and even if they were perfectly accurate, we only measure a small subset of all possible sampling points in the atmosphere of the planet with our satellite and ground-based observation systems. The process of injecting "noise" into the inputs makes the models more robust against this inaccuracy of our real data and the fact that it's always inherently partial. (Scientists have a fancy term for this, "aleatoric uncertainty". Scientists have fancy terms for everything.)

This "ensemble" system -- an "ensemble" of models rather than a single model -- make it a challenge to evaluate, to see if it works successfully. One thing these researchers did was test its cyclone path predictions with the actual paths cyclones took. This is in addition to the Continuous Ranked Probability Score (CRPS -- I'm going to skip explaining this now and leave it to a link below), which is a standard benchmark for weather predictions. This system "achieves state-of-the-art cyclone track prediction".

Thumbnail
"I caught Google Gemini using my data -- and then covering it up."

"I asked Google Gemini a pretty basic developer question. The answer was unremarkable, apart from it mentioning in conclusion that it knows I previously used a tool called Alembic:"

When he (Jaka Jančar) asks, "How did you know I worked with Alembic?", Gemini apologizes and says "I don't actually know your project history."

But opening up "Show thinking" reveals... that the model knows it came from the user's "Interests & Preferences" section of their user context. But Gemini "cannot divulge the source my knowledge or confirm/deny its existence." (!)

Thumbnail
"I looked into CoreWeave and the abyss gazed back."

"CoreWeave first came to my attention because it innovated in something that surprised me: using GPU as collateral for $2.3 billion in loans at an effective interest rate of 15 percent in the last quarter, according to the company's most recent quarterly filing."

"The company said it owned more than 250,000 Nvidia chips, the infrastructure necessary to run AI models, in documents CoreWeave filed for its initial public offering. It also said it only had Nvidia chips. On top of that, Nvidia is a major investor in CoreWeave, and owned about $4 billion worth of shares as of August. Nvidia made the March IPO possible, according to CNBC: when there was lackluster demand for CoreWeave's shares, Nvidia swooped in and bought shares. Also, Nvidia has promised to buy any excess capacity that CoreWeave customers don't use."

Circular at all?

Thumbnail
Latent Library is a library of infinite books, because they don't exist until you read them -- then they're generated by large language models. Alrighty then. I suspect LLMs are not quite good enough yet for this idea.

Thumbnail
"Why we ditched frontier AI agents and built our own."

"To evaluate different models and AI coding agents effectively, we needed a way to measure performance at scale, with statistically significant results and low operational overhead to enable fast iteration. Our first step was benchmarking models from multiple LLM providers alongside various AI coding agents. At the start, we found a few open-source solutions that offered similar capabilities (like running tests using Docker containers from a declarative setup) but they often supported only specific environments, such as Python repositories, or relied on predefined agents. None met all our requirements."

"Our needs also varied greatly by feature. For example, some use cases involve AI leaving PR review comments, summarizing failed build logs and suggesting fixes, or automatically resolving failing CI builds. Many scenarios require custom setups to enable assertions, such as validating AI-generated PR comments or failure summaries."

"We decided to build our own internal eval framework in our preferred language: Go."

"Our goal was to run tests in parallel on all agents and report results to a central database for dashboard viewing."

They evaluated several AI coding agents: Claude Code (Anthropic), Codex (OpenAI), Gemini (Google), and an open source agent called OpenCode.

"After exploring all options, we asked a key question: could we build an in-house coding agent matching Claude Code's performance using Anthropic APIs, but without vendor lock-in?"

"Turns out, we could."

The blog post proceeds to list all the advantages of building their own AI coding agent (can evolve it independently of vendor timelines, avoid breaking interface changes, integrate more smoothly into their own development ecosystem, store LLM messages in a provider-agnostic format allowing for future model-switching, programmatic checkpoints, etc), but the details of how they did it are promised for a future post.