Boulder Future Salon

Boulder Future Salon

Thumbnail
OlmoEarth is a new state-of-the-art Earth observation foundation model family from the Allen Institute for Artificial Intelligence.

If you're wondering how it compares with DeepMind's AlphaEarth, which I told you all about back in August, they say:

"We compared OlmoEarth to Google DeepMind's AlphaEarth Foundations. AlphaEarth Foundations required a different analysis because Google released annualized embeddings, but not the model itself. When we compared the AlphaEarth Foundations and OlmoEarth embeddings using k-Nearest-Neighbors (kNN) on three tasks, we found OlmoEarth performed on par or better than AlphaEarth Foundations. However, once we fine-tuned OlmoEarth, it outperformed AlphaEarth Foundations substantially. This underscores the importance of a platform that makes fine-tuning and model customization as accessible as possible."

And they have a graph of 'Comparison of OlmoEarth embeddings and fine-tuning performance to AlphaEarth Foundations embeddings on three real-world partner tasks spanning classification of crop types in Kenya (Nandi), land-use land-cover classification (AWF), and high precision ecosystem classification (Ecosystem)' that shows OlmoEarth Base (90 million parameter model) after finetuning getting higher accuracy scores on all three.

"There is no standard evaluation test suite for remote sensing models. While there are some established standard practices, they are not always followed. To get a more complete picture of the state of foundation modeling we run a comprehensive evaluation effort of OlmoEarth compared to 12 other foundation models on 18 research benchmarks. Further, to evaluate real-world performance we also evaluate models on 19 datasets from 7 partner organizations that are using Earth observation modeling in their work. Following standard practice we evaluate all models using simple transfer learning techniques (kNN and linear probing) as well as full, end-to-end fine-tuning. We evaluate all models using a standard training recipe and sweeping over a variety of parameters and settings, ensuring a fair evaluation.

"OlmoEarth achieves the best performance in 15 of 24 tasks for the kNN/LP evaluation and 20 of 29 tasks for full fine-tuning."

They've also created something called OlmoEarth Platform so you can offload the work of managing GPUs to them.

"The OlmoEarth Platform is an end-to-end solution for organizations who want to harness Earth observation data for the public good."

How does OlmoEarth work?

"Existing foundation model approaches either train in a supervised or unsupervised setting. Some foundation models are trained to predict supervised labels like land cover maps from satellite observations. Other foundation models use the vast quantity of unlabeled data to train in a self-supervised manner. We present a formulation that unifies these approaches into a single task, show that it works well with only observational data, and further improves when we add labels."

"Our unified approach strikes a middle ground between two common approaches in self-supervised learning. Masked autoencoders predict pixel-level reconstructions of masked input while approaches like I-JEPA and Latent Masked Image Modeling (Latent MIM) predict reconstructions in feature space. Masked autoencoders tend to be stable but limited in their feature representations while latent approaches are unstable but produce better features (if they don't crash out during training)."

"Many foundation models build upon work in domains like image or text processing. Earth observation data differs from these domains in having spatially aligned yet highly multi-modal, multi-temporal data. We find that adjusting our masking strategy and loss to account for this unique domain gives us significantly better performance."

"In image or text modeling it is sufficient to randomly mask some portion of the input and have the model reconstruct the input from context. With remote sensing data, because we have aligned data over various modalities and timesteps, a uniform masking strategy over all tokens may be too easy of a task. Any token in the input will have many similar tokens either in space, time, or at a different aligned modality. There's almost too much context unless you use a very high masking ratio. We adjust our masking strategy to limit the amount of context present in any sample and make the problem challenging without resorting to skewed masking ratios."

"Similarly, with our loss formulation we find a small adjustment makes a large difference in downstream performance. Like other SSL approaches in latent space we use a contrastive loss instead of a reconstruction loss. However, contrasting a reconstructed token against all other tokens in a batch, or even in the same sample, leads to many easy negatives given the highly redundant nature of Earth observation data. Instead we contrast tokens only with other tokens in their respective bandset. This focuses the model training on a more challenging but more productive objective."

By "bandset", they mean grouping bands captured at the same resolution together, even if they come from different satellites. Landsat and Sentinel-2 data gets divided and grouped into bandsets.

The data comes from Sentinel-1, Sentinel-2, and Landsat. I'm guessing Sentinel-1A (launched in 2014), same as AlphaEarth -- Sentinel-1B stopped functioning in 2021 -- Sentinel-2A and 2B (launched in 2015 and 2017), and Landsat 8 and 9 (launched in 2013 and 2021). Sentinel-2A has visible light (RGB), near-infrared, and shortwave-infrared bands. Landsat 8 and 9 have visible light (RGB) and thermal (infrared).

"OlmoEarth is a Vision Transformer (ViT) based encoder-decoder style architecture. It processes a multi-modal image timeseries of aligned satellite images and derived maps. A FlexiViT-style projection layer converts the input data from pixels to tokens with a variable patch size. Positional, temporal, and modality encodings add additional context to the tokens. During training, some portions of the input tokens are masked. The encoder transformer layers attend across space, time, and between modalities to produce embeddings for the input tokens. The decoder predicts representations for the masked input tokens."

"Our pretraining dataset contains 285,288 samples from around the world. Each sample covers a 2.56km x 2.56km spatial region and a one-year time range. For multi-temporal modalities, we use up to 12 timesteps sampled monthly over the course of the year, although many samples contain only a subset of the timesteps and modalities."

"For the above modalities we resample the data to be uniformly 10 meters per pixel. We have experimented with adding NAIP data at 2.5 meter per pixel and ERA5 data at 160 meters per pixel but found no significant improvement on our evaluations."

ERA5 refers to the "fifth generation" version of a dataset of climate data produced by the European Centre for Medium-Range Weather Forecasts. The dataset was created by taking observations made from the ground and in the atmosphere (but not from space), fitting a model to that data, and generating an hour-by-hour dataset of Earth's atmosphere, land, and oceans from 1940 to the present.

NAIP refers to the US Department of Agriculture's National Agriculture Imagery Program, which is a dataset of aerial photos (from airplanes and drones, not from space) of US agricultural land. They didn't use it, though.

"Once the input is in token space, OlmoEarth adds in a 2D sincos positional embedding, a sinusoidal temporal embedding, and a learnable modality embedding to each token. During training, some tokens are masked out of the input, otherwise all tokens are passed to the encoder transformer which performs full self-attention across space, time, and between modalities."

"OlmoEarth uses a modality-aware masking strategy. For every example the masking strategy selects some bandsets to be encoded and also some to be decoded, non-exclusively."

"This masking strategy re-frames the problem slightly from reconstructing data that has been partially masked to reconstructing missing bandsets from partial views of other bandsets."

"This masking strategy re-frames the problem slightly from reconstructing data that has been partially masked to reconstructing missing bandsets from partial views of other bandsets. When all bandsets are encoded and decoded we find the task is too easy. Masked tokens in a bandset will likely have other tokens in the same bandset that are highly correlated with them that are visible in the input, tokens nearby spatially or temporally. Training in this easier paradigm requires using very high masking ratios (i.e. masking out 90% of the input) to get decent results. Masking some bandsets entirely makes the problem harder and we can use more balanced masking ratios."

"OlmoEarth trains on both observations and maps but at inference time we only use observations. Maps can change over time -- indeed downstream tasks are often detecting this kind of change -- so we only rely on observations for inference."

"During training OlmoEarth predicts reconstructions of the masked input in latent space. We use a randomly initialized, frozen projection layer for each modality to project masked patches in the input into token space. Thus OlmoEarth performs Latent Masked Image Modeling, but based on Linear, Invariant Token Embeddings."

"Latent MIM Lite allows us to unify supervised and self-supervised training under the same architecture. We project each modality, whether observations or maps, through a frozen random projection into token space."

"Loss is calculated the same for both types of modalities. We don't need to add on specific predictor heads for supervised data or adjust our training strategy or loss. In our ablations we see this approach gives strong results in a purely self-supervised setting and also benefits from additional supervised data."

At this point you may be confused as to the difference between "token space" and "latent space". Apparently there is a tokenization process at the point of inputting satellite images and maps, but I am unclear as to what this is. But once the "tokenization" process is complete, the vision transformer (ViT) architecture (called Latent MIM -- where "MIM" stands for "Masked Image Modeling") takes over and produces an output in "latent space". This "latent space" encoding is then decoded back to "token space" to compare with the input token for training. The system does not go all the way to try to reproduce the original satellite images. If you're wondering what the point is comparing the output to the input when they're the same, they're not the same -- part of the input is masked, hence the name "Masked Image Modeling".

"Latent MIM uses a contrastive loss (Patch Discrimination) instead of reconstruction loss to incentivize diversity in the latent space predictions. Patch discrimination loss frames token reconstruction as a classification task where we want the predicted token for a patch to be similar to the target token but dissimilar from other ground truth tokens for other patches. Patch discrimination uses cosine similarity to measure token similarity and cross entropy loss to contrast between positive and negative matches."

If you're wondering what "contrastive loss" vs "reconstruction loss" is all about, remember that image generation models like Dall-E (now part of ChatGPT) learn what words go with what images in their training data through contrastive learning. You have a description with a whole bunch of words -- how do you know "tiger" is the important word that needs to be learned, and to words like "the"? With contrastive learning, the system doesn't just compare its output with the expected answer, it compares its output with all the "wrong" answers, with a negative training signal. Because "the" appears everywhere, it washes out, while "tiger" gets associated with pictures that have actual tigers in them.

So what they're doing here is not just comparing the decoding of the latent space encoding with the token for the input images, it's also comparing it with tokens for a random sample of other images from the training data with the negative training signal, so contrastive learning takes place.

Thumbnail
AI-powered fortune telling?

"Professional, accurate, and fast online fortune telling service, revealing the secrets of your destiny." "Combining traditional fortune telling with AI technology for deeper and more personalized destiny analysis."

Seriously? Alrighty then. I guess someone had to do it. You can add "AI fortune telling" to your list of "things that already happened".

This website is from China, if that indicates anything. It's called Rensheng Daoshi ("life mentor") though the domain name is suanmingzhun.com and the title in English is given as "Fateguide".

Thumbnail
TigerBeetle is a financial transactions database built for correctness "that offers two primitives for double-entry bookkeeping: accounts and transfers. A separate data store, such as Postgres, stores master data, such as name and address of the account holder or terms and conditions of the account."

"This separation enables transfers to scale independently of general purpose master data (for example dealing with Black Friday events) and solves different security, compliance, or retention requirements of the independent data sets (for example enforce immutability of transfers)."

"Just as a bank may have need for both a filing cabinet and a bank vault, Postgres specializes in strings and describing entities (master data), while TigerBeetle specializes in integers and moving integers between these entities."

"Since Postgres and TigerBeetle do not share a transaction boundary, the application must ensure consistency through repeated attempts at completion and coordination, not transactions."

"We must designate a:"

"System of Record. The champion. If the account exists here, the account exists on a system level."

"System of Reference. The supporter. If the account exists here but not in the system of record, the account does not exist on a system level."

"So which system is the system of record and which is the system of reference? That is an architectural decision that depends on your requirements and the properties of the subsystems. In this case, TigerBeetle is the system of record:"

"If the account is present in Postgres, the account is not able to process transfers, so the account in Postgres merely represents a staged record."

"If the account is present in TigerBeetle, the account is able to process transfers, so the account in TigerBeetle represents a committed record."

"Once the system of record is chosen, correctness depends on performing operations in the right order."

"Since the system of reference doesn't determine existence, we can safely write to it first without committing anything. Only when we write to the system of record does the account spring into existence."

"Conversely, when reading to check existence, we must consult the system of record, because reading from the system of reference tells us nothing about whether the account actually exists."

They call this principle "write last, read first" -- that is, relative to the system of record: write to the system of record last, read from the system of record first.

I knew distributed transactions were difficult, and never thought of this idea.

Apparently, though, there is one more requirement: serializability, which I take to mean transactions on the system of record have to be queued up single file and processed in sequence. Surely for scalability, the system must have some ability to determine which transactions don't affect one another and can be executed in parallel? Or maybe they just made the system so fast at "moving integers" that it can scale up to the whole globe while maintaining serializability?

"Remarkably, if the system of record provides strict serializability, like TigerBeetle, and if ordering is correctly applied, then the system as a whole preserves strict serializability, leading to a delightful developer experience."

Thumbnail
"Blade Manipulator" trucks use "a hydraulically operated adapter mounted on a 10 axle trailer that can manipulate a single 29,000kg turbine blade around tight corners and over vegetation by raising it on an angle to a maximum of 40 degrees."

Apparently the existing method, putting 80 metres-long turbine blades flat on massive trailers, frequently requires cutting vegetation around roads or even the construction of new large sweeping roads.

Would be fun to see one of these 'in the wild'.

Thumbnail
"The status quo of AI chip usage, that was once almost entirely US-based, is changing. China's immense progress in open-weight AI development is now being met with rapid domestic AI chip development. In the past few months, highly performant open-weight AI models' inference in China has started to be powered by chips such as Huawei's Ascend and Cambricon, with some models starting to be trained using domestic chips."

"China's chip development correlates highly with stronger export controls from the US Under uncertainty of chip access, Chinese companies have innovated with both chip production and algorithmic advances for compute efficiency in models. Out of necessity, decreased reliance on NVIDIA has led to domestic full stack AI deployments, as seen with Alibaba."

"Compute limitations likely incentivized advancements architecturally, infrastructurally, and in training. Innovations in compute efficiency from open-weight leaders include DeepSeek's introduction of Multi-head Latent Attention (MLA) and Group Relative Policy Optimization (GRPO). A culture of openness encouraged knowledge sharing and improvements in compute efficiency contributed to lower inference costs, evolving the AI economy."

"Domestic silicon's proven sufficiency has sparked demand and models are beginning to be optimized for domestic chips. In parallel, software platforms are shifting as alternatives to NVIDIA's CUDA emerge and challenge NVIDIA at every layer; synergy between AI developers and chip vendors are creating a new, fast-evolving software ecosystem."

Commentary: One wonders about the effectiveness of export controls. The CEO of Nvidia, Jensen Huang, has said he wants to sell chips to China. It's likely his motivation is that he doesn't want to see the market bifurcate in such a way that a credible competitor has the opportunity to arise, and it looks like we may be seeing the beginning of that now.

Thumbnail
"The path to a superhuman AI mathematician."

"Mathematics is the first place where evidence of AI superintelligence is likely to appear, a theoretical computer scientist says."

"Imagine the set of all possible math theorems; only a subset has been proven by human mathematicians. 'A superhuman AI mathematician is one that can prove more theorems than humans have,' said theoretical computer scientist professor Sanjeev Arora from Princeton University."

"Arora, who was awarded the 2011 ACM Prize in Computing for his contributions to computational complexity, algorithms, and optimization, sketched a possible path to a superhuman AI mathematician. He explained that the idea traces back to David Hilbert's early 20th-century dream of automating mathematics. That dream was crushed by the work of Gödel, Turing, and Church, yet it left behind something lasting: the concept of formal proof verification -- the notion that mathematical proofs can be written in a precise language and then rigorously checked by a computer."

"The idea of self-improvement is that you give the AI a large question bank created by humans; it follows many attempts to answer these questions, and the correct answers are used for further training."

"How does it get the correct ones? That's the human feedback. Some humans have labeled them as correct answers. This is the present pipeline. But in math, you can verify the answer with Lean. So, if you just ask AI to produce its proofs in Lean, labeling the correct answers can be done automatically. Even if it is a long proof, we humans can trust that it is correct."

Commentary: Hmm seems plausible. I don't know enough about Lean and other mathematical theorem provers.

Thumbnail
"Over the last few years, Y Combinator has gone through two major transformations at the same time: the AI revolution & Garry Tan returning to YC as President & CEO."

"Since Garry took over in 2023, YC founders have gotten much younger. The average age of founders had been holding steady around ~29 years since 2015, but now it stands at ~26 years and falling."

"I think this is for two reasons: 1) Garry has generally been refocusing YC on what made it successful in the first place, and 2) whenever there is a major technology platform shift underway like AI, the balance of power between youthfulness and experience tends to shift in favor of the former."

"I" here is Jared Heyman.

"The age cohorts of YC founders that have been growing the fastest in recent years are the youngest, and for the first time the past decade, there are more YC founders under the age of 25 years than over!"

"YC founders are also nerdier. At Rebel Fund, we use AI to divide founders into 'technical' vs 'non-technical' backgrounds, and the 'technical' group has really taken over since 2023 -- another return to YC's historical roots."

"More YC founders graduated from top universities than ever in recent years. For the first time last year, more than half of YC founders graduated from a top 20 university."

"In recent years, nearly half of YC founders have worked at a top 20 employer vs only ~25% a decade ago."

"Since Garry took the helm, average YC founder personality profiles have shifted, with much lower levels of Dominance and higher levels of Conscientiousness."

Note "Dominance" and "Conscientiousness" here are references to the DISC personality traits model (which is probably why they're capitalized). DISC stands for "Dominance", "Influence", "Steadiness", "Conscientiousness". If you've heard of the Big Five or HEXACO personalities models, what's different about DISC is that it is supposed to describe how people interact in a workplace environment, not be a general personality model. (Link below if you want to know more.)

"We can also observe Garry's bias towards community-building, with a massive leap in 2023 in the percentage of YC founders who had previously worked at a YC startup."

"There has been a huge increase in the proportion of YC startups located in the Bay, now approaching 85%!"

"So in short, YC has become 'more YC' than ever, with a refocus on young, technical, pedigreed, Silicon Valley-based founders."

Commentary: What I find interesting is that with software engineers, what I constantly hear is that new college graduates and "junior" engineers have it the worst and the industry has demand for "senior" engineers, yet here we see the exact opposite, "whenever there is a major technology platform shift underway like AI, the balance of power between youthfulness and experience tends to shift in favor of the former", and "the age cohorts of YC founders that have been growing the fastest in recent years are the youngest, and for the first time the past decade, there are more YC founders under the age of 25 years than over!"

Thumbnail
The AI deskilling paradox. Plus: AI is Dunning-Kruger as a service, and Spinning Plates.

These all seem to be about the same thing so I'm grouping them together. I'll start with the article that cites a few studies. The "AI deskilling paradox" says:

"A study published in 2025 in The Lancet of Gastroenterology & Hepatology found that endoscopists who routinely used AI for assistance in colonoscopies performed worse if access to the technology suddenly disappeared. The detection rate for precancerous lesions dropped from 28.4% to 22.4% without AI in the picture."

"Similar issues pop up in law, education, journalism, software development, and other fields. Law professors at Illinois Law School found that students who used chatbots and other forms of GenAI were more prone to critical errors."

"In a 2025 survey conducted by Microsoft Research and Hank Lee, a Carnegie Mellon PhD student, knowledge workers reported that generative AI made tasks seem cognitively easier. But there was a catch: researchers found they were ceding problem-solving expertise to the system and, instead, focusing on functional tasks like gathering and integrating responses. At the same time, they became more confident about using AI. 'It is plausible that high confidence in AI could lead to lower perceived effort.'"

See below for "AI is Dunning-Kruger as a service", which is basically a big rant, and "Spinning Plates", a personal anecdote from a developer using Claude Code.

Thumbnail
"The mechanisms of AI harm: Lessons learned from AI incidents" by Mia Hoffmann.

"AI systems are increasingly implicated in harmful events. Just since the beginning of 2025, 279 new incidents have been added to the AI Incident Database, a nonprofit effort dedicated to tracking realized harm from AI deployment. Since its launch in November 2020, the database has collected and indexed more than 1,200 incidents of harm or near misses involving algorithmic systems and AI. (Each incident in the AI Incident Database corresponds to one or more instances of harm, so the total number of discrete harm events captured in the database is higher than the number of incident IDs.)"

"AI incidents data provide valuable insights for understanding how AI systems can cause real harm. By collecting, indexing, and archiving reports from hundreds of real-world AI incidents, the AI Incident Database has created a treasure trove of data describing not only the myriads of harms AI systems have been implicated in, but also how these harms came to be."

In the "Intentional Harm" category, they have:

"Harm by design: Harm caused by AI systems designed and developed for harmful purposes",

"AI misuse: Use of AI systems for harm against the developers' intentions", and

"Attacks on AI systems: Harm resulting from AI behavior or (in)action caused by cyberattacks".

In the "Unintentional Harm" category, they have:

"AI failures: Harm caused by AI errors, malfunctions, or bias",

"Failures of human oversight: Harm resulting from the failure of human-machine-teams", and

"Integration harm: Harm resulting as an unintended consequence of deployment in a given context".

Going through each of these in turn with an example or two to give you some idea what the category is about:

Harm by Design:

"Some AI systems developed for defense and law enforcement, such as AI-enabled intelligence analysis and battlespace management systems used for targeting or autonomous weapons systems with AI-enabled navigation, computer vision, or terminal guidance, are obvious examples of harm by design systems. No malfunctions or misuse need to occur for harm to materialize when these AI systems are used, since harm is the intended outcome, both by developer and deployer. Militaries may appropriately use these systems against lawful combatants when deployed in accordance with the law of armed conflict. Recent conflicts involving Ukraine and Israel have reportedly seen AI-enabled systems capable of causing harm deployed in combat."

I was immediately thinking of Ukraine. It's estimated 80% of combat deaths are from UAVs. Surely some of those are using AI.

"But harm by design is also prevalent outside of law enforcement and defense contexts. Deepfake apps that allow users to maliciously create nonconsensual intimate imagery (NCII) abound."

"Online marketplaces such as Naver, Coupang, and Amazon India have been accused of engaging in unfair competitive practices through algorithmic manipulation."

AI Misuse:

"In 2023, users of the online forum 4chan created hateful and violent voice impersonations of celebrities using ElevenLabs' voice synthesis AI model. More recently, Microsoft and OpenAI reported on how state-sponsored hackers from North Korea, Iran, Russia, and China had misused ChatGPT for phishing and social engineering attacks targeting defense, cybersecurity, and cryptocurrency sectors. Other investigations revealed that ChatGPT had been misused by cyber criminals to create malware and other malicious software."

Attacks on AI Systems:

"Security researchers have uncovered vulnerabilities in GitHub Copilot that would enable attackers to modify Copilot's responses or leak the developer's data (confidentiality and integrity attacks). Experiments showed that flaws in Tesla's autopilot could be exploited to make the car accelerate and veer into the oncoming traffic lane (an integrity attack). Finally, an investigation found that a divergence attack on ChatGPT could force the system to leak training data, including personal identifiable information such as phone numbers and email and physical addresses (a confidentiality attack)."

"Harm incidents from the AI Incident Database show that in practice, attacks on AI systems are often carried out to evade generative AI model safeguards. This practice, called 'jailbreaking,' relies on prompt injection attacks in which users come up with text prompts that induce the AI model to behave in ways that violate its policies. Prompt injection attacks enabled users to evade ChatGPT's guardrails shortly after its release in order to produce discriminatory and violent content, as well as offer instructions on how to carry out criminal activities."

AI Failures:

"A Medicaid waiver program in Arkansas provided home care for people with disabilities, and the algorithm relied on the responses to a health assessment questionnaire to determine the number of care hours that would be allocated to the beneficiary. Although many recipients' health status and care needs were unchanged from the previous year's assessment, the introduction of the algorithm led to drastic cuts in care hours. Only scrutiny of the algorithm in court revealed errors in how it handled major health issues such as diabetes and cerebral palsy, which caused incorrect calculations for more than 19% of program beneficiaries."

"Generative AI systems have threatened users, created biased images, and spread misinformation. LinkedIn's search algorithm at one point apparently favored men's profiles over women's. AI- powered driver assistance technology has led to numerous accidents, some of them fatal. And there are at least eleven incidents in the database for wrongful arrests caused by faulty facial recognition technology."


Failures of Human Oversight:

"Immigrants to the United Kingdom must demonstrate English-language competence by passing a test called the Test of English for International Communication (TOEIC), which is administered by the international testing organization ETS. In order to detect cheating through proxy test-takers, ETS deployed voice recognition software to determine if the same voice turned up on multiple test recordings. If a test was flagged, two ETS staff had to agree for the test results to be classified as invalid. During the three years of its use, 97% of TOEIC test recordings were flagged as suspicious by the voice recognition AI. Despite this obvious abnormality, reviewers proceeded to classify more than half of these tests as invalid (and all others as questionable), passing the list of invalid tests on to the UK government. Based on these results, UK policymakers canceled accused test- takers' visas and began deportations."

"Being 'on-the-loop' in a partly self-driving car means that the system operates autonomously under the supervision of the driver, who in theory is ready to take control at any point. Humans, however, are not well-equipped to vigilantly but passively monitor a system for long periods of time. As a result, they can miss or react too slowly in situations where they should intervene, leading to accidents. This failure of human-machine teaming can be observed in the dozens of incidents involving AI-enabled cars in the AI Incident Database. The National Highway Traffic Safety Administration (NHTSA), a US government agency charged with investigating major self-driving car manufacturers, has termed the problem 'automation complacency,' a combination of excessive trust in the system's capabilities and the human susceptibility to disengage from monitoring tasks."

Integration Harm:

"An audit of Amazon's algorithms revealed that the platform not only hosts a wide range of products that promote misinformation, such as books containing conspiracy theories about vaccines, but its search engine also exhibits ranking bias in favor of those products. When users search for popular vaccine-related terms, it displays products that misinform about vaccines above those that provide accurate and debunking information. The audit also demonstrated the filter bubble effect of the platform's recommender algorithm: Users that interacted with misinforming products were more likely to encounter misinformation on their Amazon homepage and among their recommended products. In this way, the regular functioning of the search and recommendation algorithms, which intend to promote popular and relevant products to customers, had the unintended side effect of promoting misinformation to Amazon's customers."

"Unintended consequences can emerge as a result of one-sided consideration of stakeholders' needs in algorithmic design. This was the case when Starbucks deployed a scheduling algorithm across its stores. The AI system was created to optimize employees' shift allocation based on predicted store traffic. But its deployment resulted in constantly-changing shift schedules -- often delivered to workers on short notice -- and dramatically varying weekly hours."

"The deployment of AI systems can introduce opportunity costs by disrupting workflows and diverting resources. One example of this is the deployment of ShotSpotter, an acoustic gunfire detection system that was rolled out by the Chicago police department in 2016. An evaluation from 2024 found that in the six years that followed deployment, the AI system led to approximately 70 dispatches per day, a two-fold increase compared to pre-deployment. This increased demand for officer resources affected response times to 911 calls. Officers were dispatched to 911 calls more slowly, arrived at the scene later, and were less likely to arrest the perpetrator. The deployment of the gunfire detection system therefore reduced the effectiveness of the police force in responding to citizens' emergencies in Chicago."

Thumbnail
"More than a million ChatGPT users each week send messages that include 'explicit indicators of potential suicidal planning or intent', according to a blogpost published by OpenAI."

The article does not have a link to the blog post, but I found another media outlet (Techcrunch) with a link (below). The blog post never gives the 1 million figure, but says 0.05% of messages "contain explicit or implicit indicators of suicidal ideation or intent." It doesn't say how many messages a week ChatGPT has, but someone must have seen a number somewhere and did some arithmetic.

Thumbnail
"Many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour" according to "a new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University."

"The value of a benchmark depends on whether it is a good proxy for the real-world phenomenon it intends to measure. This property is known as construct validity."

"If a benchmark has high construct validity in measuring 'intelligence', then a model which does well is in some sense 'intelligent', but if the construct validity is low, then a high score may be irrelevant or even misleading."

"Here, we assess practices around the construct validity of LLM benchmarks through a systematic review of 445 articles from leading ML and NLP conferences."

"Almost all articles have weaknesses in at least one area across phenomena, tasks, metrics, and claims."

"The reviewing process resulted in a dataset containing responses to 21 question items on 445 benchmark articles, annotated by 29 experts in the areas of NLP and machine learning."

"The dataset contains information covering all of the stages of the benchmarks, from how they initially define their phenomenon of interest, to which tasks they select in an attempt to measure this phenomenon, to the metrics they use to estimate and compare the performance of language models on these tasks, to the claims they make about their benchmark's ability to accurately measure the phenomenon."

"The reviewed benchmarks cover a wide range of phenomena (Fig. 2 (A)), including areas such as reasoning (18.5%), alignment (8.1%) and code generation (5.7%)."

"The definitions of the phenomena also varied in whether they defined the phenomenon they tested as a composite (61.2%) or a single unified whole (36.5%). For example, some phenomena can be tested alone, (e.g. measuring the ability to traverse a 2D map), while other phenomenon are overarching abilities integrating many sub-abilities (e.g., a model's 'agentic capabilities' requiring sub-abilities such as intent recognition, alignment, and structured output generation)."

"The tasks chosen to measure the target phenomena varied widely, ranging from answering medical licensing exam questions and detecting errors in computer code to reconciling conflicting information on Wikipedia. Less than 10% of benchmarks used complete real-world tasks, such as writing a correct SQL query given a natural language query and a database structure."

"ME: Interesting. I generate SQL queries using AI on a regular basis, and they are usually correct or close enough that I can use them with a few modifications."

"Overall, 40.7% of all reviewed benchmarks make use of constructed tasks, such as reading fictional multi-party conversations and answering questions about the beliefs of the conversation participants to test 'theory of mind', with 28.5% using exclusively constructed tasks. Partially real-world tasks, such as accomplishing e-commerce tasks collected from real people on a mock website, and representative tasks, such as answering exam-style science questions, are used in 32.3% and 36.9% of reviewed benchmarks, respectively."

"Authors most commonly handcrafted new task items (43.3%), followed by reusing data from existing benchmarks (42.6%) and generating data with LLMs (31.2%). Human exams and other pre-existing sources were used in 38.2% of benchmarks."

"The most common metric used to score the benchmarking tasks was exact matching (used at least partially by 81.3%, exclusively by 40.7%). Other commonly used metrics include soft match scores, which have an exact correct answer but allow for partial credit (used at least partially by 20.9%, exclusively by 0.9%), LLM-as-a-judge (at least partially by 17.1%, exclusively by 3.1%), and human ratings (at least partially by 13.0%, exclusively by 1.8%). Once the responses were scored, 16.0% used uncertainty estimates or statistical tests to compare the results."

"To support their results, 53.4% of articles presented evidence for the construct validity of their benchmark."

"Our systematic review of 445 benchmarks reveals prevalent gaps that undermine the construct validity needed to accurately measure targeted phenomena. To address these shortcomings, which can hinder genuine progress, we propose eight recommendations and a practical checklist for designing and interpreting LLM benchmarks."

The complete checklist starts on page 23 of the paper.

Thumbnail
When given IQ tests designed for humans, large language models have increased their top score from about 95 to about 130 in the last year (allegedly). Claude had the lead a year ago, was overtaken by ChatGPT, which was overtaken by Gemini, which was overtaken by ChatGPT, which was overtaken by Grok, which is currently the world's smartest model (until the next leepfrog, which will probably happen any day now). Was DeepSeek not tested?

On another test (a MENSA test), ChatGPT makes it to 148.

Commentary: AI may exceed most humans on most IQ tests, but there still seem to be things that humans can do that AI can't.

Thumbnail
"TabPFN-2.5 Model Report"

TabPFN claims to be the world's first foundation model for tabular data, and I didn't know it existed until the 2.5th release.

"Tabular data is ubiquitous, forming the backbone of decision-making in countless domains, from finance to healthcare. For decades, traditional tabular machine learning -- built on gradient-boosted trees, random forests, and linear or additive models -- has been the workhorse of applied data science. Yet these methods remain limited: they require extensive dataset-specific tuning, often provide uncalibrated or unreliable uncertainty estimates without significant modification, and lack the generalization and transferability of modern foundation models."

"Tabular foundation models (TFMs) offer a new paradigm. They address these limitations by pretraining on large synthetic distributions of tabular tasks and performing inference via in-context learning instead of gradient descent. They are training-free predictors meta-trained to yield strong calibration, without the need for time-consuming and labor-intensive hyperparameter tuning necessary for gradient-boosted trees. Their strong generalization makes them particularly attractive for data-scarce domains."

"Our initial release, TabPFNv1, served as a proof-of-concept that a transformer could learn a Bayesian-like inference algorithm, though it was limited to small (up to 1,000 samples), clean, numerical-only data. Our successor, TabPFNv2, scaled this idea into a practical model for datasets up to 10,000 samples. TabPFNv2 handles the messy and heterogeneous data seen in the real world -- including categorical features, missing values & outliers."

What's new in TabPFN-2.5? Improved performance (outperforming tuned tree-based models like XGBoost, with low inference-time latency) and improved scalability (datasets of up to 50,000 samples with 2,000 features per sample all in one context window).

Thumbnail
The end of the renewable energy honeymoon? Asks Zoe Hilton. Starting is easy, the rest is hard?

Wind and solar energy is intermittent, and when it is high, can exceed the local saturation point, a term defined as meaning when supply of energy exceeds demand in a given local area. At other times, it will fail to meet demand. This problem has three solutions, but all three solutions increase costs.

Solution 1 is to simply waste energy. But to have energy to waste means you built out your infrastructure which costs money.

Solution 2 is to move energy through time, also known as storage. But this means you have to build storage systems, such as batteries or pumped hydro, which costs money.

Solution 3 is to move energy through space, also known as transmission lines. Transmission lines, especially if they are transporting massive amounts of energy between far-flung places, cost money.

No country anywhere in the world has found any better solutions. So once wind and solar energy goes beyond the local saturation point, costs go up, and electricity prices go up.

Australia (where this person is) has passed its the local saturation point some time ago, at around 20% of the country's energy coming from wind and solar. Australia's renewable energy honeymoon is over.

For those of you who prefer to read, rather than watch a video, I've linked to the report below.

Brrrrrp! People say the the Center for Independent Studies is "a right wing think tank", well known to Australians, and is "anti-renewables". The question I'm trying to figure out is, did they just make up the numbers? People in Australia are claiming the data is false. One guy in Australia claims to have 84% renewables where he lives (South Australia around Adelaide) and yet his electricity price (if I've done the Australian dollar to US dollar conversion correctly) is about 15% below what I'm paying here in Denver in the United States.

Thumbnail
"Grokipedia: a first look", by Wikipedia's lesser-known co-founder Larry Sanger. Larry Sanger self-identifies as "conservitarian" (portmanteau of "conservative" + "libertarian"). (Wikipedia's other co-founder, Jimmy Wales, self-identifies as "centrist and gradualist", but supported Lawrence Lessig's 2016 Democratic party presidential campaign and signed on to an open letter urging American voters not to vote for Donald Trump -- according to his Wikipedia page.)

"Last night, I browsed a number of entries and did a deep-dive into an article on the topic on which I am the undisputed leading expert in the entire world: 'Larry Sanger.' I'll tell you what I think of this article, on the reasonable theory that it is fairly representative. Weighing in at 5,901 words, it is longer than the Wikipedia entry (5,592 words by my count), but that includes repetition, which I will explain below. The writing is OK. The Grokipedia generator tends to use longer sentences, leading to a style that is slightly turgid. The style is very much LLM-ese. We all recognize it now: It's readable enough, but often insipid pablum."

"In several cases, inaccuracies went back to bad sources. GIGO."

"Most errors were minor. Often, the problem wasn't so much factual error as dwelling on irrelevancies which might give a human being the wrong idea about some minor detail."

"But some errors were more serious. It says my family was only nominally religious, which is nonsense it hallucinated. It implies my father's scientific profession (seabird biology) was somehow responsible for my becoming an agnostic. It says I found it to be a 'challenge' that there were 'individuals lacking subject expertise' on Wikipedia, which is nonsense; accommodating such individuals was the whole purpose of Wikipedia. It makes it sound, at one point, as if I opposed the whole idea an 'unrestricted open editing' model, when that was the very model I brought to the table with Wikipedia. Some bad journalists have said that, but it was always a lie, and Grokipedia repeats it. There were several more of that type of thing."

"Surprisingly, there was considerable repetition within the article. In fact, the article about me would certainly have been shorter than the Wikipedia one if it had cut out the repetition. There were three summaries of my dissertation. There were two different sections about my conversion to Christianity (one three paragraphs, the other four). There were other repetitions. This seems like an easy fix."

"Vague word salads crop up, and that can be very annoying."

But that is prelude to the key question:

"Is Grokipedia neutral?"

"I built a useful system in Ruby that graded the neutrality of Wikipedia articles."

"In order to run the experiment quickly, I'm simply going to compare the neutrality of Wikipedia versus Grokipedia on a long series of article introductions."

"The data, taken from ChatGPT 4o, is compiled below. 1 is most neutral; 5 is most biased. The remarks in the second and third columns are all generated by ChatGPT 4o, not me."

I'm skipping over the titles but they are all extremely controversial topics.

At the bottom we get to "Average bias rating": 3.5 for Wikipedia, 2.1 for Grokipedia. Remember, larger means more biased.

"According to ChatGPT 4o, which is a competent LLM that is widely perceived to lean to the left, primarily on account of its training data, the Wikipedia articles on these controversial topics, on average, had a bias somewhere between 'emphasizes one side rather more heavily' and 'severely biased.' By contrast, the Grokipedia articles on these topics are said to 'exhibit minor imbalances' on average."

"On these topics, Wikipedia was never wholly neutral, while Grokipedia was entirely neutral (rating of 1) three out of ten times, and was only slightly biased (rating of 2) five other times. Meanwhile, Wikipedia's bias was heavy, severe, or wholly one-sided (rating of 3, 4, or 5) six out of ten times."

"This is not a scientific study, and if Grokipedia boosters present it as one, that will be against my own clear labeling."

Thumbnail
"I'm writing a message to my team. Three sentences. Simple update."

"I read it back. Does this sound right? Is it too direct? Will they think I'm being dismissive?"

"I paste it into Claude. Ask if it sounds okay. Get a response. Tweak it. Ask again."

"Fifteen minutes later, I hit send."

"For a three-sentence Slack message."

"It doesn't make things easier. It makes things worse."

"Because AI doesn't just validate. It rewrites."

"I write something in my voice. Run it through AI. It comes back polished. Professional. Clean."

"And completely not me anymore."

"I send it to my manager for review. He reads it and says:"

"'This is clearly AI generated. The person receiving it won't appreciate that.'"

(Alrighty then.)