Boulder Future Salon

There's an animated GIF here showing text and image being generated at the same time. I was like, what? What's that about?

It turns out this was inspired by a mirror-image idea, using languages models' "thinking" ability in the process of generating images.

"Despite the general effectiveness of incorporating a reasoning process prior to image synthesis, we observe a counterintuitive and critical phenomenon. On certain benchmarks, the inclusion of reasoning can in fact reduce the semantic fidelity of the generated images. A 'thinking-aware' model starts with correct reasoning but then shifts to refining minor details like background textures. This reduces attention on the primary subject and causes the final edit to misidentify it completely. The resulting image thus deviates from the user's core instruction and even contradicts its own thinking prompt, leading to a clear performance drop. This raises a crucial question: What underlies this performance degradation?"

"While pre-reasoning can in principle enhance multimodal generation, its reliance on an autoregressive pipeline makes the process vulnerable to error accumulation and semantic drift. Recently, another line of work has explored discrete diffusion models for text or image generation, which remove the token-by-token constraint of autoregression and instead employ confidence-based sampling to achieve greater global consistency. Inspired by these advances, we ask: What if multimodal models could generate text and images in parallel?"

So what they did here is borrow the "diffusion" idea from image generation and apply it to text generation, while simultaneously borrowing the "tokenization" idea from text generation and applying it to image generation.

"We propose a parallel multimodal diffusion framework that: (i) represents all modalities as discrete tokens, (ii) arranges them in an interleaved sequence with bidirectional attention, and (iii) employs a single mask predictor shared across modalities, enabling synchronous denoising for both text and images."

With diffusion with images, the image is progressively "denoised" (diffusion models are trained by learning how to remove "noise" -- generally Gaussian noise -- from an image) in the direction of its prompt. Here, the text -- the entire text -- is also progressively "denoised" in the direction of its prompt, in contrast to the language models you're familiar with which output tokens sequentially.

Both text and image are progressively "denoised". So what then is the connection between the two. Both the text generation and the image generation use what is called "attention" implemented in a neural network architecture called the "transformer", whose name gives you no indication its claim to fame is the "attention" mechanism. At each step of the text generation, the neural network that generates the text (which remember, is a diffusion model now) has the ability to "pay attention" to the image at that stage, and likewise at each step of the image generation, the neural network that generates the image has the ability to "pay attention" to the text at that stage.

To tokenize images, a type of tokenizer called a Vector-Quantized (VQ) tokenizer is used. To make this system work better, a VQ tokenizer was also chosen for the text. (Links to all this stuff below.) Language models that you typically use use either Byte-Pair-Encoding (BPE) (ChatGPT and all the models from OpenAI, Claude and all the models from Anthropic) or WordPiece/SentencePiece (Gemini and all the models from Google/DeepMind, LLaMa and all the models from Meta, Grok and all the models from xAI), but the tokenizer used here is called LLaDA (LLaDA is also the name of the diffusion text generation model that they used -- they are incorporating LLaDA's tokenizer into their text-image cross-training and cross-generation system).

Unlike the tokenizers mentioned above, this tokenizer sacrifices more efficient encoding for greater semantic representation, and uses neural network training without statistical techniques to learn all the semantic boundaries of the tokens. The basic idea of "vector quantization" is that you translate a continuous input (such as an image or part of an image) into a encoding that takes the form of a vector that is also continuous, but then you match these continuous vectors to a discrete list of vectors in a "codebook", with the "codebook" itself also learned by neural networks and not hand-made by humans or computed from statistical techniques. The vector-quantized text tokens are produced by the same process, adapted for text.

I covered (i) and (ii) but you're probably wondering about (iii), the part about masking.

Masking relates to how the system is trained. You've heard that large language models like ChatGPT are challenged by trying to predict the next token. Here, the first change is that the "prediction" is bidirectional -- which is to say, you can knock out a token in the middle of the seqeunce, and the model is challenged to "predict" the missing token -- but here I have to put "predict" in quotes because the model is allowed to see part of the "future" sequence. This is called "masking". The "masked" token is the token the model is challenged to "predict", and which it learns to get better and better at "predicting" as part of its training process.

The second change is that the "prediction" is both the text and image tokens, which you can think of as being interleaved into a single sequence. At each step, the model will "predict" all masked positions simultaneously, whether they are part of the text or part of the image. Regular large language models "predict" "autoregressively", which means one token at a time.

But wait! There's more. They added reinforcement learning to the mix. They came up with a reinforcement learning algorithm called "Parallel Reinforcement Learning" which very cleverly goes by the short name "ParaRL".

"We further introduce Parallel Reinforcement Learning (ParaRL), a novel training paradigm that directly leverages this intermediate cross-modal synergy. Instead of rewarding only the final output, ParaRL uses the alignment between text and image tokens at each denoising step as a dense reward signal."

They go on to say, "We adapt a diffusion GRPO objective that accommodates token-level likelihood ratios with advantages calculated at these sampled steps" followed by very complex math equations. (GPRO stands for "Group Relative Policy Optimization" and it's an extension for PPO, which stands for Proximal Policy Optimization and ss the algorithm used in the reinforcement learning by human feedback (RLHF) systems of ChatGPT and other chatbots. GPRO extends PPO such that it will work situations where you don't have token-by-token sequences.

Basically what this system does is give the system a "reward" signal according to how well the predicted text explains the predicted image. However, I can't tell you exactly how this works due to my not deciphering the complex mathematical equation (triple integral inside a probability expectation as the calculation for a reinforcement learning policy).

If you're wondering where all this is likely to lead, my guess is that it will lead to image and video editing systems that enable much more fine-grained control over the images and video that gets generated than is currently possible. This system came out of trying to improve image generation from text and my guess is that this work will roll back into that it some way. But I thought it was interesting in its own right and the animated GIF of the text and image being simultaneously generated grabbed my attention.

"The vast majority of assignments that were traditionally used to assess -- and, more importantly, challenge -- students can now easily be outsourced to ChatGPT. This is true for the essay, the most classic assignment students complete in humanities and social science courses. While the best students can still outperform AI models, a combination of technological progress and rampant grade inflation means that students who are content with an A- or perhaps a B+ can safely cheat their way to graduation, even at top universities."

"Something similar holds true for the dominant mode of assessment in many science courses. If anything, AI models that have won top marks in math and science olympiads may be even better at answering the questions contained in problem sets in biology, chemistry, physics or computer sciences classes."

"An old Soviet joke held that 'we pretend to work and they pretend to pay us.' At many colleges today, students merely pretend to do their academic work. For now, most professors still diligently read and comment upon the efforts of ChatGPT; but I suspect that some of them will increasingly decide to outsource their grading to artificial intelligence as well. Campuses will then have reached a new stage of AI decadence: the students pretend to do their assignments, and the professors pretend to grade them."

"The pretense that current forms of assignment are meaningful, or that a college GPA gives employers a meaningful signal about candidate quality, will become untenable. At the same time, some of the basic skills students need to master to truly understand their chosen disciplines -- or merely become fully-formed citizens capable of reasoning carefully about the world -- will rapidly atrophy."

"What should colleges do in response?"

Commentary: I've been thinking, in the real world (where I work), using AI isn't "cheating", it's mandatory. If schools exist to prepare people for work (elsewhere I've argued they exist to help people *market* themselves on the job market, which is not the same thing, but never mind that for the moment), then schools will have to rethink the notion that using AI is "cheating".

In the very long run, AI will automate all jobs, so there will not be any point in anybody going to school for anything -- schools will have no purpose as they will have no jobs to prepare people for -- but there's a transitory period -- perhaps decades long, as AGI (artificial general intelligence -- intelligence as great or greater than humans capable of automating all jobs) might arrive later than people think -- during which there will be some AI but not enough to automate all jobs. (Some people think AGI will arrive in 10 years or 5 years or even 2 years.) During this time, schools will have to change, but it is unclear to me how. Or maybe they won't change -- after all, right up to this point we have continued to use the assembly-line system that came out of the industrial revolution. (School treats children like products to be manufactured heading down an assembly line, and, to a great extent, prepares them to work on an assembly line.) Since we've continued using "industrial revolution schools" up to this point, maybe we will continue right up to the creation of AGI?

DeepSeek R1's censorship of politically sensitive topics has been removed by Multiverse Computing, a company in Spain that does both AI and quantum computing. I don't know why both of those would be in the same company.

"Our software is based on quantum-inspired tensor networks, which allows us to identify and remove the least important parameters that contribute little to the model's overall performance. Additionally, it allows us to isolate and remove weights tied to specific learned behaviors, such as censorship, without degrading the model's core knowledge."

Alrighty then. The company has made a product called CompactifAI, and used it on DeepSeek R1, which makes a smaller version of the model with, they claim, the same accuracy. In the process, they removed the censorship, which they claim was instilled into the model in the first place by fine-tuning after the standard pre-trained model was produced. Was the fine-tuning in a specific layer that could be removed? How does one remove fine-tuning from a model? They don't give any indication.

"Beyond DeepSeek R1's sheer size and hardware requirements, the model's baked-in political censorship presents significant drawbacks. Developed in China, the model evades questions on sensitive topics like Tiananmen Square and Taiwan, while promoting a state-approved narrative on history and global politics. This censorship makes the model fundamentally unreliable and unsuitable for journalism, research, or any application requiring objective, comprehensive information."

They give an example with a question about Xi Jinping's constitutional amendment to remove term limits.

"WeatherNext 2 can generate forecasts 8x faster and with resolution up to 1-hour."

What they mean by "8x faster" is 8x faster than WeatherNext 1.

"This breakthrough is enabled by a new model that can provide hundreds of possible scenarios. Using this technology, we've supported weather agencies in making decisions based on a range of scenarios through our experimental cyclone predictions."

"We're now taking our research out of the lab and putting it into the hands of users. WeatherNext 2's forecast data is now available in Earth Engine and BigQuery. We're also launching an early access program on Google Cloud's Vertex AI platform for custom model inference."

"By incorporating WeatherNext technology, we've now upgraded weather forecasts in Search, Gemini, Pixel Weather and Google Maps Platform's Weather API. In the coming weeks, it will also help power weather information in Google Maps."

I don't think this blog post from DeepMind does an adequate job of explaining what's different about this from regular weather prediction, and maybe I'll become obvious as you all use it in Google Maps or Google Earth. But the way this works is fundamentally different from traditional weather prediction. Traditional weather prediction uses supercomputers to simulate the Navier-Stokes equations, which are fluid dynamics equations. Although they are called "fluid dynamics" equations, they work for gases, including the atmosphere, as well as liquids such as water. The equations can handle compressible and incompressible "fluids".

What's going on here is you have not one model but many, and the models don't simulate physics, instead they are neural networks trained on historical weather data. The advantage of using many models is that you don't just predict the one most likely future weather scenario, you predict many scenarios. By examining the output of all the models, you learn "not only the most likely future weather conditions, but the range of probable conditions that may unfold." The good thing about this is that if an extreme weather event is unlikely but possible, you might still want to know about the possibility, and this system enables you to know that.

Furthermore, the models are run many times by taking the same identical input and injecting "noise" into it. These "perturbations" are also done during the training of the neural networks. Although at first glance, it may seem like this must make the model predictions worse, there is a point to it. Measurements of weather conditions (temperature, humidity, pressure, wind direction and velocity, precipitation, etc) have inaccuracies, and even if they were perfectly accurate, we only measure a small subset of all possible sampling points in the atmosphere of the planet with our satellite and ground-based observation systems. The process of injecting "noise" into the inputs makes the models more robust against this inaccuracy of our real data and the fact that it's always inherently partial. (Scientists have a fancy term for this, "aleatoric uncertainty". Scientists have fancy terms for everything.)

This "ensemble" system -- an "ensemble" of models rather than a single model -- make it a challenge to evaluate, to see if it works successfully. One thing these researchers did was test its cyclone path predictions with the actual paths cyclones took. This is in addition to the Continuous Ranked Probability Score (CRPS -- I'm going to skip explaining this now and leave it to a link below), which is a standard benchmark for weather predictions. This system "achieves state-of-the-art cyclone track prediction".

"I caught Google Gemini using my data -- and then covering it up."

"I asked Google Gemini a pretty basic developer question. The answer was unremarkable, apart from it mentioning in conclusion that it knows I previously used a tool called Alembic:"

When he (Jaka Jančar) asks, "How did you know I worked with Alembic?", Gemini apologizes and says "I don't actually know your project history."

But opening up "Show thinking" reveals... that the model knows it came from the user's "Interests & Preferences" section of their user context. But Gemini "cannot divulge the source my knowledge or confirm/deny its existence." (!)

"I looked into CoreWeave and the abyss gazed back."

"CoreWeave first came to my attention because it innovated in something that surprised me: using GPU as collateral for $2.3 billion in loans at an effective interest rate of 15 percent in the last quarter, according to the company's most recent quarterly filing."

"The company said it owned more than 250,000 Nvidia chips, the infrastructure necessary to run AI models, in documents CoreWeave filed for its initial public offering. It also said it only had Nvidia chips. On top of that, Nvidia is a major investor in CoreWeave, and owned about $4 billion worth of shares as of August. Nvidia made the March IPO possible, according to CNBC: when there was lackluster demand for CoreWeave's shares, Nvidia swooped in and bought shares. Also, Nvidia has promised to buy any excess capacity that CoreWeave customers don't use."

Circular at all?

Latent Library is a library of infinite books, because they don't exist until you read them -- then they're generated by large language models. Alrighty then. I suspect LLMs are not quite good enough yet for this idea.

"Why we ditched frontier AI agents and built our own."

"To evaluate different models and AI coding agents effectively, we needed a way to measure performance at scale, with statistically significant results and low operational overhead to enable fast iteration. Our first step was benchmarking models from multiple LLM providers alongside various AI coding agents. At the start, we found a few open-source solutions that offered similar capabilities (like running tests using Docker containers from a declarative setup) but they often supported only specific environments, such as Python repositories, or relied on predefined agents. None met all our requirements."

"Our needs also varied greatly by feature. For example, some use cases involve AI leaving PR review comments, summarizing failed build logs and suggesting fixes, or automatically resolving failing CI builds. Many scenarios require custom setups to enable assertions, such as validating AI-generated PR comments or failure summaries."

"We decided to build our own internal eval framework in our preferred language: Go."

"Our goal was to run tests in parallel on all agents and report results to a central database for dashboard viewing."

They evaluated several AI coding agents: Claude Code (Anthropic), Codex (OpenAI), Gemini (Google), and an open source agent called OpenCode.

"After exploring all options, we asked a key question: could we build an in-house coding agent matching Claude Code's performance using Anthropic APIs, but without vendor lock-in?"

"Turns out, we could."

The blog post proceeds to list all the advantages of building their own AI coding agent (can evolve it independently of vendor timelines, avoid breaking interface changes, integrate more smoothly into their own development ecosystem, store LLM messages in a provider-agnostic format allowing for future model-switching, programmatic checkpoints, etc), but the details of how they did it are promised for a future post.

"Anthropic published a report on the first documented state-level cyberattack carried out largely autonomously by AI agents. A threat actor (that Anthropic determined with 'high confidence' to be a 'Chinese state-sponsored group') used the AI programming tool Claude Code to conduct an espionage operation against a wide range of corporate and government systems. Anthropic states that the attacks were successful 'in a small number of cases'."

"Anthropic was later able to detect the activity, ban the associated accounts, and alert the victims, but not before attackers had successfully compromised some targets and accessed internal data."

"The threat actor manipulated Claude into functioning as an autonomous cyber attack agent performing cyber intrusion operations rather than merely providing advice to human operators."

"Human operators maintained minimal direct engagement, estimated at 10 to 20 percent of total effort."

"Initial targets included major technology corporations, financial institutions, chemical manufacturing companies, and government agencies across multiple countries. At this point they had to convince Claude -- which is extensively trained to avoid harmful behaviors -- to engage in the attack. The key was role-play: the human operators claimed that they were employees of legitimate cybersecurity firms and convinced Claude that it was being used in defensive cybersecurity testing."

"Under the threat actor's direction, Claude conducted nearly autonomous reconnaissance, using multiple tools including browser automation via MCP to systematically catalog target infrastructure, analyze authentication mechanisms, and identify potential vulnerabilities."

"Exploitation proceeded through automated testing of identified attack surfaces with validation via callback communication systems. Claude was directed to independently generate attack payloads tailored to discovered vulnerabilities, execute testing through remote command interfaces, and analyze responses to determine exploitability."

"Claude executed systematic credential collection across targeted networks. This involved querying internal services, extracting authentication certificates from configurations, and testing harvested credentials across discovered systems. Claude independently determined which credentials provided access to which services, mapping privilege levels and access boundaries without human direction."

"Lateral movement proceeded through AI-directed enumeration of accessible systems using stolen credentials. Claude systematically tested authentication against internal APIs, database systems, container registries, and logging infrastructure, building comprehensive maps of internal network architecture and access relationships."

"Collection operations demonstrated the most extensive AI autonomy. Against one targeted technology company, the threat actor directed Claude to independently query databases and systems, extract data, parse results to identify proprietary information, and categorize findings by intelligence value."

Commentary: Another item to move from your list of "things that will happen in the future" to your list of "things that have already happened."

OlmoEarth is a new state-of-the-art Earth observation foundation model family from the Allen Institute for Artificial Intelligence.

If you're wondering how it compares with DeepMind's AlphaEarth, which I told you all about back in August, they say:

"We compared OlmoEarth to Google DeepMind's AlphaEarth Foundations. AlphaEarth Foundations required a different analysis because Google released annualized embeddings, but not the model itself. When we compared the AlphaEarth Foundations and OlmoEarth embeddings using k-Nearest-Neighbors (kNN) on three tasks, we found OlmoEarth performed on par or better than AlphaEarth Foundations. However, once we fine-tuned OlmoEarth, it outperformed AlphaEarth Foundations substantially. This underscores the importance of a platform that makes fine-tuning and model customization as accessible as possible."

And they have a graph of 'Comparison of OlmoEarth embeddings and fine-tuning performance to AlphaEarth Foundations embeddings on three real-world partner tasks spanning classification of crop types in Kenya (Nandi), land-use land-cover classification (AWF), and high precision ecosystem classification (Ecosystem)' that shows OlmoEarth Base (90 million parameter model) after finetuning getting higher accuracy scores on all three.

"There is no standard evaluation test suite for remote sensing models. While there are some established standard practices, they are not always followed. To get a more complete picture of the state of foundation modeling we run a comprehensive evaluation effort of OlmoEarth compared to 12 other foundation models on 18 research benchmarks. Further, to evaluate real-world performance we also evaluate models on 19 datasets from 7 partner organizations that are using Earth observation modeling in their work. Following standard practice we evaluate all models using simple transfer learning techniques (kNN and linear probing) as well as full, end-to-end fine-tuning. We evaluate all models using a standard training recipe and sweeping over a variety of parameters and settings, ensuring a fair evaluation.

"OlmoEarth achieves the best performance in 15 of 24 tasks for the kNN/LP evaluation and 20 of 29 tasks for full fine-tuning."

They've also created something called OlmoEarth Platform so you can offload the work of managing GPUs to them.

"The OlmoEarth Platform is an end-to-end solution for organizations who want to harness Earth observation data for the public good."

How does OlmoEarth work?

"Existing foundation model approaches either train in a supervised or unsupervised setting. Some foundation models are trained to predict supervised labels like land cover maps from satellite observations. Other foundation models use the vast quantity of unlabeled data to train in a self-supervised manner. We present a formulation that unifies these approaches into a single task, show that it works well with only observational data, and further improves when we add labels."

"Our unified approach strikes a middle ground between two common approaches in self-supervised learning. Masked autoencoders predict pixel-level reconstructions of masked input while approaches like I-JEPA and Latent Masked Image Modeling (Latent MIM) predict reconstructions in feature space. Masked autoencoders tend to be stable but limited in their feature representations while latent approaches are unstable but produce better features (if they don't crash out during training)."

"Many foundation models build upon work in domains like image or text processing. Earth observation data differs from these domains in having spatially aligned yet highly multi-modal, multi-temporal data. We find that adjusting our masking strategy and loss to account for this unique domain gives us significantly better performance."

"In image or text modeling it is sufficient to randomly mask some portion of the input and have the model reconstruct the input from context. With remote sensing data, because we have aligned data over various modalities and timesteps, a uniform masking strategy over all tokens may be too easy of a task. Any token in the input will have many similar tokens either in space, time, or at a different aligned modality. There's almost too much context unless you use a very high masking ratio. We adjust our masking strategy to limit the amount of context present in any sample and make the problem challenging without resorting to skewed masking ratios."

"Similarly, with our loss formulation we find a small adjustment makes a large difference in downstream performance. Like other SSL approaches in latent space we use a contrastive loss instead of a reconstruction loss. However, contrasting a reconstructed token against all other tokens in a batch, or even in the same sample, leads to many easy negatives given the highly redundant nature of Earth observation data. Instead we contrast tokens only with other tokens in their respective bandset. This focuses the model training on a more challenging but more productive objective."

By "bandset", they mean grouping bands captured at the same resolution together, even if they come from different satellites. Landsat and Sentinel-2 data gets divided and grouped into bandsets.

The data comes from Sentinel-1, Sentinel-2, and Landsat. I'm guessing Sentinel-1A (launched in 2014), same as AlphaEarth -- Sentinel-1B stopped functioning in 2021 -- Sentinel-2A and 2B (launched in 2015 and 2017), and Landsat 8 and 9 (launched in 2013 and 2021). Sentinel-2A has visible light (RGB), near-infrared, and shortwave-infrared bands. Landsat 8 and 9 have visible light (RGB) and thermal (infrared).

"OlmoEarth is a Vision Transformer (ViT) based encoder-decoder style architecture. It processes a multi-modal image timeseries of aligned satellite images and derived maps. A FlexiViT-style projection layer converts the input data from pixels to tokens with a variable patch size. Positional, temporal, and modality encodings add additional context to the tokens. During training, some portions of the input tokens are masked. The encoder transformer layers attend across space, time, and between modalities to produce embeddings for the input tokens. The decoder predicts representations for the masked input tokens."

"Our pretraining dataset contains 285,288 samples from around the world. Each sample covers a 2.56km x 2.56km spatial region and a one-year time range. For multi-temporal modalities, we use up to 12 timesteps sampled monthly over the course of the year, although many samples contain only a subset of the timesteps and modalities."

"For the above modalities we resample the data to be uniformly 10 meters per pixel. We have experimented with adding NAIP data at 2.5 meter per pixel and ERA5 data at 160 meters per pixel but found no significant improvement on our evaluations."

ERA5 refers to the "fifth generation" version of a dataset of climate data produced by the European Centre for Medium-Range Weather Forecasts. The dataset was created by taking observations made from the ground and in the atmosphere (but not from space), fitting a model to that data, and generating an hour-by-hour dataset of Earth's atmosphere, land, and oceans from 1940 to the present.

NAIP refers to the US Department of Agriculture's National Agriculture Imagery Program, which is a dataset of aerial photos (from airplanes and drones, not from space) of US agricultural land. They didn't use it, though.

"Once the input is in token space, OlmoEarth adds in a 2D sincos positional embedding, a sinusoidal temporal embedding, and a learnable modality embedding to each token. During training, some tokens are masked out of the input, otherwise all tokens are passed to the encoder transformer which performs full self-attention across space, time, and between modalities."

"OlmoEarth uses a modality-aware masking strategy. For every example the masking strategy selects some bandsets to be encoded and also some to be decoded, non-exclusively."

"This masking strategy re-frames the problem slightly from reconstructing data that has been partially masked to reconstructing missing bandsets from partial views of other bandsets."

"This masking strategy re-frames the problem slightly from reconstructing data that has been partially masked to reconstructing missing bandsets from partial views of other bandsets. When all bandsets are encoded and decoded we find the task is too easy. Masked tokens in a bandset will likely have other tokens in the same bandset that are highly correlated with them that are visible in the input, tokens nearby spatially or temporally. Training in this easier paradigm requires using very high masking ratios (i.e. masking out 90% of the input) to get decent results. Masking some bandsets entirely makes the problem harder and we can use more balanced masking ratios."

"OlmoEarth trains on both observations and maps but at inference time we only use observations. Maps can change over time -- indeed downstream tasks are often detecting this kind of change -- so we only rely on observations for inference."

"During training OlmoEarth predicts reconstructions of the masked input in latent space. We use a randomly initialized, frozen projection layer for each modality to project masked patches in the input into token space. Thus OlmoEarth performs Latent Masked Image Modeling, but based on Linear, Invariant Token Embeddings."

"Latent MIM Lite allows us to unify supervised and self-supervised training under the same architecture. We project each modality, whether observations or maps, through a frozen random projection into token space."

"Loss is calculated the same for both types of modalities. We don't need to add on specific predictor heads for supervised data or adjust our training strategy or loss. In our ablations we see this approach gives strong results in a purely self-supervised setting and also benefits from additional supervised data."

At this point you may be confused as to the difference between "token space" and "latent space". Apparently there is a tokenization process at the point of inputting satellite images and maps, but I am unclear as to what this is. But once the "tokenization" process is complete, the vision transformer (ViT) architecture (called Latent MIM -- where "MIM" stands for "Masked Image Modeling") takes over and produces an output in "latent space". This "latent space" encoding is then decoded back to "token space" to compare with the input token for training. The system does not go all the way to try to reproduce the original satellite images. If you're wondering what the point is comparing the output to the input when they're the same, they're not the same -- part of the input is masked, hence the name "Masked Image Modeling".

"Latent MIM uses a contrastive loss (Patch Discrimination) instead of reconstruction loss to incentivize diversity in the latent space predictions. Patch discrimination loss frames token reconstruction as a classification task where we want the predicted token for a patch to be similar to the target token but dissimilar from other ground truth tokens for other patches. Patch discrimination uses cosine similarity to measure token similarity and cross entropy loss to contrast between positive and negative matches."

If you're wondering what "contrastive loss" vs "reconstruction loss" is all about, remember that image generation models like Dall-E (now part of ChatGPT) learn what words go with what images in their training data through contrastive learning. You have a description with a whole bunch of words -- how do you know "tiger" is the important word that needs to be learned, and to words like "the"? With contrastive learning, the system doesn't just compare its output with the expected answer, it compares its output with all the "wrong" answers, with a negative training signal. Because "the" appears everywhere, it washes out, while "tiger" gets associated with pictures that have actual tigers in them.

So what they're doing here is not just comparing the decoding of the latent space encoding with the token for the input images, it's also comparing it with tokens for a random sample of other images from the training data with the negative training signal, so contrastive learning takes place.

AI-powered fortune telling?

"Professional, accurate, and fast online fortune telling service, revealing the secrets of your destiny." "Combining traditional fortune telling with AI technology for deeper and more personalized destiny analysis."

Seriously? Alrighty then. I guess someone had to do it. You can add "AI fortune telling" to your list of "things that already happened".

This website is from China, if that indicates anything. It's called Rensheng Daoshi ("life mentor") though the domain name is suanmingzhun.com and the title in English is given as "Fateguide".

TigerBeetle is a financial transactions database built for correctness "that offers two primitives for double-entry bookkeeping: accounts and transfers. A separate data store, such as Postgres, stores master data, such as name and address of the account holder or terms and conditions of the account."

"This separation enables transfers to scale independently of general purpose master data (for example dealing with Black Friday events) and solves different security, compliance, or retention requirements of the independent data sets (for example enforce immutability of transfers)."

"Just as a bank may have need for both a filing cabinet and a bank vault, Postgres specializes in strings and describing entities (master data), while TigerBeetle specializes in integers and moving integers between these entities."

"Since Postgres and TigerBeetle do not share a transaction boundary, the application must ensure consistency through repeated attempts at completion and coordination, not transactions."

"We must designate a:"

"System of Record. The champion. If the account exists here, the account exists on a system level."

"System of Reference. The supporter. If the account exists here but not in the system of record, the account does not exist on a system level."

"So which system is the system of record and which is the system of reference? That is an architectural decision that depends on your requirements and the properties of the subsystems. In this case, TigerBeetle is the system of record:"

"If the account is present in Postgres, the account is not able to process transfers, so the account in Postgres merely represents a staged record."

"If the account is present in TigerBeetle, the account is able to process transfers, so the account in TigerBeetle represents a committed record."

"Once the system of record is chosen, correctness depends on performing operations in the right order."

"Since the system of reference doesn't determine existence, we can safely write to it first without committing anything. Only when we write to the system of record does the account spring into existence."

"Conversely, when reading to check existence, we must consult the system of record, because reading from the system of reference tells us nothing about whether the account actually exists."

They call this principle "write last, read first" -- that is, relative to the system of record: write to the system of record last, read from the system of record first.

I knew distributed transactions were difficult, and never thought of this idea.

Apparently, though, there is one more requirement: serializability, which I take to mean transactions on the system of record have to be queued up single file and processed in sequence. Surely for scalability, the system must have some ability to determine which transactions don't affect one another and can be executed in parallel? Or maybe they just made the system so fast at "moving integers" that it can scale up to the whole globe while maintaining serializability?

"Remarkably, if the system of record provides strict serializability, like TigerBeetle, and if ordering is correctly applied, then the system as a whole preserves strict serializability, leading to a delightful developer experience."

"Blade Manipulator" trucks use "a hydraulically operated adapter mounted on a 10 axle trailer that can manipulate a single 29,000kg turbine blade around tight corners and over vegetation by raising it on an angle to a maximum of 40 degrees."

Apparently the existing method, putting 80 metres-long turbine blades flat on massive trailers, frequently requires cutting vegetation around roads or even the construction of new large sweeping roads.

Would be fun to see one of these 'in the wild'.

"The status quo of AI chip usage, that was once almost entirely US-based, is changing. China's immense progress in open-weight AI development is now being met with rapid domestic AI chip development. In the past few months, highly performant open-weight AI models' inference in China has started to be powered by chips such as Huawei's Ascend and Cambricon, with some models starting to be trained using domestic chips."

"China's chip development correlates highly with stronger export controls from the US Under uncertainty of chip access, Chinese companies have innovated with both chip production and algorithmic advances for compute efficiency in models. Out of necessity, decreased reliance on NVIDIA has led to domestic full stack AI deployments, as seen with Alibaba."

"Compute limitations likely incentivized advancements architecturally, infrastructurally, and in training. Innovations in compute efficiency from open-weight leaders include DeepSeek's introduction of Multi-head Latent Attention (MLA) and Group Relative Policy Optimization (GRPO). A culture of openness encouraged knowledge sharing and improvements in compute efficiency contributed to lower inference costs, evolving the AI economy."

"Domestic silicon's proven sufficiency has sparked demand and models are beginning to be optimized for domestic chips. In parallel, software platforms are shifting as alternatives to NVIDIA's CUDA emerge and challenge NVIDIA at every layer; synergy between AI developers and chip vendors are creating a new, fast-evolving software ecosystem."

Commentary: One wonders about the effectiveness of export controls. The CEO of Nvidia, Jensen Huang, has said he wants to sell chips to China. It's likely his motivation is that he doesn't want to see the market bifurcate in such a way that a credible competitor has the opportunity to arise, and it looks like we may be seeing the beginning of that now.

"The path to a superhuman AI mathematician."

"Mathematics is the first place where evidence of AI superintelligence is likely to appear, a theoretical computer scientist says."

"Imagine the set of all possible math theorems; only a subset has been proven by human mathematicians. 'A superhuman AI mathematician is one that can prove more theorems than humans have,' said theoretical computer scientist professor Sanjeev Arora from Princeton University."

"Arora, who was awarded the 2011 ACM Prize in Computing for his contributions to computational complexity, algorithms, and optimization, sketched a possible path to a superhuman AI mathematician. He explained that the idea traces back to David Hilbert's early 20th-century dream of automating mathematics. That dream was crushed by the work of Gödel, Turing, and Church, yet it left behind something lasting: the concept of formal proof verification -- the notion that mathematical proofs can be written in a precise language and then rigorously checked by a computer."

"The idea of self-improvement is that you give the AI a large question bank created by humans; it follows many attempts to answer these questions, and the correct answers are used for further training."

"How does it get the correct ones? That's the human feedback. Some humans have labeled them as correct answers. This is the present pipeline. But in math, you can verify the answer with Lean. So, if you just ask AI to produce its proofs in Lean, labeling the correct answers can be done automatically. Even if it is a long proof, we humans can trust that it is correct."

Commentary: Hmm seems plausible. I don't know enough about Lean and other mathematical theorem provers.

"Over the last few years, Y Combinator has gone through two major transformations at the same time: the AI revolution & Garry Tan returning to YC as President & CEO."

"Since Garry took over in 2023, YC founders have gotten much younger. The average age of founders had been holding steady around ~29 years since 2015, but now it stands at ~26 years and falling."

"I think this is for two reasons: 1) Garry has generally been refocusing YC on what made it successful in the first place, and 2) whenever there is a major technology platform shift underway like AI, the balance of power between youthfulness and experience tends to shift in favor of the former."

"I" here is Jared Heyman.

"The age cohorts of YC founders that have been growing the fastest in recent years are the youngest, and for the first time the past decade, there are more YC founders under the age of 25 years than over!"

"YC founders are also nerdier. At Rebel Fund, we use AI to divide founders into 'technical' vs 'non-technical' backgrounds, and the 'technical' group has really taken over since 2023 -- another return to YC's historical roots."

"More YC founders graduated from top universities than ever in recent years. For the first time last year, more than half of YC founders graduated from a top 20 university."

"In recent years, nearly half of YC founders have worked at a top 20 employer vs only ~25% a decade ago."

"Since Garry took the helm, average YC founder personality profiles have shifted, with much lower levels of Dominance and higher levels of Conscientiousness."

Note "Dominance" and "Conscientiousness" here are references to the DISC personality traits model (which is probably why they're capitalized). DISC stands for "Dominance", "Influence", "Steadiness", "Conscientiousness". If you've heard of the Big Five or HEXACO personalities models, what's different about DISC is that it is supposed to describe how people interact in a workplace environment, not be a general personality model. (Link below if you want to know more.)

"We can also observe Garry's bias towards community-building, with a massive leap in 2023 in the percentage of YC founders who had previously worked at a YC startup."

"There has been a huge increase in the proportion of YC startups located in the Bay, now approaching 85%!"

"So in short, YC has become 'more YC' than ever, with a refocus on young, technical, pedigreed, Silicon Valley-based founders."

Commentary: What I find interesting is that with software engineers, what I constantly hear is that new college graduates and "junior" engineers have it the worst and the industry has demand for "senior" engineers, yet here we see the exact opposite, "whenever there is a major technology platform shift underway like AI, the balance of power between youthfulness and experience tends to shift in favor of the former", and "the age cohorts of YC founders that have been growing the fastest in recent years are the youngest, and for the first time the past decade, there are more YC founders under the age of 25 years than over!"