Gordon's STEM Blog

Thursday, March 19, 2026

AI Runs on GPUs - It Stalls on Pipes

Based on: Rebuilding The Foundation: Why AI Infrastructure Needs To Change | Will Eatherton, Cisco | March 17, 2026

Every AI infrastructure conversation eventually lands on power. Megawatts per rack, cooling costs, grid capacity, carbon footprint. The power story is real, but we can’t forget about bandwidth.

I’m an old telecom guy, having spent over 30 years in telecom and networking, including running National Science Foundation Centers of Excellence focused on communications infrastructure. Bandwidth constraints are not new. What's new is the scale and the speed at which AI workloads are exposing them.

A 200,000-GPU (Graphics Processing Unit) cluster can consume 435 MW (Megawatts) of critical IT power. Of that, 17 MW goes to optical transceivers alone, just to move data between chips. When you scale to a million GPUs, the transceivers by themselves consume roughly 180 MW. That's not a power problem. That's a data movement problem that shows up on the power bill.

Cisco's Will Eatherton made the case this week that the real bottleneck in AI infrastructure has shifted from compute to data movement. GPU procurement still dominates the conversation. Networking, storage, and security are where the constraints are actually forming.

Bandwidth

Training large models requires clusters of tens of thousands of GPUs exchanging data continuously. The industry has settled on 102.4 Tbps (Terabits per second) switching silicon as the baseline for serious deployments. Traditional pluggable transceivers hit a wall at 800G and 1.6T speeds. The DSP (Digital Signal Processor) in each transceiver consumes up to 30W per port; at 200G channels, electrical loss reaches roughly 22 dB (decibels) before the signal reaches fiber. Two approaches address this.

Linear-drive Pluggable Optics (LPO) removes the DSP and lets the host ASIC (Application-Specific Integrated Circuit) drive the optical module directly, cutting per-link power by up to 50%.

Co-Packaged Optics (CPO) goes further by integrating optical engines onto the switch package itself, dropping electrical loss to 4 dB and per-port power to 9W. CPO eliminates the transceiver and DSP entirely, embedding electronic-to-optical conversion onto the switch ASIC.

Nvidia's Quantum-X InfiniBand CPO switches, entering production in 2026, deliver 115 Tbps across 144 ports at 800G. Broadcom's Tomahawk 6 (TH6-Davisson) ships 102.4 Tbps with full CPO. IDTechEx projects the CPO market will grow at a 37% CAGR (Compound Annual Growth Rate), exceeding $20 billion by 2036.

Topology

Scale-up (NVLink within a rack) and scale-out (InfiniBand or Ethernet across a data center) are both approaching practical limits. The next phase, scale-across, federates compute across geographically distributed locations into a single pool. Telecom engineers solved a version of this problem decades ago with distributed switching and ATM (Asynchronous Transfer Mode) traffic engineering. AI adds a harder constraint: gradient synchronization across a WAN (Wide Area Network) requires low, symmetric latency that wide-area networks were not built to guarantee. It breaks the latency-symmetry assumption in standard collective communication libraries such as NCCL (Nvidia Collective Communications Library), requiring deep-buffer routing, topology-aware all-reduce algorithms, and control planes that make traffic decisions based on path characteristics, not just throughput.

Nvidia's Spectrum-X Ethernet CPO platform targets this scale-across problem, combining switching and routing in a single solution with deep buffer support and integrated in-network computing via SHARP (Scalable Hierarchical Aggregation and Reduction Protocol).

Figure 1. Scale-across WAN topology: two GPU clusters connected via deep-buffer routers across a WAN, with shared DPU/SmartNIC security enforcement.

Storage

AI training creates a mixed-access pattern: large sequential reads across petabytes of training data, burst checkpoint writes during fault recovery, and sustained KV-cache (Key-Value cache) write pressure as context windows grow. RDMA-based (Remote Direct Memory Access) protocols, including RoCE (RDMA over Converged Ethernet) and NVMe-oF (NVM Express over Fabrics), cut storage latency from milliseconds to microseconds. Idle GPUs cost the same as active ones. When ingestion starves GPUs of data or checkpoint bursts block training progress, accelerator cycles go idle. Storage has to be designed into the architecture from the start, not bolted on.

Security

Model weights represent hundreds of millions of dollars in training cost. Protecting them requires hardware-based trust, confidential computing, and network segmentation. SmartNICs (Smart Network Interface Cards) and DPUs (Data Processing Units) now enforce zero-trust policy at line rate, isolated from the host OS (Operating System), handling IP filtering, session tracking, and rate limiting without CPU (Central Processing Unit) involvement. Multi-tenant inference clusters must maintain customer separation while meeting latency SLAs (Service Level Agreements), adding another layer of security complexity that traditional perimeter models were not designed to handle.

Organizations that get interconnect, storage, and security right will have capacity that GPU-focused competitors cannot replicate from a single cluster. Those that don't will rent infrastructure from the ones that do.

Source: Cisco Blogs: Rebuilding The Foundation — Why AI Infrastructure Needs To Change

CPO Market Data: IDTechEx — Co-Packaged Optics (CPO) 2026-2036

Nvidia CPO Technical Detail: Nvidia Developer Blog — Scaling AI Factories with Co-Packaged Optics

CPO Technology Overview: EDN — Where Co-Packaged Optics Technology Stands in 2026

Tuesday, March 17, 2026

Messaging Without The Internet

Briar is a messaging app built around one core idea: no central server. Most apps, Signal, WhatsApp, iMessage, route your messages through servers owned by the company. Those servers can be monitored, blocked, or shut down by governments or courts. Briar syncs messages directly between devices and cuts out the middleman entirely. You can learn more or download the app at the Briar Project website.

When you have internet access, Briar routes everything through Tor. I wrote about Tor back in May 2005 in one of my first blog posts here - it has been around a while! Tor bounces your traffic through a series of volunteer-run computers around the world, stripping away identifying information at each step. No one watching the network can tell who you are or who you’re talking to, only that you’re using Tor. This protects both message content and the fact that you’re communicating with a specific person, which is often just as sensitive as the message itself.

When the internet goes down, Briar switches to local connections. Over Wi-Fi or Bluetooth, two phones running Briar sync messages directly if they’re within range, roughly 50 feet for Bluetooth, farther for Wi-Fi. This matters during internet shutdowns, protests, or disasters where infrastructure is gone. If local wireless isn’t available either, you can load messages onto a USB stick or SD card and physically carry them to another device.

Briar can also relay messages hop by hop through your contact list. Say you want to reach a friend across town and there’s no internet. If a mutual contact is physically moving between you and your friend, their phone will carry your message and deliver it automatically when they get within Bluetooth range. No one manually copies files. The phone handles it in the background. The catch is distance and trust. Every relay hop requires someone physically walking between locations. It is not a wireless chain stretching across a city. It is more like a postal relay where trusted people carry the mail on foot. And it only works through people already in your contact list. Random phones nearby can’t intercept or relay your messages.

Adding a contact requires scanning each other’s QR codes in person. This exchanges cryptographic keys, the mathematical credentials your devices use to verify identity and encrypt messages. No usernames, no phone number lookup, no central directory. You can’t connect with someone you haven’t met face to face. That’s a feature, not a bug.

All messages are stored encrypted on your device only. Briar holds none of your data. Lose your phone, forget your password, or uninstall the app and your account and messages are gone. No recovery. That’s a deliberate tradeoff.

The app supports one-to-one messaging, group forums, and blogs, all distributed the same way. Forum messages can propagate indirectly. If you and contact A belong to the same forum, and you later come within range of contact B, also a forum member, your phone syncs forum posts to B automatically. Private messages work differently; they only sync directly between sender and recipient. The Briar user manual covers all features in detail.

Briar is Android only; right now as far as I can determine, no iOS version is planned. You can download it from Google Play or F-Droid. Battery drain is higher than standard messaging apps, especially over Tor. Both parties need to be online or physically nearby for delivery. Bluetooth range of 50 feet means it is not a true city-wide mesh network. Someone has to physically move between groups to carry data.

Briar is built for journalists, activists, and anyone operating where surveillance is a real threat or infrastructure can’t be trusted.

Tuesday, March 10, 2026

Silicon Spin Qubits: The Semiconductor Path to Quantum Computing

Gemini Generated Image

Earlier posts here covered superconducting, trapped ion, photonic, and neutral atom qubits. Each platform has a different answer to the same problem: how do you isolate a quantum system long enough to do useful computation? Silicon spin qubits take a different approach than all of them. Instead of cooling exotic materials or trapping atoms with lasers, they use ordinary silicon, the same material in every chip your phone and laptop run on.

That is the bet. If you can make a qubit out of silicon, you inherit six decades of semiconductor manufacturing infrastructure.

First, What is a Qubit Refresher

A regular computer stores information as bits. Each bit is either a 0 or a 1, like a light switch that is either off or on. A qubit is a quantum bit. It can be 0, 1, or a combination of both at the same time, a property called superposition. Two qubits can also be entangled, meaning the state of one instantly affects the other no matter how far apart they are.

These properties let quantum computers solve certain types of problems, like simulating molecules or breaking encryption, far faster than any classical computer ever could. But building a reliable qubit is extremely hard. They are fragile and easily disturbed by heat, vibration, or any stray interference from the environment.

The Silicon Idea: Using an Electron's Spin

Every electron has a property called spin. Think of it like a tiny compass needle that can point either up or down. In a silicon spin qubit, the spin-up state represents 1 and the spin-down state represents 0. That single electron, trapped in place inside a silicon chip, is the qubit.

To trap it, engineers build a structure called a quantum dot: a tiny electrical cage, about 30 to 100 nanometers across (roughly 1,000 times smaller than the width of a human hair), formed by applying precise voltages to metal wires patterned on top of the silicon. Those voltages create a pocket in the silicon just deep enough to hold one electron.

Another version uses a phosphorus atom embedded directly in the silicon. The extra electron that comes with the phosphorus atom becomes the qubit. This approach has produced some remarkable results: coherence times (the length of time the qubit stays usable) above 30 seconds, which is far longer than almost any other qubit technology can manage.

Why Silicon is Special

Most quantum computing platforms require exotic materials or unusual setups. Silicon spin qubits are different because silicon is the most well-understood material in the history of electronics. The entire semiconductor industry, the one that makes the chips powering every computer on earth, is built on it.

That matters enormously. Building quantum hardware usually requires custom processes developed from scratch. Silicon spin qubits can, in principle, be manufactured using the same factories and tools already used to make conventional computer chips. A silicon spin qubit device looks structurally similar to a standard transistor.

There is also a materials trick that helps a lot. Natural silicon contains a small amount of a variant called silicon-29, which has a nuclear spin that creates interference for nearby electron spin qubits. By purifying the silicon to remove silicon-29, researchers have dramatically improved how long qubits hold their quantum state without errors. This purified material is now commercially available.

How You Actually Operate Them

To perform a computation, you need to rotate a qubit from one state to another. For silicon spin qubits, you do this by applying a precisely timed microwave pulse, a radio-wave signal at exactly the right frequency to flip or tilt the electron’s spin. This is the same basic principle used in MRI machines, which also use magnetic resonance to probe spin states in atoms.

Single-qubit operations of this kind have achieved accuracy above 99.9%, meaning fewer than 1 error in every 1,000 operations. That is an important benchmark for building reliable quantum computers.

For two qubits to work together (which is required for any real computation), they need to interact. Silicon spin qubits do this through what is called the exchange interaction. When two quantum dots are placed next to each other and the barrier between them is briefly lowered, the two electrons can “sense” each other quantum mechanically. By controlling exactly how long and how strongly this interaction happens, you can perform a two-qubit logic operation.

Two-qubit operations have now exceeded 99% accuracy in silicon, which clears the bar generally considered necessary for quantum error correction to work.

The Challenges

Silicon spin qubits are promising, but they are not ready for prime time yet. Here are the main obstacles:

• They still need extreme cold. Like most quantum hardware, silicon spin qubits must be cooled to within a fraction of a degree of absolute zero (around minus 273 degrees Celsius) using a machine called a dilution refrigerator. These are expensive and complex. Researchers are working toward versions that operate at slightly higher temperatures, which would make the systems simpler and cheaper.

• No two quantum dots are identical. Even with precise chip fabrication, tiny variations between quantum dots mean each qubit behaves slightly differently and needs to be individually tuned. At small qubit counts this is manageable. At thousands of qubits it becomes a serious problem. Machine learning tools are being developed to automate this tuning.

• Electrical noise disrupts the qubits. Random fluctuations in the electrical environment of the chip can shift a qubit out of its intended state. This is called charge noise and it is the main source of errors in current silicon spin qubit devices.

• Qubit counts are still small. The most advanced silicon spin qubit chips demonstrated as of 2024 have around 6 to 12 qubits. Compare that to IBM’s superconducting systems, which have passed 1,000 qubits. Silicon is behind on this count, though its proponents argue it has a clearer path to catching up.

• Reading out results is tricky. Measuring the final spin state of an electron requires detecting an extremely small electrical signal. Getting this readout fast and accurate enough for large-scale computing is an active area of research.

Who Is Working on This

Several major players are betting on silicon spin qubits:

• Intel released a 12-qubit silicon chip called Tunnel Falls in 2023, fabricated on the same 300mm production line used for conventional computer chips. Intel is the most prominent example of a major semiconductor company applying its manufacturing expertise directly to quantum hardware.

• imec, a Belgian research institute, is developing silicon spin qubit fabrication on industrial 300mm wafer lines. In 2025, imec and Australian company Diraq published results in Nature showing their industrially manufactured qubits consistently hit over 99% accuracy, a first for factory-made silicon quantum devices.

• QuTech, a research institute in Delft in the Netherlands, has produced some of the highest-fidelity silicon and germanium spin qubit results in the world. Their two-qubit gate accuracy above 99.5% in germanium set a benchmark for the field.

• Silicon Quantum Computing, a spin-out from the University of New South Wales in Sydney, was founded by Professor Michelle Simmons, who pioneered the approach of placing individual phosphorus atoms in silicon with atomic precision. The company targets commercial-scale quantum computing using this technique.

The Scaling Picture

Silicon spin qubits are small. A single qubit occupies tens of nanometers, compared to hundreds of micrometers for a superconducting transmon qubit. In principle, you could fit millions of spin qubits on a chip the size of a modern processor die. The wiring and control problem at that scale is not solved, but like other methods, it is an engineering problem, not a physics one.

Monday, March 2, 2026

The Canary in the AI Coal Mine: Why Mrinank Sharma's Exit Matters

I use AI every day. Claude, my favorite, handles the heavy lifting on writing and organizing my thoughts. Gemini helps me crank out engineering math solutions quickly and accurately. In the classroom and lab, I build course materials, procedures and lecture outlines with AI tools that would have taken me three times as long to assemble a few years ago. In my consulting work, I use it to work through language, arguments, and pull together background material before client meetings. I am not a skeptic. AI, used with discipline and a clear sense of what you are asking it to do, is a genuine productivity tool. I tell my students they must learn it, not avoid it. All of that makes what follows harder to write, not easier.

On February 9, 2026, Mrinank Sharma, lead of Anthropic's Safeguards Research Team, resigned with a public letter addressed to colleagues. It reads less like a corporate farewell and more like a warning. His central claim: the world is in peril, and humanity's wisdom is not keeping pace with its technological power.

Sharma is not a fringe voice. He holds a Doctorate in Machine Learning from Oxford and spent two years at Anthropic working on AI sycophancy, defenses against AI-assisted bioterrorism, and what he described as one of the first AI safety cases. His departure, and his stated intention to "become invisible" and pursue poetry in the UK, is disturbing.

A Pattern the Nuclear Era Taught Us

When the Manhattan Project scientists delivered the atomic bomb in 1945, many of them immediately began warning about what they had built. Niels Bohr had been arguing for international controls even before the first test. The scientists who understood the technology best were also the ones most alarmed by it. The institutions managing the arms race, governments, military agencies, contractors, kept accelerating anyway. The experts warned, were sidelined, and in some cases left.

Sharma is playing the same role, inside the same basic structure. He understood the risk profile of what Anthropic was building better than most people outside the lab. He flagged the wisdom gap publicly. The organization kept shipping. He left.

The nuclear era eventually produced governance frameworks: the Non-Proliferation Treaty, arms control agreements, mutual inspection regimes. Those frameworks were imperfect and slow, but they had a chokepoint: nuclear weapons required state-level resources, enrichment facilities, and delivery systems. The barrier to entry kept the number of actors small and identifiable. You could build a treaty around a finite list of signatories.

AI has no equivalent chokepoint. The technology is distributed across private labs, globally, at speed. There is no enrichment bottleneck to regulate, no missile trajectory to track on radar. And crucially, there is no mutually assured destruction logic forcing all parties to pause. Nuclear deterrence worked, imperfectly, because reckless deployment was obviously suicidal for the deploying party. The competitive logic of AI runs the other direction: move slower than your rival and you lose market share. The arms race dynamic is the same; the forcing function for caution is not.

The Pressure Inside a Safety Lab

The most pointed part of Sharma's letter was not the global warning. It was the internal one. He wrote that he had "repeatedly seen how hard it is to truly let our values govern our actions," citing pressure within the organization to "set aside what matters most." He named no specific decisions, but the implication is clear.

Anthropic has raised billions from Amazon and Alphabet and is reportedly seeking a funding round that would value it at $350 billion. At that scale, the pressure to ship is structural, not incidental. The company that markets itself as the safety-first alternative to OpenAI is subject to the same competitive physics as everyone else. When the people hired to build the brakes feel forced to step aside, the brakes may not be functional.

The Exodus Is Not New

Sharma's exit adds to a documented pattern. Jan Leike left OpenAI in May 2024 saying safety had "taken a backseat to shiny products." He then joined Anthropic, widely seen as the responsible alternative. His presence there was read as a vote of confidence in the company's stated mission. Sharma's departure from that same organization, with similar language, closes that argument. By August 2024, more than half of OpenAI's AGI safety team had left. Recent Anthropic departures include R&D engineer Harsh Mehta and AI scientist Behnam Neyshabur. The week Sharma resigned, a researcher at OpenAI also quit, citing concerns about the company's decision to introduce advertising into ChatGPT.

The Manhattan Project scientists who warned loudest after 1945 spent years being treated as idealists. The institutional machinery kept building weapons regardless. The parallel is uncomfortable: the researchers who understand AI risk most precisely keep leaving, and the organizations they leave keep accelerating.

What It Means

AI safety is not a purely technical problem. It is an institutional one. Sharma's resignation does not slow the models down. It removes one of the people most likely to notice when something goes wrong, and most positioned to say so from the inside.

The nuclear era eventually forced the question: who governs this technology, and how? It took Hiroshima and Nagasaki to generate the political will for even partial answers. The question for AI is whether the industry waits for an equivalent event, or builds governance frameworks before the cost of inaction becomes obvious. Sharma's letter suggests at least one person who worked at the frontier believes the window for the latter is narrowing.

Sharma closed his letter with the William Stafford poem The Way It Is. The industry closed the week by shipping more product. Both things happened. The question is which one matters more.

Friday, February 27, 2026

Opinion: Anthropic, the Pentagon, and a Problem Congress Needs to Fix

This is about as political as I get on gordostuff.com. I write mostly about technology, engineering, education, and the occasional fish story. But this dispute sits at the intersection of AI, national security, and corporate governance, and those topics are worth paying attention to regardless of where you stand politically.

Here is what happened. In January 2026, U.S. special operations forces raided Caracas, captured Venezuelan President Nicolás Maduro, and flew him to New York to face narcoterrorism charges. During that operation, the military used Claude, an AI model built by Anthropic, a San Francisco company that holds a $200 million contract with the Pentagon. Claude was accessed through Palantir Technologies, a data firm whose tools are standard across the Defense Department. After the raid, an Anthropic employee asked Palantir how Claude had been used. That question triggered a confrontation that is now public: the Pentagon wants unrestricted access to Claude for any lawful military purpose; Anthropic refuses to remove safeguards that block use for mass domestic surveillance and fully autonomous weapons. The Pentagon has threatened to cancel the contract and label Anthropic a supply chain risk. CEO Dario Amodei has said the company will not comply.

The Pentagon’s position is straightforward. The “any lawful use” standard it requires is exactly that: lawful. Congress sets those limits. Courts enforce them. OpenAI, Google, and xAI have all reached deals allowing military users access to their models with fewer restrictions. The Pentagon argues that a private company writing its own usage restrictions into a government contract is not a governance model that works in an operational environment, and the Venezuela sequence supports that view. After a successful operation with no American casualties, an Anthropic employee felt it necessary to check whether their product had been used appropriately. That is not a posture compatible with military operations.

Anthropic’s position is not without merit. Amodei argues that current AI is not reliable enough for fully autonomous weapons, and a King’s College London study showing that leading AI models deployed nuclear weapons in 95% of simulated geopolitical crises suggests the concern is grounded. The company’s resistance to enabling mass domestic surveillance of American citizens also has clear constitutional backing. These are not frivolous objections.

The problem is that Anthropic is trying to solve a legislative problem with a contract clause. Mass surveillance of American citizens by the military is a constitutional question. The Foreign Intelligence Surveillance Act, the Posse Comitatus Act, the Fourth Amendment, these are the frameworks that exist for exactly this purpose. If they need updating for the AI era, that is Congress’s job.

Here is the urgency. AI is being deployed in classified military operations right now, today, and the legal frameworks governing its use have not kept pace. The Venezuela operation was not the last time this will happen. The next one may not go as cleanly, and when it doesn’t, the question of what AI was authorized to do, and by whom, will matter enormously. The Senate Armed Services and Intelligence committees should be holding hearings, calling in the AI companies, the Pentagon, and independent legal experts, and drafting legislation that sets clear boundaries. Not someday. Now. A terms-of-service clause in a private contract is not a substitute for law, and the fact that we are currently relying on one is the real problem here.