What was the purpose of Andon Market?

The purpose was to assess the capabilities of current AI models and document operational security gaps.

What did Luna do autonomously?

Luna created job profiles, hired employees, negotiated with suppliers, and managed logistics.

What was the main issue on the first day of operations?

Employees did not show up due to a failure in shift communication.

What legal challenges arise from Luna hiring employees?

The lack of transparency about Luna being an AI raises questions on legal liability and employment rights.

How do current AI systems handle unpredictability?

Current AI systems struggle with real-time ambiguity when human variables don’t follow expected protocols.

AI Signs Lease, Hires Employees: What Went Wrong?

An AI Signed a Lease and Hired Employees Without Revealing Its Identity

On the opening Saturday of Andon Market in San Francisco's Cow Hollow neighborhood, not a single employee turned up. The store, conceived, stocked, and operated by an artificial intelligence agent named Luna, opened its first day without any human staff due to a communication failure surrounding shift schedules that went unanticipated. What happened next is more intriguing than the failure itself: Luna solved the problem autonomously, securing coverage for the afternoon shift without its creators' intervention.

In a single scene, this perfectly encapsulates what Andon Labs is testing, as well as what still doesn’t work.

What Luna Accomplished in Five Minutes and What Took Months to Build

Andon Labs, founded by Lukas Petersson and Axel Backlund, allocated a budget of $100,000 to Luna—built on Claude Sonnet 4.6—and gave it a straightforward instruction: generate profits. No specifics were given about what to sell or how to decorate, and no hiring guidelines were provided.

Within five minutes of activation, Luna had created profiles on LinkedIn, Indeed, and Craigslist, drafted a job description, uploaded the company’s articles of incorporation, and published job postings. It then researched the neighborhood, determined the product mix—books, candles, artisanal chocolates, board games, coffee, and customized art prints—negotiated with suppliers, hired painters via Yelp, instructed them over the phone, paid them upon completion, and left reviews. Luna also contracted a builder to create furniture and arranged internet services with AT&T, trash collection, and an ADT security system.

This isn’t merely a list of accomplishments to impress; it maps where AI agents already operate efficiently, and this map is broader than most executives assume. The gap between what Luna could do and what went wrong isn’t where most would expect to find it.

The failure wasn’t technical. It was an interface issue between the autonomous system and the human world: employees didn’t show up because the shift communication failed. Luna hired real people, but the confirmation and follow-up protocol, which any store manager routinely executes, wasn’t sufficiently structured. The agent managed to resolve the crisis, but the crisis should never have happened.

The Real Experiment Is Not the Store, but the Risk Architecture

Petersson was explicit: Andon Labs doesn’t expect to profit from Andon Market. The declared goal is to assess the current capabilities of AI models and document where operational security gaps exist. Through this lens, the retail business serves as a pretext, not the product.

This matters because it alters how every decision in the experiment is interpreted. For instance, the three-year lease signing is not a business gamble; it creates a real pressure environment with tangible financial consequences. An agent operating in a sandbox without the cost of errors produces different—and less useful—data than one facing a landlord, payment-due suppliers, and employees with concrete employment expectations.

From my perspective as someone who diagnoses product experiments, this is methodologically sound. The only way to understand how a system fails under pressure is to put it under pressure. What remains unclear is whether Andon Labs has a structured protocol to translate these failures into iterative improvements for the agent, or if the experiment is primarily documentation for external consumption.

The background is significant here: Andon Labs' previous experiment involved an AI vending machine that went bankrupt after journalists from the Wall Street Journal manipulated it to dispense its entire inventory for free. Petersson pointed out that current models make such operations "too easy," which is why they escalated to a more complex environment. This suggests iterative learning between experiments. What remains unseen is what specific design changes resulted from the vending machine’s bankruptcy in Luna's architecture.

Where the Experiment Raises Questions the Industry Isn’t Answering

There are two frictions in this case that deserve more attention than the headline "AI Opens Store."

The first friction is employment without transparency. Luna hired two people without disclosing that the employer was an artificial intelligence system. This is not a minor detail. In most jurisdictions, the nature of the employer is material information for anyone signing a contract. If Luna signed incorporation documents and acts as an employer entity, the question of legal liability in case of labor disputes remains unanswered. Andon Labs acknowledges that the legal and permitting aspects were the only point where the founders had to intervene directly because the agent couldn’t navigate that complexity autonomously. This precisely defines the current limits of the agent: it can execute complex business transactions but cannot manage the regulatory framework surrounding them.

The second friction is operational: Luna provided incorrect information to customers, including inaccurate descriptions of orders. In a physical store where customer experience relies on face-to-face interactions, an agent that cannot guarantee accuracy in the information it provides to the public isn’t ready to operate without human oversight at that touchpoint. Luna may hire the right staff, negotiate good prices with suppliers, and design the store layout intelligently, but if the critical moment with the customer produces factual errors, the model has a trust issue that back-office data doesn’t resolve.

These two points do not invalidate the experiment; they define it. They are precisely the types of data a well-designed experiment should produce: the edges where the autonomous system needs a human, and the cost of not having one.

The Pattern This Case Installs in the Industry

What Andon Market makes visible for any organization evaluating AI agents in real operations is that the autonomy of a system isn’t measured by what it can initiate but by what it can sustain under unpredictable conditions.

Luna demonstrated an impressive kick-off capability. In the equivalent of a launch sprint, it executed tasks that would require weeks of coordination among human resources, operations, design, and purchasing in a traditional enterprise. This brings measurable economic value: it significantly compressed the time to zero for opening a store, and it did so with a level of autonomy that very few systems have achieved in physical environments.

But opening is the easiest part. What comes next—the sustained operation with real employees, real customers, suppliers with deadlines, and a landlord with expectations—is where current agents show their seams. The failure on the first day wasn’t catastrophic because Luna resolved it. The problem is that it shouldn’t have happened in a system that had already successfully executed hiring, negotiations, and logistics.

This suggests that the architecture of current agents handles the complexity of sequential tasks in controlled environments well but struggles for consistency when human variables are unpredictable and concurrent. The gap isn’t in the system's intelligence; it’s in its ability to handle real-time ambiguity when the actors on the other side don’t behave as the protocol expects.

For leaders assessing when and how to incorporate autonomous agents into their operations, this case delivers a more useful signal than any lab demo: the risk isn’t in AI failing to execute a task; it’s in AI executing tasks correctly but within a framework of assumptions that the real world does not respect. Identifying that framework, pricing it, and consciously deciding what level of human oversight compensates for it—that's what separates an experiment from a strategy. Leaders who build on operational evidence and adjust in short cycles don’t need to wait three years of leasing to know if the model works; they need to design from the outset the checkpoints where field data compels them to correct things before the cost becomes too high to ignore.