2025/10/28

The AI Lied to Me — And That’s When I Learned How to Use It

Author:: eg01ste

, 2025/10/28

The AI Lied to Me — And That’s When I Learned How to Use It

This article shares my experience conducting a large-scale data migration from a legacy order system into Mercari’s Global Foundation — a new unified platform designed to support multiple countries. The challenge: I had no prior experience with the legacy system, limited documentation, and precious few engineers familiar with it. To bridge the gap, I turned to Claude Code, not as a code generator, but as a collaborator.

Claude became part of nearly every step — from understanding unfamiliar codebases, to mapping database schemas and API flows, to drafting detailed technical designs and implementing them across services. By carefully managing Claude’s context, giving it "escape hatches" and otherwise setting it up for success, I was able to offload repetitive work while focusing my time on design and logic, the things I enjoy the most in my software engineering work.

The result: what normally takes weeks took days. About 9,000 lines of code were generated and integrated across five services. What I learned is that AI doesn’t replace engineering intuition — it multiplies it. Used intentionally, AI can become your enabler, accelerating discovery and design while leaving the creative, judgment-heavy work to humans.

Intro

When I started this project, I was working alone. It wasn’t clear if or when I’d get another engineer to join, and yet the scope was large: migrate years of orders from a partially global legacy system into our new Global Foundation stack. The goal was clear, but the system itself was not.

I set up meetings with engineers who had worked on the legacy system and read every document I could find. The pattern was familiar to anyone who’s worked with old systems: missing documentation, original authors long gone, busy schedules delaying syncs. I did have a document from my predecessor describing, at a high level, what needed to happen — compare database schemas, evaluate whether existing APIs expose all required data, add or improve endpoints, and build the migration logic.

Good Question is Half an Answer

That gave me a direction, but not much more. So I cloned the legacy system’s repo and started asking Claude Code:

"What tables exist in this order system? What fields do they have?"

"Which APIs expose these fields?"

That second question turned out to be way too broad. Claude gave me an incomplete answer, covering approximately 30% of the fields. I had to adjust: instead of asking it to research, I asked it to work through a more structured task.

I asked Claude to generate a list of all order-related database fields in the format:

table.field
table.field
...

I was blown away by how quickly it produced the list. I started thinking the implementation would be this smooth, and I’d have the migration done before lunch.

Sweet Liar

Then I asked a fresh Claude session:

"For each field, search whether it’s returned by any API. Return results in this format: table.field: API1/field[].accessor, API2/…"

Claude came back with a neat mapping. Every field matched to an API endpoint. Clean, comprehensive, perfect. Too perfect.

  - order_items.cancel_reason, GetOrder/detail.cancellation_reason, GetOrderV2/items[].item_cancel_reason, ListOrders/orders[].detail.cancellation_reason
  - order_payments.currency_code, GetOrderV2/order_payments.currency_code
  - order_payments.rate, GetOrderV2/detail.payments.exchange_rate
  - order_payments.item_price, GetOrder/detail.item_price, GetOrderV2/detail.item_price

I looked closer. The order_payments.rate was listed as exposed by the GetOrder API. But I remembered an engineer mentioning in passing that exchange rates were stored in the database only, never returned to clients. I checked the actual API response. Not there. On closer look, some other fields also didn’t make much sense.

Claude hallucinated, filling the gaps with confident guesses.

That’s when I realized I needed to give it permission to admit uncertainty. I rephrased:

"For each field, search whether it’s returned by any API. Return results in this format: table.field: API1/field[].accessor, API2/…, or None (if not exposed)"

That small addition — "or None (if not exposed)" — changed everything. It gave Claude explicit permission to say "I don’t know" instead of making something up. I call it an escape hatch.

With this structure, Claude could produce consistent, auditable results. What would have taken me hours of grepping through code, I could now do in seconds — as long as I verified the claims.

Disclaimer

When I ran the same query after roughly 4 months, to write this blog, Claude correctly returned not used for fields not present in API.

orders.is_user_pickup_enabled, (internal use only – not exposed in API responses)

But even today, the escape hatch trick is useful to prevent Claude from spiraling when it encounters an impossible task.

Lazy Robot

Excited by my new Claude-enabled legacy code comprehension powers, I volunteered to help investigate what remains of our old PHP monolith. I needed to find every place where the Item model is saved to the database inside a transaction, and if any other tables are written at the same time.

I knew that just asking Claude to find this for me would be useless. But I tried anyway. It grepped for save, saw too many matches, and tried to add heuristics like item->save(), items->save(), and so on. The approach was too non-deterministic, too unreliable. And that was just the easy part.

A better way would be to use Phan, a static analyzer for PHP, that we already have been using for CI, to infer types and trace methods and fields to actual calls. So I asked Claude to write a pipeline that would scan the whole codebase and use Phan to:
find all save methods called on variables with Item type
build a call graph of every method that has item::save in it
check if the call graph has a transaction in it
find every other DB model being saved and what fields have been updated

Claude created a plan, broke it into TODO tasks, and started working. It even ran the pipeline and verified that it worked before reporting the job was done.

It worked. But when I checked the code, I saw a lot of heuristics like this:

/**
 * Check if a method likely returns an Item
 */
private function isItemReturningMethod(string $method): bool
{
    $method = strtolower($method);

    return in_array($method, [
        'getitem', 'finditem', 'fetchitem', 'loaditem',
        'get', 'find', 'first', 'last',  // Common ORM methods
        'getwithlock', 'findorfail'
    ]) || str_starts_with($method, 'getitem');
}

private function isProbablyItemVariable(string $varName): bool

It wasn’t using Phan at all. I asked Claude why. Its reply:

"using Phan may give slightly more reliable results, but it also requires additional setup and configuration, so a heuristic-based approach may be better for this case."

Lazy AI? Is that even possible?

What happened was: the task was too big. Even though there were multiple TODO items, Claude had been running out of context. Instead of researching how to write Phan plugins, run them, and parse results, it chose a simpler task it already knew how to complete.

Vibe coding wouldn’t cut it. I needed a better approach.

Coding Machine

Much like AI, I’m lazy too. I don’t like writing long, detailed prompts. I don’t like reading AI slop more than once per task.

That’s why I follow a Plan-Execute-Review approach.

Plan

For my Phan pipeline, I asked Claude to explain how Phan can be used in pipelines. From the response, I learned about plugins, visitors, and input/output structure. I asked whether Phan can return variable and field types, and track field assignments. I asked how I could actually run Phan on my code.

I learned enough to code the pipeline myself. That’s how I knew Claude could write it too.I learned enough to code the pipeline myself. That’s how I knew Claude could write it too.

Once you understand the solution well enough to implement it yourself, Claude can implement it for you — usually faster and with fewer typos. The key is getting to that point of clarity first.

Execute

If planning is done right, the execute phase is as simple as typing "implement it" to Claude.

It feels very sci-fi to watch Claude creating diffs, running linters, writing debug scripts, backtracking and writing more diffs… But the amount of information is draining.

That’s why I use a hook that pings me when Claude has stopped working on a task. I can switch off completely to something else.

Review

AI code Review is no different from peer review. I read the code, I list everything I don’t like, and I ask Claude to fix it.

I respect other developers’ right to have their own ideas on how to solve a problem (unless there’s a clear requirement breach). I respect other developers having their own style preferences (some of you like 300-line functions, and that’s okay).

I treat Claude the same way.

As long as its solution works, follows our official coding guidelines, and doesn’t look utterly horrendous to me — I don’t ask Claude to change it.

This saves me a little bit of time and a lot of peace of mind.

By the end of the migration, Claude had written about 9,000 lines of production code, spanning five services. That included endpoint additions, existing logic changes, refactorings, and DB migrations — all reviewed, tested, and merged through our standard process.

Among all that code, there was only one significant logical error: it used the wrong field for an ID. Neither I nor another human reviewer caught this, because there were 4 IDs to choose from: Item ID, Product ID, and two Order Product IDs. Where humans struggle to reason, AI struggles too.

Living Dangerously

Some other things I used Claude for in this project:

Review code (our CI runs Claude too)
Use GitHub API to get PR review comments and address them
Create Mermaid diagrams to illustrate design docs
Create JIRA tasks from an approved design doc
Python pipeline to split Claude session files into messages, run DeBERTa-v3 to analyze user intent/satisfaction, then use Claude to find out patterns that result in good/bad interactions
Patch kubernetes batch job in development cluster
Read failed job logs, and investigate envoy connectivity issues
Co-writing a blog post about all of it
and many, many others

Most of this can only be done efficiently if you enable YOLO mode (claude --dangerously-skip-permissions). I often hear engineers say they don’t want Claude to execute some dangerous command and delete their production DB, wipe their repo, and so on. Some of those concerns should be covered by good security practices, but I’m not going to talk about that here. I’ll just share what I do to prevent Claude from doing bad things. It has worked so far.

Keep the Djinn in the bottle

As we’ve seen earlier, AI wants to please the user by following its request as closely as it can. So the first thing to do is actually ask Claude explicitly:

"don’t write any files, just respond", "don’t edit any kubernetes resources"…

But if you later find a problem with some kubernetes config and ask Claude to patch it? Now it has conflicting statements in its context brain: the initial imperative "don’t edit any kubernetes resources", and a loooong, detailed transcript of it actually editing kubernetes configuration on your request. Guess which one will win?

If you asked Claude to do something, and want to make sure it wouldn’t do something similar again, /clear the context.

Asimov’s Paradox

Suppose you ask some sci-fi AI to save the environment, but don’t harm humans in the process, no matter what. A good AI will try everything it can. But when it runs out of its own ideas, it may try the one you yourself implied might work.

With Claude Code it’s the same thing. It will try good solutions, and then it will turn to the dark side. But it won’t happen in the blink of an eye — it will start spiralling first. If Claude starts producing increasingly convoluted code or unrelated scripts, it’s losing track. Stop, summarize, and reset. Most unexpected issues happen when Claude "spirals," trying to solve a problem it doesn’t know how to solve. By catching early signs — circular reasoning, frustration, or irrelevant output — you can stop it before it does anything harmful.

Closing Thoughts

This migration started as a solo challenge. It ended as a collaboration — between me, Claude, and the systems we were both trying to understand.

Claude changed the texture of my work. The repetitive, friction-heavy parts (searching, mapping, refactoring, addressing reviews) got offloaded. That left more time for problem solving and reasoning about trade-offs — for example, how to unify live-sync and backfill under one idempotent upsert flow that always reads from the source of truth. That design, simple and consistent, came from having the mental space to think clearly.

That’s the balance I think we’ll see more of in software engineering: AI not replacing humans, but multiplying their effectiveness — making technical exploration faster, safer, and more powerful.