mercari-pm-agent Design — Automating the PM Workflow with Claude Code Skills and MCP

Introduction

Hello. I’m Shogo Kikuchi, a PM intern on Mercari.

During my internship, I developed a Claude Code Skill called mercari-pm-agent. It is an agent that can handle, within a single session, the end-to-end workflow PMs go through: “problem discovery → data gathering → PRD creation → UI mockups.”

In this article, I’ll focus on how I implemented a PM workflow on Claude Code—covering the Skill’s design as well as how I connected Notion, Slack, Looker, and Figma via MCP (Model Context Protocol).

Background: Mercari PMs’ Information-Gathering Workflow and the Challenges

When making decisions, Mercari PMs need to understand the holistic situation by working across multiple tools, such as:

  • Check mid-term strategy and KPI goals in Notion
  • Search internal requests and feedback in Slack
  • Review quantitative user metrics in Looker
  • Look at the current design of the relevant screens in Figma
  • Integrate all of the above into a PRD (Product Requirements Document)

Accessing each tool isn’t difficult in itself. However, organizing “which data matters for this decision right now” while moving across tools takes a non-trivial amount of time. PMs should be spending their time thinking deeply based on the information they gathered, making decisions, and having conversations with stakeholders. I wanted to shift time away from information gathering and toward thinking and decision-making—that was my motivation for building this tool.

Overview of mercari-pm-agent

mercari-pm-agent is a PM support agent implemented as a Claude Code Skill.

When a PM describes a business problem on the product in natural language, the following steps proceed automatically.

Flow

thumbnail

Implementation: Defining a PM Workflow with Claude Code Skills

What are Claude Code Skills?

Claude Code Skills are a mechanism for defining Claude Code’s behavior in Markdown files. By writing an agent’s steps, constraints, and how to access tools in SKILL.md, you can build an agent dedicated to a specific workflow (official guide).

The key feature is that you can define an agent’s behavior without writing code. As an example of Skills for PMs, I also referred to phuryn/pm-skills. However, as I describe later, you won’t get good accuracy by “just writing Markdown.” Constraint design and an evaluation loop are critical.

File Structure: Applying Separation of Concerns to Prompt Design

mercari-pm-agent/
├── [SKILL.md] # Agent behavior definition (English)
└── references/
    ├── [prd-template.md] # PRD template
    ├── [prd-checklist.md] # PRD quality checklist (9 items)
    ├── [ui-and-figma.md] # UI Spec / Figma Make prompt template
    ├── [laplace-guide.md] # Data interpretation guide
    ├── [data-sources.md] # Data source list / how to use
    └── [quick-reference.md] # Output checklist

At first, I consolidated all definitions into a single SKILL.md file. Through scoring by an evaluation skill (described later), I confirmed a problem: the longer the file, the worse the output accuracy becomes.

This relates to LLM characteristics. As context becomes longer, there’s a known phenomenon where models fail to properly attend to relevant information in the middle of the context (the so-called “Lost in the Middle” problem). Anthropic’s prompt engineering guidance also recommends keeping prompts concise.

To address this, I separated the behavior definition (the main SKILL.md) from reference data and templates (references/). This is an approach that applies “Separation of Concerns” from software engineering to prompt design. SKILL.md keeps only “what to do and in what order,” while concrete data and templates are referenced from references/ when needed. This structural change alone led to a clear improvement in score.

Note that I write SKILL.md in English, because instructions in English tend to produce higher accuracy with Claude.

MCP Connections: Connecting Multiple Tools to the Agent

The core value of mercari-pm-agent is automating data collection in Step 2. Here, I explain how I designed the tool connections using MCP (Model Context Protocol).

What is MCP?

MCP is an open protocol defined by Anthropic, specifying a standard way for LLM applications to connect to external tools and data sources. Via an MCP server, Claude Code can call external services such as Notion and Slack as tools.

MCP Servers Connected

MCP Server Type Data Retrieved Usage
Notion MCP Official (provided by Notion) Strategy docs · KPI dashboard Alignment check with mid-term strategy
Slack MCP Custom (built in-house) Posts from internal feedback channel Collecting improvement requests and field feedback
Socrates Custom (in-house BigQuery · Looker platform) Metrics data such as CVR Quantitative evidence for problem validation
Figma MCP Custom (built in-house) Component info from design files Fetching existing designs for UI Spec

Parallel Queries and Robustness by Design

In Step 2 (data collection), I query these MCPs in parallel. In data-sources.md, I wrote rules like:

- Pull in parallel during Data Enrichment — do not wait for one source
  before querying another.
  (During data collection, reference sources in parallel. Don’t wait for one source to finish.)

- If a source is unavailable, skip silently and mark it in the output.
  (If a source is unavailable, skip it and explicitly indicate that in the output.)

This reduces waiting time compared to sequential access. I also added fallback behavior so the process doesn’t stop even if some MCPs are unavailable.

Security Considerations

To set up Slack MCP, you need internal VPN connectivity and authentication via a user token. I pass the token into Claude Code’s configuration as an environment variable so the token string doesn’t get exposed in chat. Also, Slack user tokens expire in 7 days, so I prepared a separate script for refreshing them.

What I Prioritized During Development

Decide Evaluation Criteria First — Prompt TDD

Before implementing, I first defined criteria for “how to evaluate the agent’s output.”

  • Understanding accuracy (does it capture the essence of the problem?)
  • Spec specificity (is it described at an implementable level?)
  • Feasibility (is it reasonable technically and in terms of resources?)
  • UX validity (is it easy to use for customers?)

This is close to the mindset of test-driven development (TDD) in software engineering. With LLM-based agents, it’s harder to judge “whether it works correctly” than simply “whether it runs.” By defining evaluation axes first, I was able to run improvement cycles based on criteria rather than intuition. I collected real web improvement topics to create an evaluation dataset and iteratively improved accuracy.

Prevent Plausible Hallucinations via Constraints

The most dangerous risk when embedding LLMs into a business workflow is generating “plausible but unfounded information.” Even when data doesn’t exist, models can naturally output “reasonable-looking numbers.” If a PM trusts those numbers and puts them into a PRD, the decision-making basis becomes fiction.

This can’t be solved by simply telling the model “don’t lie.” You need to design constraints for how the model should behave when it recognizes missing data.

Data integrity rules:

  • Unconfirmed data must be labeled "Not provided" or "To be validated"
  • Never fabricate numbers or sources

In addition, I prohibited the agent from automatically progressing to the next step without PM confirmation.

You are NOT allowed to infer completeness. Only explicit confirmation from the PM allows progression.

This keeps the PM as the decision-making driver at all times, rather than letting the agent “auto-advance” in a plausible flow.

Using a Skill to Evaluate a Skill — An Automated Evaluation Pipeline

To verify whether the designed rules were actually working, I created a dedicated evaluation skill (skill-creator-max). It sends test cases to mercari-pm-agent and returns scores for output quality. Through iterative improvements using this score, I obtained the insight mentioned earlier—“the shorter the SKILL.md, the better the accuracy”—which led to the file-splitting design change.

Conclusion

Here are the key learnings I got from building mercari-pm-agent with Claude Code Skills.

  • Designing a Skill is close to writing a “behavior specification.” Constraints matter more than commands; clearly defining what the LLM “must not do” directly improves accuracy.
  • Design MCP connections for parallelism. Sequential access worsens UX; consider fallback behavior and robustness together.
  • Separation of concerns works for prompt design too. The longer the context, the lower the accuracy; separating behavior definitions from reference data is effective.
  • Define evaluation criteria before implementation. Quality evaluation for LLM agents is prone to subjectivity; defining axes first and building an evaluation agent enables an objective improvement cycle.

Because mercari-pm-agentis implemented as a Claude Code Skill, once MCP is configured you can launch it with a single /mercari-pm-agent command. I hope this is helpful for anyone interested in improving PM efficiency or designing agents using Claude Code Skills.

  • X
  • Facebook
  • linkedin
  • このエントリーをはてなブックマークに追加