Teaching AI to Label GitHub Issues

“Is GitHub using AI to label PRs now?”

When my colleague Nate asked me this question in Slack, I had to pause. “I don’t think so?” And then: “They should though.”

tl;dr: They don’t – but you can!

I built a GitHub Action that uses LLMs to intelligently label your issues and PRs, and you can drop it into your repos right now.

Beyond just announcing a new tool, I want to share a little about what I learned about practical AI application design, the surprising effectiveness of structured reasoning with smaller models, and why I believe this represents a perfect case study in augmenting (rather than replacing) human workflows.

Why Labels Matter

As an open-source contributor, you’ve surely seen labels on issues and PRs: colorful tags that categorize work in meaningful ways.

As a maintainer, labels represent a fairly sophisticated system for repository orchestration. Similar to hashtags in early social media, labels are an extremely simple construct that have nonetheless transcended their original purpose and become a critical tool for open-source management. They:

shape contributor behavior (good first issue, help wanted)
set expectations (breaking change, duplicate, needs-mre)
route attention (security-review, needs-reproduction, frontend)
advertise features and milestones (enhancement, caching, RBAC, 3.x)
and drive automation (many human and automated workflows use labels as triggers or status indicators)

One of my favorite examples is the great writeup label on the Prefect repo, which highlights issues or resolutions that are exceptionally well-written. It’s a great way to recognize and encourage good contributor experiences, and it’s a powerful “show don’t tell” signal for new contributors.

Most labeling today falls into two categories: manual application (time-consuming and inconsistent) or static automation based on simple rules. The excellent first-party Pull Request Labeler action, for instance, can apply path-based rules such as adding the frontend label to any PR that touching files in ui/**. In fact, it was seeing this deterministic behavior that prompted Nate’s question in the first place.

But path-based labeling can’t tell you whether those changes need security review, or if they’re breaking existing APIs, or if they’d make a great first issue for new contributors. To automate labeling effectively, we need something that can actually understand content, intent, and context.

Luckily, I know a guy.

Labels, Meet AI

LLMs are a perfect fit for this problem. Classification, or mapping unstructured context onto a set of categories, is one of their most fundamental operations! Most importantly, they can understand the context and even the intent behind any changes, not just their objective surface characteristics. They can tell when a PR constitutes an enhancement, when test coverage is insufficient, or when a security review is needed.

So… let’s build an AI labeler!

Using ControlFlow, the core implementation is surprisingly simple. In fact, despite representing the entirety of this action’s “magic”, I spent only a fraction of my time orchestrating the AI logic and all the rest trying to get the action itself to work in CI.

You may draw your own conclusions about the state of developer happiness.

Here is a slightly simplified version of the core code. In this flow:

import controlflow as cf
from pydantic import BaseModel
from typing import Optional, Union


@cf.flow
def labeling_workflow(
    pr_or_issue: Union["PullRequest", "Issue"],
    labels: list["Label"],
) -> list[str]:

    class Reasoning(BaseModel):
        label_name: str
        reasoning: str
        should_apply: bool

    labeler = cf.Agent(
        name="GitHub Labeler",
        model="openai/gpt-4o-mini",
        instructions="You are an expert at labelling GitHub issues and PRs.",
    )

    decision = cf.run(
        "Consider the PR/issue and reason about potential labels",
        result_type=list[Reasoning],
        context=dict(pr_or_issue=pr_or_issue, labels=labels),
        agents=[labeler],
    )

    return [r.label_name for r in decision if r.should_apply]

In this flow:

We create a Pydantic model to hold the reasoning about each label
We create an agent that will use GPT-4o-mini to label the PR or issue
We reason about each label to ultimately produce a list of labels that should be applied

The “full” code can be seen here.

Reasoning: Show Your Work

The first version of this flow simply asked the agent to generate a list of labels. This worked well in all cases with GPT-4o, but GPT-4o-mini sometimes made mistakes with complex labeling instructions.

I experimented with a variety of solutions, including different prompts, multi-stage reasoning, and more, before settling on the approach above, in which we ask the model to explicitly output its reasoning about each label. It’s fascinating how well this approach works, permitting GPT-4o-mini to operate near the level of GPT-4o, at a tiny fraction of the cost.

(Performance is actually very slightly better in a two-step reasoning approach, but at the cost of a second pass through the LLM, so I’ve opted for the single-pass version for now)

Note that this is not the same as o1-style reasoning, as it does not involve any iterative refinement of the model’s understanding of the input. Instead, this approach forces the model to pay more attention to instructions, thereby tipping it into an operating regime that’s more likely to produce the right answer.

Configuration

Now we’ve got an AI workflow that can assign labels to a PR. That still might not be enough to mimic a human maintainer, because we ascribe norms to label application that are based on context, intent, and nuance.

For this reason, the AI labeler allows you to annotate each label with natural language instructions that clarify its purpose.

The configuration is straightforward:

labels:
  - security-review: # the label name
      description: "Needs security team review"
      instructions: |
        Apply when changes involve:
        - Authentication or authorization code
        - Cryptographic operations
        - Environment variables
        - Container or deployment config

This gives you a way to define what “good first issue” means for your project, or exactly when to flag something as a breaking change or needing tests. The AI will consider your instructions alongside the actual content, leading to remarkably nuanced decisions.

For more control, you can provide global instructions and even include additional files for context, like a contribution guide or code of conduct.

instructions: |
  Focus on identifying good first issues and security concerns.

labels:
  - good-first-issue:
      description: "Perfect for newcomers"
      instructions: |
        Apply when the changes are:
        - Well-scoped and isolated
        - Well-documented
        - Don't require deep system knowledge

  - security-review:
      description: "Needs security team review"
      instructions: |
        Apply when changes touch:
        - Authentication flows
        - Environment variables
        - Network requests

context_files:
  - .github/CODEOWNERS
  - CONTRIBUTING.md
  - CODE_OF_CONDUCT.md

No Plan Survives First Contact

I’d never created a GitHub Action before, and I have to admit – it was both more challenging and more rewarding than I expected. The documentation is comprehensive but often cryptic. Environment variables have surprising names. You can’t read repository files until you check out the repository (which seems obvious in hindsight, but took me embarrassingly long to figure out). Testing is essentially “push and pray.”

This led to an important design decision: keep most of the code as normal, testable Python and wrap it with a thin layer of GitHub-specific glue. In retrospect, this separation of concerns was crucial for maintaining my sanity during development and will make it much easier to maintain going forward.

Beyond Labels

What excites me most about this project isn’t just its utility (though I do use it on all my repositories now). It’s that it represents a perfect example of how AI can augment existing workflows without trying to replace human judgment:

It handles the routine work of initial labeling, but maintainers can always adjust or override its decisions
It learns from your repository’s context and explicit instructions, adapting to your specific needs
It’s completely transparent about its reasoning, making it easy to debug and improve
It’s fast and affordable – you could process 10,000 PRs for less than $5

The structured reasoning approach means we get sophisticated behavior from smaller models – there’s no need to step up to GPT-4o or Claude just for intelligent labeling. This keeps it practical for real-world use while still delivering genuinely helpful automation.

There’s still plenty to explore – for example, distinguishing between issue-only and PR-only labels, or learning from manual corrections over time. Want to help? The project could especially use some “good first issues” to welcome new contributors. You can find AI Labeler on GitHub, and suggestions are always welcome.

But don’t worry about picking the right labels for your issues – we’ve got that covered! 😉