GREPO Leaderboard

Leaderboard

Model Performance on GREPO (Eval9)

Scores are mean Hit@K across nine representative repositories. Use the filters to compare GNNs, LLM baselines, and retrieval methods.

Rank	Model	Type	Avg Rank	Hit@1	Hit@5	Hit@10	Hit@20	Notes
Loading leaderboard data...

Selected Model Curve

A smooth profile of Hit@K metrics across Eval9.

-- Eval9

Hit@1 Hit@5 Hit@10 Hit@20

Repository View

Per-Repository Breakdown

Compare model performance within a single repository. Select a repository and metric to explore detailed results.

Repository

Metric

Model Scores for --

Metric: Hit@10

Top model: --

Visuals

Visual Analytics

More refined views for model comparison, performance patterns, and the benchmark landscape.

Compare Map

Repository-level delta between two models.

Model A Model B Metric

Delta: --

Model A better Model B better

Positive values indicate Model A outperforms Model B.

Model Heatmap

Average Hit@K values across Eval9.

Lower Higher

Model Lineage

Grouped by model family.

Selected model from the leaderboard is highlighted.

Benchmark

Evaluation Protocol at a Glance

GREPO evaluates repository-level bug localization with a strict leakage-safe input policy and graph-native supervision.

Task Definition

Given an issue title and its initial description, predict the set of functions and classes that were modified by the fixing pull request.

Input: issue title + initial body
Output: function/class nodes
Granularity: repository-level

Graph Schema

Each repository is represented as a temporal, heterogeneous graph with explicit structural and semantic relationships.

Nodes: directory, file, class, function
Edges: contain, call, inherit, temporal
Temporal snapshots per commit

Metrics

Mean Hit@K across all test issues, calculated against full ground-truth node sets.

K = 1, 5, 10, 20
Eval9 repositories
No future information allowed

Dataset

Dataset & Construction Pipeline

GREPO is built from real-world GitHub repositories and issues, with an incremental graph pipeline designed for scale.

01

Temporal Graph Build

Build a single temporal graph with per-commit lifespans, updating only changed files to keep construction efficient.

02

Issue-PR Linking

Link issues to merged PRs via closing keywords, and use only leakage-safe issue text as model input.

03

Feature Construction

Encode node text with embeddings and compute query-node similarity to provide lightweight, transferable signals for GNNs.

Methods

Core Building Blocks

The GREPO pipeline combines semantic anchors, temporal signals, and graph neural reranking for multi-hop localization.

Semantic Anchors

Issue reports are rewritten into structured queries and entities to retrieve high-quality anchor nodes before message passing.

Temporal Prior

An issue-conditioned temporal retriever ranks nodes based on recent co-change history without leaking future information.

GNN Reranker

A query-aware GNN scores nodes inside the extracted subgraph to deliver final rankings in Hit@K metrics.

Resources

Everything You Need to Reproduce

Open-source code, data, and ready-to-run commands for benchmarking.

GREPO Codebase

Training, evaluation, and dataset tooling for the full benchmark.

Open GitHub Repository

GREPO Dataset

Download the benchmark dataset and graph artifacts.

Open Dataset Page

Quickstart

Use the included command templates to train and evaluate models.

See examples/commands in the repo

Build Your Own GREPO Experiments

Use the leaderboard as a reference, then plug in your own retrieval or GNN model to test against the same Eval9 protocol.

Compare Models Read the Docs