Leaderboard

Model Performance on GREPO (Eval9)

Scores are mean Hit@K across nine representative repositories. Use the filters to compare GNNs, LLM baselines, and retrieval methods.

Category
Rank Model Type Avg Rank Hit@1 Hit@5 Hit@10 Hit@20 Notes
Loading leaderboard data...

Performance Signature

Selected Model Curve

A smooth profile of Hit@K metrics across Eval9.

-- Eval9
Hit@1 Hit@5 Hit@10 Hit@20

Repository View

Per-Repository Breakdown

Compare model performance within a single repository. Select a repository and metric to explore detailed results.

Category

Model Scores for --

Metric: Hit@10

Top model: --

Visuals

Visual Analytics

More refined views for model comparison, performance patterns, and the benchmark landscape.

Compare Map

Repository-level delta between two models.

Delta: --
Model A better Model B better

Positive values indicate Model A outperforms Model B.

Model Heatmap

Average Hit@K values across Eval9.

Lower Higher

Model Lineage

Grouped by model family.

Selected model from the leaderboard is highlighted.

Benchmark

Evaluation Protocol at a Glance

GREPO evaluates repository-level bug localization with a strict leakage-safe input policy and graph-native supervision.

Task Definition

Given an issue title and its initial description, predict the set of functions and classes that were modified by the fixing pull request.

  • Input: issue title + initial body
  • Output: function/class nodes
  • Granularity: repository-level

Graph Schema

Each repository is represented as a temporal, heterogeneous graph with explicit structural and semantic relationships.

  • Nodes: directory, file, class, function
  • Edges: contain, call, inherit, temporal
  • Temporal snapshots per commit

Metrics

Mean Hit@K across all test issues, calculated against full ground-truth node sets.

  • K = 1, 5, 10, 20
  • Eval9 repositories
  • No future information allowed

Dataset

Dataset & Construction Pipeline

GREPO is built from real-world GitHub repositories and issues, with an incremental graph pipeline designed for scale.

01

Temporal Graph Build

Build a single temporal graph with per-commit lifespans, updating only changed files to keep construction efficient.

02

Issue-PR Linking

Link issues to merged PRs via closing keywords, and use only leakage-safe issue text as model input.

03

Feature Construction

Encode node text with embeddings and compute query-node similarity to provide lightweight, transferable signals for GNNs.

Methods

Core Building Blocks

The GREPO pipeline combines semantic anchors, temporal signals, and graph neural reranking for multi-hop localization.

Semantic Anchors

Issue reports are rewritten into structured queries and entities to retrieve high-quality anchor nodes before message passing.

Temporal Prior

An issue-conditioned temporal retriever ranks nodes based on recent co-change history without leaking future information.

GNN Reranker

A query-aware GNN scores nodes inside the extracted subgraph to deliver final rankings in Hit@K metrics.

Resources

Everything You Need to Reproduce

Open-source code, data, and ready-to-run commands for benchmarking.

Build Your Own GREPO Experiments

Use the leaderboard as a reference, then plug in your own retrieval or GNN model to test against the same Eval9 protocol.