A Dataset for Pull-based Development Research

by Gousios, Georgios and Zaidman, Andy

You can get a pre-print version from here.
You can view the publisher's page here.
See the paper's associated code repository: gousiosg/pullreqs

This paper received the "MSR2014: Best data showcase paper" award

Abstract

Pull requests form a new method for collaborating in distributed software development. To study the pull request distributed development model, we constructed a dataset of almost 900 projects and 350,000 pull requests, including some of the largest users of pull requests on Github. In this paper, we describe how the project selection was done, we analyze the selected features and present a machine learning tool set for the R statistics environment.

Bibtex record

@inproceedings{GZ14,
  author = {Gousios, Georgios and Zaidman, Andy},
  title = {A Dataset for Pull-based Development Research},
  booktitle = {Proceedings of the 11th Working Conference on Mining Software Repositories},
  series = {MSR 2014},
  year = {2014},
  isbn = {978-1-4503-2863-0},
  location = {Hyderabad, India},
  pages = {368--371},
  numpages = {4},
  doi = {10.1145/2597073.2597122},
  acmid = {2597122},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {distributed software development, empirical software engineering, pull request, pull-based development},
  url = {/pub/pullreqs-dataset.pdf},
  award = {MSR2014: Best data showcase paper},
  github = {gousiosg/pullreqs},
  speakerdeck = {629aa910cd09013116791efd7f77c4b7}
}

Presentation

The paper