Georgios Gousios home page2021-02-28T22:42:19+01:00http://www.gousios.grGeorgios Gousiosgousiosg@gmail.comIN4334 - Delft seminar on Machine Learning for Software Engineering2019-11-08T00:00:00+01:00http://www.gousios.gr/blog/machine-learning-software-engineering<p><strong>with Maurício Aniche</strong></p>
<p>One of the fun facts about teaching a master's level course is that you get to pick what to teach. In our faculty, there is almost unlimited freedom, and this leads to students being exposed to the latest and greatest research in each teacher's respective field. Exploiting this freedom, last year I 've taught a seminar on a topic that I was relatively familiar with: Software Analytics (read about my experiences <a href="http://gousios.org/blog/a-seminar-on-software-analytics.html">here</a>).</p>
<p>This year, partnering with <a href="https://www.mauricioaniche.com">Maurício</a>, we
decided to stretch our topic to an upcoming and exciting sub-field: Machine Learning for Software Engineering.</p>
<p>Of course, anything related to Machine Learning is bound to raise the
interested of students. Indeed, we received 80+ subscriptions for a 40 seat
course. This led us to the interesting problem of having to select students;
we did so by asking the students to write a motivation letter explaining their
ML experience so far and their interest in the course. This trick motivated
students to apply self-selection, i.e. students that were not interested in the
course so much did not submit a reference letter. We ended up with 30+ students
with sufficient machine learning background.</p>
<p>Similarly to last year, we organized the students in groups of 4. Each group had
to discuss 2 papers in a 2 hour slot. The <a href="http://gousios.org/courses/ml4se/discussing-papers.html">paper reading protocol</a> was again the
same: 5 mins of warm-up presentation and a list of questions to facilitate
discussion. What changed however this year is that we did not ask the students
to write a survey but to develop an end-to-end pipeline. We
provided an indicative <a href="http://gousios.org/courses/ml4se/projects.html">list of topics</a> and corresponding papers, but we encouraged the students to come up with their own ideas.</p>
<p>In addition to our discussions in the class, we also invited people from the
industry to participate. We were lucky to have <a href="https://miltos.allamanis.com">Miltos Allamanis</a> from Microsoft Research, <a href="https://www.linkedin.com/in/vmarkovtsev/">Vadim Varkovtsev</a> from sourceD and our own <a href="https://www.linkedin.com/in/vladimir-kovalenko-01416b88">Vladimir Kovalenko</a> (now at JetBrains) helping lead the discussions.</p>
<p>What follows below is the list of abstracts from the student papers (with minor
edits).</p>
<h3>The papers</h3>
<p><strong>Replication Study of Tree-to-Tree Neural Networks for Program Translation</strong></p>
<p>Source code is written in a programming language specifically chosen to fit the
goal of the program. For example, Python or JavaScript can be used for scripting
and C or C++ can be used for high performance programs. Different reasons can
motivate a decision to migrate code to from one language to another, such as
performance, maintainability (e.g. deprecation) or other factors like business
decisions. While current state-of-the art performance on Machine Translation is
achieved using sequence-to-sequence networks, a study by Chen et al. has shown
promising results with Tree-to-Tree Neural Machine Translation. In this paper we
replicated their model to investigate whether we are we are able to achieve a
similar performance with the information provided.</p>
<p>We compare our results to the state-of-the-art results achieved by the tree2tree
model of Chen et al, and the results as described in Nguyen et al. While our
replication achieves a smaller accuracy than Chen et al, it still beats the
sequence-to-sequence benchmarks by up to 10 points.</p>
<p><strong>AutoComment: Comment Generation in Java Code</strong></p>
<p>Commenting large code databases is crucial for code comprehension and efficient
maintenance of a code base. Therefore, automatic code generation would be
incredibly beneficial for both programmer and future maintainer of the code. In
this paper, we propose a comment generator model using new state of the art
techniques developed in the previous years, based on code2seq for comment
generation in Java code. With the DeepCom as the baseline, the paper focuses on
replicating the code2seq model with added capabilities such as, predicting
natural language (Method-1) and modified ASTs (Method-2). The results show that,
Method-2, is capable of understanding the syntactic and semantic meaning of Java
code to generate comments automatically, but suffers from the incapability to
generate longer and complete comments, hence leading to a poor BLEU-4 score when
compared to the baseline.
(<a href="https://github.com/LRNavin/AutoComments">code</a>)</p>
<p><strong>Generating Commit Messages from Git Diffs</strong></p>
<p>Commit messages help developers in their understanding of a continuously
evolving codebase. However, developers not always document code changes
properly. Automatically generating commit messages would relieve developer from
this burden. Recently, a number of different works have demonstrated the
feasibility of using methods from neural machine translation to generate commit
messages. This work aims to reproduce a prominent research paper in this field
(Jiang and McMillan), as well as attempt to improve upon their results by
proposing a novel preprocessing technique. A reproduction of the reference
neural machine translation model was able to achieve slightly better results on
the same dataset. When applying more rigorous preprocessing, however, the
performance dropped significantly. This demonstrates the inherent shortcoming of
current commit message generation models, which perform well by memorizing
certain constructs. Future research directions might include improving diff
embeddings and focusing on specific groups of commits.
(<a href="https://arxiv.org/pdf/1911.11690.pdf">paper</a>)</p>
<p><strong>Using Distributed Representation of Code for Bug Detection</strong></p>
<p>Recent advances in neural modeling for bug detection have been very promising. Specifically, using snippets of code to create continuous vectors or embeddings has been shown to be very good at method name prediction and claimed to be efficient at other tasks, such as bug detection. However, to this end, the method has not been empirically tested for the latter.
In this work, we use the Code2Vec model of Alon et al. to evaluate it for detecting off-by-one errors in Java source code. We define bug detection as a binary classification problem and train our model on a large Java file corpus containing likely correct code. In order to properly classify incorrect code, the model needs to be trained on false examples as well. To achieve this, we create likely incorrect code by making simple mutations to the original corpus.</p>
<p><strong>Code completion using Byte Pair Encoding</strong></p>
<p>In this paper, we aim to do code completion based on implementing a Neural
Network from Li et. al. Our contribution is that we use an encoding that is
in-between character and word encoding called Byte Pair Encoding (BPE). We use
this on the source code files treating them as natural text without first going
through the abstract syntax tree (AST). We have implemented two models: an
attention-enhanced LSTM and a pointer network, where the pointer network was
originally introduced to solve out of vocabulary problems. We are interested to
see if BPE can replace the need for the pointer network for code completion.</p>
<p><strong>DLTPy: Deep Learning Type Inference Of Python Function Signatures Using Natural Language Context</strong></p>
<p>Due to the rise of machine learning, Python is an increasingly popular
programming language. Python, however, is dynamically typed. Dynamic typing has
shown to have drawbacks when a project grows, while at the same time it improves
developer productivity. To have the benefits of static typing, combined with
high developer productivity, types need to be inferred. In this paper, we
present DLTPy, a deep learning type inference solution for the prediction of
types in function signatures based on the natural language context (identifier
names, comments and return expressions) of a function. We found that DLTPy is
effective and has a top-3 F1-score of 91.6%. This means that in most of the
cases the correct type is within the top-3 predictions. We conclude that natural
language contained in comments and return expressions are beneficial to
predicting types more accurately. DLTPy does not significantly outperform or
underperform the previous work NL2Type for Javascript, but does show that
similar prediction is possible for Python.
(<a href="https://arxiv.org/abs/1912.00680">paper</a>)</p>
<p><strong>Identifying Approaches to Detect Logging Opportunities in Java Source Code</strong></p>
<p>Sizeable modern software projects produce massive amounts of log data. Logging
the correct information is hard and doing it correctly can massively speed up
failure diagnosis. Due to the difficulty of logging correctly, a tool predicting
the necessity for log statements will assist developers and enhance development
productivity. Unfortunately, there is yet no single best solution. Therefore, in
this paper, we explore several approaches to identify the most promising ones.
These approaches are tested on Apache projects, as those comply to a high
quality logging standard. An approach to determine code meaning/context that has
been used is code2vec. We conclude that a custom trained code2vec combined with
a RFC or SVM is the most promising with a balanced accuracy of 0.71 and a recall
of 0.48. We also conclude that a pretrained code2vec model cannot always be
simply applied on different code context problems.</p>
<p><strong>Multi-label Classification for Automatic Tag Prediction in the Context of Programming Challenges</strong></p>
<p>One of the best ways for developers to test and improve their skills in a fun
and challenging way are programming challenges, offered by a plethora of
websites. For the inexperienced ones, some of the problems might appear too
challenging, requiring some suggestions to implement a solution. On the other
hand, tagging problems can be a tedious task for problem creators. In this
paper, we focus on automating the task of tagging a programming challenge
description using machine and deep learning methods. We observe that the deep
learning methods implemented outperform well-known IR approaches such as tf-idf,
thus providing a starting point for further research on the task.
(<a href="https://arxiv.org/abs/1911.12224">paper</a>)</p>
<p><strong>Learning How to Mutate Python Source Code from Bug-Fixes</strong></p>
<p>The goal was to replicate an existing paper to a similar problem. This project
focuses on the topic of learning Python code mutants for mutation testing and
takes [9] as a blueprint, which has researched the same topic but for Java code.</p>
<h3>The experience</h3>
<p>All in all, this was the most interesting course I 've ever taught, by far. I 've
learned a lot, mostly on the technical size. For example, having Miltos
explaining GGNNs to us was in invaluable experience! I 've also come to
appreciate the state of the art in the field, which is, let's just say, emerging 😉</p>
<p>But what I 've mostly come to admire is our students. What a group! It was
amazing to see that in less than 2 months students being able to fully replicate existing work, coming
up with great new ideas and even write almost publication quality papers. No
matter where they go after the course, I am sure that they will shine! I would
hire 90% of them on the spot, if I were in the industry.</p>
<p>As usual, you can find the course materials under a CC license <a href="http://gousios.org/courses/ml4se/">on my
homepage</a>. Feel free to reuse them for
your own courses!</p>
Teaching Software Analytics2018-11-10T00:00:00+01:00http://www.gousios.gr/blog/a-seminar-on-software-analytics<p>At the TU Delft CS master's program, students have to take at least one seminar
course. As opposed to normal courses, where a more traditional teaching method
is the norm, in seminar courses the students have to read papers – lots of
them. This makes the format ideal for courses that teach student topics at a
field's cutting edge. In the previous quarter (Sep - Oct), I taught such a
course: Software Analytics.</p>
<p>It was the first time I taught both a seminar course and Software Analytics,
which also happens to be my <a href="https://se.ewi.tudelft.nl/research-lines/software-analytics/">field of
expertise</a>. I was therefore
relatively confident about the contents, but not so much about the format. After
lots of reading online, I 've come to the conclusion that seminars are all about
the students reading and discussing interesting papers, with minimal teacher
involvement. So I decided to have students in the lead. With the course's
teaching team (<a href="https://inventitech.com">Moritz</a>,
<a href="https://mkechagia.github.io">Maria</a>,
<a href="https://ayushirastogi.github.io">Ayushi</a> and
<a href="https://jhejderup.github.io">Joseph</a>), we identified 8 high-level
topics that we believed provide a broad coverage of the area, and came up with
10 papers per topic that acted as the seed to the student's investigations. For
each topic, a student group (or groups, we do not know in advance how many
students participate in our courses) would have to i) prepare a group discussion
of an interesting paper, ii) write a systematic literature review
of cutting edge work in the field, and iii) work on a limited replication of a paper.</p>
<h3>The format</h3>
<p>The <strong>discussions</strong> were scripted. I compiled a list of guidelines on <a href="http://gousios.org/courses/ml4se/2018/discussing-papers.html">how to run
a paper discussion</a>,
partially
to help the students and partially to make sure we have a common format during our
discussions. The responsible group would announce a paper that everybody
should read one week in advance. Each discussion started with a short
presentation (4-5 slides) of the main paper points (authors, motivation,
research method, results, implications). The moderator would kick-off the
discussion asking generic questions (e.g. "what did you like about the paper"),
but then he/she would have to dive into the paper contents. At the appropriate
times, members of the teaching team were instructed to jump in to ask a
provoking question or to move the discussion towards more fruitful paths.
Importantly, one person of the moderating team would have to keep notes of the
discussion. This helped us compile extensive documentation on what went on during
the course, which you can find linked from the <a href="http://gousios.org/courses/ml4se/2018/">course's web
page</a>.</p>
<p>The paper discussions were meant to introduce all students to a particular
topic; then, each group would have to go deep in each corresponding topic by
reading the latest related work (no older than 5 years) and write a 4-5k word
survey. The <strong>systematic literature reviews</strong> were, again, scripted. The
students were instructed to follow <a href="https://dl.acm.org/citation.cfm?id=1134500">Kitchenham's
method</a> for performing reviews in
software engineering to answer 3 research questions, relating to the
state-of-the-art in research and practice and future directions about each
particular field. The initial paper selection was followed by an intermediate
presentation per team/topic, to ensure that everybody is on the same page.
Crucially, students had to peer-review each other's chapters; per week, a pull
request per group would have to be reviewed by two more groups. This ensured
that all students would have at least read all other chapters at least once,
thereby obtaining a bird's eye view of the area. All surveys were collected and
<a href="https://saltudelft.github.io/software-analytics-book/">published as a book
here</a>; we intend to keep
this book updated through future runs of the course.</p>
<p>Reading, discussing and writing about existing work should theoretically suffice for a seminar course; but knowing our students, I knew that they would be
missing something if we stopped there: hacking! So during the last part of the
course, the students would have to <strong>partially replicate</strong> existing work. I think
this is were the students excelled: in just 2-3 weeks, students that have not
done any repository mining before managed to produce high-quality reports,
sometimes by replicating the <em>whole</em> data collection pipeline. I was impressed
with the results!</p>
<h3>The experience</h3>
<p>Running a seminar course was a learning experience as much for me as it
(hopefully!) was for the students. I learned about the value of such courses:
within 8 weeks, through and with my students, I got in touch with the latest and
greatest of our field. I also hope that the students developed an intuition on
what makes a research work timeless: strong motivation, razor-sharp answers to
the RQs and crystal-clear discussion of implications. I did notice that the
discussions during the end of the course where more focused on those high-level
concepts rather than details about why the authors choose statistical test X or
modelling method Y. The course workload was a bit on the heavy side; this did
not allow the students to perfect their surveys. Finally, on revisiting the
replication results, I was glad the course was a seminar at heart: the students
already knew how to setup data pipelines, so there was not much to learn there.</p>
<p>Changes I would do next year:</p>
<ul>
<li><p>No laptops! I will ask the students to print the papers and hand annotate
them. This will hopefully help to keep the discussion participants focused.</p></li>
<li><p>I will be more silent. I felt that my urge to talk about all the wonderful
things we are doing in software analytics was sometimes overwhelming to the
student moderators.</p></li>
<li><p>Most probably, I will drop the replication part. Our students are already
pretty good at designing data pipelines, so I will devote the extra time to
perfect the literature surveys.</p></li>
</ul>
<p>As usual, you can find the course materials under a CC license <a href="http://gousios.org/courses/ml4se/2018/">on my
homepage</a>.</p>
Introducing the FASTEN project2018-10-03T00:00:00+02:00http://www.gousios.gr/blog/Introducing-Fasten<p>A popular form of software reuse is the use of Open-Source Software (OSS) libraries, hosted on centralized code repositories, such as Maven or NPM.
Developers only need to declare dependencies to external libraries, and automated tools make them available to the workspace of the project.</p>
<p>In recent years, we have seen package management fail in spectacular ways:</p>
<ul>
<li>In the <a href="https://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/">lefpad incident</a>, a developer broke a significant part of the Internet
by just removing a package from NPM.</li>
<li>Equifax <a href="http://time.com/money/4936732/equifaxs-massive-data-breach-has-cost-the-company-4-billion-so-far/">lost $4 billion</a> because they deemed a security
update unnecessary.</li>
<li>A Linux kernel developer <a href="https://lwn.net/Articles/731941/">engaged in a series of litigation actions</a> against tens of companies claiming lost revenue, due to, in his view, in-proper enforcement of GPL in a transitively derivative project.</li>
<li>A <a href="https://queue.acm.org/detail.cfm?id=3205288">recent study</a> by Lauinger et al. found that 1 out of 3 top sites uses at least one library with a known vulnerability.</li>
</ul>
<p>The list goes on. Package management, and its repercussions, is a topic that
affects the daily lives of millions of developers and users, but it has
only received moderate attention from researchers.</p>
<p>Last spring, I led a group of 7 partners towards the submission of a project
proposal to the H2020-ICT-18 call. Then, in August, we learned that
the European Commission granted our consortium a significant amount of money
to make package management more intelligent!</p>
<p>The core idea behind FASTEN is really simple: <em>instead of analyzing dependencies
at the package level, we will analyze them at the call graph level</em>! This will
allow us to be super precise when we are tracking dependencies, when
we do change impact analysis, when we recommend clients to update packages etc.
It will also open the door to new sophisticated applications, e.g. licensing compliance, dependency risk profiling and data-driven API evolution.</p>
<p>As is usual in those cases, while the idea sounds simple and straightforward,
its practical implementation, <a href="https://pure.tudelft.nl/portal/en/publications/prazi-from-packagebased-to-precise-callbased-dependency-network-analyses(6e9d35bd-b512-4bdd-80c1-53608d2acda6.html">as we learned in our related work on
Rust</a>,
is anything but. Static call graph generators are unsound; modern features in
programming languages (dynamic dispatch, extensible classes) complicate static
call graph generation; in many languages, projects need to be built before
constructing call-graphs; the generated graphs are huge; the queries
we will need to run will bring current graph databases to their knees.
However, the accuracy benefits of creating ecosystem-level, versioned
call-graphs outweight the drawbacks. In our preliminary Rust study, we found
that in the case of pin-pointing vulnerable packages, accuracy can be improved
by <strong>3x</strong>, the issues mentioned above withstanding.</p>
<p><img src="/files/fasten.png" style="width: 70%;float: center;"></p>
<p>Our vision for better dependency management goes beyond giant call graphs.
Our goal, and also the project's <em>raison d'être</em>, is to bring the benefits
of fine-grained dependency tracking to the hands of developers. To do this,
we will create a continuously updated service that automatically analyzes
all package versions in the Java, Python and C (Debian) ecosystems and
maintains the call-graphs centrally. On top of this, we will create processes
that read data from external sources (e.g. security disclosures, GitHub
analytics) and enrich the graph, by appending information to the graph nodes
(functions). To compensate for inefficiencies in call-graph generation,
we will allow clients to upload call-graphs generated by running a project's
test suite; this will allow us to enrich the graph edges (function calls).
We will implement custom analyses (e.g. vulnerability tracking)
as efficient traversals on our graph.
And, importantly, we will create plug-ins for Maven and
PyPI, to enable developers and CI environments to query the FASTEN knowledge
base in a way that looks like this:</p>
<pre><code>$ pip list
docutils (0.10)
Jinja2 (2.7.2)
MarkupSafe (0.18)
$ pip check-security
Jinja2 (2.7.2) has known vulnerabilities (your project is affected!)
Update to version >=2.7.3 (will not break your project)
$ pip test-upgrade Jinja2 --version 2.8
Upgrading to Jinja2 2.8 will break the following methods:
myproject.foo()
myproject.bar()
$ pip what-breaks --delete myproject.foo
The following direct dependencies will break if you *delete* function foo()
* projectA: 15 methods use foo()
* projectB: 10 methods use foo()
632 indirect dependencies will fail to work.
$ pip test --upload-dyngraph
............15 Tests run OK!
Dynamic call graph at: myproject.dot
Uploading dynamic call graph to FASTEN
</code></pre>
<p>An incredible group of researchers and practitioners with a passion for research
that has impact are collaborating in the FASTEN project. I list them below,
along with the main contact point per partner (many of them are hiring 😊):</p>
<ul>
<li><a href="https://se.ewi.tudelft.nl/softanalytics.html">Software Engineering Research group</a>, TU Delft, <a href="http://gousios.org">Georgios Gousios</a>, PI</li>
<li><a href="https://www.balab.aueb.gr/">Business Analytics Lab</a>, Athens University of
Economics and Business, <a href="http://spinellis.gr/">Diomidis Spinellis</a></li>
<li><a href="http://law.di.unimi.it/">Laboratory for Web Algorithmics</a>, U Milano, <a href="http://vigna.di.unimi.it/">Sebastiano Vigna</a></li>
<li><a href="https://www.xwiki.com/en/">XWiki SAS</a>, makers of the XWiki collaboration platform, <a href="https://about.me/vmassol">Vincent Massol</a></li>
<li><a href="https://endocode.com/">Endocode AG</a>, experts in OSS licensing, <a href="https://about.me/mirkoboehm">Mirco Boehm</a></li>
<li><a href="https://sig.eu">Software Improvement Group</a>, experts in software quality, <a href="https://nl.linkedin.com/in/magiel-bruntink-b744b11">Magiel Bruntink</a></li>
<li><a href="https://www.ow2.org">OW2 consortium</a>, experts in OSS communities, <a href="https://fr.linkedin.com/in/cedricthomas">Cédric Thomas</a></li>
</ul>
<p>At TU Delft, we have openings for <strong>2 PhD students</strong> and <strong>1 scientific
programmer</strong>. If you are up for a serious research challenge, with ample
opportunity to use your hacking skills to impact everyday software
development, <a href="mailto:g.gousios@tudelft.nl">we would like to hear from you!</a></p>
Report from ICSE 20172017-05-27T00:00:00+02:00http://www.gousios.gr/blog/Report-from-ICSE-2017<p>On the week of May 18-27, I travelled to Argentina to attend MSR and ICSE. For
people not familiar with software engineering research, ICSE is the flagship
conference of the field and one of the few that receive an A* rating in the
<a href="http://portal.core.edu.au/conf-ranks">Core conference rankings</a>. All aspiring
software engineering researchers aim to publish there, which makes it very
competive (~16% acceptance rate). This year, ICSE took place in the beautiful
city of Buenos Aires, a place that reminded me of home (well, Athens) more than
any other city I have been to.</p>
<p>I really enjoyed ICSE this year. The main track content was quite technical,
while the software engineering in practice track was full of interesting tales
from the battlefront. I really wish I could be in more than one talks at the
same time!</p>
<p>For the first time in 3 years, I did not have anything to present; this
allowed me to relax and actually attend lots of talks. Moreover, I also forced
myself to take notes during the presentations. What follows is a list of brief
summaries of some of the papers whose presentations I attended, based on those
notes.</p>
<p>In <a href="http://www.cs.cmu.edu/~aldrich/papers/icse17-glacier-immutability.pdf">Transitive class immutability in Java</a>, the authors implement a plug-in type
system for Java (based on annotations and the Checker framework) that can
check and/or enforce immutability on object hierarchies, transitevely. The
authors applied it on existing programs and found bugs and also compared its
use with Java's own <strong>final</strong> keyword (they found Glacier supperior).</p>
<p>In <a href="https://users.encs.concordia.ca/~nikolaos/publications/ICSE_2017.pdf">Clone refactoring with lambda expressions</a>, the authors propose and formally
specify a simple refactoring to lambda expressions to eliminate certain types of clones. This paper stood out for its extensive evaluation: the authors
successfuly tried their refactoring on 12k (!) clones covered by tests,
while it can refactor away 58% of clones in a body of 46k (!!) clones.</p>
<p>In <a href="https://pure.tudelft.nl/portal/files/11926462/TUD_SERG_2017_006.pdf">Guided genetic Algorithm Automated crash reproduction</a> my collegue Mozan presented
a method that can automatically generate test cases that reproduce
crashes, using clever applications of genetic algorithms in Evocrash.</p>
<p>In <a href="https://www.microsoft.com/en-us/research/publication/general-framework-dynamic-stub-injection/">General framework for dynamic stub injection</a>, the awesome Maria Christakis
created a framework that intercepts the Windows dynamic library loader and
inserts code that runs before and after method invocations (stubs). They also
constructed a DSL that allows the specification of stub locations. They used
this tool to find bugs to applications as well tested as Word and Excel.</p>
<p>In <a href="http://www0.cs.ucl.ac.uk/staff/z.gao/doc/paper/type_study.pdf">To Type or Not to Type: Quantifying Detectable Bugs in JavaScript</a> attempted to quantify
the value of pluggable type systems, such as Flow and Typescript. The
authors empirically verify that pluggable typesystems can catch at least 15%
of public bugs at the cost of 2 tokens per bug! A major win for static typing and a fantastic result overall.</p>
<p>Moving on to user studies, in <a href="http://www.infosun.fim.uni-passau.de/publications/docs/JAH+17.pdf">Classifying developers into core and peripheral</a>
the authors presented a graph-based method (basically, extract the fully
connected component of the code collaboration graph) that identifies core developers. They validated it by emailing projects and checking each project's
responses with their network based model. From now on, graphs it is then.</p>
<p>In <a href="http://carver.cs.ua.edu/Papers/Conference/2017/ICSE_OTC.pdf">Understanding the Impressions, Motivations, and Barriers of One Time Code
Contributors to FLOSS
Projects</a>, the
authors performed a study very similar to our <a href="/bibliography/GSB16.html">ICSE 2016 pull request contributor</a> one and found more or less the same results,
albeit NOT on GitHub. This means that topics such as developer responsiveness
and entry barriers exist beyond GitHub and are therefore crucial to fix.</p>
<p>In <a href="http://static.barik.net/barik/publications/icse2017/PID4655707.pdf">Do developers read compiler error
messages</a>,
the authors explore the issue of compiler message comprehension. They did an
eyetracking study to identify what developers look for when reading error
messages and where their visual perception stumbles upon. Being one of the very
few studies that use eyetracking in our field, this is definetely worth a read.</p>
<p>On the SE in practice front, I 've attended a great talk about <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45794.pdf">CI at Google
scale</a>, a <a href="http://www.ifi.uzh.ch/dam/jcr:bafebc0f-ac0c-46d9-934b-4a0d5e2aab14/Characterizing_Experimentation_SEIP2017.pdf">quantitative overview of A/B experiments in Bing</a> (did you know that 50% of them
are actually deployed?) and a talk about Mike de Jong's <a href="http://quantumdb.io">QuantumDB
framework</a> for uninterupted schema evolution in continuous delivery environments.</p>
<p>The papers above are just a selection of all the talks I attended this year; I
did not cover MSR or various workshops. All in all, a very fulfilling week!</p>
The Issue 32 incident – An update2016-03-08T00:00:00+01:00http://www.gousios.gr/blog/Issue-thirty-two<p>Many of you are aware of the <a href="https://github.com/ghtorrent/ghtorrent.org/issues/32">GHTorrent issue
32</a>. To sum up the
discussion in a couple of lines, various developers included in GHTorrent wanted
their email removed from it (which I did) and then wanted all emails to be
excluded from the dataset (which I refused to do). The reasons behind the
requests where <em>privacy</em> and the <em>right to do what ever one wants</em> with their
personal data (email in many jurisdictions is considered personal data). What
caused the whole thread was that researchers used GHTorrent as a source of
emails for <em>research surveys</em> which were sent to thousands of developers, which
made some developers nervous with respect the use of their personal data.</p>
<p>As a response to the issue, I had temporarily shut down access to GHTorrent
until the situation is cleared up. In the mean time, I have consulted with my
employer's legal department and several experts, mostly on the legal and data
protection domain. The issue even <a href="https://legalict.com/privacy/is-it-legal-for-ghtorrent-to-aggregate-github-user-data/">caught the
interest</a>
of respected ICT laywer <a href="www.arnoud.engelfriet.net/">Arnoud Engelfriet</a> and
caused interesting meta-discussion on a <a href="http://blog.iusmentis.com/2016/02/29/mag-ghtorrent-openbare-data-github-aggregeren-als-onderzoeksdataset/">legal
forum</a>
(Dutch).</p>
<p>Here is what I learned.</p>
<h3>Personal data</h3>
<p>Personal data are defined as "any information relating to an identified or
identifiable natural person". From the information that GHTorrent collects,
emails and real names can be considered as <em>personal data</em>.</p>
<p>As GHTorrent is processing personal data (e.g. linking users with their
actions), it must comply with data protection legislation. This includes:</p>
<ul>
<li>controlling access to and distribution of personal data</li>
<li>informing subjects about how their personal data is used</li>
<li>enable subjects that are not in agreement with use of their personal data from
the project to <em>opt-out</em> of the collection</li>
<li>enable subjects to have their data completely removed (<a href="https://en.wikipedia.org/wiki/Right_to_be_forgotten">right to be forgotten</a>)</li>
</ul>
<p>Now, here is a catch: GHTorrent started as a project in Europe but has
since moved to the US (on the East-US Azure datacenter). Which legislative
domain should the project comply to? Most lawyers I've asked said that
being on the safe side is the best way to go, which means adhering the
union of provisions of both European and US data protection laws.</p>
<h3>Privacy</h3>
<p>GHTorrent is not required by law to create an <em>opt-in</em> mechanism. Personal data has
been shared by subjects publicly and as GHTorrent is not breaking the terms of
use of GitHub, it can download and process them, if it complies with data
protection laws.</p>
<p>Moreover, GHTorrent has no liability on what its users are going to do
with the data; as a top consultant in our University's legal department
stated it, "a store owner is not responsible if you buy a hammer and kill
someone".</p>
<h3>Ethical considerations</h3>
<p>The root of the problem lies in the domain of ethics; are we researchers allowed
to use databases such as GHTorrent to target specific developers with our
research surveys? Can we profile developers using their public activity traces?
Can we rank developers based on how good code they write? Can we recommend work
to developers (e.g. solving issues), based on their expertise on specific
project areas, programming techniques or other projects? There are no easy
answers. Personally, I have run
<a href="http://gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github/">surveys</a>
<a href="http://testroots.org">twice</a> and even <a href="http://gousios.gr/blog/Scaling-qualitative-research/">provided
guidance</a> on how to do it
at scale.</p>
<p>One safeguard that we can employ is <em>anonymity</em>: we can do research with
publicly available traces BUT we should make sure that the results (when shared)
do not reveal the true identity of developer. In turn, this might have negative
implications in the public replicability of the results or their re-use. To
address this, replication packages can be offered in a tiered manner: i)
anonymized data with public access, and ii) data that may reveal the
identities of people under agreements of non-disclosure.</p>
<h3>A course of action</h3>
<p>To comply with legal requirements, GHTorrent will do the following:</p>
<ul>
<li><p>No real names or emails will be shared in the MySQL dumps or the online MySQL
access services. All other database fields will be left untouched. Researchers
interested to mine GitHub users can do so using the <code>login</code> field as
reference. Getting the email is a GitHub API call, but it will be the
responsibility of the researcher to go the extra mile. GHTorrent itself will
continue to use emails internally, to ensure consistency of the data.</p></li>
<li><p>GHTorrent will offer to interested researchers an additional download that
links logins to personal data (essentially, a CSV file with 3 fields: <code>login</code>,
<code>email</code>, <code>name</code>). To obtain access to it, researchers will need to agree to
the terms of use and publicly state the indented use, by submitting a pull
request <a href="http://ghtorrent.org/pers-data.html">here</a>.</p></li>
<li><p>A FAQ with specific privacy-related questions/answers will be created and
shared on GitHub. People can submit their questions in <a href="http://ghtorrent.org/faq.html">this
page</a>.</p></li>
<li><p>An opt-out process will be created; developers who do not want to be tracked
will be able to do so by submitting a form on the GHTorrent web site. This
will have the effect of their <code>name</code> and <code>email</code> fields being replaced with
random strings.</p></li>
<li><p>The research community will need to work on a code on conduct wrt the use of
personal data for research. Underground processes have started already; we
should co-ordinate those efforts and will hopefully come up with a document
during this year's <a href="http://2016.msrconf.org">MSR conference</a>.</p></li>
</ul>
<p>The changes (esp the opt-out process) will need some time to be implemented; in
the mean time, GHTorrent will re-enable online access to the datasets (excluding
email addresses and real names in MySQL of course), but will not offer downloads
until all the measures above have been put into place.</p>
<h3>Personal reflection</h3>
<p>In the heat of the discussion, I have made a couple of statements that i) were
wrong ii) made the discussion hotter. Initially, I was wrong in that copyrighted
content shared on the web without a license is in the "public domain". The
opposite is actually true: copyrighted content on the web with no accompanying
license means that all rights are reserved by the copyright holder (usually the
content creator). Moreover, I said that "GHTorrent tracks public data", while
emails and names are in fact personal data, as they identify a person uniquely.
Finally, while as a researcher I
<a href="http://gousios.gr/blog/How-contributors-use-pull-requests-GitHub/">witnessed</a>
and
<a href="http://gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github/">documented</a>
the negative effect of online conflict on GitHub, I still managed to fall
into the same trap.</p>
<p>I would like to thank Efthimia Aivaloglou, Arie van Deursen, Arnoud Engelfriet,
Sean Lang, Leon Moonen, Mario van der Toorn, Bogdan Vasilescu, and Alexis Zavras
for advice and support those hot last two weeks.</p>
How do project contributors use pull requests on Github?2015-06-26T00:00:00+02:00http://www.gousios.gr/blog/How-contributors-use-pull-requests-GitHub<p><em>with <a href="http://sback.it">Alberto Bacchelli</a></em></p>
<p>Distributed software development projects employ collaboration models and
patterns to streamline the process of integrating incoming contributions.
Classic forms of code contributions to collaborative projects include change
sets sent to development mailing lists or issue tracking systems and direct
access to the version control system. More recently however, a big portion
of open source development happens on GitHub. One of the main reasons for
this is the fact that contributing to a GitHub project is a relatively
pain-free experience. Or is it?</p>
<p>In Apr 2014, we run a survey among contibutors (also: <a href="http://www.gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github/">integrators</a>) to Github projects
trying to understand how they use pull requests and what issues they face
while doing so. We got <em>760</em> responses, which after lots of preprocessing
were reduced to <em>650</em>.</p>
<h3>What motivates contributors?</h3>
<p>The main motivation for contributing to a project is its usage. This usage can
be a dependency from another project the contributor is developing or fixing an
end user bug. Altruistic motives (still) play a role: 33% of the respondents
mentioned that they want to devote their time to a good cause. Developers also
contribute for natural interest and personal development reasons, for
example to <em>sharpen their programming skills</em> or for the <em>intellectually
stimulating</em> fun of it. Finally, approximately 35% of the respondents related contributions to career development, raging from enriching their profile to attraching new customers.</p>
<blockquote>
<p>
<i>R121: Making contributions to [project] makes it easier for me to get new clients.</i>
</p>
</blockquote>
<h3>What does the pull request process look like for contributors?</h3>
<p>Initially, most (76%) contributors look up for open issues related to their
changes or whether similar pull requests where submitted recently (59%).
Half of them try to communicate their changes to the core team. Moreover,
trying to be good citizens, they check for pull request guidelines in
the project. Only few (32%) check for similar open PRs; this might
explain the fact that we found
<a href="/blog/Exploration-pull-requests/">in another study</a> that many PRs are closed due to being duplicate.</p>
<p>After coding the change, contributors (81%) expect tests to run, while they
try to honour the project guidelines (79%). Again, only very few check whether
other pull requests where opened in the mean time (37%).</p>
<p>To communicate with the project team, contributors prefer tools that are tied
to the GitHub process, namely issues and pull requests (>70%). Fewer than
50% will use email and even fewer means of synchronous communications such as
IRC and Skype/Hangouts. It looks like that developers value asynchrony rather
than immediacy at least when working with pull requests.</p>
<h3>How do contributors assess the quality of their contributions?</h3>
<p>While our question was specific on how contributors evaluate the quality, the analysis of the results also revealed what contributors examine in their PRs.</p>
<p>One of the top priorities for contributors when examining PR quality is <em>compliance</em>, which manifests as either compliance to the pull request guidelines
or to de facto code formatting and system architecture patterns that they implicitly extract by studying the code. <em>Code</em> and <em>commit quality</em> are
also other views of compliance: contributors try to deviate from the project norms as less as possible to increase their chances of acceptance.</p>
<p>It is interesting however that compliance is only checked manually; functionality
on the other hand is tested by means of <em>automated testing</em>. Contributors expect to find a test suite (see also above) and run those tests. Very few contributors actually run automated static analysis tools, and the majority that do try to lint code (e.g. using PEP8). Contributors also mentioned that they rely on explicit, <em>code reviews</em> (self done or by asking peers for help).</p>
<h3>What are the challenges of contributing with pull requests?</h3>
<p>To find the pain-points emerging when contributing using the pull-based model, we asked a mandatory open question, in which we ask respondents to state the biggest challenge they experience when contributing PRs. We found three broad categories of challenges: code related, tool and model related and social ones (which are the most common!).</p>
<p>WRT code, the contributor's biggest challenge is to <em>understand the code base</em>.
Contributors have trouble <em>assessing the impact</em> of a change and obtaining and
maintaining <em>awareness</em> of what happens in the project while coding their change. Apparently, Github's pull request interface is not perfect in that respect, which
is also what many contributors have been complaining about.</p>
<blockquote>
<p>
<i>R564: (my biggest challenge is to) Read others code and get understanding of the project design.</i>
</p>
</blockquote>
<p>WRT tools, contributors have trouble in two fronts: using git and especially
doing (<em>conflict resolution</em>) in the face of multiple branches and, more importantly, creating the required setup to build a project and run its test suite (<em>infrastructure setup</em>).</p>
<p>Finally, many problems are social. The most prominent one is <em>responsiveness</em>:
more than 15% of the survey participants find that getting a timely feedback, if any, for their pull requests is hard and they mostly mention people-related causes.
Respondents specify that they would rather receive a clear reject than having no response for their PRs. If nothing else, prompt feedback reassures contributors
that their effort is not in vain and helps them predict future communication patterns.</p>
<p><em>Communication</em> is also hard over Github's pull request mechanism (see also above).
It is therefore hard to <em>explain rationale</em> for a change. Typical issues prevalent in online forums are also prevalent in pull requests: <em>politics</em>, <em>divergent opinions</em>, <em>bikeshedding</em> and general <em>rudeness</em> have been reported by contributors, but admittedly not too often.</p>
<blockquote>
<p>
<i>R526: (my biggest challenge is) Politics, or project owners not wanting a fix or change, or not actively maintaining it</i>
</p>
</blockquote>
<h3>Recommendations</h3>
<p>After examining all aspects
(<a href="http://www.gousios.gr/blog/Exploration-pull-requests/">projects</a>,
<a href="http://www.gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github/">integrators</a>
and, now, contributors) of pull request work, I have the following to propose to
make the pull request process more streamlined and pleasant for everyone
involved.</p>
<h4>Integrators</h4>
<ul>
<li>Provide comprehensive contribution guidelines</li>
<li>Invest in good tests and <a href="https://www.cloudbees.com">run a</a> <a href="http://travis-ci.org">CI</a></li>
<li>Automate everything: <a href="https://puppetlabs.com">development environment setup</a>, <a href="http://www.docker.com">deployment</a> and <a href="http://scrutinizer-ci.com">quality</a> evaluation</li>
<li>Monitor the project's <a href="http://ghtorrent.org/pullreq-perf/">pull request handling performance</a>. Compare against the norms, promptly close dead pull requests.</li>
<li>Be proactive: Establish a communication ettiquette, actively enforce it.</li>
</ul>
<h4>Contributors</h4>
<ul>
<li>Minimize friction: make contributions small and isolated. Restrict them to one
subsystem. Adhere to code and programming style. Follow the contribution
process.</li>
<li>Build your profile: Produce contributions that are accepted, engage in
project community activities (e.g. discussions)</li>
</ul>
<p><em>This blog post is a brief account of our findings. An in-depth analysis,
including a description of our analysis methods and the original survey can be
found in our <a href="/bibliography/GB15.html">technical report</a>.</em></p>
<p><em>If you liked this post, you will also like our previous work on:</em></p>
<ul>
<li><a href="http://www.gousios.gr/blog/Exploration-pull-requests/">How projects use pull requests</a></li>
<li><a href="http://www.gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github/">How do project owners use pull requests</a></li>
</ul>
How to run a large scale survey2015-04-02T00:00:00+02:00http://www.gousios.gr/blog/Scaling-qualitative-research<p>If you know me well, this blog post might seem strange. I have always been a proponent of quantitative methods and big data.
Despite this, in April 2014, I run a survey that got filled in by 1,500 people.
One part of the survey analysis will be presented at <a href="http://www.gousios.gr/bibliography/GZSD15.html">ICSE 2015</a> this year,
while we submitted the second part to <a href="http://gousios.gr/bibliography/GB15.html">FSE 2015</a> (still twiddling our thumbs about the results). In wake of the ICSE 2015 publication, many colleagues asked me how I managed to get so many responses.</p>
<p>Here is how I did it.</p>
<ul>
<li><p><strong>Target an audience:</strong> The broader the audience, the more general the survey
will be and therefore the less deep insights you will get. With qualitative research, it usually pays off to go deep rather than broad. So, it is preferable
to target an audience. What worked for me was that I preselected my audience.
Using the <a href="htp://ghtorrent.org">GHTorrent</a> database, I queried it for the types of projects that I wanted to examine. For many problems in software engineering
at least, Github is a very good source of massive amounts of potential respondents provided one targets the appropriate ones.</p></li>
<li><p><strong>Make it lean:</strong> No one has time to spend on someone else's problem. So when
asking for help (this is what a survey is about), make sure that the helping
party's time is time well spent. Limit the number of questions to the absolute
minimum. Put all questions in one page. Make non essential questions optional.
Limit free-text questions. Configure the size of free-text question boxes to
guide the expected input size (2-3 lines should be enough). Give a realistic
estimate on how much time you expect the survey taker to spend in the invitation
email.</p></li>
<li><p><strong>Personalize the invitation:</strong> No matter how honest the sender's intentions
are, it is usually not nice to receive generic email. What I did was to include
details such as the project's name, the correspondent's name and links
to my past work, just to assure people that I am not a robot.</p></li>
<li><p><strong>Give something back:</strong> Even if you only ask 10 mins of someone's time, it is
nice to offer something in return. Since my survey was about pull requests, I
created a <a href="http://ghtorrent.org/pullreq-perf/">customized report</a> of how pull
requests work for the Github repository. I believe this made a huge difference
in the completion rate of my survey: while the pilot was filled in by 8% of the
people I sent emails to, the actual survey completion rate was > 20%!</p></li>
<li><p><strong>Be responsive:</strong> After I 've sent the initial invitations (all 3,500 of them :-)) my mailbox got flooded by emails. This is something I did not expect,
but many people asked for clarifications, while others congratulated or
complained about the spamming. In any case, you need to act accordingly as fast as possible, otherwise you will loose your "clients". In my case, I answered 160 emails the first day; I estimate that around 100 responses lead to answers.</p></li>
<li><p><strong>Automate processing of results</strong> Getting lots of responses is one thing;
processing them is another. In surveys, we typically have three types of questions: i) Multiple (or single) choice, ii) Likert scale and iii) Open ended.
Multiple choice and Likert scale questions are easier to process en masse.
Open ended questions need to be <a href="http://en.wikipedia.org/wiki/Coding_(social_sciences">coded</a>) first; then they can also be processed statistically.
Processing usually involves filtering, correlating and plotting. For all
those purposes a scripting language, such as R or Python, with support for datasets and plotting can be very handy; Excel or other GUI packages
are definitely not handy. Imagine that the last day before a deadline you
find a small error that affects all your plots and parts of the statistics.
With a scripting language, you just rerun the scripts. With a GUI tool, you
need to go through lots of painful point and clicking. You can find R code to process a CSV file with survey responses in the
GitHub repository: <a href="https://github.com/gousiosg/pullreqs-integrators">gousiosg/pullreqs-integrators</a></p></li>
<li><p><strong>Offer an option to be notified about the results:</strong> Finally, you should notify your respondents (the ones that want to be notified) about your results, when
they are available. While it is nice to send the paper itself, it is nicer for
busy people to read a blog post with the major findings. The benefit is mutual: developers learn about exiting results while our research is getting spread. My <a href="http://www.gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github/">blog post about the results</a> of the ICSE 2015 paper is by far the most read one in my blog (around 8k views now). I really doubt that that many people read my paper.</p></li>
</ul>
<p>Designing and running the survey was a great learning experience for me. First,
I understood the power of qualitative research: using the appropriate questions,
one can get tons of insight in very little time. Second, big data has its
limitations: while is relatively straightforward to quantitively analyze
mechanised processes (which have finite states), it is much more difficult to
get deep insights when people are involved, as people can act unpredictably and
have initiative. Third, I realised that despite my fears which were certainly
founded in my ignorance, it is actually possible (and sometimes preferable) to
turn qualitative research into quantitative research.</p>
<p>Given the opportunity, I would like to thank the 1,500 developers that replied to
my survey. The combined effort, assuming that it took 10 minutes to fill the
survey in, amounts to 30 working days worth of time given out for free. If I
were to do the survey the traditional way, i.e. sending actual copies to
companies, going door to door to ask people to fill it in etc, it would have
taken me more than a year.</p>
How do project owners use pull requests on Github?2014-10-03T00:00:00+02:00http://www.gousios.gr/blog/How-do-project-owners-use-pull-requests-on-Github<p>Pull-based development as a distributed development model is a distinct way of
collaborating in software development. In this model, the project’s main
repository is not shared among potential contributors; instead, contributors
fork (clone) the repository and make their changes independent of each other. In
the pull-based model, the role of the integrator is crucial. The integrator must
act as a guardian for the project’s quality while at the same time keeping
several (often, more than ten) contributions "in-flight" by communicating
modification requirements to the original contributors. Being a part of a
development team, the integrator must facilitate consensus-reaching discussions
and timely evaluation of the contributions. In Open Source Software (OSS)
projects, the integrator is additionally taxed with enforcing an online
discussion etiquette and ensuring the project’s longevity by on-boarding new
contributors.</p>
<p>In Apr 2014, we run a survey among Github users trying to understand how
integrators (also contributors; still analyzing the results) use pull requests
and to discover what challenges they may face. We got <em>750</em> responses.</p>
<p>We found three large themes in how integrators use pull requests: <strong>code
reviews</strong>, <strong>soliciting contributions from the community</strong> and <strong>discussing new
features</strong>. We also found that integrators are struggling with <strong>maintaining
quality</strong> and <strong>prioritizing work</strong>, while <strong>social challenges</strong> do not allow
them to be efficient.</p>
<h3>How do integrators use pull requests in their projects?</h3>
<h4>Overall use</h4>
<p>Overwhelmingly, 80% of the integrators use the pull-based development model for
doing code reviews and 80% resolve issues. Half of the integrators use pull
requests to discuss new features. 60% of the integrators use pull requests to
solicit contributions from the community, which seems low given the open nature
of the GitHub platform. We examined this response quantitatively, using the
GHTorrent database: indeed for 39% percent of the projects that responded, no
pull request originated from the project community.</p>
<h4>Types of contributions</h4>
<p>When discussing software maintenance, we distinguish perfective (implementing
new features), corrective (fixing issues) and adaptive-preventive (refactoring)
maintenance. We asked integrators how often they receive contributions for these
types of maintenance activities. 73% percent of the projects receive bug fixes
as pull requests once a week, half of them receive new features once a week,
while only 33% of them receive a refactoring more often than once a week.</p>
<h4>Code reviews</h4>
<p>75% of the projects indicate that they do explicit code reviews on all
contributions on Github. Interestingly, 7% of of the integrators indicated that
they are using other tools for code reviewing.</p>
<p>50% of the integrators report that the project’s community (people with no
direct commit access to the repository) actively participates in code reviews;
this number is impressive and indicates very vibrant communities caring about
projects.</p>
<p>Projects have established processes for doing code reviews. One of them is
delegation; 42% of the integrators delegate a code review if they are not
familiar with the code under review. Another process is implicit sign-off: at
least 20 integrators reported that multiple developers are required to review a
pull request to ensure high quality.</p>
<h4>Integrating changes</h4>
<p>In 79% of the cases, integrators use the GitHub web interface “often or always”
to do a merge. Only in 8% and 1% of the cases do integrators resort to cherry
picking or textual patches respectively to do the merge. The remaining cases do
rebasing to maintain clean project history instead of simple merges.</p>
<p><strong>summary:</strong> <em>Integrators successfully use the pull-based model to accommodate
code reviews, discuss new features and solicit external contributions. 75% of
the integrators conduct explicit code reviews on all contributions. Integrators
prefer commit metadata preserving merges.</em></p>
<h3>How do integrators decide whether to accept a contribution?</h3>
<h4>Decision to accept</h4>
<p><img src="/files/pr-int-acceptance.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<p>The most important factor leading to acceptance of a contribution is its
quality. Quality has many manifestations in our response set; integrators
examine the <em>source code quality</em> and <em>code style</em> of incoming code, along with
its documentation and granularity. At a higher level, they also examine the
quality of the commit set and whether it adheres to the project <em>conventions</em> for
submitting pull requests.</p>
<p>A second signal that the integrators examine is <em>project fit</em>: does the pull
request fit the project roadmap? A variation is <em>technical fit</em>: does the code fit
the technical design of the project?</p>
<p>It is interesting to note that the <em>track record</em> of the contributors is ranked
low in the integrator check list. This is in line with our earlier analysis of
pull requests, in which we did not see a difference in treatment of pull
requests from the core team or from the project’s community.</p>
<p>Finally, technical factors such as whether the contribution is in a mergeable
state, its impact on the source code or its correctness are not very important
for the eventual decision to merge to the majority of respondents. In such
cases, integrators can simply postpone decisions until fixes are being
provided by the contributors.</p>
<h4>Time to make the acceptance decision</h4>
<p><img src="/files/pr-int-time.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<p>The factors that strongly affect the time to make a decision are mostly social
and, as expected, have timing characteristics as well. The most important one,
affecting 14% of the projects, is <em>reviewer availability</em>. The problem is more
pronounced in projects with small integrator teams (45%) and no full time paid
developers. Another social factor is <em>contributor responsiveness</em>; if the pull
request contributor does not come back to requests for action fast, the
evaluation process is stalled. <em>Long discussions</em> also affect negatively the time
to decide, but they are required for reaching consensus among core team members,
especially in case of controversial contributions. For changes that have not
been communicated before, discussions are also mandatory.</p>
<p>Technical factors, such as the complexity of the change, <em>code quality</em>, <em>code
style</em> and <em>mergeability</em> of the code also affect negatively the time to decide on
a pull request. The reason is that the code inspection reveals issues that need
to be addressed by the contributors.</p>
<p><strong>summary:</strong> <em>Integrators decide to accept a contribution based on its quality
and its degree of fit to the project’s roadmap and technical design.</em></p>
<h3>How do the integrators evaluate the quality of contributions?</h3>
<p><img src="/files/pr-int-quality.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<h4>Perception</h4>
<p>One of the top priorities for integrators when evaluating pull request quality
is <em>conformance</em>. Conformance can mean conformance to project style or to API
usage throughout the project. Many integrators also examine conformance against
the <em>programming language’s style idioms</em>.</p>
<p>Integrators often relate contribution quality to the quality of the source code
it contains. To evaluate source code quality, they mostly examine non-functional
characteristics of the changes. Source code that is <em>understandable</em> and
<em>elegant</em>, has good <em>documentation</em> and provides clear <em>added value</em> to the
project with minimal impact is preferred.</p>
<p>The quality (or even the existence) of <em>documentation</em> signifies an increased
attention to detail by the submitter. The integrators also examine the <em>commit
organization</em> in the pull request along with its <em>size</em>. In the later case, the
integrators value small pull requests as it is easier to <em>assess their impact</em>.</p>
<p>Testing plays an important role in evaluating submissions. The very
<em>existence of tests</em> in the pull request is perceived as a positive signal.</p>
<p>Finally, integrators use social signals to build trust for the examined
contribution. The most important one is the <em>contributor’s reputation</em>. The
integrators build a mental profile for the contributor by evaluating their track
record within the project or by searching information about the contributor’s
work in other projects. Some integrators also use <em>interpersonal relationships</em>
to make judgements for the contributor and, by proxy, for their work.</p>
<h4>Tools</h4>
<p>The vast majority (75%) of projects use <em>continuous integration</em>, either in hosted
services or in standalone setups. On the other hand, few projects use more
dedicated software quality tools such as metric calculators (15%) or coverage
reports (18%). It is interesting to note that practically all (98%) projects
that use more advanced quality tools, run them through continuous integration.</p>
<p><strong>summary:</strong> <em>Top priorities for integrators when evaluating contribution
quality include conformance to project style and architecture, source code
quality and test coverage. Integrators use few quality evaluation tools other
than continuous integration.</em></p>
<h3>How do the integrators prioritize the application of contributions?</h3>
<p><img src="/files/pr-int-prioritization.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<p>Integrators prioritize contributions by examining their <em>criticality</em> (in case
of bug fixes), their <em>urgency</em> (in case of new features) and their <em>size</em> or
<em>complexity</em>. Bug fixes are commonly given higher priority. When prioritizing
contributions, integrators apply multiple criteria in a specific sequence. The
figure above depicts the frequencies of prioritization criteria usage for all
reported application sequences. What we can see is that criticality, urgency and
change size contribute to most prioritization criteria application sequences,
while most integrators report that they apply at most two prioritization
criteria.</p>
<h3>What key challenges do integrators face when working with pull requests?</h3>
<p><img src="/files/pr-int-key-challenges.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<h4>Technical challenges</h4>
<p>At the project level, maintaining quality is what most integrators perceive as a
serious challenge. As incoming code contributions mostly originate from
non-trusted sources, adequate reviewing may be required by integrators familiar
with the project area affected by it. Reviewer availability is not guaranteed,
especially in projects with no funded developers. Often, integrators have to
deal with solutions tuned to a particular contributor requirement or an edge
case; asking the contributor to generalize them to fit the project goals is not
straightforward. A related issue is feature isolation; contributors submit pull
requests that contain multiple features and affect multiple areas of the
project.</p>
<p>Several issues are aggravated the bigger or more popular a project is.
Integrators of popular projects mentioned that the volume of incoming
contributions is just too big; consequently, they see triaging and work
prioritization as challenges. Additionally, as pull requests are kept on the
project queue, they age: the project moves ahead in terms of functionality or
architecture and then it is difficult to merge them without (real or logical)
conflicts. Moreover, it is not straightforward to assess the impact of stale
pull requests on the current state of the project or on each other.</p>
<p>Integrators note that aspiring contributors often ignore the project processes
for submitting pull requests leading to unnecessary communication rounds. When
less experienced developers or regular users attempt to submit a pull request,
they often lack basic git skills.</p>
<p>Lack of responsiveness on behalf of the contributor hurts the code review
process and, by extension, project flow. This especially pronounced in the case
of hit and run pull requests,5 as they place additional reviewing and
implementation burden on the integrator team. Integrators mention that the lack
of centralized co-ordination with respect to project goals can lead to chaos.</p>
<h4>Social challenges</h4>
<p>On a more personal level, integrators find it difficult to handle the workload
imposed by the open submission process afforded by the pull-based development
model. For many of our respondents, managing contributions is not their main
job; consequently finding free time to devote on handling a pull request and
context switching between various tasks puts a burden on integrators.</p>
<p>Integrators often have to make decisions that affect the social dynamics of the
project. Integrators reported that explaining the reasons for rejection is one
of the most challenging parts of their job as hurting the contributor’s feelings
is something they seek to avoid. Similarly, integrators find that asking for
more work from the contributors (e.g. as a result of a code review) can be
difficult at times.</p>
<p>Reaching consensus through the pull request comment mechanism can be
challenging. Integrators often find themselves involved in a balancing act of
trying to maintainting their own vision of the project’s future and
incorporating (or rejecting) contributions that are tuned to the contributor's
needs.</p>
<p><strong>summary:</strong> <em>Integrators are struggling to maintain quality and mention feature
isolation and total volume as key technical challenges. Social challenges
include motivating contrib- utors to keep working on the project, reaching
consensus through the pull request mechanism and explaining reasons for
rejection without discouraging contributors.</em></p>
<h3>So what?</h3>
<p>This study is one of the first to investigate how <em>people</em> use pull requests and
definitely the first to do it on this scale. It answered our questions, but it
generated more; we present some of them below. Researchers and aspiring
entrepreneurs are welcome to answer them by doing further research and/or by
developing new tools.</p>
<ul>
<li>How can we facilitate the quality evaluation process involved in pull requests?</li>
<li>How can we help integrators in large projects cope with the load of incoming pull requests?</li>
<li>How can we raise the awareness of newcomers wrt what is happening in the project? How can we avoid duplicate work?</li>
<li>What can projects or the Github platform do to streamline their contribution process?</li>
</ul>
<p>Some advice:</p>
<ul>
<li><p><strong>Integrators</strong>, invest in tools: Streamline your contribution process to
make it testable and verifiable, find creative ways to integrate quality
analysis tools in your CI process and demand tests along with contributions
(your test suite is good, right?). Don't let pull requests linger; if you
don't want a contribution it is better (for you, mostly) to be frank to
the contributor.</p></li>
<li><p><strong>Contributors</strong>, make sure you do your homework: see if there are
similar issues/pull requests open before you submit, try to comply to
the project's guidelines and learn your tools. After submitting, be
responsive and nice; think that you are working together with a
work-overloaded collegue, not a stranger in the other side of the world.</p></li>
</ul>
<p>Last but not least, I would like to thank the participants for their time: this
research would not be possible if 750 people would not have each donated 15
minutes each (187 total hours!).</p>
<p>What do you think of the results? Do they reflect your personal experience? What
tools would you expect us researchers to provide you with to cope with pull
requests? Chime in below!</p>
<p>*This blog post is a brief account of our findings. An in-depth analysis,
including a description of our analysis methods and the original survey can be
found in our
<a href="http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2014-013.pdf">technical report</a>. This is joint work with
<a href="http://www.st.ewi.tudelft.nl/~zaidman/">Andy Zaidman</a>,
<a href="http://margaretannestorey.wordpress.com/">Margaret-Anne Storey</a> and
<a href="http://avandeursen.com/">Arie van Deursen</a>. *</p>
<p><em>If you liked this post, you will also like our previous work on <a href="http://www.gousios.gr/blog/Exploration-pull-requests/">how projects
use pull requests</a>.</em></p>
<p><em>Update 20/12/2014: The paper has been accepted at ICSE 2015!</em></p>
The computer scientist's guide to speech development2014-07-07T00:00:00+02:00http://www.gousios.gr/blog/Computer-scientist-speech-development<p>During the last 20 months, I 've been having fun with my daughter's (from now on: little λ) efforts to learn to speak. Up to now, the whole process can be split in 4 phases.</p>
<h3>The random noise phase</h3>
<p>This starts at around 4 months. The baby mumbles random noises initially (aaa, usually) and, as the brain develops, more focused 2 letter syllables (ma-ma, pa-pa etc). Nothing interesting here, apart from the fact the baby can combine various stimuli (noise, vision etc) with oral expressions (say ma-ma when she listens mummy whispering at night), which computers are not very capable of yet.</p>
<h3>The grep phase</h3>
<p>This starts at around 10 months and goes up to 13-14 months. The toddler ignores the context and can only respond to certain
sounds. Being desperate to communicate, he/she tries to participate to the discussion with an adult by linking patterns to specific responses. The naïveté of the approach leads to endless fun:</p>
<ul>
<li>Me: <em>λ, who's your daddy?</em></li>
<li>λ: <em>Zzzio-Zzzio!</em> (Giorgos in baby speech)</li>
<li>Me: <em>Bravo! And who's daddy's daddy?</em></li>
<li>λ: <em>Zzzio-Zzzio!</em></li>
<li>Me: <em>This little lamp's daddy is this big ram with the curly horns.</em></li>
<li>λ: <em>Zzzio-Zzzio!</em></li>
</ul>
<p>I think you can simulate this behavior with the following simple script. The <code>mappings.txt</code> file contains a word and its baby speech equivalent. The baby updates it with new words and sounds every day.</p>
<p><figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nb">cat</span> /dev/microphone |
stt| <span class="c">#Your speech recognition package of choice</span>
<span class="nb">tr</span> <span class="s1">' '</span> <span class="s1">'\n'</span>|
<span class="k">while </span><span class="nb">read </span>word<span class="p">;</span> <span class="k">do
</span><span class="nv">found</span><span class="o">=</span><span class="sb"><code></span><span class="nb">grep</span> <span class="nv">$word</span> mappings.txt<span class="sb"></code></span>
<span class="k">if</span> <span class="o">[</span> <span class="o">!</span> <span class="nt">-z</span> <span class="nv">$found</span> <span class="o">]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="nv">$found</span>|
<span class="nb">cut</span> <span class="nt">-f2</span> <span class="nt">-d</span><span class="s1">' '</span>|
tts <span class="o">></span> /dev/dsp <span class="c"># Your text-to-speech package of choice</span>
<span class="k">fi
done</span></code></pre></figure></p>
<h3>The Prolog phase</h3>
<p>The progression to the next stage marks the end of baby-hood and the beginning of the toddler phase. The toddler starts to build a knowledge base by assigning facts to items or persons ('daddy', 'coffee') or actions ('drink') and, later on, learns rules on how to infer facts from them. The rules are initially very simple (daddy and λ are people, people drink coffee -> λ drink coffee) but are soon reinforced with more complex ones (λ is toddler, toddlers not drink coffee -> λ not drink coffee).</p>
<p>If memory serves me right, this is very similar to how Prolog knowledge bases are build. And just as is the case of Prolog systems with missing rules, the inference engine in a toddler's brain can make funny inferences, e.g.</p>
<ul>
<li>Me: <em>λ, cats meow like: meow meow.</em></li>
<li>λ: <em>Nia-nio!</em> (λ's rendering of meow)</li>
<li>Me: <em>Nice! So tigers are like cats. How do they sound?</em></li>
<li>λ: <em>Nia-nio!</em></li>
<li>Me: <em>Aha! Lions roar though like tigers: ROOOAR!</em></li>
<li>λ: <em>Nia-nio!</em></li>
</ul>
<p>The critical difference between this stage and the next one is that the toddler learns rules by the people around him/her, but cannot construct his/her own rules yet.</p>
<h3>The deep learning phase</h3>
<p>I believe this stage starts when the toddler can make 2 letter sentences, i.e. at around 18 months. The brain is rapidly developing and the inference rule engine gives it place to more complex processes for infering generic rules from patterns.
This is where computers have already lost the race to the brain. The toddler has absorbed enough information and experience that it can do 2 tasks that are extremely difficult for computers:</p>
<ul>
<li>Fuzzy matching (cats in drawings, pictures and real life all meow).</li>
<li>Construct rules from facts and behavioral patterns and generalize them along the way (daddy drinks coffee in the morning, and since grandpa visited and we love him, he should drink coffee in the morning as well).</li>
</ul>
<p>The more knowledge the toddler absorbs, the more complex networks of rules it can create, which is akin to <a href="http://en.wikipedia.org/wiki/Deep_learning">deep machine learning</a>, only way more efficient.</p>
<p><em>Disclaimer: Totaly unscientific stuff. If you are a linguist, speach therapist, deep learner etc please be gentle in your comments :-)</em></p>
What's new in GHTorrent land?2014-05-29T00:00:00+02:00http://www.gousios.gr/blog/New-in-GHTorrent-land<p>A lot of people (around 30 on last count) have been using GHTorrent lately as an easy to use source for accessing the wealth of data Github has. Portions of the dataset appear in the <a href="http://ghtorrent.org/msr14.html">MSR14</a> and <a href="http://ghtorrent.org/vissoft14.html">VISSOFT14</a> data challenges, while at least 15 papers at this year's MSR and ICSE conferences are based on it.</p>
<p>In this blog post, I summarize the long list of changes that happened in the GHTorrent land since Sep 2013.</p>
<h4>Introducing Lean GHTorrent</h4>
<p>Obtaining and restoring the full GHTorrent dataset is serious business: one has to download and restore more than 3TB of MongoDB data and 30GB of MySQL data. The time to do this may be prohibitive if just a selection of repositories is enough for the task at hand. For this reason, together with Bogdan Vasilescu and Alex Serebrenik of TU Eindhoven fame and our own Andy Zaidman, I
<a href="/bibliography/GVSZ14.html">implemented</a> the Lean GHTorrent service.</p>
<p><a href="http://ghtorrent.org/lean.html">Lean GHTorrent</a> allows researchers to request a specific slice of the full GHTorrent dataset, on a per repository basis. All a researcher has to do is compile a list of repository names (i.e. using the <a href="http://ghtorrent.org/dblite/">MySQL query interface</a> to filter projects with specific characteristics) and feed them to Lean GHTorrent. Then, magic happens: Lean GHTorrent will reply with an email to provide a link where the submitter can view the job status and another email with a link to download the data from! In the mean time, Lean GHTorrent will get a fresh copy of the data for the specific repos from Github using our existing data downloading infrastructure.</p>
<p>We describe the Lean GHTorrent design in our <a href="/bibliography/GVSZ14.html">MSR '14 data track paper</a>. You can find Lean GHTorrent at <a href="http://ghtorrent.org/lean.html">http://ghtorrent.org/lean.html</a>.</p>
<h4>MySQL schema modifications</h4>
<ul>
<li>The <code>forks</code> table has been removed. Since early January 2013, the forks information had been moved in the <code>forked_from</code> field of the <code>projects</code> table and</li>
<li>The <code>merged</code> and <code>user_id</code> fields have been removed from the pull requests table, as they represent information already stored in the <code>pull_request_history</code> table.</li>
<li>Deleted projects are now marked as such on a monthly basis. We run a script that continuously queries Github for the status of each repository in a loop.</li>
</ul>
<p>A step backwards:</p>
<ul>
<li>Follow events are no longer reported by Github in public timelines. Therefore the corresponding table can only be updated when the <code>ght-retrieve-user</code> script has been run for a user. The last time we run this for all users in the GHTorrent database was in late March 2014. To compensate for that, we run a script that fetches the followers for all users in a loop. This means that
updates to followers might take a while, while the follow timestamp is not
accurate any more.</li>
</ul>
<h4>Fixes to the data retrieval code</h4>
<ul>
<li><p>The code will process a repository in full upon addition. No more half retrieved repositories, even though this only applies to new repositories.</p></li>
<li><p>The code that retrieves multi-page items (e.g. collections of commits, pull
requests etc) now works on a per page basis. Before, every time a new
entity could not be found in the retriever database, GHTorrent would retrieve
the <em>full</em> list of entities for this type (e.g. all pull requests if a
pull request was missing). While this seems grossly inefficient, initially it worked surprisingly well, as the vast majority of projects only have a few commits or pull requests. However, the time had come to make things more efficient and so we did.</p></li>
<li><p>Same as above, the commits for a new repository are now processed in a
loop, leading to significantly decreased memory pressure.</p></li>
</ul>
<p>The above changes mean that we can now run the GHTorrent retrieval process more efficiently at a bigger scale. The increased efficiency of the retrieval code however (<a href="/bibliography/GVSZ14.html">once again</a>) led to increased pressure to the MySQL database, so not all benefits have been ripped yet.</p>
<h4>Querying the GHTorrent MongoDB database</h4>
<p>Since Sep 2013, researchers have been able to query both
<a href="http://ghtorrent.org/raw.html">MongoDB</a> and <a href="http://ghtorrent.org/dblite/">MySQL</a> online on our servers. We have upgraded this functionality significantly: the MongoDB server is now a delayed replica of the live MongoDB database, while it runs on a much more powerful machine. Users should be able to run bigger queries, faster on an almost live version of the system.</p>
<h4>Querying the MySQL database</h4>
<ul>
<li><p>The MySQL server for the <a href="http://ghtorrent.org/dblite/">DBLite</a> interface was moved to much faster, server grade hardware. Simple queries should fly now.</p></li>
<li><p>Queries running in excess of 10 minutes will be killed. This is to protect the service from abuse and to make sure that other users get a fair share of CPU time. Think carefully before querying :-)</p></li>
</ul>
<h2>Calling out for help</h2>
<p>GHTorrent has been providing the research community with data for more than two
years with on own resources. Since day one, it was intended to be a community
effort, even though the help of the community was not requested immediately.
Unfortunately, the rapid growth of Github managed to outgrow our resources
and my personal time. For this reason, I would like to invite the
GHTorrent community to help. Here is how you can do it:</p>
<h4>Helping with data collection</h4>
<p>This is the easiest. Recently, GHTorrent gained the ability to use Github OAuth keys, in addition to user name/password pairs, for authenticating requests. This means that you can create an OAuth personal token and send it over to me to use for querying the API. To create a personal token, go to the following URL:</p>
<p><a href="https://github.com/settings/tokens/new">https://github.com/settings/tokens/new</a></p>
<p>deselect ALL checkboxes EXCEPT from public_repo, add a token name and click on "Generate Token". Then copy the token and <a href="mailto:gousiosg@gmail.com">send it to me</a>.</p>
<p>Every token adds an extra 5k reqs/hour capacity to GHTorrent, which will allow us to retrieve more data for longer periods of time for more projects.</p>
<h4>Dealing with data inconsistencies</h4>
<p>Data inconsistencies do exist in GHTorrent. It would be nice if the community would work together to resolve them. A few low-hanging fruits are the following:</p>
<ul>
<li>Create a consistency checker: Write a script that will use GHTorrent to get all data for a given repo, then compare those with Github or the project's <code>git</code> repo. This will give us indications of where things are going wrong.</li>
<li>Consistency checks for pull request/issue lifecycle: model the lifecycle as a
state machine, make sure that only the supported transitions are in the
database.</li>
<li>Lots of people have been complaining about the lack of commit messages for commits, issues, pull requests and code reviews. I think it is worth to
investigate the possibility of writing a script that will pull all such text from MongoDB and update MySQL with it.</li>
</ul>
<h4>Implementing services by analyzing the data</h4>
<p>Are you a student and want to play around with real big data? Consider working
on one of the following topics using the GHTorrent datasets and infrastructure.</p>
<ul>
<li>Create a dynamic version of the <a href="http://ghtorrent.org/pullreq-perf/">pull request performance reports</a>.</li>
<li>Create a project dashboard with rich analytics about the repository's life.</li>
<li>Create a more feature rich version of the <a href="http://osrc.dfm.io/">Open Source Report Card</a>.</li>
<li>Create a process to move the MongoDB database contents to Hadoop.</li>
</ul>
The triumph of online collaboration2014-03-27T00:00:00+01:00http://www.gousios.gr/blog/The-triumph-of-online-collaboration<p>For a research paper I am working on, we wanted to analyze the top 30 "most collaborative" projects on Github. Defining a quantitative metric of collaboration and sorting projects according to it is not an easy task, as collaboration is in many cases implicit and not recorded, while not all actions of collaboration are equal. As a proxy, we chose to measure the number of people that perform changes that mutate the state of a repository. On Github, we could identify the following:</p>
<ul>
<li>A: Create a commit to a repository</li>
<li>B: Perform a code review on an individual commit</li>
<li>C: Create/Update/Merge/Close a pull request</li>
<li>D: Perform a code review on a pull request</li>
<li>E: Comment on a pull request</li>
<li>F: Create/Close an issue</li>
<li>G: Comment on an issue</li>
</ul>
<p>Using <a href="http://ghtorrent.org">GHTorrent</a> as a data source, I wrote a script to measure the individual persons that performed the actions above for all non-forked repositories and then sorted the repos according to the total number of individual contributors. The results can be seen in the table below:</p>
<table class="table table-striped">
<thead>
<tr><td><b>repo</b></td><td><b>A</b></td><td><b>B</b></td><td><b>C</b></td><td><b>D</b></td><td><b>E</b></td><td><b>F</b></td><td><b>G</b></td><td><b>all</b></td></tr>
</thead>
<tbody>
<tr><td>isaacs/npm</td><td>100</td><td>21</td><td>167</td><td>23</td><td>247</td><td>2568</td><td>3302</td><td>6147</td></tr>
<tr><td>torvalds/linux</td><td>5968</td><td>14</td><td>67</td><td>3</td><td>161</td><td>0</td><td>0</td><td>6212</td></tr>
<tr><td>symfony/symfony</td><td>1021</td><td>52</td><td>1261</td><td>395</td><td>1305</td><td>1844</td><td>2160</td><td>6215</td></tr>
<tr><td>jquery/jquery-mobile</td><td>212</td><td>13</td><td>431</td><td>21</td><td>350</td><td>2888</td><td>3008</td><td>6391</td></tr>
<tr><td>joyent/node</td><td>657</td><td>52</td><td>833</td><td>132</td><td>943</td><td>2304</td><td>2805</td><td>6653</td></tr>
<tr><td>CocoaPods/Specs</td><td>2658</td><td>90</td><td>2584</td><td>39</td><td>1235</td><td>515</td><td>268</td><td>6674</td></tr>
<tr><td>gitlabhq/gitlabhq</td><td>605</td><td>89</td><td>871</td><td>138</td><td>915</td><td>2251</td><td>3608</td><td>7344</td></tr>
<tr><td>angular/angular.js</td><td>875</td><td>92</td><td>1306</td><td>139</td><td>1520</td><td>1540</td><td>3778</td><td>7919</td></tr>
<tr><td>rails/rails</td><td>2699</td><td>309</td><td>2315</td><td>607</td><td>3174</td><td>4746</td><td>4890</td><td>15339</td></tr>
<tr><td>mxcl/homebrew</td><td>3426</td><td>76</td><td>3125</td><td>528</td><td>3888</td><td>5157</td><td>7301</td><td>20510</td></tr>
</tbody>
</table>
<p>The numbers are staggering. A project (<a href="http://brew.sh/">Homebrew</a>) that is just 5 years old has attracted 20.5k --- <span class="label label-success">20,500</span>, the size of a small city! --- people to contribute to it. Ruby on Rails has been collaboratively developed by a community of 15k people and still works! To compare these numbers with other software engineering projects is futile: most projects, even ones with a very long lifeline are very small in comparison. Perhaps a more fair comparison is with other online collaborative initiatives: The <a href="http://en.wikipedia.org/wiki/Wikipedia:Wikipedians">English Wikipedia</a> is being maintained by 130,800 people, while the effort of decoding the human genome has been carried out by <a href="http://www.genome.gov/DNADay/q.cfm?aid=402&year=2007">thousands of people</a>.</p>
<p>If nothing else, the above are an example of the power of commons and certainly the usefulness of Github as a collaboration platform.</p>
How projects use pull requests on Github2014-01-27T00:00:00+01:00http://www.gousios.gr/blog/Exploration-pull-requests<p>Pull requests form a new method for collaboration on distributed software
development. The novelty lays in the decoupling of the development effort from
the decision to incorporate the results of the development in the code base.
Several code hosting sites, including <a href="http://github.com">Github</a> and
<a href="https://bitbucket.org/">BitBucket</a>, tapped on the opportunity to make the
pull-based development model more accessible to programmers. A unique
characteristic of such sites is that they allow any user to fork any public
repository. The clone creates a public project that belongs to the user that
cloned it, so the user can modify the repository without being part of the
development team. What is more important is they automate the selective
contribution of commits from the clone to the source, through pull requests.</p>
<p>Pull requests are not unique to code hosting sites; in fact, the Git version
control system includes the <code>git-request-pull</code> utility, which provides the same
functionality at the command line. Github improved (<a href="https://github.com/torvalds/linux/pull/17">not everyone
agrees</a>) this process significantly
by integrating code reviews, discussions and issues, thus effectively lowering
the entry barrier for casual contributions. Combined, cloning and pull requests
create a new development model, where changes are pushed to the project
maintainers and go through code review by the community before being integrated.</p>
<p>A year ago, together with <a href="http://serg.aau.at/bin/view/MartinPinzger">Martin
Pinzger</a> and <a href="http://avandeursen.com/">Arie van
Deursen</a>, we set forth to explore how this new and
exciting way of building software in a distributed manner actually works under
the wraps. How widespread is the use of pull requests? What are the
characteristics of the pull request lifecycle? What factors play a role in the
decision to merge and the time to process pull requests? Why are some pull
requests not merged?</p>
<p>The dataset we used to do this study was based on
<a href="http://www.gousios.gr/bibliography/GS12.html">our</a>
<a href="http://www.gousios.gr/bibliography/G13.html">previous</a>
<a href="http://ghtorrent.org/netviz/">work</a> on <a href="http://ghtorrent.org">GHTorrent</a>, a
<a href="http://ghtorrent.org/dblite/">queriable</a> <a href="http://ghtorrent.org/raw.html">off-line
mirror</a> of the data offered through the Github
API. The project has been collecting data since February 2012. Up to August
2013, when this study was conducted, 1.9 million pull requests from more than
200k projects have been collected.</p>
<p>Here is what we learned.</p>
<h3>Use of pull requests on Github</h3>
<p>In August 2013, Github reported more than 7 million repositories and 4 million
users. However, not all those projects are active: in the period Feb 2012 — Aug
2013, the GHTorrent dataset captured events initiated by (approximately)
2,281,000 users affecting 4,887,500 repositories. The majority of registered
repositories are forks of other repositories, special repositories hosting user
web pages, program configuration files and temporary repositories for evaluating
Git. In the GHTorrent dataset, less than half (1,877,660 or 45%) of the active
repositories are original repositories.</p>
<p>Of course, not all repositories on Github use pull requests. In fact, only 32%
of the repositories actually have more than one contributor. To get an estimate
of the popularity of pull requests among those repositories, we compared the use
of pull requests on Github in the same periods across two consecutive years (Feb
to Aug 2012 and 2014). We also included the number of repositories that use the
shared repository approach exclusively. The results can be seen in the table
below:</p>
<table class="table table-striped">
<thead>
<tr>
<td> <b>Metric</b></td>
<td> <b>Feb-Aug 2012</b> </td>
<td> <b>Feb-Aug 2013</b> </td>
</tr>
</thead>
<tbody>
<tr>
<td>Active repos (> 1 commits) </td>
<td> 315,522 (100%) </td>
<td> 1,157,625 (100%) </td>
</tr>
<tr>
<td><br/></td>
<td><br/></td>
<td><br/></td>
</tr>
<tr>
<td>Repos with 1 contributor</td>
<td> 261,317 (65%) </td>
<td> 913,205 (79%) </td>
</tr>
<tr>
<td>Repos with pull requests </td>
<td> 53,866 (17%) </td>
<td> 120,104 (10%) </td>
</tr>
<tr>
<td>Repos with shared repository </td>
<td> 54,205 (18%) </td>
<td> 124,316 (11%) </td>
</tr>
<tbody>
</table>
<p>If nothing else, this table is representative of the staggering growth of
Github. The number of public repos with visible activity grew more that 3x in
just one year. The number of collaborative repos grew 2.4x times, which means
that the relative number of repos using either the shared repository model
exclusively or pull requests decreased. On both years, an equal proportion of
repositories use either pull requests or the shared repository model.</p>
<p>For those projects that received pull requests in 2013, the mean number of pull
requests per project is relatively low at 8.1 (median: 2, percentiles: 5%: 1,
95%: 21); however, the distribution of the number of pull requests in projects
is highly skewed. In case you think that popular projects get the most pull
requests, you are wrong: this is only weakly supported by our data. From the
pull requests that have been opened in 2013, 73,07% have been merged using
Github facilities (more might have been merged using other methods, see below).</p>
<p><strong>summary:</strong> <em>~14% of repositories are using pull requests on Github. Pull
requests and shared repositories are equally used among projects.</em></p>
<h3>Characteristics of the pull request lifecycle</h3>
<p><img src="/files/wordcloud-icse2014.png" class="img-polaroid" style="float: right;width: 40%"></p>
<p>The GHTorrent dataset is good to get an overall view of pull requests across all
projects on Github, but we wanted to go deeper than that. For this reason, we
selected all projects written in Ruby, Python, Java and Scala (why just those
languages? because testing detection is possible) that had more than >200 pull
requests in their lifetime and using a combination of GHTorrent and their Git
repositories, we extracted more than <a href="https://github.com/gousiosg/pullreqs#generating-intermediate-files">40
features</a>
for each pull request. We also applied heuristics to detected merges happening
outside Github (e.g. using <code>git merge</code>, <code>git rebase</code>, cherry-picking etc).
This brought the number of merged pull requests up to 84%. In total, we
analyzed 291 software development projects (99 Python, 91 Java, 87 Ruby, 14
Scala) and 166,884 pull requests.</p>
<p>We then analyzed the generated data in various dimensions, in order to identify
the lifecycle characteristics of pull requests and determine the factors that
influence them.</p>
<p><strong>Lifetime</strong>
<img src="/files/pr-lifetime-hist.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<p>By correlating pull request features with the time to merge and the
time to close pull requests, one can find very interesting patterns.
Here is a list of our most interesting findings:</p>
<ul>
<li>The time to merge pull requests is highly skewed, as can be seen in the
figure above, with the great majority of merges happening very fast. Measured in
days, 95% of the pull requests are merged in 26, 90% in 10 and 80% in 3.7 days.
30% of the pull requests are merged in under one hour!</li>
<li>Pull requests are either merged fast or left open lingering before they are</li>
<li>closed. Pull requests received no special treatment, irrespective whether</li>
<li>they came from core team members or from the community, as there is
no statistical significant difference in the time required to close both.</li>
<li>The time required to merge pull requests is not correlated with
the project's size, but it is correlated with the pull requester's track record.
The more pull requests that have been accepted by a specific developer, the
faster his/her new pull request will be processed. This is an indication that
eventually even software developers can learn to trust each other :-)</li>
</ul>
<p><strong>Size</strong></p>
<table class="table table-striped">
<thead>
<tr>
<td> </td>
<td> <b>median</b> </td>
<td> <b>80%</b> </td>
<td> <b>90%</b> </td>
<td> <b>95%</b> </td>
</tr>
</thead>
<tbody>
<tr>
<td># commits</td>
<td> 1 </td>
<td> 3 </td>
<td> 6 </td>
<td> 12 </td>
</tr>
<tr>
<td># Files</td>
<td> 2 </td>
<td> 7 </td>
<td> 17 </td>
<td> 36 </td>
</tr>
<tr>
<td># lines changed</td>
<td> 20 </td>
<td> 168 </td>
<td> 497 </td>
<td> 1227 </td>
</tr>
<tbody>
</table>
<p>The table above speaks for itself: most pull requests
are small, touch only a few files and consequently only a few
lines. A reader well-versed in statistics will also see
that the distributions of those variables are highly skewed.</p>
<p><strong>Discussion and code review</strong></p>
<p>Once a pull request has been submitted, it is open for discussion until it is
merged or closed. The discussion is usually brief: 95% of pull requests receive
12 comments or less (80% less than 4 comments). Similarly, the number of
participants in the discussion is also low (95% of pull requests are discussed
by less than 4 people).</p>
<p>Code reviews are integrated in the pull request process. While the pull request
discussion can be considered an implicit form of code review, 12% of the pull
requests in our sample have also received comments on source code lines in the
included commits. This of course does not mean that the remaining pull requests
have not been code reviewed. Code reviews do not increase the probability of a
pull request being merged, but they do slow down the processing of a pull
request by an order of magnitude.</p>
<p><strong>summary:</strong> <em>Most pull requests are less than 20 lines long and processed
(merged or discarded) in less than 1 day. The discussion spans on average to 3
comments, while code reviews affect the time to merge a pull request. Inclusion
of test code does not affect the time or the decision to merge a pull request.
Pull requests receive no special treatment, irrespective whether they come from
contributors or the core team.</em></p>
<h3>What factors influence merging and time to merge?</h3>
<p>To measure the combined effect of factors on the decision to merge a pull
request and the time it takes to do so, we employed machine learning. Given the
291 project dataset discussed above, we trained 2 classifiers, one for each
question, using 3 well known classification algorithms (logistic regression,
naive bays and random forests) without any tuning. We then calculated the mean
area under curve and accuracy metrics across random selection 10-fold cross
validation runs on the whole dataset, to select the best algorithm. Data mining
people reading this will probably laugh with our algorithm selection, but
deriving the best possible classifier was not our goal. In both cases, random
forests outperformed (by far) their two contenders. So we selected random
forests.</p>
<p>The random forest algorithm can report the relative importance of features to
the prediction outcome (the mean decrease in accuracy metric), which we used as
the result of this experiment. To estimate it we trained random forest
classifiers on half of the dataset, chosen randomly in a loop of 100 runs, but
with ridiculously high configuration parameters per (2k trees, tree depth of 5
etc). We then estimated the mean MDA per factor, as can be seen in the following
figure.</p>
<div style="margin-left:auto;margin-right:auto;">
<p>
<a href="/files/varimp-merge-decision.png" rel="lightbox">
<img src="/files/varimp-merge-decision.png" class="img-polaroid" align="center" width="40%"/></a>
<a href="/files/varimp-merge-time.png" rel="lightbox">
<img src="/files/varimp-merge-time.png" class="img-polaroid" align="center" width="40%"/></a>
</p>
</div>
<p>For the decision to merge a pull request, a rather unexpected factor dominates
the results: <code>commits_on_files_touched</code> which basically measures how hot is the
project area affected by the pull request. This factor is almost enough to
predict whether a pull request will be merged or not, but there is a caveat:
most pull requests are merged anyway :-)</p>
<p>The situation is not as clear in the time to merge a pull request. The results
are a bit fragile, as the classifier did not achieve a high accuracy score
(~70%), but are nevertheless interesting as well. So the</p>
<ul>
<li>track record of the contributor</li>
<li>size of the project, and</li>
<li>the test coverage</li>
</ul>
<p>seem to influence the decision of how fast a pull request will be merged.</p>
<h3>Why are some pull requests not merged?</h3>
<p>As most pull requests are indeed merged, it is interesting to explore why some
pull requests are <em>not</em> merged. For that reason, we manually looked into 350
pull requests that our heuristics identified as un-merged. The results can be
seen in the plot below:</p>
<p><img src="/files/unmerged-reasons.png" style="width: 50%;float: center;" class="img-polaroid"></p>
<p>Apparently, there is no clearly outstanding reason for not merging pull
requests. If we group together close reasons that have a timing dimension, we
see that 27% of unmerged pull requests are closed due to concurrent
modifications of the code in project branches. Another 16% (superfluous,
duplicate, deferred) is closed as a result of the contributor not having
identified the direction of the project correctly and is therefore submitting
uninteresting changes. 10% of the contributions are rejected with reasons that
have to do with project process and quality requirements (process, tests); this
may be an indicator of processes not being communicated well enough or a very
rigorous code reviewing process. Finally, another 13% of the contributions are
rejected because the code review revealed an error in the implementation.</p>
<p>For 15% of the examined pull requests, we, as human examiners, could not
identify whether they are merged or not by looking at them. For 19%, our
inspection revealed that they were actually merged, which in turn suggests that
our merge heuristics are not inclusive enough.</p>
<p>The above may mean that the pull-based model (or at least the way Github
implements it) may be transparent for the project’s core team but not so much
for potential contributors. The fact that human examiners could not understand
why pull requests are rejected even after manually reviewing them supports
this hypothesis further.</p>
<p><strong>summary:</strong> <em>53% of pull requests are rejected for reasons having to do with
the distributed nature of pull based development. Only 13% of the pull requests
are rejected due to technical reasons.</em></p>
<h3>Conclusion and suggestions</h3>
<p>The goal of this work is to obtain a deep understanding of the pull-based
software development model, as used for many important open source projects
hosted on Github. To that end, we have conducted a statistical analysis of
millions of pull requests, as well of a carefully composed set of hundreds of
thousands of pull requests from projects actively using the pull-based model.
Here are our recommendations based on our findings:</p>
<p><strong>For contributors</strong> Want to get your contributions accepted fast? Then:</p>
<ul>
<li>Make them short</li>
<li>Make them hot</li>
</ul>
<p><strong>For project owners</strong> Want to be effective in managing pull requests? Then:</p>
<ul>
<li>Invest in a comprehensive test suite</li>
<li>Include pull request submission guidelines on a prominent location in your project</li>
<li>Make your project roadmap clearly visible</li>
<li>Ask your potential contributors to communicate their intended changes (e.g. through issues).</li>
</ul>
<p><strong>For researchers and Github</strong> Projects are struggling under a constant influx
of pull requests. Research needs to be done on tools that help developers sort
and triage incoming contributions, assess their impact on the code base and
recommends actions based on the characteristics of pull requests. For example,
an outcome of this study is that it is relatively easy to predict whether a pull
request will be merged or not, so a tool that labels pull requests as <span
class="label label-success">you 're gonna merge this</span> might help
developers to focus on difficult to process pull requests.</p>
<p>The GHTorrent dataset is documented and distributed on <a href="http://ghtorrent.org">this
website</a>. The extracted datasets as well as custom-built
Ruby and R analysis tools are available on the Github repository
<a href="https://github.com/gousiosg/pullreqs">gousiosg/pullreqs</a>, along with
instructions on how to use them.</p>
<p>See the contents of this post in presentation format:</p>
<div style="width: 60%;margin-left:auto;margin-right:auto;">
<script async class="speakerdeck-embed" data-id="c25d64607e600130294c22000a9f019a" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</div>
<h3>Postscript</h3>
<p>A couple of weeks ago, we learned that the corresponding paper was accepted in
the proceedings of the <a href="2014.icse-conferences.org/">36th</a> International
Conference on Software Engineering (ICSE).
<a href="/bibliography/GPD14.html">Here is a pre-print</a>, where you can find an in
depth account of the tools and
techniques we used to come up with the results we present here.</p>
The SEFUNC project final report2013-10-31T00:00:00+01:00http://www.gousios.gr/blog/sefunc-report<p><em>by Georgios Gousios and Arie van Deursen</em></p>
<p><em>This is the publishable version of the final report submitted as part of
my Marie Curie IEF project. It summarizes what I did during the 16 months I was funded by it.</em></p>
<p>The advent of distributed version control systems has led to the development of a new paradigm for distributed software development; instead of pushing changes to a central repository, developers pull them from other repositories and merge them locally. Various code hosting sites, notably Github, have tapped on the opportunity to facilitate pull-based development by offering workflow support tools, such as code reviewing systems and integrated issue trackers. The SEFUNC project was focused on mining and analyzing distributed collaboration on social coding sites using functional as well as object-oriented programming paradigms.</p>
<p>In the context of the project, we created a large scale repository mining operation to retrieve all data available from the Github hosting site, and
using it we analyzed distributed collaboration through pull-based development and visualized language ecosystems.</p>
<h2>Large scale repository mining</h2>
<p>A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive web service, which enables researchers to retrieve both the commits to the projects’ repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub’s event streams and persistent data, and offer it to the research community as a service.</p>
<p>The primary challenge for the data collection process is the Github imposed requests per hour limit for authenticated requests, while the event generation rate is already higher; given that a single event can lead to several (even thousands) of dependent requests, it is not practical to assume that a single Github account will suffice to mirror the whole dataset. For this reason, GHTorrent was designed from the ground up to employ distributed data collection. The event and data retrieval process have been split in two components connected together through a queue; this way event retrieval is isolated from data retrieval and both can happen in parallel by multiple accounts. Shared databases keep track of the data on both ends of the retrieval. To acquire more Github accounts, we introduced the <em>workers for data</em> program, where interested researchers provided their account credentials to the project in exchange for direct access to the live databases.</p>
<p>After collection, the data is provided back to the community through the <a href="http://ghtorrent.org">project's web page</a>. Currently, more than 2 terabytes of data is on offer in two database formats. The wealth of data enables researchers for the first time to do <em>full population quantitative studies</em> in several domains, including software ecosystems, distributed collaboration and repository mining. The project has been awarded the <em>Best data showcase award</em> at the 2013 Mining Software Repositories conference, for its innovative use of distributed crawling and the sharing of valuable data with the community.</p>
<h2>Pull based development</h2>
<p>Pull-based development is an emerging paradigm for distributed software
development. As more developers appreciate isolated development and branching, more projects, both closed source and, especially, open source, are being migrated to code hosting sites such as Github and Bitbucket with support for pull-based development. A unique characteristic of such sites is that they allow any user to clone any public repository. The clone creates a public project that belongs to the user that cloned it, so the user can modify the repository without being part of the development team. Furthermore, such sites automate the selective contribution of commits from the clone to the source through pull requests.</p>
<p>Pull requests as a distributed development model in general, and as implemented by Github in particular, form a new method for collaborating on distributed software development. The novelty lays in the decoupling of the development effort from the decision to incorporate the results of the development in the code base. By separating the concerns of building artifacts and integrating changes, work is cleanly distributed between a contributor team that submits, often occasional, changes to be considered for merging and a core team that oversees the merge process, providing feedback, conducting tests, requesting changes, and finally accepting the contributions.</p>
<p>Within the context of the project, we performed the first large scale quantitative analysis of how the pull-based development model works. Specifically, we extracted data from 300 large projects (170,000 pull requests) and using statistical and machine learning tools, we examined the factors that affect pull request acceptance, rejection and the time required
to do so. We found that the pull based development model offers fast turnaround, increased opportunities for community engagement and decreased time to incorporate contributions. We showed that a relatively small number of factors affect both the decision to merge a pull request and the time to process it. Our findings contain actionable items that can be exploited
by teams and individuals to improve the efficiency of distributed collaborative projects.</p>
<h2>Visualizing language-based project ecosystems</h2>
<p><img align="right" src="/files/communities.png" width=40%"></img></p>
<p>In the context of software analysis, the term ecosystem means a collection of software systems, which are developed and co-evolve in the same environment. An interesting partitioning of projects in ecosystems is that of ecosystems created by projects developed in the same programming language, thus permitting the visual comparison of, e.g., functional and object-oriented languages. On the collaboration level, language ecosystems are created by sharing developers among projects. To investigate the existence and evolution of such ecosystems, we <a href="http://www.gousios.gr/blog/project-communities-visualization/">created an interactive visualization</a> (seen in the adjacent screenshot). Using it interested parties can go through millions of collaborations of developers among projects of one or more programming languages, and also investigate those through time.</p>
<h2>Dissemination actions</h2>
<p>The SEFUNC project followed an open access strategy to ensure broad and timely dissemination of project results. All source code, analysis tools
and datasets that were developed by the project were disseminated using open source and creative commons licenses from day one. We used social media (mostly blogs and Twitter) to disseminate the project's results, while
we closed monitored the reach and effect of our online dissemination strategy using social media analytics.</p>
<h3>Website</h3>
<p>The project created a web site to host and disseminate the generated artifacts. The web site is mostly targeted to researchers, even though
interested users can find interactive visualizations of significant
parts of the dataset. Through the <a href="http://ghtorrent.org">GHTorrent.org</a> site, one can find:</p>
<ul>
<li>Links to download the dataset (in two formats) along with documentation on
how to use it</li>
<li>Online tools to query both datasets.</li>
<li>Instructions on how to use the developed tools to recreate the dataset from scratch.</li>
<li>Example interactive applications (project community graph-based visualization, programming language popularity metrics) developed using the dataset.</li>
</ul>
<h3>Scientific dissemination</h3>
<p>The scientific dissemination strategy for the project included publications
to conferences and journals. The project produced 2 conference and 2 journal
publications.
One of them won the <a href="http://2014.msrconf.org/history.php">Best Data Showcase Award</a> at the 10th conference of Mining Software
Repositories. The following publications resulted from the project:</p>
<p><strong>Journals</strong></p>
<ul>
<li>Gousios, G., & Spinellis, D. (2013). <a href="http://www.gousios.gr/bibliography/GS13.html">Conducting quantitative software engineering studies with Alitheia Core</a>. Empirical Software Engineering, 1–41.</li>
<li>Louridas, P., & Gousios, G. (2012). <a href="http://www.gousios.gr/bibliography/LG12.html">A note on rigour and replicability.</a> SIGSOFT Softw. Eng. Notes, 37(5), 1–4.</li>
</ul>
<p><strong>Conferences</strong></p>
<ul>
<li><p>Gousios, G. (2013). <a href="http://www.gousios.gr/bibliography/G13.html">The GHTorrent dataset and tool suite.</a> In MSR ’13: Proceedings of the 9th Working Conference on Mining Software Repositories.</p></li>
<li><p>Mitropoulos, D., Karakoidas, V., Louridas, P., Gousios, G., & Spinellis, D. (2013). <a href="http://www.gousios.gr/bibliography/MKLGS13.html">Dismal Code: Studying the Evolution of Security Bugs.</a> In LASER ’13: Proceedings of the 2013 Workshop on Learning from Authoritative Security Experiment Results.</p></li>
</ul>
<p>2 more works that resulted from the project
(<a href="http://www.gousios.gr/bibliography/GPD14.html">large scale analysis of the pull based development model</a>,
<a href="http://www.gousios.gr/bibliography/MG13.html">experiences of teaching functional programming at TU Delft</a>) are currently under review.</p>
<h3>Invited talks</h3>
<p>Georgios Gousios has been invited to present his experience in building large scale, open access research data sets at the Panel on Open Access in the 29th International Conference on Software maintenance. He presented the project and its success at attracting external users almost immediately and attributed
it to the fact that it was open and well documented from the beginning (<a href="http://www.gousios.gr/blog/On-open-access/">blog post</a>, <a href="http://www.gousios.gr/bibliography/Gousit13b.html">presentation</a>).</p>
<p>In addition, the project was presented after invitation at the following universities/research groups/companies:</p>
<ul>
<li>Software Technology Group, TU Darmstadt, Germany (<a href="http://www.gousios.gr/bibliography/Gousit13a.html">presentation</a>)</li>
<li>Information Systems Laboratory, Athens University of Economics and Business, Greece (<a href="http://www.gousios.gr/bibliography/Gousit13a.html">presentation</a>)</li>
<li>IHomer, Breda, The Netherlands. Presentation: "On Software Changes, Large and Small"</li>
</ul>
<h3>Dissemination strategy effectiveness metrics</h3>
<ul>
<li><p>The artifacts generated by the project are in active use by at least 7 institutions outside TU Delft. 4 external scientific publications have been submitted/accepted in field conferences.</p></li>
<li><p>The dataset has been selected as the official dataset of the mining challenge at the 11th Conference on Mining Software Repositories. Currently, more than 40 downloads of the dataset from more than 30 universities have been recorded in our logs.</p></li>
<li><p>The source code repository has been starred (is being followed) by 35 Github users, while the corresponding Ruby library has been downloaded 4,500 times.</p></li>
<li><p>The <a href="http://ghtorrent.org">ghtorrent.org</a> 40 unique visitors a day with a
return visitor rate of more that 50%. In total, it has been visited by 1,200 visitors, most from the US, Canada and Brazil. It has appeared in social media more in excess of 50 times.</p></li>
<li><p>The blog posts by Georgios Gousios on topics about and related to GHTorrent have been viewed by more than 800 unique users while they appeared in social media more than 70 times.</p></li>
</ul>
<h2>Impact and Availability</h2>
<p>The project has generated a stream of follow up work. The dataset has been
selected for the mining challenge of the 11th Conference on Mining Software
Repositories, the flagship conference in the field. It is currently being actively analyzed by researchers in more than 7 Universities, while papers from independent researchers have been already published. Analysis and visualization results produced through the project have been used as examples of cutting edge research in conference keynotes (e.g by <a href="https://twitter.com/avandeursen/status/336154714360139776">Brian Doll at MSR13</a>).</p>
<p><em>You can find more about the project, including the datasets, links
to source code, visualizations and documentation through the project's website at http://ghtorrent.org</em></p>
Lazy hacker's service analytics2013-10-15T00:00:00+02:00http://www.gousios.gr/blog/Hackers-data-analytics<p>A week ago, I had trouble with the GHTorrent data retrieval process.
Specifically, while scripts where performing as expected and the event
processing error rate was within reasonable bounds, API requests took forever to
complete, in many cases as much as 20 seconds. I know that <a href="https://status.github.com/">Github's API is very
snappy</a>, and even though it the response times I get
are slower than what Github reports, it is reasonably fast if we take into
account the packet trip over (or under) the Atlantic (usually, around 500msec).</p>
<p>My main hypothesis was that Github started employing some kind of tar pitting
strategy for accounts using their API extensively. I am one of them: for every
Github account that I have collected from fellow researchers, I run two
mirroring processes, to make sure that I exploit the full 5000 requests/sec
limit. As I maintain extensive (debug-level) logs for each request GHTorrent
makes, I decided to investigate whether that was the case.</p>
<h3>Preparing the data</h3>
<p>I had to process > 10 GB of data on a 16-core server and extract information
about URLs and timings. This was a non-brainer: Unix all the way! I slapped
together the following (totally inefficient) script, which uses convenient
functions in 3 programming languages :-) The crucial thing here is the use of
the <code>parallel</code> command. This allows the <code>doit()</code> function to be applied in
parallel on 10 input files, thereby stressing the fast machine sufficiently
enough.</p>
<p>The script outputs a CSV file with 3 fields: timestamp (as seconds since the epoch), IP address used for the request, time the request had taken.</p>
<p><figure class="highlight"><pre><code class="language-bash" data-lang="bash">processfile<span class="o">()</span> <span class="o">{</span>
<span class="nb">grep </span>APIClient <span class="nv">$1</span>|
<span class="nb">grep</span> <span class="nt">-v</span> WARN |
perl <span class="nt">-lape</span> <span class="s1">'s/[([T0-9-:.]<em>).</em>] DEBUG.<em>[([0-9.]</em>)].<em>Total: ([0-9]</em>) ms/$1 $2 $3/'</span>|
<span class="nb">cut</span> <span class="nt">-f2</span>,3,4 <span class="nt">-d</span><span class="s1">' '</span>|
ruby <span class="nt">-ne</span> <span class="s1">'BEGIN{require "time"}; t,i,d=$_.split(/ /); print Time.parse(t). to_i," ", i, " ", d;'</span>|
<span class="nb">grep</span> <span class="nt">-v</span> <span class="s2">"#"</span>
<span class="o">}</span></p>
<p><span class="nb">export</span> <span class="nt">-f</span> processfile</p>
<p>find mirror <span class="nt">-type</span> f|grep log.txt| parallel <span class="nt">-j10</span> processfile <span class="o">{}</span></code></pre></figure></p>
<h3>Data processing with R</h3>
<p>I have a love-hate relationship with R. I do admire the fact that it allows
developers to acquire data from multiple sources, manipulate it relatively
easily and plot it beautifully. Also, whenever basic R has shortcomings, the
community usually steps is with awesome libraries (be it sqldf, plyr, ggplot2
etc). R the language, however, <a href="http://www.gousios.gr/blog/new-stats-language-required/">leaves much to be
desired</a>. Nevertheless,
as I had written <a href="https://github.com/gousiosg/pullreqs/tree/master/R">lots of</a>
<a href="https://github.com/gousiosg/cliffs.d">R code</a> lately, it somehow felt like an
obvious choice for summarizing the data. Here is <a href="https://github.com/gousiosg/ghtorrent.org/blob/master/stats/api-stats.R">the
script</a>
I came up with.</p>
<p>Importing tabular data in R is trivial: a call to <code>read.csv</code> will do the job
without any trouble, loading a comma or tab separated file in an in-memory
representation called a data frame. I usually also pass to it column type
parameters to make sure that integers are indeed represented as integers and
factors are recognized as such, and also to make sure that there are no errors
in the data file. Moreover, to have a flexible representation of time, I usually
convert epoch timestamps to the <code>POSIXct</code> data type.</p>
<p>After initial importing of the file, basic statistics (quantiles and means) can be acquired using the <code>summary</code> function.</p>
<p>Typical processing of time series data includes aggregation per a configurable
time unit (e.g. hour, day etc). In R, this can be achieved using a two step
process: i) binning (assigning labels to data based on a criterion) ii)
bin-based aggregation. Fortunately, both steps only consist of a line each! For
example, if we want to aggregate the mean time of each API request per 30
minutes it suffices to do the following:</p>
<p><figure class="highlight"><pre><code class="language-splus" data-lang="splus">data$timebin <- cut.POSIXt(data$ts, breaks = "30 mins")
count.interval <- aggregate(ms ~ timebin, data = data, mean)</code></pre></figure></p>
<p><code>cut.POSIXt</code> is an overload of the general <code>cut</code> binning function that works on
data of type <code>POSIXt</code>. As it knows that it works on time data, the binning is
time aware, so we can specify arbitrary time-based bins (e.g. '12 minutes' or '3
months'). The <code>aggregate</code> function will then summarize our data given a formula:
in our case, it will apply the aggregation function <code>mean</code> on groups of <code>ms</code>,
where the grouping factor is the assigned time bin. In SQL, the equivalent
expression would be: <code>SELECT timebin,mean(ms) FROM... GROUP BY timebin</code>. We can
pass multiple grouping factors and arbitrary aggregation functions, which makes
<code>aggregate</code>'s functionality a superset of SQL, for single tables. For more
complex aggregations (e.g. self joins), we can use <code>sqldf</code> to convert our data
frame in an SQLite table, run an SQL query and get back a new data frame with
the results in one go.</p>
<p>The final step would be to plot the aggregated data. For that, I use <code>ggplot2</code>,
which, in my opinion, is the best plotting tool bar none. Using <code>ggplot2</code> is
straightforward, after one understands the <a href="http://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html">theory behind
it</a>. In the
following example, we specify an aesthetic (roughly, the data to be plotted)
which we then use to feed a line plot with a date x-axis. <code>ggplot2</code> will take
care of scales, colors etc.</p>
<p><figure class="highlight"><pre><code class="language-splus" data-lang="splus">ggplot(mean.interval) +
aes(x = timebin, y = ms) +
geom_line() +
scale_x_datetime() +
xlab('time') +
ylab('Mean API resp in ms') +
ggtitle('Mean API response time timeseries (10 min intervals)')</code></pre></figure></p>
<p>The plot that resulted from the R script on the data I processed
can be seen below:</p>
<p><a href="/files/api-resp.png" rel="lightbox">
<img src="/files/api-resp.png" class="img-polaroid" align="center" width="60%"/></a></p>
<h3>Results</h3>
<p>After I had the evidence at hand, I had a brief discussion with our network
administrators. They had recently updated the university-wide firewall policies
for higher throughput. Unfortunately, this turned out to be at the expense of
latency. As we can see at the end of the plot above, after the changes where
reverted the mirroring process started flying again, with 500 msec average
latency. So Github was innocent and my working hypothesis wrong.</p>
<p>The net result is that with 40 lines of totally unoptimized code, I can go
through several gigabytes worth of logs and plot service quality timeseries
plots in a little over a minute. Since I had the initial implementation running
reliably, I created a few more plots and added a cron job and a web page to
display them online. You can also see them <a href="http://ghtorrent.org/stats/">here</a>.</p>
<p>The moral of the story is that we don't always need to setup complicated systems
to do service monitoring. A few Unix commands to extract data from logs and a
tool to summarize and plot them might be enough to get us going.</p>
Teapots!2013-10-09T00:00:00+02:00http://www.gousios.gr/blog/teapots<p>On Oct 9 2013, I did a lecture for
<a href="https://twitter.com/headinthebox">@headinthebox</a>'s functional programming
course at TU Delft. Similarly to <a href="https://github.com/gousiosg/teapots">last year</a>, the topic was how to draw the <a href="http://en.wikipedia.org/wiki/Utah_teapot">Utah teapot</a> using only
right triangles. However, this year the challenge was not to implement the algorithm, but rather to optimize a shared implementation. I had written an <a href="https://gist.github.com/gousiosg/6871500">implementation in raw PHP</a> and the students would have to improve it using new language or algorithmic features found
in the Facebook implementation of PHP (the <a href="https://github.com/facebook/hiphop-php">HipHop VM</a>). Extra points would be given if the output was in HTML, which
was the point of the <a href="http://queue.acm.org/detail.cfm?id=2436698">original experiment</a> done by Brian Beckman and Erik Meijer.</p>
<p>In my original implementation, I tried to follow a functional style. However,
this was not entirely possible in PHP: while the language does support passing
functions as arguments, its imperative roots cannot be hidden. Several functions
modify arguments in place (e.g. <code>usort</code>) while others use function arguments to
return more than one values (e.g. <code>preg_match</code>). Moreover, many of the higher
order functions have varying calling and return conventions (e.g. <code>array_map</code>
and <code>array_reduce</code> take the processed array as last and first arguments,
respectively), some of the common ones are missing (e.g. <code>array_flat_map</code> or at
least <code>array_flatten</code>), while the dark corners of others have been
<a href="http://www.phpsadness.com/">documented elsewhere</a>. The combination of the
above makes writing beautiful code in PHP very hard, if not impossible. As a
further testament to this statement, <a href="https://github.com/gousiosg/teapots/blob/master/martin.pinzger.teapot/src/martin/pinzger/teapot/Triangle.scala">see the code in
Scala</a>
I based my solution upon and <a href="https://gist.github.com/gousiosg/6871500">compare it with PHP</a>.</p>
<p>Nevertheless, several students took up the challenge to make the code
run faster. On my computer (1.8GHz MacbookAir), the code was already
running 12 times faster on HipHopVM when compared to Zend PHP (2.8 vs 32.5 sec,
respectively). Then, the purpose was to identify and grab the low hanging
optimization fruit, and I believed that my code contained a lot.
However, this was not the case. Here are some things that the students did,
which only delivered marginal improvements (~5%):</p>
<ul>
<li>Removed the class Point as it is merely a data type with no operations
and make the Triangle class constructor accept 6 arguments, where the
co-ordinates for its point where passed. In theory, this would help
remove excessive object field access.</li>
<li>Make the internal functions <code>comp1</code> and <code>comp2</code>, which are re-declared
on every call to <code>select_crossline()</code>, static and declared them only once.</li>
<li>Memoized the results of the <code>{max,min}_{x,y}</code> functions and even
pre-calculated their results on Triangle construction.</li>
</ul>
<p>The winners did something more clever. Instead of going through the low hanging
fruit, they estimated the number of invocations of certain functions by hand
(akin manual profiling) and decided to focus on the function that is called the
most:
<a href="https://gist.github.com/gousiosg/6871500#file-teapot-php-L252">split_triangles()</a>.
What they noticed is that <code>array_flatten</code> is called recursively thousands of
times, and that its implementation is bad performance-wise in every respect.
This was in fact my fault: as
<a href="https://gist.github.com/gousiosg/6871500#file-teapot-php-L252">array_flatten</a>
is not an internal PHP function, I looked up for an implementation on the
Internet and copied it verbatim in my code without thinking too much. So they
replaced the implementation of <code>split_triangles</code> with one that <a href="https://github.com/ElessarWebb/teapot/blob/master/teapot.php#L252">obliterates the
need for array flattening</a>.
This brought a massive performance win in the order of x2.</p>
<p>After the exercise, we also had a brief discussion about what we learned
from it. Here are some remarks the students made:</p>
<ul>
<li><p>Measure before optimising: What the winners did was to invest time in
understanding the program flow and attack the hottest of hotspots first. This
strategy paid out quite well. On real software, one would use a profiler
for that job.</p></li>
<li><p>Massive wins do not stem from grabbing the low hanging fruit, but from
breaking changes in the code's logic. Optimizing compilers, HHVM in that case, already do a good job at optimizing away small inefficiencies.</p></li>
<li><p>Trust language implementations: It is usually faster to write a non-pretty
version of an algorithm using only functions in a language's core library rather
than copy-pasting nice implementations from the Internet. Core libraries use
optimized versions of algorithms for handling common cases, while their
performance issues have been ironed out. This is especially true for
interpreted languages, where it is common to have internal function
implementations written in hand-optimized C.</p></li>
<li><p>HHVM can certainly be improved (and AFAIK it will). For example,
currently, type information is not taken into account when doing optimization
rounds. In our case, I suspect that type information would help a lot at
optimizing the types used in numerical calculations, which is at the core of
this exercise. Coming from the JVM world, I was also curious as to how
HHVM handles memory management and garbage collection, but I could not find
any information.</p></li>
</ul>
<p>Congratulations to the winners, a T-shirt sporting the logo of a big social
networking company will be waiting for you next week!</p>
<p>Here are my slides (some of which are actually @headinthebox's):</p>
<div style="width: 60%;margin-left:auto;margin-right:auto;">
<script async class="speakerdeck-embed" data-id="47ad2570124301316d276a7923093825" data-ratio="1.33333333333333" src="//speakerdeck.com/assets/embed.js"></script>
</div>
ICSM 2013 panel on open access2013-09-25T00:00:00+02:00http://www.gousios.gr/blog/On-open-access<p>On Sep 26 2013, I participated at the ICSM 2013 panel on open access. I gave a
presentation and then we discussed with
<a href="http://gsyc.urjc.es/~grex/">Gregorio Robles</a> and
<a href="http://plg.uwaterloo.ca/~migod/">Mike Goodfrey</a>
(and of course with the audience) about the challenges of open access research.</p>
<p>My presentation was entitled "A tale of two datasets" and it was about the
lessons I 've learned by building two large software engineering research data
sets, namely <a href="http://www.sqo-oss.org">SQO-OSS</a> and
<a href="http://ghtorrent.org">GHTorrent</a>. Specifically, I attributed the relative lack
of external users in the first case and the ever increasing number of users in
the second case to
the different approaches the two projects followed with respect to open access.</p>
<p>The main points I tried to raise during my presentation, based on what
I 've learned from build SQO-OSS and GHTorrent, where the following:</p>
<ul>
<li><p><em>Aim for lean and mean</em>:
When building an open access research tool or dataset, we should
offer the minimum viable product. The least possible
piece of functionality that makes sense and lets people create by building on
top of it. This can be data plus good documentation or a tool that does one
thing very well, again with good documentation.</p></li>
<li><p><em>Infrastructures and platforms are overrated</em>: The effort required to learn
how the infrastructure works before actually exploiting them scientifically
should not be overlooked. The invested effort must return gains and concequently
big effort calls for big gains. There is always the risk of deprecation.
Especially in the field of software engineering, where new developments are
happening every day, the risk of deprecation is very high. In my opinion, this
is why many people in research keep re-inventing the wheel and do not reuse.</p></li>
<li><p><em>Open now trumps open when it’s done</em>: Most
importantly, no one is going to wait for us to perfect what we‘re doing. The
only thing that we risk if we open up our research early enough is for our
research to be ignored: then again, this is a sign that we should change
direction. If we open up early, and our work is interesting enough, "enough
eyeballs" will catch our mistakes early on and propose corrections to make our
research even more interesting. Open access is an absolute must for adoption
and widespreading of research and research results. This should happen as early
as possible.</p></li>
</ul>
<p>See my slides below:</p>
<div style="width: 60%;margin-left:auto;margin-right:auto;">
<script async class="speakerdeck-embed" data-id="c878edb007410131b4117a4b8f91befb" data-ratio="1.33333" src="//speakerdeck.com/assets/embed.js"></script>
</div>
Το μεγαλύτερο πρόβλημα της Ελλάδας2013-08-17T00:00:00+02:00http://www.gousios.gr/blog/Greeces-biggest-problem<p><em>Engish summary: I think that the biggest problem in Greece's current economics
situation is the fact that young people work for €800 gross, while their pension
money are used to fund pensions in the order of €1,500 net. This is highly unfair
and indicative of a generation that learned not to care about the future.</em></p>
<p>Κατά καιρούς, συζητάω με φίλους για τα οικονομικά προβλήματα της Ελλάδας. Οι
περισσότεροι αναφέρουν θέματα όπως ο υπερτροφικός δημόσιος τομέας, ο ιδιωτικός
τομέας που έμαθε να επενδύει χωρίς ρίσκο, η ανυπαρξία υποδομών, οι κακοί
πολιτικοί ή η δυσλειτουργική δικαιοσύνη (ιδιαίτερα όσον αφορά οικονομικά
θέματα). Δεν διαφωνώ, για την ακρίβεια νομίζω ότι ο συνδυασμός όλων αυτών ήταν
καταλυτικός για την κατάντια της χώρας μας. Παρ' όλα αυτά, νομίζω όμως ότι το
χειρότερο πρόβλημα στην Ελλάδα αυτή τη στιγμή είναι ο συνδυασμός της υψηλής
ανεργίας και χαμηλών αμοιβών (αποτέλεσμα του πρώτου) με τις προκλητικές
συντάξεις, ειδικά στο δημόσιο τομέα.</p>
<p>Λόγω της ηλικίας των γονέων μου (και οι 2 δημόσιοι υπάλληλοι), ακούω κατά
καιρούς τις συντάξεις που παίρνουν διάφοροι γνωστοί τους και εξοργίζομαι. Οι περισσότερες είναι πολλές φορές πολλαπλάσιες της εγγυημένης
που σχεδιάζει η κυβέρνηση (€380, αν θυμάμαι σωστά) και κάποιες φορές μεγαλύτερες του τελευταίου μισθού που λάμβαναν ως υπάλληλοι!
Πολλοί μάλιστα διαμαρτύρονται και για τις περικοπές:
καταμετρητής της ΔΕΗ, που αργότερα έγινε και διευθυντής προσωπικού(!) &mdash πάντα
στηριζόμενος στα προσόντα του (απόφοιτος γυμνασίου), λάμβανε σύνταξη €2,800 (ή
110% του τελευταίου μισθού) η οποία και περικόπηκε στα €1,800! Πολλοί από αυτούς τους ανθρώπους είναι κάτω των 60 χρονών. Μπήκαν στο
δημόσιο με τις γνωστές διαδικασίες της δεκαετίας του 80. Μετά την είσοδο της
χώρας στη ζώνη του Ευρώ, έλαβαν αυξήσεις που συνήθως δεν αντιστοιχούσαν στην
παραγωγικότητά τους (ή στη λογική).</p>
<p>Το σύνηθες επιχείρημα είναι για τις συντάξεις των €1,800 είναι το "έχουμε
πληρώσει για τις συντάξεις μας"; αυτό δεν θα μπορούσε να είναι μακρύτερα από την
αλήθεια. Ένας απλός υπολογισμός αρκεί για να τους πείσει για το αντίθετο: έστω
ότι κάποιος πληρώνει €1,000 σαν ασφαλιστικές εισφορές το μήνα (εξαιρετικά
απίθανο με βάση τα εκκαθαριστικά που έχω δει) επί 20 χρόνια; Με επιτόκιο 5\%
(έστω ότι τα ασφαλιστικά ταμεία έκαναν τέλειες επενδύσεις τα τελευταία 20
χρόνια), αυτό μας δίνει μια παρούσα αξία €411,000 προς χρήση για συνταξιοδότηση
τα επόμενα 20 χρόνια. Έστω τώρα ότι το επιτόκιο είναι μηδενικό (δηλαδή το χρήμα δεν
χάνει την αξία του στο χρόνο) και τα χρήματα επιμερίζονται σε ισόποσες δόσεις:
το πόσο αντιστοιχεί σε €1,710 το μήνα, τα οποία πρέπει να καλύψουν το φόρο και
όποιες άλλες εισφορές. Συνεπώς, με εξαιρετικά ιδανικές συνθήκες, οι αριθμοί
είναι συντριπτικά κατά αυτού του επιχειρήματος Εξάλλου, αν οι αριθμοί ήταν
υπέρ, δεν θα χρειαζόταν κρατική επιχορήγηση στα ταμεία.</p>
<p>Μια ίσως ποιο λογική εξήγηση είναι το γεγονός ότι το συνταξιοδοτικό σύστημα της
Ελλάδας βασίζεται στην αλληλεγγύη των γενεών (άρα δεν είναι ανταποδοτικό).
Πράγματι, έτσι είναι. Αλλά προσωπικά δεν βλέπω γιατί θα πρέπει η τωρινή γενιά να
είναι αλληλέγγυα με την προηγούμενη (δεν εννοώ σε προσωπικό επίπεδο). Οι
ευκαιρίες που είχε η προηγούμενη γενιά στην ηλικία των 25-35 να ξεκινήσει τη
ζωή της ανήκουν στη σφαίρα της φαντασίας σήμερα. Ακόμη κι έτσι όμως, τα
παραγωγικά αποτελέσματα της προηγούμενης γενιάς είναι απογοητευτικά: η κρίση που
ζούμε σήμερα είναι η απόδειξη. Με ποια λογική θα πρέπει η γενιά των ανθρώπων
στην αρχή της παραγωγικής τους ηλικίας, με ελάχιστες προοπτικές ευημερίας, να
χρηματοδοτεί την συνεχιζόμενη ευημερία των συνταξιούχων της προηγούμενης
παραγωγικής γενιάς? Πόσοι νέοι πρέπει να δουλεύουν με το βασικό μισθό για να
χρηματοδοτούν συντάξεις των €1,500 (ή ακόμη και των €800) ανθρώπων που ήδη έχουν
οικογένεια, σπίτι(-α), αυτοκίνητο(-α) και ελάχιστες υποχρεώσεις πέραν της
απόλαυσης των 30+ χρόνων ζωής που τους απομένουν? Κάτι τέτοιο είναι ξεκάθαρα
ανήθικο.</p>
<p>Νομίζω ότι το ασφαλιστικό είναι το επόμενο μεγάλο πρόβλημα που θα
αντιμετωπίσει η Ελλάδα. Και αυτό φοβάμαι θα γίνει κατά το γνωστό τρόπο: στην
αρχή ο αρμόδιος υπουργός θα αδιαφορήσει, μετά η τρόικα θα "επιστήσει την προσοχή",
ο υπουργός θα εξαγγείλει "μελέτη βιωσιμότητας του ασφαλιστικού" η οποία θα αργήσει
να ολοκληρωθεί, η τρόικα θα "συναρτήσει τις αλλαγές με την
επόμενη δόση", οι λεγόμενες προοδευτικές δυνάμεις θα κάνουν επαναστατική γυμναστική για 1-2
εβδομάδες χωρις να προτείνουν κάτι εναλλακτικό και τέλος ο υπουργός θα περικόψει οριζόντια όλες τις συντάξεις 20-30%.
Όλοι θα είναι χαρούμενοι για 3-4 χρόνια, μέχρι ο κύκλος να επαναληφθεί.</p>
<p>Τι προτείνω όμως; Όσο και αν ακούγεται ακραίο, νομίζω ότι το ποιο δίκαιο σύστημα
είναι όλοι να λαμβάνουν μια εγγυημένη σύνταξη η οποία προσαυξάνεται αναλογικά με
τις εισφορές που έχει καταβάλει ο καθένας στο ασφαλιστικό του ταμείο,
υπολογισμένες σε ορίζοντα 20 χρονών από τη συνταξιοδότηση. Τα δε όρια ηλικίας
πρέπει να είναι σταθερά και χωρίς καμία εξαίρεση τα 65 χρόνια (εκτός
βαρέων/ανθυγιεινών επαγγελμάτων).Η εγγυημένη σύνταξη μπορεί να καλυφθεί από τις
σημερινές κρατικές επιχορηγήσεις στα ασφαλιστικά ταμεία. Οτιδήποτε άλλο είναι
εξαιρετικά άδικο για τη γενιά που σήμερα προσπαθεί να βρει μια δουλειά και
πληρώνει εισφορές για τους συνταξιούχους των €1,800.</p>
<p>Ο Στέφανος Μάνος είχε υπολογίσει το 2012 εγγυημένη σύνταξη €700 για όλους με βάση αυτά τα ποσά. Αλλά ποιος ακούει τους ακραίους νεο-φιλελεύθερους...</p>
A Note on Rigor and Replicability2013-07-01T00:00:00+02:00http://www.gousios.gr/blog/note-rigour-replicability<p>At ICSE 2012, <a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6227200">one of the presented papers</a> caught my attention; the title
was provocative enough and the topic was very hot: functional vs imperative
programming. The paper presented a comparative study of programming a multicore
application in Java and Scala. The authors employed a group of master
students to write a non-toy application in both languages and then compared
the results. They found no significant difference between the two languages.</p>
<p>I remember leaving the paper presentation with mixed feelings; my suspicions
grew stronger when I actually read the paper. There were several errors in the
paper with respect to the methods used and the statistical treatment of the
data. Together with my colleague <a href="https://twitter.com/louridas">Panos Louridas</a>, we wrote a paper that criticizes the
methods used in the Pankratius et al. paper. Partially because only the paper
abstract was published in the print version of SigSoft Software Engineering
notes, our criticism went relatively unnoticed. Since today marks the first
anniversary of the writing of this paper, I am summarizing our findings here.
You can also <a href="http://www.gousios.gr/bibliography/LG12.html">read the full version</a>.</p>
<h4>Problems we found</h4>
<ul>
<li><p>Wrong statistical tests being used or wrong naming of the statistical tests</p></li>
<li><p>Liberal interpretation of p-values. While the authors use p < 0.05 as a
threshold for significance, they later claim significance (or support) for p-values of 0.078 and even 0.094</p></li>
<li><p>Subjects were classified as experts in Scala after 4 weeks of training while
other subjects were classified as novices in Java after 4 years of university
studies.</p></li>
<li><p>The method used to identify imperative and functional parts of the code
classified as imperative an example created by Martin Odersky to showcase
the functional programming capabilities of the language.</p></li>
</ul>
<h4>How to fix them?</h4>
<p>We would have encountered none of the problems outlined above if published papers included:</p>
<ul>
<li>All measurement data</li>
<li>All interviews, questionnaire, research protocols, and other related data derived from subjects, anonymized if necessary</li>
<li>Full details on the statistical methods used.</li>
<li>Any other code that has been used in the publication’s research</li>
<li>Documentation for all of the above</li>
</ul>
<p>Conferences and journals should require from authors to open up their data and
their data manipulation tools under a license that enables everybody to use
them. Sharing of data should happen in an organized way; for example, conference
organization committees could create a shared repository where researchers can
upload their data and tools along with instructions to use them. To enable full
replication, researchers should provide virtual machine images with the full
environment and data they used. Moreover, conferences and journals can describe
a formal redress procedure; should an error is found in a paper, authors should
be required to reply to the error claim.</p>
<p>What we propose can be a best effort approach: by default, submissions should be
accompanied by datasets and tools; if these are not available due to <em>force
majeure</em>, it should be up to the editor/conference chair to decide on the
submission.</p>
<h4>Conclusions</h4>
<p>The purpose of this work was not to point fingers, but to raise the issue of the
dangers of inadequate reproducibility. We were drawn to this particular article
and use it as an example mostly because some of the findings contradict our own
experience. Other articles in the same conference are equally opaque with
regards to replication and verification. However, we believe that
publication-time availability of experimental data, tools and experiment
replication documentation should be a requirement for publication. Our proposal,
if adopted, might be a first step in this direction.</p>
Reactive programming2013-06-08T00:00:00+02:00http://www.gousios.gr/blog/rx-vanity-pullreqs<p><em>tl;dr summary: Reactive programming is a great way for consuming and combining
web APIs. C# is very cool.</em></p>
<h4>Reactive programming basics</h4>
<p>Lately, I have been (being) introduced to reactive programming in general and Rx
in particular by <a href="https://twitter.com/headinthebox">@headinthebox</a>. Reactive programming is a set of techniques to
perform computations on <em>streams</em> of values rather than on just values. What is
the difference? Streams represent the state of a structure in a specific point
in time. Thus they are, by definition, unbounded. This leads to interesting
problems that need to be solved; for example, when we want to sum the results of
a list we can do a <code>fold</code> or a <code>while</code> loop; this is only possible because we
know in advance the length of the list.</p>
<p>But how can we construct streams? This is easy: every time a program receives
and processes asynchronous events, it is basically reading a stream of events.
As most programs we write (from GUI applications, to servers, to operating
system kernels) do receive stimuli from their environment, most of us have
already done reactive programming, one way or another! The basic trickery behind
reactive programming and Rx is formulating asynchronous event streams as data
structures and then running higher order functions on them. Erik, of course,
explains all these much better in his brilliant paper <a href="http://queue.acm.org/detail.cfm?id=2169076">Your Mouse is a
Database</a>.</p>
<p>The discussion on streams (or signals, as known in other contexts) however blurs
the practicality of reactive programming. The practice boils down to creating
<code>Observable</code> sequences of events (e.g. by poling an API or waiting on a socket)
and then assigning actions to incoming events. The interesting situations start
when we combine several streams, in a time-dependent fashion: say that we want
an action to occur as a result of polling two APIs and combining the responses
based on a common attribute. Luckily, the complicated interactions are hidden
quite well behind the Rx API; powerful operators such as <code>Merge</code> that merges
events from two streams in one or <code>Buffer</code> that creates a time window of events
before firing the event processors are available.</p>
<h4>A working example</h4>
<p>To make sure I understand the basic concepts, I decided to write a program that
identifies "vanity pull requests", that is Github pull requests whose URLs are
then tweeted by their creators. The process to retrieve those is
straightforward: for each new pull request, as exposed by the
<a href="https://api.github.com/events">Github event API</a>, I get the user that created
it and look for the pull request URL in his/her Tweeter steam (which I obtain
through the <a href="http://search.twitter.com/search.json?q=@gousiosg">Twitter search
API</a>) for 10 minutes. For
simplicity, I am making assumption is that the user's Github and Twitter
accounts have the same user name, while no URL shortening happens on the Github
URL (both relatively fishy). I had written some initial code myself
(in C# even!) and on (Catholic) Easter eve (6th Apr 2013) we sat together with Erik to make it actually work.</p>
<h5>Creating collections out of API calls</h5>
<p>The first thing the code needs to do is to retrieve the
Github event stream and process it. Here is the code:</p>
<p><figure class="highlight"><pre><code class="language-csharp" data-lang="csharp"><span class="k">public</span> <span class="k">static</span> <span class="k">async</span> <span class="n">Task</span><span class="p"><</span><span class="n">IEnumerable</span><span class="p"><</span><span class="kt">string</span><span class="p">>></span> <span class="nf">ReadGithubAsync</span><span class="p">(</span><span class="kt">string</span> <span class="n">user</span><span class="p">,</span> <span class="kt">string</span> <span class="n">password</span><span class="p">)</span> <span class="p">{</span></p>
<pre><code><span class="kt">var</span> <span class="n">header</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">AuthenticationHeaderValue</span><span class="p">(</span>
<span class="s">"Basic"</span><span class="p">,</span>
<span class="n">Convert</span><span class="p">.</span><span class="nf">ToBase64String</span><span class="p">(</span><span class="n">System</span><span class="p">.</span><span class="n">Text</span><span class="p">.</span><span class="n">UTF8Encoding</span><span class="p">.</span><span class="n">UTF8</span><span class="p">.</span><span class="nf">GetBytes</span><span class="p">(</span><span class="kt">string</span><span class="p">.</span><span class="nf">Format</span><span class="p">(</span><span class="s">"{0}:{1}"</span><span class="p">,</span> <span class="n">user</span><span class="p">,</span> <span class="n">password</span><span class="p">)))</span>
<span class="p">);</span>
<span class="kt">var</span> <span class="n">client</span> <span class="p">=</span> <span class="k">new</span> <span class="n">HttpClient</span><span class="p">{</span>
<span class="n">BaseAddress</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Uri</span><span class="p">(</span><span class="s">"https://api.github.com/"</span><span class="p">),</span>
<span class="n">DefaultRequestHeaders</span> <span class="p">=</span> <span class="p">{</span><span class="n">Authorization</span> <span class="p">=</span> <span class="n">header</span><span class="p">}</span>
<span class="p">};</span>
<span class="n">client</span><span class="p">.</span><span class="n">DefaultRequestHeaders</span><span class="p">.</span><span class="nf">Add</span><span class="p">(</span><span class="s">"user-agent"</span><span class="p">,</span> <span class="s">"rx-example"</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">response</span> <span class="p">=</span> <span class="k">await</span> <span class="n">client</span><span class="p">.</span><span class="nf">GetAsync</span><span class="p">(</span><span class="s">"/events"</span><span class="p">);</span>
<span class="kt">var</span> <span class="n">json</span> <span class="p">=</span> <span class="k">await</span> <span class="n">response</span><span class="p">.</span><span class="n">Content</span><span class="p">.</span><span class="nf">ReadAsStringAsync</span><span class="p">();</span>
<span class="k">return</span> <span class="p">(</span><span class="k">from</span> <span class="n">e</span> <span class="k">in</span> <span class="n">JArray</span><span class="p">.</span><span class="nf">Parse</span><span class="p">(</span><span class="n">json</span><span class="p">).</span><span class="nf">Children</span><span class="p">()</span>
<span class="k">where</span> <span class="p">(</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"type"</span><span class="p">]</span> <span class="p">==</span> <span class="s">"PullRequestEvent"</span> <span class="p">&amp;&amp;</span>
<span class="p">(</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"payload"</span><span class="p">][</span><span class="s">"action"</span><span class="p">]</span> <span class="p">==</span> <span class="s">"opened"</span>
<span class="k">select</span> <span class="p">(</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"actor"</span><span class="p">][</span><span class="s">"login"</span><span class="p">]);</span>
</code></pre>
<p><span class="p">}</span></code></pre></figure></p>
<p>The code sets up a basic authentication header
and then creates an HTTP client to query Github with.
The method then reads the stream, parses the returned JSON object and, using
LINQ, it filters out pull requests which have more than one commits and returns
the Github user names of the pull requesters.</p>
<p>Notice that the method has been marked as <code>async</code>. When a method is <code>async</code> the
C# compiler looks in the body of the method for a corresponding <code>await</code> call;
it then converts the remaining body of the method to a continuation function
which is to be called after the awaited method returns! Yes, this is
<a href="http://en.wikipedia.org/wiki/Continuation-passing_style">continuation passing style</a> hidden by a very clever compiler trick. The
<code>ReadGithubAsync</code> method is fully event based and asynchronous and can
perform non-blocking multi-threaded I/O, which apparently is the holy grail
of modern web frameworks. What's more is that the simplicity of this approach
makes event-based frameworks such as Node.js, EventMachine and Gevent look
naive and awkward.</p>
<p>The method that reads the user's Twitter stream is similar but simpler: it
sets up an HTTP client for the Tweeter search API, filters the returned
tweets for the string <code>github.com/.*/pulls/</code> and returns the
tweet text:</p>
<p><figure class="highlight"><pre><code class="language-csharp" data-lang="csharp"><span class="k">public</span> <span class="k">static</span> <span class="k">async</span> <span class="n">Task</span><span class="p"><</span><span class="n">IEnumerable</span><span class="p"><</span><span class="kt">string</span><span class="p">>></span> <span class="nf">ReadTwitterAsync</span><span class="p">(</span><span class="kt">string</span> <span class="n">user</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">client</span> <span class="p">=</span> <span class="k">new</span> <span class="n">HttpClient</span><span class="p">{</span> <span class="n">BaseAddress</span> <span class="p">=</span> <span class="k">new</span> <span class="nf">Uri</span><span class="p">(</span><span class="s">"http://search.twitter.com/search.json"</span><span class="p">)};</span>
<span class="kt">var</span> <span class="n">response</span> <span class="p">=</span> <span class="k">await</span> <span class="n">client</span><span class="p">.</span><span class="nf">GetAsync</span><span class="p">(</span><span class="n">String</span><span class="p">.</span><span class="nf">Format</span><span class="p">(</span><span class="s">"?q={0}"</span><span class="p">,</span> <span class="n">user</span><span class="p">));</span>
<span class="kt">var</span> <span class="n">json</span> <span class="p">=</span> <span class="k">await</span> <span class="n">response</span><span class="p">.</span><span class="n">Content</span><span class="p">.</span><span class="nf">ReadAsStringAsync</span><span class="p">();</span></p>
<pre><code><span class="kt">var</span> <span class="n">tweets</span> <span class="p">=</span> <span class="k">from</span> <span class="n">e</span> <span class="k">in</span> <span class="n">JValue</span><span class="p">.</span><span class="nf">Parse</span><span class="p">(</span><span class="n">json</span><span class="p">)[</span><span class="s">"results"</span><span class="p">]</span>
<span class="k">where</span> <span class="p">(</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"from_user"</span><span class="p">]</span> <span class="p">==</span> <span class="n">user</span> <span class="p">&amp;&amp;</span>
<span class="n">Regex</span><span class="p">.</span><span class="nf">Match</span><span class="p">((</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"text"</span><span class="p">],</span> <span class="s">"github.com/.*/pulls/"</span><span class="p">).</span><span class="n">Success</span>
<span class="k">select</span> <span class="k">new</span><span class="p">{</span> <span class="n">Text</span> <span class="p">=</span> <span class="p">(</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"text"</span><span class="p">],</span> <span class="n">From</span> <span class="p">=</span> <span class="p">(</span><span class="kt">string</span><span class="p">)</span><span class="n">e</span><span class="p">[</span><span class="s">"from_user"</span><span class="p">]</span> <span class="p">};</span>
<span class="k">return</span> <span class="n">tweets</span><span class="p">.</span><span class="nf">Select</span><span class="p">(</span><span class="n">x</span> <span class="p">=&gt;</span> <span class="n">x</span><span class="p">.</span><span class="n">Text</span><span class="p">);</span>
</code></pre>
<p><span class="p">}</span></code></pre></figure></p>
<h5>Timing matters</h5>
<p>What we have now is two asynchronous methods that read data from the
corresponding APIs; what we need to do is combine them. Intuitively, we <em>first</em>
need to read the Github event API <em>periodically</em> and <em>for each</em> returned pull
requester, read his/her Tweeter stream for <em>some time</em> and return all tweets that
contain pull request URLs. There emphasized terms highlight the several time
dependencies this very simple example has. Time dependencies can be quite tricky
to understand; what Erik and the Rx team usually do is draw <a href="http://channel9.msdn.com/Blogs/J.Van.Gogh/Reactive-Extensions-API-in-depth-marble-diagrams-select--where">marble diagrams</a>.
Marble diagrams present events as marbles (hence the name), colour coded by a
specific event property, on an abstract timeline. A timeline can be owned. In
our case, the colouring represents the user that initiated the event. Here is a
hypothetical timeline of Github pull requests as returned by <code>ReadGithubAsync</code>:</p>
<p><a href="/files/rx-blog-pullreqs.png" rel="lightbox">
<img style="width: 70%;margin-left: auto; margin-right:auto;" src="/files/rx-blog-pullreqs.png" alt="Github marble diagram">
</a></p>
<p>We see that the timeline is not owned by any entity; we also observe (pun
intented) that a user might have more that one pull requests, while those might
not come in regular intervals. To retrieve the tweets, we need to filter the
pull requester login names and feed them to our <code>ReadTwitterAsync</code> method,
effectively creating a timeline of tweet events per user, as shown below:</p>
<p><a href="/files/rx-blog-tweets.png" rel="lightbox">
<img style="width: 70%;margin-left: auto; margin-right:auto;" src="/files/rx-blog-tweets.png" alt="Github marble diagram">
</a></p>
<p>The question now is how we go from one the first timeline to the second. LINQ
to the resque (<em>warning: query for demo purposes only</em>)!</p>
<p><figure class="highlight"><pre><code class="language-csharp" data-lang="csharp"><span class="k">from</span> <span class="n">u</span> <span class="k">in</span> <span class="nf">ReadGithubAsync</span><span class="p">().</span><span class="n">Result</span>
<span class="k">from</span> <span class="n">t</span> <span class="k">in</span> <span class="nf">ReadTwitterAsync</span><span class="p">(</span><span class="n">u</span><span class="p">.</span><span class="n">User</span><span class="p">).</span><span class="n">Result</span>
<span class="k">select</span> <span class="k">new</span> <span class="p">{</span><span class="n">User</span> <span class="p">=</span> <span class="n">u</span><span class="p">.</span><span class="n">User</span><span class="p">,</span> <span class="n">Tweet</span> <span class="p">=</span> <span class="n">t</span><span class="p">};</span></code></pre></figure></p>
<h5>Making collections reactive</h5>
<p>Alright! Now we have combined the two streams, but if we call the LINQ query
presented above, we will only get a single result, which will represent a
single snapshot in marble diagram terms. How can we extend the marble diagram
timelines to include multiple snaphots? Here is where reactive programming
and Rx come into play. The first thing we have to do is convert our API
calls to streams; in Rx terms, we need to convert the <code>IEnumerables</code> returned
by our <code>Read*Async</code> methods to <code>IObservables</code>. The canonical way of doing
that is use the static methods of the <code>Observable</code> class as follows:</p>
<p><figure class="highlight"><pre><code class="language-csharp" data-lang="csharp"><span class="k">public</span> <span class="k">static</span> <span class="n">IObservable</span><span class="p"><</span><span class="n">IEnumerable</span><span class="p"><</span><span class="n">String</span><span class="p">>></span> <span class="nf">PullRequesters</span><span class="p">(</span><span class="n">TimeSpan</span> <span class="n">interval</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">Observable</span><span class="p">.</span><span class="nf">Timer</span><span class="p">(</span><span class="n">TimeSpan</span><span class="p">.</span><span class="n">Zero</span><span class="p">,</span> <span class="n">interval</span><span class="p">).</span><span class="nf">SelectMany</span><span class="p">(</span> <span class="n"><em></span> <span class="p">=></span>
<span class="nf">ReadGithubAsync</span><span class="p">(</span><span class="s">"foo"</span><span class="p">,</span> <span class="s">"bar"</span><span class="p">)</span>
<span class="p">).</span><span class="nf">DistinctUntilChanged</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">public</span> <span class="k">static</span> <span class="n">IObservable</span><span class="p"><</span><span class="kt">string</span><span class="p">></span> <span class="nf">Tweets</span><span class="p">(</span><span class="kt">string</span> <span class="n">user</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">Observable</span><span class="p">.</span><span class="nf">Timer</span><span class="p">(</span><span class="n">TimeSpan</span><span class="p">.</span><span class="n">Zero</span><span class="p">,</span> <span class="n">TimeSpan</span><span class="p">.</span><span class="nf">FromSeconds</span><span class="p">(</span><span class="m">10</span><span class="p">))</span>
<span class="p">.</span><span class="nf">SelectMany</span><span class="p">(</span><span class="n"></em></span> <span class="p">=></span> <span class="nf">ReadTwitterAsync</span><span class="p">(</span><span class="n">user</span><span class="p">))</span>
<span class="p">.</span><span class="nf">Scan</span><span class="p">(</span><span class="k">new</span> <span class="kt">string</span><span class="p">[]{},</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="n">v</span><span class="p">)</span> <span class="p">=></span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">diff</span> <span class="p">=</span> <span class="n">v</span><span class="p">.</span><span class="nf">Except</span><span class="p">(</span><span class="n">a</span><span class="p">).</span><span class="nf">ToArray</span><span class="p">();</span>
<span class="k">if</span><span class="p">(</span><span class="n">diff</span><span class="p">.</span><span class="n">Length</span> <span class="p">==</span> <span class="m">0</span><span class="p">)</span> <span class="k">return</span> <span class="n">a</span><span class="p">;</span>
<span class="k">return</span> <span class="n">diff</span><span class="p">;</span>
<span class="p">})</span>
<span class="p">.</span><span class="nf">DistinctUntilChanged</span><span class="p">()</span>
<span class="p">.</span><span class="nf">SelectMany</span><span class="p">(</span><span class="n">ts</span> <span class="p">=></span> <span class="n">ts</span><span class="p">);</span>
<span class="p">}</span></code></pre></figure></p>
<p>Both methods setup a timer to call the corresponding backup method every
6 (Github) and 10 (Twitter) seconds. The Twitter case is slightly more
complicated as Twitter's search window is much longer (30 mins IIRC) than
the polling interval. This means that we will get the same tweets (events)
every time we call <code>ReadTwitterAsync</code> for a single user. We are obviously
interested for an event only when we observe a new tweet; this is what
<code>DistinctUntilChanged</code> does.</p>
<p>What we have now is two <code>Observable</code> streams; theoretically, we could just
combine them using LINQ as we did with the <code>Enumerable</code> versions above. However,
if we did that, we would loose all new users creating pull requests after the
first call to <code>PullRequesters</code>. What we need to do is to come up with an updatable
version of the static Tweeter marble diagram, where each new pull requester
automatically creates a new Tweeter event timeline. To do so, we insert
an intermediate step between polling Github and Tweeter for a specific user:</p>
<p><figure class="highlight"><pre><code class="language-csharp" data-lang="csharp"><span class="k">public</span> <span class="k">static</span> <span class="n">IObservable</span><span class="p"><</span><span class="n">IGroupedObservable</span><span class="p"><</span><span class="n">String</span><span class="p">,</span> <span class="n">String</span><span class="p">>></span> <span class="nf">PullRequestersInfo</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">u</span> <span class="p">=</span> <span class="k">from</span> <span class="n">users</span> <span class="k">in</span> <span class="nf">PullRequesters</span><span class="p">()</span>
<span class="k">from</span> <span class="n">user</span> <span class="k">in</span> <span class="n">users</span> <span class="k">select</span> <span class="n">user</span><span class="p">;</span></p>
<pre><code><span class="kt">var</span> <span class="n">groups</span> <span class="p">=</span> <span class="k">from</span> <span class="n">user</span> <span class="k">in</span> <span class="n">u</span>
<span class="k">group</span> <span class="n">user</span> <span class="k">by</span> <span class="n">user</span><span class="p">;</span>
<span class="k">return</span> <span class="n">groups</span><span class="p">;</span>
</code></pre>
<p><span class="p">}</span></code></pre></figure></p>
<p>Notice the method signature: it will return an <code>Observable</code> of
<code>GroupedObservable</code>s, where each group is identified by a user.
This way we both
solve the problem of users issuing multiple user pull requests in the same
polling period and create an observable group (and therefore list, if we just
get the group keys) of users to read tweets from. Neat, eh?</p>
<p>Finally, combining the two streams (groups of users, and tweets for a user) is
again a job for LINQ:</p>
<p><figure class="highlight"><pre><code class="language-csharp" data-lang="csharp"><span class="k">from</span> <span class="n">prinfo</span> <span class="k">in</span> <span class="nf">PullRequestersInfo</span><span class="p">()</span>
<span class="k">from</span> <span class="n">u</span> <span class="k">in</span> <span class="n">prinfo</span>
<span class="k">from</span> <span class="n">t</span> <span class="k">in</span> <span class="nf">Tweets</span><span class="p">(</span><span class="n">u</span><span class="p">).</span><span class="nf">TakeUntil</span><span class="p">(</span><span class="n">Observable</span><span class="p">.</span><span class="nf">Timer</span><span class="p">(</span><span class="n">TimeSpan</span><span class="p">.</span><span class="nf">FromMinutes</span><span class="p">(</span><span class="m">5</span><span class="p">)))</span>
<span class="k">select</span> <span class="k">new</span> <span class="p">{</span><span class="n">User</span> <span class="p">=</span> <span class="n">u</span><span class="p">,</span> <span class="n">Tweet</span> <span class="p">=</span> <span class="n">t</span><span class="p">};</span></code></pre></figure></p>
<p>The only difference to the static version is the fact that we stop scanning for
new user tweets after 5 minutes. To sum up, our example will poll Github every 6
seconds for new pull requesters asynchronously, create observable sequences of
pull requesters grouped by username, create observable sequences of tweets by
polling Tweeter asynchronously every 10 seconds using the observable user groups
as source per user name and return an observable stream of user names and tweets
that refer to pull requests. The code is fully asynchronous and in fact executed
using a thread pool for running the async tasks, so automatically
parallelizable.</p>
<p>All this, <a href="https://gist.github.com/gousiosg/5290793">in just 82 lines of code</a>
of unpolished code. Amazing, right?</p>
Analyzing pull requests on Github2013-05-14T00:00:00+02:00http://www.gousios.gr/blog/on-github-pull-requests<p>Pull requests as a distributed development model in general, and as implemented
by Github in particular, form a new method for collaborating on distributed
software development. The novelty lays on the decoupling of the development
effort from the decision to incorporate the results of the development in the
code base. To conduct a first evaluation of this exciting new way of distributed
software development, together with <a href="http://serg.aau.at/bin/view/MartinPinzger">Martin Pinzger</a> and <a href="http://www.st.ewi.tudelft.nl/~arie/">Arie van Deursen</a>, I
analyzed statistically data from Github projects to determine what are the
factors that affect the decision to merge a pull request and the time required
to do it.</p>
<h4>Approach</h4>
<p>We used the <a href="http://ghtorrent.org">GHTorrent dataset</a> in two ways:</p>
<ul>
<li>We first explored the usage of pull requests across all projects on Github</li>
<li>We then focused on a set of 100 projects (50 handpicked, 50 random) to see how
pull requests are being used in real projects. To include a project in our
examination set it should have more than 10 pull requests in 2012, should
include tests and should have a committer count larger than the core team member
size. The list of projects we selected can be seen in the following figure (size
signifies number of pull requests examined for project).</li>
</ul>
<p> <a href="/files/wordcloud.png" rel="lightbox">
<img style="width: 70%;margin-left: auto; margin-right:auto;" src="/files/wordcloud.png" alt="Projects used in the pull request study">
</a></p>
<p>The project selection process left us with 37.5k pull requests. We extracted
more than 20 measurements for each one, which we then trimmed down to 12 through
cross correlation analysis. The variables are split in 3 large categories
namely pull request impact, project characteristics and developer
characteristics. We then analyzed the dataset statistically.</p>
<p>To evaluate the combined effect of factors on pull request acceptance and
processing speed, we resorted to machine learning: we constructed
prediction models using four well known machine learning algorithms (Random
Forests, Support Vector Machines, binary logistic regression, Naive Bayes)
and chose the one with the best prediction results to extract the factor
importance. For both questions Random Forests worked best, so we used the
algorithm's integrated importance calculation metric (Mean Decrease in Accuracy)
to determine the most important factor for each research question.</p>
<h4>Findings</h4>
<p>Here are the main findings of this work:</p>
<ul>
<li><p>80% of pull requests are merged in less than 3 days, while 30% are merged within one hour. 70% of all pull requests are merged.</p></li>
<li><p>Training doesn't help: projects are not getting faster at pull request processing by processing more pull requests</p></li>
<li><p>Including test code does not help pull requests to be processed faster.</p></li>
<li><p>Pull requests democratize development: We found no statistical difference in how fast pull requests are being processed based on their origin (project team or community).</p></li>
<li><p>The three main factors that affect the decision to merge a pull request are: i) How active the area affect by the pull request has been recently ii) The size of the project iii) The number of files changed by the pull request. It is possible to predict whether a pull request will be merged with an accuracy of 90%.</p></li>
<li><p>The three main factors that affect the time (faster or slower than the mean
time in the dataset) to merge a pull request are: i) The number of discussion comments ii) The size of the project iii) The project’s test coverage. It is possible to predict the time to merge a pull request with an accuracy of 70%.</p></li>
<li><p>Pull requests are much faster to merge than email-based patches. Projects can tune their reviewing and testing processes for faster turnover.</p></li>
<li><p>Pull requests can help involve project community to the code review process.</p></li>
</ul>
<h4>Availability</h4>
<p>The technical report for this work (entitled An Exploratory Study of the Pull-based Software Development Model) can be found <a href="http://swerl.tudelft.nl/bin/view/Main/TechnicalReports">here</a> (<a href="http://swerl.tudelft.nl/twiki/pub/Main/TechnicalReports/TUD-SERG-2013-010.pdf">TUD-SERG-2013-010</a>). The corresponding paper is still under evaluation.</p>
<p>See <a href="/bibliography/GPD13.html">here</a> for a Bibtex record.</p>
Monitoring Github projects with GHTorrent2013-05-11T00:00:00+02:00http://www.gousios.gr/blog/ghtorrent-project-statistics<p>GHTorrent started as an effort to bring the rich data offered by the Github API
to the hands of the <a href="http://msrconf.org">Mining Software Repositories</a>
community. Recently, I have been working to make GHTorrent more accessible to
all Github users. As of version 0.7.2, <a href="http://ghtorrent.org">GHTorrent</a> can
run in standalone mode, using SQLite as its main database, thus doing away with
the complicated setup required to mirror in a distributed fashion.</p>
<p>In this blog post, I describe how to setup GHTorrent in order to retrieve all
metadata for a relatively small, but non-trivial, repository:
<a href="https://github.com/Netflix">Netflix</a>'s
<a href="https://github.com/Netflix/RxJava">RxJava</a> (incidentally, this is also one of my
<a href="https://gist.github.com/gousiosg/5264201">favourite projects</a>). Using the
data from this project, I also created a couple of plots
to give you an idea of what can be achieved with the data that GHTorrent
gathers.</p>
<h4>Setup and running</h4>
<p>GHTorrent is distributed as a Ruby gem, and runs on Ruby 1.9.3. If you have an
older version of Ruby, use <a href="http://rvm.io">RVM</a> to install 1.9.3 (<code>rvm install
1.9.3</code>) and make it default (<code>rvm use 1.9.3</code>). Installing GHTorrent is then
trivial:</p>
<pre><code>gem install sqlite3 ghtorrent
</code></pre>
<p>Normally, GHTorrent is run in parallel on many machines, so it is convenient
that its configuration is file based. For our standalone setup, this might
be an annoyance, but we need to create a <code>config.yaml</code> file for GHTorrent to
run:</p>
<p><figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">sql</span><span class="pi">:</span>
<span class="na">url</span><span class="pi">:</span> <span class="s">sqlite://github.db</span></p>
<p><span class="na">mirror</span><span class="pi">:</span>
<span class="na">persister</span><span class="pi">:</span> <span class="s">noop</span>
<span class="na">cache_mode</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">username</span><span class="pi">:</span> <span class="s">github_username</span>
<span class="na">passwd</span><span class="pi">:</span> <span class="s">github_passwd</span></code></pre></figure></p>
<p>Make sure you change <code>username</code> and <code>password</code> to your Github credentials. The
configuration file instructs GHTorrent to create an SQLite database in the
local directory, using no persister (this is to avoid installing MongoDB)
and full caching, thus making each request only once. We can control more
parameters of how GHTorrent works using more <a href="https://github.com/gousiosg/github-mirror/blob/master/config.yaml.tmpl">configuration options</a> but the above
should be enough to get us started.</p>
<p>Then, we need to run the following command:</p>
<pre><code>ght-retrieve-repo -c config.yaml Netflix RxJava
</code></pre>
<p>We see GHTorrent going though commits, pull requests, issues, watchers and other
Github API entities. Lots of debug statements will be printed on the screen.
You will see lots of duplicate work being done as well; this is the price to pay
for not installing MongoDB as an intelligent caching layer. Nevertheless, half an
hour later or so, we have in our directory an SQLite database with lots of
interesting data representing the project's life time.</p>
<h4>Project process monitoring</h4>
<p>The <a href="http://ghtorrent.org/relational.html">database schema</a> is relatively
complicated, but the data it stores is very rich in return.
Using this database, we can formulate complex queries that will return
interesting insights of our project's development process, including
some that were not possible without Github's data. Below, I
present a couple of such queries, plotted using R and <code>ggplot2</code>.
You can find the R script <a href="https://gist.github.com/gousiosg/5563230#file-ghtorrent-project-stats-r">here</a>.</p>
<ul class="thumbnails">
<li class="span4">
<div class="thumbnail">
<a href="/files/pull-req-stats.png" rel="lightbox">
<img src="/files/pull-req-stats.png" alt="Pull request statistics per month">
</a>
<p>(1) Pull request statistics per month (<a href="https://gist.github.com/gousiosg/5563230#file-pullreqs_opened_per_month-sql">opened</a>, <a href="https://gist.github.com/gousiosg/5563230#file-pullreqs_merged_per_month-sql">closed</a>)</p>
</div>
</li>
<li class="span4">
<div class="thumbnail">
<a href="/files/issue-stats.png" rel="lightbox">
<img src="/files/issue-stats.png" alt="Source of commits">
</a>
<p>(2) Issue statistics per month (<a href="https://gist.github.com/gousiosg/5563230#file-issues_opened_per_month-sql">opened</a>,<a href="https://gist.github.com/gousiosg/5563230#file-issues_closed_per_month-sql">closed</a>)</p>
</div>
</li>
<li class="span4">
<div class="thumbnail">
<a href="/files/commit-source.png" rel="lightbox">
<img src="/files/commit-source.png" alt="Source of commits">
</a>
<p>(3) Source of commits (<a href="https://gist.github.com/gousiosg/5563230#file-commit_source-sql">query</a>). The more commits come from
pull requests, the more open the project process.</p>
</div>
</li>
<li class="span4">
<div class="thumbnail">
<a href="/files/fork-stats.png" rel="lightbox">
<img src="/files/fork-stats.png" alt="Fork statistics">
</a>
<p>(4) Forks created(<a href="https://gist.github.com/gousiosg/5563230#file-forks_created-sql">query1</a>) vs forks actually contributing commits back to the main repo (<a href="https://gist.github.com/gousiosg/5563230#file-forks_contributing-sql">query2</a>). Ideally, all forks should contribute back.
</p>
</div>
</li>
<li class="span4">
<div class="thumbnail">
<a href="/files/comments-commenters-external.png" rel="lightbox">
<img src="/files/comments-commenters-external.png" alt="Fork statistics">
</a>
<p>(5) Percentage of issue comments and commenters coming from the project community (i.e. users with no commit rights to the main repo)(<a href="https://gist.github.com/gousiosg/5563230#file-monthly_comments_commenters-sql">query</a>)
</p>
</div>
</li>
</ul>
<p>That's it! Using a simple process, we can retrieve a very rich dataset to create
project specific reports from. What's more is that the database can be updated
(make sure that <code>cache: prod</code> in <code>config.yaml</code>), and the report generation process can be automated. More than one projects can share the same database,
so organizations can have a centralized way of retrieving process information
about their projects. Researchers interested in doing work with Github's data
can use this process to try out ideas on smaller projects before moving
forward to the <a href="http://ghtorrent.org/dblite/">real dataset</a>.</p>
<p>Do you have any ideas for more useful reports? Leave a comment and I 'll
promise to implement them!</p>
Visualizing programming language communities2013-04-22T00:00:00+02:00http://www.gousios.gr/blog/project-communities-visualization<p>For the past couple of weeks, I have been working on visualizing programming
language communities. The result can be seen here: <a href="http://ghtorrent.org/netviz/">http://ghtorrent.org/netviz/</a>. The graph visualizes language communities
for any language identified by Github, using commits as the medium to identify
project collaborations. This means that nodes represent projects (repositories)
and edges represent a developer that committed on both nodes.</p>
<p>To build the visualization, I extracted all commits from base repositories
(commits on forks are excluded) from the <a href="http://ghtorrent.org/dblite/">GHTorrent
database</a>, along with information about the
project (e.g. language as identified by Github). The dataset I am using has:</p>
<ul>
<li>17896778 commits in</li>
<li>553993 repositories with code written in</li>
<li>106 languages by</li>
<li>309838 developers.</li>
</ul>
<h3>How it works</h3>
<p><a href="/files/netviz-screen.png" rel="lightbox"><img src="/files/netviz-screen.png" class="img-polaroid" align="center" width="60%"/></a></p>
<p>To start with the visualisation enter the name of your favourite programming
language in the top level search box and wait a couple of seconds. A graph
will start to move towards the center of the screen. It will usually look like
a hair ball, because there are really too many connections among the projects,
a good sing that developers are collaborating closely in the specific
language ecosystem. What can you do next is:</p>
<ul>
<li>Search for your favourite project, on the side search field. Keep in mind
that only the top projects by number of external contributions are
being displayed, so the project you are looking for may not exist in the
displayed graph.</li>
<li>Zoom in or out (with the mouse wheel) and re-arrange the graph by dragging
nodes around.</li>
<li>Click on a node and see information about it along with a link to the original
repository.</li>
<li>Add a second language and see how the two language communities communicate.
In the screen shot above, you can see that many Scala and Haskell hackers share
a common interest in the Scalaz project.</li>
<li>Extend/Reduce the timeframe or even move the slider little by little to
see how each community evolved (or communities co-evolved).</li>
<li>If your computer and browser is fast enough, you can increase the number of
displayed links for a more realistic view (see below notes on graph stemming).</li>
</ul>
<p>As the graphs are rendered and animated in real time, a fast browser and
computer is required. On my 1.8GHz MacbookAir running Mountain Lion
with Chrome 26, the threshold after which animation is becoming too
slow is 5000 links and around 800 nodes. My 4 core Xeon 2.6Ghz desktop
with Cromium on Debian Wheezy, animations are relatively fluid at 7500 links
and 3000 nodes. Your mileage may vary.</p>
<h3>How it was implemented</h3>
<p>An interesting challenge of this project was how to process 18 million commits
(or 1.4GB of condensed CSV data) on request in (soft) real time. For that, I
exploited 2 properties of modern software systems: i) the abundance of RAM ii)
the higher level data structure constructs present in modern languages. As I
had good experience with the abstractions with my general purpose language of
choice (Scala), I decided to construct an in-memory graph of the raw data,
choosing the least memory intensive datatypes for each object (e.g., Ints
instead of Doubles, references instead of Strings when those could be shared
etc). Then, Scala's parallel collections would automatically split the work load
of processing (filtering, grouping etc) the in-memory graph to multiple CPU's.
The full set of commits fits in about 3GB of heap, while queries can be
processed using all available processors. For typical languages, discovering
the nodes and links takes less than 500 msec.</p>
<p>Due to the huge number of examined commits, queries on popular languages produce
equally huge graphs. For example, querying for Ruby in the default time
window will produce a graph consisting of 57889 nodes and 421312 edges.
To filter the important nodes, I implemented two methods of node ranking:</p>
<ul>
<li>node degree ranking: Nodes are ranked based on the number of connections.</li>
<li>pagerank: Nodes are ranked using the Pagerank algorithm. <em>caveat: pagerank is
meant to be used on directed graphs, while the produced graph is undirected</em>.</li>
</ul>
<p>After running the ranking algorithm, the graph is reconstructed by retrieving
the top 100 (configurable) nodes by their rank and all edges that contain them.
The reconstructed graph is further cleaned from nodes with in-degree of one and
their edges (to eliminate visual noise). Again this might not be enough: at this
stage our Ruby graph would still contain 3084 nodes and 62180 links. A hard
limit of edges must be enforced: to do that, the algorithm sets a target number
of links (by default, 5000) and randomly removes links from the graph, but it
pays attention not to remove links that will leave orphaned nodes.</p>
<p>To visualize the results, I used the amazingly powerful <a href="http://d3js.org/">d3js</a>
library. My limited (read: non-existent) Javascript skills did not allow
me to write code that I am proud of, but using <code>d3js</code> was a pleasure. Instead
of relying on iteration and mutation, <code>d3js</code> uses arrays as collections and
higher order functions that operate on those. The result is a fluid API
that makes sense, after you get used to the underlying concepts. To that
end, the examples at the <a href="https://github.com/mbostock/d3/wiki/Gallery">d3js gallery</a> help a lot. Except from the initial data retrieval, all operations
happen at the client side. New data are retrieved only when a new language
has been added or the time window has changed.</p>
<h3>The API</h3>
<p>The API is open for use by anyone, so if you do not like this visualization, you
can build your own. The result format is JSON. The API is rooted at
<code>http://ghtorrent.org/netviz/</code>. The end points and parameters are the following:</p>
<p><code>/links</code>: Get nodes and links in the <a href="https://github.com/mbostock/d3/wiki/Force-Layout#wiki-nodes">format expected by d3js</a>.</p>
<ul>
<li><p><code>l</code>: Languages to calculate links for. Multiple languages can be specified. Required.</p></li>
<li><p><code>m</code>: Method to do node ranking. Can have the following values:</p>
<ul>
<li><code>rank</code>: the number of node connections</li>
<li><code>jung</code>: the jung library implementation of the pagerank algorithm (default)</li>
<li><code>par</code>: a parallel version of pagerank (slower that jung)</li>
</ul>
</li>
<li><code>n</code>: Number of nodes to select after ranking. Default: 100. Values more than
1000 will be truncated to 1000</li>
<li><code>e</code>: Number of edges to return. Default: 5000</li>
<li><code>f</code>: The epoch timestamp of the earliest commit to examine. Default: 0</li>
<li><code>t</code>: The epoch timestmap of the latest commit to examine: Default: 2<sup>32</sup></li>
</ul>
<p><code>/hist</code>: Number of commits per week for the languages in the argument:</p>
<ul>
<li><code>l</code>: Languages to calculate timebins for. Multiple languages can be specified Required.</li>
</ul>
<p><code>/langs</code>: A list of languages recognized by the system</p>
<h5>Examples</h5>
<p><code>http://ghtorrent.org/netviz/links?l=scala&l=clojure</code>: This will return the
graph for the whole lifetime of the data set, for the languages Scala and
Clojure constrained to 5000 edges</p>
<p><code>http://ghtorrent.org/netviz/links?l=scala&m=jung&f=1322611200e=10000</code>: This will return the
graph for scala, constrained to 10000 edges, ranked by Pagerank for commits
after 1/1/2012.</p>
<h3>Availability</h3>
<p>As always, you can find the source code on Github:
<a href="https://github.com/gousiosg/ghtorrent-netviz">gousiosg/ghtorrent-netviz</a>.
Please file issues and requests using the project's <a href="https://github.com/gousiosg/ghtorrent-netviz/issues">issue tracker</a> instead of using this blog's
comments.</p>
On commit sizes and programming language expressiveness2013-03-27T00:00:00+01:00http://www.gousios.gr/blog/commits-and-programming-languages<p><a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/">Donnie Berkholz's blog
post</a>
on the expressiveness of programming languages has generated quite some online
buzz. In this post, Donnie used the Ohloh dataset to rank programming languages
by their expressiveness, as it can be approximated by the number of lines per
commit in projects written in this language. The details of what data did the
author process are not provided, xor a <a href="http://redmonk.com/dberkholz/2013/03/25/programming-languages-ranked-by-expressiveness/#comment-843196282">minor comment</a> by the author as a response to another comment.</p>
<p>In our paper (missing reference)
(missing reference)
(missing reference)
(missing reference)
(missing reference)
(missing reference)
(missing reference), <a href="http://www.spinellis.gr">Diomidis</a> and I had included
a similar plot (even though for different reasons). To do it, we used data
from the <a href="http://www.ghtorrent.org">GHTorrent</a> projects, namely around 8.5
million commits. For each commit, we extracted the number of lines for
<em>changed</em> files; that is, we did not account for files being introduced to or
being removed from the project. We did the matching of lines changed to
language used on a per file basis, by guessing the language by the file
extension). The results can be seen in the following figure:</p>
<p><a href="/files/commit-sizes-boxplots.png" rel="lightbox">
<img src="/files/commit-sizes-boxplots.png" class="img-polaroid" align="center" width="60%"/></a></p>
<p>I will be careful not to make any remarks regarding the expressiveness of
programming languages based on this data. The data speaks for itself and
contradicts many observations Donnie included in his blog. I do believe
however that the method used to extract the data is more sound as:</p>
<ul>
<li><p>The effect of copying-pasting code, common in environments such as Javascript
where developers include whole libraries to their repository, to the results
is non-existent as such actions are discarded.</p></li>
<li><p>By relying on file extension matching to identify the programming language
we get far more accurate results per language than relying on Github's or
Ohloh's project language identification.</p></li>
</ul>
New statistics language required2013-03-05T00:00:00+01:00http://www.gousios.gr/blog/new-stats-language-required<p>Lately, I have been using R heavily for my analysis of how Github pull requests
work (more on that in upcoming posts). It is not the first time I 've used R;
the data analysis work in my PhD thesis was also done with R, while I used
R for the occasional correlation or plot in other papers. However, I never had
to manipulate data and design code that will be re-used (hopefully) with it.</p>
<p>The experience has not been great. The <a href="http://cran.r-project.org/doc/manuals/R-lang.html">R language</a>
may be nice and fine for
statisticians (I believe; I don't know any statistician), but from my perspective, it is a nightmare. While other languages may feel like
they have designed by a committee, the R language seems to be designed by
no-one. In my eyes, it looks like that every time someone needed a feature to be
implemented, she just did and it stuck. The main problem is inconsistency, for
example:</p>
<ul>
<li><p>There are several ways to access a column in a dataframe. There are even
more ways to select specific rows and columns.</p></li>
<li><p>Functions do not maintain a consistent parameter order. For example, in
<code>lapply</code> the array to loop over is passed first while in the semantically equivalent <code>Map</code> (why the capital case?) it is second.</p></li>
</ul>
<p>Moreover, R is extremely slow and inefficient. I have been using a data sample
of 40,000 data points, moderate by any account, to build classifiers based on
the <a href="http://en.wikipedia.org/wiki/Random_forest">Random Forest</a> classification
algorithm. The memory usage could easily reach 3-4 GBs (for just 5 MB of data
and in-memory data structures), forcing me to run my experiments on a server,
rather than my laptop. It takes R 3 min 15 sec to read a 35MB CSV file in memory
and then it consumes more than 500MB of RAM.</p>
<h4>What I would expect from a sane statistics language</h4>
<p>I would like a statistics language that is also a data manipulation and
exploration language. An ideal language, at least for the use cases I can think
of now would have:</p>
<ul>
<li><p>A uniform way to describe all (tabular) data. There is really no need for multiple <code>data.frame</code> like types (lists, vectors or matrices). A single
type with support for unboxing (to enable fast matrix operations) should
be enough.</p></li>
<li><p>A few numeric data types (infinite precision integers and floats), along
data types for ordinal and categorical data. Memory efficient, UTF8-based
strings.</p></li>
<li><p>Manipulation of data through higher order functions. LINQ has shown
how this can be done with any data structure.</p></li>
<li><p>Optional typing. R and SciPy's lack of types are great for
prototyping and fast testing on the REPL loop, but when real data need
to be processed, types help a lot with consistency and optimization.
<a href="http://www.dartlang.org/">Dart</a> has shown that it can be
practical.</p></li>
<li><p>A compiler to machine code or to an
intermediate format able to produce efficient machine code (LLVM or JVM
bytecode). Big data cannot be processed by interpreters and
having to develop the same program twice (prototype with R, production in C++)
is suboptimal.</p></li>
<li><p>A library for interactive graphics. <code>ggplot</code> is great, but it is static. I am
sure that someone can design an editor that will allow people to explore data
graphically and then produce static descriptions of the generated plots.</p></li>
</ul>
<p>I have never used other statistic tools seriously. Closed source ones I would not use, as
my research could not be replicated without a license. From the open source
ones, I would not use <a href="http://www.cs.waikato.ac.nz/ml/weka/">Weka</a>,
because I find clicking around an inferior way of
providing input to a program, especially since I would like executions to be
scripted. Therefore, my choices are trimmed down to three: R, <a href="http://www.gnu.org/software/octave/">Octave</a> and <a href="http://www.scipy.org/">SciPy</a> - <a href="http://scikit-learn.org/">SciKit</a>. Octave
does not include a robust machine learning toolkit. SciPy is more promising,
after all it is based on a rather nice programming language, but it does not have
<code>ggplot</code>. While waiting for a new statistics language to arrive, I am going
to give SciPy a try for my next work.</p>
Έκδοση διαβατηρίου για παιδιά που γεννήθηκαν στην Ολλανδία2012-12-20T00:00:00+01:00http://www.gousios.gr/blog/How-to-register-your-kid<p>Πρόσφατα χρειάστηκε να εκδώσουμε διαβατήριο για την νεογέννητη κόρη μας, ώστε να
μπορέσουμε να ταξιδέψουμε στην Ελλάδα (μένω στην Ολλανδία). Η διαδικασία είναι
ιδιαίτερα πολύπλοκη. Παρακάτω περιγράφω τα βήματα που χρειάζεται να κάνει
κάποιος ώστε να μπορέσει να δηλώσει το παιδί του και να εκδώσει προσωρινό διαβατήριο:</p>
<ol>
<li><p>Ο γάμος πρέπει να έχει δηλωθεί στην υπηρεσία ληξιαρχείου της χώρας
διαμονής. Στην Ολλανδία, και σε διάφορες άλλες χώρες που γνωρίζω, αυτό
τηρείται από το κατα τόπους δημαρχείο. Για να δηλωθεί ο γάμος χρειάζονται:
(i) Απόσπασμα ληξιαρχικής πράξης γάμου. Πρέπει να είναι επικυρωμένο με σφραγίδα
apostille, και να είναι πρόσφατο (< 6 μηνών).
(ii) Σφραγίδα apostile στην Ελλάδα βάζουν μόνο οι κατα τόπους περιφέρειες. Αυτό
σημαίνει πως αν το πιστοποιητικό έχει εκδοθεί στην Αθήνα, η περιφέρεια
Θεσσαλίας δεν μπορεί να το πιστοποιήσει. Επίσης, προφανω΄ς και δεν
αναλαμβάνει να το προωθήσει υπηρεσιακά. Αυτό σημαίνει πως εξ' αποστάσεως
είναι αδύνατη η πιστοποίηση apostille. Πρέπει κάποιος γνωστός να το
αναλάβει.
(iii) Μετάφρασή της στη γλώσσα της χώρα διαμονής, από πιστοποιημένο μεταφραστή.
Ευτυχώς, η Ολλανδία είναι ευέλικτη σε αυτό τον τομεά. Στην περίπτω΄ση μας,
δέχτηκαν το πιστοποιητικό μεταφρασμένο στα Αγγλικά.</p></li>
<li><p>Με το που γεννηθεί το παιδί, σε 3 μέρες πρέπει να δηλωθεί στο ολλανδικό
ληξιαρχείο. Στο Delft που μένω, αυτό σήμαινει μια φόρμα 7 πεδίων και τα
διαβατήριά μας. Από το ληξιαρχείο πέρνουμε το διεθνές πιστοποιητικό γέννησης,
το οποίο πιστοποιούμε με Αpostile στο ολλανδικό πρωτοδικείο. Στην περίπτωσή μας,
αυτό σήμαινε μετάβασή μου στην Χάγη. Ευτυχώς μένουμε στο Delft.</p></li>
<li><p>Το σφραγισμένο διεθνές πιστοποιητικό γέννησης μαζί με τα φωτοτυπίες των
ταυτοτήτων των γονέων και προσφατη πιστοποίηση οικογενιακής κατάστασης από το
δημοτολόγιο στο οποίο είναι εγγεγραμένοι οι γονείς χρειάζονται για την εγγραφή
του παιδιού στο Ελληνικό ληξιαρχείο κατοίκων εξωτερικού (το οποίο βρίσκεται κάπου στο Σύνταγμα). Ευτυχώς, μπορεί κάποιος
συγγενής να κάνει αυτή τη διαδικασία, αλλά μόνο αν διαμένει στην Αθήνα.
Διαφορετικά, κάποιος συγγενής είναι υποχρεωμένος να κάνει το ταξίδι.</p></li>
<li><p>Αφού το παιδί δηλωθεί στο ληξιαρχείο, θα πρέπει να δηλωθεί και στην
οικογενιακή μερίδα του δημοτολογίου στο οποίο είναι εγγεγραμμένοι οι γονείς (μη
με ρωτάτε γιατί).</p></li>
<li><p>Με το πιστοποιητικό εγγραφής στο ελληνικό ληξιαρχείο, 2 φωτογραφίες και
τα εισητήρια ταξιδιού ανα χείρας, μπορούμε να βγάλουμε
προσωρινό διαβατήριο. Αυτό γίνεται πολύ εύκολα και γρήγορα στο προξενείο
της Χάγης (Oranjestraat 7), αλλά το διαβατήριο ισχύει μόνο για 10 μέρες.</p></li>
<li><p>Αφού τελικά τα καταφέρουμε να ταξιδέψουμε στην Ελλάδα, πρέπει να πάμε
σε ένα αστυνομικό τμήμα για να βγάλουμε το επίσημο διαβατήριο.</p></li>
</ol>
<p>EU --- In red tape, we trust.</p>
How I lost the war on spam2012-11-25T00:00:00+01:00http://www.gousios.gr/blog/How-I-lost-the-war-with-spam<p>In late 2009, I decided to replace the aging XWiki installation for the
<a href="http://www.sqo-oss.org">Alitheia Core</a> web page with Drupal. I always had a
very good impression of Drupal, so I decided to go for it instead of its main
competitor, Wordpress. The initial impression was quite good, with several of the desired
features (comments, accounts, etc) being already in the default installation,
while plug-ins enabled more advanced functionality (code highlighting, citation
management). One particular plug-in that I installed was
<a href="http://mollom.com/">Mollom</a>, that uses machine learning to
automatically filter out spam messages.</p>
<p>During those 3 years, the installation was kept relatively secure, but required
constant updates. The updates were not exactly of the click and button and wait
type either, as most of them required modifications to the database schema. Now
perhaps being more pedantic than I should, I always backed up the Drupal
database and applied updates one by one, which costed me quite some time.</p>
<p>Last week, I noticed that the number of visits to the site has increased sharply
during the last few months. While my first reaction was along the lines of "we
must be doing something right!", further examination revealed that I was
definitely doing something wrong:</p>
<ol>
<li>I had been running the site with comments enabled for authenticated users, to
encourage user communication and participation. Registered users did not
require any administrative action to enable their accounts and post comments.</li>
<li>From the HTTP logs, I noticed several new users had been created and they
appeared in Drupal's administration console. As the user registration page included a relatively strong CAPTCHA test, courtesy of Mollom, I wondered why this happened, so I visited it. There was no CAPTCHA at all in place!</li>
<li>Then I remembered that a couple of months ago I had updated Mollom, so I
visited its administration console page. The page listed the correct settings. When I went back I saw the Drupal reported a required database
upgrade, which I never saw before.
I suspect that visiting the Mollom web page triggered the notification.</li>
<li>I checked the comment sections of several web pages. No filtering was active
and several thousand comments were created.</li>
<li>I realized that this war was lost. I 've disabled comments and user accounts altogether.</li>
</ol>
<p>For me, the moral of the story is that for such simple, mostly static sites CMSs
add more burden that convenience. I don't think most research sites need
anything more than a few easily updatable web pages. Therefore, from now on, I
will use exclusively the same tool I use for creating this blog and website:
<a href="https://github.com/mojombo/jekyll">Jekyll</a></p>
Report from the Laser 2012 summer school2012-09-05T00:00:00+02:00http://www.gousios.gr/blog/Report-from-Laser-Summer-School<p>In Sep 2012, I had the chance to participate to the Laser summer school,
organised every year at the beautiful island of Elba, Italy by Bertrand Meyer
and his research group at ETH, Zurich. Its topic varies every year, but as
expected rotates around software engineering. This year, the topic was
programming languages; as I have written on Twitter, the line-up consisted
(with a couple of ommissions) of what I consider the current dream team of
programming language design: Martin Odersky (Scala), Simon Peyton Jones
(Haskell), Erik Meijer (C#, Haskell in a previous life), Andrei Alexandrescu
(C++, D), Guido van Rossum (Python), Roberto Ierusalimscy (Lua) and of course
Bertrand himself (Eiffel) promised an exiting week. And they more than
delivered it! What follows is a post mortem account of my experience at
Laser. Not all speakers are represented equally, as I am finishing this
3 months after the actual event.</p>
<p>Erik's talk, was about monads, co-monads and their application on real life
programs. While being hard to understand theoretical concepts, monads can
encapsulate state and be composed to chains of calls. Most modern functionality
in C# has its theoretical underpinings on monads(LINQ) and co-monads (tasks,
async). Using the latter, he presented an example of converting continuation
passing style programming (in my opinion, a really horrible paradigm) to series
of composable asynchronous tasks. Erik's colorful style was reflected on his
presentations too; apparently, blenders are monads while gum ball machines are
co-monads. Also C#'s BigIntegers are bigger than your favourite language's ones.</p>
<p>Andrei showed to everyone what programming in the trenches can be about.
Already a renown expert in C++ (his brilliant book Modern C++ design should be
part of every good programmer's library) he started working on the D language
some years ago, with an aim to create a better C++. His lecture started with a
motivating example of how string representation optimization done expertly
saved thousands (he mentioned 10k) servers for his employer (Facebook). He
then proceeded to praise the standard template library (STL), providing
motivating examples of how the C++ compiler and template system can optimize
away the language's abstraction mechanisms. The pinacle of his talk was
however the D programming language. D is a modern systems language, with
several interesting properties: advanced type-based computations at compile
time, support for compiled DSLs (see the regexp library), higher order
functions on collections, STL style container libraries, thread-local
variables by default, message-passing intra-thread communication, garbage
collection, provable pure functions, smart pointer based arrays and, finally, a
nifty context aware error handling mechanism. To me, this was the seminar's
apocalypse and is definitely the next language I will be learning.</p>
<p>Martin's theme was, as expected, Scala. He went through what makes Scala an
interesting language to learn, and presented real coding examples (in my
opinion, the essense of any hacker talk worth its money). Simon's topic was the
advanced typesystem in Haskell and all the cool tricks you can do with it. He
covered type classes, generic algebraic datatypes and several other
trickerry that make Haskell an interesting language to study, but a very
hard one to use for practical applications.</p>
<p>Guido and Roberto walked us through the development of two simingly similar, but
practically really different programming languages. Python stemmed from Guido's
work on ABC, an earlier unsuccesful language, that taught guide and his CWI
collegues several important lessons. Similarily, Lua stemmed from efforts to
script embedded systems. Being both dynamically and weakly typed, they offer
significant productivity enhancements for their domains: Python as the new
generation glue that keeps the interenet together (the first one was Perl) and
Lua as an embedded scripting environments for games (WoW anyone?) and other
applications (e.g. a new version of Latex).</p>
<p>Participating in an event with such accomplished participants turned out to be
an overwhelming experience. To be honest, given the location, timing and summer
mood, I was expecting hour-long philosophical discussions on the beach. It was
nothing like that. The program was strict 8 hours of lectures per day, with lots
of informal discussion on technical matters, with little time to relax. Formal
dinners where followed by pool-side "therapy sessions", where you got the chance
to drink Bertrand Meyer's reserved grappa bottle payed for you by Erik Meijer.</p>
<p>Next year, the topic is going to be software engineering for cloud computing.
I would wholeheartedly recommend it to anyone.</p>
On the importance of tools in software engineering research2012-06-03T00:00:00+02:00http://www.gousios.gr/blog/On-Tools-Soft-Eng-Research<p>On the first weekend of June, I was at the Mining Software Repositories (MSR)
2012 conference. For those not familiar with MSR, it is a venue where software
engineering meets information extraction and data mining. Researchers present
the tools and methods that they applied on software repositories (source code
repositories, but also bug databases, mailing lists and wikis) to understand
how software is written and how its quality is affected by certain events in
the project's history. Due to its wide scope, MSR is always a bit unbalanced
with respect to the quality of the papers presented. This year however, there
were some really great submissions.</p>
<p>One of the most interesting talks, was Dongmei Zhang's keynote address on the
first day. Dongmei is a senior researcher at Microsoft Research Asia, where she
leads the development analytics project. During her presentation, she told some
great tales from the research vs practice battlefield. One of them, concerned a
code cloning detection tool, that has successfully graduated from Microsoft
Research to internal Microsoft teams and finally to a Visual Studio 2012
plug-in. Dongmei explained that the most important reason this tool was
successful was not that the research upon which it was based, but the fact that
it was a TOOL. Imperfect in the beginning, its speed and accuracy was improved
after suggestions from users started pouring in. What she learned from this
experience was the importance of producing reusable tools out of the research
was greater than doing the research itself. 'Make tools. It works on my computer
is no longer enough', as she put it.</p>
<p>I was curious as to whether the above apply to the papers presented the very
same day (and the next) to the very same conference that Dongmei gave the
keynote talk to. To do so, I went through each paper and looked for pointers to
the tools or datasets used. I also Googled the paper titles, hoping that the
authors had put together a page containing the paper's data or tools, as it
is often the case.</p>
<p>The following table summarizes what I have found:</p>
<table class="table table-striped">
<thead>
<tr><td>Paper </td><td> Data </td><td> Tools </td><td> Documentation</td><td>Comment</td></tr>
</thead>
<tbody>
<tr><td>Towards Improving Bug Tracking Systems with Game Mechanisms </td><td> Partial </td><td> No </td><td> No</td><td></td></tr>
<tr><td>GHTorrent: Github's Data from a Firehose </td><td> Yes </td><td> Yes </td><td> Partial</td><td></td></tr>
<tr><td>MIC Check: A Correlation Tactic for ESE Data </td><td> No </td><td> No </td><td> No</td><td></td></tr>
<tr><td>An Empirical Study of Supplementary Bug Fixes </td><td> No </td><td> No </td><td> No</td><td></td></tr>
<tr><td>Incorporating Version Histories in Information Retrieval Based Bug Localization </td><td> Yes </td><td> No </td><td> Yes </td><td> Uses existing documented dataset</td></tr>
<tr><td>Think Locally, Act Globally: Improving Defect and Effort Prediction Models </td><td> No </td><td> No </td><td> No </td><td> Promise to upload data</td></tr>
<tr><td>Green Mining: A Methodology of Relating Software Change to Power Consumption </td><td> No </td><td> No </td><td> No </td><td> Best paper award</td></tr>
<tr><td>Analysis of Customer Satisfaction Survey Data </td><td> No </td><td> No </td><td> No </td><td> Not based on open data</td></tr>
<tr><td>Mining Usage Data and Development Artifacts </td><td> No </td><td> No </td><td> No </td><td> </td></tr>
<tr><td>Why Do Software Packages Conflict? </td><td> No </td><td> No </td><td> No </td><td> Original data in Debian repository</td></tr>
<tr><td>Discovering Complete API Rules with Mutation Testing </td><td> Yes </td><td> Yes </td><td> Yes </td><td> Not open source</td></tr>
<tr><td>Inferring Semantically Related Words from Software Context </td><td> No </td><td> No </td><td> No </td><td></td></tr>
<tr><td>Do Faster Releases Improve Software Quality? An Empirical Case Study of Mozilla Firefox </td><td> No </td><td> No </td><td> No </td><td> </td></tr>
<tr><td>Explaining Software Defects Using Topic Models </td><td> No </td><td> No </td><td> No </td><td> </td></tr>
<tr><td>A Qualitative Study on Performance Bugs </td><td> No </td><td> No </td><td> No </td><td></td></tr>
<tr><td>Can We Predict Types of Code Changes? An Empirical Analysis </td><td> No </td><td> Yes (most) </td><td> No </td><td></td></tr>
<tr><td>An Empirical Investigation of Changes in Some Software Properties Over Time </td><td> Yes </td><td> No </td><td> Yes </td><td> Uses existing dataset</td></tr>
<tr><td>Who? Where? What? Examining Distributed Development in Two Large Open Source Projects </td><td> Yes(partially) </td><td> No </td><td> No </td><td> Paper mentions that data is on the PROMISE dataset, could not be retrieved at the date of the conference.</td></tr>
</tbody>
</table>
<p>As you can see, the results are not particularly encouraging. In one of the
most prominent empirical software engineering conferences, only two out of
18 papers provide really reusable tools (I have not investigated
the degree of reusability).</p>
<p>In my opinion, what applies in practice should also apply in research. As
researchers, we are often hesitant to provide reusable tools. Many times, this
is due to the fact that going the extra mile to convert our 'works on my
computer' scripts to tools is very time consuming and lacking any direct
scientific value (i.e. does not lead to papers). Some of us might even be
afraid of competing teams; if a tool is published this might allow others to
find flaws in our research or that a more resourceful team will leap ahead of
us using our effort.</p>
<p>Publishing a tool along with a paper has several advantages to research as a
whole:</p>
<ul>
<li>It enables research to become repeatable, facilitating both horizontal (more
hypotheses) and vertical (more data) scaling of research efforts.</li>
<li>It enables research to become reproducible, leading to more credible results.</li>
<li>It enables people to become creative with someone else's effort. This is
precisely the reason that made open source software successful, and also applies
with research tools too (see for example LLVM or JikesRVM).</li>
</ul>
<p>I believe that publishing reusable tools (plus data and documentation) should
be a prerequisite to publishing papers, especially so in empirical venues.
Thereby, I hope that efforts such as the
<a href="http://sequoia.cs.byu.edu/reser2013">RESER</a> workshop and the<br/>
the will raise the awareness of the importance of tools in software engineering research.</p>
<p>Why do you think that people are not investing time to create tools?</p>
The Efficiency of Java and C++ revised2012-05-14T00:00:00+02:00http://www.gousios.gr/blog/Efficiency-Java-CPP<p>During my first months as a PhD student (back in Feb 2005) at AUEB, we had
frequent technical arguments with my then supervisor, <a href="http://www.spinellis.gr">Diomidis
Spinellis</a>, regarding the execution speed of various
programming languages. Diomidis's argument was that the design of the Java
language and the JVM were inherently less efficient than that of natively
compiled languages, as they force upon us services that we may not want to use.
The prominent example was garbage collection. To prove his point, he setup a
simple experiment, which he documented in this <a href="http://www.spinellis.gr/blog/20050210/index.html">blog
post</a>. In the experiment,
he created random integers which he stored in an always-sorted container
(<a href="http://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html">TreeSet</a>
in Java, STL <a href="http://www.cplusplus.com/reference/stl/set/">set</a> in C++).</p>
<p>Back then, the result came to me as a surprise. Despite my best efforts (I
did try a lot of VM options), I could not manage to bring the Java
implementation's performance anywhere near C++. During the 7 years in the
between, I occasionally run the code on any new system that I had at my hands,
and the result was, give or take, the same.</p>
<p>Today, I decided to rerun the experiment. I compiled the C++ code with Clang 3
with all optimisations enabled (-O3 -march=corei7) and used the 1.6.0_31
version of the JVM to run the Java code. The result came to me as a surprise:</p>
<p><figure class="highlight"><pre><code class="language-bash" data-lang="bash"><span class="nv">$ </span>clang++ <span class="nt">-O3</span> <span class="nt">-march</span><span class="o">=</span>corei7 sort.cpp
<span class="nv">$ </span><span class="nb">time</span> ./a.out
<span class="o">[</span>...]
real 0m1.063s
user 0m1.026s
sys 0m0.035s
<span class="nv">$ </span>javac SortInt.java
<span class="nv">$ </span><span class="nb">time </span>java SortInt
<span class="o">[</span>...]
real 0m1.102s
user 0m2.325s
sys 0m0.137s
<span class="nv">$ </span><span class="nb">time </span>java <span class="nt">-server</span> SortInt
<span class="o">[</span>...]
real 0m0.866s
user 0m1.068s
sys 0m0.071s</code></pre></figure></p>
<p>The Java version was faster than C++! Why did this happen? It turns out
that between those 7 years JVM performance engineers did not sit idle:</p>
<ul>
<li><p>Escape analysis <a href="http://weblogs.java.net/blog/forax/archive/2009/10/06/jdk7-do-escape-analysis-default">is turned on by
default</a>
after version 17 of the server Hotspot compiler (I was running version 20.6).
It allows the compiler to analyse whether objects escape the context of a
method and, if not, to remove locks and allocate them on the stack.
Consequently, this reduces the load to the garbage collector. The difference
escape analysis presumambly makes can be seen by running the JVM in verbose
GC mode using the <code>-verbose:gc -XX:+PrintGCDetails</code> options. In the case of
the server VM, only one minor collection is required!</p></li>
<li><p>There were significant improvements on the garbage collector section as well.
The young generation collector is now parallel by default; this is one of the
advantages of having multiple CPUs on modern machines. The full heap
collector is also parallel by default in all collection phases, leading
to shorted pauses.</p></li>
<li><p>The compiler performs supposedely significantly better register allocation on
machines with many registers.</p></li>
</ul>
<p>The Java platform is a prime example of how research is being put into
practice. All features described above (and others: lock coarsening, biased
locking, thread local heaps) were published papers in major software
conferences (OOPSLA, PLDI etc). What is perhaps more interesting is
that it is not just Java that benefits from the work being done on the JVM;
Scala and Ruby, the two other languages mostly associated with the JVM benefit
too. For example, in Java 1.7, the new <code>invokedynamic</code> opcode allows JRuby to
optimize execution <a href="http://www.drdobbs.com/jvm/231500287">various dynamic execution
aspects</a> delivers <a href="http://blog.jruby.org/2011/12/getting_started_with_jruby_and_java_7/">significant
performance
improvements</a>. In all, Java as a platform does seem like a healty
development target; I am so sure about Java as a language.</p>
<p>To sum up: is Java inherently slower? Yes, it is, but hard optimization work
has lifted several performance hurdles. Does it matter? To some problem
domains, it does; I would never think writing the code that processes big
data in Java, expect if distribution could lead to significant speedups. Most of the problems I am trying to solve are better expressed in Ruby and Scala. I am happy as long as those two languages offer 80% of the performance of Java.</p>