Statistics tests is a core part of data analysis. We use statistics tests to calculate correlations between to variables, to compare the means and medians of groups, to evaluate the magnitude of effects on populations etc.

While Spark has a rather nice machine learning library, statistics tests (apart from 2) are mysteriously missing from their API. The purpose of this assignment is to remedy this situation by implementing high-quality statistical tests as extensions to Spark’s MLlib.

This assignment is optional and counts for an extra point toward the end grade.

Potential tests to implement

Choose one of the following tests

Generating a ground truth for testing

An important aspect of implementing a correct statistical test is to show that it is actually calculating what it should calculate. As you might expect, this is not the first time that the tests above have been implemented. R is the de facto standard for statistical computing. All the tests above have an equivalent implementation in R (or one of the packages in the R package network). Python also contains a large number of statistics tests in the scipy.stats package.

To test your implementation, you will need to generate random test data, apply the statistical test implementation in another language and collect the results. Then, you need to save the random data and the results and use them for testing your own implementations.

T (2 points) Generate 10 test cases using the method described above. The sizes of input data set should be increasing in powers of 2 increments starting from \(2^5\). This means that your first dataset should have 32 items (per input variable, in case of 2 sample tests), the second one 64 and your final one 32768.

Reporting the results

T (2 points) Implement the selected statistical test in Scala, as part of the Apache Spark repository and integrate it in the build process. Create appropriate tests.

T (5 points) Run your test against the generated test cases from above. The reported results should be within 1% from the results R or Python create. For each test case where the difference is more than 1%, you loose 0.5 points!

T (1 points) Run your statistical test with 10x (327680) more datapoints. Does it finish within reasonable time limits?

T (1 point, optional) Follow the Apache Spark contribution guidelines and create a pull request with your proposed changes. If the pull request gets accepted, you will receive the full 10 points for this assignment!

Bibliography