Searching

A fundamental area in Computer Science is to develop algorithmic processes for efficiently storing and retrieving data with good performance. Nowadays, data is fast-produced leading to high volume of data in no time. It is anticipated that Internet of Things (IoT) devices will generate 400 zettabytes of data in 2018. Performing analytics or science on such large datasets poses many problems with performance. Two important building blocks in solving such problems are Sorting and Searching algorithms. From the previous lectures, several classes of data structures were introduced such as lists, trees and graphs that maps a given dataset into the organization of that data structure. On a functional-level, the data is the same but working with the data structure such as performing an insert or finding an element comes with it's trade-offs to performance. This can be important to consider depending on the character of the dataset and it's application constraints.

The goal of this assignment is to separate the internals of a data structure from its operations to understand how the overall performance varies between a sorted and unsorted data structure. The two performance aspects that you will investigate is the time complexity and space complexity. The data structure you will work with is a conventional 2D list structure also known as a Matrix. The concepts and performance measurements discussed in this assignment should be applicable for other data structures such as a graph.

In this assignment, you will work with an open public dataset from Brandweer Amsterdam-Amstelland that contains over 125.000 fire alarms reported in this region. The dataset is provided in a compressed zip-format and the resulting CSV-file will be ~124MB. While there are handy libraries such as pandas that can read a CSV-file into a table, we will only restrict to using core libraries that is part of Python. To read a CSV-file to a 2D array, we do the following:

In [ ]:
import csv #NB: don't forget to include the csv library

def load(csv_file):
    """
        Loads the data from the file in the provided argument into a 2D array
    """
    with open(csv_file) as csvDataFile:
        return [row for row in csv.reader(csvDataFile, delimiter=';')]

If we inspect the first element of the resulting array, we obtain information about the structure of the dataset and this is shown below. Consecutive elements of the array follow this format where the first element in the row is the incident_id and the last element is the gemeente

[['incident_id',
  'start_tijd',
  'incident_type',
  'landelijke_meldingsclassificatie_niveau1',
  'landelijke_meldingsclassificatie_niveau2',
  'landelijke_meldingsclassificatie_niveau3',
  'datum',
  'jaar',
  'maand_nr',
  'maand_naam',
  'dag_nr',
  'dag_naam',
  'week_nr',
  'kwartaal',
  'prioriteit',
  'uur',
  'dagdeel',
  'objecttype',
  'objectfunctie',
  'buurt',
  'wijk',
  'woonplaats',
  'gemeente'],
  ......
  ]

Ordering the dataset

We can observe by inspecting the first 5 rows in the 2D array that the data is unordered, the incident_id that represents the unique identification of each fire alarm event and the creation time stamp of each event is out-of-sync.

T1 (20 points): Your first task is to implement the insertion sort algorithm to sort the dataset according to the incident_id and the creation date. Notice that the algorithm is slow and will not scale. Therefore only a subset of the dataset will be used. From the function definition below, the table argument represents the 2D array, f is the compare function and limit describes the size of the dataset (NB: the offset is always the first element)

In [ ]:
def insertionSort(table, f, limit=150):
    """
        Takes a data table, take a subset of 150 elements and sorts it
        NB: first value of the array is a schema of the data
    """
    pass

Use insertionSort to sort the dataset on incident id and print the first 20 id's. (Use the cmpf function to do this)

The compare function f that is provided in the argument list is used for comparing two elements and is the piece of information the sort algorithm needs to know in which order to sort elements. The _incidentid should be sorted as an Integer but in the 2D array it is a String. Therefore we need to convert to an Integer and also explain the natural order of this type. The "0" indicates incident id

In [ ]:
def cmpf(rowA , rowB):
    return int(rowA[0]) > int(rowB[0])

Similarly, this has to be done for date comparison

In [ ]:
def cmpfDate(rowA , rowB):
    #import datetime
    #6 = datum
    # 2008/01/03 00:00:00.000000000, remove microsec
    return datetime.datetime.strptime(rowA[6][:-10],'%Y/%m/%d %H:%M:%S') >
    datetime.datetime.strptime(rowB[6][:-10],'%Y/%m/%d %H:%M:%S')

After implementing the algorithm, try out the following limit values:

  • limit=50
  • limit=100
  • limit=150
  • limit=200
  • limit=250

Run the algorithm for each of these limit values and measure how long each of them takes.

Notice that the time increases with the size of the data. An improvement to the algorithm is to treat this problem using a divide and conquer-approach that divides the dataset into sub-lists. This improvement is called the shell algorithm

T2 (10 points): Your task is to modify the current implementation of the insertion sort algorithm to work with shell sort algorithm. The modified version should be called gapInsertionSort and use the arguments specified below. You have to solve how the startposition and sublistcount are used to make this work.

In [ ]:
def shellSort(table, f, limit):
    subtable = table[1:limit+1]
    sublistcount = len(subtable)//2
    while sublistcount > 0:
        for startposition in range(sublistcount):
            #implement the function call below
            gapInsertionSort(subtable,f,startposition,sublistcount)
            print("After increments of size",sublistcount,
                                   "The list is",table)

        sublistcount = sublistcount // 2
In [ ]:
def gapInsertionSort(table, f, startposition, sublistcount):
    pass

Use shellSort to sort the first 300 entries in the table on date and print the first 10 incident id's

This algorithm is faster than insertion sort, but it is still not fast enough to sort all 124,000 records, so we need to have an algorithm with a lower time complexity. There are two other algorithms that have a "O(n log n)" time complexity but their space complexity varies. For merge sort, the space complexity is O (n) which means a linear growth of the space but for a quick sort, the space complexity is constant but the performance could degrade to "O(n^2)" depending on the data and selection of the pivot. For instance, if the dataset is ordered or the elements are the same. This is not the case for the dataset that we are using

T3 (30 points): Your task is to implement the quick sort algorithm, the stubs for the algorithm is presented here and the challenge here is choosing the pivot such that it can in short time process all the records.

In [ ]:
def quickSort(table):
    quickSortHelper(table,0,len(alist)-1)

def quickSortHelper(table,first,last):
    pass

def partition(alist,table,last):
    pass

Use quick sort algorithm to sort the table on date and print the first 10 incident id's

T4 (10 points): Run the shellSort and the quickSort algorithms on the same limit values as in the previous task and measure the run times.

Searching the dataset

The final part of the assignment is to perform find elements in a data set. We will use two search techniques, a simple linear search and binary search.

T5 (20 points): Create a linear search algorithm and your task is to find the following id's: 93094, 66515, 47279, 122471, 16515

In [ ]:
def lsearch(table,f,fieldIndex,find):
    pass

Use lsearch to find the given id's and print their date (If it can't be found print something like: N/A). Also measure how long it takes combined to find them.

Hint 1: You can use the same compare function as in the previous tasks to provide for f

Hint 2: The fieldIndex=... represents the number of the column in the table for which you want to search, so if you use an index of 0 you will search for an incident id

We could observe that for two of the search queries, they took considerable time, this due to the fact that we don't know where the items are. Many algorithms are designed with the idea that the data set is sorted. Therefore, we will try the same on a sorted data set. This way we might not need to search the entire table to know that a certain value does not appear in it.

In [ ]:
def sortedLSearch(sortedTable,f,fieldIndex,find):
    pass

Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table), you will probably notice that the search executed faster.

We can further enhance the search process by exploiting the properties of using a sorted array. A binary search algorithm uses the properties of a sorted algorithm. The process is very intuitive and can be illustrated using the De Telefoongids that is a sorted dataset with name and telephone numbers ordered. If you look for someone with the surname "Hejderup", you will start at the index H and continue the search there.

T6 (20 points): Your task is to implement a binary search version with the same search queries as in the previous task.

In [ ]:
def bsearch(sortedTable,f,fieldIndex,find):
    pass

Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table)

Optional (Bonus task)

T7 (10 points): Last task is to find multiple entries, your task is to modify the binary search such that all results for the 'gemeente' of Amsterdam are returned.

In [ ]:
def multiplebsearch(sortedTable,f,fieldIndex,find):
    pass

Print all the incident id's of incidents that happened in the 'gemeente' Amsterdam