Searching

A fundamental area in Computer Science is to develop algorithmic processes for efficiently storing and retrieving data with good performance. Nowadays, data is fast-produced leading to high volume of data in no time. It is anticipated that Internet of Things (IoT) devices will generate 400 zettabytes of data in 2018. Performing analytics or science on such large datasets poses many problems with performance. Two important building blocks in solving such problems are Sorting and Searching algorithms. From the previous lectures, several classes of data structures were introduced such as lists, trees and graphs that maps a given dataset into the organization of that data structure. On a functional-level, the data is the same but working with the data structure such as performing an insert or finding an element comes with it's trade-offs to performance. This can be important to consider depending on the character of the dataset and it's application constraints.

The goal of this assignment is to separate the internals of a data structure from its operations to understand how the overall performance varies between a sorted and unsorted data structure. The two performance aspects that you will investigate is the time complexity and space complexity. The data structure you will work with is a conventional 2D list structure also known as a Matrix. The concepts and performance measurements discussed in this assignment should be applicable for other data structures such as a graph.

In this assignment, you will work with an open public dataset from Brandweer Amsterdam-Amstelland that contains over 125.000 fire alarms reported in this region. The dataset is provided in a compressed zip-format and the resulting CSV-file will be ~124MB. While there are handy libraries such as pandas that can read a CSV-file into a table, we will only restrict to using core libraries that is part of Python. To read a CSV-file to a 2D array, we do the following:

In [3]:
import csv #NB: don't forget to include the csv library

def load(csv_file):
    """
        Loads the data from the file in the provided argument into a 2D array
    """
    with open(csv_file) as csvDataFile:
        return [row for row in csv.reader(csvDataFile, delimiter=';')]

If we inspect the first element of the resulting array, we obtain information about the structure of the dataset and this is shown below. Consecutive elements of the array follow this format where the first element in the row is the incident_id and the last element is the gemeente

[['incident_id',
  'start_tijd',
  'incident_type',
  'landelijke_meldingsclassificatie_niveau1',
  'landelijke_meldingsclassificatie_niveau2',
  'landelijke_meldingsclassificatie_niveau3',
  'datum',
  'jaar',
  'maand_nr',
  'maand_naam',
  'dag_nr',
  'dag_naam',
  'week_nr',
  'kwartaal',
  'prioriteit',
  'uur',
  'dagdeel',
  'objecttype',
  'objectfunctie',
  'buurt',
  'wijk',
  'woonplaats',
  'gemeente'],
  ......
  ]

Ordering the dataset

We can observe by inspecting the first 5 rows in the 2D array that the data is unordered, the incident_id that represents the unique identification of each fire alarm event and the creation time stamp of each event is out-of-sync.

T1 (20 points): Your first task is to implement the insertion sort algorithm to sort the dataset according to the incident_id and the creation date. Notice that the algorithm is slow and will not scale. Therefore only a subset of the dataset will be used. From the function definition below, the table argument represents the 2D array, f is the compare function and limit describes the size of the dataset (NB: the offset is always the first element)

In [99]:
def insertionSort(table, f, limit=150):
    """
        Takes a data table, take a subset of 150 elements and sorts it
        NB: first value of the array is a schema of the data
    """
    subtable = table[1:limit+1]
    #implementation
    sortedTable = []
    
    for row in subtable:
        index = 0
        if (len(sortedTable) == 0):
            sortedTable.append(row)
        
        
        for element in sortedTable:
            if f(element, row):
                sortedTable.insert(index, row)
                break
            index += 1
        
        if index == len(sortedTable):
            sortedTable.append(row)
    
    for i in range(0, len(sortedTable)):
        table[i] = sortedTable[i]

Use insertionSort to sort the dataset on incident id and print the first 20 id's. (Use the cmpf function to do this)

In [38]:
incidents = load('brwaa_2010-2015.csv')
insertionSort(incidents, cmpf)

for i in range(0, 20):
    print(incidents[i][0])
2986
7056
10427
16747
20505
22992
31068
33939
39832
39925
40547
40663
40911
43049
43247
45831
45965
48323
49539
51754

The compare function f that is provided in the argument list is used for comparing two elements and is the piece of information the sort algorithm needs to know in which order to sort elements. The _incidentid should be sorted as an Integer but in the 2D array it is a String. Therefore we need to convert to an Integer and also explain the natural order of this type. The "0" indicates incident id

In [14]:
def cmpf(rowA , rowB):
    return int(rowA[0]) > int(rowB[0])

Similarly, this has to be done for date comparison

In [4]:
import datetime

def cmpfDate(rowA , rowB): 
    #6 = datum
    # 2008/01/03 00:00:00.000000000, remove microsec
    return datetime.datetime.strptime(rowA[6][:-10],'%Y/%m/%d %H:%M:%S') > datetime.datetime.strptime(rowB[6][:-10],'%Y/%m/%d %H:%M:%S')

After implementing the algorithm, try out the following limit values:

  • limit=50
  • limit=100
  • limit=150
  • limit=200
  • limit=250

Run the algorithm for each of these limit values and measure how long each of them takes.

In [52]:
import time

for i in range(1,6):
    incidents = load('brwaa_2010-2015.csv')
    start_time = time.time()
    insertionSort(incidents, cmpf, i*50)
    print("Limit " + str(i*50) + " " + str(1000*(time.time() - start_time)) + " ms")
Limit 50 1.0001659393310547 ms
Limit 100 1.4998912811279297 ms
Limit 150 4.00233268737793 ms
Limit 200 6.002187728881836 ms
Limit 250 10.005950927734375 ms

Notice that the time increases with the size of the data. An improvement to the algorithm is to treat this problem using a divide and conquer-approach that divides the dataset into sub-lists. This improvement is called the shell algorithm

T2 (10 points): Your task is to modify the current implementation of the insertion sort algorithm to work with shell sort algorithm. The modified version should be called gapInsertionSort and use the arguments specified below. You have to solve how the startposition and sublistcount are used to make this work.

In [33]:
def shellSort(table, f, limit):
    subtable = table[1:limit+1]
    sublistcount = len(subtable)//2
    while sublistcount > 0:
        for startposition in range(sublistcount):
            #implement the function call below
            gapInsertionSort(subtable,f,startposition,sublistcount)

        sublistcount = sublistcount // 2
    
    return subtable
In [114]:
def gapInsertionSort(table, f, startposition, sublistcount):
    for i in range(startposition+sublistcount,len(table),sublistcount):

        currentvalue = table[i]
        startposition = i

        while startposition>=sublistcount and f(table[startposition-sublistcount],currentvalue):
            table[startposition]=table[startposition-sublistcount]
            startposition = startposition-sublistcount

        table[startposition]=currentvalue

Use shellSort to sort the first 300 entries in the table on date and print the first 10 incident id's

In [115]:
incidents = load('brwaa_2010-2015.csv')
subtable = shellSort(incidents, cmpfDate, 300)

for i in range(10):
    print(subtable[i][0])
152257
151496
151343
150103
150062
149810
149393
149114
149008
148629

This algorithm is faster than insertion sort, but it is still not fast enough to sort all 124,000 records, so we need to have an algorithm with a lower time complexity. There are two other algorithms that have a "O(n log n)" time complexity but their space complexity varies. For merge sort, the space complexity is O (n) which means a linear growth of the space but for a quick sort, the space complexity is constant but the performance could degrade to "O(n^2)" depending on the data and selection of the pivot. For instance, if the dataset is ordered or the elements are the same. This is not the case for the dataset that we are using

T3 (30 points): Your task is to implement the quick sort algorithm, the stubs for the algorithm is presented here and the challenge here is choosing the pivot such that it can in short time process all the records.

In [64]:
def quickSort(table,f):
    quickSortHelper(table,f,0,len(table)-1)

def quickSortHelper(table,f,first,last):
    if first<last:

        splitpoint = partition(table,f,first,last)

        quickSortHelper(table,f,first,splitpoint-1)
        quickSortHelper(table,f,splitpoint+1,last)

def partition(table,f,first,last):
    pivotvalue = table[first]

    leftmark = first+1
    rightmark = last

    done = False
    while not done:

        while leftmark <= rightmark and  not f(table[leftmark], pivotvalue):
            leftmark = leftmark + 1

        while not f(pivotvalue, table[rightmark]) and rightmark >= leftmark:
            rightmark = rightmark -1

        if rightmark < leftmark:
            done = True
        else:
            temp = table[leftmark]
            table[leftmark] = table[rightmark]
            table[rightmark] = temp

    temp = table[first]
    table[first] = table[rightmark]
    table[rightmark] = temp

    return rightmark

Use quick sort algorithm to sort the table on date and print the first 10 incident id's

In [96]:
incidents = load('brwaa_2010-2015.csv')
subtable = incidents[1:len(incidents)]

quickSort(subtable, cmpfDate)

print([i[0] for i in subtable][0:10])
['154063', '154060', '154041', '154054', '154045', '154061', '154046', '154048', '154050', '154002']

T4 (10 points): Run the shellSort and the quickSort algorithms on the same limit values as in the previous task and measure the run times.

In [117]:
import time

incidents = load('brwaa_2010-2015.csv')
r = range(50, 1000, 50)
print("insertionSort:")
for limit in r:
    temp_incidents = incidents[1:limit+1]
    start_time = time.time()
    insertionSort(temp_incidents, cmpf, limit)
    print("Limit " + str(limit) + " " + str(int(100000*(time.time() - start_time)) / 100) + " ms")

print("Shellsort:")
for limit in r:
    temp_incidents = incidents[1:limit+1]
    start_time = time.time()
    shellSort(temp_incidents, cmpf, limit)
    print("Limit " + str(limit) + " " + str(int(100000*(time.time() - start_time)) / 100) + " ms")
    
print("quickSort:")
for limit in r:
    temp_incidents = incidents[1:limit+1]
    start_time = time.time()
    quickSort(temp_incidents, cmpf)
    print("Limit " + str(limit) + " " + str(int(100000*(time.time() - start_time)) / 100) + " ms")
insertionSort:
Limit 50 0.0 ms
Limit 100 1.5 ms
Limit 150 3.49 ms
Limit 200 5.5 ms
Limit 250 10.5 ms
Limit 300 14.01 ms
Limit 350 18.01 ms
Limit 400 23.51 ms
Limit 450 30.02 ms
Limit 500 37.52 ms
Limit 550 44.53 ms
Limit 600 58.54 ms
Limit 650 71.55 ms
Limit 700 76.06 ms
Limit 750 87.07 ms
Limit 800 104.57 ms
Limit 850 111.57 ms
Limit 900 135.09 ms
Limit 950 137.59 ms
Shellsort:
Limit 50 0.5 ms
Limit 100 0.49 ms
Limit 150 1.5 ms
Limit 200 1.5 ms
Limit 250 2.0 ms
Limit 300 2.5 ms
Limit 350 3.5 ms
Limit 400 4.0 ms
Limit 450 5.0 ms
Limit 500 5.0 ms
Limit 550 6.0 ms
Limit 600 6.5 ms
Limit 650 7.5 ms
Limit 700 8.5 ms
Limit 750 9.01 ms
Limit 800 11.49 ms
Limit 850 10.5 ms
Limit 900 18.01 ms
Limit 950 13.5 ms
quickSort:
Limit 50 0.5 ms
Limit 100 0.49 ms
Limit 150 1.49 ms
Limit 200 1.5 ms
Limit 250 2.51 ms
Limit 300 2.49 ms
Limit 350 3.5 ms
Limit 400 3.5 ms
Limit 450 4.0 ms
Limit 500 5.0 ms
Limit 550 5.0 ms
Limit 600 5.5 ms
Limit 650 6.5 ms
Limit 700 7.0 ms
Limit 750 7.5 ms
Limit 800 8.0 ms
Limit 850 9.0 ms
Limit 900 10.5 ms
Limit 950 10.0 ms

Searching the dataset

The final part of the assignment is to perform find elements in a data set. We will use two search techniques, a simple linear search and binary search.

T5 (20 points): Create a linear search algorithm and your task is to find the following id's: 93094, 66515, 47279, 122471, 16515

In [135]:
def lsearch(table,f,fieldIndex,find):
    for row in table[1:len(table)]:
#         print(row[fieldIndex], find == int(row[fieldIndex]))
        if find == int(row[fieldIndex]):
            return row
    
    return None

Use lsearch to find the given id's and print their date (If it can't be found print something like: N/A). Also measure how long it takes combined to find them.

In [140]:
incidents = load('brwaa_2010-2015.csv')

start_time = time.time()
print(lsearch(incidents,cmpf,0,93094)[0])
print(lsearch(incidents,cmpf,0,66515)[0])
print(lsearch(incidents,cmpf,0,47279))
print(lsearch(incidents,cmpf,0,122471)[0])
print(lsearch(incidents,cmpf,0,16515))
print("lsearch took: ", int(100000*(time.time() - start_time)) / 100, " ms")
93094
66515
None
122471
None
lsearch took:  174.12  ms

Hint 1: You can use the same compare function as in the previous tasks to provide for f

Hint 2: The fieldIndex=... represents the number of the column in the table for which you want to search, so if you use an index of 0 you will search for an incident id

We could observe that for two of the search queries, they took considerable time, this due to the fact that we don't know where the items are. Many algorithms are designed with the idea that the data set is sorted. Therefore, we will try the same on a sorted data set. This way we might not need to search the entire table to know that a certain value does not appear in it.

In [157]:
def sortedLSearch(sortedTable,f,fieldIndex,find):
    for row in sortedTable[1:len(sortedTable)]:
#         print(row[fieldIndex])
        if not f([find], row):
            if find == int(row[fieldIndex]):
                return row
            else:
                return None

Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table), you will probably notice that the search executed faster.

In [271]:
incidents = load('brwaa_2010-2015.csv')
incidents = incidents[1:len(incidents)]
quickSort(incidents,cmpf)

start_time = time.time()
print(sortedLSearch(incidents,cmpf,0,93094)[0])
print(sortedLSearch(incidents,cmpf,0,66515)[0])
print(sortedLSearch(incidents,cmpf,0,47279))
print(sortedLSearch(incidents,cmpf,0,122471)[0])
print(sortedLSearch(incidents,cmpf,0,16515))
print("lsortedLSearch took: ", int(100000*(time.time() - start_time)) / 100, " ms")
93094
66515
None
122471
None
lsortedLSearch took:  220.9  ms

We can further enhance the search process by exploiting the properties of using a sorted array. A binary search algorithm uses the properties of a sorted algorithm. The process is very intuitive and can be illustrated using the De Telefoongids that is a sorted dataset with name and telephone numbers ordered. If you look for someone with the surname "Hejderup", you will start at the index H and continue the search there.

T6 (20 points): Your task is to implement a binary search version with the same search queries as in the previous task.

In [183]:
def bsearch(sortedTable,f,fieldIndex,find):
    left = 0
    right = len(sortedTable)
    current = (left+right)//2
    
    while find != int(sortedTable[current][fieldIndex]):
        if current <= left:
            return None
        
        if f(sortedTable[current],[find]):
            right = current
        else:
            left = current

        current = (left+right)//2
        
    return sortedTable[current]

Again try to search for the same id's and measure how long it takes (don't count the time it takes to sort the table)

In [184]:
incidents = load('brwaa_2010-2015.csv')
incidents = incidents[1:len(incidents)]
quickSort(incidents,cmpf)

start_time = time.time()
print(bsearch(incidents,cmpf,0,93094)[0])
print(bsearch(incidents,cmpf,0,66515)[0])
print(bsearch(incidents,cmpf,0,47279))
print(bsearch(incidents,cmpf,0,122471)[0])
print(bsearch(incidents,cmpf,0,16515))
print("bsearch took: ", int(100000*(time.time() - start_time)) / 100, " ms")
93094
66515
None
122471
None
bsearch took:  0.5  ms

Optional (Bonus task)

T7 (10 points): Last task is to find multiple entries, your task is to modify the binary search such that all results for the 'gemeente' of Amsterdam are returned.

In [265]:
def cmpfGemeente(rowA, rowB):
    return str(rowA[22]).rstrip() > str(rowB[22]).rstrip()

def multiplebsearch(sortedTable,f,fieldIndex,find):
    left = 0
    right = len(sortedTable)
    current = (left+right)//2
    
    while find != str(sortedTable[current][fieldIndex]).rstrip():
        print(find, str(sortedTable[current][fieldIndex]), find == str(sortedTable[current][fieldIndex]))
        
        if current <= left:
            return None
        
        if f(sortedTable[current],[None] * fieldIndex + [find]):
            right = current
        else:
            left = current

        current = (left+right)//2
        
    found = [sortedTable[current]]
    
    cursor = current - 1
    while find == str(sortedTable[cursor][fieldIndex]).rstrip():
        found.append(sortedTable[cursor])
        cursor -= 1
        
    cursor = current + 1
    while find == str(sortedTable[cursor][fieldIndex]).rstrip():
        found.append(sortedTable[cursor])
        cursor += 1
        
    return found

Print all the incident id's of incidents that happened in the 'gemeente' Amsterdam

In [258]:
incidentsGemeente = load('brwaa_2010-2015.csv')
incidentsGemeente = incidents[1:len(incidentsGemeente)]
incidentsGemeente = shellSort(incidentsGemeente,cmpfGemeente, len(incidentsGemeente))
Done
In [272]:
amsterdamIncidents = multiplebsearch(incidentsGemeente,cmpfGemeente,22,"Amsterdam")
for i in amsterdamIncidents[:10]:
    print(i[0], i[22])
75083 Amsterdam                                                                       
133560 Amsterdam                                                                       
28576 Amsterdam                                                                       
429 Amsterdam                                                                       
90401 Amsterdam                                                                       
11365 Amsterdam                                                                       
139677 Amsterdam                                                                       
1818 Amsterdam                                                                       
183045 Amsterdam                                                                       
101814 Amsterdam