What is a string?

Strings in the way computers represent textual information.

Classes of problems on strings

Most problems with strings have to do with search a string or a list of strings/patterns in another string.

String representation

Encodings

In computer memory, strings are nothing more that binary data.

Data in computer memory

Data in computer memory

In order to interprect what the value 0x48 (first 8 bits) means in terms of characters, we need an encoding.

The problem of encoding predates computers. For example, Morse code is an encoding between short and long sounds to Latin script.

Typical String Encodings

  • 1-byte encodings: EBCDIC and ASCII
  • 2-byte encodings: CJK (for logographic languages), Shift-JIS
  • multibyte-byte encodings: Unicode UTF-8

ASCII

ASCII was originally a 7-bit encoding.

The ASCII character table

The ASCII character table

Fun with ASCII

ASCII characters are representable as numbers. This means that we can do arithmetic on them.

#include<stdio.h>

main() {
    for (int i = 0; i < 128; i++)
      printf("%c ", i);

    printf("%d\n", c);    

    for (int i = 128; i < 256; i++)
      printf("%c ", i);
}
The ASCII character table

The ASCII character table

ISO-8859-*

To accomodate different languages with characters in non-Latin scripts, or special characters (e.g. ø, ß, ç) in Latin-based languages, ASCII was extended with an 8th bit.

This allowed operating systems in most of the world (non-logographic languages) to use the American string representation as a basis and only customize encodings when the extra characters where required.

Q: What problems did this create?

A file encoded in ISO-8859-7 (Greek) could not be read correctly in a computer configured with ISO-8859-1 (Latin-1).

Encodings

Encodings

Unicode

The world has around 7k languages and 3.8k writing systems. Unicode is an effort to specify 1 string encoding scheme for everything. The latest version contains a repertoire of 136,755 characters covering 139 modern and historic scripts, as well as multiple symbol sets (e.g. music, math, emoji).

Unicode defines a lookup table between a unique character ID and a character name / description

  • U+01E8 Ǩ LATIN CAPITAL LETTER K WITH CARON
  • U+0E57 ๗ THAI DIGIT SEVEN
  • U+1F638 😸 GRINNING CAT FACE WITH SMILING EYES
  • U+262D ☭ HAMMER AND SICKLE

Representing Unicode text

The word hello is represented in Unicode as:

U+0068 U+0065 U+006C U+006C U+006F

Q: How would you store that on disk?

One way would be to store the bytes: 00 68 00 65 00 6C 00 6C 00 6F

However, this assumes that Unicode symbols have equal probability of being used, while we can only store \(2^{16}\) characters. The later can be solved in we use \(2^{32}\) bytes for each character.

UTF-8

To solve space issues we can use variable length encodings. The most common encoding format for Unicode data is UTF-8, developed by Ken Tompson and Rob Pike.

In UTF-8, the most common characters are encoded with less bits than more rare ones.

  • The first 128 characters need 1 byte, and correspond to ASCII
  • The next 1,920 characters need 2 bytes, and cover most western and middle-eastern languages
  • 3 bytes cover all spoken languages
  • 4 bytes cover emoji and other symbolic languages

Various Unicode bit encodings

For the character U+262D (☭), various Unicode encodings produce the following results

Unicode encodings for ☭

Unicode encodings for ☭

One thing to remember about strings in computers:

Characters and bytes are 2 different things

String distance

Approximate string matching

Approximate string matching (or fuzzy search) is a technique of finding strings that match a pattern approximately (rather than exactly).

Given a pattern string \(P\) and a string \(T\), find a sub-string \(T'\) in the set of all substrings of \(T\), so that the distance to \(P\) is minimal (for some definition of distance).

Q Can you think of an application of fuzzy searching?

Some applications include spellcheckers, Google’s “Did you mean…”, DNA matching etc

Levenshtein distance

The distance between two strings \(a,b\) (of length \(|a|\) and \(|b|\)) is:

\[ \qquad\operatorname{lev}_{a,b}(i,j) = \begin{cases} \max(i,j) & \text{ if} \min(i,j)=0, \\ \min \begin{cases} \operatorname{lev}_{a,b}(i-1,j) + 1 \\ \operatorname{lev}_{a,b}(i,j-1) + 1 \\ \operatorname{lev}_{a,b}(i-1,j-1) + 1 \end{cases} & \text{ otherwise.} \end{cases} \]

\(lev(i,j)\) is the distance at characters \(a_i\) and \(b_j\).

The Levenshtein distance measures the minimal number of deletions, insertions or replacements to make the strings equal.

Levenshtein distance in Python

def levenshtein(s, t):
    if s == "":
        return len(t)
    if t == "":
        return len(s)
    if s[-1] == t[-1]:
        cost = 0
    else:
        cost = 1
       
    res = min([levenshtein(s[:-1], t) + 1,
               levenshtein(s, t[:-1]) + 1, 
               levenshtein(s[:-1], t[:-1]) + cost])
    return res

Q: What is the complexity of the simple implementation?

\(O(3^{m+n-1})\), where \(m\) and \(n\) are the legths of s and t. See an explanation here

Memoization in action

memo = {}
def levenshtein2(s, t):
    if s == "":
        return len(t)
    if t == "":
        return len(s)
    cost = 0 if s[-1] == t[-1] else 1
       
    i1 = (s[:-1], t)
    if not i1 in memo:
        memo[i1] = levenshtein2(*i1)
    i2 = (s, t[:-1])
    if not i2 in memo:
        memo[i2] = levenshtein2(*i2)
    i3 = (s[:-1], t[:-1])
    if not i3 in memo:
        memo[i3] = levenshtein2(*i3)
    res = min([memo[i1]+1, memo[i2]+1, memo[i3]+cost])
    
    return res

With memoization, we store intermediate results in a shared dictionary, whose key is the function arguments. This takes space of \(O(mn)\), but the algorithm now becomes \(O(mn)\).


Differencing strings

Levenshtein distance does not tell us which steps we need to take to ma. To obtain the steps, we need to solve the longest common subsequence problem.

In LCS, we compare our \(O\)riginal string with a \(N\)ew one and obtain a sequence of all the items that are common in \(O\) and \(N\). By comparing this sequence to both \(O\) and \(N\), we can obtain a set of operations (or edit script) that when applied to \(O\) will give us \(N\).

O     1 a 2 3 4 r t
N     c 1 b 2 3 x d 4
LCS   1 2 3 4
DIFF  c 1 a b 2 3 x d 4 r t
OP    + = - + = = + + = - -
def diff(xs, ys):
    cs = lcs(xs, ys)
    edit_script(xs, ys, cs)

The least common subsequence

def lcs(xstr, ystr):
    if not xstr or not ystr:
        return ""
    x, xs, y, ys = xstr[0], xstr[1:], ystr[0], ystr[1:]
    if x == y:
        return x + lcs(xs, ys)
    else:
        return max(lcs(xstr, ys), lcs(xs, ystr), key=len)

The complexity of the naive LCS is \(O(2^{m+n})\). Similar to Levenshtein distance, we can use memoization to convert LCS to \(O(mn)\).

What LCS returns are the common characters in xstr and ystr, in order of appearence in xstr.

The edit script function

def edit_script(xs, ys, cs):
    if len(cs) == 0:
        for x in xs:
            print "-%s" % x
        for y in ys:
            print "+%s" % y
        return

    x, y, c = xs[0], ys[0], cs[0]
    if c != x:
        print "-%s" % x
        edit_script(xs[1:], ys, cs)
    elif c != y:
        print "+%s" % y
        edit_script(xs, ys[1:], cs)
    else:
        print "=%s" % c
        edit_script(xs[1:], ys[1:], cs[1:])

Given the two strings and the LCS, the edit script function will return a sequence of additions + and removals - that when applied to xs will give ys.

The diff utility

The ability to compare strings and produce edit scripts has been generalized in the diff Unix tool, where differences are only considered at the line level.

Output from Git Diff

Output from Git Diff

The output of diff is the basis for all version control systems, like Git and Subversion.

Bibliography

[1] J. Spolsky, “The absolute minimum every software developer absolutely, positively must know about unicode and character sets,” 2003. [Online]. Available: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/.