Assignment 5 - shortest paths

Deadline: Dec 9th, 2024 14:00 CET

In this assignment, we return to the segmentation problem and try to solve it using a shortest path algorithm. Segmentation is a prevalent problem in computational linguistics. Typical applications include determining words for texts written without word boundary markers (e.g., Chinese), determining possible words in automatic speech recognition, and modeling child language acquisition. The problem is also a first step towards parsing, another important problem in the field.

We will consider segmentation of a given string as finding the minimum path from start node to end node in a setting like the following example, which represents the graph for string acat.

example graph

We will first build all potential paths that may “cut through” the sequence at the correct nodes, assign weights to each edge, and use Bellman-Ford algorithm to find the shortest path from the start node (0) to the final node (4 in the example). Given correct weights, the shortest path should be the path highlighted above, which segments the string as a cat.

You are provided with a class in the template that assigns weights to sequences of characters based on probabilities estimated from a corpus (we also provide a short corpus that is used in a number of earlier studies for modeling child language acquisition.

As usual, we also have a few test cases.

Exercises

5.1 Initialize the graph

Implement the initialize_graph() function in the template, which takes a string and returns an adjacency matrix representation of the graph including all possible segmentations as described above. The weight of each edge should be the “negative log probability” of the string that falls between the nodes connected by the edge. A possible adjacency matrix for the above graph could be:

[[ 5.2 13.3 17.1 21.1]
 [      9.4 13.3 10.0]
 [           5.2  7.4]
 [                8.5]]

Note that this is a (n upper) triangular matrix due to the fact that the edges are strictly forward (with respect to order of letters). The cell (0,0) represents the weight of the edge 0 -> 1 in the graph, which corresponds to word a. Similarly, for example cell (2,3) represents the edge 2->4 (the word at). In general, any cell (i, j) represents the word that starts at index i, ends at j+1. Note that the matrix above has $n-1$ rows and $n-1$ columns ($n$: number of nodes in the graph), since we do not need to represent edges to node 0, and edges from node 4 (the last node in the graph). This saves some memory. However, if it simplifies the mapping of nodes to indices, you may include the redundant columns or rows.

5.2 Find the shortest path

Implement the find_shortest_path() that should use the Bellman-Ford algorithm to find the shortest path. Note that we are only interested in the shortest path from node 0 to the final node. This function returns the shortest path as a list of nodes (segmentation indices).

5.3 Segmentation

Implement the segment_sequence() that uses the functions above to return right segmentation (given the model assumptions) of the sequence. For example, segmentation of string acat should (probably) return a cat.

Wrapping up

Do not forget to