Assignment 5 - shortest paths
Deadline: Dec 9th, 2024 14:00 CET
In this assignment, we return to the segmentation problem and try to solve it using a shortest path algorithm. Segmentation is a prevalent problem in computational linguistics. Typical applications include determining words for texts written without word boundary markers (e.g., Chinese), determining possible words in automatic speech recognition, and modeling child language acquisition. The problem is also a first step towards parsing, another important problem in the field.
We will consider segmentation of a given string
as finding the minimum path from start node to end node
in a setting like the following example,
which represents the graph for string acat
.
We will first build all potential paths that may
“cut through” the sequence at the correct nodes,
assign weights to each edge,
and use Bellman-Ford algorithm
to find the shortest path from the start node (0)
to the final node (4 in the example).
Given correct weights,
the shortest path should be the path highlighted above,
which segments the string as a cat
.
You are provided with a class in the template that assigns weights to sequences of characters based on probabilities estimated from a corpus (we also provide a short corpus that is used in a number of earlier studies for modeling child language acquisition.
As usual, we also have a few test cases.
Exercises
5.1 Initialize the graph
Implement the initialize_graph()
function in the template,
which takes a string and returns an adjacency matrix representation
of the graph including all possible segmentations
as described above.
The weight of each edge should be the “negative log probability”
of the string that falls between the nodes connected by the edge.
A possible adjacency matrix for the above graph could be:
[[ 5.2 13.3 17.1 21.1]
[ 9.4 13.3 10.0]
[ 5.2 7.4]
[ 8.5]]
Note that this is a (n upper) triangular matrix due to the fact that
the edges are strictly forward (with respect to order of letters).
The cell (0,0) represents the weight of the edge 0 -> 1
in the graph,
which corresponds to word a
.
Similarly, for example cell (2,3) represents the edge 2->4
(the word at
).
In general, any cell (i, j) represents the word that starts at index i
, ends at j+1
.
Note that the matrix above has $n-1$ rows and $n-1$ columns ($n$:
number of nodes in the graph),
since we do not need to represent edges to node 0,
and edges from node 4 (the last node in the graph).
This saves some memory.
However, if it simplifies the mapping of nodes to indices,
you may include the redundant columns or rows.
5.2 Find the shortest path
Implement the find_shortest_path()
that should use the Bellman-Ford
algorithm to find the shortest path.
Note that we are only interested in the shortest path from node 0
to the final node.
This function returns the shortest path as a list of nodes
(segmentation indices).
5.3 Segmentation
Implement the segment_sequence()
that uses the functions above
to return right segmentation (given the model assumptions) of the
sequence. For example, segmentation of string acat
should (probably)
return a cat
.
Wrapping up
Do not forget to
- indicate your name(s) in the file header
- add the honor code (I/We pledge that this code represents my/our own work)
- commit all your changes
- push it to your GitHub repository