Assignment 2: Binary trees

Deadline: Nov 18, 2024 14:00 CET

In this set of exercises you will practice working with binary trees. We will work with examples of binary phrase structure (syntactic analyses of natural language sentences).

Exercises

Implement the following methods in the BinaryTree class in the template, which represents a binary tree as a Python list (using it as an array). Please follow the (standard) procedure for implementing binary trees with arrays as described during the lectures.

Throughout the implementation, we use indices of the array where a node is stored as the ID of the node.

add_node(): Add a right or left child to a given node with the given label, and return its index. This method should also extend the array when needed. The size of the array should be limited to the minimum storage needed for the current tree. An index value out of bounds indicates a non-existing node.
find_node(): Return an iterable (a sequence, or better, a generator), that returns/yields IDs (indices) of the nodes with the given label. If there is another match found at the left child (or one of its descendants) it should be returned before the node. Possible matches at the right child or its descendants should be returned after the node.
get_height(): Return the height of a node with the given index.
get_depth(): Return the depth of a node with the given index.
__str__(): Return a string representation of the binary tree similar to Penn Treebank notation (see below for details). Spaces are only needed when separating terminal nodes (words) from their parents. Other spacing (including line breaks) and indentation is optional.

For empty slots in the array (the locations corresponding to non-existing nodes), use None as a placeholder. The size of the list should be the minimum size required by the tree represented.

As usual a set of test cases are provided.

Penn Treebank format

A popular format for storing parse trees is so-called Penn treebank notation. The nodes represent either a phrase label (like NP below), or a word, or terminal (like John below). The format includes a hierarchical organization of parenthetical expressions. The label of an internal node is indicated just after an opening parentheses. The leaf nodes, the words, are not placed inside parenthesis, and indicated next to their immediate parent. The unary rules (internal nodes with a single child) are only allowed if the child is a leaf node (word). Hence, every expression inside the parentheses are either two labels (a leaf node preceded by its parent, no parentheses) or a node label followed by its left and right subtree (both subtrees surrounded by parentheses). An example parse tree and its Penn treebank representation are provided below.

(S (NP John)
   (VP (V ate)
       (NP (Det a)
           (N pizza)
       )
   )
)

Note that we are only concerned with binary trees of the form above (actual Penn treebank format allows n-ary trees with arbitrary n). We also assume that the terminal nodes (the words) are always the left child of their parent.

Wrapping up

Do not forget to

indicate your name(s) in the file header
commit all your changes
push it to your GitHub repository