Decision trees, one of the oldest statistical and machine learning models, are widely used as interpretable models and as the basis of random and boosted forests. However, their true power remains underexploited. The reason is well known: they involve a difficult, nonconvex, nondifferentiable optimization problem, caused by the discrete nature of the tree decisions. To this day, in spite of decades of research, the usual way to train decision trees is using a greedy recursive partitioning procedure (such as CART, C4.5 and variations of them), which is highly suboptimal. Also, the type of trees used is typically of the axis-aligned type, which is a very restrictive model form. This situation is drastically different from the way most machine learning models are trained (logistic regression, kernel machines, neural nets, etc.) — where we pick an arbitrary loss and regularization functions defined over a parametric model, and then apply some algorithm that iteratively descends down the objective function, continuously improving over previous iterates until convergence.

We will describe Tree Alternating Optimization (TAO), an algorithm that achieves all of the above and makes trees and tree-based models first-class machine learning models. TAO is based on two theorems: a separability condition over sets on non-descendant nodes, and a reduced problem that shows how to lower globally the objective function over an individual decision or leaf node. Starting from a given tree structure and initial node parameters, TAO repeatedly optimizes nodes in alternation, monotonically decreasing the objective function. TAO works with any loss and regularization and with any type of tree as long as the decision and leaf nodes’ reduced problems can be (approximately) solved. This opens up an enormous space of models that we have started exploring.

We will show how TAO can be used to train various types of trees and forests. A particularly useful model are sparse oblique trees, which impose a sparsity penalty or constraint on the tree parameters so the tree structure is learned automatically and each node uses few features. These trees strike a good compromise between accuracy and interpretability and are scalable to large datasets and trees. As an application, I will show how to use sparse oblique trees as a “microscope” to probe into the internal workings of deep neural nets and even manipulate them. This allows us, among other things, to determine which groups of neurons are responsible for which output classes of the network; and to force the network to predict, or not predict, a given class by masking certain neurons. This suggests that sparse oblique trees could also be used in neuroscience, cognitive science and psychology as a tool to understand and manipulate biological neural networks.

Another success of TAO is with decision forests. When using sparse oblique trees as the base learner, the resulting forests are consistently more accurate than the state-of-the-art (random forests, AdaBoost, gradient boosting) across many datasets, while resulting in drastically fewer trees and possibly fewer parameters and faster inference.

This joint work with my students Magzhan Gabidolla, Suryabhan Singh Hada, Yerlan Idelbayev and Arman Zharmagambetov.

BIOGRAPHY

Miguel ¡. Carreira-PerpiÒ·n is a professor in the Department of Computer Science and Engineering at the University of California, Merced. He received the degree of “licenciado en inform·tica” (MSc in computer science) from the Technical University of Madrid in 1995 and a PhD in computer science from the University of Sheffield in 2001. Prior to joining UC Merced, he did postdoctoral work at Georgetown University (in computational neuroscience) and the University of Toronto (in machine learning), and was an assistant professor at the Oregon Graduate Institute (Oregon Health & Science University). He is the recipient of an NSF CAREER award, a Google Faculty Research Award and best (student) paper awards at AISTATS and Interspeech. He is an action editor for the Journal of Machine Learning Research, a past associate editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence, and has been an area chair or senior area chair for several machine learning, computer vision and data mining conferences (NIPS, ICML, AISTATS, AAAI, ECCV, SDM). His research interests lie in machine learning and optimization. Most recently, he has been interested in nonconvex optimization problems that involve nested functions (such as deep neural nets) and possibly a mixture of discrete and continuous parameters; in neural net compression, formulated as a constrained optimization; and in decision tree optimization. Other interests are in unsupervised learning problems such as dimensionality reduction (including spectral methods, nonlinear embedding methods and autoencoders), clustering and denoising, mean-shift algorithms, learning binary hash functions, and applications to speech processing (e.g. articulatory inversion and model adaptation), computer vision, sensor networks, information retrieval and other areas.