Source code for col_gen_estimator._dtree_classifier

"""
Decision tree classifier using column generation.

Reference: Murat Firat, Guillaume Crognier, Adriana F. Gabor, C.A.J. Hurkens,
and Yingqian Zhang.
Column generation based heuristic for learning classification trees.
Computers and Operations Research, 116, apr 2020.

The algorithm takes candidate splits for each node of the decision tree as
inputs. A path is a set of nodes from root node to a leaf with a fixed split
for each node and a assigned target. The master problem uses a binary variable
for each possible path. The goal is to select a set of path that are consistent
among themselves and result in the best accuracy.

Since there are exponentially many paths, the path variables are generated
using column generation by a subproblem.

We only work with complete trees. That is all the paths in the tree have the
same depth d. Each internal node has two children.

We assume a fixed ordering of the nodes and leaves in the tree. The root node
is numbered 0 (ID of the node). The left child of the node (i) is (2i + 1) and
the right child of the node (i) is (2i + 2). Leaves are numberd from (0) to
(2^d) from left to right.

Master problem:
Sets
R set of rows in data file.
F set of features in data file.
N_{lf} , N_{int} leaf and internal (non-leaf) nodes in the decision tree.
p_{BT}(l) path to leaf l in binary tree.
DP_l$ set of decision paths ending in leaf l.
DP_l(j) subset of paths in DP_l, such that j in p_{BT}(l).
R^{l}(p) subset of rows directed to leaf l through path p.
S_j set of decision splits for node j.

Parameters
k depth of the decision tree.
CP(p) number of correct predictions/true positives for path p.

Decision Variables
x_p binary variable indicating that path p is assigned to leaf l.
rho_{j,a} binary variable indicating that split a has been assigned to node j.


Max     sum_{l in N_{lf}} sum_{p in DP_{l}} CP(p)x_p
s.t.    sum_{p in DP_{l}} x_p = 1,  forall l in  N_{lf}             (alpha)
		sum_{lin N_{lf}} sum_{p in DP_{l}:r in R^l(p)} x_p = 1
                                        forall r in R               (beta)
		sum_{p in DP_{l}:s(j) = a} x_p = rho_{j,a},
                                        forall l in N_{lf},
                                            j in p_{BT}(l) and N_{int},
                                            a in S_j                (gamma)
		x_p in {0,1},  forall p in DP_{l}, l in N_{lf}
		rho_{j,a} in {0,1} forall j in N_{int}, a in S_j

The objective maximizes the accuracy. The (alpha) constraints ensure that
exactly one path is selected for each leaf. The (beta) constraints ensure
that exactly one path is selected for each row in the dataset. They are implied
by the (alpha) and (gamma) constraints, but they are needed for stronger LP
relaxation. The (gamma) constraints ensure that the selected paths are
consistent (have the same split on common nodes).

Subproblem: (Modified from the original paper)
Sets
R & set of all rows in data file.
F set of features in data file.
Y set of all class labels.
S_j set of decision splits for node j.
LC(l) set of internal nodes with left child in p_{BT}(l).
RC(l) set of internal nodes with right child in p_{BT}(l).
T(r) set of splits for which row r returns a TRUE (feature <= threshold)
F(r) set of splits for which row r returns a FALSE: (feature > threshold)

Parameters
alpha_l dual cost of (alpha) constraint in the master problem for leaf l.
beta_r dual cost of (beta) constraint in the master problem for row r.
gamma_{l,j,a} dual cost of (gamma) constraint in the master problem for
        leaf l, node j, split a.

Decision Variables
y_r binary variable indicating that row r reaches leaf l.
z_r binary variable indicating that row r reaches leaf l and has the same
    target as the path being generated.
u_{j,a} binary variable indicating that split a is assigned to node j.

DPP(l):
Max sum_{r in R} z_r - alpha_l
    - sum_{j in p_{BT}(l)} sum_{a in S_j} gamma_{l,j,a} u_{j,a}
    - sum_{r in R} beta_r y_r
s.t. sum_{a in S_j} u_{j,a} = 1,  forall j in  RC(l) or LC(l)               (1)
	 y_r <= sum_{a in S_j and T(r)} u_{j,a}, forall j in LC(l), r in R      (2)
	 y_r <= sum_{a in S_j and F(r)} u_{j,a} , forall j in RC(l), r in R     (3)
	 sum_{j in LC(l)} sum_{a in S_j and T(r)} u_{j,a}
        + sum_{j in RC(l)} sum_{a in S_j and F(r)} u_{j,a} - (k-1) <= y_r,
                                            forall r in R                   (4)
	 sum_{j in p_{BT}(l)} u_{j,a} <= 1, forall a in (Union of all splits)   (5)
     z_r <= y_r, forall r in R                                              (6)
	 z_r <= w_t, forall r in R , t = correct class for row r                (7)
	 sum_{t \in Y} w_t = 1                                                  (8)
	 z_r in {0,1},  forall r in R                                           (9)
     y_r in {0,1},  forall r in R                                          (10)
	 u_{j,a} in {0,1} forall j in RC(l) or LC(l), a in S_j                 (11)

We have one subproblem for each leaf. This is different from the original paper
where there is one subproblem for each leaf and target combination.

The objective maximizes the reduced cost. The constraints (1) ensure that
exactly one split is selected for each node. The constraints (2)(3)(4) ensure
that the row follows the path when the corresponding splits are selected. The
constraints (5) ensure that a split is only selected once in a path. The
constraints (6)(7) ensure that the z variable is set only if the row follows
the path (y variable is set) and the right target is selected (w variable is
set). The constraint (8) ensures that exactly one target is selected for the
path being generated.
"""
import numpy as np
import math
import random
import multiprocessing as mp
from ortools.linear_solver import pywraplp
from ortools.linear_solver import linear_solver_pb2
from ortools.sat.python import cp_model
from math import floor, ceil
from sklearn.utils.validation import check_array, check_is_fitted
from ._col_gen_classifier import BaseMasterProblem
from ._col_gen_classifier import BaseSubproblem
from ._col_gen_classifier import ColGenClassifier

from time import time



[docs]
class Row:
    """To make the processing faster, we store some row related information in
    this class.

    Attributes
    ----------
    id: (int) ID of the row.
    reachable_nodes: (set(int)) IDs of the nodes the row can reach.
    reachable_paths = (set(int)) IDs of the paths the row can follow.
    reachable_leaves = (set(int)) IDs of the leaves the row can reach.
    target: (int) Target of the row.
    left_splits: (set(int)) IDs of the splits where the feature value of the
        row is smaller or equal to the threshold.
    right_splits: (set(int)) IDs of the splits where the feature value of
        the row is greater than the threshold.
    removed_from_master: (bool) True if the row is removed from the master.
        This can happen if the row can only reach at max two leaves or the row
        is similar to some other row in the dataset with respect to the splits
        being considered.
    removed_from_sp: (bool) True if the row is removed from the master and the
        subproblem. This can happen if the row is similar to some other row in
        the dataset with respect to the splits being considered.
    weight: (int) Typically = 1. Some rows can have higher weights if we remove
        the other rows that are similar to this row from the dataset.
    """


[docs]
    def __init__(self) -> None:
        self.id = -1
        self.reachable_nodes = set()
        self.reachable_paths = set()
        self.reachable_leaves = set()
        self.target = -1
        # List of satisfied splits.
        self.left_splits = set()
        # List of unsatisfied splits.
        self.right_splits = set()
        self.removed_from_master = False
        self.removed_from_sp = False
        self.weight = 1





[docs]
class Path:
    """All path related information.

    Attributes
    ----------
    leaf_id: (int) ID of the associated leaf.
    node_ids: (list(int)) IDs of the nodes in the path.
    splits: (list(int)) IDs of the selected splits for nodes in the path.
        The order of splits must match the order of nodes.
    cost: (int) Number of rows following the path with correct target.
    id: (int) ID of the path.
    target: (int) Associated target of the path.
    """


[docs]
    def __init__(self) -> None:
        self.leaf_id = -1
        self.node_ids = []
        self.splits = []
        self.cost = 0
        self.id = -1
        self.target = -1
        self.initial = False



[docs]
    def is_same_as(self, path):
        """ Returns true if the current path is same as the path in the
        argument.

        Parameters
        ----------
        path: (Path) The other path being compared to.
        """
        if self.leaf_id != path.leaf_id:
            return False

        if self.target != path.target:
            return False

        if self.cost != path.cost:
            return False

        if len(self.node_ids) != len(path.node_ids):
            return False

        for i in range(len(self.node_ids)):
            node_id = self.node_ids[i]
            split_id = self.splits[i]
            found = False
            for j in range(len(path.node_ids)):
                if path.node_ids[j] == node_id and path.splits[j] == split_id:
                    found = True
                    break
            if not found:
                return False

        return True



[docs]
    def print_path(self):
        """Prints the attributes of the path."""
        print("-------------------")
        print("ID: ", self.id)
        print("Nodes: ", self.node_ids)
        print("Splits: ", self.splits)
        print("Leaf: ", self.leaf_id)
        print("Target: ", self.target)
        print("Cost: ", self.cost)
        print("-------------------")



[docs]
    def set_leaf(self, leaf_id, depth):
        """
        For a full binary tree, populates the nodes corresponding to the given
        leaf_id.

        Parameters
        ----------
        leaf_id: (int) The ID of the leaf.
        depth: (int) Depth of the path.
        """
        self.leaf_id = leaf_id
        self.node_ids = []
        count = 2**(depth) - 1 + leaf_id
        while (count > 0):
            father = math.floor((count-1) / 2)
            self.node_ids.append(father)
            count = father





[docs]
class Node:
    """All node related information.

    Attributes
    ----------
    id: (int) ID of the node.
    candidate_splits: (list(int)) IDs of the candidate splits that can be used
        at this node.
    candidate_splits_count: (list(int)) Count of how many times the
        corresponding split was seen in initialization. This is useful to trim
        the candidate_splits list. Used only at the initialization and not
        during training.
    last_split: (int) ID of the last split added to the candidate_splits. Used
        only at the initialization and not during training.
    parent: (int) ID of the parent node. Parent of the root node is -1.
    left_child: (int) ID of the left child node. -1 if the child is a leaf.
    right_child: (int) ID of the right child node. -1 if the child is a leaf.
    children_are_leaves: (bool) True if the children of this node are leaves.
    """


[docs]
    def __init__(self) -> None:
        self.id = -1
        self.candidate_splits = []
        self.candidate_splits_count = []
        self.last_split = -1
        # parent is -1 for the root node.
        self.parent = -1
        # child is -1 for the leaves.
        self.left_child = -1
        self.right_child = -1
        self.children_are_leaves = False





[docs]
class Leaf:
    """All leaf related information.

    Attributes
    ----------
    id: (int) ID of the leaf.
    left_nodes: (list(int)) Set of nodes that take left branch to reach this
        leaf.
    right_nodes: (list(int)) Set of nodes that take right branch to reach this
        leaf.
    """


[docs]
    def __init__(self) -> None:
        self.id = id
        self.left_nodes = []
        self.right_nodes = []



[docs]
    def create_leaf(self, id, depth) -> None:
        """Given the id and depth of the leaf, populates the left_nodes and
        right_nodes attributes.

        Parameters
        ----------
        id: (int) ID of the leaf.
        depth: (int) Depth of the tree.
        """
        self.id = id
        self.left_nodes = []
        self.right_nodes = []
        count = 2**(depth) - 1 + id
        while (count > 0):
            father = math.floor((count-1) / 2)
            if count % 2 == 0:
                self.right_nodes.append(father)
            else:
                self.left_nodes.append(father)
            count = father





[docs]
class Split:
    """All split related information.

    Attributes
    ----------
    id: (int) ID of the split.
    feature: (int) Index of the feature used for the split.
    threshold: (float) Threshold value for the split. The row takes left branch
        if the feature <= threshold.
    left_rows: (set(int)) IDs of the rows that take left branch on this split.
        This is used for faster computing of the rows that follow a particular
        path.
    right_rows: (set(int)) IDs of the rows that take right branch on this
        split.
    left_cuts: (set(int)) Indices of the cuts added as (beta) constraints that
        correspopnd to a row that take left branch on this split. This is used
        for faster computing of the cuts that follow a particular path.
    right_cuts: (set(int)) Indices of the cuts added as (beta) constraints that
        correspopnd to a row that take left branch on this split.
    """


[docs]
    def __init__(self) -> None:
        self.id = -1
        self.feature = -1
        self.threshold = -1
        # In reduced split case, some splits are removed.
        self.removed = False
        self.left_rows = set()
        self.right_rows = set()
        self.left_cuts = set()
        self.right_cuts = set()




def get_satisfied_rows(path, leaf, splits):
    """Returns the set of satisfied rows for the given path.
    We use set intersection to compute this faster. This requires that we store
    the set of rows that take a specific branch on each split.

    Parameters
    ----------
    path: (Path) the path.
    leaf: (Leaf) the leaf of the path. We pass this because the leaf contains
        the information about which nodes take left branch for this path.
    splits: (list(Split)) splits to get the rows that take the correct branch.
    """

    satisfied_rows = set()
    split_ids = path.splits
    node_ids = path.node_ids
    split_0 = splits[split_ids[0]]
    if node_ids[0] in leaf.left_nodes:
        satisfied_rows = split_0.left_rows
    else:
        satisfied_rows = split_0.right_rows
    for i in range(1, len(node_ids)):
        split = splits[split_ids[i]]
        if node_ids[i] in leaf.left_nodes:
            satisfied_rows = satisfied_rows.intersection(split.left_rows)
        else:  # Right node in the path
            satisfied_rows = satisfied_rows.intersection(split.right_rows)

    return satisfied_rows


def get_satisfied_cuts(path, leaf, splits):
    """Returns the set of satisfied cut (beta) rows for the given path.
    We use set intersection to compute this faster. This requires that we store
    the set of cuts that take a specific branch on each split.

    Parameters
    ----------
    path: (Path) the path.
    leaf: (Leaf) the leaf of the path. We pass this because the leaf contains
        the information about which nodes take left branch for this path.
    splits: (list(Split)) splits to get the cuts that take the correct branch.
    """

    satisfied_cuts = set()
    split_ids = path.splits
    node_ids = path.node_ids
    split_0 = splits[split_ids[0]]
    if node_ids[0] in leaf.left_nodes:
        satisfied_cuts = split_0.left_cuts
    else:
        satisfied_cuts = split_0.right_cuts
    for i in range(1, len(node_ids)):
        split = splits[split_ids[i]]
        if node_ids[i] in leaf.left_nodes:
            satisfied_cuts = satisfied_cuts.intersection(split.left_cuts)
        else:  # Right node in the path
            satisfied_cuts = satisfied_cuts.intersection(split.right_cuts)

    return satisfied_cuts


def row_satisfies_path(X_row, leaf, splits, path):
    """Returns true if the passed row follows the path. The row may have
    different target than the path.

    Parameters
    ----------
    X_row:  ndarray, shape (1, n_features)
        The datapoint representing the row.
    leaf: (Leaf) The leaf of the path.
    splits: (list(Split)) splits to get the rows that take the correct branch.
    path: (Path) The path.
    """
    # path.print_path()
    split_ids = path.splits
    node_ids = path.node_ids
    for i in range(len(node_ids)):
        split = splits[split_ids[i]]
        feature = split.feature
        threshold = split.threshold
        if node_ids[i] in leaf.left_nodes:
            if X_row[feature] > threshold:
                return False
        else:  # Right node in the path
            if X_row[feature] <= threshold:
                return False
    return True


def get_params_from_string(params):
    """ Given the params in string form, returns the MPSolverParameters.
    This method is not yet implemented completely.
    By default we keep the incrementality on and use the primal simplex as the
    lp algorithm.

    Parameters
    ----------
    params : string,
        The solver parameters in string format.
    """
    # TODO: Implement this method.
    solver_params = pywraplp.MPSolverParameters()
    solver_params.SetIntegerParam(
        pywraplp.MPSolverParameters.INCREMENTALITY,
        pywraplp.MPSolverParameters.INCREMENTALITY_ON)
    solver_params.SetIntegerParam(
        pywraplp.MPSolverParameters.LP_ALGORITHM,
        pywraplp.MPSolverParameters.PRIMAL)
    print(params)
    return solver_params


class CutGenerator(cp_model.CpSolverSolutionCallback):
    """Track intermediate solutions.

    Attributes
    ----------
    path_vars: (list(int)) Indices of path vars in the solver.
    split_vars: (list(int)) Indices of split vars in the solver.
    solution_count: (int) Count of all solutions seen so far. They may not have
        the desired objective value.
    cuts: list(list(bool)), shape (num cuts, num paths)
        List of all generated cuts. Each cut is a list of boolean representing
        whether or not the cut follows the path.
    cut_rows: list(list(bool)), shape (num cuts, num splits)
        List of all generated cuts. Each cut is a list of boolean representing
        whether or not the cut takes the left branch on the splits.
    orig_path_coeffs: (list(float)), shape (num paths)
        List of original path coefficients in the objective. We compare the
        solutions using this coefficients to get the original objective value
        without the scaling. The solution represents a valid cut if the
        original objective value is greater than 1.
    """

    def __init__(self, path_vars, split_vars, orig_path_coeffs):
        cp_model.CpSolverSolutionCallback.__init__(self)
        # path_vars are updated after each col gen iter.
        self.path_vars = path_vars
        self.split_vars = split_vars
        self.solution_count = 0
        self.cuts_ = []
        self.cut_rows = []
        self.orig_path_coeffs = orig_path_coeffs

    def on_solution_callback(self):
        self.solution_count += 1
        orig_obj = 0.0
        cut = [False] * len(self.path_vars)
        cut_row = [False] * len(self.split_vars)
        for i in range(len(self.path_vars)):
            v = self.path_vars[i]
            if self.Value(v) > 0:
                orig_obj += self.orig_path_coeffs[i]
                cut[i] = True
        for i in range(len(self.split_vars)):
            v = self.split_vars[i]
            if self.Value(v) > 0:
                cut_row[i] = True
        if orig_obj > 1 + 1e-6:
            self.cuts_.append(cut)
            self.cut_rows.append(cut_row)

    def solution_count(self):
        return self.solution_count

    def cuts(self):
        return (self.cuts_, self.cut_rows)



[docs]
class DTreeMasterProblem(BaseMasterProblem):
    """Master problem for Decision Tree classifier.
    """


[docs]
    def __init__(self, initial_paths, leaves, nodes, splits,
                 beta_constraints_as_cuts=False,
                 generate_cuts=False,
                 num_cuts_round=0,
                 solver_type='glop',
                 data_rows=None):
        super(DTreeMasterProblem, self).__init__(solver_str=solver_type)
        self.cut_gen_model_ = cp_model.CpModel()
        self.cut_obj_scale = 1000000
        self.generated_ = False
        self.paths_ = initial_paths
        self.generated_paths_ = set()
        self.leaves_ = leaves
        self.nodes_ = nodes
        self.splits_ = splits
        self.data_rows = data_rows
        self.num_cuts_round = num_cuts_round
        self.generate_cuts = generate_cuts
        self.beta_constraints_as_cuts = beta_constraints_as_cuts
        self.total_cuts_added = 0
        self.cut_gen_time = 0.0
        self.rmp_objective_ = 0.0
        self.iter_ = 0
        # Experimental
        self.last_reset_iter_ = 0
        self.reset_timer_ = False



[docs]
    def create_cut_gen_model(self):
        """Creates the model for cut generation using a CP-SAT solver."""
        # Takes value 1 if row follows left branch on split (satisfies split)
        self.cut_split_vars_ = [None]*len(self.splits_)
        for i in range(len(self.splits_)):
            self.cut_split_vars_[
                i] = self.cut_gen_model_.NewBoolVar('z_'+str(i))

        # Takes value 1 if row satisfies the path.
        self.cut_path_vars_ = [None]*len(self.paths_)
        for i, path in enumerate(self.paths_):
            self.cut_path_vars_[
                i] = self.cut_gen_model_.NewBoolVar('p_'+str(i))
            literals = []
            split_ids = path.splits
            node_ids = path.node_ids
            leaf = self.leaves_[path.leaf_id]
            for j in range(len(node_ids)):
                split_id = split_ids[j]
                if node_ids[j] in leaf.left_nodes:
                    literals.append(self.cut_split_vars_[split_id])
                else:  # Right node in the path
                    literals.append(self.cut_split_vars_[split_id].Not())
            negated_literals = [x.Not() for x in literals]

            # p => AND(splits)
            self.cut_gen_model_.AddBoolAnd(
                literals).OnlyEnforceIf(self.cut_path_vars_[i])
            # ~p => NOT(AND(splits)) = OR(NOT(splits))
            self.cut_gen_model_.AddBoolOr(
                negated_literals).OnlyEnforceIf(self.cut_path_vars_[i].Not())

        # Split dependency:
        # Split_i: f <= t1,
        # Split_j: f <= t2, t1 < t2,
        # Split_i => Split_j
        for i in range(len(self.splits_)):
            split_i = self.splits_[i]
            threshold_i = split_i.threshold
            split_i_var = self.cut_split_vars_[i]
            for j in range(i+1, len(self.splits_)):
                split_j = self.splits_[j]
                if split_i.feature != split_j.feature:
                    continue
                threshold_j = split_j.threshold
                split_j_var = self.cut_split_vars_[j]
                assert threshold_i != threshold_j
                if threshold_i < threshold_j:
                    self.cut_gen_model_.AddImplication(
                        split_i_var, split_j_var)
                else:
                    self.cut_gen_model_.AddImplication(
                        split_j_var, split_i_var)



[docs]
    def generate_mp(self, X, y):
        """TODO: Documentation.
        """
        if self.generated_:
            return

        self.X_ = X
        self.y_ = y
        self.create_cut_gen_model()
        infinity = self.solver_.infinity()
        self.solver_.SetNumThreads(1)

        objective = self.solver_.Objective()
        self.path_vars_ = [None]*len(self.paths_)
        for i in range(len(self.paths_)):
            xp_var = self.solver_.IntVar(0.0, infinity, 'p_'+str(i))
            self.path_vars_[i] = xp_var.index()
            self.paths_[i].id = xp_var.index()
            # TODO: Compute cost of the path here?
            objective.SetCoefficient(xp_var, self.paths_[i].cost)
        objective.SetMaximization()

        # leaf constraints
        self.leaf_cons_ = [None]*len(self.leaves_)
        for leaf in self.leaves_:
            self.leaf_cons_[leaf.id] = self.solver_.Constraint(
                1, 1, "leaf_"+str(leaf.id))
            for path in self.paths_:
                if path.leaf_id == leaf.id:
                    xp_var = self.solver_.variable(path.id)
                    self.leaf_cons_[leaf.id].SetCoefficient(xp_var, 1)

        if self.data_rows == None:
            self.data_rows = self.preprocess_rows()
        else:
            self.compute_satisfied_paths()
        # row constraints
        n_rows = self.X_.shape[0]
        self.row_cons_ = [None]*n_rows
        self.added_row = [False]*n_rows
        for r in range(n_rows):
            if self.beta_constraints_as_cuts:
                break
            # continue
            if self.data_rows[r].removed_from_master:
                continue
            self.added_row[r] = True
            self.row_cons_[
                r] = self.solver_.Constraint(-infinity, 1, "row_"+str(r))
            for path in self.paths_:
                if row_satisfies_path(X[r], self.leaves_[path.leaf_id],
                                      self.splits_, path):
                    xp_var = self.solver_.variable(path.id)
                    self.row_cons_[r].SetCoefficient(xp_var, 1)

        # Cut constraints to be populated later.
        self.cut_cons_ = []

        # consistency constraints
        self.ns_vars = {}
        self.ns_constraints_ = {}
        for leaf in self.leaves_:
            self.ns_constraints_[leaf.id] = {}
            nodes = leaf.left_nodes + leaf.right_nodes
            for node_id in nodes:
                node = self.nodes_[node_id]
                self.ns_constraints_[leaf.id][node.id] = {}
                for split in node.candidate_splits:

                    ns_var_ind = self.ns_vars[(node.id, split)] \
                        if (node.id, split) in self.ns_vars \
                        else self.solver_.IntVar(0.0, infinity, "r_ns_" +
                                                 str(node.id) +
                                                 "_"+str(split)).index()
                    # else self.solver_.BoolVar("r_ns_" + str(node.id) +
                    #                           "_"+str(split)).index()
                    self.ns_vars[(node.id, split)] = ns_var_ind

                    ns_var = self.solver_.variable(ns_var_ind)

                    ns_constraint = self.solver_.Constraint(
                        0, 0, "ns_"+str(node.id)+"_"+str(split)+"_"
                        + str(leaf.id))

                    ns_constraint.SetCoefficient(ns_var, -1)

                    for path in self.paths_:
                        for i in range(len(path.node_ids)):
                            if path.leaf_id == leaf.id\
                                and path.node_ids[i] == node.id \
                                    and path.splits[i] == split:
                                xp_var = self.solver_.variable(path.id)
                                ns_constraint.SetCoefficient(xp_var, 1)

                    self.ns_constraints_[
                        leaf.id][node.id][split] = ns_constraint
        # print(self.solver_.ExportModelAsLpFormat(False))
        self.generated_ = True



[docs]
    def store_lp_solution(self):
        """Store the solution values in the class."""
        self.rmp_objective_ = self.solver_.Objective().Value()
        self.path_solution_values = {}
        for xp_var_id in self.path_vars_:
            xp_var = self.solver_.variable(xp_var_id)
            self.path_solution_values[xp_var_id] = xp_var.solution_value()


    def compute_satisfied_paths(self):
        for r, data_row in enumerate(self.data_rows):
            for path in self.paths_:
                if row_satisfies_path(self.X_[r], self.leaves_[path.leaf_id],
                                      self.splits_, path):
                    data_row.reachable_paths.add(path.id)


[docs]
    def preprocess_rows(self, aggressive=False):
        """TODO: Documentation."""
        # preprocess nodes first
        n_leaves = len(self.leaves_)
        depth = int(math.log2(n_leaves))
        for node in self.nodes_:
            if node.id == 0:
                node.parent = -1
            else:
                node.parent = math.floor((node.id-1) / 2)

            node.left_child = 2 * node.id + 1
            node.right_child = 2 * node.id + 2
            if node.left_child >= 2**depth - 1:
                node.left_child -= (2**(depth) - 1)
                node.right_child -= (2**(depth) - 1)
                node.children_are_leaves = True

            # print(node.id, node.parent, node.left_child,
            #       node.right_child, node.children_are_leaves)

        # print("Splits:")
        # for split in self.splits_:
        #     feature = split.feature
        #     threshold = split.threshold
        #     print(split.id, feature, threshold)
        data_rows = []
        n_rows = self.X_.shape[0]
        removed_count = 0
        average_leaf_reach = 0.0
        for r in range(n_rows):
            data_row = Row()
            for split in self.splits_:
                feature = split.feature
                threshold = split.threshold
                if self.X_[r, feature] <= threshold:
                    data_row.left_splits.add(split.id)
                else:
                    data_row.right_splits.add(split.id)
            # Record which nodes it can reach.
            for node in self.nodes_:
                if node.parent == -1 or node.id in data_row.reachable_nodes:
                    data_row.reachable_nodes.add(node.id)
                    for split_id in node.candidate_splits:
                        if split_id in data_row.left_splits:
                            if node.children_are_leaves:
                                data_row.reachable_leaves.add(
                                    node.left_child)
                            else:
                                data_row.reachable_nodes.add(
                                    node.left_child)
                        else:
                            if node.children_are_leaves:
                                data_row.reachable_leaves.add(
                                    node.right_child)
                            else:
                                data_row.reachable_nodes.add(
                                    node.right_child)

            average_leaf_reach += len(data_row.reachable_leaves)
            # print(r, data_row.reachable_leaves)
            if len(data_row.reachable_leaves) <= 2:
                # print("Row reaches fewer leaves: ",
                #       r, data_row.reachable_leaves)
                # print(data_row.left_splits)
                # print(data_row.reachable_nodes)
                # print(self.X_[r])
                data_row.removed_from_master = True
                removed_count += 1

            # Record which paths it can follow.
            for path in self.paths_:
                if row_satisfies_path(self.X_[r], self.leaves_[path.leaf_id],
                                      self.splits_, path):
                    data_row.reachable_paths.add(path.id)

            data_rows.append(data_row)

        print("Fewer leaves removed rows: ", removed_count)
        average_leaf_reach /= n_rows
        print("Average leaf reach: ", average_leaf_reach)

        # Quadratic loop
        if not aggressive:
            return data_rows
        for r1 in range(n_rows):
            if data_rows[r1].removed_from_master:
                continue
            for r2 in range(r1+1, n_rows):
                if data_rows[r2].removed_from_master:
                    continue

                if data_rows[r1].reachable_nodes != \
                        data_rows[r2].reachable_nodes:
                    continue

                found_mismatch_split = False
                valid_splits = set()
                for node_id in data_rows[r1].reachable_nodes:
                    node = self.nodes_[node_id]
                    for split_id in node.candidate_splits:
                        valid_splits.add(split_id)

                for split_id in valid_splits:
                    left_membership_r1 = split_id in data_rows[r1].left_splits
                    left_membership_r2 = split_id in data_rows[r2].left_splits
                    if left_membership_r1 != left_membership_r2:
                        found_mismatch_split = True
                        break

                if not found_mismatch_split:
                    data_rows[r2].removed_from_master = True
                    removed_count += 1
                    # print("Rows are similar: ", r1, r2)

        print("Total removed rows: ", removed_count)
        return data_rows



[docs]
    def add_column(self, path):
        """TODO: Documentation.
        """
        # print("Adding path")
        # path.print_path()
        objective = self.solver_.Objective()
        infinity = self.solver_.infinity()
        xp_var = self.solver_.IntVar(
            0.0, infinity, 'p_'+str(len(self.path_vars_)))
        self.path_vars_.append(xp_var.index())
        path.id = xp_var.index()
        objective.SetCoefficient(xp_var, path.cost)
        self.paths_.append(path)
        self.generated_paths_.add(path.id)

        # leaf constraints
        for leaf in self.leaves_:
            if path.leaf_id == leaf.id:
                self.leaf_cons_[leaf.id].SetCoefficient(xp_var, 1)

        # row constraints
        satisfied_rows = get_satisfied_rows(
            path, self.leaves_[path.leaf_id], self.splits_)
        for r in satisfied_rows:
            self.data_rows[r].reachable_paths.add(path.id)
            if not self.added_row[r]:
                continue
            self.row_cons_[r].SetCoefficient(xp_var, 1)

        # consistency constraints
        for leaf in self.leaves_:
            nodes = leaf.left_nodes + leaf.right_nodes
            for node_id in nodes:
                node = self.nodes_[node_id]
                for split in node.candidate_splits:

                    ns_constraint = self.ns_constraints_[
                        leaf.id][node.id][split]

                    for i in range(len(path.node_ids)):
                        if path.leaf_id == leaf.id\
                            and path.node_ids[i] == node.id \
                                and path.splits[i] == split:
                            ns_constraint.SetCoefficient(xp_var, 1)

        # Cut constraints
        satisfied_cuts = get_satisfied_cuts(
            path, self.leaves_[path.leaf_id], self.splits_)
        for c in satisfied_cuts:
            self.cut_cons_[c].SetCoefficient(xp_var, 1)

        # Cut generation model.
        new_cut_path_var = self.cut_gen_model_.NewBoolVar(
            'p_'+str(len(self.cut_path_vars_)))
        self.cut_path_vars_.append(new_cut_path_var)
        literals = []
        split_ids = path.splits
        node_ids = path.node_ids
        leaf = self.leaves_[path.leaf_id]
        for j in range(len(node_ids)):
            split_id = split_ids[j]
            if node_ids[j] in leaf.left_nodes:
                literals.append(self.cut_split_vars_[split_id])
            else:  # Right node in the path
                literals.append(self.cut_split_vars_[split_id].Not())
        negated_literals = [x.Not() for x in literals]

        # p => AND(splits)
        self.cut_gen_model_.AddBoolAnd(
            literals).OnlyEnforceIf(new_cut_path_var)
        # ~p => NOT(AND(splits)) = OR(NOT(splits))
        self.cut_gen_model_.AddBoolOr(
            negated_literals).OnlyEnforceIf(new_cut_path_var.Not())
        return True


    def add_cuts(self):
        # Create the objective for the cut generation model
        path_coeffs = []
        orig_path_coeffs = []
        for path in self.paths_:
            id = path.id
            orig_path_coeff = self.path_solution_values[id]
            path_coeff = int(
                floor(orig_path_coeff * self.cut_obj_scale))
            path_coeffs.append(path_coeff)
            orig_path_coeffs.append(orig_path_coeff)
        obj = cp_model.LinearExpr.WeightedSum(self.cut_path_vars_, path_coeffs)
        self.cut_gen_model_.Maximize(obj)

        sp_solver = cp_model.CpSolver()
        sp_solver.parameters.num_search_workers = 4

        # Solve
        cut_generator = CutGenerator(
            self.cut_path_vars_, self.cut_split_vars_, orig_path_coeffs)
        status = sp_solver.Solve(self.cut_gen_model_, cut_generator)
        print("Cut generation status: ", sp_solver.StatusName(status))
        print("Cut generation objective: ", sp_solver.ObjectiveValue())
        cuts, cut_rows = cut_generator.cuts()
        if len(cuts) == 0:
            return []

        # Update splits to store cut information.
        num_existing_cuts = len(self.cut_cons_)
        for i, cut_row in enumerate(cut_rows):
            cut_index = num_existing_cuts + i
            for split in self.splits_:
                if cut_row[split.id]:
                    split.left_cuts.add(cut_index)
                else:
                    split.right_cuts.add(cut_index)

        for cut in cuts:
            r = len(self.cut_cons_)
            self.cut_cons_.append(self.solver_.Constraint(
                1, 1, "cut_"+str(r)))
            self.added_row.append(True)

            for i, path in enumerate(self.paths_):
                if cut[i]:
                    xp_var = self.solver_.variable(path.id)
                    self.cut_cons_[r].SetCoefficient(xp_var, 1)

        return cut_rows


[docs]
    def get_satisfied_path_ids(self, row):
        """TODO: Documentation."""
        return list(self.data_rows[row].reachable_paths)



[docs]
    def violated_row_constraint(self, satisfied_path_ids):
        """TODO: Documentation."""
        path_sum = 0.0
        for path_id in satisfied_path_ids:
            path_sum += self.path_solution_values[path_id]
        if abs(path_sum - 1) > 1e-6:
            return True
        return False


    def add_beta_cuts(self):
        assert self.generated_
        n_rows = self.X_.shape[0]
        # added_cuts = False
        infinity = self.solver_.infinity()
        num_cuts = 0
        for r in range(n_rows):
            if self.data_rows[r].removed_from_master:
                continue
            if num_cuts >= 10:
                break
            if self.added_row[r]:
                continue
            satisfied_path_ids = self.get_satisfied_path_ids(r)
            if (self.violated_row_constraint(satisfied_path_ids)):
                # added_cuts = True
                num_cuts += 1
                self.added_row[r] = True
                self.row_cons_[r] = self.solver_.Constraint(
                    -infinity, 1, "row_"+str(r))
                for path_id in satisfied_path_ids:
                    xp_var = self.solver_.variable(path_id)
                    self.row_cons_[r].SetCoefficient(xp_var, 1)
        return num_cuts

    def remove_gen_vars(self):
        print("Removing the generated paths. Setting UB to 0.")
        self.last_reset_iter_ = self.iter_
        self.reset_timer_ = True
        for path in self.paths_:
            if not path.initial:
                xp_var = self.solver_.variable(path.id)
                xp_var.SetUb(0)


[docs]
    def solve_rmp(self, solver_params=''):
        """TODO: Documentation.
        """
        assert self.generated_
        self.prev_rmp_objective_ = self.rmp_objective_
        self.iter_ += 1

        self.rmp_result_status = self.solver_.Solve(
            get_params_from_string(solver_params))

        # TODO: Hide this under optional logging flag.
        print('Number of variables RMIP = %d' % self.solver_.NumVariables())
        print('Number of constraints RMIP = %d' %
              self.solver_.NumConstraints())
        newly_added_cuts = []
        if self.iter_ % 10 == 0:
            for i in range(self.num_cuts_round):
                if self.rmp_result_status == pywraplp.Solver.OPTIMAL:
                    # print("Before cut iter ", i)
                    # print(self.solver_.ExportModelAsLpFormat(False))
                    self.store_lp_solution()
                    t_start = time()
                    cut_rows = []
                    num_beta_cuts = 0
                    if self.beta_constraints_as_cuts:
                        num_beta_cuts = self.add_beta_cuts()
                    if self.generate_cuts and num_beta_cuts == 0:
                        cut_rows = self.add_cuts()
                    t_end = time()
                    self.cut_gen_time += t_end - t_start
                    self.total_cuts_added += len(cut_rows) + num_beta_cuts
                    print("Beta cuts added: ", num_beta_cuts)
                    print("Total cuts added: ", self.total_cuts_added)
                    print("Time spent in cut gen: ", self.cut_gen_time)
                    if (len(cut_rows) > 0) or num_beta_cuts > 0:
                        # Experiment: Cuts added. Remove the generated vars.
                        # self.remove_gen_vars()
                        cuts_params = get_params_from_string(solver_params)
                        cuts_params.SetIntegerParam(
                            pywraplp.MPSolverParameters.LP_ALGORITHM,
                            pywraplp.MPSolverParameters.DUAL)
                        self.rmp_result_status = self.solver_.Solve(
                            cuts_params)
                        if len(cut_rows) > 0:
                            newly_added_cuts.extend(cut_rows)
                    else:
                        break
                else:
                    break
        leaf_duals = []
        row_duals = []
        cut_duals = []
        ns_duals = {}
        if self.rmp_result_status == pywraplp.Solver.OPTIMAL:
            self.store_lp_solution()
            print('RMP Optimal objective value = %f' %
                  self.solver_.Objective().Value())

            # for var in self.solver_.variables():
            #     print(var.name(), var.solution_value())

            for lid in range(len(self.leaves_)):
                leaf_duals.append(self.leaf_cons_[lid].dual_value())

            n_rows = self.X_.shape[0]
            for r in range(n_rows):
                if self.added_row[r]:
                    row_duals.append(self.row_cons_[r].dual_value())
                else:
                    row_duals.append(0)

            for cut_cons in self.cut_cons_:
                cut_duals.append(cut_cons.dual_value())

            for leaf in self.leaves_:
                ns_duals[leaf.id] = {}
                nodes = leaf.left_nodes + leaf.right_nodes
                for node_id in nodes:
                    node = self.nodes_[node_id]
                    ns_duals[leaf.id][node.id] = {}
                    for split in node.candidate_splits:
                        ns_duals[leaf.id][node.id][split] = \
                            self.ns_constraints_[
                            leaf.id][node.id][split].dual_value()
        else:
            print("Master problem not solved correctly:",
                  self.rmp_result_status)
            # print(self.solver_.ExportModelAsLpFormat(False))

        return (leaf_duals, row_duals, ns_duals, cut_duals, newly_added_cuts)



[docs]
    def rmp_objective_improved(self):
        if self.rmp_result_status == pywraplp.Solver.OPTIMAL:
            return self.prev_rmp_objective_ < self.rmp_objective_
        else:
            return False



[docs]
    def solve_ip(self, solver_params=''):
        """Solves the integer RMP with given solver params.
        Returns True if the tree is generated.

        Parameters
        ----------
        solver_params : string, default='',
            The solver parameters for solving the integer RMP.
        """
        assert self.generated_

        # We can use sat here since all the coefficients and variables are
        # integer.
        # TODO: Solver type should be a parameter.
        solver = pywraplp.Solver.CreateSolver("sat")

        # We have to load the model from LP solver.
        # This copy is not avoidable with OR-Tools since we are now switching
        # form solving LP to solving IP.
        model_proto = linear_solver_pb2.MPModelProto()
        self.solver_.ExportModelToProto(model_proto)
        solver.LoadModelFromProto(model_proto)

        num_path_vars = len(self.path_vars_)

        result_status = solver.Solve(get_params_from_string(solver_params))
        print('Problem solved in %f milliseconds' % solver.wall_time())
        self.selected_paths = []

        has_solution = (
            result_status == pywraplp.Solver.OPTIMAL or
            result_status == pywraplp.Solver.FEASIBLE)
        assert has_solution

        # TODO: Store solution instead of printing it.
        print("Integer RMP Objective = ", solver.Objective().Value())
        all_vars = solver.variables()

        for path in self.paths_:
            path_var = solver.variable(path.id)
            if path_var.solution_value() > 0:
                self.selected_paths.append(path)
        return True





[docs]
class DTreeSubProblem(BaseSubproblem):
    """TODO: Documentation."""


[docs]
    def __init__(self, leaf, nodes, splits, targets,
                 depth, data_rows=None,
                 solver_str='sat') -> None:
        super().__init__()
        self.leaf_id_ = leaf.id
        self.leaf_ = leaf
        self.nodes_ = nodes
        self.splits_ = splits
        self.targets_ = targets
        self.tree_depth_ = depth
        self.solver_ = pywraplp.Solver.CreateSolver(solver_str)
        self.generated_ = False
        self.z_vars_ = None
        self.yc_vars_ = None
        self.data_rows = data_rows
        self.cut_rows_starting_index = 0
        self.iter_ = 0



[docs]
    def create_submip(self, leaf_dual, row_duals, ns_duals, cut_duals):
        """TODO: Documentation.
        """
        assert not self.generated_, "SP is already created."
        infinity = self.solver_.infinity()
        self.solver_.SetNumThreads(7)

        n_rows = self.X_.shape[0]
        # Binary variables indicating that row reaches the leaf and has the
        # correct target.
        self.z_vars_ = [None]*n_rows
        # Binary variables indicating that row reaches the leaf.
        self.y_vars_ = [None]*n_rows
        objective = self.solver_.Objective()
        for i in range(n_rows):
            obj_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                obj_coeff = self.data_rows[i].weight
            z_var = self.solver_.BoolVar('z_'+str(i))
            objective.SetCoefficient(z_var, obj_coeff)
            self.z_vars_[i] = z_var.index()

            y_var = self.solver_.BoolVar('y_'+str(i))
            row_dual = row_duals[i]
            objective.SetCoefficient(y_var, -row_dual)
            self.y_vars_[i] = y_var.index()

        num_cuts = len(cut_duals)
        self.yc_vars_ = [None]*num_cuts
        if num_cuts > 0:
            for i in range(num_cuts):
                yc_var = self.solver_.BoolVar('yc_'+str(i))
                cut_dual = cut_duals[i]
                objective.SetCoefficient(yc_var, -cut_dual)
                self.yc_vars_[i] = yc_var.index()

        # Binary variables u_j_a indicating that split s is assigned to the
        # node j.
        self.u_vars_ = {}
        self.leaf_
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        all_splits = []
        for node_id in nodes:
            node = self.nodes_[node_id]
            self.u_vars_[node.id] = {}
            for split in node.candidate_splits:
                all_splits.append(split)
                u_var = self.solver_.BoolVar(
                    'u_'+str(node.id)+'_'+str(split))
                ns_dual = ns_duals[self.leaf_id_][node.id][split]
                objective.SetCoefficient(u_var, -ns_dual)
                self.u_vars_[node.id][split] = u_var.index()

        objective.SetOffset(-leaf_dual)
        objective.SetMaximization()

        # Constraints
        # For each node, exactly one feature is selected
        for node_id in nodes:
            node = self.nodes_[node_id]
            one_feature = self.solver_.Constraint(
                1, 1, "one_feature_" + str(node.id))
            for split in node.candidate_splits:
                u_var = self.solver_.variable(self.u_vars_[node.id][split])
                one_feature.SetCoefficient(u_var, 1)

        # Each feature is selected at max once in a path
        # all_splits = list(dict.fromkeys(all_splits))
        # for split in all_splits:
        #     unique_split = self.solver_.Constraint(
        #         0, 1, "unique_split_" + str(split))
        #     for node_id in nodes:
        #         node = self.nodes_[node_id]
        #         if split not in node.candidate_splits:
        #             continue
        #         u_var = self.solver_.variable(self.u_vars_[node.id][split])
        #         unique_split.SetCoefficient(u_var, 1)

        # Row follows correct path (upper and lower bound on y var)
        for i in range(n_rows):
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
            y_lb_cons = self.solver_.Constraint(
                -infinity, self.tree_depth_-1, "row_" + str(i))
            y_var = self.solver_.variable(self.y_vars_[i])
            y_lb_cons.SetCoefficient(y_var, -1)
            for node_id in nodes:
                node = self.nodes_[node_id]
                rn_cons = self.solver_.Constraint(
                    -infinity, 0, "rn_" + str(i) + '_' + str(node.id))

                rn_cons.SetCoefficient(y_var, 1)
                if node.id in self.leaf_.left_nodes:
                    for split in node.candidate_splits:
                        if self.row_satisfies_split(i, self.splits_[split]):
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            rn_cons.SetCoefficient(u_var, -1)
                            y_lb_cons.SetCoefficient(u_var, 1)
                elif node.id in self.leaf_.right_nodes:
                    for split in node.candidate_splits:
                        if not self.row_satisfies_split(i,
                                                        self.splits_[split]):
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            rn_cons.SetCoefficient(u_var, -1)
                            y_lb_cons.SetCoefficient(u_var, 1)

        # only one class selected
        # Binary variables indicating that the class is selected.
        self.w_vars_ = [None]*len(self.targets_)
        single_target_cons = self.solver_.Constraint(0, 1, 'single_target')
        for i in range(len(self.targets_)):
            target = self.targets_[i]
            w_var = self.solver_.BoolVar('w_' + str(target))
            self.w_vars_[i] = w_var.index()
            single_target_cons.SetCoefficient(w_var, 1)
        # relation between w,y,z vars
        for i in range(n_rows):
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
            z_var = self.solver_.variable(self.z_vars_[i])
            y_var = self.solver_.variable(self.y_vars_[i])
            # z <= y (row reaches leaf)
            yz_cons = self.solver_.Constraint(-infinity, 0, "yz_" + str(i))
            yz_cons.SetCoefficient(z_var, 1)
            yz_cons.SetCoefficient(y_var, -1)

            # z <= w (row has correct target)
            correct_class = self.y_[i]
            w_var = self.solver_.variable(self.w_vars_[correct_class])
            wz_cons = self.solver_.Constraint(-infinity, 0, "wz_" + str(i))
            wz_cons.SetCoefficient(z_var, 1)
            wz_cons.SetCoefficient(w_var, -1)

        self.generated_ = True



[docs]
    def update_objective(self, leaf_dual, row_duals, ns_duals, cut_duals):
        """TODO: Documentation.
        """
        objective = self.solver_.Objective()
        n_rows = self.X_.shape[0]
        for i in range(n_rows):
            obj_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                obj_coeff = self.data_rows[i].weight
            z_var = self.solver_.variable(self.z_vars_[i])
            objective.SetCoefficient(z_var, obj_coeff)

            y_var = self.solver_.variable(self.y_vars_[i])
            row_dual = row_duals[i]
            objective.SetCoefficient(y_var, -row_dual)

        num_cuts = len(cut_duals)
        existing_cuts = len(self.yc_vars_)
        if num_cuts > 0:
            for i in range(num_cuts):
                yc_var = None
                if i >= existing_cuts:
                    yc_var = self.solver_.BoolVar('yc_'+str(i))
                    self.yc_vars_.append(yc_var.index())
                else:
                    yc_var = self.solver_.variable(self.yc_vars_[i])
                cut_dual = cut_duals[i]
                objective.SetCoefficient(yc_var, -cut_dual)

        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        for node_id in nodes:
            node = self.nodes_[node_id]
            for split in node.candidate_splits:
                u_var = self.solver_.variable(self.u_vars_[node.id][split])
                ns_dual = ns_duals[self.leaf_id_][node.id][split]
                objective.SetCoefficient(u_var, -ns_dual)

        objective.SetOffset(-leaf_dual)
        objective.SetMaximization()
        return


    def add_cut_rows(self, cut_rows):
        assert self.generated_
        infinity = self.solver_.infinity()
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        n_cut_rows = len(cut_rows)
        # Row follows correct path (upper and lower bound on y var)
        for i in range(n_cut_rows):
            cut_index = i + self.cut_rows_starting_index
            yc_lb_cons = self.solver_.Constraint(
                -infinity, self.tree_depth_-1, "cut_" + str(cut_index))
            # yc_var is created in update_objective method.
            yc_var = self.solver_.variable(self.yc_vars_[cut_index])
            yc_lb_cons.SetCoefficient(yc_var, -1)
            for node_id in nodes:
                node = self.nodes_[node_id]
                cn_cons = self.solver_.Constraint(
                    -infinity, 0, "cn_" + str(cut_index) + '_' + str(node.id))

                cn_cons.SetCoefficient(yc_var, 1)
                if node.id in self.leaf_.left_nodes:
                    for split in node.candidate_splits:
                        if cut_rows[i][split]:
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            cn_cons.SetCoefficient(u_var, -1)
                            yc_lb_cons.SetCoefficient(u_var, 1)
                elif node.id in self.leaf_.right_nodes:
                    for split in node.candidate_splits:
                        if not cut_rows[i][split]:
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            cn_cons.SetCoefficient(u_var, -1)
                            yc_lb_cons.SetCoefficient(u_var, 1)
        self.cut_rows_starting_index += n_cut_rows


[docs]
    def generate_columns(self, X, y, dual_costs, params=""):
        """TODO: Documentation.
        """

        # Solve sub problem
        result_status = self.solver_.Solve(get_params_from_string(params))

        has_solution = (result_status == pywraplp.Solver.OPTIMAL or
                        result_status == pywraplp.Solver.FEASIBLE)

        # The current path is always feasible.
        # # TODO: Fix this. Sat solver returns infeasible sometimes
        # (even without cuts).
        if not has_solution:
            return []
            # print(self.solver_.ExportModelAsLpFormat(False))
        assert has_solution, "Result status: " + str(result_status)

        # if self.leaf_id_ == 4:
        #     sp_mps = self.solver_.ExportModelAsMpsFormat(fixed_format=False,
        #                                                  obfuscated=False)
        #     name = "sp_" + str(self.tree_depth_) + "_" + \
        #         str(self.iter_) + ".mps"
        #     self.iter_ += 1
        #     f = open(name, "w")
        #     f.write(sp_mps)
        #     f.close()

        # TODO: This threshold should be a parameter.
        print("Subproblem ", self.leaf_id_, " objective = ",
              self.solver_.Objective().Value())
        if self.solver_.Objective().Value() <= 1e-6:
            return []

        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        path = Path()
        path.leaf_id = self.leaf_id_
        path.node_ids = []
        path.splits = []
        path.cost = 0
        path.id = -1
        for node in self.nodes_:
            if node.id not in nodes:
                continue
            path.node_ids.append(node.id)
            for split in node.candidate_splits:
                u_var = self.solver_.variable(self.u_vars_[node.id][split])
                if u_var.solution_value() > 0.5:
                    path.splits.append(split)
                    break
        for target in self.targets_:
            w_var = self.solver_.variable(self.w_vars_[target])
            if w_var.solution_value() > 0.5:
                path.target = target
                break
        n_rows = self.X_.shape[0]
        for i in range(n_rows):
            obj_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                obj_coeff = self.data_rows[i].weight
            z_var = self.solver_.variable(self.z_vars_[i])

            path.cost += obj_coeff * z_var.solution_value()
            # if z_var.solution_value() > 0.5:
            #     print(self.z_vars_[i], z_var)
        return [path]



[docs]
    def update_subproblem(self, X, y, dual_costs):
        self.X_ = X
        self.y_ = y
        leaf_dual = dual_costs[0][self.leaf_id_]
        row_duals = dual_costs[1]
        ns_duals = dual_costs[2]
        cut_duals = dual_costs[3]
        cut_rows = dual_costs[4]

        if self.generated_:
            self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
        else:
            self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)

        if len(cut_rows) > 0:
            self.add_cut_rows(cut_rows)



[docs]
    def row_satisfies_split(self, row, split):
        """TODO: Documentation.
        """
        feature = split.feature
        threshold = split.threshold
        if self.X_[row, feature] <= threshold:
            return True
        return False




class PathGenerator(cp_model.CpSolverSolutionCallback):
    """Track intermediate solutions of sp.

    Attributes
    ----------
    TODO
    """

    def __init__(self, u_vars, w_vars, z_vars, y_vars,
                 yc_vars,
                 leaf_dual,
                 leaf, targets,
                 nodes,
                 orig_obj_coeffs):
        cp_model.CpSolverSolutionCallback.__init__(self)
        self.u_vars = u_vars
        self.w_vars = w_vars
        self.z_vars = z_vars
        self.y_vars = y_vars
        self.yc_vars = yc_vars
        self.leaf = leaf
        self.leaf_dual = leaf_dual
        self.targets = targets
        self.nodes = nodes
        self.orig_obj_coeffs = orig_obj_coeffs
        self.solution_count = 0
        self.paths = []
        self.original_objective = 0.0

    def on_solution_callback(self):
        self.solution_count += 1
        original_objective = 0.0
        path_cost = 0.0
        for z_var in self.z_vars:
            if z_var == None:
                continue
            path_cost += self.orig_obj_coeffs[z_var] * \
                self.Value(z_var)

        original_objective += path_cost

        for y_var in self.y_vars:
            if y_var == None:
                continue
            original_objective += self.orig_obj_coeffs[y_var] * \
                self.Value(y_var)

        for yc_var in self.yc_vars:
            original_objective += self.orig_obj_coeffs[yc_var] * \
                self.Value(yc_var)

        nodes = self.leaf.left_nodes + self.leaf.right_nodes
        for node_id in nodes:
            node = self.nodes[node_id]
            for split in node.candidate_splits:
                u_var = self.u_vars[node.id][split]
                original_objective += self.orig_obj_coeffs[u_var] * \
                    self.Value(u_var)

        original_objective -= self.leaf_dual

        self.original_objective = original_objective
        if original_objective <= 1e-6:
            return

        nodes = self.leaf.left_nodes + self.leaf.right_nodes
        path = Path()
        path.leaf_id = self.leaf.id
        path.node_ids = []
        path.splits = []
        path.cost = path_cost
        path.id = -1
        for node_id in nodes:
            node = self.nodes[node_id]
            path.node_ids.append(node.id)
            for split in node.candidate_splits:
                u_var = self.u_vars[node.id][split]
                if self.Value(u_var) > 0.5:
                    path.splits.append(split)
                    break
        for target in self.targets:
            w_var = self.w_vars[target]
            if self.Value(w_var) > 0.5:
                path.target = target
                break
        self.paths.append(path)
        # print("Path cost", path_cost)


class DTreeSubProblemSat(BaseSubproblem):
    """TODO: Documentation."""

    def __init__(self, leaf, nodes, splits, targets,
                 depth, data_rows=None) -> None:
        super().__init__()
        self.leaf_id_ = leaf.id
        self.leaf_ = leaf
        self.nodes_ = nodes
        self.splits_ = splits
        self.targets_ = targets
        self.tree_depth_ = depth
        self.model_ = cp_model.CpModel()
        self.objective_scale = 100000
        self.orig_obj_coeffs = {}
        self.generated_ = False
        self.z_vars_ = None
        self.yc_vars_ = None
        self.data_rows = data_rows
        self.cut_rows_starting_index = 0

    def scale_obj_coeff(self, value):
        return int(round(value * self.objective_scale))

    def create_submip(self, leaf_dual, row_duals, ns_duals, cut_duals):
        """TODO: Documentation.
        """
        assert not self.generated_, "SP is already created."

        # obj = cp_model.LinearExpr()

        n_rows = self.X_.shape[0]
        # Binary variables indicating that row reaches the leaf and has the
        # correct target.
        self.z_vars_ = [None]*n_rows
        # Binary variables indicating that row reaches the leaf.
        self.y_vars_ = [None]*n_rows
        obj_vars = []
        obj_coeffs = []

        for i in range(n_rows):
            z_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                z_coeff = self.data_rows[i].weight
            z_var = self.model_.NewBoolVar('z_'+str(i))
            self.orig_obj_coeffs[z_var] = z_coeff
            z_coeff = self.scale_obj_coeff(z_coeff)
            self.z_vars_[i] = (z_var)
            # obj += cp_model.LinearExpr.Term(z_var, z_coeff)
            obj_vars.append(z_var)
            obj_coeffs.append(z_coeff)

            y_var = self.model_.NewBoolVar('y_'+str(i))
            self.orig_obj_coeffs[y_var] = -row_duals[i]
            y_coeff = self.scale_obj_coeff(-row_duals[i])
            self.y_vars_[i] = (y_var)
            # obj += cp_model.LinearExpr.Term(y_var, y_coeff)
            obj_vars.append(y_var)
            obj_coeffs.append(y_coeff)

        num_cuts = len(cut_duals)
        self.yc_vars_ = [None]*num_cuts
        if num_cuts > 0:
            for i in range(num_cuts):
                yc_var = self.model_.NewBoolVar('yc_'+str(i))
                self.orig_obj_coeffs[yc_var] = -cut_duals[i]
                yc_coeff = self.scale_obj_coeff(-cut_duals[i])
                self.yc_vars_[i] = yc_var
                # obj += cp_model.LinearExpr.Term(yc_var, yc_coeff)
                obj_vars.append(yc_var)
                obj_coeffs.append(yc_coeff)

        # Binary variables u_j_a indicating that split s is assigned to the
        # node j.
        self.u_vars_ = {}
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        all_splits = []
        for node_id in nodes:
            node = self.nodes_[node_id]
            self.u_vars_[node.id] = {}
            for split in node.candidate_splits:
                all_splits.append(split)
                u_var = self.model_.NewBoolVar(
                    'u_'+str(node.id)+'_'+str(split))
                self.orig_obj_coeffs[u_var] = \
                    -ns_duals[self.leaf_id_][node.id][split]
                u_coeff = self.scale_obj_coeff(
                    -ns_duals[self.leaf_id_][node.id][split])
                self.u_vars_[node.id][split] = u_var
                # obj += cp_model.LinearExpr.Term(u_var, u_coeff)
                obj_vars.append(u_var)
                obj_coeffs.append(u_coeff)

        # obj += -self.scale_obj_coeff(leaf_dual)
        obj = cp_model.LinearExpr.WeightedSum(obj_vars, obj_coeffs) - leaf_dual
        self.model_.Maximize(obj)

        # Constraints
        # For each node, exactly one feature is selected
        for node_id in nodes:
            node = self.nodes_[node_id]
            cons_vars = []
            for split in node.candidate_splits:
                u_var = self.u_vars_[node.id][split]
                cons_vars.append(u_var)
            self.model_.Add(cp_model.LinearExpr.Sum(cons_vars) == 1)

        # Each feature is selected at max once in a path
        # all_splits = list(dict.fromkeys(all_splits))
        # for split in all_splits:
        #     cons_vars = []
        #     for node_id in nodes:
        #         node = self.nodes_[node_id]
        #         if split not in node.candidate_splits:
        #             continue
        #         u_var = self.u_vars_[node.id][split]
        #         cons_vars.append(u_var)
        #     self.model_.Add(cp_model.LinearExpr.Sum(cons_vars) <= 1)

        # Row follows correct path (upper and lower bound on y var)
        for i in range(n_rows):
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
            y_lb_expr = cp_model.LinearExpr.Term(self.y_vars_[i], -1)
            for node_id in nodes:
                node = self.nodes_[node_id]
                rn_expr = cp_model.LinearExpr.Term(self.y_vars_[i], 1)

                if node.id in self.leaf_.left_nodes:
                    for split in node.candidate_splits:
                        if self.row_satisfies_split(i, self.splits_[split]):
                            u_var = self.u_vars_[node.id][split]
                            rn_expr += cp_model.LinearExpr.Term(u_var, -1)
                            y_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
                elif node.id in self.leaf_.right_nodes:
                    for split in node.candidate_splits:
                        if not self.row_satisfies_split(i,
                                                        self.splits_[split]):
                            u_var = self.u_vars_[node.id][split]
                            rn_expr += cp_model.LinearExpr.Term(u_var, -1)
                            y_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
                self.model_.Add(rn_expr <= 0)
            self.model_.Add(y_lb_expr <= self.tree_depth_-1)

        # only one class selected
        # Binary variables indicating that the class is selected.
        self.w_vars_ = [None]*len(self.targets_)
        for i in range(len(self.targets_)):
            target = self.targets_[i]
            w_var = self.model_.NewBoolVar('w_'+str(target))
            self.w_vars_[i] = w_var
        # Note: This can be added as <= 1 inequality because of the objective.
        self.model_.Add(cp_model.LinearExpr.Sum(self.w_vars_) == 1)

        # relation between w,y,z vars
        for i in range(n_rows):
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
            # z <= y (row reaches leaf)
            self.model_.Add(self.z_vars_[i] <= self.y_vars_[i])

            # z <= w (row has correct target)
            correct_class = self.y_[i]
            self.model_.Add(self.z_vars_[i] <= self.w_vars_[correct_class])

        self.generated_ = True

    def update_objective(self, leaf_dual, row_duals, ns_duals, cut_duals):
        """TODO: Documentation.
        """
        assert self.generated_, "Called update_objective before generating SP"
        # obj = cp_model.LinearExpr()
        # TODO: This method is slow. Make it faster.
        obj_vars = []
        obj_coeffs = []

        n_rows = self.X_.shape[0]

        for i in range(n_rows):
            z_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                z_coeff = self.data_rows[i].weight
            z_var = self.z_vars_[i]
            self.orig_obj_coeffs[z_var] = z_coeff
            z_coeff = self.scale_obj_coeff(z_coeff)

            # obj += cp_model.LinearExpr.Term(z_var, z_coeff)
            obj_vars.append(z_var)
            obj_coeffs.append(z_coeff)

            y_coeff = self.scale_obj_coeff(-row_duals[i])
            y_var = self.y_vars_[i]
            self.orig_obj_coeffs[y_var] = -row_duals[i]
            # obj += cp_model.LinearExpr.Term(y_var, y_coeff)
            obj_vars.append(y_var)
            obj_coeffs.append(y_coeff)

        num_cuts = len(cut_duals)
        existing_cuts = len(self.yc_vars_)
        if num_cuts > 0:
            for i in range(num_cuts):
                yc_var = None
                if i >= existing_cuts:
                    yc_var = self.model_.NewBoolVar('yc_'+str(i))
                    self.yc_vars_.append(yc_var)
                else:
                    yc_var = self.yc_vars_[i]
                yc_coeff = self.scale_obj_coeff(-cut_duals[i])
                self.orig_obj_coeffs[yc_var] = -cut_duals[i]
                # obj += cp_model.LinearExpr.Term(yc_var, yc_coeff)
                obj_vars.append(yc_var)
                obj_coeffs.append(yc_coeff)

        # Binary variables u_j_a indicating that split s is assigned to the
        # node j.
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        for node_id in nodes:
            node = self.nodes_[node_id]
            for split in node.candidate_splits:
                u_coeff = self.scale_obj_coeff(
                    -ns_duals[self.leaf_id_][node.id][split])
                u_var = self.u_vars_[node.id][split]
                self.orig_obj_coeffs[u_var] = \
                    -ns_duals[self.leaf_id_][node.id][split]
                # obj += cp_model.LinearExpr.Term(u_var, u_coeff)
                obj_vars.append(u_var)
                obj_coeffs.append(u_coeff)

        # obj -= self.scale_obj_coeff(leaf_dual)
        obj = cp_model.LinearExpr.WeightedSum(obj_vars, obj_coeffs) - leaf_dual
        self.model_.Maximize(obj)
        return

    def add_cut_rows(self, cut_rows):
        assert self.generated_
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        n_cut_rows = len(cut_rows)
        # Row follows correct path (upper and lower bound on y var)
        for i in range(n_cut_rows):
            cut_index = i + self.cut_rows_starting_index
            yc_lb_expr = cp_model.LinearExpr.Term(self.yc_vars_[cut_index], -1)
            for node_id in nodes:
                node = self.nodes_[node_id]
                cn_expr = cp_model.LinearExpr.Term(self.yc_vars_[cut_index], 1)

                if node.id in self.leaf_.left_nodes:
                    for split in node.candidate_splits:
                        if cut_rows[i][split]:
                            u_var = self.u_vars_[node.id][split]
                            cn_expr += cp_model.LinearExpr.Term(u_var, -1)
                            yc_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
                elif node.id in self.leaf_.right_nodes:
                    for split in node.candidate_splits:
                        if not cut_rows[i][split]:
                            u_var = self.u_vars_[node.id][split]
                            cn_expr += cp_model.LinearExpr.Term(u_var, -1)
                            yc_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
                self.model_.Add(cn_expr <= 0)
            self.model_.Add(yc_lb_expr <= self.tree_depth_-1)
        self.cut_rows_starting_index += n_cut_rows

    def update_subproblem(self, X, y, dual_costs):
        cut_rows = dual_costs[4]
        if len(cut_rows) == 0:
            return
        self.X_ = X
        self.y_ = y
        leaf_dual = dual_costs[0][self.leaf_id_]
        row_duals = dual_costs[1]
        ns_duals = dual_costs[2]
        cut_duals = dual_costs[3]

        if self.generated_:
            self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
        else:
            self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)

        if len(cut_rows) > 0:
            self.add_cut_rows(cut_rows)

    def row_satisfies_split(self, row, split):
        """TODO: Documentation.
        """
        feature = split.feature
        threshold = split.threshold
        if self.X_[row, feature] <= threshold:
            return True
        return False

    def generate_columns(self, X, y, dual_costs, params=""):
        """TODO: Documentation.
        """
        self.X_ = X
        self.y_ = y
        leaf_dual = dual_costs[0][self.leaf_id_]
        row_duals = dual_costs[1]
        ns_duals = dual_costs[2]
        cut_duals = dual_costs[3]
        if self.generated_:
            self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
        else:
            self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)

        # Solve sub problem
        self.cp_solver_.parameters.num_search_workers = 7

        # Solve
        leaf_dual = dual_costs[0][self.leaf_id_]
        path_generator = PathGenerator(self.u_vars_, self.w_vars_, self.z_vars_,
                                       self.y_vars_,
                                       self.yc_vars_, leaf_dual, self.leaf_,
                                       self.targets_,
                                       self.nodes_,
                                       self.orig_obj_coeffs)
        # status = self.cp_solver_.Solve(self.model_, path_generator)
        self.solve_model(self.model_, time_limit=2, callback=path_generator)
        # print(self.leaf_id_, " Path generation status: ",
        #       sp_solver.StatusName(status))
        print(self.leaf_id_, " Path generation objective: ",
              path_generator.original_objective)
        paths = path_generator.paths
        return paths



[docs]
class DTreeSubProblemHeuristic(BaseSubproblem):
    """TODO: Documentation."""


[docs]
    def __init__(self, leaves, nodes, splits, targets,
                 depth, data_rows=None) -> None:
        super().__init__()
        self.leaves_ = leaves
        self.nodes_ = nodes
        self.splits_ = splits
        self.targets_ = targets
        self.tree_depth_ = depth
        self.data_rows = data_rows
        self.num_cuts = 0
        self.failed_rounds = 0
        self.success_rounds = 0



[docs]
    def generate_columns(self, X, y, dual_costs, params=""):
        """ TODO: Documentation."""
        # Generate random columns and return the ones with postive reduced
        # cost.
        generated_paths = []
        # return generated_paths
        for iter in range(100):
            path = Path()
            # Pick a leaf
            leaf = random.choice(self.leaves_)
            path.leaf_id = leaf.id

            # Pick splits for each node.
            node_ids = leaf.left_nodes + leaf.right_nodes
            path.node_ids = node_ids
            path.splits = []
            success = True
            for node_id in node_ids:
                node = self.nodes_[node_id]
                candidate_splits = node.candidate_splits.copy()
                for used_split in path.splits:
                    if used_split in candidate_splits:
                        candidate_splits.remove(used_split)
                if not candidate_splits:
                    success = False
                    break
                chosen_split = random.choice(candidate_splits)
                path.splits.append(chosen_split)

            if not success:
                self.failed_rounds += 1
                continue

            # Compute the best target.
            n_rows = X.shape[0]

            possible_targets = {}
            best_target = -1
            best_target_count = 0
            row_satisfies_path_array = [False]*n_rows

            satisfied_rows = get_satisfied_rows(path, leaf, self.splits_)
            # print(satisfied_rows)
            for row in satisfied_rows:
                row_satisfies_path_array[row] = True
                row_weight = 1
                if self.data_rows != None:
                    row_weight = self.data_rows[row].weight
                    if self.data_rows[row].removed_from_sp:
                        continue

                target = y[row]
                if target in possible_targets.keys():
                    possible_targets[target] += row_weight
                else:
                    possible_targets[target] = row_weight

                if possible_targets[target] > best_target_count:
                    best_target_count = possible_targets[target]
                    best_target = target
            path.target = best_target
            path.cost = best_target_count

            already_generated = False
            for old_path in generated_paths:
                if path.is_same_as(old_path):
                    already_generated = True
                    break

            if already_generated:
                self.failed_rounds += 1
                continue

            # Evaluate the reduced cost.
            reduced_cost = self.get_reduced_cost(
                X, y, dual_costs, path, row_satisfies_path_array)
            if reduced_cost > 1e-6:
                # print("Generated new path: ", len(generated_paths))
                # path.print_path()
                generated_paths.append(path)
            else:
                self.failed_rounds += 1

        self.success_rounds += len(generated_paths)

        return generated_paths



[docs]
    def update_subproblem(self, X, y, dual_costs):
        cut_rows = dual_costs[4]

        for i, cut_row in enumerate(cut_rows):
            cut_index = self.num_cuts + i
            for split in self.splits_:
                if cut_row[split.id]:
                    split.left_cuts.add(cut_index)
                else:
                    split.right_cuts.add(cut_index)
        self.num_cuts += len(cut_rows)



[docs]
    def get_reduced_cost(self, X, y, dual_costs, path,
                         row_satisfies_path_array):
        """TODO: Documentation."""

        n_rows = X.shape[0]
        leaves_dual = dual_costs[0]
        row_duals = dual_costs[1]
        ns_duals = dual_costs[2]
        cut_duals = dual_costs[3]

        reduced_cost = 0
        try:
            reduced_cost -= leaves_dual[path.leaf_id]
        except IndexError:
            print("List index out of error: ", path.leaf_id, len(leaves_dual))

        reduced_cost += path.cost

        for row in range(n_rows):
            if row_satisfies_path_array[row]:
                reduced_cost -= row_duals[row]

        satisfied_cuts = get_satisfied_cuts(
            path, self.leaves_[path.leaf_id], self.splits_)
        for c in satisfied_cuts:
            reduced_cost -= cut_duals[c]

        for i in range(len(path.node_ids)):
            node_id = path.node_ids[i]
            split_id = path.splits[i]
            reduced_cost -= ns_duals[path.leaf_id][node_id][split_id]

        return reduced_cost




class DTreeSubProblemOld(BaseSubproblem):
    """TODO: Documentation."""

    def __init__(self, leaf, nodes, splits, target,
                 depth, data_rows=None,
                 solver_str='sat') -> None:
        super().__init__()
        self.leaf_id_ = leaf.id
        self.leaf_ = leaf
        self.nodes_ = nodes
        self.splits_ = splits
        self.target_ = target
        self.tree_depth_ = depth
        self.solver_ = pywraplp.Solver.CreateSolver(solver_str)
        self.generated_ = False
        self.yc_vars_ = None
        self.data_rows = data_rows
        self.cut_rows_starting_index = 0
        self.iter_ = 0

    def create_submip(self, leaf_dual, row_duals, ns_duals, cut_duals):
        """TODO: Documentation.
        """
        assert not self.generated_, "SP is already created."
        infinity = self.solver_.infinity()
        self.solver_.SetNumThreads(7)

        n_rows = self.X_.shape[0]
        # Binary variables indicating that row reaches the leaf.
        self.y_vars_ = [None]*n_rows
        objective = self.solver_.Objective()
        for i in range(n_rows):
            obj_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                obj_coeff = self.data_rows[i].weight

            y_var = self.solver_.BoolVar('y_'+str(i))
            row_dual = row_duals[i]
            if self.y_[i] == self.target_:
                objective.SetCoefficient(y_var, 1-row_dual)
            else:
                objective.SetCoefficient(y_var, -row_dual)
            self.y_vars_[i] = y_var.index()

        num_cuts = len(cut_duals)
        self.yc_vars_ = [None]*num_cuts
        if num_cuts > 0:
            for i in range(num_cuts):
                yc_var = self.solver_.BoolVar('yc_'+str(i))
                cut_dual = cut_duals[i]
                objective.SetCoefficient(yc_var, -cut_dual)
                self.yc_vars_[i] = yc_var.index()

        # Binary variables u_j_a indicating that split s is assigned to the
        # node j.
        self.u_vars_ = {}
        self.leaf_
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        all_splits = []
        for node_id in nodes:
            node = self.nodes_[node_id]
            self.u_vars_[node.id] = {}
            for split in node.candidate_splits:
                all_splits.append(split)
                u_var = self.solver_.BoolVar(
                    'u_'+str(node.id)+'_'+str(split))
                ns_dual = ns_duals[self.leaf_id_][node.id][split]
                objective.SetCoefficient(u_var, -ns_dual)
                self.u_vars_[node.id][split] = u_var.index()

        objective.SetOffset(-leaf_dual)
        objective.SetMaximization()

        # Constraints
        # For each node, exactly one feature is selected
        for node_id in nodes:
            node = self.nodes_[node_id]
            one_feature = self.solver_.Constraint(
                1, 1, "one_feature_" + str(node.id))
            for split in node.candidate_splits:
                u_var = self.solver_.variable(self.u_vars_[node.id][split])
                one_feature.SetCoefficient(u_var, 1)

        # Each feature is selected at max once in a path
        # all_splits = list(dict.fromkeys(all_splits))
        # for split in all_splits:
        #     unique_split = self.solver_.Constraint(
        #         0, 1, "unique_split_" + str(split))
        #     for node_id in nodes:
        #         node = self.nodes_[node_id]
        #         if split not in node.candidate_splits:
        #             continue
        #         u_var = self.solver_.variable(self.u_vars_[node.id][split])
        #         unique_split.SetCoefficient(u_var, 1)

        # Row follows correct path (upper and lower bound on y var)
        for i in range(n_rows):
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
            y_lb_cons = self.solver_.Constraint(
                -infinity, self.tree_depth_-1, "row_" + str(i))
            y_var = self.solver_.variable(self.y_vars_[i])
            y_lb_cons.SetCoefficient(y_var, -1)
            for node_id in nodes:
                node = self.nodes_[node_id]
                rn_cons = self.solver_.Constraint(
                    -infinity, 0, "rn_" + str(i) + '_' + str(node.id))

                rn_cons.SetCoefficient(y_var, 1)
                if node.id in self.leaf_.left_nodes:
                    for split in node.candidate_splits:
                        if self.row_satisfies_split(i, self.splits_[split]):
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            rn_cons.SetCoefficient(u_var, -1)
                            y_lb_cons.SetCoefficient(u_var, 1)
                elif node.id in self.leaf_.right_nodes:
                    for split in node.candidate_splits:
                        if not self.row_satisfies_split(i,
                                                        self.splits_[split]):
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            rn_cons.SetCoefficient(u_var, -1)
                            y_lb_cons.SetCoefficient(u_var, 1)

        self.generated_ = True

    def update_objective(self, leaf_dual, row_duals, ns_duals, cut_duals):
        """TODO: Documentation.
        """
        objective = self.solver_.Objective()
        n_rows = self.X_.shape[0]
        for i in range(n_rows):
            obj_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                # obj_coeff = self.data_rows[i].weight

            y_var = self.solver_.variable(self.y_vars_[i])
            row_dual = row_duals[i]
            if self.y_[i] == self.target_:
                objective.SetCoefficient(y_var, 1-row_dual)
            else:
                objective.SetCoefficient(y_var, -row_dual)

        num_cuts = len(cut_duals)
        existing_cuts = len(self.yc_vars_)
        if num_cuts > 0:
            for i in range(num_cuts):
                yc_var = None
                if i >= existing_cuts:
                    yc_var = self.solver_.BoolVar('yc_'+str(i))
                    self.yc_vars_.append(yc_var.index())
                else:
                    yc_var = self.solver_.variable(self.yc_vars_[i])
                cut_dual = cut_duals[i]
                objective.SetCoefficient(yc_var, -cut_dual)

        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        for node_id in nodes:
            node = self.nodes_[node_id]
            for split in node.candidate_splits:
                u_var = self.solver_.variable(self.u_vars_[node.id][split])
                ns_dual = ns_duals[self.leaf_id_][node.id][split]
                objective.SetCoefficient(u_var, -ns_dual)

        objective.SetOffset(-leaf_dual)
        objective.SetMaximization()
        return

    def add_cut_rows(self, cut_rows):
        assert self.generated_
        infinity = self.solver_.infinity()
        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        n_cut_rows = len(cut_rows)
        # Row follows correct path (upper and lower bound on y var)
        for i in range(n_cut_rows):
            cut_index = i + self.cut_rows_starting_index
            yc_lb_cons = self.solver_.Constraint(
                -infinity, self.tree_depth_-1, "cut_" + str(cut_index))
            # yc_var is created in update_objective method.
            yc_var = self.solver_.variable(self.yc_vars_[cut_index])
            yc_lb_cons.SetCoefficient(yc_var, -1)
            for node_id in nodes:
                node = self.nodes_[node_id]
                cn_cons = self.solver_.Constraint(
                    -infinity, 0, "cn_" + str(cut_index) + '_' + str(node.id))

                cn_cons.SetCoefficient(yc_var, 1)
                if node.id in self.leaf_.left_nodes:
                    for split in node.candidate_splits:
                        if cut_rows[i][split]:
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            cn_cons.SetCoefficient(u_var, -1)
                            yc_lb_cons.SetCoefficient(u_var, 1)
                elif node.id in self.leaf_.right_nodes:
                    for split in node.candidate_splits:
                        if not cut_rows[i][split]:
                            u_var = self.solver_.variable(
                                self.u_vars_[node.id][split])
                            cn_cons.SetCoefficient(u_var, -1)
                            yc_lb_cons.SetCoefficient(u_var, 1)
        self.cut_rows_starting_index += n_cut_rows

    def generate_columns(self, X, y, dual_costs, params=""):
        """TODO: Documentation.
        """

        # Solve sub problem
        result_status = self.solver_.Solve(get_params_from_string(params))

        has_solution = (result_status == pywraplp.Solver.OPTIMAL or
                        result_status == pywraplp.Solver.FEASIBLE)

        # The current path is always feasible.
        # # TODO: Fix this. Sat solver returns infeasible sometimes
        # (even without cuts).
        if not has_solution:
            return []
            # print(self.solver_.ExportModelAsLpFormat(False))
        assert has_solution, "Result status: " + str(result_status)

        # if self.leaf_id_ == 4:
        #     sp_mps = self.solver_.ExportModelAsMpsFormat(fixed_format=False,
        #                                                  obfuscated=False)
        #     name = "sp_" + str(self.tree_depth_) + "_" + \
        #         str(self.iter_) + ".mps"
        #     self.iter_ += 1
        #     f = open(name, "w")
        #     f.write(sp_mps)
        #     f.close()

        # TODO: This threshold should be a parameter.
        print("Subproblem ", self.leaf_id_, " objective = ",
              self.solver_.Objective().Value())
        if self.solver_.Objective().Value() <= 1e-6:
            return []

        nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
        path = Path()
        path.leaf_id = self.leaf_id_
        path.node_ids = []
        path.splits = []
        path.cost = 0
        path.id = -1
        for node in self.nodes_:
            if node.id not in nodes:
                continue
            path.node_ids.append(node.id)
            for split in node.candidate_splits:
                u_var = self.solver_.variable(self.u_vars_[node.id][split])
                if u_var.solution_value() > 0.5:
                    path.splits.append(split)
                    break
        path.target = self.target_
        n_rows = self.X_.shape[0]
        for i in range(n_rows):
            obj_coeff = 1
            if self.data_rows != None:
                if self.data_rows[i].removed_from_sp:
                    continue
                # obj_coeff = self.data_rows[i].weight
            y_var = self.solver_.variable(self.y_vars_[i])
            if self.y_[i] == self.target_:
                path.cost += 1 * y_var.solution_value()
        return [path]

    def update_subproblem(self, X, y, dual_costs):
        self.X_ = X
        self.y_ = y
        leaf_dual = dual_costs[0][self.leaf_id_]
        row_duals = dual_costs[1]
        ns_duals = dual_costs[2]
        cut_duals = dual_costs[3]
        cut_rows = dual_costs[4]

        if self.generated_:
            self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
        else:
            self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)

        if len(cut_rows) > 0:
            self.add_cut_rows(cut_rows)

    def row_satisfies_split(self, row, split):
        """TODO: Documentation.
        """
        feature = split.feature
        threshold = split.threshold
        if self.X_[row, feature] <= threshold:
            return True
        return False



[docs]
class DTreeClassifier(ColGenClassifier):
    """Decision Tree classifier using column generation.

    Parameters
    ----------
    initial_paths: list(Path), default=[],
        List of paths used to initialize the master problem. The user must
        ensure that a valid tree can be formed using the initial paths.
    leaves: list(Leaf), default=[],
        List of leaves in the tree.
    nodes: list(Node), default=[],
        List of nodes in the tree.
    splits: list(Split), default=[],
        List of split checks used in the nodes.
    tree_depth: int, default=1,
        Depth of the tree.
    targets: list(int), default=[],
        List of target ids. They must start from 0.
    max_iterations: int, default=-1
        Maximum column generation iterations. Negative values removes the
        iteration limit and the problem is solved till optimality.
    time_limit: int, default=-1,
        Time limit in seconds for training. Negative values removes the
        time limit and the problem is solved till optimality.
    num_master_cuts_round: int, default=3,
        Number of times the master problem adds cuts in an iteration.
    master_beta_constraints_as_cuts: bool, default=False,
        If True, adds existing beta constraints (constraints for data rows) as
        cutting planes in the master problem.
    master_generate_cuts: bool, default=False,
        If True, master problem generates new beta cuts using SAT solver.
    data_rows: list(Row), default=None,
        Preprocessed data rows. The preprocessed rows help with faster running
        times.
    use_old_sp: bool, default=False,
        If True, uses the old subproblem model published in Firat et. al. 2020.
    master_solver_type: str, default='glop',
        Solver for RMP from OR-Tools. Use 'glop' for tests. See OR-Tools
        documentation for other possible values.
    rmp_solver_params: string, default = "",
        Solver parameters for solving restricted master problem (rmp).
    master_ip_solver_params: string, default = "",
        Solver parameters for solving the integer master problem.
    subproblem_params: list of strings, default = [""],
        Parameters for solving the subproblem.
    """


[docs]
    def __init__(self, initial_paths=[], leaves=[], nodes=[], splits=[],
                 tree_depth=1,
                 targets=[], max_iterations=-1,
                 time_limit=-1,
                 num_master_cuts_round=3,
                 master_beta_constraints_as_cuts=False,
                 master_generate_cuts=False,
                 data_rows=None,
                 use_old_sp=False,
                 master_solver_type='glop',
                 rmp_solver_params="",
                 master_ip_solver_params="", subproblem_params=""):

        self.initial_paths = initial_paths
        self.leaves = leaves
        self.nodes = nodes
        self.splits = splits
        self.tree_depth = tree_depth
        self.targets = targets
        self.subproblem_params = subproblem_params
        self.num_master_cuts_round = num_master_cuts_round
        self.master_beta_constraints_as_cuts = master_beta_constraints_as_cuts
        self.master_generate_cuts = master_generate_cuts
        self.data_rows = data_rows
        split_ids = []
        for split in splits:
            split_ids.append(split.id)
        # Each node must be in sequence.
        node_ids = []
        node_id = 0
        for node in self.nodes:
            assert node.id == node_id
            node_ids.append(node.id)
            node_id += 1

        # Each leaf must be in sequence.
        leaf_ids = []
        leaf_id = 0
        for leaf in leaves:
            assert leaf.id == leaf_id
            leaf_ids.append(leaf.id)
            leaf_id += 1
        # Each path can only contain nodes and leaves provided.
        for path in self.initial_paths:
            path.initial = True
            assert path.leaf_id in leaf_ids
            for node_id in path.node_ids:
                assert node_id in node_ids
            for split_id in path.splits:
                assert split_id in split_ids, "split id " + str(split_id)
            assert path.target in targets
            assert len(path.node_ids) == self.tree_depth
            assert len(path.splits) == self.tree_depth

        self.master_problem = DTreeMasterProblem(
            self.initial_paths, self.leaves, self.nodes, self.splits,
            num_cuts_round=num_master_cuts_round,
            beta_constraints_as_cuts=master_beta_constraints_as_cuts,
            generate_cuts=master_generate_cuts,
            solver_type=master_solver_type,
            data_rows=data_rows)
        self.subproblems = []
        all_subproblem_params = []
        heuristic = DTreeSubProblemHeuristic(
            self.leaves, self.nodes, self.splits, self.targets,
            self.tree_depth, data_rows=data_rows)
        self.subproblems.append([heuristic])
        all_subproblem_params.append([""])
        self.subproblems.append([])
        all_subproblem_params.append([])
        for leaf in self.leaves:
            if use_old_sp:
                for target in self.targets:
                    subproblem = DTreeSubProblemOld(
                        leaf, self.nodes, self.splits, target,
                        self.tree_depth, solver_str='sat',
                        data_rows=data_rows)
                    self.subproblems[1].append(subproblem)
                    all_subproblem_params[1].append(subproblem_params)
            else:
                # subproblem = DTreeSubProblem(
                #     leaf, self.nodes, self.splits, self.targets,
                #     self.tree_depth, solver_str='sat',
                #     data_rows=data_rows)
                subproblem = DTreeSubProblemSat(
                    leaf, self.nodes, self.splits, self.targets,
                    self.tree_depth,
                    data_rows=data_rows)
                self.subproblems[1].append(subproblem)
                all_subproblem_params[1].append(subproblem_params)

        rmp_is_ip = True
        super().__init__(max_iterations, time_limit,
                         self.master_problem, self.subproblems,
                         rmp_is_ip, rmp_solver_params, master_ip_solver_params,
                         all_subproblem_params)


    def _more_tags(self):
        return {'X_types': ['categorical'],
                'non_deterministic': True,
                'binary_only': True}


[docs]
    def predict(self, X):
        """Predicts the class based on the solution of master problem. 

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            The input samples. The inputs should only contain numeric values.

        Returns
        -------
        y : ndarray, shape (n_samples,)
            The label for each sample.
        """
        X = check_array(X, accept_sparse=True)
        check_is_fitted(self, 'is_fitted_')
        selected_paths = self.master_problem.selected_paths
        # Check for each row, which path it satisfies. There should be exactly
        # one.
        num_rows = X.shape[0]
        y_predict = np.zeros(X.shape[0], dtype=int)
        for row in range(num_rows):
            for path in selected_paths:
                if row_satisfies_path(X[row], self.leaves[path.leaf_id],
                                      self.splits, path):
                    y_predict[row] = path.target
                    break
        return self.label_encoder_.inverse_transform(y_predict)