"""
Decision tree classifier using column generation.
Reference: Murat Firat, Guillaume Crognier, Adriana F. Gabor, C.A.J. Hurkens,
and Yingqian Zhang.
Column generation based heuristic for learning classification trees.
Computers and Operations Research, 116, apr 2020.
The algorithm takes candidate splits for each node of the decision tree as
inputs. A path is a set of nodes from root node to a leaf with a fixed split
for each node and a assigned target. The master problem uses a binary variable
for each possible path. The goal is to select a set of path that are consistent
among themselves and result in the best accuracy.
Since there are exponentially many paths, the path variables are generated
using column generation by a subproblem.
We only work with complete trees. That is all the paths in the tree have the
same depth d. Each internal node has two children.
We assume a fixed ordering of the nodes and leaves in the tree. The root node
is numbered 0 (ID of the node). The left child of the node (i) is (2i + 1) and
the right child of the node (i) is (2i + 2). Leaves are numberd from (0) to
(2^d) from left to right.
Master problem:
Sets
R set of rows in data file.
F set of features in data file.
N_{lf} , N_{int} leaf and internal (non-leaf) nodes in the decision tree.
p_{BT}(l) path to leaf l in binary tree.
DP_l$ set of decision paths ending in leaf l.
DP_l(j) subset of paths in DP_l, such that j in p_{BT}(l).
R^{l}(p) subset of rows directed to leaf l through path p.
S_j set of decision splits for node j.
Parameters
k depth of the decision tree.
CP(p) number of correct predictions/true positives for path p.
Decision Variables
x_p binary variable indicating that path p is assigned to leaf l.
rho_{j,a} binary variable indicating that split a has been assigned to node j.
Max sum_{l in N_{lf}} sum_{p in DP_{l}} CP(p)x_p
s.t. sum_{p in DP_{l}} x_p = 1, forall l in N_{lf} (alpha)
sum_{lin N_{lf}} sum_{p in DP_{l}:r in R^l(p)} x_p = 1
forall r in R (beta)
sum_{p in DP_{l}:s(j) = a} x_p = rho_{j,a},
forall l in N_{lf},
j in p_{BT}(l) and N_{int},
a in S_j (gamma)
x_p in {0,1}, forall p in DP_{l}, l in N_{lf}
rho_{j,a} in {0,1} forall j in N_{int}, a in S_j
The objective maximizes the accuracy. The (alpha) constraints ensure that
exactly one path is selected for each leaf. The (beta) constraints ensure
that exactly one path is selected for each row in the dataset. They are implied
by the (alpha) and (gamma) constraints, but they are needed for stronger LP
relaxation. The (gamma) constraints ensure that the selected paths are
consistent (have the same split on common nodes).
Subproblem: (Modified from the original paper)
Sets
R & set of all rows in data file.
F set of features in data file.
Y set of all class labels.
S_j set of decision splits for node j.
LC(l) set of internal nodes with left child in p_{BT}(l).
RC(l) set of internal nodes with right child in p_{BT}(l).
T(r) set of splits for which row r returns a TRUE (feature <= threshold)
F(r) set of splits for which row r returns a FALSE: (feature > threshold)
Parameters
alpha_l dual cost of (alpha) constraint in the master problem for leaf l.
beta_r dual cost of (beta) constraint in the master problem for row r.
gamma_{l,j,a} dual cost of (gamma) constraint in the master problem for
leaf l, node j, split a.
Decision Variables
y_r binary variable indicating that row r reaches leaf l.
z_r binary variable indicating that row r reaches leaf l and has the same
target as the path being generated.
u_{j,a} binary variable indicating that split a is assigned to node j.
DPP(l):
Max sum_{r in R} z_r - alpha_l
- sum_{j in p_{BT}(l)} sum_{a in S_j} gamma_{l,j,a} u_{j,a}
- sum_{r in R} beta_r y_r
s.t. sum_{a in S_j} u_{j,a} = 1, forall j in RC(l) or LC(l) (1)
y_r <= sum_{a in S_j and T(r)} u_{j,a}, forall j in LC(l), r in R (2)
y_r <= sum_{a in S_j and F(r)} u_{j,a} , forall j in RC(l), r in R (3)
sum_{j in LC(l)} sum_{a in S_j and T(r)} u_{j,a}
+ sum_{j in RC(l)} sum_{a in S_j and F(r)} u_{j,a} - (k-1) <= y_r,
forall r in R (4)
sum_{j in p_{BT}(l)} u_{j,a} <= 1, forall a in (Union of all splits) (5)
z_r <= y_r, forall r in R (6)
z_r <= w_t, forall r in R , t = correct class for row r (7)
sum_{t \in Y} w_t = 1 (8)
z_r in {0,1}, forall r in R (9)
y_r in {0,1}, forall r in R (10)
u_{j,a} in {0,1} forall j in RC(l) or LC(l), a in S_j (11)
We have one subproblem for each leaf. This is different from the original paper
where there is one subproblem for each leaf and target combination.
The objective maximizes the reduced cost. The constraints (1) ensure that
exactly one split is selected for each node. The constraints (2)(3)(4) ensure
that the row follows the path when the corresponding splits are selected. The
constraints (5) ensure that a split is only selected once in a path. The
constraints (6)(7) ensure that the z variable is set only if the row follows
the path (y variable is set) and the right target is selected (w variable is
set). The constraint (8) ensures that exactly one target is selected for the
path being generated.
"""
import numpy as np
import math
import random
import multiprocessing as mp
from ortools.linear_solver import pywraplp
from ortools.linear_solver import linear_solver_pb2
from ortools.sat.python import cp_model
from math import floor, ceil
from sklearn.utils.validation import check_array, check_is_fitted
from ._col_gen_classifier import BaseMasterProblem
from ._col_gen_classifier import BaseSubproblem
from ._col_gen_classifier import ColGenClassifier
from time import time
[docs]
class Row:
"""To make the processing faster, we store some row related information in
this class.
Attributes
----------
id: (int) ID of the row.
reachable_nodes: (set(int)) IDs of the nodes the row can reach.
reachable_paths = (set(int)) IDs of the paths the row can follow.
reachable_leaves = (set(int)) IDs of the leaves the row can reach.
target: (int) Target of the row.
left_splits: (set(int)) IDs of the splits where the feature value of the
row is smaller or equal to the threshold.
right_splits: (set(int)) IDs of the splits where the feature value of
the row is greater than the threshold.
removed_from_master: (bool) True if the row is removed from the master.
This can happen if the row can only reach at max two leaves or the row
is similar to some other row in the dataset with respect to the splits
being considered.
removed_from_sp: (bool) True if the row is removed from the master and the
subproblem. This can happen if the row is similar to some other row in
the dataset with respect to the splits being considered.
weight: (int) Typically = 1. Some rows can have higher weights if we remove
the other rows that are similar to this row from the dataset.
"""
[docs]
def __init__(self) -> None:
self.id = -1
self.reachable_nodes = set()
self.reachable_paths = set()
self.reachable_leaves = set()
self.target = -1
# List of satisfied splits.
self.left_splits = set()
# List of unsatisfied splits.
self.right_splits = set()
self.removed_from_master = False
self.removed_from_sp = False
self.weight = 1
[docs]
class Path:
"""All path related information.
Attributes
----------
leaf_id: (int) ID of the associated leaf.
node_ids: (list(int)) IDs of the nodes in the path.
splits: (list(int)) IDs of the selected splits for nodes in the path.
The order of splits must match the order of nodes.
cost: (int) Number of rows following the path with correct target.
id: (int) ID of the path.
target: (int) Associated target of the path.
"""
[docs]
def __init__(self) -> None:
self.leaf_id = -1
self.node_ids = []
self.splits = []
self.cost = 0
self.id = -1
self.target = -1
self.initial = False
[docs]
def is_same_as(self, path):
""" Returns true if the current path is same as the path in the
argument.
Parameters
----------
path: (Path) The other path being compared to.
"""
if self.leaf_id != path.leaf_id:
return False
if self.target != path.target:
return False
if self.cost != path.cost:
return False
if len(self.node_ids) != len(path.node_ids):
return False
for i in range(len(self.node_ids)):
node_id = self.node_ids[i]
split_id = self.splits[i]
found = False
for j in range(len(path.node_ids)):
if path.node_ids[j] == node_id and path.splits[j] == split_id:
found = True
break
if not found:
return False
return True
[docs]
def print_path(self):
"""Prints the attributes of the path."""
print("-------------------")
print("ID: ", self.id)
print("Nodes: ", self.node_ids)
print("Splits: ", self.splits)
print("Leaf: ", self.leaf_id)
print("Target: ", self.target)
print("Cost: ", self.cost)
print("-------------------")
[docs]
def set_leaf(self, leaf_id, depth):
"""
For a full binary tree, populates the nodes corresponding to the given
leaf_id.
Parameters
----------
leaf_id: (int) The ID of the leaf.
depth: (int) Depth of the path.
"""
self.leaf_id = leaf_id
self.node_ids = []
count = 2**(depth) - 1 + leaf_id
while (count > 0):
father = math.floor((count-1) / 2)
self.node_ids.append(father)
count = father
[docs]
class Node:
"""All node related information.
Attributes
----------
id: (int) ID of the node.
candidate_splits: (list(int)) IDs of the candidate splits that can be used
at this node.
candidate_splits_count: (list(int)) Count of how many times the
corresponding split was seen in initialization. This is useful to trim
the candidate_splits list. Used only at the initialization and not
during training.
last_split: (int) ID of the last split added to the candidate_splits. Used
only at the initialization and not during training.
parent: (int) ID of the parent node. Parent of the root node is -1.
left_child: (int) ID of the left child node. -1 if the child is a leaf.
right_child: (int) ID of the right child node. -1 if the child is a leaf.
children_are_leaves: (bool) True if the children of this node are leaves.
"""
[docs]
def __init__(self) -> None:
self.id = -1
self.candidate_splits = []
self.candidate_splits_count = []
self.last_split = -1
# parent is -1 for the root node.
self.parent = -1
# child is -1 for the leaves.
self.left_child = -1
self.right_child = -1
self.children_are_leaves = False
[docs]
class Leaf:
"""All leaf related information.
Attributes
----------
id: (int) ID of the leaf.
left_nodes: (list(int)) Set of nodes that take left branch to reach this
leaf.
right_nodes: (list(int)) Set of nodes that take right branch to reach this
leaf.
"""
[docs]
def __init__(self) -> None:
self.id = id
self.left_nodes = []
self.right_nodes = []
[docs]
def create_leaf(self, id, depth) -> None:
"""Given the id and depth of the leaf, populates the left_nodes and
right_nodes attributes.
Parameters
----------
id: (int) ID of the leaf.
depth: (int) Depth of the tree.
"""
self.id = id
self.left_nodes = []
self.right_nodes = []
count = 2**(depth) - 1 + id
while (count > 0):
father = math.floor((count-1) / 2)
if count % 2 == 0:
self.right_nodes.append(father)
else:
self.left_nodes.append(father)
count = father
[docs]
class Split:
"""All split related information.
Attributes
----------
id: (int) ID of the split.
feature: (int) Index of the feature used for the split.
threshold: (float) Threshold value for the split. The row takes left branch
if the feature <= threshold.
left_rows: (set(int)) IDs of the rows that take left branch on this split.
This is used for faster computing of the rows that follow a particular
path.
right_rows: (set(int)) IDs of the rows that take right branch on this
split.
left_cuts: (set(int)) Indices of the cuts added as (beta) constraints that
correspopnd to a row that take left branch on this split. This is used
for faster computing of the cuts that follow a particular path.
right_cuts: (set(int)) Indices of the cuts added as (beta) constraints that
correspopnd to a row that take left branch on this split.
"""
[docs]
def __init__(self) -> None:
self.id = -1
self.feature = -1
self.threshold = -1
# In reduced split case, some splits are removed.
self.removed = False
self.left_rows = set()
self.right_rows = set()
self.left_cuts = set()
self.right_cuts = set()
def get_satisfied_rows(path, leaf, splits):
"""Returns the set of satisfied rows for the given path.
We use set intersection to compute this faster. This requires that we store
the set of rows that take a specific branch on each split.
Parameters
----------
path: (Path) the path.
leaf: (Leaf) the leaf of the path. We pass this because the leaf contains
the information about which nodes take left branch for this path.
splits: (list(Split)) splits to get the rows that take the correct branch.
"""
satisfied_rows = set()
split_ids = path.splits
node_ids = path.node_ids
split_0 = splits[split_ids[0]]
if node_ids[0] in leaf.left_nodes:
satisfied_rows = split_0.left_rows
else:
satisfied_rows = split_0.right_rows
for i in range(1, len(node_ids)):
split = splits[split_ids[i]]
if node_ids[i] in leaf.left_nodes:
satisfied_rows = satisfied_rows.intersection(split.left_rows)
else: # Right node in the path
satisfied_rows = satisfied_rows.intersection(split.right_rows)
return satisfied_rows
def get_satisfied_cuts(path, leaf, splits):
"""Returns the set of satisfied cut (beta) rows for the given path.
We use set intersection to compute this faster. This requires that we store
the set of cuts that take a specific branch on each split.
Parameters
----------
path: (Path) the path.
leaf: (Leaf) the leaf of the path. We pass this because the leaf contains
the information about which nodes take left branch for this path.
splits: (list(Split)) splits to get the cuts that take the correct branch.
"""
satisfied_cuts = set()
split_ids = path.splits
node_ids = path.node_ids
split_0 = splits[split_ids[0]]
if node_ids[0] in leaf.left_nodes:
satisfied_cuts = split_0.left_cuts
else:
satisfied_cuts = split_0.right_cuts
for i in range(1, len(node_ids)):
split = splits[split_ids[i]]
if node_ids[i] in leaf.left_nodes:
satisfied_cuts = satisfied_cuts.intersection(split.left_cuts)
else: # Right node in the path
satisfied_cuts = satisfied_cuts.intersection(split.right_cuts)
return satisfied_cuts
def row_satisfies_path(X_row, leaf, splits, path):
"""Returns true if the passed row follows the path. The row may have
different target than the path.
Parameters
----------
X_row: ndarray, shape (1, n_features)
The datapoint representing the row.
leaf: (Leaf) The leaf of the path.
splits: (list(Split)) splits to get the rows that take the correct branch.
path: (Path) The path.
"""
# path.print_path()
split_ids = path.splits
node_ids = path.node_ids
for i in range(len(node_ids)):
split = splits[split_ids[i]]
feature = split.feature
threshold = split.threshold
if node_ids[i] in leaf.left_nodes:
if X_row[feature] > threshold:
return False
else: # Right node in the path
if X_row[feature] <= threshold:
return False
return True
def get_params_from_string(params):
""" Given the params in string form, returns the MPSolverParameters.
This method is not yet implemented completely.
By default we keep the incrementality on and use the primal simplex as the
lp algorithm.
Parameters
----------
params : string,
The solver parameters in string format.
"""
# TODO: Implement this method.
solver_params = pywraplp.MPSolverParameters()
solver_params.SetIntegerParam(
pywraplp.MPSolverParameters.INCREMENTALITY,
pywraplp.MPSolverParameters.INCREMENTALITY_ON)
solver_params.SetIntegerParam(
pywraplp.MPSolverParameters.LP_ALGORITHM,
pywraplp.MPSolverParameters.PRIMAL)
print(params)
return solver_params
class CutGenerator(cp_model.CpSolverSolutionCallback):
"""Track intermediate solutions.
Attributes
----------
path_vars: (list(int)) Indices of path vars in the solver.
split_vars: (list(int)) Indices of split vars in the solver.
solution_count: (int) Count of all solutions seen so far. They may not have
the desired objective value.
cuts: list(list(bool)), shape (num cuts, num paths)
List of all generated cuts. Each cut is a list of boolean representing
whether or not the cut follows the path.
cut_rows: list(list(bool)), shape (num cuts, num splits)
List of all generated cuts. Each cut is a list of boolean representing
whether or not the cut takes the left branch on the splits.
orig_path_coeffs: (list(float)), shape (num paths)
List of original path coefficients in the objective. We compare the
solutions using this coefficients to get the original objective value
without the scaling. The solution represents a valid cut if the
original objective value is greater than 1.
"""
def __init__(self, path_vars, split_vars, orig_path_coeffs):
cp_model.CpSolverSolutionCallback.__init__(self)
# path_vars are updated after each col gen iter.
self.path_vars = path_vars
self.split_vars = split_vars
self.solution_count = 0
self.cuts_ = []
self.cut_rows = []
self.orig_path_coeffs = orig_path_coeffs
def on_solution_callback(self):
self.solution_count += 1
orig_obj = 0.0
cut = [False] * len(self.path_vars)
cut_row = [False] * len(self.split_vars)
for i in range(len(self.path_vars)):
v = self.path_vars[i]
if self.Value(v) > 0:
orig_obj += self.orig_path_coeffs[i]
cut[i] = True
for i in range(len(self.split_vars)):
v = self.split_vars[i]
if self.Value(v) > 0:
cut_row[i] = True
if orig_obj > 1 + 1e-6:
self.cuts_.append(cut)
self.cut_rows.append(cut_row)
def solution_count(self):
return self.solution_count
def cuts(self):
return (self.cuts_, self.cut_rows)
[docs]
class DTreeMasterProblem(BaseMasterProblem):
"""Master problem for Decision Tree classifier.
"""
[docs]
def __init__(self, initial_paths, leaves, nodes, splits,
beta_constraints_as_cuts=False,
generate_cuts=False,
num_cuts_round=0,
solver_type='glop',
data_rows=None):
super(DTreeMasterProblem, self).__init__(solver_str=solver_type)
self.cut_gen_model_ = cp_model.CpModel()
self.cut_obj_scale = 1000000
self.generated_ = False
self.paths_ = initial_paths
self.generated_paths_ = set()
self.leaves_ = leaves
self.nodes_ = nodes
self.splits_ = splits
self.data_rows = data_rows
self.num_cuts_round = num_cuts_round
self.generate_cuts = generate_cuts
self.beta_constraints_as_cuts = beta_constraints_as_cuts
self.total_cuts_added = 0
self.cut_gen_time = 0.0
self.rmp_objective_ = 0.0
self.iter_ = 0
# Experimental
self.last_reset_iter_ = 0
self.reset_timer_ = False
[docs]
def create_cut_gen_model(self):
"""Creates the model for cut generation using a CP-SAT solver."""
# Takes value 1 if row follows left branch on split (satisfies split)
self.cut_split_vars_ = [None]*len(self.splits_)
for i in range(len(self.splits_)):
self.cut_split_vars_[
i] = self.cut_gen_model_.NewBoolVar('z_'+str(i))
# Takes value 1 if row satisfies the path.
self.cut_path_vars_ = [None]*len(self.paths_)
for i, path in enumerate(self.paths_):
self.cut_path_vars_[
i] = self.cut_gen_model_.NewBoolVar('p_'+str(i))
literals = []
split_ids = path.splits
node_ids = path.node_ids
leaf = self.leaves_[path.leaf_id]
for j in range(len(node_ids)):
split_id = split_ids[j]
if node_ids[j] in leaf.left_nodes:
literals.append(self.cut_split_vars_[split_id])
else: # Right node in the path
literals.append(self.cut_split_vars_[split_id].Not())
negated_literals = [x.Not() for x in literals]
# p => AND(splits)
self.cut_gen_model_.AddBoolAnd(
literals).OnlyEnforceIf(self.cut_path_vars_[i])
# ~p => NOT(AND(splits)) = OR(NOT(splits))
self.cut_gen_model_.AddBoolOr(
negated_literals).OnlyEnforceIf(self.cut_path_vars_[i].Not())
# Split dependency:
# Split_i: f <= t1,
# Split_j: f <= t2, t1 < t2,
# Split_i => Split_j
for i in range(len(self.splits_)):
split_i = self.splits_[i]
threshold_i = split_i.threshold
split_i_var = self.cut_split_vars_[i]
for j in range(i+1, len(self.splits_)):
split_j = self.splits_[j]
if split_i.feature != split_j.feature:
continue
threshold_j = split_j.threshold
split_j_var = self.cut_split_vars_[j]
assert threshold_i != threshold_j
if threshold_i < threshold_j:
self.cut_gen_model_.AddImplication(
split_i_var, split_j_var)
else:
self.cut_gen_model_.AddImplication(
split_j_var, split_i_var)
[docs]
def generate_mp(self, X, y):
"""TODO: Documentation.
"""
if self.generated_:
return
self.X_ = X
self.y_ = y
self.create_cut_gen_model()
infinity = self.solver_.infinity()
self.solver_.SetNumThreads(1)
objective = self.solver_.Objective()
self.path_vars_ = [None]*len(self.paths_)
for i in range(len(self.paths_)):
xp_var = self.solver_.IntVar(0.0, infinity, 'p_'+str(i))
self.path_vars_[i] = xp_var.index()
self.paths_[i].id = xp_var.index()
# TODO: Compute cost of the path here?
objective.SetCoefficient(xp_var, self.paths_[i].cost)
objective.SetMaximization()
# leaf constraints
self.leaf_cons_ = [None]*len(self.leaves_)
for leaf in self.leaves_:
self.leaf_cons_[leaf.id] = self.solver_.Constraint(
1, 1, "leaf_"+str(leaf.id))
for path in self.paths_:
if path.leaf_id == leaf.id:
xp_var = self.solver_.variable(path.id)
self.leaf_cons_[leaf.id].SetCoefficient(xp_var, 1)
if self.data_rows == None:
self.data_rows = self.preprocess_rows()
else:
self.compute_satisfied_paths()
# row constraints
n_rows = self.X_.shape[0]
self.row_cons_ = [None]*n_rows
self.added_row = [False]*n_rows
for r in range(n_rows):
if self.beta_constraints_as_cuts:
break
# continue
if self.data_rows[r].removed_from_master:
continue
self.added_row[r] = True
self.row_cons_[
r] = self.solver_.Constraint(-infinity, 1, "row_"+str(r))
for path in self.paths_:
if row_satisfies_path(X[r], self.leaves_[path.leaf_id],
self.splits_, path):
xp_var = self.solver_.variable(path.id)
self.row_cons_[r].SetCoefficient(xp_var, 1)
# Cut constraints to be populated later.
self.cut_cons_ = []
# consistency constraints
self.ns_vars = {}
self.ns_constraints_ = {}
for leaf in self.leaves_:
self.ns_constraints_[leaf.id] = {}
nodes = leaf.left_nodes + leaf.right_nodes
for node_id in nodes:
node = self.nodes_[node_id]
self.ns_constraints_[leaf.id][node.id] = {}
for split in node.candidate_splits:
ns_var_ind = self.ns_vars[(node.id, split)] \
if (node.id, split) in self.ns_vars \
else self.solver_.IntVar(0.0, infinity, "r_ns_" +
str(node.id) +
"_"+str(split)).index()
# else self.solver_.BoolVar("r_ns_" + str(node.id) +
# "_"+str(split)).index()
self.ns_vars[(node.id, split)] = ns_var_ind
ns_var = self.solver_.variable(ns_var_ind)
ns_constraint = self.solver_.Constraint(
0, 0, "ns_"+str(node.id)+"_"+str(split)+"_"
+ str(leaf.id))
ns_constraint.SetCoefficient(ns_var, -1)
for path in self.paths_:
for i in range(len(path.node_ids)):
if path.leaf_id == leaf.id\
and path.node_ids[i] == node.id \
and path.splits[i] == split:
xp_var = self.solver_.variable(path.id)
ns_constraint.SetCoefficient(xp_var, 1)
self.ns_constraints_[
leaf.id][node.id][split] = ns_constraint
# print(self.solver_.ExportModelAsLpFormat(False))
self.generated_ = True
[docs]
def store_lp_solution(self):
"""Store the solution values in the class."""
self.rmp_objective_ = self.solver_.Objective().Value()
self.path_solution_values = {}
for xp_var_id in self.path_vars_:
xp_var = self.solver_.variable(xp_var_id)
self.path_solution_values[xp_var_id] = xp_var.solution_value()
def compute_satisfied_paths(self):
for r, data_row in enumerate(self.data_rows):
for path in self.paths_:
if row_satisfies_path(self.X_[r], self.leaves_[path.leaf_id],
self.splits_, path):
data_row.reachable_paths.add(path.id)
[docs]
def preprocess_rows(self, aggressive=False):
"""TODO: Documentation."""
# preprocess nodes first
n_leaves = len(self.leaves_)
depth = int(math.log2(n_leaves))
for node in self.nodes_:
if node.id == 0:
node.parent = -1
else:
node.parent = math.floor((node.id-1) / 2)
node.left_child = 2 * node.id + 1
node.right_child = 2 * node.id + 2
if node.left_child >= 2**depth - 1:
node.left_child -= (2**(depth) - 1)
node.right_child -= (2**(depth) - 1)
node.children_are_leaves = True
# print(node.id, node.parent, node.left_child,
# node.right_child, node.children_are_leaves)
# print("Splits:")
# for split in self.splits_:
# feature = split.feature
# threshold = split.threshold
# print(split.id, feature, threshold)
data_rows = []
n_rows = self.X_.shape[0]
removed_count = 0
average_leaf_reach = 0.0
for r in range(n_rows):
data_row = Row()
for split in self.splits_:
feature = split.feature
threshold = split.threshold
if self.X_[r, feature] <= threshold:
data_row.left_splits.add(split.id)
else:
data_row.right_splits.add(split.id)
# Record which nodes it can reach.
for node in self.nodes_:
if node.parent == -1 or node.id in data_row.reachable_nodes:
data_row.reachable_nodes.add(node.id)
for split_id in node.candidate_splits:
if split_id in data_row.left_splits:
if node.children_are_leaves:
data_row.reachable_leaves.add(
node.left_child)
else:
data_row.reachable_nodes.add(
node.left_child)
else:
if node.children_are_leaves:
data_row.reachable_leaves.add(
node.right_child)
else:
data_row.reachable_nodes.add(
node.right_child)
average_leaf_reach += len(data_row.reachable_leaves)
# print(r, data_row.reachable_leaves)
if len(data_row.reachable_leaves) <= 2:
# print("Row reaches fewer leaves: ",
# r, data_row.reachable_leaves)
# print(data_row.left_splits)
# print(data_row.reachable_nodes)
# print(self.X_[r])
data_row.removed_from_master = True
removed_count += 1
# Record which paths it can follow.
for path in self.paths_:
if row_satisfies_path(self.X_[r], self.leaves_[path.leaf_id],
self.splits_, path):
data_row.reachable_paths.add(path.id)
data_rows.append(data_row)
print("Fewer leaves removed rows: ", removed_count)
average_leaf_reach /= n_rows
print("Average leaf reach: ", average_leaf_reach)
# Quadratic loop
if not aggressive:
return data_rows
for r1 in range(n_rows):
if data_rows[r1].removed_from_master:
continue
for r2 in range(r1+1, n_rows):
if data_rows[r2].removed_from_master:
continue
if data_rows[r1].reachable_nodes != \
data_rows[r2].reachable_nodes:
continue
found_mismatch_split = False
valid_splits = set()
for node_id in data_rows[r1].reachable_nodes:
node = self.nodes_[node_id]
for split_id in node.candidate_splits:
valid_splits.add(split_id)
for split_id in valid_splits:
left_membership_r1 = split_id in data_rows[r1].left_splits
left_membership_r2 = split_id in data_rows[r2].left_splits
if left_membership_r1 != left_membership_r2:
found_mismatch_split = True
break
if not found_mismatch_split:
data_rows[r2].removed_from_master = True
removed_count += 1
# print("Rows are similar: ", r1, r2)
print("Total removed rows: ", removed_count)
return data_rows
[docs]
def add_column(self, path):
"""TODO: Documentation.
"""
# print("Adding path")
# path.print_path()
objective = self.solver_.Objective()
infinity = self.solver_.infinity()
xp_var = self.solver_.IntVar(
0.0, infinity, 'p_'+str(len(self.path_vars_)))
self.path_vars_.append(xp_var.index())
path.id = xp_var.index()
objective.SetCoefficient(xp_var, path.cost)
self.paths_.append(path)
self.generated_paths_.add(path.id)
# leaf constraints
for leaf in self.leaves_:
if path.leaf_id == leaf.id:
self.leaf_cons_[leaf.id].SetCoefficient(xp_var, 1)
# row constraints
satisfied_rows = get_satisfied_rows(
path, self.leaves_[path.leaf_id], self.splits_)
for r in satisfied_rows:
self.data_rows[r].reachable_paths.add(path.id)
if not self.added_row[r]:
continue
self.row_cons_[r].SetCoefficient(xp_var, 1)
# consistency constraints
for leaf in self.leaves_:
nodes = leaf.left_nodes + leaf.right_nodes
for node_id in nodes:
node = self.nodes_[node_id]
for split in node.candidate_splits:
ns_constraint = self.ns_constraints_[
leaf.id][node.id][split]
for i in range(len(path.node_ids)):
if path.leaf_id == leaf.id\
and path.node_ids[i] == node.id \
and path.splits[i] == split:
ns_constraint.SetCoefficient(xp_var, 1)
# Cut constraints
satisfied_cuts = get_satisfied_cuts(
path, self.leaves_[path.leaf_id], self.splits_)
for c in satisfied_cuts:
self.cut_cons_[c].SetCoefficient(xp_var, 1)
# Cut generation model.
new_cut_path_var = self.cut_gen_model_.NewBoolVar(
'p_'+str(len(self.cut_path_vars_)))
self.cut_path_vars_.append(new_cut_path_var)
literals = []
split_ids = path.splits
node_ids = path.node_ids
leaf = self.leaves_[path.leaf_id]
for j in range(len(node_ids)):
split_id = split_ids[j]
if node_ids[j] in leaf.left_nodes:
literals.append(self.cut_split_vars_[split_id])
else: # Right node in the path
literals.append(self.cut_split_vars_[split_id].Not())
negated_literals = [x.Not() for x in literals]
# p => AND(splits)
self.cut_gen_model_.AddBoolAnd(
literals).OnlyEnforceIf(new_cut_path_var)
# ~p => NOT(AND(splits)) = OR(NOT(splits))
self.cut_gen_model_.AddBoolOr(
negated_literals).OnlyEnforceIf(new_cut_path_var.Not())
return True
def add_cuts(self):
# Create the objective for the cut generation model
path_coeffs = []
orig_path_coeffs = []
for path in self.paths_:
id = path.id
orig_path_coeff = self.path_solution_values[id]
path_coeff = int(
floor(orig_path_coeff * self.cut_obj_scale))
path_coeffs.append(path_coeff)
orig_path_coeffs.append(orig_path_coeff)
obj = cp_model.LinearExpr.WeightedSum(self.cut_path_vars_, path_coeffs)
self.cut_gen_model_.Maximize(obj)
sp_solver = cp_model.CpSolver()
sp_solver.parameters.num_search_workers = 4
# Solve
cut_generator = CutGenerator(
self.cut_path_vars_, self.cut_split_vars_, orig_path_coeffs)
status = sp_solver.Solve(self.cut_gen_model_, cut_generator)
print("Cut generation status: ", sp_solver.StatusName(status))
print("Cut generation objective: ", sp_solver.ObjectiveValue())
cuts, cut_rows = cut_generator.cuts()
if len(cuts) == 0:
return []
# Update splits to store cut information.
num_existing_cuts = len(self.cut_cons_)
for i, cut_row in enumerate(cut_rows):
cut_index = num_existing_cuts + i
for split in self.splits_:
if cut_row[split.id]:
split.left_cuts.add(cut_index)
else:
split.right_cuts.add(cut_index)
for cut in cuts:
r = len(self.cut_cons_)
self.cut_cons_.append(self.solver_.Constraint(
1, 1, "cut_"+str(r)))
self.added_row.append(True)
for i, path in enumerate(self.paths_):
if cut[i]:
xp_var = self.solver_.variable(path.id)
self.cut_cons_[r].SetCoefficient(xp_var, 1)
return cut_rows
[docs]
def get_satisfied_path_ids(self, row):
"""TODO: Documentation."""
return list(self.data_rows[row].reachable_paths)
[docs]
def violated_row_constraint(self, satisfied_path_ids):
"""TODO: Documentation."""
path_sum = 0.0
for path_id in satisfied_path_ids:
path_sum += self.path_solution_values[path_id]
if abs(path_sum - 1) > 1e-6:
return True
return False
def add_beta_cuts(self):
assert self.generated_
n_rows = self.X_.shape[0]
# added_cuts = False
infinity = self.solver_.infinity()
num_cuts = 0
for r in range(n_rows):
if self.data_rows[r].removed_from_master:
continue
if num_cuts >= 10:
break
if self.added_row[r]:
continue
satisfied_path_ids = self.get_satisfied_path_ids(r)
if (self.violated_row_constraint(satisfied_path_ids)):
# added_cuts = True
num_cuts += 1
self.added_row[r] = True
self.row_cons_[r] = self.solver_.Constraint(
-infinity, 1, "row_"+str(r))
for path_id in satisfied_path_ids:
xp_var = self.solver_.variable(path_id)
self.row_cons_[r].SetCoefficient(xp_var, 1)
return num_cuts
def remove_gen_vars(self):
print("Removing the generated paths. Setting UB to 0.")
self.last_reset_iter_ = self.iter_
self.reset_timer_ = True
for path in self.paths_:
if not path.initial:
xp_var = self.solver_.variable(path.id)
xp_var.SetUb(0)
[docs]
def solve_rmp(self, solver_params=''):
"""TODO: Documentation.
"""
assert self.generated_
self.prev_rmp_objective_ = self.rmp_objective_
self.iter_ += 1
self.rmp_result_status = self.solver_.Solve(
get_params_from_string(solver_params))
# TODO: Hide this under optional logging flag.
print('Number of variables RMIP = %d' % self.solver_.NumVariables())
print('Number of constraints RMIP = %d' %
self.solver_.NumConstraints())
newly_added_cuts = []
if self.iter_ % 10 == 0:
for i in range(self.num_cuts_round):
if self.rmp_result_status == pywraplp.Solver.OPTIMAL:
# print("Before cut iter ", i)
# print(self.solver_.ExportModelAsLpFormat(False))
self.store_lp_solution()
t_start = time()
cut_rows = []
num_beta_cuts = 0
if self.beta_constraints_as_cuts:
num_beta_cuts = self.add_beta_cuts()
if self.generate_cuts and num_beta_cuts == 0:
cut_rows = self.add_cuts()
t_end = time()
self.cut_gen_time += t_end - t_start
self.total_cuts_added += len(cut_rows) + num_beta_cuts
print("Beta cuts added: ", num_beta_cuts)
print("Total cuts added: ", self.total_cuts_added)
print("Time spent in cut gen: ", self.cut_gen_time)
if (len(cut_rows) > 0) or num_beta_cuts > 0:
# Experiment: Cuts added. Remove the generated vars.
# self.remove_gen_vars()
cuts_params = get_params_from_string(solver_params)
cuts_params.SetIntegerParam(
pywraplp.MPSolverParameters.LP_ALGORITHM,
pywraplp.MPSolverParameters.DUAL)
self.rmp_result_status = self.solver_.Solve(
cuts_params)
if len(cut_rows) > 0:
newly_added_cuts.extend(cut_rows)
else:
break
else:
break
leaf_duals = []
row_duals = []
cut_duals = []
ns_duals = {}
if self.rmp_result_status == pywraplp.Solver.OPTIMAL:
self.store_lp_solution()
print('RMP Optimal objective value = %f' %
self.solver_.Objective().Value())
# for var in self.solver_.variables():
# print(var.name(), var.solution_value())
for lid in range(len(self.leaves_)):
leaf_duals.append(self.leaf_cons_[lid].dual_value())
n_rows = self.X_.shape[0]
for r in range(n_rows):
if self.added_row[r]:
row_duals.append(self.row_cons_[r].dual_value())
else:
row_duals.append(0)
for cut_cons in self.cut_cons_:
cut_duals.append(cut_cons.dual_value())
for leaf in self.leaves_:
ns_duals[leaf.id] = {}
nodes = leaf.left_nodes + leaf.right_nodes
for node_id in nodes:
node = self.nodes_[node_id]
ns_duals[leaf.id][node.id] = {}
for split in node.candidate_splits:
ns_duals[leaf.id][node.id][split] = \
self.ns_constraints_[
leaf.id][node.id][split].dual_value()
else:
print("Master problem not solved correctly:",
self.rmp_result_status)
# print(self.solver_.ExportModelAsLpFormat(False))
return (leaf_duals, row_duals, ns_duals, cut_duals, newly_added_cuts)
[docs]
def rmp_objective_improved(self):
if self.rmp_result_status == pywraplp.Solver.OPTIMAL:
return self.prev_rmp_objective_ < self.rmp_objective_
else:
return False
[docs]
def solve_ip(self, solver_params=''):
"""Solves the integer RMP with given solver params.
Returns True if the tree is generated.
Parameters
----------
solver_params : string, default='',
The solver parameters for solving the integer RMP.
"""
assert self.generated_
# We can use sat here since all the coefficients and variables are
# integer.
# TODO: Solver type should be a parameter.
solver = pywraplp.Solver.CreateSolver("sat")
# We have to load the model from LP solver.
# This copy is not avoidable with OR-Tools since we are now switching
# form solving LP to solving IP.
model_proto = linear_solver_pb2.MPModelProto()
self.solver_.ExportModelToProto(model_proto)
solver.LoadModelFromProto(model_proto)
num_path_vars = len(self.path_vars_)
result_status = solver.Solve(get_params_from_string(solver_params))
print('Problem solved in %f milliseconds' % solver.wall_time())
self.selected_paths = []
has_solution = (
result_status == pywraplp.Solver.OPTIMAL or
result_status == pywraplp.Solver.FEASIBLE)
assert has_solution
# TODO: Store solution instead of printing it.
print("Integer RMP Objective = ", solver.Objective().Value())
all_vars = solver.variables()
for path in self.paths_:
path_var = solver.variable(path.id)
if path_var.solution_value() > 0:
self.selected_paths.append(path)
return True
[docs]
class DTreeSubProblem(BaseSubproblem):
"""TODO: Documentation."""
[docs]
def __init__(self, leaf, nodes, splits, targets,
depth, data_rows=None,
solver_str='sat') -> None:
super().__init__()
self.leaf_id_ = leaf.id
self.leaf_ = leaf
self.nodes_ = nodes
self.splits_ = splits
self.targets_ = targets
self.tree_depth_ = depth
self.solver_ = pywraplp.Solver.CreateSolver(solver_str)
self.generated_ = False
self.z_vars_ = None
self.yc_vars_ = None
self.data_rows = data_rows
self.cut_rows_starting_index = 0
self.iter_ = 0
[docs]
def create_submip(self, leaf_dual, row_duals, ns_duals, cut_duals):
"""TODO: Documentation.
"""
assert not self.generated_, "SP is already created."
infinity = self.solver_.infinity()
self.solver_.SetNumThreads(7)
n_rows = self.X_.shape[0]
# Binary variables indicating that row reaches the leaf and has the
# correct target.
self.z_vars_ = [None]*n_rows
# Binary variables indicating that row reaches the leaf.
self.y_vars_ = [None]*n_rows
objective = self.solver_.Objective()
for i in range(n_rows):
obj_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
obj_coeff = self.data_rows[i].weight
z_var = self.solver_.BoolVar('z_'+str(i))
objective.SetCoefficient(z_var, obj_coeff)
self.z_vars_[i] = z_var.index()
y_var = self.solver_.BoolVar('y_'+str(i))
row_dual = row_duals[i]
objective.SetCoefficient(y_var, -row_dual)
self.y_vars_[i] = y_var.index()
num_cuts = len(cut_duals)
self.yc_vars_ = [None]*num_cuts
if num_cuts > 0:
for i in range(num_cuts):
yc_var = self.solver_.BoolVar('yc_'+str(i))
cut_dual = cut_duals[i]
objective.SetCoefficient(yc_var, -cut_dual)
self.yc_vars_[i] = yc_var.index()
# Binary variables u_j_a indicating that split s is assigned to the
# node j.
self.u_vars_ = {}
self.leaf_
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
all_splits = []
for node_id in nodes:
node = self.nodes_[node_id]
self.u_vars_[node.id] = {}
for split in node.candidate_splits:
all_splits.append(split)
u_var = self.solver_.BoolVar(
'u_'+str(node.id)+'_'+str(split))
ns_dual = ns_duals[self.leaf_id_][node.id][split]
objective.SetCoefficient(u_var, -ns_dual)
self.u_vars_[node.id][split] = u_var.index()
objective.SetOffset(-leaf_dual)
objective.SetMaximization()
# Constraints
# For each node, exactly one feature is selected
for node_id in nodes:
node = self.nodes_[node_id]
one_feature = self.solver_.Constraint(
1, 1, "one_feature_" + str(node.id))
for split in node.candidate_splits:
u_var = self.solver_.variable(self.u_vars_[node.id][split])
one_feature.SetCoefficient(u_var, 1)
# Each feature is selected at max once in a path
# all_splits = list(dict.fromkeys(all_splits))
# for split in all_splits:
# unique_split = self.solver_.Constraint(
# 0, 1, "unique_split_" + str(split))
# for node_id in nodes:
# node = self.nodes_[node_id]
# if split not in node.candidate_splits:
# continue
# u_var = self.solver_.variable(self.u_vars_[node.id][split])
# unique_split.SetCoefficient(u_var, 1)
# Row follows correct path (upper and lower bound on y var)
for i in range(n_rows):
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
y_lb_cons = self.solver_.Constraint(
-infinity, self.tree_depth_-1, "row_" + str(i))
y_var = self.solver_.variable(self.y_vars_[i])
y_lb_cons.SetCoefficient(y_var, -1)
for node_id in nodes:
node = self.nodes_[node_id]
rn_cons = self.solver_.Constraint(
-infinity, 0, "rn_" + str(i) + '_' + str(node.id))
rn_cons.SetCoefficient(y_var, 1)
if node.id in self.leaf_.left_nodes:
for split in node.candidate_splits:
if self.row_satisfies_split(i, self.splits_[split]):
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
rn_cons.SetCoefficient(u_var, -1)
y_lb_cons.SetCoefficient(u_var, 1)
elif node.id in self.leaf_.right_nodes:
for split in node.candidate_splits:
if not self.row_satisfies_split(i,
self.splits_[split]):
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
rn_cons.SetCoefficient(u_var, -1)
y_lb_cons.SetCoefficient(u_var, 1)
# only one class selected
# Binary variables indicating that the class is selected.
self.w_vars_ = [None]*len(self.targets_)
single_target_cons = self.solver_.Constraint(0, 1, 'single_target')
for i in range(len(self.targets_)):
target = self.targets_[i]
w_var = self.solver_.BoolVar('w_' + str(target))
self.w_vars_[i] = w_var.index()
single_target_cons.SetCoefficient(w_var, 1)
# relation between w,y,z vars
for i in range(n_rows):
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
z_var = self.solver_.variable(self.z_vars_[i])
y_var = self.solver_.variable(self.y_vars_[i])
# z <= y (row reaches leaf)
yz_cons = self.solver_.Constraint(-infinity, 0, "yz_" + str(i))
yz_cons.SetCoefficient(z_var, 1)
yz_cons.SetCoefficient(y_var, -1)
# z <= w (row has correct target)
correct_class = self.y_[i]
w_var = self.solver_.variable(self.w_vars_[correct_class])
wz_cons = self.solver_.Constraint(-infinity, 0, "wz_" + str(i))
wz_cons.SetCoefficient(z_var, 1)
wz_cons.SetCoefficient(w_var, -1)
self.generated_ = True
[docs]
def update_objective(self, leaf_dual, row_duals, ns_duals, cut_duals):
"""TODO: Documentation.
"""
objective = self.solver_.Objective()
n_rows = self.X_.shape[0]
for i in range(n_rows):
obj_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
obj_coeff = self.data_rows[i].weight
z_var = self.solver_.variable(self.z_vars_[i])
objective.SetCoefficient(z_var, obj_coeff)
y_var = self.solver_.variable(self.y_vars_[i])
row_dual = row_duals[i]
objective.SetCoefficient(y_var, -row_dual)
num_cuts = len(cut_duals)
existing_cuts = len(self.yc_vars_)
if num_cuts > 0:
for i in range(num_cuts):
yc_var = None
if i >= existing_cuts:
yc_var = self.solver_.BoolVar('yc_'+str(i))
self.yc_vars_.append(yc_var.index())
else:
yc_var = self.solver_.variable(self.yc_vars_[i])
cut_dual = cut_duals[i]
objective.SetCoefficient(yc_var, -cut_dual)
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
for node_id in nodes:
node = self.nodes_[node_id]
for split in node.candidate_splits:
u_var = self.solver_.variable(self.u_vars_[node.id][split])
ns_dual = ns_duals[self.leaf_id_][node.id][split]
objective.SetCoefficient(u_var, -ns_dual)
objective.SetOffset(-leaf_dual)
objective.SetMaximization()
return
def add_cut_rows(self, cut_rows):
assert self.generated_
infinity = self.solver_.infinity()
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
n_cut_rows = len(cut_rows)
# Row follows correct path (upper and lower bound on y var)
for i in range(n_cut_rows):
cut_index = i + self.cut_rows_starting_index
yc_lb_cons = self.solver_.Constraint(
-infinity, self.tree_depth_-1, "cut_" + str(cut_index))
# yc_var is created in update_objective method.
yc_var = self.solver_.variable(self.yc_vars_[cut_index])
yc_lb_cons.SetCoefficient(yc_var, -1)
for node_id in nodes:
node = self.nodes_[node_id]
cn_cons = self.solver_.Constraint(
-infinity, 0, "cn_" + str(cut_index) + '_' + str(node.id))
cn_cons.SetCoefficient(yc_var, 1)
if node.id in self.leaf_.left_nodes:
for split in node.candidate_splits:
if cut_rows[i][split]:
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
cn_cons.SetCoefficient(u_var, -1)
yc_lb_cons.SetCoefficient(u_var, 1)
elif node.id in self.leaf_.right_nodes:
for split in node.candidate_splits:
if not cut_rows[i][split]:
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
cn_cons.SetCoefficient(u_var, -1)
yc_lb_cons.SetCoefficient(u_var, 1)
self.cut_rows_starting_index += n_cut_rows
[docs]
def generate_columns(self, X, y, dual_costs, params=""):
"""TODO: Documentation.
"""
# Solve sub problem
result_status = self.solver_.Solve(get_params_from_string(params))
has_solution = (result_status == pywraplp.Solver.OPTIMAL or
result_status == pywraplp.Solver.FEASIBLE)
# The current path is always feasible.
# # TODO: Fix this. Sat solver returns infeasible sometimes
# (even without cuts).
if not has_solution:
return []
# print(self.solver_.ExportModelAsLpFormat(False))
assert has_solution, "Result status: " + str(result_status)
# if self.leaf_id_ == 4:
# sp_mps = self.solver_.ExportModelAsMpsFormat(fixed_format=False,
# obfuscated=False)
# name = "sp_" + str(self.tree_depth_) + "_" + \
# str(self.iter_) + ".mps"
# self.iter_ += 1
# f = open(name, "w")
# f.write(sp_mps)
# f.close()
# TODO: This threshold should be a parameter.
print("Subproblem ", self.leaf_id_, " objective = ",
self.solver_.Objective().Value())
if self.solver_.Objective().Value() <= 1e-6:
return []
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
path = Path()
path.leaf_id = self.leaf_id_
path.node_ids = []
path.splits = []
path.cost = 0
path.id = -1
for node in self.nodes_:
if node.id not in nodes:
continue
path.node_ids.append(node.id)
for split in node.candidate_splits:
u_var = self.solver_.variable(self.u_vars_[node.id][split])
if u_var.solution_value() > 0.5:
path.splits.append(split)
break
for target in self.targets_:
w_var = self.solver_.variable(self.w_vars_[target])
if w_var.solution_value() > 0.5:
path.target = target
break
n_rows = self.X_.shape[0]
for i in range(n_rows):
obj_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
obj_coeff = self.data_rows[i].weight
z_var = self.solver_.variable(self.z_vars_[i])
path.cost += obj_coeff * z_var.solution_value()
# if z_var.solution_value() > 0.5:
# print(self.z_vars_[i], z_var)
return [path]
[docs]
def update_subproblem(self, X, y, dual_costs):
self.X_ = X
self.y_ = y
leaf_dual = dual_costs[0][self.leaf_id_]
row_duals = dual_costs[1]
ns_duals = dual_costs[2]
cut_duals = dual_costs[3]
cut_rows = dual_costs[4]
if self.generated_:
self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
else:
self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)
if len(cut_rows) > 0:
self.add_cut_rows(cut_rows)
[docs]
def row_satisfies_split(self, row, split):
"""TODO: Documentation.
"""
feature = split.feature
threshold = split.threshold
if self.X_[row, feature] <= threshold:
return True
return False
class PathGenerator(cp_model.CpSolverSolutionCallback):
"""Track intermediate solutions of sp.
Attributes
----------
TODO
"""
def __init__(self, u_vars, w_vars, z_vars, y_vars,
yc_vars,
leaf_dual,
leaf, targets,
nodes,
orig_obj_coeffs):
cp_model.CpSolverSolutionCallback.__init__(self)
self.u_vars = u_vars
self.w_vars = w_vars
self.z_vars = z_vars
self.y_vars = y_vars
self.yc_vars = yc_vars
self.leaf = leaf
self.leaf_dual = leaf_dual
self.targets = targets
self.nodes = nodes
self.orig_obj_coeffs = orig_obj_coeffs
self.solution_count = 0
self.paths = []
self.original_objective = 0.0
def on_solution_callback(self):
self.solution_count += 1
original_objective = 0.0
path_cost = 0.0
for z_var in self.z_vars:
if z_var == None:
continue
path_cost += self.orig_obj_coeffs[z_var] * \
self.Value(z_var)
original_objective += path_cost
for y_var in self.y_vars:
if y_var == None:
continue
original_objective += self.orig_obj_coeffs[y_var] * \
self.Value(y_var)
for yc_var in self.yc_vars:
original_objective += self.orig_obj_coeffs[yc_var] * \
self.Value(yc_var)
nodes = self.leaf.left_nodes + self.leaf.right_nodes
for node_id in nodes:
node = self.nodes[node_id]
for split in node.candidate_splits:
u_var = self.u_vars[node.id][split]
original_objective += self.orig_obj_coeffs[u_var] * \
self.Value(u_var)
original_objective -= self.leaf_dual
self.original_objective = original_objective
if original_objective <= 1e-6:
return
nodes = self.leaf.left_nodes + self.leaf.right_nodes
path = Path()
path.leaf_id = self.leaf.id
path.node_ids = []
path.splits = []
path.cost = path_cost
path.id = -1
for node_id in nodes:
node = self.nodes[node_id]
path.node_ids.append(node.id)
for split in node.candidate_splits:
u_var = self.u_vars[node.id][split]
if self.Value(u_var) > 0.5:
path.splits.append(split)
break
for target in self.targets:
w_var = self.w_vars[target]
if self.Value(w_var) > 0.5:
path.target = target
break
self.paths.append(path)
# print("Path cost", path_cost)
class DTreeSubProblemSat(BaseSubproblem):
"""TODO: Documentation."""
def __init__(self, leaf, nodes, splits, targets,
depth, data_rows=None) -> None:
super().__init__()
self.leaf_id_ = leaf.id
self.leaf_ = leaf
self.nodes_ = nodes
self.splits_ = splits
self.targets_ = targets
self.tree_depth_ = depth
self.model_ = cp_model.CpModel()
self.objective_scale = 100000
self.orig_obj_coeffs = {}
self.generated_ = False
self.z_vars_ = None
self.yc_vars_ = None
self.data_rows = data_rows
self.cut_rows_starting_index = 0
def scale_obj_coeff(self, value):
return int(round(value * self.objective_scale))
def create_submip(self, leaf_dual, row_duals, ns_duals, cut_duals):
"""TODO: Documentation.
"""
assert not self.generated_, "SP is already created."
# obj = cp_model.LinearExpr()
n_rows = self.X_.shape[0]
# Binary variables indicating that row reaches the leaf and has the
# correct target.
self.z_vars_ = [None]*n_rows
# Binary variables indicating that row reaches the leaf.
self.y_vars_ = [None]*n_rows
obj_vars = []
obj_coeffs = []
for i in range(n_rows):
z_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
z_coeff = self.data_rows[i].weight
z_var = self.model_.NewBoolVar('z_'+str(i))
self.orig_obj_coeffs[z_var] = z_coeff
z_coeff = self.scale_obj_coeff(z_coeff)
self.z_vars_[i] = (z_var)
# obj += cp_model.LinearExpr.Term(z_var, z_coeff)
obj_vars.append(z_var)
obj_coeffs.append(z_coeff)
y_var = self.model_.NewBoolVar('y_'+str(i))
self.orig_obj_coeffs[y_var] = -row_duals[i]
y_coeff = self.scale_obj_coeff(-row_duals[i])
self.y_vars_[i] = (y_var)
# obj += cp_model.LinearExpr.Term(y_var, y_coeff)
obj_vars.append(y_var)
obj_coeffs.append(y_coeff)
num_cuts = len(cut_duals)
self.yc_vars_ = [None]*num_cuts
if num_cuts > 0:
for i in range(num_cuts):
yc_var = self.model_.NewBoolVar('yc_'+str(i))
self.orig_obj_coeffs[yc_var] = -cut_duals[i]
yc_coeff = self.scale_obj_coeff(-cut_duals[i])
self.yc_vars_[i] = yc_var
# obj += cp_model.LinearExpr.Term(yc_var, yc_coeff)
obj_vars.append(yc_var)
obj_coeffs.append(yc_coeff)
# Binary variables u_j_a indicating that split s is assigned to the
# node j.
self.u_vars_ = {}
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
all_splits = []
for node_id in nodes:
node = self.nodes_[node_id]
self.u_vars_[node.id] = {}
for split in node.candidate_splits:
all_splits.append(split)
u_var = self.model_.NewBoolVar(
'u_'+str(node.id)+'_'+str(split))
self.orig_obj_coeffs[u_var] = \
-ns_duals[self.leaf_id_][node.id][split]
u_coeff = self.scale_obj_coeff(
-ns_duals[self.leaf_id_][node.id][split])
self.u_vars_[node.id][split] = u_var
# obj += cp_model.LinearExpr.Term(u_var, u_coeff)
obj_vars.append(u_var)
obj_coeffs.append(u_coeff)
# obj += -self.scale_obj_coeff(leaf_dual)
obj = cp_model.LinearExpr.WeightedSum(obj_vars, obj_coeffs) - leaf_dual
self.model_.Maximize(obj)
# Constraints
# For each node, exactly one feature is selected
for node_id in nodes:
node = self.nodes_[node_id]
cons_vars = []
for split in node.candidate_splits:
u_var = self.u_vars_[node.id][split]
cons_vars.append(u_var)
self.model_.Add(cp_model.LinearExpr.Sum(cons_vars) == 1)
# Each feature is selected at max once in a path
# all_splits = list(dict.fromkeys(all_splits))
# for split in all_splits:
# cons_vars = []
# for node_id in nodes:
# node = self.nodes_[node_id]
# if split not in node.candidate_splits:
# continue
# u_var = self.u_vars_[node.id][split]
# cons_vars.append(u_var)
# self.model_.Add(cp_model.LinearExpr.Sum(cons_vars) <= 1)
# Row follows correct path (upper and lower bound on y var)
for i in range(n_rows):
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
y_lb_expr = cp_model.LinearExpr.Term(self.y_vars_[i], -1)
for node_id in nodes:
node = self.nodes_[node_id]
rn_expr = cp_model.LinearExpr.Term(self.y_vars_[i], 1)
if node.id in self.leaf_.left_nodes:
for split in node.candidate_splits:
if self.row_satisfies_split(i, self.splits_[split]):
u_var = self.u_vars_[node.id][split]
rn_expr += cp_model.LinearExpr.Term(u_var, -1)
y_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
elif node.id in self.leaf_.right_nodes:
for split in node.candidate_splits:
if not self.row_satisfies_split(i,
self.splits_[split]):
u_var = self.u_vars_[node.id][split]
rn_expr += cp_model.LinearExpr.Term(u_var, -1)
y_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
self.model_.Add(rn_expr <= 0)
self.model_.Add(y_lb_expr <= self.tree_depth_-1)
# only one class selected
# Binary variables indicating that the class is selected.
self.w_vars_ = [None]*len(self.targets_)
for i in range(len(self.targets_)):
target = self.targets_[i]
w_var = self.model_.NewBoolVar('w_'+str(target))
self.w_vars_[i] = w_var
# Note: This can be added as <= 1 inequality because of the objective.
self.model_.Add(cp_model.LinearExpr.Sum(self.w_vars_) == 1)
# relation between w,y,z vars
for i in range(n_rows):
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
# z <= y (row reaches leaf)
self.model_.Add(self.z_vars_[i] <= self.y_vars_[i])
# z <= w (row has correct target)
correct_class = self.y_[i]
self.model_.Add(self.z_vars_[i] <= self.w_vars_[correct_class])
self.generated_ = True
def update_objective(self, leaf_dual, row_duals, ns_duals, cut_duals):
"""TODO: Documentation.
"""
assert self.generated_, "Called update_objective before generating SP"
# obj = cp_model.LinearExpr()
# TODO: This method is slow. Make it faster.
obj_vars = []
obj_coeffs = []
n_rows = self.X_.shape[0]
for i in range(n_rows):
z_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
z_coeff = self.data_rows[i].weight
z_var = self.z_vars_[i]
self.orig_obj_coeffs[z_var] = z_coeff
z_coeff = self.scale_obj_coeff(z_coeff)
# obj += cp_model.LinearExpr.Term(z_var, z_coeff)
obj_vars.append(z_var)
obj_coeffs.append(z_coeff)
y_coeff = self.scale_obj_coeff(-row_duals[i])
y_var = self.y_vars_[i]
self.orig_obj_coeffs[y_var] = -row_duals[i]
# obj += cp_model.LinearExpr.Term(y_var, y_coeff)
obj_vars.append(y_var)
obj_coeffs.append(y_coeff)
num_cuts = len(cut_duals)
existing_cuts = len(self.yc_vars_)
if num_cuts > 0:
for i in range(num_cuts):
yc_var = None
if i >= existing_cuts:
yc_var = self.model_.NewBoolVar('yc_'+str(i))
self.yc_vars_.append(yc_var)
else:
yc_var = self.yc_vars_[i]
yc_coeff = self.scale_obj_coeff(-cut_duals[i])
self.orig_obj_coeffs[yc_var] = -cut_duals[i]
# obj += cp_model.LinearExpr.Term(yc_var, yc_coeff)
obj_vars.append(yc_var)
obj_coeffs.append(yc_coeff)
# Binary variables u_j_a indicating that split s is assigned to the
# node j.
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
for node_id in nodes:
node = self.nodes_[node_id]
for split in node.candidate_splits:
u_coeff = self.scale_obj_coeff(
-ns_duals[self.leaf_id_][node.id][split])
u_var = self.u_vars_[node.id][split]
self.orig_obj_coeffs[u_var] = \
-ns_duals[self.leaf_id_][node.id][split]
# obj += cp_model.LinearExpr.Term(u_var, u_coeff)
obj_vars.append(u_var)
obj_coeffs.append(u_coeff)
# obj -= self.scale_obj_coeff(leaf_dual)
obj = cp_model.LinearExpr.WeightedSum(obj_vars, obj_coeffs) - leaf_dual
self.model_.Maximize(obj)
return
def add_cut_rows(self, cut_rows):
assert self.generated_
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
n_cut_rows = len(cut_rows)
# Row follows correct path (upper and lower bound on y var)
for i in range(n_cut_rows):
cut_index = i + self.cut_rows_starting_index
yc_lb_expr = cp_model.LinearExpr.Term(self.yc_vars_[cut_index], -1)
for node_id in nodes:
node = self.nodes_[node_id]
cn_expr = cp_model.LinearExpr.Term(self.yc_vars_[cut_index], 1)
if node.id in self.leaf_.left_nodes:
for split in node.candidate_splits:
if cut_rows[i][split]:
u_var = self.u_vars_[node.id][split]
cn_expr += cp_model.LinearExpr.Term(u_var, -1)
yc_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
elif node.id in self.leaf_.right_nodes:
for split in node.candidate_splits:
if not cut_rows[i][split]:
u_var = self.u_vars_[node.id][split]
cn_expr += cp_model.LinearExpr.Term(u_var, -1)
yc_lb_expr += cp_model.LinearExpr.Term(u_var, 1)
self.model_.Add(cn_expr <= 0)
self.model_.Add(yc_lb_expr <= self.tree_depth_-1)
self.cut_rows_starting_index += n_cut_rows
def update_subproblem(self, X, y, dual_costs):
cut_rows = dual_costs[4]
if len(cut_rows) == 0:
return
self.X_ = X
self.y_ = y
leaf_dual = dual_costs[0][self.leaf_id_]
row_duals = dual_costs[1]
ns_duals = dual_costs[2]
cut_duals = dual_costs[3]
if self.generated_:
self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
else:
self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)
if len(cut_rows) > 0:
self.add_cut_rows(cut_rows)
def row_satisfies_split(self, row, split):
"""TODO: Documentation.
"""
feature = split.feature
threshold = split.threshold
if self.X_[row, feature] <= threshold:
return True
return False
def generate_columns(self, X, y, dual_costs, params=""):
"""TODO: Documentation.
"""
self.X_ = X
self.y_ = y
leaf_dual = dual_costs[0][self.leaf_id_]
row_duals = dual_costs[1]
ns_duals = dual_costs[2]
cut_duals = dual_costs[3]
if self.generated_:
self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
else:
self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)
# Solve sub problem
self.cp_solver_.parameters.num_search_workers = 7
# Solve
leaf_dual = dual_costs[0][self.leaf_id_]
path_generator = PathGenerator(self.u_vars_, self.w_vars_, self.z_vars_,
self.y_vars_,
self.yc_vars_, leaf_dual, self.leaf_,
self.targets_,
self.nodes_,
self.orig_obj_coeffs)
# status = self.cp_solver_.Solve(self.model_, path_generator)
self.solve_model(self.model_, time_limit=2, callback=path_generator)
# print(self.leaf_id_, " Path generation status: ",
# sp_solver.StatusName(status))
print(self.leaf_id_, " Path generation objective: ",
path_generator.original_objective)
paths = path_generator.paths
return paths
[docs]
class DTreeSubProblemHeuristic(BaseSubproblem):
"""TODO: Documentation."""
[docs]
def __init__(self, leaves, nodes, splits, targets,
depth, data_rows=None) -> None:
super().__init__()
self.leaves_ = leaves
self.nodes_ = nodes
self.splits_ = splits
self.targets_ = targets
self.tree_depth_ = depth
self.data_rows = data_rows
self.num_cuts = 0
self.failed_rounds = 0
self.success_rounds = 0
[docs]
def generate_columns(self, X, y, dual_costs, params=""):
""" TODO: Documentation."""
# Generate random columns and return the ones with postive reduced
# cost.
generated_paths = []
# return generated_paths
for iter in range(100):
path = Path()
# Pick a leaf
leaf = random.choice(self.leaves_)
path.leaf_id = leaf.id
# Pick splits for each node.
node_ids = leaf.left_nodes + leaf.right_nodes
path.node_ids = node_ids
path.splits = []
success = True
for node_id in node_ids:
node = self.nodes_[node_id]
candidate_splits = node.candidate_splits.copy()
for used_split in path.splits:
if used_split in candidate_splits:
candidate_splits.remove(used_split)
if not candidate_splits:
success = False
break
chosen_split = random.choice(candidate_splits)
path.splits.append(chosen_split)
if not success:
self.failed_rounds += 1
continue
# Compute the best target.
n_rows = X.shape[0]
possible_targets = {}
best_target = -1
best_target_count = 0
row_satisfies_path_array = [False]*n_rows
satisfied_rows = get_satisfied_rows(path, leaf, self.splits_)
# print(satisfied_rows)
for row in satisfied_rows:
row_satisfies_path_array[row] = True
row_weight = 1
if self.data_rows != None:
row_weight = self.data_rows[row].weight
if self.data_rows[row].removed_from_sp:
continue
target = y[row]
if target in possible_targets.keys():
possible_targets[target] += row_weight
else:
possible_targets[target] = row_weight
if possible_targets[target] > best_target_count:
best_target_count = possible_targets[target]
best_target = target
path.target = best_target
path.cost = best_target_count
already_generated = False
for old_path in generated_paths:
if path.is_same_as(old_path):
already_generated = True
break
if already_generated:
self.failed_rounds += 1
continue
# Evaluate the reduced cost.
reduced_cost = self.get_reduced_cost(
X, y, dual_costs, path, row_satisfies_path_array)
if reduced_cost > 1e-6:
# print("Generated new path: ", len(generated_paths))
# path.print_path()
generated_paths.append(path)
else:
self.failed_rounds += 1
self.success_rounds += len(generated_paths)
return generated_paths
[docs]
def update_subproblem(self, X, y, dual_costs):
cut_rows = dual_costs[4]
for i, cut_row in enumerate(cut_rows):
cut_index = self.num_cuts + i
for split in self.splits_:
if cut_row[split.id]:
split.left_cuts.add(cut_index)
else:
split.right_cuts.add(cut_index)
self.num_cuts += len(cut_rows)
[docs]
def get_reduced_cost(self, X, y, dual_costs, path,
row_satisfies_path_array):
"""TODO: Documentation."""
n_rows = X.shape[0]
leaves_dual = dual_costs[0]
row_duals = dual_costs[1]
ns_duals = dual_costs[2]
cut_duals = dual_costs[3]
reduced_cost = 0
try:
reduced_cost -= leaves_dual[path.leaf_id]
except IndexError:
print("List index out of error: ", path.leaf_id, len(leaves_dual))
reduced_cost += path.cost
for row in range(n_rows):
if row_satisfies_path_array[row]:
reduced_cost -= row_duals[row]
satisfied_cuts = get_satisfied_cuts(
path, self.leaves_[path.leaf_id], self.splits_)
for c in satisfied_cuts:
reduced_cost -= cut_duals[c]
for i in range(len(path.node_ids)):
node_id = path.node_ids[i]
split_id = path.splits[i]
reduced_cost -= ns_duals[path.leaf_id][node_id][split_id]
return reduced_cost
class DTreeSubProblemOld(BaseSubproblem):
"""TODO: Documentation."""
def __init__(self, leaf, nodes, splits, target,
depth, data_rows=None,
solver_str='sat') -> None:
super().__init__()
self.leaf_id_ = leaf.id
self.leaf_ = leaf
self.nodes_ = nodes
self.splits_ = splits
self.target_ = target
self.tree_depth_ = depth
self.solver_ = pywraplp.Solver.CreateSolver(solver_str)
self.generated_ = False
self.yc_vars_ = None
self.data_rows = data_rows
self.cut_rows_starting_index = 0
self.iter_ = 0
def create_submip(self, leaf_dual, row_duals, ns_duals, cut_duals):
"""TODO: Documentation.
"""
assert not self.generated_, "SP is already created."
infinity = self.solver_.infinity()
self.solver_.SetNumThreads(7)
n_rows = self.X_.shape[0]
# Binary variables indicating that row reaches the leaf.
self.y_vars_ = [None]*n_rows
objective = self.solver_.Objective()
for i in range(n_rows):
obj_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
obj_coeff = self.data_rows[i].weight
y_var = self.solver_.BoolVar('y_'+str(i))
row_dual = row_duals[i]
if self.y_[i] == self.target_:
objective.SetCoefficient(y_var, 1-row_dual)
else:
objective.SetCoefficient(y_var, -row_dual)
self.y_vars_[i] = y_var.index()
num_cuts = len(cut_duals)
self.yc_vars_ = [None]*num_cuts
if num_cuts > 0:
for i in range(num_cuts):
yc_var = self.solver_.BoolVar('yc_'+str(i))
cut_dual = cut_duals[i]
objective.SetCoefficient(yc_var, -cut_dual)
self.yc_vars_[i] = yc_var.index()
# Binary variables u_j_a indicating that split s is assigned to the
# node j.
self.u_vars_ = {}
self.leaf_
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
all_splits = []
for node_id in nodes:
node = self.nodes_[node_id]
self.u_vars_[node.id] = {}
for split in node.candidate_splits:
all_splits.append(split)
u_var = self.solver_.BoolVar(
'u_'+str(node.id)+'_'+str(split))
ns_dual = ns_duals[self.leaf_id_][node.id][split]
objective.SetCoefficient(u_var, -ns_dual)
self.u_vars_[node.id][split] = u_var.index()
objective.SetOffset(-leaf_dual)
objective.SetMaximization()
# Constraints
# For each node, exactly one feature is selected
for node_id in nodes:
node = self.nodes_[node_id]
one_feature = self.solver_.Constraint(
1, 1, "one_feature_" + str(node.id))
for split in node.candidate_splits:
u_var = self.solver_.variable(self.u_vars_[node.id][split])
one_feature.SetCoefficient(u_var, 1)
# Each feature is selected at max once in a path
# all_splits = list(dict.fromkeys(all_splits))
# for split in all_splits:
# unique_split = self.solver_.Constraint(
# 0, 1, "unique_split_" + str(split))
# for node_id in nodes:
# node = self.nodes_[node_id]
# if split not in node.candidate_splits:
# continue
# u_var = self.solver_.variable(self.u_vars_[node.id][split])
# unique_split.SetCoefficient(u_var, 1)
# Row follows correct path (upper and lower bound on y var)
for i in range(n_rows):
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
y_lb_cons = self.solver_.Constraint(
-infinity, self.tree_depth_-1, "row_" + str(i))
y_var = self.solver_.variable(self.y_vars_[i])
y_lb_cons.SetCoefficient(y_var, -1)
for node_id in nodes:
node = self.nodes_[node_id]
rn_cons = self.solver_.Constraint(
-infinity, 0, "rn_" + str(i) + '_' + str(node.id))
rn_cons.SetCoefficient(y_var, 1)
if node.id in self.leaf_.left_nodes:
for split in node.candidate_splits:
if self.row_satisfies_split(i, self.splits_[split]):
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
rn_cons.SetCoefficient(u_var, -1)
y_lb_cons.SetCoefficient(u_var, 1)
elif node.id in self.leaf_.right_nodes:
for split in node.candidate_splits:
if not self.row_satisfies_split(i,
self.splits_[split]):
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
rn_cons.SetCoefficient(u_var, -1)
y_lb_cons.SetCoefficient(u_var, 1)
self.generated_ = True
def update_objective(self, leaf_dual, row_duals, ns_duals, cut_duals):
"""TODO: Documentation.
"""
objective = self.solver_.Objective()
n_rows = self.X_.shape[0]
for i in range(n_rows):
obj_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
# obj_coeff = self.data_rows[i].weight
y_var = self.solver_.variable(self.y_vars_[i])
row_dual = row_duals[i]
if self.y_[i] == self.target_:
objective.SetCoefficient(y_var, 1-row_dual)
else:
objective.SetCoefficient(y_var, -row_dual)
num_cuts = len(cut_duals)
existing_cuts = len(self.yc_vars_)
if num_cuts > 0:
for i in range(num_cuts):
yc_var = None
if i >= existing_cuts:
yc_var = self.solver_.BoolVar('yc_'+str(i))
self.yc_vars_.append(yc_var.index())
else:
yc_var = self.solver_.variable(self.yc_vars_[i])
cut_dual = cut_duals[i]
objective.SetCoefficient(yc_var, -cut_dual)
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
for node_id in nodes:
node = self.nodes_[node_id]
for split in node.candidate_splits:
u_var = self.solver_.variable(self.u_vars_[node.id][split])
ns_dual = ns_duals[self.leaf_id_][node.id][split]
objective.SetCoefficient(u_var, -ns_dual)
objective.SetOffset(-leaf_dual)
objective.SetMaximization()
return
def add_cut_rows(self, cut_rows):
assert self.generated_
infinity = self.solver_.infinity()
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
n_cut_rows = len(cut_rows)
# Row follows correct path (upper and lower bound on y var)
for i in range(n_cut_rows):
cut_index = i + self.cut_rows_starting_index
yc_lb_cons = self.solver_.Constraint(
-infinity, self.tree_depth_-1, "cut_" + str(cut_index))
# yc_var is created in update_objective method.
yc_var = self.solver_.variable(self.yc_vars_[cut_index])
yc_lb_cons.SetCoefficient(yc_var, -1)
for node_id in nodes:
node = self.nodes_[node_id]
cn_cons = self.solver_.Constraint(
-infinity, 0, "cn_" + str(cut_index) + '_' + str(node.id))
cn_cons.SetCoefficient(yc_var, 1)
if node.id in self.leaf_.left_nodes:
for split in node.candidate_splits:
if cut_rows[i][split]:
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
cn_cons.SetCoefficient(u_var, -1)
yc_lb_cons.SetCoefficient(u_var, 1)
elif node.id in self.leaf_.right_nodes:
for split in node.candidate_splits:
if not cut_rows[i][split]:
u_var = self.solver_.variable(
self.u_vars_[node.id][split])
cn_cons.SetCoefficient(u_var, -1)
yc_lb_cons.SetCoefficient(u_var, 1)
self.cut_rows_starting_index += n_cut_rows
def generate_columns(self, X, y, dual_costs, params=""):
"""TODO: Documentation.
"""
# Solve sub problem
result_status = self.solver_.Solve(get_params_from_string(params))
has_solution = (result_status == pywraplp.Solver.OPTIMAL or
result_status == pywraplp.Solver.FEASIBLE)
# The current path is always feasible.
# # TODO: Fix this. Sat solver returns infeasible sometimes
# (even without cuts).
if not has_solution:
return []
# print(self.solver_.ExportModelAsLpFormat(False))
assert has_solution, "Result status: " + str(result_status)
# if self.leaf_id_ == 4:
# sp_mps = self.solver_.ExportModelAsMpsFormat(fixed_format=False,
# obfuscated=False)
# name = "sp_" + str(self.tree_depth_) + "_" + \
# str(self.iter_) + ".mps"
# self.iter_ += 1
# f = open(name, "w")
# f.write(sp_mps)
# f.close()
# TODO: This threshold should be a parameter.
print("Subproblem ", self.leaf_id_, " objective = ",
self.solver_.Objective().Value())
if self.solver_.Objective().Value() <= 1e-6:
return []
nodes = self.leaf_.left_nodes + self.leaf_.right_nodes
path = Path()
path.leaf_id = self.leaf_id_
path.node_ids = []
path.splits = []
path.cost = 0
path.id = -1
for node in self.nodes_:
if node.id not in nodes:
continue
path.node_ids.append(node.id)
for split in node.candidate_splits:
u_var = self.solver_.variable(self.u_vars_[node.id][split])
if u_var.solution_value() > 0.5:
path.splits.append(split)
break
path.target = self.target_
n_rows = self.X_.shape[0]
for i in range(n_rows):
obj_coeff = 1
if self.data_rows != None:
if self.data_rows[i].removed_from_sp:
continue
# obj_coeff = self.data_rows[i].weight
y_var = self.solver_.variable(self.y_vars_[i])
if self.y_[i] == self.target_:
path.cost += 1 * y_var.solution_value()
return [path]
def update_subproblem(self, X, y, dual_costs):
self.X_ = X
self.y_ = y
leaf_dual = dual_costs[0][self.leaf_id_]
row_duals = dual_costs[1]
ns_duals = dual_costs[2]
cut_duals = dual_costs[3]
cut_rows = dual_costs[4]
if self.generated_:
self.update_objective(leaf_dual, row_duals, ns_duals, cut_duals)
else:
self.create_submip(leaf_dual, row_duals, ns_duals, cut_duals)
if len(cut_rows) > 0:
self.add_cut_rows(cut_rows)
def row_satisfies_split(self, row, split):
"""TODO: Documentation.
"""
feature = split.feature
threshold = split.threshold
if self.X_[row, feature] <= threshold:
return True
return False
[docs]
class DTreeClassifier(ColGenClassifier):
"""Decision Tree classifier using column generation.
Parameters
----------
initial_paths: list(Path), default=[],
List of paths used to initialize the master problem. The user must
ensure that a valid tree can be formed using the initial paths.
leaves: list(Leaf), default=[],
List of leaves in the tree.
nodes: list(Node), default=[],
List of nodes in the tree.
splits: list(Split), default=[],
List of split checks used in the nodes.
tree_depth: int, default=1,
Depth of the tree.
targets: list(int), default=[],
List of target ids. They must start from 0.
max_iterations: int, default=-1
Maximum column generation iterations. Negative values removes the
iteration limit and the problem is solved till optimality.
time_limit: int, default=-1,
Time limit in seconds for training. Negative values removes the
time limit and the problem is solved till optimality.
num_master_cuts_round: int, default=3,
Number of times the master problem adds cuts in an iteration.
master_beta_constraints_as_cuts: bool, default=False,
If True, adds existing beta constraints (constraints for data rows) as
cutting planes in the master problem.
master_generate_cuts: bool, default=False,
If True, master problem generates new beta cuts using SAT solver.
data_rows: list(Row), default=None,
Preprocessed data rows. The preprocessed rows help with faster running
times.
use_old_sp: bool, default=False,
If True, uses the old subproblem model published in Firat et. al. 2020.
master_solver_type: str, default='glop',
Solver for RMP from OR-Tools. Use 'glop' for tests. See OR-Tools
documentation for other possible values.
rmp_solver_params: string, default = "",
Solver parameters for solving restricted master problem (rmp).
master_ip_solver_params: string, default = "",
Solver parameters for solving the integer master problem.
subproblem_params: list of strings, default = [""],
Parameters for solving the subproblem.
"""
[docs]
def __init__(self, initial_paths=[], leaves=[], nodes=[], splits=[],
tree_depth=1,
targets=[], max_iterations=-1,
time_limit=-1,
num_master_cuts_round=3,
master_beta_constraints_as_cuts=False,
master_generate_cuts=False,
data_rows=None,
use_old_sp=False,
master_solver_type='glop',
rmp_solver_params="",
master_ip_solver_params="", subproblem_params=""):
self.initial_paths = initial_paths
self.leaves = leaves
self.nodes = nodes
self.splits = splits
self.tree_depth = tree_depth
self.targets = targets
self.subproblem_params = subproblem_params
self.num_master_cuts_round = num_master_cuts_round
self.master_beta_constraints_as_cuts = master_beta_constraints_as_cuts
self.master_generate_cuts = master_generate_cuts
self.data_rows = data_rows
split_ids = []
for split in splits:
split_ids.append(split.id)
# Each node must be in sequence.
node_ids = []
node_id = 0
for node in self.nodes:
assert node.id == node_id
node_ids.append(node.id)
node_id += 1
# Each leaf must be in sequence.
leaf_ids = []
leaf_id = 0
for leaf in leaves:
assert leaf.id == leaf_id
leaf_ids.append(leaf.id)
leaf_id += 1
# Each path can only contain nodes and leaves provided.
for path in self.initial_paths:
path.initial = True
assert path.leaf_id in leaf_ids
for node_id in path.node_ids:
assert node_id in node_ids
for split_id in path.splits:
assert split_id in split_ids, "split id " + str(split_id)
assert path.target in targets
assert len(path.node_ids) == self.tree_depth
assert len(path.splits) == self.tree_depth
self.master_problem = DTreeMasterProblem(
self.initial_paths, self.leaves, self.nodes, self.splits,
num_cuts_round=num_master_cuts_round,
beta_constraints_as_cuts=master_beta_constraints_as_cuts,
generate_cuts=master_generate_cuts,
solver_type=master_solver_type,
data_rows=data_rows)
self.subproblems = []
all_subproblem_params = []
heuristic = DTreeSubProblemHeuristic(
self.leaves, self.nodes, self.splits, self.targets,
self.tree_depth, data_rows=data_rows)
self.subproblems.append([heuristic])
all_subproblem_params.append([""])
self.subproblems.append([])
all_subproblem_params.append([])
for leaf in self.leaves:
if use_old_sp:
for target in self.targets:
subproblem = DTreeSubProblemOld(
leaf, self.nodes, self.splits, target,
self.tree_depth, solver_str='sat',
data_rows=data_rows)
self.subproblems[1].append(subproblem)
all_subproblem_params[1].append(subproblem_params)
else:
# subproblem = DTreeSubProblem(
# leaf, self.nodes, self.splits, self.targets,
# self.tree_depth, solver_str='sat',
# data_rows=data_rows)
subproblem = DTreeSubProblemSat(
leaf, self.nodes, self.splits, self.targets,
self.tree_depth,
data_rows=data_rows)
self.subproblems[1].append(subproblem)
all_subproblem_params[1].append(subproblem_params)
rmp_is_ip = True
super().__init__(max_iterations, time_limit,
self.master_problem, self.subproblems,
rmp_is_ip, rmp_solver_params, master_ip_solver_params,
all_subproblem_params)
def _more_tags(self):
return {'X_types': ['categorical'],
'non_deterministic': True,
'binary_only': True}
[docs]
def predict(self, X):
"""Predicts the class based on the solution of master problem.
Parameters
----------
X : array-like, shape (n_samples, n_features)
The input samples. The inputs should only contain numeric values.
Returns
-------
y : ndarray, shape (n_samples,)
The label for each sample.
"""
X = check_array(X, accept_sparse=True)
check_is_fitted(self, 'is_fitted_')
selected_paths = self.master_problem.selected_paths
# Check for each row, which path it satisfies. There should be exactly
# one.
num_rows = X.shape[0]
y_predict = np.zeros(X.shape[0], dtype=int)
for row in range(num_rows):
for path in selected_paths:
if row_satisfies_path(X[row], self.leaves[path.leaf_id],
self.splits, path):
y_predict[row] = path.target
break
return self.label_encoder_.inverse_transform(y_predict)