A Globally Convergent Gradient-based Bilevel Hyperparameter Optimization Method

Ankur Sinha
Production and Quantitative Methods
Indian Institute of Ahmedabad
Gujarat, India, 380015
asinha@iima.ac.in
Satender Gunwal
Centre for Data Science and AI
Indian Institute of Management Ahmedabad
Gujarat, India 380015
satender@iima.ac.in
&Shivam Kumar
Centre for Data Science and AI
Indian Institute of Management Ahmedabad
Gujarat, India 380015
shivamk@iima.ac.in
Abstract

Hyperparameter optimization in machine learning is often achieved using naive techniques, such as random search and grid search that only lead to an approximate set of hyperparameters. Although techniques such as Bayesian optimization perform an intelligent search on a given domain of hyperparameters, it does not guarantee an optimal solution. A major drawback of most of these approaches is that as the number of hyperparameters increases, the search domain increases exponentially, increasing the computational cost and making the approaches slow. The hyperparameter optimization problem is inherently a bilevel optimization task, and there are some studies that have attempted bilevel solution methodologies for solving this problem. However, these studies too suffer from a drawback, where they assume a unique set of model weights that minimize the training loss, which is generally violated by deep learning architectures. This paper discusses a gradient-based bilevel method for solving the hyperparameter optimization problem where these drawbacks have been addressed. The proposed method can handle continuous hyperparameters for which we have chosen the regularization hyperparameter in our experiments. The method guarantees convergence to the set of optimal hyperparameters that has been theoretically proven in this study. The idea is based on approximating the lower-level optimal value function mapping using Gaussian process regression. As a result, the bilevel problem is reduced to a single level constrained optimization task that is solved using the augmented Lagrangian method. We have performed an extensive computational study on the MNIST and CIFAR-10 datasets on multi-layer perceptron and LeNet architectures that confirms the efficiency of the proposed method. A comparative study against grid search, random search, Bayesian optimization and HyberBand method on multiple hyperparameter problems shows that the proposed algorithm converges with lower computation and leads to models that generalize better on the testing set.

1 Introduction

Training a machine learning model that performs better on unseen data requires hyperparameter tuning, which is a computationally expensive task. It is often done using non-exact methods where a search space is given to the model for identifying a good set of hyperparameters. Methods like grid search, used in practice, set a grid of hyperparameter values beforehand and evaluate models corresponding to each hyperparameter on a validation set. Random search (random_search:James) is an alternative to grid search where randomly generated hyperparameters are used to generate models using the training set which are then evaluated on the validation set. Bayesian optimization (PredModel:Kuhn; BayesianHO:Snoek) is a more intelligent approach that carries out the hyperparameter optimization by assuming a prior. Bayesian optimization also samples hyperparameters in the search space, evaluates them and updates the prior to form the posterior distribution. There is a significant body of literature on hyperparameter optimization in machine learning, see (ML_Book:HO; HOReview:TongYu; HO:Yang; optML:Claudio). The problem of optimizing the hyperparameters is inherently a bilevel optimization problem, as one can clearly observe that all the above methods create models corresponding to different hyperparameters which entails solving multiple optimization problems for model training. The various models trained are then evaluated based on validation data and the best performing model is chosen. Model training using the training data for a given hyperparameter is the lower level optimization task and the model selection based on validation performance is the upper level optimization task. The set of optimal hyperparameters prevents overfitting on the training data and results in a more generalizable model (HO:Yang; GenBilevel:FBao). For a review on bilevel optimization, refer to (dempe2002foundations; bilevel_review:asinha). Other state-of-the-art methods that follow the similar idea of probabilistic modelling as in Bayesian optimization include Sequential Model-based Global Optimization (hutter2011sequential), Gaussian Process Approach (BayesianHO:Snoek; GPforML), and Tree-structured Parzen Estimator Approach (bergstra2011algorithms). For most of these approaches the computational time increases exponentially and the performance deteriorates significantly as the number of hyperparameters increase (GB_HO:Maclaurin).

It is well-known that gradient-based methods can optimize a significantly large number of hyperparameters (domke2012generic; HO_Gradient:Fabian; GB_HO:Maclaurin; Fu_et_al:2016). Some aspects of our study fall into this category of hyperparameter optimization as we develop a gradient based approach that simultaneously optimizes the hyperparameters and model weights in a bilevel framework. Other studies in this area include bengio2000gradient and HO_Gradient:Fabian, which uses implicit function theory for the derivation of hyperparameter gradients. Another category of similar papers uses an iterative approach to approximate the reaction set mapping after solving a few lower level optimization problems (franceschi2018bilevel; GB_HO:Maclaurin). Most of the bilevel methods that are based on approximation of the reaction set mapping, i.e. the optimal set of model weights for a given hyperparameter, assume existence of unique optimum at the lower level that is often violated by deep learning structures. Recently, lorraine2018stochastic and mackay2019self proposed similar methods that locally approximate this mapping by estimating the hessian or its inverses, making their methods computationally intractable for a large number of model parameters.

Our proposed method relies on the approximation of the optimal value function mapping that does not require the assumption of unique optimum at the lower level. The optimal value function is approximated using a stochastic process model to convert the bilevel program into a constrained single level problem. Further, the constrained problem is reduced to an unconstrained optimization problem using the augmented Lagrangian method, which we solve using a gradient-based approach. Gradient_Based_Past recently proposed a gradient-based bilevel approach for hyperparameter tuning where the optimal value function is approximated in one-shot using Kriging approximation. In the present work, the optimal value mapping is improved using newly obtained upper level variable after each augmented Lagrangian step. We also consider the standard error arising from the stochastic process model to ensure a tighter confidence interval of the approximation. Assuming that we have a global optimizer for unconstrained optimization problems available, our method guarantees convergence to the global optimal solution (i.e. the optimal hyperparameters and the corresponding model weights) to the bilevel problem. We have provided the termination conditions and convergence results for our algorithm in this paper. An extensive comparative study against the common hyperparameter optimization approaches, like grid search, random search, Bayesian optimization (hutter2011sequential) and HyperBand (hyperband) method on MNIST (lecun2010mnist) and CIFAR-10 (Krizhevsky09learningmultiple) datasets reveals that the proposed approach is able to search models with high generalization at significantly lower computations. The MNIST dataset has been used with the multi-layer perceptron architecture (MLP) and the CIFAR-10 dataset has been used with the convolutional neural network (CNN) architecture. In this study we have presented results for problems involving up to 4 regularization parameters. The proposed method can easily handle higher number of regularization hyperparameters without a deterioration in performance or a significantly higher requirement of computational resources. However, we have restricted ourselves to 4 hyperparameters as it would not be easy to draw comparisons against other methods as their computational requirements increase substantially with increase in hyperparameters.

The paper is organized as follows. Section 2 provides a bilevel optimization framework for hyperparameter optimization in the context of machine learning. Further, it discusses some single level reduction approaches that can be used to solve such problems and their drawbacks. Section 3 discusses about the proposed bilevel approach and its termination and convergence proofs in detail. The results of our studies on MNIST and CIFAR-10 datasets are discussed in Section 4. Section 5 provides the conclusions of our study.

Category Symbols Used Description
Datasets , Training set with examples.
, Validation set with examples.
Variables Model hyperparameters (upper level variables)
Model parameters (lower level variables)
Loss Function Loss-function on set with model parameters , e.g. Least squares or Cross-entropy loss function.
Regularization (L2 regularization).
Objective Functions Upper level objective
Lower level objective
Reaction Set Mapping A set-valued mapping, which represents optimal lower level solution(s) for any upper level variable .
Optimal Value Function Optimal objective value of the lower level problem for any given upper level variable .
Table 1: Notations used in the paper

2 Bilevel Optimization Framework

In bilevel optimization problems, there are two levels of optimization, generally referred to as upper (or outer) level optimization and lower (or inner) level optimization. The lower level problem is a nested optimization task that acts as a constraint to the upper level problem. Note that the presence of multiple optimal solutions at the lower level may give rise to two different kinds of formulations; namely, optimistic or pessimistic in the bilevel context. In this paper, we are considering the optimistic formulation, i.e. in case there are two or more set of optimal model parameters for a given hyperparameter that lead to a similar training loss, then the one that is better in terms of validation loss is preferred. Below we provide the optimistic formulation for the bilevel hyperparameter optimization problem. Refer to Table 1 for the notations used throughout the paper.

(1)

Here, and are the upper and lower level objective functions, respectively, and and are the corresponding variables.

A Bilevel formulation in hyperparameter optimization minimizes the training set loss at the lower level and validation set loss at the upper level. The upper level validation loss will be minimized only for the set of model parameters that minimize the training set loss. The upper level objective is , where is the validation set and is the loss function. We use cross-entropy loss for our experiments, along with regularization, also known as weight decay in the context of neural networks. The lower level objective is as follows:

(2)

where loss is defined on the training set, , and is the regularization term. The hyperparameter controls the penalty on the training loss function, which improves the learning of the model for unseen data by preventing overfitting on the training examples. Next, we discuss some common methods from the bilevel optimization literature that are employed to solve (1) using single level reduction, and the drawbacks of these methods in the context of hyperparameter optimization in machine learning.

2.1 Single Level Reduction using Reaction Set Mapping

The reaction set mapping in bilevel optimization provides the optimal with respect to the lower level problem for the corresponding . If there are multiple global optimal s for a given , the mapping is expected to return all these solutions. The mapping is given by,

(3)

Single level reduction of the bilevel program depends on the approximation of (3). However, this mapping is not readily available, so we need to solve the lower-level problem for various values of the upper level variable vector () to obtain (3). As there can be more than one optimal solutions at the lower level, the mapping is ideally a point-to-set (set-valued) mapping. This is generally true in case of machine learning problems because of the non-convexity in the problem architecture. The single level reduction of the bilevel problem (1) using (3) is given as:

(4)

Studies by lorraine2018stochastic and mackay2019self utilize this reaction set mapping in the context of hyperparameter optimization. Both the studies assume (3) to be a single-valued mapping for their approximations, as it is mathematically and computationally hard to approximate the actual set-valued mapping .

2.2 Single Level Reduction using First-order Conditions

Another method to reduce a bilevel program into a single level problem is using the first-order conditions of the lower level problem. The reduced single level problem in such cases is the following constrained optimization problem:

(5)

As machine learning problems are highly non-linear with multiple local optimums and saddle points, using a first-order reduction usually will not lead to the actual solution of (1). Another problem with the above method is that solving the problem requires computation of the second-order derivatives. It is not feasible to compute second-order derivatives for large problems involving millions of model parameters that are fairly common in practice. The study by mehra2019penalty relies on first-order conditions of the lower level for single level reduction of the original bilevel problem.

3 Proposed Algorithm

We propose an alternative method in this section that uses the optimal value function mapping for the single level reduction of the original bilevel problem. The optimal value function mapping for problem (1) is given as:

(6)

The above function gives the optimal function value of the lower level problem corresponding to different values of . Using (6), the original problem (1) can be expressed as follows:

(7)

Note that unlike , is always a single-valued mapping, making it relatively simple to approximate. The proposed method iteratively approximates this optimal value function using Gaussian Process Regression that we discuss next.

Figure 1: -mapping approximation using GPR for MNIST-10K (2HP). Here with sample size, . Figure 2: -mapping approximation using GPR for MNIST-10K (2HP). Here with sample size, .

3.1 Gaussian Process Regression

Gaussian process regression (GPR) is used for interpolation of functions where the interpolation is governed by Gaussian processes (GP). We approximate the mapping using GPR with a sample of hyperparameters in the feasible region. The lower level problem (2) is solved for each using stochastic gradient descent and the value of is recorded. The resulting set is used for learning the model. The model structure that is used in a stochastic process is given as follows:

(8)

where is the prior’s expectation, and the random error follows a standard normal distribution with variance . We consider the radial-basis function (RBF) as the correlation function for the prior because of the kernel’s stationarity. The RBF is given as follows:

(9)

where is as follows:

(10)

Here, defines the length-scale of the respective upper level variable’s dimension. The kernel defined in (9) is a particular case of the class of correlation functions used in GP approximation. Note that the error terms have a negative correlation that decreases exponentially on increasing the distance between respective points. The optimal value of the parameters can be obtained by maximizing the log-likelihood of the sample. Figures 1 and 2 show how the GPR models change on adding more data points to initial sample . For a detailed study on the stochastic process model, refer to jones1998efficient and sacks1989design.

Let and be the approximate mapping and the standard error for , respectively, obtained using GPR with the initial sample , then a relaxation to the original problem (7) can be written as follows:

(11)

Choosing a large enough value will ensure a lower bound to the problem (7), which is discussed in the convergence results provided in Section 3.3. When the GP model is accurate, converges to and the optimal value function constraint in the above problem can be treated as an equality. We will reduce the problem (11) into an unconstrained problem by using the augmented Lagrangian method, which is discussed in the next section.

3.2 Augmented Lagrangian Formulation

The Lagrangian method for solving optimization problems with equality constraints is widely used in the field of optimization. An extension of this method is known as the augmented Lagrangian method, which adds an extra quadratic penalty on the Lagrangian of the original problem. Without loss of generality, we will assume the constraint in problem (11) holds with equality at the bilevel optimum. The unconstrained penalized formulation of (11) using the augmented Lagrangian method is given as follows:

(12)

The problem (12) is solved by optimizing the penalized loss function iteratively, and updating the multipliers after each step. The update rule for and are discussed in the pseudocode of our approach in Algorithm 1. The unconstrained problem (12) can be solved using gradient descent-based methods. An optimal pair, , of upper and lower level variables is obtained after every iteration of the augmented Lagrangian method. Using this, we add a new point into the initial sample and re-approximate and using the updated sample. We refer to our approach as Gradient-based Bilevel Hyperparameter Optimization (GBHO) in the rest of the paper.

Result:
Initialize: Training set , validation set and hyperparameter samples ;
for  do
        ;
       
end for
;
Let , such that , and ;
for  do
        with as starting point;
        ;
        ;
        ;
        ;
        ;
       
end for
Algorithm 1 Pseudocode for GBHO algorithm. Note that we use , , and in our experiments. GPR is an acronym for Gaussian process regression.

3.3 Termination and Convergence Results

For the convergence proof we assume that we are working with a global optimizer that guarantees the global optimal solution whenever a single level unconstrained optimization problem is given. Under this assumption, our proposed method will lead to the optimal set of the regularization hyperparameters that we show using the following theorems.

Theorem 1.

For the hyperparameter optimization problem (7), the following approximate optimization problem provides a lower bound with a probability .

Proof.

Note that the Gaussian process optimization provides an approximation to with an expected value of and standard deviation of for any given . Under the normality assumption, with a probability , where represents the -value corresponding to a single-tailed test for a given . For , the value of is . Therefore the problem

represents a relaxation to problem (7). A relaxation to a given optimization problem always leads to the lower bound of the problem. ∎

Theorem 2.

Solving the problem (11) iteratively using Algorithm 1 leads to the optimal solution to problem (7) at termination when is sufficiently large.

Proof.

The algorithm terminates when the following two conditions are met:

  1. : , which implies that for a sufficiently large value of (say ). This ensures that is accurately approximated at .

  2. : , which implies that .

From Theorem 1, represents the lower bound to problem 7 and since the constraint is satisfied with an error , represents a near optimal solution. ∎

The above theorem can be extended to also show that a small change in over iterations of Algorithm 1 when C2 is satisfied also implies that the in the current iteration is near optimal. We avoid this theorem for brevity. This ensures that the algorithm will never stagnate at a wrong solution. In case it does, the solution is guaranteed to be the optimal solution.

4 Experiments

This section includes our experiments on MNIST (lecun2010mnist) and CIFAR-10 (Krizhevsky09learningmultiple) datasets. MNIST dataset consists of grayscale images of handwritten digits from -. For this multi-classification problem we solve multiple test instances using MLP architecture with a single hidden layer for one and two regularization hyperparameters. CIFAR-10 dataset consists of colour images with 10 classes. For this classification problem, we employ LeNet-5 CNN architecture (lenet) which consists of 3 convolutional layers, 2 subsampling layers and 2 fully connected layers. Here also, we optimize multiple regularization hyperparameters corresponding to different layers. For all the experiments, the discrete hyperparameters, such as learning rate and momentum, are tuned over similar grid values for fair comparison. We ran our entire experiments on Google Colaboratory Pro Plus cloud platform.

4.1 MNIST Dataset

MNIST-1HP MNIST-2HP
Methods 1000 5000 10000 1000 5000 10000
Grid Search TRL 0.0216 0.0321 0.0302 0.0332 0.0685 0.0300
VAL 0.4395 0.3045 0.1524 0.4327 0.2757 0.1462
TEL 0.4125 0.2714 0.1942 0.4193 0.2556 0.1909
-6.06 -6.36 -6.77 -0.54, -4.78 -0.68, -1.03 -0.68, -6.89
LLO 100 100 100 900 900 900
Random Search TRL 0.0258 0.0534 0.0268 0.0204 0.0437 0.0328
VAL 0.4413 0.3145 0.1532 0.4403 0.2687 0.1438
TEL 0.4177 0.2740 0.1963 0.4227 0.2419 0.1830
-5.85 -5.88 -6.90 0.0, -3.44 -0.30, -2.11 -0.23, -5.08
LLO 100 100 100 900 900 900
Bayesian Search TRL 0.0379 0.0243 0.0059 0.0420 0.0218 0.0088
VAL 0.4396 0.2680 0.1318 0.4288 0.2558 0.1331
TEL 0.4278 0.2434 0.1824 0.4057 0.2360 0.1788
-10.00 -5.88 -7.92 -9.24,-3.55 -6.57, -6.36 -6.92, -10.00
LLO 60 60 60 100 100 100
HyperBand TRL 0.0277 0.0164 0.0033 0.0356 0.0100 0.0255
VAL 0.4335 0.2637 0.1391 0.4236 0.2640 0.1336
TEL 0.4224 0.2436 0.1916 0.4063 0.2444 0.1717
-5.43 -6.48 -8.61 -5.3, -6.05 -7.75, -7.14 -6.68, -6.89
LLO 254 254 254 254 254 254
GBHO TRL 0.0210 0.0483 0.0467 0.0298 0.0332 0.0114
VAL 0.0279 0.0133 0.0281 0.0457 0.0745 0.0980
TEL 0.3734 0.2383 0.1765 0.3652 0.2035 0.1711
-6.61 -2.75 -3.95 -0.11, -0.52 -2.59, -1.35 -10.26, 0.316
LLO 10 (+5) 10 (+5) 10 (+5) 50 (+5) 50 (+5) 50 (+5)
Table 2: Performance comparison of GBHO with Random search, Grid search, Bayesian Optimization and HyperBand in case of MNIST dataset. 1HP denotes single hyperparameter test instances and 2HP denotes two hyperparameter test instances. TRL, VAL and TEL denote the training loss, validation loss and testing loss, respectively. LLO denotes the number of lower level optimization problems solved by the corresponding method. In case of GBHO, (+5) indicates the number of unconstrained optimization calls needed to solve the problem (12).

For MNIST dataset, we use the MLP architecture with one hidden layer consisting of 100 nodes. We perform our study on three instances consisting of 1000, 5000 and 10000 examples which we sample randomly from the original MNIST dataset. We split the instances into 60% training and 40% validation examples for the experiments. The test set used for the comparison consists of 10000 data points sampled separately for each of the cases. We solve two hyperparameter optimization problems for each set. For the first experiment, we regularize weights of both the hidden and output layer using a single hyperparameter. For the second experiment, we use two separate regularization hyperparameters to regularize each of the layers separately. The results of the experiments are reported in Table 2. For the single hyperparameter case, we sample 10 different points from the interval in order to approximate the -mapping initially. We take as the regularization hyperparameter such that . Similarly, for the two hyperparameters we sample different points to approximate the -mapping initially.

Figure 3: Comparison of test set loss for random search, grid search, Bayesian optimization, Hyperband algorithm and GBHO for MNIST-1HP cases. Figure 4: Comparison of test set loss for random search, grid search, Bayesian optimization, Hyperband algorithm and GBHO for MNIST-2HP cases.

Figures 3 and 4 compare the test set loss for the single and the double hyperparameter cases. We provide comparison of our method with random search, grid search, Bayesian optimization (omalley2019kerastuner) and Hyperband algorithm (hyperband). We can clearly see from the figures that our method outperforms all the other techniques in terms of the test loss for all the cases. Table 2 contains a detailed comparison of our method with other techniques where we provide the training, validation and testing losses for the best model obtained from each method. The computational expense of each method is measured by counting the number of lower level optimization tasks executed by each method. Our method is computationally much faster as compared to other methods as it requires significantly lower number of lower level optimization calls.

4.2 CIFAR-10 Dataset

For CIFAR-10 dataset, we use LeNet-5 CNN architecture for our experiments, and once again compare with random search, grid search, Bayesian optimization and HyperBand method. We perform our study on a small batch of 1000 data points with training set and validation test to allow for overfitting. The results are compared on a testing dataset of points which are sampled separately. We use two and four regularization hyperparameters for our test instances. In the case of two hyperparameters, we regularize the two convolutional layers and the three dense layers with each hyperparameter, while in the case of four hyperparameters, we regularize the two convolutional layers with one hyperparameter and each of the three dense layers with three separate hyperparameters. The results obtained are similar as in the context of MNIST dataset, where we observe that the proposed method is able to find the best testing loss with the least computational requirements. The test loss results have been reported through Figure 5. The lower level optimization calls for grid search, random search, Bayesian search, HyperBand method and GBHO are 100, 100, 100, 254, and 25 (+5), respectively, for the 2HP case. For the 4HP case, the lower level optimization calls for grid search, random search, Bayesian search, HyperBand method and GBHO are 625, 625, 100, 254, and 81 (+5), respectively. Interestingly, the results on test data deteriorate for the 4HP case as compared to the 2HP case for Bayesian optimization, HyperBand as well as GBHO, the reason for which could be overfitting on the validation set with increasing hyperparameters. In the case of MNIST dataset while moving from 1HP to 2HP we observed some benefits.

Figure 5: Test set loss for CIFAR-10.

5 Conclusions

In this paper, we have proposed a Gradient-based Bilevel Hyperparameter Optimization (GBHO) method for optimizing continuous hyperparameters in machine learning models. We have performed our study in the context of regularization hyperparameters and have shown that the proposed method is able to optimize multiple hyperparameters with low computational requirements. The method formulates the hyperparameter optimization problem as a bilevel optimization task and then solves it by using the augmented Lagrangian method after reducing it to single level using the optimal value function approach. An extensive computational study with MNIST dataset on MLP architecture and CIFAR-10 dataset on CNN architecture shows that the method is able to optimize multiple hyperparameters with low computational requirements as compared to other techniques, such as, grid search, random search, Bayesian optimization and HyperBand method. The test loss of the models generated using the proposed algorithm also turns out to be better than the competing approaches.

References