How to make it possible for a neural network to tune its own hyper parameters?

Question

I am curious about what would happen to hyperparameters when they would be set by a neural network itself or by creating a neural network that encapsulates and influences the hyperparameters of the network it encapsulates.

The goal for me here is to experiment and get some more in-depth knowledge about neural networks. But I had a hard time finding information to be able to execute such an experiment. Which point in a direction that either it has never been done before or the idea is just really dumb.

Now what I would like to know is, does anyone of you know where I would be able to find information (e.g. books, web-articles, papers, et cetera) to conduct such an experiment?

score 4 · Accepted Answer · answered Aug 10 '18 at 15:28

I am curious about what would happen to hyperparameters when they would be set by a neural network itself

In general this is not possible as many hyper-parameters are discrete, so they are not differentiable with respect to any objective. For example, this applies to layer sizes, number of layers, choices of transfer functions. This prevents using any form of gradient descent to tune them directly as learnable parameters.

In fact the separation between parameters and hyperparameters is exactly that hyperparameters are not learnable by the model type. This applies to other ML models, not just neural networks.

or by creating a neural network that encapsulates and influences the hyperparameters of the network it encapsulates.

This is more feasible. You could use one neural network to try and predict the results from another. Then prefer to run tests on target networks that look like they will do well. However, using a "meta" neural network like this has some major drawbacks:

Neural networks require a lot of training data. Getting enough samples to make good predictions would require that you train your primary neural network (a time-consuming process) many times
Neural networks are bad at extrapolating to data outside of areas already experienced, so not so great at making creative predictions of new parameters to try
Neural networks have a lot of hyper-parameters to tune. Would you need a "meta meta" neural network to predict the performance of your "meta" network?

either it has never been done before or the idea is just really dumb

This is a real issue that comes up repeatedly. In general the search for best hyper-parameters is a chore. It is an active area of research and experimentation to find efficient ways of automating it, or avoiding it by making some hyperparameters less important or not necessary.

The reason you are not finding neural networks that tune neural networks is due to the issues listed above. So the main areas of research focus on different approaches, that can work with limited data and don't have so many hyperparameters themselves. Or models that are robust to large ranges of hyperparameters, so precise tuning is not a big deal.

Here are a few pointers to help with automated searches:

You could use a variety of hyperparameter optimisation schemes, including random search, grid search, genetic algorithms, simple gradient methods etc.
Random searches, perhaps constrained by previous experience or second-hand knowledge from similar problems, can be reasonably effective.
The quality of any search is limited by the quality and amount of cross-validation data. There is not much point tuning the cv loss value to the point that you care about changes that are much less than the standard error in its estimate.
Response to hyperparameters is typically non-linear over the search space, which makes things harder.

Outside of automation, expert analysis is often a good starting point, especially if you want to assess success of regularisation. Typically you can look at learning curves for training and cross-validation data, and based on that you can make a reasonable guess as to whether to increase or reduce regularisation hyperparameters and/or learning rate, even from observing results from a single training run.

There have likely been attempts to automate some parts of reading learning curves, since sometimes it is relatively easy to detect over-fitting and under-fitting scenarios. However, I could not find any examples when searching just now.

score 1 · Answer 2 · answered Apr 12 '25 at 18:57

In addition to the other answer, even if a hyperparameter's influence is differentiable and so could in principle be tuned in the gradient descent training loop, it sometimes doesn't make sense. For example if you included the L2 regularization factor, gradient descent would simply drive it to zero, removing regularization.

Many hyperparameters influence the training dynamics and generalization (for example the bias/variance tradeoff). Remember that the end goal is to get a model that works well on new test data. Some hyperparameters intentionally increase the training loss in exchange for an improvement on test data (i.e., regularization). So it wouldn't make sense to tune them to the training set.

Nikos M. · Answer 3 · 2025-04-13T19:31:33.380

Other answers being good, I'll add my couple of cents..

Besides the specific reasons related to ANNs already mentioned, the main reason that hyper-parameters (eg number of layers) cannot be "learned" by ANNs like other parameters that can be "learned" (eg weights) is more general and it is that this is the way fixed formal algorithmic systems (eg a Neural Network) have to work.

Necessarily some things about the architecture and/or what makes an algorithm specific regarding some fixed system, cannot be addressed by that same formal system, else it either would result in being a completely random process, or would result in an infnite regress when trying to formalize the final algorithm.

Therefore, a "meta-system" is needed (eg a hyper-parameter tuning system for the "system" under study), which implements a specific algorithm (ie the "meta-system" cannot tune its own hyper-parameters, a "meta-meta-system" could do that) in order to select specific "system" architecture/hyper-parameters among potential architectures/hyper-parameters.

How to make it possible for a neural network to tune its own hyper parameters?

3 Answers3