
The DNN comprises complicated functions with lots of nonlinear transformations, so most optimization functions are non-convex functions with local optima and global optima. In order to achieve the right output, the parameters of the DNN need to be optimized. The final output is processed by other methods to solve real-world problems. This process continues until the output layer is reached. The output of the activation function is then fed as input to the node in the next layer. The inputs are multiplied by their respective weights and summed at each node, and the sum is transformed via an activation function. Most of the activation functions are nonlinear, such as sigmoid, ReLU, and tanh.

The nodes of each layer are connected to the nodes of the adjacent layers by weights, and each node has an activation function. Compared to the earlier shallow networks, DNN consists of multiple layer of nodes, including input, hidden and output layers. A deep neural network (DNN) is a kind of deep learning technique that was initially designed to function like the human nervous system and the structure of the brain. Due to its significant advantages over traditional machine learning algorithms, deep learning has excellent performance in areas such as image classification, speech recognition, cancer diagnosis, rainfall forecast, and self-driving cars.

In recent years, as the hottest branch of machine learning, deep learning has been playing an important role in our production and life. Experimental results over synthetic datasets show that it can find redundant nodes effectively, which is helpful for model compression. Experimental results over MNIST show that aSGD can speed up the optimization process of DNN and achieve higher accuracy without extra hyperparameters. Different from the existing methods, it calculates the adaptive batch size for each parameter in the model and uses the mean effective gradient as the actual gradient for parameter updates. Furthermore, we present a variant of the batch stochastic gradient decedent for a neural network using the ReLU as the activation function in the hidden layers, which is called adaptive stochastic gradient descent (aSGD).

In this paper, we analyze the different roles of training samples on a parameter update, visually, and find that a training sample contributes differently to the parameter update. It often is a trial-and-error process to tune these hyperparameters in a complex optimizer. Some complex optimizers with many hyperparameters have been utilized to accelerate the process of network training and improve its generalization ability. Lots of effort has been put into training due to their numerous parameters in a deep network. It does not converge straight because of the noise.In recent years, deep neural networks (DNN) have been widely used in many fields. The disadvantage of Stochastic gradient descent – It is also memory efficient because it considers one observation at a time from the complete dataset. Hence it is betterĪdvantage of Stochastic gradient descent – This gradient descent algorithm variation is the same as earlier with a difference that it considers a single training observation at single epoch/backpropagation/iteration. As it needs more memory to load the complete data into memory at once. Although it is computationally efficient but not fast. It has no noisy step hence it will not be able to come out of it.Ģ. The disadvantage of Batch gradient descent –ġ.It is less prone to local minima but in case it tends to local minima. Fewer oscillations process and easy convergence to global minima. Hence only a few machine cycles are required.Ģ.

Advantage of Batch gradient descent –ġ.This is computationally efficient because all training set goes in one go. It has some advantages and disadvantages. Which is the cost function for the neural network. In this variation of gradient descent, We consider the losses of the complete training set at a single iteration/backpropagation/epoch. There are many variations of Gradient Descent Algorithm. In general, while performing a single epoch/backpropagation, We can consider single training dataset observation or multiple at a time. Here is the formula for weight updation using a gradient descent optimizer. Suppose we have a neural network with a learning rate η. If the cost function will be low, the Model will be a better fit on the datasets. In machine learning and deep learning, everything depends on the weights of the neurons which minimizes the cost function. Basically, it gives the optimal values for the coefficient in any function which minimizes the function. Gradient Descent is an optimization algorithm that minimizes any function.
