TY - JOUR

T1 - Smaller generalization error derived for a deep residual neural network compared with shallow networks

AU - Kammonen, Aku Jaakko Alexis

AU - Kiessling, Jonas

AU - Plecháč, Petr

AU - Sandberg, Mattias

AU - Szepessy, Anders

AU - Tempone, Raul

N1 - KAUST Repository Item: Exported on 2022-09-15
Acknowledged KAUST grant number(s): OSR-2019-CRG8-4033.2, URF/1/2281−01−01, URF/1/2584−01−01
Acknowledgements: Swedish Research Council (2019-03725); ARO Grant (W911NF-19-1-0243 to P.P.); King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) (URF/1/2281−01−01and URF/1/2584−01−01 to R.T., OSR-2019-CRG8-4033.2 to J.K.) in the KAUST Competitive Research Grants Program Round 8 and the Alexander von Humboldt Foundation.

PY - 2022/9/12

Y1 - 2022/9/12

N2 - Estimates of the generalization error are proved for a residual neural network with L random Fourier features layers z¯ℓ+1=z¯ℓ+Re∑Kk=1b¯ℓkeiωℓkz¯ℓ+Re∑Kk=1c¯ℓkeiω′ℓk⋅x. An optimal distribution for the frequencies (ωℓk,ω′ℓk) of the random Fourier features eiωℓkz¯ℓ and eiω′ℓk⋅x is derived. This derivation is based on the corresponding generalization error for the approximation of the function values f(x). The generalization error turns out to be smaller than the estimate ∥f^∥2L1(Rd)/(KL) of the generalization error for random Fourier features, with one hidden layer and the same total number of nodes KL, in the case of the L∞-norm of f is much less than the L1-norm of its Fourier transform f^. This understanding of an optimal distribution for random features is used to construct a new training method for a deep residual network. Promising performance of the proposed new algorithm is demonstrated in computational experiments.

AB - Estimates of the generalization error are proved for a residual neural network with L random Fourier features layers z¯ℓ+1=z¯ℓ+Re∑Kk=1b¯ℓkeiωℓkz¯ℓ+Re∑Kk=1c¯ℓkeiω′ℓk⋅x. An optimal distribution for the frequencies (ωℓk,ω′ℓk) of the random Fourier features eiωℓkz¯ℓ and eiω′ℓk⋅x is derived. This derivation is based on the corresponding generalization error for the approximation of the function values f(x). The generalization error turns out to be smaller than the estimate ∥f^∥2L1(Rd)/(KL) of the generalization error for random Fourier features, with one hidden layer and the same total number of nodes KL, in the case of the L∞-norm of f is much less than the L1-norm of its Fourier transform f^. This understanding of an optimal distribution for random features is used to construct a new training method for a deep residual network. Promising performance of the proposed new algorithm is demonstrated in computational experiments.

UR - http://hdl.handle.net/10754/665560

UR - https://academic.oup.com/imajna/advance-article/doi/10.1093/imanum/drac049/6695119

U2 - 10.1093/imanum/drac049

DO - 10.1093/imanum/drac049

M3 - Article

SN - 0272-4979

JO - IMA Journal of Numerical Analysis

JF - IMA Journal of Numerical Analysis

ER -