You can just do that by adding large hidden layers or more hidden layers, up to ...

You can just do that by adding large hidden layers or more hidden layers, up to a point. But eventually the signal that's coming from the input data is so diluted through all the neurons and layers that your models stop performing better as you add more neurons. Many of the advances in NNs come from structuring the neurons in particular ways - for example, in computer vision, convolutional neural networks. These are kind of a small neural network that looks at each part of the image and gives an output - basically, shrinking and summarizing the image - which is then fed into another small neural network that shrinks & summarizes some more, and so on until you have only one result, like "is this a cat" or something like that. Transformers, which are mostly used for natural language processing, have small neural network layers that let the network figure out which other words in a text are relevant for understanding a given word (or, token, really). These help simplify the problem for the NN so it doesn't have to sort through all the data it has at once, which is a bottleneck on scaling NNs.

Not sure that explains it like you're 5, but hopefully it addresses your question