The same way Steam allows each localization to have its own separate version of the audio files without forcing everybody to download every one.
Strings and voice clips are much easier to localize than images.
Strings are a matter of just writing the words down and voice clips are a matter of replacing the original content outright. Easy to hand off to another team.
When dealing with images, you either have to hope that the text is editable (which it may not be if the artist is working in a tool that does not support that, or they have flattened the text in order to apply certain effects to it) or you have to remove the text and start over - and
then apply any effects afterwards in a way that is artistically consistent with the rest of the game, which may require getting the original team involved. There's also cases where text may be size-sensitive, where translating the text into another language makes it too big to fit in its original context. You either have to do something clever or just give up at this point.
It's also a lot more effort and possibly outside the skill set that a localization team typically has. You could do it if you had a good localization team and the developer is willing to put the effort in, but (my understanding is, at least, that) many times localization happens towards the end of development and there's no budget or time to be flesh it out properly.