Several deep neural networks have recently been shown to generate activations
similar to those of the brain in response to the same input. These algorithms,
however, remain largely implausible: they require (1) extraordinarily large
amounts of data, (2) unobtainable supervised labels, (3) textual rather than
raw sensory input, and / or (4) implausibly large memory (e.g. thousands of
contextual words). These elements highlight the need to identify algorithms
that, under these limitations, would suffice to account for both behavioral and
brain responses. Focusing on the issue of speech processing, we here
hypothesize that self-supervised algorithms trained on the raw waveform
constitute a promising candidate. Specifically, we compare a recent
self-supervised architecture, Wav2Vec 2.0, to the brain activity of 412
English, French, and Mandarin individuals recorded with functional Magnetic
Resonance Imaging (fMRI), while they listened to ~1h of audio books. Our
results are four-fold. First, we show that this algorithm learns brain-like
representations with as little as 600 hours of unlabelled speech -- a quantity
comparable to what infants can be exposed to during language acquisition.
Second, its functional hierarchy aligns with the cortical hierarchy of speech
processing. Third, different training regimes reveal a functional
specialization akin to the cortex: Wav2Vec 2.0 learns sound-generic,
speech-specific and language-specific representations similar to those of the
prefrontal and temporal cortices. Fourth, we confirm the similarity of this
specialization with the behavior of 386 additional participants. These
elements, resulting from the largest neuroimaging benchmark to date, show how
self-supervised learning can account for a rich organization of speech
processing in the brain, and thus delineate a path to identify the laws of
language acquisition which shape the human brain.