What involves making or discovering auditory visual and motor regularities?

2Departments of Otorhinolaryngology, Neuroscience and Bioengineering, 3400 Spruce St – 5 Ravdin, Uuniversity of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

Find articles by Yale E. Cohen

Disclaimer

1University College London Ear Institute, 332 Grays Inn Road, London, WC1X 8EE, UK

2Departments of Otorhinolaryngology, Neuroscience and Bioengineering, 3400 Spruce St – 5 Ravdin, Uuniversity of Pennsylvania, Philadelphia, Pennsylvania 19104, USA

Jennifer K. Bizley: ku.ca.lcu@yelzib.j; Yale E. Cohen: ude.nnepu.dem.liam@nehocy

Copyright notice

The publisher's final edited version of this article is available at Nat Rev Neurosci

Abstract

The fundamental perceptual unit in hearing is the ‘auditory object’. Similar to visual objects, auditory objects are the computational result of the auditory system's capacity to detect, extract, segregate and group spectrotemporal regularities in the acoustic environment; the multitude of acoustic stimuli around us together form the auditory scene. However, unlike the visual scene, resolving the component objects within the auditory scene crucially depends on their temporal structure. Neural correlates of auditory objects are found throughout the auditory system. However, neural responses do not become correlated with a listener's perceptual reports until the level of the cortex. The roles of different neural structures and the contribution of different cognitive states to the perception of auditory objects are not yet fully understood.

Hearing and communication present various challenges for the nervous system. To be heard and to be understood, an auditory signal must first be transformed from a time-varying acoustic waveform into a perceptual representation (FIG. 1). This is then converted to an abstract representation that combines the extracted information with information from memory stores and semantic information. Last, this abstract representation must be interpreted to guide the categorical decisions that determine behaviour. Did I hear the stimulus? From where and whom did it come? What does it tell me? How can I use this information to plan an action?

Open in a separate window

Figure 1

The transformation of an acoustic stimulus into a perceptual representation of a sound

The fundamental problem that is solved by the auditory system is the need to transform an acoustic stimulus into a perceptual representation of one or more auditory objects. Typically, various independent sound sources contribute to the creation of a soundscape. a | In the example shown, there are three sound sources (a banjo player, a singer and a bassist), each of which is producing an acoustic stimulus with unique spectrotemporal features. b | The auditory stimulus that reaches a listener's ear will be a complex mixture of the stimuli produced by these three sources. c | However, the listener hears each source as a distinct auditory object. BOXES 1,2 discuss the grouping cues that underlie this capacity to segregate a stimulus into unique sound sources.

There is broad agreement that the ventral auditory pathway — a pathway of brain regions that includes the core auditory cortex, the anterolateral belt region of the auditory cortex and the ventrolateral prefrontal cortex — has a role in auditory-object processing and perception–. However, no consensus has been reached on either the roles of different regions in this pathway in specific elements of auditory-object processing and perception or the contributions of particular cognitive states (such as attention) to the differential modulation of activity along this pathway. Here, we discuss how the brain transforms an acoustic-based representation of a stimulus into one that is object-based. We consider how object-related neural activity might emerge and how attention and behavioural state influence perception and neural activity. We also review what is known and, more importantly, what is unknown regarding the hierarchical flow and transformation of information along the ventral pathway. Finally, we focus on studies that relate neural activity to behaviour; reviews of work underlying perceptual correlates of audition in non-behaving animals can be found elsewhere–.

What is an auditory object?

The precise definition of an auditory object has been the subject of considerable debate,–. Intuitively, we understand an auditory object to be the perceptual consequence of the auditory system's interpretation of acoustic events and happenings. For example, when sitting outside a café, we might hear a bird sing, a car passing, the hiss of a coffee machine or the voice of our friend. Each of these different and discrete sounds can be described as an auditory object–. More formally, auditory objects are the computational result of the auditory system's ability to detect, extract, segregate and group the spectrotemporal regularities in the acoustic environment into stable perceptual units,,. Thus, we define an auditory object as a perceptual construct, corresponding to the sound (such as the hiss) that can be assigned to a particular source (the coffee machine).

Auditory objects have several general features and characteristics. First, acoustic stimuli are emitted from or by things, as a consequence of actions or events. Some acoustic stimuli, such as human speech, are emitted with a clear intention, whereas others, such as environmental sounds, are not. In either case, we rarely hear sounds in isolation. Therefore, an auditory object spans multiple acoustic events that unfold over time, and a sequence of objects forms a ‘stream’. For example, when a person is walking, each step is a unique acoustic event or object. However, our auditory system groups these separate stimuli together into a temporal sequence of ‘footsteps’. A stream of objects can, itself, be termed an object,. Second, we can parse the soundscape into its constituent objects. Therefore, one auditory object has spectrotemporal properties that make it separable from other auditory objects–. As a consequence, we can detect our friend's voice among myriad other sounds in the café. Third, as with a visual object, a listener can readily describe an auditory object by the combination of its features: it might have a high or low pitch, a rich timbre or a characteristic loudness. However, the same listener would find it very difficult to describe the underlying acoustic features that give rise to these percepts, such as the harmonicity of the sound or the timing difference between our ears. Fourth, like vision, auditory-object recognition is invariant to various changes to its spectrotemporal properties, which result from the context in which the object is perceived. For example, a violin still sounds like a violin regardless of whether a single high note or a rapid melody is played, whether it is played loudly or softly or whether it is played alone or as part of an orchestra. As in the visual system, we must be capable of generalizing across the different ways in which an object or event occurs,–. Last, we expect object representations to predict parts of the object for which no input is currently available. For example, Jan can still understand Jenny's speech despite the fact that Yale's sneezing has masked certain acoustic features of her speech by rendering them inaudible,–.

How are auditory objects formed? Our ear receives a composite waveform comprised of all of the acoustic stimuli in the environment. The brain's job is to appropriately group these acoustic features into perceptual features and then to group these to form a representation of discrete objects that can be further analysed (FIG. 1). An auditory stimulus comes into our awareness as an auditory object by means of the simultaneous and sequential principles that group acoustic features into stable spectrotemporal entities (BOXES 1,2). Although attention is not always necessary for auditory-object formation, our awareness of an object can be influenced by attention,. For example, we can choose whether to listen to — or ignore — the first violin, the strings or the whole orchestra. Likewise, we can selectively attend to the features of a person's voice that allow a listener to identify the speaker.

Hierarchical processing in the cortex

Visual information processing is thought to take place in two parallel pathways that independently analyse the identity and location of objects within the visual scene. Initially, on the basis of theoretical and anatomical studies, a similar processing scheme was proposed for the auditory cortex– whereby information is processed in parallel hierarchical pathways specialized for the extraction of spatial (‘where is the sound?’) and non-spatial (‘what is the sound?’) components of an auditory stimulus. These computations occur in the so-called ‘dorsal’ and ‘ventral’ pathways, respectively. As we discuss in detail below, both functional imaging studies in humans and single-unit neurophysiology in non-human animals provide evidence in favour of a division of labour between spatial and non-spatial processing. Conversely, other studies using the same methods suggest that rather than two hierarchically organized parallel pathways, distributed, dynamically organized processing networks are likely to support auditory perception. According to this theory, feedback between brain areas would facilitate object selection.

Processing strategies within auditory cortex

Under a hierarchical-processing model, auditory-object extraction occurs in the ventral processing pathway, and we might expect to see, as we move along the pathway, a transition from the representation of acoustic features to perceptual features and finally to objects or category-specific representations at the highest stages — computations perhaps analogous to those that are well described in higher visual areas–. At least in non-human primates, the ventral pathway begins in the core auditory cortex — specifically, the primary auditory cortex and the rostral field (FIG. 2). These core areas project to the anterolateral belt region of the auditory cortex. In turn, this belt region projects to the ventrolateral prefrontal cortex.

Open in a separate window

Figure 2

Dual pathways of information flow in the auditory system and the organization of the auditory cortex

a | Information processing in the primate auditory system is hypothesized to occur in two streams. Neurons in the ‘dorsal’ stream (red), which may preferentially analyse space and motion, are involved in audiomotor processing, whereas those in the ‘ventral’ stream (green) are preferentially involved in auditory-object processing. Solid arrows indicate feedforward projections, and dashed arrows indicate feedback projections. b | A schematic representation of the organization of the auditory cortex (AC) in different species. The lemniscal auditory thalamocortical projection terminates in the ‘core’ regions of the AC (blue shading), including the primary auditory cortex (A1). In humans, this core region is in Brodmann area 41 (BA41). From these core areas, there is both serial and parallel processing in the surrounding ‘belt’ regions (such as the anterolateral (AL) and middle-lateral (ML)) regions in the macaque monkey or the secondary AC (A2) in the cat) and from there to the ‘parabelt’ regions (such as the rostral parabelt (RPB) in the macaque; see for more details). Although this organization was originally described in non-human primates, it appears to be a general organizational scheme in a variety of primate and non-primate species. Solid lines indicate boundaries between auditory fields, and dashed lies indicate anatomical boundaries. AAF, anterior auditory field; ADF, anterior dorsal field; Ald, dorsal region of the primary auditory field; AV, anteroventral field; CL, caudolateral belt region of the AC; CM, caudomedial area; CPB, caudal parabelt; CS, central sulcus; D, dorsal field; DC, dorsocaudal field; DCB, dorsocaudal belt; DLPFC, dorsolateral prefrontal cortex; D P, dorsoposterior field; DRB, dorsorostral belt; E P, ectosylvian posterior auditory region; IFC, inferior frontal cortex; Ins, insula; IPL, intraparietal lobule; IPS, intraparietal sulcus; LS, lateral sulcus; MM, mediomedial belt; PAF, posterior auditory field; PDF, posterior dorsal field; PMC, premotor cortex; PPF, posterior pseudosylvian field; PSF, posterior suprasylvian field; RM, rostromedial belt; RPB, rostral parabelt; RTL, rostrotemporal lateral belt; RTM, rostrotemporal medial belt; SRAF, suprarhinal auditory field; STGr, superior temporal gyrus rostral to the parabelt; STS, superior temporal sulcus; T, transitional belt area; Te, temporal; Tpt, temporal lobe association cortex; V, ventral field; VAF, ventral auditory field; VCB, ventrocaudal belt; VLPFC, ventrolateral prefrontal cortex; VM, ventromedial field; V P, ventroposterior field; VRB, ventrorostral belt. Part a is modified, with permission, from © (2009) Macmillan Publishers Ltd. All rights reserved. Part b is modified, with permission, from © (2011) Elsevier.

There are several pieces of evidence suggesting that auditory-object and spatial processing occurs in separate, parallel pathways (FIG. 2). Some of the first physiological evidence for a separation of spatial and non-spatial processing was provided by a study that investigated neural sensitivity to sound location and identity using a series of monkey vocalizations presented at different spatial locations. This study found that belt regions in the ventral auditory pathway were more sensitive to vocalization type, whereas belt regions in the dorsal pathway were modulated more by the location of a stimulus. Similarly, early human imaging data supported a division of spatial and non-spatial processing,. Furthermore, a meta-analysis of functional imaging data showed that spatial tasks almost always activate the posterior auditory cortex (part of the dorsal stream), whereas non-spatial activity is observed across the temporal lobe. Finally, other findings have shown that the ventral stream is involved in the categorization of speech sounds–, which is an important component of auditory-object processing. Preferential spatial and non-spatial processing is also found outside the auditory cortex: for example, in the prefrontal cortical regions that are part of these hypothesized dorsal and ventral pathways–.

Nevertheless, substantive auditory-object processing has been identified in the dorsal pathway, and substantial information about auditory space has been found in the ventral pathway–. Such findings suggest that a model of parallel hierarchical processing might be too simplistic and that a mixture of spatial and non-spatial auditory information might be useful for those computations that create the consistent perceptual representations that guide goal-directed behaviour. For example, spatial information can act as a grouping cue to enable auditory stream formation. When a rhythmic sequence of identical sound bursts is presented from a single location, it is perceived as one source by human observers. However, such a sequence is perceived as two sources, each with a distinct rhythm, when the sound sequences are presented from two spatially separated locations. Neural correlates of this paradigm are observed in the auditory cortex of anaesthetized cats. Likewise, non-spatial (object) information processed in the dorsal stream might contribute to computations that involve target selection, the online computational processing of dynamic auditory information, audiomotor processing and other computations that involve organization of the auditory scene (see REFS 42,43,51–54 for reviews of hierarchical processing of speech in both the posterior and anterior auditory cortex). However, as most, if not all, studies have asked listeners to attend to either spatial or non-spatial features of a sound but not to both simultaneously, the interaction between these two pathways has not been fully resolved within either the auditory or visual systems.

Within the ventral and dorsal processing pathways, both single-neuron studies,– and functional imaging studies– indicate that the perceptual features of a sound might be localized and organized in a hierarchical manner. Pitch is probably the most widely studied perceptual feature; below, we use it to explore findings that support both hierarchical and distributed organizational schemes.

Pitch processing: hierarchical or distributed

Several important studies indicate that pitch-selective neurons are localized to specific cortical areas. For example, in non-human primates, pitch-selective neurons are found at the border between the core and belt auditory cortex. Similarly, in humans, a pitch-sensitive area has been identified anterior to Heschl's gyrus–. Moreover, whereas neural activity throughout the auditory cortex correlates better with changes in a listener's reports of features such as pitch than with changes in the stimulus features, activity recorded in the low-frequency core and belt regions of the auditory cortex predicts both pitch and listeners' reports of pitch better than activity recorded in other regions.

However, many of the same studies also provide evidence that a broader network of brain areas may subserve pitch perception. For example, pitch-related activity has also been reported in both the core and the non-core,, auditory cortex in humans. Similarly, pitch-sensitive neurons are broadly distributed in core and non-core regions of the ferret auditory cortex, and neural responses in multiple regions of the auditory cortex correlate with pitch-perception judgements in this species,,.

The difficulties inherent in comparing data derived using different experimental methods (and often in different species) limit a comprehensive understanding of the neural correlates underlying pitch perception. For example, comparing studies using single-unit recordings and those using functional imaging is difficult as both are subject to different methodological constraints. Functional MRI (fMRI) experiments, for example, usually compare the activity elicited by a pitch-evoking stimulus with that evoked by a control sound without pitch. By contrast, single-neuron studies present a particular class of pitch-evoking stimuli and test for a neuron's tuning to a specific fundamental frequency. Also, studies rarely attempt to map a neuron's pitch tuning while also using a number of spectrally different sounds in order to explore pitch constancy (although see REFS 56,68,71 for exceptions). Finally, it has proven difficult to identify individual brain regions or neurons that respond to a pitch irrespective of the stimulus' spectral properties,.

Consequently, further studies (such as experiments in which particular neurons or brain areas are inactivated) will be required to determine whether putative pitch-selective areas have a causal role in auditory perception and to determine how these areas function interdependently of one another. Neurophysiological experiments would additionally benefit from exploring neural tuning using various pitch-evoking stimuli, to test for neural representations that can abstract pitch. Performing such studies in animals that are actively discriminating sounds on the basis of their pitch is essential to determine the response properties underlying pitch perception.

We predict two broad outcomes of such sets of experiments. It is possible that activity in a specialized area underlies pitch perception but that broadly distributed pitch sensitivity enables pitch to be used for making sense of the auditory scene — for example, by enabling common pitch to be used as a grouping cue. Alternatively, a distributed network of pitch-activated areas might form a processing hierarchy. For example, pitch processing within the primary auditory cortex could depend on the listening context, whereas pitch processing in extra-core regions (such as the planum temporale,) might be context-independent. In other words, there might be an invariant representation of pitch in the planum temporale but not in core areas such as Heschl's gyrus, which is consistent with the idea of a pitch-processing hierarchy.

Timbre: explicit and implicit representations

Similar principles can be drawn from the study of other perceptual dimensions. Another important perceptual feature of a sound is timbre. The neural representation of timbre is broadly distributed: in both core and belt regions of the auditory cortex, both single-neuron,, and functional imaging, studies have shown that neurons are sensitive to the timbre of a sound. However, this neural representation of timbre is not invariant, as neural sensitivity to timbre is modulated by other sound features, such as pitch or spatial location. Despite this, neural activity might represent different stimulus features unambiguously at different time points: when responding to a stimulus, single-unit spiking activity is initially tuned for the sound's timbre but later becomes tuned for its pitch. The core auditory cortex might thus contain an ‘implicit’ representation of both an object and identity-preserving transformations of the object (such as changes in location or loudness) in a manner that may be analogous to the different types of visual representation contained in visual area V4 ().

However, an explicit or invariant representation of timbre does seem to emerge in later stages of processing, at least in humans. For example, neural responses to vowel sounds represent stimulus acoustics at the level of the brainstem but represent perceptual mappings at the level of the cortex. Functional imaging studies indicate that neurons in the planum temporale encode an invariant representation of a sound's spectral envelope, one of the key determinants of timbre. Indeed, dynamic causal modelling has directly identified a serial-processing architecture in which timbre information originates in Heschl's gyrus, is transmitted to the planum temporale and then to the superior temporal gyrus; according to this model, spectral envelope extraction is complete by the time the information reaches the planum temporale. Such a hierarchical-processing scheme might underpin a representation of sound timbre that allows us to perceptually recognize and identify a music instrument as a bassoon or a violin across different pitches and melodies.

In summary, although single neurons in the early core and belt auditory cortex of non-human animals show broad sensitivity to a number of perceptual features, there is good evidence for specialized processing of some of these features within particular areas. Whether these areas form a linear, hierarchical processing stream or a more dynamic, distributed assembly remains a matter of debate. To advance our understanding of the mechanisms underlying timbre perception, it may prove beneficial to carry out single-unit recording studies to test predictions derived from computational modelling techniques.

From stimulus to perception

Studies in behaving animals offer the potential to observe neural correlates of perception, as indexed by changes in neural activity as a function of an animal's behavioural choice during a listening task. That is, by holding a stimulus constant and testing whether neural activity is modulated by the animal's behavioural responses or choices (such as an animal indicating whether a target pitch is perceived as higher or lower than a reference pitch), neural activity that is associated with the stimulus itself can be dissociated from neural activity associated with the sensory decision. Choice-related activity (that is, activity that represents the animal's behavioural choice rather than the stimulus) is thought to arise owing to correlations in the noise structure of neurons contributing to a sensory decision. By examining how choice-related activity and other behaviourally related signals are modulated in different cortical areas, we can gain insight into how the nervous system transforms a sensory signal into a decision variable,,.

Recent investigations in behaving primate and non-primate species have found that neural activity is significantly correlated with a listener's behavioural reports,. For example, in core and non-core regions of the auditory cortex, local-field potentials and spiking activity are modulated more by ferrets' decisions regarding the pitch of a target sound than by the actual pitch category. Similarly, in macaque monkeys, single- and multiunit recordings during an amplitude-modulation detection task reveal that activity in neurons in the primary auditory cortex is, once again, correlated with an animal's behavioural reports. Last, blood-oxygen-level-dependent (BOLD) signals measured in early belt regions (areas adjacent to Heschl's gyrus and in the planum temporale) with fMRI can be decoded to predict a human listener's percept of an ambiguous speech sound. These findings suggest that a population of core auditory cortical neurons contribute to or reflect the computations that underlie perceptual decision-making.

However, not all studies have found choice-related activity in the core auditory cortex–. For example, in an auditory flutter experiment, choice-related activity was not found in the auditory cortex but appeared in the ventral premotor cortex,. Similarly, in macaques that were discriminating between two phonemes and morphs of these phonemes, choice-related activity was not present in the auditory cortex but was found in the ventrolateral prefrontal cortex–.

It is not clear why some studies have found choice-related activity in the primary auditory cortex, whereas others have only found such activity in more anterior areas. One important consideration might be the task itself. For example, whether an animal is engaged in a single- or multiple-interval forced-choice task, the task design or the animal's strategy to solve the task might determine the location of choice-related activity: a brain area that encodes the stimulus in a multiple-interval choice task is also unlikely to perform the comparison of the two stimuli,. In such a task, choice-related activity would first be observed in more anterior processing areas, such as the ventrolateral prefrontal cortex or the premotor cortex,. By contrast, when a task can be solved on the basis of listening during a single interval, that interval could also code a sensory decision. Therefore, differences in the level of abstraction required by the animals might determine whether choice-related activity is observed within the auditory cortex: a categorical ‘same’ versus ‘different’ task, necessitates a higher level of abstraction than does a high or low pitch judgement or detection of a particular stimulus feature (such as modulation or frequency change).

Nevertheless, the finding of such signals in any brain region does not indicate that a particular cortical area is a locus for decision-making. A decision outcome is thought to require the accumulation of sensory evidence into a decision variable. It seems likely that the neural correlates of perception that are observed in the early auditory cortex represent the sensory evidence that is needed to form a perceptual decision, which is then fed forward to other areas of the ventral pathway. Alternatively, this choice-related activity could reflect feedback signals from higher areas,. Finally, the time when choice-related activity appears during the temporal evolution of a task is an important consideration. For example, if choice-related activity appears before the stimulus that forms the basis of the animal's decision (such as the second stimulus in a paradigm requiring an animal to compare two sequentially presented sounds), this activity should be considered to be reflective of the listener's bias in making one alternative (choice) more favourable than the other,,.

To identify the neural mechanisms underlying auditory decision-making, scientists must systematically study changes in neural representations throughout a circuit of cortical areas to determine whether such signals reflect sensory evidence or a true decision variable. Such work has proven to be fruitful in the visual and somatosensory systems, but has yet to be applied broadly to the auditory system,,. Additionally, formal computational models of perceptual decision-making that incorporate psychophysical and neurophysiological predictions need to be introduced into auditory studies.

Grouping features into objects

As described above, evidence suggests that the transformation from sound-source acoustics into perceptual features such as pitch and timbre, which are used to describe an object, occurs in the early auditory cortex, where, in some instances, neural activity correlates with an animal's behavioural report. It is worth repeating that these perceptual features are components of an auditory object rather than the object themselves. For example, a cat's meow has a higher pitch when someone stands on its tail than when the cat wants to be fed. Other studies have focused on how and where features are bound together to allow extraction of auditory objects.

Auditory scientists test where and how objects are extracted by analysing how the sequential and simultaneous grouping principles (BOXES 1,2) that bind perceptual features into a unified auditory object are represented in the cortex. For example, in one set of studies, fMRI data were recorded while human listeners judged whether a target sound was continuous or discontinuous, (the illusion that a discontinuous sound is continuous is called amodal completion; see BOX 2). These studies found that physically identical acoustic stimuli elicited different BOLD signals in the primary auditory cortex depending on whether a listener reported a continuous or a discontinuous percept. The fact that listeners did not report a discontinuous percept suggests that, in this case, the auditory object itself, rather than the low-level spectrotemporal details, determined the listener's percept. Consistent with the idea that central brain regions are responsible for this illusion, computational simulations predict that cortical activity should correlate with the identity of the object and not its spectrotemporal components. Finally, single-neuron correlates of amodal completion have been found in the primary auditory cortex of rhesus macaques. However, because behavioural reports and neural data were not gathered simultaneously, it is not clear whether this activity was related to the primitive grouping principles that are needed to form an auditory object or to the object itself.

The ‘ABA streaming’ paradigm is commonly used to test sequential grouping. In this paradigm, two interleaved sequences of tone bursts at two different frequencies (frequency A and B) are presented to a listener. At slow rates, a listener is more likely to hear a single stream of alternating tones (FIG. 3a). When the semitone separation between frequency A and B is small (0.5 semitones), listeners are likely to report hearing one auditory stream (FIG. 3b). When this separation is large (>10 semitones), listeners reliably report hearing two auditory streams. At intermediate semitone separations, listeners hear one or two auditory streams on alternate trials. This type of stimulus is called a ‘bistable’ stimulus because the listener's perceptual report may alternate between the two possibilities; therefore, neural activity related to the perceptual report can be disassociated from neural activity related to the stimulus. These auditory bistable stimuli might be analogous to visual bistable stimuli,,. The tone-burst duration, listening duration, repetition rate and other factors can also modulate a listener's reports.

Open in a separate window

Figure 3

Auditory streaming

In a classic paradigm of auditory streaming, two sequences of tone bursts are presented in an alternating fashion,. a | When the frequency separation between the tone bursts in the two sequences is large, listeners typically hear two streams. b | By contrast, when the frequency separation between the two sequences is small, listeners typically report hearing one stream. However, at intermediate frequency separations, the listener's report is bistable over time: they alternate between perceiving one or two streams (not shown). With longer listening times, this report stabilizes and listeners reliably report two streams. c | In addition to parameters such as listening duration and other parameter manipulations, the temporal relationship between the two sequences is critical. When the two sequences are presented concurrently, listeners consistently report hearing one stream. This observation suggests that the temporal coherence between different neural populations is the critical mechanism for the determination of whether a listener hears one or two streams. See REFS 104,169 for more details on the role that temporal coherence has in auditory streaming. Figure is modified, with permission, from © (2011) Elsevier.

What neural computations underlie a listener's perception of one or two auditory streams? Correlates of the grouping principles thought to underlie ABA streaming can be observed as early as the cochlear nucleus. One reasonable hypothesis is that neurons downstream from the core auditory cortex, such as those in the belt cortex or even the frontal and parietal lobes,–, read out the topographic distribution of activity in the core auditory cortex. That is, if the semitone separation is small, there would be one peak of activity, which downstream neurons — as a proxy for a listener's behavioural reports — would decode as one stream. By contrast, if the semitone separation was large, there would be two peaks of activity, which would be decoded as two streams. At intermediate separations, the number of peaks would be unclear and trial-by-trial neural noise would alternate the readout between one and two peaks of activity. Importantly, however, temporal parameters also influence both listeners' reports and neural activity. For example, when the intervals between tones are short, listeners are more likely to report hearing one stream. The mechanism of this bias, which is likely to be partly inherited from earlier parts of the processing pathway, might be forward masking, which would ‘eliminate’ or minimize the second peak of activity,. However, as streaming can occur in response to various sounds, including noises and harmonic sounds, that would elicit overlapping spectral representations, this topographic readout explanation is probably too simplistic.

Indeed, recent work has proven that a topographic readout is insufficient to explain auditory streaming, at least in the ABA paradigm. If spatially segregated populations of neurons are necessary for streaming to occur, then the relative timing of tone A and tone B should be inconsequential because the only factor that would be important is the topographic representation of neural activity in the auditory cortex. In an elegant series of experiments, this hypothesis was explored by testing how the timing of tone A and tone B affected a listener's behavioural reports. These authors found that, independent of semitone separation, when tone A and tone B were presented simultaneously, listeners reliably reported one stream (FIG. 3c). Thus, the relative timing of these peaks of activity is critical: when the two peaks are in phase, listeners report one stream but when they are out of phase, they are reported as two streams. This neural mechanism of temporal synchrony might also be involved in grouping of other cues such as harmonic stimuli and stimulus onset and offset. A strict interpretation of the temporal coherence model has itself recently been challenged by the finding that although temporal coherence is an important factor in the formation of perceptual streams, temporally coherent sounds can be streamed. Unfortunately, the specific neural readout mechanisms that are sensitive to such timing information are not known. Future work in which large groups of neurons are recorded simultaneously while temporal synchrony is parametrically alternated are essential for addressing this question.

Whereas single-neuron recording studies in the cochlear nucleus indicate that, in principle, the information in activity patterns of neurons in the cochlear nucleus are sufficient to support streaming, evidence from the functional imaging literature suggests that the perception of streaming occurs in or beyond the auditory cortex. Unfortunately, despite the apparent elegance and simplicity of the ABA-stimulus paradigm, the role of different cortical areas in this streaming percept has been difficult to resolve. However, whereas the auditory cortex seems to be important for constructing the stream and the perceptual organization of the auditory scene, activity in regions in the frontal and parietal lobes appear to be correlated with a listener's reports54,104–109.

Key to the grouping principles underlying both streaming paradigms and amodal completion is the idea of predictability: the auditory system must generate some sort of prediction from current and previously present sounds to build a model of what is likely to occur next. Neural activity in early auditory areas seems to represent the prediction of a regular sequence of sounds: if a sound is omitted from a fully predictable sequence of sounds, auditory cortex activity will respond to this omission as if the sound was actually presented. Activity that precedes this omission-related response arises from sources within and beyond the primary auditory cortex and is thought to be the best candidate for a signal that represents a violation of ongoing predictions.

Assigning objects to categories

Neural correlates of categorical perception have been found in both the core and belt regions of the auditory cortex. For example, in one study, monkeys participated in a task in which the correct response depended on whether the frequency of a series of tone bursts was increasing or decreasing independent of the start and end frequencies. This revealed two classes of cells in the core and early belt auditory cortex (specifically, area A1 and the caudomedial belt region of the auditory cortex): the first showed phasic responses that discriminated between the two categories (increasing versus decreasing), whereas the second class showed tonic firing that, at the population level, correlated with the monkey's behavioural response.

Similarly, in another study, monkeys made a ‘same or different’ judgement based on the sequential presentation of two speech sounds (‘dad’ versus ‘bad’) or a series of morphed versions of these sounds (FIG. 4). The behavioural data showed that monkeys perceived these morphed stimuli categorically; that is, despite the fact that the acoustic stimulus varied smoothly, the monkeys consistently assigned the morphs to one of the two categories, with a sharp transition between morphed sounds being perceived as ‘dad’ rather than ‘bad’. Neurons in the belt region of the auditory cortex likewise responded in a categorical fashion. Interestingly, the degree of neural categorization depended on the type of recorded neuron: fast-spiking neurons (putative interneurons) responded more categorically. That is, they showed greater invariance across morphs that were categorized behaviourally to be the same than did slow-spiking neurons (putative pyramidal neurons).

Open in a separate window

Figure 4

Categorization in the ventral auditory pathway

a | The involvement of two key regions of the ventral auditory pathway, the anterolateral belt (ALB) and the ventral prefrontal cortex (VPFC), in assigning auditory objects to categories has been demonstrated in a series of experiments. b | In the experiment illustrated, monkeys participated in a task that required them to discriminate between a reference stimulus and a test stimulus. The reference sound was ‘dad’, a different sound, ‘bad’, or an acoustic morph of these two sounds. The 0% stimulus is the sound ‘bad’, and the 100% stimulus is the sound ‘dad’. Intermediate morph values have proportional values of the two stimuli; for example, an 80% morph has 80% of the acoustic features of ‘bad’ and 20% of ‘dad’. Data were reported in terms of the proportion of trials in which the monkeys reported that the reference and test stimuli were the same (upper panel). As can be seen, the monkeys’ behavioural reports are categorical. They treat sounds less than 50% morph stimuli as one category and those greater than 50% morph stimuli as a second category. Similarly, when recording ALB neurons during such categorization, neural activity also responds in a categorical fashion (lower panel). That is, ALB neurons respond similarly to all less than 50% morph stimuli and respond in a different manner to greater than 50% morph stimuli. c | In rhesus monkeys, VPFC neurons encode the membership of a particular type of call in response to food to an abstract category. The two categories are calls that transmit information regarding low-food quality (a grunt) and calls that transmit information about high-quality food (a harmonic arch or a warble). Population VPFC activity is shown for a baseline condition and in response to a test vocalization. The presentation of the test vocalization (at the time indicated by the position of the dashed line) was preceded by repeated presentations of a different reference vocalization. Also shown are the spectrograms for the different types of vocalization. VPFC activity preferentially codes transitions between food calls that belong to different abstract categories independently of differences between acoustics of the vocalizations (lower panels). By contrast, VPFC neurons do not code transitions between acoustically distinct stimuli that transmit the same information (upper panels). Part b (upper panel) is modified, with permission, from © (2011) The American Physiological Society. Part b (lower panel) is modified, with permission, from © (2012) The Physiological Society. Part c is modified, with permission, from © (2005) MIT Press Journals.

Studies using fMRI indicate that there are categorical representations of speech sounds in both the posterior and anterior auditory cortex,,,. Fewer studies have investigated category selectivity with non-speech stimuli. These studies are important because they allow researchers to investigate more abstract categories that are not based on similarities between stimulus features. For example, category specificity for musical and human-speech sounds is found in the anterior superior temporal cortex. By contrast, no such specificity is seen for songbird or ‘other animal’ vocalizations, although this might be because vocalization-specific clusters are inter-digitated among other category-sensitive regions or are simply so small that they cannot be resolved by fMRI. An alternative interpretation is that object recognition might not require segregated, category-specific cortical subregions to represent different classes of objects.

However, another recent study suggests that the anterior areas might not be uniquely specialized for auditory-category information. This study used a heterogeneous set of natural sounds to explore the representation of stimulus categories for non-speech stimuli. The authors carried out a variance decomposition analysis that enabled them to differentiate variability due to low-level stimulus features from variability due to category specificity. Consistent with results from studies of animals, large areas of the human cortex were sensitive to low-level stimulus features. In addition, posterior areas of the auditory cortex (such as the planum temporale) can encode the abstract categories of living sounds and human sounds. Such findings suggest that there might be an increase in information abstraction as the cortical hierarchy ascends from the primary cortex in both anterior and posterior directions. In support of this notion, category representation for pitch-matched stimuli was seen in the anterolateral Heschl's gyrus, the planum temporale and the posterior superior temporal gyrus. Areas showing category specificity and specificity for acoustic information (in this case, pitch contrast) overlapped and included areas of both the lower and higher auditory cortex.

This abstraction of categorization continues beyond the auditory cortex and into the prefrontal cortex regions of the ventral auditory pathway. For example, neurons in the rhesus prefrontal cortex do not differentiate between vocalizations that transmit the same type of information despite the fact that these vocalizations have different acoustic features. That is, these neurons code the ‘meaning’ of vocalizations (FIG. 4).

How does learning shape neural category representation? In one study, gerbils were trained to categorize frequency-modulated tones as ‘upwards’ or ‘downwards’ regardless of the starting frequency, the ending frequency or the rate of the frequency modulation. During the task, epidural-evoked potentials were recorded from multiple sites over the auditory cortex. An analysis of these recordings demonstrated that over time, as the gerbils acquired the categorization rule, the neural activity patterns changed. Initially, neural activity reflected the acoustical properties of the frequency-modulated tones. After learning, neural activity reflected the categorical membership of the frequency-modulated tones independently of their properties. This transformation of information representation might be mediated through feedback projections between the prefrontal cortex and auditory cortex that modulate task-relevant information.

Neural computations underlying object recognition are thought to require selectivity for object-specific features, invariance across identity-preserving changes and generalization to enable categorization. Whereas studies looking at hierarchical processing in the auditory system have sought increasing levels of selectivity and, as discussed above, some studies have looked for category-specific neural firing, the question of invariance remains underexplored.

This crucial, but unresolved, question in auditory neuroscience is particularly pertinent to our understanding of how auditory objects are formed: to perform scene analysis, we must be able to generalize across identity-preserving changes. The continuity illusion discussed earlier can be seen as a very basic form of invariance, but the ability to generalize across multiple stimulus dimensions in order to assign a particular acoustic event to the right auditory object is more computationally challenging. This task requires selectivity for certain stimulus parameters, a tolerance for differences in other parameters and ultimately the ability to generalize across features to assign a sound to a more general category, or class, of sounds.

There are two contrasting models of how neurons might represent the identity of an object (FIG. 5). Distributed-coding models postulate that ensembles of neurons represent object identity. By contrast, sparse-coding models suggest that only a small number of neurons are activated by a given stimulus, so that these neurons explicitly represent the to-be-identified object. Although sparse codes are energetically efficient and easy to read out, taken to extremes such a theory would predict the existence of grandmother cells, which would require an intractable number of neurons to represent all possible objects. Experimental evidence from the visual system also suggests that the increasing selectivity that one would expect to see at each hierarchical stage in a sparse-coding model is not observed and that accurate object identification is apparently achieved through a population code.

Open in a separate window

Figure 5

Strategies for coding auditory object identity

Two neural coding strategies might hypothetically underlie how information is represented in a cortical field: distributed coding or sparse coding. a | Information about the nature of an auditory object (in this case the identity of a musical instrument in a situation in which all three instruments play the same note) could be represented by the pattern of activity across the neural ensemble. Here, each sound category elicits activity in many neurons, with any individual neuron potentially increasing its firing rate to multiple sound categories. Nevertheless, each sound category elicits a unique pattern of activity across the network. b | By contrast, in a sparse representation, each neuron in the array is tuned to a single sound category such that each musical instrument elicits activity from only a very small number of neurons.

To formally understand the mechanisms underlying auditory-object formation and recognition, as has been done for the visual system, we need to develop computational models to generate testable hypotheses as to how population activity in higher auditory areas creates explicit, implicit and tolerant representations of auditory objects. However, to date, such models have not been identified for the auditory system, and this remains an important issue in auditory neuroscience.

The role of attention in object perception

Simultaneous grouping principles and their neural correlates, such as object-related negativity, can operate independently of the listener's attentional state. Attention is not required for a person to detect changes in a stimulus feature. For example, oddball paradigms, in which a rare (deviant) sound is interposed into a stream of repeating standard sounds, show that deviance-detection mechanisms operate automatically and do not require a subject to overtly attend to the stimulus–. Other studies indicate that the continuity illusion does not require attention. Together, these findings support the idea that the auditory cortex automatically generates and monitors predictions about the current sound-scene.

However, whether auditory streaming requires attention is a more controversial matter. Whereas attention is not always required for streams to form, attention can heavily influence a listener's perception, and switching attention ‘resets’ streaming. It seems likely that attention is required to resolve or select representations in an ambiguous auditory scene. Compatible with the concept of a two-stage process is the finding that when listeners are presented with ABA tone sequences, two distinct event-related potential (ERP) components are evoked with different latencies–. The first component is thought to be the initial representation of two alternative interpretations of the sound (one stream versus two streams), whereas the later component reflects the listener's decision (one stream). In natural listening conditions, when there are almost always multiple competing sources, auditory-scene analysis is likely to be heavily influenced by attention and the behavioural goals of the listener.

Once an auditory scene has been parsed into its component objects, selective attention can operate on these components to facilitate further processing and resolve competition between multiple sources,. Attention operates at the level of objects,,, and even when attention is focused on a low-level stimulus feature (such as the pitch of someone's voice), there is enhanced sensitivity to other features of that source (such as its location). Failures of object formation impair the ability to analyse a sound source–, and attention itself influences perception of the auditory scene. Selective attention to a particular object in the visual scene is thought to be essential as the brain has limited resources. As a result of these limited resources, there is a biased competition between objects,. As in vision, both bottom-up and top-down cues can direct auditory attention to a particular object,, and thus one of the hallmarks of an ‘object-based’ neural representation is that it is modulated by behavioural demands. Indeed, highly skilled listeners have enhanced neural-processing mechanisms for particular object-based listening tasks. For example, regions in the left anterior superior temporal gyrus are modulated by a listener's expertise in perceiving and producing a given sound class: actors have greater neural activation in response to speech compared to music, whereas violinists have the opposite pattern.

Attentional signals are found throughout the auditory cortex. In the early auditory cortex, attention can modify the tuning properties of neurons in the primary auditory cortex– and can increase the magnitude of ERPs and fMRI signals–. In later parts of the auditory cortex, such as the posterior auditory cortex, which roughly corresponds to the planum temporale, neural signals reflect the listener's perception of a particular auditory object,. For example, when a listener is asked to attend to one of two spectrotemporally overlapping speech signals, the attended signal preferentially modulates neural activity in this region of the auditory cortex. Similarly, in experiments conducted using surface electrodes in human patients, neural responses to irrelevant sounds are suppressed relative to those that are attended.

Attention is not mediated by a simple feedforward network. Instead, attention is mediated by a complex network that has distinct activity patterns for spatial versus non-spatial auditory attention,. Differential activity patterns have been found in auditory regions of the superior temporal gyrus,– as well as the superior temporal sulcus and the inferior parietal sulcus; these latter regions exhibit more attention-related modulation when listeners are asked to attend to a sound that is embedded within a complex and realistic listening environment. It seems likely that these networks may provide feedback activity to early sensory areas, enabling the selection of activity related to the object of interest.

Synthesis and discussion

We have discussed and reviewed how the auditory system represents the perceptual features and grouping principles that underlie the creation of auditory objects. We have also highlighted several important principles, such as the hierarchical processing of information and the role of the ventral stream in auditory-object processing. However, we believe that two fundamental issues remain to be investigated. First, beyond the ‘classical’ auditory cortex, a network of areas subserves the functions associated with processing auditory objects. For example, neural activity in the prefrontal cortex– and hippocampus interacts with auditory cortex activity to process auditory memory and the meaning and emotional content of sounds. We do not fully understand the roles of these brain regions in auditory cognition, or the neural mechanisms that underlie these roles. Second, it is unclear which cortical areas have causal roles in auditory-object processing and perception. Thus, to drive our understanding of how the auditory cortex parses the auditory scene into recognizable objects, researchers must exploit techniques that enable perception and neural activity to be studied simultaneously in combination with methods that perturb neural activity to provide causal evidence for the contribution of particular brain areas to defined functions, and design computational models that generate testable hypotheses.

Box 1 | Analysing the soundscape: simultaneous grouping cues

Identifying an auditory object involves assigning elements of the incoming sensory input into one or more sources. Several of the cues that are used to group auditory stimuli into objects can be classified as ‘simultaneous cues’ (). We automatically group the elements of a visual scene, such as that shown in panel a of the figure into distinct objects (in this case, on the basis of the colour of the letters, the proximity and orientation of adjacent letters, the size and letter font). Similarly, in audition, the brain groups together stimuli associated with acoustic cues — such as pitch, harmonicity, timbre, common onset or modulation time and spatial location — that can be quickly derived from a sound's spectral features.

Natural sounds, such as speech, are often harmonic: that is, they have energy at integer multiples of the lowest (or fundamental) frequency. This is illustrated in panel b of the figure, which shows a spectrogram of a human speech sound in which horizontal bands of energy are visible. Importantly, individual harmonics change coherently over time, and harmonic frequencies that change coherently are grouped together. This shown schematically in panel c: sound elements that change coherently are grouped together such that the red and blue sound elements form two separate auditory objects. Pitch is another important grouping cue that allows a listener to identify and track simultaneous speakers. Panel d of the figure shows a related cue, harmonicity. Here, a single pure tone or a harmonic series of pure tones (blue) are both perceived as a single sound. However, the introduction of a ‘mistuned’ harmonic — that is, a harmonic at a frequency that is not an integer of the fundamental frequency (red) — results in the perception of an additional separate sound. Differences in timbre are used to identify different vowel sounds or different musical instruments even when the instruments are playing the same note.

Sound components with a common onset time are likely to be perceived as originating from the same object. In natural listening conditions, onset time is one of the more important grouping cues. Spatial location provides relatively weak grouping,,, but when a listener attends to a particular location, attentional resources can facilitate the distinction between simultaneous speech sounds.

Box 2 | Analysing the soundscape: sequential grouping cues

Auditory stimuli can be grouped into objects using what are known as sequential grouping cues. Sequential grouping cues enable temporal sequences of sounds to be assigned to a common source: panel a of the figure shows a visual analogy in which the sets of letters are grouped into two words because they form a sequence from left to right. As shown in panel b of the figure, these cues have been studied using repeating patterns of pure tones in which the patterns are separated perceptually into two or more streams. Two factors determine most stream segregation: frequency separation (a bigger difference in the frequency of the tones makes it more likely that two streams will be perceived) and speed (if the presentation rate of the tones is increased, a listener is more likely to hear two streams). A hallmark of such streaming is that listeners find it hard to make inter-stream judgements, such as judging the order of two sounds that are in separate streams. Such percepts can be ‘bistable’: at intermediate frequency separations (such as 3–7 semitones), the perception of ‘one stream’ and ‘two streams’ alternates over time. However, with increased listening time, a stable two-stream percept is developed. Panel c illustrates another example of sequential integration that is called ‘amodal completion’ (the continuity illusion). Here, a discontinuous tone is heard as continuous when a noise burst occurs during the gap.

Acknowledgments

We thank H. Hersh for a critical reading of the manuscript. J.K.B. is supported by a Royal Society Dorothy Hodgkin Research Fellowship and BBSRC grant BB/H016813/1. Y.E.C. is supported by grants from the US National Institute on Deafness and Other Communication Disorders and US National Institutes of Health.

Glossary

PitchThe attribute of a sound that enables it to be ordered from high to low on a musical scale. The perceived pitch for a periodic sound is determined by its fundamental frequency (F0), usually the lowest frequency componentTimbreThe quality of a sound that is determined by its spectral or temporal envelope. Timbre allows a listener to differentiate between a violin and a banjo despite the fact that the two instruments may be producing a sound that has the same pitchHarmonicityA harmonic sound contains frequency components at integer multiples of the fundamental frequency (see the definition for ‘pitch’). Many vocalizations and other pitch-evoking sounds have a harmonic structureSpectral envelopeThis term refers to the distribution of power across frequency in a sound. For a harmonic sound, this equates to the relative power across harmonicsDynamic causal modelingA computational approach that performs Bayesian model comparisons in order to infer the organizational structure of processing within different brain regionsAuditory flutterThe sensation produced by a periodic stimulus in which a listener can hear the sound as being intermittent. At higher frequencies, the sound is fused into one with a continuous melodic pitch. The border between being heard as intermittent or continuous is the flicker–fusion limitForward maskingA process by which a sound is obscured by a masker (for example, a noise burst) that precedes the soundCategorical perceptionThe experience of perceiving a stimulus as being the same (that is, invariant) despite the fact that the physical properties of the stimulus have changed smoothly along a specific axis or continuum. A characteristic of categorical perception is that for a continuously changing stimulus dimension, subjects generalize across changes, with a sharp change in the perception from one class to another at the position of the boundary of the stimulus identityScene analysisThe process by which the brain organizes and segregates acoustic stimuli into meaningful elements or objectsGrandmother cellsHypothetical cells that represent a very specific complex object or concept — such as one's grandmotherObject-related negativityAn evoked-potential component that is elicited when two concurrently presented sounds are perceived as originating from different sources based on simultaneous grouping cues

Chủ đề