High-frame-rate real-time imaging of speech production
Real-time magnetic resonance imaging (RT-MRI) involves the rapid and continuous acquisition of MRI images of a dynamically evolving physiological process. It is emerging as a powerful tool to noninvasively visualize complex spatiotemporal dynamics for in vivo applications such as cardiac cine MRI (movement of cardiac muscles, chambers, and valves to assess heart function), functional MRI (assessment of brain activity during an ongoing task), and flow MRI (tracking the blood flow).1, 2 Our work seeks to develop and apply RT-MRI methods to understand human speech production,3–5 which involves complex and intricate coordination between the lungs, diaphragm, chest wall, larynx, pharynx, vocal cords, tongue, lips, soft palate (velum), teeth, jaw, and the nasal cavity.6 MRI offers unique advantages over competing modalities such as x-ray fluoroscopy, computed tomography, and ultrasound; it provides noninvasive safe imaging of arbitrary image planes, and can visualize deep soft tissue structures. However, MRI is notoriously slow due to fundamental physical limitations. This results in a challenging tradeoff between spatial and temporal resolution, and signal-to-noise ratio.
Our first approaches to imaging the upper airway during speech production relied on using short spiral interleaves to rapidly scan the k-space (spatial frequency domain).3–5 In comparison to the widely used 2D-fourier-transform Cartesian approach, spirals are highly time-efficient and are resilient to motion artifacts (errors). We implemented the sequences within a customized real-time imaging environment that allowed for interactive imaging.7 On a modern 1.5T MRI scanner equipped with high-speed gradients, we were able to image up to a time resolution of 78ms with a spatial resolution of 2.4mm2, and reconstruct at 24 frames/second using a sliding window technique.
We have recently made further advances in the capabilities of RT-MRI.8 In addition to fast spirals, we used constrained reconstruction and advances in upper-airway radiofrequency coil design to improve the speed and quality of RT-MRI. We used a customized eight-channel upper-airway receiver coil that has four elements on either side of the jaw. The coil's design enables high sensitivity over all the important articulators (lips, tongue, epiglottis, velum), thereby greatly increasing the signal-to-noise ratio in these regions in comparison with coils developed for other body parts (such as the neurovascular or head-and-neck coil).
We used constrained reconstruction to improve the native time resolution by reconstructing images at sub-Nyquist sampling levels. As depicted in Figure 1(a), an artifact-free image reconstruction requires Nyquist sampling in k-space, which corresponds to acquisition with a temporal footprint of 78ms (or 13 spiral interleaves with a repetition time of 6ms). Sub-Nyquist sampling improved the temporal footprint to 12ms (two interleaves), but led to increased image artifacts, as shown in Figure 1(b). We resolved this issue by exploiting prior knowledge of the dynamic image time series to have sparse temporal pixel time profiles under the finite difference operation along time. We posed the reconstruction as a penalized convex optimization problem, where we penalized rapidly varying pixel time profiles (that usually correspond to alias artifacts and noise) subject to consistency with the acquired data from the eight-channel coil. We used an iterative nonlinear conjugate gradient algorithm to solve the resulting optimization, which resulted in images with excellent spatiotemporal fidelity, as shown in Figure 1(c).

The crispness along the time profile in Figure 1 shows that our approach can dramatically improve the visualization of rapid articulatory movements, for example, the production of consonant clusters. The gains in time resolution can be used in conjunction with other factors such as improved slice coverage and/or spatial resolution. For instance, in Figure 2, we show concurrent mid-sagittal and coronal imaging at a native time resolution of 24ms/frame during the production of the consonant ñ. This capability allows flexibility in modeling complex spatiotemporal patterns by utilizing information from more than one plane at a high time resolution.

The described imaging approach, along with a synchronized noise-cancelled audio acquisition scheme,9 is currently set up as the RT-MRI speech acquisition sequence at our site. We will use it to acquire data for several current and future studies that are targeted to address open questions in the areas of phonetics and phonology, understanding language acquisition and language disorders, and to inform treatment plans in clinical applications such as clefts of lips/palate and oropharyngeal cancer.
This work is supported by the National Institutes of Health under grant NIH/NIDCD R01 DC007124.
University of Southern California