A Comparative Study of State-based Neural Networks for Virtual Analog Audio Effects Modeling

Analog electronic circuits are at the core of an important category of musical devices, which includes a broad range of sound synthesizers and audio effects. The nonlinear features of their passive and active electronic components give analog musical devices a distinctive timbre and sound quality, making them highly desirable. The development of software that simulates analog musical devices, known as virtual analog modeling, is a significant sub-field in audio signal processing. Artificial neural networks are a promising technique for virtual analog modeling. They have rapidly gained popularity for the emulation of analog audio effects circuits, particularly recurrent networks. While neural approaches have been successful in accurately modeling distortion circuits, they require architectural improvements that account for parameter conditioning and low-latency response. Although hybrid solutions can offer advantages, black-box approaches can still be advantageous in some contexts. In this article, we explore the application of recent machine learning advancements for virtual analog modeling. In particular, we compare State-Space models and Linear Recurrent Units against the more common Long Short-Term Memory networks. These have shown promising ability in sequence-to-sequence modeling tasks, showing a notable improvement in signal history encoding. Our comparative study uses these black-box neural modeling techniques with a variety of audio effects. We evaluate the performance and limitations of these models using multiple metrics, providing insights for future research and development. Our metrics aim to assess the models' ability to accurately replicate energy envelopes and frequency contents, with a particular focus on transients in the audio signal. To incorporate control parameters into the models, we employ the Feature-wise Linear Modulation method. Long Short-Term Memory networks exhibit better accuracy in emulating distortions and equalizers, while the State-Space model, followed by Long Short-Term Memory networks when integrated in an encoder-decoder structure, outperforms others in emulating saturation and compression. When considering long time-variant characteristics, the State-Space model demonstrates the greatest capability to track history. The Long Short-Term Memory and Linear Recurrent Unit networks present more tendency to introduce audio artifacts, in particular, the Linear Recurrent Unit, which resulted in the least appropriate modeling techniques.