You press middle C on your MIDI controller, medium velocity. In that instant, the device sends out just three numbers: note on, pitch 60, velocity 90. But what comes out of your headphones is far more than three numbers. A piano note has the crisp attack of the hammer striking the string, the body of string vibration, the decay as the damper falls back, the subtle sympathetic resonance of other strings when the sustain pedal is down, the reverberation of the cabinet and the room. The gulf between three numbers and this entire soundscape is what a sound engine exists to fill.
MIDI’s job is to standardize performance gestures as event messages. It transmits “C4 pressed, velocity 90, piano voice” — not a pre-recorded sound. The MIDI Association’s explanation of General MIDI stops at this layer too: the GM specification defines the minimum configuration and a shared sound set for compatible MIDI devices, so that the same MIDI file plays back more consistently across different hardware, but it makes no guarantees about sound quality. The work of turning events into sound falls to the sound engine. And the way engines have answered that question has changed more than once over the past forty years.
The earliest answer was: compute it directly.
Analog synthesizers generate waveforms from electrical signals, but keeping circuits stable enough to produce harmonic structures as complex as real instruments is hard. In 1983, Yamaha’s DX7 took a purely digital approach called FM synthesis: let one oscillator modulate the frequency of another, and the modulated oscillator generates an entire new set of harmonics. A few simple sine waves, arranged through modulation relationships and envelope control, could produce the timbres of bells, brass, electric pianos, and plucked basses. The parameter count was astonishingly small, but the tonal variety was vast.
FM’s success proved one thing: complex timbres don’t necessarily come from complex sound sources — they can be derived from simple oscillators and modulation relationships. But sounds generated from equations share a characteristic: they are too clean. Real instruments have string noise, breath sounds, mechanical hammer action, random micro-deviations; FM-generated sounds are like precision clockwork, each strike nearly identical. And controlling FM parameters to dial in a specific timbre is deeply counterintuitive. The performer wants a Steinway; the sound designer faces waveforms, envelopes, and modulation indices.
In the years that followed, another approach also tried to trade a small amount of data for realism: wavetable synthesis. Instead of computing waveforms from scratch, it used a short snippet of digitized waveform as raw material, shifting pitch and timbre by varying read speed, direction, and interpolation. Korg’s DW-8000 (1985) and Ensoniq’s ESQ-1 (1986) both took this path. Wavetables got closer to the static timbre of real instruments than pure FM, but they still lacked a sense of instrumental behavior. Press C4 once, press it again — the sound is the same. Pedal down, pedal up — it has no memory.
In 1987, Roland’s D-50 took a clever middle path.
The hardest part of a real instrument to synthesize is usually concentrated in the attack. The first few tens of milliseconds of a hammer striking a string, the instant friction of a bow biting into a violin string, the breath burst of a wind instrument’s tonguing — these are all harder to generate reliably from formulas than the sustained tone. Once the sound enters the sustain phase, the waveform becomes far more regular. Roland’s LA synthesis followed this insight: capture the attack transient with extremely short samples, then use digital synthesis for the sustain portion. The samples stored less than a second of audio, but that single second captured exactly the information the ear is most sensitive to.
This approach was economical. Recording every note of a piano at every velocity in full was expensive and impractical at the time. But recording only the attack transients fit comfortably on a small ROM chip, and the synthesis engine could calculate the sustain and decay from there. The results were surprisingly good. The D-50’s tones had a breathing quality that pure FM lacked, while retaining the controllability and richness of a synthesizer. It was a combinatorial strategy: hand off the hardest part to a recording, leave the predictable part to equations.
Around the same time, the ROMpler (ROM-based sampler) emerged. Manufacturers burned large sets of PCM waveforms into ROM — not full multisampling, but enough to cover the common ranges of piano, bass, strings, and brass. This approach was amplified in 1991 when the Roland SC-55 turned General MIDI into a purchasable multitimbral sound module. GM specified 128 instrument categories and drum mappings; the SC-55 gave MIDI files a predictable playback baseline. The tradeoff was that timbral quality was locked at the GM preset level: listenable, but not convincing next to a real instrument.
By the late 1990s, a question had been hanging in the air for years: if storage and computation costs keep falling, why not just record every note of a real instrument and play it back?
The bottleneck was not the idea; it was memory and bandwidth. Fully recording a piano’s 88 keys, each at a dozen different velocities, with pedal up and pedal down, quickly adds up to thousands of samples. If each sample is cut to a second or two and then looped for the remainder, detail is lost. If everything is recorded in full, the size exceeds what hardware memory of the time could hold.
In 1998, Tascam’s GigaSampler / GigaStudio changed that constraint. It did not load all samples into RAM; instead, it streamed directly from the hard disk in real time. A library could be several gigabytes, and the system fetched only the short segment of the current note when needed. Sampling was no longer bound by memory. Native Instruments later developed this path into the Kontakt platform: multiple velocity layers, multiple mic positions, release samples (the sound of the damper touching the string when the key is released), round robin (cycling through different takes of the same note to avoid mechanical repetition), and built-in scripting to control pedal behavior.
Sample-based instruments reached a level of realism previously unattainable. Every C4 you hear is a real Steinway’s C4, recorded; the velocity-90 sample and the velocity-60 sample are different recordings. VI Labs’ Ravenscroft 275 even advertises “100% sample-based, no modeling or synthesis used,” achieving complex behaviors like half-pedaling and repedaling (notes continuing to ring when the pedal is lifted and quickly pressed again) through full sampling plus scripting.
But sampling is not without cost. Every sample captures a sound at one moment, in one state. It does not know what happened one second ago. When you first press C4, the string begins vibrating from rest; when you press C4 again while the string is still faintly vibrating from the previous strike, the sample does not know. Velocity layers are typically four to eight; audible timbral jumps when switching between them are common. With the sustain pedal held, samples enter a loop, and short loops under long sustain expose a repeating periodic fluctuation — it sounds like a machine rewinding itself. The deepest limitation is perhaps this: a sample is a static snapshot of an instrument at one moment, not a continuous process that responds to performance gestures.
In 2006, a small company called Modartt offered a fundamentally different answer.
Pianoteq does not record a piano. It builds a physical model of one: strings, soundboard, hammers, cabinet, air coupling — each described by mathematical equations. When you press C4, the model computes, in real time, how the hammer strikes the string, how the string vibrates, how the vibration travels through the bridge to the soundboard, how the soundboard pushes the air. Every keystroke is newly computed; the state of every note depends on everything that is happening and has happened before.
The changes that physical modeling brings run deeper than timbre itself. In a sample library, velocity is a switch that selects which sample layer to play. Your force falls into layer A, so recording A is triggered; into layer B, recording B. In a physical model, velocity is a continuous input variable. Pianoteq claims support for distinct dynamic response across more than 127 levels; the practical effect is that you hear no layer-switching steps during crescendos and diminuendos — the entire dynamic transition is smooth. The pedal likewise no longer triggers binary logic of “pedal down, play sample with sustain”: half-pedaling is a continuous computation of how much the damper felt touches the string, repedaling is the natural overlap of the previous note’s tail and a new strike within the same physical state, and sympathetic resonance is the passive response of undamped strings driven by the vibration of struck strings. Behaviors that in the sampling world require layers of samples and scripts to approximate are, in a physical model, simply the results of the same set of equations under different inputs.
Kawai’s SK-EX Rendering represents another choice: sampling for the core timbre, but handing the resonance part to a physical model. Multi-channel 88-key sampling provides the recorded fingerprint of a fine instrument, while resonance algorithms compute, in real time, the interactions among strings, between strings and dampers, and across various parts of the cabinet. This hybrid approach acknowledges what each path does best: the recorded fingerprint of sampling is something physical modeling still struggles to match completely, while the state continuity of physical modeling is something sampling struggles to achieve.
Which is better depends on the use case and your definition of “lifelike.” Sampling excels at reproducing the recorded texture of a particular piano in a particular space; modeling excels at creating an instrument that breathes, responds, and remembers state. The same performer might use a sample library in the studio for the final mix, but choose a physical model for practice and stage, because the latter feels more connected under the fingers.
A sound module just has to produce sound. A digital piano has to solve more problems.
The first is latency. From key press to sound emerging from headphones, every step — key scanning, sensor processing, sound engine computation, DAC conversion — accumulates microseconds. Sound On Sound gives a widely accepted rule of thumb in the industry: 5 to 6 milliseconds of latency is acceptable; exceeding 10 milliseconds can cause problems for instruments with sharp attacks like piano, clavichord, and drums. Beyond this threshold, the player subjectively feels that “the piano isn’t following.” Ten milliseconds sounds short, but on a computer running a DAW, Kontakt, and a few effect plugins, a slightly generous buffer setting can easily exceed it.
The second is velocity mapping. When the velocity sensor curve of the keybed does not match the velocity response curve of the sound engine, playing softly produces no sound, and playing harder suddenly bursts out loud with a timbral jump. Good digital pianos and stage keyboards provide adjustable velocity curves, letting the player align the key action with the sonic behavior.
The third is polyphony. Your hands are holding four notes, but the sound engine is actually consuming more than four voices. With the sustain pedal down, every note ever played keeps ringing; release samples are usually shorter than sustained tones but still occupy polyphony slots; sympathetic resonance requires extra voices to simulate the passive vibration of unstruck strings. A digital piano rated at 256 voices, under the feet of an experienced pianist, can be just barely enough.
Put these together and “good piano sound” is about the response of the entire chain, not whether a single C4 sample sounds authentic: keybed feel, sensor speed, the dynamic continuity of the sound generation, the pedal state model, the noise floor of the DAC and amplifier, the frequency response and placement of the speakers. Any one rough link in this chain cancels out the refinement of the others. This is also why high-end stage pianos cost several times more than entry-level digital pianos — it is not just a few extra gigabytes in the sound pack.
For the AI-literate reader, the question naturally turns to deep learning: how much of the story above has it changed?
The answer is narrower — and more informative — than the “everything replaced” narrative. AI has not overturned the old world of sampling, FM, or physical modeling. It is more like a new solver entering the picture, beginning to learn the parts of old techniques that were hardest to tune by hand.
Google Magenta’s DDSP provides a clear example. DDSP stands for Differentiable Digital Signal Processing: it makes classic DSP modules — oscillators, filters, reverb — differentiable, then lets a neural network learn how to control their parameters. The DDSP paper includes a key data point: all models used less than 13 minutes of training audio. This is possible because the DSP structure itself encodes the physical common sense of how sound is produced; the model does not need to learn from scratch “what makes a sound sound like a sound.” It only learns the mapping between control signals and synthesis parameters.
Following this line of thinking for piano, a 2022 paper at the DAFx conference built the first DDSP-based polyphonic differentiable piano model, synthesizing piano sound conditioned on MIDI input. Its sound quality surpassed neural baselines of the time, and it used fewer parameters and less training data. But the same paper’s listening test results also stated: the physical modeling model still had higher quality. AI here is narrowing the gap, not surpassing.
Differentiable physical modeling is another active direction. ISMIR’s related workshop frames the question more broadly: if you make the physical simulation itself differentiable, you can use automatic differentiation to estimate model parameters, optimize control mappings, and inversely derive an instrument’s physical characteristics from recordings. AI’s role here is a parameter estimation tool and a control layer; what runs underneath is still a physics engine.
AI can also appear in the interface layer above the sound engine. In 2024, Yamaha demonstrated an AI-assisted piano that tracks the player’s rhythm and dynamics and fills in harmony and accompaniment in real time. AI did not reinvent how the piano produces sound; it made the piano itself better at understanding what the performer is doing. This case illustrates that when AI enters a music system, it often changes the interaction layer first, not the sound engine layer directly.
Forty years on, FM chips, ROM waveforms, multi-gigabyte sample libraries, physical models, and differentiable DSP are all still in operation. They correspond to different constraints. A GM sound chip in a maker project encapsulates audio engineering into a few MIDI commands — a module that costs a dozen dollars, just enough for Arduino toys and music programming classes. Large sample libraries in the studio provide the finest recorded detail, at the cost of dozens of gigabytes of storage and a sufficiently fast computer. Physical modeling on stage offers continuous dynamics and pedal response, at the cost of sacrificing the recorded timbral fingerprint of a specific, named instrument. Workstations bundle multiple engines together, giving touring musicians a single hub that does everything, at the cost of the piano section alone being less refined than a dedicated stage piano.
The history of sound synthesis is a tradeoff among a set of constraints: compute power, memory, cost, latency, control continuity, and “whether the instrument you want is a perfect recording or a living thing that can converse with you.” Forty years ago someone decided to compute it with FM, thirty years ago someone decided to grab the attack transient, twenty years ago someone decided to record the entire piano, fifteen years ago someone decided to rewrite the physics equations. From a decade ago to now, someone has been letting neural networks learn the parameters inside those equations. Where the sound comes from keeps changing, but the core question has not: after a note event arrives, where are you going to pull that sound from?