Voice / Speech Recognition (Automatic Speech Recognition, ASR)
Voice capture uses Automatic Speech Recognition (ASR) to convert spoken card details while keeping sensitive audio muted from the agent.
When to Use Voice Captureβ
Voice capture is not a fallback for a broken flow β it is a fully compliant, first-class capture method. Common use cases include:
- Accessibility β Customers who are unable to use a keypad due to disability, impairment, or Accessible Customer requirements.
- Device limitations β Customers using a softphone, headset, or application that does not reliably generate standard DTMF tones.
- Customer preference β Some customers simply prefer to speak their details rather than key them in.
- DTMF failure β If inband DTMF is detected and the agent is informed, the agent can immediately offer to switch the customer to voice capture.
The agent can switch capture mode mid-session without restarting the payment journey.
How It Worksβ
When voice capture mode is active, the call audio is streamed to the Subphonic ASR service. The service transcribes the spoken input and returns the recognised card values to the Transact capture session. The audio stream is automatically muted to the agent during card data entry β preventing the agent from hearing the spoken digits β but the agent can see a text transcription of the non-PII (Personally Identifiable Information) conversation (e.g. "Please say your card details now"). This keeps the agent informed without exposing sensitive data. The raw audio is never stored.
PCI DSS Scope Impactβ
Voice capture carries the same PCI scope reduction as DTMF keypad capture:
- The audio stream is automatically muted to the agent during card data entry β the agent cannot hear the spoken card details.
- Recognised digit values are sent securely to the Transact session only β they are never returned to the agent's audio or displayed in raw form to the agent.
- The agent sees a text transcription of the conversation (non-PII only), allowing them to follow along without access to sensitive data.
- The audio stream used for recognition is transient; no audio recording of spoken card details is retained.
- The agent's environment (device, network, workspace) remains out of PCI scope.
Voice capture has the same compliance standing as DTMF capture. Both methods achieve the same SAQ-A reduction. The choice of method is driven by operational need, not compliance requirements.
Switching Between Methodsβ
Agents can switch between DTMF and voice capture at any point during an active capture session from within the Transact UI. The capture session state (fields already completed, validation status) is preserved when switching modes.
If Inband Media Detection triggers an alert β indicating that inband DTMF tones were detected in the audio β the agent will receive an on-screen notification and can immediately offer the customer a switch to voice capture or Pay by Link as an alternative.