CoAS: Composite Audio Steganography Based on Text and Speech Synthesis

Abstract

Digital steganography is the practice of embedding secret information in original normal data to enable covert communication. With the rapid advancement of generative models, generative steganography has gained renewed vitality. As a key medium on the Internet, audio has also become a focus of steganographic research. However, existing audio steganography methods rely on traditional audio synthesis models, which often suffer from suboptimal synthesis quality. In contrast, diffusion models perform well in audio synthesis tasks, but there is a lack of targeted secure audio steganography methods based on them. In addition, existing steganography schemes are generally limited to transmitting only the steganographic object, and other key elements need to be negotiated in advance, which limits their practicality. To address these issues, we propose CoAS, a composite audio steganography method based on text and speech synthesis. Firstly, we use a provably secure linguistic steganography method to embed the synchronous side information required for audio steganography, and then replace the gaussian noise in the diffusion models with message-driven sampling during the audio generation process. Both theoretical analysis and experimental results validate the security and practicality of our composite steganography method in the real world.

Method

CoAS: The composite audio steganographic scheme based on diffusion models. The Gaussian vector obtained from the secret message mapping is applied to the denoising process of the diffusion model, and the audio becomes the cover of the secret message. At the same time, the shared parameters required by the above process are embedded in the text corresponding to the audio and guide the generation process.

Message Mapping: During the message embedding process, the secret message is firstly divided into a set of segments $s_i$ according to the value of payload $p$ (with $p$=3 in the figure), and the components of the corresponding Gaussian vector $\boldsymbol{z}$ are calculated and obtained. Finally, the Gaussian vector $\boldsymbol{z}$ is applied to the generation process of the stego audio. On the other hand, in the extraction process, the receiver, after extracting the same Gaussian vector from the stego audio, can obtain the same message segments. From this, the secret message can be entirely extracted.

Experimental Results

The generation process of the CoAS system is divided into two parts: first, the audio text is generated based on the initial context using provably secure linguisitic steganography methods such as Discop, and then the stego audio is generated using the audio diffusion models. In these two steps, the models we chose are LLaMA2-7B-hf and FastDiff with ProDiff . We generated a batch of cover and stego audio pairs as examples:

Audio Text	Cover Audio	Stego Audio
If you send our brother with us, we will go down and buy food for you.
It is of the first importance that the letter used should be fine in form.
Researchers believe that the new technology could help doctors find out if the treatment is effective.
Now,as all books not primarily intended as picture books consist principally of types composed to form letter press.
Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
I personally have no difficulty with reproductive health support.
Some of our partners have clearly defined European aspirations.
This response includes operational measures and financial assistance.
Investing early is much more effective than intervening later.
This is time consuming, costly, and perfectly unnecessary.
The outcome is very welcome and reflects successful negotiations.
I have followed your valuable contributions with great interest.
The current investment situation requires resolute and swift action.
No one knows what tomorrow will bring.
We need to try to learn the lessons for the future.
Anyone who expects something entirely different is bound to be disappointed.
I have prepared a speech, but I will disregard it.

The above audio results passed the test of the speech recognition models (For example, Parakeet-TDT and Whisper-large-v3) with Word Erroe Rate $WER=0$.

BibTeX

@ARTICLE{11036088,
  author={Li, Yiming and Chen, Kejiang and Wang, Yaofei and Zhang, Xin and Wang, Guanjie and Zhang, Weiming and Yu, Nenghai},
  journal={IEEE Transactions on Information Forensics and Security}, 
  title={CoAS: Composite Audio Steganography Based on Text and Speech Synthesis}, 
  year={2025},
  volume={20},
  number={},
  pages={5978-5991},
  keywords={Steganography;Diffusion models;Security;Speech synthesis;Receivers;Noise reduction;Gaussian noise;Entropy;Reviews;Linguistics;Steganography;provably secure;text;audio;diffusion model},
  doi={10.1109/TIFS.2025.3579581}}