我要自救-吵架机器人-怒怼工贼

我要自救-吵架机器人-怒怼工贼

需求:我要实现一个能够帮助语言组织能力欠佳的人,怒怼工贼。

设计:

输入:实时语音流

分析:语音转文字--》大模型处理(TMD-GPT)--》文字转语音--》实时回怼。

涉及相关技术领域如下:

编程语言:JavaScript、python、java

领域:AIGC、ASR、TTS、前端、webRTC、后端


前端选择

vue

实时音频-webrtc

后端选择

java- 多线程处理优势,处理流和webrtc

python-处理算法相关

ASR服务

1:实时语音转写服务 前端语音分流实现-js worker方式传送音频

2:实时语音处理,切分vad等

3:asr服务


TTS服务

声纹复刻服务

桌面应用

electron


CMD-GPT

1:数据生成1:使用chatGPT等模型生成 2:手动新增 3:数据抓取

2:模型选择大模型选择

https://arxiv.org/pdf/2305.11206.pdf     lima: less is more for alignment https://huggingface.co/datasets/GAIR/lima

https://arxiv.org/pdf/2305.15717.pdf The False Promise of Imitating Proprietary LLMs 根据这几篇论文我们得出:选择好的预训练模型 + 多样化的、高质量的数据集做微调。 less is more LLaMA  Bloom  glm等大模型选择

3:微调选择lora模型微调:

loar https://arxiv.org/abs/2106.09685 P-tuning-v2 https://github.com/THUDM/P-tuning-v2 工程实现  PEFT


参考资料:

https://arxiv.org/pdf/2306.16092.pdf

https://arxiv.org/pdf/2304.01097.pdf

不到1000步微调,将LLaMA上下文扩展到32K,田渊栋团队最新研究:

论文地址:https://arxiv.org/pdf/2306.15595.pdf


ChatLaw - 中文法律大模型: github.com/PKU-YuanGroup/ChatLaw

对应论文https://arxiv.org/pdf/2306.16092.pdf

https://arxiv.org/pdf/2206.08317.pdf

https://arxiv.org/pdf/2305.15062.pdf

https://arxiv.org/pdf/2106.09685.pdf

https://arxiv.org/abs/2107.13586

https://arxiv.org/pdf/2306.03901.pdf

ASR 相关论文:

1 Improving End-to-End Contextual Speech Recognition with Fine-grained Contextual Knowledge Selection

2 Sentiment-Aware Automatic Speech Recognition pre-training for enhanced Speech Emotion Recognition

3 Internal language model estimation through explicit context vector learning for attention-based encoder-decoder ASR 

4 Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition 

5 Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech Recognition with Pinyin and Character 

6 Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition 

7 Human and Automatic Speech Recognition Performance on German Oral History Interviews 

8 Recent Progress in the CUHK Dysarthric Speech Recognition System 

9 The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition 

10 Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR 

11 Ask2Mask: Guided Data Selection for Masked Speech Modeling 

12 The PCG-AIID System for L3DAS22 Challenge: MIMO and MISO convolutional recurrent Network for Multi Channel Speech Enhancement and Speech Recognition 

13 Non-Autoregressive ASR with Self-Conditioned Folded Encoders 

14 MLP-ASR: Sequence-length agnostic all-MLP architectures for speech recognition 

15 Conversational Speech Recognition By Learning Conversation-level Characteristics 

16 The RoyalFlush System of Speech Recognition for M2MeT Challenge

17 Visual Speech Recognition for Multiple Languages in the Wild 

18 Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments 

19 Wav2Vec2.0 on the Edge: Performance Evaluation 

20 4-bit Conformer with Native Quantization Aware Training for Speech Recognition 

21 A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings 

22 Chain-based Discriminative Autoencoders for Speech Recognition 

23 CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR 

24 Enhancing Speech Recognition Decoding via Layer Aggregation 

25 Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR 

26 Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition 

27 Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR 28 Similarity and Content-based Phonetic Self Attention for Speech Recognition 

29 Speaker recognition by means of a combination of linear and nonlinear predictive models 

30 STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation 

31 Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings 

32 Transformer-based Streaming ASR with Cumulative Attention 

33 Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models 

34 Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition  35 Improved far-field speech recognition using Joint Variational Autoencoder 

36 E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR  37 Self-critical Sequence Training for Automatic Speech Recognition 

38 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition pdf 39 A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition 

40 Text-To-Speech Data Augmentation for Low Resource Speech Recognition 

41 Multiple Confidence Gates For Joint Training Of SE And ASR 

TTS-相关论文:

1 DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs 

2 The MSXF TTS System for ICASSP 2022 ADD Challenge 

3 MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription 

4 Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidanc

5 ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech 

6 Unsupervised word-level prosody tagging for controllable speech synthesis 

7 FAAG: Fast Adversarial Audio Generation through Interactive Attack Optimisation 

8 Building Synthetic Speaker Profiles in Text-to-Speech Systems 

9 Revisiting Over-Smoothness in Text to Speech 

10 A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS 

11 A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis 

12 A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing 

13 Applying Syntax–Prosody Mapping Hypothesis and Prosodic Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis 

14 BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis  15 Differentiable Duration Modeling for End-to-End Text-to-Speech 

16 DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning  17 ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis  18 Improve few-shot voice cloning using multi-modal learning 

19 JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech 

20 Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech  21 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation  22 Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition  23 Variational Auto-Encoder based Mandarin Speech Cloning  24 Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise 

25 vTTS: visual-text to speech 

26 WavThruVec: Latent speech representation as intermediate features for neural speech synthesis 

27 Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss  28 SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech  29 Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech  30 Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis 

31 Simple and Effective Unsupervised Speech Synthesis 

32 AILTTS: Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech 

33 VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature 

34 Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis 

35 NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality 

36 Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech  37 Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History 

38 Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech 

39 NatiQ: An End-to-end Text-to-Speech System for Arabic 

40 R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS 

41 TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking support vector machine with variational autoencoder 

42 UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder 

43 Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models 

44 Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation 

45 Diffsound: Discrete Diffusion Model for Text-to-sound Generation 

46 LIP: Lightweight Intelligent Preprocessor for meaningful text-to-speech

47 ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech  48 DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders 

49 Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech 

50 SATTS: Speaker Attractor Text to Speech, Learning to Speak by Learning to Separate 

51 BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model 

52 Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS) 

53 Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS) 

54 Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need 

55 Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks 

56 Visualising Model Training via Vowel Space for Text-To-Speech Systems  57 A Study of Modeling Rising Intonation in Cantonese Neural Speech Synthesis 

58 EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models 

59 A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS 

60 Controllable Accented Text-to-Speech Synthesis 

61 Deep Speech Synthesis from Articulatory Representations 

62 AudioGen: Textually Guided Audio Generation