As a general rule, larger datasets lead to better-quality TTS models, and it is generally recommended to use as much high-quality speech data as possible when training TTS models.
To train a good TTS model, It’s important to consider below factors:
Data quality and diversity: this is critical for producing high-quality TTS outputs. The training data should be diverse and representative of the target domain and task and should be well-annotated with timing and pronunciation information.
Model architecture: The choice of model can have a major impact on the quality of the TTS Output. It’s important to choose an architecture that is well-suited to the task at hand and that has been shown to perform well on similar tasks.
Training strategy: including the choice of the loss function, optimizer, and regularization techniques, can also have a significant impact on the performance of the TTS models. We should consider the need for rapid convergence with the need to prevent overfitting
Hyperparameter tuning: model performance can be sensitive to the choice of hyperparameters such as learning rate, number of hidden layers, batch size…
Prosody modeling: Adding prosody modeling into the TTS system can lead to more natural and expressive speech output.
Fine-tuning: Fine-tuning TTS model on specific tasks or domains can lead to improved performance and reduce overfitting.