[ 포스텍 인공지능연구원 연구인턴 ] HiFi-GAN 논문 리뷰 / Vocoder 개념 설명 및 정리

대외활동/포스텍 인공지능연구원 연구인턴 & 연장

[ 포스텍 인공지능연구원 연구인턴 ] HiFi-GAN 논문 리뷰 / Vocoder 개념 설명 및 정리

hae-koos 2022. 7. 22. 04:15

728x90

Vocoder 배경지식 : Mel-spectrogram 넣으면 Wave 만드는 친구 → 공부하자

BACKGROUND

합성음 음질을 결정하는 요소 (Neural Speech Synthesis 관점)
1. 주어진 텍스트로 멜 스펙트로그램을 얼마나 잘 생성해 낼 수 있는가
2. 멜 스펙트로그램으로부터 음성의 파형을 얼마나 선명하게 합성할 수 있는가 → Vocoder

💡 Audio → Mel-spectrogram
1. audio를 주파수 영역에서 분석하기 위해 STFT 수행, 주파수 성분 특징점을 추출한다.
2. 그 중 크기 성분에 해당하는 magnitude 값을 이용해 Mel-filterbank 적용한다.
3. 이를 Mel-scale로 변환시켜 Mel-spectrogram을 얻는다.

위와 같은 과정으로 멜스펙트로그램을 얻어내기 때문에
- 주파수 성분의 크기 (magnitude)와 위상(phase) 값을 알고 있다면
  STFT의 역변환을 통해 정보 손실 없이 원본 음성을 복원할 수 있다.
- 하지만 멜스펙트로그램을 예측하고 학습하는 TTS 모델의 경우,
  magnitude 정보만 알아낼 수 있어 phase 정보를 예측해야 한다.
  이 기능을 수행하는 것이 Vocoder

Griffin-Lim Vocoder

Neural Vocoder 이전의 전통적인 방법
Mel-spectrogram으로 계산된 STFT magnitude 값만 가지고 원본 음성을 예측
- Phase를 임의로 두고 예측된 음성의 STFT magnitude 값과 원래 Mel-spectrogram으로 계산된
  STFT magnitude 값의 MSE 최소가 되도록 반복

Auto-Regressive Generative Model - WaveNet

Auto-Regressive Generative Model인 PixelCNN을 음성 분야에 활용
월등히 좋은 퀄리티, 하지만 AR 모델 특성상 학습과 추론에 많은 시간 소요
- WaveNet은 each forward operation에서 하나의 샘플 생성

Flow-based Generative Model - Parallel WaveNet, WaveGlow 등

Flow-based 모델 등장으로 퀄리티와 속도 모두 잡음
- same size in parallel의 노이즈로부터 raw waveform 합성
- Parallel WaveNet : pretrained WaveNet을 teacher 삼아 KL Divergence 줄이도록 학습하는
  Inverse Autoregressive Flow
WaveGlow는 Teacher Model Distillation 필요성 제거
하지만 WaveGlow의 경우, 너무 많은 모델 parameter (WaveNet의 4배 가량)

Generative Adversarial Network - MelGAN, GAN-TTS, HiFi-GAN, VocGAN 등

위 두 모델의 단점을 개선하기 위해 퀄리티는 다소 떨어질 수 있으나 속도와 파라미터 개선
특히 HiFi-GAN 모델의 경우, 퀄리티도 높이고, 속도와 파라미터도 개선

LET’S GET STARTED

🌋 HiFi-GAN

Introduction

Most neural speech synthesis models use a two-stage pipeline

predicting a low resolution intermediate representation (mel-spectrograms or linguistic features) from text
synthesizing raw waveform audio from the intermediate representation → Vocoder

나머지 내용은 상기 Background 정리 완료

Speech Audio는 다양한 주기의 sinusoidal signal로 이루어져 있으니 realistic speech audio를 생성하려면
periodic pattern을 모델링하는 것이 중요할 것

☑️ A discriminator which consists of small sub-discriminators,
each of which obtains only a specific periodic parts of raw wave forms

☑️ discriminator마다 audio 다른 부분 체크, 여러 길이의 패턴 관찰하는 residual block 설계

WaveNet, WaveGlow보다 높은 MOS 기록
Unseen Speaker의 mel-spectrogram inversion 가능

HiFi-GAN

1. Overview

One Generator
Two Discriminators : Multi-scale Discriminator & Multi-period Discriminator

→ two additional losses를 통해 trianing stability와 model performance를 향상시킨다

2. Generator

Fully CNN
mel-spectrogram 입력받아 transposed conv 통해 upsampling
(output sequence length가 raw waveform의 temporal resolution과 맞을 때까지)
- 모든 transposed convolution은 Multi Receptive Field Fusion Module 따라 진행

MRF는 병렬적으로 다양한 길이의 패턴을 관찰한다.
MRF 모듈은 Multiple Residual Block의 출력의 합을 출력을 한다.
다양한 receptive field pattern을 형성하기 위해 각 residual block의 kernel size, dilation rate 등을 다양하게 한다.

저자 MRF Module의 parameter 조절 통한 Efficiency와 Quality 사이 Trade-off 언급

3. Discriminator

Long-term dependency를 identifying 하는 것은 realistic speech audio를 모델링 함에 있어 중요
- EXAMPLE
  phoneme duration can be longer than 100ms → raw waveform 2,200개 이상 correlation
  → has been adressed by increasing receptive fields of the generator and discriminator
- Additionally, to capture consecutive patterns (연속적인 패턴?) and long-term dependencies,
  we use the multi-scale discriminator (MSD) proposed in MelGAN (Kumar et al., 2019),
  which consecutively evaluates audio samples at different levels.
  We conducted simple experiments to show the ability of MPD and MSD to capture periodic patterns,
  and the results can be found in Appendix B.Another Crucial Problem that has yet been resolved
  → As speech audio consists of sinusoidal signals with various periods,
  the diverse periodic patterns underlying in the audio data need to be identiﬁed.
  → Multi-period discriminator (MPD) consisting of several sub-discriminators
  each sub-discriminator가 입력 오디오의 주기적인 시그널을 handling

MSD의 경우, down-sampling이 진행될 때마다 getting smoother
Average Pooling의 low-pass filtering 때문에 고주파 대역의 amplitude 감소
In conclusion, we can see that MPD captures more periodic patterns of an input
signal than MSD, and capturing periodic patterns is important to model signals.

그래서 Multi-period Discriminator랑 Multi-scale Discriminator가 뭔데 ?

Experiments & Results

LJSpeech Dataset
- consists of 13,100 short audio clips of a single speaker → approximately 24 hours
- audion format is 16-bit PCM with a sample rate of 22 kHz
HiFi-GAN was compared against the best publicly available models
- WaveNet (Oord et al., 2018) → Implementation (Yamamoto, 2018)
- WaveGlow official implementation (Valle, 2018b)
- MelGAN official implementation (Kumar, 2019)
Unseen Speaker에 대한 mel-spectrogram inversion generality 평가하기 위해
- VCTK multi-speaker Dataset 활용
  - consists of 44,200 short audio clips utteredby 109 native English Speakers
  - total length is approximately 44 hours
  - audio format is 16-bit PCM with a sample rate of 44 kHz → reduced to 22 kHz
  - randomly selected 9 speakers and excluded their clips from the training sets then, trained MoL WaveNet, WaveGlow, and MelGAN with the same settings all the models were trained until 2.5M steps
Used 80 bands mel-spectrograms FFT, window, hop size = 1024, 1024, 256

Conclusion

Above all, Our model outperforms the best performing publicly available models
Moreover, It shows a significant improvement in terms of synthesis speed
We envisage that our work will serve as a basis for future speech synthesis studies

💡 Personal Discussion
- 논문은 이렇게 쓰고 싶다. 가독성 좋고, 노베이스 상태로 읽어도 따라갈 수 있을듯
1. MRF Module 통해 다양한 Receptive Field를 만들고 Generator의 성능을 향상시킨다는 점
2. MPD로 Long-term dependency 잡고 / MSD로 Consecutive Pattern 잡고 Ablation Study로 못박기까지 짜임새 👏
3. GAN Loss + Mel-spectrogram Loss + Feature Mapping Loss까지 성능 향상에 도움되는 Loss를 고루 활용
(특히 Feature Mapping Loss 은근 자주 보인다)
4. Reproducing 하려다가 논문 리뷰에 공을 너무 많이 들였는데 잘한 것 같다

노션에 정리한 포맷 그대로 옮기는 게 쉬운 일이 아니다..

Reproducing 후딱 해보자

세로 모니터는 34인치 너무 크다.
눕힐 수 있는 모니터 하나 더 있으면 딱이지만 지금도 만족한다.
어쨌든 석사 학위는 있어야 하니 긴장하자
논문 쓰자

728x90

저작자표시

'대외활동 > 포스텍 인공지능연구원 연구인턴 & 연장' 카테고리의 다른 글

[ 포스텍 인공지능연구원 연구인턴 ] HiFi-GAN Reproducing 코드 구현 (0)	2022.07.26
[ 포스텍 인공지능연구원 연구인턴 ]AUTOVC 코드 리뷰 및 Reproducing (0)	2022.07.06
[ 포스텍 인공지능연구원 연구인턴 ]Variational Auto-Encoder & Maximum Likelihood Estimation (0)	2022.07.01
[ 포스텍 인공지능연구원 연구인턴 ]AUTOVC 개념 정리 및 논문 리뷰 (0)	2022.06.07
[ 포스텍 인공지능연구원 연구인턴 ]Voice Conversion 개념 및 MaskCycleGAN-VC 논문 리뷰 (0)	2022.06.06

현재글[ 포스텍 인공지능연구원 연구인턴 ] HiFi-GAN 논문 리뷰 / Vocoder 개념 설명 및 정리

👨‍💻 🏋️‍♂️ 🏊‍♂️ 📸 📚

Today :
Yesterday :

04-14 10:46

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

hae-koos