728x90
๋ฐ˜์‘ํ˜•

์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› 9

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN Reproducing ์ฝ”๋“œ ๊ตฌํ˜„

https://hae-koos.tistory.com/79 [ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ / Vocoder ๊ฐœ๋… ์„ค๋ช… ๋ฐ ์ •๋ฆฌ [ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ / Vocoder ๊ฐœ๋… ์„ค๋ช… ๋ฐ ์ •๋ฆฌ Vocoder ๋ฐฐ๊ฒฝ์ง€์‹ : Mel-spectrogram ๋„ฃ์œผ๋ฉด Wave ๋งŒ๋“œ๋Š” ์นœ๊ตฌ → ๊ณต๋ถ€ํ•˜์ž BACKGROUND ํ•ฉ์„ฑ์Œ ์Œ์งˆ์„ ๊ฒฐ์ •ํ•˜๋Š” ์š”์†Œ (Neural Speech Synthesis ๊ด€์ ) ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๋กœ ๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ƒ์„ฑํ•ด ๋‚ผ ์ˆ˜.. hae-koos.tistory.com https://github.com/jik876/hifi-gan [ Official Repository ] HiFi-GAN: Generative Advers..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ]AUTOVC ์ฝ”๋“œ ๋ฆฌ๋ทฐ ๋ฐ Reproducing

์ฝ”๋“œ ํŒŒํ—ค์น˜๊ธฐ ๐Ÿ”ฅ make_spect.py : Generate spectrogram data from the wav files npy ํ˜•ํƒœ๋กœ melspectrogram์ด ์ƒ์„ฑ๋จ make_metadata.py : Generate speaker embeddings and metadata for training ์œ„์—์„œ ๋งŒ๋“  ./spmel ํด๋”์— train.pkl ์ƒ์„ฑ metadata.pkl ์ƒ์„ฑ main.py : Run the main training script & Converges when the reconstruction loss is around 0.0001 conversion.ipynb : Download pre-trained AUTOVC model and run it autovc.ckpt ๋กœ๋“œํ•˜๊ณ ,..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ]Voice Conversion ๊ฐœ๋… ๋ฐ MaskCycleGAN-VC ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

ํ™ฉ๋ฐ˜์žฅ๋‹˜ ์‘์› ๊ณ ๋ง™๊ณ  ๐Ÿ˜ [์ฐธ๊ณ ์ž๋ฃŒ] http://dsba.korea.ac.kr/seminar/?mod=document&uid=1819 https://wdprogrammer.tistory.com/74

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ]Attention Is all You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์™€ ์„ค๋ช…

2017๋…„ ๊ฒจ์šธ์— ๋‚˜์˜จ ๋…ผ๋ฌธ์œผ๋กœ NIPS์— ๋“ฑ์žฌ๋œ, ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ๊ณต๋ถ€ํ•œ๋‹ค๋ฉด ๊ณต๋ถ€ํ–ˆ์„ ๋…ผ๋ฌธ์ด๋‹ค. ํ•™๋ถ€์ƒ ์ธํ„ด ๋•Œ๋„ ๊ณต๋ถ€ํ•˜๋ ค๋‹ค๊ฐ€ ์–ด์˜๋ถ€์˜ ๋„˜์–ด๊ฐ”๋˜ ๋…ผ๋ฌธ,, ๋“œ๋””์–ด ๊ฐ ์žก๊ณ  ๊ณต๋ถ€ํ•˜์—ฌ ์ •๋ฆฌํ–ˆ๋‹ค. ์ž…๋ ฅ ๋ฌธ์žฅ์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•˜๋Š” ๊ณผ์ •์—†์ด, RNN๊ณผ CNN ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์ง€๋„ ์•Š๊ณ  ์˜ค์ง Attention ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ Encoder, Decoder๋ฅผ ๋ฐ˜๋ณตํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ณ , ์„ฑ๋Šฅ ์—ญ์‹œ ๊ฐœ์„ ์‹œํ‚จ Transformer์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์ด๋‹ค. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋Š” ์œ„ ์‚ฌ์ง„๊ณผ ๊ฐ™๋‹ค. ๊ฐ€์žฅ ๋จผ์ € ์ขŒ์ธก๊ณผ ์šฐ์ธก์— ๊ฐ๊ฐ N๋ฒˆ์”ฉ ๋ฐ˜๋ณต๋˜๋Š” ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๊ตฌ์กฐ๊ฐ€ ๋ˆˆ์— ๋ˆ๋‹ค. ๊ทธ๋ฆผ์—์„œ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด Transformer ๊ตฌ์กฐ์˜ ํ•ต์‹ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌ๋œ๋‹ค. Positional Encoding Encoder Self-Att..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] Acoustic Feature, MelGAN ๋…ผ๋ฌธ ์ •๋ฆฌ ๋ฐ ์ฝ”๋“œ ์‹ค์Šต

Acoustic Feature Audio File Structure Channel : Mono(1) / Stereo(2) Length : 60s, 1m, 1h … Sampling Rate : 1์ดˆ๋‹น ์ƒ˜ํ”Œ์˜ ๋นˆ๋„์ˆ˜ (44.1kHz - 1์ดˆ์— ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ 44,100๊ฐœ ๋“ค์–ด์žˆ๋‹ค.) Bit Depth : ์Œ์˜ ๊ฐ•๋„๋ฅผ ์–ผ๋งˆ๋‚˜ ์„ธ๋ถ„ํ™”ํ•˜์—ฌ ํ‘œํ˜„ํ•  ๊ฒƒ์ธ์ง€ (24 bit - 2^24๊ฐœ์˜ ๋†’๋‚ฎ์ด๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.) Bit Rate : ์˜ค๋””์˜ค๋กœ ์ „์†ก๋˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘ (CHANNEL # x SAMPLING RATE x BIT DEPTH) import matplotlib.pyplot as plt from scipy.io import wavfile as wav fs, data = wav.read('./3sec.wav') pri..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] ๋ธ”๋ž™๋ฐ•์Šค ์˜์ƒ ๋‚ด ํ”๋“ค๋ฆผ ํƒ์ง€

2. CNN Feature Map + MSE ํ˜„์žฌ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ถฉ๋Œ ์žฅ๋ฉด์„ ๋‹ด์€ ๋ธ”๋ž™๋ฐ•์Šค ์˜์ƒ์œผ๋กœ ๋‚ฎ๋ฐค, ์ฐจ์ข…, ์ถฉ๋Œ์ •๋„, ์ถฉ๋Œ์ƒํ™ฉ ๋“ฑ ์ƒํ™ฉ์ด ๋‹ค์–‘ํ–ˆ๋‹ค. ์ด์— ์‚ฌ์ˆ˜ ์—ฐ๊ตฌ์›๋‹˜์˜ Approach์—๋„ MSE๋ฅผ ํ™œ์šฉํ•˜์—ฌ t๋ฒˆ์งธ ํ”„๋ ˆ์ž„๊ณผ t+1๋ฒˆ์งธ ํ”„๋ ˆ์ž„์˜ ์ฐจ์ด๊ฐ€ ์ปค์ง€๋Š” ์ˆœ๊ฐ„์„ ์ถฉ๋Œ๋กœ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด ๋Œ€๋ถ€๋ถ„์ด์—ˆ๋‹ค. ์—ฌ๊ธฐ์— ์ „์ฒ˜๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๋ƒ์˜ ์ฐจ์ด. ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด Canny Edge Detection์„ ํ™œ์šฉํ•œ ๋ฐฉ๋ฒ•๋ก ์ด์—ˆ๊ณ , ๋‚˜๋Š” CNN์„ ํ†ตํ•ด ์–ป์€ Feature Map์— ์ด๋ฅผ ์ ์šฉํ•˜๋ฉด ์–ด๋–จ๊นŒ ์‹ถ์–ด ์‹œ๋„ํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค. ๋ฌธ์ œ์ •์˜๋ถ€ํ„ฐ - ! ๋ฌธ์ œ์ •์˜ 'ํ”๋“ค๋ฆผ'์„ ํŒ๋‹จํ•˜๋Š” ์˜์ƒํ•™์  ๊ทผ๊ฑฐ๊ฐ€ ๋ฌด์—‡์ผ๊นŒ ์ด์ „ ํ”„๋ ˆ์ž„๊ณผ์˜ ์ฐจ์ด๋ฅผ ๋ด์•ผํ•˜๋Š” ๊ฒƒ์€ ๋ณ€ํ•จ์ด ์—†์„๊ฒƒ. ๋‹ค๋งŒ, ๋ช‡๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ๊ธฐ์ค€..

728x90
๋ฐ˜์‘ํ˜•