728x90
๋ฐ˜์‘ํ˜•

์ธ๊ณต์ง€๋Šฅ๋Œ€ํ•™์› 11

[ KT AI ์„์‚ฌ๊ณผ์ • ] ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ๋Œ€ํ•™์› ์„œ๋ฅ˜ / ํ•„๊ธฐ๊ณ ์‚ฌ / ์ง€๋„๊ต์ˆ˜ ์„ ์ •

https://hae-koos.tistory.com/84 [ KT AI ์„์‚ฌ๊ณผ์ • ] ์‹ค๋ฌด๋ฉด์ ‘ / ์ž„์›๋ฉด์ ‘ https://hae-koos.tistory.com/83 [ KT AI ์„์‚ฌ๊ณผ์ • ] ์ธ์ ์„ฑ๊ฒ€์‚ฌ / ์ฝ”๋”ฉํ…Œ์ŠคํŠธ https://hae-koos.tistory.com/82 [ KT AI ์„์‚ฌ๊ณผ์ • ] ๋ชจ์ง‘๊ณต๊ณ  / ์„ค๋ช…ํšŒ / ์„œ๋ฅ˜์ „ํ˜• ํ•œ์ฐธ ์—ฐ๊ตฌ๋ถ€์—์„œ ์ธํ„ด์„ ํ•˜๊ณ  ์žˆ์„๋•Œ ํฌ์Šคํ….. hae-koos.tistory.com ์„œ๋ฅ˜๋งˆ๊ฐ : 8์›” 23์ผ -> ๊ฒฐ๊ณผ : 8์›” 31์ผ ์ฝ”ํ…Œ&์ธ์ ์„ฑ : 9์›” 3์ผ -> ๊ฒฐ๊ณผ : 9์›” 6์ผ ์˜ค์ „ ์‹ค๋ฌด๋ฉด์ ‘ : 9์›” 7์ผ -> ๊ฒฐ๊ณผ : 9์›” 14์ผ ์˜คํ›„ ์ž„์›๋ฉด์ ‘ : 9์›” 19์ผ -> ๊ฒฐ๊ณผ : 9์›” 22์ผ ์˜ค์ „ ์œ„ ๊ณผ์ •์„ ๋์œผ๋กœ KT ์ „ํ˜•์€ ๋งˆ๋ฌด๋ฆฌ๋˜๊ณ , ๋Œ€ํ•™์› ์ „ํ˜•์œผ๋กœ ๋„˜์–ด๊ฐ„๋‹ค. ํฌ์Šคํ…..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN Reproducing ์ฝ”๋“œ ๊ตฌํ˜„

https://hae-koos.tistory.com/79 [ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ / Vocoder ๊ฐœ๋… ์„ค๋ช… ๋ฐ ์ •๋ฆฌ [ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ / Vocoder ๊ฐœ๋… ์„ค๋ช… ๋ฐ ์ •๋ฆฌ Vocoder ๋ฐฐ๊ฒฝ์ง€์‹ : Mel-spectrogram ๋„ฃ์œผ๋ฉด Wave ๋งŒ๋“œ๋Š” ์นœ๊ตฌ → ๊ณต๋ถ€ํ•˜์ž BACKGROUND ํ•ฉ์„ฑ์Œ ์Œ์งˆ์„ ๊ฒฐ์ •ํ•˜๋Š” ์š”์†Œ (Neural Speech Synthesis ๊ด€์ ) ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๋กœ ๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ƒ์„ฑํ•ด ๋‚ผ ์ˆ˜.. hae-koos.tistory.com https://github.com/jik876/hifi-gan [ Official Repository ] HiFi-GAN: Generative Advers..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] HiFi-GAN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ / Vocoder ๊ฐœ๋… ์„ค๋ช… ๋ฐ ์ •๋ฆฌ

Vocoder ๋ฐฐ๊ฒฝ์ง€์‹ : Mel-spectrogram ๋„ฃ์œผ๋ฉด Wave ๋งŒ๋“œ๋Š” ์นœ๊ตฌ → ๊ณต๋ถ€ํ•˜์ž BACKGROUND ํ•ฉ์„ฑ์Œ ์Œ์งˆ์„ ๊ฒฐ์ •ํ•˜๋Š” ์š”์†Œ (Neural Speech Synthesis ๊ด€์ ) ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ๋กœ ๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ƒ์„ฑํ•ด ๋‚ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€ ๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์œผ๋กœ๋ถ€ํ„ฐ ์Œ์„ฑ์˜ ํŒŒํ˜•์„ ์–ผ๋งˆ๋‚˜ ์„ ๋ช…ํ•˜๊ฒŒ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€ → Vocoder ๐Ÿ’ก Audio → Mel-spectrogram 1. audio๋ฅผ ์ฃผํŒŒ์ˆ˜ ์˜์—ญ์—์„œ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด STFT ์ˆ˜ํ–‰, ์ฃผํŒŒ์ˆ˜ ์„ฑ๋ถ„ ํŠน์ง•์ ์„ ์ถ”์ถœํ•œ๋‹ค. 2. ๊ทธ ์ค‘ ํฌ๊ธฐ ์„ฑ๋ถ„์— ํ•ด๋‹นํ•˜๋Š” magnitude ๊ฐ’์„ ์ด์šฉํ•ด Mel-filterbank ์ ์šฉํ•œ๋‹ค. 3. ์ด๋ฅผ Mel-scale๋กœ ๋ณ€ํ™˜์‹œ์ผœ Mel-spectrogram์„ ์–ป๋Š”๋‹ค. ์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ๋ฉœ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์„ ..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ]AUTOVC ์ฝ”๋“œ ๋ฆฌ๋ทฐ ๋ฐ Reproducing

์ฝ”๋“œ ํŒŒํ—ค์น˜๊ธฐ ๐Ÿ”ฅ make_spect.py : Generate spectrogram data from the wav files npy ํ˜•ํƒœ๋กœ melspectrogram์ด ์ƒ์„ฑ๋จ make_metadata.py : Generate speaker embeddings and metadata for training ์œ„์—์„œ ๋งŒ๋“  ./spmel ํด๋”์— train.pkl ์ƒ์„ฑ metadata.pkl ์ƒ์„ฑ main.py : Run the main training script & Converges when the reconstruction loss is around 0.0001 conversion.ipynb : Download pre-trained AUTOVC model and run it autovc.ckpt ๋กœ๋“œํ•˜๊ณ ,..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ]Voice Conversion ๊ฐœ๋… ๋ฐ MaskCycleGAN-VC ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

ํ™ฉ๋ฐ˜์žฅ๋‹˜ ์‘์› ๊ณ ๋ง™๊ณ  ๐Ÿ˜ [์ฐธ๊ณ ์ž๋ฃŒ] http://dsba.korea.ac.kr/seminar/?mod=document&uid=1819 https://wdprogrammer.tistory.com/74

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ]Attention Is all You Need ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์™€ ์„ค๋ช…

2017๋…„ ๊ฒจ์šธ์— ๋‚˜์˜จ ๋…ผ๋ฌธ์œผ๋กœ NIPS์— ๋“ฑ์žฌ๋œ, ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ๊ณต๋ถ€ํ•œ๋‹ค๋ฉด ๊ณต๋ถ€ํ–ˆ์„ ๋…ผ๋ฌธ์ด๋‹ค. ํ•™๋ถ€์ƒ ์ธํ„ด ๋•Œ๋„ ๊ณต๋ถ€ํ•˜๋ ค๋‹ค๊ฐ€ ์–ด์˜๋ถ€์˜ ๋„˜์–ด๊ฐ”๋˜ ๋…ผ๋ฌธ,, ๋“œ๋””์–ด ๊ฐ ์žก๊ณ  ๊ณต๋ถ€ํ•˜์—ฌ ์ •๋ฆฌํ–ˆ๋‹ค. ์ž…๋ ฅ ๋ฌธ์žฅ์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•˜๋Š” ๊ณผ์ •์—†์ด, RNN๊ณผ CNN ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•˜์ง€๋„ ์•Š๊ณ  ์˜ค์ง Attention ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ Encoder, Decoder๋ฅผ ๋ฐ˜๋ณตํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ณ , ์„ฑ๋Šฅ ์—ญ์‹œ ๊ฐœ์„ ์‹œํ‚จ Transformer์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์ด๋‹ค. ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋Š” ์œ„ ์‚ฌ์ง„๊ณผ ๊ฐ™๋‹ค. ๊ฐ€์žฅ ๋จผ์ € ์ขŒ์ธก๊ณผ ์šฐ์ธก์— ๊ฐ๊ฐ N๋ฒˆ์”ฉ ๋ฐ˜๋ณต๋˜๋Š” ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๊ตฌ์กฐ๊ฐ€ ๋ˆˆ์— ๋ˆ๋‹ค. ๊ทธ๋ฆผ์—์„œ๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด Transformer ๊ตฌ์กฐ์˜ ํ•ต์‹ฌ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌ๋œ๋‹ค. Positional Encoding Encoder Self-Att..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] Acoustic Feature, MelGAN ๋…ผ๋ฌธ ์ •๋ฆฌ ๋ฐ ์ฝ”๋“œ ์‹ค์Šต

Acoustic Feature Audio File Structure Channel : Mono(1) / Stereo(2) Length : 60s, 1m, 1h … Sampling Rate : 1์ดˆ๋‹น ์ƒ˜ํ”Œ์˜ ๋นˆ๋„์ˆ˜ (44.1kHz - 1์ดˆ์— ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ 44,100๊ฐœ ๋“ค์–ด์žˆ๋‹ค.) Bit Depth : ์Œ์˜ ๊ฐ•๋„๋ฅผ ์–ผ๋งˆ๋‚˜ ์„ธ๋ถ„ํ™”ํ•˜์—ฌ ํ‘œํ˜„ํ•  ๊ฒƒ์ธ์ง€ (24 bit - 2^24๊ฐœ์˜ ๋†’๋‚ฎ์ด๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.) Bit Rate : ์˜ค๋””์˜ค๋กœ ์ „์†ก๋˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘ (CHANNEL # x SAMPLING RATE x BIT DEPTH) import matplotlib.pyplot as plt from scipy.io import wavfile as wav fs, data = wav.read('./3sec.wav') pri..

[ ํฌ์Šคํ… ์ธ๊ณต์ง€๋Šฅ์—ฐ๊ตฌ์› ์—ฐ๊ตฌ์ธํ„ด ] ๋ธ”๋ž™๋ฐ•์Šค ์˜์ƒ ๋‚ด ํ”๋“ค๋ฆผ ํƒ์ง€

2. CNN Feature Map + MSE ํ˜„์žฌ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ถฉ๋Œ ์žฅ๋ฉด์„ ๋‹ด์€ ๋ธ”๋ž™๋ฐ•์Šค ์˜์ƒ์œผ๋กœ ๋‚ฎ๋ฐค, ์ฐจ์ข…, ์ถฉ๋Œ์ •๋„, ์ถฉ๋Œ์ƒํ™ฉ ๋“ฑ ์ƒํ™ฉ์ด ๋‹ค์–‘ํ–ˆ๋‹ค. ์ด์— ์‚ฌ์ˆ˜ ์—ฐ๊ตฌ์›๋‹˜์˜ Approach์—๋„ MSE๋ฅผ ํ™œ์šฉํ•˜์—ฌ t๋ฒˆ์งธ ํ”„๋ ˆ์ž„๊ณผ t+1๋ฒˆ์งธ ํ”„๋ ˆ์ž„์˜ ์ฐจ์ด๊ฐ€ ์ปค์ง€๋Š” ์ˆœ๊ฐ„์„ ์ถฉ๋Œ๋กœ ํŒ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด ๋Œ€๋ถ€๋ถ„์ด์—ˆ๋‹ค. ์—ฌ๊ธฐ์— ์ „์ฒ˜๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๋ƒ์˜ ์ฐจ์ด. ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด Canny Edge Detection์„ ํ™œ์šฉํ•œ ๋ฐฉ๋ฒ•๋ก ์ด์—ˆ๊ณ , ๋‚˜๋Š” CNN์„ ํ†ตํ•ด ์–ป์€ Feature Map์— ์ด๋ฅผ ์ ์šฉํ•˜๋ฉด ์–ด๋–จ๊นŒ ์‹ถ์–ด ์‹œ๋„ํ•˜๊ธฐ๋กœ ํ–ˆ๋‹ค. ๋ฌธ์ œ์ •์˜๋ถ€ํ„ฐ - ! ๋ฌธ์ œ์ •์˜ 'ํ”๋“ค๋ฆผ'์„ ํŒ๋‹จํ•˜๋Š” ์˜์ƒํ•™์  ๊ทผ๊ฑฐ๊ฐ€ ๋ฌด์—‡์ผ๊นŒ ์ด์ „ ํ”„๋ ˆ์ž„๊ณผ์˜ ์ฐจ์ด๋ฅผ ๋ด์•ผํ•˜๋Š” ๊ฒƒ์€ ๋ณ€ํ•จ์ด ์—†์„๊ฒƒ. ๋‹ค๋งŒ, ๋ช‡๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ๊ธฐ์ค€..

728x90
๋ฐ˜์‘ํ˜•