一、准备数据
GPT-2 本身可以生成质量不错的文本。 但是,如果您希望它在特定上下文中做得更好,则需要根据您的特定数据对其进行微调。 就我而言,由于我想生成歌词,我将使用以下 Kaggle 数据集(https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres)其中包含总共 12,500 首流行摇滚歌曲歌词,全部为英文。
数据展示:artists-data.csv
Artist,Songs,Popularity,Link,Genre,Genres 10000 Maniacs,110,0.3,/10000-maniacs/,Rock,Rock; Pop; Electronica; Dance; J-Pop/J-Rock; Gospel/Religioso; Infantil; Emocore 12 Stones,75,0.3,/12-stones/,Rock,Rock; Gospel/Religioso; Hard Rock; Grunge; Rock Alternativo; Hardcore; Punk Rock; Chillout; Electronica; Heavy Metal; Metal; World Music; Axé; Emocore 311,196,0.5,/311/,Rock,Rock; Surf Music; Reggae; Ska; Pop/Rock; Rock Alternativo; Hardcore 4 Non Blondes,15,7.5,/4-non-blondes/,Rock,Rock; Pop/Rock; Rock Alternativo; Grunge; Blues; Pop; Soft Rock; Power-Pop; Piano Rock; Indie; Chillout A Cruz Está Vazia,13,0,/a-cruz-esta-vazia/,Rock,Rock Aborto Elétrico,36,0.1,/aborto-eletrico/,Rock,Rock; Punk Rock; Pós-Punk; Post-Rock Abril,36,0.1,/abril/,Rock,Rock; Emocore; Hardcore; Pop/Rock; Rock Alternativo; Romantico; Hard Rock; Blues; World Music Abuse,13,0,/abuse/,Rock,Rock; Hardcore AC/DC,192,10.8,/ac-dc/,Rock,Rock; Heavy Metal; Classic Rock; Hard Rock; Clássico; Metal; Punk Rock; Blues; Black Music; Rockabilly; Psicodelia; Funk Carioca; Rock Alternativo; Trilha Sonora; New Age; Hip Hop; New Wave; Sertanejo; Post-Rock; Pop/Rock; MPB; Electronica; Grunge; Progressivo; Pop/Punk; Funk; Forró ACEIA,0,0,/aceia/,Rock,Rock Acid Tree,5,0,/acid-tree/,Rock,Rock; Heavy Metal; Metal Adam Lambert,110,1.4,/adam-lambert/,Pop,Pop; Pop/Rock; Rock; Romantico; Dance; Electronica; Emocore; Power-Pop; Axé; Gótico; R&B; Punk Rock; Pop/Punk; Black Music; Rock Alternativo; World Music; J-Pop/J-Rock; Gospel/Religioso; Hip Hop; K-Pop/K-Rock; Piano Rock; Heavy Metal; Velha Guarda; Soul Music; Hard Rock; Country; Soft Rock; Tecnopop; House; Trilha Sonora; Blues Adrian Suirady,7,0,/adrian-suirady/,Rock,Rock; Gótico Aerosmith,249,16.5,/aerosmith/,Rock,Rock; Hard Rock; Heavy Metal; Romantico; Pop/Rock; Classic Rock; Rock Alternativo; Blues; Metal; Chillout; Piano Rock; Funk; Gótico; Forró; Jovem Guarda; Hip Hop Aliados,75,0.8,/aliados/,Rock,Rock; Pop/Rock; Rock Alternativo; Surf Music; Hardcore; Pop/Punk; Blues; R&B; Punk Rock; Axé Alice Cooper,310,1.2,/alice-cooper/,Rock,Rock; Hard Rock; Heavy Metal; Punk Rock; Classic Rock; Grunge; Trilha Sonora; Gótico Alter Bridge,74,1.4,/alter-bridge/,Rock,Rock; Hard Rock; Rock Alternativo; Heavy Metal; Grunge; Romantico; Rap; Metal; Hardcore Amy Lee,33,0.5,/amy-lee/,Rock,Rock; Gótico; Hard Rock; Rock Alternativo; Heavy Metal; Piano Rock; Romantico; Metal; Indie; Classic Rock; New Age; Funk; Electronica; Industrial; Post-Rock; Psicodelia; Funk Carioca; Infantil; Pós-Punk; Dance; Pop; Clássico; Axé; Trilha Sonora Anberlin,98,0.1,/anberlin/,Rock,Rock; Rock Alternativo; Hardcore; Emocore; Gospel/Religioso Andi Deris,44,0,/andi-deris/,Rock,Rock; Hard Rock; Heavy Metal Andrew W.K.,31,0,/andrew-w-k/,Rock,Rock Andy (Brasil),7,0,/andy-brasil/,Rock,Rock Angra,124,2.2,/angra/,Rock,Rock; Heavy Metal; Hard Rock; Progressivo; Metal; Black Music; Piano Rock; Post-Rock; Romantico; Psicodelia; Hardcore; Clássico; Forró; Pagode Arthur Brown,2,0,/arthur-brown/,Rock,Rock Asking Alexandria,77,1,/asking-alexandria/,Rock,Rock; Hard Rock; Hardcore; Heavy Metal; Emocore; Metal; Rock Alternativo; K-Pop/K-Rock; Classic Rock; Samba; Tecnopop; Grunge; Reggae; Chillout; World Music; Pop/Rock; Black Music; Gótico; Punk Rock; New Age Autoramas,67,0.1,/autoramas/,Rock,Rock; Pop/Rock; Rock Alternativo; Progressivo; Indie; Punk Rock; Hardcore; Surf Music; Electronica; Funk; Pagode; Ska; R&B; Samba; New Age; MPB; Axé; Funk Carioca; Emocore; Grunge Avante,21,0,/avante/,Rock,Rock
数据展示:lyrics-data.csv
ALink,SName,SLink,Lyric,Idiom /10000-maniacs/,More Than This,/10000-maniacs/more-than-this.html,I could feel at the time. There was no way of knowing. Fallen leaves in the night. Who can say where they‘re blowing. As free as the wind. Hopefully learning. Why the sea on the tide. Has no way of turning. More than this. You know there‘s nothing. More than this. Tell me one thing. More than this. You know there‘s nothing. It was fun for a while. There was no way of knowing. Like a dream in the night. Who can say where we‘re going. No care in the world. Maybe I‘m learning. Why the sea on the tide. Has no way of turning. More than this. You know there‘s nothing. More than this. Tell me one thing. More than this. You know there‘s nothing. More than this. You know there‘s nothing. More than this. Tell me one thing. More than this. There‘s nothing.,ENGLISH /10000-maniacs/,Because The Night,/10000-maniacs/because-the-night.html,"Take me now, baby, here as I am. Hold me close, and try and understand. Desire is hunger is the fire I breathe. Love is a banquet on which we feed. Come on now, try and understand. The way I feel under your command. Take my hand, as the sun descends. They can‘t hurt you now can‘t hurt you now, can‘t hurt you now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Cause the night belongs to us. Have I a doubt, baby, when I‘m alone. Love is a ring a telephone. Love is an angel, disguised as lust. Here in our bed ‘til the morning comes. Come on now, try and understand. The way I feel under your command. Take my hand, as the sun descends. They can‘t hurt you now, can‘t hurt you now, can‘t hurt you now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Because the night belongs to us. With love we sleep,. with doubt the vicious circle turns, and burns. Without you, oh I cannot live,. forgive the yearning burning. I believe it‘s time to heal to feel,. so take me now, take me now, take me now. Because the night belongs to lovers. Because the night belongs to us. Because the night belongs to lovers. Because the night belongs to us",ENGLISH /10000-maniacs/,These Are Days,/10000-maniacs/these-are-days.html,"These are. These are days you‘ll remember. Never before and never since, I promise. Will the whole world be warm as this. And as you feel it,. You‘ll know it‘s true. That you - you are blessed and lucky. It‘s true - that you. Are touched by something. That will grow and bloom in you. These are days you‘ll remember. When May is rushing over you. With desire to be part of the miracles. You see in every hour. You‘ll know it‘s true. That you are blessed and lucky. It‘s true that you are touched. By something that will grow and bloom in you. These are days. These are the days you might fill. With laughter until you break. These days you might feel. A shaft of light. Make its way across your face. And when you do. You‘ll know how it was meant to be. See the signs and know their meaning. You‘ll know how it was meant to be. Hear the signs and know they‘re speaking. To you, to you",ENGLISH /10000-maniacs/,A Campfire Song,/10000-maniacs/a-campfire-song.html,"A lie to say, ""O my mountain has coal veins and beds to dig.. 500 men with axes and they all dig for me."" A lie to ssay, ""O my. river where mant fish do swim, half of the catch is mine when you haul. your nets in."" Never will he believe that his greed is a blinding. ray. No devil or redeemer will cheat him. He‘ll take his gold to. where he‘s lying cold.. A lie to say, ""O my mine gave a diamond as big as a fist."". But with every gem in his pocket, the jewels he has missed. A lie to. say, ""O my garden is growing taller by the day."" He only eats the. best and tosses the rest away. Never will he be believe that his. greed is a blinding ray. No devil or redeemer can cheat him. he‘ll. take his gold to where he‘s lying cold. Six deep in the grave.. Something is out of reach. something he wanted. something is out of reach. he‘s being taunted. something is out of reach. that he can‘ beg or steal nor can he buy. his oldest pain. and fear in life. there‘ll not be time. his oldest pain. and fear in life. there‘ll not be time. A lie to say ""O my forest has trees that block the sun and. when I cut them down I don‘t answer to anyone."" No, no, never will he. believe that his greed is a blinding ray no devil or redeemer can. cheat. him. He‘ll take his gold where he‘s lying cold..",ENGLISH /10000-maniacs/,Everyday Is Like Sunday,/10000-maniacs/everyday-is-like-sunday.html,"Trudging slowly over wet sand. Back to the bench where your clothes were stolen. This is a coastal town. That they forgot to close down. Armagedon - come armagedon come armagedon come. Everyday is like sunday. Everyday is silent and grey. Hide on a promanade. Etch on a post card:. How I dearly wish I was not here. In the seaside town. That they forgot to bomb. Come, come nuclear bomb!. Everyday is like sunday. Everyday is silent and grey. Trudging back over pebbles and sand. And a strange dust lands on your hands. (and on your face). Everyday is like sunday. Win yourself a cheap tray. Share some grease tea with me. Everyday is silent and grey",ENGLISH /10000-maniacs/,Don‘t Talk,/10000-maniacs/dont-talk.html,"Don‘t talk, I will listen. Don‘t talk, you keep your distance. For I‘d rather hear some truth tonight. Than entertain your lies,. So take you poison silently. Let me be let me close my eyes. Don‘t talk, I‘ll believe it. Don‘t talk, listen to me instead,. I know that if you think of it,. Both long enough and hard. The drink you drown your troubles. In is the trouble you‘re in now. Talk talk talk about it,. If you talk as if you care. But when your talk is over. Tilt that bottle in the air,. Tossing back more than your share. Don‘t talk, I can guess it. Don‘t talk, well now your restless. And you need somewhere to put the blame. For how you feel inside. You‘ll look for a close. And easy mark and you‘ll see me as fair game. Talk talk talk about it,. Talk as if you care. But when your talk is over tilt. That bottle in the air. Tossing back more than your share. You talk talk talk about it,. You talk as if you care. I‘m marking every word. And can tell this time for sure,. Your talk is the finest I have heard. So don‘t talk, I‘ll be sleeping,. Let me go on dreaming. How your eyes they glow so fiercely. I can tell your inspired. By the name you just chose for me. Now what was it?. O, never mind it. We will talk talk. Talk about this when your head is clear. I‘ll discuss this in the morning,. But until then you may talk but I won‘t hear",ENGLISH /10000-maniacs/,Across The Fields,/10000-maniacs/across-the-fields.html,"Well they left then in the morning, a hundred pairs of wings. In the light moved together in the colors of the morning. I looked to the clouds in the cirrus sky and they‘d gone.. Across the marshes, across the fields below.. I fell through the vines and I hoped they would catch me below.. If only to take me with them there,. Tell me the part that shinesIn your heart on the wind.. And the reeds blew in the morning.. Take me along to the places. You‘ve gone when my eyes looked away.. Tell me the song that you sing in the trees in the dawning.. Tell me the part that shines in your heart. And the rays of love forever,. Please take me there..",ENGLISH /10000-maniacs/,Planned Obsolescence,/10000-maniacs/planned-obsolescence.html,[ music: Dennis Drew/lyric: Natalie Merchant ]. . science. is truth for life. watch religion fall obsolete. science. will be truth for life. technology as nature. science. truth for life. in fortran tongue the. answer. with wealth and prominence. man so near perfection. possession. it‘s an absence of interim. secure no demurrer. defense against divine. defense against his true. image. human conflict number five. discovery. dissolved all illusion. mystery. destroyed with conclusion. and illusion never restored. any modern man can see. that religion is. obsolete. piety. obsolete. ritual. obsolete. martyrdom. obsolete. prophetic vision. obsolete. mysticism. obsolete. commitment. obsolete. sacrament. obsolete. revelation. obsolete.,ENGLISH /10000-maniacs/,Rainy Day,/10000-maniacs/rainy-day.html,"On bended kneeI‘ve looked through every window then.. Touched the bottom, the night a sleepless day instead. A day when love came,came easy like what‘s lost now found.. Beneath a blinding light that would surround.. We were without, in doubt. We were about saving for a rainy day.. I crashed through mirrors,. I crashed through floors of laughter then.. In a blind scene, no ties would moor us to this room.. A day when love came, came easy like what‘s lost now found.. And you would save me, and I held you like you were my child.. If I were you, defiant you, alone upon a troubled way.. I would send my heart to you. To save it for a rainy day..",ENGLISH /10000-maniacs/,Anthem For Doomed Youth,/10000-maniacs/anthem-for-doomed-youth.html,For whom do the bells toll. When sentenced to die. The stuttering rifles. Will stifle the cry. The monstrous anger. The fear‘s rapid rattle. A desert inferno. Kids dying like cattle. Don‘t tell me. We‘re not prepared. I‘ve seen today‘s marine. He‘s eighteen and he‘s eager. He can be quite mean. No mock‘ries for them. No prayers or bells. The demented choirs. The wailing of shells. The boys holding candles. On untraveled roads. The fear spreads like fire. As shrapnel explodes. I think it‘s wrong. To conscript our youth. Against their will. When plenty of our citizenry. Really like to kill. What sign posts will lead. To armageddon‘s fires. What bugles will call them. From crowded grey shires. The women sit quiet. With death on their minds. A slow dusk descending. The drawing of blinds. Make the hunters all line up. It‘s their idea of fun. And let those be forgiven. Who never owned a gun. Was it him or me. Or the wailing of the dead. The laughing soldiers. Cast their lots. And you can cut the dread.,ENGLISH /10000-maniacs/,All That Never Happens,/10000-maniacs/all-that-never-happens.html,"She walks alone on the brick lane,. the breeze is blowing.. A year had changed her forever,. just like her grey home.. He used to live so close here,. we‘d look for places I can‘t remember.. The world was safe when she knew him,. she tried to hold him, hold on forever.. For all that never happens and all that never will be,. a candle burning for the love we seldom keep.. The earth was raw in her fingers,. she overturned it.. Considered planting some flowers,. they wouldn‘t last long,. no one to tend them.. It‘s funny how these things go,. you were the answer to all the questions.. The memories made her weary,. she shuddered slowly,. she didn‘t want to.. As a distant summer he began to whisper,. and threw a smile her way.. She looked into the glass,. liquid surface showing that they were melding,. together present past.. So where can I go from here?. The color fading,. he didn‘t answer.. She felt him slip from her vision.. She tried to hold him, hold on forever.. So close forever,. in a silent frozen sleep..",ENGLISH /10000-maniacs/,Back O‘ The Moon,/10000-maniacs/back-o-the-moon.html,Jenny. Jenny you don‘t know the nights I hide. below a second story room. to whistle you down. the man who‘s let to divvy up. time is a miser. he‘s got a silver coin. only lets it shine for hours. while you sleep it away. there‘s one rare and odd style of living. part only known to the everybody Jenny. a comical where‘s the end parade. of the sort people here would think unusual. Jenny. tonight upon the mock brine of a Luna Sea. far off we sail on to Back O‘ The Moon. Jenny. Jenny you don‘t know the days I‘ve tried. telling backyard tales. so to maybe amuse. o your mood is never giddy. if you smile I‘m delighted. but you‘d rather pout. such a lazy child. you dare fold your arms. tisk and say that I lie. there‘s one rare and odd style of thinking. part only known to the everybody Jenny. the small step and giant leap takers. got the head start in the race toward it. Jenny. tonight upon the mock brine of a Luna Sea. far off we sail on to the Back O‘ The Moon. that was a sigh. but not meant to envy you. when your age was mine. some things were sworn true. morning would come. and calendar pages had. new printed seasons on. their opposite sides. Jenny. Jenny you don‘t know the nights I hide. below a second story room. to whistle you down. o the man who‘s let to divvy up. time is a miser. he‘s got a silver coin. lets it shine for hours. while you sleep it away. there‘s one rare and odd style of living. part only known to the everybody Jenny. out of tin ships jump the bubble head boys. to push their flags into powdered soils and cry. no second placers. no smart looking geese in bonnets. dance with pigs in high button trousers. no milk pail for the farmer‘s daughter. no merry towns of sweet walled houses. here I‘ve found. Back O‘ the Moon. not here. I‘ve found. Back O‘ the Moon,ENGLISH
让我们首先导入必要的库并准备数据。 我建议在这个项目中使用 Google Colab,因为访问 GPU 会让事情变得更快。
import pandas as pd from transformers import GPT2LMHeadModel, GPT2Tokenizer import numpy as np import random import torch from torch.utils.data import Dataset, DataLoader from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup from tqdm import tqdm, trange import torch.nn.functional as F import csv ### Prepare data lyrics = pd.read_csv(‘lyrics-data.csv‘) lyrics = lyrics[lyrics[‘Idiom‘]==‘ENGLISH‘] #Only keep popular artists, with genre Rock/Pop and popularity high enough artists = pd.read_csv(‘artists-data.csv‘) artists = artists[(artists[‘Genre‘].isin([‘Rock‘])) & (artists[‘Popularity‘]>5)] df = lyrics.merge(artists[[‘Artist‘, ‘Genre‘, ‘Link‘]], left_on=‘ALink‘, right_on=‘Link‘, how=‘inner‘) df = df.drop(columns=[‘ALink‘,‘SLink‘,‘Idiom‘,‘Link‘]) #Drop the songs with lyrics too long (after more than 1024 tokens, does not work) df = df[df[‘Lyric‘].apply(lambda x: len(x.split(‘ ‘)) < 350)] #Create a very small test set to compare generated text with the reality test_set = df.sample(n = 200) df = df.loc[~df.index.isin(test_set.index)] #Reset the indexes test_set = test_set.reset_index() df = df.reset_index() #For the test set only, keep last 20 words in a new column, then remove them from original column test_set[‘True_end_lyrics‘] = test_set[‘Lyric‘].str.split().str[-20:].apply(‘ ‘.join) test_set[‘Lyric‘] = test_set[‘Lyric‘].str.split().str[:-20].apply(‘ ‘.join)
从第 26 行和第 34-35 行可以看出,我创建了一个小型测试集,其中删除了每首歌曲的最后 20 个单词。 这将允许我将生成的文本与实际文本进行比较,以查看模型的性能如何。
二、创建数据集
为了在我们的数据上使用 GPT-2,我们还需要做一些事情。 我们需要对数据进行标记化,这是将字符序列转换为标记的过程,即将一个句子分成单词。
我们还需要确保每首歌曲最多尊重 1024 个令牌。
SongLyrics 类将在训练期间为我们完成原始数据帧中的每首歌曲。
class SongLyrics(Dataset): def __init__(self, control_code, truncate=False, gpt2_type="gpt2", max_length=1024): self.tokenizer = GPT2Tokenizer.from_pretrained(gpt2_type) self.lyrics = [] for row in df[‘Lyric‘]: self.lyrics.append(torch.tensor( self.tokenizer.encode(f"<|{control_code}|>{row[:max_length]}<|endoftext|>") )) if truncate: self.lyrics = self.lyrics[:20000] self.lyrics_count = len(self.lyrics) def __len__(self): return self.lyrics_count def __getitem__(self, item): return self.lyrics[item] dataset = SongLyrics(df[‘Lyric‘], truncate=True, gpt2_type="gpt2")
三、训练模型
我们现在可以导入预训练的 GPT-2 模型以及分词器。 此外,正如我之前提到的,GPT-2 非常庞大。 如果您尝试在计算机上使用它,很可能会遇到一堆 CUDA 内存不足错误。
可以使用的替代方法是累积梯度。
这个想法很简单,在调用优化以执行梯度下降步骤之前,它将对几个操作的梯度求和。 然后,它将总数除以累积的步骤数,以获得训练样本的平均损失。 这意味着计算要少得多。
#Get the tokenizer and model tokenizer = GPT2Tokenizer.from_pretrained(‘gpt2‘) model = GPT2LMHeadModel.from_pretrained(‘gpt2‘) #Accumulated batch size (since GPT2 is so big) def pack_tensor(new_tensor, packed_tensor, max_seq_len): if packed_tensor is None: return new_tensor, True, None if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len: return packed_tensor, False, new_tensor else: packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1) return packed_tensor, True, None
现在,最后,我们可以创建训练函数,使用我们所有的歌词来微调 GPT-2,以便它可以预测未来的质量诗句。
def train( dataset, model, tokenizer, batch_size=16, epochs=5, lr=2e-5, max_seq_len=400, warmup_steps=200, gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar", test_mode=False,save_model_on_epoch=False, ): acc_steps = 100 device=torch.device("cuda") model = model.cuda() model.train() optimizer = AdamW(model.parameters(), lr=lr) scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1 ) train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True) loss=0 accumulating_batch_count = 0 input_tensor = None for epoch in range(epochs): print(f"Training epoch {epoch}") print(loss) for idx, entry in tqdm(enumerate(train_dataloader)): (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768) if carry_on and idx != len(train_dataloader) - 1: continue input_tensor = input_tensor.to(device) outputs = model(input_tensor, labels=input_tensor) loss = outputs[0] loss.backward() if (accumulating_batch_count % batch_size) == 0: optimizer.step() scheduler.step() optimizer.zero_grad() model.zero_grad() accumulating_batch_count += 1 input_tensor = None if save_model_on_epoch: torch.save( model.state_dict(), os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"), ) return model
随意使用各种超参数(批量大小、学习率、时期、优化器)。
然后,最后,我们可以训练模型。
model = train(dataset, model, tokenizer)
使用 torch.save 和 torch.load,您还可以保存经过训练的模型以备将来使用。
四、歌词生成
是时候使用我们全新的微调模型来生成歌词了。 通过使用以下两个函数,我们可以为测试数据集中的所有歌曲生成歌词。 请记住,我已经删除了每首歌的最后 20 个单词。 现在,对于给定的歌曲,我们的模型将查看他拥有的歌词,并想出歌曲的结尾应该是什么。
def generate( model, tokenizer, prompt, entry_count=10, entry_length=30, #maximum number of words top_p=0.8, temperature=1., ): model.eval() generated_num = 0 generated_list = [] filter_value = -float("Inf") with torch.no_grad(): for entry_idx in trange(entry_count): entry_finished = False generated = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0) for i in range(entry_length): outputs = model(generated, labels=generated) loss, logits = outputs[:2] logits = logits[:, -1, :] / (temperature if temperature > 0 else 1.0) sorted_logits, sorted_indices = torch.sort(logits, descending=True) cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1) sorted_indices_to_remove = cumulative_probs > top_p sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[ ..., :-1 ].clone() sorted_indices_to_remove[..., 0] = 0 indices_to_remove = sorted_indices[sorted_indices_to_remove] logits[:, indices_to_remove] = filter_value next_token = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1) generated = torch.cat((generated, next_token), dim=1) if next_token in tokenizer.encode("<|endoftext|>"): entry_finished = True if entry_finished: generated_num = generated_num + 1 output_list = list(generated.squeeze().numpy()) output_text = tokenizer.decode(output_list) generated_list.append(output_text) break if not entry_finished: output_list = list(generated.squeeze().numpy()) output_text = f"{tokenizer.decode(output_list)}<|endoftext|>" generated_list.append(output_text) return generated_list #Function to generate multiple sentences. Test data should be a dataframe def text_generation(test_data): generated_lyrics = [] for i in range(len(test_data)): x = generate(model.to(‘cpu‘), tokenizer, test_data[‘Lyric‘][i], entry_count=1) generated_lyrics.append(x) return generated_lyrics #Run the functions to generate the lyrics generated_lyrics = text_generation(test_set)
generate 函数为整个测试数据帧准备生成,而 text_generation 实际上是这样做的。
在第 6 行中,我们指定了一代的最大长度。我将其保留为 30,但这是因为标点符号很重要,稍后我将删除最后几个单词,以确保生成在句子末尾完成。
另外两个超参数值得一提:
温度(第 8 行)。它用于缩放生成给定单词的概率。因此,高温迫使模型做出更原始的预测,而较小的温度则使模型不会偏离主题。
Top p 过滤(第 7 行)。该模型将按降序对单词概率进行排序。然后,它会将这些概率加起来为 p,同时删除其他单词。这意味着模型只保留
最相关的词概率,但不仅保留最好的一个,因为给定一个序列,可以有多个词是合适的。
在下面的代码中,我只是清理生成的文本,确保它在句子的末尾(而不是在句子的中间)结束,并将其存储在测试数据集中的新列中。
#Loop to keep only generated text and add it as a new column in the dataframe my_generations=[] for i in range(len(generated_lyrics)): a = test_set[‘Lyric‘][i].split()[-30:] #Get the matching string we want (30 words) b = ‘ ‘.join(a) c = ‘ ‘.join(generated_lyrics[i]) #Get all that comes after the matching string my_generations.append(c.split(b)[-1]) test_set[‘Generated_lyrics‘] = my_generations #Finish the sentences when there is a point, remove after that final=[] for i in range(len(test_set)): to_remove = test_set[‘Generated_lyrics‘][i].split(‘.‘)[-1] final.append(test_set[‘Generated_lyrics‘][i].replace(to_remove,‘‘)) test_set[‘Generated_lyrics‘] = final
五、效果评估
有很多方法可以评估生成文本的质量。 最流行的指标称为 BLEU。 该算法输出 0 到 1 之间的分数,具体取决于生成的文本与现实的相似程度。 1 分表示生成的每个单词都存在于真实文本中。
这是评估生成歌词的 BLEU 分数的代码。
#Using BLEU score to compare the real sentences with the generated ones import statistics from nltk.translate.bleu_score import sentence_bleu scores=[] for i in range(len(test_set)): reference = test_set[‘True_end_lyrics‘][i] candidate = test_set[‘Generated_lyrics‘][i] scores.append(sentence_bleu(reference, candidate)) statistics.mean(scores)
我们获得了 0.685 的平均 BLEU 分数,这是相当不错的。 相比之下,没有任何微调的 GPT-2 模型的 BLEU 得分为 0.288。
但是,BLEU 有其局限性。 它最初是为机器翻译而创建的,只查看用于确定生成文本质量的词汇。 这对我们来说是个问题。 事实上,有可能生成使用与现实完全不同的词的高质量诗句。
这就是为什么我会对模型的性能做一个主观的评估。 为此,我创建了一个小型 Web 界面(使用 Dash)。 该代码可在我的 Github 存储库中找到。
界面的工作方式是您向应用程序提供一些输入词。 然后,模型将使用它来预测接下来的几节经文应该是什么。 以下是一些示例结果。
给定黑色输入序列,红色是 GPT-2 模型预测的结果。 你会看到它已经成功地生成了有意义的诗句,并且尊重了之前的上下文! 此外,它会生成长度相似的句子,这在保持歌曲节奏方面非常重要。 在这方面,输入文本中的标点符号在生成歌词时是绝对必要的。
六、结论
正如文章所示,通过针对特定数据对 GPT-2 进行微调,可以相当轻松地生成与上下文相关的文本。
对于歌词生成,该模型可以生成符合上下文和句子所需长度的歌词。 当然,可以对模型进行改进。 例如,我们可以强制它生成押韵的诗句,这在编写歌词时通常是必要的。
非常感谢阅读,希望能帮到你!
可以在此处找到包含所有代码和模型的存储库:https://github.com/francoisstamant/lyrics-generation-with-GPT2
https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272