我正在嘗試使用我的自定義數(shù)據(jù)集微調(diào) gpt2。我使用擁抱面變壓器的文檔創(chuàng)建了一個(gè)基本示例。我收到上述錯(cuò)誤。我知道這意味著什么:(基本上它是在非標(biāo)量張量上向后調(diào)用)但由于我?guī)缀踔皇褂?API 調(diào)用,所以我不知道如何解決這個(gè)問題。有什么建議么?from pathlib import Pathfrom absl import flags, appimport IPythonimport torchfrom transformers import GPT2LMHeadModel, Trainer, TrainingArgumentsfrom data_reader import GetDataAsPython# this is my custom data, but i get the same error for the basic case below# data = GetDataAsPython('data.json')# data = [data_point.GetText2Text() for data_point in data]# print("Number of data samples is", len(data))data = ["this is a trial text", "this is another trial text"]train_texts = datafrom transformers import GPT2Tokenizertokenizer = GPT2Tokenizer.from_pretrained('gpt2')special_tokens_dict = {'pad_token': '<PAD>'}num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)train_encodigs = tokenizer(train_texts, truncation=True, padding=True)class BugFixDataset(torch.utils.data.Dataset): def __init__(self, encodings): self.encodings = encodings def __getitem__(self, index): item = {key: torch.tensor(val[index]) for key, val in self.encodings.items()} return item def __len__(self): return len(self.encodings['input_ids'])train_dataset = BugFixDataset(train_encodigs)training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=1, per_device_eval_batch_size=1, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', logging_steps=10,)model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=True)model.resize_token_embeddings(len(tokenizer))trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset,)trainer.train()
1 回答

海綿寶寶撒
TA貢獻(xiàn)1809條經(jīng)驗(yàn) 獲得超8個(gè)贊
我終于弄明白了。問題在于數(shù)據(jù)樣本不包含目標(biāo)輸出。即使很難的 gpt 也是自我監(jiān)督的,這必須明確地告訴模型。
你必須添加以下行:
item['labels'] = torch.tensor(self.encodings['input_ids'][index])
到Dataset類的getitem函數(shù),然后就可以正常運(yùn)行了!
添加回答
舉報(bào)
0/150
提交
取消