我正在多 GPU 集群上运行一些实验,并且正在使用加速。我试图在训练数据加载器中每次批量迭代后计算一些指标。虽然训练代码...
我正在多 GPU 集群上运行一些实验,并且正在使用加速。我试图在训练数据加载器中每次批量迭代后计算一些指标。虽然使用加速(它利用多个 GPU)训练代码似乎运行良好,但我在尝试计算上述指标时遇到错误。似乎在执行前向传递后,评估输出张量时将其放在与输入张量不同的设备上。给出错误的代码如下:
def calculatePerplexity(sentence, model, tokenizer, accelerator):
"""
exp(loss)
"""
input_ids = torch.tensor(sentence).unsqueeze(0)
print(f"Input ids device: {input_ids.device}")
model.eval()
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
outputs = accelerator.gather_for_metrics(outputs)
loss, logits = outputs[:2]
loss, logits = accelerator.prepare(loss, logits)
print(f"Loss device: {loss.device}")
print(f'Model device: {model.device}')
print(f'Logits device: {logits.device}')
probabilities = torch.nn.functional.softmax(logits, dim=-1)
all_prob = []
input_ids_processed = input_ids[0][1:]
for i, token_id in enumerate(input_ids_processed):
probability = probabilities[0, i, token_id].item()
all_prob.append(probability)
# stuff for metric calculation
probs = torch.nn.functional.softmax(logits[0, :-1], dim=-1)
log_probs = torch.nn.functional.log_softmax(logits[0, :-1], dim=-1)
token_log_probs = log_probs.gather(dim=-1, index=input_ids_processed.unsqueeze(-1)).squeeze(-1)
mu = (probs * log_probs).sum(-1)
sigma = (probs * torch.square(log_probs)).sum(-1) - torch.square(mu)
mink_plus = (token_log_probs - mu) / sigma.sqrt()
调试语句的输出如下:
Input ids device: cpu
Loss device: cuda:0
Model device: cpu
Logits device: cuda:0
对于我使用的 4 个不同的 GPU 来说,情况都是相同的,并导致以下错误:
token_log_probs = log_probs.gather(dim=-1, index=input_ids_processed.unsqueeze(-1)).squeeze(-1)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA_gather)
我不确定我在这里做错了什么,我以为使用任一 gather_for_metrics
方法或调用 accelerator.prepare
损失和逻辑都会有所帮助,但事实并非如此(删除这些语句时我得到了同样的错误)。任何建议都将不胜感激!
为了完整起见,这里是我在计算指标时使用的其余(相关)代码:
# Training loop
for i, (batch_inputs, batch_labels) in tqdm(enumerate(dataloader)):
all_labels += batch_labels
unlearned_model, tokenizer = load_base_model(self.experiment_args.model_dir_prefix, self.experiment_args.model)
torch.cuda.empty_cache()
optimizer = torch.optim.Adam(unlearned_model.parameters(), lr=self.unlearning_args.lr)
unlearned_model, optimizer, batch_inputs = accelerator.prepare(unlearned_model, optimizer, batch_inputs)
# Unlearn data and calculate PPL values
for i in range(self.unlearning_args.steps):
unlearned_model = unlearn_dataslice(unlearned_model, optimizer, batch_inputs, self.unlearning_args, accelerator)
torch.cuda.empty_cache()
UL_PPL_vals += calculate_PPL_values(unlearned_model, tokenizer, batch_inputs, accelerator)
def unlearn_dataslice(model, optimizer, sentences, args, accelerator):
learning_rate = args.lr
model.train()
optimizer.zero_grad()
input_data = sentences.clone().detach()
output = model(input_data)
# Add a minus do to gradient ascent instead of descent
loss = -output[0]['logits']
accelerator.backward(loss.mean())
torch.cuda.empty_cache()
optimizer.step()
del optimizer
torch.cuda.empty_cache()
return model
def calculate_PPL_values(model, tokenizer, text_batch, accelerator):
PPL_values = []
for text in text_batch:
PPL = calculatePerplexity(text, model, tokenizer, accelerator)[0]
PPL_values.append(PPL)
return PPL_values
在这段代码中,我删除了很多调试语句,这些语句检查和的设备 unlearned_model
是否 batch_inputs
在同一设备上 cpu
,所以我很确定那里没有不一致的地方。