由于工作需要,我将 Windows 电脑换成了 MacBook Pro,它搭载 Apple M3 Pro(36 GB),运行 macOS Sonoma(版本 14.5)。我发现了一些非常奇怪的事情。在一个小的
由于工作需要,我将 Windows 电脑换成了 MacBook Pro,它搭载 Apple M3 Pro(36 GB),运行 macOS Sonoma(版本 14.5)。我发现了一些非常奇怪的事情。在一个小的示例脚本中,我设法提取了这个问题的根本原因。
当我在 tensorflow/keras 之前导入 pandas 时,脚本会冻结。反之亦然。
剧本:
import numpy as np
import os
import pandas as pd
from tensorflow.keras import layers, models
print("Creating simple model...")
try:
model = models.Sequential([
layers.Input(shape=(10,)),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='linear')
])
print("Model created successfully.")
except Exception as e:
print(f"Error creating model: {e}")
x_train = np.random.rand(100, 10)
y_train = np.random.rand(100, 1)
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')
# Train the model
try:
model.fit(x_train, y_train, epochs=5, batch_size=32)
print("Model training completed successfully.")
except Exception as e:
print(f"Error during training: {e}")
运行时,给出以下输出:
Creating simple model...
2024-05-31 18:04:07.639131: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Pro
2024-05-31 18:04:07.639149: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 36.00 GB
2024-05-31 18:04:07.639154: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 13.50 GB
2024-05-31 18:04:07.639170: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-31 18:04:07.639186: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
脚本在此时冻结,必须终止。当我交换导入的顺序时
from tensorflow.keras import layers, models
import pandas as pd
我得到以下信息:
Creating simple model...
2024-05-31 18:07:18.879661: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3 Pro
2024-05-31 18:07:18.879680: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 36.00 GB
2024-05-31 18:07:18.879685: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 13.50 GB
2024-05-31 18:07:18.879705: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-31 18:07:18.879717: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Model created successfully.
Epoch 1/5
2024-05-31 18:07:19.269585: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
4/4 1s 16ms/step - loss: 0.1177
Epoch 2/5
4/4 0s 5ms/step - loss: 0.1078
Epoch 3/5
4/4 0s 5ms/step - loss: 0.0932
Epoch 4/5
4/4 0s 5ms/step - loss: 0.1008
Epoch 5/5
4/4 0s 5ms/step - loss: 0.0865
Model training completed successfully.
请注意,我在脚本中甚至没有使用 pandas。作为参考,我导入了 os,但也没有在脚本的任何地方使用它,但这不会影响它。
以下是我的环境包 pip 列表:
Package Version
---------------------------- -----------
absl-py 2.1.0
astunparse 1.6.3
Bottleneck 1.3.7
cachetools 5.3.3
certifi 2024.2.2
charset-normalizer 3.3.2
db-dtypes 1.2.0
flatbuffers 24.3.25
gast 0.5.4
google-api-core 2.19.0
google-auth 2.29.0
google-cloud-bigquery 3.23.1
google-cloud-core 2.4.1
google-crc32c 1.5.0
google-pasta 0.2.0
google-resumable-media 2.7.0
googleapis-common-protos 1.63.0
grpcio 1.64.0
grpcio-status 1.62.2
h5py 3.11.0
idna 3.7
importlib_metadata 7.1.0
joblib 1.4.2
keras 3.3.3
libclang 18.1.1
Markdown 3.6
markdown-it-py 3.0.0
MarkupSafe 2.1.5
mdurl 0.1.2
ml-dtypes 0.3.2
namex 0.0.8
numexpr 2.8.7
numpy 1.26.4
opt-einsum 3.3.0
optree 0.11.0
packaging 24.0
pandas 2.2.1
pip 24.0
proto-plus 1.23.0
protobuf 4.25.3
pyarrow 16.1.0
pyasn1 0.6.0
pyasn1_modules 0.4.0
Pygments 2.18.0
python-dateutil 2.9.0.post0
pytz 2024.1
requests 2.32.3
rich 13.7.1
rsa 4.9
scikit-learn 1.4.2
scipy 1.11.4
setuptools 69.5.1
six 1.16.0
tensorboard 2.16.2
tensorboard-data-server 0.7.2
tensorflow 2.16.1
tensorflow-io-gcs-filesystem 0.37.0
tensorflow-macos 2.16.1
tensorflow-metal 1.1.0
termcolor 2.4.0
threadpoolctl 3.5.0
tqdm 4.66.4
typing_extensions 4.12.0
tzdata 2024.1
urllib3 2.2.1
Werkzeug 3.0.3
wheel 0.43.0
wrapt 1.16.0
zipp 3.19.0
来自评论的建议 (@Ze'ev Ben-Tsvi)
import numpy as np
import os
import pandas as pd
from tensorflow.keras import layers, models
print("Creating simple model...")
try:
print("Initializing Sequential model...")
model = models.Sequential()
print("Adding input layer...")
model.add(layers.Input(shape=(10,)))
print("Adding first Dense layer...")
model.add(layers.Dense(64, activation='relu'))
print("Adding output Dense layer...")
model.add(layers.Dense(1, activation='linear'))
print("Model created successfully.")
except Exception as e:
print(f"Error creating model: {e}")
x_train = np.random.rand(100, 10)
y_train = np.random.rand(100, 1)
# Compile the model
try:
print("Compiling model...")
model.compile(optimizer='adam', loss='mean_squared_error')
print("Model compiled successfully.")
except Exception as e:
print(f"Error during compilation: {e}")
# Train the model
try:
print("Training model...")
model.fit(x_train, y_train, epochs=5, batch_size=32)
print("Model training completed successfully.")
except Exception as e:
print(f"Error during training: {e}")
该脚本的输出是:
Initializing Sequential model...
Adding input layer...
Adding first Dense layer...
Adding output Dense layer...
Model created successfully.
Compiling model...
Model compiled successfully.
Training model...
Epoch 1/5
这样写的话,执行起来似乎更进一步。现在它不再卡在 models.Sequential 上,而是卡在 model.fit 上。
再次交换导入的顺序(先 tensorflow 再 pandas)我得到:
Creating simple model...
Initializing Sequential model...
Adding input layer...
Adding first Dense layer...
Adding output Dense layer...
Model created successfully.
Compiling model...
Model compiled successfully.
Training model...
Epoch 1/5
4/4 0s 1ms/step - loss: 0.4620
Epoch 2/5
4/4 0s 636us/step - loss: 0.3263
Epoch 3/5
4/4 0s 1ms/step - loss: 0.2322
Epoch 4/5
4/4 0s 629us/step - loss: 0.1395
Epoch 5/5
4/4 0s 690us/step - loss: 0.1251
Model training completed successfully.
这里的主要问题是,我从来没有遇到过异常,即使将所有导入单独包装在 try/catch 块中也没有。似乎有一些东西要么吞掉错误,要么什么都没有抛出。