使用单个标记和二元标记进行语料库预处理的最佳方法？-tensorflow-IT问答社区-解决你的IT疑问

使用单个标记和二元标记进行语料库预处理的最佳方法？

b w 2月前

我想知道是否有关于解决这个问题的最聪明方法的一般建议。我正在使用 word2vec 来确定规范之间的相似度分数（这是我感兴趣的最终输出）...

我想知道是否有关于如何以最明智的方式解决这个问题的一般建议。

我使用 word2vec 来确定特定单词之间的相似度得分（这是我感兴趣的最终输出）——其中一些是单个标记，但其他应该是二元组。为了使事情复杂化，我使用了 tensorflow（为了学习如何使用 tensorflow）。

我想保留在单独列表中找到的二元词组：

Bigram_list = ["northern lights", "cloud cover", "table leg",...]

目前，该过程看起来应该是这样的：

识别语料库中的二元词组（使用 nltk 搭配）
创造 identified_bigrams_list = ["northern lights", "cloud cover", "banana peel",...]
搜索 identified_bigrams_list 匹配项 Bigram_list
问题：用 \'_\' 替换语料库中的匹配项，例如 \'northern_light\'、\'cloud_cover\'。我尝试使用 Bigram_list 的词典（例如 "northern lights": "northern_lights" ）。所以我试图将它放回语料库中，这样它将被视为单个标记并作为单个嵌入进行处理

即使我能让它工作，但这在计算上似乎效率低下，特别是当我转向更大的语料库进行实际训练时（目前使用一个很小的语料库才能让它工作）。

有什么建议吗？

densevariational

帖子版权声明 1、本帖标题：使用单个标记和二元标记进行语料库预处理的最佳方法？
本站网址：http://xjnalaquan.com/
2、本网站的资源部分来源于网络，如有侵权，请联系站长进行删除处理。
3、会员发帖仅代表会员个人观点，并不代表本站赞同其观点和对其真实性负责。
4、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
5、站长邮箱：yeweds@126.com 除非注明，本帖由b w在本站《tensorflow》版块原创发布，转载请注明出处！

最新回复 (0)

最新倒序只看楼主

CrunchyArtie 2月前 0 只看Ta

2楼

我正在尝试使用贝叶斯神经网络，向神经网络模型添加贝叶斯层。这是我使用的代码，来自 keras 网站。import numpy as npimport tensorflow as t...

我正在尝试使用贝叶斯神经网络，向神经网络模型添加贝叶斯层。这是我使用的代码，来自 keras 网站。

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds
import tensorflow_probability as tfp



dataset_size = 4898
batch_size = 256
train_size = int(dataset_size * 0.85)

# Create training and evaluation datasets
def get_train_and_test_splits(train_size, batch_size=1):
    # We prefetch with a buffer the same size as the dataset because th dataset
    # is very small and fits into memory.
    dataset = (
        tfds.load(name="wine_quality", as_supervised=True, split="train")
        .map(lambda x, y: (x, tf.cast(y, tf.float32)))
        .prefetch(buffer_size=dataset_size)
        .cache()
    )
    # We shuffle with a buffer the same size as the dataset.
    train_dataset = (
        dataset.take(train_size).shuffle(buffer_size=train_size).batch(batch_size)
    )
    test_dataset = dataset.skip(train_size).batch(batch_size)

    return train_dataset, test_dataset

# get train and test data
train_dataset, test_dataset = get_train_and_test_splits(train_size, batch_size)


# Compile, train, and evaluate the model
hidden_units = [8, 8]
learning_rate = 0.001
num_epochs = 100
mse_loss = keras.losses.MeanSquaredError()


def run_experiment(model, loss, train_dataset, test_dataset):

    model.compile(
        optimizer=keras.optimizers.RMSprop(learning_rate=learning_rate),
        loss=loss,
        metrics=[keras.metrics.RootMeanSquaredError()],
    )

    print("Start training the model...")
    model.fit(train_dataset, epochs=num_epochs, validation_data=test_dataset)
    print("Model training finished.")
    _, rmse = model.evaluate(train_dataset, verbose=0)
    print(f"Train RMSE: {round(rmse, 3)}")

    print("Evaluating model performance...")
    _, rmse = model.evaluate(test_dataset, verbose=0)
    print(f"Test RMSE: {round(rmse, 3)}")


# Create model inputs
FEATURE_NAMES = [
    "fixed acidity",
    "volatile acidity",
    "citric acid",
    "residual sugar",
    "chlorides",
    "free sulfur dioxide",
    "total sulfur dioxide",
    "density",
    "pH",
    "sulphates",
    "alcohol",
]


def create_model_inputs():
    inputs = {}
    for feature_name in FEATURE_NAMES:
        inputs[feature_name] = layers.Input(
            name=feature_name, shape=(1,), dtype=tf.float32
        )
    return inputs

# Define the prior weight distribution as Normal of mean=0 and stddev=1.
# Note that, in this example, the we prior distribution is not trainable,
# as we fix its parameters.
def prior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    prior_model = keras.Sequential(
        [
            tfp.layers.DistributionLambda(
                lambda t: tfp.distributions.MultivariateNormalDiag(
                    loc=tf.zeros(n), scale_diag=tf.ones(n)
                )
            )
        ]
    )
    return prior_model


# Define variational posterior weight distribution as multivariate Gaussian.
# Note that the learnable parameters for this distribution are the means,
# variances, and covariances.
def posterior(kernel_size, bias_size, dtype=None):
    n = kernel_size + bias_size
    posterior_model = keras.Sequential(
        [
            tfp.layers.VariableLayer(
                tfp.layers.MultivariateNormalTriL.params_size(n), dtype=dtype
            ),
            tfp.layers.MultivariateNormalTriL(n),
        ]
    )
    return posterior_model

def create_bnn_model(train_size):
    inputs = create_model_inputs()
    features = keras.layers.concatenate(list(inputs.values()))
    features = layers.BatchNormalization()(features)

    # Create hidden layers with weight uncertainty using the DenseVariational layer.
    for units in hidden_units:
        features = tfp.layers.DenseVariational(
            units=units,
            make_prior_fn=prior,
            make_posterior_fn=posterior,
            kl_weight=1 / train_size,
            activation="sigmoid",
        )(features)

    # The output is deterministic: a single point estimate.
    outputs = layers.Dense(units=1)(features)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

num_epochs = 500
train_sample_size = int(train_size * 0.3)
small_train_dataset = train_dataset.unbatch().take(train_sample_size).batch(batch_size)

bnn_model_small = create_bnn_model(train_sample_size)
run_experiment(bnn_model_small, mse_loss, small_train_dataset, test_dataset)

sample = 10
examples, targets = list(test_dataset.unbatch().shuffle(batch_size * 10).batch(sample))[
    0
]
def compute_predictions(model, iterations=100):
    predicted = []
    for _ in range(iterations):
        predicted.append(model(examples).numpy())
    predicted = np.concatenate(predicted, axis=1)

    prediction_mean = np.mean(predicted, axis=1).tolist()
    prediction_min = np.min(predicted, axis=1).tolist()
    prediction_max = np.max(predicted, axis=1).tolist()
    prediction_range = (np.max(predicted, axis=1) - np.min(predicted, axis=1)).tolist()

    for idx in range(sample):
        print(
            f"Predictions mean: {round(prediction_mean[idx], 2)}, "
            f"min: {round(prediction_min[idx], 2)}, "
            f"max: {round(prediction_max[idx], 2)}, "
            f"range: {round(prediction_range[idx], 2)} - "
            f"Actual: {targets[idx]}"
        )


compute_predictions(bnn_model_small)

num_epochs = 500
bnn_model_full = create_bnn_model(train_size)
run_experiment(bnn_model_full, mse_loss, train_dataset, test_dataset)

compute_predictions(bnn_model_full)

但是我收到此错误：

File "/Users/S/Documents/B/Prediction/test1.py", line 148, in <module>
    bnn_model_small = create_bnn_model(train_sample_size)
  File "/Users/S/Documents/B/Prediction/test1.py", line 131, in create_bnn_model
    features = tfp.layers.DenseVariational(
  File "/Users/S/.local/share/virtualenvs/B-hz56sUDM/lib/python3.9/site-packages/tf_keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/S/.local/share/virtualenvs/B-hz56sUDM/lib/python3.9/site-packages/tf_keras/src/engine/input_spec.py", line 251, in assert_input_compatibility
    ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'

我尝试使用 from tf_agents.environments import tf_py_environment environment = tf_py_environment.TFPyEnvironment(environment) 此处推荐的方法。问题是它需要 numpy<1.20，这导致与其他库发生许多冲突。

marcin2137 2月前 0 只看Ta

引用 3楼
我的 TensorFlow 图神经网络有一个损失函数，如下所示。它根据输入的密钥计算损失。由于 gnn 以字节格式存储密钥，我需要对其进行评估以

我的 TensorFlow 图神经网络有一个损失函数，如下所示。它根据输入键计算损失。由于 gnn 以字节格式存储键，我需要将其评估为字符串以对特征进行哈希处理。但由于禁用了 Eager Execution，它失败了。
```
#! /usr/bin/python
import tensorflow as tf

tf.config.run_functions_eagerly(False) # numpy() works when eager is enabled

@tf.function
def loss_fn(d):
    return tf.reduce_mean(d['features'][d['key'][0].numpy().decode('utf-8')])

d = {'key': tf.constant(['A01', 'B01']),
     'features': {'A01': [0.1,0.2],
                  'B01': [0.3,0.4]}}
loss=loss_fn(d)

print(loss)
```
```
Traceback (most recent call last):
  File "/home/mikehuang/programs/test.py", line 13, in <module>
    loss=loss_fn(d)
         ^^^^^^^^^^
  File "/home/mikehuang/.local/lib/python3.12/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/__autograph_generated_filezbap1vrc.py", line 12, in tf__loss_fn
    retval_ = ag__.converted_call(ag__.ld(tf).reduce_mean, (ag__.ld(d)['features'][ag__.converted_call(ag__.converted_call(ag__.ld(d)['key'][0].numpy, (), None, fscope).decode, ('utf-8',), None, fscope)],), None, fscope)
    ^^^^^
AttributeError: in user code:

    File "/home/mikehuang/programs/test.py", line 8, in loss_fn  *
        return tf.reduce_mean(d['features'][d['key'][0].numpy().decode('utf-8')])

    AttributeError: 'SymbolicTensor' object has no attribute 'numpy'
```
如何根据输入键访问该功能？
Mostafa 2月前 0 只看Ta

引用 4楼
导出默认函数 Journal() { const myImg = new Image() myImg.src = \'src/images/my_japan_trip.jpg\' const journal = { title: \'我的日本之旅\', image: myImg.src...
export default function Journal() { const myImg = new Image() myImg.src = "src/images/my_japan_trip.jpg" const journal = { title: "My Japan trip", image: myImg.src, text: "I went to Japan last winter." } return ( <div id="journal"> <div> <h2> {journal.title} </h2> </div> <div> <img src={journal.image} /> </div> <div> <p> {journal.text} </p> </div> </div> ) }
以下是我尝试制作的日记帐分录元素的屏幕截图以下是我用来制作日记帐分录元素的代码的屏幕截图

我正在开发一个单页 React 应用程序，并制作了一个用于制作日记帐分录的模块。该模块有一个名为的组件 Journal 。我有一个变量，它使用 Image 构造函数来创建一个新的图像元素，然后我为它的 src 属性分配一个值，该值等于位于我的项目资产内的图像文件路径。创建图像变量后，我为日记帐分录创建了一个对象。它具有标题、图像和文本属性。return 语句有 jsx，用于制作日记帐分录。对象的标题和文本值出现在网页上，但我无法使用 img 元素在网页上显示图像。目前它显示的是损坏的 img 图标。我只是将 img 复制并粘贴到 myImg 变量的 src 属性的值中。有人可以帮我找到如何显示图像吗？谢谢。
Anondev 2月前 0 只看Ta

引用 5楼

我想知道你是否应该尝试 /path_to_src/src/images/my_japan_trip.jpg （注意开头的斜杠）。或者使用相对路径 ../../images/my_japan_trip.jpg .
Mark Lawrence 2月前 0 只看Ta

引用 6楼

如果不知道你的文件结构，很难判断源是否错误，但是尝试

myImg.src = "./src/images/my_japan_trip.jpg"

可能对你有用