是否可以使用 python pandas 进行模糊匹配合并？-python-2.7-IT问答社区-解决你的IT疑问

是否可以使用 python pandas 进行模糊匹配合并？

Evan Wilson 1月前

我有两个 DataFrame，我想根据列合并它们。但是，由于拼写不同、空格数不同、变音符号的缺失/存在，我希望能够合并...

我有两个 DataFrame，我想根据列合并它们。但是，由于拼写不同、空格数不同、变音符号存在/不存在，我希望只要它们彼此相似就可以合并。

任何相似性算法都可以（soundex、Levenshtein、difflib）。

假设一个 DataFrame 包含以下数据：

df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])

       number
one         1
two         2
three       3
four        4
five        5

df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

      letter
one        a
too        b
three      c
fours      d
five       e

然后我想得到结果 DataFrame

       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e

帖子版权声明 1、本帖标题：是否可以使用 python pandas 进行模糊匹配合并？
本站网址：http://xjnalaquan.com/
2、本网站的资源部分来源于网络，如有侵权，请联系站长进行删除处理。
3、会员发帖仅代表会员个人观点，并不代表本站赞同其观点和对其真实性负责。
4、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
5、站长邮箱：yeweds@126.com 除非注明，本帖由Evan Wilson在本站《python-2.7》版块原创发布，转载请注明出处！

最新回复 (0)

最新倒序只看楼主

Zailox 1月前 0 只看Ta

引用 2楼

如果没有找到相近的匹配项，则可接受的解决方案会失败。一个简单的解决方法是，
Stanley Bak 1月前 0 只看Ta

引用 3楼
与@locojay 建议类似，您可以将 difflib 的 get_close_matches 应用于 df2 索引，然后应用 join :
```
In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]: 
      letter
one        a
two        b
three      c
four       d
five       e

In [31]: df1.join(df2)
Out[31]: 
       number letter
one         1      a
two         2      b
three       3      c
four        4      d
five        5      e
```
.

如果这些是列，则可以按照相同的方式应用于列然后 合并 :
```
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
```
Hallmanac 1月前 0 只看Ta

引用 4楼

有人知道是否有办法在一列的行之间执行此操作吗？我正在尝试查找可能有拼写错误的重复项
Liubove 1月前 0 只看Ta

引用 5楼

您可以使用 n=1 将结果限制为 1。docs.python.org/3/library/…
tech_geek 1月前 0 只看Ta

引用 6楼

对于那些说它失败的人，我认为这更多的是如何将其实现到管道中的问题，而不是解决方案的错误，它简单而优雅。
Domenico Ruggiano 1月前 0 只看Ta

引用 7楼

我使用了类似的解决方案，但使用 [:1] 来调整 get_close_matches 的结果列表的大小，并确保它不会引发 KeyError
Endre 1月前 0 只看Ta

引用 8楼
使用 fuzzywuzzy

由于该包中没有示例 fuzzywuzzy ，因此我编写了一个函数，它将根据您作为用户设置的阈值返回所有匹配项：

数据框示例
```
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

# df1
          Key
0       Apple
1      Banana
2      Orange
3  Strawberry

# df2
        Key
0      Aple
1     Mango
2      Orag
3     Straw
4  Bannanna
5     Berry
```
模糊匹配函数
```
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1
```
在数据框上使用我们的函数： #1
```
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry
```
在数据框上使用我们的函数： #2
```
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})

fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)

        Col1  matches
0  Microsoft  Mcrsoft
1     Google    gogle
2     Amazon   Amason
3        IBM         
```
安装：

点数
```
pip install fuzzywuzzy
```
蟒蛇
```
conda install -c conda-forge fuzzywuzzy
```
Dimitri Sifoua 1月前 0 只看Ta

引用 9楼

有没有办法将 df2 的所有列都转移到匹配中？假设 c 是您想要保留的表 2 (df2) 的主键或外键
Arun Kumar A.J 1月前 0 只看Ta

引用 10楼

嘿 Erfan，当你有 mo 时，你认为你可以更新它以与 pandas 1.0 一起使用吗？我想知道如果你将 apply 中的引擎更改为 Cython 或 Numba，它会获得什么样的性能提升
Arabella Simpson 1月前 0 只看Ta

引用 11楼

对于我的问题来说，这个解决方案看起来也非常有希望。但是，你能解释一下，当我在两个数据集中没有共同的列时，这将如何工作吗？我如何在两个数据集中的一个数据集中创建一个匹配列来给我分数？我已经使用了你的 #2 解决方案。我不确定为什么它要花这么多时间运行。
Matthias 1月前 0 只看Ta

引用 12楼

如果您也需要匹配的键，可以使用 s = df_2.to_dict()[key2]
EagleUK 1月前 0 只看Ta

引用 13楼
我编写了一个 Python 包来解决这个问题：

pip install fuzzymatcher

在这里找到 repo 在这里找到文档 .

基本用法：

给定两个 df_left 想要模糊连接的 df_right 和
```
from fuzzymatcher import link_table, fuzzy_left_join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
```
或者如果你只想链接到最接近的匹配：
```
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
```
sebastian-c 1月前 0 只看Ta

引用 14楼

说实话，如果它没有那么多依赖项就太棒了，首先我必须安装 Visual Studio 构建工具，现在我收到错误：没有这样的模块：fts4
Lord Konadu Kweku 1月前 0 只看Ta

引用 15楼

@RobinL 您能详细说明如何修复“没有这样的模块：fts4”问题吗？我一直试图解决这个问题，但没有成功。
Corno 1月前 0 只看Ta

引用 16楼

@AnakinSkywalker - 我想我使用了下面 reddy 的答案。但我花了很多功夫才解决这个问题
Matteo Preda 1月前 0 只看Ta

引用 17楼
我会使用 Jaro-Winkler，因为它是目前性能最高、最准确的近似字符串匹配算法之一 [ Cohen, et al. ], [ Winkler ]。

水母 jellyfish ：
```
def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)
```
输出：
```
    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e
```
Jimbo 1月前 0 只看Ta

引用 18楼

def get_closest_match(x, list_strings): 怎么样 return sorted(list_strings, key=lambda y: jellyfish.jaro_winkler(x, y), reverse=True)[0]
legoland 1月前 0 只看Ta

引用 19楼
对于一般方法： fuzzy_merge

对于更一般的情况，我们想要合并两个包含略微不同字符串的数据框中的列，以下函数使用 difflib.get_close_matches 和 merge 来模仿 pandas 的功能 merge ，但具有模糊匹配：
```
import difflib 

def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
    df_other= df2.copy()
    df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff) 
                         for x in df_other[right_on]]
    return df1.merge(df_other, on=left_on, how=how)

def get_closest_match(x, other, cutoff):
    matches = difflib.get_close_matches(x, other, cutoff=cutoff)
    return matches[0] if matches else None
```
以下是两个示例数据框的一些用例：
```
print(df1)

     key   number
0    one       1
1    two       2
2  three       3
3   four       4
4   five       5

print(df2)

                 key_close  letter
0                    three      c
1                      one      a
2                      too      b
3                    fours      d
4  a very different string      e
```
通过上面的例子，我们可以得到：
```
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
```
我们可以使用以下方法进行左连接：
```
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')

     key  number key_close letter
0    one       1       one      a
1    two       2       too      b
2  three       3     three      c
3   four       4     fours      d
4   five       5       NaN    NaN
```
对于右连接，我们将左数据框中所有不匹配的键都设置为 None ：
```
fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')

     key  number                key_close letter
0    one     1.0                      one      a
1    two     2.0                      too      b
2  three     3.0                    three      c
3   four     4.0                    fours      d
4   None     NaN  a very different string      e
```
还要注意， difflib.get_close_matches difflib.get_close_matches 将返回一个空列表 df2 为：
```
print(df2)

                          letter
one                          a
too                          b
three                        c
fours                        d
a very different string      e
```
我们会收到一个 index out of range 错误：
```
df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
```
IndexError：列表索引超出范围

为了解决这个问题，上述函数 get_closest_match 将通过索引返回的列表（ difflib.get_close_matches 只有当 它实际包含任何匹配时）返回最接近的匹配。
Mike Lischke 1月前 0 只看Ta

引用 20楼

我建议使用 apply 来使其更快： df_other[left_on] = df_other[right_on].apply(lambda x: get_closest_match(x, df1[left_on], cutoff))