这在 MSSQL 中有效,并且我相信应该可以更改几个关键字以使其在 MySQL 中也能有效(甚至可能更好):
SELECT TOP 1 t.*
FROM @Table t
INNER JOIN (SELECT t.id, sum(tt.weight) AS cum_weight
FROM @Table t
INNER JOIN @Table tt ON tt.id <= t.id
GROUP BY t.id) tc
ON tc.id = t.id,
(SELECT SUM(weight) AS total_weight FROM @Table) tt,
(SELECT RAND() AS rnd) r
WHERE r.rnd * tt.total_weight <= tc.cum_weight
ORDER BY t.id ASC
SELECT t.*FROM table1 tINNER JOIN ( SELECT t.id, SUM(tt.weight) AS cum_weight FROM table1 t INNER JOIN table1 tt ON tt.id <= t.id GROUP BY t.id) tc ON tc.id = t.id, ( SELECT SUM(weight) AS total_weight FROM table1) tt, ( SELECT RAND() AS rnd) rWHERE r.rnd * tt.total_weight <= tc.cum_weightORDER BY t.id ASCLIMIT 1
p5
SELECT t.*
FROM table1 t
INNER JOIN table1_weight w
ON w.table1_id = t.id
ORDER BY RAND()
LIMIT 1
SQL 1 持续花费 0.4 秒。
SQL 2 需要 0.01 到 0.02 秒。
结论
如果随机加权记录的选择速度不是问题,那么 van 建议的单表 SQL 就可以了,并且没有维护单独表的开销。
我可能在脚本中犯了一个错误,但我有 30 多次权重为 32 的行,偶尔也会有其他行。它被选中的次数比其他所有行多 32 次。我最终创建了一个包含总权重的临时表,用它来表示权重的百分比 (SELECT id FROM near50, total_weight ORDER BY Random()*(1/(WEIGHT*100/total_weight.weight)) LIMIT 1)。
select count(*),weight from t group by weight
count | weight
-------+--------
1000 | 99
1000 | 10
1000 | 100
(3 rows)
我的结果:
jasen=# with g as ( select generate_series(1,209000) as i )
,r as (select ( select t.weight as w
FROM t
WHERE t.weight > 0
ORDER BY ( random() / t.weight ) + (g.i*0) LIMIT 1 ) from g)
select r.w, count(*), r.w*1000 as expect from r group by r.w;
w | count | expect
-----+-------+--------
99 | 98978 | 99000
10 | 10070 | 10000
100 | 99952 | 100000
(3 rows)
没有 +(g.i*0) 影响,但需要外部引用来强制规划器重新评估在中产生的 209K 个输入行中的每一个的子选择 g