0%

机器学习-06

学习推荐系统~

理论成本函数

正则化表示:

推广到所有用户

协同过滤算法

Collaborative Filtering Algorithm

引入与概念

相反地考虑:

如果已经有w和b的参数值,那么我们也可以反推参数x

可以组合起来:

现在损失函数是w,b,x三者的函数

所以梯度下降是针对三者的同步下降

针对二进制标签的公式变形:

均值归一化

简而言之:就是先减均值,再加均值

减完均值后放到$x^{(i)}$中,最后的方程再加上均值

优点:

  • 加速推荐系统搭建
  • 针对没给出过样例或样例很少的新用户可以预测的更合理(即均值)

TensorFlow实现协同过滤算法

TensorFlow内置的gradient tape可以帮助我们实现偏导数的求解(梯度下降):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
w = tf.Variable(3.0)
x = 1.0
y = 1.0 # target value
alpha = 0.01

# 以J = (wx - 1) ^ 2为例

iterations = 30
for iter in range(iterations):
with tf.GradientTape() as tape:
fwb = w * x
costJ = (fwb - y) ** 2

[dJdw] = tape.gradient(costJ, [w]) # Auto Diff

w.assign_add(-alpha * dJdw)

协同过滤算法实现:

1
2
3
4
5
6
7
8
9
10
# Adam优化器
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

iterations = 200
for iter in range(iterations):
with tf.GradienTape() as tape:
cost_value = cofiCostFuncV(X, W, b, Ynorm, R, num_users, num_movies, lambda) # Ynorm: 归一化后评分y
grads = tape.gradient(cost_value, [X, W, b])
# 使用Adam优化器迭代
optimizer.apply_gradients(zip(grads, [X, W, b]))

寻找相关特征

如何判断两个样例的近似程度呢?

我们可以通过得到的$x^{(i)}$特征参数的距离来判断两个样例的近似程度

目前算法存在的问题

主要问题:

  • 冷启动问题

    • 如何处理没有多少评分的新的样例?

    • 如何处理没有评过多少分的新用户?

  • 使用项目或用户的边缘信息

    • 项目:类型等
    • 用户:性别、年龄、位置、偏好

implement code

损失函数计算

暴力实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def cofi_cost_func(X, W, b, Y, R, lambda_):
"""
Returns the cost for the content-based filtering
Args:
X (ndarray (num_movies,num_features)): matrix of item features
W (ndarray (num_users,num_features)) : matrix of user parameters
b (ndarray (1, num_users) : vector of user parameters
Y (ndarray (num_movies,num_users) : matrix of user ratings of movies
R (ndarray (num_movies,num_users) : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
lambda_ (float): regularization parameter
Returns:
J (float) : Cost
"""
nm, nu = Y.shape
J = 0
### START CODE HERE ###

for j in range(nu):
w = W[j, :]
b_j = b[0,j]

for i in range(nm):
x = X[i, :]
y = Y[i, j]
r = R[i, j]
J += np.square(r * (np.dot(w, x) - y + b_j))

# 正则化
J += lambda_ * (np.sum(np.square(W)) + np.sum(np.square(X)))

J /= 2
### END CODE HERE ###

return J

向量API实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
"""
Returns the cost for the content-based filtering
Vectorized for speed. Uses tensorflow operations to be compatible with custom training loop.
Args:
X (ndarray (num_movies,num_features)): matrix of item features
W (ndarray (num_users,num_features)) : matrix of user parameters
b (ndarray (1, num_users) : vector of user parameters
Y (ndarray (num_movies,num_users) : matrix of user ratings of movies
R (ndarray (num_movies,num_users) : matrix, where R(i, j) = 1 if the i-th movies was rated by the j-th user
lambda_ (float): regularization parameter
Returns:
J (float) : Cost
"""
j = (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R
J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
return J

模拟添加新用户及其部分评分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
movieList, movieList_df = load_Movie_List_pd()

my_ratings = np.zeros(num_movies) # Initialize my ratings

# Check the file small_movie_list.csv for id of each movie in our dataset
# For example, Toy Story 3 (2010) has ID 2700, so to rate it "5", you can set
my_ratings[2700] = 5

#Or suppose you did not enjoy Persuasion (2007), you can set
my_ratings[2609] = 2;

# We have selected a few movies we liked / did not like and the ratings we
# gave are as follows:
my_ratings[929] = 5 # Lord of the Rings: The Return of the King, The
my_ratings[246] = 5 # Shrek (2001)
my_ratings[2716] = 3 # Inception
my_ratings[1150] = 5 # Incredibles, The (2004)
my_ratings[382] = 2 # Amelie (Fabuleux destin d'Amélie Poulain, Le)
my_ratings[366] = 5 # Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
my_ratings[622] = 5 # Harry Potter and the Chamber of Secrets (2002)
my_ratings[988] = 3 # Eternal Sunshine of the Spotless Mind (2004)
my_ratings[2925] = 1 # Louis Theroux: Law & Disorder (2008)
my_ratings[2937] = 1 # Nothing to Declare (Rien à déclarer)
my_ratings[793] = 5 # Pirates of the Caribbean: The Curse of the Black Pearl (2003)
my_rated = [i for i in range(len(my_ratings)) if my_ratings[i] > 0]

print('\nNew user ratings:\n')
for i in range(len(my_ratings)):
if my_ratings[i] > 0 :
print(f'Rated {my_ratings[i]} for {movieList_df.loc[i,"title"]}');

将新用户加入数据集中并归一化:

1
2
3
4
5
6
7
# Reload ratings and add new ratings
Y, R = load_ratings_small()
Y = np.c_[my_ratings, Y]
R = np.c_[(my_ratings != 0).astype(int), R]

# Normalize the Dataset
Ynorm, Ymean = normalizeRatings(Y, R)

归一化函数:

1
2
3
4
5
6
7
8
9
10
11
def normalizeRatings(Y, R):
"""
Preprocess data by subtracting mean rating for every movie (every row).
Only include real ratings R(i,j)=1.
[Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
has a rating of 0 on average. Unrated moves then have a mean rating (0)
Returns the mean rating in Ymean.
"""
Ymean = (np.sum(Y*R,axis=1)/(np.sum(R, axis=1)+1e-12)).reshape(-1,1)
Ynorm = Y - np.multiply(Ymean, R)
return(Ynorm, Ymean)

初始化参数:

1
2
3
4
5
6
7
8
9
10
11
12
#  Useful Values
num_movies, num_users = Y.shape
num_features = 100

# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_users, num_features),dtype=tf.float64), name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float64), name='X')
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float64), name='b')

# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e-1)

训练模型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
iterations = 200
lambda_ = 1
for iter in range(iterations):
# Use TensorFlow’s GradientTape
# to record the operations used to compute the cost
with tf.GradientTape() as tape:

# Compute the cost (forward pass included in cost)
cost_value = cofi_cost_func_v(X, W, b, Ynorm, R, lambda_)

# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss
grads = tape.gradient( cost_value, [X,W,b] )

# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients( zip(grads, [X,W,b]) )

# Log periodically.
if iter % 20 == 0:
print(f"Training loss at iteration {iter}: {cost_value:0.1f}")

推荐与查看模型效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Make a prediction using trained weights and biases
p = np.matmul(X.numpy(), np.transpose(W.numpy())) + b.numpy()

#restore the mean
pm = p + Ymean

my_predictions = pm[:,0]

# sort predictions
ix = tf.argsort(my_predictions, direction='DESCENDING')

for i in range(17):
j = ix[i]
if j not in my_rated:
print(f'Predicting rating {my_predictions[j]:0.2f} for movie {movieList[j]}')

print('\n\nOriginal vs Predicted ratings:\n')
for i in range(len(my_ratings)):
if my_ratings[i] > 0:
print(f'Original {my_ratings[i]}, Predicted {my_predictions[i]:0.2f} for {movieList[i]}')

表格查询:

1
2
3
4
filter=(movieList_df["number of ratings"] > 20)
movieList_df["pred"] = my_predictions
movieList_df = movieList_df.reindex(columns=["pred", "mean rating", "number of ratings", "title"])
movieList_df.loc[ix[:300]].loc[filter].sort_values("mean rating", ascending=False)

基于内容的过滤算法

协同过滤算法:根据与你相似的用户给出的评分去给你推荐产品

基于内容过滤算法:根据用户和项目特征去找到好的匹配去给你推荐产品

神经网络架构

如何做用户特征$X_u$,项目特征$X_m$到$V_u$,$V_m$的映射?通过神经网络

$V_u$和$V_m$具有相同的结构,所以输出层的神经元个数先沟通

损失函数

推荐步骤

  • 检索
    • 生成一个候选项目大列表
      • 针对用户最近看的10部电影推荐10部相似的
      • 针对用户最多看的3中类型,找10个最高评分电影
      • 20部最高评分电影
    • 合并这些电影成为列表,删除冗余(用户已看过)
  • 排序
    • 根据上述列表(对大型目录的缩减)使用模型去打分排序
    • 根据排序向用户展示推荐列表

权衡:

  • 检索更多项目会导致推荐系统更好的表现,但会降低推荐的速率
  • 去分析和优化,可以去做离线实验

应用

  • 电影推荐
  • 产品推荐(最可能下单)
  • 广告推荐(最可能点击)
  • 高利润产品推荐(用户需求不一定处于第一位)
  • 视频黏性增加用户观看时长

TensorFlow实现基于内容过滤算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
num_outputs = 32
tf.random.set_seed(1)
user_NN = tf.keras.models.Sequential([
### START CODE HERE ###
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(num_outputs),
### END CODE HERE ###
])

item_NN = tf.keras.models.Sequential([
### START CODE HERE ###
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(num_outputs),
### END CODE HERE ###
])

# create the user input and point to the base network
input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)

# create the item input and point to the base network
input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1)

# compute the dot product of the two vectors vu and vm
output = tf.keras.layers.Dot(axes=1)([vu, vm])

# specify the inputs and output of the model
model = Model([input_user, input_item], output)

model.summary()

tf.random.set_seed(1)
cost_fn = tf.keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam(learning_rate=0.01)
model.compile(optimizer=opt, loss=cost_fn)

tf.random.set_seed(1)
model.fit([user_train[:, u_s:], item_train[:, i_s:]], ynorm_train, epochs=30)

model.evaluate([user_test[:, u_s:], item_test[:, i_s:]], ynorm_test)

PS:

为了加速推荐系统的效率,$V_m$可以在用户接入网络前就计算好