数据中有3个字段分别使用逗号(,)隔开,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | 用户 ,商品 ,评分 user1 , 101 , 5.0 user1 , 102 , 3.0 user1 , 103 , 2.5 user2 , 101 , 2.0 user2 , 102 , 2.5 user2 , 103 , 5.0 user2 , 104 , 2.0 user3 , 101 , 2.0 user3 , 104 , 4.0 user3 , 105 , 4.5 user3 , 107 , 5.0 user4 , 101 , 5.0 user4 , 103 , 3.0 user4 , 104 , 4.5 user4 , 106 , 4.0 user5 , 101 , 4.0 user5 , 102 , 3.0 user5 , 103 , 2.0 user5 , 104 , 4.0 user5 , 105 , 3.5 user5 , 106 , 4.0 |
1.将用户购买商品数据进行转化从而变成 <商品购买次数同现矩阵> 和 <商品用户评分矩阵>
注意: 上面的两个矩阵通过 MRjob 转化后是以稀疏矩阵的形式保存的
2.将以上得到的两个稀疏矩阵进行相乘从而获得每个用户每个商品的评分情况
1 2 3 4 5 6 7 8 9 | === === === === 用户同时购买商品矩阵 === === === === | == 用户评分矩阵 == == 推荐建物品得分 == [ 101 ] [ 102 ] [ 103 ] [ 104 ] [ 105 ] [ 106 ] [ 107 ] [ user3 ] [ R ] [ 101 ] 5 3 4 4 2 2 1 2.0 40.0 看过 [ 102 ] 3 3 3 2 1 1 0 0.0 18.5 [ 103 ] 4 3 4 3 1 2 0 0.0 22.5 [ 104 ] 4 2 3 4 2 2 1 X 4.0 = 40.0 看过 [ 105 ] 2 1 1 2 2 1 1 4.5 26.0 看过 [ 106 ] 2 1 2 2 1 2 0 0.0 16.5 [ 107 ] 1 0 0 1 1 0 1 5.0 15.5 看过 |
MRJob 计算代码(01_user_goods_score.py)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | #!/usr/bin/env python # -*- coding: utf-8 -*- from mrjob . job import MRJob class UserGoodsScore ( MRJob ) : “” “获得用户购买的商品与评分记录” “” def mapper ( self , _ , line ) : # 解析行: 用户, 商品, 评分 user , goods , score = line . split ( ‘,’ ) # 输出串: 商品:评分 output = ‘{goods}:{score}’ . format ( goods = goods , score = score ) yield user , output def reducer ( self , key , values ) : yield key , ‘,’ . join ( values ) def main ( ) : UserGoodsScore . run ( ) if __name__ == ‘__main__’ : main ( ) |
执行
1 2 3 4 5 6 7 | python 01_user_goods_score.py < 01_user_goods_score.data > 02_user_goods_score_record.data cat 02_user_goods_score_record.data “user1” “101:5.0,102:3.0,103:2.5” “user2” “101:2.0,102:2.5,103:5.0,104:2.0” “user3” “101:2.0,104:4.0,105:4.5,107:5.0” “user4” “101:5.0,103:3.0,104:4.5,106:4.0” “user5” “101:4.0,102:3.0,103:2.0,104:4.0,105:3.5,106:4.0” |
类似SQL:
1 2 3 4 | SELECT user_name , GROUP_CONCAT ( goods_id , ‘:’ , score ) FROM user_goods_score GROUP BY user_name ; |
MRJob 计算代码(01_user_goods_score.py)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | #!/usr/bin/env python # -*- coding: utf-8 -*- from mrjob . job import MRJob class GoodsBoughtCountMatrix ( MRJob ) : “” “获得商品购买次数的同现矩阵” “” def mapper ( self , _ , line ) : # 解析行 tokens = line . split ( ‘”‘ ) user = tokens [ 1 ] # 获得用户, 如: 1 score_matrix = tokens [ 3 ] # 评分矩阵, 如: 101:5.0,102:3.0,103:2.5 # 获得物品 列表 goods_score_list = score_matrix . split ( ‘,’ ) # 获得商品得分列表 # 获得商品列表 goods_list = [ goods_socre . split ( ‘:’ ) [ 0 ] for goods_socre in goods_score_list ] # 获得商品同现矩阵 for goods1 in goods_list : for goods2 in goods_list : matrix_item = ‘{goods1}:{goods2}’ . format ( goods1 = goods1 , goods2 = goods2 ) yield matrix_item , 1 def reducer ( self , key , values ) : yield key , sum ( values ) def main ( ) : GoodsBoughtCountMatrix . run ( ) if __name__ == ‘__main__’ : main ( ) |
执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | python 02_goods_bought_count_matrix.py < 02_user_goods_score_record.data > 03_goods_bought_count_matrix.data cat 03_goods_bought_count_matrix.data “101:101” 5 “101:102” 3 “101:103” 4 “101:104” 4 “101:105” 2 “101:106” 2 “101:107” 1 “102:101” 3 “102:102” 3 “102:103” 3 “102:104” 2 “102:105” 1 “102:106” 1 “103:101” 4 “103:102” 3 “103:103” 4 “103:104” 3 “103:105” 1 “103:106” 2 “104:101” 4 “104:102” 2 “104:103” 3 “104:104” 4 “104:105” 2 “104:106” 2 “104:107” 1 “105:101” 2 “105:102” 1 “105:103” 1 “105:104” 2 “105:105” 2 “105:106” 1 “105:107” 1 “106:101” 2 “106:102” 1 “106:103” 2 “106:104” 2 “106:105” 1 “106:106” 2 “107:101” 1 “107:104” 1 “107:105” 1 “107:107” 1 |
MRJob 计算代码(01_goods_user_score_matrix.py)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | #!/usr/bin/env python # -*- coding: utf-8 -*- from mrjob . job import MRJob class GoodsUserScoreMatrix ( MRJob ) : “” “商品用户评分矩阵” “” def mapper ( self , _ , line ) : # 解析行: 用户, 商品, 评分 user , goods , score = line . split ( ‘,’ ) # 输出串: 商品:评分 output = ‘{user}:{score}’ . format ( user = user , score = score ) yield goods , output def reducer ( self , key , values ) : for value in values : yield key , value def main ( ) : GoodsUserScoreMatrix . run ( ) if __name__ == ‘__main__’ : main ( ) |
执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | python 01_goods_user_score_matrix.py < 01_user_goods_score.data > 03_goods_user_score_matrix.data cat 03_goods_user_score_matrix.data “101” “user1:5.0” “101” “user2:2.0” “101” “user3:2.0” “101” “user4:5.0” “101” “user5:4.0” “102” “user1:3.0” “102” “user2:2.5” “102” “user5:3.0” “103” “user1:2.5” “103” “user2:5.0” “103” “user4:3.0” “103” “user5:2.0” “104” “user2:2.0” “104” “user3:4.0” “104” “user4:4.5” “104” “user5:4.0” “105” “user3:4.5” “105” “user5:3.5” “106” “user4:4.0” “106” “user5:4.0” “107” “user3:5.0” |
MRJob计算代码(03_user_every_goods_score_matrix.py)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | #!/usr/bin/env python # -*- coding: utf-8 -*- from mrjob . job import MRJob import os class Matrix ( MRJob ) : flag = None # 用于标记数据是哪个矩阵来的 goods_ids = [ 101 , 102 , 103 , 104 , 105 , 106 , 107 ] # 矩阵A的商品ID user_ids = [ ‘user1’ , ‘user2’ , ‘user3’ , ‘user4’ , ‘user5’ ] # 矩阵B 用户ID def mapper ( self , _ , line ) : file_name = os . environ [ ‘mapreduce_map_input_file’ ] if file_name == ’03_goods_bought_count_matrix.data’ : # 商品购买次数同现矩阵 # 获得数组 矩阵 item_i , item_j , item_value = self . _get_matrix_item_1 ( line ) for user_id in self . user_ids : # 获取key key = ‘{user_id},{item_i}’ . format ( item_i = item_i , user_id = user_id ) value = ‘a,{item_j},{item_value}’ . format ( item_j = item_j , item_value = item_value ) value = [ ‘a’ , item_j , item_value ] yield key , value elif file_name == ’03_goods_user_score_matrix.data’ : # 商品用户评分矩阵 # 获得数组 矩阵 item_i , item_j , item_value = self . _get_matrix_item_2 ( line ) for goods_id in self . goods_ids : # 获取key key = ‘{item_j},{goods_id}’ . format ( goods_id = goods_id , item_j = item_j ) value = ‘b,{item_i},{item_value}’ . format ( item_i = item_i , item_value = item_value ) value = [ ‘b’ , item_i , item_value ] yield key , value def reducer ( self , key , values ) : goods_score = { } user_score = { } for flag , goods_id , value in values : if flag == ‘a’ : goods_score [ goods_id ] = value elif flag == ‘b’ : user_score [ goods_id ] = value # 计算有相同 goods_id 乘积之和 sum = 0 for goods_id in set ( goods_score ) & set ( user_score ) : sum += float ( goods_score [ goods_id ] ) * float ( user_score [ goods_id ] ) yield key , sum def _get_matrix_item_1 ( self , line ) : “” “解析03_goods_bought_count_matrix.data 从而获得商品购买次数同现矩阵中的每个元素 Args: line: “ 101 : 102 ” 3 Return: item_i: 101 item_j: 102 value: 3 “ “” items = line . split ( ‘”‘ ) item_i , item_j = items [ 1 ] . split ( ‘:’ ) value = items [ 2 ] . strip ( ) return item_i , item_j , value def _get_matrix_item_2 ( self , line ) : “” “解析03_goods_user_score_matrix.data 从而获得商品用户评分(只包含有买过的)矩阵中的每个元素 Args: line: “ 107 ” “ user3 : 5.0 “ Return: item_i: 107 item_j: user3 value: 5.0 “ “” items = line . split ( ‘”‘ ) item_i = items [ 1 ] item_j , value = items [ 3 ] . split ( ‘:’ ) return item_i , item_j , value def main ( ) : Matrix . run ( ) if __name__ == ‘__main__’ : main ( ) |
程序说明: 这边矩阵需要先知道参与矩阵运算的商品有哪些 和 参与矩阵运算的用户有哪些,就像程序中的两个变量
1 2 | goods_ids = [ 101 , 102 , 103 , 104 , 105 , 106 , 107 ] # 矩阵A的商品ID user_ids = [ ‘user1’ , ‘user2’ , ‘user3’ , ‘user4’ , ‘user5’ ] # 矩阵B 用户ID |
goods_ids 和 user_ids 变量也是在计算中会消耗内存的地方
执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | python 03_user_every_goods_score_matrix.py 03_goods_bought_count_matrix.data 03_goods_user_score_matrix.data “user1,101” 44.0 “user1,102” 31.5 “user1,103” 39.0 “user1,104” 33.5 “user1,105” 15.5 “user1,106” 18.0 “user1,107” 5.0 “user2,101” 45.5 “user2,102” 32.5 “user2,103” 41.5 “user2,104” 36.0 “user2,105” 15.5 “user2,106” 20.5 “user2,107” 4.0 “user3,101” 40.0 “user3,102” 18.5 “user3,103” 24.5 “user3,104” 38.0 “user3,105” 26.0 “user3,106” 16.5 “user3,107” 15.5 “user4,101” 63.0 “user4,102” 37.0 “user4,103” 53.5 “user4,104” 55.0 “user4,105” 26.0 “user4,106” 33.0 “user4,107” 9.5 “user5,101” 68.0 “user5,102” 42.5 “user5,103” 56.5 “user5,104” 59.0 “user5,105” 32.0 “user5,106” 34.5 “user5,107” 11.5 |
上面的数据就显示出来的 每个用户 对应的每个商品的 评分矩阵,其中按用户评分排序就可以知道需要推荐的商品是哪个了。
1 2 3 4 5 6 7 | “user1,101” 44.0 买过评分过 过滤掉 “user1,103” 39.0 买过评分过 过滤掉 “user1,104” 33.5 需要推荐的商品 1 “user1,102” 31.5 买过评分过 过滤掉 “user1,105” 15.5 需要推荐的商品 2 “user1,106” 18.0 需要推荐的商品 3 “user1,107” 5.0 需要推荐的商品 4 |
文章转载来自:trustauth.cn