← 返回首页

Kaggle Playground思路

#技术

在 Kaggle 播客收听时间预测的比赛里,数据集如下:

Column Non-Null Count Dtype


0 id 750000 non-null int64
1 Podcast_Name 750000 non-null object 2 Episode_Title 750000 non-null object 3 Episode_Length_minutes 662907 non-null float64 4 Genre 750000 non-null object 5 Host_Popularity_percentage 750000 non-null float64 6 Publication_Day 750000 non-null object 7 Publication_Time 750000 non-null object 8 Guest_Popularity_percentage 603970 non-null float64 9 Number_of_Ads 749999 non-null float64 10 Episode_Sentiment 750000 non-null object 11 Listening_Time_minutes 750000 non-null float64

要预测11,即收听时间

我注意到了一种非常有趣的特征:播客节目名称的字母个数

all_data['Podcast_Name_Length'] = all_data['Podcast_Name'].apply(lambda x: len(str(x)))

episode number feature

def extract_episode_number(title): try: return int(''.join(filter(str.isdigit, str(title)))) except: return 0

all_data['Episode_Number'] = all_data['Episode_Title'].apply(extract_episode_number)

此外还搭配了一些常规的特征

numerical_features = [ 'Episode_Length_minutes', 'Host_Popularity_percentage', 'Guest_Popularity_percentage', 'Number_of_Ads', 'Episode_Number', 'Podcast_Name_Length', 'Publication_Time_Encoded', 'Publication_Day_Encoded', 'Episode_Sentiment_Encoded' ]

使用RF进行预测,在本地的测试中得分如下: RF_RMSE: 12.6554 RF_R²: 0.7823

最后使用这个方法在测试集上进行预测,并且筛选了特征重要性

Feature_importance: Feature Importance 0 Episode_Length_minutes 0.777424 1 Host_Popularity_percentage 0.052293 2 Guest_Popularity_percentage 0.042145 4 Episode_Number 0.038016 5 Podcast_Name_Length 0.017724 7 Publication_Day_Encoded 0.016786 6 Publication_Time_Encoded 0.011157 3 Number_of_Ads 0.009616 8 Episode_Sentiment_Encoded 0.007954 13 Genre_Lifestyle 0.002979

发现找到的特征还挺重要的

最后得分12.62056,当时得分前10,不过因为是playground,战线拉的足够长,现在也是700/3000了