By: Cici Zhao
Today people are spending more time on electronic devices. Advertising on apps becomes an effective way for companies to make promotion. In this project, our objective is to build up models to help advertisers find the potential popular apps in App Store. We scraped data from app store through the iTunes API, and applied feature engineering to it via text analytics. In our project, we defined apps with top 10% performance as the advertisers’ goal. But in the real world, users can customize the quantile-based on their own needs. We performed linear and logistic regression, neural network, KNN, decision tree and random forest, each of which we will explicitly explain later in the report.
- Daily rating counts performance prediction:
‘daily_cnt’ is the most important column in the whole dataset as it is the parameter we use to measure the app’s popularity. From the information App Store gave us, we were unable to know the exact number of downloads. Among all information, we used the number of population rating as the measure of popularity. Stepping forward, as apps all had different time duration, we took the average daily count of ratings as our final dependent value. (daily_cnt = total count of rating/ time duration) Our objective was to correctly classify apps with high daily rating counts.
- Model Selection Criteria:
From the advertisers’ perspective, precision is a more important factor than the low RMSE and high accuracy. This is because advertisers weighing more on putting ad campaign on the correct potential popular apps to fully utilize their promotion costs. Moreover, due to the naturally biased data framework, all models could reach around 90% accuracies. Therefore, precision is our top-ranked model selection criteria, compared to the accuracy and RMSE.
It is not very doable to measure promotion effects by apps’ marketing strategy and gain those data from producers. Therefore, our model leave apps’ advertisements for themselves out of account, and when predicting, we assume the new app for evaluation hasn’t conducted marketing promotion and bought search ads yet.
Based on the fact that the promotion effect is out of consideration, an app’s discoverability on the App Store becomes the key factor for the app’s download volume. Imagining that there is no advertisement for PUBG and no one knows it, then how can someone find the game without completely correctly spelling the game name to find it? But even if we do not realize Google has launched a map product, we may still find it with typing the word ‘map’ into the searching bar. Therefore, here we conclude the aspects that influence an app’s discoverability.
According to the search optimization documentation from Apple, accurate keywords, primary category, and positive ratings are of great importance. Since for searching ranking, it is mainly relevant to text relevance which is the app’s title, keywords, and primary category, and is also closely relative to user behavior, which is downloads and the number and quality of ratings and reviews. Also, the developer’s name can be searched directly in the app store, and affect the app’s discoverability.
For getting data, three-letter terms were randomly generated and the searching results for the terms from iTunes API were collected. Each time, it returned a json file with maximum result of 200 and we gained more than 260,000 pieces of data after this scraping. Looking into the nodes of each json code, we obtained app information including name, app id, rating score, number of people rating, genre, released date, seller, etc. After dropping rows with NAs, and dropping the duplicates, we ended up with 57,000 unique apps in our dataset. We divided the data into train, test, and valid in the traditional ratio of 50:25:25. To make sure we could compare our models horizontally, we yielded the three subsets into individual files, so that each of us would have the exact same data to train and test on, regardless of the randomness in division. (This problem could obviously be solved by setting seed in R. We divided in this way specifically because some of us use Python instead of R when modeling.) In total, we had 30791 samples for training, 15395 cases for validating and 15396 for testing.
For this part, we conducted text mining and concluded a word frequency table for the selected part of our dataset. Keyword popularity is changing with the time but the frequency of words can reflect an overall demand for a certain kind of apps. With higher demand, apps are more likely to be searched and download. We gave every app name a keyword score based on the keyword they have and the popularity weight of that keyword in its genre, and also used the top 30 frequent words to test the keyword performance in different perspectives. We used 1 for every top 30 words appearance and 0 for not show up, in the dataset we used ‘t30’ pluses word as our feature name.
The information of the app’s seller is a good estimator of whether it is good to invest or not. There are ten thousands of different sellers in our dataset. It is not possible to keep all of them in one category. Therefore, we decided to divide sellers into three groups according to the total amount of their products. The group one includes sellers that have more than 100 apps in our dataset. The amounts of apps in group two are between 50 and 100, and the sellers in group three have less than 50 apps in the dataset. We add this column of group into our dataset and delete the original column of seller.
Models and Results
Linear and Logistic Regression
First we tried linear regression and logistic regression which are widely used statistical models.
In linear regression, we used 65 explanatory variables to fit in a multiple linear regression to predict the daily count of ratings and try to figure out whether there is linear relationship between features and response. It turns out that although there existed several significant variables and the total p-value is extremely small, the adjusted R square of the model is 0.01516, which means only 1.516% of the response variable variation that can be explained by this linear model. Moreover, we test the model on the test data, and in order to know how well it could predict whether the app can be invest or not, we set a threshold of 90 percentile to predicted result we get. That is, if the daily rating counts are larger than the 90% quantile, we set it as good to invest, and if it is no more than 90% quantile, we set it as bad to invest. Finally, we can compare the original daily rating count of test data with the data we predicted. It turns out the accuracy rate is 86.74%, which is not low. However, since our data consists of much larger part of “bad to invest”, accuracy rate is not a good reflection to the accuracy of models. For investors, they prefer to know whether their investment is valuable, thus precision could be a good criteria. Precision measures the proportion of cases identified as positive that are actually positive. The precision is 33.76%. Therefore, it is too low that 66.24% of investment would not be successful.
In logistic model, we also use 90 quantile to divide daily count rating as valuable or not. It turns out that the result of the logistic model is better than the linear regression model. The accuracy rate of test data is 87.9%, and the precision is 39.6%.
Artificial Neural Network
Except for linear and logistic regression, the artificial neural network (ANN) is another statistical approach which the functions and mechanism of human brains, that can be used for prediction and classification. ANNs are usually nonlinear models and are able to capture the nonlinear relationship between the features and the dependent variable. As the results from the linear and logistic model were not good enough, ANN was applied in order to find out the nonlinear relationship.
For balancing the computational expense, ANN model with 2 nodes in the hidden layer was trained to predict the response variable, the daily count of ratings. The 90% percentile was selected as the evaluation criteria. Then the apps with the top 10% performance were categorized as 1, and the rest as 0. For our objective, we first needed to select appropriate features which could maximize the precision. Considering 65 features in total, backward selection method was applied. With all the input features included, the precision for validation data was 35.7%. During the first round of selection, feature “music” was deleted and the precision for the validate set was 43.5%. Column “t30editor” was deleted in the second round and the precision for the validate dataset was 54.7%. Therefore, variables “music” and “t30editor” were dropped from the final model. Using the final model, the classification accuracy of training, validate and testing set were all around 90%. The precision rate for train was 58.1%, that for validate was 54.7%, and that for test was 51.8%
Besides the supervised model, the non-parametric model was also applied to the analysis process
Here we used the k-nearest neighbors algorithm (k-NN) method to do the classification. Since the weight of each feature was hard to define, we regarded the model as an average-weighted one.
First, we needed to define the classifier. Since our objective was to identify those so-called potential apps, we divided apps into 2 groups: categorize those in the 90 quantiles of ratings per day as 0 and the rest as 1. This classification rule was applied to train, validation, and test dast. Therefore, the classifier for the 3 datasets were 3.6705, 3.7124 and 3.3840, respectively. Any app who received a higher daily rating count than the classifier will be supposed as the potential popular app for investors.
In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. Since the k needs to be a relatively small integer, we would like to build the k-NN model with k from 1 to 20, and choose the best model with the most satisfactory precision with optimal k.
Among all our k-NN models with different k, the accuracies were always around 90% and the RMSE’s were around 0.3(detailed result in Appendix). In order to make k as small as possible while still having the pleased model, we would like to pick the first relative peak accuracy as the best accuracy with optimal k and the first relative valley RMSE as the best RMSE with optimal k. Both of these 2 evaluation rules give the result of the best model is the one that k = 4.
However, the precision is our top-ranked model selection criteria, therefore the first relative peak precision as the best accuracy with optimal k is the model with k = 8 and precision around 60.66%.
When making use of the model with k=8, the test data has the accuracy=0.9058, the RMSE is 0.3069 and the precision is 60.87%.
Decision Tree and Random Forest
Decision Tree is a tree-like model that assist in making decisions. Random Forest is a model built up by multiple decision trees and the result is voted by the trees inside. In our project, we ran classification models on Decision Tree and Random Forest respectively.
For classification, we first sorted out data by the average number of daily count for each app. Then categorized the first 90% as 0, meaning not popular in the market, and take the rest as 1. Applying the same logic to all three subdata, train, test, and validation, we first concluded a model with high accuracy of over 90%, but having a low precision of around 30%. As the objective of our project is to give advice for people to invest in more popular apps, the accuracy of the model is less important to us compared with the level of precision, which in this context refers to the condition that the predicted popular apps are indeed popular in the market. We figured that the low precision score may be caused by too much number of zeros in the data (90%). Hence, to improve the precision of our model, we re-categorized our train data as the following: after sorting and categorizing, we randomly chose 2/9 of the zero values. After adjusting the data, the accuracy for the decision tree model dropped, but precision indeed increased to over 70% for all three subsets.
Similarly, we classified train set for Random Forest by two-thirds and one-third as above. The model concludes with high accuracy and high precision.
As shown above, the Decision Tree model achieved the most satisfying precision 71.1% on the testing set, and was measured to be the most reliable model. Advertisers can use our model to evaluate if an app will become popular and worth promoting their products on the app.
(final – no “music” and “t30editor”)