AirBnb Report:

Background:

AirBnb is a large international company whose business model involves provision of a website and associated guidelines for both rental and lease of short and long term accommodation. AI (artificial intelligence) methodologies are being increasingly applied to both understand and predict AirBnb related-data. Earlier work was undertaken to predict list performance utilising established performance metric data (Kirkok, 2022). More recently, machine learning algorithms have been developed to predict property prices using specific amenity features (Ghosh et al, 2023). This study reviewed 5 major US cities and found notable variability in predictive performance between these. Most notably, this study did not include data regarding New York listings and therefore it would be appropriate to provide a personalised model with high predictability for this region.

Aim:

To develop a model that utilises the key features of AirBnb listings to determine the expected listing price for daily rental. This model will aid AirBnb in recommending suggested listing rates for both current and new hosts to ensure that they are within the expected pricing range.

Descriptive statistics of current listings:

This data is publicly available data that contains various information about Airbnb in New York City (DGOMONOV, 2019). The dataset contains 48895 instances and 16 features. Null values were identified in listing name, host name, last review data, and reviews per month, whilst 11 apartments had 0.00$ as their list price. Respective mean ± standard deviations were found for price (153±240), minimum nights (7±20), number of reviews (23±44), reviews per month (1±2), and host listing count (7±33). Most common categorical features were found for neighbourhood group (Manhattan), neighbourhood (Williamsburg), and room type (Entire Home/Apt). Preliminary data cleaning was undertaken by harmonising data types and reformatting missing values. In particular, price data was positively skewed and exploratory analysis was undertaken both with and without removal of outliers (Holmes, Illowsky and Dean, 2017).

Exploratory analysis:

Multiple correlation analysis demonstrated variable associations in Figure 1, with review number and reviews per month being expectedly strongest. We sought to investigate the role of price further, considering our aim, and found significant variation both by location and between different property types as visualised in Figure 2 and Figure 3. This variation remained when normalising data and removing outliers.

Figure 1. Multiple feature correlation

Figure 2. Price by location Figure 3. Price by room type (without outliers)

Clustering of Current Listings:

Further exploratory analysis was undertaken in the format of K-means unsupervised clustering to identify the current groups of hosts that are using AirBnb. We used recognised techniques (the elbow method, the silhouette method) to identify the most likely number of distinct clusters. Seven identifiable clusters of host properties were found, with many intriguing cluster-specific features as shown in Table 1, and closeness visualised in Figure 4 using principal component analysis. This expectedly demonstrates that many hosts can be effectively clustered according to price, property type, and location. However, this also uncovers particularly interesting groups of hosts who would benefit from predictive pricing models. These include the groups that are defined by being either budget or luxury accommodation, as such a model would help to predict the category for the host’s property. ‘Mega-hosts’ were identified that either have many rental properties, or that have highly reviewed rental properties. These frequent users of AirBnb would especially benefit from a model that would reduce effort required for listing.

Table 1: Unsupervised clusters Figure 4. Visualised clusters (PCA)

Cluster	Defining features
1	General mixed properties, predominantly in Manhattan.
2	Entire homes/apartments provided by various hosts
3	Entire homes/apartments (all provided by 2 hosts), typically Manhattan.
4	Low-price private rooms
5	Highly reviewed mixed properties
6	Low-price mixed properties
7	High-price entire home/apartments

Modelling:

Prior to model training, a logarithmic transformation was applied to the target variable, 'price', to mitigate observed skewness in its distribution, which aligns with common practices in statistical modelling to normalise data distribution (Osborne, 2010). Missing values were also handled by imputing the mean of the respective variables, a method supported by (Schafer and Graham, 2002) for handling missing data under certain conditions. Numerical variables were standardised to achieve unit variance, a common step in regression analysis to aid model convergence (James et al., 2013).

For model selection, the dataset was split into a training set (70%) and a test set (30%), and four regression models were employed: Linear Regression, Ridge Regression, Lasso Regression, and ElasticNet Regression. This approach aimed to avoid the limitations of sole reliance on plain linear regression, enhancing the robustness of the analysis through diverse modelling techniques (Hastie et al., 2001). Model performance was primarily evaluated using the Mean Squared Error (MSE) Regression Loss metric. The results of the model evaluations are summarised in Table 2.

Table 2. Model performance metrics

Model	Mean Absolute Error (MAE)	Root Mean Square Error (RMSE)	R²
Linear	0.370140	0.523551	0.446063
Ridge	0.370140	0.523551	0.446063
Lasso	0.370153	0.523620	0.445918
ElasticNet	0.370481	0.524137	0.444822

Mean Absolute Error (MAE): A lower value is better. It means our predictions are, on average, closer to the actual values.
Root Mean Square Error (RMSE): A lower value is better here too. Lower RMSE indicates greater accuracy, meaning our model's predictions are closer to the actual values, especially for larger errors.
Coefficient of Determination (R²): A higher value is better. Higher R² means our model does a better job at explaining the variation in the data. The closer R² is to 100% or 1, the better our model is at predicting the outcomes.

Conclusion and Limitations:

The study developed a predictive model for Airbnb listing prices, incorporating advanced statistical methods to address data complexity and variability. However, there are notable limitations and areas for improvement. The exclusion of New York listings suggests a need for a more region-specific model, particularly for a market as dynamic as New York. Additionally, the reliance on imputation for missing data, such as the mean substitution for missing values, may introduce biases (Schafer & Graham, 2002). The models showed moderate predictive power, as indicated by MAE, RMSE, and R² metrics. Future recommendations include Region-Specific Modelling, Advanced Imputation Techniques, Diversified Data Sources and more Model Refinement with Feature Engineering Techniques or parameter tuning.

References:

DGOMONOV (2019). New York City Airbnb Open Data. [online] www.kaggle.com. Available at: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data/data [Accessed 8 Dec. 2023].

Ghosh, I., Jana, R. K., & Abedin, M. Z. (2023). An ensemble machine learning framework for Airbnb rental price modelling without using amenity-driven features. International Journal of Contemporary Hospitality Management, 35(10), 3592–3611.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Available at : https://books.google.co.uk/books?id=VRzITwgNV2UC&printsec=frontcover&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false [Accessed 15 December 2023]

Holmes, A., Illowsky, B. and Dean, S. (2017). Introductory business statistics. Houston, Texas: Openstax College, Rice University.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning (with Applications in R) . Available at : https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf [Accessed 15 December 2023]

Kirkos, E. (2022). Airbnb listings’ performance: determinants and predictive models. European Journal of Tourism Research, 30, 3012–3012. https://doi.org/10.54055/EJTR.V30I.2142

Osborne, J. W. (2010). Improving your data transformations: Applying the Box-Cox transformation. Practical Assessment, Research, and Evaluation, 15(12), p.3. Available at : https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1238&context=pare [Accessed 15 December 2023]

Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147-177 Available online at: https://sci2s.ugr.es/keel/pdf/specific/articulo/Schafer_Graham02.pdf [Accessed 15 December 2023]