In this project, I aimed to improve the accuracy of sales forecasting and streamline operations by utilizing Python and comparing multiple machine learning models.
As background, many companies currently perform sales forecasting using manual processes or simple regression models in Excel for their marketing operations.
However, these methods have limitations in terms of accuracy improvement and process automation. I believed that automating sales forecasting using Python would be more effective.
View Github Repository
Sample Data Overview
- Data Content: Sales and advertising data for an alcoholic beverage company
- Period: Weekly data for the past two years
- Key Metrics:
- Sales
- Digital Advertising Costs
- OOH (Out-of-Home) Advertising
- Gross Rating Points (GRP) for TV commercials
- Print Advertising Spend
Compared Models and Reasons for Selection
Each model has different characteristics and strengths, and we evaluated their applicability to sales forecasting.
- Linear Regression (OLS):
- A simple method for modeling the basic relationship between sales and external variables.
- Reason for Selection: Serves as a baseline model for comparison with more advanced methods.
- Prophet:
- A time series forecasting model developed by Facebook that accounts for strong seasonality and external factors.
- Reason for Selection: Allows for intuitive parameter adjustment for data with strong seasonal patterns.
- LightGBM:
- A machine learning model that efficiently captures complex and non-linear interactions between variables.
- Reason for Selection: Offers improved prediction accuracy when dealing with large-scale data or strong non-linear relationships.
- SARIMAX (Seasonal ARIMA with external regressors):
- A method that combines external explanatory variables with the traditional ARIMA model to capture seasonal effects and external influences simultaneously.
- Reason for Selection: Enables more realistic predictions by incorporating external factors alongside time-series trends.
Performance Comparison
We compared the predictive performance of each model using the following evaluation metrics. Note that lower values indicate higher prediction accuracy.
Since acceptable error ranges vary depending on data characteristics and industry standards, we've provided general guidelines below.
Root Mean Squared Error (RMSE):
- Overview: Calculates the square root of the mean of the squared differences between predicted values and actual observed values.
- Details:
- Gives greater weight to larger errors, making it sensitive to outliers.
- Expressed in the same units as the original data, making it intuitively understandable.
- Acceptable Range Guideline:
- Generally, about 5-10% or less of the average sales is desirable, though this depends on the scale of the data.
Mean Absolute Error (MAE):
- Overview: Calculates the average of the absolute differences between predicted and actual values.
- Details:
- Since positive and negative errors don't offset each other, it provides a clear understanding of the overall average error.
- Less influenced by extreme errors, making it suitable for practical error evaluation.
- Acceptable Range Guideline:
- Often considered highly accurate if the error is within 5-10% of the average value of the target data.
Mean Absolute Percentage Error (MAPE):
- Overview: Expresses the prediction error as a percentage of the actual value and calculates the average.
- Details:
- Easy to compare across different scales of data, providing an intuitive understanding of "what percentage the overall error is."
- Can overestimate errors when actual values are close to zero, so caution is needed in these cases.
- Acceptable Range Guideline:
- Generally, a MAPE below 10% is considered very good, 10-20% is good, 20-50% is acceptable, and above 50% requires improvement.
Insights and Considerations
In this project, I visualized the prediction results and error distributions of each model using graphs and charts to clearly illustrate their strengths and weaknesses. This approach allows for intuitive understanding of data trends, seasonality, and the impact of outliers.
Moving forward, I am planning improve the versatility and accuracy of the model by incorporating longer-term data and information from different marketing channels.
I will also work to further enhance prediction accuracy by exploring additional forecasting models and optimizing the hyperparameters of each model.