How to clean your SEM-Data for Machine Learning
Highly underrated but hyper relevant is the second step in the Machine Learning process, called ‚Data Cleaning’. Let’s stick with the goldrush analogy from chapter number one. Even if you search in the right location, what you’ll hold in your hands first, is a lot of mud you’ve to clean first, to find the valuable stuff you want.
Same counts for Machine Learning. Even if your data collection works smoothly, the csv file you pull from Google or Bing might be useless due to unrelevant columns, formatting errors or statistical errors. Especially Google Analytics is notorious famous for providing damaged columns in csv-files.
Possible reasons why you might have to clean your SEM-dataset:
1. Missing Data
Imagine you’re facing tracking issues and now you have to deal with gaps in your performance data. What now? Quo vadis Search Marketer?
First of all: never ignore missing data and values. Otherwise you’ll be in trouble later on and most likely your algorithm won’t work properly. The very last we want to do, is to build our model based on incomplete or wrong data. Sounds reasonable, right?
How to solve: The best way to avoid missing data is to ensure a smooth tracking. Nevertheless, when things have gone wrong already it’s highly recommended not just to drop out the particular columns. Better: Tell your algorithm that data is missing by labeling and creating new data-classes.
When it comes to missing numeric data you should further fill the affected cells with a zero-value to make it readable for the algorithm and meet the technical requirements.
Fortunately libraries like Pandas come along with several functions to check and replace missing values (numerical and non-numerical) in your dataset, like isnull(), fillna() or replace().
Outliers are datapoints that are distant to other values in your dataset. Anomalies in your usually common performance numbers.
Huge peaks and drops in SEM performance-data always require some deeper analysis before you can decide to drop out certain values, which is not always recommended. Peaks in Search Engine Advertising are quite common. They can be caused by weekend-searches, seasonality and even good or bad weather.
How to solve: Remove or fix? That’s the crucial question. The following steps are mostly common and recommended in terms of outlier-handling.
A. Figure out the outlier-reasons. Is it just about seasonality-reasons? Is it just a formatting error? Or do you have a serious issue in your data-measurement and tracking?
B. If it’s acutually crappy data caused by tracking issues, remove it and drop out the particular values
C. If you have to deal with issue-outliners regularly, think about working with algorithms that’re less sensitiv for outliers. Random Forst algorithms are a great choice for a more robust modeling
D. If your data is correct and outliers caused by seasonality or increasing bids, you should proceed by choosing the data within a standard deviation and ignore all data that is outside of this range. In Scikit-Learn ( which is the Machine Learning framework we use for all examples on this page) you can rely on the ‘preprocessing-data-package’ for standardization. This package provides a bunch of functions to prepare raw data before pushing it into a model. Roboust-scale-methods are awesome tools and functions in terms of outliers.
3. Irrelevant Columns
Depending on your subject and goal there might be a couple of columns that don’t contribute to the objective. Columns like ‚Keyword Status’ or ‚Currency’ are two typical examples.
How to solve: Key-takeaway here: Focus on the most relevant columns that support your goal to make data-handling easier and computational processes faster later on.
4. Wrong format in Cells
Who has not yet been annoyed by csv-formatting issues rooting in keyword-operators, points instead of commas or the other way around, and my favorite: The ‚date-instead of number issue’.
How to solve: A lifehack that helps in most of the times: Push your data from Google Ads dashboard to a spreadsheet first. Then download data as csv-file from here to have a clean file available.