Why 1st-party is the only data that really matters for Machine Learning
The currency in Machine learning is data-quality. The more quality we have the better. Sounds logical. But how do we define ‘data-quality’ and how can we integrate it into our Machine Learning model for Search Engine Marketing?
1st party first – But what exactly is 1st party-data?
Using a simple definition we can say, that first party data is all the data, which was gained by yourself. It’s owned data. This could include collected data from your website, data from your CRM, data from your stores or data from the market research you’ve done previously.
The advantage over 2nd-party (partner data, e.g. keyword-data in Google and Bing) or 3rd-party inputs (aggregated provider-data you’ve bought from someone else) is quite obvious: 1st party-data contains always real interaction between the customer and your business, your brand and your product.
|Dat||Examples||Interaction with brand|
|1st Party||Online-Sales, Store-Sales, CRM||Always|
|2nd Party||Google Keyword data||not necessarily (e.g. generic search terms)|
|3rd Party||audience-segments bought from data-marketplace||most likely not yet|
And why is this important for Machine Learning?
As mentioned, data-quality is king in Machine Learning. And data-quality means accuracy. Every model is just as good as the data we provide. This applies to regression problems and classification problems in the same way. What we ultimately want to see, is an output number, that represents our hypothesis showing the highest possible probability or coefficient values.
In linear regression for example we’d like to predict future conversions based on the conversion we’ve already won. Or in other words: we’d like to find datapoints we don’t have yet (new customers), based on datapoints we already know (existing customers). We’d like to find the statistical twins of our customers and not just someone who looks only a bit like them.
In the moment when we provide bad data, our model’s accuracy will decrease rapidly. We’d train our model based on unclear signals. The result would be a function that fits our hypothesis poorly. Finally you’ll be guided in the wrong direction and make wrong assumptions. That’s not what Machine Learning was build for.
So, if we have the chance to integrate 1st party data this should always be our first choice for Machine Learning. Nevertheless, I wouldn’t say that 2nd or 3rd party data is bad in general. Since Machine Learning is a lot about testing too, this kind of data should used but integrated very carefully under observation.