Monte Carlo reposted this
Third-party data is killing your data products before they even begin. Here's why: 1. You aren't measuring the health of your third-party data. 2. You don't have a solution to understand and profile your data at scale. 3. You're ingesting more third-party data without automating monitor creation. If you haven't checked out my most recent Data Downtime Newsletter, I dive into the risks posed by third-party data—including one of it's latest high-profile victims—and what you can do to protect your pipelines. It's a short one, so if you've got a second, feel free to give it a read! https://lnkd.in/gYp2QjpP
It’s extremely important to identify and tag the authoritative data sources before being processed or injected into the system. The identification and recognition of a source as Authoritative is an executive level decision based on contractual agreements with the providers. That will make the DQ assurance more definitive because the semantics of data is more clearer and accurate to create DQ rules. Still the question remains - How to close the loop ? Because profiling data will only filter the data but will not solve the DQ issues. Fixing the issues will need a broader participation of people, process and guidance.
I couldn't agree more with your insights on third-party data risks. Your points resonate strongly with challenges I've encountered in my work. I'd add that implementing robust validation processes at data ingestion points is crucial. We've found success with automated schema checks and data profiling on incoming third-party data streams. Curious about your thoughts on balancing the need for third-party data enrichment with maintaining data quality standards. Have you explored any innovative approaches to this challenge?
Barr Moses - some good points though the issue seems to be the way the third party data was used incorrectly - it was out of date. - "Roberts said DCF made that calculation based on outdated income data from a third-party source, Florida’s State Wage Information Collection Agency." I would suggest some different rules. 1.Define the pedigree and volatility of the third party data. They had used the correct data product which should have been of a sufficient standard it was just not updated properly. 2. Join & verify sources of data based on the source's core data so for instance on Land registry data: which property reference you own has been checked and the deed number verified. Land registry is the definitive source for that. The full postal address or owners name were not verified or updated as they change external to the register. Look to update these as needed. 3. Use external data to verify your data and highlight where data does not match ,then define a process to resolve it. 4. Always be suspicious of data and seek to confirm its validity. Assume some of it will be faulty and process accordingly. 5. Make sure your users / customers can understand and challenge the data with the decisions you made based upon it.
Agreed, if you can't remove the dependency on 3rd party data, these 3 are a must-have to mitigate the risk! I’ve noticed that alerting on Slack works really well since it grabs people's attention. We just need to be careful to only alert on the critical stuff, so people don’t start tuning it out.
Almost all of asset management (and a lot of financial services in general) runs on entirely 3rd party data... I'm not suggesting anyone is perfect but we/they have been doing it for a very long time and at scale...
Well sometime we need 3rd party to bootstrap the initiatives. People leave, solution stay.
Barr Moses, certainly can relate. Automating data health checks and scaling profiling are critical to maintaining reliable analysis.
My understanding is that 'Data' is raw facts. Any manipulation of the data cause it to become 'information'.
I don't know how to answer on point 1. For points 2, 3 you surely have lacks in your product design.😁
Elevating businesses: Strategic partner matchmaker
2moIf companies are not evaluating the [third-party] data prior to integration, then they have bigger issues. There's always opportunity cost to doing it yourself. The best teams I've worked with in terms of licensing data for them to integrate into their platform(s) had an idea of what they were trying to accomplish, worked with us to understand if the data was fit for purpose, and let the data speak for itself in terms of doing proper evaluations.