Beyond Volume: Unlocking the Potential of Data with the Five Vs
Harvard Business Review once estimated that Bad Data Costs the U.S. $3 Trillion Per Year (https://lnkd.in/g-Gi9kfs). But what exactly constitutes good data? The Five Vs provides a blueprint, guiding us to discern “goodness” of our data and to prioritize curation efforts, so that we can expedite the harnessing of high-value data.
Any seasoned data scientist knows that the data underpinning a model is the most valuable asset to the AI system. It has been a delicious temptation to focus only on the volume of data, but that is only one aspect of the bigger picture.
1️⃣ Volume
Volume refers to the size of data, such as the number of individual records or measured in terabytes. While it's a tempting metric to focus on, it's important to remember that data quality is just as if not more crucial than quantity. Quantity is often not the most impactful.
2️⃣ Veracity
Veracity means accounting for noise, bias, abnormalities, missingness, semantic consistency, etc. The adage "garbage in, garbage out" rings true. Low-quality data makes ML models brittle and unreliable. Investing in quality production data leads to significantly larger improvements over modification to AI models.
3️⃣ Variety
Variety is essential for developing robust, generalizable models. Diverse sample populations, available features and outcomes, addressable market, types of products, and quality of results are all factors to consider. Exposure to variety during training makes models more performant and representative, combating dangerous biases and improving system equity.
4️⃣ Velocity
Even the most performant models won't have real-world impact if they're not delivered at the right time with relevant data. Velocity refers to delivering real-time insights while accounting for data relevance and staleness. For healthcare applications, this is a key component of the Five Rights. (https://lnkd.in/g3ePEVBD)
5️⃣ Value
These all lead to Value, fifth or the bonus V, which we all strive to achieve. As noted in Forbes, practical application requires we prioritize the Vs - suggesting veracity over velocity and volume over variety (https://lnkd.in/g6QdFhku). The logic expedites business value by prioritizing accuracy over new insights and acknowledges the costliness of integrating new data types. While I agree on prioritizing veracity, healthcare has often overlooked variety in terms of representative datasets, exacerbating disparities. Moreover, while volume is commonly used to ensure sufficient sample sizes for all, it is not the most direct or effective route to representation.
By considering veracity, variety, velocity, and in addition to volume, we unlock data's true value potential.
Special thanks to Scott Sorensen for his thought partnership.
#DataScience #DataQuality #ArtificialIntelligence #WinningDataStrategy
The Omni Hotel: Operationalizing Strategy->Executing to Win. #Leadership + #Cultural Enthusiast
5moOutstanding news. Our culture at the Omni hotel permeates each step for exceeding expectations of cherished guests and internal team members