Operationalising Data Science #3 of 3 - Practical Recommendations
Introduction
In two previous articles we described (1) the dominant technical delivery workflows and (2) examined the challenges and opportunities which exist in terms of integrating aspects of these paradigms in order to successfully operationalize data science solutions. This article wraps up the series by presenting a number of practical recommendations for successfully operationalizing data science solutions.
Data Science Workflow Categories
The data science workflow can be decomposed into three data science activity categories, data collection, data analysis and data use and dissemination. This workflow model will be leveraged as the framework to help contextualize the recommendations in this article (Fig. 1).
Fig.1 – Machine learning workflow and data scientist activity categories (Kim, Zimmermann, et al. 2016; Amershi et al. 2019)
Data Science Workflow Category 1 - Data Collection
There is widespread agreement that the successful operationalizing of data science solutions has a critical dependency on the availability of high quality production data. Despite this critical dependency, it is clear that (1) data quality is usually poor and (2) that it is typically not easily accessible by the appropriate people. These two issues have contributed to poor outcomes for data science solutions and a loss in executive confidence in the value of data science.
A number of key factors have been identified which would contribute to resolving these two issues and would likely improve the downstream value provided by data science.
The struggle with achieving appropriate data quality and access is largely a function of three factors (1) executive stakeholder management, (2) appropriate organisational clarity and (3) technological pipeline capability.
Executive Stakeholder Management
Digital data has emerged as a significant business asset, with organisations increasingly focussed on insights elicited from digital data to inform decision-making and ultimately generate new revenue streams. Organisations who cannot effectively and rapidly leverage data are likely to lose competitiveness. The race to achieve competitive advantage has led to organisations rushing the critical first steps of establishing reliable data sources which contain high quality data and making them available to the appropriate people. The industrial reality is therefore that large amounts of data typically exist in unusable formats and in disparate and inaccessible locations.
Powerful and legitimate executives typically lack urgency in terms of investing in data collection activities which are necessary to ensure that the data is in the appropriate state in order for it to be effectively monetized. This lack of investment in data collection activities has resulted in many expensive data science project failures. There is an obvious need to address this lack of investment by addressing the executive deficiency in understanding through education.
Organisational Clarity / Technological Pipeline Capability
Significant confusion also exists in terms of the related themes of role clarity, inter-discipline boundaries and organisational structure.
The responsibility for data collection activities is generally attributed to an independent IT group with data science being solely a consumer of the data. It is clear that despite this general viewpoint among data scientists, significant ambiguity still exists in terms of ownership and scope of data collection activities.
In a clear deviation from many definitions of data collection, the consensus among actors in the industry is that data collection activities must also include data merging and cleaning activities. This highlights the level of boundary confusion which exists within the industry. The necessity to define both the boundaries and ownership of data collection activities is clear. One widely-held belief is that executive investment must support the establishment of an expert team to establish and continuously maintain high quality data source(s) and reliable data pipelines. This team must also provide secure access to clean production data in order to facilitate successful data science solution delivery. The provision of this reliable data pipeline is a complex task which will benefit from the application of software engineering principles, potentially via DevOps for Data, i.e. DataOps.
Data Science Workflow Categories 2 & 3 - Data Analysis and Data Use & Dissemination
Successful data science solution delivery depends to a significant degree on the integration of software engineering and data science. While the dependence on paradigm integration is clearly established, there is no consensus on the ideal model to achieve this integration. In order to develop an evidence-based model which represents the facts, a principle-based approach has been adopted and is presented below.
Principle 1 – Multi-disciplinary Solution Delivery Team
Successful data science solution delivery has a number of key capability dependencies (Fig. 2).
Fig.2 – Key capability dependencies
The ability to manipulate, interpret and understand data is a clear capability requirement and is considered to be particularly effective when coupled with a deep understanding of the business domain. The ability to understand and apply complex mathematics is necessary for effective data science modelling and the ability to develop, operationalize and deploy software has been solidly established.
Individuals who possess all four capabilities are rare in the market, with a particular deficiency evident in terms of generalists who possess mathematical/statistical capabilities. In terms of a strategic approach to resourcing data science, a principle of establishing a multi-disciplinary solution delivery team proved popular as a way to resolve this challenge. Organisational challenges related to building teams with this confluence of capabilities are evident, particularly in large firms which are functionally organised.
The requirement to educate team members in each of the four key capability areas is also a recurring theme. In order for team members to fully contribute to solution delivery they must establish a common lexicon and understand the entire delivery lifecycle.
There is therefore a clear need to establish a multi-disciplinary team to enable effective data science solution delivery.
Principle 2 – Adoption of Flexible Agile Frameworks
Poor data science outcomes can be mitigated to some degree by appropriate governance of data science activities but this governance must be flexible. There is widespread acceptance that the governance of data science delivery can be achieved by the application of flexible agile frameworks and this has therefore been identified as another core principle.
The successful application of flexible agile frameworks for data science solution delivery will however require a re-definition of the traditional agile roles and associated tasks. The role of the traditional business analyst/product manager/product owner will change in the context of inference-based data science solutions. New roles, such as the ML tester, will also likely emerge.
A core tenet of agile delivery is the concept of swarm whereby blocking issues are tackled by the entire team. This approach is based on an assumption that there is a breadth of technical understanding throughout all team members. In an enhanced integration environment, there will be a requirement for data scientists to become technically competent in software operationalizing and deployment tasks and for software engineers to become fluent in mathematical and statistical concepts as well as competent in implicit software behaviour and in the field of extreme component entanglement.
Principle 3 – End-to-end Ownership / Investment
There is also a necessity for the data science delivery team to remain involved throughout the entire lifecycle of the application, continuously monitoring and re-training the model while it is in the production environment. The performance of data science solutions is likely to regress over time due to the evolving nature of the input data therefore the consistent monitoring and re-training of the solution is necessary. This type of engagement is defined as a build it, own it model whereby the team who develops the solution retains ownership until the solution is retired.
The requirement to retain solution delivery team involvement for the entire lifecycle of the product has significant investment and management implications. The solution team involvement needs to be justified both in terms of continued investment in the product and in terms of the ever reducing delivery capability of the solution delivery team (as a function of the number of solutions they operationalize and support).
Model for the Operationalization of Data Science Solutions
A principle-based model for the efficient operationalization of data science solutions is shown below (Fig. 3). Key principles, when synthesised, form a logical model which can be easily visualized, consumed and leveraged by data science professionals in order to frame their approach to the operationalizing of data science solutions.
Fig. 3 – High-level model to operationalize data science
Executive stakeholder management is required to secure the necessary investment to, for example, adequately resource the IT involvement in data collection and the resource-intensive build it, own it delivery model which supports the data science solution throughout the product lifecycle.
The assignment of responsibility for the delivery of a data pipeline service to a team of qualified specialists, likely within an IT function, was also highlighted as a necessary organisational design feature.
There is also a need for the creation of a multi-disciplinary solution delivery team. The team must have capabilities in four key areas; mathematics/statistics, software engineering, data manipulation/business intelligence and business domain knowledge. The necessity to adopt a flexible governance framework based on existing agile delivery principles also emerged as did the necessity for full lifecycle involvement of the solution delivery team to monitor and re-train the data science models in production.
A key area of difficulty is that of the testing/validation of data science solutions. While tentative steps towards addressing this challenge have been taken, no practical resolution to this challenge has emerged.
Finally, the emergent nature of the field of data science must also be recognised in terms of the necessity for all stakeholders, especially the solution delivery team, to constantly upskill and remain current with the changing landscape of the discipline.
Recommendations
There are two categories of recommendations which have emerged. Firstly, there are a number of recommendations aimed at industry leaders who wish to incorporate efficient data science delivery into their organisations. A further suite of recommendations are presented as fertile ground for further examination and research.
Key Practical Recommendations
There is clearly significant diversity in terms of challenges and perceived opportunities in the area of data science solution delivery. What is clear however is the level of organisational change required to overcome these challenges. It is considered appropriate therefore to consider the key practical recommendations across the four dimensions of Leavitt’s change management framework (Leavitt 1965).
Recommended by LinkedIn
Dimension 1 - Structure
Data collection activities are recommended to be performed by non-data scientists. The assignment of complex data collection tasks, which may include highly specialized tasks such as data merging and cleaning, requires establishing and training an independent team of IT experts who, in large organisations, are assigned on a full-time basis to the provision of reliable data sources to consumers of the data, i.e. the data science solution delivery team.
The establishment of multi-disciplinary data science solution delivery teams tasked with both development and maintenance of data science solutions in production is also recommended. This multi-disciplinary team approach breaks down functional silos and heightens the level of collaboration and cross-pollination of expertise within the organisation. The design and adoption of an agile framework which will achieve both the flexibility and governance required is also highly recommended.
Dimension 2 - People
There is a clear requirement for appropriate stakeholder management. Dominant executives must be educated in terms of the potential offered by data science. Of equal importance is educating executives regarding the upfront investment required to establish reliable data pipelines and establishing the appropriate degree of urgency to transition them to a definitive position in this regard. It is also critical to secure the investment necessary to sustain a build it, own it model whereby the solution delivery team owns the solution throughout the lifecycle. The uncertain nature of the ultimate outcomes of the data science effort must also be clearly explained in order to manage executive expectations. This research found that inadequate education of executive stakeholders has, in the past, led to significant loss of investment and executive confidence due to issues such as poor data quality, unreliable data sources and unrealistic expectations of data science.
The integration of traditional software engineering and data science professionals is unlikely to pose significant challenges from a personality perspective. The establishment of multi-disciplinary solution delivery teams within large companies will however be a significant change for the individuals involved and therefore must be appropriately managed by adequate education and communication. The newly formed solution delivery team will require the establishment of a common language in order to communicate effectively. Educating non-data scientists in the basics of data science and educating data scientists in the lexicon and principles of software engineering will be a necessary first step.
Dimension 3 - Tasks
Deeper cross-discipline knowledge transfer and professional training is recommended whereby, for example, software engineers become increasingly fluent in the mathematical principles of data science and data scientists develop hands-on capabilities in operationalizing and deploying enterprise solutions. This approach is recommended both to develop a more robust generalist team and also to build a sustainable pipeline of data science capability into the future. There is a high degree of cross-over success by upskilling software engineers to become data scientists (Kim et al. 2018).
The changing nature of the relationship between business requirements and technology, driven by data science, implies that the nature of the tasks performed by the business analyst/product manager/product owner will also change. These individuals must be managed in terms of upskilling their knowledge of inference based systems.
Dimension 4 - Technology
Technology must be considered principally in the context of achieving technological simplicity and lower cycle-times. Consensus must be reached on the simplest end-to-end toolchain necessary for the multi-discipline solution team to support all necessary activities within an acceptable cycle time. The generalist approach within the solution delivery team requires the education and upskilling of the solution delivery team members in the selected toolchains.
Technology considerations extend to the ways of working within the flexible agile model. The standard operating procedures, tools and technologies to, for example, adequately estimate tasks, manage agile ceremonies and report on progress, must be appropriately defined, established and maintained.
Eight Key Practical Recommendations
Eight recommendations have been developed as the foundation for a practical change initiative for industry leaders and are described via Leavitts diamond framework (Fig. 4). The recommendations have been colour coded in terms of their principal stakeholder associations and numbered from 1 to 8 in terms of their suggested implementation schedule. Continuous education, communication, collaboration and business value measurement are critical features throughout the recommended transformation programme.
Fig. 4 – Practical recommendations for optimizing the operationalization of data science solutions
Key Future Research Recommendations
There are two areas of particular interest in terms of further research; (1) the investigation of options regarding flexible agile frameworks for data science and (2) the further examination of the testing of non-deterministic systems.
Flexible Agile Framework for Data Science Solution Delivery
There is a clear requirement to adopt agile delivery frameworks for the delivery of data science solutions. An equally critical requirement emerged to ensure that the agile framework is flexible enough to ensure that data science activities, such as the developing and testing of hypotheses, are supported. There is currently no published blueprint for a data science centric agile framework and it is recommended that further study be conducted to define and develop the body of knowledge in this regard.
Testing / Validating Non-deterministic Software Solutions
A recurring theme emerged regarding the weakness in terms of testing data science solutions. Significant complexities were highlighted regarding the testing and debugging of highly coupled, non-deterministic software solutions however no consensus is evident in terms of an approach to meet this challenge. It is anticipated that this is an area which will continue to experience significant attention due to the quality challenges which will be increasingly experienced during the process of productionizing data science solutions.
Conclusion
This series of three articles is intended to provide practical assistance to leaders within the software industry to efficiently develop, deploy and maintain data science solutions.
There are several recognised problems within the industry regarding the successful operationalization of data science solutions. There are pre-existing challenges and challenges which are introduced by the integration of software engineering practices. The introduction of established software engineering practices were also found to present significant opportunities. Tentative integration of the two paradigms is underway across the industry but is proceeding without an established and proven blueprint.
The state of uncertainty and resistance currently being experienced regarding paradigm integration is not unprecedented. Similar evolutions within the software domain have occurred in the past. The emergence of agile delivery philosophies was met with substantial resistance twenty years ago. The cultural shift from waterfall to agile has taken a generation to take hold, and pockets of resistance are still evident. Furthermore, the introduction of lean principles to the software delivery domain (DevOps) remains a work-in-progress across most of the industry. Notable however is that the rate of DevOps adoption has been significantly faster than was the case with agile adoption.
Paradigm integration within the data science domain is certain, only the degree and speed of the integration process remain unknown. It is likely, based on the agile and DevOps experiences, that paradigm integration will take time but, given the increasing speed of new paradigm adoption within the industry, it is likely to occur more rapidly than that of agile or DevOps.
A model for the efficient operationalizing of data science solutions with eight practical recommendations has been presented through the lens of Leavitt’s change management framework as a means by which the aforementioned model could be practically achieved.
There are, of course, challenges in this journey. Aspects of this change process require further research, including the establishment of an optimal model for data science agile implementation and the development of a framework for the testing of non-deterministic solutions.
It is increasingly obvious that evidence-based management requires effective exploitation of digital data. In order for organisations to effectively leverage the ever increasing oceans of digital data, the operationalization of data science solutions must evolve. Evolution from the current embryonic state of data exploitation will require both the ability to appropriately manage organisational change and the ability to solve for constantly emerging challenges.
Fergal Hynes, July 2021
Bibliography
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., Zimmermann, T. (2019) ‘Software Engineering for Machine Learning: A Case Study’, in Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019.
Cao, L. (2017) ‘Data science: A comprehensive overview’, ACM Computing Surveys.
Capizzi, A., Distefano, S., Mazzara, M. (2020) ‘From DevOps to DevDataOps: Data Management in DevOps Processes’, in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
Fisher, D., DeLine, R., Czerwinski, M., Drucker, S. (2012) ‘Interactions with big data analytics’, Interactions.
Howe, B., Franklin, M., Haas, L., Kraska, T., Ullman, J. (2017) ‘Data science education: We’re missing the boat, again’, in Proceedings - International Conference on Data Engineering.
Hukkelberg, I., Berntzen, M. (2019) ‘Exploring the challenges of integrating data science roles in agile autonomous teams’, in Lecture Notes in Business Information Processing.
Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J. (2012) ‘Enterprise data analysis and visualization: An interview study’, IEEE Transactions on Visualization and Computer Graphics.
Khomh, F., Adams, B., Cheng, J., Fokaefs, M., Antoniol, G. (2018) ‘Software Engineering for Machine-Learning Applications: The Road Ahead’, IEEE Software.
Kim, M., Zimmermann, T., Deline, R., Begel, A. (2018) ‘Data scientists in software teams: State of the art and challenges’, IEEE Transactions on Software Engineering.
Kim, M., Zimmermann, T., DeLine, R., Begel, A. (2016) ‘The emerging role of data scientists on software development teams’, in Proceedings - International Conference on Software Engineering.
Leavitt, H.J. (1965) ‘Applied Organizational Change in Industry’, in Handbook of Organizations.
Manyika, J., Chui Brown, M., B. J., B., Dobbs, R., Roxburgh, C., Hung Byers, A. (2011) ‘Big data: The next frontier for innovation, competition and productivity’, McKinsey Global Institute.
Riungu-Kalliosaari, L., Kauppinen, M., Männistö, T. (2017) ‘What can be learnt from experienced data scientists? A case study’, in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
Zhang, J.M., Harman, M., Ma, L., Liu, Y. (2020) ‘Machine Learning Testing: Survey, Landscapes and Horizons’, IEEE Transactions on Software Engineering.
Head, Product Engineering Center India, Clover Network, Inc.
3yFergal, this series was a great read. Thank you for sharing.