How the Importance of DevOps for Data Science is Growing

Data science adds further responsibility to DevOps. Data engineering, demands close collaboration of data science and DevOps because it deals with complex pipelines that transforms the data.

Apac CIOOutlook | Thursday, January 01, 1970

Stay ahead of the industry with exclusive feature stories on the top companies, expert insights and the latest news delivered straight to your inbox. Subscribe today.

Data science and machine learning require mathematical, statistical and data wrangling skills. While these skills are crucial for the success of implementing machine learning in an organization, DevOps for data science is gaining momentum. DevOps consists of infrastructure provisioning, continuous integration, and deployment, configuration management, monitoring, and testing. DevOps teams and development teams have been working closely to manage the lifecycle of applications effectively.

Data science adds further responsibility to DevOps. Data engineering, demands close collaboration of data science and DevOps because it deals with complex pipelines that transforms the data. Operators are expected to provision highly available clusters of apache tkafka, apache hadoop, apache airflow, and apache spark.

Data scientists use a set of tools such as jupyter notebooks, tableau, pandas, and power business Intelligence to visualize data and find insights. DevOps teams are expected to support data scientists by laying the groundwork for data visualization and exploration. The development of machine learning models is different from traditional application development because the models are iterative and heterogeneous. A variety of popular languages are used within development environments based on jupyter notebooks, pycharm, rstudio, visual studio code, and juno.

Machine learning and deep learning are complex processes and require massive compute infrastructure running on sturdy GPUs and CPUs. Frameworks exploit GPUs via tensorflow, apache mxnet, caffe, and microsoft cntk. The typical DevOps function is provisioning, configuring scaling, and managing these clusters. DevOps teams may have to write scripts for automation of both the provisioning of infrastructure and termination of instances when the training job is done.

Machine learning development is iterative. New datasets train the new ML models. Continuous integration and deployment (CI/CD) best practices are applied to Machine Learning lifecycle management. DevOps teams use CI/CD pipelines to bridge the gap between ML training development and model deployment. DevOps teams are expected to host the model in a scalable environment when a fully trained ML model is available. Machine Learning development requires containers and container management tools to be manageable and efficient.

DevOps teams leverage containers to provision development environments, training infrastructure, processing pipelines, and model deployment environments. Emerging tech like kubeflow and mlflow are enabling DevOps teams to handle new challenges.

Machine learning brings newness to DevOps. A collaborative effort of developers, operators, data scientists, and data engineers is needed to embrace the new ML paradigm.