From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data pipeline maintenance - GitHub Tutorial
From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Data pipeline maintenance
- [Instructor] Congratulations. We now have a deployed data pipelines running on GitHub actions. In this chapter, we will focus on the maintenance steps of the data pipeline. Let's start by discussing when and why you need to maintain the data pipeline. Typically, software upgrades, new features, and data integrity will force you to make changes in the code or the structure of the data pipeline. Software upgrades and new features typically trigger changes in the environment settings. Generally, it is recommended to have a clear deployment strategy for new features or changes in the environment. A classic setting is to have three environments, dev, stage, and prod, where the dev is where you roll out first and test new software updates before pushing it to the stage and prod, and a new feature you test on the stage environment before pushing the changes to the prod environment. This is will ensure that when you update your docker image or change a feature in the data pipeline, the production data pipeline won't crash or get affected. Likewise, data integrity issues or unexpected errors will require immediate changes in the code and you want to test it before rolling out to the prod environment. This is where monitoring, which is the process of tracking the health of the data pipeline, is a critical tool in the maintenance of the pipeline. This includes a variety of methods and tools such as setting unit test, defining logs, and setting alerts. In chapter two, we saw different unit test and integrated some data quality checks into the electricity data pipeline. Those are just conceptual example that we use to demonstrate this topic. Now, after you saw the process of setting a data pipeline and the functionality of actions, the sky is the limit. There are many open source tools for data quality for both R and Python, such as the pointblank R Library or the YData profile Python library, which provides great tools for data monitoring and reporting. Those tools can be easily integrated into the data pipeline during the runtime. In the next video, we'll learn how to deploy render dashboard to GitHub pages using GitHub actions.