As we move from the Data Warehouse to the Data Supply Chain, we open our perspective to include the full life cycle of data, from raw material to data product.
To produce data products with the most value, in an efficient and cost effective manner, quality control processes must be put into place at each link in the chain, driven by the requirements of data scientists. With such quality control processes in place, the burden of data scientists to cleanse data – typically 80% of the data scientists’ efforts – can be greatly reduced.
Data Models – including schema, metadata, rules, and provenance – play a crucial role in ensuring an effective Data Supply Chain.
Each Data Supply Chain link must be defined with firm boundaries with clear lines of team responsibility – with Data Models providing the natural borders.
In this talk we will discuss the processes that must be put into place at each link in the Data Supply Chain including perspectives on:
- The definition of Data Supply Chain vs. Data Warehouse
- Tools to create, manage, utilize, and share Data Models
- Tracking Data Provenance
- ETL processes, driven by Data Models
- Collaborative processes across Data Science teams
- Visualization of Data and Data Flow across the Data Supply Chain
* Apache Hadoop and Apache Spark as enabling technologies * Data Science, Cross-Organizational Collaboration, Security, Privacy
During this talk we will draw examples from implementations in healthcare and financial services.