Optimizing the Data Supply Chain for Data Science

As we move from the Data Warehouse to the Data Supply Chain, we open our perspective to include the full life cycle of data, from raw material to data product.

To produce data products with the most value, in an efficient and cost effective manner, quality control processes must be put into place at each link in the chain, driven by the requirements of data scientists. With such quality control processes in place, the burden of data scientists to cleanse data – typically 80% of the data scientists’ efforts – can be greatly reduced.

Data Models – including schema, metadata, rules, and provenance – play a crucial role in ensuring an effective Data Supply Chain.

Each Data Supply Chain link must be defined with firm boundaries with clear lines of team responsibility – with Data Models providing the natural borders.

In this talk we will discuss the processes that must be put into place at each link in the Data Supply Chain including perspectives on:

The definition of Data Supply Chain vs. Data Warehouse
Tools to create, manage, utilize, and share Data Models
Tracking Data Provenance
ETL processes, driven by Data Models
Collaborative processes across Data Science teams
Visualization of Data and Data Flow across the Data Supply Chain

* Apache Hadoop and Apache Spark as enabling technologies * Data Science, Cross-Organizational Collaboration, Security, Privacy

During this talk we will draw examples from implementations in healthcare and financial services.

Marc Hadfield founded Vital AI to create software systems that understand and harness data. Mr. Hadfield’s career has focused on large-scale data analysis in Financial Services, Life Science, and Enterprise Publishing. He has been principally focused on the interplay of Semantic Data, Machine Learning, Graph Analytics, and Natural Language Processing as a foundation for data-driven applications. Mr. Hadfield, as CTO of Alitora Systems, has worked with the Gladstone Institute and the Gates Foundation to apply these techniques to drug discovery; as CTO of Inform Technologies to apply them to content recommendation for news and video publishers; and as consultant to Bloomberg to apply them to communication regulatory compliance. Using this experience, Mr. Hadfield designed the Vital AI software components and processes to do the heavy lifting of data-driven applications, freeing up developer and data science resources for deeper data analysis.