CIS 602: Provenance & Scientific Data Management

Course Information

Instructor Information

  • Instructor: Dr. David Koop
  • Office: Dion 302B
  • Office Hours: M 3-5pm, T 2:30-3:30pm, Th 11am-12pm, by appointment
  • Phone: 508-910-6692
  • Web Page: http://www.cis.umassd.edu/~dkoop/
  • Email: dkoop@umassd.edu

Course Description

Science is increasing accomplished via computational methods and data analyses. As such, computer scientists have worked to build infrastructure and techniques to enable such work. Provenance is one of the tools that helps users keep track of experiments and computational pipelines for increasingly complex analyses. Such information helps others trust results and reproduce them. In addition, keeping track of the different types of big data often requires new data management solutions like those for structured data.

Useful background: databases, data management

Calendar

DateTopicReadingsAssignment
9/4IntroductionReview Syllabus & Web Page
9/9ProvenanceProvenance for Computational Tasks: A SurveyProject Ideas
9/11Scientific WorkflowsScientific Workflow Management and the Kepler SystemReading Response
9/16Scientific Workflow ProvenanceA Framework for Collecting Provenance in Data-Centric Scientific WorkflowsReading Response
9/18Workflow Evolution ProvenanceManaging Rapidly-Evolving Scientific WorkflowsReading Response, Presentation Topics
9/23System-Level ProvenanceProvenance-Aware Storage Systems, ES3 Slides (Frew & Slaughter)Reading Response
9/25Database ProvenanceProvenance in Databases: Past, Present, and Future, Provenance in Databases: Why, How, and Where, Integrating Workflow & Database Provenance Slides (Chirigati & Freire)Reading Response (1st paper)
9/30Provenance StorageEfficient Provenance Storage, Kepler Provenance Model & Storage Slides (Anand)Reading Response
10/2Project ProposalsProject Proposal
10/7Querying ProvenanceQuerying and Managing Provenance through User Views in Scientific Workflows, PQL Slides (Holland et al.)Reading Response
10/9Map-Reduce Provenance, Provenance AnalyticsProvenance for Generalized Map and Reduce Workflows, Examing Statistics of Workflow EvolutionReading Response (choose one)
10/14Provenance MiningProcess Mining Based on Clustering: A Quest for Precision, Clustering Workflows Slides (Santos et al.)Reading Response
10/16Secure ProvenanceThe Case of the Fake Picasso: Preventing History Forgery with Secure Provenance, Provenance & Privacy Slides (Davidson et al.)Reading Response
10/21Provenance & SemanticsJanus: from Workflows to Semantic Provenance and Linked Open Data, Semantic Web & Linked Data Slides (Hassanzadeh)Reading Response
10/23Provenance StandardsPROV Model Primer, PROV Tutorial (Moreau et al.)Reading Response
10/28Visualization & ProvenanceSupporting the Analytical Reasoning Process in Information VisualizationReading Response
10/30Visualization & ProvenanceGenerating Photo Manipulation Tutorials by Demonstration, Graphical Histories for Visualization: Supporting Analysis, Communication, and Evaluation, Nonlinear Revision Control for ImagesReading Response (choose one)
11/4ReproducibilityReproducible Research in Computational Science, Reproducible Epidemiologic ResearchReading Response (2nd paper)
11/6ReproducibilityMaking scientific computations reproducible, BURRITO: Wrapping Your Lab Notebook in Computational Infrastructure, ReproZip: Using Provenance to Support Computational ReproducibilityReading Response (either 1st or both 2nd & 3rd)
11/11No class (Veterans Day)
11/13Project Progress ReportsSample ReportProject Progress Report
11/18Graph DatabasesSurvey of Graph Database Models (Read paper, scan appendix), A Comparison of a Graph Database and a Relational DatabaseReading Response (1st paper)
11/20Graph IndexingGraph Indexing: A Frequent Structure-basd ApproachReading Response
11/25Scientific DatabasesThe Architecture of SciDBReading Response
11/27No class (Thanksgiving)
12/2Project Presentations
12/4Project Presentations

Course Objectives

  • Become familiar with current research in provenance, reproducibility, and scientific data management.
  • Develop skills to critically analyze prior work in the field
  • Develop solutions that incorporate or extend provenance, reproducibility, or data management techniques

Learning Outcomes

  • Students will understand the key challenges and techniques for scientific data management
  • Students will be able to critically discuss research papers
  • Students will be able to work on long-term research-oriented projects

Course Requirements

Because a major part of this class will be discussing and exploring active research areas, you must attend class. Your attendance is factored into the class participation part of your grade. All assignments should be completed on time. In the event that you will miss a lecture or cannot complete a required assignment on time--due to a serious and unavoidable circumstance such as illness--you must notify the instructor as far in advance as possible.

Course Project

This course will require a significant semester-long project where students are expected to design, implement, and test a new technique for managing, analyzing, or applying provenance. Sample project ideas include the following:

  • Add meaningful provenance capture to an existing tool that currently lacks it. Such a project could use a custom solution or capture provenance in an existing format (like PROV or using the VisTrails SDK). For example, you might use the VisTrails SDK to store and replay provenance from an application like a web browser. (http://www.vistrails.com/sdk.html)
  • Create new techniques for analyzing, visualizing, or querying provenance using existing provenance (e.g. the ProvBench datasets).
  • Wrap an existing library for use in a scientific workflow system (e.g. Kepler, VisTrails, Taverna) and create new useful workflows. You should then be able to create provenance by running those workflows.
  • Use a scientific or graph database to store and query information that is used in a tool. For example, you might use neo4j to support a recommendation system for movies based on a social network.

For your project proposal, you should include the following:

  • Standard Metadata: Title, Authors, Date
  • Introduction: An overview of the project and its goals
  • Motivation: Discussion of why you chose this project and how it ties into the topics of this course
  • Background: Any related work or other background information to understand the project (e.g. a project on networking might need to define how routing works)
  • Design: The conceptual design of your project--what pieces are involved and how they relate (e.g. the interface, the data storage, etc.)
  • Implementation: The language, platform(s), libraries, and system requirements for the project
  • Project Plan: Your timeline for implementing the different pieces (specifically what will be done by the progress report due date)

You will be required to present the project proposal and send me a progress report during the semester. With the proposal, I will be able to provide feedback on the feasibility of the project. The progress report is required to make sure that you are making progress on the project before the final deadline. A sample progress report is available. For the final deadline, you will need to provide the finished project code and a final report. In addition, you will be required to present your results.

Readings

Throughout the course, students are required to read the assigned readings and provide responses to them. We will review a sample report in class so students will understand the requirements. The reports should include a summary of the key ideas in the paper, a critical reaction to the ideas, and at least one question about the technique(s). The readings posted on the calendar are due on the day they are listed. Please see the Response Example for an example and the format. Students should submit their reading responses by 12pm on the day of class by sending a text or PDF document to the instructor's email address so the instructor may review the questions for class discussion.

Whenever readings are assigned, a short reading quiz may also be given at the beginning of class to check that students are prepared to discuss the day's readings.

In addition, each student will be required to present one of the assigned readings in a short (15-20 minute) presentation to prepare the class for discussion. The presentation should cover the motivation for the paper, the problem being solved, the solutions to that problem, and the results. In addition, the presentation should be critical, highlighting shortcomings in addition to the advertised benefits. The instructor will provide a list of topics and students will rank the topics they would most like to present. Then, the instructor will assign topics based on those rankings. Students may not receive their first choice if that topic is popular; in case of overlap, ties will be broken randomly.

Grading

  • 10% Course Participation
  • 10% Reading Presentation(s)
  • 20% Reading Responses
  • 10% Reading Quizzes
  • 50% Course Project
    • 10% Proposal
    • 5% Progress Report
    • 25% Final Results & Report
    • 10% Final Presentation

Syllabus Change Policy

Except for changes that substantially affect the evaluation (grading) of the course, this syllabus is a guide for the course and is subject to change. Please refer to the current class web page for the most current information.

Incomplete Policy

The incomplete policy for this course is that at least 70% of the course must be already completed and an exceptional circumstance (i.e. medical issue) must exist. If you feel you require an incomplete for an exceptional reason, you need to email me and state your reasons for the incomplete in writing. We will then decide on a course of action.

Academic Honesty Policy

All UMass Dartmouth students are expected to maintain high standards of academic integrity and scholarly practice. The University does not tolerate academic dishonesty of any variety, whether as a result of failure to understand required academic and scholarly procedure or as an act of intentional dishonesty. A student found responsible of academic dishonesty is subject to severe disciplinary action which may include dismissal from the University. Refer to the Student Handbook and Student Code of Conduct for due process.

Students must complete their own work. They should not copy work from another source (e.g. another student, a book or other published document, or a website). If you use sources that are not you own, you must explicitly acknowledge them. In this course, the instructor reserves the right to use the SafeAssign plagiarism detection software through myCourses.

Accommodation Policy

In accordance with University policy, if you have a documented disability and require accommodations to obtain equal access in this course, please meet with the instructor at the beginning of the semester and provide the appropriate paperwork from the Center for Access and Success. The necessary paperwork is obtained when you bring proper documentation to the Center, which is located in Liberal Arts, Room 016; phone: 508.999.8711.

Lecture Etiquette

You may not record lectures without the instructor's permission. Please do not cause distractions that detract from your fellow students' learning. Cell phones and other electronic devices should be silent; if there is an emergency and you need to communicate with someone, please step out of the classroom. You may use electronic devices for note-taking, but please note that not participating in lectures (e.g. working on another assignment during lecture) will affect the class participation portion of your grade.