Overview
Your final project is to build a data product – not a traditional analysis and write-up.
A data product is something that takes data as input, processes it, and delivers value to an end user. It is not a static document with tables and figures. It is a working thing – something someone could run, interact with, or query.
You may work alone or in pairs. We prefer pairs.
What Makes It a Data Product
Your product must accept user input and include at least one of the following capabilities. These are what distinguish a data product from a traditional analysis.
Every data product must accept input from a user and return something tailored to that input. The user makes a choice, and the product responds. This could be a Shiny app with dropdown menus, a parameterized report where the user specifies a region, or an API endpoint where a user submits data and gets back a result.
If a user cannot interact with it, it is a report, not a product.
Plus At Least One Of:
API Integration
Your pipeline programmatically consumes or serves data through an API. The product does not start from a static CSV you downloaded by hand – it fetches data at runtime.
Examples:
- A Shiny app that lets users pick a country, then pulls indicators from the World Bank API and visualizes trends
- A pipeline that queries the WHO GHO API, joins the results with user-selected parameters, and produces a formatted report
- A Plumber API that accepts user queries and serves model predictions
GenAI Model in the Pipeline
A generative AI model operates on your data as a functional component of the pipeline. The LLM does real work – classifying, extracting, summarizing, or generating – not just helping you write code.
Examples:
- A tool where users paste free-text clinical notes and an LLM extracts structured fields (symptoms, diagnoses, severity)
- An app where users upload survey responses and an LLM classifies them into thematic categories, then displays the distribution
- A report generator where the user selects a topic, the pipeline runs an analysis, and an LLM drafts a plain-language interpretation
Automation
Your pipeline runs on a schedule without human intervention. The product does its job autonomously – fetching fresh data, processing it, and producing output.
Examples:
- A GitHub Actions workflow that runs weekly, pulls updated data from an API, regenerates a report, and publishes it to a dashboard users can query
- A scheduled pipeline that checks for new data, runs quality checks, and sends an alert if anomalies are detected
- An automated data refresh that updates a hosted dashboard with the latest numbers
Combining Capabilities
You are encouraged to combine. A Shiny app (user input) that pulls from an API (API integration) and uses an LLM to summarize results (GenAI) is ambitious and excellent. But user input plus one capability done well is sufficient.
What Does NOT Count
A traditional analysis-and-write-up will not meet the requirements, regardless of quality. The following do not qualify on their own:
- A Quarto document with static visualizations and narrative
- A slide deck summarizing an exploratory analysis
- A notebook that runs code and prints output
- A rendered HTML page hosted on GitHub Pages with no user input, no API, no automation, and no GenAI
These are analyses. The difference is what happens at runtime. If the only thing that happens is rendering a document from a static dataset, it is not a data product.
Pipeline Documentation
Every project must include clear documentation of the data pipeline in your README.md. This is not a methods section. It is a technical document that describes:
- Where the data comes from – API endpoint, database, user input, file
- How the data is ingested – packages, authentication, format
- How the data is processed – cleaning, joining, transforming, modeling
- What the output is – dashboard, app, report, predictions, API responses
- How someone else could run it – dependencies, environment setup, API keys, deployment steps
Think of it as the document you would hand to a colleague who needs to maintain your product after you leave.
Data
You are free to use any data source. Some starting points:
If your data includes protected health information, you must use the synthpop package to create a synthetic version.
Deliverables
1. Proposal (due March 17)
A short document (2 pages) covering:
- What you are building. One paragraph describing the product, who it is for, and what it does.
- Which capability/capabilities you are targeting. API, GenAI, automation, or a combination (in addition to user input, which is required).
- Data source(s). Where the data comes from and how you will access it.
- Technical plan. Key packages, services, or APIs you expect to use. A rough sketch of the pipeline.
- Division of labor (if working in a pair).
This is a checkpoint, not a contract. Your project can evolve. But this document needs to be thorough. It’s your best chance to kick the tires and make sure you’ve got a good plan.
2. Working Product
The thing itself. This could be:
- A GitHub repo with code that runs the full pipeline
- A URL where we can access a hosted app or dashboard
- A combination of both
We should be able to run it or access it. If your product requires API keys or credentials, include instructions for setup.
3. Pipeline Documentation
Your README.md covering the five pipeline components described above.
4. Presentation (April 7, 9, or 14)
5 minutes maximum. Demo the product. Show us what it does, not just what you found.
Your presentation should cover:
- What problem does your product address?
- Live demo or walkthrough of the product in action
- Key design decisions and trade-offs
- What you would improve with more time
If you work with a partner, both of you must present. Prepare slides using Quarto. The presentation order will be randomized and posted in advance.
5. Peer Evaluation
A brief evaluation rating your partner’s contribution (if working in a pair) and providing feedback on other teams’ products.
Project Structure
Submit via GitHub repository (preferred) or zip file:
project/
├── README.md # Pipeline documentation
├── app.R or index.qmd # Your product (varies by type)
├── R/ # Supporting scripts or functions
├── data/ # Raw or cached data (if applicable)
├── deck/ # Presentation slides (qmd + html)
└── .gitignore # Exclude API keys, large files, etc.
Do not commit API keys or secrets to your repository. Use environment variables or a .env file that is gitignored.
Grading
| Proposal |
10 pts |
| Product |
40 pts |
| Pipeline documentation |
15 pts |
| Presentation |
20 pts |
| Peer evaluation |
15 pts |
Product (40 pts)
- Does it work? Can we run it, access it, or interact with it? (10 pts)
- Does it qualify? Does it accept user input and include at least one additional capability – API, GenAI, or automation? (10 pts)
- Is it useful? Does it address a real question or need? (10 pts)
- Code quality. Is the code organized, readable, and reproducible? (10 pts)
Pipeline Documentation (15 pts)
- Completeness. Are all five pipeline components documented? (5 pts)
- Reproducibility. Could someone else set up and run this from your docs? (5 pts)
- Clarity. Is it well-written and easy to follow? (5 pts)
Presentation (20 pts)
- Demo. Did you show the product working? (8 pts)
- Design rationale. Did you explain your choices and trade-offs? (4 pts)
- Communication. Was the presentation clear, organized, and within time? (4 pts)
- Self-awareness. Did you discuss limitations and what you would improve? (4 pts)
Peer Evaluation (15 pts)
- Partner evaluation of contribution, if applicable (10 pts)
- Thoughtful feedback on other teams’ products (5 pts)
Tips
- Start with something small that works, then add complexity. A working product with one feature is better than a broken product with five.
- Test your product on someone who has not seen it before. If they cannot figure out what it does in 30 seconds, simplify.
- If you are using an API, check rate limits and authentication requirements early. Do not discover on presentation day that your API key expired.
- If you are hosting something, deploy early. Deployment fights the night before presentations are not fun.
- Budget time for the README. Writing documentation after you finish building is painful but necessary.
Late Work
- There is no late submission for the presentation. You must present live on your assigned day.
- For all other deliverables, the late penalty is 5% per calendar day, up to 7 days. Notify us before the deadline if you plan to submit late.