Grading with AI – A Workflow Example
As part of my course on using AI in business, I felt compelled to develop workflows using AI that automated or augmented parts of my course tasks – as a proof of concept for the students. To that end, I started with one of the most time consuming and mentally challenging aspects to the job – grading. If I could find a way to speed up grading with AI augmenting the process, then I would greatly improve the quality of work life. Here’s what I learned!
I started with a conversation with Claude on how best to improve my workflow in grading course projects. https://claude.ai/share/2a7031b0-2934-48db-bcea-66407203b28f
You can see from the conversation, I started with thinking through different options within Claude’s features. It suggested creating an app to do the grading, but that was with the assumption I had funds to pay the API fees, which I don’t. With that option off the table, it reverted to a simpler setup – something similar to what I’ve done in the past. The simpler setup also allows me to use two platforms for the workflow, Claude and ChatGPT. This was great as it would allow me to compare the results.
Grading with AI workflow
Here’s the final workflow:
- I first graded three submitted projects, adding lots of details on my reasoning for each rubric item to help calibrate the AIs to my expectations.
- Next, I created a project in both Claude and ChatGPT called “AI project grading”. In those projects, I added the instructions for the project and the three calibrated project grades.
- Next, I downloaded the deliverables from each submission. One by one, I uploaded the projects to both Claude and ChatGPT. Then with the prompt below, I asked both platforms to assess the submission.
- I then compared Claude’s response to ChatGPT’s response to ensure consistency. I also read through the assessment and scanned the project submission to ensure the results were fair and objective. I found a few mistakes as I discuss below.
- Lastly, after all the submissions were done, I reviewed the overall grade average to ensure the AIs were not overly strict or permissive. If a curve was warranted, I would have applied that prior to releasing the grades for students to see.
As you can see in the workflow, I’ve included multiple instances of human-in-the-loop to ensure the process worked well. Some lessons I learned along the way, ChatGPT performed consistently well, whereas Claude started having issues with its responses, sometimes assessing my sample project instead of the target student as requested in the prompt. I also found that while both platforms included similar comments, Claude was consistently more strict than ChatGPT when rating the results. Given my problems with Claude, I stopped using it about 3/4 of the way through the grading process.
I also found several errors through the review process, including a mistake in the prompt. This required me to go back through previous submissions and fix accordingly.
Here’s the prompt I used:
You are grading a graduate-level project submission. The student has submitted the attached document.
CONTEXT
This is a graduate course. Students are expected to produce professional, evidence-based work. The course emphasizes applying AI concepts to real organizational and communicating findings clearly to non-technical audiences.
ASSIGNMENT-SPECIFIC INSTRUCTIONS
– The student assessed an organization using a five-level AI Capability Maturity Model across five dimensions: Data, Infrastructure, AI Usage, People, Processes, and Governance & Ethics.
– Findings must be supported by evidence (interview quotes, observations, company materials).
– Recommendations should move the organization to the next higher maturity level per dimension.
SCORING STANDARDS
Apply these standards consistently across all papers:
– Full marks require meeting all elements of the criterion description. Do not award full marks if any element is missing or clearly underdeveloped.
– When scoring, identify not only why the current score is appropriate, but specifically what is missing that would earn the next higher score.
– Evidence quality matters. Unsupported assertions, conclusions without data, or recommendations not tied to findings should result in deductions.
– A score in the 90-95% range (e.g., 23/25) indicates strong work with minor but identifiable gaps. Reserve full marks for work that is genuinely complete on that criterion.
CALIBRATION EXAMPLES
Use the following graded examples 1, 2, and 3 to calibrate your scoring. These represent the instructor’s own judgments.
OUTPUT FORMAT Return your evaluation in exactly this format:
—
STUDENT: [name]
ASSIGNMENT: [type]
TOTAL: [X/100]
RUBRIC:
1. Problem Framing & Context: [score/15] — [2-3 sentence reasoning: what earned the score AND what would earn the next higher score]
2. Audit Methodology / Data & Method: [score/20] — [reasoning]
3. Findings & Analysis: [score/25] — [reasoning]
4. Recommendations & Roadmap: [score/20] — [reasoning]
5. AI audit [score/5] — [reasoning]
Take aways
The major take aways from this experiment.
- Augmenting the grading process saved me numerous hours of tedious work.
- Training the AI on my standards was essential for grading success.
- Reviewing the outputs was essential for catching errors and miscalibrations in grading standards.
- Transparency with the students is a necessary component so that they understand how the process maintains fairness and objectivity.
Cheers,