Using LLM for grading

birina
Mar 31
5 min read

Updated: Jul 6

In academia, using AI for grading and providing a feedback seems like a reasonable next step. I suspect that AI will manage to grade and provide valid feedback for easy to medium level of difficulty calculation problems. If it can solve it, it can grade it, right? What if the problem is not a calculation assignment? We need to test it!

Note, the privacy of my students is a major concern for me (even with the turn off button that instructs LLM not to save the information; as long as something is online, I have low trust that data is not collected). For the purpose of this experiment, I have replaced names of all students.

What activity have I used for LLM to grade?

One of the activities that students are doing (juniors, enrollment 100-150, course 'Transportation Systems Analysis') is Weekly Group Activities (5 students per team), where students are asked to provide a responses to 5 questions (some are debating, some are knowledge based).

Each team provides one page report graded on scale 0-5.

In addition, it is required to hold discussions online to enable participation grading. Here is a section from the syllabus/assignment regarding participation grading:

***

Chat discussion, graded individually using 0-3 scale, where

i. 0 is assigned to students who have not participated;

ii. 0.5 per question for student who provided minor contribution (helped editing the question, provided short comment, and/or expression of agreement/disagreement with others ideas);

iii. 1 per question for students who provided major contribution (participated in debates for that question, provided justification to their agreement / disagreement toward others ideas)

iv. Note, to get full contribution points a student needs to provide major contributions for at least 2 questions and minor contributions for two other questions.

***

What do we want to learn?

I am curious is LLM is capable to provide reasonable grading and feedback for the chat discussions. This year my enrollment was 103, so each week I had 21 reports and 21 chats. Every week, assigned TA is tasked to look through all chats to capture the participation. It is time consuming, and somewhat subjective. Although, TA did a great job with no complains throughout the semester.

In this experiment, I wanted to see:

will LLM model be able to do this grading consistently between chats/groups;
how LLM grading is comparable with our grading;
and how different models perform.

My observations

I will give an example for one team discussion graded by different models (DeepSeek and free Gemini model) later. Before that, let me provide you with some of my observations from the experience:

models more or less demonstrated good performance with extreme cases: when students didn't or barely participated.
different models often have similar or the same grade, but their grades were different from one that was given by the TA. It was so common for many groups that I started checking myself if grades from the TA are fair. My investigations supported TA's decisions.
DeepSeek demonstrated initiative and often graded student participation as 3.5-4.5 out of 3 with comment ‘Exceeding Expectations’. That was funny and surprising! Grading policy and my prompt instructions stated clearly that grade should be between 0 and 3. However, because there are directions on how to grade contribution to each question, if student participated in more than requested, DeepSeek violated 0-3 scale. Other models I tried followed 0-3 constraint, and stop at 3 even if the student contributed to more questions. It was interesting to see how models prioritized different conditions if they contradict each other.

Results of my investigations for differences between LLM and TA’s assessments:

When students had little discussions on some question, just 1-2 messages but to the point and at the level of major contribution, we gave them full credits; however, models awarded minor contribution (0.5) and on rare occasions even no contribution with 0 points.
The opposite was also true, when student had a lot of discussions with somewhat relevant text but no substance (discussing wording and formatting for the report, asking clarifications from others, confirming that others has solved the question) we gave no contribution (0) or minor contribution (0.5), but models awarded more, qualifying that discussion as major contribution.

This means that models couldn’t ‘grasp’ the meaning of the discussion just counted the number of statements of relevant discussions. So, as of now, TA are only valid grader for this type of the assignments. Models are very close to being able to do it, but not quite yet.

Example

Here is an example of the analysis from 2 different models, for this team we gave 3 out of 3 to every student, however, models have big disagreements with us for this team demonstrating some of my observations above (I provided detailed report only for one student to make this post readable).

Model 1

Student Analysis and Grading:

Student P:

A.1: Did not contribute.
A.2: Provided major contribution by selecting and justifying vehicle attributes.
A.3: Provided minor contribution by acknowledging AI's use in spellcheck and understanding.
A.4: Did not contribute.
A.5: Did not contribute.
Summary: Paul contributed to the discussion, mainly focusing on Q2, and provided some general feedback and edits to the final document.
Grade: 1.5 (1 for A.2, 0.5 for A.3)

Grade Summary:

In summary, students Y, A, T, and E all received a 3, and P received a 1.5.

Model 2

Here’s the summary and grading of each student’s contributions based on the provided chat transcript and the evaluation metric. I’ve analyzed their participation in the discussion for each question (A.1 to A.5) and assigned scores accordingly.

Student P

A.1: No contribution.
A.2: Provided a major contribution (1 point) by agreeing with A’s suggestion for modifying vehicle attributes (tau) and proposing emissionClass and sigma as additional attributes.
A.3: No contribution.
A.4: No contribution.
A.5: No contribution.

Total Score: 1/3

Summary: P contributed moderately to A.2 but did not address A.1, A.3, A.4, or A.5. His participation was limited and needs improvement.

***

Final Grades

A: 4.5/3 (Exceeds expectations)
Y: 3.5/3 (Exceeds expectations)
T: 3/3 (Meets expectations)
E: 2.5/3 (Meets expectations but could improve)
P: 1/3 (Needs improvement)

You can guess what model was DeepSeek.

Concluding Notes

Note, that Model 1 for student P captured some contribution for A3 but named it minor, Second Model has not caught that at all. We judged both contributions as major. Both models missed minor contributions for that student when they just clarified something in one-two sentences.

This experiment and report was created for Lumen Circle activity during Spring 2025 semester. When you are reading this, this experience will not be replicable due to rapid advancements of LLMs. Hopefully, it will be much better! …and safer.

If you want to discuss more, feel free to reach out or to join Lumen Circle. I do recommend!