Use of LLM Arena for easy to medium level engineering problems

birina
Jul 6
4 min read

Updated: Aug 4

The swift and widespread integration of Large Language Models into virtually every professional sector has created an immediate and critical need for their inclusion within university curricula. We need to incorporate them yesterday.

Universities have an obligation to guide students on their path of ethically navigating this new technology, teaching them to critically evaluate AI-generated content, understand its inherent biases, and leverage it as a powerful tool for augmenting human intellect rather than replacing it, and all of that, somehow, without sacrificing the major task: learning critical thinking and problem solving skills in specific area.

When the most advanced LLMs are able to solve almost all engineering problems, how can we achieve that?

Let’s share ideas!

Idea: Use of LLM Arena for easy to medium level engineering problems.

LLM Arena is a crowdsourced online platform where users anonymously vote for the better of two side-by-side large language model responses to the same prompt, generating a continuous leaderboard that ranks models based on human preference.

https://lmarena.ai/

I am teaching transportation engineering for juniors, my example will be from that course.

Problem from the text book:

A four-timing-stage traffic signal has critical lane group flow ratios of 0.225, 0.175, 0.200 and 0.150. If the lost time per timing stage is 5 seconds and a critical intersection v/c of 0.85 is desired, calculate the minimum cycle length and the timing stage effective green times such that the lane group v/c ratios are equalized.

Revised problem for individual homework:

Using LLM Arena (https://lmarena.ai/) solve this problem 2 times (you will get 4 different solutions) with random models.

“A four-timing-stage traffic signal has critical lane group flow ratios of 0.225, 0.175, 0.200 and 0.150. If the lost time per timing stage is 5 seconds and a critical intersection v/c of 0.85 is desired, calculate the minimum cycle length and the timing stage effective green times such that the lane group v/c ratios are equalized.”

Note: you may need to copy this problem to text editor before copying into arena text field.

For every ‘battle’ provide:

Screenshots of obtained answers before and after your judgement (first screenshot will have Assistant A/B instead of model names and second will have model names.
Report what models participated and what was your judgement.
Justification for your judgement with at least one reason.

For every solution, assume that you are a teacher who is grading students’ work and provide:

A solution screenshot with the name of the model.
A grade using scale 0-10, where 0: no solution provided, and 10: perfect solution. Make sure that your grading is fair and consistent between ‘students’.
If your grade is less than 10, explain all mistakes made by the ‘student’.
The feedback to the ‘student’.

This is how LLM Arena interface looks like as of July 2025:

Why an easy to medium level problem?

As of July 2025, most of these problems CAN be solved by advanced LLMs without any additional materials. However, difficult problems that require merging together multiple concepts may not be solvable yet with a zero-shot approach.

Why solving it at least twice?

Some 'junior' models may not be able to solve even easy problems, see picture below:

LLM results before user judgement (no names of the models). Right screen - no solution is provided, left screen - correct solution is provided.

You can ask students to run another battle if this happens, or ask them to solve 3-4 times to get different outputs, that include no solutions, wrong solutions, and correct solutions.

In the picture below, left solution is wrong, while right solution is correct.

LLM Arena results after user finished judgement (left screen is highlighted with green, names of the models are visible). Right screen - wrong solution is provided, left screen - correct solution is provided.

Example of the problem considered here is an easy problem that every student can solve correctly, usually from the first attempt.

If more advanced LLMs are used (Gemini, ChatGPT, etc.), more likely students will get the correct answer from the first attempt, limiting the intensives to look into solution at all.

However, when using LLM Arena, we have got:

One no solution,
One wrong solution,
Two correct solutions.

This allows for more engaging activity for the student.

Why ask to grade and to provide feedback?

The grading process with discussion on errors force students to look deep into the solutions (however easy the problem is) multiple times, providing necessary training.

Feedback writing trains the ability to communicate engineering problems.

Another bonus, there is no need to generate different numbers to avoid ‘cheating’ as LLMs will provide necessary randomness to guarantee that students will be working with somewhat different problems. Sounds like win-win-win! Right?

Final thoughts: Is this feasible for large classes?

The only issue is how to grade such HW for 150 students? We do need new tools in BrightSpace.

Before LLM, this assignment was a self-graded calculation problem with randomly generated numbers and unified feedback with easy solution steps. However, if a problem is formulated in the way proposed here, the grading becomes a time-consuming task that require further discussion. Especially in a light of intensive discussion on how teachers are the first to be replaced by the AI.

What do you think?