Join us on Monday 3rd of November to discuss how consistent humans are when grading programming assignments (spoiler alert: not very)

CC BY scales of justice via flaticon.com

Giving students fair, justified and consistent feedback on their work is a cornerstone of good teaching practice. So how consistent are humans when they are grading programming assignments? Join us on Monday 3rd November at 2pm GMT (UTC) to discuss a paper [1], abstract below. :

Providing consistent summative assessment to students is important, as the grades they are awarded affect their progression through university and future career prospects. While small cohorts are typically assessed by a single assessor, such as the module/class leader, larger cohorts are often assessed by multiple assessors, typically teaching assistants, which increases the risk of inconsistent grading.

To investigate the consistency of human grading of programming assignments, we asked 28 participants to each grade 40 CS1 introductory Java assignments, providing grades and feedback for correctness, code elegance, readability and documentation; the 40 assignments were split into two batches of 20. The 28 participants were divided into seven groups of four (where each group graded the same 40 assignments) to allow us to investigate the consistency of a group of assessors. In the second batch of 20, we duplicated one assignment from the first to analyse the internal consistency of individual assessors.

We measured the inter-rater reliability of the groups using Krippendorff’s α—an α > 0.667 is recommended to make tentative conclusions based on the rating. Our groups were inconsistent, with an average α = 0.02 when grading correctness and an average α < 0.01 for code elegance, readability and documentation.

To measure the individual consistency of graders, we measured the distance between the grades they awarded for the duplicated assignment in batch one and batch two. Only one participant of the 22 who didn’t notice that the assignment was a duplicate was awarded the same grade for correctness, code elegance, readability and documentation. The average grade difference was 1.79 for correctness and less than 1.6 for code elegance, readability and documentation.

Our results show that human graders in our study cannot agree on the grade to give a piece of student work and are often individually inconsistent, suggesting that the idea of a ‘gold standard’ of human grading might be flawed. This highlights that a shared rubric alone is not enough to ensure consistency, and other aspects such as assessor training and alternative grading practices should be explored to improve the consistency of human grading further when grading programming assignments

We’ll be joined by the papers lead author Marcus Messer from King’s College London, who’ll give us a lightning talk summary of the research. All welcome, meeting URL is public at zoom.us/j/96465296256 (meeting ID 9646-5296-256) but the password is private and pinned in the slack channel which you can join by following the instructions at sigcse.cs.manchester.ac.uk/join-us

(Cite this article using DOI:10.59350/01cd6-dvq50 provided by rogue-scholar.org)

References

  1. Marcus Messer, Neil C. C. Brown, Michael Kölling, Miaojing Shi (2025) How Consistent Are Humans When Grading Programming Assignments? ACM Transactions on Computing Education, Volume 25, Issue 4 Article No.: 49, Pages 1 – 37 DOI:10.1145/3759256
  2. Neil Brown (2025) Consistency of grading programming assignments academiccomputing.wordpress.com

Join us to discuss ten things engineers should learn about learning on Monday 5th February at 2pm GMT

See one, do one, teach one” is a popular technique for teaching surgery to medical students. It has three steps:

  • You see one: by watching it, reading about it or listening to it
  • You do one: by engineering it or making it
  • You teach one: by telling others all about it


If you’re teaching engineers, what do you need to know beyond the seeing and doing? Understanding how human memory and learning works and the differences between beginners and experts can improve your teaching. So what practical steps can engineers take to improve the training and development of other engineers? What do engineers need to know in order to improve their own learning?

Join us on Monday 5th February at 2pm GMT (UTC) for our monthly ACM SIGCSE journal club meetup on zoom to discuss a paper on this topic by Neil Brown, Felienne Hermans and Lauren Margulieux, published in (and featured on the cover of) the January issue of Communications of the ACM. [1]

We’ll be joined by the lead author, Neil Brown of King’s College London, who will give us a lightning talk summary of the paper to kick off our discussion.

All welcome, as usual, we’ll be meeting on zoom, details at sigcse.cs.manchester.ac.uk/join-us

References

  1. Neil C.C. Brown, Felienne F.J. Hermans and Lauren Margulieux (2024) 10 Things Software Developers Should Learn about Learning, Communications of the ACM, Volume 67, No. 1. DOI:10.1145/3584859 (see accompanying video at vimeo.com/885743448 )

Join us to discuss novice use of Java on Monday 7th November at 2pm GMT

Java is widely used as a teaching language in Universities around the world, but what wider problems does it present for novice programmers? Join us to discuss via a paper published in TOCE by Neil Brown, Pierre Weill-Tessier, Maksymilian Sekula, Alexandra-Lucia Costache and Michael Kölling. [1] From the abstract:

Objectives: Java is a popular programming language for use in computing education, but it is difficult to get a wide picture of the issues that it presents for novices, and most studies look only at the types or frequency of errors. In this observational study we aim to learn how novices use different features of the Java language. Participants: Users of the BlueJ development environment have been invited to opt-in to anonymously record their activity data for the past eight years. This dataset is called Blackbox, which was used as the basis for this study. BlueJ users are mostly novice programmers, predominantly male, with a median age of 16. Our data subset featured approximately 225,000 participants from around the world. Study Methods: We performed a secondary data analysis that used data from the Blackbox dataset. We examined over 320,000 Java projects collected over the course of eight years, and used source code analysis to investigate the prevalence of various specifically-selected Java programming usage patterns. As this was an observational study without specific hypotheses, we did not use significance tests; instead we present the results themselves with commentary, having applied seasonal trend decomposition to the data. Findings: We found many long-term trends in the data over the course of the eight years, most of which were monotonic. There was a notable reduction in the use of the main method (common in Java but unnecessary in BlueJ), and a general reduction in the complexity of the projects. We find that there are only a small number of frequently used types: int, String, double and boolean, but also a wide range of other infrequently used types. Conclusions: We find that programming usage patterns gradually change over a long period of time (a period where the Java language was not seeing major changes), once seasonal patterns are accounted for. Any changes are likely driven by instructors and the changing demographics of programming novices. The novices use a relatively restricted subset of Java, which implies that designers of languages specifically targeted at novices can satisfy their needs with a smaller set of language constructs and features. We provide detailed recommendations for the designers of educational programming languages and supporting development tools.

All welcome, as usual we’ll be meeting on zoom, details at sigcse.cs.manchester.ac.uk/join-us

References

  1. Neil C. C. Brown, Pierre Weill-Tessier, Maksymilian Sekula, Alexandra-Lucia Costache and Michael Kölling (2022) Novice use of the Java programming language ACM Transactions on Computing Education DOI:10.1145/3551393