Teaching Ethics & AI in the Wake of ChatGPT – Science, Technology and Policy

The advent of Large Language Models (LLMs) has raised profound questions about why and how we teach writing at university, and the extent to which instructors can still assess the level of student understanding on the basis of written work. In the most recent iteration of Caltech’s first year class on Ethics & AI, Professor of Philosophy Frederick Eberhardt tested a very permissive LLM policy. This report details his experience, highlighting the need to find a delicate balance between training students to productively integrate LLMs into their workflow, while ensuring that the development of their critical thinking skills remains center stage.

————-

For better or worse, this past fall quarter I decided to have an extremely permissive policy on the use of Large Language Models (LLMs) for my first year undergraduate humanities course on “Ethics & AI” (see policy at the end of this article). The arrival of ChatGPT and its powerful siblings had led to much soul searching amongst the humanities faculty and there was a scramble to determine course policies on LLMs last summer. Caltech’s own overarching policy on the use of generative AI only came online midway through the fall term.

The teaching of writing forms part of the mandate of Caltech humanities courses, and written assignments, often with required revisions, have been a standard tool of assessment in these courses. The arrival of LLMs has challenged not only our standards for authorship, but also made us revisit our aims of writing instruction: Is learning to write, learning to communicate? Is learning to write, learning to think? Does writing well help students read carefully? Or, is writing instruction about finding and developing one’s own voice?

The goals are all worthy, but for which is writing instruction the key tool? And where might LLMs help achieve these goals rather than undermine them? Unsurprisingly, and I think quite healthily, opinions on this matter vary widely among the humanities faculty and, thankfully, for now the generic division-wide policy has allowed us to try different approaches: each instructor can set their own class policy on the use of LLMs.

In a course on “Ethics & AI,” I felt that we had to give the new unknown in the field a try. We had to learn how to engage intelligently with the tools, how to integrate LLMs into our workflow, figure out what works, and explore the limitations. This applied as much to the students as it did to me. I did not really know how essays written using LLMs would look, how reliably I would be able to detect (or even just suspect) LLM-usage, nor did I have a clear idea of how students would use the tools. I pitched my permissive policy to the students as an exploration, emphasizing that we had to pay attention to ensure that we don’t lose our power as writers and thinkers, that we don’t want our voices to be tempered into lukewarm corporate softspeak, and most importantly, that we must still own and stand behind what we write, whether machine-boosted or not.

In my course then, LLM usage was permitted in all forms for all aspects of writing, but it was not required. For each assignment, students were asked to report on their LLM usage, which LLM they used, at what stage of the writing process they had used it, whether they had found particular prompting techniques that worked well or whether they encountered problems – and I asked that their reports on this should not use an LLM (please!). We had several class discussions at different points during the term where students shared advice, recommended different LLMs or acknowledged that for certain tasks the machine was useless. Assignments consisted of several 500-1,000-word subsections of a term-length project that culminated in a policy proposal for the regulation of generative AI. Overall grades for Ethics & AI were pass/fail, as is typical for a first-term course at Caltech. This alleviated the pressure of fine-grained grading and limited the consequences that a failure of this policy might imply. Of course, it also affected the effort that students put into their work.

So, what did my students do? – Few first-year students had used LLMs for substantive writing tasks prior to my course. Several did not choose to use it initially, but I believe everyone did use LLMs in some form or other eventually. The first assignment resulted in the expected disaster where for the most part I had to read generic ChatGPT-speak consisting of perfectly formed sentences providing very high-level points with very little substance. Detecting auto-generated text seemed trivial – I trust it is not the students themselves that now write like a marketing department ensured that everything is shiny while no commitments are made. However, students quickly adapted: ChatGPT’s failure (at the time) to provide proper references quickly resulted in it being replaced by its (then) more powerful siblings. And students realized that the task of writing was not going to be just a matter of copy-pasting my prompt into the machine, but would require multiple iterations and revisions. I started to see excellently researched case-studies where I no longer could tell which parts were student-written and which were machine-generated. Most likely, that separation was no longer even well-defined as students had gone back and forth many times revising and improving their text, prompting the LLM using new strategies and trying again with different ideas that were now a mix of their previous input and the LLM’s outputs. Students also shifted at what stage they used LLMs in the writing process: Many students reported that the actual text generation was far less efficient than writing themselves, but that the LLMs were enormously helpful for brainstorming ideas and getting started on their assignments. Shortly after midterm, we hit a plateau with several students returning to doing more of their writing on their own (or at least, so they reported), while LLMs were used now principally as sophisticated search engines that could summarize large bodies of text (the EU regulations on AI are a few hundred pages, most of them deadly boring, so one can hardly blame them).

At the end of the term, I received several outstanding term projects that, to the best of my assessment, were better than either the student or the machine could have done on their own. These students had managed to integrate the LLM into their workflow in such a way that they were able to make use of the vast amount of knowledge the LLMs have processed, while not losing the particular angle they wanted to argue for. Their write-ups would include a wealth of material that we had not covered in class, their points were insightful and well-developed and supported. Several students reported that ChatGPT’s failure (at the time) to provide proper sources had made them actually track down more carefully the sources of the claims they got from later LLM interactions. The LLM told them what to look for and they could now find the place in the original source.

Of course, the outcomes were not all rosy. Unfortunately, quite a few final submissions felt like the machine had taken over. The prose was flawless, but the content did not merit the 15 pages I had to read. Arguments were not developed in detail and there was little that did not feel generic or uncommitted.

In our last class, we had an open discussion about what the students thought the role of LLMs should be in future writing-intensive courses. Across the board, students appeared to appreciate the permission to use LLMs. The most common argument was that LLMs helped overcome what several students referred to as “writer’s block.” It helped them get started, generating ideas and moving the assignment along when they were stuck. A second consideration was that many felt that this was a tool that they had to learn how to use – it would inevitably become a staple ingredient to most future writing tasks. This point had been one of my main motivations for allowing LLMs in the first place. Too many of my friends and acquaintances in “real jobs” had reported that effective LLM usage was becoming a basic job requirement. Caltech undergrads feel the same.

Encouragingly (or perhaps depressingly?), one student noted how hard it is to detect the weaknesses and flaws in one’s own writing, but that in interacting with the LLM, the added edits meant that one became somewhat of a third person reader to one’s own writing – the places where improvement was needed became much more apparent. (I know I am not the only humanities instructor who has always encouraged peer review in class. Students generally catch at least 90% of the places their peer could improve their writing. But unless I enforced it, few students ever seemed to do it. I am a little dispirited that the machine may be the peer that we are more willing to tolerate).

Obviously, I also tried grading assignments with an LLM. As anyone who has graded a large number of papers knows, the same problems occur again and again, and it would be a huge relief to have the machine deal with those, so that one can focus on more specific feedback. It is fair to say that at least for now, this attempt at using an LLM was a massive failure. It was extremely difficult, even with quite detailed prompts, to get the LLM to systematically review the submissions, to take a stance, and to properly evaluate the work. The LLM was very good at saying something positive and something negative, but did not seem able to prioritize. I welcome ideas and advice on this front, because I would expect that I am not the only one out there who would be delighted to have a high-quality grading sidekick (we don’t have TAs in the humanities at Caltech, unfortunately).

Over the winter break, as I was preparing course policies for my letter-graded courses the following term, I was left with many lessons, but few solutions. It is clear that students’ writing can benefit enormously from the use of LLMs. Students are keen to have the option, it helps them get off the ground, and when done well, the result is more than the sum of its parts. But it is a serious challenge to get all students onto this track of working with the LLM and not letting the LLM do the work. While purely LLM-generated essays may now still be easily detectable, I doubt this will last long, and I do worry that in the near future essays will be bunched up near the grade ceiling, so that I will not be able to distinguish work that has resulted from a constructive interaction with an LLM from work resulting from a 4am request to the LLM for an essay in response to my prompt. It may be tempting to take the output of a (sophisticated) LLM as a baseline and expect student submissions to be original and innovative beyond that. But I am doubtful this will work: It takes time and effort to learn and understand the material that the LLM has ingested and we can’t expect a first-year student to just start from that level.

I do not expect to ban LLMs from my courses. Quite apart from not being able to detect violations, it will be extraordinarily hard for a student to fully avoid using them: Just about every search engine will or already has an LLM behind it, grammar improvement software is highly useful and is also based on a form of LLM. But perhaps most importantly, like sex ed, I think LLM usage should not be learned on the street. Obviously, for small seminar-style courses, in-class participation — an important skill to develop in its own right — provides a useful tool to gauge student understanding of the material. But if this is to play an important role in student assessment, it now has to take on a much more systematic form. For large courses, significant in-class participation is a non-starter and so I do think there is a place again for proctored exams. In combination with low stakes homeworks (where the practice for the exam outweighs the outsourcing to an LLM), proctored exams provide a useful assessment of competency, albeit under time pressure. But exams come with their own well-known problems. Quite apart from incentivizing cramming, exams don’t provide a useful basis to assess careful reasoning. The development of a clear presentation of an argument takes time and often many revisions. But if clear presentation of a well-developed argument is the goal, then LLM use is perhaps permissible again — and the process of successfully integrating an LLM into this revision process is one we ought to teach, and assess. The challenge here is that for many teaching topics, a completely crisp account has already been ingested into the LLMs, ready to be regurgitated with minor variations for the class assignment. It is a losing and senseless battle to invent ever more obscure topics to outmaneuver the LLMs. So, at this point I am inclined to bite the bullet: If (big if, we’re not there yet!) the LLMs are really so good at presenting the topics, then we should use them for that purpose. It will not be our student’s future job anymore to develop the clear presentation themselves. And to ensure that our students can develop well-justified arguments and engage in sound reasoning, I think it will be a matter of teaching the abstract skills of logic, probabilistic reasoning and evidence collection, with a sensitivity to all their limitations. Going forward, I will be reducing the weight of homework essays and focus those grades on the clear presentations of arguments, while I am likely to switch my assessment of content understanding from homework essays to exams and in-class participation (where possible). I am not optimistic that the humanities model of 2-3 essays (even with feedback-based revisions) will continue to serve either as basis to learn how to write well or to assess the student’s understanding of content.

As an institution, we will need to provide subscriptions to high-quality LLMs to all students and employees, and while my attitude towards privacy and data ownership often marks me as distinctly European, or at least as older, I was glad to see that these concerns were part of Caltech’s overarching LLM policy. Navigating this space with sensitive breakthrough scientific discoveries and first year student papers will require a very delicate balance.

Finally, I want to leave you with a student recommendation for prompt engineering that gave me pause. The student reported that in the last weeks of the course, they finally figured out how to generate good text with an LLM. I paraphrase: “Professor, I uploaded all your published papers to the LLM, as well as a few others from the Humanities faculty. I then uploaded your prompt and asked the LLM to generate an answer in the style of those papers. Then it finally produced useful text.” Needless to say, other students kicked themselves for not having had that idea themselves. But the student in question left ambiguous whether they thought the generated text itself was any good or whether it was just my feedback on their submission that was more positive. Well, what was I assessing? Whether the student could write like me? Whether the student was following the writing standards amongst Caltech philosophy professors? Whether the quality of the argument they provided was any good? — I’d like to think it was the latter, and if (another big if) that was the case, then maybe something was gained by this approach, but the challenges are obvious.

[In line with Caltech policy, I would like to note that LLMs may have been used in the generation and revision of this text. Nevertheless, the text is entirely my responsibility.]

Frederick Eberhardt is Professor of Philosophy at Caltech and Co-director of Caltech’s Center for Science, Society, and Public Policy. He was the instructor for the “Ethics & AI” course for first year students in the fall quarter of 2023, which tested the very permissive LLM policy below.

Course Policy on LLMs for Ethics & AI course, Fall 2023 (this policy was adapted from a sample policy provided by Caltech’s Hixon Writing Center)

Classes in the Humanities and Social Sciences abide by this policy, which states that “students submitting work for HSS courses may use generative AI tools only in ways that are explicitly allowed by the course instructor in the course materials.”

In a course on “Ethics & AI” it is essential that we engage with these tools and explore their potential and shortcomings. This is inevitably an exploration, testing how we can integrate generative AI into our workflow while recognizing the risk to our individual development as thinkers and writers. In this spirit of experimentation, I will allow you to use generative AI in any way you wish in this course. If you find a tool that you think will help you accomplish an assigned task in a more effective, efficient, or compelling manner, you can use it. You may incorporate generative AI outputs into your work without documenting their use within the text itself. You are fully responsible for the correctness and appropriateness of any language or images created by a generative AI tool that you choose to use. However, I require that you complete a “Generative AI Memo” for all graded assignments. This memo requires you to tell me which tools you used, how, and why. If you decided not to use these tools, it asks you to briefly explain why you did not. Your exploration in this class will help me decide if and how I allow students to use these tools in future versions of this course.

A further note: As we are testing how we might integrate generative AI tools into our workflow, we need to be keenly aware of the potential risks. It is by no means obvious that it makes sense to use these tools in an introductory class, especially perhaps not in one that is supposed to teach writing. So here is my thinking: We teach writing to develop our reasoning skills and to learn how to communicate effectively. There is also the aspect of developing our own voice through writing (though that has not been a focus in my courses). In that sense, learning to write is developing a means to achieve broader goals; writing (at least in my teaching and research) has never been an end in itself. So, I consider it to be an open question whether these new tools can still help us achieve these broader goals, possibly much more easily. By learning how to use these tools well, it may be possible to develop excellent reasoning and communication skills without becoming a good writer oneself. (I may also be wrong about that.) A second more practical reason for using these tools in class is that I expect that “generative AI literacy” will become a basic skill requirement. That is, similar to typing, I expect that knowing how to engineer prompts for a generative AI tool will become a default expectation in many jobs. Therefore, we, which includes me, need to get up to speed.