Automated Essay Scoring Using Python: A Robustness Testing Toolkit for AES Systems Against Adversari

anatoliydorofeev51
Aug 19, 2023
6 min read

Essay scoring: Automated Essay Scoring is the task of assigning a score to an essay, usually in the context of assessing the language ability of a language learner. The quality of an essay is affected by the following four primary dimensions: topic relevance, organization and coherence, word usage and sentence complexity, and grammar and mechanics.

The earliest AES system dates back to 1966 when Ellis Page developed Project Essay Grade, the first computerized essay scoring system. The computers of the time were extremely expensive and there was not much advancement in the field until the 1990s when more systems were developed. In all systems, the goal is to improve the efficacy of written assessment and decrease human effort. When analyzing the effectiveness of AES itself, the system is evaluated based on its ability to be fair, valid, and reliable. If the system does not disproportionately penalize a group of people, if it measures what it sets out to measure, and if it repeatedly gives the same essay a consistent score and students can use the feedback to improve their writing, it is considered successful.

automated essay scoring using python

Download File

Critics of AES argue that computer-scoring focuses largely on surface elements and components such as creativity and originality of ideas are not adequately assessed. This is especially important for those students writing high-stakes essays (ones in which the outcome is of great importance for the test-taker). Additionally, if the students get to know what features are being evaluated, they may end up writing to the test. Test writers will compose a lengthy essay using big words and complex sentences, knowing the computer algorithm is set to look for those elements. Others worry that writers will lose motivation to write if they know a machine will evaluate them. Written communication assumes a relationship between the reader and the writer; without a human reader, the writer may not see the purpose for writing. This concern is particularly acute in a small classroom setting where the teacher-student relationship is important in written communication. Still others raise concerns about the quality of automated scorers. Occasionally, AES misses errors or provides bad feedback, not able to compete with the discerning eyes of an expert human evaluator. However, having a human evaluator doublecheck scores and the feedback generated by a machine seem to mitigate these worries.

One cannot discount the advantageous acceleration of feedback and reduction of workload for teachers with the use of AES. Soon, students may come to expect automated feedback and scoring of their writing in all their courses, complaining to friends about old-fashioned teachers who make them wait unnecessarily for their scores.

Probably, there will also be teachers who overuse automated essay scoring systems, leaving students wondering who their audience is if only a machine ends up reading their essays. But when taxpayers start calling for lower taxes, governments may force colleges to reduce costs by increasing class sizes to the point that teachers feel they must use AES to manage their workload. An older generation will, no doubt, get nostalgic for handwritten comments in red ink and complain that youngsters have willfully dehumanized education and the writing process.

A second, highly relevant study, using ASAP datasets, investigated how transfer learning could alleviate the need for big prompt-specific training datasets (Cummins et al. 2016). The proposed AES model consisted of both an essay rank prediction model and a holistic score prediction model. The AES model was trained based on the differences between the two essays, generating a difference vector. Accordingly, the model predicted which of the two essays had higher quality. Subsequently, a simple linear regression modeled the holistic scores using the ranking data. The process reduced the data requirements of AES systems and improved the performance of the proposed approach, which proved to be competitive.

For analysis purposes, these macrofeatures are more efficient than individual microfeatures. Literature shows that automated assessment of spelling accuracy had a higher correlation with human judgments of essay quality than grammatical accuracy, possibly due to certain interference that mechanical errors might have over meaning, and because grammatical errors were weakly associated with writing quality (less than 0.15) (Crossley et al. 2019a).

This study investigated both the feasibility and benefits of applying automated essay scoring at the rubric level. Rubric scores provide high-level formative feedback that are useful to both student-writers and teachers. Most of the literature in this domain focuses almost exclusively on predicting holistic scores. This article goes one step further by analyzing the performance of deep/shallow learning on rubric score prediction and by investigating the most important writing indices that determine those rubric scores.

To the best of our knowledge, only one study attempted to predict rubric scores using D7 (Jankowska et al. 2018), only one study investigated rubric score prediction on D8 (Zupanc and Bosnić 2017), and very few AES systems in general predict essay scores at the rubric level (Kumar et al. 2017). Zupanc and Bosnić (2017) reported an agreement level (QWK) of 0.70 on Organization Rubric (D8). Their feature-based AES model included 29 coherence metrics, which greatly contributed to the observed performance (alone these coherence metrics achieved a QWK of 0.60).

Automated scoring of written and spoken test responses is a growing field in educational natural language processing. Automated scoring engines employ machine learning models to predict scores for such responses based on features extracted from the text/audio of these responses. Examples of automated scoring engines include Project Essay Grade for written responses and SpeechRater for spoken responses.

Rater Scoring Modeling Tool (RSMTool) is a python package which automates and combines in a single pipeline multiple analyses that are commonly conducted when building and evaluating such scoring models. The output of RSMTool is a comprehensive, customizable HTML statistical report that contains the output of these multiple analyses. While RSMTool does make it really simple to run a set of standard analyses using a single command, it is also fully customizable and allows users to easily exclude unneeded analyses, modify the default analyses, and even include custom analyses in the report.

We expect the primary users of RSMTool to be researchers working on developing new automated scoring engines or on improving existing ones. Note that RSMTool is not a scoring engine by itself but rather a tool for building and evaluating machine learning models that may be used in such engines.

With essays comes the need for personnel qualified enough to carry out the process of grading the essays appropriately and ranking them on the basis of various testing criteria. Our project aims to automate this process of grading the essays with the aid of Deep learning, in particular, using Long Short Term Memory networks which is a special kind of RNN.

Automated Essay Scoring (AES) allows the instructor to assign scores easily to the participants with a pre-trained deep learning model. This model is trained in such a way that the scores assigned are in agreement with the previous scoring patterns of the instructor. So this needs the dataset which contains the information of scores given by the instructor previously. AES uses Natural Language processing, a branch of artificial intelligence enabling the trained model to understand and interpret human language, to assess essays written in human language.

Given the growing number of candidates applying for standardized tests every year, finding a proportionate number of personnel to grade the essay component of these tests is an arduous task. This personnel must be skilled and capable of analyzing essays, scoring them according to the requirements of the institution, and be able to discern between the good and the excellent.

The challenge was to create a web application to take in the essay and predict a score. We need to train a neural network model to predict the score of the essay in accordance with the rater. The model is to be made using LSTM.

This application makes use of the technologies of Natural Language Processing that performs operations on textual input, and LSTM, which is used to train a model on how to grade essays. The application also uses the Word2Vec embedding technique to convert the essay into a vector so that the model can be trained addresses the issue of time constraints; automated grading takes place within seconds as compared to physical grading which requires minutes per essay. The net amount of time saved over a period of consistently using the application is vast; costs of maintaining human graders are also saved.

The front end of the application was implemented using HTML, CSS, and Bootstrap. It provides the option for users to choose from a set of prompts and write an essay accordingly or to grade their own custom essay.

We generated back-translation essays using Google Translator ( ) and adjusted the corresponding scores in several ways. We trained and validated the model with doubled number of essay-score pairs and tested it on the original data. The performance is improved by using augmented data.

The first AES system was created in 1966 and uses some linguistic features to score essays [7]. Most recent works have used neural network models for AES. In 2016, Taghipour and Ng [8] designed a neural network model using CNN and long short-term memory (LSTM) [9] and showed significant improvement compared to traditional methods that depend on manual feature engineering (Figure 1). This is the simplest and most representative model, and it generates a representation of the input essay and obtains a value from it. The convolution layer extracts local features from the essay and the recurrent layer generates a representation for an essay. In the mean over time layer, the sum of outputs of the recurrent layer is divided into essay length. Let be the outputs of the recurrent layer. Then, the function of the mean over time layer is defined by the following equation: 2ff7e9595c

Dial-a-Dumpster

Automated Essay Scoring Using Python: A Robustness Testing Toolkit for AES Systems Against Adversari

automated essay scoring using python

Recent Posts

Comments