Capstone: Making Reading Accessible and Motivating

Megan Sorel
3 min readJun 10, 2021

Thinking of a project for my capstone, I knew I loved the idea of exploring language and its accessibility to readers, which is likely good context to my experience as an ESL teacher.

As I was sitting in class, my instructor briefly pulled up Kaggle to show us a homework. While he scrolled, something caught my eye- CommonLit Readability Prize: Rate the complexity of literary passages for grades 3–12 classroom use.

I was so interested that I went back to the competition to look at its description.

Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? Reading is an essential skill for academic success. When students have access to engaging passages offering the right level of challenge, they naturally develop reading skills.

In this competition, you’ll build algorithms to rate the complexity of reading passages for grade 3–12 classroom use. To accomplish this, you’ll pair your machine learning skills with a dataset that includes readers from a wide variety of age groups and a large collection of texts taken from various domains. Winning models will be sure to incorporate text cohesion and semantics.

If successful, you’ll aid administrators, teachers, and students. Literacy curriculum developers and teachers who choose passages will be able to quickly and accurately evaluate works for their classrooms. Plus, these formulas will become more accessible for all. Perhaps most importantly, students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.

Reading is a greatly important way of spreading knowledge and for any message, knowing your audience and meeting that audience is key. I realized I would love to explore this, using my linguistic skills, to make a model that could rank the passages.

It would also provide a good foundation for further uses. The basis of using grammatical and semantic features to rank text could be expanded to ranking and sorting ESL and other target language resources. It could provide information to language teachers on what aspects of grammar or vocab to teach before reading the passage, or help teachers find articles that focus on target grammatical structure. One could also combine these concepts to audio transcription and rank podcasts for beginner language learners to listen to, or to music lyrics.

So far, I have focused on brushing up on my understanding of syntax trees and syntactic concepts such as heavy Noun Phrases and Heavy NP shifting. I am using displaCy to visualize word dependencies in the excerpts provided by the Kaggle competition.

I have begun looking into ways to rank the semantics of the articles by frequency or age of acquisition. There is a PyPi package: wordfreq or a database created by Victor Kuperman, Hans Stadthagen-Gonzalez and Marc Brysbaert in their study: Age-of-acquisition ratings for 30,000 English words.

I plan on also looking into semantic assumptions and lexical density for the model. Once I have a working model for grade levels, I want to incorporate one of the ideas above or make a way to summarize the grammatical aspects of the essays to apply to ESL articles using the articles from the website News in Levels.

--

--

Megan Sorel

I am a Data Scientist with a background in linguistics and education. I use NLP to explore communication and how to make it more accessible and impactful!