Extracting spells in Harry Potter
How many spells are cast by the characters in the seven Harry Potter books? That’s what master’s students of Data Science for Decision Making Moritz Haine and Markus Dienstknecht wanted to find out in celebration of Harry Potter’s 20th anniversary. To answer their question, they used text mining, an information retrieval technique common in computer science.
The project was part of the course Information Retrieval and Text Mining, taught by Prof. Jan Scholtes, who holds a special Chair in Text Mining at the Department of Data Science and Knowledge Engineering. “Every year, we have a mandatory practical assignment where students have to apply text mining algorithms on a text of their own choice. The objective is to extract patterns from a text and find answers to the questions ‘who’, ‘when’, ‘where’, ‘what’, ‘why’, ‘how’, ‘how much’ or ‘by which means’. This also includes detection and extraction of more abstract notions such as emotions, sentiments or concepts”, Prof. Scholtes explains.
Moritz and Markus were inspired by some earlier projects on very popular fantasy literature that were carried out. Moritz: “Students in previous years looked at for example Lord of the Rings, Star Wars and Game of Thrones. However, to our surprise, Harry Potter was missing. Since the books are about magic, we decided it would be interesting to identify all of the spells and the wizards that cast the most spells."
Expelliarmus: ‘Release your wand’
Within only 25 minutes, they extracted 41 different wizards, 64 different spells and 253 spells cast in total, using a computer with a Core i7 processor and 16GB RAM. Moritz: “As you might expect, Harry Potter himself used the most spells throughout the whole book series, 108 in total. He used Stupefy, Expelliarmus and Accio most often – 11 times each. This makes sense, since the most frequent spell is in fact Expelliarmus, used to disarm ones opponent.”
The reasons why characters cast a certain spell is not something that the students could extract from the data, but could be explained by the meaning of the spell itself. Stupefy, or the stunning spell, makes a victim unconscious and stops objects from moving, while Accio is used to make objects levitate towards you. Moritz points out that they only focused on spoken spells, while the most powerful wizards can also cast spells without naming them. They expect this might be the reason why Dumbledore (headmaster of the wizarding school Hogwarts) or Harry Potter’s archenemy Voldemort are not as high ranked as Harry. At the end of their project, Moritz and Markus constructed a complete spell-character mapping and presented their results to the other students, which to Prof. Scholtes is “always the highlight of the course”.
Scholtes finds that working on appealing topics is a good way to engage students and make them enthusiastic about text mining. So he presents students the most creative projects from the past to stimulate their creativity and challenges them to do better. “At the end of the course, many of my students tell me that it was one of the most interesting and fun projects they’ve worked on. Some even decide to graduate or do their internship in a text mining-topic.”
What is text mining?
Text mining refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. It encompasses several computer science disciplines with a strong orientation towards artificial intelligence in general, including pattern recognition, neural networks, natural language processing, information retrieval and machine learning. An important difference with standard information retrieval techniques is that they require a user to know what he or she is looking for, while text mining attempts to discover information in a pattern that is not known beforehand. This is very relevant, for example, in criminal investigations, legal discovery, (business) intelligence, clinical research or due diligence investigations.
By Dunja Bajic