Evo-Data: Tracking the evolution of data using historic datasets
Historic datasets offer a wealth of information that can help us understand how data changes and evolves over time, and we offer an easier way of making initial explorations of those often messy datasets.
Team membersAna Hibert, Elaine Farrow, Alexia Revueltas, David McClure, Sarah Austin
Members roles and background
Ana Hibert - Doctoral student in Education, focusing on learning analytics for English as a second language. Has an MSc in Education Research, is a Teacher in Higher Education and also has Degree in Literature.
Elaine Farrow - Doctoral student in Data Science, focusing on learning analytics. She has a MSc in Data Science, and has been a teacher for Higher Education.
Alexia Revueltas - Doctoral student in Learning Sciences, focusing on early years engagement with science learning. Has an MSc in Cognitive Neuropsychology, a degree in Psychology and has worked with facial expressions of emotions and teaching robotics to children.
David McClure - Software Engineer for company that develops CAD software for Engineers. Degree in Computer Sciences.
Sarah Austin - Doctoral student in Education, focusing on refugee teachers's identity and professional development for them in refugee camps. Has a MSc in Education Leadership, degree in School Teaching and Literature.
An initial exploration approach for the data, from the lens of a researcher, that might offer us insights on datasets that could set the stage for deeper explorations.
We offer a way of using natural language processing tools to get an overview of how concepts within the text relate to each other, and how they relate over time, to offer a first insight into the contents of these datasets. By creating a dashboard that displays semantic networks related to key queries, we can help researchers get a feel for the data and start thinking which questions might be relevant to pursue.
--> Our solution does not offer to clean up or label datasets, as this would depend on the particular needs of each research project
As an example case, in the dataset provided by the National Library of Scotland, the exploration begings with queries that can be related to specific Encylopaeda entries, allowing us to form an idea of how the concepts presented within the Encyclopaedia Brittanica have evolved over time. The type of analysis, however, can be tailored to different kinds of queries depending on the particular dataset that is being considered. For example:
- frequency analysis, which allows to map which words are more frequently associated with certain concepts
- semantic networks help us understand how concepts within the datasets relate to each other
- sentiment analysis can allow an overview of attitudes/biases found in the data
Historic datasets often come in the form of messy, raw text files that can include enormous amounts of text and are complicated to understand at first sight.
Researchers often have to work with unstructured text-based datasets, especially (but not limited to) historical research. With more and more text-based resources being digitised into raw text files, there is a wealth of untapped information that might seem intimidating because of the sheer amount of data and the unstructured quality of the datasets, making this data inaccesible.
Solution target group
This solution is geared toward researchers that have to work with large quantities of raw text data, especially in historical contexts.
- This solution will make it easier for researchers to get an initial feel of the data contained within the text, therefore allowing them to spend less time in the initial exploration and more into planning in-depth analysis of their datasets.
- This solution will also make historical datasets less intimidating and more accesible, especially to researcher who do not have programming skills or natural language processing knowledge.
Solution tweet textTravel through text and time with text-analysis made easy
While there are many tools that allow for the easy exploration of data, our solution focuses on historical datasets from a researcher's perspective, allowing a quick an easy comparison between similar entries at different points in time that might be needed to ask more informed questions.
This project is first of all designed to work with the Encyclopaedia Britannica datasets provided by the National Library of Scotland. However, this dataset is used as a proof of concept, since the natural language processing techniques can be applied to any dataset and the time-sensitive aspect of our solution could be used in any dataset that spans similar texts across long periods of time.
In the short term, we would implement our solution in order to help better understand the Encyclopaeda Britannica datasets provided by the National Library of Scotland. By creating an informational dashboard on the relationship between words and concepts within the database, we would help provide a unique insight into the dataset and firther guidance into more specific in-depth exploration of the data.
In the medium term, we want to optimise the dashboard so that it can be applied to any unstructured text-based dataset that has a time component across different parts of it.
In the long term, we would like to add more functionality to the dashboard, allowing for more sophisticated exploration, analysis and tools that could help clean up the data in ways that benefit old datasets the most, for example cleaning up common OCR mistakes or dealing with characters that have fallen out of use and might confuse OCR systems.
Solution team work
We were a multidisciplinary team (Informatics, education, learning sciences and social studies) that managed to interact positively, considering everyone's areas of expertise and discussing ideas from different perspectives. We managed to balance all opinions and created an atmosphere in which every member of the team was excited and willing to share their ideas, which led to very productive discussions about the best way to tackle the challenge.
* Climate-KIC publishes the proposed solutions developed during the DigiEduHack event solely for the purposes of facilitating public access to the information concerning ideas and shall not be liable regarding any intellectual property or other rights that might be claimed to pertain to the implementation or use any of the proposed solutions shared on its website neither does it represent that it has made any effort to identify any such rights. Climate-KIC cannot guarantee that the text of the proposed solution is an exact reproduction of the proposed solution. This database is general in character and where you want to use and develop a proposed solution further, this is permitted provided that you acknowledge the source and the team which worked on the solution by using the team’s name indicated on the website.