Accessible Investigative Journalism: Navigating Canada’s Largest Corpus of Government Documents
When and Where
Speakers
Description
The Investigative Journalism Foundation (IJF) is dedicated to advancing public interest journalism through transparent, data-driven reporting. A key resource of the IJF is the “Open By Default” (OBD) dataset, Canada’s largest collection of government documents, comprising over 4.5 million pages of Access To Information and Privacy (ATIP) requests and corresponding government documentation. This project aims to enhance the usability of the OBD dataset, making it more accessible and efficient for users without an investigative journalism background. Key improvements include enhanced data capture using optical character recognition (OCR), improved search performance through Large Language Model (LLM) vectorization, and topic modelling to reveal the high-level subject matter represented in the OBD dataset. The final development of the project is a Retrieval Augmented Generation (RAG) LLM pipeline, which enables a chatbot to provide tailored, context-rich responses to user queries, paired with follow-up research directions. This initiative significantly increases the dataset’s accessibility and interpretability, enabling more inclusive access for all Canadians.
About Kim Tran
Kim is an undergraduate computer science student and research fellow at the University of British Columbia’s Data Science Institute. She is passionate about interdisciplinary applications of technology and aims to develop software to create positive social impact. As a research fellow this past summer, she worked towards improving the accessibility of the IJF’s Open By Default Database.
About Sana Shams
Sana is a research fellow through UBC’s Data Science for Social Good Program. She is currently pursuing a BSc in cognitive systems with a minor in data science and is passionate about the intersection of technological development and ethical compliance, particularly in the fields of data science and machine learning. As a DSSG Fellow, Sana is committed to leveraging equitable and responsible data-driven solutions.
About Waris Bhatia
Waris is a Data Science Intern with the University of British Columbia’s (UBC) Data Science Institute. He is a senior undergraduate studying Computer Science at UBC. At the IJF, he is working on improving accessibility to government documents for the general public. He has previous academic and industry experience working in life sciences research. His interests lie in natural language processing, mathematical game theory, and computer vision.