13 December 2024
Looking back on the Data Science Accelerator
From April to July this year an Assistant Statistician at the Cabinet Office and a Research Software Engineer at the British Library teamed up as mentee (Catherine Macfarlane, CO) and mentor (Harry Lloyd, BL) for the Data Science Accelerator. In this blog post we reflect on the experience and what it meant for us and our work.
Introduction to the Accelerator
Harry: The Accelerator has been around since 2015, set up as a platform to ‘accelerate’ civil servants at the start of their data science journey who have a business need project and a real willingness to learn. Successful applicants are paired with mentors from across the Civil Service who have experience in techniques applicable to the problem, working together one protected day a week for 12 weeks. I was lucky enough to be a mentee in 2020, working on statistical methods to combine different types of water quality data, and my mentor Charlie taught me a lot of what I know. The programme played a huge role in the development of my career, so it was a rewarding moment to come back as a mentor for the April cohort.
Catherine: On joining the Civil Service in 2023, I had the pleasure of becoming part of a talented data team that has motivated me to continually develop my skills. My academic background in Mathematics with Finance provides me with a strong theoretical foundation, but I am striving to improve my practical abilities. I am particularly interested in Artificial Intelligence, which is gaining increasing recognition across government, sparking discussions on its potential to improve efficiency.
I saw the Data Science Accelerator as an opportunity to deepen my knowledge, address a specific business need, and share insights with my team. The prospect of working with a mentor and immersing myself in an environment where diverse projects are undertaken was particularly appealing. A significant advantage was the protected time this project offered - a rare benefit! I was grateful to be accepted and paired with Harry, an experienced mentor who had already completed the programme. Following our first meeting, I felt ready to tackle the upcoming 12 weeks to see what we could achieve!
The Project
Catherine: Our team is interested in the annual reports and accounts of Arm’s Length Bodies (ALBs), a category of public bodies funded to deliver a public or government service. The project addressed the challenge my team faces in extracting the highly unstructured information stored in annual reports and accounts. With this information we would be able to enhance the data validation process and reduce the burden of commissioning data from ALBs on other teams. We proposed using Natural Language Processing to retrieve this information, analysing and querying it using a Large Language Model (LLM).
Initially, I concentrated on extracting five features, such as full-time equivalent staff in the organisation, from a sample of ALBs across 13 departments for the financial year 22/23. After discussions with Harry, we decided to use Retrieval-Augmented Generation (RAG), to develop a question-answering system. RAG is a technique that combines LLMs with relevant external documents to improve the accuracy and reliability of the output. This is done by retrieving documents that are relevant to the questions asked and then asking the LLM to generate an answer based on the retrieved material. We carefully selected a pre-trained LLM while considering ethical factors like model openness.
The first four weeks focused on exploratory analysis, data processing, and labelling, all completed in R, which was essential for preparing the data for input into the language model. The subsequent stages involved model building and evaluation in Python, which required the most time and focus. This was my first time using Python, and Harry’s guidance was extremely beneficial during our pair coding sessions. A definite highlight for me was seeing the pipeline start to generate answers!
To bring all our results together, I created a dashboard in Shiny, ensuring it was accessible to both technical and non-technical audiences. The final stage involved summarising all our hard work from the past 12 weeks in a 10 minute presentation and delivering it to the Data Science Accelerator cohort.
Harry: Catherine’s was the best planned project of the ones I reviewed, and I suspected she’d be well placed to make best use of the 12 weeks. I wasn’t wrong! We covered a lot of the steps involved in good reproducible analysis. The exploratory work gave us a great sense of the variance in the data, setting up quantitative benchmarks for the language model results drove our development of the RAG system, and I was so impressed that Catherine managed to fit in building a dashboard on top of all of that.
Our Reflections
Catherine: Overall this experience was fantastic. In a short amount of time, we managed to achieve a considerable amount. It was amazing to develop my skills and grow in confidence. Harry was an excellent mentor; he encouraged discussion and asked insightful questions, which made our sessions both productive and enjoyable. A notable highlight was visiting the British Library! It was brilliant to have an in-person session with Harry and meet the Digital Research team.
A key success of the project was meeting the objectives we set out to achieve. Patience was crucial, especially when investigating errors and identifying the root problem. The main challenge was managing such a large project that could be taken in multiple directions. It can be natural to spend a long time on one area, such as exploratory analysis, but we ensured that we completed the key elements that allowed us to move on to the next stage. This balance was essential for the project's overall success.
Harry: We divided our days between time for Catherine to work solo and pair programming. Catherine is a really keen learner, and I think this approach helped her drive the project forward while giving us space to cover foundational programming topics and a new programming language. My other role was keeping an eye on the project timeline. Giving the occasional steer on when to stick with something and when to move on helped (I hope!) Catherine to achieve a huge amount in three months.
Ongoing Work
Catherine: Our team recognises the importance of continuing this work. I have developed an updated project roadmap, which includes utilising Amazon Web Services to enhance the speed and memory capacity of our pipeline. Additionally, I have planned to compare various large language models, considering ethical factors, and I will collaborate with other government analysts involved in similar projects. I am committed to advancing this project, further upskilling the team, and keeping Harry updated on our progress.
Harry: RAG, and the semantic rather than key word search that underlies it, represents a maturation of LLM technology that has the potential to change the way users search our collections. Anticipating that this will be a feature of future library services platforms, we have a responsibility to understand more about how these technologies will work with our collections at scale. We’re currently carrying out experiments with RAG and the linked data of the British National Bibliography to understand how searching like this will change the way users interact with our data.
Conclusions
Disappointingly the Data Science Accelerator was wound down by the Office for National Statistics at the end of the latest cohort, citing budget pressures. That has made us one of the last mentor/mentee pairings to benefit from the scheme, which we’re both incredibly grateful for and deeply saddened by. The experience has been a great one, and we’ve each learned a lot from it. We’ll continue to develop RAG at the Cabinet Office and the British Library, and hope to advocate for and support schemes like the Accelerator in the future!