Data Science Part-Time capstone projects batch #10

by Ekaterina Butyugina

Students working on a project

We’re thrilled to celebrate the achievements of our latest graduates from Part-Time Data Science Batch #10, who have just wrapped up their Data Science journey with 3 remarkable, real-world projects. 

This round of final presentations showcased how data science and AI can drive tangible impact across industries, from transforming business development workflows to reinventing the way of the new market discovery. 

Take a look at how our graduates are using data science to generate insights, push boundaries, and create real-world impact. 
 

AI Research Assistant: Navigating Academic Literature with Intelligent Topic Modeling 

Students: Helga RablHeba Abu EmranAmbrosio AcalVictor Generaux 

Every day, the arXiv repository adds dozens of new academic papers to an already overwhelming corpus. For researchers trying to stay current, this presents a significant challenge: how do you efficiently discover relevant work and track research trends without spending hours manually reviewing abstracts? 

This capstone project tackles that problem by building an automated pipeline that monitors arXiv for specified categories, providing researchers with an intelligent, topic-aware discovery tool. 

The project's success hinged on making the right technical choices. Five topic modeling algorithms were evaluated: BERTopic (context-based), LDA (counting-based), Word2Vec (association-based), Top2Vec (density-based), and FasTopic (efficiency-focused). Equally critical was selecting the right embedding model. Domain-specific transformers dramatically outperform general-purpose models for scientific text. The team compared scispaCy (biomedical), PhysBERT (physics-specific), SciBERT (general science), BioBERT (biomedical), and PubMedBERT (medical jargon).
 


 

For an example use case provided by one of the stakeholders of the project, the focus was on the fields of physics and quantum mechanics. All five algorithms were trained on PhysBERT embeddings using a 5,000-article training set, with human evaluation confirming the results. The conclusion was clear: BERTopic with PhysBERT provides the most balanced performance across interpretability, semantic coherence, and topic quality. 

A key design decision involved whether to process full papers or just abstracts. The team found that abstracts are often sufficient because they focus on key findings without noise from references, equations, and supplementary sections. Shorter text also means faster computation and lower memory requirements, enabling the system to scale efficiently. This pragmatic choice allowed processing 100,000 articles from condensed matter and quantum physics while maintaining topic quality. 

The complete pipeline orchestrates multiple components: fetching new articles by date from arXiv, using PhysBERT to convert abstracts into semantic representations, applying UMAP for dimensionality reduction, using HDBSCAN to cluster semantically similar papers, and employing c-TF-IDF to identify distinctive terms for each topic. The system maintains a topic model trained on 100,000 abstracts, enabling classification of new articles as they appear. 

The Streamlit dashboard provides several powerful features. LLM-generated summaries using GPT-3.5-turbo or TopicGPT create human-readable one-sentence descriptions for each topic cluster. Temporal tracking allows researchers to follow specific topics over time and identify emerging "hot topics." An expert network cross-references topics with author information to identify leading researchers in specific areas. Keyword search lets users see how terms distribute across topics and drill down into relevant papers. New articles are displayed sorted by topics with custom labels and direct PDF access. 

This AI research assistant demonstrates how modern NLP techniques can transform literature discovery. Future enhancements include integration of BioBERT and PubMedBERT for cross-disciplinary research, enhanced temporal analytics to predict emerging research directions, expanded expert network analysis with citation graphs, and multi-modal support for processing figures and equations. As arXiv continues to grow, such intelligent discovery tools will become increasingly essential for researchers maintaining awareness in rapidly evolving scientific domains. 
 

Regulatory Inquiries Chatbot: Accelerating Compliance Through AI-Powered Document Intelligence 

Student: Alexander Arm 


In the wake of the 2008 financial crisis, regulatory oversight of the banking sector intensified dramatically worldwide. For multinational banks operating across multiple jurisdictions, this creates a significant challenge: responding efficiently and accurately to routine regulatory inquiries while maintaining consistency and compliance. 

The problem is multifaceted. The banking sector faces increased regulatory strictness post-2008, with multinational banks needing to interact with authorities across numerous jurisdictions. Responding to regulatory inquiries involves repetitive, manual effort that can take days or weeks, slowing response times. Information is often fragmented across individual teams, leading to delays, inefficiencies, and potential inconsistencies. Without systematic analysis, banks struggle to proactively address emerging regulatory concerns. 

The solution implements a two-pronged approach. First, a centralized regulatory repository consolidates all historical exam data into a unified database, including agenda questions, findings, recommendations, and responses. This information is stored in both a SQL database for structured queries and a vector database with embeddings for semantic search. Second, an AI-powered chatbot retrieves information about previous regulatory interactions to help teams prepare for upcoming exams and automates response generation using pre-approved answers for speed and compliance. 

The proof-of-concept implements a sophisticated multi-stage decision process. When a user poses a question, an LLM first assesses the intent and determines whether to retrieve using RAG for semantic understanding, query the SQL database for metadata searches, or answer from conversation context for follow-ups. The question is then reformulated based on the chosen path to improve search effectiveness. 

For database retrieval, an LLM determines which metadata fields to filter and constructs SQL queries. For vector store retrieval, semantic similarity search combined with metadata filtering identifies relevant document chunks, using LLM-generated filters, cosine similarity thresholds, and optimized chunking. The system leverages Azure-based Risklab Vega vector store, optimized for financial use cases. 

The LLM synthesizes an answer based on retrieved context and conversation history. Critically, a grader LLM acts as a guardrail against hallucinations, evaluating whether the answer actually addresses the question. If the answer doesn't meet quality standards, the system asks the user to reformulate rather than providing potentially inaccurate information. 

A significant technical challenge involved processing regulatory exam records stored in PDFs, Word documents, and PowerPoint decks with varying formats. LLMs serve as intelligent extractors, converting these heterogeneous sources into structured SQL records and semantic embeddings with rich metadata. 

The project delivered a functional chatbot deployed on the intranet that supplies structured information about previous regulatory interactions and prepares templates for current regulatory responses. The comprehensive data processing pipeline extracts and structures information from diverse file formats, maintaining both SQL and vector representations for optimal retrieval. 

Future development includes internal deployment expansion to additional teams and jurisdictions, user feedback integration to improve retrieval accuracy and response quality, advanced analytics for trend analysis to enable proactive compliance management, and integration with existing compliance management systems. This AI-powered regulatory chatbot demonstrates how modern NLP techniques can transform compliance operations in highly regulated industries, significantly reducing time and effort while improving consistency and accuracy. 
 

Conclusion  

With the completion of the Data Science capstone projects of the Part-Time Batch #10, we celebrate the impressive achievements of our graduates. Their work clearly demonstrates what happens when technical expertise meets creativity: data science unlocks its full potential for innovation. 

Interested in reading more about Constructor Nexademy and tech related topics? Then check out our other blog posts.

Read more
Blog