Cracking Audio Intent Recognition with Machine Learning

Technologies & Methods Used

Tools: Python, Librosa, Scikit-learn
Preprocessing: Mel spectrograms, silence trimming, logarithmic scale transformation
Models: Random Forest, Support Vector Machine (SVM)
Approach: Feature-based classification

As part of the Data Science Lab course at Politecnico di Torino, I completed a winter project focused on audio-based intent recognition. The challenge was to build a model that could accurately classify short spoken commands like “Increase Volume” using a dataset of audio recordings from speakers of varying ages and backgrounds.

We tackled the problem by converting audio signals into Mel spectrograms—visual representations of audio frequencies over time—and splitting them into blocks to extract summary statistics. After filtering out inaudible samples and trimming silent sections, we transformed the spectrograms to a logarithmic scale, aligned with how humans perceive sound.

Instead of using deep learning or neural networks, we followed a classical machine learning approach. We hand-engineered features from the spectrograms and used models like Random Forest and SVM for classification. After tuning the parameters using grid search, Random Forest emerged as the more effective choice, achieving a test accuracy of 70.4% and a leaderboard score of 75%—well above the course baseline.

What made this project rewarding wasn’t just the technical outcome, but how it deepened my understanding of real-world audio processing challenges, like class imbalance, silent frames, and feature extraction.

This hands-on project was a meaningful step forward in my journey as a data scientist, and I’m excited to keep building from here.

Here is the pdf report of it.

Report Download

https://github.com/masoud-khalilian/intent-recognition

Masoud Khalilian

Data Engineer | Software Engineer | Web Developer

Cracking Audio Intent Recognition with Machine Learning