Classifying Fruit Purees with Infrared Spectroscopy


Infrared Spectroscopy is the process of measuring the interaction of radiation with matter. Specifically within absorption infrared spectroscopy, various wavelengths of infrared light are cast onto a sample, and the level of light absorption is calculated and recorded. Uses for this technique are abundant in chemical engineering, but this article focuses on the data gathered from one specific experiment. The experiment involved recording mid-infrared spectroscopy absorption rates on fruit puree samples with the intent of categorizing them as pure or adulterated. The observations were all gathered on strawberry purees that were either 1. Unadulterated or 2. Intentionally adulterated with either apple, plum, red grape juice, or rhubarb compote. The purpose of the experiment was to test if mid-infrared spectroscopy gathered sufficient data to build a predictive model where future samples could then be tested and accurately classified. The business case here would be a food manufacturer wanting to confirm that the raw ingredients they are receiving from suppliers met quality standards in terms of purity. If a supplier is suspected of diluting their fruit purees, is infrared spectroscopy a feasible way to determine chemical composition?

The data gathered in the study consisted of 938 fruit purees: 351 pure strawberry puree and 632 having some level of adulteration added. For each sample, 256 different wavelengths were tested and absorptions levels recorded. Using 3 of these wavelength features, strawberry vs non-strawberry can be visualized below.



The strawberry purees are in yellow and all the adulterated samples in teal.  The data was given in a binary format- 0 is strawberry and 1 is non-strawberry. We were given no data on the level of dilution or what type of other fruit was added. Our goal is to maximize accuracy in predictions strawberry vs non-strawberry.

First step was to separate the observations into a training set (length 689) and testing set (length 294). Then the following models were created and assigned to the data:


1. Linear Discriminant Analysis. This model attempts to create a linear classification barrier in 237 dimensional space separating strawberry from non-strawberry observations. After effective parameter tuning to eliminate overfitting, this model was actually the most effective with an overall accuracy score of 97.91% on the testing data. The confusion matrix of all observations is below with 0 referring to strawberry and 1 referring to non-strawberry:



2. Random Forest. Random Forest Classifier is based on decision trees and uses both bagging (bootstrapping) techniques as well as random samples of features at each split. I apologize for departing from normal human english in this article. The results from the Random Forest classifier were very similar to the LDA with an accuracy score of 97.6% in testing data. Confusion Matrix below:


3. K Nearest Neighbors. KNN is a non-parametric model that can classify observations based on the behavior of the observations closest to them in dimensional space. With tuned parameters, we were able to achieve 95.7% classification accuracy. Results below:




Now that we have 3 strong models, the next step is to combine predictions from all three to assemble an ensemble model. The ensemble model strategy is simple- for each observation, consider what each of the three different models classified that sample as (either strawberry or non-strawberry), and use majority rule for the ensemble model's prediction. The results are the strongest yet with accuracy store of 97.96%



In conclusion, the technique of infrared spectroscopy appears to be a feasible metric on which a manufacturer could asses the quality of their raw ingredients received from suppliers. However, a ~98% accuracy score is not sufficient for many food-related industries, so perhaps more tuning is required, and this certainly should not be the only process check before items move to end consumers.

Comments

Popular posts from this blog

Using a Neural Network to Predict Pneumonia From X-Ray Images

Using a Neural Network to Classify Lego Figures

Why are people leaving Illinois?