Having looked at the options using tools of increasing complexity we come to using my favourite tool for LIBS Analysis - Python. I like Python because it is reasonably easy to start, and it encourages experimentation - not everything has to be right before you can start getting something out of it.
I use WinPython as it offers (at least on Windows) a version that works out of the box.
The source files for these tutorials are Jupyter notebooks which enable me to combine the code and the explanation together. The files are available from the Elementia Consulting Github if you want to try/adapt these for your own use.
The beauty of Jupyter notebooks is that they allow the combining together of Python script and an explanation of what is happening. If you are viewing this on the libs-info website, then the file won't be interactive - you will just the output. If you download the notebook files from the Github, then they can be run live and the results seen.
If you are only interested in the techniques used, then you can skip the Python code lines and their explanation. The pure code commentary will be marked with italics
In the next could of lines, we initialise the Python code that we need to run the calculations. We will be using the Bokeh for the charts
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import Range1d
from IPython.display import display, Markdown, HTML
from elementia.tutorial_fns import calculate_cross_validation, calculate_intensity_cv, calculate_sum_cv
output_notebook()
For this work we will use the following samples, and analyse for Sodium (Na)
Sample | Na. Conc (ppm) |
---|---|
OREAS45e | 594 |
OREAS501b | 20848 |
OREAS601 | 14543 |
OREAS921 | 6068 |
OREAS603 | 4283 |
OREAS933 | 1515 |
OREAS903 | 301 |
To do this, we use the Python Data Analysis Library (PANDAS). This package is included in most Python implementations.
PANDAS has a read_csv function that handles the import of the file, placing into a PANDAS dataframe, which has a range of usefule features which we will take advantage of - the first of which is that we can use the column heading to label the data
conc_data = pd.read_csv('.\\files\\Na_concs.csv')
print(conc_data)
In this case we will import the same source CSV file as we have used previously and plot all the spectra
If you are using notebook file, then the tools to the left of the chart allow zooming, panning and saving the image to file
# load the data from file
spectra_data = pd.read_csv('.\\files\\OREAS_source.csv')
# and display the first 5 lines
print(spectra_data.head(5))
#use the Bokeh library, start a chart (or figure)
p1 = figure()
#on our new figure, display the intensities we have loaded from the files. One of the many beauties of the PANDAS library we used to do
# the import is that it allows us to use the column labels from the file to represent the columns of data.
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['501b_6'], legend='OREAS 501b', color='red')
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['601_7'], legend='OREAS 601', color='orange')
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['921_7'], legend='OREAS 921', color='yellow')
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['603_7'], legend='OREAS 603', color='green')
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['933_5'], legend='OREAS 933', color='blue')
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['45e_7'], legend='OREAS 45e', color='indigo')
p1.line(x=spectra_data['Wavelength (nm)'], y=spectra_data['903_5'], legend='OREAS 903', color='violet')
show(p1)
#restrict the range of the x axis (wavelengths)
p1.x_range = Range1d(585.0, 594.0)
#and reshow the plot
show(p1)
When we used the spreadsheet for this analysis, we were forced to manually inspect the data to find the nearest points for our calculations. Here we can ask Python to do it for us.
# creating a PANDAS index allows for a range of search functions.
wl_index = pd.Index(spectra_data['Wavelength (nm)'])
# firstly find the index (that is row number) in the wavelength list that is closest to 588.0nm. Store it in the index_588 variable for use later
index_588 = wl_index.get_loc(588.0, method='nearest')
# repeat this for index closest to 589.2nm, 590nm and 587.5nm (background point).
index_589 = wl_index.get_loc(589.2, method='nearest')
index_590 = wl_index.get_loc(590.0, method='nearest')
index_5875 = wl_index.get_loc(587.5, method='nearest')
print("Nearest Point to 588nm is ", spectra_data['Wavelength (nm)'][index_588], 'nm which is at index:', index_588)
print("Nearest Point to 589.2nm is ", spectra_data['Wavelength (nm)'][index_589], 'nm which is at index:', index_589)
print("Nearest Point to 590nm is ", spectra_data['Wavelength (nm)'][index_590], 'nm which is at index:', index_590)
print("Nearest Point to 587.5nm is ", spectra_data['Wavelength (nm)'][index_5875], 'nm which is at index:', index_5875)
Next we ask for the highest value between the our indices
# we will build a list of the peak intensities
intensities_589 = []
# go through all the samples
for sample in conc_data['Sample']:
#finding the peak/maximum intensity between the rows representing our wavelengths (and subtracting off the background point)
intensities_589.append(max(spectra_data[sample][index_588:index_589])-spectra_data[sample][index_5875])
# repeat this process for the 589.5nm peak
intensities_5895 = []
for sample in conc_data['Sample']:
intensities_5895.append(max(spectra_data[sample][index_589:index_590])-spectra_data[sample][index_5875])
#and display the intensities vs concentrations plots.
p2 = figure()
p2.xaxis.axis_label = "Intensity (Arb Units)"
p2.yaxis.axis_label = "Concentration (ppm)"
p2.scatter(intensities_589, conc_data['Na Conc (ppm)'], legend="589nm", color="red")
p2.scatter(intensities_5895, conc_data['Na Conc (ppm)'], legend="589.5nm", color="blue")
show(p2)
In the spreadsheet example, we got Excel to calculate the slope of the curve.
Mathematically, the slope can be calculated using
$\begin{aligned} slope = \frac{x \cdot y}{x \cdot x} \end{aligned} $
#while we could have done this calculation many different ways, the python NumPy library has a function for calculating
# the dot product between 2 vectors.
# The cost is that the lists we made before need to be converted into the NumPy array format.
np_intensities_589 = np.array(intensities_589)
np_intensities_5895 = np.array(intensities_5895)
#now the slopes are a straightforward calculation
slope_589 = np.dot(np_intensities_589, conc_data['Na Conc (ppm)'])/np.dot(np_intensities_589,np_intensities_589)
slope_5895 = np.dot(np_intensities_5895, conc_data['Na Conc (ppm)'])/np.dot(np_intensities_5895,np_intensities_5895)
print("Slope of 589nm Curve:", slope_589)
print("Slope of 589.5nm Curve:", slope_5895)
Plotting these on our curve, alongside the data shows (again) that the peak intensity of the Na lines at 589nm are unlikely to give us good/useable calibration curves
p2.line([0.0, 40000.0], [0.0, slope_589*40000], color="red")
p2.line([0.0, 40000.0], [0.0, slope_5895*40000], color="blue")
show(p2)
As we did previously, we will now investigate the Na peaks at 818.3nm and 819.4nm
# change our previous plot to better display our new region of interest.
p1.x_range = Range1d(815, 825.0)
p1.y_range = Range1d(0, 7000)
show(p1)
As the concentration of Na reduces, there still remains some underlying structure in the spectra, which makes the 818.3nm line unlikely to be useable at low Na concentrations. For this work we will not use it.
As before, we will get Python to find the nearest peaks for the region in which we will search for the highest intensity (819.1-819.9nm) and for a measure of the underlying background signal (817.5nm)
#Similar to before, find the indices (row numbers) of interest
index_8191 = wl_index.get_loc(819.1, method='nearest')
index_8199 = wl_index.get_loc(819.9, method='nearest')
index_8175 = wl_index.get_loc(817.5, method='nearest')
# caculate the peak intensities (removing the background at 817.5nm)
intensities_8194 = []
for sample in conc_data['Sample']:
intensities_8194.append(max(spectra_data[sample][index_8191:index_8199])-spectra_data[sample][index_8175])
# show the figure
p3 = figure()
p3.xaxis.axis_label = "Intensity (Arb Units)"
p3.yaxis.axis_label = "Concentration (ppm)"
p3.scatter(intensities_8194, conc_data['Na Conc (ppm)'], color="red")
# with the calibration curve
np_intensities_8194 = np.array(intensities_8194)
slope_8194 = np.dot(np_intensities_8194, conc_data['Na Conc (ppm)'])/np.dot(np_intensities_8194,np_intensities_8194)
p3.line([0.0, 6000.0], [0.0, slope_8194*6000], color="red")
show(p3)
As with previous work, while a visual inspection of the quality of a calibration curve can tell an experienced researcher quite a bit, deciding the which curve to use should have some more formal measures. In Tutorial 2, we used LOD and the linear regression uncertainty as measures.
Here we introduce a new one - Root Mean Square Error in Prediction (RMSEP) or sometimes referred to as Root Mean Square Error in Cross Validation The aim of this measure is to provide an estimate of the uncertainty in the concentration value from the calibration curve.
# As a first example, lets calcuate the calibration curve without the point with the maximum concentration.
# So, as before, calculate the intensities
intensities_8194_drop_max = []
for sample in conc_data['Sample'][1:]:
intensities_8194_drop_max.append(max(spectra_data[sample][index_8191:index_8199])-spectra_data[sample][index_8175])
#now manually calculate the intensity for the 'missing' point
intensity_501b_6 = max(spectra_data['501b_6'][index_8191:index_8199])-spectra_data['501b_6'][index_8175]
# calculate the slope of the calibration curve
np_intensities_8194_drop_max = np.array(intensities_8194_drop_max)
slope_8194_drop_max = np.dot(np_intensities_8194_drop_max, conc_data['Na Conc (ppm)'][1:])/np.dot(np_intensities_8194_drop_max,np_intensities_8194_drop_max)
# plot the new calibration curve [blue dotted line]
p3.line([0.0, 6000.0], [0.0, slope_8194_drop_max*6000], color="blue", line_dash='dotted')
# plot the predicted concentration [blue circle]
p3.scatter([intensity_501b_6], [(slope_8194_drop_max*intensity_501b_6)], color='blue')
print("Calculated Concentration:", slope_8194_drop_max*intensity_501b_6, "ppm. Actual Conc:", conc_data['Na Conc (ppm)'][0])
print("Diff (ppm):", slope_8194_drop_max*intensity_501b_6-conc_data['Na Conc (ppm)'][0] )
show(p3)
In this example the calibration curve without the highest concentration point, predicts a concentration of 20019.7ppm (a difference of 828.3ppm). This shows there is an internal consistency in the data - the prediction is reasonably close, even when the point is not used in the calibration.
Repeating this for each of the points in turn - that is:-
removing the point $I_i$ from the data set
performing a calibration without the missing point
analysing the removed point using the new calibration curve (making concentration $C^u_i$
determine the difference to the true value $C^s_i - C^u_i $
Adding the differences using the Root-Mean-Square formula, that is
$\begin{aligned} RMSEP = \sqrt{\frac{\sum_n \left(C^s_i - C^u_i \right)^2 }{n}} \end{aligned} $
# To simplify things, we have automated this in a function.
# If you are working from the Github, make sure you also download the elementia directory as this has the functions in it
cv_concs, rmsep = calculate_cross_validation(concs=conc_data['Na Conc (ppm)'], intensities=intensities_8194)
# plot the predicted concentrations on the chart
p3.scatter(intensities_8194, cv_concs, color='blue')
# and show the predicted concs
print(cv_concs)
# and the RMSEP
print(rmsep)
show(p3)
Our calibration curve using the peak intensities has a RMSEP = 605ppm. Not bad!
As a different way of visualising the results - let's plot the predicted concentrations vs the actual concentrations. Sometimes this can give insight into where our calibration is performing well and where it isn't
p4 = figure()
p4.xaxis.axis_label = "Predicted Concentration (ppm)"
p4.yaxis.axis_label = "Actual Concentration (ppm)"
p4.scatter(conc_data['Na Conc (ppm)'], cv_concs, color="red")
# The 1:1 line (that is if the point lies on this line, then there is an exact match between predicted and actual.)
p4.line([0.0, conc_data['Na Conc (ppm)'][0]], [0.0, conc_data['Na Conc (ppm)'][0]])
show(p4)
As a comparison, we will do all the same work on the 589nm and 589.5nm peak intensities. That is,
calculate the predicted concentration from a calibration curve made without the point
calculate the difference between the predicted concentration and the actual concentration.
Calcuate the RMSEP
# Set up the Chart
p5 = figure()
p5.xaxis.axis_label = "Predicted Concentration (ppm)"
p5.yaxis.axis_label = "Actual Concentration (ppm)"
p5.line([0.0, conc_data['Na Conc (ppm)'][0]], [0.0, conc_data['Na Conc (ppm)'][0]])
# Calculate the RMSEP and Cross Validation Predictions for the 589nm peak
cv_concs_589, rmsep_589 = calculate_intensity_cv(wavelengths=spectra_data['Wavelength (nm)'], intensities=spectra_data,
concs=conc_data['Na Conc (ppm)'], samples=conc_data['Sample'],
wavelength_lower=588.0, wavelength_upper=589.25, wavelength_background=587.5)
# Calculate the RMSEP and Cross Validation Predictions for the 589.5nm peak
cv_concs_5895, rmsep_5895 = calculate_intensity_cv(wavelengths=spectra_data['Wavelength (nm)'], intensities=spectra_data,
concs=conc_data['Na Conc (ppm)'], samples=conc_data['Sample'],
wavelength_lower=589.25, wavelength_upper=591.0, wavelength_background=587.5)
# Plot the 589nm points in red
p5.scatter(conc_data['Na Conc (ppm)'], cv_concs_589, color="red")
# Plot the 589.5nm points in blue
p5.scatter(conc_data['Na Conc (ppm)'], cv_concs_5895, color="blue")
print("RMSEP (589nm):", rmsep_589)
print("RMSEP (589.5nm):", rmsep_5895)
show(p5)
The predicted points lie quite a way off the 1:1 line, and this is demonstrated in the RMSEP - at best 8 times worse than the 819.4nm peak.
Repeating similar work, we will calculate the Peak Area for each of the samples, constructing a calibration curve using these values and determine the RMSEP.
sum_8194 = []
#Calculate the peak area, subtracting off the background area.
for sample in conc_data['Sample']:
sum_8194.append(sum(spectra_data[sample][index_8191:index_8199])-(index_8199-index_8191-1)*spectra_data[sample][index_8175])
p6 = figure()
p6.xaxis.axis_label = "Intensity (Arb Units)"
p6.yaxis.axis_label = "Concentration (ppm)"
p6.scatter(sum_8194, conc_data['Na Conc (ppm)'], color="red")
# calculate the slope of the fitting curve
np_sum_8194 = np.array(sum_8194)
slope_sum_8194 = np.dot(np_sum_8194, conc_data['Na Conc (ppm)'])/np.dot(np_sum_8194,np_sum_8194)
p6.line([0.0, 15000.0], [0.0, slope_sum_8194*15000], color="red")
print(slope_sum_8194)
show(p6)
Now, calcuate the RMSEP
cv_concs_sum_8194, rmsep_sum_8194 = calculate_cross_validation(concs=conc_data['Na Conc (ppm)'], intensities=sum_8194)
p6.scatter(sum_8194, cv_concs, color='blue')
print("RMSEP (819.4nm):", rmsep_sum_8194)
show(p6)
As we found using the Spreadsheet example in Tutorial 2, the performance of the calibration curve using peak areas for the 819.4nm peak doesn't perform as well as the curve using peak intensity.
Replotting this to show the predicted vs the actual concentration shows performance is worse at both the high and low concentration ends.
p7 = figure()
p7.xaxis.axis_label = "Predicted Concentration (ppm)"
p7.yaxis.axis_label = "Actual Concentration (ppm)"
p7.scatter(conc_data['Na Conc (ppm)'], cv_concs_sum_8194, color="red")
p7.line([0.0, conc_data['Na Conc (ppm)'][0]], [0.0, conc_data['Na Conc (ppm)'][0]])
show(p7)
Combining our :-
Tells a consistent story:-
rmsep_sum_589 = calculate_sum_cv(wavelengths=spectra_data['Wavelength (nm)'], intensities=spectra_data,
concs=conc_data['Na Conc (ppm)'], samples=conc_data['Sample'],
wavelength_lower=588.0, wavelength_upper=589.25, wavelength_background=587.5)[1]
rmsep_sum_5895 = calculate_sum_cv(wavelengths=spectra_data['Wavelength (nm)'], intensities=spectra_data,
concs=conc_data['Na Conc (ppm)'], samples=conc_data['Sample'],
wavelength_lower=589.25, wavelength_upper=591.0, wavelength_background=587.5)[1]
rmsep_sum_589_5895 = calculate_sum_cv(wavelengths=spectra_data['Wavelength (nm)'], intensities=spectra_data,
concs=conc_data['Na Conc (ppm)'], samples=conc_data['Sample'],
wavelength_lower=588.0, wavelength_upper=591.0, wavelength_background=587.5)[1]
display(HTML("<table><tr><th>Peak (nm)</th> <th>Intensity RMSEP</th><th>Area RMSEP</th><th>Intensity SE (Tut 2)</th><th>Area SE (Tut 2)</th><th>Intensity LOD (Tut 2)</th><th>Area LOD (Tut 2)</th>"+
"</tr><tr><td>819.4</td><td>%.1f</td><td>%.1f</td><td>562.0</td><td>743.4</td><td>80.0</td><td>6.0</td></tr>" % (rmsep, rmsep_sum_8194) +
"</tr><tr><td>589</td><td>%.1f</td><td>%.1f</td><td>5054.2</td><td>2720.5</td><td>16.3</td><td>30.5</td></tr>" % (rmsep_589, rmsep_sum_589) +
"</tr><tr><td>589.5</td><td>%.1f</td><td>%.1f</td><td>4178.2</td><td>2511.3</td><td>19.4</td><td>8.3</td></tr>" % (rmsep_5895, rmsep_sum_5895) +
"</tr><tr><td>589/589.5</td><td></td><td>%.1f</td><td></td><td>2753.4</td><td></td><td>3.7</td></tr></table>" % (rmsep_sum_589_5895)))