This tutorial describes how to take two raw mass spectrometry (MS) data sets, convert them into formats that can be read by obiwarp, and create a time standards file from MS/MS identifications. Typically, this would be data from peptides eluted from a reverse phase column where one is trying to align the chromatograms (i.e., adjust the elution times of identical peptides across runs so they match up).
Raw mass spec data must be converted to a format that can be read by obi-warp. Currently, obi-warp expects data in either a rectilinear or uniform matrix. Programs exist to convert mass spec data into this format (see below), but a simple example to demonstrate how this conversion is performed is useful. The '.lmat' (binary) or '.lmata' (ascii) formats contain time and m/z axis labels to indicate the position of the rows and columns of the intensity (ion count) matrix. Basically, the data is written in rows like this: 1) number of time values, 2) the time values, 3) number of m/z values, 4) the m/z values, 5) rows of m/z values (one row per time). The simple example of using .lmata files is also helpful to understand the format.
Following is a simplified example of how to convert MS data into an lmata matrix (using 1.0 m/z unit bins):
(time: 0.3 sec) [m/z 300.2, intensity 10000],[m/z 301.3, intensity 10500],
[m/z 302.6, intensity 10600],[m/z 303.2, intensity 20000]
(time: 3.3 sec) [m/z 300.2, intensity 50000],[m/z 301.3, intensity 50500],
[m/z 302.6, intensity 40600],[m/z 303.2, intensity 30000]
(time: 6.2 sec) [m/z 300.2, intensity 20000],[m/z 301.3, intensity 20500],
[m/z 302.6, intensity 30600],[m/z 303.2, intensity 33000]
If we add m/z values that round to the same bin and we use a rectilinear
matrix, we would write an lmata file (named '
3 # <- Number of time values (along the m axis)
0.3 3.3 6.2 # <- These are the time values
4 # <- Number of m/z values (along the n axis)
300 301 302 303 # <- These are the m/z values
10000 10500 0 30600 # <- This is the start of the intensity data
50000 50500 0 70600 # <- The 303 m/z bin contained 2 summed values,
20000 20500 0 63600 # <- while the 302 bin has none.
More sophisticated methods to create this matrix might use more m/z bins or might interpolate the data along the time or m/z axes (or both simultaneously), but the end result is still a matrix similar in form to the one shown.
Type:
obiwarp --help formats
for more information on acceptable file formats. Conversion programs exist in the obi-warp package to convert the ascii file into binary format (e.g., lmata2lmat) and vice versa.
Currently, converters exist to produce an lmat matrix from mzXML (version 1
and 2.X). There are converters for various binary data formats into mzXML.
Search the sashimi software
page for a
converter suitable for your data. An executable to create mzXML version 1
files from .RAW files on linux is here. The converter readw.exe
to
convert .RAW files to mzXML (version 2) files is included with the
transproteomic pipeline cygwin
installation.
The ruby gem 'mspire' contains a converter from mzXML (version 1 and 2.X) and mzData to lmat (or lmata). With ruby and ruby gems installed (see INSTALL[link:files/INSTALL.html]) one can download and install the gem with the command:
gem install mspire
Then, the converter can be run:
ms_to_lmat.rb file.mzXML
If the scans in the file do not have 'startMz' or 'endMz' tags (e.g., in many mzXML 2.X files), you may want to indicate the bounds on the matrix (although the script will attempt to determine these on-the-fly):
ms_to_lmat.rb --mz_start 300 --mz_end 1500 file.mzXML
Type 'ms_to_lmat.rb' for more options.
A converter that performs interpolation along the time axis to fill in any missing values is found in the Prt package (written in perl). If you have an svn client installed then you can check this library out:
svn co http://svn.icmb.utexas.edu/svnrepos/Prt
It contains documentation on installation. (Don't be alarmed if, when you run tests on the package, ~1/2 of the tests fail on your system). Then, to convert mzXML (version 1) files to lmata:
mzXML2matrix.pl -l file.mzXML
Type 'mzXML2matrix.pl' for more details.
For lmat(a) data (i.e., a matrix with time and m/z labels), the alignment accuracy may be checked with externally derived alignment points. For instance, one may use high confidence MS/MS identifications that are shared across two runs (this would typically be the same peptide at the same charge state):
run 1: Peptide (ALEEGYYK) +2 charge (scan 1020 = 1400 seconds)
run 2: Peptide (ALEEGYYK) +2 charge (scan 998 = 1340 seconds)
run 1: Peptide (NFLALARLSGHFLDR) +3 charge (scan 710 = 1100 seconds)
run 2: Peptide (NFLALARLSGHFLDR) +3 charge (scan 640 = 1050 seconds)
run 1: Peptide (QKELTR) +1 charge (scan 1890 = 3020 seconds)
run 2: Peptide (QKELTR) +1 charge (scan 1920 = 3300 seconds)
The determination of the elution time of a peptide may be as simple as taking the time the MS/MS scan was acquired, or as complicated as working back to elution time of the peptide peak apex.
These would be written into a time file (for example, 'timefile.txt') like this:
1400 1340
1100 1050
3020 3300
(The line ordering of these points doesn't matter, but the order of the data in the two columns must not be mixed up: [run1time run2time]). Then, the alignment may be run:
obiwarp -t timefile.txt run1.lmat run2.lmat
The output would look something like this:
0.44 5.15 10.69 16.4 ... # <- new time values for run2
6700.000000 2233.333252 180.000000 60.000000 # <- residuals
Type
obiwarp --help long # look under the --timefile option
for more details.