Alignment of Mass Spec Data

This tutorial describes how to take two raw mass spectrometry (MS) data sets, convert them into formats that can be read by obiwarp, and create a time standards file from MS/MS identifications. Typically, this would be data from peptides eluted from a reverse phase column where one is trying to align the chromatograms (i.e., adjust the elution times of identical peptides across runs so they match up).

Convert raw data to lmat

Raw mass spec data must be converted to a format that can be read by obi-warp. Currently, obi-warp expects data in either a rectilinear or uniform matrix. Programs exist to convert mass spec data into this format (see below), but a simple example to demonstrate how this conversion is performed is useful. The '.lmat' (binary) or '.lmata' (ascii) formats contain time and m/z axis labels to indicate the position of the rows and columns of the intensity (ion count) matrix. Basically, the data is written in rows like this: 1) number of time values, 2) the time values, 3) number of m/z values, 4) the m/z values, 5) rows of m/z values (one row per time). The simple example of using .lmata files is also helpful to understand the format.

Following is a simplified example of how to convert MS data into an lmata matrix (using 1.0 m/z unit bins):

(time: 0.3 sec) [m/z 300.2, intensity 10000],[m/z 301.3, intensity 10500],
  [m/z 302.6, intensity 10600],[m/z 303.2, intensity 20000]
(time: 3.3 sec) [m/z 300.2, intensity 50000],[m/z 301.3, intensity 50500],
  [m/z 302.6, intensity 40600],[m/z 303.2, intensity 30000]
(time: 6.2 sec) [m/z 300.2, intensity 20000],[m/z 301.3, intensity 20500],
  [m/z 302.6, intensity 30600],[m/z 303.2, intensity 33000]

If we add m/z values that round to the same bin and we use a rectilinear matrix, we would write an lmata file (named '.lmata') like this (comments follow the # sign):

3                        # <- Number of time values (along the m axis)
0.3 3.3 6.2              # <- These are the time values
4                        # <- Number of m/z values (along the n axis)
300 301 302 303          # <- These are the m/z values
10000 10500 0 30600      # <- This is the start of the intensity data
50000 50500 0 70600      # <- The 303 m/z bin contained 2 summed values,
20000 20500 0 63600      # <- while the 302 bin has none.

More sophisticated methods to create this matrix might use more m/z bins or might interpolate the data along the time or m/z axes (or both simultaneously), but the end result is still a matrix similar in form to the one shown.

Type:

obiwarp --help formats

for more information on acceptable file formats. Conversion programs exist in the obi-warp package to convert the ascii file into binary format (e.g., lmata2lmat) and vice versa.

Convert raw data to xml

Currently, converters exist to produce an lmat matrix from mzXML (version 1 and 2.X). There are converters for various binary data formats into mzXML. Search the sashimi software page for a converter suitable for your data. An executable to create mzXML version 1 files from .RAW files on linux is here. The converter readw.exe to convert .RAW files to mzXML (version 2) files is included with the transproteomic pipeline cygwin installation.

Convert .mzXML or .mzData to lmat

The ruby gem 'mspire' contains a converter from mzXML (version 1 and 2.X) and mzData to lmat (or lmata). With ruby and ruby gems installed (see INSTALL[link:files/INSTALL.html]) one can download and install the gem with the command:

gem install mspire

Then, the converter can be run:

ms_to_lmat.rb file.mzXML

If the scans in the file do not have 'startMz' or 'endMz' tags (e.g., in many mzXML 2.X files), you may want to indicate the bounds on the matrix (although the script will attempt to determine these on-the-fly):

ms_to_lmat.rb --mz_start 300 --mz_end 1500 file.mzXML

Type 'ms_to_lmat.rb' for more options.

A converter that performs interpolation along the time axis to fill in any missing values is found in the Prt package (written in perl). If you have an svn client installed then you can check this library out:

svn co http://svn.icmb.utexas.edu/svnrepos/Prt

It contains documentation on installation. (Don't be alarmed if, when you run tests on the package, ~1/2 of the tests fail on your system). Then, to convert mzXML (version 1) files to lmata:

mzXML2matrix.pl -l file.mzXML

Type 'mzXML2matrix.pl' for more details.

Using a time file to determine alignment accuracy

For lmat(a) data (i.e., a matrix with time and m/z labels), the alignment accuracy may be checked with externally derived alignment points. For instance, one may use high confidence MS/MS identifications that are shared across two runs (this would typically be the same peptide at the same charge state):

run 1: Peptide (ALEEGYYK) +2 charge (scan 1020 = 1400 seconds)
run 2: Peptide (ALEEGYYK) +2 charge (scan 998 = 1340 seconds)

run 1: Peptide (NFLALARLSGHFLDR) +3 charge (scan 710 = 1100 seconds)
run 2: Peptide (NFLALARLSGHFLDR) +3 charge (scan 640 = 1050 seconds)

run 1: Peptide (QKELTR) +1 charge (scan 1890 = 3020 seconds)
run 2: Peptide (QKELTR) +1 charge (scan 1920 = 3300 seconds)

The determination of the elution time of a peptide may be as simple as taking the time the MS/MS scan was acquired, or as complicated as working back to elution time of the peptide peak apex.

These would be written into a time file (for example, 'timefile.txt') like this:

1400 1340
1100 1050
3020 3300

(The line ordering of these points doesn't matter, but the order of the data in the two columns must not be mixed up: [run1time run2time]). Then, the alignment may be run:

obiwarp -t timefile.txt run1.lmat run2.lmat

The output would look something like this:

0.44 5.15 10.69 16.4 ...  # <- new time values for run2
6700.000000 2233.333252 180.000000 60.000000   # <- residuals

Type

obiwarp --help long     # look under the --timefile option

for more details.