Summary of study ST001269

This data is available at the NIH Common Fund's National Metabolomics Data Repository (NMDR) website, the Metabolomics Workbench,, where it has been assigned Project ID PR000854. The data can be accessed directly via it's Project DOI: 10.21228/M8998T This work is supported by NIH grant, U2C- DK119886.


Perform statistical analysis  |  Show all samples  |  Show named metabolites  |  Download named metabolite data  |  Download all metabolite data  |  Download mwTab file (text)   |  Download mwTab file(JSON)   |  Download data (Contains raw data)
Study IDST001269
Study TitleExosomal lipids for classifying early and late stage non-small cell lung cancer
Study TypeBiomarker Discovery
Study SummaryLung cancer is the leading cause of cancer deaths in the United States. Patients with early stage lung cancer have the best prognosis with surgical removal of the tumor, but the disease is often asymptomatic until advanced disease develops, and there are no effective blood-based screening methods for early detection of lung cancer in at-risk populations. We have explored the lipid profiles of blood plasma exosomes using ultra high-resolution Fourier transform mass spectrometry (UHR-FTMS) for early detection of the prevalent non-small cell lung cancers (NSCLC). Exosomes are nanovehicles released by various cells and tumor tissues to elicit important biofunctions such as immune modulation and tumor development. Plasma exosomal lipid profiles were acquired from 39 normal and 91 NSCLC subjects (44 early stage and 47 late stage). We have applied two multivariate statistical methods, Random Forest (RF) and Least Absolute Shrinkage and Selection Operator (LASSO) to classify the data. For the RF method, the Gini importance of the assigned lipids was calculated to select 16 lipids with top importance. Using the LASSO method, 7 features were selected based on a grouped LASSO penalty. The Area Under the Receiver Operating Characteristic curve for early and late stage cancer versus normal subjects using the selected lipid features was 0.85 and 0.88 for RF and 0.79 and 0.77 for LASSO, respectively. These results show the value of RF and LASSO for metabolomics data-based biomarker development, which provide robust an independent classifiers with sparse data sets. Application of LASSO and Random Forests identifies lipid features that successfully distinguish early stage lung cancer patient from healthy individuals.
University of Kentucky
DepartmentCenter for Environmental and Systems Biochemistry
Last NameThompson
First NamePatrick
Address789 South Limestone, Lexington, Kentucky, 40536, USA,
Submit Date2019-10-17
Total Subjects95
Raw Data AvailableYes
Raw Data File Type(s).raw
Analysis Type DetailLC-MS
Release Date2019-10-11
Release Version1
Patrick Thompson Patrick Thompson application/zip

Select appropriate tab below to view additional metadata details:

Combined analysis:

Analysis ID AN002109
Analysis type MS
Chromatography type None (Direct infusion)
Chromatography system Thermo Orbitrap Fusion
Column none
MS instrument type Orbitrap
MS instrument name Thermo Fusion Orbitrap
Units Ion Intensity


MS ID:MS001960
Analysis ID:AN002109
Instrument Name:Thermo Fusion Orbitrap
Instrument Type:Orbitrap
MS Comments:High sample throughput ( 16 min total cycle time per sample, <7 min for MS1 portion) was achieved using the nanoelectrospray TriVersa NanoMate (Advion Biosciences, Ithaca, NY, USA) with 1.5 kV electrospray voltage and 0.4 psi head pressure. UHR-FTMS data were acquired from an Orbitrap Fusion Tribrid (Thermo Scientific, San Jose, CA, USA) set at a resolving power of 450,000 (at 200 m/z) for MS1 full scans using 10 microscans per scan in the m/z range of 150e1,600, achieving sub ppm mass accuracy through <1200 m/z in positive mode. AGC (Automatic Gain Control) target was set to 1e5 and maximal injection time was set to 100 ms. During the MS1 run, the top 500 most intense monoisotopic precursor ions were isolated via quadrupole using 1m/z isolation window and HCD (Higher Energy Collisional Dissociation) set at 25% collision energy was performed in positive mode for datadependent MS2 at a resolving power of 120,000 (at 200 m/z) to obtain fragments for acyl chain assignment and neutral loss of specific head groups. The AGC target was set to 5e4 with maximal injection time of 500 ms. MS2 does not distinguish the sn1 and sn2 acyl positions of glycerolipids, nor the position of unsaturations in acyl chains and acyl branching. Representative full scan MS along with an example MS2 spectrum are shown in Fig. S2. The UHR-FTMS raw data were assigned by our (CESB) in-house software PREMISE (PRecalculated Exact Mass Isotopologue Search Engine) that compares UHR-FTMS m/z data against our metabolite m/z library (calculated with mass accuracy to the 5th decimal point) to discern all known lipid MF and their 13C isotopologues, including hypothetical lipids, while simultaneously taking into account all of the major adducts (here Hþ, Naþ, Kþ and NHþ4 ) [50,51]. An in-house developed natural abundance (NA) correction algorithm [52,53] was applied to simultaneously examine the distribution of naturally occurring 13C isotopologues of the unlabeled lipids to help verify the assigned molecular formulae, and to eliminate non- monoisotopic 13C isotopologues from further analysis. For statistical classification, we used only high accuracy monoisotopic m/z values that mapped to lipid molecular formulae, and multiple adducts of each were tracked throughout to avoid redundancy. Below, such m/z values are referred to as “lipid features”, and neither molecular formulae nor lipid names were directly used. The number of assigned lipid features in each sample varied from 1 to 70. After combining all samples into a master file, the data set had a total of 430 such lipid features. Prior to multivariate statistical analyses, MS1 peaks arising from solvent blanks and known contaminants were removed from the lipid feature lists. As absolute intensities vary from sample to sample, the lipid features must be normalized. The intensities of the lipid features in each sample were thus normalized to the summed intensities of all mass peaks that were non-zero in 20%, 50%, 75%, 97%, 100% of all samples. This is equivalent to estimating the mole fraction of each lipid feature present, and therefore can be used for determining relative changes in composition. We found that normalization using the summed intensities of lipid features that were non-zero in 20% of all samples provided the best statistical outcome according to the ROC analysis.