论文部分内容阅读
在基于核磁共振(NMR)的代谢组学数据分析中,尺度缩放是关键的预处理步骤之一,其主要目的是通过调整数据的方差结构,改善后续的多变量统计分析的结果。从信息熵的角度出发,利用KullbackLeibler(K-L)散度来度量不同实验分组的生物样品的1 H NMR波谱数据的差异程度,并结合单位方差缩放法,提出一种基于K-L散度的尺度缩放方法。该方法先利用单位方差法将数据各变量的标准差调整到同一水平上,再利用K-L散度对各变量进行有监督地加权,增强重要变量、减弱无关变量。由于K-L散度是在概率分布的意义上度量数据间的差异程度,且对于高斯和非高斯分布的数据均适用,因此能更准确地度量不同实验分组样品的1 H NMR波谱数据的差异性,从而更有效地地对谱数据的重要变量进行识别和加权。人群尿液1 H NMR波谱数据的分析结果表明,基于K-L散度的尺度缩放方法能有效抑制噪声变量,同时很好地区分特征变量和非特征变量;提高主成分回归(PCR)模型的判别能力;改善偏最小二乘回归判别分析(PLS-DA)模型的解释能力、预测能力以及对特征代谢物的辨识能力。
Scale-up is one of the key preconditioning steps in nuclear magnetic resonance (NMR) -based metabolomics data analysis. Its main goal is to improve the results of subsequent multivariate statistical analyzes by adjusting the variance structure of the data. From the perspective of information entropy, KullbackLeibler (KL) divergence was used to measure the difference of 1 H NMR spectral data of biological samples from different experimental groups. Based on the unit variance scaling method, a scale scaling method based on KL divergence . The method first uses the unit variance method to adjust the standard deviation of each variable of the data to the same level, and then uses the K-L divergence to supervise the variables supervisibly, enhances the important variables, and weakens the irrelevant variables. Since KL divergence measures the degree of difference between data in the sense of probability distribution and is applicable to both Gaussian and non-Gaussian distribution data, it is possible to more accurately measure the differences in the 1 H NMR spectral data of different experimental grouping samples, Thus identifying and weighting significant variables of the spectral data more effectively. The results of 1 H NMR spectra of urine showed that scales scaling method based on KL divergence could effectively suppress noise variables while distinguishing eigenvariants and non-eigenvariables. It also improved discrimination ability of principal component regression (PCR) models ; Improving the explanatory ability, predictive ability and identification of characteristic metabolites of partial least squares regression discriminant analysis (PLS-DA) model.