{"id":100,"date":"2020-12-15T13:59:27","date_gmt":"2020-12-15T18:59:27","guid":{"rendered":"http:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/?p=100"},"modified":"2020-12-15T14:24:20","modified_gmt":"2020-12-15T19:24:20","slug":"revamping-linear-regression-in-big-data-a-split-and-resample-approach-for-predictive-modeling","status":"publish","type":"post","link":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/2020\/12\/revamping-linear-regression-in-big-data-a-split-and-resample-approach-for-predictive-modeling\/","title":{"rendered":"Revamping linear regression in \u201cbig data&#8221; &#8212; A split and resample approach for predictive modeling"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\">This blog post has been written by <a href=\"https:\/\/www.linkedin.com\/in\/jing-zhang-b6790795\/\">Dr. Jing Zhang<\/a>, adopted from research she presented with <a href=\"https:\/\/www.linkedin.com\/in\/thomas-fisher-0768241b\/\">Dr. Thomas Fisher<\/a> and <a href=\"https:\/\/www.linkedin.com\/in\/qi-kee-he-88a8b2158\/\">Qi He<\/a>, Miami University masters student.  <\/h5>\n\n\n\n<p class=\"wp-block-paragraph\">Linear\nregression is likely to be the first statistical modeling tool that many of you\nlearn in your coursework, no matter you are a data science and statistics\nmajor, math and stat major, analytics co-major or data analytics major. It is\nstill a popular modeling tool because it is easy to implement\nmathematically\/computationally, and the model findings are intuitive to\ninterpret. However, it also suffers from the lack of flexibility due to all the\nimportant model assumptions that are possibly problematic in real analysis:\nnormality of responses, independence among observations, linear relation\nbetween the response and group of predictors. When outliers exists in the data,\nor the predictors are highly correlated with each other, the model findings\nwill be distorted and therefore misleading in decision-making. What\u2019s more, it\nsuffers more in the \u201cbig-data\u201d era. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A\nreal \u201cBIG\u201d data set would be too big to hold in a single computer\u2019s memory,\nwhile a \u201cbig\u201d data simply mean that the data set is so big that the traditional\nstatistical analysis tools, such as the linear regression model, would be too\ntime consuming to implement, and suffers more from the rounding errors in\ncomputation. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Big data sets often have many variables, i.e. a lot of predictors in the linear regression model (i.e., large \u201cp\u201d), in addition to the large number of data observations (i.e., large \u201cn\u201d). Therefore, we need to select the \u201ctrue\u201d signals among lots \u201cnoises\u201d when fitting any statistical models to such a data set, in another word, we often need to conduct \u201cvariable selection.\u201d We wish to speed up the model fitting, variable selection and prediction in analysis of big data sets, yet with relatively simple modeling tools, such as linear regression.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Popular\nchoices involves subsampling the big data and focusing on the analysis of the\nsubset of information. For\nexample, bags of little bootstraps (Kleiner et al. 2014) suggested selecting a\nrandom sample of the data and then using bootstrap on the selected sample to\nrecover the original size of the data in the following analysis, which would\neffectively reduce the memory required in the data storage and analysis, and\nhelp in the \u201cBIG\u201d data case. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition to random selection,\nresearchers also suggested using sampling weights based on the features of data\nobservation, such as the leverage values (Ma and Sun, 2014) to retain the\nfeature of the data set as much as we can when the subsampling has to be done. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Alternatively, the \u201cdivide and\nconquer\u201d idea has also been popular:&nbsp; big\ndata are split into multiple blocks of smaller sample size without overlap and\nthe analysis results of each block are then aggregated to obtain the final\nestimated model and predictions (Chen and Xie, 2014). This idea utilizes all the information in the data sets.\nIn a CADS funded project, we explore the revamping of linear regression in big\ndata predictive modeling using a similar \u201cdivide and resample\u201d idea, via a\ntwo-stage algorithm. Firstly, we combine the least absolute shrinkage and\nselection operator (LASSO) (Tibshirani, 1996), which helps select the relevant\nfeatures; and the divide and conquer approach, which helps deal with the large\nsample size, in the variable selection stage, with products being the selected\nrelevant features in the big data with high dimension. Secondly, in the\nprediction stage, with the selected features, when the data are of extremely\nhigh sample size, we subsample the data multiple times, refit the model with\nselected features to each subsample, and then aggregate the predictions based\non the analysis of each subsample. When the data are sizable but the chosen\nmodel can still be fitted with reasonable computing cost, predictions are obtained\ndirectly by refitting the model with selected features to the complete data. Here\nis the detailed description of the algorithm: <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Step 1. Partition a large data set\nof sample size &nbsp;into &nbsp;blocks, each with sample size . The samples are randomly assigned to\nthese blocks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Step 2. Conduct variable selection\nwith LASSO in each block individually.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Step 3. Use \u201cmajority voting\u201d to\nselect the variables, i.e, if we have a \u201cmajority\u201d of blocks end up selecting a\npredictor variable in the analysis, this predictor is retained.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Step 4. When \u201c&#8221; is large but still can\nbe handled as one piece,\nrefit the model with selected variables to the\noriginal data and predict based on the\nparameter estimates. When \u201c&#8221; is too large to be\nhandled as one piece, randomly select multiple subsample from the original data\nand refit the model with selected variables on the subsamples, then aggregate\nthe predictions based on the models fitted to different subsamples (e.g. mean\nor median of the predicted values).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Simulation\nstudies are often used to evaluate the performance of statistical models,\nmethods and algorithms. In a simulation study, we are able to simulate\n\u201cplausible\u201d data from a known probabilistic framework with pre-picked values of\nmodel parameters. Then the proposed statistical analysis method would be\nimplemented to fit the simulated data, and the analysis findings, such as the\nestimated parameters, predicted responses of a holdout set, would be compared\nwith the true values we know in the simulation. When many such plausible data\nsets are simulated and analyzed, we are able to empirically evaluate the\nperformance of the proposed methods through the discrepancy between the model\nfindings and true values. In this project, we conducted a simulation study to\nhelp make important decisions on key components of the proposed algorithm,\nincluding the number of blocks we divide the complete data into, and how to\ndecide \u201cmajority\u201d in the variable selection step. This simulation study was\nalso designed such that we are able to evaluate the impact of multicolinearity\n(highly correlated predictors) and effect size (strength of the linear relation\nbetween response and predictors) on the performance of the proposed algorithm\nin variable selection and prediction. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We\nsimulated responses from 11,000 independent data observations total from a normal\npopulation, with a common variance of 2, among which 10,000 observations are\nchosen as the training set and the remaining 1,000 observations are used as the\ntest set. For each of the 11,000 data observations, a vector of 500 predictors\nare simulated from normal population with mean 0, and a covariance matrix, . Two\ndifferent setup of &nbsp;were used, with the first one being a diagonal\nidentity matrix, indicating a \u201cperfect\u201d scenario with complete independence\namong the predictors; and a second one being a matrix whose entry in row &nbsp;and column &nbsp;is determined by ,\nmimicking a \u201cpractical\u201d scenario where the nearby predictors are highly\ncorrelated (AR(0.9) correlation structure). Among the 500 predictors, 100 are\nrandomly chosen to be the \u201ctrue\u201d signals and three different sets of true\nregression coefficients (i.e. the effect size) are chosen as follows to\nevaluate the ability of our proposed two-stage algorithm in terms of capturing\nthe \u201ctrue\u201d signals when the signals are stronger (i.e., larger effect size) vs.\nweaker (i.e., smaller effect size). The remaining 400 predictors are all\nassociated with zero regression coefficients in the \u201ctrue\u201d model that generates\nthe data.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>all 100 regression\ncoefficients are equal to 2;<\/li><li>all 100 regression\ncoefficients are equal to 0.75;<\/li><li>50% of regression coefficients are\nrandomly chosen to be 0.75 and the rest are set to be 2.<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">&nbsp;Let\u2019s summarize what we wish to do with this\nsimulation study: <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>How many block should we divide the 10,000 training set observations\ninto? (4 blocks of 2,500 observation each? 5 blocks? 8,10, 16, 20 or 25 blocks?)<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>What\nis a good threshold for \u201cmajority voting?&#8221; Analyses of % of the blocks suggest selecting\na variable. %= 50%? 60%? \u2026100%?)<\/li><li>How does the effect size impact the variable selection and prediction?<\/li><li>How does multicolinearity impact the variable selection and prediction?<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To\ncompare the performance of the proposed method under different setups, we\nevaluated the following quantities:&nbsp; <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Sensitivity (proportion of the \u201ctrue signals\u201d picked up by the\nmethod, i.e. how many out of the 100 predictors impacting the mean of response\nare selected) and specificity (proportions of the 400 \u201cnon-signal\u201d predictors\nnot selected by the method) of variable selection<\/li><li>Mean squared predictor error (MSPE) of the test set.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Here are four graphs\nthat help us visualize the simulation study findings on the determination of\n\u201cmajority\u201d when different number of blocks are used.&nbsp; <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"273\" src=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-1.png\" alt=\"\" class=\"wp-image-104\" srcset=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-1.png 640w, https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-1-300x128.png 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 1. Strong signal\nscenario (regression coefficients are all equal to 2): sensitivity and\nspecificity of the two-stage algorithm when training set are divided into &nbsp;blocks respectively, and a variable is\nselected when &nbsp;out of &nbsp;blocks selects this predictor, where &nbsp;=4, 5, 8, 10, 16, 20 and 25, = . Both the case with complete\nindependence among predictors (left panel) and correlated predictors (right\npanel) are presented. <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"657\" height=\"277\" src=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-2.png\" alt=\"\" class=\"wp-image-103\" srcset=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-2.png 657w, https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-2-300x126.png 300w\" sizes=\"auto, (max-width: 657px) 100vw, 657px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 2. Weak signal\nscenario (regression coefficients are all equal to 0.75, left panel) and mixed\nsignal scenario (regression coefficients are 0.75 or 2, right panel) with\ncorrelated predictors: sensitivity and specificity of the two-stage algorithm\nwhen training set are divided into &nbsp;blocks respectively, and a variable is\nselected when &nbsp;out of &nbsp;blocks selects this predictor, where &nbsp;=4, 5, 8, 10, 16, 20 and 25, = .<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Apparently that when\nparallel computing is possible, it is computationally cheaper to divide the\ndata into more blocks as there are less observations in each block and the\ncomputation speed is restricted by the computing cost of fitting the model to\neach block. Our simulation study did not use a real \u201cBIG\u201d sample size, so the\ncomputing time for each block is not high even when we use as low as 4 block.\nBut in practice it is possible that we have giant training data and need to\nlower the computing cost of fitting a linear regression to a single block\nbecause the blocks still have large samples. All the number of blocks we\nconsidered here seem to approach the same level of sensitivity and specificity\nin variable selection, and it seems that 50% or higher blocks agreeing on the\nselection of variables is a good threshold for \u201cmajority voting.\u201d <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Four approaches are\ncompared for the case of using 25 blocks on the data with independent predictors\nwhen the effect sizes are 2 (strong signal scenario). &nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>OLS:\nfit OLS with all 500 predictors to the original &nbsp;<\/li><li>Split-resample:\nfit OLS with selected variables in the proposed approach to the original data <\/li><li>True p: fit OLS with the 100 \u201ctrue signals\u201d to the original data&nbsp; <\/li><li>Split-resample: predict with aggregated estimation function in Chen and Xie (2014) <\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Comparison of the\nprediction performance can be visualized in the following figures: <\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"392\" height=\"466\" src=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-3.png\" alt=\"\" class=\"wp-image-102\" srcset=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-3.png 392w, https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-Figure-3-252x300.png 252w\" sizes=\"auto, (max-width: 392px) 100vw, 392px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Figure 3: MSPE computed\nby refitting the model with selected variables to the original data and predict\nfor the hold out test set. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The aggregated\npredictions result in much higher MSPE compared to the other three approaches,\nso it is shown in the lower panel of Figure 3, while the other three approaches\nare shown in the upper panel of Figure 3. The proposed approaches produces MSPE\nthat approaches the best scenario when we know exactly which subset of the\npredictors are \u201ctrue signals\u201d in the simulated data when majority voting is\nused assuming 50% or higher proportion of blocks is the threshold of selecting\npredictors. It seems that the proposed algorithm works well in variable\nselection when the predictors are independent or correlated; and also predicts\nwell when 50% or higher proportion of blocks is the threshold of selecting\npredictors. The smaller the effect size, the more blocks need to agree on\nvariable selection in order to achieve the same sensitivity and specificity,\nbut this 50% or higher threshold seem to work for both strong and weak signal\ncase in general. <\/p>\n\n\n\n<h4 class=\"wp-block-heading\">About the Author<\/h4>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing.jpg\" alt=\"\" class=\"wp-image-105\" width=\"229\" height=\"229\" srcset=\"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing.jpg 500w, https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-150x150.jpg 150w, https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-300x300.jpg 300w, https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/files\/2020\/12\/Jing-144x144.jpg 144w\" sizes=\"auto, (max-width: 229px) 100vw, 229px\" \/><figcaption><br><strong>Dr. Jing Zhang is an Associate Professor in the Department of Statistics at Miami University.<\/strong> <\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>This blog post has been written by Dr. Jing Zhang, adopted from research she presented with Dr. Thomas Fisher and Qi He, Miami University masters student. Linear regression is likely to be the first statistical modeling tool that many of you learn in your coursework, no matter you are a data science and statistics major, [&hellip;]<\/p>\n","protected":false},"author":3098,"featured_media":101,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_s2mail":"","footnotes":""},"categories":[2],"tags":[],"class_list":["post-100","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-faculty-research"],"_links":{"self":[{"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/posts\/100","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/users\/3098"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/comments?post=100"}],"version-history":[{"count":0,"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/posts\/100\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/media\/101"}],"wp:attachment":[{"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/media?parent=100"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/categories?post=100"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.miamioh.edu\/the-center-for-analytics-and-data-science\/wp-json\/wp\/v2\/tags?post=100"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}