PCA Fit Once Before Cross-Validation
A notebook computes PCA on the full feature matrix and then feeds the resulting components into every cross-validation fold. Why is that not a harmless speed optimization?
打开 →GLOBAL SEARCH
搜索在服务端完成,题目解析与答案不会进入搜索结果。登录后可搜索自己的收藏题单。
找到 30 个结果
中文题目A notebook computes PCA on the full feature matrix and then feeds the resulting components into every cross-validation fold. Why is that not a harmless speed optimization?
打开 →There are 12 issuers and each issuer contributes 5 observations. In 3-fold grouped cross-validation, one fold holds out 4 issuers at a time. How many observations are used for training in each fold?
打开 →A 5-fold cross-validation comparison records four paired score differences (model A minus model B): [0.02, 0.01, -0.01, 0.03]. The desk report says the overall mean fold difference across all 5 folds was 0.01. What was the missing fifth-fold difference?
打开 →A 5-fold cross-validation comparison records four paired score differences (model A minus model B): [0.05, 0.02, 0.04, -0.01]. The desk report says the overall mean fold difference across all 5 folds was 0.026. What was the missing fifth-fold difference?
打开 →A 5-fold cross-validation comparison records four paired score differences (model A minus model B): [-0.02, 0.01, 0.0, -0.01]. The desk report says the overall mean fold difference across all 5 folds was 0.002. What was the missing fifth-fold difference?
打开 →A 5-fold cross-validation comparison records four paired score differences (model A minus model B): [0.01, 0.01, 0.02, 0.0]. The desk report says the overall mean fold difference across all 5 folds was 0.014. What was the missing fifth-fold difference?
打开 →A 5-fold cross-validation comparison records four paired score differences (model A minus model B): [0.04, -0.02, 0.01, 0.02]. The desk report says the overall mean fold difference across all 5 folds was 0.01. What was the missing fifth-fold difference?
打开 →Why can class-stratified cross-validation still fail badly when the same issuer appears many times and issuer identity carries predictive information?
打开 →Why is row-wise cross-validation inappropriate when each entity appears many times and the model can recognize entity-specific signatures?
打开 →Why can random k-fold cross-validation be invalid when each feature vector uses a rolling 20-day history from a time series?
打开 →Why can ordinary random row cross-validation severely overstate performance when each label depends on the next 5 trading days and adjacent rows overlap in those horizons?
打开 →An expanding walk-forward starts with 12 months of training and then advances by 6 months for each of 5 complete test folds. What is the average training-window length used across the 5 folds?
打开 →Suppose 50 genuinely null standardized t-statistics are approximately independent N(0,1). What is the probability the largest of them exceeds 2.4?
打开 →A team trains one model, plots test loss by boosting round, and reports the round with the best test value. Why is the final test score no longer a valid final check?
打开 →A walk-forward backtest produces 7 complete folds, and the research protocol inserts a 3-day embargo between each training block and its following test block. How many calendar days are lost to embargo across the whole run?
打开 →A test block has 25 trading days. A signal generated on day t is executed on day t+1 and evaluated on the open-to-close return from day t+1 through day t+4. How many signals inside the block can be scored without the label running past the block end?
打开 →An expanding-window walk-forward starts with 18 months of training, then uses a 1-month embargo and a 4-month test block, advancing by 4 months each round across 59 months of history. What is the training-window length in the last complete fold?
打开 →A research platform runs 200 null strategies. Only strategies with in-sample p-value below 15% are promoted, and each promoted strategy must then pass a fresh 5% confirmation test. Assuming independence under the null, what is the expected number of false strategies that survive
打开 →In R repeats of ordinary k-fold CV, each point appears in exactly one validation fold per repeat. Derive the number of validation appearances of one point across all repeats.
打开 →A desk tries 80 genuinely null strategy ideas. A strategy is kept only if it passes an in-sample screen at 10% and then a fresh out-of-sample confirmation at 5%, with the two tests treated as independent under the null. What is the probability at least one null idea survives both
打开 →A researcher generates 240 heavily correlated strategy variants but argues they amount to only 24 effectively independent families. If the desk still flags any family with p-value below 8%, what is the approximate probability of at least one false family-level winner under the nu
打开 →A team ranks 5,000 candidate features by correlation with the target on the full dataset, keeps the top 30, and only then creates train and test. Why is the later split not enough to rescue the experiment?
打开 →A researcher joins fundamentals after they were restated months later, then backtests on the original trade dates. Why is this a split-discipline failure even if no test labels were touched?
打开 →Why does entity overlap across train and test typically make confidence intervals and model-stability assessments look better than they really are?
打开 →A training set has 100 labels with 30 positives. A class-weighting routine is mistakenly fit on all 125 labels and reports an overall positive rate of 0.36. What is the positive rate in the 25 held-out labels?
打开 →A category appears 40 times in train with 18 positives and 10 times in validation. A target encoder is incorrectly fit on train plus validation and outputs 0.56 for that category. How many validation positives did the encoder implicitly use?
打开 →A nested CV uses 7 outer folds and selects exactly one hyperparameter setting inside each outer fold. What is the maximum possible number of distinct winning hyperparameter settings across outer folds?
打开 →You have 60 months of data. Each expanding-window fold uses 24 months for training, the next 6 months for validation, and then advances by 6 months. How many validation folds fit?
打开 →A single tree has variance 6, while an extremely large forest appears to level off at variance 1.8. What pairwise tree correlation rho is implied?
打开 →For one issuer, the three training rows sum to 12. A pipeline mistakenly demeans by the full-sample issuer mean 3.6 computed from five rows total. What is the sum of the two held-out rows for that issuer?
打开 →