Title: | Feature Selection in Highly Correlated Spaces |
---|---|
Description: | Feature selection algorithm that extracts features in highly correlated spaces. The extracted features are meant to be fed into simple explainable models such as linear or logistic regressions. The package is useful in the field of explainable modelling as a way to understand variable behavior. |
Authors: | Allen Sunny [aut, cre] |
Maintainer: | Allen Sunny <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-02-25 06:23:11 UTC |
Source: | https://github.com/allen-1242/tangledfeatures |
Automatic Data Cleaning
DataCleaning(Data, Y_var)
DataCleaning(Data, Y_var)
Data |
The imported Data Frame |
Y_var |
The X variable |
The cleaned data.
DataCleaning(Data = TangledFeatures::Housing_Prices_dataset, Y_var = 'SalePrice')
DataCleaning(Data = TangledFeatures::Housing_Prices_dataset, Y_var = 'SalePrice')
Generalized Correlation function
GeneralCor(df, cor1 = "pearson", cor2 = "polychoric", cor3 = "spearman")
GeneralCor(df, cor1 = "pearson", cor2 = "polychoric", cor3 = "spearman")
df |
The imported Data Frame |
cor1 |
The correlation metric between two continuous features. Defaults to pearson |
cor2 |
The correlation metric between one categorical feature and one cont feature. Defaults to biserial |
cor3 |
The correlation metric between two categorical features. Defaults to Cramers-V |
Returns a correlation matrix containing the correlation values between the features
GeneralCor(df = TangledFeatures::Advertisement)
GeneralCor(df = TangledFeatures::Advertisement)
The main TangledFeatures function
TangledFeatures( Data, Y_var, Focus_variables = list(), corr_cutoff = 0.85, RF_coverage = 0.95, plot = FALSE, fast_calculation = FALSE, cor1 = "pearson", cor2 = "polychoric", cor3 = "spearman" )
TangledFeatures( Data, Y_var, Focus_variables = list(), corr_cutoff = 0.85, RF_coverage = 0.95, plot = FALSE, fast_calculation = FALSE, cor1 = "pearson", cor2 = "polychoric", cor3 = "spearman" )
Data |
The imported Data Frame |
Y_var |
The dependent variable |
Focus_variables |
The list of variables that you wish to give a certain bias to in the correlation matrix |
corr_cutoff |
The correlation cutoff variable. Defaults to 0.8 |
RF_coverage |
The Random Forest coverage of explainable. Defaults to 95 percent |
plot |
Return if plotting is to be done. Binary True or False |
fast_calculation |
Returns variable list without many Random Forest iterations by simply picking a variable from a correlated group |
cor1 |
The correlation metric between two continuous features. Defaults to pearson correlation |
cor2 |
The correlation metric between one categorical feature and one continuous feature. Defaults to bi serial correlation correlation |
cor3 |
The correlation metric between two categorical features. Defaults to Cramer's V. |
Returns a list of variables that are ready for future modelling, along with other metrics
TangledFeatures(Data = TangledFeatures::Advertisement, Y_var = 'Sales')
TangledFeatures(Data = TangledFeatures::Advertisement, Y_var = 'Sales')