Есть несколько задач по дата майнингу. Они сформулированы так, что для их решения нужно использовать учебник. За качественное выполнение плачу много и регулярно.
Пример 1:
1-A)
The assignement is to prepare the datasets in a format required to be used with Weka. You need to convert the files into the ARFF format described in the section 2.4 of the textbook. Prepare two sets of fit and test files. One set will be used to build and evaluate prediction models, and the other to build and evaluate classification models:
SET 1: [For Prediction]
fit.arff file to build the prediction models. You only need to reformat the original fit file to the ARFF format without any changes and add the required labels.
test.arff fileto evaluate theprediction models. Same as above.
SET 2: [For Classification]
fit.arff fileto buildclassification models. You need to add a column describing the class of each module: fault-prone (fp) or not fault-prone(nfp). Fault proneness is based on a threshold of number of faults. In this assignment, modules with less than 2 faults are considered nfp, and modules with 2 or more faults are considered fp. Make sure you do not use the number of faults column as an independent variable while doing classification.
test.arff fileto evaluate the classification models. Same as above.
Please make sure you label the data correctly and comment the ARFF file (instances, attributes, date, author....).
The original data file has 9 columns. Following is the description of what each column represents (in the same order):
Number of unique operators (NUMUORS)
Number of unique operands (NUMUANDS)
Total number of operators (TOTOTORS)
Total number of operands (TOTOPANDS)
McCabe's cyclomatic complexity (VG)
Number of logical operators (NLOGIC)
Lines of code (LOC)
Executable line of code (ELOC)
Number of faults (FAULTS)
1-B)
This part of the project will build models to predict the number of faults based on the other attributes of the instances. Each model is to be first built and evaluated using 10-fold cross validation on the fit data set, and then validated using the test data set. Use the data sets prepared for prediction in the previous assignment of the project.
Build the following prediction models:
Linear Regression
Decision Stump
For linear regression, compare the model selection methods: greedy, M5, no selection. Compare the models, how many and which independent variables were selected? Use the statistical indicators provided by Weka to perform the comparisons.
Your report should include all the results based on 10-fold cross-validation and on the test data set. You should also compare the results of all the methods.
учебник
drive.google.com/open?id=...