Классификация методом ID3.
Этот алгоритм требует чтобы во входном наборе содержались только данные номинального типа. Для этого выбираем filter->unsupervised->attribute->RemoveType и удаляем все типы данных кроме nominal. Применение фильтра RemoveType изображено на рисунке 2.
Рисунок 2
После применения фильтров в наборе остаются только данные номинального типа. С ними и продолжает работу алгоритм. Далее для работы алгоритма необходимо отсутствие пустых значений. Чтобы осуществить это применяется фильтр ReplaceMissingValues, заменяющий пустые значения средними. Пример вывода программы изображен в листинге 4.
В нашем случае разбиение производится по перменной Income и дерево получается очень сильно разветвленным и не очень точным. Также возрастает средняя ошибка дерева, и при этом 11.75 %из исходных выкладок не классифицируются.
Листинг 3
=== Run information === Scheme: weka.classifiers.trees.Id3
Relation: laba43-weka.filters.unsupervised.attribute.RemoveType-Tnumeric-weka.filters.unsupervised.attribute.ReplaceMissingValues
Instances: 400
Attributes: 9
workclass
education
marital-status
occupation
relationship
race
sex
native-country
income
Test mode: 10-fold cross-validation === Classifier model (full training set) === Id3
education = Bachelors
| marital-status = Married-civ-spouse
| | occupation = Tech-support: >50K
| | occupation = Craft-repair
| | | relationship = Wife
| | | | race = White: >50K
| | | | race = Asian-Pac-Islander: <=50K
| | | | race = Amer-Indian-Eskimo: null
| | | | race = Other: null
| | | | race = Black: null
| | | relationship = Own-child: >50K
| | | relationship = Husband: <=50K
| | | relationship = Not-in-family: null
| | | relationship = Other-relative: null
| | | relationship = Unmarried: null
| | occupation = Other-service: null
| | occupation = Sales: >50K
| | occupation = Exec-managerial
| | | workclass = Private
| | | | relationship = Wife: <=50K
| | | | relationship = Own-child: null
| | | | relationship = Husband: >50K
| | | | relationship = Not-in-family: null
| | | | relationship = Other-relative: null
| | | | relationship = Unmarried: null
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: >50K
| | | workclass = Local-gov: >50K
| | | workclass = State-gov: >50K
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Prof-specialty
| | | workclass = Private: >50K
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: <=50K
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Handlers-cleaners: null
| | occupation = Machine-op-inspct: null
| | occupation = Adm-clerical: >50K
| | occupation = Farming-fishing: >50K
| | occupation = Transport-moving: null
| | occupation = Priv-house-serv: null
| | occupation = Protective-serv: <=50K
| | occupation = Armed-Forces: null
| marital-status = Divorced: <=50K
| marital-status = Never-married
| | occupation = Tech-support: <=50K
| | occupation = Craft-repair: >50K
| | occupation = Other-service: <=50K
| | occupation = Sales: <=50K
| | occupation = Exec-managerial
| | | relationship = Wife: null
| | | relationship = Own-child
| | | | workclass = Private: >50K
| | | | workclass = Self-emp-not-inc: null
| | | | workclass = Self-emp-inc: <=50K
| | | | workclass = Federal-gov: null
| | | | workclass = Local-gov: null
| | | | workclass = State-gov: null
| | | | workclass = Without-pay: null
| | | | workclass = Never-worked: null
| | | relationship = Husband: null
| | | relationship = Not-in-family: <=50K
| | | relationship = Other-relative: null
| | | relationship = Unmarried: null
| | occupation = Prof-specialty
| | | sex = Female: <=50K
| | | sex = Male
| | | | relationship = Wife: null
| | | | relationship = Own-child: <=50K
| | | | relationship = Husband: null
| | | | relationship = Not-in-family: >50K
| | | | relationship = Other-relative: <=50K
| | | | relationship = Unmarried: null
| | occupation = Handlers-cleaners: null
| | occupation = Machine-op-inspct: null
| | occupation = Adm-clerical: <=50K
| | occupation = Farming-fishing: null
| | occupation = Transport-moving: null
| | occupation = Priv-house-serv: null
| | occupation = Protective-serv: <=50K
| | occupation = Armed-Forces: null
| marital-status = Separated: <=50K
| marital-status = Widowed: <=50K
| marital-status = Married-spouse-absent: null
| marital-status = Married-AF-spouse: null
education = Some-college
| relationship = Wife
| | occupation = Tech-support: null
| | occupation = Craft-repair: <=50K
| | occupation = Other-service: <=50K
| | occupation = Sales: null
| | occupation = Exec-managerial: null
| | occupation = Prof-specialty: null
| | occupation = Handlers-cleaners: null
| | occupation = Machine-op-inspct: null
| | occupation = Adm-clerical: >50K
| | occupation = Farming-fishing: null
| | occupation = Transport-moving: null
| | occupation = Priv-house-serv: null
| | occupation = Protective-serv: null
| | occupation = Armed-Forces: null
| relationship = Own-child: <=50K
| relationship = Husband
| | occupation = Tech-support: <=50K
| | occupation = Craft-repair: <=50K
| | occupation = Other-service: >50K
| | occupation = Sales
| | | race = White: <=50K
| | | race = Asian-Pac-Islander: >50K
| | | race = Amer-Indian-Eskimo: <=50K
| | | race = Other: null
| | | race = Black: null
| | occupation = Exec-managerial
| | | workclass = Private: null
| | | workclass = Self-emp-not-inc: >50K
| | | workclass = Self-emp-inc: >50K
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: >50K
| | | workclass = State-gov: <=50K
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Prof-specialty: <=50K
| | occupation = Handlers-cleaners: <=50K
| | occupation = Machine-op-inspct: <=50K
| | occupation = Adm-clerical: <=50K
| | occupation = Farming-fishing: null
| | occupation = Transport-moving: >50K
| | occupation = Priv-house-serv: null
| | occupation = Protective-serv
| | | workclass = Private: null
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: <=50K
| | | workclass = State-gov: >50K
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Armed-Forces: null
| relationship = Not-in-family: <=50K
| relationship = Other-relative: <=50K
| relationship = Unmarried: <=50K
education = 11th
| occupation = Tech-support: null
| occupation = Craft-repair: <=50K
| occupation = Other-service: null
| occupation = Sales: null
| occupation = Exec-managerial
| | marital-status = Married-civ-spouse: >50K
| | marital-status = Divorced: <=50K
| | marital-status = Never-married: null
| | marital-status = Separated: null
| | marital-status = Widowed: null
| | marital-status = Married-spouse-absent: null
| | marital-status = Married-AF-spouse: null
| occupation = Prof-specialty: >50K
| occupation = Handlers-cleaners: null
| occupation = Machine-op-inspct: <=50K
| occupation = Adm-clerical: <=50K
| occupation = Farming-fishing: <=50K
| occupation = Transport-moving: <=50K
| occupation = Priv-house-serv: null
| occupation = Protective-serv: null
| occupation = Armed-Forces: null
education = HS-grad
| relationship = Wife
| | occupation = Tech-support: null
| | occupation = Craft-repair: null
| | occupation = Other-service
| | | workclass = Private: <=50K
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: >50K
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Sales: <=50K
| | occupation = Exec-managerial: null
| | occupation = Prof-specialty: null
| | occupation = Handlers-cleaners: null
| | occupation = Machine-op-inspct: <=50K
| | occupation = Adm-clerical: >50K
| | occupation = Farming-fishing: null
| | occupation = Transport-moving: null
| | occupation = Priv-house-serv: null
| | occupation = Protective-serv: null
| | occupation = Armed-Forces: null
| relationship = Own-child: <=50K
| relationship = Husband
| | occupation = Tech-support: <=50K
| | occupation = Craft-repair
| | | workclass = Private: <=50K
| | | workclass = Self-emp-not-inc: <=50K
| | | workclass = Self-emp-inc: >50K
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: <=50K
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Other-service: <=50K
| | occupation = Sales
| | | workclass = Private: <=50K
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: <=50K
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: null
| | | workclass = State-gov: <=50K
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Exec-managerial: <=50K
| | occupation = Prof-specialty: null
| | occupation = Handlers-cleaners: <=50K
| | occupation = Machine-op-inspct
| | | native-country = United-States
| | | | race = White: <=50K
| | | | race = Asian-Pac-Islander: null
| | | | race = Amer-Indian-Eskimo: <=50K
| | | | race = Other: null
| | | | race = Black: null
| | | native-country = Cambodia: null
| | | native-country = England: null
| | | native-country = Puerto-Rico: null
| | | native-country = Canada: null
| | | native-country = Germany: null
| | | native-country = Outlying-US(Guam-USVI-etc): null
| | | native-country = India: null
| | | native-country = Japan: null
| | | native-country = Greece: null
| | | native-country = South: null
| | | native-country = China: null
| | | native-country = Cuba: null
| | | native-country = Iran: null
| | | native-country = Honduras: null
| | | native-country = Philippines: null
| | | native-country = Italy: >50K
| | | native-country = Poland: null
| | | native-country = Jamaica: null
| | | native-country = Vietnam: null
| | | native-country = Mexico: null
| | | native-country = Portugal: null
| | | native-country = Ireland: null
| | | native-country = France: null
| | | native-country = Dominican-Republic: null
| | | native-country = Laos: null
| | | native-country = Ecuador: null
| | | native-country = Taiwan: null
| | | native-country = Haiti: null
| | | native-country = Columbia: null
| | | native-country = Hungary: null
| | | native-country = Guatemala: null
| | | native-country = Nicaragua: null
| | | native-country = Scotland: null
| | | native-country = Thailand: null
| | | native-country = Yugoslavia: null
| | | native-country = El-Salvador: null
| | | native-country = Trinadad&Tobago: null
| | | native-country = Peru: null
| | | native-country = Hong: null
| | | native-country = Holand-Netherlands: null
| | occupation = Adm-clerical
| | | workclass = Private: <=50K
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: >50K
| | | workclass = Local-gov: null
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Farming-fishing
| | | workclass = Private: <=50K
| | | workclass = Self-emp-not-inc
| | | | race = White: >50K
| | | | race = Asian-Pac-Islander: <=50K
| | | | race = Amer-Indian-Eskimo: null
| | | | race = Other: null
| | | | race = Black: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: null
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Transport-moving
| | | workclass = Private: <=50K
| | | workclass = Self-emp-not-inc: null
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: <=50K
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | occupation = Priv-house-serv: null
| | occupation = Protective-serv: null
| | occupation = Armed-Forces: null
| relationship = Not-in-family: <=50K
| relationship = Other-relative: <=50K
| relationship = Unmarried: <=50K
education = Prof-school: >50K
education = Assoc-acdm
| occupation = Tech-support: null
| occupation = Craft-repair: null
| occupation = Other-service: null
| occupation = Sales: <=50K
| occupation = Exec-managerial: >50K
| occupation = Prof-specialty: <=50K
| occupation = Handlers-cleaners: <=50K
| occupation = Machine-op-inspct: null
| occupation = Adm-clerical
| | workclass = Private: <=50K
| | workclass = Self-emp-not-inc: null
| | workclass = Self-emp-inc: null
| | workclass = Federal-gov: >50K
| | workclass = Local-gov: null
| | workclass = State-gov: null
| | workclass = Without-pay: null
| | workclass = Never-worked: null
| occupation = Farming-fishing: >50K
| occupation = Transport-moving: null
| occupation = Priv-house-serv: null
| occupation = Protective-serv: null
| occupation = Armed-Forces: null
education = Assoc-voc
| relationship = Wife: <=50K
| relationship = Own-child: <=50K
| relationship = Husband
| | workclass = Private: >50K
| | workclass = Self-emp-not-inc: null
| | workclass = Self-emp-inc: null
| | workclass = Federal-gov: <=50K
| | workclass = Local-gov: null
| | workclass = State-gov: null
| | workclass = Without-pay: null
| | workclass = Never-worked: null
| relationship = Not-in-family: <=50K
| relationship = Other-relative: null
| relationship = Unmarried: <=50K
education = 9th: <=50K
education = 7th-8th: <=50K
education = 12th
| occupation = Tech-support: null
| occupation = Craft-repair: <=50K
| occupation = Other-service: null
| occupation = Sales: null
| occupation = Exec-managerial: null
| occupation = Prof-specialty: >50K
| occupation = Handlers-cleaners: null
| occupation = Machine-op-inspct: null
| occupation = Adm-clerical: null
| occupation = Farming-fishing: null
| occupation = Transport-moving: null
| occupation = Priv-house-serv: null
| occupation = Protective-serv: null
| occupation = Armed-Forces: null
education = Masters
| occupation = Tech-support: <=50K
| occupation = Craft-repair: null
| occupation = Other-service: null
| occupation = Sales: >50K
| occupation = Exec-managerial: >50K
| occupation = Prof-specialty
| | relationship = Wife: >50K
| | relationship = Own-child: >50K
| | relationship = Husband
| | | workclass = Private
| | | | race = White: >50K
| | | | race = Asian-Pac-Islander: null
| | | | race = Amer-Indian-Eskimo: null
| | | | race = Other: null
| | | | race = Black: >50K
| | | workclass = Self-emp-not-inc: >50K
| | | workclass = Self-emp-inc: null
| | | workclass = Federal-gov: null
| | | workclass = Local-gov: null
| | | workclass = State-gov: null
| | | workclass = Without-pay: null
| | | workclass = Never-worked: null
| | relationship = Not-in-family: <=50K
| | relationship = Other-relative: null
| | relationship = Unmarried: <=50K
| occupation = Handlers-cleaners: null
| occupation = Machine-op-inspct: null
| occupation = Adm-clerical: null
| occupation = Farming-fishing: null
| occupation = Transport-moving: null
| occupation = Priv-house-serv: null
| occupation = Protective-serv: >50K
| occupation = Armed-Forces: null
education = 1st-4th: <=50K
education = 10th
| workclass = Private: <=50K
| workclass = Self-emp-not-inc: >50K
| workclass = Self-emp-inc: null
| workclass = Federal-gov: null
| workclass = Local-gov: <=50K
| workclass = State-gov: null
| workclass = Without-pay: null
| workclass = Never-worked: null
education = Doctorate
| workclass = Private
| | marital-status = Married-civ-spouse: >50K
| | marital-status = Divorced: null
| | marital-status = Never-married: <=50K
| | marital-status = Separated: null
| | marital-status = Widowed: null
| | marital-status = Married-spouse-absent: null
| | marital-status = Married-AF-spouse: null
| workclass = Self-emp-not-inc: <=50K
| workclass = Self-emp-inc: null
| workclass = Federal-gov: >50K
| workclass = Local-gov: <=50K
| workclass = State-gov: null
| workclass = Without-pay: null
| workclass = Never-worked: null
education = 5th-6th: <=50K
education = Preschool: null Time taken to build model: 0.03 seconds === Stratified cross-validation ===
=== Summary === Correctly Classified Instances 297 74.25 %
Incorrectly Classified Instances 56 14 %
Kappa statistic 0.4963
Mean absolute error 0.1909
Root mean squared error 0.3951
Relative absolute error 64.2259 %
Root relative squared error 103.2838 %
UnClassified Instances 47 11.75 %
Total Number of Instances 400 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class
0.554 0.082 0.641 0.554 0.594 >50K
0.918 0.446 0.886 0.918 0.901 <=50K === Confusion Matrix === a b <-- classified as
41 33 | a = >50K
23 256 | b = <=50K
Дерево имеет очень сильно разветвленную структуру. Но видно что многие значения – пустые (null), то есть такая комбинация параметров в исходных данных не встречается. А в тех местах, где значение не пустое – там можно увидеть цепочку для которой в итоге будет >50K или <50K.
Если удалить все строки со значением null то можно получить небольшой набор правил, по которым можно классифицировать объекты.
Классификация методом J4.8 (модификация С4.5).
Этот алгоритм также применяется к исходным данным без их изменения. Результатом его работы является дерево решений, которое можно увидеть в виде дерева(рисунок 2), и текста (листинг 3).
Листинг 4
=== Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: laba43
Instances: 400
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
income
Test mode: 10-fold cross-validation === Classifier model (full training set) === J48 pruned tree
------------------ education-num <= 11
| capital-gain <= 3908: <=50K (286.0/29.0)
| capital-gain > 3908
| | capital-gain <= 4064: <=50K (2.0)
| | capital-gain > 4064: >50K (6.0)
education-num > 11
| marital-status = Married-civ-spouse
| | age <= 28: <=50K (3.0)
| | age > 28: >50K (49.0/8.0)
| marital-status = Divorced
| | hours-per-week <= 45: <=50K (13.0)
| | hours-per-week > 45: >50K (5.0/1.0)
| marital-status = Never-married
| | capital-gain <= 5178: <=50K (30.0/3.0)
| | capital-gain > 5178: >50K (3.0)
| marital-status = Separated: >50K (2.0/1.0)
| marital-status = Widowed: <=50K (1.0)
| marital-status = Married-spouse-absent: <=50K (0.0)
| marital-status = Married-AF-spouse: <=50K (0.0) Number of Leaves : 13 Size of the tree : 20
Time taken to build model: 0.02 seconds === Stratified cross-validation ===
=== Summary === Correctly Classified Instances 347 86.75 %
Incorrectly Classified Instances 53 13.25 %
Kappa statistic 0.5697
Mean absolute error 0.208
Root mean squared error 0.3436
Relative absolute error 60.9529 %
Root relative squared error 83.273 %
Total Number of Instances 400 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class
0.563 0.048 0.766 0.563 0.649 >50K
0.952 0.437 0.887 0.952 0.918 <=50K === Confusion Matrix === a b <-- classified as
49 38 | a = >50K
15 298 | b = <=50K
Рисунок 3
В этом случае дерево имеет лучший вид, чем при использовании метода ID3. Это вызвано 2 улучшениями алгоритма, по сравнению с ID3:
Возможность работать не только с категориальными атрибутами, но также с числовыми.
После построения дерева происходит усечение его ветвей. Если получившееся дерево слишком велико, выполняется либо группировка нескольких узлов в один лист, либо замещение узла дерева нижележащим поддеревом. Перед операцией над деревом вычисляется ошибка правила классификации, содержащегося в рассматриваемом узле. Если после замещения (или группировки) ошибка не возрастает (и не сильно увеличивается энтропия), значит замену можно произвести без ущерба для построенной модели.
При этом достигается более высокий процент правильности классификации (86.75 % против 74.25 %у ID3). По результатам вывода, а точнее дереву можно увидеть достаточно логичную классификацию объектов. В нашем случае получается что дальнейшее ветвление дерева происходит при marital-status = Married-civ-spouse, marital-status = Divorced и marital-status = Widowed, для остальных имеем статистику верно/неверно классифицированных объектов
marital-status = Never-married: <=50K (114.0/3.0)
marital-status = Separated: <=50K (7.0)
marital-status = Married-spouse-absent: <=50K (4.0)
marital-status = Married-AF-spouse: <=50K (0.0)
Классификация методом 1R (в системе Weka называется OneRule).
Метод классификации 1R – один из самых простых и понятных методов классификации. Применяется как к числовым данным, которые разбиваются на промежутки, так и к данным типа nominal.
Пример вывода алгоритма представлен в листинге 5.
Листинг 5
=== Run information === Scheme: weka.classifiers.rules.OneR -B 6
Relation: laba43
Instances: 400
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
income
Test mode: 10-fold cross-validation === Classifier model (full training set) === capital-gain:
< 4621.0 -> <=50K
>= 4621.0 -> >50K
(329/400 instances correct)
Time taken to build model: 0.02 seconds === Stratified cross-validation ===
=== Summary === Correctly Classified Instances 321 80.25 %
Incorrectly Classified Instances 79 19.75 %
Kappa statistic 0.1963
Mean absolute error 0.1975
Root mean squared error 0.4444
Relative absolute error 57.8673 %
Root relative squared error 107.7134 %
Total Number of Instances 400 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class
0.161 0.019 0.7 0.161 0.262 >50K
0.981 0.839 0.808 0.981 0.886 <=50K === Confusion Matrix === a b <-- classified as
14 73 | a = >50K
6 307 | b = <=50K Пример вывода алгоритма представлен в листинге 5.
Применительно к нашим данным этот метод показал себя не очень хорошо. Как известно, он обладает так называемой сверхчувствительностью (overfitting). Метод выбирает переменные принимающие наибольшее возможное количество значений, для таких переменных ошибка и будет наименьшей. Так, например, для переменной по которой у каждого ключа свое уникальное значение ошибка будет равно нулю, но для таких переменных правила бесполезны. В нашем случае такой переменной является Capital Gain. Соответственно после кросс-проверки точность результата также достаточно высока – 80.25 %.
Классификация методом SVM (в Weka называется SMO).
Для этого метода не требуется каких-либо преобразований исходной выборки.
Данный метод является алгоритмом классификации с использованием математических функций. Метод использует нелинейные математические функции. Номинальные данные преобразуются в числовые. Основная идея метода опорных векторов – перевод исходных векторов в пространство более высокой размерности и поиск максимальной разделяющей гиперплоскости в этом пространстве. Результат выполнения алгоритма представлен в листинге 6.
Листинг 6.
=== Run information === Scheme: weka.classifiers.functions.SMO -C 1.0 -E 1.0 -G 0.01 -A 250007 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1
Relation: laba43
Instances: 400
Attributes: 15
age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
sex
capital-gain
capital-loss
hours-per-week
native-country
income
Test mode: 10-fold cross-validation === Classifier model (full training set) === SMO Classifier for classes: >50K, <=50K BinarySMO Machine linear: showing attribute weights, not support vectors. -0.0585 * (normalized) age
+ 0.0084 * (normalized) workclass=Private
+ -0.0213 * (normalized) workclass=Self-emp-not-inc
+ -0.2018 * (normalized) workclass=Self-emp-inc
+ -0.2739 * (normalized) workclass=Federal-gov
+ 0.2433 * (normalized) workclass=Local-gov
+ 0.2452 * (normalized) workclass=State-gov
+ 0.0193 * (normalized) fnlwgt
+ -0.7706 * (normalized) education=Bachelors
+ 0.6757 * (normalized) education=Some-college
+ 0.3256 * (normalized) education=11th
+ 0.5779 * (normalized) education=HS-grad
+ -0.9101 * (normalized) education=Prof-school
+ -0.7434 * (normalized) education=Assoc-acdm
+ 1 * (normalized) education=Assoc-voc
+ 0.1488 * (normalized) education=9th
+ 0.2148 * (normalized) education=7th-8th
+ -0.3453 * (normalized) education=12th
+ -0.9079 * (normalized) education=Masters
+ 0.0496 * (normalized) education=1st-4th
+ 0.2341 * (normalized) education=10th
+ -0.3222 * (normalized) education=Doctorate
+ 0.773 * (normalized) education=5th-6th
+ -1.6111 * (normalized) education-num
+ -1.1023 * (normalized) marital-status=Married-civ-spouse
+ 0.4264 * (normalized) marital-status=Divorced
+ 0.7828 * (normalized) marital-status=Never-married
+ -0.4317 * (normalized) marital-status=Separated
+ 0.3247 * (normalized) marital-status=Widowed
+ 0.1073 * (normalized) occupation=Tech-support
+ -0.0689 * (normalized) occupation=Craft-repair
+ -0.0632 * (normalized) occupation=Other-service
+ -0.042 * (normalized) occupation=Sales
+ -0.2862 * (normalized) occupation=Exec-managerial
+ -0.1301 * (normalized) occupation=Prof-specialty
+ 1 * (normalized) occupation=Handlers-cleaners
+ 0.1839 * (normalized) occupation=Machine-op-inspct
+ -0.0754 * (normalized) occupation=Adm-clerical
+ -0.2496 * (normalized) occupation=Farming-fishing
+ -0.0682 * (normalized) occupation=Transport-moving
+ -0.3074 * (normalized) occupation=Protective-serv
+ -0.6652 * (normalized) relationship=Wife
+ -0.0987 * (normalized) relationship=Own-child
+ 0.111 * (normalized) relationship=Husband
+ -0.0157 * (normalized) relationship=Not-in-family
+ 0.1003 * (normalized) relationship=Other-relative
+ 0.5683 * (normalized) relationship=Unmarried
+ -0.208 * (normalized) race=White
+ 0.0377 * (normalized) race=Asian-Pac-Islander
+ 0.582 * (normalized) race=Other
+ -0.4117 * (normalized) race=Black
+ -0.5355 * (normalized) sex
+ -1.1261 * (normalized) capital-gain
+ -1.2683 * (normalized) capital-loss
+ -0.2404 * (normalized) hours-per-week
+ 0.2201 * (normalized) native-country=United-States
+ -0.6401 * (normalized) native-country=Canada
+ 0.2992 * (normalized) native-country=Germany
+ 0.6778 * (normalized) native-country=China
+ -0.9273 * (normalized) native-country=Cuba
+ -1 * (normalized) native-country=Italy
+ 1 * (normalized) native-country=Vietnam
+ 0.1772 * (normalized) native-country=Mexico
+ 0.193 * (normalized) native-country=Nicaragua
+ 2.9283 Number of kernel evaluations: 38130 (96.102% cached)
Time taken to build model: 0.59 seconds === Stratified cross-validation ===
=== Summary === Correctly Classified Instances 331 82.75 %
Incorrectly Classified Instances 69 17.25 %
Kappa statistic 0.4293
Mean absolute error 0.1725
Root mean squared error 0.4153
Relative absolute error 50.5423 %
Root relative squared error 100.6655 %
Total Number of Instances 400 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class
0.448 0.067 0.65 0.448 0.531 >50K
0.933 0.552 0.859 0.933 0.894 <=50K === Confusion Matrix === a b <-- classified as
39 48 | a = >50K
21 292 | b = <=50K
На выводе алгоритма показываются веса для всех возможных атрибутов, при этом заметна задержка его вывода из-за проведения расчетов. Процент верной классификации оказывается достаточно высоким – 82.75 %, а средняя ошибка классификатора наоборот, оказывается минимальной среди всех рассмотренных методов.
В итоге вывод данного алгоритма представлен в виде вектора n-мерного пространства. Цифры указанные в выводе – коэффициенты задающие плоскость, разделяющую исходные данные на типы.
Задание 3: Построение ассоциативных правил. Метод Априори. Нахождение ассоциативных правил происходит почти так же, как и классификация. На вкладке Associate выбирается метод нахождения, для него выставляются параметры кликом на его названии, после чего нажимается кнопка Start и анализируется вывод. При необходимости применяются фильтры (в данном случае применяются фильтры, аналогичные использованным для метода ID3). В нашем случае ассоциативные правила строятся по методу Априори.
Листинг 7
=== Run information === Scheme: weka.associations.Apriori -N 11 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0
Relation: laba43-weka.filters.unsupervised.attribute.RemoveType-Tnumeric
Instances: 400
Attributes: 9
workclass
education
marital-status
occupation
relationship
race
sex
native-country
income
=== Associator model (full training set) ===
Apriori
======= Minimum support: 0.4 (160 instances)
Minimum metric : 0.9
Number of cycles performed: 12 Generated sets of large itemsets: Size of set of large itemsets L(1): 7 Size of set of large itemsets L(2): 15 Size of set of large itemsets L(3): 13 Size of set of large itemsets L(4): 2 Best rules found: 1. relationship=Husband 167 ==> marital-status=Married-civ-spouse sex=Male 167 conf:(1)
2. marital-status=Married-civ-spouse relationship=Husband 167 ==> sex=Male 167 conf:(1)
3. relationship=Husband sex=Male 167 ==> marital-status=Married-civ-spouse 167 conf:(1)
4. relationship=Husband 167 ==> sex=Male 167 conf:(1)
5. relationship=Husband 167 ==> marital-status=Married-civ-spouse 167 conf:(1)
6. marital-status=Married-civ-spouse sex=Male 173 ==> relationship=Husband 167 conf:(0.97)
7. marital-status=Married-civ-spouse race=White 182 ==> native-country=United-States 173 conf:(0.95)
8. marital-status=Married-civ-spouse native-country=United-States 182 ==> race=White 173 conf:(0.95)
9. race=White sex=Male 238 ==> native-country=United-States 224 conf:(0.94)
10. marital-status=Married-civ-spouse sex=Male 173 ==> native-country=United-States 162 conf:(0.94)
11. marital-status=Married-civ-spouse sex=Male 173 ==> race=White 162 conf:(0.94) В результате выполнения алгоритма, показываются правила с метрикой больше минимальной.
В настройках метода устанавливалось создание 11 ассоциативных правил. Данный алгоритм определяет часто встречающиеся наборы, соответственно самыми точными являются самые часто встречающиеся наборы, но как видно, не все они имеют смысл.
ЗАКЛЮЧЕНИЕ
В результате работы были исследованы методы классификации и построения ассоциативных правил. Исходным набором данных в нашем случае являлась перепись населения в США, в которой классификация производилась по доходам населения.
НАБОР ДАННЫХ datamining400-02
Москва 2008
|