Subgroup discovery (SD) is a data mining field that aims to describe data using supervised learning techniques. The goal is to find simple, general and interesting patterns with respect to a given variable of interest. Throughout the literature, SD has been applied with success to different real-world problems in areas such as marketing , medicine and e-learning , among others .
SD is a rule learning process within a complex search space. Therefore, the search strategy used becomes a key factor in the efficiency of the method. Different strategies can be found in the literature such as beam search in the algorithm CN2-SD and Apriori-SD , exhaustive algorithms such as SDMap or Evolutionary Algorithms (EAs), for example.
EAs are stochastic algorithms for optimizing and searching tasks, based on the natural evolution process . There are different paradigms within EAs: genetic algorithms , evolution strategies , evolutionary programming and genetic programming . With these methods the inclusion of rules as knowledge representation is known as evolutionary rule-based systems and has the advantage of allowing the inclusion of domain knowledge and returning better rules. The use of EAs is very well suited to SD because these algorithms perform a global search in the space in a convenient way, stimulating the obtaining of simple, interesting and precise subgroups.
Nowadays, there are several frameworks that allow the realization of data mining tasks, but only a few of them have implementations of SD algorithms. The most well-know frameworks with SD algorithms are KEEL and VIKAMINE , but ORANGE , and CORTANA provide some implementations too. In fact, VIKAMINE also provides an R package called rsubgroup which is an interface for R to VIKAMINE Java algorithms.
In this contribution the SDEFSR package is introduced. It provides the user with the most important evolutionary fuzzy rule-based methods for SD documented in the literature, being a major contribution to the R community. In addition, it also brings the capability of reading datasets in the KEEL data format. This file format is not supported by R. Similarly, this package provides a Graphical User Interface (GUI) to make this task easier for the user, especially the novel one.
This contribution is organized according to the following structure: The first section presents SD concepts, main properties and features. In the second section, the structure of the SDEFSR package and its operations are described. In the third section, a complete example of use of the package is presented. Finally, the GUI of SDEFSR is shown.
SD was defined by as:
In subgroup discovery, we assume we are given a so-called population of individuals (objects, customer, . . .) and a property of those individuals we are interested in. The task of subgroup discovery is then to discover the subgroups of the population that are statistically “most interesting”, i.e. are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest.
SD tries to find relations between different properties of a set with respect to one interesting or target variable. Such relations must be statistically interesting, so it is not necessary to find complete relations, partial relations could be interesting too.
Usually these relations are represented by means of rules, one per relation. These rules are defined as : where Targetvalue is the value for the variable of interest (target variable) for the SD task and Cond is normally a conjunction of attribute-value pairs which describe the characteristics of the induced subgroup.\ SD is halfway between description and classification, because it has a target variable but its objective is not to predict it but rather to describe it. The use of a target variable is not possible in description, because description simply finds relationships between unlabeled object.\ A key point to fully understanding the goal of SD is how it differentiates from the classification objective. Classification is a predictive task that tries to split the entire search space, usually in a complex way, aiming to find the correct value for the variable in new incoming instances. On the other hand, SD aims to find interesting relations among these instances regarding the variable of interest.\ For instance, assuming there is a group of patients, the variable of interest is whether they have heart disease or not. The predictive data mining objective is to predict if new patients will have heart disease. However, SD tries to find which subgroup of patients are more likely to have heart disease according to certain characteristics. This is relevant for developing treatment against those characteristics, for instance.
A quality measure tries to measure the interestingness of a given rule or subgroup, but there is not a formal definition of what interestingness is. However, the interestingness could be defined as a concept that emphasizes conciseness, coverage, reliability, peculiarity, diversity, novelty, surprisingness, utility, and actionability . For SD, the most used criteria to measure the interestingness of a rule are conciseness, generality or coverage, reliability, and novelty .
Quality measures that accomplish this criteria available in the SDEFSR package are:
A more extended classification and definition of quality measures for SD is available in and .
Evolutionary fuzzy systems (EFS) are the union of two powerful tools for approximate reasoning: evolutionary algorithms and fuzzy logic.
In one side, evolutionary algorithms are stochastic algorithms based on the natural evolution to solve complex optimization problems. It is based on a population of representations of possible solutions, called chromosomes. A basic evolutionary algorithm scheme is:
These algorithms performs a global stochastic search through a huge search space efficiently. However, it is possible that these algorithms can not find an optimal solution (global optimum), but a good one (a local optimum) that can solve the problem too. Theys are well suited for SD because the problem of finding subgroups can be formulated from the optimization point of view coding rules as a parameters structure that optimize some measures. Additionally, different kinds of rules exists in SD (with inequality, with intervals, fuzzy rules…). This can change dramatically the size of the search space and genetic algorithms can adapt these structures easily without performance degradation. Likewise, the selection of the genetic operators can make GA a great candidate to introduce expert knowledge .
On the other side, fuzzy logic is an extension of the traditional set theory. Its main objective is to model and deal with imprecise knowledge, which is nearer to the human reasoning. The main difference with traditional set theory is that belonging degree is not zero or one, but a real value in [0,1]. This possibility allows to define fuzzy limits and the chance of overlapping between fuzzy sets.
A fuzzy variable is a set of linguistic labels, e.g. low, medium and high, which are defined by overlapped fuzzy sets. This information is nearer to human reasoning and it is possible to calculate with precision the value of each belonging degree. This expressivity allow to obtain simpler rules because continuous variables are more understandable for humans. A rule can be a set of fuzzy variables. To determine if the rule cover an example it is neccessary to calculate the belonging degree of each variable in the rule with respect to the example. If all variables has a belonging degree grater than zero, the example is covered.
Evolutionary fuzzy systems is the union of both techniques, and has three ways of work :SDEFSR is a package entirely written on R from scratch. This package includes the most important EFS for SD presented throughout the literature up to the moment. In addition, SDEFSR has the capacity to read data in different standard and well-known formats such as ARFF (Weka), KEEL, CSV and data.frame objects. Similarly, all functions included in the SDEFSR package have default parameters with values recommended by the authors. This allows the algorithms to be executed in an easy way, without the necessity of knowing the parameters for final users.
Now, we describe the algorithms that are available in our package. This contains three algorithms: SDIGA , MESDIF and NMEEF-SD . In chapter we describe how to use the algorithms.
The algorithm SDIGA is a subgroup discovery algorithm that extract
rules with two possible representations: one with conjunctions of
attribute-value pairs (called canonical representation) or one with a
disjunctive normal form (DNF) in the antecedent. It follows an iterative
schema with a genetic algorithm in his core to extract those rules, this
genetic algorithm only extract one rule, the best of the population, and
after the extraction a local optimization could be performed in order to
improve the generality of the rule. As the algorithm follows an
iterative schema, the genetic algorithm is executed one time after
another until a stopping criteria is reached: the rule obtained by the
genetic algorithm and after the local improvement phase must cover at
least one example not covered by precedent rules and this rule must have
a minimum confidence (see Equation ).
SDIGA can work with lost data (represented by the maximum value of the
variable + 1), categorical variables and numerical ones using fuzzy
logic with the latter to improve the interpretability of the
results.
As we mentioned above, a genetic algorithm is the core of SDIGA. Such genetic algorithm is the responsible of extract one rule per execution and this rule it is the best of the population at the final of the evolutive process.
Each chromosome in the population represents a possible rule but it
only represents the antecedent part of the rule because the consequent
is prefixed. SDIGA can handle two types of representation as we
mentioned above, canonical and DNF.
Canonical representation is formed by a numerical vector of a fixed
length equal to the number of variables with possibles values in a range
in [0, max]
where max is
the number of possible values for categorical variables or the number of
fuzzy sets defined for numerical variables. This max value
represents the no participation of that variable in the rule.
DNF representation is formed by a binary vector, also with fixed length
equal to the sum of all number of values/fuzzy sets of the variables.
Here, a variable does not participate in a rule when all of his values
are equal to zero or one.
SDIGA follows a pure stationary schema which only crosses the two best chromosomes in an iteration of the evolutionary process. This crossover is performed by a two-point cross operator.
As this rules uses fuzzy logic, we need an expresion to determine when an example is covered or not by a rule and also determine the belonging degree of that example to the rule. This function is determined by the expression APC (Antecedent Part Compatibility) and it is calculate by the expression: where:
μni will be one or zero if the variable is categorical and its value is the same of the rule or not. In case of numerical variable, μni will be computed following the triangular belonging degree function.
To get next population for the next iteration a replace operator must be performed. This operator sort the population by fitness value and keep only the n best chromosomes, being n the size of the population.
After the genetic algorithm returns a rule, this rule could be improved by means of a local optimization based on a hill climbing first best local search. The algorithm eliminate one by one a variable and if the rule has a support and confidence greater than or equal the actual, the process starts again with that variable eliminated.
MESDIF is a multi-objective genetic algorithm that extract fuzzy
rules. The representation used could be canonical or DNF (see Section ).
This algorithm follows the SPEA2 approach where an elite population is
used along the whole evolutive process. At the end of the process, the
rules stored in the elite population where returned as result.
The multi-objective approach is based on a niches schema based on the
dominance in the Pareto front. This schema allows to find non-dominated
individuals to fill the elite population, that has a fixed size. In
Figure we see a basic schema of the algorithm.
Now we describe briefly the components of the genetic algorithm to show how it works.
The initial population operator performed by MESDIF produce random chromosomes in two ways: one percentage of chromosomes are produced randomly and the rest are produced also randomly, but with the difference that only a maximum number of variables could participate in the rule. This produce more generality in the generated rules.
As the elite population has a fixed size, we need a truncation operator to truncate the elite population if all non-dominated individuals can not fit in elite population. To make this truncation, the operator take all non-dominated individuals and calculate the distance among every one. Then, the two closest individuals are taken and eliminate the individual with his k-th nearest neighbour with a minor distance. This process is repeated until non-dominated indivuals fit in elite population.
If the number of non-dominated individuals are less than the size of the elite population, we need to fill elite population with dominated individuals. The operator sort the individuals by its fitness value and copy the n first individuals to elite population, where n is the size of the elite population.
NMEEF-SD is another multi-objective genetic algorithm based in the
NSGA-II approach. This algorithm has a fast sorting algorithm and a
reinitialisation based on coverage if the population does not evolve for
a period.
This algorithm only works with a canonical representation (see Section
). In a study is presented where it reflects the low quality of rules
obtained with a DNF representation.
The genetic operators of NMEEF-SD are a two-point crossover operator (see Section ) and a biased mutation operator (see Section ).
This operator is applied if the Pareto front does not cover any new example during a 5% of the total number of evaluations. Then, the algorithm copy the non duplicated individuals in the Pareto front to Pt + 1 and the rest of individuals are generated by means of trying to cover one example of the target class with a maximum number of variables.
FuGePSD is another genetic algorithm that finds interesting subgroups of a given class. It uses a programming genetic approach, where individuals are represented as trees with variable length instead of vectors. Also, the consecuent of the rule is represented. This schema has the advantage of get rules for all possible values of a given target variable in one execution. Furthermore, FuGePSD has a variable population length, which can change over the evolutive process based on an competitive-cooperative approach, the Token Competition.
The evolutionary process of FuGePSD works as follow:
The key function on this scheme is the token competition procedure, which promove diversity on the population, and some of the genetic operator will bring specificity to the rules.
This genetics operators will be applied with a probability given by the user.
The token competition is the key procedure of FuGePSD, this brings diversity on the population keeping the best individuals. The algorithm works as follows: let a token be an example of the datasets, each individual can catch a token if the rule can cover it. If its occur, a flag is set at that token and this token cannot be catched by other rules.
So, the token competition works as follows:
At the end of the evolutive process, the screening function is launched to get the users quality rules only. This rules must reach a minimum level of confidence and sensitivity. The function has an external parameter (ALL_CLASS) which if true, the algorithm will return, at least, one rule per value of the target variable, or at least the best rule of the population if false.
The main advantage of the SDEFSR package is that it provides all evolutionary fuzzy systems for SD that exists in the bibliography. Algorithms included are an important kind of SD methods that are not available in R at the moment. Therefore, this package provides to the R community a brand new possibility for data mining and data analysis.
The base of the package is defined by two S3 classes. This classes are:Additionally, the package has a general function that reads datasets in ARFF, KEEL or CSV format called and to transform a into a .
In this section we are going to explain how to use this package. This package tries to use in a really simple way subgroup discovery algorithms and also without any dependencies.
The package SDEFSR is now available at CRAN servers, so it can be installed as any other package by simply typing:
Also, the develop version is available into GitHub at http://github.com/aklxao2/SDEFSR, feel free to clone and
contribute if you wish. If you wish to use the development version you
need to install the package devtools
using the commando
install_github
:
SDEFSR depends only on the package
shiny . This package is neccessary to use the user
interface in a local way and if the package is not installed, it asked
to the user to be install when running the function
SDEFSRR_GUI()
. Once installed the package has to be loaded
before use it. The package can be loaded through library()
or require()
. After loading the package there are six
datasets available as SDEFSR_Dataset: carTra
,
carTst
, germanTra
, germanTst
,
habermanTra
and habermanTst
that corresponds
to car
, german
and haberman
training and test datasets. Also, rules generated by the SDIGA algorithm
with defaults parameters over the haberman
dataset are
loaded as habermanRules
as an SDEFSR_Rules
object.
In order to use SD algorithms available in the
SDEFSR package, a SDEFSR_Dataset
object is
neccessary. This object can be generated using the
read.dataset() function. This function reads “.dat” files with
the KEEL data mining tool format, ARFF files (“.arff”) from WEKA or even
CSV files (“.csv”). Assuming the files iris.dat
,
iris.arff
and iris.csv
corresponding to the
classical iris dataset in KEEL, ARFF and CSV formats respectively in the
working directory, the loading of these files will be as follows:
irisFromKEEL <- read.dataset("iris.dat")
irisFromARFF <- read.dataset("iris.arff")
irisFromCSV <- read.dataset("iris.csv")
Note that the function detects the type of data by the extension. To
read csv files, the function has optional parameters that defines the
separator used between fields, the decimal separator, the quote
indicator and the NA
identifier as parameters. By default,
this options and values are
sep = ",", quote = "\"", dec = "."
and
na.strings = "?"
respectively. It is important to remark
that sparse ARFF data is not supported.
If the dataset is not available in this formats, it is possible to
obtain a SDEFSR_Dataset object from a data.frame
.
This data.frame
could be loaded by
read.table()
or similar functions. Eventually, the
resulting data.frame
has to be given to the
SDEFSR_DatasetFromDataFrame() function. As we can see, this
function allows the creation of datasets on the fly, as in this
example:
library(SDEFSR)
df <- data.frame(matrix(data = runif(1000), ncol = 10))
#Add class attribute (It must be the last attribute and it must be categorical)
df[,11] <- c("Class 0", "Class 1", "Class 2", "Class 3")
SDEFSR_DatasetObject <- SDEFSR_DatasetFromDataFrame(df, relation = "random")
#Load from iris dataset
irisFromDataFrame <- SDEFSR_DatasetFromDataFrame(iris, "iris")
This will assign to SDEFSR_DatasetObject a new dataset created randomly with 100 examples and 11 attributes. Note that the target variable must be categorical, because the SD algorithms can only deal with categorical target variables.\
The SDEFSR\_DatasetFromDataFrame()
function has three
additional parameters: names
, types
, and
classNames
. These allow the manual assignment of attribute
names, their types, and a vector with values of target variable,
respectively. Leaving the default values (NA
), the function
automatically retrieves these values through the information found on
the dataset. However, if the information in the dataset is not accurate,
it could cause unexpected results for the SD algorithms.
Once the dataset is loaded, it is possible to view a simple summary
of its content by using the usual summary()
function:
## Summary of the SDEFSR_Dataset object: 'irisFromDataFrame'
## - relation: iris
## - nVars: 4
## - Ns: 150
## - attributeNames: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
## - class_names: setosa, versicolor, virginica
## - examplesPerClass: 50, 50, 50
Any of these values can be obtained individually using the
$
operator on the SDEFSR_Dataset
object:
## [1] 4
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Also, it is possible to print the dataset as a
data.frame
with the print()
function.
Once our datasets are ready to be used, it is time to execute one
subgroup discovery algorithm. For example we want to execute the
algorithm MESDIF. For the rest of the algorithm this steps are equal and
only a few parameters are different, if you need help with the
parameters, refer to the help of the function using
help(function)
or ?function
.
It is possible to execute an SD algorithm in two ways: Through a
parameters file, specifying as argument the path to such file. Or by
entering all the parameter names and values at the command line. You can
find the structure of the parameters file, among other useful
information, on the help pages of each function.
Assuming the SDEFSR_Dataset
object
irisFromDataFrame
that has been loaded before, and that the
params.txt
parameters file is stored in the working
directory, the easiest way to run the MESDIF()
algorithm,
for example, will be: When we execute the algorithm, the results are
shown in the console. This results are divided in three fields:
ruleSet <- MESDIF(paramFile = NULL,
training = irisFromDataFrame,
test = NULL,
output = c("optionsFile.txt", "rulesFile.txt", "testQM.txt"),
seed = 0,
nLabels = 3,
nEval = 300,
popLength = 100,
eliteLength = 2,
crossProb = 0.6,
mutProb = 0.01,
RulesRep = "can",
Obj1 = "CSUP",
Obj2 = "CCNF",
Obj3 = "null",
Obj4 = "null",
targetVariable = "Species",
targetClass = "virginica")
## --------------------------------
## Algorithm: MESDIF
## Relation: iris
## Training dataset: training
## Test dataset: test
## Rules Representation: CAN
## Number of evaluations: 300
## Number of fuzzy partitions: 3
## Population Length: 100
## Elite Population Length: 2
## Crossover Probability: 0.6
## Mutation Probability: 0.01
## Obj1: CSUP (Weigth: )
## Obj2: CCNF (Weigth: )
## Obj3: null (Weigth: )
## Obj4: null
## Number of examples in training: 150
## Number of examples in test: 150
## Target Variable: Species
## Target Class: virginica
## --------------------------------
##
##
## Searching rules for only one value of the target class...
##
## GENERATED RULE 1
## Variable Sepal.Width = Label 1 ( 2 , 3.2 , 4.4 )
## THEN virginica
##
## GENERATED RULE 2
## Variable Petal.Width = Label 2 ( 1.3 , 2.5 , 3.7 )
## THEN virginica
##
##
## Testing rules...
## Rule 1 :
## - N_vars: 2
## - Coverage: 0.8
## - Significance: 0.602743
## - Unusualness: 0.02
## - Accuracy: 0.357724
## - CSupport: 0.286667
## - FSupport: 0.245
## - CConfidence: 0.358333
## - FConfidence: 0.3528
## - True Positive Rate: 0.86
## - False Positive Rate: 0.77
## Rule 2 :
## - N_vars: 2
## - Coverage: 0.193333
## - Significance: 27.673033
## - Unusualness: 0.128889
## - Accuracy: 0.9375
## - CSupport: 0.193333
## - FSupport: 0.201667
## - CConfidence: 1
## - FConfidence: 0.889706
## - True Positive Rate: 0.58
## - False Positive Rate: 0
## Global:TRUE
## - N_rules: 2
## - N_vars: 2
## - Coverage: 0.496666
## - Significance: 14.137888
## - Unusualness: 0.074444
## - Accuracy: 0.647612
## - CSupport: 0.3
## - FSupport: 0.223334
## - FConfidence: 0.621253
## - CConfidence: 0.679166
## - True Positive Rate: 0.72
## - False Positive Rate: 0.385
The output has three clearly defined sections:
As this output could be extremely large, the function also saves it to three files, one for each of the above sections. The name of these files by default are optionsFile.txt, rulesFile.txt and testQM.txt and being saved into the working directory, overwriting existing files. The format of these files is identical to the output generated by the algorithm, but divided into the sections described above.
After the execution of a SD algorithm, it returns a
SDEFSR_Rules
object that contains the rules obtained.
Following the example, with the ruleSet
object obtained we
can plot a TPR vs FPR plot to view the reliability, and generality of
the rule. Reliable rules has low values of FPR and great TPR values, and
too much general variables has great values of both TPR and FPR. To plot
the rules, we can use the function plot()
:
library(ggplot2)
plot(ruleSet)
Additionally, we can directly order the rule set by a quality measure
with the function orderRules
which returns another
SDEFSR_Rules
object with the sorted rules. By default, the
ordering is done by confidence.
Filter rules by number of attribute-value pairs or keep those rules
with a quality measure greater than a given threshold is an interesting
functionality to keep only high-quality rules. Such filtering operations
are quite simple to apply in SDEFSR. Using the subset
operator ('[]'
) and introducing the filtering conditions
will generate a new SDEFSR_Rules
object with the
result:
#Apply filter by unusualness
filteredRules <- ruleSet[Unusualness > 0.05]
#We check only if the number of rules decrease.
#In this case, this value must be 1.
length(filteredRules)
## [1] 1
#Also, you can make the filter as complex as you can
#Filter by Unusualness, TPr and number of variables:
filteredRules <- ruleSet[Unusualness > 0.05 & TPr > 0.9 & nVars == 3]
#In this case, any of the rules match the conditions. Therefore, the
#number of rules must be 0.
length(filteredRules)
## [1] 0
As we mentioned at the begin of this paper, the
SDEFSR package provide to the user an interface to use
the subgroup discovery task in a more easy way and also do some basic
exploratory analysis tasks.
We can use the user interface by two ways: one of them is using the
interface via web at: http://sdrinterface.shinyapps.io/shiny . This has the
advantage of use the algorithm without having R installed in our system
and also avoid expending process time in our machine. The other way is
to use the interface is in a local way, having our local server. This
could be possible simply using:
This function launch the user interface in our predetermined web
browser. As we can see in Figure . The page are organized into an
options panel at the left and a tabbed panel at the right. Here is where
our results are shown when we execute the algorithms.
The first we have to do is to load a KEEL dataset. If we want to execute
a subgroup discovery algorithm we must load a training and a test file
having the same '@relation'
field
in the KEEL dataset file.
Once we select the dataset, automatically it shows a graph with
information about the last variable defined in the dataset. The graph
shows the distribution of examples having some values of the variable.
At the rigth of this graphic we can see a table with some basic
information, more extended if this variable is numerical.
After the execution of a subgroup discovery algorithm, we go
automatically to the tab 'Rules generated'
, this tab
contains a table with all subgroups generated. If we want we could
filter rules by variable, for example, typing into the
'Search'
field.
The tab 'Execution Info'
shows an echo of the parameters
used for launch the algorithm. This echo is equal than the one we can
see in R console.
The tab 'Test Quality Measures'
shows a table with quality
measures of every rule applied to test dataset. The last row its a
summary of results and shows the number of rules we have and the average
results of every quality measure.
In this paper the SDEFSR package has been
introduced. This package can use four subgroup discovery algorithms
without any other dependencies to others tools/packages. Also, the
possibility of read and load datasets in the KEEL dataset format is
provided, but it can read dataset from ARFF format or load a data.frame
object. Finally, a web-based interface is developed for make the work
easier, even if we do not have R installed in our system.
The development of the package will continue in the future, including
more functionality to work with KEEL datasets, adding new subgroup
discovery algorithms and also improve the web interface. As we can see
we have a great job ahead, so we encourage other developers to
participate adding tools or algorithms into the package, as we will do
in futures releases.