Introduction

Subgroup discovery (SD) is a data mining field that aims to describe data using supervised learning techniques. The goal is to find simple, general and interesting patterns with respect to a given variable of interest. Throughout the literature, SD has been applied with success to different real-world problems in areas such as marketing , medicine and e-learning , among others .

SD is a rule learning process within a complex search space. Therefore, the search strategy used becomes a key factor in the efficiency of the method. Different strategies can be found in the literature such as beam search in the algorithm CN2-SD and Apriori-SD , exhaustive algorithms such as SDMap or Evolutionary Algorithms (EAs), for example.

EAs are stochastic algorithms for optimizing and searching tasks, based on the natural evolution process . There are different paradigms within EAs: genetic algorithms , evolution strategies , evolutionary programming and genetic programming . With these methods the inclusion of rules as knowledge representation is known as evolutionary rule-based systems and has the advantage of allowing the inclusion of domain knowledge and returning better rules. The use of EAs is very well suited to SD because these algorithms perform a global search in the space in a convenient way, stimulating the obtaining of simple, interesting and precise subgroups.

Nowadays, there are several frameworks that allow the realization of data mining tasks, but only a few of them have implementations of SD algorithms. The most well-know frameworks with SD algorithms are KEEL and VIKAMINE , but ORANGE , and CORTANA provide some implementations too. In fact, VIKAMINE also provides an R package called rsubgroup which is an interface for R to VIKAMINE Java algorithms.

In this contribution the SDEFSR package is introduced. It provides the user with the most important evolutionary fuzzy rule-based methods for SD documented in the literature, being a major contribution to the R community. In addition, it also brings the capability of reading datasets in the KEEL data format. This file format is not supported by R. Similarly, this package provides a Graphical User Interface (GUI) to make this task easier for the user, especially the novel one.

This contribution is organized according to the following structure: The first section presents SD concepts, main properties and features. In the second section, the structure of the SDEFSR package and its operations are described. In the third section, a complete example of use of the package is presented. Finally, the GUI of SDEFSR is shown.

Subgroup Discovery

SD was defined by as:

In subgroup discovery, we assume we are given a so-called population of individuals (objects, customer, . . .) and a property of those individuals we are interested in. The task of subgroup discovery is then to discover the subgroups of the population that are statistically “most interesting”, i.e. are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest.

SD tries to find relations between different properties of a set with respect to one interesting or target variable. Such relations must be statistically interesting, so it is not necessary to find complete relations, partial relations could be interesting too.

Usually these relations are represented by means of rules, one per relation. These rules are defined as : where Target_value is the value for the variable of interest (target variable) for the SD task and Cond is normally a conjunction of attribute-value pairs which describe the characteristics of the induced subgroup.\ SD is halfway between description and classification, because it has a target variable but its objective is not to predict it but rather to describe it. The use of a target variable is not possible in description, because description simply finds relationships between unlabeled object.\ A key point to fully understanding the goal of SD is how it differentiates from the classification objective. Classification is a predictive task that tries to split the entire search space, usually in a complex way, aiming to find the correct value for the variable in new incoming instances. On the other hand, SD aims to find interesting relations among these instances regarding the variable of interest.\ For instance, assuming there is a group of patients, the variable of interest is whether they have heart disease or not. The predictive data mining objective is to predict if new patients will have heart disease. However, SD tries to find which subgroup of patients are more likely to have heart disease according to certain characteristics. This is relevant for developing treatment against those characteristics, for instance.

Main elements of subgroup discovery algorithms

Below, the most relevant aspects of SD algorithms are presented :

Quality measures for subgroup discovery

A quality measure tries to measure the interestingness of a given rule or subgroup, but there is not a formal definition of what interestingness is. However, the interestingness could be defined as a concept that emphasizes conciseness, coverage, reliability, peculiarity, diversity, novelty, surprisingness, utility, and actionability . For SD, the most used criteria to measure the interestingness of a rule are conciseness, generality or coverage, reliability, and novelty .

Quality measures that accomplish this criteria available in the SDEFSR package are:

A more extended classification and definition of quality measures for SD is available in and .

Evolutionary fuzzy systems

Evolutionary fuzzy systems (EFS) are the union of two powerful tools for approximate reasoning: evolutionary algorithms and fuzzy logic.

In one side, evolutionary algorithms are stochastic algorithms based on the natural evolution to solve complex optimization problems. It is based on a population of representations of possible solutions, called chromosomes. A basic evolutionary algorithm scheme is:

These algorithms performs a global stochastic search through a huge search space efficiently. However, it is possible that these algorithms can not find an optimal solution (global optimum), but a good one (a local optimum) that can solve the problem too. Theys are well suited for SD because the problem of finding subgroups can be formulated from the optimization point of view coding rules as a parameters structure that optimize some measures. Additionally, different kinds of rules exists in SD (with inequality, with intervals, fuzzy rules…). This can change dramatically the size of the search space and genetic algorithms can adapt these structures easily without performance degradation. Likewise, the selection of the genetic operators can make GA a great candidate to introduce expert knowledge .

On the other side, fuzzy logic is an extension of the traditional set theory. Its main objective is to model and deal with imprecise knowledge, which is nearer to the human reasoning. The main difference with traditional set theory is that belonging degree is not zero or one, but a real value in [0,1]. This possibility allows to define fuzzy limits and the chance of overlapping between fuzzy sets.

A fuzzy variable is a set of linguistic labels, e.g. low, medium and high, which are defined by overlapped fuzzy sets. This information is nearer to human reasoning and it is possible to calculate with precision the value of each belonging degree. This expressivity allow to obtain simpler rules because continuous variables are more understandable for humans. A rule can be a set of fuzzy variables. To determine if the rule cover an example it is neccessary to calculate the belonging degree of each variable in the rule with respect to the example. If all variables has a belonging degree grater than zero, the example is covered.

Evolutionary fuzzy systems is the union of both techniques, and has three ways of work :

The SDEFSR package

SDEFSR is a package entirely written on R from scratch. This package includes the most important EFS for SD presented throughout the literature up to the moment. In addition, SDEFSR has the capacity to read data in different standard and well-known formats such as ARFF (Weka), KEEL, CSV and data.frame objects. Similarly, all functions included in the SDEFSR package have default parameters with values recommended by the authors. This allows the algorithms to be executed in an easy way, without the necessity of knowing the parameters for final users.

Algorithms of the package

Now, we describe the algorithms that are available in our package. This contains three algorithms: SDIGA , MESDIF and NMEEF-SD . In chapter we describe how to use the algorithms.

SDIGA (Subgroup Discovery Iterative Genetic Algorithm)

The algorithm SDIGA is a subgroup discovery algorithm that extract rules with two possible representations: one with conjunctions of attribute-value pairs (called canonical representation) or one with a disjunctive normal form (DNF) in the antecedent. It follows an iterative schema with a genetic algorithm in his core to extract those rules, this genetic algorithm only extract one rule, the best of the population, and after the extraction a local optimization could be performed in order to improve the generality of the rule. As the algorithm follows an iterative schema, the genetic algorithm is executed one time after another until a stopping criteria is reached: the rule obtained by the genetic algorithm and after the local improvement phase must cover at least one example not covered by precedent rules and this rule must have a minimum confidence (see Equation ).
SDIGA can work with lost data (represented by the maximum value of the variable + 1), categorical variables and numerical ones using fuzzy logic with the latter to improve the interpretability of the results.

Components of genetic algorithm of SDIGA

As we mentioned above, a genetic algorithm is the core of SDIGA. Such genetic algorithm is the responsible of extract one rule per execution and this rule it is the best of the population at the final of the evolutive process.

Representation schema

Each chromosome in the population represents a possible rule but it only represents the antecedent part of the rule because the consequent is prefixed. SDIGA can handle two types of representation as we mentioned above, canonical and DNF.
Canonical representation is formed by a numerical vector of a fixed length equal to the number of variables with possibles values in a range in [0, max] where max is the number of possible values for categorical variables or the number of fuzzy sets defined for numerical variables. This max value represents the no participation of that variable in the rule.
DNF representation is formed by a binary vector, also with fixed length equal to the sum of all number of values/fuzzy sets of the variables. Here, a variable does not participate in a rule when all of his values are equal to zero or one.

Crossover operator

SDIGA follows a pure stationary schema which only crosses the two best chromosomes in an iteration of the evolutionary process. This crossover is performed by a two-point cross operator.

Mutation operator

Mutation operator is applied over all the population of parents and chromosomes generated by crossover. The mutation probability is applied at every gene. This operator can be applied in two ways (both with the same probability):

Fitness function

The function to define the quality of a chromosome in SDIGA is a weighted average of three of the quality measures described in section . So the functions to maximize is:
where:

As this rules uses fuzzy logic, we need an expresion to determine when an example is covered or not by a rule and also determine the belonging degree of that example to the rule. This function is determined by the expression APC (Antecedent Part Compatibility) and it is calculate by the expression: where:

μ_nⁱ will be one or zero if the variable is categorical and its value is the same of the rule or not. In case of numerical variable, μ_nⁱ will be computed following the triangular belonging degree function.

Replace operator

To get next population for the next iteration a replace operator must be performed. This operator sort the population by fitness value and keep only the n best chromosomes, being n the size of the population.

Local optimization

After the genetic algorithm returns a rule, this rule could be improved by means of a local optimization based on a hill climbing first best local search. The algorithm eliminate one by one a variable and if the rule has a support and confidence greater than or equal the actual, the process starts again with that variable eliminated.

MESDIF (Multiobjetive Evolutionary Subgroup DIscovery Fuzzy rules)

MESDIF is a multi-objective genetic algorithm that extract fuzzy rules. The representation used could be canonical or DNF (see Section ). This algorithm follows the SPEA2 approach where an elite population is used along the whole evolutive process. At the end of the process, the rules stored in the elite population where returned as result.
The multi-objective approach is based on a niches schema based on the dominance in the Pareto front. This schema allows to find non-dominated individuals to fill the elite population, that has a fixed size. In Figure we see a basic schema of the algorithm.

Components of the genetic algorithm.

Now we describe briefly the components of the genetic algorithm to show how it works.

Initial Population generation

The initial population operator performed by MESDIF produce random chromosomes in two ways: one percentage of chromosomes are produced randomly and the rest are produced also randomly, but with the difference that only a maximum number of variables could participate in the rule. This produce more generality in the generated rules.

Fitness function

To obtain the fitness value of each chromosome, MESDIF follows this steps:

Truncation operator

As the elite population has a fixed size, we need a truncation operator to truncate the elite population if all non-dominated individuals can not fit in elite population. To make this truncation, the operator take all non-dominated individuals and calculate the distance among every one. Then, the two closest individuals are taken and eliminate the individual with his k-th nearest neighbour with a minor distance. This process is repeated until non-dominated indivuals fit in elite population.

Fill operator

If the number of non-dominated individuals are less than the size of the elite population, we need to fill elite population with dominated individuals. The operator sort the individuals by its fitness value and copy the n first individuals to elite population, where n is the size of the elite population.

Genetic operators

The genetic operators of MESDIF are:

NMEEF-SD (Non-dominated Multi-objective Evolutionary algorithm for Extracting Fuzzy rules in Subgroup Discovery)

NMEEF-SD is another multi-objective genetic algorithm based in the NSGA-II approach. This algorithm has a fast sorting algorithm and a reinitialisation based on coverage if the population does not evolve for a period.
This algorithm only works with a canonical representation (see Section ). In a study is presented where it reflects the low quality of rules obtained with a DNF representation.

Evolutionary process

The evolutionary process follows this steps

Genetic operators

The genetic operators of NMEEF-SD are a two-point crossover operator (see Section ) and a biased mutation operator (see Section ).

Reinitialisation operator

This operator is applied if the Pareto front does not cover any new example during a 5% of the total number of evaluations. Then, the algorithm copy the non duplicated individuals in the Pareto front to P_t + 1 and the rest of individuals are generated by means of trying to cover one example of the target class with a maximum number of variables.

FuGePSD

FuGePSD is another genetic algorithm that finds interesting subgroups of a given class. It uses a programming genetic approach, where individuals are represented as trees with variable length instead of vectors. Also, the consecuent of the rule is represented. This schema has the advantage of get rules for all possible values of a given target variable in one execution. Furthermore, FuGePSD has a variable population length, which can change over the evolutive process based on an competitive-cooperative approach, the Token Competition.

Evolutionary process

The evolutionary process of FuGePSD works as follow:

The key function on this scheme is the token competition procedure, which promove diversity on the population, and some of the genetic operator will bring specificity to the rules.

Genetic operators

The genetic operators used by FuGePSD are:

This genetics operators will be applied with a probability given by the user.

Token Competition procedure

The token competition is the key procedure of FuGePSD, this brings diversity on the population keeping the best individuals. The algorithm works as follows: let a token be an example of the datasets, each individual can catch a token if the rule can cover it. If its occur, a flag is set at that token and this token cannot be catched by other rules.

So, the token competition works as follows:

Screening Function

At the end of the evolutive process, the screening function is launched to get the users quality rules only. This rules must reach a minimum level of confidence and sensitivity. The function has an external parameter (ALL_CLASS) which if true, the algorithm will return, at least, one rule per value of the target variable, or at least the best rule of the population if false.

Package architecture

The main advantage of the SDEFSR package is that it provides all evolutionary fuzzy systems for SD that exists in the bibliography. Algorithms included are an important kind of SD methods that are not available in R at the moment. Therefore, this package provides to the R community a brand new possibility for data mining and data analysis.

The base of the package is defined by two S3 classes. This classes are:

Additionally, the package has a general function that reads datasets in ARFF, KEEL or CSV format called and to transform a into a .

Use of SDEFSR package

In this section we are going to explain how to use this package. This package tries to use in a really simple way subgroup discovery algorithms and also without any dependencies.

Installing and load the package.

The package SDEFSR is now available at CRAN servers, so it can be installed as any other package by simply typing:

install.packages("SDEFSR")

Also, the develop version is available into GitHub at http://github.com/aklxao2/SDEFSR, feel free to clone and contribute if you wish. If you wish to use the development version you need to install the package devtools using the commando install_github:

devtools::install_github('aklxao2/SDEFSR')

SDEFSR depends only on the package shiny . This package is neccessary to use the user interface in a local way and if the package is not installed, it asked to the user to be install when running the function SDEFSRR_GUI(). Once installed the package has to be loaded before use it. The package can be loaded through library() or require(). After loading the package there are six datasets available as SDEFSR_Dataset: carTra, carTst, germanTra, germanTst, habermanTra and habermanTst that corresponds to car, german and haberman training and test datasets. Also, rules generated by the SDIGA algorithm with defaults parameters over the haberman dataset are loaded as habermanRules as an SDEFSR_Rules object.

Loading a dataset

In order to use SD algorithms available in the SDEFSR package, a SDEFSR_Dataset object is neccessary. This object can be generated using the read.dataset() function. This function reads “.dat” files with the KEEL data mining tool format, ARFF files (“.arff”) from WEKA or even CSV files (“.csv”). Assuming the files iris.dat, iris.arff and iris.csv corresponding to the classical iris dataset in KEEL, ARFF and CSV formats respectively in the working directory, the loading of these files will be as follows:

irisFromKEEL <- read.dataset("iris.dat") 
irisFromARFF <- read.dataset("iris.arff")
irisFromCSV <- read.dataset("iris.csv")

Note that the function detects the type of data by the extension. To read csv files, the function has optional parameters that defines the separator used between fields, the decimal separator, the quote indicator and the NA identifier as parameters. By default, this options and values are sep = ",", quote = "\"", dec = "." and na.strings = "?" respectively. It is important to remark that sparse ARFF data is not supported.

If the dataset is not available in this formats, it is possible to obtain a SDEFSR_Dataset object from a data.frame. This data.frame could be loaded by read.table() or similar functions. Eventually, the resulting data.frame has to be given to the SDEFSR_DatasetFromDataFrame() function. As we can see, this function allows the creation of datasets on the fly, as in this example:

library(SDEFSR)
df <- data.frame(matrix(data = runif(1000), ncol = 10))
#Add class attribute (It must be the last attribute and it must be categorical)
df[,11] <- c("Class 0", "Class 1", "Class 2", "Class 3")
SDEFSR_DatasetObject <- SDEFSR_DatasetFromDataFrame(df, relation = "random")
#Load from iris dataset
irisFromDataFrame <- SDEFSR_DatasetFromDataFrame(iris, "iris")

This will assign to SDEFSR_DatasetObject a new dataset created randomly with 100 examples and 11 attributes. Note that the target variable must be categorical, because the SD algorithms can only deal with categorical target variables.\

The SDEFSR\_DatasetFromDataFrame() function has three additional parameters: names, types, and classNames. These allow the manual assignment of attribute names, their types, and a vector with values of target variable, respectively. Leaving the default values (NA), the function automatically retrieves these values through the information found on the dataset. However, if the information in the dataset is not accurate, it could cause unexpected results for the SD algorithms.

Obtaining information from a loaded dataset

Once the dataset is loaded, it is possible to view a simple summary of its content by using the usual summary() function:

summary(irisFromDataFrame)

## Summary of the SDEFSR_Dataset object: 'irisFromDataFrame'
##  - relation: iris
##  - nVars: 4
##  - Ns: 150
##  - attributeNames: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
##  - class_names: setosa, versicolor, virginica
##  - examplesPerClass: 50, 50, 50

Any of these values can be obtained individually using the $ operator on the SDEFSR_Dataset object:

irisFromDataFrame$nVars

## [1] 4

irisFromDataFrame$attributeNames

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Also, it is possible to print the dataset as a data.frame with the print() function.

Executing Subgroup Discovery algorithms

Once our datasets are ready to be used, it is time to execute one subgroup discovery algorithm. For example we want to execute the algorithm MESDIF. For the rest of the algorithm this steps are equal and only a few parameters are different, if you need help with the parameters, refer to the help of the function using help(function) or ?function.
It is possible to execute an SD algorithm in two ways: Through a parameters file, specifying as argument the path to such file. Or by entering all the parameter names and values at the command line. You can find the structure of the parameters file, among other useful information, on the help pages of each function.

Assuming the SDEFSR_Dataset object irisFromDataFrame that has been loaded before, and that the params.txt parameters file is stored in the working directory, the easiest way to run the MESDIF() algorithm, for example, will be: When we execute the algorithm, the results are shown in the console. This results are divided in three fields:

> ruleSet <- MESDIF(paramFile = "param.txt")

ruleSet <- MESDIF(paramFile = NULL, 
                  training = irisFromDataFrame, 
                  test = NULL,
                  output = c("optionsFile.txt", "rulesFile.txt", "testQM.txt"),
                  seed = 0,
                  nLabels = 3,
                  nEval = 300,
                  popLength = 100,
                  eliteLength = 2,
                  crossProb = 0.6,
                  mutProb = 0.01,
                  RulesRep = "can",
                  Obj1 = "CSUP",
                  Obj2 = "CCNF",
                  Obj3 = "null",
                  Obj4 = "null",
                  targetVariable = "Species",
                  targetClass = "virginica")

## --------------------------------

## Algorithm: MESDIF

## Relation: iris

## Training dataset: training

## Test dataset: test

## Rules Representation:  CAN

## Number of evaluations: 300

## Number of fuzzy partitions: 3

## Population Length: 100

## Elite Population Length: 2

## Crossover Probability: 0.6

## Mutation Probability: 0.01

## Obj1: CSUP   (Weigth:  )

## Obj2: CCNF   (Weigth:  )

## Obj3: null   (Weigth:  )

## Obj4: null

## Number of examples in training: 150

## Number of examples in test: 150

## Target Variable: Species

## Target Class: virginica

## --------------------------------

## 
## 
## Searching rules for only one value of the target class...

## 
## GENERATED RULE 1

## Variable Sepal.Width = Label 1 ( 2 , 3.2 , 4.4 ) 
##  THEN virginica

## 
## GENERATED RULE 2

## Variable Petal.Width = Label 2 ( 1.3 , 2.5 , 3.7 ) 
##  THEN virginica

## 
## 
## Testing rules...

## Rule 1 :

##   - N_vars: 2
##   - Coverage: 0.8
##   - Significance: 0.602743
##   - Unusualness: 0.02
##   - Accuracy: 0.357724
##   - CSupport: 0.286667
##   - FSupport: 0.245
##   - CConfidence: 0.358333
##   - FConfidence: 0.3528
##   - True Positive Rate: 0.86
##   - False Positive Rate: 0.77

## Rule 2 :

##   - N_vars: 2
##   - Coverage: 0.193333
##   - Significance: 27.673033
##   - Unusualness: 0.128889
##   - Accuracy: 0.9375
##   - CSupport: 0.193333
##   - FSupport: 0.201667
##   - CConfidence: 1
##   - FConfidence: 0.889706
##   - True Positive Rate: 0.58
##   - False Positive Rate: 0

## Global:TRUE

##   - N_rules: 2
##   - N_vars: 2
##   - Coverage: 0.496666
##   - Significance: 14.137888
##   - Unusualness: 0.074444
##   - Accuracy: 0.647612
##   - CSupport: 0.3
##   - FSupport: 0.223334
##   - FConfidence: 0.621253
##   - CConfidence: 0.679166
##   - True Positive Rate: 0.72
##   - False Positive Rate: 0.385

The output has three clearly defined sections:

As this output could be extremely large, the function also saves it to three files, one for each of the above sections. The name of these files by default are optionsFile.txt, rulesFile.txt and testQM.txt and being saved into the working directory, overwriting existing files. The format of these files is identical to the output generated by the algorithm, but divided into the sections described above.

Analyzing the rules obtained

After the execution of a SD algorithm, it returns a SDEFSR_Rules object that contains the rules obtained. Following the example, with the ruleSet object obtained we can plot a TPR vs FPR plot to view the reliability, and generality of the rule. Reliable rules has low values of FPR and great TPR values, and too much general variables has great values of both TPR and FPR. To plot the rules, we can use the function plot():

library(ggplot2)
plot(ruleSet)

Additionally, we can directly order the rule set by a quality measure with the function orderRules which returns another SDEFSR_Rules object with the sorted rules. By default, the ordering is done by confidence.

    rulesOrderedBySignificance <- orderRules(ruleSet, by = "Significance")

Filter rules by number of attribute-value pairs or keep those rules with a quality measure greater than a given threshold is an interesting functionality to keep only high-quality rules. Such filtering operations are quite simple to apply in SDEFSR. Using the subset operator ('[]') and introducing the filtering conditions will generate a new SDEFSR_Rules object with the result:

#Apply filter by unusualness
filteredRules <- ruleSet[Unusualness > 0.05]
#We check only if the number of rules decrease.
#In this case, this value must be 1.
length(filteredRules)

## [1] 1

#Also, you can make the filter as complex as you can
#Filter by Unusualness, TPr and number of variables:
filteredRules <- ruleSet[Unusualness > 0.05 & TPr > 0.9 & nVars == 3]
#In this case, any of the rules match the conditions. Therefore, the
#number of rules must be 0.
length(filteredRules)

## [1] 0

The user interface

As we mentioned at the begin of this paper, the SDEFSR package provide to the user an interface to use the subgroup discovery task in a more easy way and also do some basic exploratory analysis tasks.
We can use the user interface by two ways: one of them is using the interface via web at: http://sdrinterface.shinyapps.io/shiny . This has the advantage of use the algorithm without having R installed in our system and also avoid expending process time in our machine. The other way is to use the interface is in a local way, having our local server. This could be possible simply using:

SDR_GUI()

This function launch the user interface in our predetermined web browser. As we can see in Figure . The page are organized into an options panel at the left and a tabbed panel at the right. Here is where our results are shown when we execute the algorithms.
The first we have to do is to load a KEEL dataset. If we want to execute a subgroup discovery algorithm we must load a training and a test file having the same '@relation' field in the KEEL dataset file.
Once we select the dataset, automatically it shows a graph with information about the last variable defined in the dataset. The graph shows the distribution of examples having some values of the variable. At the rigth of this graphic we can see a table with some basic information, more extended if this variable is numerical.

When loaded a dataset we can do a lot of things, in Figure we can see all the possibilities we can do:

After the execution of a subgroup discovery algorithm, we go automatically to the tab 'Rules generated', this tab contains a table with all subgroups generated. If we want we could filter rules by variable, for example, typing into the 'Search' field.
The tab 'Execution Info' shows an echo of the parameters used for launch the algorithm. This echo is equal than the one we can see in R console.
The tab 'Test Quality Measures' shows a table with quality measures of every rule applied to test dataset. The last row its a summary of results and shows the number of rules we have and the average results of every quality measure.

Summary

In this paper the SDEFSR package has been introduced. This package can use four subgroup discovery algorithms without any other dependencies to others tools/packages. Also, the possibility of read and load datasets in the KEEL dataset format is provided, but it can read dataset from ARFF format or load a data.frame object. Finally, a web-based interface is developed for make the work easier, even if we do not have R installed in our system.
The development of the package will continue in the future, including more functionality to work with KEEL datasets, adding new subgroup discovery algorithms and also improve the web interface. As we can see we have a great job ahead, so we encourage other developers to participate adding tools or algorithms into the package, as we will do in futures releases.

Subgroup Discovery with Evolutionary Fuzzy Systems in R: the SDEFSR Package

Introduction

Subgroup Discovery

Main elements of subgroup discovery algorithms

Quality measures for subgroup discovery

Evolutionary fuzzy systems

The SDEFSR package

Algorithms of the package

SDIGA (Subgroup Discovery Iterative Genetic Algorithm)

Components of genetic algorithm of SDIGA

Representation schema

Crossover operator

Mutation operator

Fitness function

Replace operator

Local optimization

MESDIF (Multiobjetive Evolutionary Subgroup DIscovery Fuzzy rules)

Components of the genetic algorithm.

Initial Population generation

Fitness function

Truncation operator

Fill operator

Genetic operators

NMEEF-SD (Non-dominated Multi-objective Evolutionary algorithm for Extracting Fuzzy rules in Subgroup Discovery)

Evolutionary process

Genetic operators

Reinitialisation operator

FuGePSD

Evolutionary process

Genetic operators

Token Competition procedure

Screening Function

Package architecture

Use of SDEFSR package

Installing and load the package.

Loading a dataset

Obtaining information from a loaded dataset

Executing Subgroup Discovery algorithms

Analyzing the rules obtained

The user interface

Summary