1. Introduction
One of the
greatest problems that Brazilians electrical energy power distribution
companies have to deal is the commercial
losses, mainly resulted from consumer’s frauds. To reduce losses, the companies
realize in loco inspections to try to detect frauds. The inspections are made
by technicians that visit the consumer and evaluate equipments and electricity
connections. Generally, company specialists indicate which consumer must
undergo an inspection. However, due the high quantity of consumer, it is almost
impossible to evaluate each consumer behavior and indicate the ones that are
suspicious of fraud. Also, it is not viable to inspect all the units, because
the number of consumers fraudulent is small compared to the total number of
clients. This problem is present in all consumer classes, from residential to
industrial. Still, high voltage electricity consumers reflect major financial
loss because of its high energy consumption and differentiated electrical
demand (KW) fees and consumption (KWh). However, it’s known that electrical
energy distribution companies store client information on their database. This
information can be used as input on a data mining system, identifying clients
with suspicious behavior, that is, good candidates to undergo inspection. There
are many works about fraud detection, however a few of them about fraud
detection for electrical energy consumers, and with some of the data mining
techniques, such as Decision Tree and Rough Sets. Basically, the methodology
for those works is:
1.
Generate a sample database of normal and fraudulent consumers;
2.
Preprocess the data to the data mining tools;
3. Apply
the tools to create decision rules;
4. Verify
which consumers fit the rules with “fraud” decision and inspect them.
However,
there are some important characteristics that distinguish low from high voltage
consumers, which make impossible the application of the above cited
methodology. First, the number of high voltage electrical energy consumers
(mainly industries) is reduced, what economically makes telemetering possible;
the opportunity to follow others variables in addition to the consumption
(KWh), and thepossibility to inspect all the clients over a relatively short
period of time (annually). Second, to perform a high voltage electrical energy
fraud is a complex and dangerous act, having a reduced number of detected
cases, and the creation of a consistent fraudulent consumer database and the
derivation of rules from this database are almost impossible. Finally, the
clients must inform the maximum electrical demand (contracted demand) that they
will need each month, information that may be related to the effectively
registered electrical demand, showing possible abnormal consumption. So, the
objective of this work is to present a methodology and an identification system
for possible frauds of high voltage electrical energy consumers. Initially, the
methodology based on an Artificial Intelligence technique called
Self-Organizing Maps (SOM) will be proposed. After that, some implementation
details of the frauds detection system (FDS) will be shown. Finally,
validation, result and conclusion will be presented.
The methodology proposed here is base on a
reference model known as Knowledge Discovery in Databases (KDD), largely use on
data mining projects. In the sequence the methodology steps are shown.
2.1
Choice of the Variables and Data Consolidation
Among the
considered variables (or attributes) for each consumer, there are those whose
values changing with time, called dynamic, and the ones that are kept
constantly
unaltered
or have rare actualizations, called static (or contract variable). The dynamic
variables are the most important for fraud detection, because they represent
the behavior of the consumer on time domain. However, for each time unit
considered there are new values, the dynamic variables are more complexes to be
handled and analyzed. Table I shows the chosen set of static and dynamic
variables. To obtain information about the high voltage consumers, the data
collected by measurement devices (telemeter) installed by the company for each
consumer were used, as well the information existent on the contracts between
consumers and electrical energy company. The measurement device (telemeter)
registers the data of each consumer on a 15 minutes period, which gives 96
registers per day and almost 3,000 per month. Registers are stored on the
device, and transmitted via RGPS at the end of the month. The cost of online
transmission, or at the end of the day, still too expensive for the company.
The database from a Brazilian electric energy distribution company was used on
this work. The database store data of approximately 2,000 high voltage
consumers. So, each month, almost 6 millions registers are added to it. When
selecting the client’s data, within a desired time
interval
(12 months for example), a great data volume is received, and it have to be
prepared in order to apply the data mining tool of the next step. The consumption
(KWh)
variable,
with 96 registers per day, was grouped on weekdays– Monday to Friday, which
means, blocks of 480 (5x96) registers. Therefore, register 1 represents
00:00:00 Monday consumption and register 480 is consumption on Friday at
23:45:00. So, client consumption was converted into big weekly registers, each
one with 480 values. Saturdays and Sundays were excluded because they are
atypical days, on these days the client could be consuming normally, partially
or even not realizing any activity.
TABLE
2.1.1
CHOSEN
VARIABLES TO BE USED ON THE METHODOLGY
2.2 Self-Organizing Maps (SOM)
SOM is an specific Artificial Neural Network model of non-supervised
knowledge that maps a time variant input according to its graphical
representation, allowing the identification of clusters or patterns comparable
to the inputs.In other words, given a set of registers that can be graphically
visualized, the SOM identify groups of registers that are similar (clusters).
An important SOM characteristic is that information or orientation about the
clusters is unnecessary, it can be used as identification tool for standard
profiles on data without classification (or decision), like the one here. To
illustrate how SOM was used as a data mining tool on the proposed methodology,
data from a client was selected and weekly grouped (Monday to Friday). Figure
2.2.1 illustrates consumption (kWh) for 68 weeks (period of data collection).
On the x-axis are all the 480 values that compose a week register, Monday (2)
to Friday (6). The curve that is highlighted (black) represents the mean
consumption of all the weeks (colored). It can be observed that are many
distinct weeks and that each day behavior is similar in a way, what is not
necessarily common for all the consumers. When applying the same weeks to the
SOM, it found 2 clusters. The weeks that compose each cluster can be seen on
the graphics of Figs. 2.2.2 and 2.2.3. Analyzing Clusters 1 (Fig. 2.2.2) and 2
(Fig. 2.2.3), it is possible to notice that this consumer have a typical
profile, represented by Cluster 1 (with 44 weeks), and Cluster 2 has an
atypical profile, with an relatively low mean consumption. The graphic on
Fig.2.2.4 shows on the x-axis all the weeks chronologically orientated, and on
the y-axis is the cluster mean consumption. Now it can be clearly seen that
Cluster 2 represents an atypical and sporadical consumption until week 50.
However, after this week, it is the only cluster. The mean consumption for the
weeks on Cluster 2 is 40% of the mean consumption for the weeks on Cluster 2,
the suspicion that the client is performing some type of electrical energy
fraud from the 50th week could be raised. However, the immediate supposition of
fraud may lead to many false positives, for the reason that atypical behavior
are common to some clients, specially those who present variable production
throughout the year in due to the characteristic of it commercial or industrial
activity.
Thus, the application of SOM for this problem needs to be complemented by other
operations.
2.3. Automatic Behavior Analysis
The same
way SOM is able to identify which ones are the week profiles that a consumer
possess in a given time interval, it also can classify new weeks according to
pre-computed clusters. Based on this, it is proposed that the behavior of a
consumer may be analyzed as follow:
1. Verify
if there is a consumption drop (negative variation) between current and
anterior month of the analysis (30% drop, for example);
2. Select
the last 12 months of data (historical) and organize them into weeks;
3. Compute
the weeks clusters with the SOM;
4.
Attribute each new week of the current month to one of the clusters found by
the SOM (4 or 5 weeks per month);
5. Verify
if each new week adequately fits into the cluster that it was attributed
(fitness), or if this week probably represents a new profile unknown until now;
6. Verify
if the unknown profile is justified by modifications of the consumer contract,
keeping approximately constant the reason between monthly registered electrical
demand and contracted electrical demand (RD/CD = k).
Fig.
2.3.1
Flowchart
indicating the steps of the behavior analysis of
each
consumer, each new month.
The flowchart presented on
Fig. 2.3.1 illustrates the steps of the behavior analysis described above. On
this analysis, it is admitted that all clients are normally consuming
electrical energy. Those who present abrupt drops will go over a consumption
behavior analysis, supported by the clusters found with the SOM. The
methodology will point fraud suspects only when a really abnormal behavior is
identified and not explained by contractual modifications of the electrical
demand.
3. FRAUD DETECTION SYSTEM
The
methodology presented on the previous section fundamented the implementation of
a fraud detection system (FDS). This system was integrated to the information
system (IS) of a Brazilian electrical energy distribution company. It is
important to emphasize that it is not expected as a result a FDS that
substitute the critical sense and the specialist experience. This is because
the quantity of high voltage clients is much less than normal clients
(residential for example). This way, even so the system identifies consumers
with high level of fraud suspicion, this normally small quantity is passive of
supervision. This specialist posterioranalysis or verification of suspicions
leads to eliminate inspections of the false positives, which are the consumers
with atypical behavior of suspects, but that did not committed any illicit act.
MATLAB was chosen as FDS development
platform because it comprehends a series of toolboxes that facilitate
data
manipulation and analysis, even with the use of the SOM. The FDS is executed as
a monthly scheduled task (service). So, every month, the system will perform
the following tasks:
1. Select
the clients that must be analyzed due to fraud suspicion;
2. For
each client, its data is selected and the developed methodology is applied;
3. All
fraud suspicions found are inserted into the database, as well justifications
and additional information about the consumers, facilitating the analysis of
the suspicions, which will be performed by the specialists.
To the
specialists analyze the suspicions, a interface within the IS was created. When
the suspicions are visualized, the specialists can: immediately launch an
inspection; detail information of the consumer suspect to understand the motive
that brought this alarm; free the client of any suspicion, once the specialist
knows the motive of the alarm raised by the FDS.
Fig. 3.1 illustrates the integration between
IS and FDS, defining the operations that each system execute on the database.
Fig.
3.1
Integration between IS and FDS.
4. VALIDATION AND RESULTS
For the
FDS validation, a simulation was realized, where 156 random consumers with
different behavior were selected. First, all of them were analyzed with the FDS
regarding fraud suspicion. Then, all the consumers suffered an intentional and
temporary 30% drop on their consumption register for a specific period of time,
and were submitted again to the FDS. The quantities of suspicion, before and
after the intentional consumption drop are presented on Table II. Analyzing
this result, it can be observed that FDS is able to identify and judge as
suspects the abrupt consumption drops without justificative or similar
antecedent. In other words, if a consumer realizes a fraud, the FDS will
certainly indicate this consumer as suspect. The justificative for the consumers,
that even after consumption modification still as normals (15%), is that they
present natural atypical behavior on the same months of the simulation period,
however for the previous year. Therefore, the FDS admitted that this
abnormality was expected.
TABLE 4.1
SIMULATION
RESULTS
Since the
incidence of frauds on high voltage consumers is historically small
(approximately 1 per year), and the developed FDS is on its first months of
functioning, for now there is no registration of suspicions effectively
confirmed as fraud after inspection in loco. However, the suspicions
raised by the FDS are helping specialists to understand the behavior of their
consumers, since only the more severe and intense abnormalities were observed
before the implementation of the system.
One of the
most relevant points of this work is the proposition of a practical methodology
for data mining application on a real problem. This methodology can be easily
applied on other behavior analysis problems, especially when historical
abnormal cases are limited. SOM as a data mining tool showed to be very
efficient. Clusters identification from data is not a simple task, especially
when the identification is non-supervised. The consumer clusters showed to be
very consistent with reality, taking apart regular from atypical weeks, mainly
when consumer’s commercial or industrial activities drastically influence its
consumption profile. Fraud detection is a very complex problem, once the differentiation
between fraudulent and normally atypical profile is very subtle. The developed
FDS showed to be satisfactory because it could perceive alterations on the
consumption profile, and also it confronts this atypical behavior with the
consumer’s history data. Therefore, when indicating a consumer as suspect, the
FDS does not declare that it is defrauding, but signalize that consumption is
less than normal, and mainly, that current consumption profile is different
from the expected profile for this consumer. With the values found on the
validation, it can be concluded that, on the hypothesis of a fraud, the FDS
chance to point to a consumer as suspect is large. Inevitably, some consumers
will pass as suspects on some moments. With inspection and certification that
these consumers are not fraudulents, the FDS may be fed with this new
information and start to admit the new behavior as already known. It is
important to highlight that the FDS is parametric, and it can work with
rigorous or loose values. With its practical functioning and respectively
monitoring, it will be possible to tune these parameters on a more convenient
way.
6. REFERENCES
[1] José
E. Cabral, Fraud Detection in High Voltage Electricity Consumers Using Data
Mining" in Proc. 2008 IEEE Networking System, Sensing and control. Conf.,
pages 761-766, 2008.
[2] Y.
Kou, Survey of Fraud Detection Techniques, Proceedings of the 2004 IEEE
International Conference on Networking, Sensing and Control, vol. 1, pages
749–754, 2004.
[3] J. R. Filho,
"Fraud Identification In Electricity Company Costumers Using Decision
Tree" in Proc. 2004 IEEE SMC System, Man and Cybernetics Conf., pp.
3730-3734, 2004.
[4] J. E. Cabral, "
Methodology for Fraud Detection Using Rough Sets." in Proc. 2006 IEEE
Granular Computing Conf., pages 244-249, 2006.
No comments:
Post a Comment