A Survey of Sequence Patterns in Data Mining
Techniques
Mrs.N.Nithya Research scholar, Pioneer College of
Arts and Science, Jothipuram, Coimbatore-47. Dr.N.Balakumar Assistant
Professor, Pioneer College of Arts and Science, Jothipuram, Coimbatore-47.
Abstract
Data mining techniques are used in many areas in
the world to retrieve the useful knowledge from the very large amount of data.
Sequence pattern mining is the important techniques in data mining concepts
with the wide range of applications. The applications of the sequence patterns
data mining are weblog click streams, DNA sequences, sales analysis, telephone
calling patterns, stock markets and etc., The methods for sequential pattern
mining are categorized in to two approached. First approach is Apriori-based approach
and second is Pattern-Growth-based approaches. In this paper, a
methodical review of the sequential pattern mining algorithms is accomplished.
Finally, reasonable study is done on the base of important key features
reinforced by many algorithms and current research encounters are discoursed in
this area of data mining.
In this paper, an organized survey of the
sequential pattern mining algorithms is accomplished. This paper examines these
algorithms by studying the classification algorithm for
sequential pattern-mining. These algorithms classified into two
extensive classes. First, on the foundation of algorithms which are considered
to surge effectiveness of mining and the other, on the origin of numerous
additions of sequential pattern mining planned for certain application. At the
end, comparative analysis is done on the basis of important key features
supported by various algorithms and current research challenges are discussed.
4
Keywords: Data Mining, Sequence pattern, Association Rule,
Pattern Mining.
Mrs.N.Nithya, Dr.N.Balakumar
Introduction
In Knowledge Discovery
Process, Data mining techniques are divided into two major categories. These
are descriptive type and prediction type. Each of the type will have different
type of the approaches.
The sequential pattern
mining is anidenticalmain concept of data mining, a further extension to the
concept of association rule mining 1.The set of sequences of the given data
is called data-sequences. Customer transactions list is the data
sequences and the set of items is the transactions. Each transaction is
associated with the transaction time of the sequence database. Association rule
mining and the sequential pattern mining is more or less comparable, the events
linked with the time is the difference among them. The sequential pattern
mining determines the correlation between the dissimilar transactions, but in
the event of association rule mining it determines the association of items in
the similar transaction 2.
In this paper segments
are ordered as follows: Section II deals with types of the sequential pattern
mining models, Section III discusses limitations of sequential pattern mining
algorithms, Section IV discusses the comparative analysis of sequential pattern
mining algorithms, Section V discusses about the comparative analysis of
sequential pattern mining algorithms. Finally, the conclusion part is discussed
about the
Problem definition: Let I= {i1, i2, in} be a set of all items. An item set is
a non- empty set of items. A sequence is an ordered list of item sets. A
sequence is denoted by
element of the sequence and denoted as (x1,x2,…xm), where xk?I for 1?k?m. The number
of instances of items in a sequence is called the length of the sequence. A
sequence with length l is called a l-sequence. A sequence a=
subsequence of b=
1?j1?j2…?jn?m such that a1? bj1 , a2? bj2 , … , an?bjn.
A sequence database D is a set of tuples
is a sequence-id and s is a sequence. A tuple
to contain a sequence a, if a is a subsequence of s, i.e., a?s. The number of tuples in a sequence database D containing
sequence a is called the support of a, denoted as sup (a). 22
Given a sequence
database D and some user specified minimum support min_sup, a sequence a is a
sequential pattern in D if sup(a) min_sup. The sequential pattern mining
problem is to find the complete set of sequential pattern with respect to D and
min_sup.
A Survey of Sequence Patterns in Data Mining
Techniques
Categories of sequence pattern mining Techniques
As defined
by Yen-Liang Chen and Ya-Han Hu 4 in latest years,
several methods in sequential pattern mining have been projected; these studies
cover a wide-ranging variety of problems. In general, there are two
different concerns in the area of sequential pattern mining in research. The
first is to increase the efficacy in sequential pattern mining process while
the other one is to. Secondly, extend the mining of sequential pattern to other
time- related patterns.
The algorithms of
sequential pattern mining are differed in two different ways, based on the
researches done on the fields of sequential pattern mining 3. First,
generating the sequences of candidates with storing, and the second is, how the
counting and testing performed on the candidate sequence in a frequent manner.
The main goal of the primary one is to reduce the generation of the total
number of candidate sequences, so that the I/O cost will be reduced. The main
goal of the second one is,to remove any database or data structure that has to
be sustained all the period for support of counting commitments only. The main
benefits and shortcomings of sequential pattern mining are listed in Table 1.
Table 1 Pros and Cons of Sequential Pattern Mining
Type/Techniques
Pros
Cons
Apriori-
Based Algorithms5
It is easy algorithm to implement.
It takes more memory, lot of space and it will
take more time for the process of
candidate generation.
Pattern Growth Algorithm 7.
It can be faster when given large volume of
data.
Normally more multifarious to progress, investigation and maintain
A Survey of Sequence Patterns in
Data Mining Techniques
Limitations of Sequential Pattern Mining
Algorithms
Sequential pattern
mining algorithms are typically centred on string. It is not focus on discovery
of the sequential patterns with the limitations in an agreed database. In query
languages like, SQL or MySQL, it will not permit the practice of the non-
aggregate functions for the portion of the query compilation 19.
Sequential pattern
mining retrieved the relationships among objects in sequential dataset 18.
The most familiar pattern mining in the sequential is Apriori. This algorithm,
also having the drawbacks like, too many candidate sets, more number of passes
over the databases. Another disadvantages of the above mentioned algorithm is,
requirement of the huge memory space 20.
The assignment of
determining entire frequent sequences in huge databases is relatively
interesting. The exploration of the memory space is tremendously large21.
Table 1 Comparative analysis of algorithm performance 3. The
symbol ?-?means an algorithm crashes with the parameters provided, and
memory usage could not be measured. 3
Algorithm
Data Set Size
Minimum Support
Execution
Memory Usage
Time(sec)
(MB)
Medium
Low(0.1%)
>3600
800
GSP Apriori
(D=200K)
Medium(1%)
2126
687
Large
Low(0.1%)
–
–
(D=800K)
Medium(1%)
–
–
Medium
Low(0.1%)
–
–
SPAM Apriori
(D=200K)
Medium(1%)
136
574
Large
Low(0.1%)
–
–
(D=800K)
Medium(1%)
674
1052
Medium
Low(0.1%)
31
13
PrefixSpanPattern
(D=200K)
Medium(1%)
5
10
Growth
Large
Low(0.1%)
1958
525
(D=800K)
Medium(1%)
798
320
Comparative Analysis of Sequential Pattern
Mining Algorithms
Sequential pattern
mining is precise significant because it is the foundation of numerous
applications. A sequential mining algorithm should discover the entire set of
patterns, when potentially, adequate the least support. Working with the big
data, the scalability is the one of the important issue of the mining the
knowledge from the huge amount of data. The above mentioned issue will be
raised in MapReduce model in the cloud. The SPAM algorithm, suggestively
decrease the mining period with big data, and also it will attain enormously great
scalability 11. The important and familiar algorithm for mining the data is
Apriori. Using this algorithm, finding the sequence data from
the d-dimensional sequence data is not possible. Using the PREFIXMD
SPAN algorithm, the retrieval of the sequence data is possible from
thed-dimensional data 14. Generating the huge amount of the unpromising
candidate sub sequences is difficult, while using
the Generate-and-test algorithm. This will be overcome, applying the
algorithm called Maximum weighted upper-bound model. The maximum
weighted upper-bound model will give the good performance of pruning
efficiency and also it will improve the performance efficiency 17.
The huge amount of
repeated projected databases in mining data sets will be creating applying
pattern growth type of algorithm. It will be overwhelmed, using the SMPM 13
algorithm. This algorithm will avoid the repeated projected database and evade
physical forecast 13. The greedy algorithm will raise the issues in the
sensor network applications, by creating the multiple interleaved patterns. The
GAIS 15 method algorithm will find the sequential pattern from the small
amount of quality data. The Frequent Pattern Tree type is another type for
finding the pattern using the sequence mining. In this algorithm will be work
in scanning the database many number of time. It will be time consuming
comparing with another type of algorithm. Yi Sui, Feng Jing Shao, Rencheng Sun
and Jinlong Wang were used the STMFP algorithm. In this algorithm required to
scan the database in a single. After the single scan itself, the tree can store
the all the sequences from the source data 9.
The Association rule
mining algorithm is the important type of algorithm in the Apriori model of
mining methods. The Apriori based association rule algorithm is the single
minimum support. The single minimum support cannot exactly discover the
interesting pattern.
The number of minimum
support is very high in the usage of MSCP growth algorithm 10. More number of
minimum supports will produce the interesting pattern.Xilu Wang and Weill Yao
used their optimum maximum sequence pattern mining for getting the sequence
pattern. The advantage of this algorithm is, to acquiring the sequential
pattern is very reliable. The existing mathematical models for mining the
sequential pattern will be failed in noisy data with the candidate
patterns12.
Measuring
the multidimensional-attribute of the material is not completely
measured concurrently in modified Apriori and PrefisSpan algorithms. It will be
overwhelmed using the Leaner Preference Tree (LPT) algorithm. The advantage of
this algorithm is, the learners actual learning favourite can be fulfilled
perfectly 16. Mining the pattern from the incremental data is very difficult
to handle. In this problem will be solved using the Direct Appending (DirApp)
algorithm. The improvement of this method, the incremental data can be easily
dealt and also the static database 8.
Performance Based
Comparative Study
The above table 1 described about the Comparative analysis of
different algorithm based on their performance in sequential mining. These
algorithms are studied with the help of the different size of the data sets.
The parameters chosen for these studies are Minimum Support, execution time (sec)
and memory usage (MB). The execution time is measured here is in the form
seconds and, the memory usages of these algorithm is measured in the form of
megabytes. In the data sizes, we have categories like medium size of data sets
and the large size of data sets.
The data size is denoted
as D. The value of D for the all the algorithms are categories in to medium
size and large size. The Medium size value is 200k and the large size value is
800k. Minimum support for the each algorithm is categorised as low and medium
for the both data sizes medium and large respectively.
The large amount of the
execution time taken by the GSP Apriori algorithm was more than the 3600sec
with the memory usage 800mb in the medium size of the data sets. The minimum
support of this highest execution time low. The Prefix Span pattern algorithm
execution time is very less. The execution time for this algorithm is 5sec with
the memory usage of MB in the large data set size of medium support size.
Table 2 Comparative Analysis of Sequential Pattern
Mining Algorithms
Reference paper
Author
Methodology Used
Algorithm Used
Existing System
Proposed System
ABCF 16
Mojtaba
Modified
Leaner
Multidimensional-
Learner’s actual
Salehi, Isa
Apriori and
Preference
attribute of
learning favourite
Nakhai
PrefixSpan
Tree (LPT)
materials is not
can be fulfilled
Kamalabadi
algorithms
Completely
perfectly
And
Measured
Mohammad
concurrently.
Bagher
Ghaznavi
Ghoushchi
OMSP
Xilu Wang and
Mathematic
Optimum
Noisy data, with
Acquired
M 12
Weill Yao
al Model.
maximum
fewer candidate
sequential patterns
sequence
patterns.
are reliable.
pattern
mining
MMS
Ya-Han Hu,
Apriori
MSCP-
Single minimum
Numerous
10
Fan Wu and
based
Growth
support cannot
minimum supports
Yi-Chun Liao
Association
exactly discover
possible.
Rule mining
interesting pattern.
Jen-Wei
–
Dealing the
It can easily deal
Huang, Taipei,
incremental data is
with a static
SPMP
Chi-Yao
Direct
difficult.
database or an
Tseng, Jian-
Appending
incremental
D 8
Chih Ou and
(DirApp)
database as well.
Ming-Syan
Chen
IFPT
Yi Sui,
Frequent
STMFP
Need to scan the
After the single
9
FengJing Shao,
Pattern Tree
Algorithm
database many
scan, the tree can
Rencheng Sun
times.
store the all the
and Jinlong
sequences.
Wang
GAIS
Ruotsalainen,
Greedy
GAIS
Issues in sensor
Sequential
15
M, Ala-
Algorithm
method.
network
patterns can be
Kleemola, T
application are
identified from
and Visa
multiple
little quality data
interleaved
patterns.
WSP
Guo-Cheng
Generate-
Maximum
Generate a large
Good enactment
17
Lan, Tzung-Pei
and-test.
weighted
number of
of pruning
Hong and
upper-
unpromising
efficiency and
Conclusions
In this paper, we discussed about the sequential
pattern mining and also briefly represented the major categories of the sequential
pattern mining. The comparison between the some of the types of algorithm was
discussed with the help of previously completed work. Primarily, this topicwas
initiated based on the improvement of the performance of the algorithm with the
help of the dissimilar data structure and representation. The comparative study
of different type of the algorithm is used for the mining the sequential
pattern. As well as, we discussed about the comparative analysis of the
algorithm performance. From the discussion about the pros and cons of
sequential mining, easily can be define the strength and their limitations. The
analysis of the comparison based on the different type of methodology and their
algorithms are discussed in detail.
Reference
1J. Han and M. Kamber,
?Data Mining: Concepts and Techniques?, Morgan Kaufman publishers, 2001.
2Vishal S.
Motegaonkar, Prof. Madhav V. Vaidya ?A Survey on Sequential Pattern Mining
Algorithms?, International Journal of Computer Science and Information
Technologies, Vol. 5 (2) , 2014, 2486-2492.
3Nizar R. Mabroukeh
and C. I. Ezeife, ?A Taxonomy of Sequential Pattern Mining Algorithms?, ACM
Computing Surveys, Vol. 43, No. 1, Article 3, Publication date: November 2010.
4J.Pei, J.Han,
B.MortazaviAsl, J.Wang, H.Pinto, Q.Chen, U.Dayal and M.- C.Hsu, ?Mining
sequential patterns by pattern-growth: The PrefixSpan approach?, IEEE
Transactions on Knowledge and Data Engineering, vol.16, no.11, 2004,
pp. 1424-1440.
5Hilderman R. J.,
Hamilton H. J.,?Knowledge Discovery and Interest Measures?,In: Kluwer Academic
Publishers, Boston, 2002.
6V.Chandra Shekhar
Rao, P.Sammulal,Ph.D, ?Survey on Sequential Pattern Mining Algorithms ?
International Journal of Computer Applications (0975 – 8887) Volume 76– No.12,
August 2013.
7Carl H. Mooney and
John F. Roddick, ACM Journal Name, Vol. V, No. N, M 20YY, Pages 1–46.
8Jen-Wei Huang ,
Nat. Taiwan University, Taipei, Chi-Yao Tseng, Jian-ChihOu
and Ming-Syan Chen,?A General Model for Sequential Pattern Mining
with a Progressive Database?, IEEE Trans. Knowledge and Data Eng., Volume:20 ,
Issue 9, page no 1153-1167, September 2008.
9Yi Sui, Feng, Jing
Shao, Rencheng Sun and Jinlong Wang,?A Sequential Pattern Mining Algorithm
Based on Improved FP-tree?,. Ninth ACIS International Conference on
Software Engineering, Artificial Intelligence, Networking, and
Parallel/Distributed Computing, 2008. SNPD 2008, Page(s): 440- 444.
10Ya-Han Hu, Fan
Wu and Yi-Chun Liao, ?Sequential pattern mining with multiple minimum
supports: A tree based approach?, 2nd International Conference on Software
Engineering and Data Mining (SEDM), 2010, Page(s): 428-433.
11Chun-Chieh Chen, Chi-Yao Tseng
and Ming-Syan Chen ?Highly Scalable Sequential Pattern Mining Based
on MapReduce Model on the Cloud?, IEEE International Congress on Big Data (Big
Data Congress), 2013, Page(s) 310- 317.
12Xilu Wang and Weill
Yao, ?Sequential Pattern Mining: Optimum Maximum Sequential Patterns and
Consistent Sequential Patterns?, IEEE International Conference on Integration
Technology, 2007, Page(s): 365-368.
13Yong-Gui Zou
and Hong Yu, ?Moving sequential pattern mining based on Spatial Constraints in
Mobile Environment?, IEEE International Conference on Intelligent Computing and
Intelligent Systems (ICIS), 2010, Page(s) 103- 107.
14Chung-Ching Yu
and Yen-Liang Chen, ?Mining Sequential Patterns from
15Ruotsalainen,
M, Ala-Kleemola, T and Visa, ?A GAIS: A Method for Detecting
Interleaved Sequential Patterns from Imperfect Data?, IEEE Symposium on
Computational Intelligence and Data Mining, 2007, Pages(s) 530- 534.
16Mojtaba Salehi, Isa
Nakhai Kamalabadi, Mohammad Bagher Ghaznavi Ghoushchi, ?Personalized recommendation
of learning material using sequential pattern mining and attribute based
collaborative filtering?, Education and Information Technologies, December
2014, Volume 19, Issue 4, page(s) 713-735.
17Guo-Cheng Lan, Tzung-Pei Hong
and Hong-Yu Lee, ?An efficient approach for finding weighted
sequential patterns from sequence databases?, Applied Intelligence, September
2014, Volume 41, Issue 2, pp 439-452.
18Thanh-Trung Nguyen, Phi-Khu Nguyen,
?A New Approach for Problem of Sequential Pattern Mining?, Lecture Notes in
Computer Science on Computational Collective Intelligence. Technologies and
Applications Volume 7653, 2012, pp 51-60.
19VangipuramRadhakrishna,
Chintakindi Srinivas and C.V.Guru Rao, “Constraint Based Sequential
Pattern Mining in Time Series Databases – A Two Way Approach”, AASRI
Conference on Intelligent Systems and Control, AASRI
Procedia 4(2013)313-318.
20ShamilaNasreen,
Muhammad AwaisAzamb, KhurramShehzada, UsmanNaeemc, and Mustansar Ali
Ghazanfara, ?Frequent Pattern Mining Algorithms for Finding Associated Frequent
Patterns for Data Streams: A Survey?, The 5th International Conference on
Emerging Ubiquitous Systems and Pervasive
Networks (EUSPN-2014), Procedia Computer Science 37
( 2014 ) 109 – 116.
21Mohammed J. Zak,
?SPADE: An Efficient Algorithm for MiningFrequent Sequences?, Kluwer Academic
Publishers. Manufactured in The Netherlands, 42, 31–60, 2001.
22RamakrishnanSrikant,
Rakesh Agrawal,? Mining Sequential Patterns: Generalizations and Performance
Improvements?, Advances in Database Technology — EDBT ’96, Lecture Notes in
Computer Science, Springer, Volume 1057, 1996, pp 1-17.