A Survey of Sequence Patterns in Data Mining

Techniques

Mrs.N.Nithya Research scholar, Pioneer College of

Arts and Science, Jothipuram, Coimbatore-47. Dr.N.Balakumar Assistant

Professor, Pioneer College of Arts and Science, Jothipuram, Coimbatore-47.

Abstract

Data mining techniques are used in many areas in

the world to retrieve the useful knowledge from the very large amount of data.

Sequence pattern mining is the important techniques in data mining concepts

with the wide range of applications. The applications of the sequence patterns

data mining are weblog click streams, DNA sequences, sales analysis, telephone

calling patterns, stock markets and etc., The methods for sequential pattern

mining are categorized in to two approached. First approach is Apriori-based approach

and second is Pattern-Growth-based approaches. In this paper, a

methodical review of the sequential pattern mining algorithms is accomplished.

Finally, reasonable study is done on the base of important key features

reinforced by many algorithms and current research encounters are discoursed in

this area of data mining.

In this paper, an organized survey of the

sequential pattern mining algorithms is accomplished. This paper examines these

algorithms by studying the classification algorithm for

sequential pattern-mining. These algorithms classified into two

extensive classes. First, on the foundation of algorithms which are considered

to surge effectiveness of mining and the other, on the origin of numerous

additions of sequential pattern mining planned for certain application. At the

end, comparative analysis is done on the basis of important key features

supported by various algorithms and current research challenges are discussed.

4

Keywords: Data Mining, Sequence pattern, Association Rule,

Pattern Mining.

Mrs.N.Nithya, Dr.N.Balakumar

Introduction

In Knowledge Discovery

Process, Data mining techniques are divided into two major categories. These

are descriptive type and prediction type. Each of the type will have different

type of the approaches.

The sequential pattern

mining is anidenticalmain concept of data mining, a further extension to the

concept of association rule mining 1.The set of sequences of the given data

is called data-sequences. Customer transactions list is the data

sequences and the set of items is the transactions. Each transaction is

associated with the transaction time of the sequence database. Association rule

mining and the sequential pattern mining is more or less comparable, the events

linked with the time is the difference among them. The sequential pattern

mining determines the correlation between the dissimilar transactions, but in

the event of association rule mining it determines the association of items in

the similar transaction 2.

In this paper segments

are ordered as follows: Section II deals with types of the sequential pattern

mining models, Section III discusses limitations of sequential pattern mining

algorithms, Section IV discusses the comparative analysis of sequential pattern

mining algorithms, Section V discusses about the comparative analysis of

sequential pattern mining algorithms. Finally, the conclusion part is discussed

about the

Problem definition: Let I= {i1, i2, in} be a set of all items. An item set is

a non- empty set of items. A sequence is an ordered list of item sets. A

sequence is denoted by

element of the sequence and denoted as (x1,x2,…xm), where xk?I for 1?k?m. The number

of instances of items in a sequence is called the length of the sequence. A

sequence with length l is called a l-sequence. A sequence a=

subsequence of b=

1?j1?j2…?jn?m such that a1? bj1 , a2? bj2 , … , an?bjn.

A sequence database D is a set of tuples

is a sequence-id and s is a sequence. A tuple

to contain a sequence a, if a is a subsequence of s, i.e., a?s. The number of tuples in a sequence database D containing

sequence a is called the support of a, denoted as sup (a). 22

Given a sequence

database D and some user specified minimum support min_sup, a sequence a is a

sequential pattern in D if sup(a) min_sup. The sequential pattern mining

problem is to find the complete set of sequential pattern with respect to D and

min_sup.

A Survey of Sequence Patterns in Data Mining

Techniques

Categories of sequence pattern mining Techniques

As defined

by Yen-Liang Chen and Ya-Han Hu 4 in latest years,

several methods in sequential pattern mining have been projected; these studies

cover a wide-ranging variety of problems. In general, there are two

different concerns in the area of sequential pattern mining in research. The

first is to increase the efficacy in sequential pattern mining process while

the other one is to. Secondly, extend the mining of sequential pattern to other

time- related patterns.

The algorithms of

sequential pattern mining are differed in two different ways, based on the

researches done on the fields of sequential pattern mining 3. First,

generating the sequences of candidates with storing, and the second is, how the

counting and testing performed on the candidate sequence in a frequent manner.

The main goal of the primary one is to reduce the generation of the total

number of candidate sequences, so that the I/O cost will be reduced. The main

goal of the second one is,to remove any database or data structure that has to

be sustained all the period for support of counting commitments only. The main

benefits and shortcomings of sequential pattern mining are listed in Table 1.

Table 1 Pros and Cons of Sequential Pattern Mining

Type/Techniques

Pros

Cons

Apriori-

Based Algorithms5

It is easy algorithm to implement.

It takes more memory, lot of space and it will

take more time for the process of

candidate generation.

Pattern Growth Algorithm 7.

It can be faster when given large volume of

data.

Normally more multifarious to progress, investigation and maintain

A Survey of Sequence Patterns in

Data Mining Techniques

Limitations of Sequential Pattern Mining

Algorithms

Sequential pattern

mining algorithms are typically centred on string. It is not focus on discovery

of the sequential patterns with the limitations in an agreed database. In query

languages like, SQL or MySQL, it will not permit the practice of the non-

aggregate functions for the portion of the query compilation 19.

Sequential pattern

mining retrieved the relationships among objects in sequential dataset 18.

The most familiar pattern mining in the sequential is Apriori. This algorithm,

also having the drawbacks like, too many candidate sets, more number of passes

over the databases. Another disadvantages of the above mentioned algorithm is,

requirement of the huge memory space 20.

The assignment of

determining entire frequent sequences in huge databases is relatively

interesting. The exploration of the memory space is tremendously large21.

Table 1 Comparative analysis of algorithm performance 3. The

symbol ?-?means an algorithm crashes with the parameters provided, and

memory usage could not be measured. 3

Algorithm

Data Set Size

Minimum Support

Execution

Memory Usage

Time(sec)

(MB)

Medium

Low(0.1%)

>3600

800

GSP Apriori

(D=200K)

Medium(1%)

2126

687

Large

Low(0.1%)

–

–

(D=800K)

Medium(1%)

–

–

Medium

Low(0.1%)

–

–

SPAM Apriori

(D=200K)

Medium(1%)

136

574

Large

Low(0.1%)

–

–

(D=800K)

Medium(1%)

674

1052

Medium

Low(0.1%)

31

13

PrefixSpanPattern

(D=200K)

Medium(1%)

5

10

Growth

Large

Low(0.1%)

1958

525

(D=800K)

Medium(1%)

798

320

Comparative Analysis of Sequential Pattern

Mining Algorithms

Sequential pattern

mining is precise significant because it is the foundation of numerous

applications. A sequential mining algorithm should discover the entire set of

patterns, when potentially, adequate the least support. Working with the big

data, the scalability is the one of the important issue of the mining the

knowledge from the huge amount of data. The above mentioned issue will be

raised in MapReduce model in the cloud. The SPAM algorithm, suggestively

decrease the mining period with big data, and also it will attain enormously great

scalability 11. The important and familiar algorithm for mining the data is

Apriori. Using this algorithm, finding the sequence data from

the d-dimensional sequence data is not possible. Using the PREFIXMD

SPAN algorithm, the retrieval of the sequence data is possible from

thed-dimensional data 14. Generating the huge amount of the unpromising

candidate sub sequences is difficult, while using

the Generate-and-test algorithm. This will be overcome, applying the

algorithm called Maximum weighted upper-bound model. The maximum

weighted upper-bound model will give the good performance of pruning

efficiency and also it will improve the performance efficiency 17.

The huge amount of

repeated projected databases in mining data sets will be creating applying

pattern growth type of algorithm. It will be overwhelmed, using the SMPM 13

algorithm. This algorithm will avoid the repeated projected database and evade

physical forecast 13. The greedy algorithm will raise the issues in the

sensor network applications, by creating the multiple interleaved patterns. The

GAIS 15 method algorithm will find the sequential pattern from the small

amount of quality data. The Frequent Pattern Tree type is another type for

finding the pattern using the sequence mining. In this algorithm will be work

in scanning the database many number of time. It will be time consuming

comparing with another type of algorithm. Yi Sui, Feng Jing Shao, Rencheng Sun

and Jinlong Wang were used the STMFP algorithm. In this algorithm required to

scan the database in a single. After the single scan itself, the tree can store

the all the sequences from the source data 9.

The Association rule

mining algorithm is the important type of algorithm in the Apriori model of

mining methods. The Apriori based association rule algorithm is the single

minimum support. The single minimum support cannot exactly discover the

interesting pattern.

The number of minimum

support is very high in the usage of MSCP growth algorithm 10. More number of

minimum supports will produce the interesting pattern.Xilu Wang and Weill Yao

used their optimum maximum sequence pattern mining for getting the sequence

pattern. The advantage of this algorithm is, to acquiring the sequential

pattern is very reliable. The existing mathematical models for mining the

sequential pattern will be failed in noisy data with the candidate

patterns12.

Measuring

the multidimensional-attribute of the material is not completely

measured concurrently in modified Apriori and PrefisSpan algorithms. It will be

overwhelmed using the Leaner Preference Tree (LPT) algorithm. The advantage of

this algorithm is, the learners actual learning favourite can be fulfilled

perfectly 16. Mining the pattern from the incremental data is very difficult

to handle. In this problem will be solved using the Direct Appending (DirApp)

algorithm. The improvement of this method, the incremental data can be easily

dealt and also the static database 8.

Performance Based

Comparative Study

The above table 1 described about the Comparative analysis of

different algorithm based on their performance in sequential mining. These

algorithms are studied with the help of the different size of the data sets.

The parameters chosen for these studies are Minimum Support, execution time (sec)

and memory usage (MB). The execution time is measured here is in the form

seconds and, the memory usages of these algorithm is measured in the form of

megabytes. In the data sizes, we have categories like medium size of data sets

and the large size of data sets.

The data size is denoted

as D. The value of D for the all the algorithms are categories in to medium

size and large size. The Medium size value is 200k and the large size value is

800k. Minimum support for the each algorithm is categorised as low and medium

for the both data sizes medium and large respectively.

The large amount of the

execution time taken by the GSP Apriori algorithm was more than the 3600sec

with the memory usage 800mb in the medium size of the data sets. The minimum

support of this highest execution time low. The Prefix Span pattern algorithm

execution time is very less. The execution time for this algorithm is 5sec with

the memory usage of MB in the large data set size of medium support size.

Table 2 Comparative Analysis of Sequential Pattern

Mining Algorithms

Reference paper

Author

Methodology Used

Algorithm Used

Existing System

Proposed System

ABCF 16

Mojtaba

Modified

Leaner

Multidimensional-

Learner’s actual

Salehi, Isa

Apriori and

Preference

attribute of

learning favourite

Nakhai

PrefixSpan

Tree (LPT)

materials is not

can be fulfilled

Kamalabadi

algorithms

Completely

perfectly

And

Measured

Mohammad

concurrently.

Bagher

Ghaznavi

Ghoushchi

OMSP

Xilu Wang and

Mathematic

Optimum

Noisy data, with

Acquired

M 12

Weill Yao

al Model.

maximum

fewer candidate

sequential patterns

sequence

patterns.

are reliable.

pattern

mining

MMS

Ya-Han Hu,

Apriori

MSCP-

Single minimum

Numerous

10

Fan Wu and

based

Growth

support cannot

minimum supports

Yi-Chun Liao

Association

exactly discover

possible.

Rule mining

interesting pattern.

Jen-Wei

–

Dealing the

It can easily deal

Huang, Taipei,

incremental data is

with a static

SPMP

Chi-Yao

Direct

difficult.

database or an

Tseng, Jian-

Appending

incremental

D 8

Chih Ou and

(DirApp)

database as well.

Ming-Syan

Chen

IFPT

Yi Sui,

Frequent

STMFP

Need to scan the

After the single

9

FengJing Shao,

Pattern Tree

Algorithm

database many

scan, the tree can

Rencheng Sun

times.

store the all the

and Jinlong

sequences.

Wang

GAIS

Ruotsalainen,

Greedy

GAIS

Issues in sensor

Sequential

15

M, Ala-

Algorithm

method.

network

patterns can be

Kleemola, T

application are

identified from

and Visa

multiple

little quality data

interleaved

patterns.

WSP

Guo-Cheng

Generate-

Maximum

Generate a large

Good enactment

17

Lan, Tzung-Pei

and-test.

weighted

number of

of pruning

Hong and

upper-

unpromising

efficiency and

Conclusions

In this paper, we discussed about the sequential

pattern mining and also briefly represented the major categories of the sequential

pattern mining. The comparison between the some of the types of algorithm was

discussed with the help of previously completed work. Primarily, this topicwas

initiated based on the improvement of the performance of the algorithm with the

help of the dissimilar data structure and representation. The comparative study

of different type of the algorithm is used for the mining the sequential

pattern. As well as, we discussed about the comparative analysis of the

algorithm performance. From the discussion about the pros and cons of

sequential mining, easily can be define the strength and their limitations. The

analysis of the comparison based on the different type of methodology and their

algorithms are discussed in detail.

Reference

1J. Han and M. Kamber,

?Data Mining: Concepts and Techniques?, Morgan Kaufman publishers, 2001.

2Vishal S.

Motegaonkar, Prof. Madhav V. Vaidya ?A Survey on Sequential Pattern Mining

Algorithms?, International Journal of Computer Science and Information

Technologies, Vol. 5 (2) , 2014, 2486-2492.

3Nizar R. Mabroukeh

and C. I. Ezeife, ?A Taxonomy of Sequential Pattern Mining Algorithms?, ACM

Computing Surveys, Vol. 43, No. 1, Article 3, Publication date: November 2010.

4J.Pei, J.Han,

B.MortazaviAsl, J.Wang, H.Pinto, Q.Chen, U.Dayal and M.- C.Hsu, ?Mining

sequential patterns by pattern-growth: The PrefixSpan approach?, IEEE

Transactions on Knowledge and Data Engineering, vol.16, no.11, 2004,

pp. 1424-1440.

5Hilderman R. J.,

Hamilton H. J.,?Knowledge Discovery and Interest Measures?,In: Kluwer Academic

Publishers, Boston, 2002.

6V.Chandra Shekhar

Rao, P.Sammulal,Ph.D, ?Survey on Sequential Pattern Mining Algorithms ?

International Journal of Computer Applications (0975 – 8887) Volume 76– No.12,

August 2013.

7Carl H. Mooney and

John F. Roddick, ACM Journal Name, Vol. V, No. N, M 20YY, Pages 1–46.

8Jen-Wei Huang ,

Nat. Taiwan University, Taipei, Chi-Yao Tseng, Jian-ChihOu

and Ming-Syan Chen,?A General Model for Sequential Pattern Mining

with a Progressive Database?, IEEE Trans. Knowledge and Data Eng., Volume:20 ,

Issue 9, page no 1153-1167, September 2008.

9Yi Sui, Feng, Jing

Shao, Rencheng Sun and Jinlong Wang,?A Sequential Pattern Mining Algorithm

Based on Improved FP-tree?,. Ninth ACIS International Conference on

Software Engineering, Artificial Intelligence, Networking, and

Parallel/Distributed Computing, 2008. SNPD 2008, Page(s): 440- 444.

10Ya-Han Hu, Fan

Wu and Yi-Chun Liao, ?Sequential pattern mining with multiple minimum

supports: A tree based approach?, 2nd International Conference on Software

Engineering and Data Mining (SEDM), 2010, Page(s): 428-433.

11Chun-Chieh Chen, Chi-Yao Tseng

and Ming-Syan Chen ?Highly Scalable Sequential Pattern Mining Based

on MapReduce Model on the Cloud?, IEEE International Congress on Big Data (Big

Data Congress), 2013, Page(s) 310- 317.

12Xilu Wang and Weill

Yao, ?Sequential Pattern Mining: Optimum Maximum Sequential Patterns and

Consistent Sequential Patterns?, IEEE International Conference on Integration

Technology, 2007, Page(s): 365-368.

13Yong-Gui Zou

and Hong Yu, ?Moving sequential pattern mining based on Spatial Constraints in

Mobile Environment?, IEEE International Conference on Intelligent Computing and

Intelligent Systems (ICIS), 2010, Page(s) 103- 107.

14Chung-Ching Yu

and Yen-Liang Chen, ?Mining Sequential Patterns from

15Ruotsalainen,

M, Ala-Kleemola, T and Visa, ?A GAIS: A Method for Detecting

Interleaved Sequential Patterns from Imperfect Data?, IEEE Symposium on

Computational Intelligence and Data Mining, 2007, Pages(s) 530- 534.

16Mojtaba Salehi, Isa

Nakhai Kamalabadi, Mohammad Bagher Ghaznavi Ghoushchi, ?Personalized recommendation

of learning material using sequential pattern mining and attribute based

collaborative filtering?, Education and Information Technologies, December

2014, Volume 19, Issue 4, page(s) 713-735.

17Guo-Cheng Lan, Tzung-Pei Hong

and Hong-Yu Lee, ?An efficient approach for finding weighted

sequential patterns from sequence databases?, Applied Intelligence, September

2014, Volume 41, Issue 2, pp 439-452.

18Thanh-Trung Nguyen, Phi-Khu Nguyen,

?A New Approach for Problem of Sequential Pattern Mining?, Lecture Notes in

Computer Science on Computational Collective Intelligence. Technologies and

Applications Volume 7653, 2012, pp 51-60.

19VangipuramRadhakrishna,

Chintakindi Srinivas and C.V.Guru Rao, “Constraint Based Sequential

Pattern Mining in Time Series Databases – A Two Way Approach”, AASRI

Conference on Intelligent Systems and Control, AASRI

Procedia 4(2013)313-318.

20ShamilaNasreen,

Muhammad AwaisAzamb, KhurramShehzada, UsmanNaeemc, and Mustansar Ali

Ghazanfara, ?Frequent Pattern Mining Algorithms for Finding Associated Frequent

Patterns for Data Streams: A Survey?, The 5th International Conference on

Emerging Ubiquitous Systems and Pervasive

Networks (EUSPN-2014), Procedia Computer Science 37

( 2014 ) 109 – 116.

21Mohammed J. Zak,

?SPADE: An Efficient Algorithm for MiningFrequent Sequences?, Kluwer Academic

Publishers. Manufactured in The Netherlands, 42, 31–60, 2001.

22RamakrishnanSrikant,

Rakesh Agrawal,? Mining Sequential Patterns: Generalizations and Performance

Improvements?, Advances in Database Technology — EDBT ’96, Lecture Notes in

Computer Science, Springer, Volume 1057, 1996, pp 1-17.