Abstract is more. To optimize this and increase

Abstract

The distinctive webpage
recommendation for individuals is evident these days. Web servers are loaded with recommendation systems that analyse and recommend webpages for the users. They use data implicitly obtained as a result of Web browsing patterns of the users for recommending webpages. The existing system
collects the Web logs and generates a cluster of similar
users and recommends pages to the user by actively analysing it in online. However the time for analysing it in online is more. To optimize this and
increase the correctness of recommendation systems, a method that
applies Firefly based algorithm for recommending Web
pages along with Naive Bayes clustering is designed. User
Web logs are initially clustered in offline by using Naive Bayes
clustering technique. To find the similarity between the
active user queries with other users in the
cluster Firefly algorithm based similarity measure is used. The proposed approach uses a
probability based clustering which eliminates the odd
records while forming clusters. Firefly algorithm meticulously
searches the generated web logs present in the cluster of the
active user and recommends the top pages. Firefly algorithm utilizes time
efficiently, thus it is used for processing in online. When pages are
obtained, they are ranked and the top
pages that are more relevant to the query are recommended.
The efficiency of the system can be evaluated
using measures like precision, recall-Score, Matthews’s correlation and Fallout rate. The proposed
approach is expected to improve time utilization in online process as well as recommends more accurate
Webpages.     

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 

Introduction- 

 

Web page recommendation system is a sub-domain of recommendation systems that recommends a set of Web pages to the users based on their past browsing
patterns. It is done by applying special mining techniques on the data that are previously gathered from the users which in turn discovers and extract information from Web documents and services. The major concern is to find reliable and efficient recommendation algorithms.
Recommendation system typically produces the result by following one of the two
ways – through collaborative and content based filtering.  

 

A.    Collaborative
Filtering   

Most recommendation system has wide use of collaborative filtering for recommending items. This method lies on collecting and processing the information’s on user’s behaviours or activities and then predicting the items relating to their similarity with other users. Collaborative filtering approach builds a structure from the users past decisions and behaviours of other similar users. This model is used to predict user
interested items. Since collaborative filtering is independent of machine analysable contents, it is capable of recommending for complex items accurately without “understanding” of the item itself.  

 

 

 

 B.    Content
Based Filtering 

Content based filtering is a widely used approach for designing recommendation systems. This technique is lies on a definition of item along with a user’s preferred profile. In a content based recommendation systems, the keywords are considered as user’s interest. It utilize a series of distinct property of an item for obtaining and recommending items with same
properties. These approaches are continually combined as Hybrid Recommendation Systems. These algorithms try to recommend items based on examining the items that are liked by a user in the past or in the present. In general, various items of candidate set are compared with items that are rated by the user in the past and the best matching items are recommended.    

 

 Literature survey

 

 Recommendation system has a major role of recommending personalized items for the users based on their interest in a web services. The web
also contains a rich and dynamic information’s. The amount
of information on the web is growing dynamically, as well as the number of web sites and webpages per web site. Predicting the behaviours and
needs of a web user has gained importance. Many webpage recommendation system were developed in the
past, since they compute recommendations in online process, their time utilization should
be efficient. A system 4 that uses support vector machine (SVM) learning based model was
developed to compute similarity between two items which
performed better than latent
factor approach for group recommendations. Since the
matrix representation was followed, the data sparsity problem was solved.
However, the system was not able to stably scale when size of the group dynamically increased.   Hybrid recommendation systems
which combines more number of
recommendation techniques was designed 5. It
eliminates any weakness which exist when only one recommendation system is used. There are several ways in which the systems can be combined, such as weighted hybrid recommendation
in which the score of a recommended item is computed
from the results of all of the available
recommendation techniques present in the system. However, data sparseness was
still a problem, the system may generate week recommendations if
few users have rated the same items and also
the system doesn’t overcome the cold start
problem.    Hyper spectral sensors can acquire hundreds of
contiguous bands over a wide electromagnetic spectrum for each
pixel. To reduce computational cost and eliminate an actual classifier
within the band searching process, an improved firefly algorithm based band
selection method 8 was used. The Firefly algorithm
is an evolutionary optimization algorithm proposed by Yang
13. After the initialization of parameters, the
brightness is calculated with the objective
function .Then the moment states were evaluated and the bands are
selected. Firefly algorithm also had a faster convergence even
at the size of the data is larger.     Further, to
improve the accuracy of similarity measure, firefly algorithm based similarity
measures are also
introduced 10.It considered separate effects for ratings of
users with similar opinions and conflicting opinions. In order
to generate initial population of fireflies, half of population randomly
generated and the other half of population are randomly generated. Mean
absolute error was chosen as objective function
to measure recommendation accuracy which is obtained by
difference between predicted rating and real rating.   An optimal
similarity measure via a simple linear combination of values and ratio of
ratings for user-based collaborative filtering provides better results. It
increased speed of finding nearest neighbours of active user and reduce
its computation time. Similarity function equation based
on Firefly algorithm was simpler than the equation
used in traditional metrics therefore, the proposed method
provided recommendations faster than traditional metrics.  
Graph colouring problems are generally discrete. Algorithms to discrete
problems are quite complex. A new algorithm based on Similarity
and discretize firefly algorithm directly without any other hybrid
algorithm was developed 11. It was adoptable to dynamic graph
sizes.   A system for assigning an electronic
document to one or more predefined categories or classes based on its textual context and use of agglomerative
clustering algorithm was developed 6. This type of clustering along with sample correlation coefficient as similarity
measure, allowed high indexing term space reduction factor with
a gain of higher classification accuracy.   In order to minimize
noise and outlier data, a modified DBSCALE algorithm using Naïve Bayes has been
designed 7. This algorithm is basically a prospect based utility. This
function is used to estimate the outlier cluster
data and increase the correctness rate of algorithm on given
threshold value. Since Naïve Bayes is a probability based function,
it removes outlier cluster data and increases the correctness rate according to threshold value. It also computes maximum posterior hypothesis for outlier data. In order to
minimize noise and outlier data, a modified DBSCALE algorithm using Naïve Bayes
has been designed 7. This algorithm is basically a prospect based utility.
  This function is used to increase the correctness rate
of algorithm on given threshold value and to estimate the
outlier cluster data. Since Naïve Bayes is
a probability based function, it removes
outlier cluster data and increases the
correctness rate according to
threshold value. It also computes maximum posterior
hypothesis for outlier data. The memory
based collaborative system uses matrix
based computation and solves data sparsity problem but, scalability
of the system cannot be stable when size of
the group dynamically increases. Hybrid system
could be helpful in overcoming the scalability issue but it
again leads to cold start problem.   To eliminate outliers as well as
overcoming other two problems Naive Bayes clustering,
a probability based method was used in past.
Firefly algorithm has a faster convergence and searches all
possible subsets with better time utilization. Thus, to design
an efficient recommendation system, Naïve Bayes method can be
followed for clustering in offline. Since the time complexity
should be less, Firefly algorithm that is more
efficient in terms of time utilization, it can be used for
calculating similarity in online. Combination of these two technique might increase the accuracy of the recommendation system
as well as results in efficient
time utilization.                
       

 

 III. Overview of the proposed work   

 

Initially, the web log files are
obtained from the 1 America Online Inc. The log files consists
of five fields i.e. anonymous ID for individual user, query of each user
along with query time, list of URLs which user
proceeded and its rank in the result. These logs
are collected and grouped based on anonymous ID.
The URL among all the users are obtained and its
content are downloaded and processed. The processing
of data includes removal of stop words from the URL’s
data and keyword extraction. Similar users are clustered based on
fetched keywords by using Naïve Bayes clustering technique which provides
efficient clusters compared to clustering by the use of association rules. The created clusters are given to online component. In online process, when an active user gives a query, the keywords from
the query is extracted. The similarity between the extracted
keywords with the other users in the same
cluster of the active user is calculated using
Firefly similarity measure. The similarity values are sorted along with the web pages browsed by similar users in the cluster. The top k web pages are
recommended for the active user
as a result.            
     

 

 IV. The proposed work  

The proposed
system follows a linear process of initially collecting the
web logs and processing them followed by clustering similar users
by Naïve Bayes clustering technique and finally generating
recommendations based on a similarity measure from firefly
algorithm.   

A.       
Pre-processing of Web Logs  

 The web logs are collected form 1 AOL Inc.
It consists of 20 million web queries from 650 thousand real users over 3 months. The data set includes anonymous
ID, query, query time, item rank and click
URL. The log file contains many number of users along
with the web pages visited by them. It is validated
and separated based on anonymous ID. The user is separated into individual
file using anonymous ID. The content from the URL are fetched and
downloaded. Those keywords are processed which undergoes stop words removal and
stemming process. The final keywords are then
extracted. The features like keywords, Timings, Frequency, Click URL and Revisit are fetched.
The user profile is constructed using those features. The user profile that
constructed is based on the features that are taken
form the user log files.  
Timing: The timing that the user spent on that particular
URL Frequency: The amount of time the user visited the URL Clickstream: The
number of click stream that are visited by user Revisit: Whether the
user visited the web page   The keywords are
generated from the data fetched form the
URL. Timing for each URL is estimated from
the given date and time by calculating the difference
between the each URL that are searched in a single
day by having some time constraints. Frequency
is hence calculated such that number of times the user
clicked the URL. The clickstreams are those that are clicked by the user for additional information. The timing of
revisit is calculated such that to decide whether the user preferred it much or not. Keywords: Keywords are those which are extracted from
the URL. The information from the URL is hence collected and processed to obtain features of the user.      

B.        
Naïve Bayes Clustering   

Clustering, also known as unsupervised classification, is a descriptive task with many
applications. Clustering is decomposition or partition of a data set into groups such that the object in one group are similar to
each other but as different as possible from the object in other groups. Three main approach for clustering of data is partition based clustering, hierarchical clustering and probabilistic model based clustering. Probabilistic model based clustering is a soft clustering were an object can be in many cluster
following a probability distribution. A clustering is useful if it produces
some interesting insight in the problem that we
are analysing. Naïve Bayes clustering is also a
probabilistic clustering technique that is based in
Bayes theorem with strong independent
assumption between features. The feature variables can
be discrete or continuous. This probabilistic clustering lies on nominal and numeric variables in the data set and its novelty lies in the use of mixture of truncated exponential (MTE) densities to model the numeric variables. In Naïve Bayes clustering the class is the only
root variable and all the attributes are
conditionally independent given the class. The clustering problem reduces to take
a data set of instances and a previously specified number of clusters (k), and work out each cluster’s distribution and the population distribution between the clusters. To obtain these parameters the expectation maximization (EM)
algorithm is used. Since Naïve Bayes clustering is
a probability based techniques. The items belongs to the
cluster if and only if it has a relation to it. This helps in eliminating outlier data in the process of clustering. It also provides proper clustering
with less computations. The given dataset is divided into two parts, one
for the training and other for testing. For each record in
the test and train databases, the distribution of the class variable is
computed. According to the obtained distribution, a value for the class variable is simulated and inserted in the corresponding cluster. The log-likelihood of the new
model is computed. If it is higher than the initial model, the process is repeated. Otherwise, the process is stopped, obtained clusters are returned.    

 C.    Optimisation Using Firefly Algorithm
 

Firefly algorithm is an
evolutionary algorithm that is based on the behaviour of fireflies. Fireflies live in colonies and cooperate for the survival of the colony. Generally, in order to model the behaviour of fireflies,
three assumptions will always be considered i.e. all fireflies are homogeneous, Attractiveness of each firefly is related to its level of brightness, rightness of firefly is determined with an exponential
objective function. Each firefly always emits a kind
of light that by which attracts other fireflies. The amount of accessed
light depends on parameters such as distance and absorption coefficient of the
surroundings. The longer the distance the lesser the amount of accessed light
will be. Also in surroundings with high light absorption coefficient such as
foggy weathers, the intensity of light decreases. The
certain issue is that every firefly regardless of its gender has
always been attracted to and moved toward the brighter firefly.
Firefly has a light intensity of its own. The key concept is, the firefly with
low light intensity is always attracted to the firefly with high light intensity. This concept can be incorporated for calculating similarity. By using firefly
based similarity measure unique and distinguished results can be obtained which
is a useful feature for ranking. It can deal with highly non- linear, multi-modal optimization problems naturally and efficiently. It does not use velocities, and there is no problem as that associated
with velocity in PSO. The speed of convergence is very high in probability of finding the global optimized answer. It has the flexibility of integration with other optimization techniques to form hybrid tools. It does not require a
good initial solution to start its
iteration process. Each web pages visited by
the user i are considered a firefly. The number of user visited the
particular page is assumed as the light intensity of the firefly. The objective function is formulated based on the frequency and duration. Frequency is calculated as the ratio to the number of visits per page to the average vests of all pages.     The duration
is the ratio of duration of page to the total duration of all the pages visited
by the user. Thus, the objective function can be defined as in equation 5.1
Interest (i)= 2*Frequency (i)*Duration (i) Frequency (i)+Duration (i) (5.1)
  The interest of all users in the cluster is calculated. Then the pages
to be recommended are found by using page rank algorithm 2 on the obtained
result. The results after applying page rank algorithm is given as the
recommended web page to the user.      

D. Ranking the Web Pages

The result, set of web pages obtain
should ranked in an order that the user might have higher
interest. Thus, they are
ranked in a sorted order based
on the interest of the active user. The association
rule checks the maximum possible combinations
which provides more accurate pages.    

 

 E.    Recommendation Process
 

The URL that are to be recommended will
be identified based on ranking and similarity measure. The similarity measure
is calculated among the users by comparing their similar interest. From the
obtained result of pages, page rank algorithm is used to rank the
most relevant
pages to the user. Thus, resultant URL’s are
recommended to the users. Hence
the web page that is to be recommended to
the user will be more relevant. The use of Naive Bayes clustering
will eliminate the outliers and Firefly based similarity calculation will
check all the subsets of the clusters

x

Hi!
I'm Isaac!

Would you like to get a custom essay? How about receiving a customized one?

Check it out