some novel approaches to study spam detection. In the absence of gold-standard datasetthey used Amazon product review data and trained models using the features fromreview text, reviewer and product characteristics to distinguish between duplicate andnon-duplicate reviews.Lim et al.LNJ+10 Aims at detecting the user-generated spam reviews or review spammersby using the characteristic behaviour of review spammers and model these behavioursto detect review spammers. They have modelled two behaviours namely spammerstarget specic product or group of products to maximize their impact and theytend to deviate from the other reviews in their rating behaviour. The scoring methodis proposed to measure the degree of spam and applied to the Amazon review data.Results show that proposed ranking methods are eective in discovering the reviewspammers.Li et al. LFW+16 proposed a two-mode labelled Hidden Markov model to detect thespammers using co-bursting behaviour and temporal patterns of the spammers. Theproposed model is applied to the real-life dataset from the review hosting site dianping.com. Experiment results show that model outperforms supervised learning usinglinguistic and behavioural features in identifying the spammers. Li et al. LCM+15also proposed novel temporal and special features for supervised opinion spam detection.An analysis is done on the Yelp ltered review data, results show that thesefeatures signicantly out-perform existing state of the art features.Li et al. LCL+14 present the rst real fake review in Chinese using the ltered fakereviews from Dianping’s fake review detection system. Most of the ltered reviews arefake but unknown reviews may contain many fake reviews. Using ltered fake reviewsand unknown reviews Positive-unlabelled learning model is proposed. Experiments areconducted on real-life reviews of 500 restaurants in China. Since model uses languageindependent features, they can be easily generalized.Due to the diculty in human labelling needed for supervised learning and evaluation,the problem of spam detection became highly challenging. Mukherjee et al.MKL+13proposed a novel method to the problem by modelling unsupervised model called AuthorSpamicity Model(ASM). It works in the Bayesian setting which facilitates modellingspamicity of authors as latent and allows to exploit the behavioural footprints ofreviewers. Experiments conducted using Amazon product review data demonstrate theeectiveness of the proposed model which considerably outperforms the state-of-the-artmodels.Previous studies have used various types of pseudo fake reviews for training. Maybe, an interesting fact is that pseudo fake reviews are produced using the AmazonMechanical Turk(AMT) crowdsourcing tools. They are not same as real fake reviews.The accuracy of the classication models on pseudo-fake reviews is 89.6%. Mukherjeeet al. MVLG13 tested accuracy on Yelp real-life data gives only 67.8% accuracy usingn-gram features.Jitendra et al.RSJB17 have applied both supervised and unsupervised techniques toidentify review spam in their work. The most eective feature set is used for model2.2. Related Study for helpfulness prediction 9building. Also, sentiment analysis is also incorporated in the detection process. Sincethere is no gold-standard labelled dataset supervised classiers are always not preferableso unsupervised learning is used in this work.Summary of the previous studies on spam detection has shown in the Table 2.1.Paper Dataset Features used LearnerPerformaceMetricJL07reviewers crawledfrom amazon websiteReviews andreviewer featuresLR AUCLHYZ11 Epinions reviews Review and reviewer features NB with Co-training F-scoreOCCH11Hotels through AmazonMechanical Turk (AMT)by Ott et al.LIWC,Bigrams and Bigrams SVM AccuracyMKL+13 Yelp’s dataBehavioral featurescombined with thebigram featuresSVM AccuracyLOCH14 Hotel reviews LIWC + POS + Unigram SAGE AccuracySIR+17Amazon electronics productreviewsReviewer,review and product K-NN AccuracyTable 2.1: Spam review detection methods2.2 Related Study for helpfulness predictionThis section reviews the existing work that is relevant to review helpfulness and reviewspam detection. It is divided into four subsections for dening helpfulness, spaminessand for discussing the existing methods in these elds.Helpful review Over the past decades, e-commerce industry has grown drastically.Online reviews are the major source of product evaluation because of the point of salefor the online merchants does not involve retail store where customers can view theproducts physically. Given a large number of reviews, both retailers and consumerswant to eectively identify the review which provides most insights. Helpfulness is traditionallydone by asking simple question “Was this review helpful to you?” and putting”thumbs up” and “thumbs down” buttons SIR+17. If reviews are more traditionalvoting method does not work because (i) very fewer reviews receive helpful votes and(ii) recent reviews does not get votes hence their helpfulness cannot be decided. Toovercome this limitation automated mechanisms are developed using machine learning.For a given Amazon review r, let fp be the number of users that found the review tobe helpful(i.e.,positive votes) and let fn be the number of users that did not nd thereview to be helpful(i.e.,negative vote). We denote sr to be helpfulness score of r.sr =fpfp + fn(2.1)Sample review from the Amazon website is shown in Figure 2.110 2. BackgroundFigure 2.1: Sample Amazon review(Source:SIR+17)Methods for predicting the helpfulness of reviewsVarious studies have been going on for past decade in predicting the helpfulness of areview. Study of review is termed as opinion mining which is an interdisciplinary researcheld involves natural language processing, computer linguistics and text mining.Most of the researchers have used regression techniques while others used classication,Deep learning and Neural networks. Summary of the previous studies referred in thiswork are shown in Table 2.2Paper Dataset Features used LearnerPerformaceMetricNYSS17 Amazon and Yelp datasetReview lenght, readability,sentiment, rating and ageSVR MAE,RMSEYYQB15 Amazon product review datasetSTR,UGR,GALC,LIWC,INQUIRERSVR with RBF kernel RMSENYS14 Amazon and Yelp dataset BOW+RFM SVR RMSESIR+17 Amazon reviews dataset review+reviewer+product Ensemble RMSEQSR+16 TripAdvisor NCR,ANCS Tobit regression Std.error, z-statisticGI09 AmazonSubjectivity, readability,reviewer characteristics andreview historyRegression R-squareTable 2.2: Existing studies for helpfulness predictionSingh et al.SIR+17 developed models for helpfulness prediction of consumer review usingmachine learning using various textual features such as polarity subjectivity entropyand readability etc,. They have used ensemble learning technique(gradient boosting algorithm)to analyse the data. For the experiments, Amazon dataset for Book category2.2. Related Study for helpfulness prediction 11is used. The experiment shows that MSE for the training dataset decreased when thenumber of trees is increased. Also, Rating is the very inuential feature in predictingthe helpfulness.Thomas et al.NYSS17 proposed a new script-enriched model to predict the reviewhelpfulness and to evaluate the eectiveness and eciency of the model. The eciencyof the script-enriched model is demonstrated by comparing with the benchmark models- a baseline model and Bag of Words model. The results show that script-enrichedmodel not only yields the highest accuracy but also the lower the training and testingfeature selection time.Yang et al.YYQB15 solved the problem of helpfulness prediction from a dierentangle by hypothesizing that helpfulness is an internal property of text. They have usedsemantic features like INQUIRER and LIWC in review helpfulness prediction. Theinsight behind is that people usually embed semantic meaning such as emotion andreasoning. A regression model is trained on Amazon dataset and model was validatedusing human annotation. They achieved better Results with very less RMSE whenGALC, LIWC and INQUIRER features are combined. Also, the Cross-category testshows that semantic features can be transferable to other categories.Qazi et al.QSR+16 developed a concept model for helpfulness prediction. This studyconsiders not only the quantitative factors such as a number of concepts also qualitativeaspects of a review including review types such as regular, comparative and suggestivereviews and reviewer helpfulness. The set of 1500 reviews are randomly chosen fromTripAdvisor across multiple hotels for analysis and a set of four hypothesis are used totest the model. Results suggest that number concepts contained in a review, numberof concepts per sentence and the review type contribute to the perceived helpfulness ofonline reviews.Liu et al.LJJ+13 focused on how to automatically evaluate the helpfulness of a reviewfrom a designer’s viewpoint entirely using review content. They have conductedan exploratory study to understand what makes the review helpful from the productdesigner viewpoint.Ghose et al.GI11 analysed the impact of online reviews on economic outcomes likeproduct sales and see how various factors impact the usefulness. their approach exploresthe multiple aspects of review text such as subjectivity, readability and spellingerrors to identify the important text-based features. They also explore the reviewerlevel aspects like average rating. An analysis reveals that extent of subjectivity, informativeness,readability, and linguistic correctness in reviews highly inuence the salesand perceived usefulness. Reviews with the mixture of objective and highly subjectivesentences inuence the usefulness negatively. Using Random Forest based classiers,they have shown that usefulness can be predicted accurately.Singh et al.SIR+17 proposed a model using ensemble learning into the website itselfto predict the helpfulness of online consumer reviews. The proposed system would beable to perform the initial evaluation of the review. That would help in prioritizing the12 2. Backgroundbetter reviews in an appropriate order so that they can be viewed by other consumers.Recent studies have shown that about 87% of the consumers read only top ten reviews.The proposed system ensure that helpful review is ranked appropriately.In the next chapter an overview of three modules to be implemented to predict helpfulnessand to detect spaminess is provided.