Data mining can be used to monitor social network as well as discussion forums for suspicious posts or comments. Religious or Racial sentiments can be hurt with such malicious posts. Discussion forums can be used to broadcast a message to large population quickly. Hence it becomes important to monitor the posts on these forums. This application collects the posts and comments from the discussion sites and analyses those post and comments using various data mining techniques and algorithm. These posts and comments are analyzed for provoking posts by comparing the words in the posts with the set of sensitive key-words in the algorithm. Further, the set of sensitive keywords are divided into 6 categories:- hacking, sexuality, religious, piracy, gambling, fraud. Simultaneously in the comments of data set if the algorithm comes across any of sensitive words related to any of the 6 categories, then it is categorized for that particular category to which the sensitive word belongs.
The traditional data mining techniques categorize the structure in structured data, for example, association analysis, classification and prediction, outlier analysis and cluster analysis. On the other hand, the newer techniques recognize patterns from unstructured and structured data. Crime data mining increases the privacy concerns like the other forms of data mining. However, our effort to promote the various automated data mining techniques for national security applications and local law enforcement.
Clustering techniques group data objects into classes by similar characteristics to minimize or maximize interclass. Clustering legal and illegal can automate a major part of crime analysis but is limited by the high computational intensity typically required.
Association rule mining determines frequently occurring item sets in a database and offerings some patterns as rules that been used in network intrusion detection to develop the connection rules from users’ interaction history. In network intrusion detection, this approach can identify intrusion patterns among time-stamped data. Showing hidden patterns benefits analysis, but to obtain meaningful results requires rich and highly structured data.
Deviation detection utilizes the particular measures to study data that differs noticeably from the rest of the data. Also called outlier detection, investigators can apply this technique to fraud detection, network intrusion detection, and other analyses
Classification finds mutual properties between various crime entities and arranges them into predefined classes that have been applied for identifying the source of email spamming according to the sender’s structural features and linguistic patterns. Often used to predict crime trends, classification can reduce the time required to identify suspicious entities. However, the technique requires a predefined classification scheme. Classification also requires complete training and testing data because a high degree of missing data would limit prediction accuracy.
String comparator techniques that show the relation the textual fields in pairs of database records and calculate the correspondence among the records that can detect suspicious information. The researchers can utilize string comparators to evaluate textual data that often need intensive computation. String comparison is the interesting field for computer scientists that whether string matching or string distance measures. Levenshtein define a usual measure of similarity between two strings.
Stopword selection: Stop words are the most used words in the English language which includes the words pronouns such as “I, he, she” or articles such as “a, an, the” or prepositions. Information Retrieval (IR) systems has first introduced the concept of stop-words. For a significant portion of the text size in terms of frequency of appearance small portion of words in the English language accounted. It was noticed that the mentioned pronouns and preposition words were not used as index word to retrieve documents. Thus, it was concluded that such words did not carry significant information about documents.