# Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

# Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

## Bing Liu

Language: English

Pages: 624

ISBN: 3642268919

Format: PDF / Kindle (mobi) / ePub

Web mining aims to discover useful information and knowledge from Web hyperlinks, page contents, and usage data. Although Web mining uses many conventional data mining techniques, it is not purely an application of traditional data mining due to the semi-structured and unstructured nature of the Web data. The field has also developed many of its own algorithms and techniques.

Liu has written a comprehensive text on Web mining, which consists of two parts. The first part covers the data mining and machine learning foundations, where all the essential concepts and algorithms of data mining and machine learning are presented. The second part covers the key topics of Web mining, where Web crawling, search, social network analysis, structured data extraction, information integration, opinion mining and sentiment analysis, Web usage mining, query log mining, computational advertising, and recommender systems are all treated both in breadth and in depth. His book thus brings all the related concepts and algorithms together to form an authoritative and coherent text.

The book offers a rich blend of theory and practice. It is suitable for students, researchers and practitioners interested in Web mining and data mining both as a learning text and as a reference book. Professors can readily use it for classes on data mining, Web mining, and text mining. Additional teaching materials such as lecture slides, datasets, and implemented algorithms are available online.

Formal Languages and Compilation (2nd Edition) (Texts in Computer Science)

Distributed Systems: Concepts and Design (5th Edition)

Artificial Intelligence for Advanced Problem Solving Techniques

Testing Computer Software (2nd Edition)

papers have been published on the topic. This short chapter only introduces some basics, and it, by no means, does justice to the huge body of literature in the area. The bibliographic notes here should help you explore further. Since given a data set, a minimum support and a minimum confidence, the solution (the set of frequent itemsets or the set of rules) is determined and unique, most papers improve the mining efficiency. The most wellknown algorithm is the Apriori algorithm proposed by

(ICDM-2002), 2002. 3. Boser, B., I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of Fifth Annual Workshop on Computational Learning Theory, 1992. 4. Breiman, L. Bagging predictors. Machine learning, 1996, 24(2): p. 123-140. 5. Breiman, L. Random forests. Machine learning, 2001, 45(1): p. 5-32. 6. Breiman, L., J.H. Friedman, R. Olshen, and C.L. Stone. Classification and Regression Trees. 1984: Chapman and Hall. 7. Brunk, C. and M. Pazzani. An

partitioning clustering methods. It is able to take any form of distance or similarity function. Moreover, unlike the k-means algorithm which only gives k clusters at the end, the hierarchy of clusters from hier- 4.5 Distance Functions 151 archical clustering enables the user to explore clusters at any level of detail (or granularity). In many applications, this resulting hierarchy can be very useful in its own right. For example, in text document clustering, the cluster hierarchy may

value range of one attribute is from 0 to 1, while the value range of the other attribute is from 0 to 1000. Consider the following pair of data points xi: (0.1, 20) and xj: (0.9, 720). The Euclidean distance between the two points is dist (xi , x j ) (0.9 0.1) 2 (720 20) 2 700.000457, (18) which is almost completely dominated by (720 20) = 700. To deal with the problem, we standardize the attributes, e.g., to force the attributes to have a common value range. If both attributes are forced

Science, Sports, and Politics. Each class has 300 documents, and each document is labeled with one of the topics (classes). We use this collection to perform clustering to find three clusters. Class/topic labels are not used in clustering. After clustering, we want to measure the effectiveness of the clustering algorithm. First, a confusion matrix (Fig. 4.17) is constructed based on the clustering results. From Fig. 4.17, we see that cluster 1 has 250 Science documents, 20 Sports documents, and