189

Views

CrossRef citations to date

Altmetric

Original Articles

DEFINING TOPIC BOUNDARIES IN SEARCH ENGINE TRANSACTION LOGS USING GENETIC ALGORITHMS

Seda Ozmutlu Industrial Engineering Department, Uludag University, Gorukle, Bursa, TurkeyCorrespondence[email protected]

Coskun Gencer Cosar Industrial Engineering Department, Uludag University, Gorukle, Bursa, Turkey

Abstract

This study proposes to use genetic algorithms for defining the topic boundaries in search of engine transaction logs. Users are interested in multiple topics during a search session, and genetic algorithms are used in this study to determine whether a search engine user has changed topics during a session. Sample data logs from FAST and Excite search engines were analyzed. The findings show that genetic algorithms are fairly successful in identifying topic continuations and shifts in search engine transaction logs.

It is a challenge to develop effective information retrieval algorithms and interpret user information-seeking behavior, since people have different and changing information needs, and they utilize different information-searching strategies to solve their information problems (Gremett Citation2006). One of the challenges in information-seeking behavior is new topic identification or session identification. New topic identification or session identification is discovering when the user has switched from one topic to another during a single search session (He, Goker, and Harper Citation2002). In order to find useful patterns in user sessions, it is necessary to group the queries on the transaction logs into clusters. After the query clusters have been identified, the common usage patterns of search engine users can be discovered by statistical tools (Huang, Yao, and An Citation2006).

It is important to identify when the user has changed topics to design clustering algorithms and query recommendation algorithms. In addition, it is a very difficult problem to identify user sessions in common access computer centers, such as libraries and PC labs. When a new user accesses the computer, he/she might begin using the search engine with a new topic compared to the previous user. In this case, a topic change occurs, although it might seem that the user is the same (since the computer's IP has not changed). If this topic change is identified, search engines might offer more successful results to the user, by adjusting their ranking and clustering algorithms according to the changing needs of different users. Other implications of session and new topic identification, in terms of personalized web services, web caching systems, and website design, are well-documented by Huang, Peng, An, and Schuurmans (Citation2004).

New topic identification, session identification, and query clustering studies generally analyzed the queries semantically. Defining topic boundaries by relying on semantics of query terms are “dangerously circular” and conceal the persistence of users' long-term information needs (Murray, Lin, and Chowdhury Citation2006). It is also quite a challenge to determine the topics of user queries looking at the few keywords that a user enters as a query. An alternative approach for session identification or automatic new topic identification is nonsemantic methodologies, where the statistical characteristics of queries, such as query duration or search pattern are used to estimate an upcoming topic change. Hence, nonsemantic methodologies aim to understand that a user has changed topics during a session without looking at the context of the queries, instead of considering the statistical characteristics of the queries. The initial indications of the relationship between statistical characteristics of queries and topic change were shown in Spink et al. (Citation2002a) and Goker and He (Citation2000). Ozmutlu (Citation2006) also showed that the statistical characteristics of queries, such as time interval and search pattern of queries, have a statistically significant effect on topic changes.

Several nonsemantic methodologies, primarily based on learning algorithms, have been previously applied for new topic identification and session identification, such as the Demspter–Shafer Theory (Ozmutlu and Cavdur Citation2005a; Ozmutlu, Cavdur, and Ozmutlu Citation2006), neural networks (Ozmutlu and Cavdur Citation2005b; Ozmutlu, Cavdur, Ozmutlu, and Spink Citation2004a), but room for improvement exists. A new nonsemantic methodology for fast and successful automatic topic identification would be a valuable contribution to the field of information science.

In this study, we apply genetic algorithms to perform automatic new topic identification of search engine user queries. The queries are categorized with respect to their statistical characteristics. The statistical characteristics are a time interval and search pattern of queries. After the queries are categorized, genetic algorithms are applied to determine whether new queries contain a topic shift or a continuation compared to the previous query. It is also one of the objectives of the article to determine the parameter set for genetic algorithms that will provide the best performance values for automatic new topic identification through elaborate experimental design and analysis of variance.

The layout of the article is as follows: We initially present the literature review related to new topic identification, followed by the description of the research design, results, and the conclusion.

RELATED STUDIES

There are many large-scaled studies on web search engine transaction logs, such as those of Silverstein, Henzinger, Marais, and Moricz (Citation1999), Spink, Bateman, and Jansen (Citation1999), Spink et al. (Citation2002b), and Ozmutlu, Ozmutlu, and Spink (Citation2004b). The number of studies on content analysis is few, the reason generally being the effort required to manually process the queries for topic identification; however, content analysis is a growing area (Pu, Chuang, and Yang Citation2002). Some researchers, such as Silverstein et al. (Citation1999) and Spink, Wolfram, Jansen, and Saracevic (Citation2001) have performed content analysis of search engine data logs at the term and topical level, and concluded that the popular topics are pornography, entertainment, and e-commerce. Other topic-related studies include Shen, Dumais, and Horvitz (Citation2005), who used marginal models, Markov models, and time-specific Markov models to investigate the transition from one topic to another. It is also a known fact in information retrieval that users might be interested in multiple topics during a single search session, which is called multi-tasking. In terms of information retrieval, multi-tasking is “the process of searches over time in relation to more than one, possibly evolving, set of information problems” (Spink et al. Citation2002a). Spink et al. (Citation2002a) and Ozmutlu, Ozmutlu, and Spink (Citation2003) show that 11.4% of Excite users and 31.8% of FAST users performed multi-tasking searches, respectively.

Since there might be several topics in a single-user session, it might be important to divide a user session into query clusters. Beeferman and Berger (Citation2000) and Wen, Nie, and Zhang (Citation2002) applied query clustering that uses search engine query logs including click-through data, which provides the documents that the user have selected as a result of the search query. Pu et al. (Citation2002) developed an automatic classification methodology to classify search queries into broad subject categories. They formed a subject taxonomy and fit each search query into one of the categories in the taxonomy. Zhu, Greiner, and Haubl (Citation2003) proposed a link-based system that generates queries from terms contained in the previously accessed documents. Muresan and Harper (Citation2004) proposed a topic-modeling system for developing mediated queries. They performed a statistical analysis of terms in documents available in a source collection and a statistical representation of the lexicographic model of the query. Recently, Liu, Croft, Oh, and Hart (Citation2004) and Metzler and Croft (Citation2005) used support vector machines (SVMs) to classify queries in different groups. Results showed that SVMs achieved recognition accuracy around 80% for queries.

Session identification is an important problem, since it is difficult to identify web user sessions. A single IP address might not mean a single user, due to dynamic IP applications and use of search engines at common-access computer centers such as libraries, computer labs, etc. (Jansen, Spink, Blakely, and Koshman Citation2007). The most commonly used session identification method is timeout, but it is difficult to set the time threshold (Huang et al. Citation2004; Cooley, Mobasher, and Srivastava Citation1999). Search engine researchers have also used cookies, along with IP addresses, for user and session identification. The advantages and disadvantages of these approaches have been discussed in detail in Jansen et al. (Citation2007). Essentially, session identification and new topic identification are equivalent problems. Since different users from a single IP might be differentiated by the topic change that occurs from user to user; the session identification problem can also be solved by successful new topic identification.

As mentioned in the introduction, natural language processing techniques have not yet been successfully applied on search engine transaction logs. One promising approach to the problem of session identification or new topic identification is to use nonsemantic methodologies. In such an approach, queries can be categorized in different topic groups with respect to their statistical characteristics, such as the time-interval between subsequent queries or the reformulation of queries. There are several studies adapting such an approach. He et al. (Citation2002) proposed a topic identification algorithm that uses the Dempster-Shafer Theory (Shafer Citation1976). Their algorithm automatically identifies topic changes using statistical data from web search logs. He et al.'s (Citation2002) approach was replicated on Excite search engine data (Ozmutlu and Cavdur Citation2005a) and FAST search engine data (Ozmutlu et al. Citation2006). The main finding of these studies was that the idea of using query patterns and time-intervals in identifying topic shifts is valuable, but can be improved. In order to improve the performance of new topic identification Ozmutlu et al. (Citation2004a) and Ozmutlu and Cavdur (Citation2005b) applied artificial neural networks to define boundaries of separate topics in the same session, and identified topic shifts fairly successfully. Ozmutlu (Citation2006) found that the effect of the characteristics of web search queries on the occurrence of topic shifts/continuations is statistically significant. Furthermore, Ozmutlu, Ozmutlu, and Buyuk (Citation2007) applied conditional probabilities to perform automatic new topic identification. These methods have applied session or new topic identification successfully; room for improvement exists, however. The methodologies in these articles have estimated too many topic shifts and as a result the precision of topic shift estimation was rather low. There is a need for a fast and efficient new topic identification that can successfully apply new topic identification.

METHODOLOGY

Research Design

New topic identification is performed on two datasets in this study: the FAST transaction log and the Excite transaction log. The FAST search engine (http://www.alltheweb.com) provided a query log of 1,257,891 for our analysis. Queries were collected from 12:00 AM (Norwegian time) on February 6, 2001 for 24 hours until 12:00 AM February 7, 2001. In the FAST data log structure, the entries are given in the order they arrive. FAST assigns a new user ID to every new user. It is possible to identify new sessions through a user ID and each query contains three fields:

1) Identification: anonymous code assigned by the search engine server to a user;
2) Time of day: in hours, minutes, and seconds;
3) Query: user terms as entered.

We selected a sample of 963 sessions from the entire dataset using Poisson sampling (Ozmutlu, Spink, and Ozmutlu Citation2002). It is necessary to select entire sessions, since the aim of the study is to perform automatic topic identification within sessions. The 963 sessions consisted of 10,007 queries. The sample size was not kept very large, since evaluation of the performance of the algorithm would require a human expert to go over all the queries. Poisson sampling provides a basis for sampling from large-scale data logs, such as the FAST search query log, while preserving the characteristics of the main dataset (Ozmutlu et al. Citation2002). Poisson sampling algorithms select sample points from a certain dataset by skipping a random number of observations that are distributed according to a Poisson process versus systematic sampling algorithms skipping a constant number of observations. The Poisson sampling process is a useful random sampling process, which includes properties such as unbiased sampling, proportional sampling, comparability of heterogeneous Poisson sampling arrivals, and flexibility on the stochastic arrival process from which the sample is selected. Due to the well-established property of stochastic science, the Poisson arrival see time averages (PASTA) property, Poisson sampling can be applied to any kind of stochastic arrival process (Wolff Citation1982), which is the most important property for the applicability of Poisson sampling on the web search query data logs, since there is no information on the type of stochastic arrival process of the web query sessions (Ozmutlu et al. Citation2002).

The second search query log used in this study comes from the Excite search engine (http://www.excite.com) located in the USA. The data was collected on May 4, 2001 and consists of 1.7 million for our analysis. We selected a sample of 3592 sessions (10,256 queries) using Poisson sampling.

We use the following concepts in the study (Spink et al. Citation2001):

(1) Query: a set of one or more search terms; it may include advanced search features such as logical operators and modifiers.
(2) Session: the entire set of queries by the same user over time. A session could be as short as one query or contain many unique and repeat queries.

Notation

The notation used in this study is below, and is similar to the notation used in previous studies of new topic identification (Ozmutlu et al. Citation2004a, Citation2006; Ozmutlu and Cavdur Citation2005a, Citation2005b).

Topic shift: A change from one topic to another between queries within a single-user session

Topic continuation: Staying on the same topic from one query to another within a single-user session

N _shift: Number of queries labeled as topic shifts by genetic algorithms
N _contin: Number of queries labeled as topic continuation by genetic algorithms
N _{true shift}: Number of queries labeled as topic shifts by manual examination of human expert
N _{true contin}: Number of queries labeled as topic continuation by manual examination of human expert
N _{shift & correct}: Number of queries labeled as topic shifts by genetic algorithms and by manual examination of human expert
N _{contin & correct}: Number of queries labeled as topic continuation by genetic algorithms and by manual examination of human expert

Type A error: This type of error occurs in situations where queries on same topics are considered as separate topic groups.

Type B error: This type of error occurs in situations where queries on different topics are grouped together into a single topic group.

Some useful formulation related to the above notation is as follows:

Precision (P) and Recall (R) are used in this study to measure the performance of new topic identification using genetic algorithms. Interpreted in terms of topic shifts, as in Ozmutlu et al. (Citation2004a, Citation2006) and Ozmutlu and Cavdur (Citation2005a, Citation2005b), precision (P _shift) is the correctly estimated number of shifts by genetic algorithms among all the shifts marked by genetic algorithms (Equation (Equation5)), and recall (R _shift) is the correctly estimated number of shifts by genetic algorithms among all the shifts marked by the human expert (Equation (Equation6)). On the other hand, interpreted in terms of topic continuations, precision (P _contin) is the correctly estimated number of continuations by genetic algorithms among all the continuations marked by genetic algorithms (Equation (Equation7)) and recall (R _contin) is the correctly estimated number of continuations by genetic algorithms among all the continuations marked by the human expert (Equation (Equation8)). The fifth measure, F _{β_shift} combines P _shift and R _shift to provide a single parameter to compare different results and to be consistent with previous studies. β is varied in this study between 1.3 and 1.7 to determine its best level in terms of performance measures. The sixth measure, fitness function F _{β_contin} (Equation (Equation10)), combines P _contin and R _contin into a single value. The formulations of these measures are as follows:

Proposed Methodology

In this section, we present the methodology based on genetic algorithms to perform automatic new topic identification based on statistical characteristics of search queries, such as search patterns and time-interval of queries. The initial steps of the methodology, i.e., evaluation by human expert, separating the data into two sets, identification of search pattern, and time interval of each query in the dataset, can be found in previous articles by Ozmutlu et al. (Citation2004a, Citation2006), and Ozmutlu and Cavdur (Citation2005a,b). However they are also briefly explained in this article for readers' convenience.

Evaluation by human expert: The actual topic shifts and continuations in the 10,000 query FAST and Excite datasets are identified and marked by a human expert.

Separating the data into two sets: The data is separated to two approximately equal-sized sections. The first section of the datasets is used to apply genetic algorithms to match topic shifts and continuations with query categories and the second section is used to test the results obtained with the first section of the dataset, i.e., test the performance of genetic algorithms as a new topic identification methodology. The two halves of the datasets are approximate halves and might not be equal in size (as seen in the FAST dataset) to keep the entirety of user sessions. The Excite dataset is divided into two equal portions as a coincidence. Sizes of the datasets are given in Table .

TABLE 1 Size of the Datasets Used in the Study

Download CSV Display Table

Identification of search pattern and time-interval of each query in the dataset: Each query in the transaction logs is categorized in terms of its search pattern and time-interval. Ozmutlu (Citation2006) showed that time-interval and search pattern of queries have a statistically significant effect on topic shifts. The classification of the search patterns is based on terms of the consecutive queries within a session. The time-interval is the difference of the arrival times of two consecutive queries. The categories of time-intervals are determined with respect to the length of the difference of the arrival times of two consecutive queries. The categorization of time-interval and search pattern remains similar to those of Ozmutlu and Cavdur (Citation2005a,b) and Ozmutlu et al. (Citation2004a, Citation2006).

The search patterns are automatically identified by a computer program. The logic for the automatic search pattern identification can be summarized in Figure (Ozmutlu et al. Citation2004a, Citation2006; Ozmutlu and Cavdur Citation2005a,b). We use seven categories of search patterns in this study which are as follows: Ozmutlu et al. (Citation2004a, Citation2006) and Ozmutlu and Cavdur (Citation2005a,b). We also provide examples for each search pattern:

FIGURE 1 Search pattern identification algorithm.

Unique (New): the second query has no common term compared to the first query:

Next Page (Browsing): the second query requests another set of results on the first query:

Generalization: all of the terms of second query are also included in the first query but the first query has some additional terms:

Specialization: all of the terms of the first query are also included in the second query but the second query has some additional terms:

Reformulation: some of the terms of the second query are also included in the first query but the first query has some other terms that are not included in the second query. This means that the user has added and deleted some terms of the first query:

Relevance feedback: the second query has zero terms (empty) and it is generated by the system when the user selects the choice of “related pages”:

Others: If the second query does not fit any of the above categories, it is labeled as other. Any nonempty query listed after an initial empty query, such as relevance feedback belongs to the “others” category. Note that if this pattern observed on an intermediate query in a session, this property does not hold:

The categories of time-intervals are determined with respect to the length of the difference of the arrival times of two consecutive queries and are similar to those used in Ozmutlu et al. (Citation2004a, Citation2006) and Ozmutlu and Cavdur (Citation2005a,b). We use seven categories of time-intervals for a query: 0–5 minutes, 5–10 minutes, 10–15 minutes, 15–20 minutes, 20–25 minutes, 25–30 minutes, 30+ minutes.

Applying Genetic Algorithms for automatic new topic identification of search engine transaction logs: Recall that the first section of the datasets is used to apply genetic algorithms to match topic shifts and continuations with query categories. After the query categories are matched with topic shifts and continuation, the queries in the second portion of the dataset are tagged as topic shift or continuation according to the results obtained in the first portion of the dataset.

To apply such an approach, the genetic algorithm framework is modeled to solve the new topic identification problem, and match query categories with the binary information of topic shifts and continuations. There are two statistical characteristics of the transaction log queries—time-interval and search pattern—each with seven levels. We propose to use a combination of these two statistical characteristics, which yields 49 categories of queries, and any query in the transaction log falls into one of these categories. For example: Time Interval_1&Search Pattern_1; Time Interval_1&Search Pattern_2;…..; Time Interval 7&Search Pattern_7. Considering these categories, we modeled a genetic algorithm framework with 49 chromosomes. Each chromosome is labeled as 1 (topic shift exists) or 0 (no topic shifts). The gene structure for the problem is in Figure .

FIGURE 2 Structure of the chromosomes.

MATLAB is used to model and run the genetic algorithm structure used in the study. Some of the parameters are kept at the default values of MATLAB, unless there is necessary reason to change them. Other parameters are varied to determine the best set of parameters for the study. The parameters of the genetic algorithm structure used in the study are as follows:

– Population type: bit string since the problem involves a binary output, i.e., topic shift or continuation. The chromosomes take the value of 0 or 1 based on whether there is a topic shift or continuation involved with the query.
– Population size: 50
– Fitness scaling: the current genes are presented in a rank-scaling fashion in the new generations, MATLAB default.
– Selection rule: Stochastic uniform, MATLAB default.
– Crossover rule: Scattered. MATLAB default.

We believe that the mutation and crossover fractions and the “ß” value in the fitness function are important parameters in the performance of automatic new topic identification, and there are no specific values assigned to these parameters in literature. Therefore, we preferred to make an elaborate experimental design, where these parameters are varied systematically. These parameters and their values that are used in the experimental design are as follows:

– Mutation fraction: In literature, we have observed values between %0.5 and %10 for this parameter. Considering that there are 49 chromosomes in each gene; we have chosen the values of (%2.1 pertaining to 1 chromosome out of 49, %4.2 pertaining to 2 chromosomes out of 49, %6.3 pertaining to 3 chromosomes out of 49.
– Crossover fraction: In literature, we have observed values between %50 and %95 for this parameter. Consequently, we have chosen four values for crossover fraction between these levels; 0.6, 0.7, 0.8, and 0.9.
– β value: Several values for “ß” have been used in information science literature for the performance measure of F _ß. In our previous studies, we have used the values of 1.2, 1.3, and 1.5 to provide the best results for new topic identification. In this study, we choose to vary “ß” between 1.3 and 1.7, and use the values of 1.3, 1.4, 1.5, 1.6, 1.7 for “ß.”

A full fractional experimental design can be used to determine the effect of these parameters on the performance of new topic identification. A full fractional factorial design entails 60 run points (3∗4∗5). Sixty run points are replicated for both the Excite and the FAST datasets. The genetic algorithm applications were always terminated based on the change in the values of the fitness function (amount of change in the fitness function was around 0,000001 from one iteration to the next).

After the query categories are assigned with either topics or continuations as a result of applying genetic algorithms on the first portion of the dataset, the second portion of the datasets are tagged with the topic continuation or shift base don the results of genetic algorithms. For example, if in the first portion of the dataset a query of the category Time Interval_1&Search Pattern_1 is marked as a continuation. Whenever we meet a query of the same category in the second portion of the dataset, it is also marked as a continuation. Similarly, if in the first portion of the dataset a query of the category Time Interval_1&Search Pattern_5 is marked as a shift, whenever we meet a query of the same category in the second portion of the dataset, it is also marked as a shift. The results of the tests on the second half of the Excite 2001 and FAST datasets are in Tables and , respectively.

TABLE 2 Experimental Design Results for the Excite 2001 Dataset

Download CSV Display Table

TABLE 3 Experimental Design Results for the FAST Dataset

Download CSV Display Table

After the runs are completed, analysis of variance is performed to test whether the factors considered in the experimental design have a statistically significant effect on the performance measures of new topic identification. The performance measures used for ANOVA are N _shift, N _contin, Type A error and Type B error. Other performance measures such as P, R, and F _ß are functions of N _shift, N _contin, Type A error, and Type B error. Two factor interactions are also considered in the study which makes a total of six factors. The hypotheses for testing the effect of the design factors and two factor interactions are in the following equation:

If H ₀ cannot be rejected, it is concluded that a factor's effect on the performance measure is not statistically significant. If H ₀ is rejected, it is concluded that a factor's effect on the performance measure is statistically significant. Then, the factor combinations providing the best fitness function value can be determined and used for new topic identification.

RESULTS AND DISCUSSION

At first glance, it is beneficial to recognize the true characteristics of the transaction logs as the human expert analyzed them. When the human expert evaluated the 10,256 Excite query dataset, 6001 topic continuations and 663 topic shifts were found. Eliminating the last query of each session leaves 6664 queries to be included in the analysis. There are 2879 topic continuations/391 topic shifts in the first half, and 3122 topic continuations/272 topic shifts in the second half of the dataset. The structure of the dataset in terms of the number of sessions and the number of queries included in the analysis can be seen in Table .

TABLE 4 Topic Shifts and Continuations in the Excite and FAST Datasets as Evaluated by Human Experts

Download CSV Display Table

When the human expert evaluated the FAST dataset, 8348 topic continuations and 696 topic shifts were found. Eliminating the last query of each session leaves 9044 queries to be included in the analysis. In the first half of the dataset (4997 queries), there were 437 user sessions. Thus 4560 queries of the first half of the dataset are used for computing transition probabilities. Out of 4560 queries, there are 4174 topic continuations and 386 topic shifts. In the second half of the dataset, there were 5010 queries and 526 user sessions. Eliminating the last query of each session leaves 4484 queries to be included in the analysis. Out of 4484 queries, 4174 were topic continuations, whereas 310 were topic shifts.

After investigating the results of genetic algorithms application in Tables and , we observe that the genetic algorithm yields similar results across different values of the design factors crossover fraction and design fraction. The value of the F _ß(shift) performance measure primarily changes based on the value of the ß parameter. In both datasets, the runs yielding the best values for F _ß(shift) were runs 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60, where ß equals 1.7. The result of these runs is shown in Table . For smaller values of ß, such as 1.3 and 1.4, Type A errors seem to decrease and Type B errors seem to increase, and the number of queries marked as shifts also seem to decrease. This result is due to the formulation of F _ß, which puts more emphasis on Type B errors than Type A errors. The importance of Type B errors is a subjective issue, and an extensive study should be conducted to develop different performance measures than F _ß; but this is an area of further research.

TABLE 5 The Results of the Best Run for Genetic Algorithm on the Excite and FAST Datasets

Download CSV Display Table

For the Excite dataset, we observe that genetic algorithm application marked 2941 queries as topic continuation, whereas the human expert identified 3122 queries as topic continuation. Similarly, the genetic algorithm approach marked 453 queries as topic shifts, whereas the human expert identified 272 queries as topic shifts. 237 out of 272 topic shifts are identified correctly, yielding an R _shift value of 0.8713, and 2906 out of 2133 topic continuations are identified correctly, yielding an R _contin value of 0.93. These results demonstrate that the topic shifts were estimated fairly correctly and topic continuations were estimated almost entirely correct by genetic algorithms. On the other hand, the genetic algorithm application yielded 453 topic shifts, when actually there are 272 topic shifts, giving a value of 0.52 for P _shift. Although this result means that the genetic algorithm approach overestimates the number of topic shifts, it still yields the best results for P _shift, compared to previous studies with other datasets. Similarly, the value F _ß(shift) is the highest ever obtained for any dataset, 0.744. Recalling that the research objective of this study was overcoming the problems of the previous studies, we can conclude that genetic algorithms might have the potential to improve the performance of new topic identification in terms of shift-based measures, which were fairly problematic in previous studies. In terms of topic continuations P _contin was 0.99, i.e., almost all the topic continuations marked by the genetic algorithms approach were correct. For the FAST dataset, we observe the values of 0.37 for P _shift, 0.98 for R _shift, 0.87 for R _contin, and 0.99 for P _contin. The overestimation of topic shifts is still a problem with the FAST dataset; continuation-based measures, however, provided satisfactory results.

As mentioned in the methodology section, analysis of variance is performed after the runs are completed. The analysis of variance yields the results listed in Tables and for the Excite and FAST datasets, respectively.

TABLE 6 ANOVA for the Excite 2001 Dataset, Analyzing the Effect of Statistical Characteristics of Queries on New Topic Identification Performance

Download CSV Display Table

TABLE 7 ANOVA for the FAST Dataset, Analyzing the Effect of Statistical Characteristics of Queries on New Topic Identification Performance

Download CSV Display Table

The critical F value used in this study is F _{0, 05, 1, 53} ≈ ∼4.02. ANOVA confirms the results that we have reached by initially observing the results in Tables and . The ß factor in the fitness function of the genetic algorithm has a statistically significant effect on the performance measures of new topic identification. Increasing values of ß, increases N _shift and Type A errors, but decreases N _contin and Type B error. The other factors do not have an effect on the dependent variables and on F _ß(shift), the fitness function. The performance of new topic identification with genetic algorithms is directly affected from the ß value, since F _ß(shift) is the fitness function. Genetic algorithms might have provided different or better results with alternative fitness functions. As mentioned in the previous paragraphs, it is an important line of research to develop alternative performance measures for automatic new topic identification specifically, and information science generally. Then, new topic identification with genetic algorithms should be performed with alternative fitness functions. This is an area of further research.

CONCLUSION AND FURTHER RESEARCH

New topic identification and session identification are important problems in information science. This study proposes to apply genetic algorithms to automatically define topic boundaries in a user session by using statistical characteristics of queries, such as time-intervals and query reformulation patterns. Transaction logs from the FAST and Excite search engines are used. An elaborate experimental design scheme was used to determine the best set of parameters of the genetic algorithm to optimize the performance measures of new topic identification. As a result of genetic algorithms, the performance measures yielded fairly satisfactory results, especially for the Excite dataset. The overestimation of topic shifts still exists for the FAST dataset, but stays at moderate values for the Excite dataset. Genetic algorithms have performed very well for continuation-based measures on both datasets. Therefore, we conclude that genetic algorithms are fairly successful in automatic identification of topic shifts and continuations in search engine data logs, and have a promising application potential in information science. Future work includes solving the problem of overestimation of topic shifts, improving the other performance measures, and integrating genetic algorithms as the automatic new topic identification tool in information retrieval and search engine algorithms.

This research has been funded by TUBITAK, Turkey, and is a National Young Researchers Career Development Project 2005, fund number 105M320, “Application of Web Mining and Industrial Engineering Techniques in the Design of New Generation Intelligent Information Retrieval Systems.”

REFERENCES

Beeferman , D. , and A. Berger . 2000 . Agglomerative clustering of a search engine query log . In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , Boston , MA , pp. 407 – 416 .
Google Scholar
Cooley , R. , B. Mobasher , and J. Srivastava . 1999 . Data preparation for mining world wide web browsing patterns . Knowledge and Information Systems 1 : 5 – 32 .
Google Scholar
Goker , A. , and D. He . 2000 . Analysing intranet logs to determine session boundaries for user-oriented learning . In Proceedings of AH2000: The International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems , Trento , Italy , pp. 319 – 322 .
Google Scholar
Gremett , P. 2006 . Utilizing a user's context to improve search results . Journal of the American Society for Information Science and Technology 57 ( 6 ): 808 – 812 .
Google Scholar
He , D. , A. Goker , and D. J. Harper . 2002 . Combining evidence for automatic web session identification . Information Processing and Management 38 : 727 – 742 .
Google Scholar
Huang , X. , F. Peng , A. An , and D. Schuurmans . 2004 . Dynamic web log session identification with statistical language models . Journal of the American Society of Information Science and Technology 55 : 1290 – 1303 .
Google Scholar
Huang , X. , Q. Yao , and A. An . 2006 . Applying language modeling to session identification from database trace logs . Knowledge and Information Systems 10 ( 4 ): 473 – 504 .
Google Scholar
Jansen , B. J. , A. Spink , C. Blakely , and S. Koshman . 2007 . Defining a session on web search engines . Journal of the American Society for Information Science and Technology 58 ( 6 ): 862 – 871 .
Web of Science ®Google Scholar
Liu , X. , W. B. Croft , P. Oh , and D. Hart . 2004. Automatic recognition of reading levels from user queries. In Proceedings of the 27th ACM ACM International Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 548–549.
Google Scholar
Metzler , D. , and W. B. Croft . 2005 . Analysis of statistical question classification for fact-based questions . Information Retrieval 8 : 481 – 504 .
Google Scholar
Muresan , G. , and D. J. Harper . 2004 . Topic modeling for mediated access to very large document collections . Journal of the American Society for Information Science and Technology 55 : 892 – 910 .
Web of Science ®Google Scholar
Murray , G. C. , J. Lin , and A. Chowdhury . 2006 . Identification of user sessions with hierarchical agglomerative clustering . In Proceedings of ASIST 2006: Annual Meeting of the American Society for Information Sciences and Technology , Austin , TX .
Google Scholar
Ozmutlu , S. 2006 . Automatic new topic identification using multiple linear regression . Information Processing and Management 42 ( 4 ): 934 – 950 .
Google Scholar
Ozmutlu , H. C. , and F. Cavdur . 2005a . Application of automatic topic identification on Excite web search engine data logs . Information Processing and Management 41 : 1243 – 1262 .
Google Scholar
Ozmutlu , S. , and F. Cavdur . 2005b . Neural network applications for automatic new topic identification . Online Information Review 29 : 35 – 53 .
Google Scholar
Ozmutlu , H. C. , F. Cavdur , and S. Ozmutlu . 2006 . Automatic new topic identification in search engine datalogs . Internet Research 16 : 323 – 338 .
Google Scholar
Ozmutlu , H. C. , F. Cavdur , S. Ozmutlu , and A. Spink . 2004a . Neural network applications for automatic new topic identification on Excite web search engine datalogs . In Proceedings of ASIST 2004, Annual Meeting of the American Society for Information Science and Technology , Providence , RI , pp. 310 – 316 .
Google Scholar
Ozmutlu , S. , H. C. Ozmutlu , and B. Buyuk . 2007 . Using conditional probabilities for automatic new topic identification . Online Information Review 31 ( 4 ): 491 – 515 .
Google Scholar
Ozmutlu , S. , H. C. Ozmutlu , and A. Spink . 2003 . Multitasking web searching and implications for design . In Proceedings of ASIST 2003, Annual Meeting of the American Society for Information Science and Technology , Long Beach , CA , pp. 416 – 421 .
Google Scholar
Ozmutlu , S. , H. C. Ozmutlu , and A. Spink . 2004b . A day in the life of web searching: An exploratory study . Information Processing and Management 40 : 319 – 345 .
Google Scholar
Ozmutlu , S. , A. Spink , and H. C. Ozmutlu . 2002 . Analysis of large data logs: An application of Poisson sampling on excite web queries . Information Processing and Management 38 : 473 – 490 .
Google Scholar
Pu , H. T. , S.-L. Chuang , and C. Yang . 2002 . Subject categorization of query terms for exploring web users' search interests . Journal of the American Society for Information Science and Technology 53 : 617 – 630 .
Google Scholar
Shafer , G. 1976 . A Mathematical Theory of Evidence . Princeton , NJ : Princeton University Press .
Google Scholar
Shen , X. , S. Dumais , and E. Horvitz . 2005 . Analysis of topic dynamics in web search . In Proceedings of World Wide Web Conference 2005 , Chiba , Japan .
Google Scholar
Silverstein , C. , M. Henzinger , H. Marais , and M. Moricz . 1999 . Analysis of a very large web search engine query log . In ACM SIGIR Forum , Vol. 33 , pp. 6 – 12 .
Google Scholar
Spink , A. , J. Bateman , and B. J. Jansen . 1999 . Searching heterogeneous collections on the web: A survey of excite users . Internet Research: Electronic Networking Applications and Policy 9 : 117 – 128 .
Web of Science ®Google Scholar
Spink , A. , B. J. Jansen , D. Wolfram , and T. Saracevic . 2002a . From e-sex to e-commerce: Web search changes . IEEE Computer 35 : 133 – 135 .
Google Scholar
Spink , A. , H. C. Ozmutlu , and S. Ozmutlu . 2002b . Multitasking information seeking and searching processes . Journal of the American Society for Information Science and Technology 53 : 639 – 652 .
Web of Science ®Google Scholar
Spink , A. , D. Wolfram , B. J. Jansen , and T. Saracevic . 2001 . Searching the web: The public and their queries . Journal of the American Society for Information Science and Technology 53 : 226 – 234 .
Google Scholar
Wen , J. R. , J. Y. Nie , and H. J. Zhang . 2002 . Query clustering using user logs . ACM Transactions on Information Systems 20 : 59 – 81 .
Google Scholar
Wolff , R. W. 1982 . Poisson arrivals see time averages . Operations Research 30 : 223 – 231 .
Web of Science ®Google Scholar
Zhu , T. , R. Greiner , and G. Haubl . 2003 . Learning a model of a web user's interests . In Proceedings of the 9th International Conference on User Modeling , Johnstown , PA .
Google Scholar

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Order Reprints Request Corporate Permissions

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

Request Academic Permissions

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.

DEFINING TOPIC BOUNDARIES IN SEARCH ENGINE TRANSACTION LOGS USING GENETIC ALGORITHMS

Abstract

RELATED STUDIES