Search Tool Data Analysis

By Susan Kennedy (Susan KennedySusan Kennedy in BIT330, Fall 2008)

Questions and queries

Web search engines

Based on factors such as, sleep, diet, and exercise, what are ways that a college-aged student can increase the consistency of energy levels on a daily basis? This research seeks to find what types of small changes can be made each day in order to increase productivity, efficiency, self-esteem, mood stabilization, and general quality of life. Although much information exists on this topic, many sources seem contradictory or inaccurate. This research seeks to come up with consistent and reliable findings on how students, aged 18-22, who find it hard to stay motivated and engaged throughout the entire day improve their experience as a student.

Query Search for Google, Windows Live, and Yahoo Web:

Increasing energy for college students + diet + exercise + sleep

Blog search engines

For any student interesting in marketing, staying current with media and the world of entertainment is important. Being able to know which celebrities are marketable to certain social groups, such as families, college students, and high school students, is important when putting together persuasive mass marketing campaigns. As a busy college student, it can be challenging financially to regularly buy gossip magazines, as well time consuming. For this reason, three blog sites will be searched in order to answer the question: which American celebrities are currently under (relative) high regard by students ages 18-22?

Query search for Technorati, Google Blog Search, and Bloglines:

Celebrity-news

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 15 15 15
Google 20 15
Yahoo Web 30
All 5
Blog search Technorati Google Blog Bloglines
Technorati 40 0 0
Google Blog 60 0
Bloglines 50
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 0 0 1
10 0 0 0
20 0 0 1
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 0 0 0
10 0 0 0
20 1 0 2
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 0 0
10 0 0 0
20 0 0 0
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 0
10 0 0 0
20 0 0 0

Results

Web search

Aggregate Statistics from Data Set One:

Precision Live Google Yahoo Web
Max 80 90 85
Min 10 20 10
Mean approx. 43% approx. 49% approx. 52%
Median 42 57 52
Standard Deviation 21.54 19.53 21.21
Overlap L/G L/Y G/Y All
Max 35 45 35 25
Min 0 5 5 0
Mean approx. 19% approx. 20 % approx. 21 % approx. 10%
Median 20 20 20 10
Standard Deviation 9.07 10.94 7.4 7.92

*Margin of Error : +/- .23

Precision is a measure of how many retrieved documents can be considered relevant. The above statistics record the precision for three search engines, Google, Yahoo, and Live, based on the population of current BIT330 students. The mean represents the average percentage, although this may be distorted by outliers, which is why the median may be a more accurate point of comparison. The standard deviation helps explain how consistent the data was with the average. The overlap data is a measure of how many sites within the top 20 retrievals overlap with the top 20 of the other search engines examined. The average % shows which search engines tend to have to most overlap, which can be used with the precision results to determine which search engine may be most useful.

Aggregate Statistics from Data Set Two:

Google to Yahoo o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Max 4 4 4 4 4 5 4 5 7
Min 0 0 0 0 0 0 0 0 0
Mean 1.1 1.35 1.65 1.29 2 2.06 1.65 2.47 3.71
Yahoo to Google o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Max 4 4 4 4 4 5 4 5 7
Min 0 0 0 0 0 0 0 0 0
Mean 1.1 1.176 1.65 1.47 1.94 2.47 1.88 2.65 3.76
  • Margin of error: +/- .2425

This data is a more detailed analysis of overlaps between Yahoo and Google. For the first graph (Google to Yahoo) it is measuring the amount of Yahoo sites (by rank) that are found in Google's top five, top ten, and top twenty. For example, column (5,5) in the first graph of this section is measuring the amount of top five returns for Google that would be found anywhere within the top five for Yahoos results. For the next column, (10,5) it is measuring if any of the top ten results for Google can be found anywhere in the top five for Yahoo. The same works for the second graph, which measures that amount of ranked results for Yahoo that can be found within the top five, top ten, and top twenty retrievals for Google.

Blog search

Aggregate Statistics from Data Set One:

Precision Technorati GBlog Bloglines
Max 85 100 75
Min 5 25 20
Mean approx. 53% approx. 49% approx. 45%
Median 30 45 52
Standard Deviation 20.62 21.56 13.96
Overlap T/G T/B G/B All
Max 25 25 20 10
Min 0 0 0 0
Mean approx. 4% approx. 10% approx. 7% approx. 2%
Median 0 10 5 0
Standard Deviation 6.94 7.66 6.37 3.36

Precision is a measure of how many retrieved documents can be considered relevant. The above statistics record the precision for three blog search engines, Technorati, GBlog, and Bloglines. The mean represents the average percentage, although this may be distorted by outliers, which is why the median may be a more accurate point of comparison. The standard deviation helps explain how consistent the data was with the average. The overlap data is a measure of how many sites within the top 20 retrievals overlap with the top 20 of the other blog search engines examined. The average % shows which search engines tend to have to most overlap, which can be used with the precision results to determine which search engine may be most useful or if more than one blog search engine should be utilized.

Aggregate Statistics from Data Set Two:

GoogleBlog to Bloglines o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Max 1 2 3 2 2 4 2 3 4
Min 0 0 0 0 0 0 0 0 0
Mean .29 .35 .47 .41 .47 .82 .71 .76 1.1
Bloglines to GoogleBlog o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,10) o(10,20) o(20,20)
Max 1 2 2 2 2 3 3 4 4
Min 0 0 0 0 0 0 0 0 0
Mean .29 .35 .59 .41 .53 .82 .53 .89 1.1

This data is a more detailed analysis of overlaps between Bloglines and GoogleBlog. For the first graph (GoogleBlog to Bloglines) it is measuring the amount of Bloglines sites (by rank) that are found in GoogleBlogs's top five, top ten, and top twenty. For example, column (5,5) in the first graph of this section is measuring the amount of top five returns for GoogleBlog that would be found anywhere within the top five for Bloglines results. For the next column, (10,5) it is measuring if any of the top ten results for Google can be found anywhere in the top five for Bloglines. The same works for the second graph, which measures that amount of ranked results for Yahoo that can be found within the top five, top ten, and top twenty retrievals for GoogleBlog.

Discussion

Web search

According to the population of current BIT330 students, Yahoo had the highest average precision of 52%. However it should be noted that it also had a wider spread with 21.21 standard deviation, meaning that the data was less concentrated and possibly more effected by outliers. Google had the highest median of the data with 57 and also had the smallest standard deviation of 19.53. In this case, it is possibly more useful to use the median score because outliers, such as min and max, can greatly distort the data. When looking at the overlap data, it is clear that the two search engines that overlap the most are Google and Yahoo, with an average of 21%. These two also have the smallest standard deviation of 7.4 , which reflects on the consistency of the data. As these sites also seem to have higher average precision compared to Live, this makes sense that more of their sites would over lap than with Live. However, there isn't a large difference between the overlaps, especially with a margin of error of +/- .23, although it apparent that the results are inconsistent across all three search engines as they only overlap 10% of the time.

The data from the ranked overlaps reveals the many inconsistencies between the same query typed into a different search engine. Results that Google would consider most relevant (i.e. the first 5) overlap (on average) 1.65 of the time with documents that Yahoo would consider less relevant (i.e. results 11-20). In the same way results that Yahoo considers less relevant (i.e. documents 11-20) overlap with Google's top five results 1.88 of time. Both sites have a very high average for having overlaps with the other in the 11-20 documents (3.71 and 3.76) implying that many sites that are considered irrelevant are consistent across both sites. On the whole, there seems to be many consistencies between the two sets of data, leaving it impossible to find one search engine better at putting relevant documents in the top five retrieved documents.

From this data, it is apparent that there is no objective system for determining which sites are the most relevant for both engines, suggesting that in order to be exposed to the most options of relevant sites, both search engines may need to be examined. For this reason, it is strongly recommended to use multiple search engines and subjectively decide for yourself what is relevant. For example, one site that is the fifteenth retrieval for Yahoo, which you may have found useful, could be ranked first on Google and you may have never discovered it if it where not for using more than one search engine. Likewise, it is apparent that putting a lot of though into your query will greatly increase the usefulness of a search. The better and more specific the query, the more likely your query results are likely to be useful.

I personally have learned that the differences between Google and Yahoo and Live are small, although Yahoo and Google seem consistently preferable to Live. Additionally, however, I was surprised that one query for one search engine can have very inconsistent results, which is why, for a complete and effective search, many search engines should be utilized. In the future, I would like to explore the effectiveness of Google and Yahoo to more specific search engines (such as ones that only search cars). Likewise, I would like to look into how the results of this study would change if I had the opportunity to use a more effective and specific query that I had originally selected.

Blog search

Base on the population of BIT330 students, the average blog search engine with the highest precision was Technorati with 53%. However, Bloglines had the highest median with 52% and lowest spread of 13.96, suggesting that Bloglines may be more precise as the median may be a more accurate representation of precision. As for as general overlap data, the two sites with the highest average overlap was Technorati and Bloglines, which overlapped 10% of the time. As Technorati and Bloglines were considered the more useful blog search engines, it is possible that these two should be used in combination. It should be noted that overlaps are much less common for blog searches than web searches. Therefore, in order to gather as much information as possible through blogs, it is not as effective to use only on blog search engine.

For the second set of data, which measured the overlaps of GoogleBlog and Bloglines based on their rankings of top five, top ten, and top twenty, there were many parallels. Although a few data points were different, there were not enough dramatic differences to conclude that one blog search engine has results that were consistently ranked lower or higher on a different blog search engine. For this reason, it is again obvious that there is no universal system for determining what is relevant and that each blog search engine will deliver a very different array of documents for the same query.

Based on this, it is recommended that the user of a blog search engine keep in mind that there will few overlaps between blog search engines that in order to get the must selection of results back, it is necessary to use multiple search engines. Additionally, keep in mind that one search engine, for a give query, may list something in its 20th place that another site ranks fifth. In order to subjectively rank results, the user cannot rely on one blog search engines to determine relevancy. Additionally, blog queries are not the same as search engine queries. The user will have to be very specific while not being too complicated.

Blogs a typically written in the vernacular and many of the search results will also be too. For this reason, I would like to do more exploring on what kind of key word within blog searches will yield the highest rate of return. For example, searching "Britney Spears" via Technorati yielded very few relevant results, but searching "celebrities" did, which seemed surprising to me. I would also like to explore what types of blog search engines out there focus on topics that I would find most useful. For example, are there blog search engines specifically on political blogs? I would also like to explore RSS feeds further in order to understand how to compare multiple blog search engines at once without having to expend hours every day.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License