29

My mother is taking some online course in order to be a librarian of sorts, in this course they cover boolean searches, so they can search databases efficiently, however, she got a question sounding something like this:

The search "x OR y" will result in 105 000 hits, while a search for only x will result in 80 000 hits, and a search for only y will get 35 000 hits. Why does the search "x OR y" give 105 000 hits, when the combined individual searches gives 115 000 hits?

For me this sounded strange, so I tested this myself, using the words bacon and sandwich.

  • Only bacon yielded 179 000 000 results
  • Only sandwich yielded 312 000 000 results
  • bacon OR sandwich gave 491 000 000 results

But for me it adds up: 179 000 000 (bacon) + 312 000 000 (sandwich) = 491 000 000 (bacon OR sandwich)

Why could an OR query result in fewer hits than both individual queries combined?

200_success
  • 1,012
  • 7
  • 12
sch
  • 393
  • 3
  • 5

5 Answers5

95

The counting principle that applies here is inclusion-exclusion.

$$ \left|X \cup Y\right| = \left|X\right| + \left|Y\right| - \left|X \cap Y \right|$$

To make the numbers work out, $\left|X \cap Y \right|$ must be 10000.

A Venn diagram may be more convincing to someone who may be intimidated by the notation.

Venn diagram

200_success
  • 1,012
  • 7
  • 12
63

Hint: The search x AND y will result in 10 000 hits.

Yuval Filmus
  • 280,205
  • 27
  • 317
  • 514
13

Document 1: The cat is on the table
Document 2: My cat is black
Document 3: The dog is under the table
Document 4: What's the name of your cat?
Document 5: This is a black and white photo

Search for cat: returned documents are 1,2,4 (3 documents returned)
Search for black: returned documents are ...
Search for cat OR black: returned documents are ...

:-D :-D

Vor
  • 12,743
  • 1
  • 31
  • 62
3

In simple words:

Search for X gives you n answers.
Search for Y gives you m answers.
Search for X AND Y gives you p answers.

In searching for X OR Y, the search breaks off as soon as it finds either X or Y. So if there's an X before a Y, that Y will not be counted in searching for X OR Y. Therefore your search for X OR Y will give you n + m - p answers.

It's important to note that the results will be the same, whether you do 2 searches, or just one. It's just that in summing the two searches, some documents are counted twice.

frank
  • 131
  • 2
3

Imagine you have only one document. This is Document#1 with this:

X Y

Now imagine you have a search function that can give you all the documents based on one keyword:

search("X") => 1
search("Y") => 1

Notice that the number of documents in both cases is 1. Now if you have a search function that gives you the number of documents that matched one or more of the keywords supplied:

search("X", "Y") => 1

When you add the number of documents containing X to the number of documents containing Y, this causes you to count the same document twice. In your case, this happened 10000 times as pointed out above :)

Arnab Datta
  • 173
  • 6