Case Study: Guess Who

Code related to this article can be found here.

To implement an online guess who app capable of having millions of popular people listed in it is a large data problem.
We also need to be able to add more people to our system as time goes on, so it doesn't become outdated, and we need it to be fast and scalable to allow many people to use our app at the same time.

There are countless solutions to this task, and the one I will be covering uses relational table databases.
Table databases are a proven solution to storing large amounts of interlinking data efficiently, and over many years it has been refined to be highly efficient and scalable.

Data-Structure

We need to work out how to break now all the linking information that exists in this problem and turn them into restricted field linear data.
First we can think of breaking down the problem into rules (questions) and people.

tblQuestions

Field	Type	Description
name	String(25)	The name of the person

tblPeople

Field	Type	Description
prompt	String(45)	The actual question

It is important to remember that we are only storing facts about people, not what isn't a fact (i.e. "they don't have brown hair")

tblMatch

Field	Type	Description
person	uint	Index reference of the referred person
question	uint	Index reference of the referred person

Interfacing

Now that we know how to store our data we need to think about how we can pick a person out of this database as a result, as well as what questions to prompt the user with to find out who their person is.

If we break out into a little thought experiment and presumed that for every 10,000 people one person was in our system. We'll say on average every person has at least 30 rules linked to them. Therefore the size of our matching table would be;

\[ \frac{ 7 \times 10 ^ {12} }{10000} \times 8 \times 2 \approx 10GB \]

So we can see that we can't keep a working copy of possible matches for each game, even if with each question the number of possible people halved it would still be impractical to keep any of it in ram.
We need to calculate everything on each request then dump all of the working data.

So we need to have the client remember what questions they have asked and each type them answer a question they send that result as well as all previous results as part of one request.

Matching people

To select the most appropriate question we need to know what people apply to the current rule set, so we can work out which question will best divide the options available. Thus, matching people with the currently know rules isn't just an endgame state, but also a recurring feature.

A way of doing this is looping through every single tuple in the matching table then removing any matches that include people that don't follow the rules.
To save on read cycles we can do this in one parse of the table.
We read the table tuple by tuple finding if the current tuple relates to any of the known rules, as we go we add people to a list as well as what questions apply to them.

If we find any tuple of which links to a question that was marked as wrong by the user, we can remove that person from our cache as then add them to a black-list, so we don't re-add them.

let people = {};
let blackList = {};

scan: for (tuple of tblMatch){
  if (tuple.person in blackList){
    continue scan;
  }

  if (tuple.question in rules && rule.value === false){
    delete people[tuple.person];
    blackList.push(tuple.person);
    continue scan;
  }

  people[tuple.person].push(tuple.question);
}

However, this does nothing for any rules that are true, however for rules that are true they only need to exist once per person, thus we need to of scanned the entire list to see if there are any links between each person and the rule that rule. We have just created this list people.
Now we can run through that and apply the rules of which their value was true.

outer: for (person in people){
  inner: for (rule of rules){
    if (rule not in person){
      delete person;
      continue outer;
    }
  }
}

Finding Appropriate Questions

Now that we have a list of links for people that apply to all the current rules we can choose an appropriate question.

The best possible question to ask is one that evenly divides the remaining options because no matter the choice you will remove a considerable amount of possibilities.

To do this we need to loop through our results and count how many times each question occurs.

let count = [];

outer: for (let person of people){
  inner: for (let rule of person){
    count[rule].num += 1;
  }
}

Now we need to find the dividing factor, this will be; $$ \text{have not} - \text{have} $$

num = people.length; // Number of people

for (let opt in count){
  let no = num - count[rule].num;
  let yes = count[rule].num;

  opt.val = Math.abs( no - yes )
}

Now we know what the next best question is because we can just see which value the highest value as ask that.

It is important to remember that this system does not have any system to stop it from asking repeated questions.

Summary

It is important to remember that for the first couple of questions the system will still require storing a large percent of the table in ram for processing.
If every question halved the number of possible results, then the percent stored in ram would be; $$ 0.5^n $$

So after three questions of our previous example the ram used would be $$ 0.5^3 \times 10GB = 1.25GB $$

It also means that by that rule to find one person it would require

\[ \begin{align*} \frac{7 \times 10^9}{10000} \times 0.5^n &= 1 && \\ 0.5^n &= \frac{1}{7 \times 10^9} && \\ 0.5^{-n} &= 7 \times 10^9 \\ -n * log(0.5) &= log(7 * 10^9) && \\ n &= \frac{ log(7) + log(9) }{log(0.5)} && \\ n &\approx 30 && \\ \end{align*} \]

Which would be extremely boring as a player, also it would require a lot of admin to have that many people in the DBMS. Thus, realistically the system would be a lot smaller in scale.
However, it is still a good idea to cache the first couple of possible questions as a tree so that they don't need to be recalculated for each user, and instead, they only run operations of which will initially eliminate a lot more results than the first questions would.