Behind the scenes of Social Networking : An introduction to Trees, Graphs and Databases

For the past few years, social networking sites have seriously become part of our daily lives. We are twitting, poking, commenting and sharing a lot more than we did in the last decade. Today, successful social sites like Facebook, Twitter and LinkedIn are each hosting hundreds of millions of users, probably reaching a total of almost one billion users around the world.


This is too much data. But even for a mediocre programmer, this is nothing new. A billion record database including the user profile details, with horizontal scaling can easily be designed. We can also easily set someone a friend of someone else over a simple table which holds the relationships.

We call this a hierarchy or a tree, with branches and leafs. It is also possible to go deeper in the tree and analyse it as we like, which is again very easy to achieve in todays RDBMSs. This can be done by Oracle’s CONNECT BY statement and SQL Server’s HierarchyID type and CTE features for recursion. Or even with a custom developed nested set model. Below graph represents the scenario of A knows B and C, B knows D and E

These types of trees can be useful in representing organizational hierarchies, bill of materials, multi level network memberships and etc. But this is not the real life scenario for social networking. In real life D may know A and E may know C, giving us cyclic structures in our tree.

In this case another model is needed, and here we can introduce adjacency list model or adjacency matrix (often called sociomatrix).

And remember, we are talking about hundreds of millions of users interrelated with each other, giving us billions of relationships and possibly of different types. And what is more important we have data at hand which may not be bound to a schema and changes frequently. A typical RDBMS is schema bound meaning that the properties of a node is predefined and nothing else can be stored. But social data is considered to be schema free data.  

So what should be done, now that our tree has become a directed graph like the following?

 

This is where the concept of Graph Databases comes to rescue.

Either you will implement the models yourself and reinvent the wheel or you will use an already invented wheel and work on other functionalities.

Graph databases are implemented based on graph theory and technically they are good old linked lists stored in disk. Persisted linked lists together with caching can give great performance results in very large social graphs.

Primary elements of a graph database are nodes, properties, and edges.

Nodes are entities such as people and businesses. Properties are information that relate to these nodes or edges. For example, if “Cem Guler” was a node, then it could have properties like “engineer”, “male”, “a person  whose name starts with C” and maybe “handsome”. Since graphs are not bound to any predefined schema, we can add as many properties as we like without being limited by a schema.

Edges are the relationship lines between 2 nodes. Actually this is what we are interested in, the most. In the example of : A knows B knows D, A is father of E and E is a friend of C; The relationships “knows”, “father” and “friend” are each an edge with a property of “relationship type”. Here we can limit our “relationship type” data domain to { “knows”, “father”, “friend” }, but evidently there will be more relationship types like “mother”,  “aunt”, “friend”, “owner”, “colleague” etc. and may be some more user defined types. See that we are about to lose control over our data domain and if we had relational database, this might have destroyed our database normal form or cause complex problems. Full text feature of graph databases allow us to add as many relationship types as we like.

Another benefit of using a graph database, is that they provide ready to use traversal algorithms for queries like “finding the shortest path between 2 nodes” or even “finding all possible paths between 2 nodes” and therefore giving us the answer to the question “how am I related to this person?” or “how can I reach this person over my existing contacts?”

Here are some of the most popular graph databases that you may wish to check,

Please also note that all of them are implemented in Java, and make use of REST based JSON. For open source projects they are free.

For the Microsoft enthusiasts, please note that Microsoft is also working on a new project called “Trinity”.

In the upcoming posts, I will try to emphasize more on technical information for social site implementations.

To have a better understanding of the differences between relational databases and graph databases, please take a look at a paper comparing MySQL and Neo4j prepared by the Department of Computer and Information Science of University of Mississippi.

Cheers,

Cem GULER

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s