Why do different representations of the same data points impact what we're able to know and do so remarkably?
TL;DR:
Different data representations (data structures) greatly impact what information and range of operations and capabilities are available to us because of differences in their suitability for the problem at hand.
One data structure may be useful for a problem but completely useless for another.
Our Opening Question
We have two diagrams in this piece (Fig. 1 and Fig. 2) dealing with the same data points - seven people in a family. Our question is simple:
can we tell who the children of Person3 are just by looking at Fig. 1? How about when we look at Fig. 2?
One word you would hear quite often among engineers and tech experts is "data". You may also hear people talk about "information", but that's just people talking about an arrangement of data within a context that makes sense or that tells us something meaningful - so data is still at the core of information.
In fact, data is so important in Computer Science circles that it's got entire academic endeavours dedicated to studying it and how it's represented. So, let's dive into it!
What is data?
Within the context of computing, data is any symbol or figure that can be stored and processed by a computer - it doesn't have to convey any meaning. For example, "0" is a piece of data. It has no meaning without additional context.
Now that we know what data is, how does a computer know how to deal with it? Surely, it has to be represented in a way that the computer understands for the computer to be able to process and store it. This representation of data is what we refer to as a Data Structure.
Data Structures
A data structure is a representation of data or a way of organising data within a computing system to enable efficient storage, retrieval, and processing of such data. The value of structuring data cannot be overemphasised. It literally could determine what operations are possible on that data and how efficiently those operations can be performed - if they can even be performed at all. Generally, how you plan to manipulate or work with your data should influence how you choose to structure or organise the data.
To get an idea of what this really means, let's assume you need the names of all the students in your school, what would you typically ask for?
That's right - a list of the names of students!
Whereas, if you need to know the path that leads from your home to the nearest coffee shop, you'd likely prefer a geographical map showing directions other than a mere list of the places that are on your way to the coffee shop. That's because while a map and a list are both representations of data, they're not applicable to the same situations and they afford you different capabilities.
Within the context of computing, we'd call the students' list an Array which is one of the most popular and widely used data structures in computing. Geographical maps are just an application of the Graph data structure.
With the examples we've considered, let's talk about two very popular and widely used data structures in software engineering: Arrays and Trees. The two structures will be dealing with the seven people mentioned at the beginning of this piece and we'd see why the two structures tell us different stories even though they deal with the same data points.
Please note that there are so many structures in use out there. We have only chosen to discuss these two to keep our sample data structures simple and short.
Array
An array is a homogeneous collection of data. It is homogeneous because it contains data of the same type. For instance, the list of student names in our previous example contains strings - a string is a sequence of characters. "Daniel" is a string, for example.
Another example would be the collection of the heights of participants in a contest. Now, to easily compute stuff like the average height, what type of data structure should we choose to represent these heights?
You got it - an array of numbers (not their word equivalents)!
The size (n) of an array is the number of items in the array. Array items occupy positions starting from 0. Naturally, since we start counting array elements from position 0, the last element would be on the (n-1)th position.
So, for Fig. 1:
the size (n) of the array is 7
Person1 is at position 0
Person2 is at position 1
Person7 is at position 6.
So, what range of operations are possible with an array? We can, among other operations, insert items into and remove items from the array, search for an item by scanning or reading through the array, and sort the items in a given order.
But how about answering our opening question: can we tell who the children of Person3 are from Fig. 1? Can we even tell the ancestral relationship between Person1 and Person5?
No, we can't!
Fig. 1 tells us nothing about how the people in the array are related. Clearly, an array is not a suitable structure (data representation) for this problem. Let's take a look at our next data structure (Tree) and see if it allows us to know about the relationship between the individuals in the family.
Tree
A tree is a hierarchical structure that represents items (nodes) related to each other via links (edges). The edges specify the relationship between the nodes.
A little note on a few tree terms:
The node at the start of a tree is called the root node. A node is a parent node if an edge proceeds directly from it to another node which is its child node. The children of a particular parent node are sibling nodes. Any node that has no child is a leaf node.
Notice how some tree terms are those we use to describe human family relationships. In fact, you could correctly guess what ancestor and descendant nodes are 😉
In Fig. 2:
Person1 is the root of the tree. It has three children (Person2, Person3, and Person4)
Person2, Person3, and Person4 are siblings because they share the same parent (Person1)
Person6 is a descendant of Person1
Person4, Person5, Person6, and Person7 are leaf nodes because they have no children
The family tree is a very popular concept that describes the ancestry of individuals. It shows how each individual in the tree relates to their parent up to their common ancestor who is seen as the root. The relationship or link in the family tree is based on "ancestry" - we say an edge within the family tree shows an ancestry line.
Our Opening Question Answered!
In addition to the operations made possible by a tree, like inserting and removing items, we can easily tell, by looking at Fig. 2, that the children of Person3 are Person6 and Person7. We can also tell the ancestral relationship between Person1 and Person5. These are tasks we couldn't perform using an array (Fig. 1).
By now, you probably could imagine other real-life situations that can be represented by a tree.
Conclusion
So, why does our choice of data representation (data structure) greatly impact what information and range of operations and capabilities are open to us? The answer is suitability for the problem at hand.
One data structure may be useful for a problem but completely useless for another.
We saw an example of this when we considered Fig. 1 and Fig. 2 and their ability to tell us about family relationships. The two structures (array and tree) dealt with the same data points - seven people in a family. But only Fig. 2 was able to tell us about the children of Person3 and about other family relationships between the individuals. That's because trees are more suitable for such relationship situations than arrays.
An array is, of course, a very useful data structure - and a very popular one too. For example, we saw that it can be used to store numbers for which we need to find an average.
So, in choosing how to represent your data, it's very important to consider what information you're trying to pass across and what operations you want to make possible on that data.
Use data structures that are suitable for the problem you're trying to solve.
Commentaires