Blog

For the last 8 years I’ve worked in the storage industry, with most of that time spent on a class of storage referred to as Distributed Object Stores, or more generally, object stores. I’ve worked on three commercial object stores: Centera, Atmos, and now Hitachi Content Platform (HCP). In that time, I’ve witnessed numerous misunderstandings about this particular class of storage technology — not just in how they function, but even more fundamentally on where, when, and why such a system would be employed. In a series of articles, I’ll try to put shed some light on this topic. 

It’s always challenging to get the right level of technical detail with a diverse audience. Generally there will be a bias toward simplicity as there's an inverse relationship between how technical an article is and the number of possible readers. Therefore, the typical article will strive to hit a moderate technical level, hopefully avoiding the arcane. However, if there's sufficient interest in a particular topic, I'll go back later and add greater technical rigor to those specific areas of interest.

The first three articles will be on the following topics:

  1. The difference between structured and unstructured data types;
  2. A basic description of an object; and
  3. An overview article describing distributed object stores. 

These first three articles will be accompanied by videos. 

Video

Here’s a link to a video version of this blog entry:  Structured Data Video

Why do we care?

In storage lately there’s been a lot of discussion about unstructured data. The first question that I think whenever I hear a new terms put out there is: why do I care?  

The answer turns out to be: because the growth model for the 2 data types is so dramatically different.

The graph below is by IDC which depicts storage usage over time, breaking out structured and unstructured data. In this case structured is labeled block and unstructured is labeled file. 

What’s striking is that that the unstructured data has a growth curve substantially greater than that of structured data. If we look out to the 2014 projections we see that, combined, the forecast is to ship 80EB of storage. An exabyte (EB) is 10^18, or a billion billion. So it’s a big number. 

It also shows that the lion’s share – almost 70 of this 80EB – is expected to come from unstructured data. So you’re getting close to saying 90% of all storage shipped will be to store unstructured data. That’s a big enough difference to qualify as a substantial shift in the storage industry. And it doesn’t really matter which side of the equation you’re on – whether a producer or consumer of storage – change of this magnitude warrants a look under the covers to see what is it all about. And that’s what we’ll be doing in this session.

Unstructured Data Definition

While there isn’t a canonical definition of the two terms, generally the term structured data (SD) is applied to databases (DB) and unstructured data (UD) applies to everything else. 

The terms themselves aren’t terribly meaningful as all computer data is structured. Anyone who has written computer programs realizes early on that digital computers don’t do well with anything other than perfect structure. Misplace so much as a single semicolon in a program and you’ll be spending time in the debugger. 

Writing out data is similar. The data you write doesn’t have to make any sense – it can be random bytes, but the way the data is written out to the storage medium is very orderly. Otherwise, it would be impossible to read it back. 

Structured Data

I suspect the term SD came from the name for a common language used to access DBs, called Structured Query Language, or SQL

SQL provides a well-defined way for applications to manage data in a DB. 

The little snippet of SQL code here illustrates how an application could retrieve all rows from a table called Book where the Price is greater than 100 and request that the result be sorted in ascending order by title. 

Most people haven’t written DB code themselves, but they have used an office spreadsheet tool like Excel on Windows or Numbers on the Mac. These tools allow you to create simple DB tables. 

The example here shows customer records with the typical information — name, address, and so forth. And that’s it — you’ve created a DB table! While sophisticated database management systems (DBMS) allow you to do much more than you can do with a spreadsheet, the base concept is essentially the same: tables with rows and columns, often containing straight ASCII text as this example shows. 

Unstructured Data 

Earlier, we said that UD is simply the complement of SD; that is, it’s everything other than a DB. That’s a pretty broad class. To break that down further, UD can be divided into two subclasses: file and object. 

Files

File data is the more familiar, as computer users are accustomed to seeing this data in the Windows Explorer or Mac Finder screens, as in the image below.

This is an image of a file system on a disk that contains the files created by a digital camera. Each file is an image of type JPG and has associated system metadata (SMD). In this case, the SMD shown is:

  • The filename;
  • The date the file was last modified; 
  • The file size; and
  • The file type (JPEG image).

This view is quite familiar to computer users and is the principle mechanism for finding and using files on your computer.  

Objects

Objects are the evolution of the basic file types that we’ve had for over 50 years now. The next article and video “What is an Object?” talks about the difference between objects and files in detail. Here, it is sufficient to say that objects are files with an additional type of metadata, called custom metadata (CMD). 

While SMD allows only a fairly pedestrian set of characteristics to be expressed (file name, size, and so forth), CMD allows for much richer data expression. It’s therefore no surprise that CMD was introduced around the time that rich data types, for example videos, pictures, and music files, were introduced to the general computer user. 

A simple example of CMD is when you use an application to import the pictures from your video camera and you are able to add any text that suits you, for example the names of the people in the photograph, where the picture was taken. The point of CMD is that, unlike SMD, CMD is not limited. A good software system will let you add any arbitrary text you want and associate it directly with the object (in this case, a photograph).

Why is Unstructured Data Growing So Rapidly?

A good question is why would one form of data, UD, grow so much more rapidly than the other? After all, both data types are required, as DBs perform essential functions. 

One reason is that the actual user content of a DB is typically text, as shown in the earlier spreadsheet example. For about 40 years, files were likewise most often comprised of just text. But the world has changed. Now users want rich content, not just plain text. 

Rich data types include things such as pictures, music, movies, and x-rays. Even basic office document types such as Word and Powerpoint are becoming increasingly rich media containers, where it’s now easy for a user to embed much more than just text.

                

While rich data types provide a far superior user experience over text alone, they do so at the expense of storage space. Rich media types are not just slightly larger that basic text, they can be orders of magnitude larger. 

To get a sense of both the difference in user experience and the different storage capacity usage of these two data types, consider this simple example. 

Rich Data vs. Text

Let’s say you’re trying to decide which movie to go see. A traditional method of doing so would to be to read a movie review. Here is a link to a movie review by the Boston Globe for the movie “Real Steal”. When I downloaded a copy of this review, it took up ~10KB of capacity.

The movie snippet below is from the full movie trailer available on the Internet Movie Database (IMDB) web site.

Of the two, which gave you a better sense of the movie? Clearly the trailer serves this purpose far better, which is why movie theaters show coming attractions in the form of trailers rather than written text on the screen. However, when I downloaded the trailer, rather than using up just 10KB of capacity, it used ~200MB. That’s an incredible difference, with 20,000 times more capacity required for the trailer than for the movie review!\

Of course, we can’t extrapolate too much from a single example; that’s the job of industry analysts. But this does give a sense just how great a difference there is between the storage required for rich data and that required for plain text, and it does give us a sense of why analysts forecast so much more storage dedicated to unstructured versus structured data going forward.

Takeaways

Here’s 3 takeaways.

  1. Most prefer Rich Data over basic text;
  2. Rich data takes up WAY more space
    • Text movie review: ~10KB
    • Full HD Trailer: ~200MB
    • 20,000x greater storage capacity!
  3. Use of Rich Data is increasing at an increasing rate

Video

Here’s a link to the video: What is an Object?

Introduction

In the last article and video (Structured vs. Unstructured Data), I talked briefly about the differences between an object and a file. This article describes those differences in greater detail.  

First, a note on technical detail.  Both this article and the accompanying video were created with the goal that no specific technical knowledge should be required. I have therefore consciously sacrificed technical specificity for the sake of simplicity. To that end, this article doesn’t touch on topics such as objects versus classes of objects, object types, etc. For now, we deal just with objects in the simplest terms. 

Object = Abstract Term

The term object is an abstract term. By itself, it doesn’t really mean anything in particular. And this is on purpose, because we really don’t want to limit what an object can be. As soon as we name it to be something specific, we’ve limited it.

Objects in the Physical World

Let’s start with an example in the physical world. Let’s say I want to talk generally about a class of objects called motor vehicles

If I use a specific term, such as: Ford Thunderbird, I’ve limited the scope of what I can talk about. Specificity has the advantage of being quite easy to conceptualize, but has the disadvantage of constrained scope. In other words, by calling the car object by a specific name, that’s all it can be. 


However, if I use a more abstract term, such as motor vehicles, I’ve instantly expanded the scope of what I can talk about. I’m no longer limited to just a single make and model of car. By using a more abstract term, I can talk about any type of car I want. In other words, I’ve already substantially expanded the set of things I can address simply by moving from the specific to the abstract.

On the plus side, abstraction is a very powerful thing because it doesn’t impose artificial limits on our thinking. On the other hand, abstraction is harder to wrap our heads around than is specificity. 

Worse, this difficulty only increases when we move from the physical world of objects such as cars to the logical world of computer software. We always have to deal with that added complexity when discussing concepts in the software domain. 

As we transition from the physical world to the logical world of computer systems, let’s narrow this abstract term object down to something specific to help understand what it is.

Digital Photo Example

Perhaps the computer object most familiar to the greatest number of people is the image that results from taking a picture with a digital camera. 

In common terms, we call this image a file. Digital cameras tend to label these files with names like DSCN0141.JPG that aren’t particularly useful to us as human users. Such naming conventions provide an easy way for the camera to automatically generate unique names, thus avoiding file name collisions. 


Camera Memory Card Contents

When we take pictures, the memory card on the camera fills up with a bunch of files with seemingly arbitrary names. Here’s an example of the contents of a camera memory card I have used.

You can see that, while functional, this isn’t terribly useful because I can’t easily discern which of these files represents a given picture.

Simple Object Store - iPhoto

To make the digital images useful, one of the first things we typically do is import these files into some type of photo editing software, such as iPhoto on the Mac. iPhoto is an example of an object store. It stores these files, called digital photographs, in a software system that allows us to do useful things by converting them into objects.  Here’s a screen shot of iPhoto.


One of the prime values of this object store software is that it makes it easy for us to recognize what we’re working with. We are no longer dealing with odd file names, such as we saw in the file system view, but instead can work with the object content itself — the pictures.



Converting a File into an Object

Now that we have our picture, let’s build a first-class object. We begin with the file itself, which is just the digital image. Next we have system metadata (SMD). Metadata just means data about data. For the picture we have typical SMD such as the file name, when the file was created, and when it was last modified. This type of SMD is pretty basic, and it’s familiar to most computer users because they see it in file system views such as Finder on the Mac or Windows Explorer on the PC.


The image below is a file system view we see in when using Finder on the Mac. This shows the contents of the memory card from my digital camera. And what you see here is that we have seven files, each of which is a picture. We also see the typical SMD. There’s the file name, the date last modified, the file size, and the kind of file we’re dealing with. 


This is the typical file system view that we’re used to seeing. It’s very useful for organizing files, particularly with small data sets like this where we can easily understand what we’re looking at. However, we want to convert these files to objects, let’s go back to our object construction.



I want to start to make this a bit more user-friendly. Now that I’ve imported the file into an object store, in this case iPhoto, I’m going to customize this object to make it useful to me. 

I’ll begin by adding some custom metadata (CMD). CMD is metadata that’s provided by the user. 

The first thing I’m going to do with this object is give it a name that means something to me and not just some arbitrary name automatically generated by the camera. I’ll call the model Lisa Simpson. 

Now I’m going to add where the photo was taken: Tempe, AZ.

Next I want to add this object to a category, in this case, Family.

Finally, I want to do some advanced functions that iPhoto doesn’t presently support, but we’ll assume the object is going to a more sophisticated enterprise-class object store, so I’ll also say:

  • Don’t delete this object.
  • Allow this object to be shared. 

And there you have it. We’ve created a first-class object that contains the three components of all objects: the file itself, which is also called the data, the SMD, and the CMD. 

This ability to attach whatever arbitrary information I want to a photograph makes the system far more useful to me than it would be if I were looking at the same data on a file system, such as in the Mac Finder shown above.  

Even in this simple example where I’ve added very little in the way CMD, you can see that the digital photographs, are far more useful to me now that they’re in an object store — in this case, iPhoto. 

What I’ve illustrated in this rather simple example, without going into a lot of technical deep dive detail, is just how important, CMD is to rich data types — such as digital pictures. 

Without CMD, the value of my digital picture library would be substantially reduced. 

Evolution

I also hope you can see why it’s a natural evolution to go from files, which began in the middle of last century, to objects, which are becoming the predominant data type. You can view the video I did on “Structured vs. Unstructured Data” to get a sense of the magnitude of the shift we’re seeing. 

Value of Metadata over Time

In the photo example, I talked about the value that metadata, specifically CMD, brings to rich data types such as pictures. However, what may not be obvious at first glance is that the value of CMD can actually exceed the value of the data itself over time. 

Here’s an example of what I’m talking about.

Below is a screen shot of my iPhoto library, which holds over 4,000 photos to date. Some of these photos are more than ten years old.

Let’s say I’m looking for a particular photo that includes my kids. There’s no way that it’s at all efficient for me to search through 4,000 photos in the hope of finding the picture I want. Further, I couldn’t possibly do this by trying to guess whatever cryptic name the camera decided to assign to my picture.  I need CMD just to give me the basic functionality of doing a search in my photo library for those photos that contain my kids. So right off the bat, we can see that CMD is essential to making rich data useful.

However, earlier I asserted that the value of CMD is greater than the value of the data itself over time. What justifies this assertion?

Because software continually evolves. 

It used to be in iPhoto allowed users to do searches against CMD, which is useful. But I want the software to do these functions for me. In recent releases, iPhoto added the ability to automatically traverse your photo library and use facial recognition algorithms to create views of all the people in your photo library. 


There were many, many releases of iPhoto over many years that never had this type of functionality. But the beauty of allowing users to enter CMD with their pictures is that as the software evolved and grew smarter, it could use the CMD to do more interesting things. Such a feature not only make the user experience better, but it actually increases the usefulness of the data itself.

The photos I can’t find, are of no use to me at all. The data itself has value only if I can easily find it. The CMD is what allows to do sophisticated and interesting operations against the data. 

This is what I mean by the “value over time.” I can’t predict right now what will be the next cool feature. But I can say that if your objects don’t have CMD associated with them, it’s likely that at least some advances in technology will pass you by. 

Rich Data Types

To keep this discussion simple I’ve focused on a single object data type — a digital photograph. However, an object can be any type of file type: from Office file types such as word and presentation files, to music, pictures, movies, medical images, and object from virtually every other vertical, such as Oil and Gas. 

Further, to take full advantage of the richness of data as it evolves from the basic file types we saw last century, to the rich file types that we see in this century, we require some type of software system to enable us to do things that go beyond looking at file names in a Finder window. That software system is generically referred to as an Object Store

Recap

This article has provided a working  definition of the term object, as used in the computer storage realm. 

The accompanying video reinforces these points:

  • Objects are more powerful than files alone;
  • Objects are a natural evolution of the basic file types that we saw in the last century; and
  • Objects are not exotic.


© Robert Primmer 2013