Data, whether structured or unstructured, is the lifeblood of business and at the heart – or should be at the heart – of every decision your company makes. The term “big data” has become commonplace in not only the tech industry but in common vernacular. Like many tech terms, however, definitions for big data vary, but the common denominator is that it is data that’s available in high volumes delivered at a high velocity, making it difficult to analyze using traditional tools. To put that into a real-world context, think about large volumes of real-time data produced by everything from your car to an off-shore oil rig.
Before you use big data to drive business outcomes you need to understand where it comes from and how to recognize and capture it so you can build an efficient data model. The more organization or structure you can give to your data, the easier it is to record and analyze it. For that reason, structured data – data born to be analyzed – is the backbone of big data.
What is Structured Data?
Data falls into two categories: structured and unstructured. Looking at this Ying-Yang of data, the names are somewhat self-explanatory. Unstructured data includes content such as video, email, images, podcasts, social media posts and PDFs. In short, unstructured data has no internal identifier to let search functions recognize it. The consensus is that it also makes up a whopping 80 percent of data generated.
Structured data exists in a format created to be captured, stored, organized and analyzed. It’s neatly organized for easy access. If structured data was an office it would contain many file cabinets that are efficiently set up, clearly labeled and easy to access. For that reason, structured data brings inherent benefits when dealing high volumes of information.
Structured data vs unstructured data isn’t a zero-sum game, however. Structured data also complements unstructured data, and you find insights in your unstructured data sets. For example, structured data records can hold unstructured data within it. Consider a form that offers questions with a list of answers available in a drop-down menu but also allows users to add freeform comments. The answers generated from the pick list is structured data, but the comments field yields unstructured data.
Most data is a hybrid to some degree. For that reason, you may also see the term semi-structured data, which is a loosely defined subset of structured data. This format includes the capability to add tags, keywords and metadata to data types that were once considered unstructured data. Adding descriptive elements to images, email and word processing files are examples of semi-structured data. Markup languages such as XML are often used to manage semi-structured data.
Structured data, unlike unstructured data, tends to be a more natural fit for the data mining processes of traditional Big Data applications.
Where Does Structured Data Come From?
The two primary examples of where structured data is generated are databases and search algorithms.
The term structured data is often associated with relational database management systems, which date back to 1970 and a mathematical theory developed by Edgar Codd at IBM’s San Jose Research Laboratory. Codd’s model organizes data into one or more tables (also known as relations) of columns and rows. A few years later, fellow IBMers Donald D. Chamberlin and Raymond Boyce designed the structured query language (SQL), which is used with the vast major of relational databases.
In addition to relational databases, spreadsheets are also common sources of structured data. Whether it’s a complex SQL database or an Excel spreadsheet, because structured data depends on you creating a data model, you must plan for how you will capture, store and access data. For example, will you be storing numeric, monetary, alphabetic data?
While relational databases and SQL have a long history, more recently, structured data also plays a major role in internet searches and offers benefits to organic search. According to Google’s Introduction to Structured Data, “When information is highly structured and predictable, search engines can more easily organize and display it in creative ways.” Google says that by using structured data markup you make it possible for your content to appear in rich results and Knowledge Graph cards.
To create a structured data standard for web-based application, email messages and forms of internet content, Google, Microsoft, Yahoo and Yandex created Schema.org, an open community. Schema.org says its vocabulary includes encodings such as RDFa (an HTML5 extension used in both the head and body sections of the HTML page), Microdata (an open HTML specification used to include structured data in HTML content) and JSON-LD (JavaScript Object Notation for Linked Data).
Sources of Structured Data
Unlike unstructured data that will grow organically – and uncontrollably — and come from a wide range of sources, structured data is created two ways: The first is machine-generated data by devices or sensors without human intervention. According to IDC, by 2025 80 billion devices will be connected to the Internet versus approximately 11 billion devices connect to the internet now. That means a lot more devices producing a lot more data.
Examples of machine generated data include the following:
Data from sensors such as GPSs, RFID tags, medical devices, data from network and web logs, retail and ecommerce data – to name only a few.
Conversely, structured data is also generated by people to feed databases and spreadsheets. This typic of structured data is created by humans who interact with computers and other devices. Example include (non-freeform) data generated through interaction with online forms, kiosks, games and so on.