Unstructured data represents 85% or more of corporate data. Textual unstructured data includes word processing, presentations, video and audio files, email, chat, and social media postings. Machine data includes sensor data, satellite imagery, digital microscopy, sonar explorations, and much more.
Thanks to massive data, many different file types, and high creation speeds, analyzing unstructured data is very challenging – and very worthwhile. That dizzying amount of unstructured data holds significant business value for those businesses that successfully tap it.
Unstructured Data as Big Data
Let’s start by defining unstructured data as big data. The storage industry considers the three Vs of data — volume, variety, and velocity – when defining data characteristics and trending. In unstructured big data, we’re looking at high values in all three.
- Volume: Massive volumes of unstructured data. The trend is for continued high growth, meaning that analytics platforms must scale along with data.
- Variety: Many different file types. The above table illustrates just some of the data types that fall under the big data/unstructured data flag. Few analytics platforms that were created for relational databases can handle, say, traffic sensors, digital microscopy, email, and search histories.
- Velocity: High-speed data creation. Humans and machines produce data quickly. This speed requires accelerated ingest and processing, and fast throughput to applications.
- And a 4th V – Value: Business intelligence. Without business value, big data is simply a lot of data. With business value, it becomes a rich mine of business intelligence. Spend resources on big data analytics to realize that value.
Big Data is often said to be comprised of “the 3 V’s.” That is: volume, velocity and variety. These factors in combination can be said to offer value, if data mined properly.
Steps to Mine Value from Unstructured Big Data
It is a relatively simple matter to extract actionable information from structured data. The structured schema of a relational database lends itself to extracting and analyzing records. Analyzing unstructured data is a very different story.
When companies want to analyze unstructured data, they need specialized tools to do it. If they have very different types of data – such as textual and non-textual they will probably need multiple tools to launch the analysis.
First: Decide on Business Objectives
Decide on your business objectives and the type of data you need to analyze. Analyzing sensor data for example is wildly different from textually analyzing email or social media, and analyzing email for compliance is a completely different objective then analyzing network traffic for technical support metrics.
Second: Choose the Right Analytics Tool for the Task
Choosing the right analytics tool for the right task is next. If the company only has a single data source to analyze, such as social media postings for marketing campaign metrics, then choose Web harvesting or social media analytics and sentiment analysis.
If an organization wants to mine information from a broader set of text-based data, choose tools that analyze across a variety of textual formats. There are several choices. Some software-only tools run on your own storage infrastructure, some run online, some run from their own storage hardware, and still others run on a Hadoop infrastructure.
Whatever tool you choose should clearly tabulate and visualize results for you, and reporting should work on computers, mobile devices, and browser-based clients.
- Application-specific analytics: When a single unstructured application contains rich value, look at analytics tools that work specifically for that application such as analysis tools for Salesforce applications.
- Text analysis: This is a large category containing data mining, textual analytics such as metadata or part-of-speech tagging, and natural language processing (NLP). Algorithms use document relevance-based analytics to search a variety of textual data types.
- Web harvesting: These tools search for relevant structured and unstructured data from the Web. Technology connects data with user-generated patterns and filters to harvest related data.
- Hadoop: Hadoop merits its own mention as the market leader for big data analysis infrastructure. Apache develops open source Hadoop, but it’s not a single product. It’s an ecosystem that spawns multiple vendors and products. It’s based on clusters of commodity servers with massively parallel processing that supports structured and unstructured analytics.
- Business intelligence software (BI): BI is a category of analytics that works across both structured and unstructured data. It employs data mining, reporting, and dashboards that present data in the context of informed business decisions.
- Data integration tools: These tools consolidate data from different sources so users may view and analyze them from a centralized dashboard. Traditionally they work structured data, but some of them work on unstructured data as well.
- Array-based analytics: Several storage array manufacturers include native analytics on their storage systems. A popular ecosystem for big machine data is ingest raw data on high capacity/high throughput storage systems, which deliver data to scientific applications. These applications store their data onto specialized storage systems that contain built-in data analytics.
Additional categories are not officially analytics tools, although they can help to corral unstructured data to present to analytics tools. These include document management systems, information management products for lifecycle tracking, and search and indexing.
Third: Plan the Technology Stack
Once you have chosen your tools, choose the technology stacks that supports them. You have several deployment choices to make. If you choose a hardware based analytics tool, then it’s a matter of purchasing a storage system with native analytics. If you choose a highly-distributed grid architecture using Hadoop architecture, you may want to deploy it yourself or choose a service provider to install and/or manage it. You may also choose to deploy your own storage infrastructure and run software-only analytics tools on in-house data sources, or purchase an analytics tool made specifically for an online application.
Whatever method of deployment you choose, if you’re working with on-premise you will need to scale for massive data volumes and high performance. You will also need both availability and data durability. If you are after real-time results then you should ensure for high availability. If your goal is meaningful historical trending, then data durability will be the overriding factor.
Analyzing data for valuable information has always been a business goal. Today, a lot of that business wealth is found in unstructured data on the web and on-premise. Those companies who can effectively capture that value will increase their effectiveness, and accelerate business decision quality and speed. The even better news is that unstructured data analytics are not just available to the enterprise. They’re also available to mid-sized business and SMB who are serious about wringing business intelligence out of their own data.