Dealing with big data is a big challenge. Big data is big because of sheer volume, because of the velocity of creation, and because of the huge variety of unstructured data types.
One of its biggest challenges is testing unstructured data for the software industry’s big data applications. Traditional big data testing for relational database management systems (RDBMS) isn’t a walk in the park, but it’s a mature and well defined process. But testing applications’ unstructured big data is quite challenging.
Experian recently researched the phenomenon of poor data quality in applications and reported that 75% of businesses are wasting 14% of revenue simply due to poor data quality. Evans data Corporation surveyed big data application developers and reported that 19.2% of them said that quality of data is the biggest problem that they consistently face.
What is Big Data
First let’s get our definition straight on what constitutes big data. A common approach is to define big data (or the lack thereof) in terms of the 3V’s of data: volume, velocity, and variety. High volume is the biggest clue but not the only one. Velocity – speed of creation – is critical, as is a wide variety of data types thanks to unstructured data.
Unlike structured data, unstructured data does not have a defined data model. Unstructured data includes social media like Twitter and Facebook, email and chat applications, video and audio files, digital photos, voicemail, call center records, and photos. In these are just human-generated files. Once you get into machine-generated files then you’re talking about massive and fat-growing volumes of data.
What is Big Data Testing?
Big data testing is, in essence, the process of testing data for data and processing integrity, so that organizations can verify their big data. Big data presents big computing challenges, thanks to massive dataset sizes and a wide variety of formats. Investing in big data analytics creates business intelligence – if organizations can trust that intelligence. Hence the importance of big data testing.
The level of difficulty varies widely between testing structured or unstructured big data. Much big data testing is based on the ETL process: Extract, Transform, Load. The Extraction phase extracts a set of test data from structured data applications, usually relationship database management systems (RDBMS). The transformation process is extensive depending on the ETL goal, and includes data verification and process verification for testing purpose. Once the data is successfully transformed, then testers can either move it into a data warehouse or delete the test data.
The big data testing process for unstructured data is a considerably bigger challenge. To see the differences more closely, let’s look at the divide between traditional database testing and unstructured application testing in Hadoop:
Area | Structured Data Testing | Unstructured Data Testing |
---|---|---|
Data | Relational database schema defines structured data model. Testers typically use time-tested tools like manual sampling or automated verification. |
Unstructured data presents a variety of data types with no relational database structure. Testers can use sampling strategics in some types of unstructured data, but quality is an issue. |
Data Environment | Generally limited file sizes do not need specialized testing environments. | The vanety and volume of unstructured data may require specialized infrastructure and file systems. |
Testing Tools | Tried-and-true testing tools include MS Excel macros or automated testing applications. Automated testing tools simplify the structured data testing process. |
Tools for testing unstructured data are relatively new,with more being introduced all the time. Unstructured testing tools reflect the complexity of the job. Learning and administering testing toolsets takes skills and ongoing training thanks to fast upgrades and development. |
Big Data Testing for Unstructured Data
Structured and unstructured data testing share the same goals, which are to 1) validate the quality of the data, and 2) validate the data processes. Although some testers use the principles of ETL to describe the unstructured data testing process, the testing tools are entirely different. Unstructured data cannot be contained in relational databases (although may sometimes be contained in NoSQL document databases). And automating unstructured data testing is a requirement: the tools themselves are complex, and the process is very complicated given big data’s volume and the speed of data creation from users and machines.
Big Data Testing Steps
Big data testing for applications does not test individual features, but rather the quality of the test data, and data processing performance and validity. Processing tests may be batch, interactive, or real-time. Data quality tests include validity, completeness, duplication, consistency, accuracy, and conformity.
There are different data testing procedures but the most common explanation involves three major steps: validate data staging, validate testing rules, and validate output. Since the leading testing tool for unstructured big data is MapReduce, you will often see industry experts define stages 2 and 3 as MapReduce testing and output validation.
Step 1: Validate Data Staging
Validating data staging starts with a big data cluster – usually Hadoop, which may be on-premise or in the cloud. Testers then pull in test unstructured data from the source and use automated testing tools to compare source data to staged data. If there is a problem at this point, the test is compromised.
In fact, building and testing the workload environment is critical to running a successful test. Testers cannot properly test verification and performance on a poorly designed and implemented cluster. Set up high-performance and high capacity clusters to run testing workloads, or work with cloud providers to construct testing environments in the cloud.
Step 2. Validate Testing Rules
In Hadoop environments – whether on-premise or cloud – this step validates the MapReduce transformation process for unstructured data. Testing proves that the business rules that aggregate and segregate the test data are working properly.
The test runs node-by-node to verify business logic on each node. A successful test proves that the process is working correctly by implementing data aggregation or segregation rules.
Step 3. Validate Output
This stage validates the tested data and its process. It verifies that Step 2 testing successfully applied business/logic rules, that the tested workload retains data integrity, and that the business/ logic process introduced no data corruption.
When complete, testers were free to move the tested data into a storage system or to delete it from the testing cluster.
Big Data Testing Challenges
This process requires a high level of automation given massive data volumes, and the speed of unstructured data creation. However, even with automated toolsets big data testing isn’t easy.
- Good source data and reliable data insertion: “Garbage in, garbage out” applies. You need good source data to test, and a reliable method of moving the data from the source into the testing environment. Â
- Test tools require training and skill:Â Automated testing for unstructured data is highly complex with many steps. In addition, there will always be problems that pop up during a big data test phase. Testers will need to know how to problem-solve despite unstructured data complexity.
- Setting up the testing environment takes time and money:Â Hadoop eases the pain because it was created as a commodity-based big data analytics platform. However, IT still needs to buy, deploy, maintain, and configure Hadoop clusters as needed for testing phases. Even with a Hadoop cloud provider, provisioning the cluster requires resources, consultation, and service level agreements.
- Virtualization challenges:Â Few business application vendors do not develop for virtual environments, so virtualized testing is a necessity. Virtualized images can introduce latency into big data tests, and managing virtual images in a big data environment is not a straightforward process.
- No end-to-end big unstructured data testing tools:Â No vendor toolset can run big data tests on all unstructured data types. Testers need to invest in and learn multiple tools depending on the data types they need to test.
No matter how challenging the big data testing process is, it must be done – developers can hardly release untested applications. There are certain features to look for that make the job easier for both structured and unstructured data testing. Look for high levels of automation and repeatability, so testers do not have to reinvent the data wheel every time, or pause the testing process to research and take manual steps.
And although Hadoop is very popular for structured and unstructured big data, it’s not the only game in town. If testing data resides on different platforms like application servers, the cloud, or NoSQL databases; then look for tools that expand to include them. Also consider testing speeds and data coverage verification, a smooth training process, and centralized management consoles.Â