Unstructured data is proliferating massively. It is growing in volume by more than 50% a year, and according to IDC, it will form 80% of all data by 2025 and does so already for some organisations.
There is another 80% figure that flies about with regard to unstructured data, which is that four-fifths of all business-relevant information originates in unstructured data, mostly text.
In other words, it is in emails, reports, articles, customer reviews, client notes and other forms of unstructured text. It is also in social media posts, medical research findings, video, voice recording and remote system monitoring data (internet of things). In other words, unstructured data is very varied and can range in size from a few bytes to very large indeed.
So, whether or not the 80% figures are accurate, they do highlight the importance of unstructured data.
In this article, we will look at the huge variety of unstructured data, the structures that exist in unstructured data, NAS and object storage, and the cloud services that are aimed at unstructured data.
No one-size-fits-all in storage terms
In terms of size and format, unstructured data can comprise everything from internet of things (IoT) remote system monitoring data to video. That encompasses file sizes ranging from a few bytes to multiple gigabytes or beyond. In between, there is lots of text-based data that derives from emails, reports, customer interaction, and so on.
To define it, we can say it is the type of data that is not held in the structured format we associate with a traditional relational database. Instead, it could reside in any form between raw data and some type of NoSQL database, which in reality encompass a range of products/methods of ordering data that go beyond the traditional SQL way of doing things.
What type of storage is required depends on two things. We’re not talking here about the database in use, but the storage on which that sits.
Here the requirements are in terms of its capacity but also the I/O requirements that will be placed on it by the organisation.
So, unstructured data storage could be anything from relatively low volume, low I/O performance – as NAS or object storage appliance or cloud instance – to huge, highly performant distributed file or object storage.
Not as unstructured as you might think
“Unstructured” can be something of a misnomer. In fact, you could see unstructured data existing on a continuum. At one end would be things like IoT data, emails, documents, and possibly some less obvious candidates such as voice and video that have metadata headers or come with formats (XML, JSON) that allow for some basic analysis.
This is semi-structured data.
At the other end would be vast amounts of text gained from websites or social media posts which would be the most difficult to analyse and process.
It is beyond the scope of this article to go into detail about data lakes, warehouses, marts, swamps, and so on, and the methods of ordering data within them, such as NoSQL.
The key decision from the first point remains – back-end storage will depend on capacity required and access times, I/O profile and potentially availability, and the ability to scale.
NAS isn’t what it used to be. Scale-out NAS has brought file access storage into the realms of very high capacity and performance. NAS used to mean a single filer, and that meant the potential to become siloed.
Scale-out NAS is built with a parallel file system that provides a single namespace across multiple NAS boxes with the ability to scale to billions of files. Capacity can be added, and in some cases, so can processing power.
Scale-out NAS has the benefit that it is Posix-compliant, so works well with traditional applications and benefits from functionality such as file locking, which may be important from an access point of view.
Scale-out NAS was also recently the only choice for high-performance unstructured data, although object storage is catching up.
On-prem scale-out NAS storage is available from the big five physical storage array makers – Dell EMC, NetApp, Hitachi, HPE and IBM. They also have ways to tier data to the cloud and, in some cases, offer cloud instances of their NAS products.
All the big three cloud providers – AWS, Azure and Google Cloud – provide file storage that ranges from standard to premium service levels, often based on NetApp storage.
There is also a new breed of file storage products designed for hybrid cloud use. These include Qumulo, WekaIO, Nexenta and Hedvig. Elastifile was counted among these, but was bought by Google in 2019.
Object storage is a more recent contender for the unstructured data storage crown. It keeps data in a flat format accessed via a unique ID, with metadata headers that allow for search and some analysis.
Object storage gained traction as an alternative to some of the drawbacks of scale-out NAS, which can suffer performance hits as it grows due to its hierarchical structure.
Object storage is arguably the native format of the cloud, too. It is hugely scalable and accessible via application programming interfaces (APIs), which fits well with the DevOps way of doing things.
Compared to file storage, object storage lacks file locking, and until recently it lagged in terms of performance, although that is changing and is driven by the need for rapid analysis of unstructured data.
All the big five make object storage for on-prem use, with ways to tier to object storage in the cloud. Also, there are object storage specialists such as Scality, Cloudian, Quantum, Pure Storage and the open source Ceph.
All the big cloud providers’ basic storage offerings are based on object storage, with varying classes of service/performance offered. AWS, for example, offers different classes of S3 storage that vary according to access time requirements and value or reproducibility of data.
Cloud benefits and containers
All the big three cloud providers offer their core object storage services for use as data lake storage.
Microsoft offers a targeted service that will handle unstructured data, Azure Data Lake.
The benefits here are that the cloud provider offers expandable capacity and the means of getting data to it via gateways, etc. The downside, of course, is that you have to pay for it, and the more data you put into the data lake, the more it costs.
Also, the hyperscalers offer NoSQL databases in their clouds. These can be their own – Google Datastore, Amazon DynamoDB, Azure Cosmos DB – or third-party NoSQL databases that can be deployed in their clouds.