AI Trends: Open Data and Big Data Sharing

Rob Light
Rob Light  |  January 18, 2018

Artificial intelligence (AI) is the preferred buzzword for machine learning, which is helping to rapidly progress business operations and strategy as companies embrace digital transformation. 

In 2018, businesses and developers alike will utilize not only accessible company data to train machine learning tools, but also public, open-source data sets that will have significant impact on business solutions. If a business truly understands the value of digital transformation, then it is already mining its data for actionable insights with the likes of artificial intelligence software, big data analytics software and business intelligence software, but it should also be using that data to train machine and deep learning models for the benefit of the company.

Big data trends for 2018

These companies will also have to use open data sources to train AI if they want to remain on the cutting edge and further their business modernization; otherwise, they will leave themselves limited and missing opportunities to enhance their solutions with public data sets.

Big data has been a term thrown around for some time now, but it represents the massive boom in accessible open-data sets over the past decade. This boom has been made possible due to the migration of data storage to the cloud, the ability to easily scale data retention with the use of virtual machines for big data processing, and the sheer number of connected devices that create data themselves. The internet of things is only increasing the already unfathomable big data statistics and will continue to do so as that industry takes off. The ability to feed big data sets to deep learning models that contain artificial neural networks has been the reason for such rapid AI advancement (and the reason you read about it so much in the news).

Enterprise companies have already been able to take advantage of data to build machine learning tools because they have so much of it. Amazon and Google have an enormous amount of information regarding everything from the way consumers shop online to what videos people watch on YouTube, to name a few examples. With this information, these companies can build AI-driven solutions to benefit their own organizations, but also to resell as a machine learning service to outside companies — the same way they sell cloud storage. This machine learning as a service (MLaaS) is not the only benefit to third-party companies created by enterprise data, but many large businesses have opted to open source their data sets for the benefit of developers building their own business applications.

In the coming year, machine learning developers and businesses will take more advantage of open data sets from the likes of big companies, and enterprises will continue to share data that they feel could be impactful. There are a number of different data sets that provide a variety of information for machine learning tools to consume. These include image and video repositories, text data, geospatial and environmental data, transportation information and climate data. Each of these different open data sets can provide unique and critical information for developers building their own applications.

Software as a service (SaaS) vendors have been working tirelessly to add AI capabilities to their solutions, and some are beginning to prove out the benefits of machine learning features to their users. However, to take advantage of the embedded deep learning within SaaS products, businesses will have to opt in to sharing their own data with the vendor, so that the AI can best learn how to help the user and company. While businesses may be wary of this agreement (and for good reason, sharing data is scary), they will become much more open to the concept in 2018 because the benefits will outweigh the risks. These tools will be critical to digital transformation, so opening up their own data to help improve how their tools work for them will become commonplace.

In a similar vein, SaaS, cloud infrastructure and general businesses will open up data partnerships in mutually beneficial agreements to share specific data sets. Instead of hoarding valuable data, businesses will strategically — and more frequently — share data to better analyze aspects of their business, whether it is customer interactions or business operations. Many may be beneficial specifically for machine learning purposes, as businesses are building automated processes in-house and need data to progress those capabilities. Either way, the concept of open data will progress beyond the needs of developers and will lead to major business data swaps, ultimately increasing digital transformation in 2018.

Big data helps business modernization and enables open data

Over the last decade, the amount of data that companies have access to has outgrown the term “big.” The volume of data at a business’ fingertips has allowed companies to take a data-driven business strategy, which helps provide transparency into their operations, workforce and customer interactions. It has become so vital that a C-level job, the chief data officer, has been created just to deal with all of it. The boom is not stopping anytime soon.

According to IDC’s “Data Age 2025: The Evolution of Data to Life-Critical”, this hyper growth of data will only continue going forward. “IDC forecasts that by 2025 the global datasphere will grow to 163 zettabytes (that is a trillion gigabytes). That’s ten times the 16.1ZB of data generated in 2016. All this data will unlock unique user experiences and a new world of business opportunities.” A major part of that new world of business opportunities is the machine learning advancements made possible by the data.

graph-global-datasphereThere are a few reason behind the increase in data, the first being the migration of data storage to public, private and hybrid cloud offerings. As the popularity of public infrastructure services, such as Amazon Web Services (AWS), Google’s Cloud Platform and Microsoft’s Azure, continue to grow, so does big data. These infrastructure as a service (IaaS) providers allow business to quickly spin up reliable and cost-efficient data storage. Not only does it allow for the data to sit there, but it also makes it easily accessible with the use of analytics tools. The ability for big data processing and distribution systems to consume and store unstructured data has also been an important advancement for big data.

Additionally, the growth in connected devices has contributed to the growth of the overall datasphere. The term ‘connected device’ has the connotation of a mobile phone; however, as IoT devices become more prominent in both the consumer and business world, the number of internet-enabled devices continues to skyrocket. Everything from a toaster or refrigerator, to assembly line and farming machinery may be connected to the internet, therefore constantly creating data.

This industry is poised for continued growth, so the data creation from connected devices is not slowing down. Per the IDC report, “Big Data and metadata (data about data) will eventually touch nearly every aspect of our lives — with profound consequences. By 2025, an average connected person anywhere in the world will interact with connected devices nearly 4,800 times per day — basically one interaction every 18 seconds.”

If that doesn’t sound spooky to you, then you are much braver than I am.

Outlandish statistics are hard to really comprehend, but ultimately the big data explosion will contribute and aid in the growth of machine learning. Inevitably, some of these data sets will become open-sourced because they will provide value to developers or organizations outside of those that technically own them. Large enterprises are already recognizing this, which is why Amazon, Google and government institutions, such as NASA, are making some of their data sets public. Developers will be using these big, open data sets to develop business applications for many years to come.

Types of open data: business value and applications

1. Image and video data sets

A frequently used group of data sets are image and video repositories. These massive compilations of pictures and videos help train image and speech recognition tools. Some prominent image and video data sets are Google’s Open Images, which contains roughly 9 million URLs to images with labels and classes for machine learning, and Amazon’s Bin Image Dataset, which has “Over 500,000 bin JPEG images and corresponding JSON metadata files describing products in an operating Amazon Fulfillment Center.” Google also offers a robust video data set for developers to take advantage of with YouTube-8M, which provides 7 million video URLs, 450,000 hours of video, and 3.2 billion audio and visual features, among other data points.

2. Geospatial and environmental data sets

There are a number of public data sets, mostly opened up by NASA and other government funded organizations, that provide geographical and environmental data from satellite imagery. These data sets provide a wide range of information, from climate information to maps of cities. Such data can be used for training self-driving cars and image recognition algorithms, among other uses.

3. Text data sets

Similarly to the image and video data sets, these offer a compilation of documents to be fed to natural language processing (NLP) models. These text data sets help train the machine learning behind such applications as chatbots. Some valuable open data sets include WikiText, which is a “language modeling data set of over 100 million tokens extracted from the set of verified ‘Good’ and ‘Featured’ articles on Wikipedia”. This data set was collected by Salesforce’s MetaMind. Google Books Ngrams is also an expansive data set of text consisting of words taken from Google Books. This data set is offered in a “Hadoop friendly file format” on Amazon S3.

4. Transportation data sets

In recent years, mobile application services like Uber and Lyft have recorded an enormous amount of transportation data, ranging from traffic patterns to popular spots people like to be dropped off. More traditional transportation services have also opened up their data sets for the benefit of developers. These data sets include the Department of Transportation’s airline delay data and New York City’s taxi data. Both Uber and Lyft have open sourced some of their data sets to partner with city’s for traffic planning and automotive companies to help fuel the AI behind autonomous driving cars.

5. Other data sets

Other sets of data may include genome or DNA data, facial data, sentiment analysis and recommendation systems, among many others. In the coming year, look for more and more companies to make their own unique data sets available to developers for machine learning purposes.

Data sharing for SaaS enhancements

Embedded AI is being pushed by SaaS vendors as a crucial part of their product plans for 2018. Buyers will be more frequently asking whether or not a software contains machine learning functionality, and how that functionality can benefit the buyer’s company and employees. Often, this functionality will help automate simple tasks that are tedious and undesirable for workers, but nonetheless critical. However, to receive the fullest and most complete machine learning capabilities companies, will have to opt in to sharing their company data with the SaaS vendor, which may turn off some businesses. In 2018, it will become commonplace for businesses to opt in to data sharing to benefit in full from embedded AI.

The main reasons that businesses are wary of opening up their data are security threats, and SaaS vendors will be doing everything they can to prevent data breaches and to protect their customers. As time goes on, and vendors gain the trust of their customers, businesses will acknowledge that the benefits of smarter AI outweigh the risks.

An easy example of this is for Salesforce’s Einstein, which can be embedded throughout all of its products, but most prominently in its flagship solution Salesforce CRM. If users want to use Einstein, but not share their company data with Salesforce and outside companies, the machine learning capabilities will only take what it learns from the company’s own CRM data. But, if a business chooses to open up its data sharing then Einstein will be smarter, because it is working from the data sets of the entire Salesforce CRM customer base that has also opted in. The benefit of companies sharing data not only helps the customer using the CRM, but also Salesforce itself, because its machine learning tools will get smarter as they consume more data.

Data sharing opt ins will extend far beyond the CRM space in 2018, as companies will choose to open their data for such business solutions as improved supply chain efficiency and increasingly intelligent machines. Businesses would be able to identify gaps in a supply chain operation and, based on what other companies do in the same scenario, the intelligent system could make a recommendation, whether that is optimized shipping routes or provide suggestions on machine maintenance. The possibilities are seemingly endless.

Along the same lines, SaaS vendors will begin to enter more data swap partnerships that benefit both companies involved, as will general businesses outside of software. The idea of a company opening its data to another company could be seen as irresponsible by some traditional businesses, but if it can help improve the digital transformation of customers or the business itself, then the companies will find an agreement. Is it so unfathomable that AWS would share some of its infrastructure data with Salesforce in return for its customer data? Or what about GE sharing its machine and IoT data with American Airlines in return for flight and delay information? These partnerships simply make sense and will become more prominent in 2018.

Concluding predictions

Data has been pushed to the forefront of business strategy, because it allows employees and companies to make more informed, educated decisions. So much data that businesses never had access to before cloud computing is now influencing all aspects of digital transformation, including the workforce, customer interactions and business operations. According to the IDC report, “by 2025, nearly 20% of the data in the global datasphere will be critical to our daily lives and nearly 10% of that will be hypercritical.” That is a startling statistic, but one that companies should take heed of going into 2018.

For AI purposes, developers will take advantage of open-source data sets when building and training machine learning models, because they do not have the luxury of owning massive amounts of data like enterprise companies. However, those same enterprise companies will continue to be the leaders in opening up their data to developers in the coming year. Businesses will begin to opt in to data-sharing agreements with software vendors to improve the embedded AI within the solutions they are using. Not only will they open formerly proprietary data to SaaS vendors, but also to other businesses in data swap partnerships to improve overall digital transformation. Open data, data sharing and data swaps will all make headlines in one way or another in 2018 and will benefit all parties involved.

Discover the biggest digital trends in 2019

Rob Light

Rob Light

Rob Light is a Research Principal focusing on artificial intelligence, big data, and analytics categories at G2 Crowd. Additionally, Rob focuses heavily on enterprise technology vendors across all software categories. He started working at the Crowd in the Fall of 2015 after two years of teaching English in Spain and a brief stint in the customer service world. In his free time, Rob enjoys watching as many films as possible and even dabbles in some amateur screenwriting.