Democratizing Machine Learning - Data

Industrial revolution was driven by the standardization of parts, tools and interfaces. In the context of machine learning, democratization is only possible through the standardization of machine intelligence. As I established in my previous post, empowering people to build their own models is as equally important as providing them ready-to-use tools. Tools and frameworks make machine learning more widespread. The open source frameworks lower the barrier for the software developers to adopt it; they can create models and play around with ideas. However, they do not completely address the problems of the “common people”. Big tech companies still hold the upper hand over data. Therefore, even with the right tools “common people” would not be able to produce intelligent systems. In the absence of data, machines can not learn. The model would just be as dormant as your brain after multiple tequila shots at 2 am on a Saturday night. In order to make models more intelligent, vast amounts of data are pooled into data centers from the devices of the “common people.” Most people are quite unaware of this data collection. Access to data is limited, and gathered at the hands of the few.

Data is produced by an event; either by an action or inaction. Devices like mobile phones produce data through user interactions over a period of time. For example, a hidden application may collect the news headlines that you are more likely to click on (it probably does). This information can be used to provide you a better “service” and recommend the news that you are more likely to be interested in. It is this “intelligence” and control over what you see that make these companies very powerful. Despite the growing public concern on data privacy, companies will continue to find innovative methods to gather intelligence. This is literally how companies like Google and Facebook make money. For them, it is an existential matter. Personally, I do not want to go back to a time where Google does not exist. I love free web search! To be honest, I am not sure if I could survive without it. It would have definitely been harder to write code. I also would rather have an independent company to provide me that service instead of the government. I mean have you read 1984?!

On the one hand, I am glad that the governments around the world are cracking down on bad data collection practices. On the other hand, I believe good data collection practices should be encouraged and government agencies should strive to make their datasets publicly available. Fortunately, the number of AI initiatives are growing. Public initiatives such as data trusts provide a platform for small and non-profit companies to be a part of the game. Given that most researchers end up working at the big tech companies, improving publicly available datasets is as equally important as funding public research in artificial intelligence. Long story short, improving publicly available data trusts can relieve the situation a bit, and all we need is competition (and love).

Accessing privacy-invasive data is another issue. AI is expected to be very useful in healthcare, especially. There are already a growing number of companies that are building wonderful applications for this reason. However, healthcare data is incredibly privacy invasive and access to data is very restrictive. We must find a way to realize the potential of AI, as it can be the “cure” to many of our problems. Instead of restricting access to privacy-sensitive data, we must strive to make them more publicly available without breaching privacy. So.. How can we train our models without sharing private information? How can we ensure anonymity?

The answer is to share intelligence rather than data. Machine learning methods like “federated learning” cultivate the shared intelligence to build powerful models in a “centralized manner”. (Unsurprisingly, research is supported by Google. Thanks, Google!) Do you like the idea? Good, then note this textbook definition down -

“Federated learning is a distributed machine learning approach that enables training on decentralized data.“

Another approach to this problem is “decentralized learning”, which is a direct descendant of anarcho-syndicalism. Fully-decentralized learning enables learning on decentralized data without a central entity that coordinates learning. In other words, it makes peer-to-peer learning possible. Therefore, improvements to this approach will provide the underlying framework to open-source intelligence.

Written on July 30, 2019