Tuesday, April 25, 2017

Why do we need independent data operators that will store the digital personality? Part 2.


In previous post you can learn about four types of data and its problems that professor of Computer Science and Engineering at Washington University Pedro Domingos highlights in his book The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake our World”. Professor Domingos also offers a solution for organizing and keeping that data.
We need a new type of company that will play the same role for our data as a bank for our savings. Banks do not steal and should invest investments wisely. Today, many companies offer to consolidate our data somewhere in the cloud storage, but they are still very far from the level of personal data banks. Providers of cloud services are trying to tie us to themselves - and this is impossible (imagine that you opened an account with Bank of America and are not sure whether it will be possible in the future to transfer funds to another bank).
Companies of a new type, as professor imagines them, will provide several functions for the subscription fee. First, they will anonymize our interactions in the electronic world, conducting them through their own servers, and accumulate them, as well as similar actions by other users. Secondly, they will store in one place data collected during our life. Thirdly, they will form a complete model of our personality and our world and constantly update it. Fourthly - to apply this model on our behalf, within its abilities, always doing exactly what we would have done ourselves. The main obligation of the company before us is never to use our data and our model contrary to our interests. The guarantee will not be one hundred percent - after all, we ourselves are not immune from doing anything to ourselves in any way harmful. Nevertheless, the viability of the company will depend on the implementation of the agreement to the same extent as the survival of the bank - from the safety of our money, so we can trust them the way we trust banks today.
Such companies can quickly become one of the most expensive in the world. As Alexis Madrigal of the Atlantic magazine points out, today our profiles can be bought for half a cent or even cheaper, but for the online advertising industry the user's value is approaching $1200 per year. A piece of information about us, available to Google, costs about $20, Facebook has $5 and so on. Add to this the fragments that no one else has, and the fact that the whole weight of the sum of parts - the personality model based on all our data, is much better than a thousand models built from separate pieces.
Of course, some existing companies will happily take our "digital personality". For example, Google. Sergey Brin wants Google to become the "third hemisphere of your brain", and some of the company's acquisitions are probably related to how successfully streams of user data complement the company's own flow. But, despite the initial advantages, Google and Facebook, for example, are not very suitable for the role of our digital home, because there is a conflict of interest. They earn their living by targeting advertising, so they will somehow balance the interests of users and advertisers. Domingos asks rhetorical question: “You, probably, do not allow one of the hemispheres to be not entirely loyal to you? Then why give it to the third hemisphere?”
A potential threat can come from government bodies if they have the right to claim our data or even prophylactically put us behind bars. To prevent this, the company storing data should encrypt it, and the key should be at our disposal (now it is already possible to perform calculations on encrypted data without decrypting it). Or we can keep everything on our hard drive at home, and the company will simply provide the software for rent.
Privacy is only one aspect of the broader issue of providing access to information, and if we focus on it at the expense of the whole, we run the risk of reaching wrong conclusions. For example, laws that prohibit the use of data for any purpose, except for the originally envisaged, are extremely short-sighted.
Companies that store our digital identity and data protection associations, in professor’s opinion, will determine the picture of working with data in the future. Today, most people do not realize how much data they provide and with what costs and benefits this can be due to them. Companies, for their part, are happy to maintain the status quo and work behind the scenes. Domingos concludes that sooner or later such a system will collapse. It is better to raise awareness now and give everyone the right to choose whether to share data, and if so, how and where.


Why do we need independent data operators that will store the digital personality? Part 1.


Professor of Computer Science and Engineering at Washington University Pedro Domingos wrote the book The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake our World”. I would like to introduce you the main idea of the one of the book’s chapter - "The world of machine learning".
Professor divides data into four categories: those that we share with everyone, those that we share only with friends and colleagues, those that we share with different companies, and those that we do not distribute at all. The first type includes, for example, reviews on Yelp, Amazon and TripAdvisor, ratings on eBay, a resume on LinkedIn, blogs, and tweets. These data are very valuable and cause the least problems. We share them with the world, because we want it, and it all goes to the benefit. The only difficulty is that companies that store these data do not always allow it to be downloaded in bulk for building models. They should change their approach. Today you can go to TripAdvisor and see the reviews and ratings of hotels that have interested us, but what about the model of factors that make the hotel good or bad in general? With its help, it would be possible to evaluate hotels that have little reliable reviews or even none at all. TripAdvisor could create something like that. And what about the modeling factors that determine the attractiveness of the hotel for you? This requires information about your identity, and you may not want to share it with TripAdvisor. It's better to have a trusted third party that will connect the two types of data and give you the result.
Data of the second kind should not create problems either, but this is not so, because it is in contact with the third kind of data. We share news and pictures with our friends on Facebook, and they share with us. In this case, each of us shares all this information with the Facebook network. The network takes advantage: it has a billion friends. Day after day, it learns about the world much more than an individual could learn. All this knowledge Facebook uses mainly for targeted advertising, and in exchange creates an infrastructure for the exchange of information: this transaction is for every user. Learning algorithms are becoming more powerful and extract more and more benefits from the data, which is partly returned in the form of more appropriate advertising and better services. The only problem is that Facebook is free to do with data and models that contradict the interests of the user, and this will not be avoided.
Such a problem appears everywhere where a person shares data with companies, and these days these situations include almost all activities on the Internet and many in real life. Everyone wants to get your data. While each company has only a particle of the whole. Google knows what we are looking for on the Internet, Amazon has information about our purchases, AT&T - about phone calls, Apple - about the music we download, Safeway has a complete idea of ​​what products we eat, and Capital One - about our transactions with Credit cards. Some companies, for example, Acxiom, correlate information about us and sell it, but in actual fact (aboutthedata.com), it turns out a bit, and it is partly wrong. No one has a close picture of our personality. This is good and bad. It's good, because the one who will manage to get it will have too much power. Bad - because, as long as it is so, the creation of a comprehensive model is impossible. In fact, we just need to be the sole owner of such a model and grant access to it solely on our own terms.
The last type of data - those that we do not share - also poses a problem, and it consists in the fact that sometimes such information should be provided. Maybe it did not occur to you, maybe it's not easy or you do not have that desire. In the latter case, it is worth considering whether we have an ethical duty to share data about ourselves. Cancer patients can contribute to the victory over this disease if they provide access to the tumor genome and treatment history. The data that we generate in our daily lives can provide answers to all sorts of questions about society and politics. Social sciences enter their golden age and finally they will receive a volume of data comparable to the complexity of the phenomena studied, and the benefits for all of us will be enormous - provided that these data will be accessible to scientists, decision makers and citizens themselves. This does not mean that we should let others spy on your personal life. This means that we need to give them the opportunity to get acquainted with the obtained models, in which there will be only statistical information. Between us and them should be an honest data broker, which ensures that information about us will not be misused and thus there will not be "freeloaders" who seek to obtain benefits without sharing their data.
Professor Domingos concludes that there are problems for all four types of data and suggests a solution which you can find out in the next post. 

For more analytics posts follow this link: https://analyticsinbusinessworld.blogspot.com/2017/04/monetization-of-analytics-data.html

The black market of data.



We live in a post-private society, in which, with rare exceptions, everything is known about us. And, first of all, we disclose information about ourselves - in social networks, instagram, on service portals, with the help of various gaming elements (in various tests and surveys), in Internet and even retail stores when using discount cards. As a result of the collection of this information, huge amounts of data arrays are created. And it contains sets of personal information that their owners do not even suspect. Correctly correlated information from such repositories ceases to be impersonal and characterizes a particular person. We can talk about the "digital prints" of the person. After data is created it can be sold by different companies.
Data brokers are notoriously secretive. Paul Stephens, a director at Privacy Rights Clearinghouse in San Diego, says that: “It’s hard to tell who’s selling what to whom.” In fact, it’s unknown exactly how many data brokers operate in the United States, because so many keep a low profile. Credible estimates range from 2,500 to 4,000. There are supergiants in the field—Acxiom, Experian. But there are myriad smaller companies that few have heard of: CubeYou, Exact Data, Paramount Lists, Datalogix, Statlistics.
In addition to a set of generally available and commercially available data, these arrays include information stolen from various sources, from dating sites to payment services. One of the conventionally "safe" commercial goals of using information from such stores is targeted marketing. On the basis of the consumer's portrait, an individualized, as precisely as possible, package of offers of goods or services is formed. For example, the healthcare industry is becoming a potential target for cybercriminals. Personal information about the state of health (medical data), usually includes in addition to direct information about the health of the patient, many other information - from his name and address of residence, and ending with information about the place of work and data of the payment card. According to a Redspin source for information security in the health sector, more than 29 million medical records protected by the Health Insurance Portability and Accountability Act have been compromised since 2009 in America.
On the black market, medical information is much more valuable than financial payment information - now the estimated cost of medical information is about 10 times higher than the prices for bank card data. If stolen card numbers are sold for several dollars per dozen, then the cost of one medical record exceeds 350 dollars (according to the Ponemon Institute). Medical data is in many ways more valuable than other types of information, because of its "permanence". Stolen or simply compromised bank cards can be replaced, fraudulent operations are contested, but medical data cannot be annulled. For what purposes is the medical data "monetized"? First of all, they replenish the data repositories, after which the information can be used, for example, by swindlers who sell "magic medicines" and "miracle devices" that cure all diseases. A particular value is the results of analyzes of specific individuals, which are important, including when looking for donors. In addition, it is possible to sell mass media information about the health of media persons for use with the purposes of blackmail, political or corporate struggle. For example, the public widely discussed the case of 2014, when an employee of the Swiss ambulance tried to sell media information from the medical map of Michael Schumacher for 50 thousand euros.
There are two main reasons for this problem. The first is a worldwide trend: there is completely no user culture of handling personal information. The second reason is inadequate protection of user data by those organizations that receive them from the client and process them for one or another purpose. The problem needs to be solved in the most complex and synergistic way. Otherwise, the black market of data will continue to exist and develop, legislative measures will not work in full, and each person will remain under threat of using his personal data against him.

You can find more interesting analytics information here: https://analyticsinbusinessworld.blogspot.com/2017/04/monetization-of-analytics-data.html

Monday, April 17, 2017

Why Google bought Kaggle, a service for data researchers?




Representatives of Google, speaking at a conference Google Cloud Next in San Francisco (March, 2017), confirmed the purchase of a start-up Kaggle, which created the platform for data researchers, where they can test the analysis models on current problems.
Kaggle is the largest platform for conducting competitions among data specialists and machine learning. Hundreds of thousands of users are registered on the site. Access to them will strengthen Google's influence on the community of artificial intelligence specialists. The company's competition with other major players in the artificial intelligence market (Amazon) requires constant activity, and buying Kaggle is an important step in this direction.
It will allow strengthening of Google's position in the community of researchers on data processing, as well as in the struggle for the best professionals in the market.
The active spread of artificial intelligence technologies opens up access to the market for small and medium-sized companies, whose combined influence can shake the acknowledged leadership of Google in the field of machine learning. The new takeover is an additional trump card to preserve the status of the corporation.
 Founded in 2010, Kaggle unites more than 500,000 data specialists. Despite the presence of strong competitors in the person of Driven Data, Top Coder and Hacker Rank, the project managed to quickly gain popularity due to a clearly marked niche. Investors of the service - Yuri Milner, Max Levchin, as well as Khosla Ventures, Index Ventures and other funds. Over the period of its existence, Kaggle attracted $ 12.5 million.