We use the terms ‘data scientist’, ‘data analyst’ , ‘data engineer’ interchangeably, but many of those who are in any data profession, long enough, knows this is not right. They have very different job profiles and demands. I don’t want to go into the intricacies as it will become too technical a discussion, but they all have one common requirement that’s as obvious as day and night; Data!.
Big-tech companies will give us a billion bytes of free code (for a lot of reasons, do read it here)but won’t give us more than a tiny bone size of the data they hold. What should be very clear is that there is no magic when Big-techs announce unbelievable feats of AI. They have an infinite supply of data and they have as much buck to buy computing muscle to process them. Data are the ‘razor blades’ in this business; and the software that we use to make sense of it, are the razor sticks. The analogy is not as accurate as I would like to think, because blades have no supply deficit, however domain specific data do. Data is a luxury to have and people who have it will not give it away unless they have a real business incentive to do so.
New data driven solution companies have to prove that there is an incentive for sharing the data in order to get the data from prospective clients so they can improve their solution further. This is clearly a chicken and egg problem. How can we initialize this cycle and how can we showcase our capability to our clients if we have no solution to show. This should be the pertinent question within every young data start-up. How can you bootstrap a data driven solution product? It’s very bad news to every young data enthusiast who dream of jumpstarting his career with a glorious ML or DL model, solving a pertinent problem in their organization with an AI solution. The reality is that companies recruit new talents (some time in an uninformed way)in high hopes of seeing some magic and the new employees will expect the companies to provide them with data. Eventually what ensues is dead lock and despair at both ends.
What will a data analyst analyze? What pipelines or models will a machine learning engineer build without data. Machine learning is not a data efficient process, and you need tons of data just to get the model to be even proposal worthy. But for a data scientist it is an opportunity. A data scientist works by understanding the process that generates the data of interest. He can, with the help of a Subject Matter Expert (SME) or a domain specialist, come up with math models that can simulate data. He can use some statistical techniques to infer more data when only limited data points are available. He can then design experiments that can simulate real-like data, or even formulate equations that can solve for a set of state variables and compute data with those for a given system at hand. At least he can do some research to discover similar open datasets available for download online, but it will take skill and experience to choose the right one that works on your problem.
A programmer evolves to a data engineer. This does entail a learning curve, mainly for machine learning and data manipulation APIs. Usually people bank on a nice DIY type methods complemented by good online courses for it. These tools are brilliant when you have some good data to try them on, and trust me, there are plenty of avenues for it too. But if you want to attack a virgin dataspace (with business value) where very little or no data is available for the taking, then the tools can’t help you. If you can break this ICE! Then you can call yourself a data scientist, in the true contemporary sense of the term, not a mere designation. It is now time to really ask yourself; are you a data scientist? Or are you a programmer who is evolving up this hierarchy?