While working for a Fortune 500 firm, my team was tasked with creating an application to pull data from one of their Hadoop servers and perform some analysis. What we learned in the process is that even with the best intentions, if you don’t plan your Big Data implementation properly, you can waste a lot of time, and still end up with a solution that falls well short of expectations.
Even though Hadoop is open source, it requires a large investment of time to administer and maintain. Employee time is usually the most expensive commodity a company has, and yet, so often it gets wasted without a second thought. A day of developer’s time wasted is nearly the cost of a low-end PC, a week, or a workstation. An on-premises Big Data system like Hadoop requires knowledgeable staff, constant monitoring and a certain level of expertise in order to prevent a Big Data project from becoming a huge waste of time.
There is nothing particularly magical that differentiates a Big Data system like Hadoop from a traditional relational database like Oracle. They both use the same hardware and operating system. The difference lies in how each approaches data management. Hadoop’s approach, via Hive databases, employs file partitioning to take advantage of its distributed architecture. File partitioning involves creating files which are stored in a directory hierarchy. Whereas a relational database will rely on specialized indexes to locate a record, Hive navigates the file system proper. In each case, if the index/file system structure doesn’t support a particular query, a table scan is performed. In a relational database, this means reading a relatively small table. In Hadoop, this means reading each of a potentially large number of files in the file system. In order to have a viable Big Data system, the data has to be partitioned in such a way that queries can take advantage of the partition scheme and minimize the number of files read.
In our case, data engineers, interested in making sure the ingest went quickly, partitioned the data by time (year, month, day, hour). Our queries, however, had no time component in them. Rather, as a manufacturing company, we were interested in process steps, product identifiers and the like. From the beginning, because we couldn’t use the existing partition scheme, we had trouble pulling data. Over time, we developed data marts of specific products and processes, but they were never able to scale beyond the prototype stage. It was an impasse.
The Hadoop system we used also had a particularly nasty defect. Both R and Tableau use Open Database Connectivity (ODBC) to access Hadoop. To support ODBC, we have a Knox server which acts as a gateway. However, under a certain load caused by a long-running query, the Knox server will block all incoming connections until the query finishes or is killed. In the meantime, it reports back to the user that there is a security-related access problem.
The team wrestled with this problem for months without knowing the real cause. Things would be fine, then we couldn’t connect for hours at a time. Our Hadoop administrators, also thinking it was a security problem, would reboot the Knox server, or apply some new patch hoping the problem would go away. It never did. After months of fruitless attempts to fix things, management decided the solution to the Knox problem was to abandon Hadoop entirely in favor of a newly purchased Teradata system.
However, Teradata is yet another Big Data system with its own limitations. Teradata uses a hashing technique which is very fast if the data is partitioned correctly, much like Hadoop. If not, the server generates spool-space errors or does relatively slow minimal node processing. This can be resolved by strategic use of specialized primary indexes and secondary indexes, but only to a degree. In our case, we had to implement so many secondary indexes to get the latency down to acceptable levels that our high-performance, very expensive Teradata system got bogged down managing a bunch of secondary indexes, and was unable to make maximum use of its hashing. Worse, our original SQL server-based prototype still outperformed the Teradata system, and at roughly 1/3000th the cost.
In summary, when planning a Big Data implementation, one should consider:
- Is the staff sufficiently experienced to handle the varied maintenance and security issues certain to come up when an implementation goes into production? It is one thing to install Hadoop, but quite another to integrate it seamlessly into your corporate infrastructure.
- Do the developers understand how best to utilize the architecture, and most importantly, know when it is totally unsuitable for the task at hand?
- Does the architecture support how the data will be retrieved? Does this include filters, updates and data reorgs?
- Is the use case an appropriate use of the technology? Would a “smaller” data system better handle the requirements?
Among many companies new to Big Data, there is a misperception that Hadoop can do everything that any other commercial system can do, only faster and better. While Hadoop does certain things quite well, employing its massively parallel architecture, there are always trade-offs. By understanding what these trade-offs are, companies can better plan how best to integrate Hadoop as part of a larger commercial and open source implementation. In the end, they can better save that most precious of resources—time—and a lot of embarrassment.