I was my great honor to have my fantastic-three-months-summer internship at Query Optimizer Group, Teradata @Segundo. During my time there, I not only learnt how to work on a specific topic, how to conduct experiments, how to write a scientific paper and more importantly how Teradata run their business successfully as a leading global company in data warehousing and data analytics.
I was assigned a very challenging but interesting topic namely Multidimensional Cardinality Estimation (I would say selectivity estimation is one type of cardinality estimation in which trying to estimate the number of results returned by a particular query). I was so fascinated in this topic that I was able to understand the concept and implement a naive approach with some encouraging deliverable results during the first week. My motivation to work on this topic was how cardinality estimation play an important role in the relational database (Teradata database, Oracle, MySQL..). To be simple, there are two steps when a query is being executed, including query parsing, query processing (happened in the database manipulation language (DML) layer). There can be multiple ways to execute a query, so called query execution plans; however, there is only one optimal plan which is executed in the shortest time. Importantly, if we know the number of results returned by a query in advanced (by estimation), it is more likely we can choose the optimal plan or close-to-optimal plan. Let give an example, there are three execution plans of the query: Plan A takes 10 days to complete, Plan B takes 1 days to complete, Plan C takes 1 hour to complete. If the number of results in advanced is given, we know Plan C is the best plan. However; we don’t know the result set size. It likes chicken and egg problem! So we try to estimate the result sets’s size; the more accurately the result set size is estimated, the more likely the query optimizer choose the optimal plan. So the problem now becomes estimating the result set size of a query – so called selectivity estimation. Note that a little improvement in selectivity estimation could result in a great difference in business execution (10 days vs 1 hour).
I then dug deeper into histogram – a specific approach in selectivity estimation, implemented various histogram algorithms, came up with our own algorithms. Extensive experiments on various datasets have shown the superiority of our algorithms to the current approaches.
In the second part of this post, I would like to share my working experience at Optimizer Group, mostly logistic. As an internship position, I was in charge of a research problem. I was supported by a Mentor and a Manager. I worked directly with the mentor, talked to him every couple of days about ideas, and executions. Then I reported my progress to my manager, cced the mentor every week and received feedbacks from both of them. Although I am solely responsible for the project, in term of reading materials and coding; the supports from my mentor and my manager helped me to constantly move forward. Fortunately, my project caught an attentions of the director and the architect of the Optimizer Group. They also joined in many weekly meeting and actively contributed their ideas to the project. Their supports turned out one of the main reason I worked hard and tried to deliver good results.
I also valued the opportunities to attend every seminar of the Optimizer Group, the presentations of the CEO, Vice Presidents… to the interns like me. From what I observed, Teradata is indeed a leading corporation in data warehousing and data analytics with their biggest customers are in the top 500 companies, such as Apple, Walmart… As a global trends, Teradata recently acquired other companies, for example Aster-data with an ambition to explore and then lead a new area of unstructured data like text analytics.