Like Philosophy? Try Data Engineering
“Data engineering can be very philosophical.”
That’s according to Mark Magpayo, the principal innovations engineer at Smarter Sorting, a product intelligence platform.
For the uninitiated, it might seem strange that budding developers with a flair for philosophy would want to consider a data engineering career, a path that Magpayo himself took.
The Outlook of Data Engineering
Perhaps this metaphysical quality is due to the fact that data engineering is a developing discipline. These professionals are akin to plumbers, building sound data infrastructure pipelines while keeping them well-maintained and unclogged. The work can make a big impact, and also comes along with some intellectual questions.
“You’re constantly defining the meaning of data: What is useful and what isn’t for your intended audience,” said Magpayo. “You have to figure out when it’s broken, what it means to be broken, and justify how you know it’s broken.”
Determining the best way to define, use and view a set of data for users is a surprisingly profound exercise. In practice, Smarter Sorting harnesses data to provide guidance for manufacturing, moving, marketing, selling and handling products throughout the supply chain. It’s a complex system to sort, but the impacts are tangible for both retailers and manufacturers. With properly harnessed data, more waste can be diverted from landfills, more donations can be made to charity and more compliance infractions can be avoided.
Built In Colorado checked in with Magpayo for more insights into the pondersome nature of a career in data science.
What led you to a career as a data engineer?
My career took twists and turns to get to this point. I started as a developer and was interested in databases. I almost became a database administrator when I decided to join a startup in a software engineer role. Not long after the IT systems director moved on, I became responsible for running our data center on top of my developer duties. This naturally led to my next roles in DevOps and finally as a principal engineer at Smarter Sorting.
At every role I’ve been in I’ve had to gather and present data in many different ways: what features recorded the most time spent on our social media platform, metadata from most watched videos on our platform, the chemical makeup of a product. I had never set out to be a data engineer explicitly; my diverse path just happened to get me here.
Describe a project you’re working on right now. What is challenging about it?
Currently, I’m working on getting data from all our disparate sources into a Snowflake warehouse. A lot of our applications were built with different data stores, such as PostgreSQL and MongoDB. We’re also trying to pull in data from third-party sources such as Stripe, HubSpot, and product aggregators like GS1 and 1WorldSync. The short-term goal is to get all of our data into one location for analysis. The long-term goal is to use this data to build what we call product intelligence, where we better understand the chemical and physical attributes of consumer products.
This is a challenging feat because there is often equivalent data repeated in different data sources, but it is viewed in different ways. More challenging is running into data that does not have standard formats. Parsers need to be continually refined and broken data needs to be identified quickly. Those two parts alone can make data engineering feel like a continual game of cat and mouse, but being able to provide the proper underlying data is key to the overall goals of Smarter Sorting: to help retailers and brands better make, market and move consumer products.
It can feel like a continual game of cat and mouse, but being able to provide the proper underlying data is key.”
What are the key technical skills that you use most often during your workday?
Scripting is the technical skill I probably use the most; shell scripting in particular. Many of the utilities available in most shells, such as awk, grep and sed, go a long way to getting data formatted properly.
After that, Python and SQL are important for doing extract, transform and load data integration and understanding how to store data in a database or data warehouse. Having a solid understanding of cloud services allows you to come up with different ways of presenting data to your stakeholders, whether it’s traditional databases, NoSQL databases, or CSV or TSV hosted on cloud storage.