What is Data Engineering: Laravel Edition

Published in Data Engineering Laravel on Dec 27, 2019

Spoiler Alert: It isn't data science, it isn't "Big Data"...and you're probably already doing it.

Introduction & Definition

In the past few years, it seems like "Data Engineering" has been thrown around a lot. In fact, at the time of writing, there are ~87,000 open jobs on Glassdoor.com with the exact title "Data Engineer." Apparently, ~30k of those fuckers are verified to make over $100k a year:



 
With an incredibly reductive lens, the definition for data engineer is pretty simple:

A "Data Engineer" is a developer that is responsible for collecting and preparing data for use by other consumers (e.g: Data Scientists, software applications, BI tools, etc)

That feels pretty straight forward, doesn't it? What does a more complicated definition look like? Maybe something like this:

"Data Engineers" are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance and testing of architectures, such as databases and large-scale processing systems.

Oh alright. So they still manipulate and manage data for other people to use, but it's probably a whole lot of data? Not particularly revolutionary. How about this one?

Data Engineers build and manage code pipelines that alter the location, format, or existence of datasets.

I like that one a lot. In my opinion, it clearly states the process and the intent of Data Engineering while leaving room for the complexity or scale of the specific implementation to vary.

Armed with this new definition, let's play a game called:

"Is that Data Engineering?" - The Laravel Edition

Scenario 1: You outgrew a config file

Problem: You have "app settings" data in a config file, but want to move it to the database so it can be managed in an admin panel.

Solution: You write a migration for a new "settings" table, and a route closure in routes/web.php like this:

    Route::get('load-settings-into-table', function () {
    	$settings = collect(config('myApp.settings'));

    	$settings->each(function ($item, $key) {
    		$setting = new App\Setting;
    		$setting->name = $key;
    		$setting->value = $settings[$key]['value'];
    		$setting->category = $settings[$key]['category'];
    		$setting->status = 'Active';
    		$setting->save();
    	});

    	return 'Completed Load';
    });

Then you delete the route closure and the settings array from the config/myApp.php file.

Is that Data Engineering?

Well yeah. I mean it's not particularly elegant. You probably should have handled this with a seeder or something, but you definitely altered the location, format, AND existence of data for use by a software application. You're basically a Data Engineer, my friend.

Scenario 2: Migrate a Blog

Problem: You convince your client to let you rebuild their WP site in Laravel. Hell yes! However, they want all of their old content to be available in the new site.

Solution: You write a new fancy wp:import artisan command to grab those old posts and pull them into the new DB with their original tags. Maybe you handle URL changes, and scrub the content too.

Note: this is a real life example provided by Freek Murze of Spatie

    class ImportWp extends Command
    {
    		...
        public function handle()
        {
    				// Source the original data
            $oldPosts = DB::table('wp_posts')
                ->where('post_status', 'publish')
                ->where('post_type', 'post')
                ->get();

    				// Convert and load the new posts
            collect($oldPosts)
                ->each(function (stdClass $oldPost) {
                    $post = Post::create([
                        'title' => $oldPost->post_title,
                        'text' => $this->sanitizePostContent($oldPost->post_content),
                        'wp_post_name' => $oldPost->post_name,
                        'publish_date' => Carbon::createFromFormat('Y-m-d H:i:s', $oldPost->post_date),
                        'published' => true,
                        'tweet_sent' => true,
                        'posted_on_medium' => true,
                    ]);
                    $this->attachTags($oldPost, $post);
                });
        }

        protected function sanitizePostContent(string $postContent): string
        {
            $postContent = str_replace('-700x458', '', $postContent);
            $postContent = str_replace('-700x459', '', $postContent);
            return $postContent;
        }

        protected function attachTags(stdClass $oldPost, Post $post)
        {
    				// Source original tags
            $tags = DB::select(DB::raw("SELECT * FROM wp_terms
                     INNER JOIN wp_term_taxonomy
                     ON wp_term_taxonomy.term_id = wp_terms.term_id
                     INNER JOIN wp_term_relationships
                     ON wp_term_relationships.term_taxonomy_id = wp_term_taxonomy.term_taxonomy_id
                     WHERE taxonomy = 'post_tag' AND object_id = {$oldPost->ID}"));

    				// Alter tag fomat and load into target DB
    				collect($tags)
                ->map(function (stdClass $tag) {
                    return $tag->name;
                })
                ->pipe(function (Collection $tags) use ($post) {
                    return $post->attachTags($tags);
                });
        }
    }

Is that Data Engineering?

Definitely!! We collected the "source" data (literally with collect() haha), altered it's format (using map, pipe, and str_replace functions), and then added the new data to our "target" source.

This couldn't really be any more data engineering than it already is. Sort of.

Scenario 3: Denormalize for Performance (Admin Panel)

Problem: You need to show the total number of "likes" a user has ever received across all of their posts. The data model looks like this (simplified for brevity):

./assets/content/a1-data-model.png

Each time we need the "all time likes per user" we have to find all of the posts for the given user and then sum all of the likes for all of those posts.

Easy, right? Under the hood, Eloquent can join all of the needed data and return an answer for you. Unfortunately, you also need to display the top 50 users by "all time likes" in an admin dashboard. Using a lot of joins is bad for performance. Left joins are especially bad for performance. Things get worse still when you start applying sorting/filtering/ranking/etc. The query starts taking too long and your marketing team is pissed.

Solution: Maybe you could add a posted_by_user_id field to the likes table? That would help a bit. Maybe you could just optimize the query with tools like laravel debug bar? Both decent options. Instead, you decide to add an all_time_like_count column to our users table. To keep it in sync, you create a LikeObserver with php artisan make:observer LikeObserver . Then you update the new column for a given user every time one of their posts is "liked." Technically, you could have also used a scheduled artisan command to calculate this too. Either way, now you can just run something like $top_users = User::orderBy('all_time_like_count', 'DESC')->take(50)->get(); and our data is snappy again.

Is that Data Engineering?

Yes. We denormalized and aggregated our users' posts' like data to optimize performance. Sounds pretty data engineer-y.

Still not sure if you're a data engineer?

If you've been a Laravel developer for a while, odds are you've encountered at least one of the above scenarios before. If not, there are countless other examples of data engineering in most production Laravel applications. Surely you've tackled one of them:

  • Exporting/syncing users to Salesforce or Mailchimp
  • Processing user uploaded CSVs and manipulating rows before merging into your database
  • Piping data to and from Algolia for full text search
  • Collecting and visualizing log data
  • Even backing up data to S3 applies

In a future post, I want to dive into some extreme examples. Seeing advanced or deliberate implementations of data engineering will help us better understand the full spectrum. These concepts can get very complicated and can involve massive datasets, but many of the same concepts you already know will still apply, I promise.

Onboard to my newsletter if you like the articles. It makes me feel good...😉