Hello, my name is Victor Lou. I am the Solutions Engineer of Global Philanthropy of Daameer.org. Helping researchers, academia, and non-profits find the insights that matter using data. Today I am here to give you a production environment, personalized music recommendation project. That was done for PonoMusic and so what is Pono Music? Pono Music is revolutionizing music listening by bringing back native, high-resolution audio. It's the way that artists intended and recorded their music. Neil Young, the iconic artist and musician is also the founder and CEO. Pono Music is the only complete high-resolution music ecosystem. That includes the Pono player, the Pono Music store, the Pono Music community. They have been super generous in allowing us to take a look behind the scenes as how we built out the recommendations. So, David Meer supports their mission by providing a music recommendation engine that is both scalable and flexible as Pono grows their user base. Let's begin by visiting Pono music's website as you can see, they have a few shelves that focus on various recommendations that are not tailored to each individual user. If you login, then an additional shelf would appear that would deliver the recommendations that are created in data gear. So how does DataMirror accomplish this. In a nutshell, DataMirror is a end to end, antalis platform which contains ingestion, preparation analysis, visualization and operationalization all the same platform. And having all the capabilities in one platform is superior to traditional technology stack. Integration between disparate technologies brings a lot of unnecessary challenges. We take advantage of open source big data technologies. Specifically, we run natively and Hadoop and Spark for our back end. And we leverage D3JS on the front end for visualizations. Our Excel-like interface makes it easy for folks who don't know how to code to also get in with the data modeling. It is very much self-service. In the case of Pono Music's data mirror deployment, we have deployed a Hadoop on Amazon on web services. But we can also be deployed off premise if needed it doesn't have to be in the clouds. In fact, we could work with many many different distributions of Hadoop. We can access dean rear through any modern web browser. As you can see here we're using Google Chrome but it works on any browser, right? So lets go ahead and take a look at the a; let's log into data grip, so we're going to go ahead and log in here.. So what you're seeing here is the browser tab and we can see various artifacts which include connections such as connections to the pull nose, salesforce.com instance or a connection to AWS S3 instance. Because DataMirror comes with so many prebuilt connectors which include Connections to salesforce.com and S3. We can easily grant access to those systems in order to pull or push data in and out of the dev. For example, let's take a quick look at the salesforce.com configuration here. As you can see, we've already configured salesforce as the connector type. When we hit next here, we'll we that you can authorize a data manager to retrieve data. We're not going to do it because this is a production environment, so I don't want to disrupt the pools. But if we were to click it, it'll give us a salesfloor.com blogging stream. And so as soon as we login, the Datameer would have access via the O-Off token. And so this is valid for a period of time. I'm going to go ahead and hit the cancel button as this is a production environment. And so now we're ready to look at an import job artifact. So this is the sales force import job and as you can see we go in here and we configure it. We can see that it's connected to the Pono Music SMDC connection. I'm going to go ahead and hit next here.with salesforce.com, we just have to simply define the SOQL, which is basically a querying language that's based on SQL. To locate data contained within the salesforce.com objects. You can see the select statement here, you can also see the familiar from statement, and you can also see the where statement as well. So in order to protect the privacy of the users, Datameer has the capability to obfuscate sort of columns that are sensitive. And so in other words, we can scramble sensitive fields such as email or physical addresses. And at this point in time, I'm going to go ahead and cancel out, and I'm going to show the next part of this demo in an obfuscated workbook environment. The artifacts contained in the obfuscated folders are duplicates of the import job of the workbook from their production counterparts as you can see here. So as you can see in the import job settings we have configured the same connections here but for the obfuscated columns here. Now we're obfuscating the owner email as well as the street address. Now we hit Next here. And so you can see, various fields are being pulled here from Salesforce, right. And in particular, I want to highlight the. The owner email, that field, so as you can see these are the obfuscated fields. Similarly, the actual street address, that's also masked as well. I'm going to go ahead and hit Cancel. I will go back to, rather we're going to go check out the workbook environment, which is where we do the preparation and analysis. To the obfuscated workbook environment. As you can see we're starting with a raw source of data here. This was the data that was contained in the data sheet. Take note that there's a variety of icons that indicate different types of sheets. So you could see here this is the raw data type. This is a regular worksheet. You have a union sheet and you have a joint sheet and we'll go into a little bit more detail as we get to them. So, as you can see, the interface is very much like Excel. Looking at the second sheet pairs, and it's using a function called group by. So the group by function is a type of aggregation function in Datameer speak. As you can see, you can build the functions by pointing and clicking. But you can also enter the formula to create nested functions or Apply arithmetic operations just as a cell. We take all the unique combinations of albums purchased for each of the unique email using a group pair function. Then we pull out each of the elements of the pairings using a list element function, so let's see here. If you were to build a function from scratch, you can see this. So then you create a group by function, and then direct it at a column. You basically pass it to argument. To find out the occurrence of each pairing of albums, we're going to go ahead and look at this LR sheet. And so these are, we're using group by functions as well here, so we're grouping by the first item one, and then item two. And then, we're doing a group count, which is counting the frequency of occurrences of those cells. Similarly, we're going to do a similar thing but with the item two first now, and then item one, and then doing a group count. And the reason is, we don't care about the order. And so, we repeat this on the next sheet reversing the order of the albums to make sure that we capture all the combinations since the order doesn't matter to us. And then we use our unit sheet function to append the second frequency counting sheet to the first frequency counting sheet, as you can see here. It's drag and drop, of course, and so we've connected them together. You can do multi-sheets but in this case we only have two, right. So it's, okay. We ignore the co-occurrent set sheet as it is currently not used in the final output. We were kind of experimenting with other recommendation models. We're rebuilding those workbooks. So one of the really cool features for we can execute the workbook, but deselect the storage of intermediary sheets. This means that we only need relevant functions are executed at a to the data lineage of columns that are contained in the selector sheet. Moving on, we have to create a separate sheet from the raw data, that counts the number of times an album occurs across all of the purchases. So you can see here, which students will buy an album and then we do a count. We bring the code occurrences and the frequency together using the join feature as you can see, it is a drag and drop interface. You got the obfuscated workbook each of the worksheets here and within each worksheets, you got various columns. You can bring those, you can drag and drop them into here to do the joins. You can do different types of joins, you got the inner, outer, etc. You can also do multi column joins. You can do multi sheet joins. You can also select which columns you want to keep during the join. Go ahead and cancel out here. Datameer does not allow us to add additional functions to a joint sheet so we duplicate it by right clicking and then hitting the duplicate button. And then we create the next sheet, which contains a link to columns, so it will be linked to the rating sheet. This is, see the link columns back to the joint sheets and then. We simply add a calculation where it's the number of times the co-occurrence occurs divided by the number of times that the first album in the column appears throughout the data set. The idea is to give a high recommendation for albums that appear frequently together While simultaneously penalizing albums that are too common. And so at this point, I would like to write a few more joins that bring together the album ID, the email, recommended album ID based on the album ID and email, so take a look here. So, as you can see here and then, before we finish, we also do an anti joint by performing the outer left joint with the almighty in the email. And then filtering, so this is, by the way, this is a double column joint. And then we do a filter, so then we filter out all the, for the obfuscated owner fields that are empty. We apply a filter to it. So finally, we use a few group I functions to do some deep duplication, so we use group by functions again. And then, so we make sure that the album functions are unique for each user and then we use a group max function. In order to preserve the highest rating for each of the unique recommendations the final desired output is on the recommenders sheet. It leverages the group top end function, so you can see here group top numbers. So we only want to look at the to 20 recommendations for each user. And in addition, we also create a email row number field, which is a key that needed for salesforce.com, and it is generated by a simple concatenation function here. Something that note that [SOUND] is big on data governance, data lineage. As you can see, we can keep track of how all the raw data flows through from the raw data set all the way to the end product. So each workbook functions can be also exported in JSON format to keep track of revision history. We're not ready to operationalize things and so how do we do it? Get our workbook configurations. Go ahead and exit out of the workbook. Let's go back to the production environment. As you can see, the workbook recalculation retriggers when the import job is completed. We also only retain the latest results, so we select purge historical results, as you can see here. As I mentioned earlier, a data mirror makes it easy to operationalize a workbook because we only select the final output sheets. And, so you see everything's unchecked except for the recommended sheet. And Datameer basically will automatically determine what is a critical path of the intermediary sheets to produce the desired output sheet. Once the recommendation workbook is calculated, then this triggers the export job run. I'm going to go ahead and move to the export job artifact. Exit out of this. Go ahead and save, or not save. We'll see the export job, let's go ahead and configure that. Hit Next, Share. CSV outputs, so. This triggers the export job to run, which is how we basically push the results of the recommendations to S3. We elected to use S3 because we are already hosting at services and S3 is an affordable storage solution. We cannot push directly to salesforce.com at the present because It's a only connection. Therefore, we need to push data to the storage area, which can be scheduled for using Apex. A general challenge for is the role limitation on each pool. Fortunately, DataMirror comes with the option to push files that are broken into one megabyte chunks. So as you can see here in advanced settings 1 megabyte, and then we set a type of consecutive numbering scheme. And so it's dynamic naming conventions using time stamps and numbering for each of the chunks. So that sums up the current co-curds/d deployment. As you've seen, we have easily integrated data from Salesforce.com, created a recommendation model, and set up a push to S3 automatically for easy consumption back to Salesforce.com. Finally, we operationalized this by triggering each of the steps sequentially all in sitting on top of the powerful Hadoop platform. What's next? Well, Pono Music is working on implementing Google Analytics tracking for each user at the Apple pages. So let's go back to the Apple Music Store here. Just for an example, let's take a look at the Train Led Zeppelin Two Album. As you can see all of those album IDs that we were looking at earlier, these are actually embedded in the URL. This means that in the near future, we can actually adapt recommendations based on the buying behavior to recommendations based on both buying and browser behavior. We can look at recommendations based on genre, perhaps we can try to look at most recent purchases or browsing behavior in the last three to six to nine months. Or we could use album metadata such as album release date to give additional recommendations. Also, it is quite simple to just duplicate the recommendations workbook in [INAUDIBLE] to try any number of these options. And so, that means we can do a lot of AB testing, as we track how users react to each of the modified recommendation algorithms. The possibilities are literally endless. So anyhow, thanks again for checking out how Data Mirror builds a simple code current recommendation engine for music. Please feel free to direction questions or comments to me at victor.liu@datamirror.com. Thanks again.