The Python Tool In Alteryx
With the 2018.3 version of Alteryx came the Python Jupyter Notebook Tool, which means you can now use your Jupyter notebooks directly in Alteryx. This integration brought along endless possibilities of augmenting your workflows with the capabilities of Python. So in this post I’m going to show you an example of how your Python scripts can be used in Alteryx.
I will assume you are familiar with the basics of Python throughout the post. If you already know how to use the tool, you can skip the first part explaining it and jump to the second part where I present a use case of why it makes sense to use your Python scripts with Alteryx. This part will also show a neat way of clustering a large number of text files into various topics containing the most widely used words across these topics.
Part 1 – Getting started using the tool
The tool itself is basically just a Jupyter Notebook inside an Alteryx tool. A lot of packages are already installed in the tool and additional packages can be installed as shown below. Here I install the NLTK package used for various Natural Language Processing techniques.
When the packages are installed, you can import the packages to your Notebook in the same way it is normally done in Python – e.g. import numpy as np.Importing data from your workflow is easy – just connect the Datastreams you want to use in the Python tool as shown in the picture above. Returning to the Jupyter notebook, you then read in the Dataframes by typing in the following:
If you want to import an additional datastream, connect the data to the Python tool and add a line where you define the name of the second datastream, and then change the #1 to #2 like so
From here on, it is just like working with a Jupyter notebook, except for the fact that, as for now, you can only export Pandas Dataframes from the Jupyter Notebook back to the canvas. This can be done by typing the following line in the notebook.
Up to 5 different Dataframes can be exported to the canvas for each tool.
Part 2: Use case
Generally, you can do almost anything with Alteryx in terms of data-prepping and advanced analytics, but if you ever feel limited by the tools available in the Tool Palette, you can always turn to the Python Tool and at the same time make use of the vast amounts of resources readily available on Github and other corners of the internet.
Using Alteryx with your Python scripts can dramatically increase the user-friendliness of the specific analysis. Say you for example want an end user to be able to change some specific hyperparameters of the algorithm – then why don’t you embed the Python Tool in a Alteryx Macro? That way the user can adapt the analysis to his/her specific needs.
Clustering of Large Textfiles
The picture below shows a macro which is capable of clustering large text files by using the Python Tool:
In this example I want to categorize 15.000 abstracts from patent applications filed to the European Patent Office in 2017. Say I want to gain knowledge of what these specific patents contain. I could start reading all of them, in which case I would probably be done by the end of the year. Alternatively I could use an unsupervised Machine Learning algorithm to assign a topic to each abstract and thereby gain knowledge of what’s been patented in broad terms. Specifically, the algorithm clusters the abstracts on the basis of N topics and returns the M most coherent words within the cluster, where M and N are hyperparameters that can be changed by the user as seen below. Here I’ve divided the abstracts into 10 topics containing 12 words. The output also contains a coherence score to weight each word depending on how frequently it is used in the abstracts within the topic. The user can also choose which words to exclude in the analysis. The algorithm makes use of the package stopwords, so we exclude words like the, if, then, etc. We also use a lemmatizer to transform each verb to its base form.
Examples of the resulting word-clouds from the different topics are visualized below using Tableau. The first topic clearly contains patents that have something to do with network technology, while the fourth topic relates to chemistry and so on. The size of the bubbles indicate the coherence score of the words.
To conclude… if you want to implement a piece of Python code for analytics purposes, the usability can be seriously upgraded if you integrate it with Alteryx since the end user can now change the configurations of the code without being able to edit the Python-script itself.
If you want to have a look at the Macro, you can send me an email on firstname.lastname@example.org – I’d be happy to share it.