As we´ve seen in the first article, the future for data scientists looks bright. Here, we´ll cover how GIS technology can enhance the skillset of any data scientist.
As discussed in the first article, the core skills of any data scientist consist of Python, R and SQL. It´s no surprise that the GIS industry uses all three, making it easier for data science practitioners to extend their workflows with GIS software and data. Let´s have a look at how data science uses GIS concepts, data and technology:
GIS and SQL
GIS is about analysis of location data, present in spatial databases or as tabular data that can be visualized using mapping software. SQL is the language used in GIS to work with tabular data, for example making attribute queries, selecting records and joining tables. This makes SQL knowledge a requirement for working with GIS, in addition to being a data scientist. In a broader sense, working with various (spatial) databases will greatly enhance your data science skillset. This is because GIS technology not only limits itself to relational databases, take for instance non-relational databases PostgreSQL and its spatial database PostGIS. Connections to both are available in QGIS and ArcGIS, two popular desktop GIS products.
GIS and R
R is still the most popular programming language in the science community, with Python close behind. It´s possible to extend current data science workflows using R and GIS. Both QGIS and ArcGIS offer ways to integrate R into desktop mapping software: QGIS offers RQGIS, a package that enables access to QGIS geospatial algorithms from within R. The R-ArcGIS bridge offers similar functionality for ArcGIS. Besides these packages, R offers many spatial libraries to work with spatial data natively. Plotting functionality is possible using Rstudio, a free and open source data analysis software package.
GIS and Python
Both QGIS and ArcGIS Desktop (ArcMap + ArcGIS Pro) have adopted Python as a scripting language, enabling to standardize geospatial workflows and create user-defined add-ins and geoprocessing toolboxes. Numpy, a Python package for working with data arrays, can greatly improve performance when working with large amounts of data, such as raster files.
It´s also possible to do away with desktop-based GIS and use web mapping APIs, such as the Python API for ArcGIS that uses Python as a scripting language to work with geospatial data. The preferred working environment for this API is Jupyter Notebook, the application of choice for data analysists. A big incentive to use this particular API is its mapping widget, that immediately displays the results of data analysis workflows. Location intelligence platform provider Carto just released a similar initiative called cartoframes, that can be used to enhance existing data science workflows with components from the carto stack. Its mapping widget is a particular highlight.
Because the Python community created a lot of tools for the data science field, it´s easy to integrate different libraries with each other for one workflow, such as the SciPy stack or machine learning libraries. There are many spatial libraries available, as well as Python wrappers for libraries written in other languages. The most important one, called GDAL, is indispensable for working with spatial data and is used by many other spatial Python packages, notably GeoPandas. Finally, Python offers tools to work with geospatial data, web and cloud frameworks – all in all considerably more than what R offers.