CSC462 - Stakeholder Report

Andreas Anglin

Mitchell Read

Ben Austin

10/08/2021

Introduction

Over the course of the Summer 2021 term, CSC462 students were challenged to research and develop innovative solutions in an effort to solve global problems. These consisted of issues that could affect several industries or populations while working with potentially thousands of gigabytes of data. This began with employing satellite imagery data for selected regions over British Columbia in an effort to develop unique metrics that may benefit a variety of different groups. After developing a simple proof of concept script, the team looked into potential applications and downsides between methods of calculations by comparing with a large proprietary mosaic dataset. Afterwards, listening to communities and developing around integrating edge computing techniques became a key goal in our proof of concept system for the future.

Phase 1

Phase one consisted of architecting a basic cloud architecture system and how we could integrate satellite imagery computations for different metrics. This would involve the integration of a local Victoria platform Arbutus which runs virtual machines (VMs) that allow for Canadian researchers access to vast amounts of storage and computational power. Additionally, our ideal system would involve the usage of Microsoft Azure’s cloud computing infrastructure to allow for scalability.

Having this system architecture would be intended for running large calculations on satellite imagery data over the province of British Columbia. This involved the usage of Sentinel-2 bands where we can calculate different metrics such as the Normalized Difference Vegetation Index (NDVI). In order to perform such computation in a scalable manner we would break up these large datasets into manageable chunks and distribute the workload amongst several worker nodes. This would improve performance significantly while maintaining access to the underlying system.

Potential Applications

Having a system that can efficiently handle thousands of gigabytes in data allows for several useful applications. During our initial research we concluded that the applications can be beneficial for governments, private sectors, local communities, and many more. Some of the potential metrics could revolve around measuring aerosol contamination, mapping life activity below the water, or plotting suitable locations for sustainable settlements. As we can see the applications for this type of system can provide useful and difficult to obtain information.

System Difficulties

Having such a system also entails having several difficulties along the way. This is most apparent around the underlying distributed system performing computations and the satellite imagery data itself.

Developing distributed systems is no easy task. It requires significant planning and software development to achieve an efficient system. Over the term, the team studied several core aspects around distributed systems such as reaching a consensus and segregating work to be executed in parallel. These are core concepts that would need to be combined into a single distributed system in order to achieve the level of efficiency needed for these large datasets.

During our initial research and testing it was clear that satellite imagery data tends to be rather “messy”. This was due to the somewhat inconsistent nature of satellite images in general that would often have days missing or being heavily impacted due to weather. There are methods in place that involve performing data masking to remove aspects such as cloud cover, however, may not be guaranteed every time. Additional methods include developing a large mosaic by combining several days worth of data together and optimizing each pixel for the best possible result and was explored more in Phase 2.

Phase 2

Phase 2 of the project consisted of analyzing alternative forms of Sentinel-2 data, learning about the importance of data privacy along with the concept of edge computing and their importance in distributed systems. Finally, we looked into attempting to integrate the lessons throughout the term into a final prototype.

Mosaic Data

Phase two started with the analysis of the EarthDaily Analytics Sentinel-2 Mosaic data of British Columbia. This tool allowed us to make comparisons between the mosaic Sentinel-2 data, and the scene-by-scene Sentinel-2 from the first phase of the project. With access to this proprietary dataset, we were able to investigate the results of performing a simple NDVI calculation on the area of Greater Vancouver to compare and contrast the results of the two data sources, and determine their viability.

Vancouver NDVI mosaic

Mosaic NDVI Vancouver calculation

Vancouver scene-by-scene NDVI

Scene-by-scene NDVI Vancouver calculation

As we can see, the mosaic tended to have a more in-depth data reading between the individual bands used for calculating the NDVI metric. This would allow for greater precision and accuracy towards decision making of the data. Additionally, the mosaic was not nearly as susceptible to aspects such as cloud coverage or missing dates all together, which was a common problem for the scene-by-scene data.

Cultural Intelligence and Kelp Farming on Vancouver Island

Through a variety of weekly guest speakers over the course of the term, we were exposed to many knowledgeable individuals in the fields related to remote sensing and the indigienous community. We also learned about the importance of cultural intelligence while preserving the privacy of the sensitive generational knowledge within these communities.

Additionally, our team learned the importance and recent increase in interest of kelp farming on Vancouver Island. Kelp is in demand in a large variety of industries, playing a role in products such as shampoos, toothpastes, dairy products, and pharmaceuticals. Kelp forests also provide biologically productive habitats for many sea creatures such as fish, urchins, sea otters, sea lions, and even some whales. The current state of overfishing impacts natural kelp forests poorly. Without many carnivorous fish, herbivorous fish are free to decimate kelp as they please. Therefore, sustainable kelp farming on the island would simultaneously provide commercial and environmental value.

To empower the Nuu-Chan-Nulth community and the indigenous seafood economy, modern tools must be leveraged to expand their ecological knowledge and monitoring to aid the communities and organizations when making business decisions. For kelp farming in particular, many factors contribute to a productive environment. Location salinity, irradiance, oceanographic upwelling, and water clarity are some of these factors. Many of these factors would be difficult to approximate with satellite imagery.

Data Privacy and Edge Computing

With many factors involved, a system which blends satellite analysis and local measurements would be ideal for Nuu-Chah-Nulth and other seafood companies looking to expand. To protect the sovereignty of this local data, the prospective system could utilize edge device computing. Edge device computation involves executing operations on a user’s device instead of sending it to uncontrollable international cloud servers. Local data can stay local and only the output of this edge computation (typically an aggregation) is shared with a cloud architecture, if any at all.

The level of privacy required of a prospective system will impact the design heavily. Edge computation can provide weak privacy by obfuscating datasets with noise and/or aggregation prior to uploading to a cloud architecture. Strong privacy would result in a system that boasts a non trivial client application. This client application would amalgamate local datasets with satellite imagery via edge computation to avoid sharing any component, obfuscated or otherwise, to the cloud architecture.

In today’s world, it is easy to create an application where the server takes ownership of all user data, yet it is much harder to build collaborative software that respects users ownership and privacy of their local data. Moving forward, we hope to see more effort to respect data privacy through the application of these local-first computing models, as privacy is becoming increasingly scarce in today’s online world, as large corporations increasingly strive to harvest the valuable personal information and data of their users.

Prototype

Through the lessons we learned regarding the privacy of local data sources and maintaining cultural awareness throughout developments and business decisions, there were several key factors that we wished to incorporate into our final prototype.

Design Elements

Firstly, we wanted to properly create a client/server environment where we could integrate certain functionality to create a client-side data source that could be abstracted to the server in ways to promote data privacy and local ownership of sensitive data sources. For our purposes, we generated an articial dataset, contained in a GeoJSON file to simulate local drone data analytics. Finally, we wanted to ensure data privacy through means of data abstraction and avoiding the sharing of the raw local data with the server component of the system.

Architecture Goals

The following diagram represents the intended architecture of the system if it were to be fully developed. Requests are issued from the client library from a user’s local machine. The client library adds noise and aggregates local datasets prior to making requests to the internal cluster. Worker nodes receive the request via a load balancing agent and plot the given dataset against satellite imagery. Satellite imagery is obtained via SentinelHub unless it is already available in the database cluster.

ideal cloud architecture

Implementation

For the purposes of this prototype demonstration, a Flask server was run locally to simulate a server running on a service such as Compute Canada’s Arbutus or similar. Once the Flask server is started locally, another terminal can be used to run the client. To start, the client collects a mock dataset from a local file, which contains geographical points, with each point having a value between 1 and 10. These values are arbitrary by design, in order to accommodate many readings. Next, the client adds a level of noise to this local data, as described in the next section, and the points are aggregated into a heatmap. Finally, the server is called with this “abstracted” data heatmap, and associated areas of interest. The server accepts this data and area of interest, and fetches a colour image of the area from SentinelHub, plotting the heatmap overlay onto the colour image, and returning the final product to the client. Any location can be utilized so long as SentinelHub has satellite imagery for it.

Data Privacy

The client library ensures weak data privacy by adding noise and aggregating local datasets in an effort to prevent measurements and units from being passed to the cloud architecture. The figures below are heatmap examples of a unique dataset after 3 separate runs of noise addition and dataset aggregation via the client library. These examples were run with the noise argument at 100%.

data privacy noise heatmap

Sample Results

After the server has fetched a colour image of the selected geographical area from SentinelHub, the aggregated heatmap can be overlaid onto the image to combine the two data sources into a single image. An example of this overlay can be seen below, which consists of our simulated data overlaid onto the colour image of Kamloops previously fetched from SentinelHub.

sample results overlay

Conclusion

Throughout these initial phases of the project the team learned about several aspects around developing systems for the future. With this, we also learned about the difficulties involved. This comes from the increasing need for scalable systems to handle vast data while trying to manage datasets that can be inconsistent. Although there are methods to help remedy these issues, they can involve significant resources and expertise in multiple technical fields. We can look at combining this expertise together while communicating with local communities to find ideal solutions that can make a difference. This not only has the possibility of giving greater control around data privacy but also to develop solutions for significant problems that impact our local ecosystems.