data – spirigs

June 14, 2024June 20, 2024

Linked data (LINDAS/SPARQL/hydrodata)

TL;DR;

Get the up-to-date data from LINDAS of the currently available values discharge / water level/water temperatures for all BAFU stations.

PREFIX schema: <http://schema.org/>
PREFIX hg: <http://rdf.histograph.io/>
PREFIX la: <https://linked.art/ns/terms/>
prefix ex: <http://example.org/>

# hydro related prefixes
prefix h: <https://environment.ld.admin.ch/foen/hydro/>
prefix hd: <https://environment.ld.admin.ch/foen/hydro/dimension/>
prefix hgs: <https://environment.ld.admin.ch/foen/hydro/station/>
prefix river: <https://environment.ld.admin.ch/foen/hydro/river/observation/>
prefix lake: <https://environment.ld.admin.ch/foen/hydro/lake/observation/>

# geo related prefixes
prefix wkt: <http://www.opengis.net/ont/geosparql#asWKT>

# hydrodata related prefixes (fake) that we use to reference to the historical data. 
prefix hydrodata: <https://www.hydrodaten.admin.ch/plots/>





# what we want to get out of the query as "nicely" formatted data
SELECT ?station_id ?station_name ?station_type ?station_type_nice
    ?latitude ?longitude 
    ?station_iri ?station_data_iri ?station_geometry_iri
	  (group_concat(?measurement_type_measurement_value;SEPARATOR=";") as ?data)
		#?hydrodata_url_temperature ?hydrodata_url_discharge_waterlevel
		?query_time 
		# add the https://hydrodaten.admin.ch url for the data that go back 
		# 7 days (actually 40 days sometimes, see dev console)
  		(if(contains(?data,"discharge"), 
  		    uri(concat(str(hydrodata:), 'p_q_7days/',
  		               ?station_id, '_p_q_7days_de.json')), 
          "") 
    		as ?hydrodata_url_discharge_waterlevel)
        (if(contains(?data,"temp"), 
            uri(concat(str(hydrodata:), 'temperature_7days/', 
                ?station_id, '_temperature_7days_de.json')), 
           "") 
    	as ?hydrodata_url_temperature)
 
where {
  		# it seems counterintuitive but getting everything and filtering 
  		# is faster than first getting 
  		# the stationlist and go trough their IRI
        ?station ?prop ?value.
  		# we only want to keep actual data variables and time
        filter( ?prop in (ex:isLiter, hd:measurementTime, hd:waterTemperature,  hd:waterLevel, hd:discharge))
  		# convert the subject station to the id, which is the last thing after the /;
  		bind(replace(replace(
        	str(?station), str(river:), "", "i"),
      		str(lake:), "", "i")
    		as ?station_id)
  		# the above is equivalent to this in theory but the above 
  		# uses prefixes so would be easier to change
		# in case of changes
  		#bind(strafter(str(?station), "observation/") as ?station_id2)
  		# get the kind of station this is (whether lake or river) 
  		# - the hierarchy/ontology here is weird as all are measurement stations
		# or observation but the distinction is river/lake first then observation
  		bind(strbefore(strafter(str(?station), str(h:)), "/") as ?station_type)
  		# uppercase the first letter -> why is there no capitalize function in SPARQL
  		bind(concat(ucase(substr(?station_type,1,1)), substr(?station_type,2,strlen(?station_type)-1)) as ?station_type_nice)
  		# remove some filler words from the data
		bind(replace(str(?prop), str(hd:), "", "i") as ?measurement_type)
  		# convert the actual measurement value to a literal
        bind(str(?value) as ?measurement_value)
  		# introduce the time of the query to (maybe) see if the values are old
        bind( str(now()) as ?query_time)
  		# combine the type of measurement with the value so we 
  		# can group by and not loose information
  		bind(concat(?measurement_type, ';', ?measurement_value, '') as ?measurement_type_measurement_value) 
    	
  		# start making IRIs so we can check/link back to lindas
  		# the IRI containing information of the station itself (not linked to river/lake)
  		bind(IRI(concat(str(hgs:), ?station_id)) as ?station_iri)
  		# generate the geometry IRI
  		bind(IRI(concat(str(hgs:), concat(?station_id,"/geometry"))) as ?station_geometry_iri)
  		# generate the station IRI that holds the data information
  		bind(IRI(concat(str(h:), ?station_type, "/observation/", ?station_id)) as ?station_data_iri)
  		# get the information about the location of the station
  		?station_geometry_iri wkt: ?coordinates.
  		# also get the name and remove commas to not confuse them with seps
      ?station_iri schema:name ?stat_name.
      bind(replace(?stat_name, ',', ';') as ?station_name)

  		# simplifiy the geometry to lon/lat in array form
        BIND(STRBEFORE(STRAFTER(STR(?coordinates), " "), ")") AS ?latitude)
        BIND(STRBEFORE(STRAFTER(STR(?coordinates), "("), " ") AS ?longitude)



}
# group by, basically keep everything of the select here
# except the ?measurement_type_measurement_value which we want to combine to one row
# (ideally we would actually get them as seperate columns again but this seems the best we can do for the moment
group by # the station information
		?station_id ?station_type ?station_type_nice ?station_name 
		?latitude ?longitude
		# the IRIs
		 ?station_iri ?station_data_iri ?station_geometry_iri
		# validation information
		 ?query_time

This one took quite some time and serves as of now not much purpose other than training SPARQL for me. Essentially, I was asked once to use “the API” for the current water temperatures that are embedded in my RaspberryPI webcams.

Now, initially I actually “webscapped” the JSON data from hydrodata and transferred them to the MCR database as this allowed the concurrent use of in-house data with the BAFU data. Additionally, the urban/rural/water temperature graph of dolueg2 uses exactly the same temperature, meaning it makes sense to keep it in the same database (at least for some of the relevant stations around Basel, the B2091 means BAFU station 2091).

Now going back to “the API” and RDF where data are in triplet form (subject property object) to allow reuse and linking we can get the most recent data of any of the BAFU stations, via e.g.

prefix hd: <https://environment.ld.admin.ch/foen/hydro/dimension/>
SELECT * WHERE {
  <https://environment.ld.admin.ch/foen/hydro/river/observation/2091> hd:waterTemperature ?obj .
}
# see also the page: https://environment.ld.admin.ch/foen/hydro/river/observation/2091

or some relevant values for us (more than water temperature)

prefix hd: <https://environment.ld.admin.ch/foen/hydro/dimension/>
prefix obs: <https://environment.ld.admin.ch/foen/hydro/river/observation/>


SELECT * WHERE {
  obs:2091 ?prop ?obj .
  filter( ?prop in (hd:measurementTime, hd:waterTemperature,  hd:waterLevel, hd:discharge))

}

The above is an excerpt of the output from the LINDAS SPARQL endpoint at the time of writing. Notably, we can also get the information for the station (note the added prefix):

prefix station: <https://environment.ld.admin.ch/foen/hydro/station/>
SELECT * WHERE {
  station:2091 ?prop ?obj .
}

And as you can see, there is also a geometry linked to the station data (shown without query this time):

Cool – except for the fact that the southernmost point of Switzerland is located at 45°49’N. And the station in question (id 2091) is in Rheinfelden, i.e. in the North of Switzerland. So something seems off here and I informed the responsible person in the hopes that they’ll fix that soon.

Now, initially, I thought I’d first extract the IDs of the relevant station, then get their properties of interest/values but somehow the query runs rather slow (6+ seconds); probably this could be fixed and if someone wants to try, the following should give a nicely formatted data table (I actually assume that eventually the trigger for measurement time might not be a good choice when more and more datasets would be available on LINDAS):

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
# hydro related prefixes
prefix h: <https://environment.ld.admin.ch/foen/hydro/>
prefix hd: <https://environment.ld.admin.ch/foen/hydro/dimension/>
prefix hydro_geo: <https://environment.ld.admin.ch/foen/hydro/station/>
# geo related prefixes
prefix wkt: <http://www.opengis.net/ont/geosparql#asWKT>

select ?station_iri ?station_id ?station_type ?coordinates ?station_coordinates ?query_time ?measurement_time ?waterTemperature ?discharge ?waterLevel 
where {
  
    ?station hd:measurementTime ?measurement_time.
    optional {?station hd:waterLevel ?waterLevel. }
    optional { ?station hd:discharge ?discharge.}
  	optional { ?station hd:waterTemperature ?waterTemperature.}
  # measurement related things
  bind(now() as ?query_time)
  bind(strbefore(strafter(str(?station), str(h:)), "/") as ?station_type)
  bind(strafter(str(?station), "observation/") as ?station_id)
  # geo related things
  # first find all the relevant subjects, i.e. station property pages
  bind(IRI(concat(str(hydro_geo:), concat(?station_id,"/geometry"))) as ?station_geometry_iri)
  bind(IRI(concat(str(hydro_geo:), ?station_id)) as ?station_iri)
  ?station_geometry_iri wkt: ?coordinates.
  BIND(strafter(strbefore(str(?coordinates),")"), "POINT(") AS ?station_coordinates)
  # can be used for yasgui and nicer representation but slows the query
  #bind(xsd:decimal(?station_id) as ?station_id_numeric)
} 
#order by desc(?station_id_numeric)

As you can see, it’s important to use the OPTIONAL keyword as not all stations have all variables (water temperature / discharge / level). Additionally, I retrieve the kind of station (lake/river) that is available. The choice of prefix for these is a bit weird for my feeling as the hierarchy seems to be river/lake then station instead of the other way around – which must have a reason that I’m too inexperienced to understand. The rest of the query is cosmetics to get a bit nicer values (keep the geometry as proper type and you can get a map of the observations though).

As seen, this runs rather slow (6+ seconds for around 230 records is too slow on any day) so I started experimenting, arriving ultimately at the code at the top (here shortened to remove geometry and hydrodata links) by first filtering out the relevant dimensions, essentially getting the data first, then cleaning up a bit and concatenating by station id.

PREFIX schema: <http://schema.org/>
PREFIX hg: <http://rdf.histograph.io/>
PREFIX la: <https://linked.art/ns/terms/>
prefix ex: <http://example.org/>

# hydro related prefixes
prefix h: <https://environment.ld.admin.ch/foen/hydro/>
prefix hd: <https://environment.ld.admin.ch/foen/hydro/dimension/>
prefix hgs: <https://environment.ld.admin.ch/foen/hydro/station/>
prefix river: <https://environment.ld.admin.ch/foen/hydro/river/observation/>
prefix lake: <https://environment.ld.admin.ch/foen/hydro/lake/observation/>

# geo related prefixes
prefix wkt: <http://www.opengis.net/ont/geosparql#asWKT>


# what we want to get out of the query as "nicely" formatted data
SELECT ?station_id ?station_name ?station_type ?station_type_nice ?station_iri ?station_data_iri 
	  (group_concat(?measurement_type_measurement_value;SEPARATOR=", ") as ?data)
		#?hydrodata_url_temperature ?hydrodata_url_discharge_waterlevel
		?query_time  
where {
  		# it seems counterintuitive but getting everything and filtering is faster than first getting 
  		# the stationlist and go trough their IRI
        ?station ?prop ?value.
  		# we only want to keep actual data variables and time
        filter( ?prop in (hd:measurementTime, hd:waterTemperature,  hd:waterLevel, hd:discharge))
  		# convert the subject station to the id, which is the last thing after the /;
  		bind(replace(replace(
        	str(?station), str(river:), "", "i"),
      		str(lake:), "", "i")
    		as ?station_id)
  		# the above is equivalent to this in theory but the above uses prefixes so would be easier to change
		# in case of changes
  		#bind(strafter(str(?station), "observation/") as ?station_id2)
  		# get the kind of station this is (whether lake or river) - the hierarchy/ontology here is weird as all are measurement stations
		# or observation but the distinction is river/lake first then observation
  		bind(strbefore(strafter(str(?station), str(h:)), "/") as ?station_type)
  		# uppercase the first letter -> why is there no capitalize function in SPARQL (╯°□°)╯︵ ┻━┻
  		bind(concat(ucase(substr(?station_type,1,1)), substr(?station_type,2,strlen(?station_type)-1)) as ?station_type_nice)
  		# remove some filler words from the data
		bind(replace(replace(replace(
            str(?prop), str(hd:), "", "i"),
            "water", "", "i"),
            "measurementTime", "measurement_time", "i") 
            as ?measurement_type)
  		# convert the actual measurement value to a literal
        bind(str(?value) as ?measurement_value)
  		# introduce the time of the query to (maybe) see if the values are old
        bind( str(now()) as ?query_time)
  		# combine the type of measurement with the value so we can group by and not loose information
  		bind(concat(?measurement_type, ': ', ?measurement_value, '') as ?measurement_type_measurement_value) 
    	
  		# start making IRIs so we can check/link back to lindas
  		# the IRI containing information of the station itself (not linked to river/lake)
  		bind(IRI(concat(str(hgs:), ?station_id)) as ?station_iri)
  	
  

}
# group by, basically keep everything of the select here
# except the ?measurement_type_measurement_value which we want to combine to one row
# (ideally we would actually get them as seperate columns again but this seems the best we can do for the moment
group by # the station information
		?station_id ?station_type ?station_type_nice ?station_name 
		# the IRIs
		 ?station_iri ?station_data_iri
		# validation information
		 ?query_time

This query is the reduced version of the one on top. It runs in around 1 second or less and can now be used for automated queries or other products as described here on LINDAS. The only additions are the added geometry and the URLs for the hydrodata website that give you the historical data to implement a product I have in mind that might come up later. And maybe at some point, we can also get data from the past on LINDAS, making the project I have in mind easier.

Update:
The offset in the geometry seems to be consistently 400’000 m in “hochwert”/latitude based on three stations picked from hydrodata and LINDAS where longitude is almost correct:

b = 2634026 , 1125901 # 2531
a = 7.854351126515431, 42.67878915566025 # 2531
a = wgs84_to_ch1903(*a[::-1], plus=True)[::-1]
print([i - j for i, j in zip(a, b)])
#out: [132.7675012690015, -401020.22037598956]

b = 2627189 , 1267845 # 2091
a = 7.777036377264884, 43.96335989872657 # 2091
a = wgs84_to_ch1903(*a[::-1], plus=True)[::-1]
print([i - j for i, j in zip(a, b)])
#out: [1.471713425591588, -399994.8737920185]

b = 2830800 , 1168706 # 2617
a = 10.265672193049967, 43.036021653211186 # 2617
a = wgs84_to_ch1903(*a[::-1], plus=True)[::-1]
print([i - j for i, j in zip(a, b)])
#out: [48.71478863572702, -399969.5114403835]

May 14, 2024

My implementation of a touchdown event via javascript

TL;DR; The below code can be added as a script to your page to implement a touchdown event detection and directly attach it to a “menu” element

async function swipedetect(el, callback, threshold=50, minimumTime=150, allowedTime=2000){

    var touchsurface = el,
    swipedir,
    startX,
    startY,
    distX,
    distY,
    threshold = threshold, //required min distance traveled to be considered swipe in X
//    restraintX = restraintX, // maximum distance allowed at the same time in perpendicular direction in X
//    thresholdY = thresholdY, //required min distance traveled to be considered swipe in Y
//    restraintY = restraintY, // maximum distance allowed at the same time in perpendicular direction in Y
    minimumTime = minimumTime, // maximum time allowed to travel that distance
    allowedTime = allowedTime, // maximum time allowed to travel that distance
    elapsedTime,
    startTime,
    handleswipe = callback || function(swipedir){}

    touchsurface.addEventListener('touchstart', function(e){
        var touchobj = e.changedTouches[0]
        swipedir = 'none'
        dist = 0
        startX = touchobj.pageX
        startY = touchobj.pageY
        startTime = new Date().getTime() // record time when finger first makes contact with surface
        // e.preventDefault()
    }, false)

    touchsurface.addEventListener('touchmove', function(e){
        e.preventDefault() // prevent scrolling when inside DIV
    }, false)

    touchsurface.addEventListener('touchend', function(e){
        var touchobj = e.changedTouches[0]
        distX = touchobj.pageX - startX; // get horizontal dist traveled by finger while in contact with surface
        distY = touchobj.pageY - startY; // get vertical dist traveled by finger while in contact with surface
        dist = distY * distY + distX * distX;
        dist = Math.sqrt(dist);
        elapsedTime = new Date().getTime() - startTime; // get time elapsed
        // calculate the angle of the movement
        angle = (Math.atan2(distY, distX) / Math.PI * 180)
        // convert angle to a sector of the movement
        // offsetting it by 45 degreees to get the movement easier.
        // and ensuring its positive
        sector = (angle + 45 + 360) % 360;
        sector = Math.floor(sector/90);
        if (elapsedTime >= minimumTime && elapsedTime <= allowedTime && dist >= threshold){ // first condition for a swipe met
            switch (sector) {
               case 0: swipedir = "right"; break;
               case 1: swipedir = "up"; break;
               case 2: swipedir = "left"; break;
               case 3: swipedir = "down"; break;
            }
        }
        handleswipe(swipedir)
        if (dist >= threshold) {
          e.preventDefault()
        }
    }, false)
}

async function swipe_enable(elementid='menu', table={'right':'a', 'up':'w', 'left':'d', 'down':'s', 'none': null}) {
  var el = document.getElementById(elementid);
  if (el) {
    swipedetect(el, async function(swipedir){
      //swipedir contains either "none", "left", "right", "top", or "down"
      let key = table[swipedir];
      document.dispatchEvent(new KeyboardEvent('keypress', {'key': key}));
    })
  }
}

Handling a touchdown event via vanilla JS is nothing new, and I also took the initial implementation from a website I cannot find anymore. The structure was similar to this gist and the related discussion, as in initially I simply added a touchdown but the detection of right/left/up/down was patchy until I also added the direction sector. Additionally, I did not want to attach it all the time to each element etc. but instead added another function that handles the attaching.

Now I can simply have the above in a js-lib file and source it on our data visualisation page, then call swipe_enable() in the doc.ready function and be certain that swiping is handled on the corresponding element (with selects in this case). This makes usage of the page far easier on mobile devices; highly relevant if you are in the field like in CLOUDLAB and your phone is the only thing available due to the windy/rainy/snowy weather.

May 14, 2024July 2, 2024

Get the most recent census data from the Basler Atlas

TL;DR;

Get the census data (and more) from the Basler Atlas that are updated by the Statistisches Amt Basel. The up-to-date version can be found on my GitHub Gists. Don’t forget to add the data source, i.e. the Stat. Amt Basel when you use their data. – it is being added automatically to the “meta” part in the datacontainer (in addition to the URL, the name and short name of the dataset).

import datetime


headers = {'User-Agent': 'Mozilla/5.0'}


def data2csv(data,
             outpath='./',
             prefix='',
             suffix='',
             filesep='_',
             datasep=',',
             ext='.csv',
             quiet=True,
             ):

    import os
    if isinstance(data, dict) and 'data' not in data.keys():
        of = []
        for key in data.keys():
            _of = data2csv(data[key],
                           outpath=outpath,
                           prefix=prefix,
                           filesep=filesep,
                           datasep=datasep,
                           ext=ext,
                           quiet=quiet,
                           )
            if isinstance(_of, list):
                of += _of
            else:
                of.append(_of)
        return of

    outpath = os.path.expanduser(outpath)
    filename = filesep.join([prefix,
                             data['meta']['theme'],
                             data['meta']['subtheme'],
                             _nivgeo2nicename(data['meta']['nivgeo']),
                             suffix,
                             ])
    filename = filename.lstrip(filesep).rstrip(filesep) + ext
    os.makedirs(outpath, exist_ok=True)
    if not quiet:
        print('Saving as', outpath+filename)
    with open(outpath+filename, 'w') as file_obj:
        file_obj.writelines(datasep.join(data['header']) + '\n')

        datalines = [datasep.join(map(lambda x: str(x), line))
                     for line in data['data']]
        file_obj.writelines('\n'.join(datalines) + '\n')
    return outpath+filename


def _get_topic_url():
    url = 'https://www.basleratlas.ch/GC_listIndics.php?lang=de'
    return url


def _get_data_url():
    # used to be the first url at around 2018
    # "https://www.basleratlas.ch/GC_ronds.php?lang=de&js=1&indic=bevstruk.alter0&nivgeo=wge&ts=1&serie=2015"
    url = 'https://www.basleratlas.ch/GC_ronds.php'
    url = 'https://www.basleratlas.ch/GC_indic.php'
    return url


def _get_references_url():
    # e.g. https://www.basleratlas.ch/GC_refdata.php?nivgeo=wbl&extent=extent1&lang=de
    url = 'https://www.basleratlas.ch/GC_refdata.php'
    return url


def get_ref_data(nivgeo, slimmed=True):
    import requests
    payload = {'nivgeo': nivgeo,
               'lang': 'de', 'extent': 'extent1'}
    refs = requests.get(_get_references_url(),
                        params=payload,
                        headers=headers)
    refdata = refs.json()['content']['territories']
    # contains the actual numbering in the main geographical administrative
    # dataset that it can be linked against
    if slimmed:
        return refdata['libgeo'], refdata['codgeo']
    else:
        return refdata


def _replace_error_value(value, error_value, replace_value):
    return replace_value if value <= error_value else value


def get_basler_atlas_data(nivgeo,
                          theme,
                          subtheme,
                          year,
                          quiet=True,
                          error_value=-9999,
                          replace_value=0,
                          empty_value='',
                          ):
    import requests
    from requests.exceptions import JSONDecodeError

    payload = _params2payload(nivgeo, theme, subtheme, year=year)

    response = requests.get(_get_data_url(),
                            params=payload,
                            headers=headers)
    if response.ok:

        try:
            data = response.json()['content']['distribution']
        except JSONDecodeError:
            print(f'issues with {response.url} and the payload {payload}')
            return None, None
        if not quiet:
            print(f'Got data from {response.url}, transforming ...')

        values = data['values']

        if 'sortIndices' in data:
            # reported values are to be sorted, usual case?
            indices = data['sortIndices']
        else:
            # reported values are sorted already, usual case for Bezirk?
            indices = range(len(values))
        if isinstance(values, dict):
            keys = list(values.keys())
            indices = range(len(values[keys[0]]))
            values = [[_replace_error_value(values[key][i],
                                            error_value,
                                            replace_value,
                                            )
                      for key in keys]
                      for i in sorted(indices)
                      ]
            data = values
            return keys, data
        else:
            data = [str(_replace_error_value(values[i],
                                             error_value,
                                             replace_value,
                                             )
                        )
                    for i in sorted(indices)]
            return None, data
    else:
        if not quiet:
            print(f'Request for {payload} failed')
        return None, None


def _nivgeo2map(nivgeo):
    lookuptable = {'block': 'map5',
                   'wbl': 'map5',
                   'bezirk': 'map6',
                   'wbe': 'map6',
                   'viertel': 'map2',
                   'wvi': 'map2',
                   'gemeinde': 'map3',
                   'wge': 'map3',
                   # 'wahlkreis': 'map7',
                   # 'pwk': 'map7',
                   }

    return lookuptable[nivgeo]


def _get_nivgeos():
    return 'wbe', 'wvi', 'wge', 'wbl'  # , 'pwk'


def _nicename2nivgeo(nivgeo=None):
    nicenames = _nivgeo2nicename(nivgeo=nivgeo)
    if nivgeo is None:
        return {v: k for k, v in nicenames.items()}
    else:
        return {nicenames: nivgeo}


def _nivgeo2nicename(nivgeo=None):
    names = {'wbe': 'bezirk',
             'wvi': 'viertel',
             'wge': 'gemeinde',
             'wbl': 'block',
             # 'pwk': 'wahlkreis',
             }

    if nivgeo is None:
        return names
    else:
        return names[nivgeo]


def _params2payload(nivgeo,
                    theme,
                    subtheme,
                    year=None):

    payload = {'lang': 'de',
               'dataset':  theme,
               'indic': subtheme,
               'view': _nivgeo2map(nivgeo),
               }

    if year is not None:
        payload['filters'] = 'jahr='+str(year)

    return payload


def get_basler_atlas(start_year=1998,
                     end_year=datetime.date.today().year,
                     # population has different ages,
                     # men m , women w and a grand total
                     themes={'bevstruk': ['alter0',
                                          'alter20',
                                          'alter65',
                                          'w',
                                          'm',
                                          'gesbev'],
                             'bevheim': ['anteil_ch',
                                         'anteil_bs',
                                         'anteil_al',
                                         'anteil_bsanch',
                                         # "gesbev" has been replaced by
                                         'gesbev_f',
                                         ],
                             'bau_lwg': ['anzahl_lwg',
                                         ],
                             },
                     geographical_levels='all',
                     error_value=-9999,
                     replace_value=0,
                     testing=False,
                     quiet=True,
                     ):
    _nicenames = _nicename2nivgeo()

    if geographical_levels == 'all':
        nivgeos = _get_nivgeos()
    else:
        if isinstance(geographical_levels, str):
            geographical_levels = [geographical_levels]
        _nivgeos = _get_nivgeos()

        nivgeos = [i if i in _nivgeos else _nicenames[i]
                   for i in geographical_levels]

        assert all([i in _nivgeos or i in _nicenames
                    for i in nivgeos])

    # the defaults that we know of - there is wahlbezirke too on the
    # atlas but we don't particularly care about that one...
    data = {}
    
    # mapping of information from topic url to meta information entries
    info2meta = {'url': 'c_url_indicateur',
                 'name': 'c_lib_indicateur',
                 'short_name': 'c_lib_indicateur_court',
                 'unit': 'c_unite',
                 'source': 'c_source',
                 'description': 'c_desc_indicateur',
    }
    for nivgeo in nivgeos:
        refname, refnumber = get_ref_data(nivgeo)
        refdata = [[_refname, _refnumber]
                   for _refname, _refnumber in zip(refname, refnumber)]
        # ids of the nivgeo is in refdata['codgeo']
        # names of the nivgeo is in refdata['libgeo']
        nicename = _nivgeo2nicename(nivgeo)

        for theme in themes:
            if not quiet:
                print(f'Working on {theme} for {_nivgeo2nicename(nivgeo)}')
            for subtheme in themes[theme]:
                if not quiet:
                    print(f'Working on {theme}.{subtheme} for ',
                          f'{_nivgeo2nicename(nivgeo)}')
                # force a copy of refdata otherwise we keep updating the old list of lists.
                container = {'data': [i.copy() for i in refdata],
                             'meta': {'theme': theme,
                                      'nivgeo': nivgeo,
                                      'subtheme': subtheme,
                                      'theme': theme, },
                             'header': ['referencename', 'referencenumber'],
                             }
                topicinfo = get_basler_atlas_topics(theme=theme,
                                                    subtheme=subtheme,
                                                    fullinfo=True)
                for key, value in info2meta.items():
                    container['meta'][key] = topicinfo[theme][subtheme][value]

                # values will be nested, adjust header line with extra
                for year in range(start_year, end_year+1):

                    if not quiet:
                        print(f'Getting data for {year} for {theme}',
                              f'{subtheme} for {_nivgeo2nicename(nivgeo)}')

                    keys, thisdata = get_basler_atlas_data(nivgeo,
                                                           theme,
                                                           subtheme,
                                                           year,
                                                           quiet=quiet,
                                                           )
                    if thisdata is None:
                        if not quiet:
                            print(f'Failure to get data for {year} for {theme}',
                                  f'{subtheme} for {_nivgeo2nicename(nivgeo)}')
                        thisdata = [''] * len(container['data'])

                    if keys is None:
                        container['header'] += [f'{year}']
                    else:
                        container['header'] += [f'{year}_{key}'
                                                for key in keys]

                    for i, value in enumerate(thisdata):
                        if not isinstance(value, list):
                            value = [value]
                        container['data'][i] += value

                    if testing:
                        break  # year

                data[nicename+'_'+theme+'_'+subtheme] = container

                if testing:
                    break  # specific theme
            if testing:
                break  # theme
        if testing:
            break  # nivgeo

    return data


def get_basler_atlas_topics(theme=None,
                            subtheme=None,
                            fullinfo=False):
    import requests
    if subtheme is not None and isinstance(subtheme, str):
        subtheme = [subtheme]
    payload = {"tree": "A01", 'lang': 'de'}
    if theme is not None:
        payload['theme'] = theme
    response = requests.get(_get_topic_url(),
                            params=payload,
                            headers=headers)
    topiclist = response.json()['content']['indics']

    data = {}
    for topicinfo in topiclist:
        maintopic, subtopic = (topicinfo['c_id_dataset'],
                               topicinfo['c_id_indicateur'])
        if maintopic not in data:
            if fullinfo:
                data[maintopic] = {}
            else:
                data[maintopic] = []

        if subtheme is not None and subtopic not in subtheme:
            continue
        if fullinfo:
            data[maintopic][subtopic] = topicinfo
        else:
            data[maintopic] += [subtopic]

    return data
  
if __name__ == '__main__':
  print("""
        Some example usages:
        # get some information for the population for the years 200 to 2003
        data = get_basler_atlas(end_year=2003, 
                                start_year=2000, 
                                themes={'bevheim': ['anteil_ch','anteil_bs', ]},
                                geographical_levels='wvi',
                                quiet=True)
        # just get everything that is a predefined topic: 
        themes={'bevstruk': ['alter0',
                             'alter20',
                             'alter65',
                             'w',
                             'm',
                             'gesbev'],
                'bevheim': ['anteil_ch',
                            'anteil_bs',
                            'anteil_al',
                            'anteil_bsanch',
                            # "gesbev" has been replaced by
                            'gesbev_f',
                            ],
                'bau_lwg': ['anzahl_lwg',
                            ],
        }
        # limit just by years
        data = get_basler_atlas(themes=themes, start_year=2000, end_year=2010)
        # also save the data to csv files (by theme and subtheme)
        data2csv(data)
        # get some information regarding the available topics
        themes = get_basler_atlas_topics()
        """)

In the years of 2014 to 2020 or so I was teaching “An introduction to Geoinformatics” at the University of Basel. As a starter, students were supposed to load data of the administrative units of Basel in the form of shapefiles into a GIS (ArcMap or QGIS, I gave support in both) and connect these with population data (age groups or similar). These data can be downloaded from the Statistisches Amt Basel, usually in the form of Excel sheets – which is doable for a GIS but not great to be used: Sometimes I even struggled with getting the right separator in the excel sheet to be automatically recognised for the later import (which is just one reason why I prefer QGIS). However, the updated numbers were sometimes published later than when I’d like to use them – even though the Basler Atlas already had them. So I went to work to scrape the data directly from this source, because I did not want to manually download the data each and every semester again and for several dataset individually. We can do better than that.

In the back, the Basler Atlas uses geoclip, a product from a French company (you also see their logo when opening the Atlas). Relevant for me was that with the browser inspect feature you can get the requests being made in the back and see where the data are coming from (you can of course also click on the request to see the direct answer)

As you can see there is a listIndics and a indic API endpoint. ListIndics gives an overview of datasets, indic gives the data of the “indices”. Both require some specific payload to work, the most important being {‘lang’: ‘de’}. There is however a third endpoint that gives the crucial reference information, namely the name/id of the administrative feature at GC_refdata.php. Initially, I only used the indic endpoint to collect the data directly and the ref to get the reference frame. Combining requests to both, the data can be collected automatically, either to update or for the first time and come back in dict, including a headerline and meta information for the later export (also included above). Now you can get your up-to-date data from the Basler Atlas as well and have a better time using it for temporal analysis by running the above code and then calling this or something like this

data = get_basler_atlas(geographical_levels='block',
                        themes={'bevstruk': ['alter0',
                                             'alter20',
                                             'alter65',
                                             'w',
                                             'm',
                                             'gesbev'], },
                        quiet=True)

This gives you the population data on the block level of the years from 1998 until today (2024 as of this writing). Calling the below saves the data to your home directory, namely 6 CSV files.

data2csv(data, outpath='~/')

Today I extended the code above to also get the available topics (i.e. the categories). Which allows us to discover topics or download data that would require some time to find otherwise.

geolevel = 'block'

for key, values in topics.items():
    if key != 'bevstruk':
        continue
    data = get_basler_atlas(themes={key: values},
                         quiet=True,
                         geographical_levels=geolevel)
    of = data2csv(data, outpath='~/')
    break

As always, I hope this helps someone to discover data or save some time for themselves. And maybe at some point, I’ll find the time to document everything a bit better. Don’t forget to add the data source, i.e. the Stat. Amt Basel when you use their data.

Update/Fix

As I noticed today, there was some issue with the previous version of this script, namely a “deepcopy” issue that led to the fact, that the datacontainer had data from all connected subthemes when more than one was requested. Ups. But it should be fine now, the gist on top is updated and some usage information is added as well. 🙂

April 8, 2024April 10, 2024

Download data from the Lufft CHM 15k ceilometer webinterface

TL;DR; Summary

Download all newer data from the web interface of the Lufft CHM15k ceilometer into a main directory or directories according to Year/Month/Day format with the code below. An up-to-date version can also be found at my GitHub gists. Change according to your needs (esp. the format of the subfolders).

Python

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 15 09:22:30 2021

@author: spirrobe
"""

import os
import datetime
import requests
import json


class chm15ksession(requests.Session):
    """
    A class for interacting with the CHM-15k data server.

    This class inherits from the requests.Session class and is designed to
    facilitate downloading netCDF and zipped netCDF files from the CHM-15k.
    To use this class, you must have a valid password for accessing the
    server.

    Parameters
    ----------
    url : str
        The URL of the CHM-15k ceilometer. Can be local ip or http URl
    password : str, optional
        The password for accessing the CHM-15k.
        Default is "15k-Nimbus".
    outpath : str, optional
        The path to save downloaded files to. Default is the current directory.
    download2subdirs : bool, optional
        Whether to put files into a subdirectory as outpath/{year}/{month}/{day} 
        where year, month, day are inferred for each file based on the filename
    quiet : bool, optional
        Whether to print information about the download progress.
        Default is True.

    Attributes
    ----------
    url : str
        The URL of the CHM-15k.
    session : requests.Session
        The requests session object used to communicate with the server.
    password : str
        The password for accessing the CHM-15k data server.
    outpath : str
        The path to save downloaded files to.
    filecount : bool
        The number of files available on the server.
    quiet : bool
        Whether to print information about the download progress.
    sessionid : str
        The ID of the current session with the server.
    zipfiles : list of str
        The names of the zipped netCDF files available on the server.
    zipsizes : list of int
        The sizes of the zipped netCDF files available on the server, in bytes.
    ncfiles : list of str
        The names of the netCDF files available on the server.
    ncsizes : list of int
        The sizes of the netCDF files available on the server, in bytes.

    Methods
    -------
    connect()
        Connects to the CHM-15k data server and establishes a session.
    getfilelist()
        Returns a dictionary of available netCDF and zipped netCDF files on the
        CHM-15k data server.
    getncfiles(overwrite=False)
        Downloads all available netCDF files from the CHM-15k to the
        local file system.
    getzipfiles(overwrite=False)
        Downloads all available zipped netCDF files from the CHM-15k
        to the local file system.
    """

    def __init__(self,
                 url,
                 password="15k-Nimbus",
                 outpath='./',
                 download2subdirs=False,
                 timeout=20,
                 quiet=True,
                 *args, **kwargs,
                 ):
        """
        Initialize a new instance of the chm15ksession class.

        Parameters
        ----------
        url : str
            The URL of the CHM-15k.
        password : str, optional
            The password for accessing the CHM-15k data server. Default is
            "15k-Nimbus".
        outpath : str, optional
            The path to save downloaded files to.
            Default is the current directory.
        timeout : bool, optional
            The timeout in seconds for the get calls, adjust if on low bandwidth/slow network.
        quiet : bool, optional
            Whether to print information about the download progress.
            Default is True.
        """
        super().__init__(*args, **kwargs)
        # assert url, str, 'url must be a str'
        self.timeout = timeout
        self.url = url

        if not self.url.endswith('/'):
            self.url += '/'

        if not self.url.startswith('http'):
            self.url = 'http://' + self.url

        self.__cgi = "cgi-bin/chm-cgi"
        self.__cgiurl = self.url + self.__cgi
        #self.session = requests.Session()
        #self = requests.Session()
        self.password = password
        self.outpath = outpath
        self.__subpath = ''
        self.download2subdirs = download2subdirs
        if not self.outpath.endswith(os.sep):
            self.outpath += os.sep

        self.filecount = None
        self.sessionid = None
        self.zipfiles = []
        self.zipsizes = []
        self.ncfiles = []
        self.ncsizes = []
        self.quiet = quiet

    def _filename2date(self, filename):
        # pattern is YYYYMMDD
        _ = filename.split(os.sep)[-1].split('_')[0]
        if len(_) == 8:
            # typical netcdf files
            return _[:4], _[4:4+2], _[4+2:4+2+2]
        elif len(_) == 6:
            # zipfiles do not have a day as they are for the month
            return _[:4], _[4:4+2]
        else:
            print(f'Date could not be inferred from {filename}')
            return '', '', ''

    def _filename2datefolder(self, filename):
       date = self._filename2date(filename)
       if date[0]:
           date = [s + i for s, i in zip(['Y','M','D'], date)]
           date = os.sep.join(date) + os.sep

           if not self.outpath.endswith(os.sep):
               date = os.sep + date
           return date
       else:
           return ''

    def connect(self):
        """
        Connect to the CHM-15k using the provided password.

        This method sends a validation request to the CHM-15k data server
        with the provided passwordand obtains a session ID that can be
        used for subsequent requests.

        Raises
        ------
        requests.exceptions.RequestException
            If the request fails.

        """
        validationurl = self.__cgiurl+f"?validatetoken&code={self.password}"
        # this url could be used to check if the connection worked
        # checkurl = self.__cgiurl+"?checkvalidation"
        try:
            resp = self.get(validationurl, timeout=self.timeout)
        except requests.exceptions.RequestException:
            now = datetime.datetime.now(datetime.UTC)
            print(f'{now}: Connection failed, check url {self.url} and '
                  f'password {self.password}')
            return
        sessionid = resp.text.strip().split('{')[1].split('}')[0]
        resp.close()
        sessionid = sessionid.split(':')[1].split(',')[0]
        self.sessionid = sessionid
        self.cookies.set("session", self.sessionid,
                                 domain=self.url.split(':')[1][2:])
        if not self.quiet:
            now = datetime.datetime.now(datetime.UTC)
            print(f'{now}: Connection successful to {self.url}')
        self.sessionid = True

    def getfilelist(self):
        """
        Get a list of files from the CHM-15k.

        If the connection to the server has not been established,
        this method will establish a connection. Sets attributes of the
        object to contain the return values as well.

        Returns
        -------
        dict
            A dictionary containing the following keys:
            - 'zipfiles': A list of the names of zipped netCDF files.
            - 'netcdffiles': A list of the names of netCDF files.
            - 'zipsizes': A list of the sizes of zipped netCDF files.
            - 'ncsizes': A list of the sizes of netCDF files.
        """
        if self.sessionid:
            pass
        else:
            self.connect()
        resp = self.get(self.__cgiurl + '?filelist', timeout=self.timeout)
        filelist = resp.text
        resp.close()
        filelist = filelist[filelist.index('{'):]
        filelist = filelist[:-filelist[::-1].index('}')]
        try:
            filelist = json.loads(filelist)
        except json.JSONDecodeError:
            if not self.quiet:
                now = datetime.datetime.now(datetime.UTC)
                print('{now}: Issue with getting proper filelist, aborting getfilelist and potential callers')
            return None
        self.filecount = filelist['count']
        self.zipfiles = [i[0] for i in filelist["ncfiles"] if 'zip' in i[0]]
        self.zipsizes = [i[1] for i in filelist["ncfiles"] if 'zip' in i[0]]

        self.ncfiles = [i[0] for i in filelist["ncfiles"] if 'zip' not in i[0]]
        self.ncsizes = [i[1] for i in filelist["ncfiles"] if 'zip' not in i[0]]

        if not self.quiet:
            now = datetime.datetime.now(datetime.UTC)
            print(f'{now}: Found {filelist["count"]} files in total to be checked')
            print(f'{now}: Found {len(self.ncfiles)} netCDF files')
            print(f'{now}: Found {len(self.zipfiles)} zipped netCDF files')

        return {'zipfiles': self.zipfiles, 'netcdffiles': self.ncfiles,
                'zipsizes': self.zipsizes, 'ncsizes': self.ncsizes}

    def getsinglefile(self, filename, overwrite=True):
        """
        Download a single file from the CHM15k to the specified output path.

        Parameters
        ----------
        filename : str
            Name of the file to be downloaded. Can be either zip or nc file.
        overwrite : bool, optional
            Flag indicating whether to overwrite the file if it already
            exists in the output path and has the same size.
            Defaults to True.

        Returns
        -------
        None
            If the file is not available on the server or
            if the file transfer fails.

        Raises
        ------
        None

        Notes
        -----
        This method uses the requests library to download the file
        from the server, and saves it to the output path using
        the same filename as on the device.

        """
        if self.filecount:
            pass
        else:
            self.getfilelist()

        if filename not in self.ncfiles or filename in self.zipfiles:
            print(f'File {filename} not available')
            return
        else:
            if filename in self.ncfiles:
                filesize = self.ncsizes[self.ncfiles.index(filename)]
            elif filename in self.zipfiles:
                filesize = self.zipsizes[self.zipfiles.index(filename)]
            else:
                print(f'File {filename} not available')
                return

        if self.download2subdirs:
            self.__subpath = self._filename2datefolder(filename)

        os.makedirs(self.outpath + self.__subpath, exist_ok=True)

        # check if the file exists, and if it does has the same size
        # if so continue
        if os.path.exists(self.outpath + self.__subpath + filename):
            fs = os.path.getsize(self.outpath + self.__subpath + filename) // 1024
            if fs == filesize and not overwrite:
                if not self.quiet:
                    print(f'File {filename} already exists and has the same '
                          'size as the file on the CHM15k. Pass overwrite to',
                          'download anyway')

                return

        filecontent = self.get(self.__cgiurl+'/'+filename+"?getfile", timeout=self.timeout)
        # check if the transfer worked in the firstplace, if not continue
        if filecontent.status_code != 200:
            if not self.quiet:
                now = datetime.datetime.now(datetime.UTC)
                print(f'{now}: Filetransfer failed for {filename}')
            return

        with open(self.outpath + self.__subpath + filename, 'wb') as fo:
            fo.write(filecontent.content)

        if not self.quiet:
            now = datetime.datetime.now(datetime.UTC)
            print(f'{now}: Successfully downloaded {filename}')

        self.__subpath = ''

    def getncfiles(self, overwrite=False):
        """
        Download netCDF files from the CHM-15k to the specified `outpath`.

        Parameters
        ----------
        overwrite : bool, optional
            Whether to overwrite existing files with the same name and size
            in the `outpath`.
            Default is False.

        Raises
        ------
        ValueError
            If `filecount` attribute is False.

        Notes
        -----
        This method first checks whether the `filecount` attribute is set.
        If not, it calls the `getfilelist` method to obtain a list of files
        available for download. Then, for each netCDF file in the list,
        it checks whether the file already exists in the `outpath` and has
        the same size as the file.
        If not, it downloads the file using a GET request and saves it
        to the `outpath`.

        """
        if self.filecount:
            pass
        else:
            self.getfilelist()

        dlcount = 0
        for fileno, (filename, filesize) \
                in enumerate(zip(self.ncfiles, self.ncsizes)):
            if self.download2subdirs:
                self.__subpath = self._filename2datefolder(filename)
            # check if the file exists, and if it does has the same size
            # if so continue
            if os.path.exists(self.outpath + self.__subpath + filename):
                fs = os.path.getsize(self.outpath + self.__subpath + filename) // 1024
                if fs == filesize and not overwrite:
                    if not self.quiet:
                        now = datetime.datetime.now(datetime.UTC)
                        print(f'Not downloading {filename} as it exists and has the same size')
                        print(f'{now}: Progress at ',
                             f'{round((fileno+1)/len(self.ncfiles) * 100,1)} %')

                    continue
            else:
                os.makedirs(self.outpath + self.__subpath, exist_ok=True)

            filecontent = self.get(
                self.__cgiurl+'/'+filename+"?getfile", timeout=self.timeout)
            # check if the transfer worked in the firstplace, if not continue

            if filecontent.status_code != 200:
                if not self.quiet:
                    print(f'Filetransfer failed for {filename}')
                continue

            with open(self.outpath + self.__subpath + filename, 'wb') as fo:
                fo.write(filecontent.content)

            if not self.quiet:
                now = datetime.datetime.now(datetime.UTC)
                print(f'{now}: Successfully downloaded {filename}, the {dlcount+1} file')
                print(f'{now}: Progress at '
                      f'{round((fileno+1)/len(self.ncfiles) * 100,1)} %')
            dlcount += 1
        now = datetime.datetime.now(datetime.UTC)
        print(f'{now}: Downloaded all {dlcount} files that contained new data '
              f'to {self.outpath + self.__subpath}')
        self.__subpath = ''

    def getzipfiles(self, overwrite=False):
        """
        Download zip files from the CHM-15k to the specified `outpath`.

        Parameters
        ----------
        overwrite : bool, optional
            Whether to overwrite existing files with the same name and size
            in the `outpath`.
            Default is False.

        Raises
        ------
        ValueError
            If `filecount` attribute is False.

        Notes
        -----
        This method first checks whether the `filecount` attribute is set.
        If not, it calls the `getfilelist` method to obtain a list of files
        available for download. Then, for each zip file in the list,
        it checks whether the file already exists in the `outpath` and has
        the same size as the file.
        If not, it downloads the file using a GET request and saves it
        to the `outpath`.

        """
        if self.filecount:
            pass
        else:
            self.getfilelist()

        os.makedirs(self.outpath, exist_ok=True)

        for fileno, (filename, filesize) \
                in enumerate(zip(self.zipfiles, self.zipsizes)):
            if self.download2subdirs:
                self.__subpath =  self._filename2datefolder(filename)
            # check if the file exists, and if it does has the same size
            # if so continue
            if os.path.exists(self.outpath + self.__subpath + filename):
                fs = os.path.getsize(self.outpath + self.__subpath + filename) // 1024
                if fs == filesize and not overwrite:
                    if not self.quiet:
                        print('File already exists and has '
                              f'the same size ({filename})')
                    continue
            else:
                os.makedirs(self.outpath + self.__subpath, exist_ok=True)

            filecontent = self.get(
                self.__cgiurl+'/'+filename+"?getfile", timeout=self.timeout)
            # check if the transfer worked in the firstplace, if not continue
            if filecontent.status_code != 200:
                if not self.quiet:
                    print(f'Filetransfer failed for {filename}')
                continue

            with open(self.outpath + self.__subpath + filename, 'wb') as fo:
                fo.write(filecontent.content)

            if not self.quiet:
                now = datetime.datetime.now(datetime.UTC)
                print(f'{now}: Successfully downloaded {filename}')
                print(f'{now}: Progress at '
                      f'{round((fileno+1)/len(self.zipfiles) * 100,1)} %')
        now = datetime.datetime.now(datetime.UTC)
        print(f'{now}: Downloaded all {len(self.zipfiles)} available '
              f'zip files at {self.outpath + self.__subpath}')
        self.__subpath = ''

if __name__ == '__main__':
    url = ''  # the url to connect to, either http/s or ip directly of the chm15k
    a = chm15ksession(url
                      outpath='./',
                      quiet=False)
                      
    # establish a connection, setting up a session, this wil be done automatically
    # upon calling other get functions
    a.connect()
    
    # get the available files in case you want to download only one file
    a.getfilelist()
    
    # usually, one is interested only in the netcdf files that are available,
    # especially in an operational setting where other files have already
    # been downloaded. 
    # per default, existing files are not downloaded again
    # a.getncfiles()
    
    # zipfiles are created by the device for each month and can be downloaded as well
    # per default, existing files are not downloaded again
    # a.getzipfiles()

Background & motivation

The CHM15k offers the choice between serial and ethernet connection to sample data. While serial connections are true and tested, especially with data logger the reality might be that you don’t have one on-site, its serial ports are full or you would need a USB to serial adapter (which can be quite bothersome with Linux machines. We actually do sample a Parsivel² with our data server at the CLOUDLAB field site which requires frequent self-compiled drivers as we are running Fedora with its frequent kernel updates….

So we choose to go via the web interface of the Lufft CHM15k even though it requires a login. The upside is that checking for missing data is quite straightforward, it can be interactive and if you forward ports to its network correctly you can also sample it from the outside.

For this purpose, I had a look with the browser inspection tool to see what is being done when the password is sent and used the requests session to stay validated. The rest is fairly standard file checking and downloading. The above allows the script to be changed once with the correct URL (can be the IP or similar, including a port of course). Be aware that you should probably (really really) change the password if you make your device world-accessible via port forwarding.

Once that is done you can run the file via a cronjob or task scheduler as many times as you want as only most recent files are downloaded. Alternatively, import the class and check functionalities yourself for downloading single files or similar. Hope this helps someone out there to facilitate sampling via their ceilometer