User:Smalyshev (WMF)/Publishing query data

From Wikidata
Jump to navigation Jump to search

In order to enable better understanding of Wikidata Query Service SPARQL usage and enable research into this data, we want to periodically publish the data about the usage. The main premises should be:

  • No request should be linkable to a specific user
  • No PII should be present anywhere in the data

The data is meant to be created from raw analytics logs periodically and be stored long-term. Exact extent of how much data back we will store is TBD, but we don't plan to impose limits that are not caused by technical limitations, and since the data will be public, it can be archived by third parties. So all the data published should be considered available forever, and if any data point would be problematic in that scenario, it should not be included.

Opt-out option: users should be able to specify that their queries are not published.

Public dataset contents[edit]

The public dataset will contain the following data:

  • Timestamp, with hourly accuracy
  • Namespace queried (from URL)
  • Sanitized SPARQL request
  • User agent class
  • Bot flag (boolean, true for bots and non-interactive tools)
  • Request source (WMF production, WMF labs, external)
  • Response size
  • Time to first byte

Additional data[edit]

Additionally, we may want to provide secondary data based on the above such as:

  • List of Q-ids used in the query
  • List of properties used in the query

Having those as separate fields would allow easier querying and metrics without having to parse SPARQL.

SPARQL sanitization[edit]

In order to remove potential PII SPARQL requests will be sanitized. This proposal is based on Markus Kroetzsch's proposal for SPARQL sanitization.

  1. Queries that are not parsable by OpenRDF are discarded (including all syntax errors, etc.). In the future, we may want to support syntax extensions by Blazegraph.
  2. All comments are removed
  3. All non-significant whitespace is removed
  4. All variables are renamed to "varNNN", where NNN is a numeric counter. Instances of the same variable are renamed to the same name. This will preserve "Label", "Description" and "Alias" suffixes used by Label service, so if ?item becomes ?var123, then ?itemLabel becomes ?var123Label.
  5. All string literals longer than 3 characters are replaced with "stringNNN" literals, where NNN is a numeric counter. Same string within the query will be replaced with the same literal.
    • There will be whitelist of strings and contexts excepted from this, such as:
      • Pre-registered service names
      • Arguments for Label service
      • Fixed arguments to services, such as MediawikiAPI and query hints, according to a whitelist. Whitelist will be by string value (e.g. "com.bigdata.rdf.graph.analytics.BFS") and by predicate (e.g. "mwapi:generator").
      • Numeric arguments to services, such as bd:serviceParam wikibase:radius "100".
  6. All coordinate values are rounded to full degree. The same is done to numeric values used in triples involving wikibase:geoLongitude and wikibase:geoLatitude.
  7. Once https://phabricator.wikimedia.org/T127929 is implemented, URIs pointing to specific users will be anonymized to URIs pointing to generic "Anonymous" user.

User agent sanitization[edit]

User agents will not be preserved as is, but will be classified into broader classes, according to the following list:

  1. Known tools - a set of regexps identifying known clients, with pre-defined names (e.g. Magnus Manske's tools)
  2. Client classes such as "Java", "Pywikibot", "PHP client", etc. according to pre-defined set of regexps
  3. Identifiable major browsers, such as Chrome, Firefox, Opera, etc. - without version information
  4. All other user agents will be classified as "Other".

In addition to the above, if the user agent class identified per above represents less than 10K queries for a particular day, or less than 1% of the queries that day, it also will be classified as "Other". The queries count used for this is the queries in the published data set, not the original one (i.e., with bad queries excluded).

Timestamp sanitization[edit]

The timestamp should be cut off at the hour and random minute-second value will be generated. The logs then will be re-sorted by that synthetic timestamp.This allows to reduce de-anonymization potential that could follow from observing query sequences (e.g. revealing that certain tool is being used by observing sequence of queries peculiar to this tool).

See also[edit]