SHINRA2020-ML (Categorizing 30 language Wikipedia into Extended Named Entity categories)

Introduction

Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the KB. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create clean and valuable knowledge bases. The first step should be to classify multi-lingual Wikipedia articles into well-defined and fine-grained categories, instead of the existing, cumbersome Wikipedia categories.

Data Availability

Extended named entity (ENE) hierarchy defines about 200 categories in a top-down manner to categorize Wikipedia entities. Compared with Wikipedia categories, which includes not only categories but also topics or related issues, ENE categories consists only of clean “categories” and the hierarchy consists of up to 4 levels. The current ENE categories have been used to annotate the November 2017 Japanese Wikipedia snapshot. A new version of annotation on the 201901 Wikipedia snapshot will be provided and will be shared on October 2019.

Task Definition

This year’s shared-task is planned to categorize Wikipedia entities in 30 languages with the largest number of users, shown below. We will provide annotated data for the training. The annotated corpus consists of 780K Japanese articles (entities) that have already been categorized into ENE categories by supervised machine learning methods, and the less reliable data (around half of the total data) are manually verified. There are cross-language links between different language Wikipedia. For example, out of 780K Japanese articles, about 511K articles have link to English articles. We can use this as the training data and the task is to categorize the remaining 5.3M English Wikipedia articles (English Wikipedia has about 5 million articles). Similarly, the training data will be available in the other languages. Derived from Wikipedia, this dataset will be subject to the Creative Commons Attribution Share-Alike license. The Japanese Wikipedia entities, together with the annotations, are given in the form of JSON documents. The fully annotated dataset with the multilingual links will be shared via downloading and database querying.

We will provide not only the training data, but also the cleaned, relevant Wikipedia information in JSON format. Participants can leverage all the resources available, explicitly provided by the organizers or not, to improve entity categorization in this task. All submitted systems will be evaluated against a test dataset with ground truth (not publicly available to participants) by the organizers. Note that participants have to categorize all the data and only a small test dataset will be used for the evaluation.

30 languages are the followings:

English, Spanish, French, German, Chinese, Russian, Portuguese, Italian, Arabic, Indonesian, Turkish, Dutch, Polish, Persian, Swedish, Vietnamese, Korean, Hebrew, Romanian, Norwegian, Czech, Ukrainian, Hindi, Finnish, Hungarian, Danish, Thai, Catalan, Greek, Bulgarian

Resource by Collaborative Contribution (RbCC)

The most notable of this project is RbCC. We believe the Shared Task is a good framework to motivate the researchers to develop technologies on a defined task. Researchers have used different technologies to optimize their systems based on the training data, expecting similar performance on the test data. However, their efforts may be abandoned if the task is changed or the evaluation results are not satisfactory. To fully utilize the efforts by the participants, we propose to design the tasks so that the efforts will be collectively useful for future use. Towards this objective, we propose to create resources via a collaborative manner, dubbed Resource by Collaborative Contribution (RbCC).

Beyond simply a shared task, we are planning to create resources for multi-lingual Wikipedia articles from the submitted systems by ensemble learning, e.g., if the majority participants in a specific language agree on the category of an article, we may create the category resource by the majority decision. For the unsure ones, the shared-task can be run again with enlarged data (bootstrapping) or by annotating some more data (active learning) before running the shared-task again.

Schedule

October, 2019: Provide the data including all 30 language Wikipedia data and annotated Japanese categorized information in JSON format. All of the Wikipedia information is based on 201901 version

October 2019 – July 2020: Formal run.

August 2020: Evaluation Results Released

December 2020: Workshop at NTCIR-15

Task Organizers

Satoshi Sekine (Riken, AIP, Japan)

Jiewen Wu (A*STAR, Singapore)

Christophe Gravier (Université de Lyon, France)

Hsin-Hsi Chen (National Taiwan University)

Haizhou Li (National University of Singapore, Singapore)

Sunday	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday
	1st	Link to the event page for the 2nd	3rd	4th	5th	6th
7th	8th	9th	10th	11th	Link to the event page for the 12th	13th
14th	15th	Link to the event page for the 16th	Link to the event page for the 17th	18th	Link to the event page for the 19th	20th
21th	22th	23th	24th	25th	26th	27th
28th	29th	30th	31th

Center for Advanced Intelligence Project

Laboratories