Wikipedia consists of a large volume of entities (a.k.a. articles), which is a great resource of knowledge to be utilized in many NLP tasks. To maximize the use of such knowledge, resources created from Wikipedia need to be structured for inference, reasoning, or any other purposes in many NLP applications. The current structured knowledge bases, such as DBpedia, Wikidata, Freebase, YAGO, among others, are created mostly by bottom-up crowdsourcing, which lead to a significant amount of undesirable noises in the KB. We believe that the structure of the knowledge should be defined top-down rather than bottom-up to create clean and valuable knowledge bases. The first step should be to classify multi-lingual Wikipedia articles into well-defined and fine-grained categories, instead of the existing, cumbersome Wikipedia categories.
Extended named entity (ENE) hierarchy defines about 200 categories in a top-down manner to categorize Wikipedia entities. Compared with Wikipedia categories, which includes not only categories but also topics or related issues, ENE categories consists only of clean “categories” and the hierarchy consists of up to 4 levels. The current ENE categories have been used to annotate the November 2017 Japanese Wikipedia snapshot. A new version of annotation on the November 2018 Wikipedia snapshot is planned and will be completed by March 2019 for the Shared Task.
This year’s shared-task is planned to categorize Wikipedia entities in 9 languages with the largest number of users, namely, English, Spanish, French, German, Chinese, Russian, Portuguese, Italian and Arabic. We will provide annotated data for the training. The annotated corpus consists of 720,000 Japanese articles (entities) that have already been categorized into ENE categories by supervised machine learning methods, and the less reliable data (around half of the total data) are manually verified. There are cross-language links between different language Wikipedia. For example, out of 720,000 Japanese articles, about 200,000 articles have link to English articles. We can use this as the training data and the task is to categorize the remaining 4,800,000 English Wikipedia articles (English Wikipedia has about 5 million articles). Similarly, the training data will be available in the other languages, each of which will have more than 10,000 training articles. Derived from Wikipedia, this dataset will be subject to the Creative Commons Attribution Share-Alike license. The Japanese Wikipedia entities, together with the annotations, are given in the form of JSON documents. The fully annotated dataset with the multilingual links will be shared via downloading and database querying.
We will provide not only the training data, but also the cleaned, relevant Wikipedia information through MongoDB. Participants can leverage all the resources available, explicitly provided by the organizers or not, to improve entity categorization in this task. All submitted systems will be evaluated against a test dataset with ground truth (not publicly available to participants) by the organizers. Note that participants have to categorize all the data and only a small test dataset will be used for the evaluation.
Resource by Collaborative Contribution (RbCC)
The most notable of this project is RbCC. We believe the Shared Task is a good framework to motivate the researchers to develop technologies on a defined task. Researchers have used different technologies to optimize their systems based on the training data, expecting similar performance on the test data. However, their efforts may be abandoned if the task is changed or the evaluation results are not satisfactory. To fully utilize the efforts by the participants, we propose to design the tasks so that the efforts will be collectively useful for future use. Towards this objective, we propose to create resources via a collaborative manner, dubbed Resource by Collaborative Contribution (RbCC).
Beyond simply a shared task, we are planning to create resources for multi-lingual Wikipedia articles from the submitted systems by ensemble learning, e.g., if the majority participants in a specific language agree on the category of an article, we may create the category resource by the majority decision. For the unsure ones, the shared-task can be run again with enlarged data (bootstrapping) or by annotating some more data (active learning) before running the shared-task again.
April 1, 2019: Provide the data including all 10 language Wikipedia data and annotated Japanese categorized information in JSON format. All of the Wikipedia information is based on 201810 version
June 30, 2019: Dry run. Initial results can be submitted and we will evaluate the data
October 1,2019: Deadline of formal run submission
November: Workshop (hopefully as an EMNLP-2019 workshop)
Satoshi Sekine (AIP, Japan)
Jiewen Wu (A*STAR, Singapore)
Akio Kobayashi (AIP, Japan)