[Fad16d] Towards an Automatic Analyze and Standardization of Unstructured Data in the Context of Big and Linked Data
Conférence Internationale avec comité de lecture :
Mots clés: Web Semantics, Computational linguistics, Information extraction, Ontologies, Data Mining, Text Mining, Web Mining, (Big, Linked, Smart) Data, Semantic relations, Contextual
Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Many studies confirm that around 80-90% of all produced information is in unstructured form. So this kind of content, rich and most importantly too precious, must be integrated and taken into consideration for processing and exploitation: extraction of relevant information from heterogeneous textual data. The goal of the research described here is to present an approach for automating the detection and the extraction of meaning from unstructured Web using its normalized part: Web of data & Linked Open data (LOD) such as RDF WordNet, DBpedia, etc. The process follows a "cyclical process" that consists of two phases (a) creating & generating normalized smart data by the experts or automatically, (b) exploiting the created data in (a), as "validated expert data", to analyze the Big Data and generate automatically new ones by learning from Linked Open Data (LOD). The approach is based on a range of linguistic and ontological techniques, in the context of Big Data. A software, EC3, is being implemented and at LIP6. EC3 is actually tested on very large corpuses on electronic supports, provided by the labex OBVIL (http://obvil.paris-sorbonne.fr) and the BNF (National Library of France).