US10726018B2 - Semantic matching and annotation of attributes - Google Patents
Semantic matching and annotation of attributes Download PDFInfo
- Publication number
- US10726018B2 US10726018B2 US14/177,081 US201414177081A US10726018B2 US 10726018 B2 US10726018 B2 US 10726018B2 US 201414177081 A US201414177081 A US 201414177081A US 10726018 B2 US10726018 B2 US 10726018B2
- Authority
- US
- United States
- Prior art keywords
- tables
- semantic
- identifier
- attributes
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 230000003416 augmentation Effects 0.000 claims abstract description 62
- 238000012545 processing Methods 0.000 claims description 53
- 238000006243 chemical reaction Methods 0.000 claims description 48
- 230000015654 memory Effects 0.000 claims description 40
- 238000003860 storage Methods 0.000 claims description 27
- 238000000034 method Methods 0.000 abstract description 145
- 238000013459 approach Methods 0.000 description 76
- 238000000605 extraction Methods 0.000 description 17
- 230000006870 function Effects 0.000 description 16
- 238000002372 labelling Methods 0.000 description 15
- 238000004891 communication Methods 0.000 description 13
- 238000002474 experimental method Methods 0.000 description 12
- 238000007781 pre-processing Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 239000008186 active pharmaceutical agent Substances 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 8
- 230000003190 augmentative effect Effects 0.000 description 7
- 230000008859 change Effects 0.000 description 7
- 238000013500 data storage Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 208000030514 Leukocyte adhesion deficiency type II Diseases 0.000 description 3
- 238000003491 array Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 3
- 230000008030 elimination Effects 0.000 description 3
- 238000003379 elimination reaction Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 241000721619 Najas Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 235000019580 granularity Nutrition 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2457—Query processing with adaptation to user needs
- G06F16/24573—Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
Definitions
- Entities of interest such as companies, schools, etc.
- HTML hypertext markup language
- Such efforts are often referred to as entity augmentation.
- Accuracy of entity augmentation depends on semantic relationships between web tables and semantic labels of those tables.
- Current techniques work well for string-valued, e.g., textual, and static attributes, but current techniques perform poorly for numeric and time-varying attributes. While numeric and/or time-varying information may be available, they are often provided in different units or for different periods of time. Thus, while existing techniques may be well suited to string-values and static attributes, they will often return incorrect information for numeric and/or time-varying attributes. The inaccuracy and need for error correction make information gathering tasks for numeric and time-varying attributes extremely labor-intensive today.
- An entity augmentation operation as described herein can take entity names and multiple keywords describing an attribute associated with the entities as input and leverage a corpus of data, such as tables, which can include HTML web tables, to automatically populate values of the named entities for the attribute including when the attribute is represented by a numeric value, which may be in different units or scales, and when the attribute is represented by numeric values that can vary by time.
- constructs discussed herein enable efficient SMA of attributes from tables, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- the constructs discussed herein can be implemented in a semantic matching and annotation framework provided via an application programming interface (API) or as a separate service, program, or system.
- API application programming interface
- An entity augmentation application programming interface may be used to accept queries with numeric criteria, parameters, or arguments, including those that vary by time.
- FIG. 1 is a block diagram depicting an example environment in which embodiments of semantic matching and automated annotation (SMA) of attributes can operate.
- SMA semantic matching and automated annotation
- FIG. 2 is a block diagram depicting an example computing device of a distributed computing resource.
- FIG. 3 is a block diagram depicting an example client computing device.
- FIG. 4 is a block diagram depicting an example process architecture that can perform SMA of attributes, according to various embodiments.
- FIG. 5 depicts example conversion rules and mutex groups used to perform SMA of attributes, according to some embodiments.
- FIG. 6 illustrates example entity augmentation operations for attributes, according to an example scenario.
- FIG. 7 depicts an example semantic graph over web tables, according to the example scenario of FIG. 6 .
- FIG. 8 depicts an example graphical model to facilitate SMA of attributes from three web tables, according to some embodiments.
- FIG. 9 depicts an example algorithm for independent inference employed in SMA of attributes, according to various embodiments.
- FIG. 10 depicts an example semantic graph employed in SMA of attributes, according to some embodiments.
- FIG. 11 depicts an example algorithm for collective inference employed in SMA of attributes, according to various embodiments.
- FIG. 12 depicts an example algorithm for query processing employed in SMA of attributes, according to various embodiments.
- FIG. 13 illustrates a number of examples of experimental results.
- Embodiments described herein provide techniques and constructs to improve semantic matching and annotation of numeric and time-varying attributes, such as from web tables, using resources including, for example, processing units and accelerators.
- resources may be implemented using specialized programming.
- resources may have different execution models as is the case for graphics processing units (GPUs) and computer processing unit (CPUs).
- GPUs graphics processing units
- CPUs computer processing unit
- entity augmentation involves entities and multiple keywords describing an attribute associated with the entities and leverages data including from a corpus of tables, such as web tables, to automatically populate values of the named entities for the attribute.
- An entity augmentation operation can take entity names and multiple keywords describing an attribute associated with the entities as input and leverage a corpus of tables, such as web tables, to automatically populate values of the named entities for the attribute.
- the corpus can include tables in one or more of hypertext markup language (HTML), resource description framework (RDF), web ontology language (OWL), and/or extensible markup language (XML), for example.
- HTML hypertext markup language
- RDF resource description framework
- OWL web ontology language
- XML extensible markup language
- a semantic graph can be built to facilitate entity augmentation. Accuracy in the semantic graph depends on correctly identifying relationships between tables and meanings of the labels included in the tables.
- Conventional techniques are inadequate for attributes represented by numeric values, including numeric attributes that can vary by time, because the conventional techniques do not account for tables representing data using varied units, scale, and/or time.
- numeric units can include currencies such as US Dollar, Euro, Yen, and/or Yuan; measurements such as metric and imperial, e.g., feet and meters, kilograms and pounds, and Centigrade and Fahrenheit, etc.
- varying numeric scales can include differences in order of magnitude (e.g., billions, millions, thousands, etc.; decade, year, quarter, month, day, hour, minute, second, etc.; dollars, cents, etc.; yards, feet, inches, meters, centimeters, millimeters, etc.).
- Numeric time-varying attributes can include values that change over time or are aggregated by time such as sales per fiscal year; units of production per quarter; costs per month, etc.
- Semantic matching and automated annotation (SMA) of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein enhances information gathering tasks involving such attributes by facilitating accurate entity augmentation.
- accurate entity augmentation for attributes involves building a probabilistic graphical model to leverage semantic relationships between tables and semantic labels of those tables.
- One example implementation includes building a semantic graph in which (i) table columns are labeled with unit, scale and timestamp information and (ii) semantic matches between columns are computed, including when the same numeric attribute is expressed in different units or scales in various tables.
- This example implementation can include developing or using an entity augmentation application programming interface (API) suited for attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, which leverages the semantic graph. Building a semantic graph for this example can be challenging since such label information is often missing from the column headers in one or more of the tables.
- the techniques described herein complement local label extraction from column headers by leveraging a wealth of tables on the web and inferring label information from semantically matching columns of other tables. Examples are presented in greater detail in the description of the following figures.
- Techniques for SMA of attributes as described herein can create interdependence between labels and semantic matches which can present a challenge when tables include related data that is presented in differing units, in different scales, or for different periods of time.
- Techniques as described herein address this challenge by representing the entity augmentation task as a probabilistic graphical model that jointly discovers labels and semantic matches over a number of columns, and in some instances all columns, from tables such as web tables.
- Example graphical models are presented in greater detail in the description of the following figures.
- HTML tables where each row corresponds to an entity and each column corresponds to an attribute, although the techniques can be adapted for use with other types of tables.
- web-based HTML tables contain the information sought in information-gathering tasks, although the desired information is typically scattered among various tables.
- To automate information gathering an entity augmentation operation can receive via an interface such as an input interface implemented by an interface module and/or API, for example, entity names and multiple keywords describing an attribute associated with the entities as input.
- the entity augmentation operation can via one or more modules and/or APIs, for example, leverage a corpus of tables, such as tables, to automatically construct, annotate, infer, graph, etc. to populate values of the named entities for the attribute.
- the corpus can include tables in one or more of hypertext markup language (HTML), resource description framework (RDF), web ontology language (OWL), and/or extensible markup language (XML), for example.
- HTML hypertext markup language
- RDF resource description framework
- OWL web ontology language
- the baseline approach to compute semantically consistent matches is to attempt to build a semantic matching graph over web tables.
- each web table has a binary relation with the first column corresponding to the entity name and the second column corresponding to an attribute of the entity (referred to as entity-attribute binary (EAB) relations).
- EAB entity-attribute binary
- each web table can be represented as a node in a graph.
- an edge between two nodes if and only if (i) the first columns of the two web tables contain the same type of entities and (ii) the second columns refer to the same attribute of those entities.
- schema matching There are three main types of schema matching has. The first is semantic schema matching that uses information provided only by the schema and not from particular data instances. The second is syntactic schema matching that uses the actual data instances. The third uses external information, like thesauri, standard schemas, and past mappings. Most current schema matching solutions use hybrid approaches that include all three.
- the techniques and constructs discussed herein are designed to be implemented and executed to perform efficient entity augmentation for attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, by leveraging tables.
- Embodiments described herein provide techniques and constructs applicable to solve problems encountered in SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- a processing unit configured via programming from modules or APIs to perform techniques as described herein can include one or more of a GPU, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by the CPU.
- FPGA field-programmable gate array
- DSP digital signal processor
- illustrative types of hardware logic components include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
- ASICs Application-Specific Integrated Circuits
- ASSPs Application-Specific Standard Products
- SOCs System-on-a-chip systems
- CPLDs Complex Programmable Logic Devices
- FIGS. 1-13 Various embodiments, scenarios, and aspects are described further with reference to FIGS. 1-13 .
- FIG. 1 shows an example environment 100 in which embodiments of SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time can operate.
- the various devices and/or components of environment 100 include distributed computing resources 102 that can communicate with one another and with external devices via one or more networks 104 .
- network(s) 104 can include public networks such as the Internet, private networks such as an institutional and/or personal intranet, or some combination of private and public networks.
- Network(s) 104 can also include any type of wired and/or wireless network, including but not limited to local area networks (LANs), wide area networks (WANs), satellite networks, cable networks, Wi-Fi networks, WiMax networks, mobile communications networks (e.g., 3G, 4G, and so forth) or any combination thereof.
- Network(s) 104 can utilize communications protocols, including packet-based and/or datagram-based protocols such as internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), or other types of protocols.
- IP internet protocol
- TCP transmission control protocol
- UDP user datagram protocol
- network(s) 104 can also include a number of devices that facilitate network communications and/or form a hardware basis for the networks, such as switches, routers, gateways, access points, firewalls, base stations, repeaters, backbone devices, and the like.
- network(s) 104 can further include devices that enable connection to a wireless network, such as a wireless access point (WAP).
- WAP wireless access point
- Example embodiments support connectivity through WAPs that send and receive data over various electromagnetic frequencies (e.g., radio frequencies), including WAPs that support Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (e.g., 802.11g, 802.11n, and so forth), and other standards.
- IEEE Institute of Electrical and Electronics Engineers
- distributed computing resources 102 include devices 106 ( 1 )- 106 (N).
- device(s) 106 can include one or more computing devices that operate in a cluster or other grouped configuration to share resources, balance load, increase performance, provide fail-over support or redundancy, or for other purposes.
- Device(s) 106 can belong to a variety of categories or classes of devices such as traditional server-type devices, desktop computer-type devices, mobile devices, special purpose-type devices, embedded-type devices, and/or wearable-type devices. Thus, although illustrated as desktop computers, device(s) 106 can include a diverse variety of device types and are not limited to a particular type of device.
- Device(s) 106 can represent, but are not limited to, desktop computers, server computers, web-server computers, personal computers, mobile computers, laptop computers, tablet computers, wearable computers, implanted computing devices, telecommunication devices, automotive computers, network enabled televisions, thin clients, terminals, personal data assistants (PDAs), game consoles, gaming devices, work stations, media players, personal video recorders (PVRs), set-top boxes, cameras, integrated components for inclusion in a computing device, appliances, or any other sort of computing device.
- PDAs personal data assistants
- PVRs personal video recorders
- Device(s) 106 can include any type of computing device having one or more processing unit(s) 108 operably connected to memory 110 such as via a bus 112 , which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses.
- Executable instructions stored on memory 110 can include, for example, an operating system 114 , a semantic matching and annotation framework 116 , and other modules, programs, or applications that are loadable and executable by processing units(s) 108 .
- the functionally described herein can be performed, at least in part, by one or more hardware logic components such as accelerators.
- an accelerator can represent a hybrid device, such as one from ZYLEX or ALTERA that includes a CPU course embedded in an FPGA fabric.
- Device 106 can also include one or more network interfaces 118 to enable communications between computing device 106 and other networked devices such as client computing device(s) 120 involved in SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, or other device(s) 106 over network(s) 104 .
- network interface(s) 118 can include one or more network interface controllers (NICs) or other types of transceiver devices to send and receive communications over a network.
- NICs network interface controllers
- other components are omitted from the illustrated device 106 .
- client computing devices 120 1 )- 120 (N).
- Device(s) 120 can belong to a variety of categories or classes of devices such as traditional client-type devices, desktop computer-type devices, mobile devices, special purpose-type devices, embedded-type devices, and/or wearable-type devices.
- client computing device(s) 120 can include a diverse variety of device types and are not limited to any particular type of device.
- Client computing device(s) 120 can include, but are not limited to, computer navigation type client computing devices 120 ( 1 ) such as satellite-based navigation systems including global positioning system (GPS) devices and other satellite-based navigation system devices, telecommunication devices such as mobile phone 120 ( 2 ) mobile phone tablet hybrid 120 ( 3 ), personal data assistants (PDAs) 120 ( 4 ), tablet computers 120 ( 5 ), laptop computers such as 120 (N), other mobile computers, wearable computers, implanted computing devices, desktop computers, personal computers, automotive computers, network-enabled televisions, thin clients, terminals, game consoles, gaming devices, work stations, media players, personal video recorders (PVR 5 ), set-top boxes, cameras, integrated components for inclusion in a computing device, appliances, or any other sort of computing device.
- computer navigation type client computing devices 120 ( 1 ) such as satellite-based navigation systems including global positioning system (GPS) devices and other satellite-based navigation system devices
- telecommunication devices such as mobile phone 120 ( 2 ) mobile phone tablet hybrid 120 ( 3 ), personal data assistant
- Client computing device(s) 120 can represent any type of computing device having one or more processing unit(s) 122 operably connected to memory 124 such as via a bus 126 , which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses.
- a bus 126 which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses.
- Executable instructions stored on memory 124 can include, for example, an operating system 128 , a semantic matching and annotation framework 130 , and other modules, programs, or applications that are loadable and executable by processing units(s) 122 .
- the functionally described herein can be performed, at least in part, by one or more hardware logic components such as accelerators.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
- an accelerator can represent a hybrid device, such as one from ZYLEX or ALTERA that includes a CPU course embedded in an FPGA fabric.
- Client computing device 120 can also include one or more network interfaces 132 to enable communications between client computing device 120 and other networked devices such as other client computing device(s) 120 or devices 106 over network(s) 104 .
- network interface(s) 132 can include one or more network interface controllers (NICs) or other types of transceiver devices to send and receive communications over a network.
- NICs network interface controllers
- FIG. 2 is a block diagram depicting an example computing device 200 of a distributed computing resource, such as a device 106 from FIG. 1 .
- processing unit(s) 202 can be processing unit(s) 108 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU.
- FPGA field-programmable gate array
- DSP digital signal processor
- illustrative types of hardware logic components include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
- ASICs Application-Specific Integrated Circuits
- ASSPs Application-Specific Standard Products
- SOCs System-on-a-chip systems
- CPLDs Complex Programmable Logic Devices
- memory 204 can be memory 110 and can store instructions executable by the processing unit(s) 202 , which as discussed above, can represent a processing unit incorporated in device 200 .
- Memory 204 can also store instructions executable by external processing units such as by an external CPU, an external GPU, and/or executable by an external accelerator, such as an FPGA type accelerator, a DSP type accelerator, or any other internal or external accelerator.
- an external CPU such as by an external CPU, an external GPU, and/or executable by an external accelerator, such as an FPGA type accelerator, a DSP type accelerator, or any other internal or external accelerator.
- at least one CPU, GPU, and/or accelerator is incorporated in device 200 , while in some embodiments one or more of a CPU, GPU, and/or accelerator is external to device 200 .
- memory 204 also includes a data store 206 .
- data store 206 includes data storage such as a database, data warehouse, or other type of structured or unstructured data storage.
- data store 206 includes a corpus and/or a relational database with one or more tables, indices, stored procedures, and so forth to enable data access such as web tables including one or more of hypertext markup language (HTML) tables, resource description framework (RDF) tables, web ontology language (OWL) tables, and/or extensible markup language (XML) tables, for example.
- HTML hypertext markup language
- RDF resource description framework
- OWL web ontology language
- XML extensible markup language
- Data store 202 can store data for the operations of processes, applications, components, and/or modules stored in memory 204 and/or executed by processing unit(s) and/or accelerator(s) 202 . Alternately, some or all of the above-referenced data can be stored on separate memories 208 on board one or more processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.
- processing unit(s) 202 such as a memory on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.
- Device(s) 200 can further include one or more input/output (I/O) interfaces 210 to allow device 200 to communicate with input/output devices such as user input devices including peripheral input devices (e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, a gestural input device, and the like) and/or output devices including peripheral output devices (e.g., a display, a printer, audio speakers, a haptic output, and the like).
- I/O input/output
- peripheral input devices e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, a gestural input device, and the like
- peripheral output devices e.g., a display, a printer, audio speakers, a haptic output, and the like.
- network interface(s) 212 which can be network interface(s) 118 , can represent, for example, network interface controllers (NICs) or other types of transceiver devices to send and receive communications over a network.
- NICs network interface controllers
- memory 204 also includes an operating system 214 , which can be operating system 114 .
- Memory 204 also includes a semantic matching and annotation framework 216 , which can be semantic matching and annotation framework 116 .
- Semantic matching and annotation framework 216 can include one or more modules and/or APIs, which are illustrated as blocks 218 , 220 , 222 , 224 , 226 , and 228 , although this is just an example, and the number can vary higher or lower.
- block 218 can represent an extraction module with logic to program processing unit 202 of device 200 for extraction of one or more tables, such as tables from web pages.
- the extraction module further includes logic to distinguish between relational tables and at least one other type of table.
- Block 220 can represent a construction module with logic to program processing unit 202 for constructing a semantic match between at least two of a plurality of tables including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- Block 222 can represent an annotation module with logic to program processing unit 202 of device 200 for annotating columns of one or more of the at least two of the plurality of tables with unit, scale, and/or time information corresponding to the values of numeric attributes, which may be in different units or scales, and can vary by time.
- block 222 represents logic for performance of such annotation including when label information is missing from column headers in the one or more of the at least two of the plurality of tables.
- Block 224 can represent an inference module with logic to program processing unit 202 of device 200 for inferring label information from another of the at least two of the plurality of tables including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- Block 226 can represent a graphing module with logic to program processing unit 202 of device 200 for building and/or using a probabilistic graphical model to model label discovery and match discovery.
- Block 228 can represent an indexing module with logic to program processing unit 202 of device 200 for building and/or using indexes on the plurality of tables.
- Bus 230 which can be bus 112 , and which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses, can operably connect memory 204 to processing unit(s) 202 .
- FIG. 3 is a block diagram depicting an example client computing device 300 , such as a client device 120 from FIG. 1 .
- processing unit(s) 302 can be processing unit(s) 122 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU.
- FPGA field-programmable gate array
- DSP digital signal processor
- illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
- ASICs Application-Specific Integrated Circuits
- ASSPs Application-Specific Standard Products
- SOCs System-on-a-chip systems
- CPLDs Complex Programmable Logic Devices
- memory 304 can be memory 124 and can store instructions executable by the processing unit(s) 302 , which as discussed above, represents a processing unit incorporated in device 300 .
- Memory 304 can also store instructions executable by external processing units such as by an external CPU, an external GPU, and/or executable by an external accelerator, such as an FPGA type accelerator, a DSP type accelerator, or any other internal or external accelerator.
- an external CPU such as by an external CPU, an external GPU, and/or executable by an external accelerator, such as an FPGA type accelerator, a DSP type accelerator, or any other internal or external accelerator.
- at least one CPU, GPU, and/or accelerator is incorporated in device 300 , while in some embodiments one or more of a CPU, GPU, and/or accelerator is external to device 300 .
- memory 304 also includes a data store 306 .
- data store 306 includes data storage such as a database, data warehouse, or other type of structured or unstructured data storage.
- data store 306 includes a relational database with one or more tables, indices, stored procedures, and so forth to enable data access such as web tables including one or more of hypertext markup language (HTML) tables, resource description framework (RDF) tables, web ontology language (OWL) tables, and/or extensible markup language (XML) tables, for example.
- HTML hypertext markup language
- RDF resource description framework
- OWL web ontology language
- XML extensible markup language
- Data store 306 can store data for the operations of processes, applications, components, and/or modules stored in memory 304 and/or executed by processing unit(s) and/or accelerator(s) 302 .
- some or all of the above-referenced data can be stored on separate memories on board one or more processing unit(s) 302 such as a memory 308 on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.
- processing unit(s) 302 such as a memory 308 on board a CPU-type processor, a GPU-type processor, an FPGA-type accelerator, a DSP-type accelerator, and/or another accelerator.
- Device(s) 300 can further include one or more input/output (I/O) interfaces 310 to allow device 300 to communicate with input/output devices such as user input devices including peripheral input devices (e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, a gestural input device, and the like) and/or output devices including peripheral output devices (e.g., a display, a printer, audio speakers, a haptic output, and the like).
- I/O input/output
- peripheral input devices e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, a gestural input device, and the like
- peripheral output devices e.g., a display, a printer, audio speakers, a haptic output, and the like.
- network interface(s) 312 which can be network interface(s) 132 , can represent, for example, network interface controllers (NICs) or other types of transceiver devices to send and receive communications over a network.
- NICs network interface controllers
- memory 304 also includes an operating system 314 , which can be operating system 128 .
- Memory 304 also includes a semantic matching and annotation framework 316 , which can be semantic matching and annotation framework 130 .
- Semantic matching and annotation framework 316 can include one or more modules and/or APIs, which are illustrated as blocks 318 , 320 , 322 , 324 , 326 , and 328 , although this is just an example, and the number can vary higher or lower.
- block 318 can represent an interface module with logic to program processing unit 302 of device 300 for receiving an entity augmentation query.
- the entity augmentation query can include a name of an entity, a keyword associated with the entity, time information associated with the keyword, and at least one of unit information associated with the keyword or scale information associated with the keyword.
- Block 320 can represent an identification module with logic to program processing unit 302 for identifying a name of an entity, a keyword associated with the entity, time information associated with the keyword, and at least one of unit information associated with the keyword or scale information associated with the keyword from the entity augmentation query.
- Block 322 can represent an annotation module with logic to program processing unit 302 of device 300 for annotating columns of one or more of the at least two of the plurality of tables with unit, scale, and/or time information corresponding to the values of numeric attributes, which may be in different units or scales, and can vary by time.
- block 222 represents logic for performance of such annotation including when label information is missing from column headers in the one or more of the at least two of the plurality of tables.
- Block 324 can represent a conversion module with logic to program processing unit 302 of device 300 for processing the entity augmentation query based at least in part on existing conversion rules or graphs.
- block 324 represents logic that enables the logic of block 322 to perform annotation including when label information is missing from column headers in the one or more of the at least two of the plurality of tables.
- Block 326 can represent a display module with logic to program processing unit 302 of device 300 for presenting results of the entity augmentation query on a display associated with device 300 .
- block 326 represents logic that optimizes for presentation the results according to a size and/or type of display and/or according to a program for which the results are obtained.
- the logic of block 326 can optimize the results for presentation on a relatively small screen of a mobile device in accordance with the size and resolution of the display.
- the logic of block 326 can expose results of the entity augmentation query to a spreadsheet program for a tabular presentation, a graphical presentation, and/or a chart-type presentation.
- Block 328 can represent a communication module with logic to program processing unit 302 of device 300 for sending or making available the entity augmentation query, or such query having undergone some processing as intermediate stage query toward or for one or more devices 200 and/or for receiving entity augmentation results from one or more devices 200 .
- block 328 can send a query toward one or more indexes on device 200 , or expose the same for access via an API.
- block 328 can receive results such as an augmented table from one or more devices 200 .
- Bus 330 which can be bus 126 , and which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses, can operably connect memory 304 to processing unit(s) 302 .
- one or more of the modules and logic associated with device 200 may operate on device 300 and/or one or more of the modules and logic associated with device 300 may operate on device 200 .
- the modules and logic can be stored on, operated from, or installed from computer-readable media associated with device 200 and/or device 300 .
- Computer-readable media may include computer storage media and/or communication media.
- Computer storage media can include volatile memory, nonvolatile memory, and/or other persistent and/or auxiliary computer storage media, removable and non-removable computer storage media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.
- Memories 110 , 204 , 208 , 124 , 308 , and/or 304 are examples of computer storage media.
- the memory 110 , 204 , 208 , 124 , 308 , and/or 304 includes tangible and/or physical forms of media included in a device and/or hardware component that is part of a device or external to a device, including but not limited to random-access memory (RAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), phase change memory (PRAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs), optical cards or other optical storage media, magnetic cassettes, magnetic tape, magnetic disk storage, magnetic cards or other magnetic storage devices or media, solid-state memory devices, storage arrays, network attached storage, storage area networks, hosted computer storage or any other storage memory, storage device, and/or storage medium that can be used to store and maintain information for access by a computing device.
- RAM random-access memory
- SRAM static random-access
- communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism.
- computer storage media does not include communication media. That is, memory 110 , 204 , 208 , 124 , 308 , and/or 304 , and the described computer storage media encompassed thereby does not include communications media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.
- FIG. 4 is a block diagram depicting an example architecture of processes that semantic matching and annotation framework 216 and/or semantic matching and annotation framework 316 can perform to facilitate SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein.
- the architecture 400 includes an offline pre-processing portion 402 and a query-time processing portion 404 .
- semantic matching and annotation framework 116 of device(s) 106 and/or semantic matching and annotation framework 216 of device(s) 200 can perform offline pre-processing portion 402 and semantic matching and annotation framework 130 of device(s) 120 and/or semantic matching and annotation framework 316 of device(s) 300 can perform query time processing 404 .
- one or a number of computing devices extract tables from the web.
- computing device(s) 106 extract relational web tables in HTML format from web pages 408 via web crawling.
- the computing device(s) 106 distinguish relational tables from other types of tables such as layout tables and attribute-value tables to avoid expending further processing on non-relational tables.
- alternate processing can be performed on non-relational tables and/or other data structures.
- the computing device(s) 106 can employ an approach similar to that described in “Uncovering the Relational Web,” Cafarella et al., Proceedings of the 11 th International Workshop on Web and Databases, Jun. 13, 2008, Vancouver Canada, which is incorporated herein by reference.
- one or a number of computer systems such as computing device(s) 200 or 106 build a semantic graph 410 .
- the offline pre-processing part of the system receives conversion rules 412 , which are applied in building the semantic graph 410 .
- one or a number of computing device(s) such as device(s) 200 or 106 build indexes on the tables and the graph for efficient query time processing 414 , which can be stored in a data store or distributed for storage among one or more data stores such as a data store 206 or a data store 306 .
- the offline pre-processing part of the system builds indexes.
- the offline pre-processing part of the system can build three indexes: (i) An inverted index on entities EI. Given a query table Q, EI(Q) returns the set of tables (along with scores) that contains at least one of the query entities.
- NLI An inverted index on column names and semantic labels
- NLI Given a query table Q, NLI(Q) returns the set of tables (along with scores) whose column headers contain the query keywords and/or the set whose semantic labels match with the query labels.
- An index on graph edges GI Given a web table T, GI(T) returns the set of tables that are connected to T in the semantic graph along with a score indicating the strength of the matching relationship.
- Query time processing 404 can be employed for a variety of types of queries. Although examples herein focus on entity augmentation, other appropriate classes of queries include augmentation by example, attribute discovery, and search by column keywords using architecture 400 .
- one or a number of computer computing device(s) such as device(s) 106 , device(s) 200 , device(s) 120 , and/or device(s) 300 identify matching tables and edges along with their scores for the query.
- the computing device(s) identify matching tables and edges along with their scores by leveraging the EI, NLI, and GI indexes from offline pre-processing to fill in values. For example, for individual ones of the query entities, in some cases for each query entity, the system collects the corresponding values in these matching tables along with the scores, converts to the desired unit and scale, aggregates the scores for each value and selects the one with the highest aggregate score to form an augmented table 418 .
- a table represents an entity-attribute binary (EAB) relation.
- a table T ⁇ T is of the form T(K,B) where K denotes an entity name and B denotes an attribute of the entity.
- FIG. 7 shows five web tables (T 1 , T 2 , T 3 , T 4 , T 5 ) satisfying the EAB property.
- Table T has (i) a column header H T of the attribute column T.B (e.g., H T1 for table T 1 in FIG. 7 is “2010 Revenues (USD bil)”).
- H T can be empty if the table has no column headers.
- Table T has (ii) context information C T including header rows that describes the table (that spans across all the columns), caption of the table, text surrounding the table, and the uniform resource locator (URL) and title of the web page from which the table was extracted.
- context information C T including header rows that describes the table (that spans across all the columns), caption of the table, text surrounding the table, and the uniform resource locator (URL) and title of the web page from which the table was extracted.
- URL uniform resource locator
- EAB relations In practice, not all tables or even all web tables are EAB relations. Most such tables have a “subject column,” which contains the names of the entities, and the other columns contain attributes of these entities. Furthermore, there are effective heuristics to detect the subject column.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein can employ an approach similar to that described in “Recovering Semantics of Tables on the Web,” Ventetis et al., PVLDB, 4(9):528-538, 2011, which is incorporated herein by reference.
- the techniques described herein include splitting the n-ary table into (n ⁇ 1) EAB relations: each of the so constructed EAB relations including the subject column with one of the other (n ⁇ 1) columns.
- FIG. 5 depicts a set of example conversion rules 502 and mutually exclusive groups (referred to as “mutex groups”) 504 used to perform SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time in tables, according to some embodiments.
- mutex groups mutually exclusive groups
- the techniques described herein define a set of mutex groups such that any table can be assigned at most one label from a mutex group.
- a set of conversion rules 502 are known at the time of pre-processing.
- conversion rules can be specified by a system administrator or one or more domain experts.
- the rules can be expressed using a simple rule specification language.
- a set of eight rules is shown.
- each rule has a rule identifier (ID) 506 and three components, though other numbers of components may be possible.
- the three components illustrated include a left-hand side (LHS) 508 , a conversion factor ( ⁇ ) 510 , and a right-hand side (RHS) 512 .
- the same unit and scale can be referred to in several ways. For example, USD also can be referred to as $ and US Dollar; mil also can be referred to as min, million and millions.
- the techniques as described herein receive canonical strings for units and scales, in some instances with the assumption that there is a canonical string for every unit and scale. As illustrated in the set 502 , the rules are specified using the canonical strings. Furthermore, the techniques as described herein receive synonyms so that the occurrence of synonyms can be detected in column headers and column values, in some instances with the assumption that all the synonyms are known by the constructs implementing the techniques.
- the techniques and constructs can handle ranges as the conversion factor in order to capture fluctuating conversion factors.
- a rule might specify the conversion factor between euro and USD to be anywhere between 1.2 and 1.3.
- the description herein focuses on a single number as the conversion factor.
- the rules can change with time. For example, a system administrator, other user, or automatic control code, can insert new rules, delete existing rules or modify the LHS, conversion factor or RHS of existing rules.
- FIG. 5 depicts example mutex groups 504 ( 1 ), 504 ( 2 ), and 504 ( 3 ) corresponding to the set of eight rules depicted at 502 , which can be used to perform SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, according to some embodiments. In other embodiments some of the illustrated mutex groups may be omitted and/or other mutex groups may be added.
- the techniques described herein can identify mutex groups 504 by constructing a graph with each label l ⁇ L as a node and inserting an edge between two nodes if there is a rule with the two strings as its LHS and RHS.
- the techniques determine the connected components of the graph so that each connected component corresponds to a mutex group.
- the techniques described herein facilitate building a semantic graph G over tables that enables SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time to perform accurate entity augmentation.
- the semantic graph G can contain two types of edges, termed S edges and X edges, in addition to semantic labels.
- S edges refer to edges between two nodes T(K,B) and T(K′,B′) if and only if T.K and T.K′ refer to the same type of entities and T.B and T.B′ refer to the same attribute of those entities, expressed in the same unit and scale and reflect information of the same periods of time (e.g., same year, same quarter, etc.).
- X edges refer to edges between two nodes T(K,B) and T(K′,B′) if and only if T.K and T.K′ refer to the same type of entities and T.B and T.B′ refer to the same attribute of those entities, and reflect information of the same periods of time (e.g., same year, same quarter, etc.) but expressed in different units and/or scales.
- Each X edge is associated with a set of conversion rules which converts the values from T.B to T.B′. Since the reverse rule is present for each forward rule, for every X edge from T to T′, there is an X edge with an equivalent set of rules from T′ to T. Hence, directionality of X edges can be ignored in the discussion.
- a semantic label refers to a label for scale, unit, and/or time at each node. Techniques as described herein distinguish between the scale and unit labels (SU labels in short) and time labels. Generally a node will be assigned multiple SU labels and only one time label.
- FIG. 6 illustrates entity augmentation operations for attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, according to various embodiments.
- Query table 602 is an example of an entity augmentation query for four entities 604 and one keyword 606 in a query table.
- the four entities are four pharmaceutical companies, Eli Lilly 604 ( 1 ), Merck 604 ( 2 ), Roche 604 ( 3 ), and Novartis 604 ( 4 ) with one keyword, revenues 606 .
- a system employing a baseline approach finds tables that match with the query table 602 .
- matching means that the tables contain at least one of the entities and an attribute whose name matches the keyword of the query.
- the baseline system then consolidates the matching tables to fill in the desired values.
- query 602 can match with tables containing 2010 revenues and tables containing 2011 revenues.
- query 602 can also match with tables containing revenues in billions of USD, tables containing revenues in billions of Euros, tables containing revenues in millions of USD and tables containing revenues in millions of Euros (because all of these tables contain the keyword ‘revenues’). Consolidating values from such different tables without understanding the semantic relationships between them can lead to erroneous augmentation as illustrated in result table 608 . Absent markers 610 , it may not be apparent that table 608 shows revenues for Eli Lilly in billions of US dollars for the year 2011, revenues for Merck in billions of US dollars for the year 2010, and revenues for Roche and Novartis in millions of Euros for the year 2010.
- query table 612 shows an example of an entity augmentation query using semantic matching and annotation of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, for the same four entities as query table 602 , e.g., Eli Lilly 604 ( 1 ), Merck 604 ( 2 ), Roche 604 ( 3 ), and Novartis 604 ( 4 ) with the same keyword, revenues 606 .
- Query table 612 also includes two numeric attributes, “bil” (i.e., billions) 614 and “USD” (i.e., United States Dollars) 616 , and one time-varying attribute, “2010” representing the year 618 .
- matching means that the tables contain at least one of the entities and an attribute whose name matches the keyword of the query as well as tables with attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, provided for by the conversion rules.
- the SMA of attributes system as described herein then applies the conversion rules to attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, from the matching tables and consolidates the matching tables to fill in the desired values.
- the matching tables need not be semantically consistent.
- the query 612 does not match with tables containing 2010 revenues as well as those containing 2011 revenues because there are not conversion rules between 2010 and 2011.
- the query 612 does match with tables containing 2010 revenues in billions of USD as well as tables containing 2010 revenues in millions of Euros since such tables contain the keyword revenues and the time attribute 2010. Applying the conversion rules to consolidate values from such different tables with an understanding the semantic relationships between them leads to appropriate augmentation as illustrated in result table 620 , which represents an augmented table.
- FIG. 7 depicts an example semantic graph 700 over five web tables T 1 , T 2 , T 3 , T 4 , T 5 , according to at least one embodiment of SMA of attributes as described herein.
- the baseline technique can be used to identify the 3 edges shown by lines 702 , 704 , and 706 .
- Edges 702 , 704 , and 706 can be referred to as simple (S) edges because they capture simple, 1:1 mappings, between T 2 and T 3 , T 1 and T 4 , and T 4 and T 5 , respectively.
- the baseline technique will match query table 602 with all five tables T 1 , T 2 , T 3 , T 4 , and T 5 .
- the result of entity augmentation using the baseline approach is as shown in result table 608 of FIG. 6 .
- T 4 and T 5 provide the value 29.1 while T 1 provides 21.8.
- the baseline approach selects 29.1 for presentation in results table 608 of FIG. 6 .
- T 1 provides 27.4, T 5 provides 45.9 and T 2 provides 21091; the baseline approach may select 27.4 for presentation in results table 608 of FIG. 6 based on a matching score.
- T 3 provides the values 36113 and 33762 respectively. The result is undesirable: the value for Eli Lilly is in USD billion and from 2011, the one for Merck is also in USD billion but from 2010 and the ones for Roche and Novartis are in Euro million and from 2010.
- the baseline approach produces semantically inconsistent results.
- a number of the attributes in the example web tables T 1 , T 2 , T 3 , T 4 , and T 5 are numeric and the same semantic attribute occurs in a variety of units and scales across different of the example web tables. For example, in FIG. 7 T 1 shows 2010 revenue in billions of USD while T 2 and T 3 show 2010 revenue in millions of Euros.
- the baseline approach has no knowledge of unit and/or scale nor does the baseline approach have an ability to identify discrepancy in unit and/or scale. Thus, it is impossible for the baseline approach to produce semantically consistent results in this instance.
- many numeric attributes are time-varying in nature. Thus, different tables, which may have been created at different times, contain values of the same attribute for different periods of time.
- T 1 shows the revenue information for 2010 while T 4 and T 5 show the same information for 2011.
- the baseline approach has no knowledge of time nor does the baseline approach have an ability to identify discrepancy in the time periods shown. Thus, again it is impossible for the baseline approach to produce semantically consistent results in this instance.
- the baseline approach fails to detect relationships where the same semantic attribute is expressed in different units or scales. For example, the baseline approach fails to detect that T 1 , T 2 , and T 3 contain the same semantic attribute (2010 revenue) and thus the values could have been converted from one to the other. Without this knowledge, it is impossible for the baseline approach to convert the values in the matching tables into a matching scale and unit.
- Techniques for SMA of attributes as described herein build a semantic graph over tables suited for numeric and time-varying attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- a semantic graph over tables suited for attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time includes semantic labels and semantic matches.
- Semantic labels annotate each attribute column with unit, scale, and a time period or timestamp.
- the examples provided show year as the timestamp for simplicity of explanation, but other time periods may be used.
- Techniques for SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein can handle other granularities of timestamp like quarter, month, day, etc.
- the semantic labels are shown in boxes 708 , 710 , 712 , 714 , and 716 of FIG. 7 .
- Semantic matches techniques for SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein determine semantic matches between columns representing the same semantic attribute, even if the attribute values are expressed in different units and/or scales.
- the graph shows an edge between T 1 and T 2 , 718 , and between T 1 and T 3 , 720 , in FIG. 7 .
- Edges 718 and 720 can be referred to as transformation or conversion (X) edges because they capture conversion between units and/or scale for mappings between T 1 and T 2 and T 1 and T 3 , respectively.
- an X edge between a pair of tables T and T′ can be associated with a set of conversion rules which when applied convert the values between T.B and T.B′.
- the X edge 718 between T 1 and T 2 and the X edge 720 between T 1 and T 3 are associated with two conversion rules.
- T 3 semantically matches with table T 2 ; the techniques extract T 2 's labels mil and Euro locally, e.g., from T 2 's column header and column values. The system then propagates extract T 2 's labels mil and Euro to T 3 .
- this example illustrates propagation over an S edge, this propagation also can occur over X edges.
- Such propagation creates an interdependence between labels and semantic matches.
- the system employs semantic matches to compute the labels and the system employs the labels to compute the matches.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, addresses the interdependence by representing the task as a probabilistic graphical model (e.g., a Markov random field) that simultaneously discovers labels and semantic matches over columns, in some instances, all columns.
- a probabilistic graphical model e.g., a Markov random field
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, addresses such inconsistencies by defining hard constraints to control such inconsistent labels and integrating the hard constraints into the inference algorithm.
- spurious matches can result from the values of an attribute such as revenues not changing significantly from one year to another.
- the revenues of Pfizer and Abbott Labs did not change significantly from 2010 to 2011, which led to a spurious semantic match between T 1 and T 4 , which leads to semantically inconsistent results as shown in a second example presented in the discussion below of FIG. 12 .
- Techniques of SMA of attributes as described herein can leverage the discovered labels to eliminate such spurious edges. For example, these techniques can infer that the edge 704 between T 1 and T 4 is spurious based on the discovered labels 2010 and 2011 respectively. Hence edge 704 can be marked as spurious as illustrated by the cross-off 722 or omitted.
- techniques and/or constructs of SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, encode this knowledge as hard constraints into a graphical model.
- techniques as described herein are based on probabilistic graphical models to discover the semantic labels of columns and semantic matches between columns over a number of tables, in some instances all tables, collectively instead of individually.
- the graphical model described herein elegantly combines diverse signals such as “local” extraction of labels from the column headers and values, semantic matches computed using traditional schema matching techniques, and label propagation.
- Such techniques employ particular efficient algorithms to solve tasks of joint discovery.
- techniques and constructs as described herein provide for an entity augmentation API suited for attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- an API allows input specifying unit, scale and time information to unambiguously specify the augmentation task.
- result table 620 of FIG. 6 An example of using such an API and the desired output is shown in result table 620 of FIG. 6 , which is based on the graph shown in FIG. 7 .
- Such techniques employ particular query processing algorithms for the new entity augmentation operation.
- An option for semantic graph construction is to employ a “staged” approach.
- An example staged approach includes a first stage of building a semantic graph as proposed in the baseline approach.
- the semantic graph acquired using the baseline approach will contain only S edges.
- the second stage includes adding semantic labels to the semantic graph acquired using the baseline approach.
- the third stage includes adding X edges to the semantic graph from the second stage.
- this option for semantic graph construction can be implemented as follows.
- L denote a set of scale and unit descriptor strings that appear in the LHSs and RHSs of the conversion rules.
- Y denote the set of year strings.
- the set of year strings includes all year strings from 1900 to 2050, though other time or year strings are possible.
- the system annotates a table T, such as a web table, with y ⁇ Y if the y occurs in either in the column header H T of T.B or in the context C T of T
- T 1 is annotated with 2010.
- the approach can be termed “local extractions.”
- the techniques described herein can add an X edge between T and T′ associated with the set R ⁇ R of conversion rules if and only if (i) the set L T of labels of T contains LHS of each rule r ⁇ R, i.e., ⁇ r ⁇ R r.LHS ⁇ L T (ii) the set L′ T of labels of T′ contains RHS of each rule r ⁇ R, i.e., ⁇ r ⁇ R r.RHS ⁇ L′ T and (iii) the values of the common entities in the two tables can be converted from one to the other by multiplying with the product of the conversion factors, i.e., Sim x (T, T′,R)> ⁇ where Sim x (T, T′,R) denotes the fraction of common entities that can be converted using the product of the conversion factors and ⁇ is a threshold.
- Sim x (T, T′,R) denotes the fraction of common entities that can be converted using the product of the conversion factors and ⁇ is a threshold
- staged approach can suffer from problems of low precision and low coverage.
- text in the context fields and text in the column headers can be noisy and ambiguous, which can cause annotations based on only local extraction to lead to incorrect labels.
- T 3 from FIG. 7 does not contain this information. So, the staged approach can fail to annotate T 3 with the unit, scale and year information. Consequently, the staged approach fails to detect the two X edges (corresponding to conversion rules R 1 and R 2 listed in FIG. 3 ) between T 1 and T 3 .
- conversion rules R 1 and R 2 listed in FIG. 3 The fact that many tables do not have column headers and/or context information exacerbates the low coverage problem.
- Computing the labels using labels of semantically matching columns in addition to using local extractions improves the precision and coverage.
- Computing the labels using labels of semantically matching columns in addition to using local extractions creates an interdependence between labels and semantic matches.
- Techniques as described herein employ semantic matches to compute the labels and the techniques employ the labels to compute the matches.
- Techniques as described herein employ a global approach that collectively computes all the labels and matches and combines the diverse signals.
- techniques for SMA of attributes addresses the interdependence by representing the task as a probabilistic graphical model (e.g., a Markov random field) that simultaneously discovers labels and semantic matches over columns, in some instances, all columns
- a probabilistic graphical model e.g., a Markov random field
- the model represents each element of x as a node in a graph G and captures the dependencies between them with a set of cliques of G.
- a clique is a subset of the set of vertices of the graph G, such that every two vertices in the subset are connected by an edge, or in other words, the subgraph represented by the clique is complete.
- the techniques described herein use two kinds of potentials: node potentials ⁇ (i, X i ) defined on the label X i of a single node i and edge potentials ⁇ (i,j,X i ,X j ) defined over edge (i,j) in G and labels (X i ,X j ) assigned to the two nodes edge (i,j) connects.
- the overall probability distribution is the normalized product of all of the potentials. In logarithmic representation,
- the inference problem is to find argmax x P(X 1 , . . . , X n ), the most likely joint assignment of labels to variables.
- the following descriptions provides operations performed in various embodiments to model the semantic annotation and matching task as a graphical model by defining the random variables, node and edge potentials, the overall objective and finally the inference algorithm.
- FIG. 8 depicts an example graphical model for matching and annotating attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time from three web tables T 1 , T 2 , and T 3 , according to some embodiments.
- the techniques associate two random variables: the first random variable L T to denote the set of SU labels and the second random variable y T to denote the time (year) label.
- L T can take a value from the set P(L) where p(s) denotes the power set of any set s.
- L denotes the set of scale and unit descriptor strings that appear in the LHSs and RHSs of the conversion rules.
- ⁇ T can take a value from the set Y ⁇ ⁇ NA ⁇ (NA denotes no year).
- B TT′ denoting the semantic match between T and T′.
- B TT′ can take a value from the set ⁇ p(R) ⁇ ⁇ S, NA ⁇ .
- R denotes the set of conversion rules.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein define node and clique potentials to combine diverse signals like local extraction of labels from the column headers and values, semantic matches computed using traditional schema matching techniques, and label propagation.
- One technique for assigning SU labels is the local extraction technique described in the staged approach, in which an SU label l ⁇ L is assigned to a table T, such as a web table, if and only if either l or one of its synonyms occur in either the column header H T of T.B or column values of T.B.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein define two features for this purpose.
- the techniques define a binary feature function ⁇ H (T, l) which is set to 1 if H T contains either the label l or a synonym of l, and 0 otherwise.
- a set LT of labels is a valid assignment if either ⁇ H (T, l) or ⁇ v (T, l) is 1 for all or most labels in it.
- Aspect 802 of FIG. 8 illustrates an example node potential ⁇ su .
- ⁇ su ⁇ ( T , L T ) ⁇ l ⁇ L T ⁇ max ⁇ ( f H ⁇ ( T , l ) , f V ⁇ ( T , l ) ) ⁇ L T ⁇
- SU labeling especially with label propagation, can lead to inconsistencies.
- the same table might be assigned both Euro and USD as labels.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein define hard constraints to control such inconsistent labels and integrate them into node potential ⁇ su (T, L T ).
- the techniques described herein define a set of mutually exclusive groups (referred to as “mutex groups”) such that any table can be assigned at most one label from a mutex group.
- the system has the eight conversion rules 502 of FIG. 5 , the mutex groups for the eight rules are shown at 504 of FIG. 5 .
- SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time techniques as described herein can compute the mutex groups by constructing a graph with each label l ⁇ L as a node and inserting an edge between two nodes if there is a rule with the two strings of labels as the LHS and RHS of the rule.
- the techniques as described herein can compute all the connected components of the graph when each connected component corresponds to a mutex group by applying a mutex function.
- Mutex(l, l′) denote a binary variable which is true if there exists a mutex group containing both l and l′ and false otherwise.
- the techniques described herein can modify ⁇ su (T, L T ) to disallow inconsistent labeling by recognizing large negative values as representing inconsistent labels.
- the final node potential ⁇ su (T, L T ) can be formally represented as follows.
- One technique for determining S edges is the schema matching technique described in the baseline approach in which there is likely to be an S edge between two web tables T and T′ if the common entities in the two tables have equal values in the attribute column.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein define tuple similarity Sim s (T, T′) as the fraction of common entities that have approximately equal values as follows.
- S is a good assignment for B TT′ if Sim s (T, T′) is above a certain threshold, for example, ⁇ .
- the techniques follow the computation technique described in the staged approach, which means there is likely to be an X edge associated with a set R ⁇ R of conversion rules between T and T′ if (i) labels of T contain the LHSs of R, (ii) labels of T′ contain the RHSs of R, and (iii) the values of the common entities can be converted using the product of the conversion factors.
- Sim X (T, T′, R) can denote the fraction of common entities that can be converted using the product of the conversion factors.
- An X edge associated with set of rules R represents a good assignment if (i) and (ii) are true and Sim X (T, T′, R) is above a certain threshold, for example, ⁇ .
- Aspect 804 of FIG. 8 illustrates an example clique potential ⁇ e .
- the techniques described herein define the final clique potential ⁇ e (T, T′, L T , L′ T , B TT ) as follows:
- SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time
- techniques as described herein assign a year label y ⁇ Y to T if the y occurs either in the column header H T of T B or in the context C T of T.
- the techniques as described herein can define two binary feature functions: (i) ⁇ H (T, y), which is set to 1 if H T contains y and to 0 otherwise, and (ii) ⁇ C (T, y) which is set to 1 if C T contains y and to 0 otherwise.
- y T is a good assignment if either ⁇ H (T, y T ) or ⁇ C (T, y T ) is 1.
- Aspect 806 of FIG. 8 illustrates an example time potential ⁇ y .
- ⁇ y (T, y T ) max( ⁇ H ( T, yT ), ⁇ C ( T, yT ))
- SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time
- the techniques described herein can propagate all the labels l ⁇ L T of T to T′, and vice-versa.
- FIG. 7 illustrates an example of labels so propagated (mil and Euro) from T 2 to T 3 .
- L T and L′ T represent valid assignments to two tables T and T′ connected by an S edge if all of their elements are the same, the number of their elements that are the same exceeds a threshold, or most of their elements are the same, e.g., the set similarity (such as Jaccard Similarity) is high.
- SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time techniques as described herein
- the techniques described herein can propagate all labels l ⁇ (L T ⁇ r ⁇ R r.LHS) from T to T′ and all labels l ⁇ (L T ⁇ r ⁇ R r.RHS) from T′ to T′′.
- the techniques described herein leverage that T and T′ are semantically identical except in the scales and units present in the rules connecting them. Thus, all labels of T can be applied to T′ and vice-versa.
- L T and L′ T represent valid assignments to two tables T and T′ connected by an X edge associated with the set of rules R if all or most of the elements in L T ⁇ r ⁇ R r.LHS and L′ T ⁇ r ⁇ R r.RHS are the same or if the number of their elements that are the same exceeds a threshold, in other words, when the set similarity (e.g., Jaccard Similarity) between those two sets is high.
- Aspect 808 of FIG. 8 illustrates an example label edge potential ⁇ lp .
- the techniques as described herein can propagate the time, e.g., year, label from T to T′ if there is either an S edge or X edge between them.
- Aspect 810 of FIG. 8 illustrates an example time edge potential ⁇ lp .
- Schema matching techniques can introduce spurious edges. Time-varying attributes can be significant source of such errors. For example, if the values of an attribute (e.g., revenues) did not change significantly from one year to another, a spurious match can be returned. In FIG. 7 , the revenues of Pfizer and Abbott Labs did not change significantly from 2010 to 2011, which leads to a spurious semantic match 722 between T 1 and T 4 .
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time as described herein disallow such spurious matches by recognizing large negative values as representing year labels that are not identical.
- Aspect 812 of FIG. 8 illustrates an example edge constraint clique potential ⁇ ec .
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein accomplish a goal of finding assignment to the variables L T , y T , and B TT′ such that the following objective is maximized:
- FIG. 9 depicts an example algorithm for independent inference in pseudo code, according to various embodiments.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, first show how to find the optimal node and edge labeling to maximize the objective described above in the case of no label propagation. This optimization can be solved in polynomial time because the optimization for each objective component can be solved completely based on local information, which is similar to the staged approach described before. Pseudo code of Algorithm 1 is shown in FIG. 9 .
- S and X edges are computed using the similarity function and the computation technique described above.
- the edge labels can then be obtained accordingly.
- the techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein change the edge label to NA for those having different years on two sides to satisfy the edge constraint ⁇ ec .
- FIG. 10 depicts an example semantic graph 1000 , according to some embodiments of SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time.
- Techniques for SMA of attributes apply an approximation algorithm to resolve the collective inference by adapting the techniques proposed in probabilistic graphical model.
- At least one embodiment employs a belief propagation algorithm as described in “Probabilistic Graphical Models: Principles and Techniques,” D. Koller and N. Friedman. MIT Press, 2009, which is incorporated by reference herein.
- a belief propagation algorithm operates on a factor graph which contains variable nodes and factor nodes, and proceeds by passing “belief” via messages between variable nodes and factor nodes. In the problem described for FIG. 7 and FIG. 8 , there are three variables and five factors.
- FIG. 7 and FIG. 8 there are three variables and five factors.
- the algorithm can start with initializing the node potential ⁇ su at 1002 and the node potential ⁇ y at 1004 for each table T by doing the local extraction of scale, unit and year labels.
- the collective inference is then processed by iteratively passing messages between neighboring nodes.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein schedule the message passing process in three steps.
- the first step is centered on edge potential ⁇ e .
- the messages are sent from the scale/unit label L T 1006 to ⁇ e 1008 , then to the edge label B TT′ 1010 and then sent back.
- This step tries to create more edges between tables based on current belief of node labeling and, at the same time, tries to select labels that are beneficial for edge creation.
- the focus is on the edge potential ⁇ lp 1012 for label propagation.
- the messages are passed from L T 1006 , y T 1014 , and B TT′ 1010 to ⁇ lp 1012 ; ⁇ lp 1012 then sends back messages according to the set of labels that produce the highest potential.
- the final step deals with the constraint ⁇ ec 1016 for eliminating spurious edges.
- the passing schedule for this step is similar to Step 1 except that the messages are sent between year label y T 1014 , edge label B TT′ 1010 , and ⁇ ec 1016 .
- the processing stops when the message values converge or the number of iteration reaches the threshold set for a maximum limit.
- the final values of variables are derived from the messages. Details of an algorithm for collective inference are presented in pseudo code in FIG. 11 .
- FIG. 11 depicts an example algorithm for collective inference in pseudo code, according to at least one embodiment.
- Finding labels that maximize the final objective function with label propagation is NP-Hard, even for a simple case with one variable Y T .
- the objective can be represented by the following:
- each node v ⁇ V corresponds to one T ⁇ T.
- the variable y T can only take labels from set L and cannot be NA.
- maximizing ⁇ y (v, l) is equivalent to minimizing the cost for assigning l to v.
- maximizing ⁇ ec (v, v′, l, l′) is equivalent to minimizing the cost for assigning l and l′ to a pair of nodes.
- the techniques let ⁇ lp be zero so that it does not affect the final labeling.
- the objective function (2) is consistent with the total cost in metric labeling. If a labeling function ⁇ can be found in polynomial time to maximize the objective (2), the labeling function ⁇ can also minimize the total cost C in the metric labeling problem.
- a belief propagation algorithm will be able to scale to factor graphs containing billions of nodes. Since the belief propagation algorithm follows a computational model consistent with large-scale graph processing, the techniques can leverage large-scale graph processing technologies for scalable implementation of collective inference. For example, many large scale graph processing tasks can be expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology.
- Many systems have been developed for such a computation model over large-scale graphs.
- FIG. 12 depicts an example algorithm for query processing in pseudo code, according to various embodiments.
- techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein can support entity augmentation queries and/or other types of table search queries, such as web table search queries.
- the API of SMA of attributes significantly extends the previous entity augmentation API which takes only attribute name and entity set as input described in “Infogather: entity augmentation and attribute discovery by holistic matching with web tables,” Yakout, et al., SIGMOND ' 12, 2012.
- Techniques for SMA of attributes, including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein extend the query language with SU and Y to unambiguously specify the augmentation task.
- the techniques for SMA of attributes including attributes represented by numeric values, which may be in different units or scales, and attributes represented by numeric values that can vary by time, as described herein better leverage the semantic graph built as described herein.
- FIG. 12 represents an example algorithm providing details for query processing in pseudo code, according to various embodiments.
- techniques as described herein identify matching tables. Given a query Q, the techniques leverage the EI index from data store 414 of FIG. 4 to identify the set of tables that contains at least one of the entities in Q.E. Among these tables, the techniques use the NLI index from data store 414 of FIG. 4 to identify the subset of tables that satisfy the following conditions: (1) The column header matches with the attribute name keywords in Q.A; (2) The year label matches with the query label in Q.Y (if any); (3) The unit and scale labels match with each of the query labels in Q.SU. A unit/scale label is considered a match if it occurs in the same mutex group as Q.SU.
- the techniques can convert values from the units in which the values are available to the desired query unit.
- the techniques search for semantically matching tables for each qualified table using GI index from data store 414 of FIG. 4 .
- the qualified tables together with their matching tables become the sources for filling in values at 1204 .
- techniques as described herein fill-in values. For each table identified at 1202 , the techniques collect the values provided in the table for each of the query entities. The techniques compare the semantic labels (unit and scale) of the table with the query labels. If these labels do not agree, the techniques convert the values to the desired unit and scale. The techniques aggregate the converted values. At 1206 , the techniques augment the query entities with the values with the highest aggregate score.
- T 1 , T 2 , and T 3 are matching tables because (1) their column headers contain “revenues” (2) their year labels match with 2010, and (3) they contain unit/scale labels that are in the same mutex group as USD (USD, Euro and Euro respectively).
- USD USD, Euro and Euro respectively.
- T 1 provides 21.8 which is already in USD bil.
- T 1 provides 27.4 and T 2 provides 21091; the latter is converted to USD bil which also results in 27.4.
- T 3 provides the values 36113 and 33762 respectively; they are converted to USD bil resulting in 46.9 and 43.8 respectively. This produces the desirable, semantically consistent results, all from 2010 and all in billions of U.S. Dollars as shown in table 620 of FIG. 6 .
- Table 2 shows conversion rules relevant to the entity augmentation queries regarding company revenue and profit data, data on the population of cities, and data on the total area of countries, showing unit and scale while, for simplicity, omitting conversion factors involved in the rules.
- FIG. 13 shows a number of examples of results from real-life experiments.
- Charts 1302 show the coverage and precision results of augmenting country area in square km.
- the x-axis represents different entity sets: the top-k countries in descending order of the country area for various values of k to study the sensitivity of the approaches to head and tail entities.
- the study assumed the countries with larger area would be head countries and the ones with smaller area would be tail countries.
- S and S-Syn approaches have poor precision: the highest being 0.22 on 250 countries.
- II and CI achieve consistently high precision across all entity sets: an average precision of 0.87 for II approach and 0.93 for CI. This confirms that adding X edges and semantic labels significantly impacts the quality of entity augmentation queries. The improvement is more significant in CI more X edges and labels are created via label propagation.
- Charts 1304 show the coverage and precision results of augmenting country area using a different unit, square feet.
- a drawback of the baseline approaches is more obvious with this query.
- the baseline approaches fail to augment even a single entity because country area is not available in the desired unit; hence, S and S-Syn finds zero relevant tables.
- II and CI approaches achieve high coverage and precision as II and CI can convert the values from the units they are available in to the desired unit.
- Charts 1306 show the coverage and precision results of augmenting country tax rate for year 2006-2009.
- the baseline approaches reach high coverage for each of the query years.
- the average precision is only 0.55 and 0.54 for S and S-Syn respectively due to spurious edges between tables that contain information from different years; hence, the results contain values from years different from the query year.
- the II and CI improve the average precision to 0.72 and 0.92 respectively by eliminating spurious edges. Note that II returned no result for the year 2009 because II did not recognize the tables that contain the query answer without propagating the years. Overall, CI significantly outperforms II and the baselines as it does a better job in propagating the labels and eliminating spurious edges (as shown in Table 4).
- Charts 1308 , 1310 , 1312 , and 1314 illustrate results for four entity augmentation queries on Company and City datasets.
- the charts omit illustration of the S approach because S either returned no results or similar results as S-Syn.
- Charts 1308 show the coverage and precision results of 3 approaches, S-Syn, II, and CI, for the company-revenue query in ⁇ million, $> for year 2011.
- the baseline approach had good coverage but suffered from poor precision ( ⁇ 0.4).
- II did not improve the precision significantly but CI improved the precision to 0.9, which indicates that the higher quality labels and edges produced by CI significantly improved the quality of entity augmentation queries.
- Charts 1310 show the coverage and precision results of the S-Syn, II, and CI approaches for the company-revenue query in ⁇ billion, $> for year 2011. Similar to augmenting country area in square feet 1304 , the baseline approach returned no results, whereas II and CI were not affected by changing of query scale from million to billion. This is due to information not being available in ⁇ billion, $> and II and CI having the ability to convert the values from the units they are available into ⁇ billion, $>. The II and CI approaches achieved high coverage and precision since II and CI can convert the values from the units they are available in to the desired unit.
- Charts 1312 show the coverage and precision results of the S-Syn, II, and CI approaches for the company-profit query in ⁇ million, $> for year 2011.
- the CI approach consistently outperforms both the II approach and the baseline approach.
- Charts 1314 show the coverage and precision results of the S-Syn, II, and CI approaches for the city-population in million for year 2011. Again, CI significantly outperforms both the II approach and the baseline approach on precision and achieves high coverage as well. As shown, in the experiment the average coverage and precision were 0.9 and 0.9 respectively.
- the operations of the example processes are illustrated in individual blocks and summarized with reference to those blocks.
- the processes are illustrated as logical flows of blocks, each block of which can represent one or more operations that can be implemented in hardware, software, or a combination thereof.
- the operations represent computer-executable instructions stored on one or more computer-readable media that, when executed by one or more processors, enable the one or more processors to perform the recited operations.
- computer-executable instructions include routines, programs, objects, modules, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be executed in any order, combined in any order, subdivided into multiple sub-operations, and/or executed in parallel to implement the described processes.
- the described processes can be performed by resources associated with one or more device(s) 106 , 120 , 200 , and/or 300 such as one or more internal or external CPUs or GPUs, and/or one or more pieces of hardware logic such as FPGAs, DSPs, or other types of accelerators.
- All of the methods and processes described above may be embodied in, and fully automated via, software code modules executed by one or more general purpose computers or processors.
- the code modules may be stored in any type of computer-readable storage medium or other computer storage device. Some or all of the methods may alternatively be embodied in specialized computer hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
where R denotes the set of conversion rules and r.LHS (r.RHS) denotes the scale or unit descriptor included in the LHS (RHS) of rule r. Given the synonyms for each scale or unit descriptor l ∈ L, the system annotates a web table T with a label l ∈ L if l or a synonym of l occurs in either the column header HT of T.B or column values of T.B. For example, the system annotates T1 in
For example, the techniques described herein add an X edge between T1 and T2 corresponding to conversion rules R={USD=0.77×Euro, bil=1000×mil} because (i) T1 has labels {USD, bil} (ii) T2 has labels {Euro, mil} and (iii) the values of the common entities (i.e., Pfizer and Merck) satisfy the product of the conversion factors, i.e., 46277/60.1˜(0.77×1000) for Pfizer and 21091/27.4˜(0.77×1000) for Merck. So, Simx(T1, T2,R)=1.0. Note that in some embodiments the techniques allow slight variations in values while checking the convertibilities. For example, the conversion ratio of USD to Euro does not have to be exactly 0.77; and may cover a range such as 0.769 to 0.772, etc.
ψy(T, yT)=max(ƒH(T, yT), ƒC(T, yT))
ψy(v,l)=−c(v,l)
ψec(v, v′, l,l′)=−w(v,v′)·d(l,l′)
ψlp(v, v′, l,l′)=0
-
- A: keywords describing attribute name.
- E: set of entities of interest.
- SU: unit and scale (optional).
- Y: timestamp (optional).
TABLE 1 | |||||
Numeric | |||||
Domain | Tables | Attributes | Average | ||
Company | 39,223 | 80,149 | 2.04 | ||
City | 81,977 | 140,459 | 1.71 | ||
Country | 159,730 | 344,509 | 2.16 | ||
TABLE 2 | |||
Query Attribute | Conversion Tokens | ||
Company | Euro USD | ||
revenue/profit | billion million thousand NA | ||
City population | billion million thousand NA | ||
Country area | sq. meter sq. feet sq. km sq. mile | ||
TABLE 3 | |||||
| II | 510 | same: 428 | ||
CI | 690 | diff: 1 | |||
more: 81 (100%) | |||||
new: 180 (~97%) | |||||
Year | II | 129 | same: 127 | ||
CI | 527 | diff: 2 | |||
more: — | |||||
new: 398 (~89%) | |||||
TABLE 4 | |||||
X Edge | Elimination |
II | CI | II | CI | ||
Number | 205 | 405 | 4 | 336 | ||
Accuracy | ~90% | ~83% | 100% | ~79% | ||
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/177,081 US10726018B2 (en) | 2014-02-10 | 2014-02-10 | Semantic matching and annotation of attributes |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/177,081 US10726018B2 (en) | 2014-02-10 | 2014-02-10 | Semantic matching and annotation of attributes |
Publications (2)
Publication Number | Publication Date |
---|---|
US20150227589A1 US20150227589A1 (en) | 2015-08-13 |
US10726018B2 true US10726018B2 (en) | 2020-07-28 |
Family
ID=53775098
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/177,081 Active 2035-09-05 US10726018B2 (en) | 2014-02-10 | 2014-02-10 | Semantic matching and annotation of attributes |
Country Status (1)
Country | Link |
---|---|
US (1) | US10726018B2 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366967B2 (en) * | 2019-07-24 | 2022-06-21 | International Business Machines Corporation | Learning roadmaps from unstructured text |
US11580081B1 (en) * | 2019-11-01 | 2023-02-14 | United Services Automobile Association (Usaa) | System and method for large scale anomaly detection |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9565067B1 (en) * | 2014-02-20 | 2017-02-07 | The Mathworks, Inc. | Heterogeneous units management system with dimensionless, ambiguous, and partial units |
WO2015172329A1 (en) * | 2014-05-14 | 2015-11-19 | 华为技术有限公司 | Terminal matching method and matched terminal |
KR102174122B1 (en) | 2014-09-02 | 2020-11-04 | 애플 인크. | Semantic framework for variable haptic output |
US9898488B2 (en) * | 2014-12-01 | 2018-02-20 | Oracle International Corporation | Preserving deprecated database columns |
US20160283523A1 (en) * | 2015-03-24 | 2016-09-29 | International Business Machines Corporation | Schema generation using natural language processing |
US10452661B2 (en) * | 2015-06-18 | 2019-10-22 | Microsoft Technology Licensing, Llc | Automated database schema annotation |
US10102276B2 (en) * | 2015-12-07 | 2018-10-16 | International Business Machines Corporation | Resolving textual numerical queries using natural language processing techniques |
DK179823B1 (en) | 2016-06-12 | 2019-07-12 | Apple Inc. | Devices, methods, and graphical user interfaces for providing haptic feedback |
DK179489B1 (en) | 2016-06-12 | 2019-01-04 | Apple Inc. | Devices, methods and graphical user interfaces for providing haptic feedback |
DK201670720A1 (en) | 2016-09-06 | 2018-03-26 | Apple Inc | Devices, Methods, and Graphical User Interfaces for Generating Tactile Outputs |
DK179278B1 (en) | 2016-09-06 | 2018-03-26 | Apple Inc | Devices, methods and graphical user interfaces for haptic mixing |
US10592557B2 (en) * | 2017-03-31 | 2020-03-17 | Microsoft Technology Licensing, Llc | Phantom results in graph queries |
CN109416695B (en) | 2017-05-08 | 2022-05-27 | 微软技术许可有限责任公司 | Providing local service information in automatic chat |
DK201770372A1 (en) | 2017-05-16 | 2019-01-08 | Apple Inc. | Tactile feedback for locked device user interfaces |
CN108763221B (en) * | 2018-06-20 | 2022-05-17 | 科大讯飞股份有限公司 | Attribute name representation method and device |
CN109271392B (en) * | 2018-10-30 | 2022-07-26 | 长威信息科技发展股份有限公司 | Method and equipment for quickly distinguishing and extracting relational database entity and attribute |
US11500867B2 (en) * | 2018-11-07 | 2022-11-15 | International Business Machines Corporation | Identification of multiple foci for topic summaries in a question answering system |
JP7174377B2 (en) * | 2018-11-26 | 2022-11-17 | 株式会社日立製作所 | Database management system and anonymization processing method |
CN109710811B (en) * | 2018-11-28 | 2021-03-02 | 汉海信息技术(上海)有限公司 | User portrait detection method, device and application system |
CN110427480B (en) * | 2019-06-28 | 2022-10-11 | 平安科技(深圳)有限公司 | Intelligent personalized text recommendation method and device and computer readable storage medium |
EP3896619A1 (en) * | 2020-04-16 | 2021-10-20 | Robert Bosch GmbH | Method and system for keyword search over a knowledge graph |
EP3926492A1 (en) * | 2020-06-19 | 2021-12-22 | Robert Bosch GmbH | Computer-implemented method for keyword search in a knowledge graph |
CN116982037A (en) * | 2020-10-01 | 2023-10-31 | 克劳德斯玛特有限公司 | Semantic coverage in managing and measuring knowledge discovery processes |
US11636119B2 (en) * | 2021-05-20 | 2023-04-25 | Innoplexus Ag | System and method for efficient management of a search database for retrieving context-based information |
CN114741385B (en) * | 2022-03-10 | 2025-01-07 | 北京元年科技股份有限公司 | Method, device, equipment and readable storage medium for generating general data relation structure |
US20230342653A1 (en) * | 2022-04-21 | 2023-10-26 | International Business Machines Corporation | Action Space Reduction for Planning Domains |
CN117353284B (en) * | 2023-09-22 | 2025-02-18 | 东莞市柏群电子科技有限公司 | Low-rotation-speed power supply method and system for generator based on exercise bicycle |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020103793A1 (en) * | 2000-08-02 | 2002-08-01 | Daphne Koller | Method and apparatus for learning probabilistic relational models having attribute and link uncertainty and for performing selectivity estimation using probabilistic relational models |
US20030012428A1 (en) * | 1999-11-16 | 2003-01-16 | Syeda-Mahmood Tanveer Fathima | Method and apparatus for indexing and retrieving images from an image database based on a color query |
US20050262051A1 (en) * | 2004-05-13 | 2005-11-24 | International Business Machines Corporation | Method and system for propagating annotations using pattern matching |
US7149746B2 (en) * | 2002-05-10 | 2006-12-12 | International Business Machines Corporation | Method for schema mapping and data transformation |
US20070156669A1 (en) * | 2005-11-16 | 2007-07-05 | Marchisio Giovanni B | Extending keyword searching to syntactically and semantically annotated data |
US7246116B2 (en) * | 2004-04-22 | 2007-07-17 | International Business Machines Corporation | Method, system and article of manufacturing for converting data values quantified using a first measurement unit into equivalent data values when quantified using a second measurement unit in order to receive query results including data values measured using at least one of the first and second measurement units |
US7249328B1 (en) * | 1999-05-21 | 2007-07-24 | E-Numerate Solutions, Inc. | Tree view for reusable data markup language |
US20070185868A1 (en) * | 2006-02-08 | 2007-08-09 | Roth Mary A | Method and apparatus for semantic search of schema repositories |
US20100030801A1 (en) * | 2008-08-01 | 2010-02-04 | Mitsubishi Electric Corporation | Table classification device, table classification method, and table classification program |
US20100332511A1 (en) | 2009-06-26 | 2010-12-30 | Entanglement Technologies, Llc | System and Methods for Units-Based Numeric Information Retrieval |
US20110071965A1 (en) * | 2009-09-24 | 2011-03-24 | Yahoo! Inc. | System and method for cross domain learning for data augmentation |
US7958489B2 (en) * | 2007-04-12 | 2011-06-07 | Microsoft Corporation | Out of band data augmentation |
US20120011115A1 (en) * | 2010-07-09 | 2012-01-12 | Jayant Madhavan | Table search using recovered semantic information |
US20120254143A1 (en) | 2011-03-31 | 2012-10-04 | Infosys Technologies Ltd. | Natural language querying with cascaded conditional random fields |
US20130238621A1 (en) * | 2012-03-06 | 2013-09-12 | Microsoft Corporation | Entity Augmentation Service from Latent Relational Data |
US20140059043A1 (en) * | 2012-08-27 | 2014-02-27 | Oracle International Corporation | Normalized ranking of semantic query search results |
-
2014
- 2014-02-10 US US14/177,081 patent/US10726018B2/en active Active
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7249328B1 (en) * | 1999-05-21 | 2007-07-24 | E-Numerate Solutions, Inc. | Tree view for reusable data markup language |
US20030012428A1 (en) * | 1999-11-16 | 2003-01-16 | Syeda-Mahmood Tanveer Fathima | Method and apparatus for indexing and retrieving images from an image database based on a color query |
US20020103793A1 (en) * | 2000-08-02 | 2002-08-01 | Daphne Koller | Method and apparatus for learning probabilistic relational models having attribute and link uncertainty and for performing selectivity estimation using probabilistic relational models |
US7149746B2 (en) * | 2002-05-10 | 2006-12-12 | International Business Machines Corporation | Method for schema mapping and data transformation |
US7246116B2 (en) * | 2004-04-22 | 2007-07-17 | International Business Machines Corporation | Method, system and article of manufacturing for converting data values quantified using a first measurement unit into equivalent data values when quantified using a second measurement unit in order to receive query results including data values measured using at least one of the first and second measurement units |
US20050262051A1 (en) * | 2004-05-13 | 2005-11-24 | International Business Machines Corporation | Method and system for propagating annotations using pattern matching |
US20070156669A1 (en) * | 2005-11-16 | 2007-07-05 | Marchisio Giovanni B | Extending keyword searching to syntactically and semantically annotated data |
US20070185868A1 (en) * | 2006-02-08 | 2007-08-09 | Roth Mary A | Method and apparatus for semantic search of schema repositories |
US7958489B2 (en) * | 2007-04-12 | 2011-06-07 | Microsoft Corporation | Out of band data augmentation |
US20100030801A1 (en) * | 2008-08-01 | 2010-02-04 | Mitsubishi Electric Corporation | Table classification device, table classification method, and table classification program |
US20100332511A1 (en) | 2009-06-26 | 2010-12-30 | Entanglement Technologies, Llc | System and Methods for Units-Based Numeric Information Retrieval |
US20110071965A1 (en) * | 2009-09-24 | 2011-03-24 | Yahoo! Inc. | System and method for cross domain learning for data augmentation |
US20120011115A1 (en) * | 2010-07-09 | 2012-01-12 | Jayant Madhavan | Table search using recovered semantic information |
US20120254143A1 (en) | 2011-03-31 | 2012-10-04 | Infosys Technologies Ltd. | Natural language querying with cascaded conditional random fields |
US20130238621A1 (en) * | 2012-03-06 | 2013-09-12 | Microsoft Corporation | Entity Augmentation Service from Latent Relational Data |
US20140059043A1 (en) * | 2012-08-27 | 2014-02-27 | Oracle International Corporation | Normalized ranking of semantic query search results |
Non-Patent Citations (28)
Title |
---|
"Global 2000 Leading Companies 2011", retrieved on Sep. 13, 2013, at <<http://www.forbes.com/lists/2012/18/global2000_2011.html>>, Forbes, 16 pages. |
"List of Countries and Dependencies by Area", retrieved on Sep. 13, 2013, at <<http://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area>>, Wikipedia, 20 pages. |
"OECD Corporate Income Tax Rates, 1981-2012", retrieved on Sep. 13, 2013, at <<http://taxfoundation.org/article/oecd-corporate-income-tax-rates-1981-2012>>, Tax Foundation, Jun. 29, 2012, 3 pages. |
Assaf et al., "RUBIX: A Framework for Improving Data Integration with Linked Data", In Proceedings of the First International Workshop on Open Data, May 25, 2012, 9 pages. |
Cafarella et al., "Uncovering the Relational Web", In Proceedings of the 11th International Workshop on Web and Databases, Jun. 13, 2008, 6 pages. |
Cafarella et al., "Webtables: Exploring the Power of Tables on the Web", In Proceedings of the VLDB Endowment, vol. 1, No. 1, Aug. 2008, 12 pages. |
Cafarella, et al., "Data Integration for the Relational Web", In Proceedings of the VLDB Endowment,vol. 2, No. 1, Aug. 24, 2009, 12 pages. |
Dhamankar et al., "iMAP: Discovering Complex Semantic Matches Between Database Schemes", In Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 13, 2004, 12 pages. |
Doan et al., "Semantic Integration Research in the Database Community: A Brief Survey", In Journal, AI Magazine-Special Issue on Semantic Integration, vol. 26, No. 1, Mar. 2005, 10 pages. |
Doan et al., "Semantic Integration Research in the Database Community: A Brief Survey", In Journal, AI Magazine—Special Issue on Semantic Integration, vol. 26, No. 1, Mar. 2005, 10 pages. |
Johnson, "City Mayors", retrieved on Sep. 19, 2013, at <<http://www.citymayors.com/statistics/largest-cities-mayors-1.html>>, City Mayors & Tann vom Hove, 3 pages. |
Kleinberg et al., "Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields", In the Journal of the ACM, Sep. 2002, 24 pages. |
Koller et al., "Probabilistic Graphical Models: Principles and Techniques", In the MIT Press, Aug. 2009, 1265 pages. |
Limaye et al., "Annotating and Searching Web Tables Using Entities, Types and Relationships", In Proceedings of the VLDB Endowment, vol. 3, No. 1-2, Sep. 2010, 10 pages. |
Malewicz et al., "Pregel: A System for Large-Scale Graph Processing", In Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 6, 2010, 11 pages. |
Mulwad et al., "A Domain Independent Framework for Extracting Linked Semantic Data from Tables", In Search Computing, vol. 7538Jul. 2012, 18 pages. |
Mulwad et al., "Generating Linked Data by Inferring the Semantics of Tables", In Proceedings of the First International Workshop on Searching and Integrating New Web Data Sources-Very Large Data Search, Sep. 2, 2011, 6 pages. |
Mulwad et al., "Generating Linked Data by Inferring the Semantics of Tables", In Proceedings of the First International Workshop on Searching and Integrating New Web Data Sources—Very Large Data Search, Sep. 2, 2011, 6 pages. |
Pimplikar et al., "Answering Table Queries on the Web Using Column Keywords", In Proceedings of the VLDB Endowment, vol. 5, No. 10, Jun. 2012, 12 pages. |
Quercini et al., "Entity Discovery and Annotation in Tables", In Proceedings of the 16th International Conference on Extending Database Technology, Mar. 18, 2013, 12 pages. |
Rahm et al., "A Survey of Approaches to Automatic Schema Matching", In International Journal on Very Large Data Bases, Springer-Verlag, Dec. 2001, pp. 334-350. |
Sarma et al., "Finding Related Tables", In Proceedings of the ACM SIGMOD International Conference on Management of Data, May 20, 2012, 12 pages. |
Shao et al., "Trinity: A Distributed Graph Engine on a Memory Cloud", In Proceedings of the ACM SIGMOD International Conference on Management of Data, Jun. 22, 2013, 12 pages. |
Talukdar et al., "Coupled Temporal Scoping of Relational Facts", In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, Feb. 8, 2012, 10 pages. |
Venetis et al., "Recovering Semantics of Tables on the Web", In Proceedings of the VLDB Endowment, vol. 4, No. 9, Jun. 2011, 11 pages. |
Wang et al., "Harvesting Facts from Textual Web Sources by Constrained Label Propagation", In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Oct. 24, 2011, 10 pages. |
Yakout et al., "Infogather: Entity Augmentation and Attribute Discovery by Holistic Matching with Web Tables", In Proceedings of the ACM SIGMOD International Conference on Management of Data, May 20, 2012, 12 pages. |
Zhang et al., "Automatic Discovery of Attributes in Relational Databases", In Proceeding of the ACM SIGMOD International Conference on Management of Data, Jun. 12, 2011, 12 pages. |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11366967B2 (en) * | 2019-07-24 | 2022-06-21 | International Business Machines Corporation | Learning roadmaps from unstructured text |
US11580081B1 (en) * | 2019-11-01 | 2023-02-14 | United Services Automobile Association (Usaa) | System and method for large scale anomaly detection |
Also Published As
Publication number | Publication date |
---|---|
US20150227589A1 (en) | 2015-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10726018B2 (en) | Semantic matching and annotation of attributes | |
US12013850B2 (en) | Method and system for advanced data conversations | |
US12282757B2 (en) | System and method for automated mapping of data types for use with dataflow environments | |
Zhang et al. | Infogather+ semantic matching and annotation of numeric and time-varying attributes in web tables | |
US10120930B2 (en) | Identifying entity mappings across data assets | |
CN108292310B (en) | Techniques for digital entity correlation | |
US20200301916A1 (en) | Query Template Based Architecture For Processing Natural Language Queries For Data Analysis | |
US10423881B2 (en) | Systems and methods for semantic inference and reasoning | |
US10452661B2 (en) | Automated database schema annotation | |
US20080256121A1 (en) | Method and system for mapping multi-dimensional model to data warehouse schema | |
US11120086B2 (en) | Toponym disambiguation | |
US20230075655A1 (en) | Systems and methods for context-independent database search paths | |
Safra et al. | Location‐based algorithms for finding sets of corresponding objects over several geo‐spatial data sets | |
US10769140B2 (en) | Concept expansion using tables | |
CN118981475A (en) | SQL statement generation method and device based on large model | |
CN118673038B (en) | Index acquisition method, apparatus, electronic device and computer readable storage medium | |
Hou et al. | GEE-OPs: an operator knowledge base for geospatial code generation on the Google Earth Engine platform powered by large language models | |
US8560468B1 (en) | Learning expected values for facts | |
US12099575B2 (en) | Auto-triage failures in A/B testing | |
Araújo et al. | Incremental entity blocking over heterogeneous streaming data | |
Leblay et al. | Exploring the veracity of online claims with backdrop | |
Agael | Navigating Towards Building a Big Data Analytics Platforms with Comprehensive Analytics Capabilities | |
Oreščanin et al. | Conceptual Framework for entity integration from multiple data sources | |
Ardalan | Large-scale information extraction using rules, machine learning, and human computation | |
Mao et al. | A MapReduce-Based Decision-Making Approach for Multiple Criteria Sorting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHAKRABARTI, KAUSHIK;ZHANG, MEIHUI;SIGNING DATES FROM 20140129 TO 20140203;REEL/FRAME:032649/0556 |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034747/0417 Effective date: 20141014 Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:039025/0454 Effective date: 20141014 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:052504/0345 Effective date: 20141014 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |