CN102364473B - Netnews search system and method based on geographic information and visual information - Google Patents

Netnews search system and method based on geographic information and visual information Download PDF

Info

Publication number
CN102364473B
CN102364473B CN2011103520023A CN201110352002A CN102364473B CN 102364473 B CN102364473 B CN 102364473B CN 2011103520023 A CN2011103520023 A CN 2011103520023A CN 201110352002 A CN201110352002 A CN 201110352002A CN 102364473 B CN102364473 B CN 102364473B
Authority
CN
China
Prior art keywords
news
module
matrix
retrieval
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2011103520023A
Other languages
Chinese (zh)
Other versions
CN102364473A (en
Inventor
卢汉清
刘静
李泽超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN2011103520023A priority Critical patent/CN102364473B/en
Publication of CN102364473A publication Critical patent/CN102364473A/en
Application granted granted Critical
Publication of CN102364473B publication Critical patent/CN102364473B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提出了一种融合地理信息与视觉信息的网络新闻检索系统及方法。该系统包括:数据预处理模块,用于爬取新闻数据并进行文本分析和信息数据提取,所述新闻数据包括人物,地点、时间和文本信息;地点相关性分析模块,用于执行新闻事件与新闻地点的相关性分析;新闻配图模块,用于为新闻选择合适的图像;检索结果展示模块,用于基于检索相关性排序展示检索到的新闻。本发明的系统及方法综合利用了地理位置信息和视觉信息对网络新闻进行描述与展示,为网络用户提供基于地理位置的多媒体新闻检索,同时综合了新闻地点-新闻事件的关系、新闻地点的相关性以及新闻事件之间的关系,从而提供给用户一个更生动、更富信息的新闻搜索结果。

The invention proposes a network news retrieval system and method that integrate geographic information and visual information. The system includes: a data preprocessing module for crawling news data and performing text analysis and information data extraction, the news data including characters, places, time and text information; location correlation analysis module for performing news events and Correlation analysis of news locations; news map module, used to select appropriate images for news; retrieval result display module, used to sort and display retrieved news based on retrieval relevance. The system and method of the present invention comprehensively utilize geographic location information and visual information to describe and display network news, provide multimedia news retrieval based on geographic location for network users, and simultaneously integrate news location-news event relationship and news location correlation Sex and the relationship between news events, so as to provide users with a more vivid and informative news search results.

Description

融合地理信息与视觉信息的网络新闻检索系统及方法Network news retrieval system and method integrating geographic information and visual information

技术领域 technical field

本发明涉及网络新闻检索领域,特别的,涉及一种融合地理信息与视觉信息的网络新闻检索系统及方法。The invention relates to the field of network news retrieval, in particular to a network news retrieval system and method that integrates geographical information and visual information.

背景技术 Background technique

随着信息技术的发展和网络的全球化,在线新闻越来越多并且也越来越受欢迎,日益变成了人们日常生活中获取信息的一种重要途径。人们可以通过一些主要的网络门户网站如雅虎、MSN或者大型新闻网站如CNN、AOL和MSNBC获取和浏览新闻。With the development of information technology and the globalization of the network, online news is becoming more and more popular, and it has become an important way for people to obtain information in their daily lives. People can obtain and browse news through some major Internet portals such as Yahoo, MSN or large news websites such as CNN, AOL and MSNBC.

但是,现有技术中的新闻展示方法具有若干不足。However, the news display method in the prior art has several deficiencies.

例如,现有的新闻展示方法缺乏以地理为基础的组织。有研究表明用户经常优先关注几个特定地点的新闻,比如家乡和工作地点。大部分的大型新闻网站可以根据相关的国家进行组织新闻。用户可以提交一个地点作为检索词检索新闻。但是文档中包含的地理名词经常存在噪声,因而降低了检索的性能。For example, existing approaches to news presentation lack geographic-based organization. Studies have shown that users often give priority to news about a few specific locations, such as hometown and work. Most of the major news sites can organize news by relevant country. Users can submit a location as a search term to retrieve news. However, the geographic terms contained in the documents are often noisy, which reduces the retrieval performance.

另外,现有的新闻展示方法不包含全面的视觉信息。In addition, existing news display methods do not contain comprehensive visual information.

图1显示了现有技术中一篇新闻文档中包含图片个数的分布情况。Fig. 1 shows the distribution of the number of pictures contained in a news document in the prior art.

从图1中可以看到,现有技术中大部分的新闻文档没有图片或者包含很少的图片。例如,仅有不到5%的新闻文档包含超过一张的图片。It can be seen from FIG. 1 that most news documents in the prior art have no pictures or contain very few pictures. For example, less than 5% of news documents contain more than one image.

通常来说,图片的表现效果胜过千言万语,作为新闻文本的补充,新闻图片能够使用户更快的获取信息。但是,如图1所示,现有的新闻文档中包含的图片数很少,因而远远不能满足用户获取信息的全面需求。Generally speaking, pictures are worth a thousand words. As a supplement to news texts, news pictures can enable users to obtain information faster. However, as shown in Figure 1, the number of pictures contained in the existing news documents is very small, so it is far from meeting the comprehensive needs of users for obtaining information.

发明内容 Contents of the invention

本发明的目的是提供一种融合地理与视觉信息的网络新闻检索系统及方法。根据本发明的系统及方法,能够为用户提供基于地理信息组织的新闻,使用户快速地浏览到所关心地区发明的新闻事件;进而,本发明采用图像信息对文本信息进行补充,使用户能够快速地掌握新闻事件的内容。The purpose of the present invention is to provide a network news retrieval system and method that integrates geographic and visual information. According to the system and method of the present invention, it is possible to provide users with news organized based on geographic information, so that users can quickly browse news events invented in the area they care about; furthermore, the present invention uses image information to supplement text information, so that users can quickly Accurately grasp the content of news events.

根据本发明的一个方面,提供了一种融合地理信息与视觉信息的网络新闻检索系统,该系统包括:数据预处理模块,用于爬取新闻数据并进行文本分析和信息数据提取,所述新闻数据包括人物,地点、时间和文本信息;地点相关性分析模块,用于执行新闻事件与新闻地点的相关性分析;新闻配图模块,用于为新闻选择合适的图像;检索结果展示模块,用于基于检索相关性排序展示检索到的新闻。According to one aspect of the present invention, a network news retrieval system that integrates geographical information and visual information is provided, the system includes: a data preprocessing module, used to crawl news data and perform text analysis and information data extraction, the news The data includes person, location, time and text information; the location correlation analysis module is used to perform correlation analysis between news events and news locations; the news picture matching module is used to select appropriate images for news; the retrieval result display module uses Display the retrieved news in order based on retrieval relevance.

其中,所述数据预处理模块包括:新闻数据爬取模块,用于从新闻网站上爬取新闻文档和对应的新闻图像;文本分析模块,用于提取出新闻数据的标题、时间、网站、摘要和正文以及对应的网址,提取出新闻图像的网址和图像对应的文本信息;新闻实体提取模块,从新闻数据中提取出人物,地点和时间。Wherein, the data preprocessing module includes: a news data crawling module, which is used to crawl news documents and corresponding news images from news websites; a text analysis module, which is used to extract the title, time, website, abstract of news data According to the text and the corresponding URL, the URL of the news image and the text information corresponding to the image are extracted; the news entity extraction module extracts the person, place and time from the news data.

所述地点相关性分析模块包括:地理名词过滤和扩展模块,用于获取地理名词的地理位置信息;基于矩阵分解的相关性分析模块,用于利用一致性约束概率矩阵分解方法分析新闻地点和新闻事件之间的关系。The location correlation analysis module includes: geographic noun filtering and expansion module, used to obtain geographic location information of geographic nouns; matrix decomposition-based correlation analysis module, used to analyze news locations and news by using consistency constraint probability matrix decomposition method relationship between events.

所述新闻配图模块包括:检索词生成模块,用于从新闻数据中抽取一个或多个关键词,将其组合成检索词并提交给图像搜索引擎进行图像检索;图像排序和选择模块,用于对检索到的图像进行排序和去重,并选择能够表达新闻文档内容的图像。Described news matching picture module comprises: retrieval term generation module, is used for extracting one or more key words from news data, it is combined into retrieval term and is submitted to image search engine and carries out image retrieval; Image sorting and selection module, uses It is used to sort and deduplicate the retrieved images, and select images that can express the content of news documents.

所述检索结果展示模块包括:地图视图模块,用于显示所选择的新闻在地图上的分布位置;新闻事件列表模块,用于按照预定的规则排序并显示检索到的新闻事件的列表。The retrieval result display module includes: a map view module, used to display the distribution position of the selected news on the map; a news event list module, used to sort and display the list of retrieved news events according to predetermined rules.

在地点相关性分析模块中,所述一致性约束概率矩阵分解方法基于下述规则分析新闻地点和新闻事件之间的关系:相似度较高的新闻事件很可能发生在同一个地方,以及相关性较高的多个地点与同一个新闻事件的关系是相近的。In the location correlation analysis module, the consistency constraint probability matrix decomposition method analyzes the relationship between news locations and news events based on the following rules: news events with higher similarity are likely to occur in the same place, and the correlation Higher multiple locations are closely related to the same news event.

在新闻配图模块中,所述检索词生成模块是从新闻数据的多个部分中提取检索词进行图像检索;所述图像排序和选择模块采用基于等级聚合的方法对检索到的图像进行排序。In the news picture matching module, the search term generation module extracts search terms from multiple parts of the news data for image retrieval; the image sorting and selection module sorts the retrieved images using a method based on hierarchical aggregation.

在检索结果展示模块中,所述地图视图模块响应于用户输入的检索词或者点击地图上任何一个地点,显示出最相关新闻事件的标题及对应的图像;所述预定的规则排序包括下述中的一种或多种:新闻事件之间的相关性、新闻事件与检索地点之间的相关性以及新闻发生的时间信息。In the retrieval result display module, the map view module displays the titles and corresponding images of the most relevant news events in response to the search words input by the user or clicking on any location on the map; the predetermined rule sorting includes the following One or more of: the correlation between news events, the correlation between news events and retrieval locations, and the time information of news occurrence.

根据本发明的另一个方面,还提供了一种融合地理信息与视觉信息的网络新闻检索方法,该方法包括:数据预处理步骤,用于爬取新闻数据并进行文本分析和信息数据提取,所述新闻数据包括人物,地点、时间和文本信息;地点相关性分析步骤,用于执行新闻事件与新闻地点的相关性分析;新闻配图步骤,用于为新闻选择合适的图像;检索结果展示步骤,用于基于检索相关性排序展示检索到的新闻。According to another aspect of the present invention, there is also provided a network news retrieval method that integrates geographic information and visual information, the method includes: a data preprocessing step for crawling news data and performing text analysis and information data extraction, so The described news data includes people, place, time and text information; the place correlation analysis step is used to perform the correlation analysis between news events and news places; the news matching step is used to select suitable images for news; the retrieval result display step , used to sort and display the retrieved news based on retrieval relevance.

如上所述,本发明的系统及方法,提供了基于地理位置的新闻检索,估计和细化了新闻文档与地理位置之间的关系,在此过程中综合考虑了新闻文档与地点的初始关系,地点之间的关系以及新闻文档之间的相似性。此外,通过提出给新闻文档配图的方法,使得使用户能够更直接和更快地获取新闻。此外,还提出了支持新闻地理名词的检索模块,以及通过点击地图上相应地理位置进行检索的用户检索与浏览界面。As mentioned above, the system and method of the present invention provide news retrieval based on geographical location, estimate and refine the relationship between news documents and geographical location, and comprehensively consider the initial relationship between news documents and locations in the process, Relationships between locations and similarities between news documents. In addition, by proposing a method for matching pictures to news documents, users can obtain news more directly and quickly. In addition, a retrieval module that supports news geographic terms, and a user retrieval and browsing interface that can be retrieved by clicking on the corresponding geographic location on the map are also proposed.

根据本发明提出的一致性约束概率矩阵分解模型,能够将地点-事件关系,地点之间相关性和事件之间的相似性融合起来,估计和细化地点与事件之间的相关性,能够去除噪声和估计出潜在的关系。According to the consistency constraint probabilistic matrix decomposition model proposed by the present invention, the location-event relationship, the correlation between locations and the similarity between events can be fused together, the correlation between locations and events can be estimated and refined, and the correlation between locations and events can be removed. Noise and estimated potential relationships.

根据本发明提出的从文本中抽取检索词进行网络图像检索和图像排序的方法,可以按照多种规则实现对新闻文档的精确配图。According to the method for extracting search words from texts to perform network image retrieval and image sorting proposed by the present invention, accurate picture matching of news documents can be realized according to various rules.

根据本发明提出的组合检索词的方法,能够避免目前网络搜索引擎不能够处理复杂检索的问题以及单个词作为检索不能表达文档内容的问题。According to the method for combining retrieval words proposed by the present invention, the problems that current network search engines cannot handle complex retrieval and the problem that a single word cannot express document content as retrieval can be avoided.

此外,对于从网络搜索引擎中得到的不同的图像列表,本发明还提出了基于等级聚合的方法对这些图像列表进行融合排序的,从而选出最能表达新闻文档内容的图像。In addition, for different image lists obtained from network search engines, the present invention also proposes a method based on level aggregation to fuse and sort these image lists, so as to select the image that best expresses the content of the news document.

根据本发明提出的新闻文档排序方法,综合考虑了新闻的时效性、重要性以及检索相关性。该方法基于传统的马尔科夫随机游走模型,将前面分析得到的新闻事件-新闻地点相关性与新闻文档的时效性线性融合为该模型中的初始状态,并基于新闻文档之间的相似性,实现新闻文档集合的检索相关性排序。According to the news document sorting method proposed by the present invention, the timeliness, importance and retrieval relevance of news are considered comprehensively. This method is based on the traditional Markov random walk model, which linearly fuses the news event-news location correlation obtained from the previous analysis and the timeliness of news documents into the initial state of the model, and based on the similarity between news documents , to achieve the retrieval relevance sorting of the news document collection.

本发明还提供便于用户检索和浏览新闻的交互界面。用户可以通过提交检索或者单击地图进行检索,同时为一个检索结果提供了标题、新闻图片和内容摘要,用户可以快速生动地获取需要的信息。如果用户想要了解更详细的信息,可通过点击界面的“更多”按钮来获取。The invention also provides an interactive interface for users to retrieve and browse news. Users can search by submitting a search or clicking on the map. At the same time, a search result is provided with title, news pictures and content abstracts, and users can quickly and vividly obtain the information they need. If users want to know more detailed information, they can get it by clicking the "More" button on the interface.

综上所述,本发明以新闻的发生地点名称为搜索关键词或者通过点击地图上感兴趣的地理位置,提供给用户一个更生动、更富信息的新闻搜索结果,其结果展示界面包含两部分:其一,在真实地图上按照事件所发生地理位置来展示与检索地点最相关的新闻标题和图片信息;其二,具有新闻标题、图片和简短说明的多模态检索结果列表。In summary, the present invention uses the name of the place where the news occurs as the search keyword or clicks on the geographical location of interest on the map to provide the user with a more vivid and informative news search result. The result display interface includes two parts : First, display the most relevant news titles and picture information on the real map according to the geographical location of the event; second, a multi-modal search result list with news titles, pictures and short descriptions.

附图说明 Description of drawings

图1显示了现有技术中一篇新闻文档中包含图片个数的分布情况;Figure 1 shows the distribution of the number of pictures contained in a news document in the prior art;

图2显示了本发明的网络新闻检索系统的示意图;Fig. 2 has shown the schematic diagram of network news retrieval system of the present invention;

图3是本发明提出的一致性约束概率矩阵分解模型图;Fig. 3 is the matrix decomposition model figure of consistency constraint probability that the present invention proposes;

图4显示了本发明中一个新闻配图的实例;Fig. 4 has shown the example of a news matching picture among the present invention;

图5显示了本发明一个实施例的网络新闻检索和浏览界面;Fig. 5 has shown the network news retrieval and browsing interface of an embodiment of the present invention;

图6是BM25排序模型、概率矩阵分解模型和一致性约束概率矩阵分解模型的检索性能评价结果;Figure 6 shows the retrieval performance evaluation results of the BM25 ranking model, the probability matrix factorization model and the consistency constraint probability matrix factorization model;

图7给出了变化参数在NDCG50规则下的结果;Figure 7 shows the results of changing parameters under the NDCG50 rule;

图8是本发明的新闻配图方法与现有技术的性能比较结果;Fig. 8 is the performance comparison result of the news map matching method of the present invention and the prior art;

图9显示了本发明的检索结果排序方法与现有技术的排序方法在检索相关性上的比较结果;Fig. 9 has shown the comparison result of retrieval relevancy between the retrieval result sorting method of the present invention and the prior art sorting method;

图10显示了本发明的检索结果排序方法与现有技术的排序方法在时效性方面的比较结果。Fig. 10 shows the comparison results of the retrieval result sorting method of the present invention and the sorting method of the prior art in terms of timeliness.

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实例,并参照附图,对本发明进一步详细说明。虽然本发明的实例是基于英文新闻提供的,但是本发明的方法不受语言种类的限制。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in combination with specific examples and with reference to the accompanying drawings. Although the examples of the present invention are provided based on English news, the method of the present invention is not limited by the language category.

本发明提出了一个利用计算机的基于多媒体分析的新闻检索系统,综合利用了地理信息和视觉信息。首先从文本中提取出新闻地点候选集,利用网络信息进行过滤和扩展,并获取其地理位置信息(经纬度)。通过本发明提出的基于一致性约束概率矩阵分解关系挖掘技术发现潜在的新闻地点与新闻事件之间的关系,综合考虑了新闻地点之间的相关性,新闻事件的相似性和新闻地点-新闻事件之间的初始关系。然后为了使用户快速生动地获取新闻,本发明提出了给新闻配图的方法。The invention proposes a computer-based news retrieval system based on multimedia analysis, which comprehensively utilizes geographic information and visual information. Firstly, the news location candidate set is extracted from the text, filtered and expanded using network information, and its geographic location information (latitude and longitude) is obtained. Discover the relationship between potential news locations and news events through the consistency constraint probability matrix decomposition relationship mining technology proposed by the present invention, comprehensively considering the correlation between news locations, the similarity of news events and news locations-news events the initial relationship between. Then, in order to enable users to obtain news quickly and vividly, the present invention proposes a method for matching pictures to news.

虽然目前的新闻文档中也包含了新闻图片,但是对应的新闻图片太少了甚至一半以上的文档还是没有图片的,如图1所示。本发明提出的相关方法能够给文档提供多张具有表现力的图片。对检索结果,本发明提出了考虑时间信息的基于网页排序的排序方法。设计了一个方便用户的新闻检索与浏览界面。Although the current news documents also contain news pictures, there are too few corresponding news pictures, and even more than half of the documents still have no pictures, as shown in Figure 1. The related method proposed by the present invention can provide a document with multiple expressive pictures. For retrieval results, the present invention proposes a sorting method based on webpage sorting considering time information. A user-friendly interface for news retrieval and browsing is designed.

图2显示了本发明的网络新闻检索系统的示意图。Fig. 2 shows a schematic diagram of the network news retrieval system of the present invention.

如图2所示,本发明的融合地理信息和视觉信息的新闻检索系统包括数据预处理模块、地点相关性分析模块、新闻配图模块及检索结果展示模块。As shown in FIG. 2 , the news retrieval system integrating geographic information and visual information of the present invention includes a data preprocessing module, a location correlation analysis module, a news map module and a retrieval result display module.

数据预处理模块用于爬取新闻数据并进行文本分析和信息数据提取,所述新闻数据包括人物,地点、时间和文本信息。所述数据预处理模块包括新闻数据爬取模块、文本分析模块和新闻实体提取模块等子模块,其中:The data preprocessing module is used to crawl news data and perform text analysis and information data extraction. The news data includes people, places, time and text information. Described data preprocessing module comprises submodules such as news data crawling module, text analysis module and news entity extraction module, wherein:

新闻数据爬取模块采用网络爬虫从新闻网站(例如,ABC、BBC、CNN及谷歌等新闻网站)上爬取新闻文档和对应的新闻图像。The news data crawling module uses web crawlers to crawl news documents and corresponding news images from news websites (for example, news websites such as ABC, BBC, CNN, and Google).

文本分析模块采用自然语言处理技术提取出新闻文档的标题、时间、网站、摘要和正文以及对应的网址,提取出新闻图像的网址和图像对应的文本信息。The text analysis module uses natural language processing technology to extract the title, time, website, abstract and text of the news document and the corresponding URL, and extracts the URL of the news image and the text information corresponding to the image.

新闻实体提取模块采用自然语言处理技术去除重复文档和从新闻文档中提取出人物,地点和时间。The news entity extraction module uses natural language processing technology to remove duplicate documents and extract people, places and time from news documents.

地点相关性分析模块,用于执行新闻事件与新闻地点的相关性分析。地点相关性分析模块包括地理名词过滤和扩展模块、基于矩阵分解的相关性分析模块等子模块,其中:The location correlation analysis module is used to perform a correlation analysis between news events and news locations. The location correlation analysis module includes geographic term filtering and expansion modules, matrix decomposition-based correlation analysis modules and other sub-modules, among which:

地理名词过滤和扩展模块用于获取地理名词的地理位置信息(例如地理经纬度)。The geographic noun filtering and expansion module is used to obtain geographic location information (such as geographic longitude and latitude) of geographic nouns.

基于矩阵分解的相关性分析模块采用本发明的一致性约束概率矩阵分解方法分析新闻地点和新闻事件之间的关系。The correlation analysis module based on matrix decomposition adopts the consistency constraint probability matrix decomposition method of the present invention to analyze the relationship between news locations and news events.

新闻配图模块用于为新闻选择能够说明新闻内容的图像。新闻配图模块包括检索词生成模块、图像排序和选择模块等子模块,其中:The news image matching module is used to select images that can explain news content for news. The news map module includes sub-modules such as search term generation module, image sorting and selection module, among which:

检索词生成模块,用于从新闻数据中抽取一个或多个关键词,将其组合成检索词并提交给图像搜索引擎进行图像检索,即从新闻文档中抽取一个或多个关键词,将其组合成长度不同的检索词,提交给图像搜索引擎(例如谷歌)进行图像检索。The search term generation module is used to extract one or more keywords from the news data, combine them into search terms and submit them to the image search engine for image retrieval, that is, extract one or more keywords from the news documents, and combine them into search terms Combine them into search terms of different lengths and submit them to image search engines (such as Google) for image retrieval.

本实施例中,可以利用新闻数据的各个部分(标题、摘要和正文等)的不同重要性,从新闻中提取检索词进行网络图像检索,从而解决目前图像检索引擎不能处理长检索词的问题以及单个检索词无法表达文档内容的问题。In this embodiment, it is possible to utilize the different importance of each part of the news data (title, abstract and body text, etc.) to extract search terms from the news for network image retrieval, thereby solving the problem that the current image search engine cannot handle long search terms and The problem that a single search term cannot express the content of the document.

图像排序和选择模块,用于对检索到的图像进行排序和去重,并选择合适的图像。该模块执行图像排序和选择即考虑图像在返回列表中的位置以及与源文档包含图片的相似度,采用等级聚合方法学习不同长度的检索词对应列表的权重,再利用这些权重对图像进行排序,并去除重复图像,然后选择能够表达新闻内容的图像。Image sorting and selection module to sort and deduplicate the retrieved images and select the appropriate one. This module performs image sorting and selection, that is, considering the position of the image in the returned list and the similarity with the image contained in the source document, using the hierarchical aggregation method to learn the weights of the corresponding list of search words of different lengths, and then using these weights to sort the images. And remove duplicate images, and choose images that express the content of the news.

检索结果展示模块用于基于检索相关性排序展示检索到的新闻。本发明提供了一个展示检索结果的用户界面。如图2所示,检索结果展示模块包括地图视图模块和新闻事件列表模块等子模块,其中:The retrieval result display module is used to sort and display retrieved news based on retrieval relevance. The invention provides a user interface for displaying retrieval results. As shown in Figure 2, the retrieval result display module includes submodules such as a map view module and a news event list module, among which:

地图视图模块显示所选择的新闻在地图上的分布位置。The map view module displays the distribution position of the selected news on the map.

新闻事件列表模块用于按照预定的规则排序并显示检索到的新闻事件的列表。The news event list module is used to sort and display the retrieved news event list according to predetermined rules.

如图2所示,用户可以在检索框中输入检索词进行检索,也可以浏览地图点击想要检索的地点在地图上对应的位置,系统自动返回相关结果。As shown in Figure 2, the user can enter search words in the search box to search, or browse the map and click the corresponding position on the map of the place to be searched, and the system will automatically return relevant results.

本发明的检索结果展示模块综合考虑了新闻的时效性、与检索的相关性以及新闻的重要性。The retrieval result display module of the present invention comprehensively considers the timeliness of the news, the correlation with the retrieval and the importance of the news.

如图2所示,在地图上展示了最相关新闻的标题和前两幅图像。在右部分的列表中,每个新闻显示了标题、相关图像和简短摘要。更多的信息可通过点击“更多”按钮获取。As shown in Figure 2, the headlines and top two images of the most relevant news are shown on the map. In the list on the right, each story shows the headline, related image and a short summary. More information is available by clicking the "More" button.

以上介绍了本发明的新闻检索系统的结构组成,如图2所示,与该新闻检索系统的各个模块对应,本发明还提出了融合地理信息与视觉信息的网络新闻检索方法,该方法包括下述步骤:数据预处理步骤,用于爬取新闻数据并进行文本分析和信息数据提取,所述新闻数据包括人物,地点、时间和文本信息;地点相关性分析步骤,用于执行新闻事件与新闻地点的相关性分析;新闻配图步骤,用于为新闻选择合适的图像;检索结果展示步骤,用于基于检索相关性排序展示检索到的新闻。Introduced above the structural composition of the news retrieval system of the present invention, as shown in Figure 2, corresponding to each module of this news retrieval system, the present invention also proposes the network news retrieval method of fusing geographic information and visual information, and this method comprises following Said steps: a data preprocessing step for crawling news data and performing text analysis and information data extraction, said news data including characters, places, time and text information; a place correlation analysis step for performing news events and news The correlation analysis of the location; the step of news map matching, which is used to select a suitable image for the news; the retrieval result display step, which is used to sort and display the retrieved news based on the retrieval correlation.

所述地点相关性分析步骤包括:地理名词过滤和扩展步骤,用于获取地理名词的地理位置信息;基于矩阵分解的相关性分析步骤,用于利用一致性约束概率矩阵分解方法分析新闻地点和新闻事件之间的关系。The location correlation analysis step includes: geographic noun filtering and expansion steps, used to obtain geographic location information of geographic nouns; matrix decomposition-based correlation analysis step, used to analyze news locations and news using the consistency constraint probability matrix decomposition method relationship between events.

优选的,所述一致性约束概率矩阵分解方法基于下述规则分析新闻地点和新闻事件之间的关系:相似度较高的新闻事件很可能发生在同一个地方,以及相关性较高的多个地点与同一个新闻事件的关系是相近的。Preferably, the consistency constraint probability matrix decomposition method analyzes the relationship between news locations and news events based on the following rules: news events with high similarity are likely to occur in the same place, and multiple news events with high correlation The relationship between location and the same news event is similar.

所述新闻配图步骤包括:检索词生成步骤,用于从新闻数据中抽取一个或多个关键词,将其组合成检索词并提交给图像搜索引擎进行图像检索;图像排序和选择步骤,用于对检索到的图像进行排序和去重,并选择合适的图像。Described news matching step comprises: search word generation step, is used to extract one or more key words from news data, it is combined into search word and is submitted to image search engine and carries out image retrieval; Image sorting and selection step, uses It is used to sort and deduplicate the retrieved images and select the appropriate one.

优选的,所述检索词生成步骤从新闻数据的多个部分中提取检索词进行图像检索;所述图像排序和选择步骤采用基于等级聚合的方法对检索到的图像进行排序。Preferably, the step of generating search words extracts search words from multiple parts of the news data for image retrieval; the step of sorting and selecting images sorts the retrieved images using a method based on hierarchical aggregation.

所述检索结果展示步骤包括:地图视图步骤,用于显示所选择的新闻在地图上的分布位置;新闻事件列表步骤,用于按照预定的规则排序并显示检索到的新闻事件的列表。The step of displaying the retrieval results includes: a map view step for displaying the distribution position of the selected news on the map; a news event list step for sorting and displaying the list of retrieved news events according to predetermined rules.

其中,所述预定的规则排序包括下述中的一种或多种:新闻事件之间的相关性、新闻事件与检索地点之间的相关性以及新闻发生的时间信息。Wherein, the predetermined ordering of rules includes one or more of the following: correlation between news events, correlation between news events and retrieval locations, and time information of news occurrence.

其中,所述地图视图步骤,响应于用户输入的检索词或者点击地图上任何一个地点,显示出最相关新闻事件的标题及对应的图像。Wherein, in the map view step, in response to the search words input by the user or clicking on any location on the map, the titles and corresponding images of the most relevant news events are displayed.

如上所述,该新闻检索系统采用了以下4个主要处理流程:(1)基于一致性约束概率矩阵分解模型的地理位置相关性分析;(2)新闻配图;(3)检索结果的排序;(4)检索结果浏览界面。As mentioned above, the news retrieval system adopts the following four main processing procedures: (1) geographical location correlation analysis based on the probability matrix factorization model of consistency constraints; (2) news map matching; (3) sorting of retrieval results; (4) Search result browsing interface.

下面以英文检索为例分别介绍上述子流程,主要包括新闻文档定位流程和配图流程,以及对用户检索结果的排序流程。但显然,本发明不限制于此,而是可以合理的应用于其他语言种类,例如中文等。The following takes English retrieval as an example to introduce the above sub-processes, mainly including the process of locating news documents and matching pictures, and the process of sorting user search results. But obviously, the present invention is not limited thereto, but can be reasonably applied to other languages, such as Chinese.

<地理位置-新闻事件的相关性分析><Correlation Analysis of Geographical Location-News Events>

该流程包括四个步骤:(1)候选地名提取;(2)取出候选地名奇异性;(3)得到地名与文档的初始关系;(4)是对地名与文档的关系进行细化分析。The process includes four steps: (1) extraction of candidate place names; (2) extracting the singularity of candidate place names; (3) obtaining the initial relationship between place names and documents; (4) detailed analysis of the relationship between place names and documents.

首先,根据利用自然语言处理技术从数据库的新闻文档中抽取出新闻地名,得到了候选地名列表;然后将利用已得到的列表提交到维基百科,如果返回的页面中没有地理信息则认为这个候选地名是噪声并去除。First of all, according to the use of natural language processing technology to extract news place names from the news documents in the database, a list of candidate place names is obtained; then, the obtained list is submitted to Wikipedia, and if there is no geographical information in the returned page, the candidate place name is considered is noise and removed.

然后,对于不同的地方对应同一个名字的情况,可以将过滤之后的列表再提交给地理信息系统(GeoNames)进行扩展并爬取对应的地理信息(经纬度)。Then, for the situation that different places correspond to the same name, the filtered list can be submitted to the geographic information system (GeoNames) for expansion and the corresponding geographic information (latitude and longitude) can be crawled.

随后,统计各地名在数据库新闻文档中的出现频率,从而得到初始的地名与文档的关系。Then, count the occurrence frequency of place names in the database news documents, so as to obtain the initial relationship between place names and documents.

根据上述方法得到的地名与文档的对应关系是包含噪声的,例如,一个关于名人婚礼的新闻会有婚礼举行地点以及新娘和新郎的家乡。婚礼地点才是该新闻真正的地点,也是最相关的,其他地点是不相关的。另外与新闻相关的地点可能没有出现在文章中,比如关于北京奥运会的新闻,介绍了相关的体育事件但没提及北京,然而与此新闻最相关的地点是北京。The correspondence between place names and documents obtained according to the above method contains noise, for example, a news about a celebrity wedding will include the wedding venue and the hometowns of the bride and groom. The wedding location is the true location of the news and is the most relevant, other locations are irrelevant. In addition, the location related to the news may not appear in the article, such as the news about the Beijing Olympics, which introduces related sports events but does not mention Beijing, but the most relevant location for this news is Beijing.

因此,为了更好地挖掘新闻地点与新闻事件之间的关系(新闻事件与的新闻文档是一一对应关系,即认为一个新闻文档描述了一个新闻事件),本发明基于传统的概率矩阵分解(Probabilistic Matrix Factorization,PMF)模型(参见Ruslan Salakhutdinov and Andriy Mnih.“Probabilistic Matrix Factorization”,NIPS 2008.)提出了一致性约束概率矩阵分解模型(Consistent Constraints Probabilistic MatrixFactorization,简写为CCPMF),对新闻地点与新文档的相关性进行分析。相比传统的PMF模型,本发明的CCPMF模型引入了新闻文档与地点之间的一致相关性作为优化求解的约束条件(即相关的新闻文档需要对应相关的地点,反之亦然),从而能够更有效的分析二者之间的真实相关性。其中,地点相关性是利用搜索引擎(例如谷歌距离)计算各地名之间的统计共生相关性;新闻文档之间的相关性是通过考虑新闻标题、摘要和正文的不同重要性,进行线性组合计算而得的文本相似性。下面详细讲述前述第四个步骤:地名与文档的关系细化分析。Therefore, in order to better mine the relationship between news locations and news events (news events and news documents are one-to-one correspondence, that is, a news document describes a news event), the present invention is based on traditional probability matrix decomposition ( The Probabilistic Matrix Factorization (PMF) model (see Ruslan Salakhutdinov and Andriy Mnih. "Probabilistic Matrix Factorization", NIPS 2008.) proposed a Consistent Constraints Probabilistic Matrix Factorization model (CCPMF for short), for news location and new Analyze the relevance of documents. Compared with the traditional PMF model, the CCPMF model of the present invention introduces the consistent correlation between the news document and the place as the constraint condition of the optimal solution (that is, the relevant news document needs to correspond to the relevant place, and vice versa), so that it can be more Effectively analyze the true correlation between the two. Among them, the location correlation is to use search engines (such as Google distance) to calculate the statistical co-occurrence correlation between place names; the correlation between news documents is to calculate the linear combination by considering the different importance of news titles, abstracts and texts The resulting textual similarity. The fourth step above is described in detail below: detailed analysis of the relationship between place names and documents.

图3是本发明提出的一致性约束概率矩阵分解模型图。Fig. 3 is a model diagram of the consistency constraint probability matrix decomposition model proposed by the present invention.

如图3所示,考虑到非常相关的事件很有可能发生在同一个地点以及同一个新闻可能与非常相关的几个地名相关,本发明的一致性约束概率矩阵分解模型综合考虑了地名之间的相关性、文档之间的相似性以及地名与文档的关系。As shown in Figure 3, considering that very related events are likely to occur in the same place and the same news may be related to several very related place names, the consistency constraint probability matrix decomposition model of the present invention comprehensively considers the relationship between place names. , the similarity between documents, and the relationship between place names and documents.

假设具有M个地点,N个事件。

Figure BDA0000106783350000091
Figure BDA0000106783350000092
Figure BDA0000106783350000093
分别表示地点-事件关系矩阵、地点之间的相关性矩阵和事件相似性矩阵。采用矩阵分解的思想挖掘潜在的高质量特征空间,即采用PTE近似R,其中
Figure BDA0000106783350000094
Figure BDA0000106783350000095
表示潜在的H维地名和事件特征矩阵。
Figure BDA0000106783350000096
是初始的地点-事件的0-1关系矩阵。在概率矩阵分解模型中,假设对地点-事件的关系估计误差从均值为0、方差为
Figure BDA0000106783350000097
的高斯分布,则有:Suppose there are M locations and N events.
Figure BDA0000106783350000091
Figure BDA0000106783350000092
and
Figure BDA0000106783350000093
represent the place-event relationship matrix, the correlation matrix between places and the event similarity matrix, respectively. Using the idea of matrix decomposition to mine the potential high-quality feature space, that is, using P T E to approximate R, where
Figure BDA0000106783350000094
and
Figure BDA0000106783350000095
Represents the underlying H-dimensional place-name and event feature matrix.
Figure BDA0000106783350000096
is the initial location-event 0-1 relationship matrix. In the probability matrix factorization model, it is assumed that the estimation error of the location-event relationship starts from a mean of 0 and a variance of
Figure BDA0000106783350000097
Gaussian distribution of , then:

Figure BDA0000106783350000098
Figure BDA0000106783350000098

其中

Figure BDA0000106783350000099
表示均值为0、方差为σ2的高斯分布函数。pi和ej分别是矩阵P和E的第i列和第j列。δ是标识矩阵,如果i和j的关系大于零,则δij=1,否则δij=0。另外,假设潜在特征空间和系数矩阵服从球形高斯分布,即:in
Figure BDA0000106783350000099
Represents a Gaussian distribution function with mean 0 and variance σ2 . p i and e j are the ith and jth columns of matrices P and E, respectively. δ is an identity matrix, if the relationship between i and j is greater than zero, then δ ij =1, otherwise δ ij =0. In addition, it is assumed that the latent feature space and coefficient matrix obey a spherical Gaussian distribution, namely:

Figure BDA00001067833500000910
Figure BDA00001067833500000910

Figure BDA00001067833500000911
Figure BDA00001067833500000911

其中I是单位矩阵。经过简单的贝叶斯推导和取对数操作,得到目标函数:where I is the identity matrix. After simple Bayesian derivation and logarithmic operation, the objective function is obtained:

其中 &lambda; P = &sigma; R 2 / &sigma; P 2 , 以及 &lambda; E = &sigma; R 2 / &sigma; E 2 . in &lambda; P = &sigma; R 2 / &sigma; P 2 , as well as &lambda; E. = &sigma; R 2 / &sigma; E. 2 .

考虑到地名之间的关系以及文档之间的关系,一致性约束概率矩阵分解模型在概率矩阵分解模型基础上加了两个一致性约束,得到对应的目标函数:Considering the relationship between place names and the relationship between documents, the consistency constraint probabilistic matrix factorization model adds two consistency constraints on the basis of the probability matrix factorization model, and obtains the corresponding objective function:

Figure BDA0000106783350000104
Figure BDA0000106783350000104

++ &lambda;&lambda; EE. 22 TrTr [[ EE. TT EE. ]] ++ &lambda;&lambda; CC 22 Ff CC (( RR )) ++ &lambda;&lambda; SS 22 Ff SS (( RR ))

其中λC和λS是两个非负的权重系数,是表示地名相关性和文档相似性之间的均衡参数。

Figure BDA0000106783350000106
Figure BDA0000106783350000107
分别考虑了地名的相互关系和文档的相互关系,定义为:Among them, λ C and λ S are two non-negative weight coefficients, which represent the balance parameters between place name correlation and document similarity.
Figure BDA0000106783350000106
and
Figure BDA0000106783350000107
Considering the mutual relationship of place names and the mutual relationship of documents respectively, it is defined as:

Ff CC (( RR )) == 11 22 &Sigma;&Sigma; kk == 11 NN &Sigma;&Sigma; ii ,, jj == 11 Mm (( RR ikik -- RR jkjk )) 22 CC ijij == TrTr [[ RR TT LL CC RR ]]

Ff SS (( RR )) == 11 22 &Sigma;&Sigma; kk == 11 Mm &Sigma;&Sigma; ii ,, jj == 11 NN (( RR kithe ki -- RR kjkj )) 22 SS ijij == TrTr [[ RLRL SS RR TT ]]

其中LC=DC-C和LS=DS-S是拉布拉斯矩阵,DC是对角矩阵,定义为

Figure BDA00001067833500001010
DS是对角矩阵,定义为Tr[]是矩阵求迹运算。于是,目标函数为:where L C =D C -C and L S =D S -S are Laplace matrices and D C is a diagonal matrix defined as
Figure BDA00001067833500001010
D S is a diagonal matrix defined as Tr[] is the matrix trace operation. Then, the objective function is:

++ &lambda;&lambda; CC 22 TrTr [[ RR TT LL CC RR ]] ++ &lambda;&lambda; SS 22 TrTr [[ RLRL SS RR TT ]] ..

采用梯度下降法求解目标函数,可以得到局部最优解。Using the gradient descent method to solve the objective function, a local optimal solution can be obtained.

<新闻配图><News photo>

为了使用户快速生动地了解新闻内容,本发明提供了给新闻文档配图说明的方法。该流程包含两个步骤:图像检索词的生成以及检索图像的排序。In order to enable users to quickly and vividly understand news content, the present invention provides a method for illustrating news documents with pictures. The process consists of two steps: image term generation and ranking of retrieved images.

图4显示了本发明中一个新闻配图的实例。Fig. 4 has shown the example of a news picture in the present invention.

如图4所示,用户看到美国电影明星朱莉娅·罗伯茨的新闻时想知道她是什么样子,通过本发明提供的图片很容易地获取需要的信息;用户看到冰壶比赛的新闻时想知道冰壶运动是什么样子,通过本发明提供的结果会快速地明白。As shown in Figure 4, the user wants to know what she looks like when seeing the news of American movie star Julia Roberts, and the information needed can be easily obtained by the pictures provided by the present invention; the user wants to know when seeing the news of the curling competition What curling looks like will quickly become apparent through the results provided by the present invention.

首先,从新闻文档中提取出图像检索词。由于目前的网络图像搜索引擎不能处理复杂的检索,以及单个词的检索词不能够表达文档内容,本发明提供了一种有效的图像检索词生成方法。First, image terms are extracted from news documents. Since the current network image search engine cannot handle complex retrieval, and a single-word search term cannot express document content, the invention provides an effective image search term generation method.

由于一篇文章太长,从正文中抽取检索比较复杂,而经过人工编辑的文档标题是对文档内容很好的总结,因此本发明优先从标题中抽取组成检索的条目,而在标题太短的情况,再从正文中抽取。Because an article is too long, it is more complicated to extract and retrieve from the text, and the manually edited document title is a good summary of the content of the document, so the present invention first extracts the items that form the retrieval from the title, and when the title is too short situation, and then extracted from the text.

上述抽取方法需要对标题和正文的词语进行打分,本发明采用词频-反词频模型进行打分。例如,对每个文档,挑选出c个检索条目。通常情况下,如果采用太多检索条目进行检索图像,搜索引擎返回很少的结果甚至没有结果;而如果采用单个条目进行检索,返回的结果不能够表示文档的内容。因此,采用组合条目形成不同长度的检索进行检索会得到比较好的结果。据此,本发明提出组合这些检索条目以形成不同长度的检索词进行图像检索,然后将这些返回的结果列表进行融合排序。c个条目组合共有

Figure BDA0000106783350000111
个检索,提交给图像检索引擎检索并保存相关图像The above extraction method needs to score the words in the title and the text, and the present invention uses the word frequency-inverse word frequency model to score. For example, for each document, select c retrieval items. Usually, if too many search items are used to retrieve images, the search engine returns few or no results; and if a single item is used to search, the returned results cannot represent the content of the document. Therefore, better results will be obtained by combining items to form retrievals of different lengths. Accordingly, the present invention proposes to combine these retrieval items to form retrieval words of different lengths for image retrieval, and then fuse and sort the returned result lists. A total of c item combinations
Figure BDA0000106783350000111
Retrieval, submitted to the image retrieval engine to retrieve and save related images

然后,对所保存的图像进行排序。每个检索爬取前h幅图像组成一个列表,共有L个列表。本发明采用等级聚合的方法将这L个列表进行融合排序,从而为新闻文档选择合适的图像。由于有的文档中包含了人工编辑筛选的图片,这些图片能够很好地反应文档内容,因此与文档中图片视觉上越相似的图片越应排在前面。另外,图像在列表中的位置反应了与检索的文本相关性。因此,本发明提出的方法考虑单幅图像在列表中的位置以及与文档中包含图像的相似性给其赋予初始的关系分数:Then, sort the saved images. Each retrieval crawls the first h images to form a list, and there are L lists in total. The present invention adopts the method of level aggregation to fuse and sort the L lists, so as to select suitable images for news documents. Since some documents contain pictures that are manually edited and screened, these pictures can well reflect the content of the document, so the pictures that are more visually similar to the pictures in the document should be ranked first. Additionally, the image's position in the list reflects its relevance to the retrieved text. Therefore, the proposed method of the present invention assigns an initial relation score to a single image considering its position in the list and its similarity to the image contained in the document:

其中,xi是第j个列表中第k个位置的图像,

Figure BDA0000106783350000121
是文档中图像集合。本发明采用1000维的视觉词袋特征以及余弦相似度度量图像之间的相似性。where x i is the image at the kth position in the jth list,
Figure BDA0000106783350000121
is the collection of images in the document. The present invention uses 1000-dimensional visual word bag feature and cosine similarity to measure the similarity between images.

为了调节不同长度检索的作用,本发明对具有同等长度的检索赋予相同的权重,即有In order to adjust the effect of retrieval with different lengths, the present invention assigns the same weight to retrievals with the same length, that is,

Figure BDA0000106783350000122
Figure BDA0000106783350000122

其中ηk是长度为k的

Figure BDA0000106783350000123
个检索的权重。因此,xi的打分是where η k is the length k
Figure BDA0000106783350000123
search weight. Therefore, the score of xi is

sthe s (( xx ii )) == &Sigma;&Sigma; jj == 11 LL &theta;&theta; jj sthe s jj (( xx ii ))

c个权重是根据训练数据集得到的。采用网格搜索使在训练集上第10位置上的归一化nDCG(normalized Discounted Cumulative Gain)最大。The c weights are obtained from the training data set. Use grid search to maximize the normalized nDCG (normalized discounted Cumulative Gain) at the 10th position on the training set.

根据分数可以得到一个排序列表,先采用重复检测算法去除重复图像,然后从去重后的排序列表中为每篇文档选择r幅图像。其中,文档自带的图像需优先选择。According to the scores, a sorted list can be obtained. First, duplicate images are removed by using a duplicate detection algorithm, and then r images are selected for each document from the sorted list after deduplication. Among them, the images that come with the document need to be selected first.

<检索结果排序><Sort of search results>

根据用户提交的检索,系统返回一系列的相关结果。针对新闻领域,用户关注的是新的、重要的以及与检索相关的新闻。本发明提出了一种综合考虑新闻的时效性、重要性以及检索相关性的新闻文档排序方法。检索结果排序流程包括下述步骤:时间信息量化、地名文档相关性归一化、相关性排序(排序初始化以及排序)。According to the search submitted by the user, the system returns a series of relevant results. For the news field, users focus on new, important and retrieval-related news. The invention proposes a news document sorting method which comprehensively considers the timeliness, importance and retrieval relevance of news. The retrieval result sorting process includes the following steps: time information quantification, place name document correlation normalization, and correlation sorting (sorting initialization and sorting).

首先进行新闻文档时间信息量化。时间是新闻重要的一个因素。首先要把新闻的时间量化,把时间表示成“年月日”的形式,例如把“九月12号,2010”表示成“20100912”。datek表示第k个文档的时间量化值,把它进行归一化:Firstly, quantify the time information of news documents. Time is an important factor in news. First of all, the time of the news should be quantified, and the time should be expressed in the form of "year, month and day", for example, "September 12, 2010" should be expressed as "20100912". date k represents the time quantization value of the kth document, and normalizes it:

datedate kk == datedate kk -- minmin jj (( datedate jj )) maxmax jj (( datedate jj )) -- minmin jj (( datedate jj ))

datedate kk == datedate kk &Sigma;&Sigma; jj datedate jj

然后执行归一化文档与地点的相关性。通过一致性约束概率矩阵分解,已得到了文档与地名之间的相关值,并对其进行归一化:A normalized document-to-place correlation is then performed. Through the consistency constraint probability matrix decomposition, the correlation value between the document and the place name has been obtained and normalized:

scorescore kk == scorescore kk &Sigma;&Sigma; jj scorescore jj

最后执行新闻文档与地点的相关性排序。为了将最新的、热点的和最相关的新闻展示给用户,本发明提出了一种基于马尔科夫随机游走模型的新闻文档排序方法,综合考虑了新闻时间信息、重要性和与检索相关性。该模型可表示为:Finally, the correlation sorting of news documents and locations is performed. In order to present the latest, hot and most relevant news to users, this invention proposes a news document sorting method based on the Markov random walk model, which comprehensively considers news time information, importance and relevance to retrieval . This model can be expressed as:

rr kk iteriter == ythe y &times;&times; rr kk iteriter -- 11 ++ (( 11 -- ythe y )) rr kk 00

其中,

Figure BDA0000106783350000133
是第k个文档在第iter次迭代的值,是第k个文档的初始排序值。y是权重系数,是非负常数。in,
Figure BDA0000106783350000133
is the value of the kth document at the iter iteration, is the initial ranking value of the kth document. y is the weight coefficient, which is a non-negative constant.

在随机游走模型中,需要给定一个初始排序值,本发明综合考虑新闻的时间信息以及与检索的相关性,将初始排序值设为In the random walk model, an initial ranking value needs to be given. The present invention comprehensively considers the time information of the news and the correlation with retrieval, and sets the initial ranking value as

rr kk 00 == datedate kk ++ scorescore kk 22

在迭代过程中,考虑到新闻文档的重要性,本发明提出的迭代公式如下:In the iterative process, considering the importance of news documents, the iterative formula proposed by the present invention is as follows:

rr kk iteriter == (( 11 -- ythe y )) rr kk 00 ++ ythe y &Sigma;&Sigma; jj SS kjkj &Sigma;&Sigma; mm SS mjmj rr jj iteriter -- 11

其中,Skj表示两个文档的相似性,y设为0.85。重复以上迭代过程,最终达到一个稳定状态,就得到了排序结果。Among them, S kj represents the similarity of two documents, and y is set to 0.85. The above iterative process is repeated until a steady state is finally reached, and the sorting result is obtained.

下面详细介绍用户操作界面。The following describes the user interface in detail.

<检索结果浏览界面><Search result browsing interface>

图5显示了本发明一个实施例的网络新闻检索和浏览界面。Fig. 5 shows a network news retrieval and browsing interface of an embodiment of the present invention.

为了提供给用户生动快速的检索和浏览界面,本发明提供了一种新颖的用户检索和浏览界面。In order to provide users with a vivid and fast retrieval and browsing interface, the present invention provides a novel user retrieval and browsing interface.

参见图5,该界面包括地图视图和新闻事件列表这两个模块,分别对应图5中的左右两部分。左半部分是基于谷歌地图的地图视图。用户可以在上面的检索框中输入检索词进行检索,也可以通过浏览地图双击感兴趣的地点进行检索。系统返回检索结果,在地图上对应地点跳出一个窗口,展示排在第一位的新闻的标题以及前两幅最相关图像,用户可以点击“更多”按钮获取更多信息。右半部分按照新闻文档与检索的相关性降序排列,给出了与检索相关的所有事件列表。每一个列表条目对应一个新闻事件,给出了这个新闻事件对应的5幅新闻图片、标题和简单描述。如果用户想了解详细信息,可点击“更多”按钮获取新闻的全文。用户也可以通过点击新闻标题到新闻的原始网页进行访问浏览。Referring to FIG. 5 , the interface includes two modules, a map view and a news event list, respectively corresponding to the left and right parts in FIG. 5 . The left half is a map view based on Google Maps. Users can enter search terms in the above search box to search, or double-click the place of interest by browsing the map to search. The system returns the search results, and a window pops up at the corresponding location on the map, displaying the title of the first-ranked news and the first two most relevant images. Users can click the "More" button to get more information. The right half is arranged in descending order according to the correlation between news documents and retrieval, giving a list of all events related to retrieval. Each list entry corresponds to a news event, and 5 news pictures, titles and brief descriptions corresponding to the news event are given. If users want to know more details, they can click the "More" button to get the full text of the news. Users can also click on the headline of the news to visit the original web page of the news.

<优选实施例><preferred embodiment>

以下通过一个具体实施例来说明本发明所提供的算法和界面的技术效果。本实施例中,所有的数据是从ABC、BBC和CNN以及谷歌等新闻网站上爬取的。总共有48,429新闻文档和20,862个新闻图片。经过过滤和扩展,共得到4,742个地名,以及一些参数设定。对于一致性约束概率矩阵分解模型,H=100,λP=λE=0.001,λC=2-3和λS=2-4。在新闻配图中,h=20,c=5,每个文档抽取5个检索条目。The technical effect of the algorithm and the interface provided by the present invention is illustrated below through a specific embodiment. In this embodiment, all the data are crawled from news websites such as ABC, BBC, CNN and Google. In total, there are 48,429 news documents and 20,862 news pictures. After filtering and expansion, a total of 4,742 place names and some parameter settings were obtained. For the consistency constrained probabilistic matrix factorization model, H=100, λ PE =0.001, λ C =2 −3 and λ S =2 −4 . In the news picture, h=20, c=5, and 5 retrieval items are extracted from each document.

为了评估用户对这整个系统的满意程度,本实施例设置了多个用户按照设定好的规则打分,然后采用nDCG准则度量评价结果。In order to evaluate the user's satisfaction with the whole system, this embodiment sets multiple users to score according to the set rules, and then uses the nDCG criterion to measure the evaluation results.

在评价当中,需要人为标定一些数据,包括地名与文档的相关性以及图片与文档的相关性。本实施例中设定了三种标准:非常相关、相关和不相关,并分别量化为2,1,0。同时也邀请了30名年龄在20和30之间的用户进行用户研究,这些参与者来自两个国家并经常在线阅读英文新闻。In the evaluation, it is necessary to manually calibrate some data, including the correlation between place names and documents and the correlation between pictures and documents. In this embodiment, three standards are set: very relevant, relevant and irrelevant, and quantified as 2, 1, 0 respectively. A user study was also conducted with 30 users between the ages of 20 and 30, who came from two countries and frequently read English news online.

首先对地点相关性分析进行实验评价。随机挑选了500个文档分别对进行一致性约束概率矩阵分解细化关系前后进行了评价,得到了两个平均值,分别为0.492和0.954。这表明一致性约束概率矩阵分解对于去噪和细化地点-事件关系具有显著的效果。The site correlation analysis is first evaluated experimentally. 500 documents were randomly selected to evaluate the relationship before and after the consistency constraint probability matrix decomposition and refinement, and two average values were obtained, which were 0.492 and 0.954, respectively. This shows that consistency-constrained probabilistic matrix factorization has a significant effect on denoising and refining place-event relations.

其次,通过检索检验地名相关性分析。随机选择了100个地名进行检索,并比较一致性约束概率矩阵分解、BM25排序模型和传统的概率矩阵分解模型。Second, the correlation analysis of place names is tested by retrieval. 100 place names were randomly selected for retrieval, and the consistency-constrained probabilistic matrix factorization, BM25 ranking model and traditional probabilistic matrix factorization model were compared.

图6是BM25排序模型、概率矩阵分解模型和一致性约束概率矩阵分解模型的检索性能评价结果。Figure 6 shows the retrieval performance evaluation results of the BM25 ranking model, the probability matrix factorization model and the consistency constraint probability matrix factorization model.

图6中,横坐标表示评价时选取的排在前面的文档数,纵坐标采用nDCG准则的度量值。如图6所示,本发明提出的一致性约束概率矩阵分解模型(CCPMF)的分析方法的nDCG度量值明显高于其他方法的相应值,这表明本发明的分析方法具有显著的技术效果。In Fig. 6, the abscissa indicates the number of top-ranked documents selected during evaluation, and the ordinate uses the measurement value of the nDCG criterion. As shown in Figure 6, the nDCG measurement value of the analysis method of the Consensus Constrained Probabilistic Matrix Factorization Model (CCPMF) proposed by the present invention is obviously higher than the corresponding value of other methods, which shows that the analysis method of the present invention has significant technical effects.

为了进一步观察模型中参数对系统的影响,本发明分别以变化λC(固定λS)和变化λS(固定λC)进行了结果评估。In order to further observe the influence of the parameters in the model on the system, the present invention evaluates the results by changing λ C (fixed λ S ) and changing λ S (fixed λ C ).

图7给出了变化参数在NDCG50规则下的结果。图7所示结果说明了参数在很大范围内变化时,本发明提出的一致性约束概率矩阵分解模型(CCPMF)的效果明显优于其他两种方法。Figure 7 shows the results of changing parameters under the NDCG50 rule. The results shown in Fig. 7 illustrate that the Consistency Constrained Probabilistic Matrix Factorization Model (CCPMF) proposed by the present invention is significantly better than the other two methods when the parameters vary in a large range.

图8是本发明的新闻配图方法与现有技术的性能比较结果。Fig. 8 is the result of performance comparison between the news image matching method of the present invention and the prior art.

在图8比较测试中,为了评价新闻配图,先随机挑出300个文档并标注其图像与文档的相关性,组成训练集学习权重系数。然后随机挑选出1,000个文档进行测试,并将本发明的新闻配图方法与简单搜索(把标题作为检索检索图像)和简单融合(把标题中的每个单词作为检索检索图像并融合)进行了比较。如图8所示,本发明的新闻配图方法明显优于其他两种方法,说明本发明方法的有效性。In the comparison test in Figure 8, in order to evaluate news pictures, 300 documents are randomly selected and the correlation between their images and documents is marked to form a training set to learn weight coefficients. Then 1,000 documents were randomly selected for testing, and the news matching method of the present invention was compared with simple search (using the title as a retrieval image) and simple fusion (using each word in the title as a retrieval image and fused) Compare. As shown in Figure 8, the method for matching news pictures of the present invention is obviously superior to the other two methods, which illustrates the effectiveness of the method of the present invention.

另外,本实施例还采用用户研究比较了新闻配图前后的性能。每个用户自由浏览并比较这两个版本。对于较差的版本打1分,如果另外一个版本与它相比是好,更好或者相当,分别给2,3和1分。同时也做了双因子方差分析。该比较结果显示在如下的表1中。In addition, this embodiment also adopts user research to compare the performance before and after news with pictures. Each user is free to browse and compare the two versions. Score 1 for the poorer version, and 2, 3, and 1 if the other version is good, better, or equivalent, respectively. A two-way analysis of variance was also performed. The results of this comparison are shown in Table 1 below.

表1:新闻配图前后的用户研究结果Table 1: User research results before and after news with pictures

Figure BDA0000106783350000151
Figure BDA0000106783350000151

表1显示了新闻配图前后的平均值和标准差以及方差分析,其中左边部分给出了平均分和标准差;右边给出了方差分析结果。由表1可以看出,用户更喜欢给新闻配图,并且本发明的新闻配图方法性能在统计结果上具有明显的有益效果。Table 1 shows the mean, standard deviation and variance analysis before and after the news map, where the left part shows the mean score and standard deviation; the right part shows the result of variance analysis. It can be seen from Table 1 that users prefer to add pictures to news, and the performance of the news picture matching method of the present invention has obvious beneficial effects on statistical results.

图9显示了本发明的检索结果排序方法与现有技术的排序方法在检索相关性上的比较结果。Fig. 9 shows the comparison results of the retrieval relevancy between the retrieval result ranking method of the present invention and the prior art ranking method.

在图9所示测试中,比较了本发明的检索结果排序方法、PRT(采用时间信息作为静态排序的网页排序方法)方法,PRR(采用地名相关值作为静态排序的网页排序方法)方法以及BM25排序方法。随机挑选了100个检索进行了相关性评价,评价结果显示在图9中。如图9所示,BM25取得了最差的性能,PRR取得了最好的性能。本发明的检索结果排序方法仅仅略次于PRR,但是在时效性方面明显优于PRR方法(参见图10),即本发明能够将最新发生的新闻优先地展示给用户。In the test shown in Figure 9, the method for sorting retrieval results of the present invention, the PRT (using time information as a static sorting web page sorting method) method, the PRR (using place name correlation value as a static sorting web page sorting method) method and BM25 were compared. Sort method. 100 searches were randomly selected for correlation evaluation, and the evaluation results are shown in Figure 9. As shown in Fig. 9, BM25 achieves the worst performance and PRR achieves the best performance. The search result sorting method of the present invention is only slightly inferior to PRR, but is obviously superior to the PRR method in terms of timeliness (see FIG. 10 ), that is, the present invention can preferentially display the latest news to users.

图10显示了本发明的检索结果排序方法与现有技术的排序方法在时效性方面的比较结果。Fig. 10 shows the comparison results of the retrieval result sorting method of the present invention and the sorting method of the prior art in terms of timeliness.

图10中,纵坐标表示发生在最近一周的新闻所占百分比。在该测试中,统计了100个检索返回列表中,在前d(d=5,10,20,50,100)个新闻中发生在最近一周的新闻的平均百分比。由图10可见,本发明的检索结果排序方法仅仅略次于PRT方法,但是如图9所示,在检索相关性方面明显优于PRT方法。In Figure 10, the vertical axis represents the percentage of news that occurred in the latest week. In this test, the average percentage of news that occurred in the last week among the first d (d=5, 10, 20, 50, 100) news in the 100 retrieved return lists was counted. It can be seen from FIG. 10 that the retrieval result sorting method of the present invention is only slightly inferior to the PRT method, but as shown in FIG. 9 , it is obviously superior to the PRT method in terms of retrieval relevance.

综合图9和图10所示的测试结果,可以看到本发明的检索结果排序方法在相关性和时效性的综合效果明显优于现有技术中的排序方法,因而能够取得令人满意的性能。Combining the test results shown in Figure 9 and Figure 10, it can be seen that the comprehensive effect of the retrieval result sorting method of the present invention on relevance and timeliness is obviously better than the sorting method in the prior art, so satisfactory performance can be achieved .

应当理解的是,本发明的上述具体实施方式仅仅用于示例性说明或解释本发明的原理,而不构成对本发明的限制。因此,在不偏离本发明的精神和范围的情况下所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。此外,本发明所附权利要求旨在涵盖落入所附权利要求范围和边界、或者这种范围和边界的等同形式内的全部变化和修改例。It should be understood that the above specific embodiments of the present invention are only used to illustrate or explain the principle of the present invention, and not to limit the present invention. Therefore, any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention shall fall within the protection scope of the present invention. Furthermore, it is intended that the appended claims of the present invention embrace all changes and modifications that come within the scope and metesques of the appended claims, or equivalents of such scope and metes and bounds.

Claims (12)

1. A network news retrieval system fusing geographic information and visual information, the system comprising:
the data preprocessing module is used for crawling news data and performing text analysis and information data extraction, wherein the news data comprises people, places, time and text information;
a location relevance analysis module for performing a relevance analysis of the news event to the news location;
the news matching module is used for selecting images capable of explaining news contents for news;
the retrieval result display module is used for displaying the retrieved news in a sequencing mode based on the retrieval relevance;
the location correlation analysis module includes:
the geographic noun filtering and expanding module is used for acquiring geographic position information of geographic nouns;
a matrix decomposition based correlation analysis module for analyzing the relationship between news locations and news events using a consistency constraint probability matrix decomposition method that analyzes the relationship between news locations and news events based on the following rules: the news events with high similarity are likely to occur in the same place, the relationship between a plurality of places with high relevance and the same news event is similar, and the objective function of the consistency constraint probability matrix decomposition method is as follows:
Figure FDA0000377634290000018
Figure FDA0000377634290000012
wherein M is the number of sites; n is the number of events; delta is an identification matrix, delta if the relationship of i and j is greater than zeroij= l, otherwise δij=0;
Figure FDA0000377634290000019
Is a 0-1 relationship matrix of initial location-events;
Figure FDA00003776342900000112
representing a desired place-event relationship matrix;
Figure FDA0000377634290000013
Figure FDA0000377634290000014
estimating a variance of a gaussian distribution to which the error obeys for the location-event relationship;
Figure FDA0000377634290000015
obeying the variance of Gaussian distribution for the potential H-dimensional place name feature matrix;obeying the variance of the Gaussian distribution to the potential H-dimensional event feature matrix;
Figure FDA00003776342900000111
representing a potential H-dimensional place name feature matrix;
Figure FDA00003776342900000110
representing a potential H-dimensional event feature matrix; lambda [ alpha ]CAnd λSAre two non-negative weight coefficients; l isC=DC-C and LS=DSS is a Laplacian matrix, DCIs a diagonal matrix, defined as
Figure FDA0000377634290000017
DSIs a diagonal matrix, defined as
Figure FDA0000377634290000023
Representing a correlation matrix between the sites;
Figure FDA0000377634290000022
representing an event similarity matrix; tr [ 2 ]]Performing matrix tracing operation; solving to obtain P and E based on the model, and then adopting PTE approximates R;
the news matching module comprises:
the search word generation module is used for extracting one or more key words from the news data, combining the key words into a search word and submitting the search word to an image search engine for image search;
and the image sorting and selecting module is used for sorting and removing the duplication of the retrieved images and selecting the images capable of explaining the news content.
2. The system of claim 1, the data pre-processing module comprising:
the news data crawling module is used for crawling news documents and corresponding news images from a news website;
the text analysis module is used for extracting the title, time, website, abstract and text of the news data and corresponding websites, and extracting the websites of the news images and text information corresponding to the images;
and the news entity extraction module is used for extracting people, places and time from the news data.
3. The system of claim 1, the search result presentation module comprising:
the map view module is used for displaying the distribution position of the selected news on a map;
and the news event list module is used for sequencing and displaying the list of the retrieved news events according to a preset rule.
4. The system of claim 1, wherein
The search term generating module extracts search terms from a plurality of parts of the news data to carry out image search;
the image sorting and selecting module sorts the retrieved images by a method based on rank aggregation.
5. The system of claim 3, wherein the predetermined rule ordering includes one or more of: the correlation between news events, the correlation between news events and retrieval locations, and the time information of news occurrences.
6. The system of claim 3, wherein the map view module displays the titles of the most relevant news events and the corresponding images in response to a search word input by a user or clicking on any one of the locations on the map.
7. A network news retrieval method fusing geographic information and visual information comprises the following steps:
the method comprises the steps of data preprocessing, wherein the data preprocessing is used for crawling news data and carrying out text analysis and information data extraction, and the news data comprises people, places, time and text information;
a location correlation analysis step of performing correlation analysis of the news event with a news location;
a news matching step, which is used for selecting images capable of explaining news contents for news;
a retrieval result display step for displaying the retrieved news in a sorted manner based on the retrieval relevance;
the location correlation analyzing step includes:
a geographic noun filtering and expanding step, which is used for acquiring the geographic position information of the geographic noun;
a correlation analysis step based on matrix decomposition for analyzing a relationship between the news site and the news event using a consistency constraint probability matrix decomposition method that analyzes a relationship between the news site and the news event based on the following rules: the news events with high similarity are likely to occur in the same place, the relationship between a plurality of places with high relevance and the same news event is similar, and the objective function of the consistency constraint probability matrix decomposition method is as follows:
Figure FDA0000377634290000032
wherein M is the number of sites(ii) a N is the number of events; delta is an identification matrix, delta if the relationship of i and j is greater than zeroij= l, otherwise δij=0;
Figure FDA0000377634290000039
Is a 0-1 relationship matrix of initial location-events;
Figure FDA00003776342900000313
representing a desired place-event relationship matrix;
Figure FDA0000377634290000033
Figure FDA0000377634290000034
estimating a variance of a gaussian distribution to which the error obeys for the location-event relationship;
Figure FDA0000377634290000035
obeying the variance of Gaussian distribution for the potential H-dimensional place name feature matrix;
Figure FDA0000377634290000036
obeying the variance of the Gaussian distribution to the potential H-dimensional event feature matrix;
Figure FDA00003776342900000310
representing a potential H-dimensional place name feature matrix;
Figure FDA00003776342900000311
representing a potential H-dimensional event feature matrix; lambda [ alpha ]CAnd λSQi is two non-negative weight coefficients; LC = DC-C and LS=DSS is a Laplacian matrix, DCIs a diagonal matrix, defined as
Figure FDA0000377634290000037
DSIs a diagonal matrix, defined as
Figure FDA00003776342900000312
Representing a correlation matrix between the sites;representing an event similarity matrix; tr [ 2 ]]Performing matrix tracing operation; solving to obtain P and E based on the model, and then adopting PTE approximates R;
the news matching step comprises the following steps:
a search word generation step, which is used for extracting one or more key words from news data, combining the key words into a search word and submitting the search word to an image search engine for image search;
and an image sorting and selecting step for sorting and de-duplicating the retrieved images and selecting images capable of explaining news contents.
8. The method of claim 7, the data preprocessing step comprising:
a news data crawling step, which is used for crawling news documents and corresponding news images from a news website;
a text analysis step, which is used for extracting the title, time, website, abstract and text of the news data and corresponding websites, and extracting the websites of the news images and text information corresponding to the images;
and a news entity extraction step, namely extracting people, places and time from the news data.
9. The method of claim 7, wherein the step of displaying the search result comprises:
a map view step for displaying the distribution position of the selected news on a map;
and a news event listing step for sorting and displaying the retrieved news event list according to a predetermined rule.
10. The method of claim 7, wherein
The search word generating step extracts search words from a plurality of parts of the news data for image search;
the image sorting and selecting step sorts the retrieved images using a rank aggregation based approach.
11. The method of claim 9, wherein the predetermined rule ordering includes one or more of: the correlation between news events, the correlation between news events and retrieval locations, and the time information of news occurrences.
12. The method of claim 9, wherein the map view step displays a title of a most relevant news event and a corresponding image in response to a search word input by a user or clicking any one place on a map.
CN2011103520023A 2011-11-09 2011-11-09 Netnews search system and method based on geographic information and visual information Expired - Fee Related CN102364473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103520023A CN102364473B (en) 2011-11-09 2011-11-09 Netnews search system and method based on geographic information and visual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103520023A CN102364473B (en) 2011-11-09 2011-11-09 Netnews search system and method based on geographic information and visual information

Publications (2)

Publication Number Publication Date
CN102364473A CN102364473A (en) 2012-02-29
CN102364473B true CN102364473B (en) 2013-11-20

Family

ID=45691039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103520023A Expired - Fee Related CN102364473B (en) 2011-11-09 2011-11-09 Netnews search system and method based on geographic information and visual information

Country Status (1)

Country Link
CN (1) CN102364473B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103634736A (en) * 2012-08-21 2014-03-12 北京友友天宇系统技术有限公司 A hot news sharing method based on geographical information, an apparatus and a system
CN103425770B (en) * 2013-08-08 2017-09-01 刘广宇 Event multi-dimensional information display device and method
CN103390068A (en) * 2013-08-22 2013-11-13 济南中维世纪科技有限公司 News retrieval method
CN105683949A (en) * 2013-11-27 2016-06-15 英特尔公司 High level of detail news maps and image overlays
CN104281691B (en) * 2014-10-11 2017-07-21 百度在线网络技术(北京)有限公司 A kind of data processing method and platform based on search engine
CN104965847B (en) * 2015-02-04 2017-11-10 北京奇虎科技有限公司 Information displaying method and device
CN104615715A (en) * 2015-02-05 2015-05-13 北京航空航天大学 Social network event analyzing method and system based on geographic positions
US9654549B2 (en) * 2015-05-18 2017-05-16 Somchai Akkarawittayapoom Systems and methods for creating user-managed online pages (MAPpages) linked to locations on an interactive digital map
CN104933171B (en) * 2015-06-30 2019-06-18 百度在线网络技术(北京)有限公司 Interest point data association method and device
WO2017041239A1 (en) * 2015-09-08 2017-03-16 余青山 Geographical location-based application for searching news within certain distance range
CN105808761A (en) * 2016-03-16 2016-07-27 山东大学 Solr webpage sorting optimization method based on big data
CN106066862B (en) * 2016-05-25 2019-05-31 东软集团股份有限公司 Media event display methods and device
CN106326447B (en) * 2016-08-26 2019-06-21 北京量科邦信息技术有限公司 A kind of detection method and system of crowdsourcing web crawlers crawl data
CN106599285B (en) * 2016-12-23 2020-06-30 北京奇虎科技有限公司 Method and device for providing search results based on news search
CN106951493A (en) * 2017-03-14 2017-07-14 北京搜狐新媒体信息技术有限公司 Automatic figure methods of exhibiting and device without figure news
CN107133290B (en) * 2017-04-19 2019-10-29 中国人民解放军国防科学技术大学 A kind of Personalized search and device
CN108182232B (en) * 2017-12-27 2018-10-23 掌阅科技股份有限公司 Personage's methods of exhibiting, electronic equipment and computer storage media based on e-book
CN108446377A (en) * 2018-03-16 2018-08-24 四川高原之宝牦牛网络技术有限公司 Map special efficacy methods of exhibiting and device
CN109033358B (en) * 2018-07-26 2022-06-10 李辰洋 Method for associating news aggregation with intelligent entity
CN109063198B (en) * 2018-09-10 2022-02-11 浙江广播电视集团 Multi-dimensional visual search recommendation system for fusing media resources
CN109543876A (en) * 2018-10-17 2019-03-29 天津大学 A kind of visual analysis method of urban issues
CN110136226B (en) * 2019-04-08 2023-12-22 华南理工大学 News automatic image distribution method based on image group collaborative description generation
CN110890130B (en) * 2019-12-03 2022-09-20 大连理工大学 Biological network module marker identification method based on multi-type relationship
CN111639173B (en) * 2020-05-22 2023-07-14 程鹏 Epidemic data processing method, device, equipment and storage medium
CN113626668B (en) * 2021-07-02 2024-05-14 武汉大学 News multi-scale visualization method for map
CN118779353B (en) * 2024-06-07 2025-03-14 新疆文腾信息科技有限公司 Public security event-based data mining method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101714145A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Website news analysis system and method
CN102024056A (en) * 2010-12-15 2011-04-20 中国科学院自动化研究所 Computer aided newsmaker retrieval method based on multimedia analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100449497B1 (en) * 2000-12-21 2004-09-21 주식회사 매직아이 Apparatus and method for providing realtime information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101714145A (en) * 2008-10-07 2010-05-26 英业达股份有限公司 Website news analysis system and method
CN102024056A (en) * 2010-12-15 2011-04-20 中国科学院自动化研究所 Computer aided newsmaker retrieval method based on multimedia analysis

Also Published As

Publication number Publication date
CN102364473A (en) 2012-02-29

Similar Documents

Publication Publication Date Title
CN102364473B (en) Netnews search system and method based on geographic information and visual information
CN109960756B (en) News event information induction method
US9262532B2 (en) Ranking entity facets using user-click feedback
US7519588B2 (en) Keyword characterization and application
US8615707B2 (en) Adding new attributes to a structured presentation
CN103106282B (en) A kind of method of Webpage search and displaying
KR101060594B1 (en) Keyword Extraction and Association Network Configuration for Document Data
US8452791B2 (en) Adding new instances to a structured presentation
JP6381775B2 (en) Information processing system and information processing method
CN107784092A (en) A kind of method, server and computer-readable medium for recommending hot word
US20070282826A1 (en) Method and apparatus for construction and use of concept knowledge base
CN104978314B (en) Media content recommendations method and device
US20090254455A1 (en) System and method for virtual canvas generation, product catalog searching, and result presentation
CN107016020A (en) The system and method for aiding in searching request using vertical suggestion
CN104036038A (en) News recommendation method and system
CN102622417A (en) Method and device for ordering information records
CN102236719A (en) Page search engine based on page classification and quick search method
Xu et al. Generating temporal semantic context of concepts using web search engines
CN101751439A (en) Image retrieval method based on hierarchical clustering
US20160299951A1 (en) Processing a search query and retrieving targeted records from a networked database system
CN107908749B (en) Character retrieval system and method based on search engine
CN112184021A (en) Answer quality evaluation method based on similar support set
CN105786794B (en) Question-answer pair retrieval method and community question-answer retrieval system
JP4883644B2 (en) RECOMMENDATION DEVICE, RECOMMENDATION SYSTEM, RECOMMENDATION DEVICE CONTROL METHOD, AND RECOMMENDATION SYSTEM CONTROL METHOD
CN112199487B (en) Knowledge graph-based movie question-answer query system and method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131120

CF01 Termination of patent right due to non-payment of annual fee