Key attributes play an essential role in defining an entity. For instance, once the brand and model of a cell phone are fixed, as a product, the cell phone may be determined. From the E-commerce perspective, discovering key attributes leads to duplicate item detection and the consolidation of various items into products. Nowadays, billions of items are listed on Taobao, belonging to tens of thousands of commodity categories. Manually specifying the key attributes for each item is impractical. First of all, no domain expert owns the required domain knowledge and is aware of the key attributes for all commodity categories. Second, it is not feasible to specify the key attributes due to the attribute diversity and potential attribute hierarchy. Usually, the attributes associated to an item serve different purposes. Some are introduced to describe the physical features such as color, weight, length, and width, while the other target describing the conceptual features such as brand and model. Further, these conceptual attributes may have values from some hierarchical structures. For example, a cell phone model can be iPhone 7, iPhone 8, iPhone X, iPhone XR, iPhone XS, or iPhone XS Max. It is unclear how to leverage the semantics of such hierarchies in the key attribute discovery process. Probably, defining entities with different granularities is necessary.
Some dimension reduction techniques have been devised to perform key feature selection. For instance, the well-known Principal Components Analysis (PCA), which is widely applied to datasets from various domains, including social sciences, economics, biology, and chemistry, maps data points from a high dimensional space to a low one while trying to keep all relevant linear structure intact. This way, a subset of k columns/attributes from the original datasets, as opposed to the k eigenvectors or eigenfeatures returned by PCA, that can accurately reproduce the structure derived by PCA, is obtained. However, the purpose of key attributes is to define entities, rather than dimension reduction. Specifically, we use key attributes to perform duplicate entity detection and entity resolution.
Besides, the discovery of key attributes should be efficient, the discovered key attributes should be conciseness, and the discovered key attributes should be organized in a collection of hieratical structures. As stated, the E-commerce ecosystem contains billions of items. If the runtime of the key attribute discovery fails to scale with the number of involved items, the discovery process is impractical. Moreover, as we rely on the discovered key attributes to perform the time-consuming entity resolution, conciseness is a major consideration. The discovered key attributes should include any attribute that is to the point of entity definition but leaves out those attributes that do not bear on the definition of an entity. Lastly, for an item, different key attributes may be discovered at different granularities. As such, a hieratical structure should be generated to manage the key attributes. Furthermore, different items can have different key attributes; therefore, multiple hieratical structures should be established.
- A methodical model of the E-commerce item system, and further key attribute discovery of such system
- An efficient and scalable algorithm for concise key attribute discovery
- Build key attribute trees in Fashion / FMCG / Electronics areas
Related Research Topics
- Define an entity in knowledge graph
- Entity alignment in knowledge graph
- Information organization for specific domain data