@ Query Expansion and Contraction
Introduction
This is a report of my experience in identifying an under-explored idea in the field of photo meta data aggregation, and an implementation of this idea in the Emme photo search system. This report combines the original discussion for my motivation, ideas, background research, and initial project proposal, with the results of my implementation efforts, a brief synopsis of my experience with the system, and a proposed future work for both my extension, and the system overall. The first few sections of this paper are unchanged from my original proposal, and new sections start with the section tiled ‘Implementation Overview’.
Motivation
Initially I had hoped to implement a semantic search component to Arjun Satish’s Emme project. My goal had been to significantly improve picture ordering and ranking given a natural language query and to use the query itself to build up significant data on images, image groupings, and image categories.
Since beginning research on the topics of image and semantic search I’ve come to appreciate the difficulty of my initial proposal. The core of my initial idea was to improve the image search in Arjun’s project. I had intended on doing this through parsing ‘story like’ natural language queries and storing the novel data contained in these queries as meta-data for the pictures discovered in the search. This initial idea has turned out to be too difficult. When unconstrained there were too many interpretations that a given natural language sentence could have (or indeed as I had intended paragraph), and only probabilistic understanding of what the natural language block means is possible. Under my original idea I would have had to deal with named entity recognition, co-reference resolution, relationship extraction, sentence breaking, topic segmentation, word sense segmentation, and morphological segmentation. Most current NLP toolkits use statistical machine learning technique to solve these problems with some degree of certainty. That means that I would have to provide quite a bit of training data for possible ‘story like’ image queries; I don’t have a source for such training data, and I don’t know of a good method to generate it.
My idea of a ‘simple semantic graph’ for the data was premised on an unstated assumption I made. Namely, I assumed that pieces of the natural language query could be extracted and ‘classed’ according to the terms and sentence order. The classification would allow me to convert the query string into a similar logical representation as the meta-data for images in Arjun’s repository. Data for Images in Arjun’s database is represented as a series of name-value pairs which describe the collected meta-data for the photos. Prominent among meta-data categories are some distinct classes, time, geographic area, and event data. It had been my idea that portions of the natural language query could be similarly classed. In particular I had hoped that date range queries stated as something like “last year I went on many trips.” could be recognized and converted to a representation closer to Arjun’s meta-data representation of the images for example a _1: 1262304000 which is a timestamp that represents the first second of the year 2010. I had hoped to use sentence order, and adverb position to link and order images for example “visited uncle in Delhi. Then visited friends in Hyderabad.” would add a field mapping all pictures from Delhi for the duration of the trip to the pictures from Hyderabad during the trip. This field would indicate that the photo sets are related. Finally I thought that the remaining portions of such queries could be parsed into miscellaneous terms that could also be appended to the photo sets so as to preserve information present in the query; for example “in Hyderabad my friends and I had fresh brewed tea from a tea stand” those pictures that take place within the context of the trip, and are brought up in the search would have a field labeled ‘misc’ with the terms “fresh”, ”brewed”, “tea”, “stand”. Then a subsequent search for “tea” would instantly bring up all pictures from the trip.
After researching natural language processing, and freely available nlp software, I found out that the trade-off of doing anything with nlp is that you are going to be dealing with intent probabilities and optimizing those. Furthermore most of the popular libraries invoked a machine learning approach that would necessitate training data, choosing a classifier algorithm and then finally parsing back the results of the library into a query representation matching the ‘key : value’ style representation of images in Arjun’s database.
Form this point I decided to step back, and explore more classical approaches to meeting my goals. I wanted to look at 3 search improvements to Arjun’s system. 1) the ability to to temporal range queries. 2) The ability to unambiguously and without error specify several locations, and look at images from all those locations. 3) To associate with retrieved images keywords present in the query that could then be used to again find images with these keywords. I have decided on a content analysis free, meta-data/keyword view of both images, and image queries. This being the environment I will be working with it makes sense for me to review document based information retrieval techniques with an emphasis on meta-data based documents, and with a further emphasis on meta-data image queries.
Current Techniques in Document and Meta-data Search
Image search is partitioned into two principal approaches, content based image retrieval (CBIR) and image meta-data search or ‘content free image retrieval’. CBIR retrieves images based on similarities in the contents of the query image and images in an image repository. The similarities are generally measured based on image factors such as colors, shapes, textures, and other image features that are compiled as a result of the application of image understanding and computer vision techniques to the repository and query images [1]. Results seem to indicate that a combination of content-based image feature search and meta-data based image search (in a document environment) perform better than either of the two approaches in isolation [2].
Since I am unfamiliar with image understanding techniques, and a content based implementation project seems infeasible given the timeline of the course, I will concentrate my efforts and my research on image meta-data search (in with a focus on query formulation and refinement). In meta-data image search the meta-data of each image is indexed and stored in a database. Queries will be processed and converted to a representation matching the meta-data of the indexed images. Upon search time the query is matched to the meta-data records stored in the index and collects the image identifiers of those index records that match with the query. The results are ranked in order of relevance and the images pointed to by the identifiers are presented to the user.
Since a meta-data view of an image is similar to a highly structured document with clear semantic meanings and boundaries many document search techniques will apply to my effort. Additionally since the query environment is principally a text-box many if not most query formulation techniques centered on keyword queries will apply to my effort.
The first step in improving the quality of text query results is defining ‘quality’. Classically search performance metrics attempt to measure relevance or some function of relevance. Relevance is a topic that has been debated and theorized upon in many disciplines, but within the context of information systems, and topical relevance therein the defining relevance as the “Topical or subject relevance: Relation between the subject or topic expressed in a query and topic or subject covered by information or information objects (retrieved or in the systems file, or even in existence). It is assumed that both queries and objects can be identified as being about a topic or subject. Aboutness is the criterion by which topicality is inferred.”[3] seems an adequate fit. Here relevance is measured as the degree of relatedness of concepts in the query to concepts in the document. Practically, this type of relevance can be measured by user interaction, and then normalized using statistical mechanisms. However, aside from perhaps providing training data, or rigorously measuring the performance of a system this approach will not scale, and an algorithmic non interactive method for both measuring relevance, and increasing the numbers of results returned that are most relevant is necessary. There are different approaches to algorithmically measuring relevance, from using the tf-idf scores of query terms, to machine learning classifiers that get training data from user feedback experiments, to string match counts of query terms and synonyms thereof in a document. Commercial search engines measure relevance as a combination of anchor text matches, and link popularity [4]. Regardless of the method it is safe to say that relevance is a topic of research and several, several competing relevance measures exist, but the sorts of things (popularity, tf-idf) that relevance is a function of are known.
Precision and recall were defined in class, but since many performance metrics make use of these measures I will mention them briefly. Precision is the fraction of relevant retrieved documents of the total corpus of documents. Recall is the fraction of documents that are relevant to the query. Fallout is the measure of non-relevant documents retrieved out of all non relevant documents in the corpus [5]. The F1 performance measure is the harmonic mean of precision and recall 2*p*r / (p+r) [6]. Average precision is a measure of the sum of the precisions of all retrieved documents over the total number of relevant documents. The reciprocal rank of a query response is the multiplicative inverse of the first relevant result, the mean reciprocal rank of a set of queries Q is the sum of all the reciprocal ranks in set Q divided by |Q|. [7]
Mean Average Precision or MAP is a commonly used correctness metric used it many if not most of the papers I reviewed in the course of my research. MAP the map of a series of k queries is the sum of the average precisions of all the queries divided by k.
In papers I reviewed many of these measures as well as others are used to train classifiers when applying a supervised machine learning approach, or when determining coefficients in a statistical retrieval method.
There are a number of logical models for documents and queries. Models are either term-independent, immanent term interdependent, or transcendent term interdependent. Term-independent models treat each term in the document as independent, immanent term interdependent models consider terms dependant and derive the degree of dependence, transcendent term interdependent models rely on external data for term interdependence [8]. The standard Boolean model represents each document as a set of terms, a query in this model proceeds as a series of q filters where q is the size of the set Q representing the query. Each filter removes all documents from consideration that do not contain a term in Q. Similarly the vector space model (which Lucene uses!) represents each document and query as a D sized vector (where D is the size of the dictionary), queries are matched according to the normalized dot product of a query vector to each document vector. Vectors can be binary (indicating term presence/absence) or continuous indicating a term weight calculated using tf-idf. The generalized vector space model is an improvement on the vector space model that attempts to take synonyms into account by representing each term by a linear combination of vectors representing the similarity of a term to any other term. The extended Boolean model is basically the same as the vector model when using tf-idf weights, and in set notation. The topic-based vector space model is similar to the vector space model, except it requires semantic analysis and attempts to break up documents and queries into topic vectors rather than term vectors. The semantic analysis is necessary to decide which topics occur in a document, and with what weight. Latent semantic indexing is a model which attempts to address the major flaw of vector and set model, namely the problem of identifying synonyms and related concepts in a document or query. LSI constructs a term-document matrix to identify unique terms in a collection of documents, this is done as in the vector model. “A rank-reduced, Singular Value Decomposition is performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text. The SVD forms the foundation for LSI”. [9] LSI is described in [10]. Very roughly stated computing the SVD gives the vector space for the terms and documents and this conveys information on term similarity. As many of the models explored used the vector model as a motivation, or were in many respects similar to the vector space model I decided to concentrate my research in the direction or set or vector model.
Assuming a relevance function, and a chosen document and query model a similarity function is now needed to help match documents and queries. There are many similarity functions that attempt to measure the similarity of a given query to each document in the collection and sort them according to their similarity. There are two classes of similarity functions, those created by human information retrieval experts using probabilistic techniques and with parameters tuned by machine learning and those functions entirely constructed by machine learning. Overall several sources seem to indicate that functions developed by means of machine learning have the can be superior to even industry standard similarity functions. For example Wang, Fan, Xi, and Fox achieved such a result when comparing Okapi BM25(the similarity function used in Lucene) to a genetic algorithm solution of the 2 classification in document retrieval. “When combined with query expansion, the GP function can get a significantly better performance, measured as MAP(mean average precision), than the baseline system using Okapi BM25.” [11].
After all of these pieces are in place, relevance function, logical model, similarity function we can perform search. I looked into ways to improve search when all of these prior pieces are in place. Improvements can be categorized into roughly 2 classes expansive and reductive. Expansive techniques increase the amount of high value data present in either the query, the documents, or both. Reductive techniques reduce the amount of low value. One of the most common techniques for keyword based queries is ‘query expansion’ a technique where the query is examined and pre-processed prior to document matching so as to improve either or both precision and recall. There are several techniques for query expansion, including stemming, misspelling identification and correction, and the addition of synonymous stem words. These techniques improve recall in that more documents can be matched to a larger query with more general (stemmed or spelling corrected) words. They may improved precision if so few documents are found to be relevant to the original query (a badly misspelled word, or incredibly antiquated term for example) that very few results are retrieved, then by expanding the query many more relevant documents could be retrieved. [12] There are both interactive human interactive query expansion techniques (where a human judges whether a given expansion is accurate) and automated expansion techniques where the expansion is done algorithmically. One important algorithmic expansion technique is pseudo relevance expansion (pseudo because terms are deemed relevant without the support of a human operator).
Pseudo relevance feedback is a technique used in meta data search (as well as keyword search on textual data) Pseudo relevance feedback decides on which terms are most informative out of the terms presents in those documents deemed most relevant to the initial query, adds them to the original query, then recurses using this new (longer) query.The terms added to the original query are those which have a high calculated value, the values of these terms are preserved in the query so that they do not have to be recalculated. The values used are a function of the implementation, but the tf-idf of the query term in query is usually a factor of the value derivation. [13] Empirical testing indicates that PRF improves the mean average precision of all information retrieval algorithms tested (out of a set of commonly used algorithms including one used in Lucene). [14] .
Document expansion is related technique to query expansion, and the same methods can be applied with a slightly differently to increase search performance from the document side. In document expansion each meta-data document in the collection iteratively serves as a query to the system. After stemming and otherwise pre-processing this query some similarity function is used to find the most similar documents (not including the query document itself of course). Some set of the the most similar and relevant documents will be chosen, then those documents are analysed and the top X weighted terms are appended to the meta data describing the current document being expanded. The description of document expansion is derived from its definition [15]
A useful reductive technique is query reduction. This technique is only really applicable to complex and verbose queries, when small sparse queries are given, query expansion is more relevant. In some sense query reduction is the inverse operation to query expansion, in that the main idea of query reduction is to reduce queries that are ‘too long’ to a smaller size. ‘Too long’ here means that the elimination of terms in the query will improve precision of the query evaluation. A simple and example and a mainstay of most search systems is the elimination of ‘stop words’ word with a high a very high and very low idf like ‘a’ ‘but’ etc. While useful in semantic search these words are thrown out of a query in almost all keyword based search systems.Here as in query expansion it is possible to test whether a query reduction has been successful by use of human judges, or algorithmically. Again the use of human judges is un-scalable and is only interesting in that it helps form a test set against which to compare algorithmic attempts. The core question of the algorithmic approach is to decide which terms stay and which terms to throw out. This usually decided by giving each term a weight and throwing out all terms below a certain weight. Both identifying which weight to give to each term, and how to determine the threshold of which terms to dismiss are topics of research. Some approaches to term ranking are supervised learning in Markov random field models [16]. Another approach is to identify key concepts in the query with supervised machine learning techniques, and to retain only those keywords found to have a high degree of relevance to the most relevant concepts in the query [17]. A third approach (and one that makes sense to me) is to train a classifier to look 2n possible subsets that can be generated from a query of n terms, and rank them according to a utility function that would have as its components various search quality metrics such as idf ranking, td-idf with respect to the original and proposed queries, query scope, and others. After training the classifier will choose reduced query formulations that rank highest according to its utility function. [18] indicates that iterative query reduction and query expansion techniques seem to present the best results.
Document reduction could also be useful. Again this is similar to query reduction, a threshold value T is chosen, all documents in the collection are scanned and the weight or importance of a given term within the context of its given document is derived (using any of the techniques discussed td-idf, Markov random field models, concept ranking via supervised learning, etc), and terms appearing in the document with a weight lower than the threshold are dropped. There is however one important caveat, if we apply something like the tricks used in query reduction to our collection of documents, we end up with a different collection of documents. There I see two answers here, one is that we take this modified collection and feed that into a document expansion which uses pseudo-relevance feedback rather that simply feeding it the unmodified documents. [19] Shows results indicating that in their experiments with the ImageCLEF’s wikipediaMM 2010 task set show a dramatic improvement when the previously mentioned method is used as compared to query expansion only. Strangely enough results were not mentioned comparing document expansion in isolation and document expansion with document reduction, I could not find a similar study using this technique, but it seems quite promising. An obvious idea seems to me to be to use document reduction techniques to produce a ‘shadow document’ which would only contain terms that meat the threshold value. Then this shadow document could be the basis for forming the index entries at the base of the information retrieval system. I am not sure to what extent this application would be helpful as I was unable to find studies implementing it, but it seems very in line with the expansion and reduction techniques discussed earlier.
Possibility for Future Work
Having looked at a sampling of work in query/document expansion and reduction and at the basics at the prerequisites for these approaches it seems to me that at least the literature that I read presents a some trends. In particular, applying both expansive and reductive techniques to both query and document presents the best results. Secondly, statistical and machine learning approaches in trained classifiers outperform hand crafted relevance classifiers by a wide margin. Finally, there is room for research into a genetic algorithm approach towards both document classification and term weighing. I think that given the results of the work I studied at least, a system that does both reduction and expansion of both query and document iteratively and perhaps until equilibrium, and uses a classifier produced through reinforced learning of a genetic program would be interesting to study.
A second point that I was surprised to note was the lack of work done into document reduction, and the preservation of the result thereof. As I mentioned in the previous section, it seems like a very obvious direction given the success of document expansion, and query expansion and reduction. On that note it seems strange to me that the results of both document expansion and possibly reduction aren’t saved somewhere other than in a compressed form in the index. Why not leave a duplicate index entry at the server of the document being expanded/reduced ? This document can be updated upon the update of the work on which it is based. There are a few advantages to this, firstly there is a complete backup of the index available on the host servers, and the backup can be loaded without all the pre-processing work done initially when each entry was added. Secondly the distributed backup index is distributed with the same locality as the websites it indexes. This property could be used to populate local caches to improve query service times.
A last research direction that I thought would be worth of exploration is the preservation of query information in the document. It seems that if many people are looking for something using terms that do not lead to a successful result and after extensive browsing they finally find pages that they like on this topic the previous queries should be used to improve the rank of the documents they find relevant for other searchers. I know that past query context preservation is a current topic of research, but that’s not what I mean. I think that if after many disappointing queries a user finds some resources he likes those documents themselves should have their meta-data changed to account for these past searches. A use case I have come across is that I need an explanation of a standard library function in C, I query google and look at many many results none telling me a critical piece of information. Finally on third page of results I find a forum post that notes that the function takes a parameter in octal representation, and that was the information I needed so I bookmark the site or I ‘like’ it on Facebook, or I +1 it on google. I think that given all of my past queries on C and the library function these terms should be given a greater weight in that page since I gave a detectable sign of the sites utility to me. The next time someone does one of the searches I tried, the forum post should at least come up on the second page.
Proposed Implementation Work
My intent had been to take a couple of obvious search features out of the general search that Arjun and his team have devised and cater to them explicitly. Since my idea of doing this through NLP turned out to be very difficult I have decided on a simpler and more brute force approach (described below). Currently, Emme “maps each of the JSON tags (which contain meta-data generated for the image) into a Lucene field, and declare all the text fields index-able. A search query is executed over all fields. If the query term exists in any field, the photo is thrown out into the result set.” [20]. Since one of the fields describing the image is a UNIX time-stamp temporal queries by the user are essentially lost (nobody is going to query for photos using a UNIX timestamp representation), I propose an alternative mechanism.
I propose that a separate index be created to handle temporal range queries. At the front end the query environment is now slightly more complex, there will be six additional multi-choice fields start_year, start_month, start_day, end_year, end_month, end_day. Each multi-choice field will be populated once the previous has been selected. For example, there will be no choices in start_month until start_year has been selected similarly for start_day. The same technique will be used for end_{year,month,day}. A non-selection of any of these fields will imply that no constraint will be placed on that attribute of the image. For example if only start_year is set to 2009 and no other field is entered then only photos from 2009 to the current time will come up. The query environment will then consist of a mechanism for temporal range selection and the usual text query box. On the data side the UNIX timestamp for each image can easily be parsed into year, month, day, then these fields can be stored as additional BSON columns in the mongodb database representing each image. Then an index on these fields, and the photo identifiers will be created. Mongodb supports range queries [21] so a range query on the set of all documents in the collection is now a possibility. So as to make this addition modular and not disturb the current setup of Arjun’s system I propose to integrate the current Lucene style string matching search as follows. Perform the range query on the set of all documents U to produce set of unique photo identifiers T. Perform the Lucene search on the set of documents U using the keywords in the text-box (as is currently done) to get a set of unique identifiers L. Present to the user the documents in the collection identified the identifiers in set T intersect L. This addition will grant the Emme system the ability to do temporal and temporal range queries in addition to the current string matching approach.
To address my second point of interest (the ability to unambiguously and without error specify several locations, and look at images from all those locations) I propose the following system change. Next to the search box, and temporal query selection there will be a mult-choice location triple for the user to specify locations the triple will be {Country, state, city}. The choices for each will be populated once the previous has been specified by the user. For example there will be no items to choose from in the ‘state’ and city multi-choice boxes until the country has been specified, similarly with state and city. If no choices are specified for any of the choice-boxes an implicit “for all” query for that field is predicted. Once the first multi-choice box is filled the data from that action is used to populate the choices of the next multi-choice box and so on. the user will also have the option of adding an additional triple [22]. This will allow the user to specify a list of any number exact locations with each location in the most general case being specified by country only and in the most specific by country & state & city. After these locations have been specified a search will take the union of all photos represented in any of the mentioned locations. The meta-data representations of Arjun’s photos already have fields for Country, State, and City, an index in mongodb can be created over these attributes and the photo unique identifiers. Then I plan to use the same technique as I had with the temporal queries, I plan to do a temporal query (if specified by the user), a location query (if specified by the user), and finally a keyword query on the remaining data (if specified by the user) and done as currently by Lucene. The intersection of the photos in these 3 queries will be returned to the user. I think this geographic feature requires some motivation from my part since currently if you type ‘Irvine’ into the search box of Arjun’s system you will indeed get all images from ‘Irvine’, and if type ‘Irvine Santa Ana’ you’ll get all pictures from either Santa Ana or Irvine (Lucene handles it). My alternative offers 2 improvements. First of all did you know that there are at least 8 cities called Irvine ? They are spread across the US, Canada, and England. There are at least 10 Sana Anas, spread across the US,Mexico, and central America. What I’m getting at is that my system removes a the vast amount of such ambiguity by having the user spell out their exact intent. While this may not seem very useful to users who don’t ‘glob trot’ around the world, there is another feature that definitely would be. There are no misspelled locations names with my system, the names will be input correctly by me and once selected will be queried in their correct spelling. This will remove the difficulty of remembering complex place names and allow the user to specify exactly what they want unambiguously. Finally a third benefit of this approach is that it frees up the search text box. Temporal and geographical terms need no longer be entered, that aspect of the search will have been handled. No longer will a photo at ‘Ana’s Restaurant’ come up as a result of a query for ‘Santa Ana’. Events, trips, perhaps in the future peoples names can all now be queried unambiguously by Lucene’s engine without worrying about picking up photos for badly phrased temporal and geographic queries.
The final part of my proposal addresses my third point (associating with retrieved images keywords present in the query that could then be used to again find images with these keywords). I would like to do the following, suppose the user has a specific event in mind, and enters some time range, and a location (or a few) and then the user enters some words that match none of the attributes stored in the meta-data of the images returned. For example given a specific date range and location the search string “had tea with my friend Raj”. Given the current attributes of the system [20] the system would not be able to narrow the results at all from this string (events, trip names, store names, etc are not specified). Now suppose that the temporal and geographic data is enough to select a small collection of photos, (some of which the user wants). It would be a real shame to let the terms {“tea”,“friend”,“Raj”} and their probable association with the set of queries returned disappear. I propose taking all these terms and appending them to each of the returned photos as a new key-value pair in Arjun’s system the field would be something like ‘misc’ and would collect all keyword terms associated (or probably associated with a photo) and Lucene would then search these fields as well. The next time I type the word ‘Raj’ into the text box one of the photos that should come up will be me sipping tea with Raj (assuming my time and date had been correct in the previous query). Practically, my rule would be this: If a query is done using both the temporal multi-choice boxes, and the geographic boxes, and the text box, and if the query returns more than 0 but fewer than X results, and if after stemming and stop word removal there are any terms that are not present in any of the meta-data fields the photos, then add those key terms to the collection of photos under the _misc key. The additional constraint involving X is to make sure we don’t include irrelevant keywords on a huge collection of photos, like if the query was for all years between 2000 and 2010 in Irvine, when the person lives in Irvine.
Conclusion of Project Proposal
Natural language processing is an ongoing research area, it is also difficult, and the most promising approaches utilize statistical and machine learning techniques to get an array of meanings each with some probability of being the intended meaning. There is little middle ground with natural language processing, you are either doing it or you aren’t, there is no middle ground that would allow you to enter data in a semi-structured natural language way, there either is a simple structure or there is not. If you’re implementing something structured anyways, and the difficulty of the problem is mostly in identifying the structure you are looking for, why not just force the user to enter the structure in exactly the way you want them to ? If you don’t want to force the user to do that well, you either want to try out natural language processing or drop the idea of structure entirely there is little room or point for ‘middle-ground’ approaches. This is roughly the train of thought that lead me from my initial natural language motivation to my more practical project 2 proposal. Finally, I think I should address an obvious criticism of my first 2 proposed features. Namely, I’m cluttering the query environment when all the user wants is a text box. I would reply that ideally yes the user should be able to communicate everything they need in a text box, but that is really hard, and I think temporal and spacial dimensions are so important that the benefits of using my approach out-weight the user inconvenience. Lastly, if they do clutter the environment the users can just ignore them and use Arjun’s original search, maybe pushing all this stuff to an ‘advance search’ setting like in google (I use advance search all the time by the way).
Implementation Overview
Overall I consider my experience in the implementation of project 2 to be a qualified success. I was able to accomplish the vast majority of the goals I set out to, and indeed was able to add an unanticipated feature to the program (sentence based batch search), that I thought was an excellent addition. Moreover the portions that I did accomplish were the more significant in in the project, and the portion that was left unfinished (geographic filtering) was left out purely due to lack of time, it’s implementation is nearly identical to temporal filtering. I found that my design document was specific enough and expressed doable goals and methods for accomplishing these, indeed at almost no time did I deviate significantly from the initial design I proposed. Finally, and to my surprise I found that upon testing on Arjun’s set of photos the features worked exactly as I expected, the experience was easy and intuitive, and the benefits conveyed were real and potentially significant. The features I implemented were chronological filtering (with a significant front-end and back-end component), a partial implementation of geographic filtering, query keyword association of filtered photo-sets, and batch query execution. I shall discuss the specifics of the implementation of these features and sub-features, then discuss my experience with the Emme system, and propose future work for my features, and the Emme system.
Chronological Filtering : User Interface
Because substantial chronological filtering is critical to the correct function of my query keyword association that this filtering is the portion that I implemented first. Most of my front end changes for this feature were in the file .../ept/web/qpage.html this is where the html for Arjun’s search function resides. As part of the form submission I created six combo boxes to represent the time anchors for the range search, start {year,month,day} and end {year,month,day}. In my design doc, and in my conception of these filters I envisioned dynamically populating combo-boxes that would disallow incorrect date ranges, and limit user selection to choices that only make sense in the context of the choices already made. For example, if the same year is chosen for both time anchors, and then month 5 is chosen as the end month, then the choices for month start will be 1-5 only. Similarly if the start month is specified first then the end month selection is limited. This limiting applies to years, months, and days and occurs exactly as one would expect so I won’t belabor the point further than saying that only valid dates and date ranges are open for selection to the user at any time. I looked for pre-made javascript libraries that would accomplish this, in particular I looked at jquery date picker, as well as several others, a clean implementation of this function or its equivalent was not available or could not be found so I implemented this function from scratch in javascript. I used the “cascading combo boxes” technique to implement these checks, I take into account leap years, and all event combinations that could result in incorrect date selection. By default, and when unselected each box represents either the minimum or the maximum selection. For example a blank year represents a start year of 2000, a hard coded start year I built into the function, a blank end year indicates the last the current year, similarly a blank month indicates either the first or last month of the selected year. Needless to say, months cannot be selected before the respective year is selected, and days before the respective month is selected. The point of this functionality is that I had initially intended to strip such date information from natural language, that being too difficult and probabilistic I decided to take the opposite approach and ensure that regardless of what the user enters as time anchors the query is exactly correct, and fully specified.
Chronological Filtering : Back End
After a setting (even a default one) is fixed for the time anchors these values are loaded into an object at form submission (query time) along with the query string. All this information is then submitted to Node.js which (through Redis) submits it to a dynamic queue serviced by a handler running on the java server. Within this server I unpack the request page, and get the stored fields. Originally I had intended on doing the date range query on mongoDB and then filtering them down using the keywords in the query, however Arjun’s method of querying the Lucene index, and returning the photo identifier directly from that service made this approach less efficient, with more overhead, and modification of the code-base. Instead I did the reverse, I filter the results after Lucene retrieves matched photos. I set up a filter’s module that allows one to filter a result set down given the end timestamp, and start timestamp (which are recalculated whenever any the date combo-boxes on the client side are changed. Filtering is done primitively, I iterate through the list composing a new list of photos in which all photos pass the time range constraint.
Geographical Filtering
Though I got to this feature last, and it is only partially implemented I’ll mention it here since it is so similar to the temporal filtering. One complication that arises with implementing the geographical filtering is that if one wants to cover all the cities and states of all the countries in the world, that is a lot of data items and makes for a very large javascript file. It may make sense to have a separate database of these data items, and stream them over to the client on an as needed basis, so as not to clog the memory at the client terminal. I evaded this problem by starting with only a small sample set, a couple of countries (U.S , Canada, Mexico) a couple of states, and several cities. That was really the only new problem with this functionality. I got the fully function user interface portion of this aspect working and about half of the back end work to get this done before I realized that I would run out of time during debugging and stopped. The completion of this feature is a question of an additional day or two.
Query Keyword Association
This was the centerpiece of my design and implementation. The idea is this, given a query after stop word elimination and stemming, an index is used to match photos to the logical representation of the query. Then after possible temporal and geographical filtering a set of pictures is left that either comprises several related sets event pictures, a single set of event pictures, or a thematically unrelated set of photos. If temporal filtering is used, photos are assured of being related in at least one very important dimension (time), and this eliminates that possibility of a set of seemingly unrelated photos being present. The keyword matching performed by Lucene on Arjun’s existing meta-data collection is another filter on the full set of photos in the collection, and ensures that photos retrieved are at least somewhat likely to be thematically similar. If the collection returned after filtering on these dimensions is small enough then it is likely that these photos are all related, hence it makes sense to add in keywords present in the query that were not present in the index to the collection of photos returned because the user clearly believes that those terms are related to the collection he has in mind. In this way the user performs implicit tagging. I punted on the difficult question of what is “small enough” and chose arbitrarily the number 8 to be a limit on the number of photos that will represent the same event or event sequence. I believe that setting this variable is very likely a difficult problem, that would best be addressed with a machine learning scheme, and the implementation of which would constitute a separate project entirely.
After querying the Lucene index for either a keyword query or a natural language query that is then converted to a keyword query and filtering it down based on the provided time ranges, I test whether the number of returned photos is less than 8 or more. If it is more, I do nothing. If it is 8 or less I take all keywords in the query (eliminating a quickly assembled set of stop words) then for each photo in the set returned, I take the set of terms associated with the photo by previous queries, and take the union of this set, and the set found in the query. Then for each such union I persist this new set in the database under the JSON key “_MISC” . To facilitate this I had to write an update method for Arjun’s IndexingService, it was basically a rewrite of one of Arjun’s functions of inserting an index entry with the change being that Lucene’s update function is used which will replace entries found with a search on a unique identifier with the newly presented queries. Another change I had to make was to add a field that would serve as a unique identifier for the photo set and was not the file location of the image itself, for reasons I could not discover Lucene failed to match strings with forward slashes, perhaps they are removed as a pre-processing step. Unmentioned are several function rewrites, and modifications I had to make for all of this to work. Work it did however, and it worked quite well, as can be seen in the demo, the module identifies related sets quite well and on the few queries I tried it builds up relevant meta-data about these photo sets quite quickly. My only real regret here is that the system could have worked so much better if the magic number 8 were instead some number developed through machine learning techniques, as I was using the system I noticed that the number 8 undershot the true maximum number of photos for related sets, and thus many were too large to have query data associated with them. Overall even with the limited functionality I built into it I believe that this implementation shows how useful such a dynamically annotating feature can be after even a little use.
My Experience with the Emme System, and Suggestions for its Development
I like Arjun’s system and what it does, it does it’s job quite well and the work of annotating photos is interesting and relevant. However there are a few places where working with the system is not as nice as it can be. In particular, some design and functional documentation would have been a great boon to me. Secondly, there is a definite lack of a comprehensive installation procedure for even a single system architecture. Finally, there are many redundant, overlapping containers, and methods resulting in operations where a record is moved from one place to another becoming a mess of 3 to 4 conversion steps before the task can be accomplished. A more ubiquitous data organization scheme, and standard interfaces would do wonders for Arjun’s system. While all these steps are simple ‘grunt work’ they are all also very easy, and would immeasurably improve the usability of this system. I have to thank Arjun for guiding me through many points in the system which I would never have discovered without his help, there is no way I would have been able to implement the features that I did without his help.
Future Work
I think that the module I have been working on shows definite promise, and there are numerous additions and extensions to it that would improve both its functionality and that of the system. First of course, it makes sense to finish the geographical filter as described in the design document for this project. Secondly, a feature should be added where keyword entered through the query association mechanism, that incorrectly associate to some picture or set of pictures can be removed from there, by the user. Another possible addition would be to have a feedback mechanism from the user, those photos deemed by the user to be relevant to the query (if any) can be tagged rather than just the the full result of the query if that result set is small enough. An optimization to this module would be to remove words from the _MISC field that already appear in an indexed field of the meta-data for a given photo. Finally, and most importantly I think that a separate system should be set up to identify for each user what is the most likely number of photos taken per event through a machine learning approach, then this data can be used for each query, rather than the number 8.
Project Conclusion
I enjoyed this project, and I think I got the chance to explore a new idea, or at least an idea I’ve never seen treated in quite the same manner as in the literature. I think I got some very promising results, even thought the idea was a simple one. Finally, I was able to develop an understanding for the kind of steps necessary to implement this idea in a functional system.
Demo and Epilogue
I set up Arjun’s system, and modified it on my laptop. In the tenth week of the quarter I carelessly dropped this machine, and after running some disk analysis programs ascertained that the hard disk was dying. I quickly removed my code from the machine, and recorded a demo video of my modifications in use. Very soon thereafter my machine became totally inoperable. Given some time I can acquire and set up a new system that will display my modifications. As of this moment however, I can only provide the reader with a link to the demo video I hurriedly made as the laptop was dying.
The video demo I filmed can he found online here:
http://www.youtube.com/watch?v=JfH9uJVGsa8
References: