Information retrieval (IR) has evolved into a mature technology for determining relevance in the retrieved information from various sources, not only in the news domain but also in other specialized domains. Information retrieval is restricted to information accessible on the internet in this study. This chapter begins with retrieval models and methods used to improve retrieval, followed by cross-language information retrieval approaches; finally, it discusses Telugu’s use of IR. Mooers coined the term “information retrieval” in the 1960s. It took many early studies, such as [7, 8, 9], before IR matured in the mid-1990s. IR refers to “information retrieval” here, where queries and information are presented in the same language.
In 10, “Information Retrieval” describes the technology of finding information of an unstructured nature (text) that satisfies a need for information from within a large number of sources. The overall flowcharting of information retrieval is shown in Figure 2.1, which may be divided into three sections: one focuses on techniques to prepare data for retrieval; the second covers algorithms for parsing user queries and subsequently enhancing them; and the third covers the retrieval engine itself. The first step is gathering data from numerous sources, such as online documents, databases, and so on. Several pre-processing procedures are required before indexing the information: in general, the information will be excessively frequent or infrequent at this stage.
Prices start at $12
Prices start at $11
Prices start at $12
The third issue concerns the assignment of term weights, on which query-document relevance is based; iii) because the Boolean model does not provide an estimate of query-document relevance, the retrieved documents are generally presented in a random order; iv) it’s difficult to control the number of documents that will be returned; and v) there is no way to find a suitable balance between AND and OR. Salton [Sal86] suggested using a query formulation that was neither too broad nor too narrow as a compromise. Several investigations [82, 83] have extended the basic Boolean model to include term weighting and output ranking capabilities. In the vector space model (VSM) [Sal71], a ranking algorithm is used to rank documents according to their intersection with query terms. [Boo82]
The people’s attic trust is a difficult storage and retrieval operation that includes a wide range of formats of media from several decades worth of data stored in various types of technological media.
Some of them offer single data storage options, while others offer a variety of media. Some of it is text, but the bulk of it is multimedia. Organizing it into extractable formats and then delivering the information to a large audience through an Information Retrieval mechanism is difficult.
Fortunately, several solutions already exist to address this issue. The project’s primary goal is to document the existence of the media so that they may be properly described and recovered in the future. This document focuses on the retrieval concerns of the project. It examines a number of options for organizing the retrieval system and evaluates them, before making a selection recommendation.
Different types of IR systems
Components of an Information Retrieval (IR) system
The four basic elements of an Information Retrieval system are a database, a search mechanism, a language, and an interface that allows the user to interact with the system. Databases “comprise information represented and organized in a specific way,” according to Chu (2005, p.15).
In other words, a database is a structured storage system that allows for the retrieval of information based on pre-defined criteria. The search mechanism is the method used to search for data in a database. Depending on the technical capabilities of the user accessing it, query methods may be more or less complex. Language (Chu, 2005) , which can be either “natural language” or “controlled vocabulary,” is a third component of an Information Retrieval system.
Information is “made” by language, whether spoken or written. When information is processed, transferred, or communicated, it depends on language (Chu 2005). The user interface is the final element of an Information Retrieval system. It’s the point of contact between the customer and the technology. Its ease of use will have a huge impact on how popular it is among users. More than anything else, it impacts whether or not a System succeeds in its mission.
Categorisation of items in attic
The attic contains a variety of different objects. Items fall into the following four categories: text-based, image-based media, streamed media software, and multimedia applications. Poems, performance art scripts, and newspaper clippings are examples of text-based material. Picture elements are used by image-based applications to store information.
A pixel is the smallest unit of digital image data. It has a unique identifier that describes its color and intensity, which when combined with other pixels, creates a particular picture. The collection contains applications that work with photographs stored on CD-ROMs and hard drives, as well as 35mm film negatives.
Maps and paintings are also included. If they must be retrieved from a computerized Information Retrieval system, they will need to be digitised. Streamed media applications are those that require a time component in order to properly interpret the data. Altering the timeline changes the information within it. The collection includes audio recordings available in such file formats as .wav and .mp3.
Notable items in the collection include a large number of audio recordings, both vinyl records and cassette tapes. The existence of spoken language and music on audio cassettes and vinyl records is also catalogued. If these forms are to be made accessible to a larger audience, they will need to be digitized. Finally, multimedia applications utilize a variety of media to present information. Video in digital format, as well as on tape, is included in the collection’s multimedia applications.
Text Based Retrieval Systems
The text-based media in the collection will be aided by a text-based retrieval system. Some of the media are based on vintage technologies, which makes storage in the available to the general public’s media storage difficult. The text-based materials discovered in the collection will have to be digitized.
The main benefit of text-based retrieval is that it is well developed, which means there is a high degree of format standardization. It has fewer compatibility issues than other technologies. When this issue comes up, numerous options for conversion are available to retrieve data in the desired form. Its disadvantage derives from its use of letters and words as the fundamental data storage and retrieval unit.
Many of the retrieval methods currently available for text retrieval do not account for the semantic components of a search. They rely on word match, which means that most search systems may not return relevant content based on their meaning but instead return material that is closely matched to the phrase used as a search query. Contextual searches, which employ Thesauri to identify words with closely related meanings, are becoming more advanced.
Multimedia Retrieval Systems
In contrast, multimedia retrieval systems employ various techniques to match a query’s keywords. A multimedia Information Retrieval system can easily handle search queries for image-based applications and streamed media applications. Multimedia search queries contain terms that may be used in either image-based applications or streamed media applications.
The field of multimedia information retrieval is still in its infancy. Because different formats are used to exhibit media types that are similar, it has a lot of compatibility issues. There are .wav and .mp3 audio files in the collection, for example, which are all different types of audio files. This is due to the fact that each type of new format provides greater functionality. The more recent formats are typically lacking in backward compatibility.
The main limitations that drive the use of various formats include maximizing storage space or maintaining media quality. Many media players, on the other hand, are designed to take these restrictions into account. They frequently contain functionality for processing a variety of media types and format conversion facilities. The key is to have the most up-to-date version of a media player, which will be able to display the newest file formats.
Requirements for an IR system
Comparison of Requirements for Text Based IR Systems and Multimedia IR Systems
The retrieval systems use a mechanism to identify the information source, which may be used by a search method to identify media from a database. This is as far as the resemblance between these two types of retrieval systems goes. Text-based Information Retriervation methods rely on comparing the contents of files to the search query in the database in order to find a document, while multimedia Information Retrieval systems require various elements to assess relevant media that contain pertinent data.
Text components including a designated name for the media in the database are examples of this. On condition that the film’s name is on the file containing the film, you may search for it using the film’s title in a database. Duration of media and file type are other multimedia file locators. These are useful in narrowing down a search query.
Main Solutions Available to Designers of IR System
The attic trust’s access to searchable information is contingent on the digitisation of all records currently in the collection, as well as some degree of standardization of file types to make retrieval easier. To type or scan text-based items in the collection will be necessary. Because it will allow you to format the information, typing will give you more leeway when it comes to displaying it. Because the things are antiques and their appeal is in looking exactly as they did at their creation, however, reformatted presentation will best serve users seeking information for semantic purposes.
The restored version, which has been digitised and optimized for better readability, is ideal for those who are seeking information for sentimental reasons. The best solution is to utilize a digital image of the text to preserve the original appearance. Scanned material without text detection is the quickest approach to do this. This will really turn the material into a ‘picture’ that shows words rather than images. The consequence is that text retrieval techniques will not be used.
The majority of the information will need to be digitized in order to preserve it. The most significant aspect is the process’s format. There are conversion technologies available for audio and video tapes that convert them from tape to digital data. Taking digital photographs of physical objects such as sculptures for public display would be necessary when archiving physical artifacts like this.
The technique of converting photographs into three-dimensional representations through animation, or making short films about the items, gives the opportunity to add sound clips. Animation provides for greater user involvement during filming, allowing for more details to be included via voice and enriching the experience. The format used depends on the type of user. An animated sequence that allows him to influence the image in order to obtain certain perspectives would be ideal for a person interested in art.
A video with a sound clip that provides background information on the object will be perfect for the inquisitive semantic user. “Speech may introduce, summarize, stimulate, and tell” (Jalal, 2001, p.6). Because sound does not vary significantly between users, audio data has the fewest presentation issues. There should be no significant technological challenges since the data delivery takes place in a widely accessible form.
Different Methods of Representation
Information Retrieval systems are classified according to two categories. Belkin (n.d.) distinguishes them as “retrospective” and “ad-hoc,” whereas the latter is defined as “Information filtering or routing.” Retrospective solutions meet one-time information demands that taper off after they have been satisfied. e-books, newspaper articles, online magazines, and information websites are among these things.
The database should be focused on information, rather than individuals. Individuals who are frequently accessed because they have a high utility level are referred to as information filters. Websites that provide changing information, such as weather patterns, stock prices, and map services, are examples of this type. When designing the database based on the available representation methods, there are some important things to consider.
The choice of database language is crucial. There are two methods for dealing with the problem of language across the database. One option is to use natural language from users, which becomes the basis for search queries, whereas the second option is to utilize a controlled lexicon. If the trust opts for natural language in its Information Retrieval system, people will have an easier time using it because they won’t have to learn the controlled vocabulary of the database.
Users will nevertheless be confronted with issues of ambiguity and irrelevancy. Users would first need to learn the language before obtaining more relevant search results if the trust adopts a restricted vocabulary. Tedd et al. (2005 p.39) underline that, “users must have adequate skills to access timely and effective information.”
There will be a need to index the whole database. This entails assigning words or phrases to each item in the database. The trust may choose descriptors of free indexing based on whether the users’ native language is English or a controlled vocabulary. Categorization would necessitate the creation of numerous categories for all of the collection’s items. (2005, p. 29) “Exhaustive” and “mutually exclusive” are two criteria that Chu (2005) suggests should be used when developing useful categories
All items in the collection must be classified into a particular category, and no two categories should share the same space. Summarization techniques allow text-based programs to perform better. It entails supplying a user with succinct information regarding a body of text. Abstracts, summaries, and extracts are all methods used to give readers with an overview of the material. Abstraction allows for a broader perspective on the text and may serve as a substitute for it. It’s missing some substance.
A summarization skips over portions such as background and the study’s methodology in order to save time. An extract, on the other hand, is a complete piece of paper that has been removed from its original location. Each of these techniques has advantages and disadvantages, and they differ depending on the situation.
Querying a database is the process of asking it questions in a language. In the information retrieval process, Nordbotten (2008) states that “query language will always specify the selection criteria for the sought-for data for the subsequent operations.” The key to building a query system is to figure out how much semantically driven searching is required for optimal user experience.
Synonyms must be managed, which may influence query processing and therefore system performance, as well as increasing design and management expenses. A simple query mechanism that matches input to metadata and comparable phrases generates a large number of results, making it more difficult for the user to sift through the data and possibly impacting user experience. Metadata might help improve search outcomes by allowing for additional access options.
Implications of Using IR systems
A two-part system is most suitable for the project. One of them is the preservation of physical relics that contain the information that needs to be stored. The second part is developing a digital library or digital museum so that users from all over the world interested in the trust’s activities can interact with the materials.
The phrase “on the internet” has evolved to mean “at least” anything one can put on a screen, such as Facebook and YouTube. (2001, p.5) According to Castle (2001), “a digital library delivers information directly to the user’s desk, either at work or at home. ” The most effective Information Retrieval system will employ natural language because the trust is intended for an international audience rather than a restricted vocabulary approach. Keywords in the process aid in tuning queries. As with obtaining artifacts in their natural state, methods of storage that preserve them in this form are preferred since this is what makes viewing relics so unique.
Later, the trust may weigh storage options relevant to semantically oriented consumers who want to discover meaning in their data, especially for instructional purposes. The collection must be digitized in its entirety. This entails converting audio files to a variety of digital formats. If the goal is to save disk space, MP3 format will come in handy. It’s also playable on most media players.
Because text-based media will be used, discretion will be needed. Some of them will need to be preserved digitally in order for formatting to take place. This would apply to essays and poems. Other materials, such as newspaper excerpts and poems, may not require any scanning at all if done without the aid of text recognition software.
Digital photographs of physical objects such as sculptures may aid in the creation of animated collections. This is much easier than multimedia items. The collection’s multimedia components will require a variety of file types for retrieval. Consideration must be given to creating a unique media player for the trust as an option. In the meantime, because it will utilize a single format and potentially cut administrative costs, this solution will address compatibility issues.
Information retrieval systems may be found in Web search engines, library catalogs, store catalogues, cookbook indexes, and other places. Information retrieval (IR), also known as information storage and retrieval (ISR or ISAR) or information organization and retrieval, is the practice of extracting a subset of items from a collection to meet the user’s requirement.
The fourth stage is the information retrieval system, which comprises an information retrieval language (also known as IRL), rules for converting natural language to the information retrieval language and vice versa, and match criteria designed to carry out information retrieval. It’s critical to distinguish information retrieval systems from information retrieval devices, which are specialized equipment or methods for combining technological tools in order to perform practical data retrievals.
It’s a level 5 hoarding house, with an enormous chaotic collection of various objects that are useless until identified and stored in an organized manner. The items can then be searched through and classified (to some extent) for obtaining insights after this organization process has been completed (through the use of specialized software).
You may have a solid reason to harvest and categorize data from this source, even if data mining tools can’t process information in email messages (however structured it might be). This demonstrates the significance and potential breadth of unstructured data. Bitmap pictures/objects, text, and other data types that aren’t part of a database are examples of unstructured data. Most company data today is relatively unstructured. An email message is considered to be an example of unsructified data.