- Overview
- Model building
- Model validation
- Model deployment
- Frequently asked questions

Unstructured and complex documents user guide
Best practices
This section contains best practices on how to write good prompt instructions at the project (that is, overall extraction) level, the field group level, and the individual field level.
- Clarity and simplicity - Use clear, direct, and unambiguous language. Avoid overcomplicating instructions that could confuse the model. Use plain language and keep sentences short.
- Consistency - Maintain consistent terminology across fields, field groups, and instructions to avoid confusion.
- Provide context - Equip the model with pertinent context to comprehend the general scope of the task. This could encompass industry information, document type, or overall data format, as the model needs to understand the task it handles. If you provide more context within the prompt, it increases the probability of the model to consistently predict the field correctly.
- Iterate - As refining prompts is an iterative process, maintaining a record of your drafts and their corresponding results can provide valuable insights for future adjustments and improvements. Write a prompt, test, and edit. Repeat this process until you get your desired extraction.
- Avoid negative instructions - Do not enter an instruction similar to: do not leave out any sections of the document. Instead, replace it with: ensure all key sections, such as x,y,z, of the document are covered.
- Avoid repetitive language - Repetitive language can lead to redundancy, confusion, and unclear instructions for the model.
- Watch out for contradictory information - Make sure that your project, field group, and field-level instructions do not contradict one another in terms of the information to extract, the format of the extraction, and where the information can be found. This will confuse the model and lead to inconsistent results.
- Example reinforcement - Whenever possible, reinforce the prompt instruction with examples of correct responses. These instances can guide the model towards the expected outcome.
Best practice | Details | Importance | Correct example | Incorrect example |
---|---|---|---|---|
Define the industry and the document type | Briefly describe the industry and the document type from which information is being extracted. Then, specify key characteristics and the expected structure of the document type to guide the extraction. | This provides important context for the data extraction process. | Instruction: Extract information from a brokerage statement, which is commonly found in the Financial Services Industry. Brokerage statements typically consist of a few sections: account overview, account summary, account holdings, and account transaction activity. |
Instruction: Extract the fields below from the document. Explanation: This project instruction example does not benefit the model. It does not provide any important context or key characteristics that would help guide the model. |
Specify if you expect multiple occurrences of the document within one file. | Indicate if the document contains multiple instances of identical data, and provide guidance for each extraction instance. In use cases that may have multiple documents within a single file, identify a unique identifier and include it as a field in each field group. | This will facilitate post-processing, allowing for more efficient automation. | Instruction: There may be multiple brokerage accounts within a single document file. A brokerage account can be identified via a unique account number field present in each field group. Extract the account information, account holdings, and account activity field groups for each account. |
Instruction: Extract all instances of data from each account document.
Explanation: This instruction example is poor as it fails to specify how to determine if there are multiple occurrences of a document type within the file. |
Best practice | Details | Importance | Correct example | Incorrect example |
---|---|---|---|---|
Group similar data points that you want to be extracted together into field groups. | Organize related fields into logical groups. | This helps to streamline extraction and minimize errors. | The name, address, and marital status of the account owner can all be grouped under an Account Owner Information field group. |
Field Group: Account Information
Fields: Account Holdings, Transaction Date, Account Owner
Explanation: This grouping might work in a situation where a user only wants to extract those three fields. However, if there are other fields like the holding ticker symbol and cost basis, the design or structure of this group will not be the most effective. |
Field group context | Explain how each field group contributes to the overall meaning and purpose of the document. | This helps the model understand the context of the extraction. | Instruction: This section outlines key brokerage statement account holding details, including the equity name, purchase date, quantity purchased, cost basis, and total price paid. These details help determine the current holdings in a brokerage statement. |
Instruction: Extract the fields below from the document.
Explanation: The prompt instructions lack context and detailed instructions for the model. It neither explains the type of information that requires extraction nor highlights its importance.
|
Leverage the location and structure of information in the document within your field group prompts | Indicate likely locations for the data of each field,
for example, table, header, body, to guide extraction.
Note: If you are working on
a document where information appears in the same section, state the
section in the prompt.
| This helps the model focus on the correct part of the document for each field. | Instruction: The field-level data for this section will most likely be found in the header of the report on the first page under the document title. |
Instruction: Extract the information from the beginning of the document.
Explanation: The prompt is vague and does not provide the model with enough detail on where specifically to look within the document. |
Model tables using field groups with fields | Treat a field group as a table, with each column acting as a unique field within that group. This approach is key to effective data modeling as it ensures clear differentiation, minimizes data duplication, and increases data consistency. | This method enables a logically structured and systematic arrangement of data, which subsequently leads to enhanced efficiency during data queries and analysis. |
Field group: Customers Fields: Name, Address, Phone Number |
Field groups: Customer Name, Customer Address, Customer Phone Number Fields: Name, Address, Phone Number Explanation: This example unnecessarily separates each customer detail into its own field group, making data management complex and prone to inconsistencies. |
Create parent and child field groups | Relationships are denoted with a greater-than
> sign. A parent field group can have multiple
child field groups.
| Leveraging field groups to show relationships between data within the documents is a great way of maintaining hierarchical data organization. |
Field group: Brokerage Statement Fields: Account Owner, Account Type Field group name: Brokerage Statement > Asset Allocation Fields: Asset Type, for example, Stocks, Bonds, Cash, Percentage of Total Assets Field group name: Brokerage Statement > Investments Fields: Investment Name, Quantity Owned, Price per Share, Total Value of Investment |
Field group: Account Owner Fields: Name, Investment Name, Type of Account, Number of Shares, Stocks, Bonds Field group: Account Owner > Address Fields: Street, City, State, ZIP Code Field group: Account Owner > Contact Info Fields: Phone Number, Email
Explanation: This is a poorly structured hierarchy because it combines unrelated fields under the same parent, and the child field groups (Address and Contact Info) do not logically relate to the fields of the parent (Investment Name, Number of Shares, Stocks, Bonds). This could confuse the AI model as it does not reflect the natural organization of the data within the document. |
Use a key field for files that contain multiple documents within them | Select a unique identifier in the document that will allow you to differentiate the data. Include this field in every field group. You do not need to alter the instruction for this field from one field group to another. | Including this key field allows for the separation of information within the document and removes confusion when processing the extracted data. | Field: Account Number, Social Security Number, Policy Number |
Field: Date, Name Explanation: The field names listed would not make good key fields as they are not unique. Dates and names can both be repeated. |
Best practice | Details | Importance | Correct example | Incorrect example |
---|---|---|---|---|
Pick field names carefully | Choose clear, recognizable names for fields that align with the expectations of the user. If there is a universal name that is used in all document variations, make sure to include it. | Precise field names ensure accurate extraction and reduce ambiguity. | Field: Date of Accident |
Field: Date
Explanation: Date is a generic term and does not provide any context about what the date refers to. This can lead to inaccurate data extraction, as the AI model might pick up any date that appears in the document. |
Be explicit and detailed with instructions | Kickstart the model by explicitly stating what you want the model to extract. Specify the exact format and structure of the data to be extracted. | Clear, detailed prompts guide the model to extract exactly what you need, in the format you expect. | Instruction: Extract the list of all the advisors from the document, format them into a comma-separated list, and arrange them in alphabetical order. |
Instruction: Get all of the advisors
Explanation: The prompt is vague and does not provide the model with clear instructions about the desired outcome and how it should be formatted. This can lead to inconsistencies in the extracted information, making it more difficult to process the results.
|
Provide examples within the instructions | Provide example inputs and corresponding expected outputs to clarify the expected outcomes. | This helps the model understand exactly what you are looking for. | Instruction:
Extract the transaction dates from the document. The dates should be in
MM/DD/YYYY format. For example, if the document
states that the transaction was completed on January 1, 2021, the
extracted date should be 01/01/2021. If the transaction date is stated
in the MM/YYYY format then extract it as the first day
of that month. For example, if the date is presented as 05/2021, extract
it as 05/01/2021.
|
Instruction: Get the transaction dates from the document.
Explanation: The prompt above is not as effective because it does not provide explicit instructions on how to handle different date formats found in the document. This lack of clarity can lead to inconsistent extraction of dates, making the task of interpreting and analyzing data more complicated. |
Stick to one main idea per field instruction | Avoid overloading the prompt by trying to extract large, sequential amounts of data in a single field to improve accuracy. Each field level should focus on extracting one piece of data. | This will also make post-processing easier. |
Field 1: Extract the Account Number. Field 2: Extract the Transaction Date. Field 3: Extract the Account Balance. |
Instruction: Extract the account number, transaction date, and account balance together. Explanation: The prompt is overloaded with multiple instructions directing the model to extract different types of data simultaneously. This approach could create messy extraction outcomes and make post-processing difficult. |
Best practice | Details | Importance | Correct example | Incorrect example |
---|---|---|---|---|
Choose data types with purpose | Consider how you want the extracted data formatted and
ensure it aligns with downstream use cases to optimize extraction for
automation.
| Selecting the appropriate data type enables accurate formatting and easier downstream processing. |
Field name: Transaction Volume Data type: Number |
Field Name: Phone Number Data Type: Number Explanation: Using the Number data type for a phone number is not beneficial. Although a phone number is composed of digits, it is not a numerical value, meaning that you do not perform arithmetic with it; it is better described as a string of digits. Therefore, using an Exact Text data type would be the appropriate choice. |
Only include field type-specific instructions in the field type. |
When providing instructions for data extraction, it is crucial to keep them specific to each field type. If there are general instructions that apply to all fields of a certain type, a user can provide them at the field type level to avoid repetition. For example, if all Monetary Quantity fields need to be in USD, specify this at the field type level.
However, some datasets may require unique fields not covered by existing field types (Date, Text, Monetary Quantity, and so on). In these cases, you can create a new, customized field type. When writing instructions for these new fields, specify how the data should be formatted to ensure the extracted data meets its intended purpose. These practices enhance the precision and consistency of your extracted data. |
Field type: Date Instruction: Extract all the dates associated with transactions from
the document. Dates should be normalized to the format
YYYY-MM-DD . |
Field type: Monetary Quantity Instruction: Extract the item price from the Price column under the invoice line items table. Explanation: The instruction is relevant specifically to extracting a Monetary Quantity from a certain field (the Price column), not to any other Monetary Quantity-based field. |
- Create a field for all of the information you want extracted but do not include any instructions.
- Select a sample of 2 to 3 documents and run predictions on each one. These documents should reflect the variation present in the documents that you are building the model for.
- Compare the extractions of the model to what you expected. For the fields that did not perform well, draft a prompt using the previously listed best practices, as this will serve as your baseline prompt.
- Rerun the predictions using the same 2 to 3 sample documents you tested earlier and check whether the extraction performance has improved.
- If the predictions are incorrect or incomplete, refine the prompts to add the necessary details to enhance the extraction performance of the model. If the predictions align with your expectations, widen your sample size of documents. It is crucial to gradually increase these numbers. Move from 2 to 3 to 10, then to 20, 30, and so on. Continue until you feel confident that the predictions of the model are correct.
- If the instructions have changed, reevaluate previously viewed documents to ensure predictions remain accurate.
- Once you are satisfied with the performance of the model, revisit the first document and start annotating. Annotate at least 10 documents to gain valuable field performance metrics through the Measure tab. This feature allows you to evaluate the extraction performance at both the overall project and field levels.
- Monitor performance metrics to inform your large-scale prompt refinement. The process of prompt iteration should primarily occur at the field level, where adjustments will have more targeted and direct impacts on the specific fields that are not performing well. If the score for a field group is not performing well, then adjusting your project and field group instructions may be more impactful, as they affect several fields.