This blog discusses an option to improve Solr indexing performance. To improve the speed at which indexing occurs by solr it is important that Solr can quickly retrieve a list of documents that require indexing. When a document or folder is added or updated in Alfresco a transaction is written to the DB. Using a rest request Solr asynchronously requests the latest set of transactions that have been created to determine what documents need to be indexed. It then requests the documents for each transaction and adds their metadata and content to the index. A key driver of the speed for which this occurs is that solr can get the list of transactions in a timely manner. This is especially important if you are re-indexing your Repository shards. We ran into a problem for one of our customers where the request to get the latest transactions was taking > 2 minutes and therefore timing out. The advice we got from Alfresco was to increase the timeout time on the DB. While that prevented timeouts it did nothing to improve the speed at which the index was building which resulted in the index lagging under even normal load. We also found that the DB was using > 50% cpu as it tried to fullfill queries to the transaction tables that were taking forever. This occurred in a shared solr instance with >100,000,000 transactions in the Repo.
How Indexing Works
Solr makes a number of requests to Alfresco to get new transactions. A transaction is created whenever a node or its permissions have changed (CUD). It is important that these requests return quickly or else indexing can slow as solr waits for the list of transactions to be returned.
The below request is made by Solr to Alfresco to get new transactions within DB_ID_Range 140-190 million transactions.
time curl -H "X-Alfresco-Search-Secret: <secret password>" -u {username:password} -X GET "http://<domain>:<port>/alfresco/service/api/solr/transactionInterval?fromNodeId=140000000&toNodeId=190000000"
In a sharded solution each shard is given a DB_ID_Range, the range being used to query the transaction table for changes. The solution below adds an index to the transaction table to match the db_id_range of transactions that a shard is configured to query. This massively decreases the response time for the query.
The rest request above results in Alfresco calling the following query (embedded in a larger query) to retrieve new transactions based on the shard db_id_range values.
select min(transaction_id)
from alf_node
where id >= <db id range start value>
and id <= <db id range end value>;
Looking at the query plan showed that it is using a btree index and that all of the slowness is in execution not the parsing of the query. The index itself looked healthy but was slow to respond when above 70000000.
Alfresco by default includes an index on the transaction table called idx_alf_node_txn. However as this grows it becomes less efficient. Adding specific indexes to match the db_id_range ensures each shard query will use its specific index and therefore improve query performance.
Solution
Create a separate index for each shard with the index including the the same range as configured for the db_id_range of the shard (shard2 and shard3 in example below).
Eamples below create DB indexes for 2 shards, shard1 having a db_id_range of 70,000,000 to 140,000,000 and shard2 with a range of 140,000,000 to 190,000,000
CREATE INDEX idx_alf_node_txn_shard2 ON public.alf_node USING btree (transaction_id)
where id >= 70000000
and id <= 140000000;
CREATE INDEX idx_alf_node_txn_shard3 ON public.alf_node USING btree (transaction_id)
where id >= 140000000
and id <= 190000000;
By creating a separate index for shards 1 and 2 these will be used when those shards query the transaction table for more transactions within their specific db_id_range.
This means that the query execution is now using the separate indexes for each shard when executing the query and returns very fast (15ms) compared to 160,000 ms before we tuned it. One interesting thing to note here is that the db indexes have to be exactly the same range as the shard db_id_range or else it is not used when the shard requests changes from the index.
Testing
Run the query below before adding the index and note the time:
EXPLAIN ANALYZE select min(transaction_id)from alf_node where id >= db id range start value and id <= db id range end value;
Once new index has been created then rerun the query to verify that the query returns quickly.
Also,you can run the curl request made by Solr to Alfresco to verify the time it taked to get back the results.
Verify that solr index is not lagging.
Conclusion
For the customer we implemented this for it had a very positive impact. Not only did the index stop lagging, the DB also used considerably less cpu and therefore could be downsized saving considerable cost. I hope this helps anyone who is tuning Alfresco solr to avoid many hours of frustration when managing Alfresco large repositories.