Abstract
The Apache Solr cluster is available in CDP Public Cloud, using the “Data exploration and analytics” data hub template. In this article we will investigate how to connect to the Solr REST API running in the Public Cloud, and highlight the performance impact of session cookie configurations when Apache Knox Gateway is used to proxy the traffic to Solr servers. Information in this blog post can be useful for engineers developing Apache Solr client applications.
The Apache Solr servers in the Cloudera Data Platform (CDP) expose a REST API, protected by Kerberos authentication. In general, all the Solr server instances can handle traffic when the Solr cluster is running in a distributed mode. The given Solr server that is receiving the request from the client will forward the query to all the servers handling shards for the collection and combine the results before sending back the response to the client. For scalability, it is best to distribute the queries among the Solr servers in a round-robin fashion.
When Solr is deployed in the public cloud using the “data exploration and analytics” data hub template, there are two ways to reach the Solr cluster from a separate client host. The first, easier approach is to reach Solr using Knox Gateway as a proxy. The Apache Knox Gateway is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. In the CDP Data Hub cluster Knox accepts HTTP basic authentication, so CDP users can use their workload or machine user credentials for authentication. Based on these credentials Knox will forward the requests to Solr servers in round-robin, using Kerberos and Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO) on behalf of the authenticated end user. (See Figure 1)
When we connect to Solr through Knox, the Knox Gateway sets the KNOXSESSIONID cookie in the HTTPS response. This cookie can be reused and set in each subsequent request, which will drastically improve the performance of handling Solr requests.
Another approach is to connect to any Solr server instance directly, using HTTPS with SPNEGO authentication. In this case the Knox Gateway is not used. Setting up this connection can be more challenging, as no basic authentication is possible but Kerberos credentials are required. Also, if the Solr client host is outside of the CDP environment, then all Solr server ports on the worker hosts need to be exposed. (See Figure 2)
Benchmarking
To measure the performance of the Solr API, we developed a small performance benchmark script and executed it from a gateway node of the data hub cluster. The benchmark script is available under Apache 2.0 license in this repository.
The following table and graph present our benchmark results. We executed short Solr queries on a very small Solr collection. We varied the number of parallel threads (1..10) and on each thread we executed 100 Solr REST calls using the “curl” command. We tested the Solr API both directly (connecting to a single given Solr server without load balancing) and using Knox (connecting to Solr through a Knox Gateway instance). We repeated the tests both with and without reusing the cookies sent back in the HTTPS responses. In all cases, the benchmark script was running on the gateway host of the Solr data hub cluster.
Our results clearly show how important it is to pay attention to use the KNOXSESSIONID cookie when connecting to Solr using the Knox Gateway. When the cookie is set, the performance is basically the same, suggesting that the Knox Gateway is not the bottleneck for this particular benchmark. However, without setting KNOXSESSIONID we get a very significant performance degradation, which is caused by the fact that the Knox Gateway needs to authenticate each HTTPS request one by one, but if this cookie is set Knox can rely on earlier authentication.
Conclusion
We described two ways to connect to Solr REST API in the CDP Public Cloud; hopefully the information in this blog post will help you to choose the best one for your project. Connecting through Knox is preferable as the Knox Gateway provides load balancing and also eases the authentication by eliminating the need for client side Kerberos configuration. Direct connection to the Solr server instances is also possible and might be a good approach if Knox gateway becomes a bottleneck or if the extra routing step made by Knox proves to add too much extra latency to the traffic. Still, for most of the cases we suggest starting the project by using Knox Gateway to reach Solr, mainly because setting up secure connection and load balancing for a direct Solr access can be more challenging. Using the KNOXSESSIONID cookie can help to reach performance similar to the direct setup.