Muon id

Please log in.

    • Objective. Build a classifier that would distinguish muons from non-muons in the LHCb detector.

      Background information. Beginning from the beginning. Normal matter, the one that planets, humans, and stars are made of, make up only 5% of mass in the Universe. The rest are invisible dark matter and dark energy whose existence might be hinted through the gravitational effects. One way of studying these mysteries is to recreate conditions just after the Big Bang with particle accelerators. Using a very rough analogy, we collide automobiles at supersonic speed and try to learn how they work by looking at the photos of the collisions. One of such photo cameras is the LHCb detector. 

      Here is a typical collision event recorded by the LHCb detector, one of the four big experiments at the Large Hadron collider. Point to the left is where the protons have collided, the lines are the secondary particles tracks.

      Your goal is to build an algorithm that distinguishes the muon tracks (green) from the tracks of the other particle types, using the information from the Muon subdetector. This is an extremely important problem, muon identification is used, one way or another, in the majority of physical analyses at LHCb.

      Muon subdetector consists of five stations (sensitive planes perpendicular to the beam pipe). Only four of them are used in our competition (M2-M5). Green parallelepipeds are the detector pads which registered a charged particle passing through them. The physical idea is that only muons have high enough penetration ability that allows them to pass through the lead shielding that separates the Muon subdetector from the rest of the detector. Of course, in the real world not all hits are generated by muons, that’s why we need machine learning.

       You are given tracks of three types: muon, pion and proton. Pions might decay in flight into genuine muons, so some of their tracks are very muon-like, you want to reject them as well.

       The data sample is real (i. e. not simulated) and the particle types cannot be known with certainty. Paper on how it was obtained. To account for that, we use a statistical method called sPlot (original paper, blog post). Each example is assigned a weight, when used with those weights, the distribution of the features matches the distribution over pure samples. Some of the weights are negative, this is expected. 

      Since the data for different particle types have been obtained from different decays, the distributions of the tracks kinematical observables are different. But in the end we need an algorithm that differentiates particle types in general, not only in the specific decays. In ML terms, this can be viewed as domain adaptation. To achieve that we reweighted the sample so that the distributions in momentum of signal and background match.

      You can find baseline repository here: https://github.com/yandexdataschool/MLHEP-2020-muon-id

    • Quality metric: background (pions and protons) rejection rate at 90% signal (muon) efficiency rate. This is a point on the ROC curve, 1 - false positive rate (FPR) when the true positive rate (TPR) = 0.9.

       

      Since there are negative weights, computation is a bit tricky. This is how we do it precisely:

      - prediction for i-th signal (ground truth muon) example, in decreasing order

       - weight of the i-th signal example

      - prediction for i-th background (ground truth pion or proton) example

      - weight of the i-th background example

      - the target true positive rate (TPR).

      Let

      ,where "I" is the indicator function.

      is a boolean function that tells whether the TRP for threshold

       is greater or equal to the target TPR. In the case of non-negative weights,

      would have the form -----++++: with threshold decrease, the TPR increases. In the case of some weights being negative,

      might become a bit fuzzy, e. g. ----+--++++. We define the threshold as following:

      1.Find the minimal index l for which T[l] is true

      2.Find the maximal index r for which T[r]  is false

      3. Take the final threshold value 

       Note that for non-negative weights 

      The metric equals to 1 - FPR:

    • MLHEP 2020 MuId Rules

      Results must be submitted before the 2020-08-09 20:09:00+00:00. You may submit 100 submissions every day and 1000 in total. You can’t use datasets other than provided. Getting a dump of LHCb data is just not fair to other competitors. You can, of course, use publicly available physics constants and published LHCb detector materials. For everything in-between, please ask the organisers. You can use any code you find on the Internet, just acknowledge it.

    • A .zip file containing submission.csv (name is important). Your submission file must contain two columnsid and prediction.

    • Data files description:

      The public leaderboard is based on 20% of the test data. The private leaderboard is based on the rest 80%.

      Dataset fields description:

      1. label,integer in {0,1} - you need to predict it. 0 is background (pions and protons), 1 is signal (muons)

      2. particle_type,integer in {0,1,2} - type of the particle. 0 - pion, 1 - muon, 2 - proton. Available only for the training dataset.

      3. weight, float - example weight, used in both training and evaluation. Product of sWeight and kinWeight.

      4. sWeight, float - a component of the example weight that accounts for uncertainty in labeling

      5. kinWeight,float 0- a component of the example weight that equalizes kinematic observables between signal and background

      6. id, integer - example id

      7. Lextra_{X,Y}[N], float - coordinates of the track linear extrapolation intersection with the Nth station. The extrapolation uses the following station Z coordinates: [15270, 16470, 17670, 18870]

      8. Mextra_D{X,Y-}2[N], float - uncertainty for squared {X, Y} coordinate of the track extrapolation.

      9. MatchedHit_{X,Y,Z}[N], float - coordinates of the hit in the Nth station that a physics-based tracking algorithm associated with the track. Poster about the algorithm (χ2COR), code

      10. MatchedHit_TYPE[N],categoricalin {0, 1, 2} - whether the Matched hit is crossed. 1 means uncrossed, 2 means crossed. 0 means there is no matched hit in the station (missing value). See pages 6-8 here

      11. MatchedHit_T[N], integer in {255}∪ [0,15] - timing of the Matched hit, in ticks of 25/16 = 1.5625 ns. 255 means missing value (no matched hit in the station)

      12. MatchedHit_D{X,Y,Z}[N], float in {-9999}∪ (0, +)- uncertainty of the Matched hit coordinates, also known as pad size

      13. MatchedHit_DT[N],integertime delta for the matched hit in the Nth station in ticks of 25/16 = 1.5625 ns. This is a highly technical thing. A simplified explanation:

        1. For uncrossed hits (MatchedHit_TYPE=1) Hit_DT is not defined and the value of MatchedHit_T is stored as a placeholder.

        2. For crossed hits (MatchedHit_TYPE=2) Hit_DT is the uncertainty of the hit time. If its absolute value is high that might also mean the hit is not real, but rather a product of noise in the system or an artifact of hits reconstruction algorithm

      Technically speaking, most muon pads have two independent strips (one horizontal and one vertical). These strips are crossed in order to detect the hit. Each of the strips has information about the time of the arrival of the particle, Hit_T is the time reported by the first strip and Hit_T + Hit_DT is the time reported by the second strip. For normal hits, one would expect Hit_DT distribution to peak at zero. More in the paper.

      1. FOI_hits_N, integer ≥ 0 - number of hits inside a physics-defined cone around the track (aka Field Of Interest, FOI)

      2. FOI_hits_{,D}{X,Y,Z,T}, array of floatof size FOI_hits_N - same as MatchedHit{,D}{X,Y,Z,T}, per hit in FOI

      3. FOI_hits_S, array of integers in {0, 1, 2, 3} - stations of the FOI hits

      4. ncl[N], integer- number of clusters in the Nth station. A high-level variable computed by an experimental undocumented algorithm, code for it is here

      5. avg_cs[N], float ≥ 0 - average cluster size in the Nth station, computed by the same algorithm as ncl[N]

      6. ndof, integer in {4, 6, 8}- number of degrees of freedom used in χ2 computation, a function of momentum

      7. NShared, integer ≥ 0- number of closest hits shared with the neighbouring tracks. See pages 4-5 here and pages 10-11 here. For almost all tracks, NShared ≤ FOI_hits_N. There is, however, a single event in train where NShared > FOI_hits_N, this is most likely due to a bug in the LHCb software.

      8. P, float 3000 - momentum modulo, MeV/c

      9. PT, float 800 - component of the momentum transverse (i.e. perpendicular) to the beam line, MeV/c

       

      Missing values:

      • 0 for MatchedHit_TYPE

      • 255 for MatchedHit_T

      • -1 for  MatchedHit_DT in case there is no matching hit in the station

      • Hit_T for Hit_DT for uncrossed hits 

      • -9999 for rest

      Submit your results in zip arhive, containing file name submission.csv 

    • Due to large size of training dataset, we have splited it into two dataset. Public data for phase 1 is training, public data for 2 is test. Note that submission file, should be called submission.csv and be in zipped before sending

      Public test

      starting_kit public_data

      Private test

      public_data
  • Make your submission using github

    ID Status Inputs