shahbazsyed
Repos
20
Followers
13
Following
1

Events

closed issue
Any updates on the code release?
Created at 3 days ago
Created at 1 month ago
Created at 1 month ago
Created at 1 month ago
Created at 1 month ago
Created at 1 month ago
how to score all documents in the index

Thanks! That did it.

Created at 1 month ago
how to score all documents in the index

Thanks! That did it.

Created at 1 month ago

Add files via upload

Created at 1 month ago
how to score all documents in the index

@cmacdonald
Okay I tried this:

df = df[['qid', 'query']] (used corpus_id from the scenario above as qid) Say I have 100 rows in this dataframe which is now processed row by row by the following function.

def _my_func(indf):
  qRow = indf.iloc[0]
  return pd.DataFrame([ [qRow["qid"], qRow["query"], docid] for docid in range(len(colbert)) ], columns=['qid', 'query', 'docid'])
  

Create and apply pipeline

pipe = pt.apply.by_query(_my_func) >> colbert.index_scorer(query_encoded=False)
pipe(df)

This gives me KeyError: 'score'

Complete error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'score'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in ApplyForEachQuery.transform(self, res)
     35 if self.add_ranks:
---> 36     dfs = [add_ranks(df, single_query=True) for df in dfs]
     37 rtr = pd.concat(dfs)

File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in <listcomp>(.0)
     35 if self.add_ranks:
---> 36     dfs = [add_ranks(df, single_query=True) for df in dfs]
     37 rtr = pd.concat(dfs)

File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pyterrier/model.py:33, in add_ranks(df, single_query)
     31 if single_query:
     32     # -1 assures that first rank will be FIRST_RANK
---> 33     df["rank"] = df["score"].rank(ascending=False, method="first").astype(int) -1 + FIRST_RANK
     34     if STRICT_SORT:

File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):

File ~/.pyenv/versions/3.8.6/envs/temp/lib/python3.8/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.

KeyError: 'score'


Created at 1 month ago
how to score all documents in the index

Okay I tried this:

df = df[['qid', 'query']] (used corpus_id from the scenario above as qid) Say I have 100 rows in this dataframe which is now processed row by row by the following function.

def _my_func(indf):
  qRow = indf.iloc[0]
  return pd.DataFrame([ [qRow["qid"], qRow["query"], docid] for docid in range(len(colbert)) ], columns=['qid', 'query', 'docid'])
  

Create and apply pipeline

pipe = pt.apply.by_query(_my_func) >> colbert.index_scorer(query_encoded=False)
pipe(df)

This gives me KeyError: 'score'

Complete error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/core/indexes/base.py:3621, in Index.get_loc(self, key, method, tolerance)
   3620 try:
-> 3621     return self._engine.get_loc(casted_key)
   3622 except KeyError as err:

File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/_libs/index.pyx:163, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'score'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in ApplyForEachQuery.transform(self, res)
     35 if self.add_ranks:
---> 36     dfs = [add_ranks(df, single_query=True) for df in dfs]
     37 rtr = pd.concat(dfs)

File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pyterrier/apply_base.py:36, in <listcomp>(.0)
     35 if self.add_ranks:
---> 36     dfs = [add_ranks(df, single_query=True) for df in dfs]
     37 rtr = pd.concat(dfs)

File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pyterrier/model.py:33, in add_ranks(df, single_query)
     31 if single_query:
     32     # -1 assures that first rank will be FIRST_RANK
---> 33     df["rank"] = df["score"].rank(ascending=False, method="first").astype(int) -1 + FIRST_RANK
     34     if STRICT_SORT:

File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/core/frame.py:3505, in DataFrame.__getitem__(self, key)
   3504     return self._getitem_multilevel(key)
-> 3505 indexer = self.columns.get_loc(key)
   3506 if is_integer(indexer):

File ~/.pyenv/versions/3.8.6/envs/discuss-summ/lib/python3.8/site-packages/pandas/core/indexes/base.py:3623, in Index.get_loc(self, key, method, tolerance)
   3622 except KeyError as err:
-> 3623     raise KeyError(key) from err
   3624 except TypeError:
   3625     # If we have a listlike key, _check_indexing_error will raise
   3626     #  InvalidIndexError. Otherwise we fall through and re-raise
   3627     #  the TypeError.

KeyError: 'score'

Created at 1 month ago
how to score all documents in the index

Not yet, but will try it out soon and let you know. But I already have some questions regarding this snippet though.

Assuming my queries are in the same dataframe as my documents (with columns corpus_id, query, corpus). Here each row is a small collection of documents from different domains.

  1. What is now the input to _my_func()? Is it one row at a time or the entire dataframe?
  2. What is colbert here? Is it the indexer or the ranking_factory?
Created at 1 month ago
Created at 1 month ago
Created at 1 month ago
how to score all documents in the index

Hi, While using the dense retrieval pipeline via factory.end_to_end() I notice that not all documents are ranked and returned using the search(query) method of the pipeline. I am struggling a bit to understand where do I explicitly provide the number of results to return.

For e.g., my index has 5000 documents, but the search returns around 3000 documents even when I call search like this

ranked_df = (end_to_end_factory % 5000).search('my query')

Can someone please guide me in this regard?

Created at 1 month ago
how to score all documents in the index

Thanks for the quick response!

Apparently the max depth for faiss GPUIndex search is 2048 items which probably means I can't rank all the documents.

RuntimeError: Error in virtual void faiss::gpu::GpuIndex::search(faiss::Index::idx_t, const float*, faiss::Index::idx_t, float*, faiss::Index::idx_t*) const at /project/faiss/faiss/gpu/GpuIndex.cu:248: Error: 'k <= (Index::idx_t)getMaxKSelection()' failed: GPU index only supports k <= 2048 (requested 21825)
Created at 1 month ago
how to score all documents in the index

Hi, While using the dense retrieval pipeline via factory.end_to_end() I notice that not all documents are ranked and returned using the search(query) method of the pipeline. I am struggling a bit to understand where do I explicitly provide the number of results to return.

For e.g., my index has 5000 documents, but the search returns around 3000 documents even when I call search like this

ranked_df = (end_to_end_factory % 5000).search('my query')

Can someone please guide me in this regard?

Created at 1 month ago
Created at 1 month ago
Created at 1 month ago
Created at 2 months ago
Created at 2 months ago
Created at 2 months ago
started
Created at 2 months ago
started
Created at 2 months ago
Created at 2 months ago
Created at 2 months ago
Created at 2 months ago
Created at 2 months ago